Pay Attention to Your Neighbours:
Training-Free Open-Vocabulary Semantic Segmentation

Jose Dolz Sina Hajimiri^🖂 Ismail Ben Ayed Jose Dolz Sina Hajimiri^🖂
ÉTS Montreal
🖂 seyed-mohammadsina.hajimiri.1@etsmtl.net

Abstract

Despite the significant progress in deep learning for dense visual recognition problems, such as semantic segmentation, traditional methods are constrained by fixed class sets. Meanwhile, vision-language foundation models, such as CLIP, have showcased remarkable effectiveness in numerous zero-shot image-level tasks, owing to their robust generalizability. Recently, a body of work has investigated utilizing these models in open-vocabulary semantic segmentation (OVSS). However, existing approaches often rely on impractical supervised pre-training or access to additional pre-trained networks. In this work, we propose a strong baseline for training-free OVSS, termed Neighbour-Aware CLIP (NACLIP), representing a straightforward adaptation of CLIP tailored for this scenario. Our method enforces localization of patches in the self-attention of CLIP’s vision transformer which, despite being crucial for dense prediction tasks, has been overlooked in the OVSS literature. By incorporating design choices favouring segmentation, our approach significantly improves performance without requiring additional data, auxiliary pre-trained networks, or extensive hyperparameter tuning, making it highly practical for real-world applications. Experiments are performed on 8 popular semantic segmentation benchmarks, yielding state-of-the-art performance on most scenarios. Our code is publicly available at https://github.com/sinahmr/NACLIP.

1 Introduction

In recent years, we have witnessed remarkable progress of deep learning models in dense visual recognition tasks, such as semantic segmentation [28]. Nevertheless, a main limitation of these methods stems from the fixed set of classes available in the traditional training scenario. This hinders the deployment of these models in a myriad of real-world problems, where the number of visual concepts is far from being finite, and likely includes novel categories unseen during training. A straightforward solution would consist in collecting a large set of labeled images of each novel class to adapt the model. However, this solution is impractical in many aspects, from the cumbersome labeling process of additional images to the unrealistic adaptation of the model to each new class.

Open-vocabulary semantic segmentation (OVSS) has emerged as an appealing alternative to traditional close-set approaches, as it can handle novel categories that may not have been seen during training. Fueled by the zero-shot performance of vision-language models in visual recognition tasks, most recent OVSS approaches are inspired by the Contrastive Language-Image Pre-training (CLIP) [32] paradigm. A popular family of approaches integrates a fully-supervised training step [2, 16, 22, 44, 45, 46, 31], where pixel-wise masks from a limited set of categories are leveraged to transfer language-visual alignment at image level to pixel granularity. The pixel-level annotation assumption can be relaxed by considering a weakly-supervised adaptation dataset, where only image-text pairs are accessible [8, 6, 49, 40, 7, 42]. These methods, however, still require a substantially large set of annotations, which typically present a large overlap with the open-set categories in the test datasets (please see Table 1 in [45]). Furthermore, the final model’s performance may be biased towards the training dataset selected for adaptation.

To meet the requirements of real-world applications, where access to large labeled datasets is rare and one cannot anticipate novel classes, we focus in this work on training-free, a more challenging and realistic scenario without access to additional data for adaptation. Due to their practical relevance, there exists a growing literature on these approaches [52, 33, 41, 19, 38, 4, 3, 21, 27, 11], which leverages CLIP [32] as the source of knowledge. However, some also utilize additional pre-trained networks like MoCov2 [17], DeiT [30], or unsupervised object localization approaches [35] trained on extra datasets [33, 41, 19], or pre-trained text-conditional generative models (e.g., stable diffusion) to generate multiple additional images for novel concepts [3, 19, 11]. We argue that methods using these auxiliary pre-trained models, having been trained on significant extra data and introducing additional weights, cannot be fairly compared to training-free methods that use only CLIP. Furthermore, some prior works present practical limitations such as tuning many hyperparameters [3, 27], whose values change across datasets [3], or integrating a weakly supervised reward function for hyperparameter tuning [27]. Thus, devising novel alternatives that relax these requirements and present simple solutions in a more realistic training-free manner is paramount to deploying these methods in real-world problems.

In dense prediction tasks such as semantic segmentation, the crucial aspect of localization is often overlooked when employing vision transformers (ViTs) [12]. Notably, ViTs emphasize global representations, particularly through their [CLS] token, which leads to suboptimal dense prediction performance. Therefore, we advocate that devising novel and effective methodologies hinges on considering design choices that favour segmentation. The recent concurrent work, SCLIP [38], shares a similar line of research, and investigates the inherent problems of CLIP in dense prediction, proposing minor adjustments over the baseline model. In particular, the authors identified that the baseline’s poor performance is caused by a spatial misalignment of the patch representations, pointing to CLIP’s self-attention modules as the source of the problem. More concretely, they argue that CLIP learns spatial-invariant visual features, beneficial only in image-level tasks. To address this limitation, a new self-attention mechanism was introduced in [38], which re-organizes the spatial information. Nevertheless, despite its remarkable performance compared to CLIP, it does not guarantee semantic correlations across local tokens, as it does not have an explicit mechanism to attend to each token’s neighbours and therefore ensure spatial consistency. As depicted in Fig. 1, although SCLIP [38] captures semantic context better than the standard attention in CLIP, it yields unstable attention maps on nearby tokens.

Contributions.

Motivated by these limitations, in this work, we do due diligence in training-free OVSS and propose a straightforward baseline with minimal modifications to vanilla CLIP. The implemented changes are motivated by the specificity of dense prediction problems, which have been mostly disregarded in existing methods. Specifically, we identify the potential limitations of CLIP in image segmentation and modify its visual encoder, particularly in the final layer, to enhance its localization capabilities. To ensure spatial consistency, we propose a simple solution that encourages each patch to attend to its neighbours, generating consistent attention maps across adjacent patches. As demonstrated empirically through comprehensive experiments on 8 popular OVSS benchmarks, our strong baseline Neighbour-Aware CLIP (NACLIP), achieves state-of-the-art performance without requiring additional data, auxiliary pre-trained networks, or extensive hyperparameter tuning.

Refer to caption — Figure 1: Attention maps of the final visual encoder layer. For the patches shaded in red (denoted with (a) to (d)), the final layer’s attention maps are presented for CLIP [32], SCLIP [38], and our method. We have identified two problematic phenomena in the attention maps of CLIP and SCLIP, stemming from a lack of mechanisms to properly attend to patches’ neighbourhoods. First, as depicted in (a) and (b), attention intensity is sometimes dispersed among distant patches, neglecting the vicinity of a patch. Additionally, adjacent or closely located patches sharing the same real-world category and even similar visual characteristics can have inconsistent attention maps. For instance, while SCLIP generates a quality attention map for patch (d), its attention map for (c) is notably different and fails to focus on the desired object. By explicitly promoting attention to neighbours, our method produces consistent attention maps across adjacent patches.

2 Related work

2.1 Adapting large-scale vision-language models

The field of deep learning is undergoing a learning paradigm shift with the emergence of vision-language models trained at a large scale. In particular, the Contrastive Language-Image Pre-Training (CLIP) method [32] has shown unprecedented success, mainly due to its remarkable zero-shot and few-shot transfer capabilities on visual recognition tasks, notably in the context of classification. Fueled by its generalization and transfer capabilities, a realm of approaches have arisen to either improve its zero-shot performance by modifying its training on image-text pairs [47, 39, 14], or efficiently adapt to novel tasks with only a few labeled images [50, 15, 48, 36, 18, 34]. Nonetheless, the pre-training is performed at the image level, and therefore, typically, only the class token is trained to capture the global information, which limits the applicability of these approaches for dense prediction tasks.

2.2 Open-vocabulary semantic segmentation

The outstanding transferability of CLIP in classification has surged to a rapidly growing literature in open-vocabulary semantic segmentation (OVSS) [2, 16, 22, 44, 45, 46, 52, 33, 43, 26, 45, 27], which attempts to segment novel concepts in a given image without explicit supervision for them. These methods can be categorized into three main groups, depending on the level of auxiliary data required for adaptation: fully-supervised, weakly-supervised, and training-free. Fully-supervised OVSS approaches [2, 16, 22, 44, 45, 46, 31] adapt the pre-trained CLIP to semantic segmentation by exploiting a large labeled set containing pixel-wise annotations from a set of classes. The main idea of these approaches is that, by leveraging a large pixel-labeled dataset of an arbitrary limited set of categories, the model can learn to perform segmentation, while maintaining the remarkable transfer capability of CLIP. Nevertheless, the adaptation datasets employed in this family of approaches typically present a high class overlap with the testing images. Instead of requiring pixel-wise masks, weakly-supervised OVSS resorts to additional datasets with image-level tags, typically in the form of image-text pairs [8, 6, 49, 40, 42, 24, 7]. A common strategy is to use these large image-text pair sets as supervision during adaptation, where the categories present in an image are included in the text. The alignment between the visual and textual information is typically performed via a contrastive loss [42], similar to the pre-training of CLIP, which can be further enhanced by integrating additional learning strategies, such as online pixel clustering [24] or multiple contrastive losses between the features from original images and those restored from corrupted versions [6], for example. Nevertheless, a substantially large dataset is still required for adaptation, and classes appearing in each image need to be known in advance, presenting impractical considerations for real-world problems.

Given the points raised above, we focus in this work on training-free OVSS where, ideally, access to additional data for adaptation is not allowed. Nevertheless, as stated earlier, many of the methods rather rely on leveraging auxiliary models pre-trained on large-scale datasets (e.g., ViT [33, 41, 27] or stable diffusion [3, 19, 11]), or containing multiple components whose hyperparameters must be tuned [3, 27]. A different line of work attempts to improve the potential of CLIP in extracting dense visual features by modifying ViT’s self-attention [52, 4, 21]. This can be done, for instance, by directly using the value vectors [52], or by introducing additional pathways computing self-self attentions in parallel to the encoder blocks to aggregate output of several layers using the residual connections [21, 4]. A main difference with this line of work, however, is that we introduce explicit local spatial consistency in the self-attention, a concept that has been disregarded by the existing training-free OVSS literature.

The concurrent approach, SCLIP [38], proposes a self-attention mechanism encouraging each token to attend to itself and the positions sharing similar information. In particular, the authors state that an attention map with maximum values on the diagonal leads to proper localization. Nevertheless, since query and key vectors are not normalized in SCLIP, there is no guarantee that the maximum values of the attention map fall on the diagonal. Hence, if an outlier patch has a key vector with a magnitude much larger than the others, patches attend to the outlier more than themselves. Furthermore, even though in a well-localized model each patch should attend to itself with a high intensity, localization is not limited to self, and the vicinity of each patch should be taken into account as well. This is particularly important in segmentation, where adjacent patches often represent the same class. As apparent in our qualitative analysis (Fig. 1), despite SCLIP improving the spatial localization of attention maps over CLIP, it often generates contrasting attention maps for adjacent patches containing the same real-world object. This demonstrates that in this method, spatial localization in the neighbourhood of a patch does not typically go far beyond the patch itself. Moreover, although for a given patch, SCLIP attends to the patches that share similar query or key vectors, neighbouring patches do not necessarily do. We argue that only attending to patches that share similar vectors (and also self) is suboptimal, which is demonstrated empirically in Fig. 1. Our proposed method can handle these limitations by adding an explicit mechanism that enforces patches to attend to their neighbours, imposing the local spatial consistency required for semantic segmentation.

3 Preliminaries

In this work, we address the task of open-vocabulary semantic segmentation in a training-free scenario, i.e., with no supervision and without fine-tuning the parameters. Hence, for a given image $\mathbf{X}_{i}\in{\mathbb{R}}^{\Omega_{i}}$ , where $\Omega_{i}$ denotes its spatial domain $H\times W\times C$ , and a set of arbitrary concepts described by natural language $\mathbf{t}_{j}\in\mathcal{T}$ , our goal is to provide a segmentation mask of concepts present in the image. In particular, an image $\mathbf{X}_{i}$ is fed into CLIP’s ViT, which produces a representation of each patch. In parallel, $\mathbf{t}_{j}\;\forall j$ are fed into CLIP’s text encoder and the resulting representations are compared to each patch’s representation using cosine similarity, forming a probability distribution which is converted to a segmentation mask using an $\arg\max$ operation.

3.1 Background on CLIP

CLIP [32] employs a joint training approach, aligning visual and text modalities into the same feature space. Since most of CLIP-based OVSS methods utilize the vision transformer [12] version of CLIP, we focus on this architecture in what follows. In this model, the visual encoder comprises $L$ sequential blocks, each processing $1+hw$ tokens: the first one representing the [CLS] token, which captures global information, and the subsequent tokens each representing a patch. Thus, given an input image, it is initially partitioned into $hw$ non-overlapping patches ( $h=H/P$ and $w=W/P$ ) each in ${\mathbb{R}}^{P\times P\times C}$ , where $(P,P)$ denotes the resolution of each patch. Then, a linear transformation projects the sequence of $1+hw$ tokens from $C$ channels to a $D$ -dimensional space, and positional embeddings are added to create the input for the first encoder block. The components of this model are described below. For the sake of simplicity of our method’s formulation in Sec. 4.1, we avoid flattening the 2D grid of patches and do not consider the [CLS] token in the formulations below.

Encoder block.

Each encoder block $l$ receives a sequence of tokens, $\mathbf{Z}^{(l-1)}$ , from the preceding block and performs the following operations:

$\displaystyle\phantom{,}\mathbf{Z}^{\prime}$	$\displaystyle=\operatorname{LN}(\mathbf{Z}^{(l-1)}),$	(1)
$\displaystyle\phantom{,}\mathbf{Z}^{\prime}$	$\displaystyle=\mathbf{Z}^{(l-1)}+\operatorname{SA}(\mathbf{Z}^{\prime}),$	(2)
$\displaystyle\phantom{,}\mathbf{Z}^{*}$	$\displaystyle=\operatorname{LN}(\mathbf{Z}^{\prime}),$	(3)
$\displaystyle\phantom{.}\mathbf{Z}^{(l)}$	$\displaystyle=\mathbf{Z}^{\prime}+\operatorname{MLP}(\mathbf{Z}^{*}).$	(4)

In these equations, $\operatorname{LN}$ , $\operatorname{SA}$ and $\operatorname{MLP}$ denote layer normalization, the self-attention module, and the feed-forward neural network, respectively. Please note that skip connections refer to the addition of $\mathbf{Z}^{(l-1)}$ and $\mathbf{Z}^{\prime}$ in Eqs. 2 and 4. Furthermore, the combination of Eqs. 1 and 2 is commonly referred to as the self-attention block, whereas the combination of Eqs. 3 and 4 is called the feed-forward block.

Self-attention module.

Within the self-attention module, the sequence of token representations undergoes a linear transformation $\mathbf{W}^{qkv}$ , yielding three sequences of $d$ -dimensional vectors: query ( $\mathbf{q}$ ), key ( $\mathbf{k}$ ), and value ( $\mathbf{v}$ ). Subsequently, a similarity measure is calculated between all the tokens. More specifically, for a given patch at position $(i,j)$ the dot product of $\mathbf{q}_{ij}$ and $\mathbf{k}_{mn}$ is computed for all $m\in\{1,2,\dots,h\}$ and $n\in\{1,2,\dots,w\}$ . This measure is then scaled by $1/\sqrt{d}$ and passed through a $\mathrm{softmax}$ operation to calculate a weighted sum of $\mathbf{v}$ for the patch. Finally, the output is obtained via a projection using a linear transformation $\mathbf{W}^{o}$ . Thus, formally, operation $\operatorname{SA}$ for the patch at position $(i,j)$ can be described as:

$\displaystyle\phantom{,}[\mathbf{q},\mathbf{k},\mathbf{v}]$	$\displaystyle=\mathbf{Z}\mathbf{W}^{qkv},$	(5)
$\displaystyle\phantom{,}\text{sim}_{ij}$	$\displaystyle=\frac{\mathbf{k}\mathbf{q}_{ij}}{\sqrt{d}},$	(6)
$\displaystyle\phantom{,}A_{ij}$	$\displaystyle=\mathrm{softmax}\left(\text{sim}_{ij}\right)\mathbf{v},$	(7)
$\displaystyle\phantom{.}\operatorname{SA}(\mathbf{Z})_{ij}$	$\displaystyle=A_{ij}\mathbf{W}^{o}.$	(8)

Hereafter, we will refer to $\mathrm{softmax}\left(\text{sim}_{ij}\right)$ as the attention map of point $(i,j)$ , and $\text{sim}_{ij}$ as its corresponding logits. Note that while in practice a multi-head version of the self-attention is used, for simplicity, we have presented equations considering a single head.

3.2 Limitations of CLIP for image segmentation

As previously stated, localization holds a pivotal role in segmentation tasks, yet it is often overlooked in ViTs [12]. As described in [12], unlike CNNs, most of the operations in ViTs are global and the locality of patches is mostly not taken into account. Within ViT, and consequently, in CLIP’s visual encoder, positional information is integrated into the network through 1D positional embeddings added to the input at the first block. Recent evidence points to the possible suboptimal performance of 1D embeddings in image data, and thus alternative mechanisms are sought after [25, 9]. Furthermore, positional embeddings utilization is restricted only to the first encoder block, potentially diminishing their relevance in subsequent layers. This oversight could be detrimental in dense prediction tasks, where patches contain important contextual information, and such information remains valuable throughout the network’s depth. Please note that simply incorporating the pre-trained positional embeddings at all the layers is not feasible in the training-free scenario since such additions would change the distribution of the data that each layer expects and requires fine-tuning. In light of the arguments presented, we assert that explicitly attending to the neighbourhood of each patch is imperative within the context of segmentation. Moreover, recent studies have noted that the final encoder block in CLIP’s ViT disrupts spatial information, hindering dense prediction tasks [52]. Indeed, CLIP’s visual encoder is trained to emphasize the [CLS] token’s embedding (global embedding), while the outputs at other locations (i.e., embeddings of patches) are not optimally structured for tasks such as semantic segmentation.

4 Neighbour-Aware CLIP

The pre-training procedure of CLIP [32] encourages its ViT [12] to learn representations tailored to image-level tasks, thereby compromising its efficacy in dense prediction problems. Given the inherent differences between such tasks and segmentation, coupled with the fact that the output of the ViT for patch tokens has not been explicitly trained during CLIP’s pre-training, it struggles to generalize effectively to pixel-wise prediction scenarios. This underscores the necessity for targeted adjustments in the original CLIP model to accommodate the nuances of semantic segmentation. Our study delves into these specific components that potentially hinder CLIP’s segmentation performance and proposes minimal alterations to its overall framework, without changing any of the network’s parameters. The precise modifications are detailed in this section.

4.1 Introducing spatial consistency

In Sec. 3.2, we underscored the importance of explicitly attending to the locality of each patch and highlighted the inadequacy of vanilla ViT’s positional embeddings in achieving this. In this section, we introduce our simple method for enforcing explicit spatial attention to each patch’s neighbourhood. In particular, we augment the attention map information with an unnormalized multivariate (in our case, $2$ -dimensional) Gaussian kernel, which can be defined as:

\phantom{.}\phi(\mathbf{x};\boldsymbol{\mu},\boldsymbol{\Sigma})=\exp\left(-{\frac{1}{2}}(\mathbf{x}-\boldsymbol{\mu})^{\top}\boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\right).

(9)

Assuming $\boldsymbol{\Sigma}=\sigma^{2}I$ , we can rewrite the kernel as

\phantom{.}\phi(\mathbf{x};\boldsymbol{\mu},\sigma)=\exp\left(-\frac{||\mathbf{x}-\boldsymbol{\mu}||^{2}}{2\sigma^{2}}\right),

(10)

which is maximized when $\mathbf{x}=\boldsymbol{\mu}$ and decreases as the Euclidean distance of $\mathbf{x}$ to $\boldsymbol{\mu}$ increases. We now define a function $\omega((i,j))$ that, given a coordinate as input, discretizes $\phi$ and outputs a matrix of size $h\times w$ :

\begin{gathered}\omega((i,j);\sigma)_{mn}=\phi((m,n);{(i,j)},\sigma),\\ \phantom{,}\forall m\in\{1,2,\dots,h\},\;\forall n\in\{1,2,\dots,w\},\end{gathered}

(11)

with the maximum value at $(i,j)$ , gradually decreasing with distance from $(i,j)$ (see $\omega$ in Fig. 2 for elaboration).

As a corner case, let us assume that the attention map’s logits have been explicitly set to $0$ and we set

\phantom{,}A_{ij}=\mathrm{softmax}\left(\omega((i,j),\sigma)\right)\mathbf{v},

(12)

so that the attention will be purely paid to the neighbourhood of patches. As we show in our empirical validation (termed Neighbourhood Only in Sec. 5.3), by just doing so we can observe a leap in performance compared to CLIP.

Examining Eq. 12, we observe that the patch information is not utilized as $\omega$ replaces $\text{sim}_{ij}$ in Eq. 7, leading to image-independent attention. This prompts us to move beyond the corner case and include the similarity information:

	$\displaystyle\phantom{,}\widetilde{A}_{ij}=\mathrm{softmax}\left(\text{sim}_{ij}+\omega((i,j);\sigma)\right)\mathbf{v},$		(13)
	$\displaystyle\phantom{.}\widetilde{\operatorname{SA}}(\mathbf{Z})_{ij}=\widetilde{A}_{ij}\mathbf{W}^{o}.$		(14)

By adding the Gaussian window to the logits of the attention map for patch $(i,j)$ , the attention increases not only to $\mathbf{v}_{ij}$ , which has been proven fruitful in semantic segmentation [52, 38], but also to the value vectors of the patches in the vicinity of $(i,j)$ , and therefore locality is introduced to the model.

4.2 Measure of similarity

Transformers [37] are complex deep neural networks and providing a definitive explanation of their inner workings remains challenging. However, from an intuitive standpoint, we can offer the following interpretations for query, key, and value vectors. The query vector denotes what a patch looks for; the key vector signifies what it represents; and the value vector displays what it has to offer. Guided by these descriptions, we deviate from the standard similarity measure ( $\mathbf{q}\mathbf{k}^{\top}$ ) in our self-attention module. This deviation is motivated by the discrepancy between what we want the model to look for (i.e., accurate patch-level predictions) and what it had been trained to look for during pre-training. In semantic segmentation, we need to focus on the nature of each patch and this naturally leads us to shift our attention toward using the key vectors. Thus, we opted to use $\mathbf{k}\mathbf{k}^{\top}$ scores in our similarity measure instead, resulting in $\text{sim}_{ij}={\mathbf{k}\mathbf{k}_{ij}}/{\sqrt{d}}$ . By doing so, patches that represent similar information (as portrayed by their key vectors) attend to each other’s value vectors with high intensity. This simple change in perspective brings considerable performance improvement compared to CLIP, as shown in our empirical validation (termed Key-Key Similarity in Sec. 5.3).

Considering the changes proposed up until now, we rewrite Eq. 13 as (see Fig. 2 for more insight):

\phantom{.}\widetilde{A}_{ij}=\mathrm{softmax}\left(\frac{\mathbf{k}\mathbf{k}_{ij}}{\sqrt{d}}+\omega((i,j);\sigma)\right)\mathbf{v}.

(15)

4.3 Eliminating image-level specialized units

As outlined in Sec. 3.2, the final encoder block of CLIP’s vision transformer undermines the network’s efficacy for dense prediction tasks [52]. Therefore, we have chosen to eliminate specific modules from the final encoder, rendering CLIP more suitable for semantic segmentation. Specifically, we have removed the feed-forward block from this encoder, as its parameters are tailored for image-level tasks rather than dense prediction. Additionally, with the alteration of the self-attention operation to incorporate locality and the removal of the feed-forward block, the inclusion of a skip connection becomes impractical. This is because it puts greater importance on the output of the previous encoder block, thereby diminishing the significance of the output of our self-attention module ( $\widetilde{\operatorname{SA}}$ ). Considering these modifications, the final visual encoder block in our approach simplifies the operations described in Sec. 3.1 to

\phantom{,}\mathbf{Z}^{(L)}=\widetilde{\operatorname{SA}}\left(\operatorname{LN}\left(\mathbf{Z}^{\left(L-1\right)}\right)\right),

(16)

in which $L$ denotes the index of the last encoder block. We refer to this structure as the Reduced architecture.

5 Experiments

Table 1: Quantitative evaluation on 6 datasets and 2 of their variants. The first 3 benchmarks (V21, PC60, and C-Obj) contain a background category, while the subsequent ones do not. The Fair column indicates whether a method’s comparison to ours is equitable, i.e., conducted without leveraging additional knowledge. The Post. column shows whether an approach contains a post-processing step for mask refinement. For elucidation on abbreviated benchmark names, please refer to Sec. 5.1.

Method		Fair	Post.	V21	PC60	C-Obj	V20	City	PC59	ADE	C-Stf	Avg
CLIP [32]	ICML’21	✓	✗	18.6	7.8	6.5	49.1	6.7	11.2	3.2	5.7	13.6
MaskCLIP [52]	ECCV’22	✓	✗	43.4	23.2	20.6	74.9	24.9	26.4	11.9	16.7	30.3
GroupViT [42]	CVPR’22	✗	✗	52.3	18.7	27.5	79.7	18.5	23.4	10.4	15.3	30.7
CLIP Surgery [21]	arXiv’23	✓	✗	41.2	30.5	-	-	31.4	-	12.9	21.9	-
CLIP-DIY [41]	WACV’24	✗	✗	59.0	-	30.4	-	-	-	-	-	-
GEM [4]	CVPR’24	✓	✗	46.2	-	-	-	-	32.6	15.7	-	-
SCLIP [38]	ECCV’24	✓	✗	59.1	30.4	30.5	80.4	32.2	34.2	16.1	22.4	38.2
NACLIP	Ours	✓	✗	58.9	32.2	33.2	79.7	35.5	35.2	17.4	23.3	39.4
ReCo [33]	NeurIPS’22	✗	✓	25.1	19.9	15.7	57.7	21.6	22.3	11.2	14.8	23.5
TCL [7]	CVPR’23	✗	✓	55.0	30.4	31.6	83.2	24.3	33.9	17.1	22.4	37.2
FreeSeg-Diff [11]	arXiv’24	✗	✓	53.3	-	31.0	-	-	-	-	-	-
FOSSIL [3]	WACV’24	✗	✓	-	-	-	-	23.2	35.8	18.8	24.8	-
PnP-OVSS [27]	CVPR’24	✗	✓	51.3	-	36.2	-	-	28.0	14.2	17.9	-
SCLIP [38]	ECCV’24	✓	✓	61.7	31.5	32.1	83.5	34.1	36.1	17.8	23.9	40.1
NACLIP	Ours	✓	✓	64.1	35.0	36.2	83.0	38.3	38.4	19.1	25.7	42.5

Table 2: Evaluation using different CLIP-ViT backbones.

	ViT-B/16			ViT-B/32			ViT-L/14
Method	V21	PC59	Avg.	V21	PC59	Avg.	V21	PC59	Avg.
SCLIP [38]	61.7	36.1	48.9	54.8	30.2	42.5	45.4	27.4	36.4
GEM [4]	46.2	32.6	39.4	40.5	27.0	33.8	44.6	28.6	36.6
NACLIP	64.1	38.4	51.3	54.8	34.9	44.9	57.9	36.4	47.2

5.1 Experimental setup

Datasets.

We evaluate our method on the following segmentation benchmarks, whose names are abbreviated (in parentheses) to conserve table space: PASCAL VOC 2012 (V21) [13], ADE20K-150 (ADE) [51], PASCAL Context (PC60) [29], COCO-Stuff (C-Stf) [5], Cityscapes (City) [10], COCO-Object (C-Obj) [23]. Additionally, alongside the original benchmarks on these datasets, we follow [38] and evaluate on variants of PASCAL VOC 2012 (V20) and PASCAL Context (PC59) in which the background class is removed from the evaluation. Furthermore, input images are resized to have a shorter side of 336 (560 for Cityscapes [10], because of high-resolution images), and following prior works [38, 3, 21, 7, 42], a slide inference is carried out with a $224\times 224$ window and a stride of 112.

Baselines.

We compare our method to a set of relevant works in OVSS, including: MaskCLIP [52], ReCo [33], CLIP Surgery [21], SCLIP [38], GEM [4], CLIP-DIY [41], FreeSeg-Diff [11], FOSSIL [3], and PnP-OVSS [27]. We also include a few influential weakly supervised OVSS methods, such as GroupViT [42] and TCL [7] in our comparison. Furthermore, since vanilla CLIP [32] can be adapted for semantic segmentation, we include it as a baseline in our comparison tables as well.

Implementation details.

In our experiments, we employ pre-trained CLIP-ViT [32]. Unless mentioned otherwise, we use the ViT-B/16 backbone ( $16\times 16$ patch size) comprising 12 visual encoder blocks. Since our approach operates in a training-free manner, we exclusively utilize the frozen CLIP model without any optimization. The standard deviation of the Gaussian kernel, $\sigma$ in Eq. 10, is set to 5, a choice further elaborated upon in Appendix A. OVSS methods often incorporate a mask refinement step [3, 38, 7, 33, 27, 11], such as DenseCRF [20] or pixel-adaptive mask refinement (PAMR) [1]. For our method, we opt for PAMR as it is lighter and more efficient. We also report results without this refinement step. We resort to mIoU for the evaluation metric across all experiments.

5.2 Main results

Comparison to training-free OVSS methods.

In Tab. 1, we evaluate our proposed method against baselines mentioned in Sec. 5.1. Although most methods are presented as training-free OVSS, it is important to note that they vary in underlying characteristics. Specifically, within the subset of methods that refrain from fine-tuning, some leverage auxiliary pre-trained models [33, 3, 41, 27, 11]. Given our primary focus on achieving training-free OVSS exclusively through a frozen CLIP model, comparing our method to those utilizing additional weights and data, while informative, may not be entirely equitable. Hence, we explicitly denote fair comparisons in Tab. 1. Analyzing the results, our approach outperforms state-of-the-art training-free OVSS methods in 7 out of 8 benchmarks. This signifies the effectiveness of our proposed architecture and formulation of CLIP’s visual encoder. The improvement trend is also noticeable among methods that opted not to perform mask refinement steps compared to our method without post-processing. Note that results for newer methods [3, 41, 4, 21, 27, 11] are extracted from their respective manuscripts, hence might not include experiments across all benchmarks, whereas other results have been derived from [38].

Robustness to visual backbones.

Table 2 reports the segmentation results of NACLIP using different CLIP-ViT backbones, namely ViT-B/16 (default), ViT-B/32, and ViT-L/14. Compared to concurrent approaches [38, 4], our method is more robust to the backbone choice. For example, SCLIP [38] experiences a significant performance drop of over 12% on average when using ViT-L/14 instead of the default backbone, whereas the degradation is about 4% in our case. Moreover, our method outperforms GEM [4] by over 10% across all settings, making NACLIP a more robust solution compared to [38, 4].

Visual examples.

Figure 3 presents several segmentation maps generated by our method on the PASCAL Context (59) dataset [29]. These visualizations demonstrate NACLIP’s significant performance enhancement over CLIP. Furthermore, SCLIP seems to find difficulties in identifying the limits of objects properly, sometimes confusing nearby concepts (e.g., the first row). Besides, while SCLIP tends to focus mainly on similar patches without explicitly considering the surrounding context, our method maintains a local contextual understanding. This is evident in the second row (the bus image) where SCLIP misclassifies trees, likely due to its limited attention to nearby objects such as the road. More examples are available in the Appendix B.

5.3 Ablation study

Effect of altering the self-attention module.

In Sec. 4.1, we discussed the rationale behind introducing spatial consistency to attention maps and described our approach. Besides, we defined the Neighbourhood Only (N-Only in the table) setting, where patches exclusively attend to their neighbours, disregarding similarities. Furthermore, we defined the Key-Key Similarity (KK-Sim in the table) setting in Sec. 4.2, in which $\mathbf{k}\mathbf{k}^{\top}$ is used as the similarity measure. Results from the comparative analysis (Tab. 3) between N-Only and Vanilla, as well as between KK-Sim and Vanilla underscore the efficacy of the modifications proposed in Secs. 4.1 and 4.2, respectively.

Table 3: Ablation on the effect of spatial consistency and similarity measure. Vanilla refers to CLIP’s self-attention module,

\widetilde{A}

has been defined in Eq. 15, and the rest of the settings have been mentioned in Sec. 5.3. Throughout this experiment, we adhere to the default CLIP encoder architecture, retaining all architectural elements of the encoder block.

Atten.	V21	PC60	C-Obj	V20	City	PC59	ADE	C-Stf	Avg
Vanilla	18.6	7.8	6.5	49.1	6.7	11.2	3.2	5.7	13.6
N-Only	38.0	17.0	16.1	65.9	21.4	23.9	10.2	14.6	25.9
KK-Sim	36.0	15.6	10.9	64.9	24.8	26.2	9.6	15.1	25.4
$\widetilde{A}$	40.2	17.4	13.9	68.2	28.1	27.9	11.2	16.5	27.9

Impact of architectural reduction.

Here, we explore the effects of architectural modifications made to the final block of CLIP’s visual encoder, as outlined in Sec. 4.3. Specifically, we investigate the efficacy of utilizing the self-attention module’s output as the output of the final encoder block. As showcased in Tab. 4, the reduction in architecture and the removal of previously described operations yield substantial benefits in semantic segmentation tasks.

Table 4: Investigating architectural reduction’s impact. We compare CLIP’s default encoder architecture, denoted as Vanilla, with the Reduced setting described in Sec. 4.3, where the self-attention module’s output directly serves as the final encoder block’s output. Throughout this experiment, we utilize CLIP’s default self-attention module.

Arch.	V21	PC60	C-Obj	V20	City	PC59	ADE	C-Stf	Avg
Vanilla	18.6	7.8	6.5	49.1	6.7	11.2	3.2	5.7	13.6
Reduced	37.5	22.3	23.2	81.4	20.2	24.9	11.6	16.7	29.7

6 Conclusion

Drawing from CLIP’s remarkable zero-shot generalizability, the paradigm of open-vocabulary semantic segmentation leveraging this model has gained significant traction, emerging as an appealing alternative to circumvent the limitations of traditional close-set supervised training. In this work, we have explored the inherent weaknesses of CLIP for dense prediction and proposed simple and minimal modifications that accommodate this powerful model to the restrictive training-free OVSS scenario. In addition to removing components of CLIP’s visual encoder that hamper its localization capabilities, we have integrated a simple mechanism that explicitly encourages local consistency in the self-attention maps, which has been unexplored in the existing works. Extensive experiments on popular OVSS benchmarks showcase the superiority of our approach over other existing OVSS methods, some of which resort to impractical or unrealistic choices, such as leveraging auxiliary models trained on additional large datasets or relying on validation sets for hyperparameter tuning. While yielding state-of-the-art performance on 7 out of 8 benchmarks, NACLIP does not require access to either labeled or unlabeled data, becoming a suitable solution for real-world applications.

Limitations.

In CLIP’s pre-training, only the output at the [CLS] token’s position directly influences optimization [32]. Although this token plays a primary role in the original CLIP’s encoders, our attempts to extract information useful for segmentation from it were unsuccessful, having almost no effect on performance. We argue that this failure can be attributed to the divergent nature of CLIP’s pre-training and the specific requirements of dense prediction problems, in which objects of interest must be recognized with their local positioning. Despite this, we acknowledge that the [CLS] token has proven effective in many image-level tasks and may contain information transferable to dense prediction. Therefore, we believe the use of this token in dense prediction requires further investigation.

Acknowledgments

This work was funded by the Natural Sciences and Engineering Research Council of Canada (NSERC). We also thank Calcul Quebec and Compute Canada.

References

[1] Nikita Araslanov and Stefan Roth. Single-stage semantic segmentation from image labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
[2] Luca Barsellotti, Roberto Amoroso, Lorenzo Baraldi, and Rita Cucchiara. Enhancing open-vocabulary semantic segmentation with prototype retrieval. In International Conference on Image Analysis and Processing, pages 196–208, 2023.
[3] Luca Barsellotti, Roberto Amoroso, Lorenzo Baraldi, and Rita Cucchiara. FOSSIL: Free open-vocabulary semantic segmentation through synthetic references retrieval. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1464–1473, 2024.
[4] Walid Bousselham, Felix Petersen, Vittorio Ferrari, and Hilde Kuehne. Grounding everything: Emerging localization properties in vision-language transformers. In Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition, 2024.
[5] Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco-stuff: Thing and stuff classes in context. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
[6] Kaixin Cai, Pengzhen Ren, Yi Zhu, Hang Xu, Jianzhuang Liu, Changlin Li, Guangrun Wang, and Xiaodan Liang. Mixreorg: Cross-modal mixed patch reorganization is a good mask learner for open-world semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1196–1205, 2023.
[7] Junbum Cha, Jonghwan Mun, and Byungseok Roh. Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
[8] Jun Chen, Deyao Zhu, Guocheng Qian, Bernard Ghanem, Zhicheng Yan, Chenchen Zhu, Fanyi Xiao, Sean Chang Culatana, and Mohamed Elhoseiny. Exploring open-vocabulary semantic segmentation from clip vision encoder distillation only. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 699–710, 2023.
[9] Xiangxiang Chu, Zhi Tian, Bo Zhang, Xinlong Wang, and Chunhua Shen. Conditional positional encodings for vision transformers. In International Conference on Learning Representations, 2023.
[10] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016.
[11] Barbara Toniella Corradini, Mustafa Shukor, Paul Couairon, Guillaume Couairon, Franco Scarselli, and Matthieu Cord. Freeseg-diff: Training-free open-vocabulary segmentation with diffusion models. arXiv preprint arXiv:2403.20105, 2024.
[12] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
[13] Mark Everingham, SM Ali Eslami, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 2015.
[14] Enrico Fini, Pietro Astolfi, Adriana Romero-Soriano, Jakob Verbeek, and Michal Drozdzal. Improved baselines for vision-language pre-training. Transactions on Machine Learning Research, 2023.
[15] Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip-adapter: Better vision-language models with feature adapters. International Journal of Computer Vision, 2023.
[16] Golnaz Ghiasi, Xiuye Gu, Yin Cui, and Tsung-Yi Lin. Scaling open-vocabulary image segmentation with image-level labels. In European Conference on Computer Vision, pages 540–557, 2022.
[17] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
[18] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
[19] Laurynas Karazija, Iro Laina, Andrea Vedaldi, and Christian Rupprecht. Diffusion models for zero-shot open-vocabulary segmentation. arXiv preprint arXiv:2306.09316, 2023.
[20] Philipp Krähenbühl and Vladlen Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. Advances in Neural Information Processing Systems, 2011.
[21] Yi Li, Hualiang Wang, Yiqun Duan, and Xiaomeng Li. Clip surgery for better explainability with enhancement in open-vocabulary tasks. arXiv preprint arXiv:2304.05653, 2023.
[22] Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana Marculescu. Open-vocabulary semantic segmentation with mask-adapted clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7061–7070, 2023.
[23] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, pages 740–755. Springer, 2014.
[24] Quande Liu, Youpeng Wen, Jianhua Han, Chunjing Xu, Hang Xu, and Xiaodan Liang. Open-world semantic segmentation via contrasting and clustering vision-language embedding. In European Conference on Computer Vision, pages 275–292, 2022.
[25] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
[26] Huaishao Luo, Junwei Bao, Youzheng Wu, Xiaodong He, and Tianrui Li. Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In International Conference on Machine Learning, 2023.
[27] Jiayun Luo, Siddhesh Khandelwal, Leonid Sigal, and Boyang Li. Emergent open-vocabulary semantic segmentation from off-the-shelf vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4029–4040, 2024.
[28] Shervin Minaee, Yuri Boykov, Fatih Porikli, Antonio Plaza, Nasser Kehtarnavaz, and Demetri Terzopoulos. Image segmentation using deep learning: A survey. IEEE transactions on pattern analysis and machine intelligence, 44(7):3523–3542, 2021.
[29] Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. The role of context for object detection and semantic segmentation in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2014.
[30] Muhammad Muzammal Naseer, Kanchana Ranasinghe, Salman H Khan, Munawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang. Intriguing properties of vision transformers. Advances in Neural Information Processing Systems, 34:23296–23308, 2021.
[31] Jie Qin, Jie Wu, Pengxiang Yan, Ming Li, Ren Yuxi, Xuefeng Xiao, Yitong Wang, Rui Wang, Shilei Wen, Xin Pan, et al. Freeseg: Unified, universal and open-vocabulary image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19446–19455, 2023.
[32] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763, 2021.
[33] Gyungin Shin, Weidi Xie, and Samuel Albanie. Reco: Retrieve and co-segment for zero-shot transfer. Advances in Neural Information Processing Systems, 2022.
[34] Julio Silva-Rodriguez, Sina Hajimiri, Ismail Ben Ayed, and Jose Dolz. A closer look at the few-shot adaptation of large vision-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2024.
[35] Oriane Siméoni, Chloé Sekkat, Gilles Puy, Antonín Vobeckỳ, Éloi Zablocki, and Patrick Pérez. Unsupervised object localization: Observing the background to discover objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3176–3186, 2023.
[36] Yi-Lin Sung, Jaemin Cho, and Mohit Bansal. Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5227–5237, 2022.
[37] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems, 2017.
[38] Feng Wang, Jieru Mei, and Alan Yuille. Sclip: Rethinking self-attention for dense vision-language inference. European Conference on Computer Vision, 2024.
[39] Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Medklip: Medical knowledge enhanced language-image pre-training for x-ray diagnosis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 21372–21383, 2023.
[40] Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Xiangtai Li, Wentao Liu, and Chen Change Loy. Clipself: Vision transformer distills itself for open-vocabulary dense prediction. In International Conference on Learning Representations, 2024.
[41] Monika Wysoczańska, Michaël Ramamonjisoa, Tomasz Trzciński, and Oriane Siméoni. Clip-DIY: Clip dense inference yields open-vocabulary semantic segmentation for-free. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1403–1413, 2024.
[42] Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, and Xiaolong Wang. Groupvit: Semantic segmentation emerges from text supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
[43] Jilan Xu, Junlin Hou, Yuejie Zhang, Rui Feng, Yi Wang, Yu Qiao, and Weidi Xie. Learning open-vocabulary semantic segmentation models from natural language supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
[44] Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiaolong Wang, and Shalini De Mello. Open-vocabulary panoptic segmentation with text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2955–2966, 2023.
[45] Mengde Xu, Zheng Zhang, Fangyun Wei, Han Hu, and Xiang Bai. Side adapter network for open-vocabulary semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2945–2954, 2023.
[46] Mengde Xu, Zheng Zhang, Fangyun Wei, Yutong Lin, Yue Cao, Han Hu, and Xiang Bai. A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In European Conference on Computer Vision, pages 736–753, 2022.
[47] Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Bin Xiao, Ce Liu, Lu Yuan, and Jianfeng Gao. Unified contrastive learning in image-text-label space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19163–19173, 2022.
[48] Tao Yu, Zhihe Lu, Xin Jin, Zhibo Chen, and Xinchao Wang. Task residual for tuning vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10899–10909, 2023.
[49] Fei Zhang, Tianfei Zhou, Boyang Li, Hao He, Chaofan Ma, Tianjiao Zhang, Jiangchao Yao, Ya Zhang, and Yanfeng Wang. Uncovering prototypical knowledge for weakly open-vocabulary semantic segmentation. Advances in Neural Information Processing Systems, 36, 2024.
[50] Renrui Zhang, Rongyao Fang, Wei Zhang, Peng Gao, Kunchang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. Tip-adapter: Training-free clip-adapter for better vision-language modeling. In European Conference on Computer Vision, pages 1–19, 11 2022.
[51] Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision, 2019.
[52] Chong Zhou, Chen Change Loy, and Bo Dai. Extract free dense labels from clip. In European Conference on Computer Vision, 2022.

Supplementary Material for
Pay Attention to Your Neighbours:
Training-Free Open-Vocabulary Semantic Segmentation

Appendix A Gaussian kernel’s standard deviation

In a realistic training-free open-vocabulary scenario, where additional data access is restricted, there should be no validation set available for hyperparameter tuning. Consequently, it is crucial for training-free methods to operate effectively without such procedures. In our approach, we introduce a hyperparameter denoted as $\sigma$ , representing the standard deviation of the Gaussian kernel used in Eq. 10, which we set to 5 in our experiments. In this section, we detail the heuristics guiding this choice, enabling us to determine this value without the need for fine-tuning.

For a patch located at $\boldsymbol{\mu}$ , the Gaussian kernel increases its attention logits by $1$ at $\boldsymbol{\mu}$ and by lesser values at neighbouring patch locations. Our choice of $\sigma$ is based on the number of neighbouring patches whose attention logits are modified by more than a threshold $\tau$ . To achieve this, we express:

$\displaystyle\phi(\mathbf{x};\boldsymbol{\mu},\sigma)=\exp\left(-\frac{\|\|\mathbf{x}-\boldsymbol{\mu}\|\|^{2}}{2\sigma^{2}}\right)$	$\displaystyle>\tau$	(17)
$\displaystyle\Rightarrow\;\frac{\|\|\mathbf{x}-\boldsymbol{\mu}\|\|^{2}}{2\sigma^{2}}$	$\displaystyle<-\ln\tau$	(18)
$\displaystyle\Rightarrow\;\phantom{.}\|\|\mathbf{x}-\boldsymbol{\mu}\|\|^{2}$	$\displaystyle<-2\sigma^{2}\ln\tau$	(19)
$\displaystyle\Rightarrow\;\phantom{.}\|\|\mathbf{x}-\boldsymbol{\mu}\|\|^{2}$	$\displaystyle<\left(\sigma\sqrt{-2\ln\tau}\right)^{2}.$	(20)

Considering Eq. 20, neighbouring patches for which $\boldsymbol{\mu}$ ’s attention logits are increased by at least $\tau$ are positioned within of a circle centered on $\boldsymbol{\mu}$ with a radius of $\sigma\sqrt{-2\ln\tau}$ . For instance, with $\sigma=5$ , patch $\boldsymbol{\mu}$ ’s attention to 37 patches gets a logit increase of at least $0.8$ as illustrated in Fig. 4.

Table 5 displays the value of this heuristic measure for $\sigma\in\{1,2,\dots,10\}$ and $\tau\in\{0.7,0.8,0.9\}$ . Besides, CLIP [32] has been trained on $224\times 224$ pixel images, meaning the ViT-B/16 backbone operates on $14\times 14$ patches for each image. Based on this fact and considering the values provided in Tab. 5, we opted for $\sigma=5$ in our experiments to maintain a balance, i.e., to have neither too small nor too large field of attention. It is worth noting that $\tau$ is defined solely for the purpose of the described heuristic and does not play a role in our approach. In other words, there is no $\tau$ value to fine-tune in our approach.

Table 5: Proposed heuristic measure for determining

\sigma

value. For 3 values of

\tau

and 10 values of

\sigma

, the table provides the number of patches that patch

\boldsymbol{\mu}

’s attention logits to them is increased by more than

\tau

. It is important to note that these values are derived based on an infinite grid of patches, while in practice, these numbers could be less depending on the window size and

\boldsymbol{\mu}

’s position.

$\sigma$	$\tau=0.7$	$\tau=0.8$	$\tau=0.9$
1	1	1	1
2	9	5	1
3	21	13	5
4	37	21	9
5	57	37	21
6	81	49	21
7	109	69	37
8	145	89	45
9	177	113	57
10	221	137	69

Although we employed a heuristic measure to determine $\sigma$ , we provide in Fig. 5 the impact of varying $\sigma$ values on test set performance. Please note that these experiments were conducted after deciding to use $\sigma=5$ , and whose goal is to demonstrate that i) our heuristic approach to find $\sigma$ provides indeed a good value; and ii) the performance across different datasets is not strongly sensitive to the hyperparameter $\sigma$ .

Appendix B Visual examples

Additional visual examples can be found in Fig. 6 for PASCAL Context (59) [29], and in Fig. 7 for COCO-Object [23, 5]. Upon reviewing the images in Fig. 6, we can observe that SCLIP [38] often encounters difficulties in segmenting objects wholly and finding their boundaries (e.g., rows 1, 2, 4, and 8). We attribute this problem to SCLIP’s failure to consistently incorporate information from surrounding patches. Similar observations can be made for the first four rows of Fig. 7. However, an interesting minute distinction between the methods emerges in the final row of the figure. Notably, the pixels representing the cat’s eyes differ significantly from those of its skin, resulting in SCLIP failing to segment them as the same class. In contrast, NACLIP attentively considers the surrounding context of the eyes, resulting in accurate segmentation.

$\sigma$	$\tau=0.7$	$\tau=0.8$	$\tau=0.9$
1	1	1	1
2	9	5	1
3	21	13	5
4	37	21	9
5	57	37	21
6	81	49	21
7	109	69	37
8	145	89	45
9	177	113	57
10	221	137	69

$\sigma$	$\tau=0.7$	$\tau=0.8$	$\tau=0.9$
1	1	1	1
2	9	5	1
3	21	13	5
4	37	21	9
5	57	37	21
6	81	49	21
7	109	69	37
8	145	89	45
9	177	113	57
10	221	137	69

Pay Attention to Your Neighbours: Training-Free Open-Vocabulary Semantic Segmentation