Semi-supervised Semantic Segmentation with Multi-Constraint Consistency Learning

Jianjian Yin, Tao Chen, Gensheng Pei, Huafeng Liu, Yazhou Yao, Liqiang Nie, Xiansheng Hua J. Yin, T. Chen, G. Pei, H. Liu, Y. Yao are with the School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China. Co-corresponding author: Yazhou Yao and Tao Chen.L. Nie is with the School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, China.X. Hua is with the Terminus Group, Beijing, China.

Abstract

Consistency regularization has prevailed in semi-supervised semantic segmentation and achieved promising performance. However, existing methods typically concentrate on enhancing the Image-augmentation based Prediction consistency and optimizing the segmentation network as a whole, resulting in insufficient utilization of potential supervisory information. In this paper, we propose a Multi-Constraint Consistency Learning (MCCL) approach to facilitate the staged enhancement of the encoder and decoder. Specifically, we first design a feature knowledge alignment (FKA) strategy to promote the feature consistency learning of the encoder from image-augmentation. Our FKA encourages the encoder to derive consistent features for strongly and weakly augmented views from the perspectives of point-to-point alignment and prototype-based intra-class compactness. Moreover, we propose a self-adaptive intervention (SAI) module to increase the discrepancy of aligned intermediate feature representations, promoting Feature-perturbation based Prediction consistency learning. Self-adaptive feature masking and noise injection are designed in an instance-specific manner to perturb the features for robust learning of the decoder. Experimental results on Pascal VOC2012 and Cityscapes datasets demonstrate that our proposed MCCL achieves new state-of-the-art performance. The source code and models are made available at https://github.com/NUST-Machine-Intelligence-Laboratory/MCCL.

Index Terms:

Semi-supervised semantic segmentation, Consistency regularization, Multi-Constraint Consistency Learning, Feature knowledge alignment, Self-adaptive intervention.

I Introduction

Recent substantial advancements in deep neural networks [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11] have significantly improved the performance of various tasks. Semantic segmentation [12, 3, 13, 14, 15, 16, 17], a prevalent task in computer vision, drives the model to interpret and analyze scenes. Its primary goal is to allocate specific class labels to each pixel in an image. Fully supervised semantic segmentation methods [13, 18, 19, 20, 21] mainly focus on enhancing the performance of semantic segmentation by designing novel model architectures and modeling basic pixel features to enhance the feature representation of pixels. However, a major challenge lies in the need for comprehensive pixel-level annotations for training, which are labor-intensive and difficult to acquire. To mitigate this, semi-supervised semantic segmentation approaches have been introduced. These methods utilize a combination of limited labeled data and a substantial amount of unlabeled data, as proposed in various studies [22, 23, 24].

Refer to caption — Figure 1: Explanation of motivation. (a) The comparison of consistency ratio with previous SOTA methods across different feature similarity ranges between Strongly and Weakly augmented View (SWV) features. Considering the higher similarity of SWV features leads to better prediction consistency, we propose to further enhance the consistency learning with Image-augmentation based Feature (IF) consistency to generate more similar SWV features. (b) The comparison of pixel ratio across different feature similarity ranges. Our proposed feature knowledge alignment (FKA) strategy effectively increases the ratio of pixels with higher-similarity SWV features.

Semi-supervised semantic segmentation aims to utilize the insights gained from labeled data to extract more valuable information from unlabeled data, a key focus of this study. The learning framework of generative adversarial networks [25, 26, 27] was initially applied in the realm of semi-supervised semantic segmentation and has demonstrated advancement. Subsequently, Consistency regularization [28, 29, 30, 31, 32], a prominent technique in this field, strives to ensure that the network produces uniform predictions for variously augmented versions of the same input. For instance, several studies [33, 34, 35] have concentrated on enhancing prediction consistency by introducing different perturbations to the model. Concurrently, various data augmentation strategies [36, 37, 38, 39, 40] have been developed to enhance the generalization capabilities of the network. Nonetheless, current methods in consistency regularization primarily focus on Image-augmentation based Prediction (IP) consistency and the overall optimization of the segmentation network, which often leads to insufficient utilization of potential supervisory information, e.g. feature consistency.

We perform an in-depth analysis of the features in strongly and weakly augmented views using the state-of-the-art consistency regularization methods, UniMatch [36] and AugSeg [37]. We normalize the similarity between pixel features of strongly and weakly augmented views. The fundamental principle of consistency regularization is to maximize consistency prediction between different augmented views, while Fig. 1 (a) illustrates that the consistency of prediction improves with increasing similarity. Consequently, one of the primary focuses of this paper is to promote feature consistency learning within the network.

Building on the observations presented above, this paper proposes a Multi-Constraint Consistency Learning approach to facilitate the staged enhancement of the encoder and decoder by imposing additional consistency constraints, thus fully utilizing the potential supervisory information in the network. As shown in Fig. 2, previous methods have overlooked the potential of Image-augmentation based Feature (IF) consistency. A robust encoder should yield consistent features for both strongly and weakly augmented views. To achieve this, we propose a feature knowledge alignment (FKA) strategy, enhancing IF consistency learning in the encoder. In particular, strong augmentation, applied to the original images, such as color jitter, graying, and Gaussian blur, introduces complex contextual information to the strongly augmented views. In contrast, weak augmentation, limited to horizontal and vertical flipping, has a negligible effect on contextual information. Building upon it, we encourage features from weakly augmented views to guide those from strongly augmented views, utilizing both point-to-point alignment and prototype-based intra-class feature compactness. Point-to-point alignment fundamentally relies on using the similarity between features at corresponding positions in different augmented views as a loss function. This encourages features in the strongly augmented feature map to align with those in the weakly augmented one, thereby allowing the network to extract richer knowledge from complex contextual information. However, the network may still misclassify some outlier features that are far from the cluster centers during training. Therefore, we resort to prototype-based intra-class feature compactness. After obtaining the feature prototype for each class, we select several intra-class features from the weakly augmented view that are most similar to the class prototypes. Simultaneously, we identify outlier features from the strongly augmented view that are more distant from the prototypes. Then, we minimize the nearest neighbor similarity loss between these selections to encourage the network to focus more on outlier features.

Fig. 1 (b) illustrates that the FKA strategy substantially enhances feature consistency learning for the encoder. The high degree of similarity between features from the strongly and weakly augmented views results in consistent predictions from the decoder. While this outcome may seem desirable, it restricts the decoder’s learning scope within the feature space. The strongly and weakly augmented view features used for decoder training are highly similar, which simplifies the training process and may result in the decoder becoming trapped in local optima. In the field of semi-supervised semantic segmentation, a robust and efficient decoder should maintain the ability to generate consistent features even when confronted with paired features that exhibit considerable differences. To address this, we design a self-adaptive intervention (SAI) module, designed to enhance decoder training through Feature-perturbation based Prediction (FP) consistency. The SAI module, tailored for each instance, acts on features of strongly augmented views, thereby broadening the decoder’s learning capability in a wider feature space. The module comprises two components: self-adaptive feature masking and noise injection. The former masks highly activated regions in the feature map of strongly augmented views based on inter-view feature similarity, reducing network dependency on these regions and averting overfitting. The latter injects self-adaptive noise into features to diversify them, thus expanding the decoder’s learning range. The outcomes of these interventions are guided by predictions from the weakly augmented view to achieve FP consistency. Importantly, as the gradient backpropagation between the encoder and decoder remains uninterrupted, the conventional Image-augmentation based Prediction (IP) consistency is also inherently sustained in our Multi-Constraint Consistency Learning. Our contributions can be summarized as follows:

(1) We propose a Multi-Constraint Consistency Learning approach to facilitate the staged enhancement of the encoder and decoder by imposing additional consistency constraints on the network, maximizing the utilization of available supervisory information.

(2) A feature knowledge alignment strategy is designed to encourage the feature consistency learning of the encoder from image-augmentation, specifically from the perspectives of point-to-point alignment and prototype-based intra-class feature compactness.

(3) We design a self-adaptive intervention module to promote prediction consistency of feature-perturbation, enabling the decoder to learn more useful information from the broader feature space.

(4) Extensive experiments on the Pascal VOC2012 and Cityscapes datasets demonstrate that our proposed method achieves new state-of-the-art performance.

II Related Works

II-A Semantic Segmentation

Semantic segmentation aims to assign the correct label to every pixel in an image, allowing machines to analyze scenes and finds wide applications across fields including autonomous driving, medical image analysis, and augmented reality. Prior semantic segmentation methods [41, 42, 43, 44] have concentrated on crafting innovative network architectures to achieve more robust pixel feature representations. Examples include PSPNet [14], integrating a feature pyramid module, and CCNet [45], which employs cross-channel and bidirectional attention mechanisms. Current mainstream semantic segmentation methods [46, 47, 48, 49, 50] strive to extract diverse semantic information from image context to aid the decoder in classification, thus enhancing segmentation accuracy. MCIBI++ [48] assumes that class semantic features at the dataset level follow a Gaussian distribution. During each training and testing instance, semantic features for each class are randomly sampled from their respective mean and variance, thereby enriching the basic pixel feature representations. IDRNet [49] guides contextual relationship modeling between pixels by deletion diagnostic procedure, thus reducing reliance on extensive prior information. PSS [50] uses a set of prototypes to describe the features of each class to construct a more discriminative feature embedding space. The exceptional performance exhibited by the methods mentioned above heavily relies on the quantity of available labeled data.

II-B Semi-supervised Semantic Segmentation

Semi-supervised semantic segmentation (S4) endeavors to extract valuable information from unlabeled images, leveraging insights from a limited dataset of labeled images to improve model performance. Initially, the focus was predominantly on using generative adversarial networks (GANs) [51, 26, 52, 53, 27] to yield high-quality predictions. However, recent advancements have been driven by pseudo-labeling methods [23, 54, 55, 56], demonstrating significant progress. Unreliable pseudo labels are used by U²PL [57] for contrastive learning to enhance the performance of the model. The LDR [55] approach capitalizes on relational knowledge between class semantic concepts to refine incorrect pseudo labels. FPL [58] promotes adaptive fuzzy positive predictions while concurrently minimizing the likelihood of false negatives. Lastly, ESL [59] employs high-entropy predictions to dynamically preserve high-probability classes. DCC [60] focuses on context consistency in randomly sized local patches through contrastive learning. Differently, our proposed method emphasizes learning the consistency of paired pixel features in the global context using similarity loss.

Consistency regularization represents a significant research avenue in semi-supervised semantic segmentation, with recent methodologies [61, 62, 60, 63, 64] delivering commendable results. RC²L [62] introduces region-level contrast along with consistency regularization, effectively reducing the impact of erroneous pixel noise on pixel-level regularization. PS-MT [65] innovates with an auxiliary teacher model and a more stringent confidence-weighted cross-entropy (CE) loss, replacing mean squared error loss to heighten segmentation accuracy on unlabeled data. MKD [32] advocates for mutual knowledge distillation, using two auxiliary mean models to supervise and facilitate knowledge exchange between student models. UniMatch [36] devises a dual-stream augmentation approach, enabling the simultaneous supervision of two strongly augmented views by a weakly augmented view. AugSeg [37] and Classmix [40] both aim to use a variety of data augmentation techniques to enhance prediction consistency in neural networks. However, these methods are primarily centered on Image-augmentation based Prediction (IP) consistency. In contrast, our approach emphasizes leveraging both Image-augmentation based Feature (IF) consistency and Feature-perturbation based Prediction (FP) consistency, by introducing additional consistency constraints to facilitate staged improvements in both the encoder and decoder.

III Proposed Method

This section describes the content of our method in detail. Initially, we introduce the overall architecture and training process of the network, followed by a successive introduction of the feature knowledge alignment module and the self-adaptive intervention module. Finally, several loss functions are summarized for the optimization of network parameters.

III-A Overall Framework

This paper investigates extracting more valuable information from unlabeled data, utilizing the insights gained from a limited labeled data denoted as ${S_{l}}=\{({x_{i}},{y_{i}})\}_{i=1}^{|{S_{l}}|}$ and a vast collection of unlabeled data ${S_{u}}=\{({u_{i}})\}_{i=1}^{|{S_{u}}|}$ . As depicted in Fig. 3 (a), a weakly augmented labeled image is fed into the network to produce the prediction $\hat{y}$ , guided by the ground truth $y$ . In contrast, an unlabeled image first undergoes strong and weak augmentations simultaneously, and is then input into the encoder, resulting in two different sets of feature representations, $F_{s}$ and $F_{w}$ . Our Multi-Constraint Consistency Learning (MCCL) approach integrates a feature knowledge alignment strategy and a self-adaptive feature intervention module. The feature knowledge alignment strategy employs similarity measures to regulate the features of the strongly augmented view $F_{s}$ , focusing on point-to-point alignment and prototype-based intra-class feature compactness. The self-adaptive feature intervention module generates self-adaptive masking feature representations $F_{mk}$ and noisy feature representations $F_{ne}$ , based on the cosine similarity between $F_{s}$ and $F_{w}$ , tailored for each instance. Ultimately, the predictions from $F_{ne}$ and $F_{mk}$ are supervised by the prediction from the weakly augmented view $p_{w}$ .

III-B Feature Knowledge Alignment

Consistency regularization methods leverage prediction results from weakly augmented views to supervise those from strongly augmented views, thereby attaining Image-augmentation based Prediction (IP) consistency. However, these methods often neglect the crucial aspect of feature consistency in image-augmentation. As depicted in Fig. 3 (c), our feature knowledge alignment strategy is designed to alleviate this problem. It prompts the features of weakly augmented view to supervise those of strongly augmented one through two primary perspectives: point-to-point alignment and prototype-based intra-class feature compactness.

Point-to-point alignment. The feature knowledge alignment, focusing on point-to-point alignment, is designed to preserve consistent contextual information across feature representations $F_{s}$ and $F_{w}$ of differently augmented views.

In alignment with prior studies such as UniMatch [36] and AugSeg [37], we apply both strong and weak augmentations ( $A_{s}(\cdot)$ and $A_{w}(\cdot)$ ) ¹¹1The strong and weak augmentation methods utilized in this study are adapted from UniMatch [36]. to the unlabeled image $u$ , resulting in a strongly augmented view $u_{s}$ and a weakly augmented view $u_{w}$ :

{u_{s}}={A_{s}}(u)\;,\quad{u_{w}}={A_{w}}(u).

(1)

$u_{s}$ and $u_{w}$ are fed into the encoder $f_{t}(\cdot)$ to obtain the corresponding feature representations $F_{s}$ and $F_{w}$ :

{F_{s}}={f_{t}}({u_{s}})\;,\;\;{F_{w}}={f_{t}}({u_{w}}).

(2)

Next, we compute the similarity $S_{p2p}$ between $F_{s}$ and $F_{w}$ according to the following equation:

{S_{p2p}}=\frac{1}{{p{\rm{\times}}q}}\sum\limits_{i=1}^{p{\rm{\times}}q}{\frac{{d_{s}^{i}\cdot d_{w}^{i}}}{{||d_{s}^{i}||{\rm{\times||}}d_{w}^{i}{\rm{||}}}}}.

(3)

$p{\rm{\times}}q$ is the resolution of the $F_{s}$ and $F_{w}$ . $d_{s}^{i}$ and $d_{w}^{i}$ denotes the point feature at the $i$ -th position of the $F_{s}$ and $F_{w}$ in current iteration. The point-to-point alignment loss $\mathcal{L}_{p2p}$ is computed using the formula below:

{\mathcal{L}_{p2p}}=1-\frac{1}{{{B_{u}}}}\sum\limits_{i=1}^{{B_{u}}}{S_{p2p}^{i}},

(4)

where $B_{u}$ represents the batch size.

Intra-class feature compactness. The outlier features have always been ignored by consistency regularization-based methods. The feature knowledge alignment strategy from the perspective of prototype-based intra-class feature compactness make the network pay more attention to the outlier features.

After obtaining $F_{w}$ with Eq. (2), we input it into the decoder to get the class probability distribution $p_{w}$ :

p_{w}=g_{t}(F_{w}).

(5)

Then we calculate the feature set of class $k$ in weakly augmented view features as follows:

{W_{k}}={F_{w}}\cdot\mathcal{I}(\arg\max({p_{w}})=k).

(6)

The size of $W_{k}$ is $N_{k}$ × $C$ , and $N_{k}$ is the number of features. $C$ is the number of channels. $\mathcal{I}(\cdot)$ is a judgment function. Similarly, we get the features of class $k$ in strongly augmented view using the following formula:

{S_{k}}={F_{s}}\cdot\mathcal{I}(\arg\max({p_{w}})=k).

(7)

Next, we filter the intra-cluster features $M_{in}^{k}$ in the weakly augmented view features:

{M_{in}^{k}}=\{{r_{k}}|\cos({r_{k}},{\rho_{k}})\geq a\},

(8)

where $r_{k}$ is selected from $W_{k}$ . We rank the features in $W_{k}$ based on the cosine similarity $\cos(\cdot)$ between the features of class $k$ in the current mini-batch and the class prototype $\rho_{k}$ , and $a$ denotes the $N_{r}$ -th high cosine similarity value. On the other hand, we select the outlier features $M_{dis}^{k}$ in strongly augmented view features as follows:

{M_{dis}^{k}}=\{{h_{k}}|\cos({h_{k}},{\rho_{k}})\leq e\},

(9)

where $h_{k}$ is selected from $S_{k}$ . Similar to Eq. (8), $e$ is the $N_{d}$ -th lowest cosine similarity value. Finally, we compute the outlier feature loss $\mathcal{L}_{dt}$ using the following formula:

{\mathcal{L}_{dt}}=\frac{1}{Z}\sum\limits_{i=1}^{Z}{\frac{1}{{{N_{d}}}}\sum{Near\_\cos(M_{in}^{i},M_{dis}^{i})}},

(10)

where $Near\_\cos(\cdot)$ is committed to finding the nearest intra-cluster feature $r_{i}$ for each outlier feature $h_{i}$ in $M_{dis}^{k}$ , and utilizes the similarity-based loss function to align feature $h_{i}$ with feature $r_{i}$ . $Z$ is the number of classes in the dataset.

In addition to this, we need to update the prototype of class $k$ to filter out more reliable intra-cluster features, so that the network can more accurately focus on outlier features:

{\rho^{\prime}_{k}}=\eta{\rho_{k}}+(1-\eta)\frac{1}{{{N_{k}}}}\sum\limits_{i=1}^{{N_{k}}}{W_{k}^{i}},

(11)

where $\eta$ is a hyper-parameter that controls the updating speed. Following PCR [66], we set $\eta$ to 0.99.

III-C Self-adaptive Intervention

Our feature knowledge alignment strategy motivates the encoder to attain feature consistency through image-augmentation. However, it inadvertently restricts the learning scope of the decoder. To address this, as depicted in Fig. 3 (b) and drawing inspiration from the feature perturbations in CCT [29], we introduce a self-adaptive intervention module. This module is designed to enhance the prediction consistency of the decoder via feature perturbation, thereby broadening the decoder’s capacity to assimilate valuable information within the feature space. The module encompasses two main components: self-adaptive feature masking and noise injection.

Algorithm 1 Pseudocode of MCCL in a Pytorch-like style.

# $f_{t}$ : encoder, $g_{t}$ : decoder. $U(\cdot)$ : uniform distribution.
# $\rho_{k}$ : prototype of class $k$ , C: the number of classes.
# $MSE$ : mean squared error, ( $\alpha,\omega,\beta$ ): loss weight.
# weakly and strongly augmented views ( $u_{w}$ and $u_{s}$ )
# intervention, left and right boundary values ( $v_{u}$ , $b_{l}$ and $b_{r}$ ).
# $Near\_cos(\cdot)$ : Nearest neighbor similarity loss.

for ( $u_{w}$ , $u_{s}$ ) in loader_u :
$F_{w}$ , $F_{s}$ = $f_{t}$ ( $u_{w}$ ), $f_{t}(u_{s})$
$p_{w},\ p_{s}\ =\ g_{t}(F_{w}),\ g_{t}(F_{s})$

# point-to-point alignment (IF consistency)
cosine = nn.CosineSimilarity ()
$\mathcal{L}_{p2p}$ = 1 $-$ cosine ( $F_{w}$ , $F_{s}$ ) . mean ()

# intra-class feature compactness (IF consistency)
$\mathcal{L}_{dt}$ = 0
for $k$ in range (C) :
indice = torch.sort ( cosine ( $F_{w}[p_{w}==k]$ , $\rho_{k}$ ))[1]
$M_{in}^{k}$ = $F_{w}$ [indice] [ $-$ $N_{r}$ : ]
index = torch.sort ( cosine ( $F_{s}[p_{w}==k]$ , $\rho_{k}$ ))[1]
$M_{dis}^{k}$ = $F_{s}$ [index] [0 : $N_{d}$ ]
# outlier feature loss
$\mathcal{L}_{dt}$ = $\mathcal{L}_{dt}$ + $Near\_cos$ ( $M_{in}^{k}$ , $M_{dis}^{k}$ )
$\mathcal{L}_{dt}$ = $\mathcal{L}_{dt}$ . mean ()

# self-adaptive intervention (FP & IP consistency)
$F_{mk}=F_{s}[avg(F_{s})<max(avg(F_{s}))\cdot U(b_{l},b_{r})]$
$F_{ne}=F_{s}\cdot U(-v_{u},v_{u})+F_{s}$
$\mathcal{L}_{m},\ \mathcal{L}_{n}\ =MSE(p_{w},g_{t}(F_{mk})),\ MSE(p_{w},g_{t}(F_{ne}))$
$\mathcal{L}_{u}={\alpha\mathcal{L}_{p2p}}+{\omega\mathcal{L}_{dt}}+\beta({\mathcal{L}_{m}}+{\mathcal{L}_{n}})$

Self-adaptive feature masking. During training, the model frequently overemphasizes activated regions, adversely impacting its generalization performance. To mitigate this, adaptive masking of features is essential for improving the model’s robustness, as shown in Fig. 4. We commence by computing the self-adaptive intervention value $v_{u}$ using the point-to-point similarity $S_{p2p}$ :

{v_{u}}=\lambda{\rm{(1+}}{{S}_{p2p}}{\rm{)}},

(12)

where $S_{p2p}$ $\in$ [-1,1], and $\lambda$ is the scaling factor. Eq. (12) can be seen as adaptively adjusting the intervention value $v_{u}$ within a certain range based on the similarity ${S}_{p2p}$ . The higher the similarity, the greater the intervention value. We obtain the self-adaptive intervention left and right boundary values ( $b_{l}$ and $b_{r}$ ) according to the following equations:

{b_{l}}=\max(0,\;\frac{9}{{10}}-v_{u}),

(13)

b{}_{r}=\min(1,\;\frac{{11}}{{10}}-v_{u}).

(14)

The masking matrix $G_{t}$ is then obtained as follows:

{G_{t}}=\mathcal{I}(avg({F_{s}})<\max(avg({F_{s}}))\cdot U({b_{l}},{b_{r}})),

(15)

where the size of $G_{t}$ is $p{\rm{\times}}q{\rm{\times}}1$ , and $avg(\cdot)$ denotes the average over the channel dimension. $U(\cdot)$ represents uniform distribution. Eqs. (13) and (14) are focused on determining the left and right boundary values of the intervention based on the $v_{u}$ . As the intervention value increases, the corresponding left and right boundary values decrease, leading to more regions being filtered out as described in Eq. (15). Thus, the core of self-adaptive feature masking lies in adaptively filtering out highly activated regions based on the similarity between strongly and weakly augmented view features.

We perform Hadamard product $\odot$ between $F_{s}$ and $G_{t}$ to generate self-adaptive masking feature representations $F_{mk}$ :

{F_{mk}}=F_{s}\odot{G_{t}}.

(16)

We then feed $F_{mk}$ into the decoder $g_{t}(\cdot)$ to obtain its prediction:

{p_{m}}={g_{t}}({F_{mk}}).

(17)

The self-adaptive feature masking loss $\mathcal{L}_{m}$ is defined as follows:

{\mathcal{L}_{m}}=\frac{1}{{{B_{u}}}}\sum\limits_{i=1}^{{B_{u}}}{d(p_{m}^{i},p_{w}^{i})},

(18)

where $d(\cdot)$ is a distance measure between two class probability distributions, and this paper details the impact of using three probability distribution measurement ways on model performance in the subsequent experiment section: KL divergence, cross-entropy (CE), and mean squared error (MSE).

TABLE I: Comparison of the performance with other state-of-the-art methods on the original Pascal VOC2012 dataset. 1/n represents the proportion of labeled data, and the remaining data are unlabeled.

Methods	Publication	Backbone	Performances (%)
Methods	Publication	Backbone	1/8 (183)	1/4 (366)	1/2 (732)	Full (1464)
Supervised	-	ResNet-50	52.26	61.65	66.72	72.94
Supervised	-	Transformer	63.62	70.76	75.44	77.01
ECS [24]	ECCV 2020	ResNet-50	70.20	72.60	-	76.30
PseudoSeg [67]	ICLR 2021	ResNet-50	61.88	64.85	70.42	71.00
CPS [23]	CVPR 2021	ResNet-50	67.42	71.71	75.88	-
PC²Seg [68]	ICCV 2021	ResNet-50	64.63	67.62	70.90	72.26
DCC [60]	CVPR 2021	ResNet-50	72.40	74.00	-	76.50
ELN [69]	CVPR 2022	ResNet-50	73.20	74.63	-	-
GuidedMix-Net [70]	AAAI 2022	ResNet-50	73.40	75.50	76.50	-
MKD [32]	ACMMM 2023	ResNet-50	66.74	71.01	72.73	78.14
CPCL [71]	TIP 2023	ResNet-50	67.02	72.14	74.25	-
AugSeg [37]	CVPR 2023	ResNet-50	72.17	76.17	77.40	78.82
UniMatch [36]	CVPR 2023	ResNet-50	72.48	75.96	77.39	78.70
SemiCVT [72]	CVPR 2023	Transformer	71.26	74.99	78.54	80.32
ESL [59]	ICCV 2023	ResNet-50	69.50	72.63	74.69	77.11
Ours	-	ResNet-50	75.22	76.45	78.51	79.62
Ours	-	Transformer	75.91	79.89	80.72	81.96

Self-adaptive feature noise injection. In addition to feature masking, we propose the adaptive injection of noise into the features of strongly augmented views to enhance feature diversity. Adaptive generation of uniform noise $N_{l}$ is based on the intervention value $v_{u}$ as defined in Eq. (12):

{N_{l}}=R({\rm{-}}{{v}_{u}},{v_{u}}),

(19)

where $R(\cdot)$ produces a tensor of the same size as $F_{s}$ , with all positions taking values in $[-v_{u},v_{u}]$ . We then get the self-adaptive noise feature representations $F_{ne}$ by injecting uniform noise into feature representations of the strongly augmented view $F_{s}$ :

{F_{ne}}={F_{s}}\odot{N_{l}}+{F_{s}}.

(20)

$F_{ne}$ is fed into the decoder $g_{t}(\cdot)$ to get the prediction $p_{n}$ :

{p_{n}}={g_{t}}({F_{ne}}).

(21)

Finally, we compute the self-adaptive feature noise injection loss $\mathcal{L}_{n}$ according to the following equation:

{\mathcal{L}_{n}}=\frac{1}{{{B_{u}}}}\sum\limits_{i=1}^{{B_{u}}}{d(p_{n}^{i},p_{w}^{i})}.

(22)

Self-adaptive feature noise injection can assist the decoder in expanding the learning range in feature space and thus learning more useful information.

The main difference between feature perturbations in CCT [29] and ours is that our proposed self-adaptive intervention module intervenes on strongly augmented view features in an instance-specific manner based on the similarity of the differently augmented view features, while CCT perturbs features in an instance-independent manner. Furthermore, unlike CCT’s feature perturbations that necessitate multiple decoders, our module efficiently operates with a single decoder, significantly reducing the complexity of the model.

III-D Overall Training Objective

The detailed process of the proposed MCCL can be found in Algorithm 1. We train the network jointly with supervised loss $\mathcal{L}_{s}$ for labeled data and consistency loss $\mathcal{L}_{u}$ for unlabeled data. The overall training loss during each iteration is formulated as:

\mathcal{L}=\mathcal{L}_{s}+\mathcal{L}_{u}.

(23)

We apply the weak augmentation to the labeled data and then input them into the network to get the predicted results:

{\hat{y}}={g_{t}}({f_{t}}({A_{w}}({x}))),

(24)

where $\hat{y}$ is the predicted class probability distribution score. The standard cross-entropy loss $\ell_{ce}(\cdot)$ is used to supervise network training on labeled data:

\mathcal{L}_{s}={\frac{1}{B_{u}}\cdot\frac{1}{{H{\rm{\times}}W}}\sum_{i=1}^{B_{u}}\sum\limits_{j=1}^{H{\rm{\times}}W}{{{\ell}_{ce}}(\hat{y}^{i}(j),{y}^{i}(j))}},

(25)

where $H{\rm{\times}}W$ is the resolution of the image. $j$ represents the $j$ -th pixel on the image, and $y$ denotes the ground-truth.

The consistency loss $\mathcal{L}_{u}$ consists of the point-to-point alignment loss $\mathcal{L}_{p2p}$ , the outlier feature loss $\mathcal{L}_{dt}$ , the self-adaptive feature masking loss $\mathcal{L}_{m}$ and the self-adaptive feature noise injection loss $\mathcal{L}_{n}$ . The formula for consistency loss $\mathcal{L}_{u}$ is as follows:

\mathcal{L}_{u}={\alpha\mathcal{L}_{p2p}}+{\omega\mathcal{L}_{dt}}+\beta({\mathcal{L}_{m}}+{\mathcal{L}_{n}}),

(26)

where $\alpha$ , $\omega$ , and $\beta$ are the weights of corresponding losses.

TABLE II: Comparison of the performance with other state-of-the-art methods on the blended Pascal VOC2012 dataset. 1/n represents the proportion of labeled data, and the remaining data are unlabeled.

Methods	Publication	ResNet-50			ResNet-101
Methods	Publication	1/16 (662)	1/8 (1323)	1/4 (2646)	1/16 (662)	1/8 (1323)	1/4 (2646)
Supervised	-	62.40	68.20	72.30	67.50	71.10	74.20
MT [34]	NeurIPS 2017	66.80	70.80	73.20	70.60	73.20	76.60
CCT [29]	CVPR 2020	65.21	70.90	73.40	67.94	73.00	76.17
CPS [23]	CVPR 2021	71.98	73.67	74.90	74.48	76.44	77.68
ST++ [73]	CVPR 2022	72.60	74.40	75.40	74.50	76.30	76.60
ELN [69]	CVPR 2022	70.50	73.20	74.60	72.50	75.10	76.60
U2PL [57]	CVPR 2022	72.00	75.10	76.20	74.40	77.60	78.70
PS-MT [65]	CVPR 2022	72.83	75.70	76.43	75.50	78.20	78.70
AugSeg [37]	CVPR 2023	74.66	75.99	77.16	77.01	77.31	78.82
CCVC [31]	CVPR 2023	74.50	76.10	76.40	77.20	78.40	79.00
UniMatch [36]	CVPR 2023	75.80	76.90	76.80	78.10	78.40	79.20
ESL [59]	ICCV 2023	73.41	75.86	76.80	76.36	78.57	79.02
Ours	-	76.06	77.55	77.27	78.53	78.71	79.30
Methods	Publication	Transformer
Methods	Publication	1/16 (662)		1/8 (1323)		1/4 (2646)
Supervised	-	72.01		73.20		76.62
SemiCVT [72]	CVPR 2023	78.20		79.95		80.20
Ours	-	80.82		81.17		81.44

TABLE III: Comparison of the performance with other state-of-the-art methods on the Cityscapes dataset. 1/n represents the proportion of labeled data, and the remaining data are unlabeled.

Methods	Publication	Backbone	Performances (%)
Methods	Publication	Backbone	1/16 (186)	1/8 (372)	1/4 (744)	1/2 (1488)
Supervised	-	ResNet-50	63.30	70.20	73.10	76.60
Supervised	-	Transformer	67.17	73.10	75.12	78.55
MT [34]	NeurIPS 2017	ResNet-50	66.14	72.03	74.47	77.43
CCT [29]	CVPR 2020	ResNet-50	66.35	72.46	75.68	76.78
CPS [23]	CVPR 2021	ResNet-50	69.79	74.39	76.85	78.64
ELN [69]	CVPR 2022	ResNet-50	-	70.33	73.52	75.33
U2PL [57]	CVPR 2022	ResNet-50	69.00	73.00	76.30	77.10
PS-MT [65]	CVPR 2022	ResNet-50	-	75.76	76.92	77.64
CPCL [71]	TIP 2023	ResNet-50	69.92	74.60	76.98	78.17
CCVC [31]	CVPR 2023	ResNet-50	74.90	76.40	77.30	-
UniMatch [36]	CVPR 2023	ResNet-50	75.03	76.77	77.49	78.60
SemiCVT [72]	CVPR 2023	Transformer	72.19	75.41	77.17	79.55
ESL [59]	ICCV 2023	ResNet-50	71.07	76.25	77.58	78.92
Ours	-	ResNet-50	75.69	76.83	78.20	79.33
Ours	-	Transformer	74.12	77.03	78.63	80.87

IV EXPERIMENTS

This section outlines the experiments of the proposed method on various datasets and detailed ablation studies. We first elaborate on the experimental datasets and demonstrate the details of our approach during training. Subsequently, we compare the performance of our method with other methods on multiple datasets. Finally, we conduct detailed ablation experiments on each component of the proposed method to demonstrate their effectiveness. Comprehensive analyses are carried out on both the comparative experimental results and the ablation study results.

IV-A Experiment Setup

Datasets. We evaluate the performance of our method on Pascal VOC2012 [74] and Cityscapes [75] datasets. The Pascal VOC2012 is a standard semantic segmentation dataset comprising 20 foreground classes and 1 background class. 1464 fine-labeled training images and 1449 validation images collectively constitute the original Pascal VOC2012 dataset. Following the dataset setup of other state-of-the-art methods [59, 31], we also adopt the blended version of the Pascal VOC2012 dataset, comprising a total of 10,582 images obtained by augmenting 9,118 coarsely-labeled images from the SBD [76] dataset. Cityscapes is an urban scenes dataset comprising 19 classes. It consists of 2,975 images for training, 500 images for validation, and 1,525 images for testing.

Evaluation Metric. We conduct extensive comparative experiments on the original, blended Pascal VOC2012 and Cityscapes datasets using the same proportion of labeled images as previous methods [36, 37]. It is worth mentioning here that the Full (1464) setting indicates 1464 labeled images and 9118 unlabeled images. Mean Intersection over Union (mIoU) is employed as the evaluation metric.

Implementation Details. Our model code is built upon the PyTorch framework and trained on eight NVIDIA A6000 GPUs with 48 GB memory per card. We conduct experiments using two segmentation networks: the first is DeepLabv3+ [77] with a ResNet-50/ResNet-101 [78] backbone pretrained on ImageNet [79], and the second is SegFormer-B5 [20], which is based on the Transformer architecture. As with other state-of-the-art method [36], for original Pascal VOC2012, we set the input image size to 321 $\times$ 321, and for blended Pascal VOC2012, we set the input image size to 513 $\times$ 513. The learning rate is consistently set to 0.001. In the case of the Cityscapes dataset, we use a learning rate of 0.005 and adjust the image size to 801 $\times$ 801. The number of training epochs for the model is 80 and 240, respectively. The batch size for both datasets is set to 8, which means each mini-batch consisted of 8 labeled images and 8 unlabeled images. Weak augmentation operations consist of horizontal and vertical flipping, while strong augmentation operations not only include weak augmentation operations, but also contain colorjitter, graying, and gaussian blur.

IV-B Comparison with State-of-the-Art Methods

Results on original Pascal VOC2012. Table I presents the performance comparison of our approach with other state-of-the-art methods on the original Pascal VOC2012 dataset. The term ‘Supervised’ refers to training conducted solely with a specific proportion of labeled data, without incorporating any unlabeled data. As can be seen, our approach significantly outperforms the current state-of-the-art methods, whether using ResNet or the Transformer as the backbone network. Specifically, with the ResNet backbone, our method exceeds UniMatch [36] by 2.74%, 0.49%, 1.12%, and 0.92% mIoU in the 1/8, 1/4, 1/2, and Full settings, respectively. It also surpasses ESL [59] by 5.72%, 3.82%, 3.82%, and 2.51% mIoU in these settings. Additionally, when employing the Transformer architecture, our method surpasses SemiCVT [72] by 4.65%, 4.90%, 2.18%, and 1.64% mIoU in the 1/8, 1/4, 1/2, and Full settings, respectively. These results clearly demonstrate the effectiveness of our method, particularly in scenarios with extremely limited labeled data, such as the 1/8 setting.

TABLE IV: Ablation experiments of Image-augmentation based Prediction (IP) consistency, Image-augmentation based Feature (IF) consistency and Feature-perturbation based Prediction (FP) consistency on original Pascal VOC2012 and Cityscapes datasets.

Baseline	IP	IF	FP	Backbone	Original Pascal VOC2012		Cityscapes
Baseline	IP	IF	FP	Backbone	1/2 (732)	Full (1464)	1/16 (186)	1/4 (744)
$\checkmark$				ResNet-50	66.72	72.94	63.30	73.10
$\checkmark$	$\checkmark$			ResNet-50	77.39	78.70	75.03	77.49
$\checkmark$	$\checkmark$	$\checkmark$		ResNet-50	78.33	79.36	75.47	77.99
$\checkmark$	$\checkmark$		$\checkmark$	ResNet-50	77.98	79.29	75.36	77.78
$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$	ResNet-50	78.51	79.62	75.69	78.20

Results on blended Pascal VOC2012. Table II displays the performance comparison of our method with other state-of-the-art methods on the blended Pascal VOC2012 dataset. The data presented in the table demonstrate that our proposed MCCL method considerably exceeds other approaches across various configurations, regardless of the encoder used—whether it is ResNet-50, ResNet-101, or a Transformer. Specifically, with ResNet-50 as the backbone network, our method outperforms ESL [59] by 2.65%, 1.69%, and 0.47% mIoU under the 1/16, 1/8, and 1/4 labeled data ratios, respectively, while also exceeding CCVC [31] by 1.56%, 1.45%, and 0.87% mIoU. Similarly, employing ResNet-101 as the encoder, our method outperforms AugSeg [37] by 1.52%, 1.4%, and 0.48% mIoU in the corresponding labeled data partition settings. Our proposed approach surpasses the SemiCVT [72] by 2.62%, 1.22%, and 1.24% mIoU across various configurations, highlighting its effectiveness within the Transformer architecture. The data comparisons above all demonstrate the superior performance of our method.

Results on Cityscapes. Table III presents a comparative analysis of MCCL against other state-of-the-art methods on the Cityscapes dataset. As can be seen, our proposed MCCL outperforms all other methods in various settings. Especially when using a Transformer as the backbone network, our method outperforms SemiCVT [72] by 1.93%, 1.62%, 1.46%, and 1.32% mIoU across the 1/16, 1/8, 1/4, and 1/2 settings, respectively. Moreover, under the same conditions, our method surpasses UniMatch [36] by 0.7% mIoU at the 1/2 setting. A notable observation is that MCCL shows a more significant performance improvement on the Pascal VOC 2012 dataset compared to Cityscapes, which could be attributed to the higher complexity of scene composition in Cityscapes. The performance results show that when using the Transformer as the backbone network, it outperforms ResNet in the 1/8, 1/4, and 1/2 settings, but slightly underperforms in the 1/16 setting. We hypothesize that although the Transformer efficiently captures scene information, when the amount of labeled data is too small, the Transformer may overfit the data, preventing it from learning more effective global information. Analyzing results from both Pascal VOC2012 and Cityscapes datasets, it is evident that compared to prior consistency regularization methods, our staged enhancement of the encoder and decoder, along with additional consistency constraints, sets a new benchmark in performance.

IV-C Ablation Studies

This section delineates the comprehensive results of ablation studies conducted on each element of our method, thereby elucidating the contribution of individual components to the model’s overall performance. All ablation experiments are conducted under a consistent experimental framework, employing ResNet-50 as the encoder and DeepLabv3+ as the decoder. The baseline used in all ablation experiments defaults to UniMatch [36].

Impact of feature consistency. We conduct detailed ablation experiments on the Image-augmentation based Feature (IF) consistency, and the experimental results are shown in Fig. 5. MCCL performs significantly better than the state-of-the-art method UniMatch [36] in feature consistency, outperforming UniMatch significantly in different labeled data settings. Besides, we apply a loss to UniMatch that makes the features of strongly and weakly augmented views dissimilar, resulting in significant performance degradation of UniMatch^∗. From the above experimental results, it can be seen that mining feature consistency information is beneficial for improving the performance of the model.

Impact of IP, IF and FP consistency. We perform comprehensive ablation experiments on the original Pascal VOC2012 and Cityscapes datasets, assessing Image-augmentation based Prediction (IP) consistency, Image-augmentation based Feature (IF) consistency, and Feature-perturbation Prediction (FP) consistency. The experimental results are detailed in Table IV. The experimental results indicate that the exclusive application of IP consistency invariably enhances the model’s performance. Specifically, under the 1/16 setting of the Cityscapes dataset, IP consistency has resulted in an 11.73% increase in mIoU. When applied independently across different labeled data ratios, both IF and FP consistency deliver substantial performance gains. We observe that the performance improvement from IF consistency is more pronounced than that from FP. While FP consistency employs self-adaptive intervention on features, its mechanism of action is similar to that of IP consistency, with both approaches focusing on alignment at the segmentation level. In contrast, IF consistency emphasizes alignment at the feature level, thereby complementing the role of IP consistency and significantly enhancing overall model performance. Best experimental results are achieved when IP, IF, and FP are cohesively integrated into the baseline model.

Scaling factor. We conduct ablation experiments on the scaling factor $\lambda$ (Eq. (12)) in the self-adaptive intervention module, and the results are presented in Fig. 7. The experimental results indicate that appropriately increasing the scaling factor to enhance intervention can significantly improve the performance of the model. Nonetheless, it is observed that surpassing a certain threshold in the scaling factor value leads to a gradual deterioration in performance. The best results are achieved when the scaling factor $\lambda$ = 0.15.

Loss weight. Fig. 7 shows the impact of the weights ( $\alpha$ , $\omega$ and $\beta$ ) for point-to-point alignment loss, outlier feature loss, and self-adaptive intervention loss on model performance. The experimental results from the figure indicate that setting the weights within a specific range can significantly improve the performance of the model. When $\alpha$ = 0.1, $\beta$ = 0.01, and $\omega$ = 0.01, the model achieves the best performance.

TABLE V: Effect of each component in our method using the Full setting on the original Pascal VOC2012 dataset. FKA represents the feature knowledge alignment strategy, while SAI is the self-adaptive intervention module.

Baseline	FKA		SAI		mIoU
Baseline	$\mathcal{L}_{p2p}$	$\mathcal{L}_{dt}$	$\mathcal{L}_{n}$	$\mathcal{L}_{m}$	mIoU
$\checkmark$					78.70
$\checkmark$	$\checkmark$				79.19
$\checkmark$		$\checkmark$			78.93
$\checkmark$	$\checkmark$	$\checkmark$			79.36
$\checkmark$			$\checkmark$		78.99
$\checkmark$				$\checkmark$	79.23
$\checkmark$			$\checkmark$	$\checkmark$	79.29
$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$	79.62

TABLE VI: Comparison with conventional feature perturbations F-Noise (Feature-Noise) and F-Drop (Feature-Drop) from CCT using the Full setting (1464) on the original Pascal VOC2012 dataset.

Baseline	CCT		Ours		mIoU
Baseline	F-Noise	F-Drop	SAF-Noise	SAF-Masking	mIoU
$\checkmark$					78.70
$\checkmark$	$\checkmark$				78.75
$\checkmark$		$\checkmark$			78.89
$\checkmark$			$\checkmark$		78.99
$\checkmark$				$\checkmark$	79.23

TABLE VII: Effect of

N_{r}

for model performance using the 1/8 setting (183) on the original Pascal VOC2012 dataset.

$N_{r}$	0	8	12	16	20	24
mIoU	72.48	73.34	73.77	73.99	73.81	73.52

TABLE VIII: Effect of

N_{d}

for model performance using the 1/8 setting (183) on the original Pascal VOC2012 dataset.

$N_{d}$	0	64	128	256	512	1024
mIoU	72.48	73.45	73.65	73.99	73.92	73.62

TABLE IX: Ablation experiments on the impact of probability distribution measurement ways on model performance using the 1/2 setting of the original Pascal VOC2012 dataset.

Baseline	KL divergence	CE	MSE	mIoU
$\checkmark$				77.39
$\checkmark$	$\checkmark$			77.92
$\checkmark$		$\checkmark$		78.12
$\checkmark$			$\checkmark$	78.51

Component analysis. Table V shows the results of the ablation experiments for each component of our method. It can be seen that adding the feature knowledge alignment (FKA) strategy can improve the performance of the original baseline from 78.70% to 79.36% mIoU, demonstrating the effectiveness of feature consistency based on image-augmentation. Specifically, the point-to-point alignment loss $\mathcal{L}_{p2p}$ and the outlier feature loss $\mathcal{L}_{dt}$ improve the performance of the model by 0.49% and 0.23% mIoU, respectively. In addition, the self-adaptive intervention (SAI) module can yield a 0.59% performance gain compared to the baseline. Self-adaptive noise injection loss $\mathcal{L}_{n}$ and self-adaptive feature masking loss $\mathcal{L}_{m}$ bring performance improvements of 0.29% and 0.53% mIoU to the model, respectively. This proves that applying self-adaptive intervention to the features of strongly augmented views can promote the decoder to learn more useful information from the broader feature space. In addition, we conduct a visual comparison of the segmentation results for FKA and SAI, as shown in Fig. 6. Both qualitative and quantitative results indicate that the complementary integration of the feature knowledge alignment strategy and the self-adaptive intervention module can enable the model to achieve the best experimental performance.

Self-adaptive vs Conventional feature perturbation. The performance comparison between the self-adaptive intervention module and conventional feature perturbations is shown in Table VI. Compared with conventional feature perturbations, adopting a self-adaptive intervention module for strongly augmented view features can significantly improve the performance of the model. Specifically, compared to F-Noise which only improves the baseline with 0.05% mIoU, our self-adaptive feature noise injection (SAF-Noise) can bring 0.29% mIoU performance gain. On the other hand, when F-Drop can increase by 0.19% mIoU, our self-adaptive feature masking (SAF-Masking) significantly outperforms the baseline by 0.53% mIoU. The experimental results demonstrate that our self-adaptive intervention module, in an instance-specific manner, can promote the decoder to learn more useful information in broader feature space, thereby significantly improving the performance of the model.

The impact of $N_{r}$ and $N_{d}$ . Tables VII and VIII show the impact of the values of $N_{r}$ and $N_{d}$ on model performance. Compared with the baseline ( $N_{r}$ / $N_{d}$ =0), setting the values of $N_{r}$ and $N_{d}$ reasonably can greatly improve the model. The experimental results show that the model achieves the best performance when $N_{r}$ is set to 16 and $N_{d}$ is set to 256. In other words, the 256 outlier features are compacted to the 16 intra-cluster features by the nearest-neighbor similarity loss, making the model pay more attention to the outlier features during the training process.

Probability distribution measurement analysis. We perform detailed ablation experiments on a variety of probability distribution measurement ways in Eqs. (18) and (22), with the results presented in Table IX. These results unequivocally demonstrate that mean squared error (MSE) yields the most substantial improvement in model performance. we speculate that compared to KL divergence or cross-entropy (CE), MSE is more sensitive to smaller differences between distributions and does not need to consider whether the distribution satisfies the specific conditions, resulting in the best performance.

IV-D Comparison of Visualization Results

Fig. 8 displays the segmentation results on the Pascal VOC2012 and Cityscapes datasets using ResNet-50 as the encoder. It compares our method with other state-of-the-art methods (AugSeg [37] and UniMatch [36]). The visualization results clearly show that our method precisely identifies target areas and their boundaries. For example, AugSeg and UniMatch cannot accurately segment the car area in the second column, and they incorrectly classify some pixels that are part of the car as background. In the third column, our method more accurately segments the boundaries of the sofa area compared to the other two methods. The segmentation results on the Cityscapes dataset demonstrate that our method maintains superior performance even in complex scenes. The results above fully show that our method can achieve state-of-the-art experimental performance.

IV-E Failure Cases and Limitation

Although our approach has achieved state-of-the-art experimental performance compared to previous methods, the segmentation results in Fig. 9 show that our approach still has some limitations in certain scenarios. The results in the first row demonstrate that our method struggles to segment regions of different categories that have similar appearances, e.g., it incorrectly classifies the background as a person. The second row reveals that elongated objects (poles) in the image are not accurately segmented.

V Conclusion

In this paper, we present a Multi-Constraint Consistency Learning approach to facilitate the staged enhancement of the encoder and decoder by imposing additional consistency constraints on the network, thereby fully utilizing the potential supervisory information in the network. Specifically, we first propose a feature knowledge alignment strategy based on image-augmentation from the perspectives of point-to-point alignment and prototype-based intra-class compactness to promote feature consistency learning of the encoder. In addition, we further propose a self-adaptive intervention module to encourage prediction consistency learning of the decoder based on feature-perturbation. Extensive comparative and ablation experiments on the Pascal VOC2012 and Cityscapes datasets demonstrate that our method achieves state-of-the-art performance.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (No. 62472222, 62302217), Natural Science Foundation of Jiangsu Province (No. BK20240080, BK20220934, BK20220936), China Postdoctoral Science Foundation (No. 2022M711635), Jiangsu Funding Program for Excellent Postdoctoral Talent (No. 2022ZB267), Fundamental Research Funds for the Central Universities (No. 30923010303).

References

[1] T. Chen, Y. Yao, L. Zhang, Q. Wang, G.-S. Xie, and F. Shen, “Saliency guided inter-and intra-class relation constraints for weakly supervised semantic segmentation,” IEEE Trans. Multimedia, vol. 25, pp. 1727–1737, 2022.
[2] X. Cai, Q. Lai, Y. Wang, W. Wang, Z. Sun, and Y. Yao, “Poly kernel inception network for remote sensing detection,” in IEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 27 706–27 716.
[3] G. Pei, Y. Yao, F. Shen, D. Huang, X. Huang, and H.-T. Shen, “Hierarchical co-attention propagation network for zero-shot video object segmentation,” IEEE Trans. Image Process., vol. 32, pp. 2348–2359, 2023.
[4] G. Pei, T. Chen, X. Jiang, H. Liu, Z. Sun, and Y. Yao, “Videomac: Video masked autoencoders meet convnets,” in IEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 22 733–22 743.
[5] Y. Tang, T. Chen, X. Jiang, Y. Yao, G.-S. Xie, and H.-T. Shen, “Holistic prototype attention network for few-shot video object segmentation,” IEEE Trans. Circuit Syst. Video Technol., vol. 34, no. 8, pp. 6699–6709, 2024.
[6] T. Chen, G.-S. Xie, Y. Yao, Q. Wang, F. Shen, Z. Tang, and J. Zhang, “Semantically meaningful class prototype learning for one-shot image segmentation,” IEEE Trans. Multimedia, vol. 24, pp. 968–980, 2021.
[7] Y. Yao, T. Chen, G.-S. Xie, C. Zhang, F. Shen, Q. Wu, Z. Tang, and J. Zhang, “Non-salient region object mining for weakly supervised semantic segmentation,” in IEEE Conf. Comput. Vis. Pattern Recog., 2021, pp. 2623–2632.
[8] J. Li, Z. Jie, X. Wang, Y. Zhou, X. Wei, and L. Ma, “Weakly supervised semantic segmentation via progressive patch learning,” IEEE Trans. Multimedia, vol. 25, pp. 1686–1699, 2022.
[9] G. Pei, F. Shen, Y. Yao, G.-S. Xie, Z. Tang, and J. Tang, “Hierarchical feature alignment network for unsupervised video object segmentation,” in Eur. Conf. Comput. Vis., 2022, pp. 596–613.
[10] B. Zhou, L. Li, Y. Wang, H. Liu, Y. Yao, and W. Wang, “Unialign: Scaling multimodal alignment within one unified model.”
[11] G. Pei, T. Chen, Y. Wang, X. Cai, X. Shu, T. Zhou, and Y. Yao, “Seeing what matters: Empowering clip with patch generation-to-selection.”
[12] T. Chen, Y. Yao, and J. Tang, “Multi-granularity denoising and bidirectional alignment for weakly supervised semantic segmentation,” IEEE Trans. Image Process., vol. 32, pp. 2960–2971, 2023.
[13] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in IEEE Conf. Comput. Vis. Pattern Recog., 2015, pp. 3431–3440.
[14] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in IEEE Conf. Comput. Vis. Pattern Recog., 2017, pp. 2881–2890.
[15] T. Chen, X. Jiang, G. Pei, Z. Sun, Y. Wang, and Y. Yao, “Knowledge transfer with simulated inter-image erasing for weakly supervised semantic segmentation,” pp. 441–458, 2024.
[16] T. Chen, Y. Yao, X. Huang, Z. Li, L. Nie, and J. Tang, “Spatial structure constraints for weakly supervised semantic segmentation,” IEEE Trans. Image Process., vol. 33, pp. 1136–1148, 2024.
[17] H. Liu, P. Peng, T. Chen, Q. Wang, Y. Yao, and X.-S. Hua, “Fecanet: Boosting few-shot semantic segmentation with feature-enhanced context-aware network,” IEEE Trans. Multimedia, vol. 25, pp. 8580–8592, 2023.
[18] J. Nie, C. Wang, S. Yu, J. Shi, X. Lv, and Z. Wei, “Mign: Multiscale image generation network for remote sensing image semantic segmentation,” IEEE Trans. Multimedia, vol. 25, pp. 5601–5613, 2022.
[19] L. Li, T. Zhou, W. Wang, J. Li, and Y. Yang, “Deep hierarchical semantic segmentation,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 1246–1257.
[20] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “Segformer: Simple and efficient design for semantic segmentation with transformers,” Adv. Neural Inform. Process. Syst., pp. 12 077–12 090, 2021.
[21] B. Cheng, A. Schwing, and A. Kirillov, “Per-pixel classification is not all you need for semantic segmentation,” Adv. Neural Inform. Process. Syst., pp. 17 864–17 875, 2021.
[22] I. Alonso, A. Sabater, D. Ferstl, L. Montesano, and A. C. Murillo, “Semi-supervised semantic segmentation with pixel-level contrastive learning from a class-wise memory bank,” in Int. Conf. Comput. Vis., 2021, pp. 8219–8228.
[23] X. Chen, Y. Yuan, G. Zeng, and J. Wang, “Semi-supervised semantic segmentation with cross pseudo supervision,” in IEEE Conf. Comput. Vis. Pattern Recog., 2021, pp. 2613–2622.
[24] R. Mendel, L. A. De Souza, D. Rauber, J. P. Papa, and C. Palm, “Semi-supervised segmentation based on error-correcting supervision,” in Eur. Conf. Comput. Vis., 2020, pp. 141–157.
[25] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” Adv. Neural Inform. Process. Syst., pp. 2672–2680, 2014.
[26] S. Mittal, M. Tatarchenko, and T. Brox, “Semi-supervised semantic segmentation with high-and low-level consistency,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 4, pp. 1369–1379, 2019.
[27] N. Souly, C. Spampinato, and M. Shah, “Semi supervised semantic segmentation using generative adversarial network,” in Int. Conf. Comput. Vis., 2017, pp. 5688–5696.
[28] J. Kim, J. Jang, H. Park, and S. Jeong, “Structured consistency loss for semi-supervised semantic segmentation,” arXiv, 2020.
[29] Y. Ouali, C. Hudelot, and M. Tami, “Semi-supervised semantic segmentation with cross-consistency training,” in IEEE Conf. Comput. Vis. Pattern Recog., 2020, pp. 12 674–12 684.
[30] Z. Ke, D. Qiu, K. Li, Q. Yan, and R. W. Lau, “Guided collaborative training for pixel-wise semi-supervised learning,” in Eur. Conf. Comput. Vis. Springer, 2020, pp. 429–445.
[31] Z. Wang, Z. Zhao, X. Xing, D. Xu, X. Kong, and L. Zhou, “Conflict-based cross-view consistency for semi-supervised semantic segmentation,” in IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 19 585–19 595.
[32] J. Yuan, J. Ge, Z. Wang, and Y. liu, “Semi-supervised semantic segmentation with mutual knowledge distillation,” in ACM Int. Conf. Multimedia, ser. MM ’23, 2023, p. 5436–5444.
[33] Z. Feng, Q. Zhou, Q. Gu, X. Tan, G. Cheng, X. Lu, J. Shi, and L. Ma, “Dmt: Dynamic mutual training for semi-supervised learning,” Pattern Recognit., vol. 130, p. 108777, 2022.
[34] A. Tarvainen and H. Valpola, “Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results,” in Adv. Neural Inform. Process. Syst., 2017, pp. 1195–1204.
[35] J. Kim, Y. Min, D. Kim, G. Lee, J. Seo, K. Ryoo, and S. Kim, “Conmatch: Semi-supervised learning with confidence-guided consistency regularization,” in Eur. Conf. Comput. Vis., 2022, pp. 674–690.
[36] L. Yang, L. Qi, L. Feng, W. Zhang, and Y. Shi, “Revisiting weak-to-strong consistency in semi-supervised semantic segmentation,” in IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 7236–7246.
[37] Z. Zhao, L. Yang, S. Long, J. Pi, L. Zhou, and J. Wang, “Augmentation matters: A simple-yet-effective approach to semi-supervised semantic segmentation,” in IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 11 350–11 359.
[38] G. French, S. Laine, T. Aila, M. Mackiewicz, and G. D. Finlayson, “Semi-supervised semantic segmentation needs strong, varied perturbations,” in Brit. Mach. Vis. Conf. BMVA Press, 2020, pp. 1–14.
[39] K. Sohn, D. Berthelot, N. Carlini, Z. Zhang, H. Zhang, C. A. Raffel, E. D. Cubuk, A. Kurakin, and C.-L. Li, “Fixmatch: Simplifying semi-supervised learning with consistency and confidence,” in Adv. Neural Inform. Process. Syst., vol. 33, 2020, pp. 596–608.
[40] V. Olsson, W. Tranheden, J. Pinto, and L. Svensson, “Classmix: Segmentation-based data augmentation for semi-supervised learning,” in IEEE/CVF Winter Conf. Appl. Comput. Vis., 2021, pp. 1369–1378.
[41] L. Wang, D. Li, Y. Zhu, L. Tian, and Y. Shan, “Dual super-resolution learning for semantic segmentation,” in IEEE Conf. Comput. Vis. Pattern Recog., 2020, pp. 3774–3783.
[42] Z. Wang, R. Song, P. Duan, and X. Li, “Efnet: enhancement-fusion network for semantic segmentation,” Pattern Recognit., vol. 118, p. 108023, 2021.
[43] H. Zhou, L. Qi, H. Huang, X. Yang, Z. Wan, and X. Wen, “Canet: Co-attention network for rgb-d semantic segmentation,” Pattern Recognit., vol. 124, p. 108468, 2022.
[44] J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu, M. Tan, X. Wang et al., “Deep high-resolution representation learning for visual recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 10, pp. 3349–3364, 2020.
[45] Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu, “Ccnet: Criss-cross attention for semantic segmentation,” in Int. Conf. Comput. Vis., 2019, pp. 603–612.
[46] C. Yang, H. Zhou, Z. An, X. Jiang, Y. Xu, and Q. Zhang, “Cross-image relational knowledge distillation for semantic segmentation,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 12 319–12 328.
[47] Z. Jin, B. Liu, Q. Chu, and N. Yu, “Isnet: Integrate image-level and semantic-level context for semantic segmentation,” in Int. Conf. Comput. Vis., 2021, pp. 7189–7198.
[48] Z. Jin, D. Yu, Z. Yuan, and L. Yu, “Mcibi++: Soft mining contextual information beyond image for semantic segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 5, pp. 5988–6005, 2022.
[49] Z. Jin, X. Hu, L. Zhu, L. Song, L. Yuan, and L. Yu, “Idrnet: Intervention-driven relation network for semantic segmentation,” Adv. Neural Inform. Process. Syst., vol. 36, 2024.
[50] T. Zhou and W. Wang, “Prototype-based semantic segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., pp. 1–15, 2024.
[51] W. Hung, Y. Tsai, Y. Liou, Y. Lin, and M. Yang, “Adversarial learning for semi-supervised semantic segmentation,” in Brit. Mach. Vis. Conf. BMVA Press, 2018, p. 65.
[52] M. Lee, S. Lee, J. Lee, and H. Shim, “Saliency as pseudo-pixel supervision for weakly and semi-supervised semantic segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., pp. 12 341–12 357, 2023.
[53] M. Qi, Y. Wang, J. Qin, and A. Li, “Ke-gan: Knowledge embedded generative adversarial networks for semi-supervised scene parsing,” in IEEE Conf. Comput. Vis. Pattern Recog., 2019, pp. 5237–5246.
[54] M. S. Ibrahim, A. Vahdat, M. Ranjbar, and W. G. Macready, “Semi-supervised semantic image segmentation with self-correcting networks,” in IEEE Conf. Comput. Vis. Pattern Recog., 2020, pp. 12 715–12 725.
[55] C. Liang, W. Wang, J. Miao, and Y. Yang, “Logic-induced diagnostic reasoning for semi-supervised semantic segmentation,” in Int. Conf. Comput. Vis., 2023, pp. 16 197–16 208.
[56] Y. Zhou, H. Xu, W. Zhang, B. Gao, and P.-A. Heng, “C3-semiseg: Contrastive semi-supervised segmentation via cross-set learning and dynamic class-balancing,” in Int. Conf. Comput. Vis., 2021, pp. 7036–7045.
[57] Y. Wang, H. Wang, Y. Shen, J. Fei, W. Li, G. Jin, L. Wu, R. Zhao, and X. Le, “Semi-supervised semantic segmentation using unreliable pseudo-labels,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 4248–4257.
[58] P. Qiao, Z. Wei, Y. Wang, Z. Wang, G. Song, F. Xu, X. Ji, C. Liu, and J. Chen, “Fuzzy positive learning for semi-supervised semantic segmentation,” in IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 15 465–15 474.
[59] J. Ma, C. Wang, Y. Liu, L. Lin, and G. Li, “Enhanced soft label for semi-supervised semantic segmentation,” in Int. Conf. Comput. Vis., October 2023, pp. 1185–1195.
[60] X. Lai, Z. Tian, L. Jiang, S. Liu, H. Zhao, L. Wang, and J. Jia, “Semi-supervised semantic segmentation with directional context-aware consistency,” in IEEE Conf. Comput. Vis. Pattern Recog., 2021, pp. 1205–1214.
[61] J. Peng, G. Estrada, M. Pedersoli, and C. Desrosiers, “Deep co-training for semi-supervised image segmentation,” Pattern Recognit., vol. 107, p. 107269, 2020.
[62] J. Zhang, T. Wu, C. Ding, H. Zhao, and G. Guo, “Region-level contrastive and consistency learning for semi-supervised semantic segmentation,” in IJCAI, 7 2022, pp. 1622–1628.
[63] J. Fan, B. Gao, H. Jin, and L. Jiang, “Ucc: Uncertainty guided cross-head co-training for semi-supervised semantic segmentation,” in IEEE Conf. Comput. Vis. Pattern Recog., June 2022, pp. 9947–9956.
[64] T. Miyato, S.-i. Maeda, M. Koyama, and S. Ishii, “Virtual adversarial training: a regularization method for supervised and semi-supervised learning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 8, pp. 1979–1993, 2018.
[65] Y. Liu, Y. Tian, Y. Chen, F. Liu, V. Belagiannis, and G. Carneiro, “Perturbed and strict mean teachers for semi-supervised semantic segmentation,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 4258–4267.
[66] H. Xu, L. Liu, Q. Bian, and Z. Yang, “Semi-supervised semantic segmentation with prototype-based consistency regularization,” in Adv. Neural Inform. Process. Syst., vol. 35, 2022, pp. 26 007–26 020.
[67] Y. Zou, Z. Zhang, H. Zhang, C.-L. Li, X. Bian, J.-B. Huang, and T. Pfister, “Pseudoseg: Designing pseudo labels for semantic segmentation,” in Int. Conf. Learn. Represent., 2021, pp. 1–18.
[68] Y. Zhong, B. Yuan, H. Wu, Z. Yuan, J. Peng, and Y.-X. Wang, “Pixel contrastive-consistent semi-supervised semantic segmentation,” in Int. Conf. Comput. Vis., 2021, pp. 7273–7282.
[69] D. Kwon and S. Kwak, “Semi-supervised semantic segmentation with error localization network,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 9957–9967.
[70] P. Tu, Y. Huang, F. Zheng, Z. He, L. Cao, and L. Shao, “Guidedmix-net: Semi-supervised semantic segmentation by using labeled images as reference,” in AAAI, 2022, pp. 2379–2387.
[71] S. Fan, F. Zhu, Z. Feng, Y. Lv, M. Song, and F.-Y. Wang, “Conservative-progressive collaborative learning for semi-supervised semantic segmentation,” IEEE Trans. Image Process., pp. 6183–6194, 2023.
[72] H. Huang, S. Xie, L. Lin, R. Tong, Y.-W. Chen, Y. Li, H. Wang, Y. Huang, and Y. Zheng, “Semicvt: Semi-supervised convolutional vision transformer for semantic segmentation,” in IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 11 340–11 349.
[73] L. Yang, W. Zhuo, L. Qi, Y. Shi, and Y. Gao, “St++: Make self-training work better for semi-supervised semantic segmentation,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 4268–4277.
[74] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes challenge: A retrospective,” in Int. Conf. Comput. Vis., vol. 111. Springer, 2015, pp. 98–136.
[75] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in IEEE Conf. Comput. Vis. Pattern Recog., 2016, pp. 3213–3223.
[76] B. Hariharan, P. Arbeláez, L. Bourdev, S. Maji, and J. Malik, “Semantic contours from inverse detectors,” in Int. Conf. Comput. Vis., 2011, pp. 991–998.
[77] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in Eur. Conf. Comput. Vis., 2018, pp. 801–818.
[78] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conf. Comput. Vis. Pattern Recog., 2016, pp. 770–778.
[79] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in IEEE Conf. Comput. Vis. Pattern Recog., 2009, pp. 248–255.