Learning Restoration is Not Enough: Transfering Identical Mapping for Single-Image Shadow Removal
Abstract.
Shadow removal is to restore shadow regions to their shadow-free counterparts while leaving non-shadow regions unchanged. State-of-the-art shadow removal methods train deep neural networks on collected shadow & shadow-free image pairs, which are desired to complete two distinct tasks via shared weights, i.e., data restoration for shadow regions and identical mapping for non-shadow regions. We find that these two tasks exhibit poor compatibility, and using shared weights for these two tasks could lead to the model being optimized towards only one task instead of both during the training process. Note that such a key issue is not identified by existing deep learning-based shadow removal methods. To address this problem, we propose to handle these two tasks separately and leverage the identical mapping results to guide the shadow restoration in an iterative manner. Specifically, our method consists of three components: an identical mapping branch (IMB) for non-shadow regions processing, an iterative de-shadow branch (IDB) for shadow regions restoration based on identical results, and a smart aggregation block (SAB). The IMB aims to reconstruct an image that is identical to the input one, which can benefit the restoration of the non-shadow regions without explicitly distinguishing between shadow and non-shadow regions. Utilizing the multi-scale features extracted by the IMB, the IDB can effectively transfer information from non-shadow regions to shadow regions progressively, facilitating the process of shadow removal. The SAB is designed to adaptive integrate features from both IMB and IDB. Moreover, it generates a finely tuned soft shadow mask that guides the process of removing shadows. Extensive experiments demonstrate our method outperforms all the state-of-the-art shadow removal approaches on the widely used shadow removal datasets.

1. Introduction
The shadows are created when objects block a source of light. Single-image shadow removal aims to reconstruct the shadow-free image from its degraded shadow counterpart, which is an important and non-trivial task in the computer vision field and can benefit many downstream tasks, e.g. object detection (Nadimi and Bhanu, 2004; Fu et al., 2021a, b; Wang et al., 2022), objects tracking (Sanin et al., 2010), and face recognition (Zhang et al., 2018b). Despite significant progress, shadow removal still faces challenges. Even state-of-the-art shadow removal methods can produce de-shadowed results that contain artifacts in both shadow and non-shadow regions, e.g. color inconsistencies between shadow and non-shadow regions, as well as visible marks along the shadow boundary (see Fig. 1).
The existing shadow removal methods often rely on complex deep neural networks to reconstruct both shadow and non-shadow regions simultaneously. However, these methods overlook the fact that shadow removal involves two distinct tasks: restoring shadow regions to their shadow-free counterparts and identical mapping for non-shadow regions. As a consequence, these deep neural networks are optimized towards only one of these tasks during training instead of both, due to the shared weights and the poor compatibility of these tasks. To address this problem, in this work, we propose to handle these two tasks separately. Intuitively we can divide shadow removal into these two distinct tasks based on the binary shadow mask. However, due to the diverse properties of shadows in the real world, obtaining an accurate shadow mask that efficiently distinguishes between shadow and non-shadow regions is challenging or impossible. This is particularly true for areas around the shadow boundary, where a gradual transition occurs between shadow and non-shadow regions. Even the ground truth shadow masks provided by the commonly used shadow removal datasets, e.g. ISTD+ dataset(Le and Samaras, 2019), are not always precise and can not effectively differentiate between shadow and non-shadow regions (see the red and green areas in Fig. 2). Therefore, decoupling shadow removal into these two distinct tasks and processing them separately is challenging.
To tackle this issue, we claim that shadow removal can be decoupled by transferring identical mapping without explicitly distinguishing between shadow and non-shadow regions. Specifically, our approach consists of three components: an identical mapping branch (IMB) for processing non-shadow regions, an iterative de-shadow branch (IDB) for shadow regions restoration based on identical results, and a smart aggregation block (SAB). The IMB aims to reconstruct an image that is identical to the input one, which can benefit the restoration of the non-shadow regions without explicitly distinguishing between shadow and non-shadow regions. The IDB is responsible for progressively transferring the information from the non-shadow regions to the shadow regions in an iterative manner to facilitate the process of shadow removal by utilizing the multi-scale features provided by IMB. The SAB is designed to adaptive integrate features from both IMB and IDB. Moreover, the SAB can generate finely tuned soft shadow masks at multiple feature levels (see Fig. 7 (c), (d), and (e)) to guide the process of removing shadows. In summary, this work makes the following contributions:
❶ We are the first to decouple the shadow removal problem into two distinct tasks: restoring shadow regions to their shadow-free counterparts and identical mapping for non-shadow regions and propose a novel Dual-Branch shadow removal paradigm for solving this problem.
❷ We propose a novel Dual-Branch shadow removal network that uses an identical mapping branch (IMB) to process the non-shadow regions, an iterative de-shadow branch (IDB) to process the shadow regions, and a smart aggregation block (SAB) to adaptive aggregate features from two branches.
❸ The extensive experiments demonstrate our proposed method outperforms all previous state-of-the-art shadow removal approaches on several public shadow removal datasets, i.e. ISTD+ and SRD.

2. Related Work
2.1. General Shadow Removal Methods
To restore the shadow-free image from the degraded shadow counterpart, traditional methods (Mohan et al., 2007; Finlayson et al., 2005; Guo et al., 2012; Zhang et al., 2015; Finlayson et al., 2009, 2005) rely on the priors information, e.g. gradients, illumination, and patch similarity. For example, (Mohan et al., 2007) proposes a gradient-domain processing technique to adjust the softness of the shadows without introducing any artifacts. (Guo et al., 2012) uses a region-based approach that predicts relative illumination conditions between segmented regions to distinguish shadow regions and relight each pixel. (Zhang et al., 2015) extends this region-based approach and constructs a novel illumination recovering operator to effectively remove the shadows and restore the detailed texture information based on the texture similarity between the shadow and non-shadow patches. Although well-designed, these methods have been surpassed by deep learning-based shadow removal methods recently (Wan et al., 2022; Fu et al., 2021c; Zhu et al., 2022a; Inoue and Yamasaki, 2020; Qu et al., 2017; Le and Samaras, 2019, 2020; Hu et al., 2019a; Wei et al., 2019; Zhang et al., 2020; Zhu et al., 2022b; Li et al., 2023; Gao et al., 2022; Abiko and Ikehara, 2022). Specifically, (Qu et al., 2017) is the pioneering work that uses an end-to-end network to tackle shadow detection and shadow removal problems by extracting multi-context features from the global and local regions. Since then, a large number of interesting deep learning-based methods have been proposed by focusing on different problem aspects. (Fu et al., 2021c) reformulates the shadow removal task as an exposure problem and employs a neural network to predict the exposure parameters to get shadow-free images. (Zhu et al., 2022a) takes into account the auxiliary supervision of shadow generation in the shadow removal procedure and proposes a unified network to perform shadow removal and shadow generation. (Wan et al., 2022) explicitly considers the style consistency between shadow and non-shadow regions after shadow removal and proposes a style-guided shadow removal network. Although these methods achieve promising results, the problem of lacking large-scale paired training data becomes a bottleneck that limits their performance. To alleviate this problem, (Inoue and Yamasaki, 2020) designs a pipeline to generate a large-scale synthetic shadow dataset to improve the shadow removal performance. While the unsupervised or weekly supervised methods are also introduced (Liu et al., 2021b, a; Hu et al., 2019b; Jin et al., 2021; He et al., 2021) to train the deep neural network with unpaired data.
2.2. Iterative Network
Iterative networks are extensively employed in machine learning-based image processing tasks (Yan et al., 2022; Zhou et al., 2022; Li et al., 2022b; Hang et al., 2019; Saharia et al., 2022; Wang and Wang, 2022; Zhan and Lu, 2019; Yu et al., 2019; Li et al., 2020) to recursively and gradually improve the quality of predictions. For example, (Yan et al., 2022) uses a recurrent image-guided network to address challenges in depth prediction, where the recurrent is applied to both the image guidance branch and the depth generation branch to gradually and sufficiently recover depth values. (Zhou et al., 2022) introduces an edge-guided recurrent positioning network for predicting salient objects in optical remote sensing images. The proposed approach can sharpen the predicted positions by utilizing the effective edge information and recurrently calibrating them during the prediction process. (Hang et al., 2019) introduces a cascaded recurrent neural network that utilizes gated recurrent units to effectively explore the redundant and complementary information present in hyperspectral images. (Saharia et al., 2022) performs image Super-Resolution via Repeated Refinement, which employs a stochastic iterative denoising process to improve image super-resolution performance. (Zhan and Lu, 2019) introduces an end-to-end trainable scene text recognition system that utilizes an iterative rectification framework to address the problem of perspective distortion and text line curvature. (Yu et al., 2019) introduces a deep iterative down-up convolutional neural network for image denoising that employs a resolution-adaptive approach by iteratively reducing and increasing the resolution at the feature level. (Li et al., 2020) presents a recurrent feature reasoning (RFR) network for single-image inpainting that iteratively predicts hole boundaries at the feature level of a convolutional neural network, which then serves as cues for subsequent inference. Inspired by the success of these iterative networks, in this paper, we employ an iterative de-shadow branch (IDB) to gradually improve the performance of shadow removal.
3. Discussion and Motivation
We can divide the shadow removal task into two distinct tasks: 1) restoring shadow regions to their shadow-free counterparts and 2) identical mapping for non-shadow regions. Previous methods (Qu et al., 2017; Fu et al., 2021c; Zhu et al., 2022a) use deep neural networks with shared weights to handle the two tasks. We argue that these networks are only optimized toward one of the tasks instead of both. In this section, we first conduct experiments to uncover the limitation of training a shared deep neural network to handle these two distinct tasks (See Sec. 3.1). Then, we illustrate the superiority of identical mapping for non-shadow regions. To do this, we utilize the Information in the Weights (IIW)(Wang et al., 2021) technique to demonstrate the benefits of restoring non-shadow regions using identical mapping training (See Sec. 3.2). Additionally, we employ an encoder-decoder architecture to explore the efficacy of the iterative technique for shadow removal and highlight its advantages and limitations (See Sec. 3.3).
3.1. Mutual Interference of Shadow Removal Network
To begin with, we define mutual interference as a phenomenon in which the performance of a shadow removal network increases in the shadow restoration task but decreases in the identical mapping task or vice versa. To uncover this phenomenon, we build a baseline encoder-decoder shadow removal network and evaluate its performance on the ISTD+ dataset(Le and Samaras, 2019) via root mean squared error (RMSE) every 10000 iterations during training. Specifically, given the shadow removal network trained at th iteration, we can calculate the RMSEs at both shadow and non-shadow regions of testing images. Then, we can draw plots along the iteration indexes for both shadow and non-shadow regions (i.e., the red plot and green plot in Fig. 3 (a)). Then, we say that mutual interference occurs if the RMSE variation between two neighboring iterations at the shadow region is different from the one at the non-shadow region. We define the mutual interference ratio as the number of times mutual interference occurs during the training procedure. As depicted in Fig. 3(c), we see that the baseline encoder-decoder network has a high mutual interference ratio (i.e., over 30 times within 100 counted iterations), which illustrates that a shared deep neural network can hardly cover the two distinct tasks at the same time.
3.2. Advantages of The Identical Mapping
To analyze the potential advantages of identical mapping, we conduct two additional experiments using the same encoder-decoder network. For Exp1, the input is the shadow image, and the network is trained to remove shadows from that. While for Exp2, we aim to have the reconstructed result identical to the input shadow images (i.e., identical mapping). For both experiments, we evaluate the restoration quality in the non-shadow regions. Surprisingly, we observed that Exp2 had a significant advantage over Exp1. As shown in Fig. 4 (a), in the non-shadow regions, Exp2 has a much lower RMSE than Exp1 throughout the entire training procedure. To further explore the potential functionality of identical mapping, we used the Information in the Weight (IIW) (Wang et al., 2021) technique to analyze the training procedure of Exp1 and Exp2. A lower IIW means the network has a higher generalization to different non-shadow scenes. The results are displayed in Fig. 4 (b). We observe that Exp2 substantially improved the model’s generalization (i.e., lower IIW) in the non-shadow regions compared to Exp1, which demonstrates the efficacy of identical mapping in reconstructing non-shadow regions.


3.3. Iterative Network for Shadow Removal
To investigate the effectiveness of the iterative network, we conduct a series of experiments using the same encoder-decoder structure in Sec. 3.1. Specifically, we feed the decoder’s feature back into the encoder at various times and evaluate the restoration quality in both shadow and non-shadow regions. As depicted in Fig. 4(c), both the encoder-decoder structure and our proposed method show significant improvement in the restoration quality of shadow regions with more iterations (see the red line). However, we notice that for the encoder-decoder structure, the restoration quality in non-shadow regions decreases as the number of iterations increases (see the square green line). In contrast, with our method, the restoration quality in non-shadow regions remains nearly unchanged even with increasing iterations (see the dotted green line). In addition, we also calculate and visualize the differences between the ground truth and reconstructed results obtained by the encoder-decoder structure at different iterations. The results are shown in Fig. 5, where green arrows highlight the differences in shadow regions, and red arrows highlight the differences in non-shadow regions. With a single iteration, the restoration quality is favorable in non-shadow regions but subpar in shadow regions, while the opposite trend is observed with two iterations.
Overall, the above experiments present the necessity of decoupling the shadow removal task into two distinct tasks and the effectiveness of the combination of identical mapping and iterative shadow removal.

4. Method
4.1. Overview
The proposed method consists of three components: an identical mapping branch (IMB) (see Sec. 4.2), an iterative de-shadow branch (IDB) (see Sec. 4.3), and a smart aggregation block (SAB) (see Sec. 4.4). Given a shadow image, we decouple the shadow removal into two distinct tasks: restoring shadow regions to their shadow-free counterparts and identical mapping for non-shadow regions. We use IMB to handle the identical mapping task and use IDB to handle the shadow restoration task. The SAB is designed to adaptive integrate features from both IMB and IDB. To prove the advantage of our method, we conduct the same experiment as discussed in Sec. 3.1 by using our method. As shown in Fig. 3(b), for both tasks, our method exhibits less fluctuation in terms of RMSE compared to the encoder-decoder structure during training. Besides, our method also demonstrates a lower mutual interference ratio, as shown in Fig. 3(c).

4.2. Identical Mapping Branch
We propose the identical mapping branch (IMB) to reconstruct the input image via an encoder-decoder network, which can benefit the restoration of non-shadow regions. Given a shadow image , the objective of IMB is to reconstruct , where should be identical to . This procedure can be represented by
(1) |
where denote the IMB. Let denote the feature extracted from the th convolution layer of . can be formalized as
(2) |
where denotes the th convolution layer of . After training the IMB, we freeze its parameters and rely solely on the multi-scale features, i.e. , to guide the iterative de-shadow branch (IDB).
4.3. Iterative De-shadow Branch
The iterative de-shadow branch (IDB) is responsible for progressively transferring the information from the non-shadow regions to the shadow regions in an iterative manner to facilitate the process of shadow removal by utilizing the multi-scale features provided by IMB. Let denotes the IDB and denote the feature extracted from the th convolution layer of . can be formalized as
(3) |
where means channel-wise concatenation, denotes the th convolution layer of and denotes the corresponding binary shadow mask of the shadow image . The shadow and non-shadow regions are annotated by and respectively in . Then we aggregate and in an adaptive manner at the multi-scale features level (i.e. after the first, third, and second-to-last convolution layers of as illustrated in Fig. 6). The procedure of the aggregation can be represented as
(4) |
where denotes the adaptive aggregated feature which will be used as the input of the next convolution layer of , i.e. , and denotes the smart aggregation block (see Sec. 4.4).
4.4. Smart Aggregation Block
Instead of directly concatenating the features extracted from the IMB and IDB, we propose to aggregate them in an adaptive manner. Specifically, we utilize a convolutional layer with a kernel size of 3x3 followed by a sigmoid activation function to estimate the adaptive aggregation weights, which can be represented as
(5) |
where denote the corresponding aggregation weights of , respectively. Since the IMB is frozen, remains constant throughout the iterative procedure of IDB. Therefore, we utilize only to predict the aggregation weights. The whole aggregation procedure can be formalized as
(6) |
where denotes the element-wise multiplication, denotes a convolution layer with a kernel size of 3x3, denotes the generated soft shadow mask which is obtained by applying average pooling along the channels of . As shown in Fig. 7, the soft shadow masks (see (c)-(e)) obtained from the procedure of aggregation are capable of accurately capturing the shadow regions, in contrast to the binary shadow mask (see (b)) provided by the shadow removal dataset, i.e. SRD dataset(Qu et al., 2017), which can not capture the shadow details, especially for the regions along the shadow boundary.

4.5. Implementation Details
DIDB() | IMB () | ||||||
Input | Output | Output size | Operation | Input | Output | Output size | Operation |
Conv(4, 64, 7, 1, 3), ReLU | Conv(3, 64, 7, 1, 3), ReLU | ||||||
SAB | |||||||
Conv(64, 128, 4, 2, 1), ReLU | Conv(64, 128, 4, 2, 1), ReLU | ||||||
Conv(128, 256, 4, 2, 1), ReLU | Conv(128, 256, 4, 2, 1), ReLU | ||||||
SAB | |||||||
Resnet x 8 | Resnet x 8 | ||||||
Conv(256, 256, 3, 1, 1), ReLU | Conv(256, 256, 3, 1, 1), ReLU | ||||||
Conv(256, 256, 3, 1, 1), ReLU | Conv(256, 256, 3, 1, 1), ReLU | ||||||
Conv(256, 128, 4, 2, 1), ReLU | Conv(256, 128, 4, 2, 1), ReLU | ||||||
Conv(128, 64, 4, 2, 1), ReLU | Conv(128, 64, 4, 2, 1), ReLU | ||||||
SAB | |||||||
Conv(64, 3, 7, 1, 3) |
Network architectures. Following (Li et al., 2022a; Guo et al., 2021) the identical mapping branch and the iterative de-shadow branch employ a similar encoder-decoder architecture with different input and output, as shown in Table 1. Theoretically, the smart aggregation block can be added after each convolution layer of to maximize its potential impact. However, to optimize computation efficiency, in our experiments, we selectively add the smart aggregation block after the first, third, and second-to-last layers of to balance the computation and performance.
Loss functions. Following the previous shadow removal method (Fu et al., 2021c), we only employ the loss during the training process. Specifically, we first train the identical Mapping branch with the objective function
(7) |
Then we freeze and train the iterative de-shadow branch with the same objective function
(8) |
where and denote the de-shadowed result and the corresponding ground truth shadow-free image, respectively.
Training details. We adopt a two-step training strategy. Firstly, we exclusively utilize shadow images to train the identical mapping branch for 500,000 iterations, employing a batch size of 8. Subsequently, we freeze and utilize paired shadow & shadow-free images to train the iterative de-shadow branch for 150,000 iterations with the same batch size. Following (Fu et al., 2021c), we resize the input shadow image to 256x256 resolution. Both branches are optimized using the Adam optimizer with a learning rate of 0.00005. All experiments are conducted on the Linux server equipped with two NVIDIA Tesla V100 GPUs.
5. Experiments
5.1. Setups
Datasets. Following the previous shadow removal method (Wan et al., 2022), we conduct experiments on two widely used shadow removal datasets i.e. ISTD+ (Le and Samaras, 2019) and SRD (Qu et al., 2017). The ISTD+ dataset consists of 1330 triplets for training and 540 triplets for testing. We use the provided ground truth masks directly during the training procedure. While in the evaluation step, we follow the previous method (Fu et al., 2021c) and use Ostu’s algorithm to detect the corresponding shadow masks. The SRD dataset contains 2680 paired shadow and shadow-free images for training and 408 paired shadow and shadow-free images for testing. Because the SRD does not provide the corresponding shadow masks, we use the shadow masks provided by DHAN(Cun et al., 2020) for both training and evaluation steps.
Metrics. We adopt a comprehensive evaluation approach for assessing the performance of our proposed method. Firstly, we calculate the root mean square error (RMSE) in the LAB color space. Furthermore, following the previous approaches (Zhu et al., 2022a; Wan et al., 2022), we employ the commonly used image quality evaluation metrics, i.e. peak signal-to-noise ratio (PSNR) (Johnson et al., 2016), structural similarity index (SSIM), and learned perceptual image patch similarity (LPIPs) (Zhang et al., 2018a). This allows us to thoroughly evaluate the restoration quality of our proposed method from multiple perspectives.
Baselines. We conduct comprehensive comparisons with previous state-of-the-art shadow removal algorithms, including SP+M-Net (Le and Samaras, 2019), Param+M+D-Net (Le and Samaras, 2020), Fu et al. (Fu et al., 2021c), LG-ShadowNet (Liu et al., 2021a), DC-ShadowNet (Jin et al., 2021), G2R-ShadowNet (Liu et al., 2021b), BMNet (Zhu et al., 2022a), and SG-ShadowNet (Wan et al., 2022) on the ISTD+ dataset. Additionally, we compare with DSC (Hu et al., 2019a), DHAN (Cun et al., 2020), Fu et al. (Fu et al., 2021c), DC-ShadowNet (Jin et al., 2021), BMNet (Zhu et al., 2022a), and SG-ShadowNet (Wan et al., 2022) on the SRD dataset.
5.2. Comparison Results
Quantitative comparison. To validate the effectiveness of our proposed method, we first conduct a comprehensive comparison with recent state-of-the-art methods on the ISTD+ datasets. The results, as depicted in Table 2, clearly demonstrate our method’s superiority in terms of reconstruction quality, as evaluated by multiple metrics, including RMSE, PSNR, SSIM, and LPIPS. Specifically, for the comparison at the whole image level, our method outperforms all the competitors. Compared to Fu et al.(Fu et al., 2021c), our method achieves a reduction of 20.50% in RMSE and 68.38% in LPIPS, as well as an increase of 15.32% in PSNR and 14.21% in SSIM. Similarly, compared to Param+M+D-Net(Le and Samaras, 2020), our method demonstrates superior performance with a reduction of 15.92% in RMSE and 30.30% in LPIPS, as well as an increase of 12.68% in PSNR and 1.89% in SSIM. For the comparison in the shadow regions, our method also outperforms other methods. Compared to Fu et al.(Fu et al., 2021c), our method achieves a reduction of 9.80% in RMSE, as well as an increase of 5.14% in PSNR and 1.30% in SSIM. When compared to Param+M+D-Net(Le and Samaras, 2020), our method achieves a reduction of 38.87% in RMSE, as well as an increase of 13.96% in PSNR and 0.47% in SSIM. For the comparison in the non-shadow regions, our method continues to outperform other methods. Compared to Fu et al.(Fu et al., 2021c), our method achieves a reduction of 24.12% in RMSE, as well as an increase of 19.45% in PSNR and 11.54% in SSIM. Additionally, when compared to Param+M+D-Net(Le and Samaras, 2020), our method achieves a reduction of 1.06% in RMSE, as well as an increase of 7.89% in PSNR and 0.43% in SSIM.
To further substantiate the effectiveness of our proposed method, we conducted additional comparison experiments on the SRD dataset. The results, as presented in Table 3, demonstrate the superiority of our method over other state-of-the-art shadow removal approaches. Our method exhibits a significant margin of improvement across all evaluation metrics. Specifically, for the comparison at the whole image level, our method outperforms all the competitors. Compared to DHAN(Cun et al., 2020), our method achieves a reduction of 22.20% in RMSE and 9.09% in LPIPS, as well as an increase of 9.46% in PSNR and 1.47% in SSIM. Compared to BMNet(Zhu et al., 2022a), our method demonstrates superior performance with a reduction of 14.39% in RMSE and 11.87% in LPIPS, as well as an increase of 5.30% in PSNR and 0.41% in SSIM. For the comparison in the shadow regions, our method also outperforms other methods. Compared to DHAN(Cun et al., 2020), our method achieves a reduction of 24.44% in RMSE, as well as an increase of 7.15% in PSNR and 0.41% in SSIM. When compared to BMNet(Zhu et al., 2022a), our method achieves a reduction of 15.90% in RMSE, as well as an increase of 6.12% in PSNR and 0.43% in SSIM. For the comparison in the non-shadow regions, our method continues to outperform other methods. Compared to DHAN(Cun et al., 2020), our method achieves a reduction of 20.31% in RMSE, as well as an increase of 9.22% in PSNR and 0.93% in SSIM. Additionally, when compared to BMNet(Zhu et al., 2022a), our method achieves a reduction of 13.13% in RMSE, as well as an increase of 2.65% in PSNR and 0.04% in SSIM. These comparison results unequivocally support the effectiveness of our proposed method and its superiority over the state-of-the-art methods in relation to reconstruction quality in both the shadow regions and non-shadow regions.
Method | All | Shadow | Non-Shadow | |||||||
---|---|---|---|---|---|---|---|---|---|---|
RMSE | PSNR | SSIM | LPIPS | RMSE | PSNR | SSIM | RMSE | PSNR | SSIM | |
SP+M-Net(Le and Samaras, 2019) | 3.610 | 32.33 | 0.9479 | 0.0716 | 7.205 | 36.16 | 0.9871 | 2.913 | 35.84 | 0.9723 |
Param+M+D-Net(Le and Samaras, 2020) | 4.045 | 30.12 | 0.9420 | 0.0759 | 9.714 | 33.59 | 0.9850 | 2.935 | 34.33 | 0.9723 |
Fu et al.(Fu et al., 2021c) | 4.278 | 29.43 | 0.8404 | 0.1673 | 6.583 | 36.41 | 0.9769 | 3.827 | 31.01 | 0.8755 |
LG-ShadowNet(Liu et al., 2021a) | 4.402 | 29.20 | 0.9335 | 0.0920 | 9.709 | 32.65 | 0.9806 | 3.363 | 33.36 | 0.9683 |
DC-ShadowNet(Jin et al., 2021) | 4.781 | 28.76 | 0.9219 | 0.1112 | 10.434 | 32.20 | 0.9758 | 3.674 | 33.21 | 0.9630 |
G2R-ShadowNet(Liu et al., 2021b) | 3.970 | 30.49 | 0.9330 | 0.0868 | 8.872 | 34.01 | 0.9770 | 3.010 | 34.62 | 0.9707 |
BMNet(Zhu et al., 2022a) | 3.595 | 32.30 | 0.9551 | 0.0567 | 6.189 | 37.30 | 0.9899 | 3.087 | 35.06 | 0.9738 |
SG-ShadowNet(Wan et al., 2022) | 3.531 | 32.41 | 0.9524 | 0.0594 | 6.019 | 37.41 | 0.9893 | 3.044 | 34.95 | 0.9725 |
Ours | 3.401 | 33.94 | 0.9598 | 0.0529 | 5.938 | 38.28 | 0.9896 | 2.904 | 37.04 | 0.9765 |
Method | All | Shadow | Non-Shadow | |||||||
---|---|---|---|---|---|---|---|---|---|---|
RMSE | PSNR | SSIM | LPIPS | RMSE | PSNR | SSIM | RMSE | PSNR | SSIM | |
DSC(Hu et al., 2019a) | 5.704 | 29.01 | 0.9044 | 0.1145 | 8.828 | 34.20 | 0.9702 | 4.509 | 31.85 | 0.9555 |
DHAN(Cun et al., 2020) | 4.666 | 30.67 | 0.9278 | 0.0792 | 7.771 | 37.05 | 0.9818 | 3.486 | 32.98 | 0.9591 |
Fu et al.(Fu et al., 2021c) | 6.269 | 27.90 | 0.8430 | 0.1820 | 8.927 | 36.13 | 0.9742 | 5.259 | 29.43 | 0.8888 |
DC-ShadowNet(Jin et al., 2021) | 4.893 | 30.75 | 0.9118 | 0.1084 | 8.103 | 36.68 | 0.9759 | 3.674 | 33.10 | 0.9540 |
BMNet(Zhu et al., 2022a) | 4.240 | 31.88 | 0.9376 | 0.0817 | 6.982 | 37.41 | 0.9816 | 3.198 | 35.09 | 0.9676 |
SG-ShadowNet(Wan et al., 2022) | 4.297 | 31.31 | 0.9273 | 0.0835 | 7.564 | 36.55 | 0.9807 | 3.056 | 34.23 | 0.9611 |
Ours | 3.630 | 33.57 | 0.9414 | 0.0720 | 5.872 | 39.70 | 0.9858 | 2.778 | 36.02 | 0.9680 |
Qualitative comparison. We compared our visualized results with other state-of-the-art shadow removal methods on both ISTD+ and SRD datasets. As shown in Fig. 8, our method consistently outperforms the competitors in two aspects: ❶ Our reconstructed results exhibit superior color consistency. In particular, for case 2 and case 4, our method produces color-consistent results where the shadow regions and non-shadow regions are nearly indistinguishable from the human eye. In contrast, the competitors’ results show obvious color inconsistency. ❷ The mask boundary in our reconstructed results is smoother and more seamless. For case 1, the mask boundary in the competitors’ results is clearly visible, whereas it is imperceptible in our method. For case 3, the competitors’ results exhibit ghosting artifacts around the mask boundary, while our result does not show any artifacts around the mask boundary.

5.3. Ablation Study
Method | All | Shadow | Non-Shadow | |||||||
RMSE | PSNR | SSIM | LPIPS | RMSE | PSNR | SSIM | RMSE | PSNR | SSIM | |
(a) Feature addition | 3.545 | 33.76 | 0.9582 | 0.0548 | 6.021 | 38.32 | 0.9893 | 3.060 | 36.77 | 0.9755 |
(b) Feature multiplication | 3.798 | 32.99 | 0.9554 | 0.0574 | 6.879 | 37.22 | 0.9882 | 3.195 | 36.41 | 0.9744 |
(c) Feature concatenation | 3.513 | 33.91 | 0.9595 | 0.0530 | 6.121 | 38.37 | 0.9895 | 3.002 | 37.04 | 0.9764 |
(d) SF w/o soft-mask | 3.528 | 33.62 | 0.9585 | 0.0542 | 6.293 | 38.09 | 0.9892 | 2.987 | 36.69 | 0.9756 |
(e) w/o SAB- | 3.543 | 33.66 | 0.9580 | 0.0540 | 6.246 | 38.27 | 0.9894 | 3.013 | 36.70 | 0.9750 |
(f) w/o SAB- | 3.599 | 33.77 | 0.9589 | 0.0536 | 6.549 | 38.26 | 0.9891 | 3.022 | 37.00 | 0.9764 |
(g) w/o SAB- | 3.689 | 33.76 | 0.9570 | 0.0571 | 6.314 | 38.26 | 0.9889 | 3.175 | 36.82 | 0.9744 |
(h) One encoder-decoder | 3.911 | 33.06 | 0.9545 | 0.0638 | 7.319 | 37.04 | 0.9875 | 3.243 | 36.59 | 0.9743 |
(i) Two encoder-decoder | 3.855 | 33.01 | 0.9554 | 0.0619 | 7.227 | 36.96 | 0.9880 | 3.194 | 36.54 | 0.9748 |
(j) Iteration-1 | 3.451 | 33.57 | 0.9598 | 0.0541 | 6.555 | 37.52 | 0.9890 | 2.844 | 37.23 | 0.9778 |
(k) Iteration-2 | 3.474 | 33.60 | 0.9589 | 0.0541 | 6.308 | 37.91 | 0.9894 | 2.919 | 36.87 | 0.9762 |
(l) Iteration-3 | 3.460 | 33.78 | 0.9591 | 0.0538 | 6.236 | 37.99 | 0.9894 | 2.916 | 37.03 | 0.9765 |
(m) Iteration-4(Ours) | 3.401 | 33.94 | 0.9598 | 0.0529 | 5.938 | 38.28 | 0.9896 | 2.904 | 37.04 | 0.9765 |
In this section, we conduct comprehensive ablation experiments to validate each part of the proposed method, and the results are displayed in Table 4.
5.3.1. Effectiveness of SAB
Firstly, we evaluate the effectiveness of the smart aggregation block by comparing it with three substituted aggregation operations: feature addition, which adds and ; feature multiplication, which multiplies and ; and feature concatenation, which concatenates and . Empowered by the smart aggregation block, our method achieves the highest reconstruction quality across all evaluation metrics, as demonstrated in (a)-(c). Specifically, at the whole image level, replacing the smart aggregation block with feature addition leads to an increase of 4.23% in RMSE and 3.59% in LPIPS, and a decrease of 0.53% in PSNR and 0.17% in SSIM. Replacing the smart aggregation block with feature multiplication leads to an increase of 11.67% in RMSE and 8.51% in LPIPS, and a decrease of 2.80% in PSNR and 0.46% in SSIM. Replacing the smart aggregation block with feature concatenation leads to an increase of 3.29% in RMSE and 0.19% in LPIPS, and a decrease of 0.09% in PSNR and 0.03% in SSIM.
To further demonstrate the necessity of each smart aggregation block, we conduct experiments to remove them individually. As shown in (e)-(g), we find that removing any of the smart aggregation blocks leads to a decrease in reconstruction quality across all metrics. At the whole image level, removing the SAB after the first layer leads to an increase of 4.18% in RMSE and 2.08% in LPIPS, and a decrease of 0.82% in PSNR and 0.19% in SSIM. Removing the SAB after the third layer leads to an increase of 5.82% in RMSE and 1.32% in LPIPS, and a decrease of 0.50% in PSNR and 0.09% in SSIM. Removing the SAB after the second-to-last layer leads to an increase of 8.47% in RMSE and 7.94% in LPIPS, and a decrease of 0.53% in PSNR and 0.29% in SSIM.
5.3.2. Effectiveness of the soft mask
Besides, we evaluate the significance of the soft mask produced in the smart aggregation block by removing it directly. As shown in (d), we observe that without , the reconstruction quality significantly deteriorates. Specifically, it leads to an increase of 3.73% in RMSE and 2.46% in LPIPS, and a decrease of 0.94% in PSNR and 0.14% in SSIM at the whole image level.
5.3.3. Effectiveness of the Dual-Branch shadow removal paradigm
Furthermore, we compare our method with an encoder-decoder architecture in two scenarios. Firstly, we evaluate a single saved model that is optimal for the whole image. As shown in (h), our method outperformed this scenario. Specifically, at the whole image level, our method achieves a reduction of 13.04% in RMSE and 17.08% in LPIPS, as well as an increase of 2.66% in PSNR and 0.56% in SSIM. Secondly, we evaluated two saved models that are optimal for the shadow regions and non-shadow regions, respectively. To obtain a de-shadowed clean image, we can combine the restored results of these two selected models using the binary shadow masks. As shown in (i), our method also outperforms this scenario. Specifically, at the whole image level, our method achieves a reduction of 11.78% in RMSE and 14.54% in LPIPS, as well as an increase of 2.82% in PSNR and 0.46% in SSIM.
5.3.4. Effectiveness of Iterative
Finally, we evaluate the performance of our method by iterating at different times. As shown in (j)-(m), increasing the number of iterations significantly improves performance in the shadow regions, with only a negligible decrease in performance in the non-shadow regions. Specifically, compared to iteration-1, our method reduces the RMSE in shadow regions by 9.41% while only increasing it in non-shadow regions by 2.11%. Compared to iteration-2, our method surprisingly reduces the RMSE in both shadow and non-shadow regions by 5.87% and 0.51%, respectively. Similarly, in comparison to iteration-3, our method achieves a reduction in RMSE of 4.78% and 0.41% in shadow and non-shadow regions, respectively.
6. Conclusions
In this work, we first identify the limitation of existing shadow removal approaches that use a shared model to restore both shadow and non-shadow regions. To overcome this limitation, we propose to decouple shadow removal into two distinct tasks: restoring shadow regions to their shadow-free counterparts and identical mapping for non-shadow regions. Specifically, our proposed method comprises three components. Firstly, we employ an identical mapping branch (IMB) to handle the non-shadow regions. Secondly, we use an iterative de-shadow branch (IDB) to handle the shadow regions by progressively transferring information from the non-shadow regions to the shadow regions in an iterative manner, which facilitates the process of shadow removal. Finally, we design a smart aggregation block (SAB) to adaptive integrate features from both IMB and IDB. The extensive experiments demonstrate the superiority of our proposed method over all state-of-the-art competitors.
References
- (1)
- Abiko and Ikehara (2022) Ryo Abiko and Masaaki Ikehara. 2022. Channel attention GAN trained with enhanced dataset for single-image shadow removal. IEEE Access 10 (2022), 12322–12333.
- Cun et al. (2020) Xiaodong Cun, Chi-Man Pun, and Cheng Shi. 2020. Towards ghost-free shadow removal via dual hierarchical aggregation network and shadow matting GAN. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 10680–10687.
- Finlayson et al. (2009) Graham D Finlayson, Mark S Drew, and Cheng Lu. 2009. Entropy minimization for shadow removal. International Journal of Computer Vision 85, 1 (2009), 35–57.
- Finlayson et al. (2005) Graham D Finlayson, Steven D Hordley, Cheng Lu, and Mark S Drew. 2005. On the removal of shadows from images. IEEE transactions on pattern analysis and machine intelligence 28, 1 (2005), 59–68.
- Fu et al. (2021a) Lan Fu, Qing Guo, Felix Juefei-Xu, Hongkai Yu, Wei Feng, Yang Liu, and Song Wang. 2021a. Benchmarking shadow removal for facial landmark detection and beyond. arXiv preprint arXiv:2111.13790 (2021).
- Fu et al. (2021b) Lan Fu, Hongkai Yu, Xiaoguang Li, Craig P Przybyla, and Song Wang. 2021b. Deep Learning for Object Detection in Materials-Science Images: A tutorial. IEEE Signal Processing Magazine 39, 1 (2021), 78–88.
- Fu et al. (2021c) Lan Fu, Changqing Zhou, Qing Guo, Felix Juefei-Xu, Hongkai Yu, Wei Feng, Yang Liu, and Song Wang. 2021c. Auto-exposure fusion for single-image shadow removal. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10571–10580.
- Gao et al. (2022) Jianhao Gao, Quanlong Zheng, and Yandong Guo. 2022. Towards real-world shadow removal with a shadow simulation method and a two-stage framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 599–608.
- Guo et al. (2021) Qing Guo, Xiaoguang Li, Felix Juefei-Xu, Hongkai Yu, Yang Liu, and Song Wang. 2021. JPGNet: Joint Predictive Filtering and Generative Network for Image Inpainting. In Proceedings of the 29th ACM International Conference on Multimedia. 386–394.
- Guo et al. (2012) Ruiqi Guo, Qieyun Dai, and Derek Hoiem. 2012. Paired regions for shadow detection and removal. IEEE transactions on pattern analysis and machine intelligence 35, 12 (2012), 2956–2967.
- Hang et al. (2019) Renlong Hang, Qingshan Liu, Danfeng Hong, and Pedram Ghamisi. 2019. Cascaded recurrent neural networks for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing 57, 8 (2019), 5384–5394.
- He et al. (2021) Yingqing He, Yazhou Xing, Tianjia Zhang, and Qifeng Chen. 2021. Unsupervised Portrait Shadow Removal via Generative Priors. In Proceedings of the 29th ACM International Conference on Multimedia. 236–244.
- Hu et al. (2019a) Xiaowei Hu, Chi-Wing Fu, Lei Zhu, Jing Qin, and Pheng-Ann Heng. 2019a. Direction-aware spatial context features for shadow detection and removal. IEEE TPAMI 42, 11 (2019), 2795–2808.
- Hu et al. (2019b) Xiaowei Hu, Yitong Jiang, Chi-Wing Fu, and Pheng-Ann Heng. 2019b. Mask-ShadowGAN: Learning to remove shadows from unpaired data. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2472–2481.
- Inoue and Yamasaki (2020) Naoto Inoue and Toshihiko Yamasaki. 2020. Learning from synthetic shadows for shadow detection and removal. IEEE Transactions on Circuits and Systems for Video Technology 31, 11 (2020), 4187–4197.
- Jin et al. (2021) Yeying Jin, Aashish Sharma, and Robby T Tan. 2021. DC-ShadowNet: Single-Image Hard and Soft Shadow Removal Using Unsupervised Domain-Classifier Guided Network. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5027–5036.
- Johnson et al. (2016) Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision. Springer, 694–711.
- Le and Samaras (2019) Hieu Le and Dimitris Samaras. 2019. Shadow removal via shadow image decomposition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8578–8587.
- Le and Samaras (2020) Hieu Le and Dimitris Samaras. 2020. From shadow segmentation to shadow removal. In European Conference on Computer Vision. Springer, 264–281.
- Li et al. (2020) Jingyuan Li, Ning Wang, Lefei Zhang, Bo Du, and Dacheng Tao. 2020. Recurrent feature reasoning for image inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7760–7768.
- Li et al. (2022b) Jiankun Li, Peisen Wang, Pengfei Xiong, Tao Cai, Ziwei Yan, Lei Yang, Jiangyu Liu, Haoqiang Fan, and Shuaicheng Liu. 2022b. Practical stereo matching via cascaded recurrent network with adaptive correlation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16263–16272.
- Li et al. (2023) Xiaoguang Li, Qing Guo, Rabab Abdelfattah, Di Lin, Wei Feng, Ivor Tsang, and Song Wang. 2023. Leveraging Inpainting for Single-Image Shadow Removal. arXiv preprint arXiv:2302.05361 (2023).
- Li et al. (2022a) Xiaoguang Li, Qing Guo, Di Lin, Ping Li, Wei Feng, and Song Wang. 2022a. MISF: Multi-level Interactive Siamese Filtering for High-Fidelity Image Inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1869–1878.
- Liu et al. (2021a) Zhihao Liu, Hui Yin, Yang Mi, Mengyang Pu, and Song Wang. 2021a. Shadow removal by a lightness-guided network with training on unpaired data. IEEE Transactions on Image Processing 30 (2021), 1853–1865.
- Liu et al. (2021b) Zhihao Liu, Hui Yin, Xinyi Wu, Zhenyao Wu, Yang Mi, and Song Wang. 2021b. From Shadow Generation to Shadow Removal. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4927–4936.
- Mohan et al. (2007) Ankit Mohan, Jack Tumblin, and Prasun Choudhury. 2007. Editing soft shadows in a digital photograph. IEEE Computer Graphics and Applications 27, 2 (2007), 23–31.
- Nadimi and Bhanu (2004) Sohail Nadimi and Bir Bhanu. 2004. Physical models for moving shadow and object detection in video. IEEE transactions on pattern analysis and machine intelligence 26, 8 (2004), 1079–1087.
- Qu et al. (2017) Liangqiong Qu, Jiandong Tian, Shengfeng He, Yandong Tang, and Rynson WH Lau. 2017. Deshadownet: A multi-context embedding deep network for shadow removal. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4067–4075.
- Saharia et al. (2022) Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J Fleet, and Mohammad Norouzi. 2022. Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).
- Sanin et al. (2010) Andres Sanin, Conrad Sanderson, and Brian C Lovell. 2010. Improved shadow removal for robust person tracking in surveillance scenarios. In ICPR. 141–144.
- Wan et al. (2022) Jin Wan, Hui Yin, Zhenyao Wu, Xinyi Wu, Yanting Liu, and Song Wang. 2022. Style-Guided Shadow Removal. In Proceedings of the European Conference on Computer Vision (ECCV).
- Wang et al. (2022) Chien-Yao Wang, Alexey Bochkovskiy, and Hong-Yuan Mark Liao. 2022. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv preprint arXiv:2207.02696 (2022).
- Wang and Wang (2022) Dong Wang and Xiao-Ping Wang. 2022. The iterative convolution–thresholding method (ICTM) for image segmentation. Pattern Recognition 130 (2022), 108794.
- Wang et al. (2021) Zifeng Wang, Shao-Lun Huang, Ercan E Kuruoglu, Jimeng Sun, Xi Chen, and Yefeng Zheng. 2021. PAC-bayes information bottleneck. arXiv preprint arXiv:2109.14509 (2021).
- Wei et al. (2019) Jinjiang Wei, Chengjiang Long, Hua Zou, and Chunxia Xiao. 2019. Shadow inpainting and removal using generative adversarial networks with slice convolutions. In Computer Graphics Forum, Vol. 38. Wiley Online Library, 381–392.
- Yan et al. (2022) Zhiqiang Yan, Kun Wang, Xiang Li, Zhenyu Zhang, Jun Li, and Jian Yang. 2022. RigNet: Repetitive image guided network for depth completion. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVII. Springer, 214–230.
- Yu et al. (2019) Songhyun Yu, Bumjun Park, and Jechang Jeong. 2019. Deep iterative down-up cnn for image denoising. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops. 0–0.
- Zhan and Lu (2019) Fangneng Zhan and Shijian Lu. 2019. Esir: End-to-end scene text recognition via iterative image rectification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2059–2068.
- Zhang et al. (2020) Ling Zhang, Chengjiang Long, Xiaolong Zhang, and Chunxia Xiao. 2020. Ris-gan: Explore residual and illumination with generative adversarial networks for shadow removal. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 12829–12836.
- Zhang et al. (2015) Ling Zhang, Qing Zhang, and Chunxia Xiao. 2015. Shadow remover: Image shadow removal based on illumination recovering optimization. IEEE Transactions on Image Processing 24, 11 (2015), 4623–4636.
- Zhang et al. (2018a) Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. 2018a. The unreasonable effectiveness of deep features as a perceptual metric. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 586–595.
- Zhang et al. (2018b) Wuming Zhang, Xi Zhao, Jean-Marie Morvan, and Liming Chen. 2018b. Improving shadow suppression for illumination robust face recognition. IEEE transactions on pattern analysis and machine intelligence 41, 3 (2018), 611–624.
- Zhou et al. (2022) Xiaofei Zhou, Kunye Shen, Li Weng, Runmin Cong, Bolun Zheng, Jiyong Zhang, and Chenggang Yan. 2022. Edge-guided recurrent positioning network for salient object detection in optical remote sensing images. IEEE Transactions on Cybernetics 53, 1 (2022), 539–552.
- Zhu et al. (2022a) Yurui Zhu, Jie Huang, Xueyang Fu, Feng Zhao, Qibin Sun, and Zheng-Jun Zha. 2022a. Bijective Mapping Network for Shadow Removal. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5627–5636.
- Zhu et al. (2022b) Yurui Zhu, Zeyu Xiao, Yanchi Fang, Xueyang Fu, Zhiwei Xiong, and Zheng-Jun Zha. 2022b. Efficient Model-Driven Network for Shadow Removal. (2022).