Discriminative feature encoding for
intrinsic image decomposition

Zongji Wang¹, Yunfei Liu², and Feng Lu

{}^{*~{}2,3}

Correspondence should be addressed to Feng Lu.

1\quad

Key Laboratory of Network Information System Technology (NIST), Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China. wangzongji@aircas.ac.cn

2\quad

State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineering, Beihang University, Beijing 100191, China. {lyunfei,lufeng}@buaa.edu.cn

3\quad

Peng Cheng Laboratory, Shenzhen 518000, China.

Abstract

Intrinsic image decomposition is an important and long-standing computer vision problem. Given an input image, recovering the physical scene properties is ill-posed. Several physically motivated priors have been used to restrict the solution space of the optimization problem for intrinsic image decomposition. This work takes advantage of deep learning, and shows that it can solve this challenging computer vision problem with high efficiency. The focus lies in the feature encoding phase to extract discriminative features for different intrinsic layers from an input image. To achieve this goal, we explore the distinctive characteristics of different intrinsic components in the high dimensional feature embedding space. We define feature distribution divergence to efficiently separate the feature vectors of different intrinsic components. The feature distributions are also constrained to fit the real ones through a feature distribution consistency. In addition, a data refinement approach is provided to remove data inconsistency from the Sintel dataset, making it more suitable for intrinsic image decomposition. Our method is also extended to intrinsic video decomposition based on pixel-wise correspondences between adjacent frames. Experimental results indicate that our proposed network structure can outperform the existing state-of-the-art.

Index Terms:

intrinsic image decomposition; deep learning; feature distribution; data refinement.

I Introduction

In terms of intrinsic image decomposition, the albedo image $A$ indicates the surface material’s reflectivity which is unchanging under different illumination conditions, while the shading image $S$ accounts for illumination effects due to object geometry and camera viewpoint [1]. It is an ill-posed problem to reconstruct these two intrinsic images from a single color image $I$ , which has the formation model:

\centering I=A\cdot S.\@add@centering

(1)

To solve this challenging inverse image formation problem, many researchers have tried applying physically motivated priors as constraints to disambiguate the decomposition [2, 3, 4, 5, 6, 7, 8, 9]. These methods usually represent the priors in the form of energy terms and solve the decomposition problem through graph-based inference algorithms. With the surge of ground-truth intrinsic decomposition data [10, 11, 12], data-driven deep learning methods [1, 13, 14, 15, 16, 17] have achieved promising decomposition results and have drawn more and more research interest. However, fully-supervised methods require high-quality and densely-labelled decompositions, which are expensive to acquire. To overcome this problem, methods training across different datasets [17], training on synthetic datasets [15, 17], adding additional constraints [16] and reusing physically motivated priors [1] have been proposed.

When developing their specific deep learning techniques, previous methods usually extract features via a shared encoder, and then use different decoders to disentangle information for specific intrinsic layers. Observing the different distributions between albedo and shading in the gradient domain [2], it is natural to assume that features representing different intrinsic layers can be separated in the embedding space. With the features separated during the encoding phase, decoders can be relieved from distilling clues for specific targets and focus on the reconstruction procedure. This idea motivates the research in this paper.

We propose a novel two-stream encoder-decoder network for intrinsic image decomposition. In particular, our feature distribution divergence (FDD) constraint is designed to encourage the two encoders to extract distinctive features for different intrinsic layers. Our feature distribution consistency (FDC) constraint is used to encourage the features of a reconstructed intrinsic layer to have a similar distribution pattern to ground-truth decompositions. Moreover, we provide an approach to deal with the illumination inconsistency between the ground truth shading and input images in the MPI Sintel dataset, making it more suitable for intrinsic image decomposition. We also provide an intrinsic decomposition method for video data based on pixel-wise correspondences between adjacent frames. This work is an extension of our previous published workshop paper [18], giving more detailed method descriptions, novel technical contributions, and more comprehensive experiments.

The major contributions of this work are:

$\bullet$ A novel two-stream encoder-decoder network for intrinsic image decomposition, in which discriminative feature encoding is achieved via feature distribution divergence and feature distribution consistency constraints.

$\bullet$ A data refinement algorithm for the MPI Sintel dataset, producing a more physically consistent dataset that better suits the intrinsic decomposition task.

$\bullet$ Experimental results on various datasets to demonstrate the effectiveness of our proposed method, including experimental extension to decomposition of video data.

II Related work

II-A Approaches

Intrinsic image decomposition is a long standing computer vision problem. However, it is a seriously ill-posed problem to recover an albedo layer and a shading layer from a single color image [15]. In recent decades, considerable effort has been devoted to this challenging problem. These approaches can be coarsely classified into optimization-based methods using physically motivated priors, and deep learning based, data-driven methods [15, 19]. There are also approaches using multiple images as input [20, 21, 22, 23, 24, 25], treating the reflectance as a constant factor with changing illumination. These methods require the images to be captured by a static camera with varying illumination. A generative adversarial network (GAN) based domain transfer framework has also applied to image layer separation tasks [26]. Additional cues including depth maps [27, 28, 29, 30] and near-infrared images [31] are also taken into account in some work. Here, we focus on key works recovering intrinsic images from a single input.

II-B Physically motivated priors based methods

To solve this ill-posed intrinsic decomposition problem, researchers have derived several physically-inspired priors to constrain the solution space [15]. Land et al. [2] proposed the retinex algorithm, exploring the different properties of intrinsic components in the gradient domain: large derivatives are perceived as changes in reflectance properties, while smoother variations are seen as changes in illumination. Based on this assumption, many priors for intrinsic image decomposition have been explored. Derived from a piece-wise constant property, reflectance sparsity [3, 4] and low-rank reflectances [8] have been used as constraints. Other constrains include the distribution difference in the gradient domain [32, 33, 34], non-local textures [5, 6], shape and illumination [7], and user strokes [8, 9]. These hand-crafted priors are not likely to be valid across complex datasets [19]. Bi et al. [32] presented an approach using the $L_{1}$ norm for piece-wise image flattening, and proposed an algorithm for complex scene-level intrinsic image decomposition. Li et al. [33] presented a method to automatically extract two layers from an image based on differences in their distributions in the gradient domain. Sheng et al. [34] proposed an approach based on illumination decomposition, in which a shading image is decomposed into drift and step shading channels based on different distribution properties in gradient domain. Based on an analysis of the logarithmic transformation, Fu et al. [35] introduced a weighted variational model to refine the regularization terms for intrinsic image decomposition. Later, Fu et al. [36] presented an algorithm incorporating a reflectance sparseness regularizer based on the $L_{0}$ norm and a shading smoothness regularizer based on total variation. Non-local texture constraints [5, 6] are used to find pixels with the most similar reflectance within an image. Krebs et al. [37] developed a method for intrinsic image decomposition from a single RGB or multispectral image, taking the mathematical properties of the mean and standard deviation along the spectral axis into consideration. Although these methods restrict the solution space to a feasible region, such specifically designed priors cannot hold under complex conditions. The results are largely influenced by the parameter settings, which need expert knowledge. Our method differs in that it is data driven, and parameters are automatically learned from the dataset.

Refer to caption — Figure 1: Framework of our two-stream intrinsic image decomposition network. The input image is passed through two sub-network streams for albedo and shading image reconstruction respectively. We use the extractor in VGG-19 as the encoder structure, which extracts multi-scale feature maps. These are then aggregated by sequences of upsampling, concatenation, and convolution. Finally, three residual dilated blocks are used as a decoder to reconstruct intrinsic images from the fused feature maps. $\oplus$ denotes feature aggregation, $\odot$ denotes element-wise multiplication, and rounded boxes represent loss computations. *Cycle, FDC* and *FDD* mean cycle loss, feature distribution consistency and feature distribution divergence respectively.

II-C Deep learning methods

Thanks to the public availability of intrinsic image datasets such as the MIT intrinsic dataset [10], the MPI Sintel dataset [11] and Intrinsic Images in the Wild (IIW) [12], application of deep learning to intrinsic decomposition has surged [13, 38, 39, 40, 41]. Direct intrinsics [14] provided the first entirely deep learning model that directly outputs albedo and shading layers given a color image. Results from this method are blurred due to down-sampling during encoding and deconvolution during decoding.

Facing the fact that high-quality and densely-labelled intrinsic images are expensive to acquire, many methods have been developed to train models with additional constraints [16, 42], reusing physically motivated priors [1, 43, 44, 45, 46], expanding the dataset with synthetic images [15, 47, 17] or training across datasets [17].

Fan et al. [16] provided a network structure using a domain filter between edges in the guidance map to encourage piece-wise constant reflectance. [42] used region masks to guide the separation of different intrinsic components. By utilizing additional constraints, the solution space for the problem is further restricted. Seo et al. [43] proposed an image-decomposition network, which makes use of all the three premises regarding consistency from retinex theory. In this work, pseudo images (generated color-transferred multi-exposure images) are used for training. Baslamisli et al. [1] presented a two-stage framework to firstly split the image gradients into albedo and shading components, which are then fed into decoders to predict pixel-wise intrinsic values. In their later work [44], the gradient descriptors for albedo and shading are derived from a physics-based reflection model and used to compute the shading map directly from RGB image gradients. [45] and [46] derived fine-grained shading components from a physics-based image formation model, in which the shading component is further decomposed into direct and indirect components, and shape-dependent/independent ones. These works proposed novel methods by revisiting physically motivated priors. Shi et al. [15] trained a model to learn albedo, shading and specular images on a large-scale object-level synthetic dataset by rendering ShapeNet [48]. Sial et al. [47] trained an intrinsic decomposition model on a synthetic ShapeNet-based scene dataset which has more realistic lighting effects. Li et al. [17] presented an end-to-end learning approach that learns better intrinsic image decomposition by leveraging datasets with different types of labels.

The majority of these methods extract features via a shared encoder, and then use different decoders to disentangle information for specific intrinsic layers. In contrast to these works, we try to exploit the difference between intrinsic components in feature space through a novel two-stream framework. With the features separated in the encoding phase, decoders can be relieved from distilling clues for specific targets and focus on the reconstruction procedure.

II-D Intrinsic video

Kong et al. [49] defined intrinsic video estimation as the problem of extracting temporally coherent albedo and shading from video alone. Ye et al. [50] proposed a probabilistic approach to propagate the reflectance from the initial intrinsic decomposition of the first frame. In order to achieve temporal consistency, these methods rely on optical flow to provide correspondences across time. Meka et al. [51] presented the first approach to tackle the hard intrinsic video decomposition problem at real-time frame rates. This method applies global consistency constraints in space and time based on random sampling. Lei et al. [52] presented a novel and general approach for blind video temporal consistency. This method is only trained on a pair of original and processed videos directly instead of a large dataset. In this paper, we simply extend our intrinsic image decomposition method to video based on optical flow, preserving temporal consistency during the decomposition process.

III Method

III-A Network structure

Our network architecture is visualized in Figure 1. The framework consists of two streams of encoder-decoder sub-networks. One performs albedo image reconstruction, and the other, shading image reconstruction. Taking the albedo sub-network for example, the input image is passed through a convolutional encoder to extract multi-level features, which are then aggregated by sequences of upsampling, concatenation, and convolution. In the decoding phase, the fused multi-scale features are fed into a sequence of three residual dilated blocks to reconstruct the albedo intrinsic image. The shading sub-network has the same structure as the albedo sub-network. In practice, we adopt VGG-19 [53] pretrained on ImageNet [54] as the initial encoder.

Previous works usually use a shared encoder to extract features containing both albedo and shading information. Different decoders are then applied to distill clues from the comprehensive features for specific intrinsic image reconstruction. The ‘Y’-shaped framework can be formulated as:

	$\displaystyle A$	$\displaystyle=g(f(I;~{}\Theta);~{}\Omega_{a})=g_{a}\circ f(I),$		(2)
	$\displaystyle S$	$\displaystyle=g(f(I;~{}\Theta);~{}\Omega_{s})=g_{s}\circ f(I),$		(2)

where $f(\cdot;~{}\Theta)$ and $g(\cdot;~{}\Omega)$ denote the feature encoder and decoder respectively. $\Theta$ and $\Omega$ represent their corresponding trainable parameters. In this equation, we intend to represent $g(\cdot,\Theta_{a})$ as $g_{a}(\cdot)$ , in which the subscript of the function symbol $f_{*}$ means that the trainable parameters $\Theta_{*}$ are different with respect to specific intrinsic components, while the network frameworks are the same.

Unlike such methods, our network design has two encoders for albedo and shading images respectively. In this paper, we denote this structure as an ‘ $\asymp$ ’-shaped framework:

	$\displaystyle A$	$\displaystyle=g(f(I;~{}\Theta_{a});~{}\Omega_{a})=g_{a}\circ f_{a}(I),$		(3)
	$\displaystyle S$	$\displaystyle=g(f(I;~{}\Theta_{s});~{}\Omega_{s})=g_{s}\circ f_{s}(I).$		(3)

Using this framework, the encoders ( $f_{a}(\cdot)$ , $f_{s}(\cdot)$ ) are able to extract features more pertinent to their reconstruction targets (albedo, shading). In Figure 2, we visualize the feature distributions of different network structures, which explains our idea pictorially.

In each row of Figure 2, a feature embedding visualization using t-SNE is provided on the right. Each data point represents a feature extracted by the encoder, which is then fed into the corresponding intrinsic image reconstruction decoder. We use the same color coding for the embedding’s data points and the extracted feature vectors from the simplified network structure. The proposed ‘ $\asymp$ ’-shaped framework (given in detail in Figure 1) results in a better feature embedding. For instance, in the embedding space, the features for different intrinsic components are better separated.

The rest of this section introduces the core idea and detailed design of the discriminative feature encoding. Then, important constraints for our intrinsic decomposition network are explained.

III-B Discriminative feature encoding

III-B1 Basis

Our work is inspired by Land et al. [2]: the retinex approach assumes that albedo and shading layers possess different properties in the gradient domain. By utilizing such discriminative properties, the intrinsic decomposition results can be improved. In this work, we study and exploit the discriminative properties in a more general convolutional feature space. We next describe the proposed discriminative feature encoding in detail.

III-B2 Feature distribution divergence

As Figure 1 shows, the encoding phase consists of multiple (convolution, relu, maxpooling) blocks, through which the input signal is encoded into several different abstraction levels. The multi-scale features are denoted $\{\bm{f^{E_{1}}},\dots,\bm{f^{E_{n}}}\}$ , in which $\bm{f^{E_{i}}}$ represents the output feature of the $i^{th}$ block. We define the feature distance function as $d:\mathbb{R}^{m\times n\times c}\times\mathbb{R}^{m\times n\times c}\mapsto\mathbb{R}$ , where $c$ denotes the feature channel number and the input signal has spatial size $m\times n$ :

		$\displaystyle d_{\cos}(\bm{f_{a}^{E_{i}}},\bm{f_{s}^{E_{i}}})=$		(4)
		$\displaystyle\qquad\frac{1}{N_{i}}\sum_{\forall(x,y)}\left(\frac{<\bm{f_{a}^{E_{i}}}(x,y),\bm{f_{s}^{E_{i}}}(x,y)>}{\|\|\bm{f_{a}^{E_{i}}}(x,y)\|\|_{2}\,\|\|\bm{f_{s}^{E_{i}}}(x,y)\|\|_{2}}\right)^{2},$
		$\displaystyle d_{L_{1}}(\bm{f_{a}^{E_{i}}},\bm{f_{s}^{E_{i}}})=h(\|\|\bm{f_{a}^{E_{i}}}-\bm{f_{s}^{E_{i}}}\|\|_{1}),$
		$\displaystyle d(\bm{f_{a}^{E_{i}}},\bm{f_{s}^{E_{i}}})=\alpha d_{\mathrm{cos}}(\bm{f_{a}^{E_{i}}},\bm{f_{s}^{E_{i}}})+\beta d_{L_{1}}(\bm{f_{a}^{E_{i}}},\bm{f_{s}^{E_{i}}}).$

Feature distance measurement is based on the cosine distance between two vectors and $L_{1}$ norm. In Eq. (4), $\bm{f_{a}}$ and $\bm{f_{s}}$ represent features from the albedo encoder and the shading encoder respectively. $<\cdot,\cdot>$ is the inner product in Euclidean space, $N_{i}=m_{i}\times n_{i}$ ;, and $(x,y)$ represents a spatial location in a feature map. $h(\cdot)$ is a distance rescaling function in the form of a modified sigmoid function to ensure $d_{L_{1}}\in(0,1)$ . We use

h(d)=1-\frac{1}{1+\exp{(-(d-1.2\exp(1.2))/1.2^{2})}}.

The feature distribution divergence $\mathcal{L}_{\mathrm{fdd}}$ is now formulated as:

\displaystyle\mathcal{L}_{\mathrm{fdd}}=\sum_{i}^{n}\omega_{i}d(\bm{f_{a}^{E_{i}}},~{}\bm{f_{s}^{E_{i}}}),

(5)

where $\omega_{i}\geq 0$ is the weight for the feature distance for abstraction level $i$ . Empirically, we use five different levels of abstraction in our experiments ( $n=5$ ). We set $[\omega_{1},\dots,\omega_{5}]=[0.01,0.1,0.5,0.7,1.0]$ and $\alpha=0.3$ , $\beta=0.1$ .

III-B3 Feature distribution consistency

The feature distribution divergence aims to increase the distance between the feature vectors embedded by different encoders. However, this is not sufficient for discriminative feature encoding. The core idea of Fisher’s linear discriminant is to maximize the distance between classes and minimize the distance within classes simultaneously. As an analogue of that, along with the feature distribution divergence described above, we use the feature perceptual loss [55] between the predicted and ground truth intrinsic images to constrain the encoding process, encouraging the embedded features to fit the real distribution.

We use the same distance measurement as for the feature distribution divergence. $d(\bm{f_{\mathrm{pred}}^{E_{i}}},\bm{f_{\mathrm{real}}^{E_{i}}})$ denotes the feature distance in the $i_{th}$ abstraction level.

The feature distribution consistency $\mathcal{L}_{\mathrm{fdc}}$ is formulated as:

	$\displaystyle\mathcal{L}_{\mathrm{fdc}}=\sum_{i}^{n}\gamma_{i}\left(\left(1-d(\bm{f_{pred,a}^{E_{i}}},\bm{f_{real,a}^{E_{i}}})\right)\right.+$		(6)
	$\displaystyle\left.\left(1-d(\bm{f_{pred,s}^{E_{i}}},\bm{f_{real,s}^{E_{i}}})\right)\right),$		(6)

where $\gamma_{i}\geq 0$ is a weight. Note that $(1-d(\bm{f_{\mathrm{pred}}},\bm{f_{\mathrm{real}}}))\in(0,1)$ represents the feature similarity between $\bm{f_{\mathrm{pred}}}$ and $\bm{f_{\mathrm{real}}}$ . Minimizing Eq. (6) encourages the predicted and ground truth intrinsic images to have similar perceptual features. In practice, the encoders are reused to extract features from the predicted and target results in our framework, so the embedded feature distribution can be optimized directly during training. Empirically, we set $[\gamma_{1},\dots,\gamma_{5}]=[1.0,1.0,1.0,1.0,1.0]$ and $\alpha=0.1$ , $\beta=0.9$ .

III-C Basic supervision constraints

III-C1 Use of losses

Besides the above constraints for discriminative feature encoding, several basic supervision losses are adopted to train the intrinsic image decomposition network.

As described in Eq. (3), given an image $I$ , the albedo image $A$ and the shading image $S$ are predicted through trained $g_{a}\circ f_{a}$ and $g_{s}\circ f_{s}$ . With densely-labelled intrinsic images $\widehat{A}$ and $\widehat{S}$ as ground truth data, we constrain the pixel-wise predictions using the reconstruction loss $\mathcal{L}_{\mathrm{rec}}$ and the gradient loss $\mathcal{L}_{\mathrm{grad}}$ .

III-C2 Reconstruction loss

We use the $L_{1}$ loss $\mathcal{L}_{L_{1}}$ combined with the SSIM (structural similarity index [56]) loss $\mathcal{L}_{\mathrm{SSIM}}$ as the reconstruction loss:

$\displaystyle\mathcal{L}_{\mathrm{rec}}$	$\displaystyle=\lambda_{L_{1}}\mathcal{L}_{L_{1}}+\lambda_{\mathrm{SSIM}}\mathcal{L}_{\mathrm{SSIM}},$	(7)
$\displaystyle\mathcal{L}_{L_{1}}$	$\displaystyle=\|\|A-\widehat{A}\|\|_{1}+\|\|S-\widehat{S}\|\|_{1}+\|\|A\times S-I\|\|_{1},$
$\displaystyle\mathcal{L}_{\mathrm{SSIM}}$	$\displaystyle=(1-\mathrm{SSIM}(A,\widehat{A}))+(1-\mathrm{SSIM}(S,\widehat{S}))$
	$\displaystyle+(1-\mathrm{SSIM}(A\times S,I)),$

where $\mathrm{SSIM}(\bm{x},\bm{y})$ measures the structural similarity of images $\bm{x}$ and $\bm{y}$ . The SSIM loss is $(1-\mathrm{SSIM}(\bm{x},\bm{y}))$ , indicating structural dissimilarity. Empirically, we set $\lambda_{L_{1}}=30.0$ and $\lambda_{\mathrm{SSIM}}=0.5$ . $A\times S$ is computed pixel-wise.

The cycle loss is used to encourage the product of the predicted $A$ and $S$ to be similar to that of the input image $I$ .

III-C3 Gradient loss

We also use image gradients as supervision to help preserve details in the intrinsic images:

	$\displaystyle\mathcal{L}_{\mathrm{grad}}=\|\|\nabla_{x}A-\nabla_{x}\widehat{A}\|\|_{2}^{2}+\|\|\nabla_{y}A-\nabla_{y}\widehat{A}\|\|_{2}^{2}+$		(8)
	$\displaystyle\|\|\nabla_{x}S-\nabla_{x}\widehat{S}\|\|_{2}^{2}+\|\|\nabla_{y}S-\nabla_{y}\widehat{S}\|\|_{2}^{2},$		(8)

where $\nabla_{x},\nabla_{y}$ are the image gradients in the $x$ and $y$ directions.

For datasets with ground truth decomposition like the MIT intrinsic dataset and MPI Sintel, the total loss is constructed as:

\displaystyle\mathcal{L}_{\mathrm{total}}=\lambda_{1}\mathcal{L}_{\mathrm{rec}}+\lambda_{2}\mathcal{L}_{\mathrm{grad}}+\lambda_{3}\mathcal{L}_{\mathrm{fdd}}+\lambda_{4}\mathcal{L}_{\mathrm{fdc}}.

(9)

Empirically, we set $[\lambda_{1},\dots,\lambda_{4}]=[1.0,1.5,0.1,1.0]$ .

III-D Adjustment for sparsely-labelled data

III-D1 Sparse labelling

As well as the densely-labelled datasets mentioned above, recently, the sparsely-labelled dataset IIW has become available, with a larger number of real-world images. To apply our core idea to this situation, we must slightly adjust the training framework, as we first explain, and then give the loss functions measuring the intrinsic image reconstruction quality.

III-D2 Framework adjustment

In order to train on the sparsely-labelled IIW dataset, there are a few barriers to overcome. Figure 3 shows the adjustments made to our framework.

One problem is the lack of dense supervision during the albedo reconstruction procedure. The reconstruction loss in Eq. (7) and the gradient loss in Eq. (8) both need pixel-wise dense supervision. However, the IIW dataset only has sparse annotations of reflectance comparisons at selected points of the images. Therefore, we have to apply alternative constraints utilizing sparse labelling. Specifically, we use the ordinal loss to measure the difference between the output albedo image and the annotated input. Furthermore, smoothness constraints are applied to model the smoothness prior of the albedo component.

A further problem is the lack of shading ground truth, which is necessary in our two-stream training framework. The core idea of the feature distribution divergence is to extract distinctive features corresponding to different target intrinsic layers from the same input image. However, there is no annotation for the shading layers in this dataset. To circumvent the lack of ground truth data, we directly synthesize the target shading image from the input $I$ and the reconstructed albedo $A$ , using Eq. (1). Therefore, the synthesized shading image can be used as dense supervision.

Last but not least, the lack of reference intrinsic images causes problems for computing the feature distribution consistency. The FDC is designed to constrain the intrinsic image to have a similar distribution to the corresponding real reference. In detail, this constraint is achieved by minimizing the feature perceptual loss between the reconstructed and reference images. However, dense ground truth images are not provided in the IIW dataset, making ground truth features unavailable. To solve this problem, we maintain an image pool for each training stream, in which batches of reconstructed intrinsic images are gathered as the guidance for feature perceptual loss.

III-D3 Training constraints

As noted, we thus modify the constraints to suit the sparse annotations in the IIW dataset. We now describe the ordinal loss and the smoothness constraints used.

We first consider the ordinal loss. Since dense ground truth labels are not available in [12], it introduced the weighted human disagreement rate (WHDR) as an error metric. Following [17], we use an ordinal loss based on WHDR as the sparse supervision term.

The ordinal loss $\mathcal{L}_{\mathrm{ord}}$ is obtained by accumulating all the annotated pairs in the albedo image:

\displaystyle\mathcal{L}_{\mathrm{ord}}=\sum_{(i,j)}e_{i,j}(A),

(10)

where $e_{i,j}(A)$ represents the error for a pair of annotated pixels $(i,j)$ in the predicted albedo image $A$ . A detailed definition is provided in Appendix A.

We next consider smoothness priors: we adopt the same ones as [17]. The smoothness constraints comprise the albedo smoothness constraint $\mathcal{L}_{\mathrm{asmooth}}$ and the shading smoothness constraint $\mathcal{L}_{\mathrm{ssmooth}}$ . The albedo component is constrained using a multi-scale $L_{1}$ smoothness term, through which the albedo layer reconstruction is encouraged to be piecewise constant. Shading smoothness is constrained using a densely-connected $L_{2}$ term. Detailed definitions are provided in Appendix A.

The total loss for sparsely-labelled data is:

	$\displaystyle\mathcal{L}_{\mathrm{total}}^{\mathrm{sparse}}=$	$\displaystyle\mathcal{L}_{\mathrm{total}}^{\backslash\hat{A}}+\lambda_{\mathrm{ord}}\mathcal{L}_{\mathrm{ord}}+$		(11)
		$\displaystyle\lambda_{\mathrm{As}}\mathcal{L}_{\mathrm{asmooth}}+\lambda_{\mathrm{Ss}}\mathcal{L}_{\mathrm{ssmooth}},$		(11)

where $\mathcal{L}_{\mathrm{total}}^{\backslash\hat{A}}$ means the total loss function $\mathcal{L}_{\mathrm{total}}$ for densely-labelled data in Eq. (9), removing the terms containing dense albedo ground truth $\hat{A}$ . $\lambda_{\mathrm{As}}$ and $\lambda_{\mathrm{Ss}}$ weight the smoothness priors. A study considering different values of $\lambda_{\mathrm{As}}$ and $\lambda_{\mathrm{Ss}}$ is provided in V-B3.

III-E Adjustment for video data

The MPI Sintel dataset is composed of short films, so it naturally has temporal consistency. We also investigated a suitable framework for intrinsic decomposition of such video data.

For video intrinsic decomposition, adjacent pairs of frames are input into our two-stream networks to get the corresponding intrinsic layers. In addition to using the single image decomposition framework, optical flow is computed from the sequential input images, and used to provide temporal consistency guidance for the output intrinsic layers.

Optical flow is typically used to construct temporal correspondences between two adjacent frames, assuming that the pixel intensities of an object do not change between consecutive frames, and that neighbouring pixels have similar motion. In this work, the optical flow field of pairs of consecutive input images is obtained directly from the MPI Sintel dataset. Besides the single image intrinsic decomposition loss $\mathcal{L}_{\mathrm{total}}$ in Eq. (9), the optical flow $\bm{u}$ is used to enhance the temporal consistency of the output intrinsic layers:

	$\displaystyle\mathcal{L}_{\mathrm{temporal}}=$	$\displaystyle\lambda_{A}\sum_{i}\omega_{\bm{u}}(i)\|\|A^{t+1}(i+\bm{u}(i))-A^{t}(i)\|\|_{1}+$		(12)
		$\displaystyle\lambda_{S}\sum_{i}\omega_{\bm{u}}(i)\|\|S^{t+1}(i+\bm{u}(i))-S^{t}(i)\|\|_{1},$		(12)

where $A^{t}(i)$ is the $i^{th}$ pixel in the albedo image of frame $t$ . We denote the optical flow map between the consecutive frames $(t,t+1)$ as $\bm{u}$ . The mask $\omega_{\bm{u}}$ records which pixels have valid optical flow value. $\omega_{\bm{u}}(i)=0$ if pixel $i$ is occluded, and is $1$ otherwise. We use $\lambda_{A}$ and $\lambda_{S}$ to balance the importance of the temporal consistency terms between albedo and shading layers.

IV Intrinsic data refinement

IV-A Basic approach

The MPI Sintel dataset [11] is a publicly-available densely-labelled dataset containing complex indoor and outdoor scenes. It was originally designed for optical flow evaluation. For research into intrinsic image decomposition, ground truth shading images have been rendered with a constant gray albedo considering illumination effects. However, due to the creation process, the original input frames can not be reconstructed from the ground truth albedo and shading layers through Eq. (1).

As the first row of Figure 5 shows later, the specular component of the shading image cannot be observed in the original image, which means it does not share the same illumination condition. Although the simplified image formation model Eq. (1) need not be strictly respected, it is physically incorrect to extract a shading layer depicting different illumination effects from the original image. To overcome this inconsistency, previous works [14] directly resynthesize original images $I$ from the ground truth albedo $A$ and shading $S$ via Eq. (1). However, this approach does not deal with the specular component of the shading layer, which is considered not to be modeled well by Eq. (1) [15].

In this paper, we propose an approach to refine the dataset in order to shift it into a domain more representative of real images. The refined MPI Sintel dataset (MPI_RD) is subject to the image formation model in Eq. (1), and the shading layers contain no color information (gray shading). In addition, the shading layers in the MPI_RD maintain consistency with the original images. This can be shown in two ways. For one thing, the specular component is removed from the shading layer. For another thing, shape details observed in the original images are preserved in the shading layer. We describe our data refinement algorithm in 1. In summary, we shift the distribution of the albedo layer to a higher mean value, and then reconstruct the shading layer from the original image and the shifted albedo (steps 1 to 5). Next, invalid pixels in the reconstructed shading layer are computed using local linear embedding (LLE) [57] with the input $I$ as the guiding image adopted to construct the embedding weights (steps 6 to 7). Finally, the input image is resynthesized from the processed albedo and shading images (step 8).

Input: Original MPI Sintel dataset comprising input images

I

, albedo images

A

and shading images

S

Output: Refined MPI Sintel dataset comprising

I^{*},A^{*},S^{*}

, so that

I^{*}=A^{*}S^{*}

, constrained by the intrinsic decomposition model in Eq. (1)

for each

i\in[1,N]

1: convert the RGB image into

L^{*}a^{*}b^{*}

space, and extract the

L

channel as

\{I_{i},A_{i},S_{i}\}

;

2: reconstruct the albedo and shading:

\hat{A_{i}}=I_{i}/S_{i}

\hat{S_{i}}=I_{i}/A_{i}

;

3: compute the validity mask for

\hat{A_{i}}

M_{i}=(0<\hat{S_{i}}<1)~{}\&~{}(0<\hat{A_{i}}<1)

;

4: compute the statistics of valid pixels according to

M_{i}

from

\hat{A_{i}}

(\hat{\mu},\hat{\sigma})

;

5: shift the distribution of

A_{i}

{\mu,\sigma}

to the reconstructed valid statistics:

\tilde{A_{i}}=\hat{\sigma}(A_{i}-\mu)/{\sigma}+\hat{\mu};

6: reconstruct the shading from

\tilde{A_{i}}

\tilde{S_{i}}=I_{i}/\tilde{A_{i}};

7: reconstruct the invalid pixels in

\tilde{S_{i}}

with the help of

I_{i}

S_{i}^{*}

;

8: convert

\tilde{A_{i}}

into RGB color space:

{A_{i}^{*}}

, and reconstruct

I_{i}^{*}=A_{i}^{*}S_{i}^{*}

;

end for

Algorithm 1 Data refinement for MPI Sintel

IV-B Improvements

Although the data refinement algorithm 1 successfully suppresses inconsistencies in the intrinsic components in the MPI dataset, problems still remain in terms of temporal consistency. For example, we can observe intensity jittering in consecutive albedo images, and areas lacking detail in shading images. We now discuss possible causes for these defects, as determined by investigating the dataset, and consequently provide methods to alleviate them.

Jittering effects can be observed in frames with notable statistical changes (mean color changes caused by large dark object occlusion). Areas lacking detail are usually observed in shading images whose source image has low contrast areas. Note that in the single-image refinement procedure, the validity mask is computed based on the source image and the original albedo and shading images. Invalid pixels (pixel values $\notin(0,1)$ ) are directly truncated, and therefore not used when computing statistics of the albedo image. Such invalid areas in the shading image are then reconstructed using the LLE algorithm. In fact, the invalid areas lacking detail often overlap with the LLE reconstructed areas, resulting in low quality reconstruction. In conclusion, jittering effects are mainly due to large dark region changes between frames, while lack of detail is mainly caused by invalid region reconstruction artifacts.

This analysis results in a simple method to deal with those problems. To avoid intensity jittering between consecutive frames, we use optical flow to construct the correspondences between frames, and expand the pixel set for statistical computation. This increases robustness of the image statistics. To overcome lack of detail in shading images, the region reconstruction method is optimized to take temporal correspondences into consideration.

IV-C Temporal consistency measurement

To measure temporal consistency in the sequential data, we use the same video temporal consistency metric (TCM) as [58]:

\displaystyle\mathrm{TCM}_{t}=\exp\left(-\left|\frac{||O_{t}-\mathrm{warp}(O_{t-1})||^{2}}{||V_{t}-\mathrm{warp}(V_{t-1})||^{2}}-1\right|\right),

(13)

where $O_{t}$ and $V_{t}$ represent the $t^{th}$ frame in the output video ( $O$ ) and the input video ( $V$ ) respectively. $\mathrm{warp}()$ is the warping function using the optical flow. The TCM of the $t^{th}$ frame is calculated using the warping error between frames. The 2-norm of a matrix $||\cdot||$ is the sum of squares of its elements. Through this equation, the processed video ( $O$ ) is encouraged to be temporally consistent according with variations in the input video.

In order to visualize the video processing effects, we record pixelwise temporal consistency value in a TCM map, computed as:

\displaystyle\mathrm{Map}_{t}(\cdot)=\exp\left(-\left|\frac{(O_{t}(\cdot)-\mathrm{warp}(O_{t-1})(\cdot))^{2}}{(V_{t}(\cdot)-\mathrm{warp}(V_{t-1})(\cdot))^{2}}-1\right|\right),

(14)

where the TCM score is computed per pixel.

V Results and discussion

V-A Datasets

V-A1 MPI Sintel dataset and our refined version

Sintel is an open source 3D animated short film, which has been published in many formats for various research purposes. For intrinsic image decomposition, the clean pass images and the corresponding albedo and shading layers have been published as the MPI Sintel dataset, containing 18 sequences with a total of 890 frames. As discussed in IV, there is severe illumination inconsistency between the input frames and the shading layers in this dataset. Therefore, we provide the refined MPI Sintel dataset as a more suitable dataset for intrinsic image decomposition.

Figure 4(bottom) compares our refined MPI dataset (MPI_RD) to the original MPI dataset (MPI). In the shading layer of the first column, we can see that in our refined shading image, the specularity on the shoulder of the girl is removed, making the shading illumination consistent with the original input image. In the second and third columns, the shading layers from the MPI_RD dataset contain more geometric details than those from the MPI dataset. For instance, the wooden cart’s coarse surface is depicted in the refined shading in the third column, while the original shading from the MPI dataset has a smooth surface. These examples demonstrate that our refined MPI_RD improves consistency between the intrinsic decomposition and the input image. In Figure 4(bottom, right), the mean squared error (MSE) between the input image $I$ and the resynthesized image $A\times S$ is computed. The MSE for the MPI_RD dataset is significantly smaller than for the MPI dataset, showing that the intrinsic decomposition model Eq. (1) is well respected in the refined dataset.

TABLE I: Comparison of methods using the MPI Sintel dataset.

		MSE			LMSE			DSSIM
	Methods	albedo	shading	avg	albedo	shading	avg	albedo	shading	avg
image split	retinex [10]	0.0606	0.0727	0.0667	0.0366	0.0419	0.0393	0.2270	0.2400	0.2335
	Barron et al. [7]	0.0420	0.0436	0.0428	0.0298	0.0264	0.0281	0.2100	0.2060	0.2080
	Chen et al. [28]	0.0307	0.0277	0.0292	0.0185	0.0190	0.0188	0.1960	0.1650	0.1805
	MSCR [14]	0.0100	0.0092	0.0096	0.0083	0.0085	0.0084	0.2014	0.1505	0.1760
	Revisiting [16]	0.0069	0.0059	0.0064	0.0044	0.0042	0.0043	0.1194	0.0822	0.1008
	Ours	0.0047	0.0046	0.0047	0.0037	0.0038	0.0038	0.0950	0.0774	0.0862
scene split	MSCR [14]	0.0190	0.0213	0.0201	0.0129	0.0141	0.0135	0.2056	0.1596	0.1826
	Revisiting [16]	0.0189	0.0171	0.0180	0.0122	0.0117	0.0119	0.1645	0.1450	0.1547
	Ours	0.0173	0.0195	0.0184	0.0118	0.0147	0.0133	0.1587	0.1405	0.1496

For data augmentation, we randomly resize the input image by a scale factor in $[0.8,~{}1.3]$ , and randomly crop a $288\times 288$ patch from the resized image per iteration. We also use horizontal flipping in the training phase. When comparing methods, following [16], we evaluate our results on both a scene split and an image split. For a scene split, half of the scenes are used for training and the other half for testing. For an image split, all 890 images are randomly separated into two sets. Evaluation on a scene split is considered more challenging as it requires more generalization capacity.

For video data, the refined version MPI_VRD (video refined data) alleviate image illumination jitter and local region inconsistencies. Quantitatively, taking the shading images of video sleeping_2 for example, temporal consistency refinement increases TCM from 0.74 to 0.78.

V-A2 IIW dataset

Intrinsic Images in the Wild (IIW) [12] is a large scale, public dataset of real-world scenes intended for intrinsic image decomposition. It contains 5,230 real images of mostly indoor scenes, combined with a total of 872,161 crowd-sourced annotations of reflectance comparisons between pairs of points sparsely selected throughout the images (on average 100 judgements per image). Following many prior works [13, 41, 39, 16], we split the IIW dataset by placing the first of every five consecutive images sorted by image ID into the test set, and the others into the training set. WHDR from [12] is employed to measure the quality of the reconstructed albedo images.

For the IIW dataset, our proposed network structure cannot be directly used due to the lack of dense labelling of albedo and shading layers. Actually, only sparse and relative reflectance annotations are provided. In order to take advantage of the proposed feature distribution divergence and feature distribution consistency, we modify the network. In detail, the predicted dense albedo is collected into an image pool to describe the distribution of albedo. The reconstructed shading using the original image and predicted albedo is used as dense supervision for shading prediction, and is also collected in an image pool to describe the shading distribution. We set the weights in Eq. (6) to $[\gamma_{1},\dots,\gamma_{5}]=[0,~{}0,~{}0,~{}1.0,~{}1.0]$ .

V-B Comparison to state-of-the-art methods

V-B1 Using MPI Sintel and the refined dataset

As Table I shows, our method achieves the best results on the MPI Sintel dataset using the image split. On the more challenging scene split, our method is competitive with the state of the art, and achieves the best results in 5 out of the 9 cases in the table. We show a group of qualitative results evaluated on the scene split in Figure 5. While the MSCR [14] results are relatively blurred due to the large kernel convolutions and down-sampling, our method provides sharper results comparable to Revisiting [16]. Moreover, our shading layer depicts better shadow area than [16].

As explained in IV, the MPI Sintel dataset has issues of data consistency between the original input images and the corresponding shading images. Because of the proposed feature distribution divergence, feature distribution consistency and the use of the cycle loss, our method is sensitive to such data inconsistency. Therefore, we compare our method to the state-of-the-art methods on the more challenging scene split of the refined MPI Sintel dataset. As Table II shows, our method achieves the best result, demonstrating the effectiveness of our method and data refinement process.

TABLE II: Comparison of methods using the MPI_RD dataset.

		MSE			LMSE			DSSIM
	Methods	albedo	shading	avg	albedo	shading	avg	albedo	shading	avg
	MSCR [14]	0.0222	0.0175	0.0199	0.0151	0.0122	0.0136	0.1803	0.1619	0.1711
	Revisiting [16]	0.0196	0.0137	0.0167	0.0146	0.0094	0.0120	0.1651	0.1082	0.1366
ablation studies	Plain w/o ssim or grad	0.0189	0.0147	0.0168	0.0137	0.0104	0.0120	0.1560	0.1140	0.1350
	Plain w/o ssim	0.0169	0.0149	0.0159	0.0117	0.0103	0.0110	0.1530	0.1100	0.1320
	Plain w/o grad	0.0188	0.0142	0.0165	0.0123	0.0095	0.0109	0.1520	0.1090	0.1300
	Plain	0.0172	0.0147	0.0159	0.0116	0.0097	0.0106	0.1528	0.1085	0.1307
	Ours w/o FDD	0.0166	0.0134	0.0150	0.0112	0.0090	0.0101	0.1474	0.1048	0.1261
	Ours w/o FDC	0.0170	0.0130	0.0150	0.0113	0.0089	0.0101	0.1530	0.1070	0.1300
	Ours	0.0157	0.0126	0.0142	0.0105	0.0087	0.0096	0.1419	0.1015	0.1217

TABLE III: Method comparison and alternative training approach comparison using the MIT intrinsic dataset. The top 3 methods are classical methods, while the others are learning based ones. Note that Barron et al.’s method relies on specialized priors and masked objects particular to this dataset. Alternative training strategies are given in the bottom 3 rows. Ours (scratch) means training from scratch. Ours (MPI) means pre-trained on the original MPI Sintel dataset. Ours (RD) means pre-trained on the refined MPI Sintel dataset.

	MSE			LMSE
Method	albedo	shading	avg	total
Barron et al. [7]	0.0064	0.0098	0.0081	0.0125
SRIE [35]	0.0136	0.0128	0.0132	0.0192
Relative Smoothness [33]	0.0492	0.0342	0.0417	0.0216
Zhou et al. [39]	0.0252	0.0229	0.0240	0.0319
Shi et al. [15]	0.0216	0.0135	0.0175	0.0271
MSCR [14]	0.0207	0.0124	0.0165	0.0239
Revisiting [16]	0.0134	0.0089	0.0111	0.0203
Ours (scratch)	0.0134	0.0099	0.0117	0.0186
Ours (MPI)	0.0126	0.0106	0.0116	0.0175
Ours (RD)	0.0120	0.0095	0.0108	0.0170

To further validate the effectiveness of the proposed method, we also conducted an ablation study on the training loses as well as the network architecture. Results are given in the bottom part of Table II. ‘Plain’ represents the baseline two-stream network structure shown in Figure 2(b). In the ablation study, the gradient loss and SSIM loss are progressively added to train the plain network. The experimental results show that using the gradient loss and SSIM loss simultaneously achieves better intrinsic image decomposition results. The proposed network architecture is denoted by ‘Ours’, as defined in Figure 2(c). It can be observed that using only the feature distribution divergence (w/o FDC) or the feature distribution consistency (w/o FDD) does not improve results much, while using both of them results in considerable improvement.

Figure 6 displays a side-by-side comparison with two other methods using the refined dataset MPI_RD. As can be seen, our method better separates shading from albedo information. For example, our method outputs consistent shadow around the girl’s neck.

V-B2 Using the MIT intrinsic dataset

We further experimented on the MIT intrinsic dataset [10], which consists of object-level real images. In this experiment, classical methods [7, 35, 33] as well as learning based methods [16, 15, 14, 39] were compared to our proposed method. We also conduct an ablation study on the training strategies of our proposed network, including training from scratch (Ours scratch), pre-training on the original MPI Sintel dataset (Ours MPI), and pre-training on the refined MPI Sintel dataset (Ours RD). The experimental results are reported in Table III, and representative instances are selected for visual comparison in Figure 7.

As in [16], we used the 220 images in the dataset. In comparisons to previous methods, the split from [7] was used. Our refined MPI Sintel dataset has grayscale shading images, so we first pre-train the model on MPI_RD and then fine-tuned it on the MIT training set.

Numerical results are shown in Table III. Our method (Ours RD) achieves the best results in most cases in the table. Moreover, Ours RD performs better than Ours MPI in terms of LMSE, and both pre-training methods perform better than training from scratch, demonstrating that pre-training on the refined MPI Sintel dataset helps intrinsic image decomposition on the MIT dataset.

Qualitative results are illustrated in Figure 7. We can observe that our method (Ours RD) predicts sharp and accurate intrinsic layers. Classical methods may produce meaningful layer separation results, but good results depend on parameter tuning, which requires expert knowledge. Compared to (Ours scratch) and (Ours MPI), (Ours RD) achieves better region consistency with the ground truth.

V-B3 Using the IIW dataset

TABLE IV: Method comparison on the IIW test set.

Method	WHDR (mean)
Baseline (const shading)	51.37
Baseline (const reflectance)	36.54
Shen et al. 2011 [4]	36.90
Retinex (color) [10]	26.89
Retinex (gray) [10]	26.84
Garces et al. 2012 [59]	25.46
Zhao et al. 2012 [6]	23.20
$L_{1}$ flattening [32]	20.94
Bell et al. 2014 [12]	20.64
Zhou et al. 2015 [39]	19.95
Nestmeyer et al. 2017 (CNN) [41]	19.49
Zoran et al. 2015* [40]	17.85
Nestmeyer et al. 2017 [41]	17.69
Bi et al. 2015 [32]	17.67
CGIntrinsic [17]	14.80
Revisiting [16]	14.45
Ours	13.60

In Table IV, we report results obtained using the test set of the IIW dataset. Our proposed method achieves the best performance with a mean WHDR value of $13.60\%$ , a considerable improvement over the second best [16] with a mean WHDR value of $14.45\%$ . To better show the quality of our results, we give a qualitative comparison to state-of-the-art methods in Figure 8. Detailed intrinsic decomposition results are shown in the close-up windows. It can be observed that our method successfully preserves the texture of the floor tiles in the albedo layer, while the other approaches treat such texture as shading.

To further investigate the influence of the smoothness priors applied in the sparsely-labelled case, a comparison was conducted. Figure 9 shows the effects of changing the weights of the smoothness terms in Eq. (11): the albedo smoothness weight $\lambda_{\mathrm{As}}$ and shading smoothness weight $\lambda_{\mathrm{Ss}}$ were varied from 0 to 4 respectively. Results show that using the shading smoothness term increases the reconstruction loss, while using the albedo smoothness term decreases the reconstruction loss. Using both simultaneously provides the best result. In Figure 9(c, d, e), three representative parameter settings are used to show the resulting intrinsic decomposition. The albedo image in Figure 9(e) contains more precise contours and consistent region colors.

V-B4 Intrinsic decomposition of video data

In this experiment, we evaluated the proposed intrinsic decomposition method on video data with respect to reconstruction quality of the intrinsic images, and to temporal consistency, through comparisons to alternative methods.

Framewise methods used for comparison include MSCR [14] and Revisiting [16]. The blind video temporal consistency method [52] is an unsupervised video smoothing method. In the experiment (Ours+DVP), it is applied as post-processing to increase the temporal consistency of the results produced by our framewise method (Ours). (Ours+Flow) is the proposed intrinsic decomposition method for video data. The temporal consistency constraint is applied by taking optical flow as input to construct correspondences between adjacent frames. We also extend the temporal consistency constraint to MSCR in the same way as for (Ours+Flow) in (MSCR+Flow). All methods were trained and tested on the MPI_VRD dataset.

The intrinsic decomposition accuracy scores of framewise methods and flow methods are shown in Table V, while the temporal consistency scores for 8 test short videos are shown in Table VI. Using adjacent frame temporal consistency as a constraint, (Ours+Flow) and (MSCR+Flow) achieve much better temporal consistency metric scores than their framewise counterparts, while the intrinsic decomposition accuracy scores remain almost unchanged. This demonstrates the effectiveness of our proposed extension method for video data. As a result, (Ours+Flow) achieves the best average accuracy and temporal consistency for intrinsic decomposition of video data.

We also visualize temporal consistency of a specific frame clipped from a video, using a temporal inconsistency metric (TICM) map to highlight temporally inconsistent areas. First, the TCM map is computed using Eq. (14). Then, the TCM map is smoothed by a Gaussian kernel of size 65. Finally, the Jet colormap is inverted to highlight inconsistent areas with warm colors (the colder the color, the better the consistency). A qualitative comparison of different intrinsic decomposition methods on video data is shown in Figure 10. It can be observed that, using the temporal consistency constraint, both (MSCR+Flow) and (Ours+Flow) achieve better temporal consistency compared to their framewise counterparts (MSCR and Ours).

In Figure 11, temporal consistency loss is added to framewise methods (MSCR) and (Ours). In the (Ours+Flow) result, temporally inconsistent areas are suppressed compared to (Ours). However, in the (MSCR+Flow) result, temporal inconsistency is unexpectedly amplified in the highlighted area in the red box. In addition, the (MSCR+Flow) result is more blurred than the one without temporal consistency loss. In MSCR, multiple scales of feature maps are merged in the encoding phase. While features from deeper layers contain high-level knowledge, they may lack texture details. The temporal consistency constraint may encourage MSCR to pay more attention to high-level features, resulting in more blurred outputs. The case of (MSCR+Flow) indicates that while temporal consistency loss could be easily extended to other deep neural networks, the outcome will largely depend on the characteristics of specific methods.

TABLE V: Method comparison for intrinsic decomposition of video data.

		MSE			LMSE			DSSIM
	Methods	albedo	shading	avg	albedo	shading	avg	albedo	shading	avg
framewise	MSCR [14]	0.0223	0.0181	0.0202	0.0154	0.0127	0.0140	0.1880	0.1609	0.1745
	Revisiting [16]	0.0199	0.0137	0.0168	0.0148	0.0094	0.0121	0.1655	0.1095	0.1375
	Ours	0.0173	0.0163	0.0168	0.0120	0.0115	0.0117	0.1472	0.1123	0.1298
flow	MSCR+Flow	0.0227	0.0184	0.0206	0.0150	0.0123	0.0136	0.1950	0.1727	0.1839
	Ours+DVP [52]	0.0169	0.0165	0.0167	0.0117	0.0114	0.0116	0.1549	0.1204	0.1377
	Ours+Flow	0.0173	0.0152	0.0163	0.0117	0.0102	0.0109	0.1423	0.1054	0.1238

TABLE VI: Temporal consistency values for various methods for intrinsic decomposition of video (

\times 10^{-2}

). A to H represent the videos: alley_2, bamboo_2, bandage_2, cave_4, market_5, mountain_1, sleeping_2, temple_3.

	Methods	A	B	C	D	E	F	G	H	avg
framewise	MSCR [14]	48.18	62.30	38.70	31.80	5.49	52.36	86.50	31.47	44.60
	Revisiting [16]	8.32	35.48	18.06	25.44	5.03	68.92	19.37	39.41	27.50
	Ours	0.12	27.61	6.55	4.57	2.04	88.99	12.80	40.05	22.84
flow	MSCR+Flow	68.69	56.42	50.50	56.76	7.87	48.27	60.34	31.78	47.58
	Ours+DVP [52]	78.93	47.52	28.72	10.19	22.71	51.89	72.33	54.06	45.79
	Ours+Flow	84.22	70.91	78.33	60.32	6.42	51.64	54.05	45.12	56.37

VI Conclusions

In this paper, we have presented a novel two-stream encoder-decoder network for intrinsic image decomposition. Our method is able to exploit discriminative properties of the features for different intrinsic images. Specifically, our feature distribution divergence is designed to increase the distance between features corresponding to different intrinsic images, and our feature perceptual loss is applied to constrain the feature distribution. These two modules work together to encode discriminative features for intrinsic image decomposition. We have also provided an algorithm to refine the MPI Sintel dataset to make it more suitable for intrinsic image decomposition. Visual results for MPI_RD and the more challenging IIW dataset demonstrate that our proposed method can achieve results with better albedo and shading separation than existing methods. Its extension to video data is able to decompose video into intrinsic image sequences with temporal consistency.

The limitations of our method are of two kinds. Firstly, the mathematical model of intrinsic decomposition applied in this work is relatively preliminary. We cannot recover scene properties such as 3D geometry, light source positions, global illumination, etc. It is worthwhile challenge to exploit feature discrimination properties in more complex and powerful mathematical models. Secondly, the temporal consistency constraint used to extend our method to intrinsic video decomposition does not generalize well. As Figure 11 shows, directly using the temporal consistency loss in MSCR can result in unwanted blurring. In future, it is of interest to derive more general temporal consistency constraints for intrinsic video decomposition.

Acknowledgements. Portions of this work were presented at the International Conference on Computer Vision Workshops in 2019 [18]. This work was supported by the National Natural Science Foundation of China (NSFC) (Grants 61972012, 61732016).

Declarations

Conflict of interest The authors declare that they have no conflict of interest.

Appendix A Constraints for sparsely-labelled data

A-A Ordinal loss

For each pair of annotated pixels $(i,j)$ in the predicted albedo image $A$ , we have the error function:

e_{i,j}(A)=\left\{{\small\begin{aligned} &\omega_{i,j}(\log~{}A_{i}-\log~{}A_{j})^{2},&r_{i,j}&=0\\ &\omega_{i,j}(\max(0,m-\log~{}A_{i}+\log~{}A_{j}))^{2},&r_{i,j}&=+1\\ &\omega_{i,j}(\max(0,m-\log~{}A_{j}+\log~{}A_{i}))^{2},&r_{i,j}&=-1\end{aligned}}\right.

(15)

where $r_{i,j}$ denotes the relative reflectance (albedo) judgement from IIW. Values of $r_{i,j}$ are set to 1, 0, or $-1$ depending on whether the relative brightness of pixel $i$ is greater, the same, or lower than that of pixel $j$ .

A-B Smoothness priors

The albedo component is constrained using a multi-scale $L_{1}$ smoothness term:

	$\displaystyle\mathcal{L}_{\mathrm{asmooth}}$	$\displaystyle=$		(16)
		$\displaystyle\sum_{i=1}^{L}\frac{1}{N_{l}l}\sum_{i=1}^{N_{l}}\sum_{j\in\mathcal{N}(l,i)}v_{l,i,j}\|\|\log A_{l,i}-\log A_{l,j}\|\|_{1},$		(16)

in which $\mathcal{N}(l,i)$ indicates the 8-connected neighborhood of the pixel at position $i$ at scale $l$ . $v_{l,i,j}$ is the weight corresponding to the similarity between the pair of albedo pixels $(i,j)$ , which is formulated as $\exp(-\frac{1}{2}(\mathbf{f}_{l,i}-\mathbf{f}_{l,j})^{T}\Sigma^{-1}(\mathbf{f}_{l,i}-\mathbf{f}_{l,j}))$ . The greater the similarity of the two pixels, the smaller the weight, making the pairwise term loss less significant. $\mathbf{f}_{l,i}$ is the feature vector defined as $[\mathbf{p}_{l,i},I_{l,i},c_{l,i}^{1},c_{l,i}^{2}]$ , where $\mathbf{p}_{l,i}$ is spatial position, $I_{l,i}$ is image intensity, and $c_{l,i}^{1}$ and $c_{l,i}^{2}$ are the first two elements of chromaticity. $\Sigma$ is the covariance matrix defining the distance between two feature vectors. This albedo smoothness term encourages the reconstructed albedo layer to be piecewise constant.

The shading smoothness is formulated using a densely-connected $L_{2}$ term:

\displaystyle\mathcal{L}_{\mathrm{ssmooth}}=\frac{1}{2N}\sum_{i}^{N}\sum_{j}^{N}\hat{W}_{i,j}(\log S_{i}-\log S_{j})^{2},

(17)

where $\hat{W}$ is a bi-stochastic weight matrix derived from $W$ where

W_{i,j}=\exp(-\frac{1}{2}||\frac{\mathbf{p}_{i}-\mathbf{p}_{j}}{\sigma_{p}}||_{2}^{2}).

A detailed derivation can be found in [23, 60]. This weight is used to measure the positional difference of a pair of pixels $(i,j)$ in an image, with greater weight for nearby pixels.

Appendix B Further results

Further qualitative comparisons are shown in Figure 12, using the refined MPI Sintel dataset and Figure 13, using the IIW dataset.

References

[1] Baslamisli AS, Le HA, Gevers T. CNN based learning using reflection and retinex models for intrinsic image decomposition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, 6674–6683.
[2] Land EH, McCann JJ. Lightness and retinex theory. Josa, 1971, 61(1): 1–11.
[3] Rother C, Kiefel M, Zhang L, Schölkopf B, Gehler PV. Recovering intrinsic images with a global sparsity prior on reflectance. In Advances in neural information processing systems, 2011, 765–773.
[4] Shen L, Yeo C. Intrinsic images decomposition using a local and global sparse representation of reflectance. In CVPR 2011, 2011, 697–704.
[5] Shen L, Tan P, Lin S. Intrinsic image decomposition with non-local texture cues. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, 2008, 1–7.
[6] Zhao Q, Tan P, Dai Q, Shen L, Wu E, Lin S. A closed-form solution to retinex with nonlocal texture constraints. IEEE transactions on pattern analysis and machine intelligence, 2012, 34(7): 1437–1444.
[7] Barron JT, Malik J. Shape, illumination, and reflectance from shading. IEEE transactions on pattern analysis and machine intelligence, 2014, 37(8): 1670–1687.
[8] Bousseau A, Paris S, Durand F. User-Assisted Intrinsic Images. In ACM SIGGRAPH Asia 2009 Papers, SIGGRAPH Asia ’09, 2009, 130–140.
[9] Shen J, Yang X, Li X, Jia Y. Intrinsic image decomposition using optimization and user scribbles. IEEE transactions on cybernetics, 2013, 43(2): 425–436.
[10] Grosse R, Johnson MK, Adelson EH, Freeman WT. Ground truth dataset and baseline evaluations for intrinsic image algorithms. In 2009 IEEE 12th International Conference on Computer Vision, 2009, 2335–2342.
[11] Butler DJ, Wulff J, Stanley GB, Black MJ. A naturalistic open source movie for optical flow evaluation. In A Fitzgibbon et al (Eds), editor, European Conf. on Computer Vision (ECCV), Part IV, LNCS 7577, 2012, 611–625.
[12] Bell S, Bala K, Snavely N. Intrinsic images in the wild. ACM Transactions on Graphics (TOG), 2014, 33(4): 159.
[13] Narihira T, Maire M, Yu SX. Learning lightness from human judgement on relative reflectance. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, 2965–2973.
[14] Narihira T, Maire M, Yu SX. Direct Intrinsics: Learning Albedo-Shading Decomposition by Convolutional Regression. In International Conference on Computer Vision (ICCV), 2015, 2992–2992.
[15] Shi J, Dong Y, Su H, Yu SX. Learning non-lambertian object intrinsics across shapenet categories. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, 1685–1694.
[16] Fan Q, Yang J, Hua G, Chen B, Wipf D. Revisiting deep intrinsic image decompositions. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, 8944–8952.
[17] Li Z, Snavely N. Cgintrinsics: Better intrinsic image decomposition through physically-based rendering. In Proceedings of the European Conference on Computer Vision (ECCV), 2018, 371–387.
[18] Wang Z, Lu F. Single image intrinsic decomposition with discriminative feature encoding. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2019, 0–0.
[19] Bonneel N, Kovacs B, Paris S, Bala K. Intrinsic Decompositions for Image Editing. Computer Graphics Forum (Eurographics State of the Art Reports 2017), 2017, 36(2).
[20] Weiss Y. Deriving intrinsic images from image sequences. In Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, volume 2, 2001, 68–75.
[21] Matsushita Y, Lin S, Kang SB, Shum HY. Estimating intrinsic images from image sequences with biased illumination. In European Conference on Computer Vision, 2004, 274–286.
[22] Laffont PY, Bazin JC. Intrinsic decomposition of image sequences from local temporal variations. In Proceedings of the IEEE International Conference on Computer Vision, 2015, 433–441.
[23] Li Z, Snavely N. Learning intrinsic image decomposition from watching the world. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, 9039–9048.
[24] Lettry L, Vanhoey K, Van Gool L. Unsupervised Deep Single-Image Intrinsic Decomposition using Illumination-Varying Image Sequences. Computer Graphics Forum, 2018, 37(7): 409–419.
[25] Gong W, Xu W, Wu L, Xie X, Cheng Z. Intrinsic image sequence decomposition using low-rank sparse model. IEEE Access, 2018, 7: 4024–4030.
[26] Liu Y, Lu F. Separate in Latent Space: Unsupervised Single Image Layer Separation. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(07): 11661–11668.
[27] Barron JT, Malik J. Intrinsic scene properties from a single rgb-d image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, 17–24.
[28] Chen Q, Koltun V. A simple model for intrinsic image decomposition with depth cues. In Proceedings of the IEEE International Conference on Computer Vision, 2013, 241–248.
[29] Lee KJ, Zhao Q, Tong X, Gong M, Izadi S, Lee SU, Tan P, Lin S. Estimation of intrinsic image sequences from image+ depth video. In European Conference on Computer Vision, 2012, 327–340.
[30] Kim S, Park K, Sohn K, Lin S. Unified depth prediction and intrinsic image decomposition from a single image via joint convolutional neural fields. In European conference on computer vision, 2016, 143–159.
[31] Cheng Z, Zheng Y, You S, Sato I. Non-local intrinsic decomposition with near-infrared priors. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, 2521–2530.
[32] Bi S, Han X, Yu Y. An L1 Image Transform for Edge-Preserving Smoothing and Scene-Level Intrinsic Decomposition. ACM Trans. Graph., 2015, 34(4).
[33] Li Y, Brown MS. Single image layer separation using relative smoothness. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, 2752–2759.
[34] Sheng B, Li P, Jin Y, Tan P, Lee TY. Intrinsic image decomposition with step and drift shading separation. IEEE Transactions on Visualization and Computer Graphics, 2018, 26(2): 1332–1346.
[35] Fu X, Zeng D, Huang Y, Zhang XP, Ding X. A weighted variational model for simultaneous reflectance and illumination estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, 2782–2790.
[36] Fu G, Zhang Q, Xiao C. Towards high-quality intrinsic images in the wild. In 2019 IEEE International Conference on Multimedia and Expo (ICME), 2019, 175–180.
[37] Krebs A, Benezeth Y, Marzani F. Intrinsic image decomposition as two independent deconvolution problems. Signal Processing: Image Communication, 2020, 86: 115872.
[38] Tang Y, Salakhutdinov R, Hinton G. Deep Lambertian Networks. In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML’12, 2012, 1419–1426.
[39] Zhou T, Krahenbuhl P, Efros AA. Learning data-driven reflectance priors for intrinsic image decomposition. In Proceedings of the IEEE International Conference on Computer Vision, 2015, 3469–3477.
[40] Zoran D, Isola P, Krishnan D, Freeman WT. Learning ordinal relationships for mid-level vision. In Proceedings of the IEEE International Conference on Computer Vision, 2015, 388–396.
[41] Nestmeyer T, Gehler PV. Reflectance adaptive filtering improves intrinsic image estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, 6789–6798.
[42] Fu G, Zhang Q, Zhu L, Li P, Xiao C. A Multi-Task Network for Joint Specular Highlight Detection and Removal. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, 7752–7761.
[43] Seo K, Kinoshita Y, Kiya H. Deep Retinex Network for Estimating Illumination Colors with Self-Supervised Learning. In 2021 IEEE 3rd Global Conference on Life Sciences and Technologies (LifeTech), 2021, 1–5.
[44] Baslamisli AS, Liu Y, Karaoglu S, Gevers T. Physics-based shading reconstruction for intrinsic image decomposition. Computer Vision and Image Understanding, 2021, 205: 103183.
[45] Baslamisli AS, Das P, Le HA, Karaoglu S, Gevers T. ShadingNet: image intrinsics by fine-grained shading decomposition. International Journal of Computer Vision, 2021: 1–29.
[46] Zhu Y, Tang J, Li S, Shi B. DeRenderNet: Intrinsic Image Decomposition of Urban Scenes with Shape-(In)dependent Shading Rendering. In 2021 IEEE International Conference on Computational Photography (ICCP), 2021, 1–11.
[47] Sial HA, Baldrich R, Vanrell M. Deep intrinsic decomposition trained on surreal scenes yet with realistic light effects. JOSA A, 2020, 37(1): 1–15.
[48] Chang AX, Funkhouser T, Guibas L, Hanrahan P, Huang Q, Li Z, Savarese S, Savva M, Song S, Su H, et al.. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
[49] Kong N, Gehler PV, Black MJ. Intrinsic video. In European Conference on Computer Vision, 2014, 360–375.
[50] Ye G, Garces E, Liu Y, Dai Q, Gutierrez D. Intrinsic video and applications. ACM Transactions on Graphics (ToG), 2014, 33(4): 1–11.
[51] Meka A, Zollhöfer M, Richardt C, Theobalt C. Live intrinsic video. ACM Transactions on Graphics (TOG), 2016, 35(4): 1–14.
[52] Lei C, Xing Y, Chen Q. Blind video temporal consistency via deep video prior. Advances in Neural Information Processing Systems, 2020, 33.
[53] Simonyan K, Zisserman A. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR, 2014, abs/1409.1556.
[54] Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 2009, 248–255.
[55] Johnson J, Alahi A, Fei-Fei L. Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, 2016, 694–711.
[56] Wang Z, Bovik AC, Sheikh HR, Simoncelli EP, et al.. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 2004, 13(4): 600–612.
[57] Roweis ST, Saul LK. Nonlinear dimensionality reduction by locally linear embedding. science, 2000, 290(5500): 2323–2326.
[58] Yao CH, Chang CY, Chien SY. Occlusion-aware video temporal consistency. In Proceedings of the 25th ACM international conference on Multimedia, 2017, 777–785.
[59] Garces E, Muñoz A, Lopez-Moreno J, Gutiérrez D. Intrinsic Images by Clustering. Computer Graphics Forum, 2012, 31(4): 1415–1424.
[60] Barron JT, Adams A, Shih Y, Hernández C. Fast bilateral-space stereo for synthetic defocus. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, 4466–4474.

Discriminative feature encoding for intrinsic image decomposition