¹¹institutetext: Department of Electrical and Computer Engineering, Sungkyunkwan University, Suwon 16419, Republic of Korea ²²institutetext: SK Planet Co., Ltd., Seongnam 13487, Republic of Korea
²²email: sungho369@skku.edu, yeonghyeon@sk.com, mjss016@skku.edu, jhyi@skku.edu

Scene Depth Estimation from Traditional Oriental Landscape Paintings

Sungho Kang 11 0009-0008-5829-4503 YeongHyeon Park 1122 0000-0002-5686-5543 Hyunkyu Park 11 0009-0003-4889-1498 Juneho Yi 11 0000-0002-9181-4784

Abstract

Scene depth estimation from paintings can streamline the process of 3D sculpture creation so that visually impaired people appreciate the paintings with tactile sense. However, measuring depth of oriental landscape painting images is extremely challenging due to its unique method of depicting depth and poor preservation. To address the problem of scene depth estimation from oriental landscape painting images, we propose a novel framework that consists of two-step Image-to-Image translation method with CLIP-based image matching at the front end to predict the real scene image that best matches with the given oriental landscape painting image. Then, we employ a pre-trained SOTA depth estimation model for the generated real scene image. In the first step, CycleGAN converts an oriental landscape painting image into a pseudo-real scene image. We utilize CLIP to semantically match landscape photo images with an oriental landscape painting image for training CycleGAN in an unsupervised manner. Then, the pseudo-real scene image and oriental landscape painting image are fed into DiffuseIT to predict a final real scene image in the second step. Finally, we measure depth of the generated real scene image using a pre-trained depth estimation model such as MiDaS. Experimental results show that our approach performs well enough to predict real scene images corresponding to oriental landscape painting images. To the best of our knowledge, this is the first study to measure the depth of oriental landscape painting images. Our research potentially assists visually impaired people in experiencing paintings in diverse ways. We will release our code and resulting dataset.

Keywords:

CLIP-based image matching Oriental landscape painting Monocular depth estimation Image-to-Image translation 3D sculpture

Refer to caption — Figure 1: An overview of our method. Direct application of pre-trained SOTA depth estimation models to the given oriental landscape painting image, $I_{ori}$ , ends up with a meaningless depth map, ‘Depth map of $I_{ori}$ ’. We propose a novel framework that consists of CLIP-based image matching and I2I translation in two steps. Through CLIP-based image matching, we initially obtain pairs of oriental landscape painting image and its corresponding landscape photo image. For the I2I translation in two steps, we employ CycleGAN [63] and DIffuseIT[32]. In the first step, CycleGAN [63] that is trained on the CLIP-matched dataset translates an oriental landscape painting image into a pseudo-real scene image. To create a real scene image, the generated pseudo-real scene image and the given oriental landscape painting image are fed into DiffuseIT [32] in the second step. The produced real scene image enables measuring depth using a pre-trained depth estimation model such as MiDaS [48]. The measured depth map can be directly applied to craft 3D sculptures.

1 Introduction

The advancement of media art technology enables visually impaired people to appreciate paintings with tactile sensation by crafting paintings as 3D sculptures[45]. Motivated by this, we aim to find a way to produce depth maps particularly for traditional oriental landscape paintings in an attempt to automate the process of crafting paintings as 3D sculptures. Using the estimated depth maps, accurate 3D sculptures of the artwork are created in an efficient manner with the help of a 3D printer.

Depth estimation for oriental landscape painting images is a very challenging problem while pre-trained state-of-the-art (SOTA) depth estimation models can measure depth reasonably well in some western landscape painting images. Through the study of numerous oriental art papers [35, 6, 26, 28, 55, 24, 16, 30, 34, 27, 29, 42], we have found why direct application of pre-trained SOTA depth estimation methods to oriental landscape painting images does not work. First, as illustrated in Fig. 2, they use a unique method of depicting depth in oriental landscape paintings called ‘three-way method’ that does not follow the perspective rule. This is the particular method of creating a sense of perspective by representing objects based on specific perspectives[6, 28, 24, 16]. Also, painters often use lots of empty spaces to emphasize the semantic relationship between objects as a way to depict depth of oriental landscape paintings [29, 55]. Second, oriental landscape paintings have a long historical background, profoundly influenced and shaped by various nations including China, Japan, and Korea. This causes a multitude of stylistic variations[34, 27]. Third, numerous oriental landscape paintings suffer from poor preservation, affected by blurry edges and weak contrast.

Thus, direct application of pre-trained SOTA depth estimation models to traditional oriental landscape painting images is likely to end up with meaningless depth maps as depicted in Fig. 1. Because those depth estimation models are trained with real scene images which are in a huge domain difference from oriental landscape painting images.

Our strategy to achieve the best possible scene depth estimation is to first predict real scene images that best match given oriental landscape painting images, and then utilize a pre-trained SOTA depth estimation model to conduct depth estimation for the generated real scene images. The process of predicting a real scene image consists of two Image-to-Image (I2I) translation steps in an attempt to produce a high-quality real scene image corresponding to the given oriental landscape painting image for depth estimation. That is, in the first step, we employ CycleGAN[63] to translate an oriental landscape painting image into a pseudo-real scene image that is input as a reference image for DiffuseIT[32]. In the second step, the generated pseudo-real scene image and the oriental landscape painting image are fed into DiffuseIT[32] to create a real scene image in high-quality to estimate depth using a pre-trained SOTA depth estimation model. The reason for utilizing the two I2I translation steps is that while diffusion-based I2I translation models show excellent performance in translating high-quality images, training two image domains with huge domain gap for diffusion-based I2I translation model does not work well. Furthermore, if images for the two domains are structurally different, the output undergoes significant distortion during translation. Therefore, to utilize a pre-trained diffusion-based I2I translation model, for getting high quality images, it is necessary to produce pseudo-real scene images that are structurally similar to oriental landscape painting images.

More importantly, since actual location photo images matched with oriental landscape painting images are not available[35, 26], we employ CLIP-based image matching to get pairs of oriental landscape painting images and corresponding landscape photo images for training CycleGAN[63] to predict pseudo-real scene images for the given oriental landscape painting images. For this, we semantically match oriental landscape painting images with landscape photo images by using a pre-defined dictionary. To build the pre-defined dictionary, we have carefully selected objects that frequently appear in oriental landscape paintings by referring to museum collections[3, 5, 4, 1, 2] and oriental art papers[30, 42].

Our method can be summarized as follows. We first build a CLIP-matched dataset by matching oriental landscape painting images from the Chinese landscape dataset [58] with semantically similar landscape photo images from the LHQ dataset [51], based on the pre-defined dictionary using CLIP [46]. When creating the CLIP-matched dataset, a 1-to- $K$ matching manner is employed that matches a given oriental landscape painting image with the $K$ most similar landscape photo images, instead of a 1-to-1 matching manner. This reduces the impact on the performance when CLIP[46] incorrectly matches oriental landscape painting images with landscape photo images. CycleGAN [63], trained on the CLIP-matched dataset, converts oriental landscape painting images into pseudo-real scene images. Then, the oriental landscape painting images and the pseudo-real scene images are fed into the pre-trained DiffuseIT [32] to create real scene images that are structurally similar to the oriental landscape painting images. Finally, depth of real scene images is measured using pre-trained MiDaS[48]. MiDaS[48] is one of the SOTA depth estimation models, which is trained on variety of datasets, showing an outstanding generalization performance that leads to produce plausible depth maps of paintings with high probability. The resulting depth maps of real scene images can be directly used in crafting 3D sculptures, which can assist visually impaired people in appreciating paintings with their touch. The experimental results show that our method outperforms previous I2I translation methods in predicting real scene images corresponding to oriental landscape painting images. Accordingly, the predicted real scene images enable MiDaS[48] to measure depth in a satisfactory manner. To the best of our knowledge, this is the first study to measure the depth of oriental landscape painting images.

Our contributions are as follows.

•

We introduce a novel framework that converts oriental landscape painting images into real scene images using I2I translation models in two steps to produce best possible real scene images of high-quality for depth estimation.
•

We propose a CLIP-based image matching method to initially obtain pairs of oriental landscape painting image and its corresponding landscape photo image. For this, we have built a pre-defined dictionary by selecting objects that frequently appear in oriental landscape paintings by referring to museum collections and oriental art papers. This strategy gives a new perspective to an otherwise very difficult problem with no image pairs.

2 Related work

2.1 Depth estimation in painting

Various depth estimation methods were proposed and showed satisfactory performances in data domains such as synthetic data[17, 64], multi-view cameras[12], and LiDAR[22]. Especially, MiDaS[48] showed excellent generalization performance, allowing their network to estimate depth in images completely unrelated to the training dataset such as paintings. Ranftl et al.[47] changed the architecture of MiDaS[48] to transformer-based encoder-to-decoder architecture to achieve better performance. Birkl et al.[9] presented variety of new models of MiDaS[48] based on different pre-trained vision transformers. Bhat et al.[7] and Miangoleh et al.[41] produced more detailed depth maps based on the approach of MiDaS[48].

On the other hand, there have been several attempts to leverage deep learning models to measure the depth of painting images. Jiang et al.[25] utilized a depth estimation model[62] to gauge the depth in western painting images. Lin et al. [37] introduced a user-interactive framework to generate depth maps of painting images and photo images manually. Bhattacharjee et al.[8] proposed a novel framework that estimates the depth of comics by transforming them into real scenes. Pauls et al. used depth maps of painting images to enhance the experience of museum visits[45]. Ehret et al.[14] reported that although DPT [47] and MiDaS [48] were not explicitly trained for measuring the depth of painting images, they produced reasonable depth maps based on the high generalization performance.

Previous depth estimation works in painting images only attempted to measure the depth of western landscape painting images that are somewhat similar to landscape photo images. We have observed SOTA depth estimation model such as MiDaS [48] cannot generate satisfactory depth maps for oriental landscape painting images due to their characteristics mentioned earlier. We take the strategy that first converts oriental landscape painting images into real scene images and then measures depth of real scene images using MiDaS[48].

2.2 Image-to-Image translation from painting to photo

I2I translation is a popularly used solution to transformation of an image in domain A into an image in domain B. Tomei et al.[53] proposed a method to convert painting images into photo images. This method employed a segmentation model[21] to divide painting images and photo images into patches and then match them, showing an outstanding translation performance. Despite its good performance, it cannot be applied to convert oriental landscape painting images to real scene images due to the huge domain gap between them.

Pix2Pix [23] was the first to apply conditional GAN to the I2I translation task and showed excellent performances. On the other hand, CycleGAN[63] was proposed as a way for unsupervised learning to the I2I translation. Park et al.[43] presented a method, dubbed CUT, that maximizes mutual information based on contrastive learning. Liu et al. [38] showed a shared-latent space assumptions and proposed a framework, called UNIT. Lee et al.[33] introduced a disentangled representation framework with cross-cycle consistency loss. Recently, Chen et al.[11] conducted I2I translation based on VQGAN[15] which gives an outstanding performance in image generation.

Although GAN-based models performed well in I2I translation tasks, Ho et al.[20] successfully applied diffusion process to neural network, yielding many diffusion-based I2I models that translate images with high-quality. Kwon et al.[32] proposed diffusion-based I2I and Text-to-Image (T2I) translation methods using disentangled style and content representation. Saharia et al. [49] implemented diffusion models for various tasks, including I2I translation, without task-specific hyperparameter tuning. Su et al.[52] proposed diffusion-based I2I translation in latent encoding and defined it via ordinary differential equations. Li et al.[36] successfully applied stochastic brownian bridge process to diffusion for translation between two domains directly through the bidirectional diffusion process.

GAN-based I2I translation models cannot realistically transform oriental landscape painting images into landscape photo images as diffusion-based models do. However, diffusion-based models are not directly applicable to our problem because they drastically modify structural information of image during translation when input images are structurally different. Therefore, to overcome these limitations, we employ GAN-based model in the first I2I step and diffusion-based model in the second I2I step.

2.3 CLIP-based image matching

Radford et al.[46] proposed a contrastive learning-based Image-to-Text (I2T) matching model which allows zero-shot prediction called CLIP. When CLIP[46] gets an input image and a set of text, it finds the text that best matches the image in multi-modal embedding space. Diverse CLIP-based methods are proposed, for image retrieval[50, 57], style transfer[39, 31], image generation[31, 44, 10], classification[13, 59], anomaly detection[60], and image aesthetic assessment[18, 56]. Vinker et al.[54] introduced a technique for converting images into sketches by employing diverse types and multiple levels of abstraction. Hessel et al.[19] proposed a novel metric for robust automatic evaluation of image captioning. Especially, Materzyńska et al.[40] found that CLIP[46] had a strong ability of matching words with natural images of scenes.

To generate more plausible real scene images, we build a CLIP-matched dataset consisting of pairs of oriental landscape painting images and landscape photo images since there are no actual location photo images that correspond to oriental landscape painting images. We opt to match semantically similar images to a given oriental landscape painting images. For this, we utilized CLIP[46] for building the CLIP-matched dataset because CLIP[46] may allow matching tasks in unseen image domains such as oriental landscape painting images to landscape photo images.

3 Method

As depicted in Fig. 1, our method exploits CLIP-based image matching and two-step I2I translation to predict a high-quality real scene image corresponding to a given oriental landscape painting image. For the I2I translation in two steps, CycleGAN[63] and DiffuseIT[32] are sequentially employed. In the first step, we utilized CycleGAN[63] to get a pseudo-real scene images that is structurally similar to a given oriental landscape painting image. In the second step, DiffuseIT[32] predicts a real scene image of high-quality using the generated pseudo-real scene image and the given oriental landscape painting image. The predicted real scene image is utilized for depth estimation by pre-trained MiDaS[48].

3.1 CLIP-matched dataset

Since oriental landscape painting images, $I_{ori}$ , and landscape photo images, $I_{photo}$ , are in different domains, direct I2I matching in the embedding space does not perform well. Furthermore, actual location photo images that matches with oriental landscape painting images are not available. Thus, we utilize CLIP[46] to semantically match oriental landscape painting images with landscape photo images for building a CLIP-matched dataset.

To make the CLIP-matched dataset, we built a pre-defined dictionary. As the pre-defined dictionary is important for matching, we have selected frequently appearing objects in $I_{ori}$ by referencing papers [30, 42] and museum collections [3, 5, 4, 1, 2]. Park et al. [42] reported that the analysis of 629 oriental landscape paintings showed that 98.4% of them depicted mountains, rivers, and the sea, underlining that many oriental landscape paintings are related to nature, while those depicting villages are scarce. Therefore, when constructing a pre-defined dictionary, it is reasonable to include more words related to nature. According to Kim et al. [30], when expressing water in oriental landscape paintings, it is subdivided into details such as waterfall, stream, and pond based on the terrain.

Hence, when creating the pre-defined dictionary, we not only included keywords related to nature, but also categorized those keywords. To semantically match $I_{ori}$ with $I_{photo}$ using CLIP [46], we build the pre-defined dictionary as shown in Fig. 3.

Each set of images has $n$ and $m$ samples of $I_{ori}$ or $I_{photo}$ as in (1).

\displaystyle I_{ori}=\{i_{ori}^{1},\cdots i_{ori}^{m}\},\ I_{photo}=\{i_{photo}^{1},\cdots i_{photo}^{n}\},

(1)

To avoid limitation that CLIP-based image matching method incorrectly matched $I_{ori}$ with $I_{photo}$ , we construct a CLIP-matched dataset using a 1-to- $K$ matching method instead of a 1-to-1 matching manner. Thus, for each $I_{ori}$ , we search top- $K$ best matching results in a set of $I_{photo}$ . The result of CLIP-based matching will be $I^{\prime}_{photo}$ , a subset of $I_{photo}$ , as expressed in (2).

\displaystyle S_{photo}^{K}(i_{ori}^{j})_{j={1,\cdots m}}=I^{\prime}_{photo},\ I^{\prime}_{photo}\subset I_{photo}

(2)

In summary, since there are no actual location photo images corresponding to oriental landscape painting images, we employed CLIP[46] to match landscape photo images that are semantically similar to oriental landscape painting images. The CLIP-matched dataset that consists of matched images is used to train CycleGAN [63] in an unsupervised manner. The performance evaluation for different values of $K$ in 1-to- $K$ matching is presented in the Section 4.2.

3.2 Two-step real scene image prediction with I2I translation

As in CycleGAN[63], we employ two auto-encoders, $G_{photo\rightarrow ori}$ and $G_{ori\rightarrow photo}$ , for mapping. $G_{photo\rightarrow ori}$ transforms $I_{photo}$ into $I_{ori}$ , while $G_{ori\rightarrow photo}$ converts $I_{ori}$ into $I_{photo}$ . Additionally, we use two discriminators, $D_{photo}$ , and $D_{ori}$ . $D_{photo}$ learns to differentiate $G_{photo\rightarrow ori}(I_{ori})$ and $I_{photo}$ as fake and real, while $D_{ori}$ distinguishes between $G_{ori\rightarrow photo}(I_{photo})$ and $I_{ori}$ , aiming to discern fake and real images.

The utilized adversarial and consistency loss can be summarized as follows: The adversarial loss ensures that the auto-encoder $G_{photo\rightarrow ori}$ generates images as similar as possible to $I_{ori}$ and simultaneously trains the discriminator $D_{ori}$ to classify $I_{ori}$ as real and $G_{photo\rightarrow ori}(I_{ori})$ as fake. This process facilitates the transformation of $I_{photo}$ into $I_{ori}$ as closely as possible.

$\displaystyle L_{adv}$	$\displaystyle(G_{photo\rightarrow ori},D_{ori},I_{ori},I_{photo})$	(3)
	$\displaystyle=\mathbb{E}_{I_{ori}\sim P_{data}(I_{ori})}\left[log(D_{ori}(I_{ori}))\right]\,$
	$\displaystyle\quad+\mathbb{E}_{I_{photo}\sim P_{data}(I_{photo})}\left[log(1-D_{ori}(G_{photo\rightarrow ori}(I_{photo})))\right]$

Adversarial loss is also employed during the transformation from $I_{ori}$ to $I_{photo}$ , is expressed as:

$\displaystyle L_{adv}$	$\displaystyle(G_{ori\rightarrow photo},D_{photo},I_{ori},I_{photo})$	(4)
	$\displaystyle=\mathbb{E}_{I_{photo}\sim P_{data}(I_{photo})}\left[log(D_{photo}(I_{photo}))\right]\,$
	$\displaystyle\quad+\mathbb{E}_{I_{ori}\sim P_{data}(I_{ori})}\left[log(1-D_{photo}(G_{ori\rightarrow photo}(I_{ori})))\right]$

The cycle consistency loss is utilized under the premise that when $I_{photo}$ is transformed using $G_{photo\rightarrow ori}$ and the resulting $G_{photo\rightarrow ori}(I_{photo})$ is further transformed using $G_{ori\rightarrow photo}$ , the output $G_{ori\rightarrow photo}(G_{photo\rightarrow ori}(I_{photo}))$ should resemble $I_{photo}$ when case of $I_{ori}$ should be same. The equation can be represented as:

$\displaystyle L_{cyc}$	$\displaystyle(G_{photo\rightarrow ori},G_{ori\rightarrow photo})$	(5)
	$\displaystyle=\mathbb{E}_{I_{photo}\sim P_{data}(I_{photo})}\left[\left\\|G_{ori\rightarrow photo}(G_{photo\rightarrow ori}(I_{photo}))-I_{photo}\right\\|_{1}\right]\,$
	$\displaystyle\quad+\mathbb{E}_{I_{ori}\sim P_{data}(I_{ori})}\left[\left\\|G_{photo\rightarrow ori}(G_{ori\rightarrow photo}(I_{ori}))-I_{ori}\right\\|_{1}\right]$

The final objective loss term can be expressed as follows, with each term adjusted using $\lambda$ s.

$\displaystyle L_{total}$	$\displaystyle(G_{photo\rightarrow ori},G_{ori\rightarrow photo},D_{ori},D_{photo})$	(6)
	$\displaystyle=L_{adv}(G_{photo\rightarrow ori},D_{ori},I_{ori},I_{photo})\,$
	$\displaystyle\quad\;+\lambda_{adv}L_{adv}(G_{ori\rightarrow photo},D_{photo},I_{ori},I_{photo})\,$
	$\displaystyle\quad\;+\lambda_{cyc}L_{cyc}(G_{photo\rightarrow ori},G_{ori\rightarrow photo})$

In summary, in the first step, we use CycleGAN[63] to transform $I_{ori}$ into $I_{photo}$ . We denote the translated $I_{photo}$ as $I_{pseudoRS}$ , i.e., $\Phi_{CycleGAN}(I_{ori})=I_{pseudoRS}$ . The generated $I_{pseudoRS}$ and $I_{ori}$ are fed into a pre-trained DiffuseIT[32] to predict a real scene image, $I_{RS}$ , i.e., $\Phi_{DiffuseIT}(I_{ori},I_{pseudoRS})=I_{RS}$ . The generated $I_{RS}$ enables depth prediction via a pre-trained depth estimation model such as MiDaS [48].

4 Experiments

To evaluate the performance of our method, we provide experimental settings, qualitative evaluation, ablation study, optimal $K$ in CLIP-based image matching, and user study.

In the Section 4.1, we have compared our method with previous I2I translation models[63, 43, 61, 11, 33, 32, 36] for qualitative evaluation. In addition, the ablation study is also provided to demonstrate the importance of each part of the proposed framework in the Section 4.1.

For the experimental settings, as the first step of I2I, CycleGAN[63] has been trained in an unsupervised manner using the CLIP-matched dataset that paired oriental landscape painting images from Chinese landscape dataset [58] with the corresponding landscape photo images from LHQ dataset [51]. For DiffuseIT[32] in the second I2I step, ImageNet $256\times{}256$ pre-trained model has been utilized. Other I2I translation models[63, 43, 61, 11, 33, 32, 36] have been trained in an unsupervised manner, using the Chinese landscape dataset [58] as oriental landscape painting images and random sampled of LHQ dataset [51] as landscape photo images. Their official source codes and recommended hyperparameters have been used for training. For generating depth maps, BEiT₅₁₂-L pre-trained MiDaS v3.1[48] has been employed to estimate depth.

In order to analyze the impact of $K$ on CLIP-based image matching and find the optimal $K$ value, we have experimented with different $K$ values, i.e., $K$ = 1, 3, 5, and 10.

Since there is no metric for quantitative evaluation, we conducted a user study. We asked forty anonymous participants five questions, each question contains $Q^{s}$ and $Q^{q}$ . $Q^{s}$ aims to to assess how well the real scene image preserves the structure of a given oriental landscape painting image during translation. $Q^{q}$ is for rating how well a produced real scene image is realistically translated. Detailed information is provided in following section.

4.1 Qualitative evaluation

To evaluate qualitative results, we have compared our method with other I2I translation models. For our methods, CycleGAN[63] in the first step has been trained on the CLIP-matched dataset. ImageNet $256\times{}256$ pre-trained model has been employed for DIffuseIT[32] in the second step. Other I2I translation models have been trained on pairs with oriental landscape painting images from Chinese landscape painting dataset[58] and randomly sampled landscape photo images from LHQ dataset[51].

As illustrated in the first column of Fig. 4, direct application of pre-trained depth estimation model, MiDaS [48], fails to estimate depth of oriental landscape painting images although MiDaS[48] has excellent generalization ability. As depicted in the second, third, and fourth columns of Fig. 4, the GAN-based I2I translation methods[63, 43, 33] well preserve the structural information of the oriental landscape painting images, but cannot convert them realistically. Surprisingly, the GAN-based I2I translation models such as CAST[61] and VQ-I2I[11] not only fail to realistically translate oriental landscape painting images to real scene images, but also distort structural information as shown in the fifth and sixth columns of Fig. 4. In the seventh and eighth columns of Fig. 4, DiffuseIT [32] and BBDM [36], diffusion-based I2I models realistically transform oriental landscape painting images into real scene images. However, predicted real scene images drastically change structural information during translation. This means that even if depth can be estimated using pre-trained MiDaS[48], obtained depth map does not represent the scene in the given oriental landscape painting images. Our method predicts real scene images corresponding to oriental landscape painting images, preserving structural information such as CycleGAN [63], CUT[43] or DRIT++ [33] and realistically translating into real scene images, like BBDM [36] or DiffuseIT [32].

As depicted in Fig. 5, regardless of the use of CLIP-based image matching, absence of CycleGAN[63] leads to significant structural deformation during translation. Depth of structurally modified real scene images is meaningless even if depth can be measured using a pre-trained depth estimation model[48]. In cases that DiffuseIT[32] is missing, although the predicted real scene images maintain structural similarity, it fails to realistically convert oriental landscape painting images into real scene images that lead to unsatisfactory depth map outcomes. Without CLIP-based image matching, it creates low-quality pseudo-real scene images which affect the final result. To achieve plausible real scene images corresponding to given oriental landscape painting images, CLIP-based matching, CycleGAN[63] and DiffuseIT[32] should be employed.

4.2 Optimal $K$ in CLIP-based matching

As illustrated in Fig. 6, we have trained CycleGAN [63] using the CLIP-matched dataset. When building this dataset, a 1-to- $K$ manner is employed. To prove effectiveness of the 1-to- $K$ manner, we have compared real scene images with $K$ as 1, 3, 5, and 10. Setting $K$ to either 1 or 10 results in either too few or too many images treated as correct data, significantly influencing CycleGAN [63] and ultimately yielding poor translated outcomes. When $K$ is set to either 3 or 5, CycleGAN [63] produces the most plausible pseudo-real scene images, which also affect the final results through DiffuseIT [32].

4.3 User study

To evaluate structural preservation and translation quality of predicted real scene image, we conducted a user study. We asked forty anonymous participants five question sets that are subdivided to $Q^{s}$ and $Q^{q}$ .

Table 1: User study result. For

Q^{s}

, more than 90% of the respondents chose the oriental landscape painting images that correspond to the real scene image for all the questions, with an average of 93.6%. This shows that our method effectively preserves structural information during transformation. Additionally, for the questions regarding the quality of the predicted real scene images,

Q^{q}

, all the questions received scores of higher than 4.0, with an average score of 4.22. This indicates that the predicted real scene images are realistically translated.

Question No.		1	2	3	4	5	avg
$Q^{s}$	score (%)	90%	100%	93%	95%	90%	93.6%
$Q^{q}$	score (1 $\sim$ 5)	4.3	4.3	4.1	4.2	4.2	4.22

First, for $Q^{s}$ , we provided one real scene image and five oriental landscape painting images that include the one oriental landscape painting image corresponding to the given real scene image. We asked participants to select a single oriental landscape painting image that is most structurally similar to the real scene image. This question aims to assess how well the real scene image preserves the structure of the given oriental landscape painting image during translation.

After the first question was done, we asked another question, $Q^{q}$ . We gave participants a pair of an oriental landscape painting image and its real scene image and asked them to rate how well the produced real scene image is realistically translated, scaling from min 1 to max 5.

As summarized in Table 1, regarding the assessment of structural preservation in questions, $Q^{s}$ , an average of 93.6% of the participants chose oriental landscape painting images that corresponded to predicted real scene images. For the questions that evaluate to quality of the predicted real scene images, $Q^{q}$ , an average score of 4.22 point was obtained. These results serve as evidence demonstrating the capability of our method to not only preserve the structural similarity of oriental landscape paintings during prediction, but also realistically translate oriental landscape painting image to real scene image.

5 Conclusion

In this paper, we propose a novel framework that features CLIP-based image matching and two-step I2I translation to predict real scene images corresponding to oriental landscape painting images. We utilize a CLIP-based image matching method with the pre-defined dictionary that includes objects commonly appearing in oriental landscape paintings. In the first I2I step, oriental landscape painting images are translated to pseudo-real scene images using CycleGAN[63]. The generated pseudo-real scene images and the given oriental landscape painting images are input into DiffuseIT[32] for predicting final real scene images in the second I2I step. This strategy enables oriental landscape painting images to be translated into real scene images that are structurally similar yet realistic.

Experimental results prove to be highly effective when measuring the depth of oriental landscape painting images through I2I translations. The measured depth of the predicted real scene images can be directly applied when generating 3D sculptures that aid visually impaired people in experiencing paintings through touch. We hope that our work paves the way to help visually impaired people to appreciate diverse paintings.

There are tasks that need to be addressed in future research. oriental landscape paintings suffer from inherent preservation issues such as splitting, blurring, and deformation which significantly impact the outcomes of I2I translation. In other words, a model that restores oriental landscape paintings to their original state is required. Furthermore, oriental landscape paintings often contain numerous text and stamps that impact to depth estimation results. Removing these text, written with similar strokes and colors as the artwork, poses difficulties. To address this problems, substantial amounts of data need to be collected and shared. However, many globally recognized museums frequently restrict the public release and manipulation of their data.

6 Acknowledgement

This study was supported in part by the National Research Foundation of Korea (NRF) under Grant 2020R1F1A1048438 and the High-performance Computing (HPC) Support project funded by the Ministry of Science and ICT and National IT Industry Promotion Agency (NIPA). Also, we sincerely thank the anonymous participants who participated in the user study.

References

[1] Harvard art museums. https://harvardartmuseums.org/, accessed: (Use the date of access)
[2] Kyoto national museum. https://www.kyohaku.go.jp/eng/, accessed: (Use the date of access)
[3] The metropolitan museum of art. https://www.metmuseum.org/, accessed: (Use the date of access)
[4] National museum of korea. http://www.emuseum.go.kr/main, accessed: (Use the date of access)
[5] National palace museum. https://www.npm.gov.tw/, accessed: (Use the date of access)
[6] Bao, Y., Yang, T., Lin, X., Fang, Y., Wang, Y., Pöppel, E., Lei, Q.: Aesthetic preferences for eastern and western traditional visual art: Identity matters. Frontiers Psychology 7(000), 000–000 (2017)
[7] Bhat, S.F., Birkl, R., Wofk, D., Wonka, P., Müller, M.: Zoedepth: Zero-shot transfer by combining relative and metric depth (2023)
[8] Bhattacharjee, D., Everaert, M., Salzmann, M., Süsstrunk, S.: Estimating image depth in the comics domain. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). pp. 2070–2079 (January 2022)
[9] Birkl, R., Wofk, D., Müller, M.: Midas v3.1 – a model zoo for robust monocular relative depth estimation (2023)
[10] Chen, Y.: Generate lanscape paintings from text to landscape painting generation and redrawing using deep learning model. In: 2022 5th International Conference on Artificial Intelligence and Big Data (ICAIBD). pp. 145–149 (2022). https://doi.org/10.1109/ICAIBD55127.2022.9820385
[11] Chen, Y.J., Cheng, S.I., Chiu, W.C., Tseng, H.Y., Lee, H.Y.: Vector quantized image-to-image translation. In: European Conference on Computer Vision (ECCV) (2022)
[12] Cheng, X., Wang, P., Yang, R.: Learning depth with convolutional spatial propagation network. IEEE Transactions on Pattern Analysis and Machine Intelligence 42(10), 2361–2379 (2020). https://doi.org/10.1109/TPAMI.2019.2947374
[13] Conde, M.V., Turgutlu, K.: Clip-art: Contrastive pre-training for fine-grained art classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. pp. 3956–3960 (June 2021)
[14] Ehret, T.: Monocular depth estimation: a review of the 2022 state of the art. Image Processing On Line 13, 38–56 (2023). https://doi.org/10.5201/ipol.2023.459
[15] Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 12873–12883 (June 2021)
[16] Gary, H.: Western perspective japanese vision. POUR UN VOCABULAIRE DE LA SPATIALITÉ JAPONAISE 43(000), 153–158 (2013)
[17] Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017)
[18] Hentschel, S., Kobs, K., Hotho, A.: Clip knows image aesthetics. Frontiers in Artificial Intelligence 5 (2022). https://doi.org/10.3389/frai.2022.976235
[19] Hessel, J., Holtzman, A., Forbes, M., Bras, R.L., Choi, Y.: CLIPScore: a reference-free evaluation metric for image captioning. In: EMNLP (2021)
[20] Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems. vol. 33, pp. 6840–6851. Curran Associates, Inc. (2020)
[21] Hu, R., Dollár, P., He, K., Darrell, T., Girshick, R.: Learning to segment every thing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2018)
[22] Imran, S., Long, Y., Liu, X., Morris, D.: Depth coefficients for depth completion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)
[23] Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017)
[24] Jang, J.: Re-consideration about concept of samwonbub: based on middle level school art textbook. Journal of Art Education 33(000), 221–239 (2012)
[25] Jiang, E., Reif, E., Kim, B.: Depth predictions in art. https://pair-code.github.io/depth-maps-art-andillusions/art_history_vis/blogpost/blogpost_1.html
[26] Jin, J.H.: Real scenery landscape painting in the early and mid joseon era. Korean Association of Art History Education 24(000), 39–66 (2010)
[27] Jungmann, B.: Confusing traditions: Elements of the korean an kyŏn school in early japanese nanga landscape painting. Artibus Asiae 55(4), 303–318 (1995)
[28] Kim, H.B.: The influence of landscape painting concepts on garden design principles in east-asia. Journal of the Korean Institute of Landscape Architecture 37(000), 000–000 (2010)
[29] Kim, J.s.: The aesthetic meaning of the ’mountatin-water’ space expressed in ’three distances’. Journal of The Society of philosophical studies 125, 83–107 (2013)
[30] Kim, Y.H., kang, Y.: A study on the micro-topography landscape characteristics and waterfront landscape style of waterfront in korean jingyeong landscape painting. journal of the korean institute of landscape architecture 47(1), 26–38 (2019)
[31] Kwon, G., Ye, J.C.: Clipstyler: Image style transfer with a single text condition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 18062–18071 (June 2022)
[32] Kwon, G., Ye, J.C.: Diffusion-based image translation using disentangled style and content representation. In: The Eleventh International Conference on Learning Representations (2023)
[33] Lee, H.Y., Tseng, H.Y., Mao, Q., Huang, J.B., Lu, Y.D., Singh, M.K., Yang, M.H.: Drit++: Diverse image-to-image translation viadisentangled representations. International Journal of Computer Vision pp. 1–16 (2020)
[34] Lee, H.S.: Tradition of korean landscape: Its historic perspective and indigenization. International Journal of Korean Humanities and Social Sciences 2(00), 49–70 (2016)
[35] Lee, T.h.: Painting from actual scenery and painting from memory: Viewpoint and angle of view in landscape paintings of the late joseon dynasty. Journal of Korean Art and Archaeology 3, 106–150 (2008)
[36] Li, B., Xue, K., Liu, B., Lai, Y.K.: Bbdm: Image-to-image translation with brownian bridge diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1952–1961 (June 2023)
[37] Lin, Y.H., Tsai, M.H., Wu, J.L.: Depth sculpturing for 2d paintings: A progressive depth map completion framework. Journal of Visual Communication and Image Representation 25(4), 670–678 (2014). https://doi.org/10.1016/j.jvcir.2013.12.005, 3D Video Processing
[38] Liu, M.Y., Breuel, T., Kautz, J.: Unsupervised image-to-image translation networks. Advances in neural information processing systems 30 (2017)
[39] Liu, Z.S., Wang, L.W., Siu, W.C., Kalogeiton, V.: Name your style: Text-guided artistic style transfer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. pp. 3530–3534 (June 2023)
[40] Materzyńska, J., Torralba, A., Bau, D.: Disentangling visual and written concepts in clip. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 16410–16419 (June 2022)
[41] Miangoleh, S.M.H., Dille, S., Mai, L., Paris, S., Aksoy, Y.: Boosting monocular depth estimation models to high-resolution via content-adaptive multi-resolution merging. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 9685–9694 (June 2021)
[42] Park, C.W., Lee, Y.H., Kim, J.J.: The types and characteristics of natural scenery in landscape painting during joseon dynasty. Artibus Asiae 55(4), 303–318 (1995)
[43] Park, T., Efros, A.A., Zhang, R., Zhu, J.Y.: Contrastive learning for unpaired image-to-image translation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16. pp. 319–345. Springer (2020)
[44] Patashnik, O., Wu, Z., Shechtman, E., Cohen-Or, D., Lischinski, D.: Styleclip: Text-driven manipulation of stylegan imagery. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 2085–2094 (October 2021)
[45] Pauls, A., Pierdicca, R., Mancini, A., Zingaretti, P.: The depth estimation of 2d content: A new life for paintings. In: De Paolis, L.T., Arpaia, P., Sacco, M. (eds.) Extended Reality. pp. 127–145. Springer Nature Switzerland, Cham (2023)
[46] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 139, pp. 8748–8763. PMLR (18–24 Jul 2021)
[47] Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 12179–12188 (2021)
[48] Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., Koltun, V.: Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2020)
[49] Saharia, C., Chan, W., Chang, H., Lee, C., Ho, J., Salimans, T., Fleet, D., Norouzi, M.: Palette: Image-to-image diffusion models. In: ACM SIGGRAPH 2022 Conference Proceedings. SIGGRAPH ’22, Association for Computing Machinery, New York, NY, USA (2022). https://doi.org/10.1145/3528233.3530757
[50] Sain, A., Bhunia, A.K., Chowdhury, P.N., Koley, S., Xiang, T., Song, Y.Z.: Clip for all things zero-shot sketch-based image retrieval, fine-grained or not. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2765–2775 (June 2023)
[51] Skorokhodov, I., Sotnikov, G., Elhoseiny, M.: Aligning latent and image spaces to connect the unconnectable. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 14144–14153 (October 2021)
[52] Su, X., Song, J., Meng, C., Ermon, S.: Dual diffusion implicit bridges for image-to-image translation. In: International Conference on Learning Representations (2023)
[53] Tomei, M., Cornia, M., Baraldi, L., Cucchiara, R.: Art2real: Unfolding the reality of artworks via semantically-aware image-to-image translation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5849–5859 (2019)
[54] Vinker, Y., Alaluf, Y., Cohen-Or, D., Shamir, A.: Clipascene: Scene sketching with different types and levels of abstraction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 4146–4156 (October 2023)
[55] Wang, G., Shen, J., Yue, M., Ma, Y.: A computational study of empty space ratios in chinese landscape painting. Leonardo 55(000), 43–47 (2022)
[56] Wang, J., Chan, K.C., Loy, C.C.: Exploring clip for assessing the look and feel of images. In: Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence. AAAI’23/IAAI’23/EAAI’23, AAAI Press (2023). https://doi.org/10.1609/aaai.v37i2.25353
[57] Xie, C.W., Sun, S., Xiong, X., Zheng, Y., Zhao, D., Zhou, J.: Ra-clip: Retrieval augmented contrastive language-image pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 19265–19274 (June 2023)
[58] Xue, A.: End-to-end chinese landscape painting creation using generative adversarial networks. In: 2021 IEEE Winter Conference on Applications of Computer Vision (WACV). pp. 3862–3870 (2021). https://doi.org/10.1109/WACV48630.2021.00391
[59] Yong, G., Jeon, K., Gil, D., Lee, G.: Prompt engineering for zero-shot and few-shot defect detection and classification using a visual-language pretrained model. Computer-Aided Civil and Infrastructure Engineering 38(11), 1536–1554 (2023). https://doi.org/10.1111/mice.12954
[60] Zhang, R., Zhang, W., Fang, R., Gao, P., Li, K., Dai, J., Qiao, Y., Li, H.: Tip-adapter: Training-free adaption of clip for few-shot classification. In: Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV. p. 493–510. Springer-Verlag, Berlin, Heidelberg (2022). https://doi.org/10.1007/978-3-031-19833-5_29
[61] Zhang, Y., Tang, F., Dong, W., Huang, H., Ma, C., Lee, T.Y., Xu, C.: Domain enhanced arbitrary image style transfer via contrastive learning. In: ACM SIGGRAPH 2022 Conference Proceedings. pp. 1–8 (2022)
[62] Zhou, T., Tucker, R., Flynn, J., Fyffe, G., Snavely, N.: Stereo magnification: learning view synthesis using multiplane images. ACM Trans. Graph. 37(4) (jul 2018). https://doi.org/10.1145/3197517.3201323
[63] Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE international conference on computer vision. pp. 2223–2232 (2017)
[64] Zhu, S., Brazil, G., Liu, X.: The edge of depth: Explicit constraints between segmentation and depth. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020)