\jyear

2023

[2]\fnmJun \surYang

1]\orgnameZhejiang Sci-Tech University, \orgaddress\cityHangzhou, \countryChina 2]\orgnameJiaxing University, \orgaddress\cityJiaxing, \countryChina

Rethinking PRL: A Multiscale Progressively Residual Learning Network for Inverse Halftoning

\fnmFeiyu \surLi lifeiyu.cs@foxmail.com yangj95@mail2.sysu.edu.cn [ [

Abstract

Image inverse halftoning is a classic image restoration task, aiming to recover continuous-tone images from halftone images with only bilevel pixels. Because the halftone images lose much of the original image content, inverse halftoning is a classic ill-problem. Although existing inverse halftoning algorithms achieve good performance, their results lose image details and features. Therefore, it is still a challenge to recover high-quality continuous-tone images. In this paper, we propose an end-to-end multiscale progressively residual learning network (MSPRL), which has a UNet architecture and takes multiscale input images. To make full use of different input image information, we design a shallow feature extraction module to capture similar features between images of different scales. We systematically study the performance of different methods and compare them with our proposed method. In addition, we employ different training strategies to optimize the model, which is important for optimizing the training process and improving performance. Extensive experiments demonstrate that our MSPRL model obtains considerable performance gains in detail restoration.

keywords:

Image inverse halftoning, error diffusion, multiscale progressively learning, deep learning.

1 Introduction

The halftoning method represents continuous-tone images with two levels of color, namely black and white, due to cost considerations and is commonly used in digital image printing, publishing and displaying applications (Mulligan and Ahumada Jr, 1992). There are various methods for halftoning algorithms, such as error diffusion, dot diffusion, ordered dithering and direct binary search (Floyd, 1976; Eschbach and Knox, 1991; Knuth, 1987; Bayer, 1973; Seldowitz et al, 1987). Because the halftone image has only two values, it can save considerable storage space and network transfer bandwidth compared to continuous-tone images. It is also a feasible and important image compression method. Fig. 1 illustrates an original grayscale image, corresponding to the halftone images and the inverse halftoning images.

Image inverse halftoning is an image restoration task that reconstructs continuous-tone images with 255 or more levels from its corresponding halftone images. The purpose is to convert a binary image at $\{0,1\}^{H\times W}$ into a continuous image in $\mathbb{R}^{H\times W}$ space, where $H$ and $W$ are the image height and width respectively. Because the halftone image loses many detailed features during the halftoning process, it is a challenging and ill-posed problem. In the past several decades numerous image inverse halftoning approaches have been explored to achieve good inverse halftoning performance (Kite et al, 2000; Analoui and Allebach, 1992; Mese and Vaidyanathan, 2001; Liu et al, 2010; Wong, 1995).

Owing to the success of deep convolutional neural networks (CNNs) in vision tasks, CNN-based image restoration methods have been extensively studied and have shown amazing performance. Several inverse halftoning methods based on deep learning have also achieved the significant advancements (Hou and Qiu, 2017; Xiao et al, 2017; Yuan et al, 2019; Xia and Wong, 2018). These methods mainly use the typical UNet architecture to build their CNN models. The UNet architecture is a multilevel design that aims to recover detailed features by extracting different information at multiple scales of the image. Therefore, it is widely used as a baseline in many vision models.

However, there is still a certain gap in detail restoration. Although most of the existing methods use the UNet architecture, they cannot effectively extract features at different image scales, because of which the quality of image reconstruction still has much room for improvement. Previous studies (Hou and Qiu, 2017; Xiao et al, 2017; Yuan et al, 2019) did not effectively extract image textures and features of multilayer downsampled images, and failed to restore high-quality continuous-tone images. In addition, Shao et al (2021) added an attention mechanism to enhance the detail extraction, but, despite the increased complexity of the model, there was no obvious performance improvement.

In this paper, we present a novel multiscale progressively learning network architecture that inspired by precious progressively learning UNet architectures (Zamir et al, 2022; Chen et al, 2022; Cho et al, 2021). Our model takes multiscale input images and uses a shallow feature extraction module to extract similar features from multiscale images. The encoder and decoder are composed of multiple residual block modules. Then, the feature fusion module fuses the output images of different stages of the encoder as the input images of the decoder, and finally outputs the continuous-tone images via progressively learning, which can ensure the efficiency of the model’s learning ability. We conduct experiments on the VOC2012 dataset, which is widely used in other vision tasks, such as image classification, object detection and instance segmentation. The main contributions of this paper are as follows.

1) Our MSPRL contains encoder and decoder stages. The encoder is mainly responsible for restoring image information and removing noise that affects image quality. The decoder aims to recover the texture details of different feature maps from the encoding stage, and outputs continuous-tone grayscale images. Meanwhile, we compare some common feature extraction blocks in the encoder and decoder.

2) We propose a computationally inexpensive shallow feature extraction module (SFE) to extract attention information between images to recover content feature representation and use a feature fusion module (FF) to fuse different stages of feature information.

3) Compared with other researchers who are keen on designing model architectures, we delve into the optimization of training strategies. Good training strategies for the performance improvement of the model are obvious and are used in our model training process, such as data augmentation methods and compound loss functions, which bring considerable improvements to model training and optimization.

2 Related Work

2.1 Conventional Inverse Halftoning

During the last decade, many approaches have been proposed for image inverse halftoning. Simple approaches use low-pass filtering to remove halftone noise (Wong, 1995; Catté et al, 1992). Although these methods can remove most of the halftone noise, they also remove high-frequency edge information. Thus, Kite et al (2000) proposed gradient-based spatially varying filtering for error diffused images to better recover high-frequency details of images. Unal and Çetin (2001), and Analoui and Allebach (1992) proposed projection onto the convex sets method (POCS) for inverse halftoning. In addition, some researchers used wavelet-based methods to separate halftoned image noises and then reconstruct the original image by wavelet shrinkage. Based on the Bayesian approach, Liu et al (2010) built a correlation map between adjacent points for inverse halftoning. Dictionary-based learning has also been widely and successfully applied to inverse halftoning (Zhang et al, 2018b). Son and Choo (2014) proposed an edge-oriented local learned dictionaries (LLD) method to enhance the edge details of the restored image. Considering computational efficiency, Mese and Guo further proposed a precomputed look-up table (LUT) (Mese and Vaidyanathan, 2001; Guo et al, 2013) to improve performance and utilize efficiency. Huang et al (2008) used a hybrid neural network method to process halftone and inverse halftoning images.

2.2 Deep Convolutional Neural Networks

Deep convolutional neural networks (CNNs) have become the dominant method for solving various image reconstruction problems and have achieved state-of-the-art performance on a wide variety of vision datasets. SRCNN (Dong et al, 2014) first introduced CNNs to the image super-resolution (SR) task, which focuses on reconstructing high-resolution (HR) details from corresponding low-resolution (LR) images, and obtains superior performance against previous conventional SR methods. ResNet (He et al, 2016) introduced an identity skip connection that alleviates the difficulty of model degradation in deep neural networks and allows networks to learn deeper feature representations. VDSR (Kim et al, 2016) achieved a good recovery effect using a residual learning architecture for super resolution. EDSR (Lim et al, 2017) built a very wide network using residual blocks. DnCNN (Zhang et al, 2017) used CNNs to remove the white Gaussian noise of images. MIMO-UNet (Cho et al, 2021), NAFNet (Chen et al, 2022) and Restormer (Zamir et al, 2022) presented the multi-input fusion UNet architecture to aggregate multiscale feature information for image deblurring and image deraining.

Image inverse halftoning is similar to many image restoration tasks. Thus, Hou and Qiu (2017), and Xiao et al (2017) applied CNNs to inverse halftoning by building a UNet network as the restoration network. Xia and Wong (2018) proposed a progressively residual learning network (PRL) including two main stages: the content aggregation stage, which restores the content map, and the detail enhancement stage, which restores the extract texture and details. Yuan et al (2019) proposed gradient-guided residual learning CNNs (GGRL) for inverse halftoning. The same subnetworks are used to learn gradient maps of different Sobel orientations from the input halftone image, and a coarse map is output that is used to restore the continuous-tone images. Shao et al (2021) presented an attention model for inverse halftoning by using residual channel attention blocks (RCAB) (Zhang et al, 2018a). Xia et al (2021) and Yen et al (2021) combined inverse halftoning with image colorization methods to recover color continuous-tone images with better visual quality from corresponding halftone grayscale images.

2.3 The importance of training strategies

Better training strategies can increase the performance of a model and effectively decrease the training time (Goyal et al, 2017; He et al, 2019; Qian et al, 2022; Lin et al, 2022). Data augmentation is one of the most important strategies to boost the performance of a neural network (Cubuk et al, 2020). It can provide more learning samples and improve model generalization through various random changes for training images. Many researchers use cosine annealing (Loshchilov and Hutter, 2016) decay to boost performance. Furthermore, the warm-up method (Goyal et al, 2017; He et al, 2019) is used to alleviate the instability of the model in the early training stage. In many vision tasks, removing batch normalization (BN) layers can increase performance and reduce computational complexity such as SR and deblurring (Lim et al, 2017; Wang et al, 2018). Zhao et al (2016) showed that L₁ loss has a better convergence effect and image perceptual quality than L₂ loss. In this paper, we adopt suitable training strategies for our inverse halftoning task to improve the visual quality of restored continuous-tone images.

3 Methodology

In this section, we first introduce our MSPRL model based on the UNet architecture and propose the shallow feature extraction module (SFE) in Sec. 3.1. The overall architecture of MSPRL is shown in Fig. 2. Then we describe our loss function in Sec. 3.2. Last, we use an overall different training strategy compared with PRL in Sec. 3.3.

3.1 model architecture

As shown in Fig. 2, given a halftone input image $X\in\{0,1\}^{H\times W\times 1}$ , the goal of our method is to restore the clear and continuous-tone grayscale image $Y\in\mathbb{R}^{H\times W\times 1}$ by progressively learning. Our model is mainly divided into two stages: the left encoder (EC) stage and the right decoder (DC) stage, and three levels from top to bottom.

Overall Pipeline. In the encoder stage, we first use a 3 $\times$ 3 convolution layer to obtain a low-level feature map F ${}_{k}^{EB}\in\mathbb{R}^{H\times W\times C}$ , where $H$ $\times$ $W$ denotes the spatial dimension, $C$ is the number of feature map channels that we set to 48, $k$ represents the $k^{th}$ level and EB represents encoder block that consist of 8 residual blocks (RBs). Then, F ${}_{k}^{EB}$ passes through an encoder EB_k which transforms it into deep feature maps at level 1. EB_k obtains an output EB ${}_{k}^{down}$ by downsampling, where the number of channels is doubled and the image size is halved. For the downsampling and upsampling modules, we apply pixel-unshuffle and pixel-shuffle operations respectively. To extract the similar information of multiscale images, we use the shallow feature extraction module (SFE) to exploit the attention features of EB ${}_{k}^{down}$ and X ${}_{k-1}^{resize}$ in the second and third levels, respectively, and output the fusion attention feature maps of SFE_k, where $resize$ uses linear interpolation downsampling from the corresponding $k$ -1 level of the input image $X$ . Then SFE_k passes through EB to obtain deep features. The left encoding stage process is defined as:

EB_{k}=\left\{\begin{array}[]{ll}EB_{k}(Conv_{3}(X_{k}))&k=1,\\ EB_{k}(SFE_{k}(EB{{}_{k=1}}))&k=2,3,\end{array}\right.

(1)

where $X$ is the input image, $Conv_{3}$ represents a 3 $\times$ 3 convolutional layer, and EB_k and SFE_k represent the outputs of the $k^{th}$ level EB and SFE respectively.

In MSPRL, the decoder takes the encoder features EB as input and progressively recovers the continuous-tone representations. First, the feature fusion module (FF) aggregates the feature maps of different encoder stages EB_k and EB ${}_{k+1}^{up}$ and outputs the aggregated features FF_k. Then, we use the decoder block (DB) DB_k to reconstruct the image details, where DB is also composed of 8 residual blocks (RBs). Through a series of decoding and reconstruction, we obtain F ${}_{k}^{DB}$ . Finally, we apply a 3 $\times$ 3 convolution and residual connection to obtain the final continuous-tone image $Y$ . The overall process is progressively learning. The right decoding stage process is defined as:

DB_{k}=DB_{k}(FF_{k}(EB_{k},EB_{k+1}^{up})),

(2)

Y=Conv_{3}(DB_{1})+X,

(3)

where $X$ and $Y$ are the input and output images, respectively, $Conv_{3}$ represents a 3 $\times$ 3 convolutional layer, and DB_k and FF_k represent the outputs of the $k^{th}$ DB and FF respectively, and $k$ = 1,2 in Eq. 2.

Shallow Feature Extraction and Feature Fusion. Inspired by the shallow convolutional module (SCM) in MIMO-UNet (Cho et al, 2021), our shallow feature extraction module (SFE) is shown in Fig. 3LABEL:sub@fig:3_1. The $X_{k-1}^{resize}$ passes through a 3 $\times$ 3 convolutional layer and two stacks of 1 $\times$ 1 point-wise convolution to output a low-level feature map Conv ${}_{k}^{stack}$ . Then, we use element-wise multiplication to obtain attention features between the previous Conv ${}_{k}^{stack}$ and EB ${}_{k-1}^{down}$ . A 1 $\times$ 1 point-wise convolution is used to aggregate the attention features of previous $X_{k-1}^{resize}$ and EB ${}_{k-1}^{down}$ , which are shown in Fig. 3LABEL:sub@fig:3_2. The SFE is formulated as:

SFE_{k}^{att}=Conv_{k}^{stack}(X_{k-1}^{resize})\otimes EC_{k-1}^{down},

(4)

SFE_{k}=Conv_{1}(Concat(X_{k-1}^{resize},SFE_{k}^{att}))+EC_{k-1}^{down},

(5)

where $k$ = 2,3 represents the $k^{th}$ level, $Conv_{stack}$ , $Conv_{1}$ and $\otimes$ represent multiple stacked convolutional layers, a 1 $\times$ 1 convolutional layer and element-wise multiplication respectively.

For the feature fusion module (FF), the FF aggregates the feature maps of EB_k and EB ${}_{k+1}^{up}$ and is formulated as:

FF_{k}=Conv_{1}(Concat(EB_{k},EB_{k+1}^{up})),

(6)

where $k$ = 1,2 represents $k^{th}$ level and $Conv_{1}$ represents a 1 $\times$ 1 convolutional layer.

Downsampling and Upsampling. We use PixelShuffle as the upsampling and downsampling operation. Compared with convolution upsampling and downsampling, PixelShuffle can obtain better visual quality (Shi et al, 2016).

Progressively Learning. Progressively learning allows the network to learn local and global features, which makes full use of semantic information of different scale images. In addition, it can also greatly reduce the convolution operation time under the small image patches. The feature maps of different stages are shown in Fig. 4.

3.2 Loss Function

Although L₁ loss, MSE loss and perceptual loss are used in PRL, we experimentally found that perceptual loss is added with a very large penalty coefficient, which has little effect on model convergence, and that MSE loss has a smooth ing effect. In this paper, we only use L₁ loss as follows:

L_{pixel}={\|X_{gt}-Y\|}_{1},

(7)

where $L_{pixel}$ is the pixel-wise loss that evaluates the L₁ distance between recovered image $Y$ and the ground-truth gray image $X_{gt}$ . Some studies have shown that composite loss functions can improve performance. Inspired by (Cho et al, 2021), we add the fast Fourier transform (FFT) (Cochran et al, 1967) loss function to strengthen the high-frequency extraction as follows:

L_{FFT}={\|FFT(X_{gt})-FFT(Y)\|}_{1},

(8)

where FFT represents the fast Fourier transform that transfers the image signal to the frequency domain and uses L₁ loss to evaluate the distance between recovered image $Y$ and the ground-truth gray image $X_{gt}$ . The final loss function for training our model is as follows:

L_{total}=L_{pixel}+\lambda L_{FFT},

(9)

where we set $\lambda$ = 0.1 in experiment.

3.3 Training strategies

We first show the different training strategy comparisons in Tab. 1. Then we illustrate the strategies that differ from PRL.

Data augmentation. We found that other researchers use the resize operation to scale the images to 256 $\times$ 256. However, this resizing operation results in the loss of much of detail and texture information of the original image. During our training period, we use random cropping on the training data so that the model can learn image information of different regions. Data augmentation enables the model to learn richer feature representations and improves model generalization.

Bigger batch-size. The original PRL uses a minimum batch-size of 1. A smaller batch size will make the model training unstable and affect the convergence speed. We use the most commonly used batch-size 16.

Optimizer and Schedule. Unlike PRL, we utilize the AdamW optimizer (Loshchilov and Hutter, 2017) instead of Adam (Kingma and Ba, 2014) and optimizer momentum with ( $\beta_{1}=0.9,\beta_{2}=0.999$ ). For the learning rate decreasing strategy, we use the cosine annealing decay schedule (Loshchilov and Hutter, 2016) instead of linear decay.

4 Experiments

In this section, we first describe the datasets, evaluation metrics and training details. We then show the impact of different training strategies in the same baseline PRL. We demonstrate the performance of the different models later.

4.1 Datasets and Implementation Details

Datasets and Metrics. Following the PRL, we use the VOC2012¹¹1http://host.robots.ox.ac.uk/pascal/VOC/. dataset (Everingham et al, 2015), which includes over 17000 images. We randomly select 13841 images for training and 3000 nonoverlapping images for validation, where excluding images patch size smaller than 256 $\times$ 256. We evaluate the model on the Place365²²2http://places2.csail.mit.edu/. small test dataset (Zhou et al, 2017), In addition, some classic images, such as Lena, Barbara and Baboon, and the Kodak³³3http://r0k.us/graphics/kodak/. dataset are added to the test dataset. We also test five standard SR benchmark datasets including Set5 (Bevilacqua et al, 2012), Set14 (Zeyde et al, 2010), BSD100 (Martin et al, 2001), Urban100 (Huang et al, 2015) and Manga109 (Matsui et al, 2017), where some images are properly cropped to fit the original PRL (Xia and Wong, 2018) model. In the experiment, the halftone images for all datasets are generated by the Floyd Steinberg error diffusion algorithm (Floyd, 1976). For the evaluation metrics, the peak signal to noise ratio (PSNR), and the structural similarity metric (SSIM) are used in all experiments. Our code and pre-trained models are available at https://github.com/FeiyuLi-cs/MSPRL.

Table 1: Comparison of training strategies between PRL and MSPRL.

Training config	PRL	MSPRL
System implement	TensorFlow	Pytorch
Dataset size	13K	13K
Data augment	✗	✓
Batch size	1	16
Image size	256	128
Epochs	150	347
Total iterations	1950K	300K
Channel dimension	64	48
Optimizer	Adam	AdamW
Optimizer momentum	$\beta_{1}=0.9,\beta_{2}=0.999$	$\beta_{1}=0.9,\beta_{2}=0.999$
Learning rate decay	$2e^{-4}\to 2e^{-6}$	$2e^{-4}\to 1e^{-6}$
Learning rate schedule	Linearly decay	Cosine decay
Loss function	L₁+MSE+Perceptual Loss	L₁+FFT Loss
\botrule

Training detail. In the training process, the batch-size is set to 16, and then the sampled images are randomly cropped to 128 $\times$ 128. For data augmentation, each image patch is horizontally flipped with a probability of 0.5. We use iterations instead of epochs to represent the training length. The model is trained by the AdamW optimizer (Loshchilov and Hutter, 2017) ( $\beta_{1}=0.9,\beta_{2}=0.999$ ) for 300K iterations. The initial learning rate is set to 0.0002, which gradually decays to $1e^{-6}$ with cosine annealing (Loshchilov and Hutter, 2016). The model training time is approximately 18 hours and runs on one Nvidia RTX 3090 GPU.

4.2 Ablation study

In this section, we conduct experiments to show the effects of different modules, activation functions and feature blocks with our method. Our MSPRL model employs 8 residual blocks for each encoder and decoder. First, we evaluate the effectiveness of MSPRL without SFE and FF. The experimental results are shown in Tab. 2. The FF improves PSNR by 0.02 dB compared with SFE in the Kodak datasets, and the performance gain is further increased to 0.05 dB when we combine FF with SFE. The results show that aggregating feature maps from different encoders is more important than computing attention feature maps for our model learning.

Many vision networks adopt ReLU (Nair and Hinton, 2010) or LeakyReLU (Maas et al, 2013) as the activation function. In recent years, GELU (Hendrycks and Gimpel, 2016) has gradually become the first choice. Therefore, we test three activation functions to explore the best performance for our method. The experimental results are shown in Tab. 3. The results show the effect of different activation functions on model performance. ReLU performs better overall on multiple datasets; the results of LeakyReLU and GELU are close to ReLU but add some training time. Thus, we choose ReLU as the activation function in our model.

Table 2: Ablation study of SFE and FF.

SFE	FF	Place365		Kodak
SFE	FF	PSNR	SSIM	PSNR	SSIM
✓		30.76	0.9019	31.84	0.8897
	✓	30.76	0.9019	31.86	0.8897
✓	✓	30.77	0.9020	31.89	0.8898
\botrule

Table 3: Performance comparison of different activation functions.

Method	Place365		Kodak
Method	PSNR	SSIM	PSNR	SSIM
ReLU	30.77	0.9020	31.89	0.8898
LeakyReLU	30.76	0.9015	31.87	0.8894
GELU	30.76	0.9017	31.85	0.8896
\botrule

Besides, we also compared three common feature blocks: residual block (RB) (He et al, 2016), residual channel attention block (RCAB) (Zhang et al, 2018a) and residual-in-residual dense block (RRDB) (Wang et al, 2018) to explore the performance of encoder and decoder of MSPRL. Both RCAB and RRDB will increase the computational complexity, and RRDB will greatly increase the model parameters, while RB can maintain model performance between low computational complexity and parameters. Their parameters and performance comparisons are shown in Tab. 4.

Table 4: Comparison of PSNR performance of different feature blocks.

Method	Amounts	Total parameters	Place365	Kodak
RB	8	9681505	30.77	31.89
RCAB	8	9745489	30.79	31.85
RRDB	2	22082593	30.80	31.90
\botrule

4.3 Impact of Training Strategies

To explore the impact of training strategies, we conduct multiple experiments with different image sizes and loss functions using PRL and MSPRL models, respectively. We use the original PRL baseline and only use our different training strategies as shown in Tab. 1. The average improvement is approximately 1.5 dB in all test datasets, and is named PRL-dt. For image size, we found that the training time sharply decreased by approximately 70% when using images smaller than 128 pixels and that the performance was comparable to larger images of size 256. We assert that this phenomenon is due to data augmentation, random sampling and more iterations, which make the model learn as much feature information as from large images on small image sizes. For different loss functions, we minimize the fast Fourier transform loss in the frequency domain, so that the model can be further optimized and improved in image details compared to only using a single L1 loss function. The experimental results are shown in Tab. 6. Meanwhile, we also test the performance of our MSPRL with different numbers of channels and residual blocks in Tab. 5. The validation PSNR curves of the model under these different settings are shown in Fig. 5.

Table 5: Performance comparison of different number of channels and residual blocks (RBs).

Channels	RBs	Place365		Kodak
Channels	RBs	PSNR	SSIM	PSNR	SSIM
48	8	30.77	0.9020	31.89	0.8898
64	8	30.77	0.9022	31.89	0.8900
48	16	30.80	0.9025	31.93	0.8904
\botrule

Table 6: Performance comparison at different channels and image sizes. Green and blue show the best PSNR for PRL and MSPRL, respectively, under different settings. L1 means using a separate L1 loss function. PRL-dt uses the original PRL baseline, but in this paper uses different training strategies. Details are discussed in Sec. 4.3.

Model	Image size	Training time	Place365	Kodak
PRL	256 $\times$ 256	-	29.23	30.28
PRL-dt	256 $\times$ 256	2 Days	30.65	31.72
PRL-dt	128 $\times$ 128	17 Hours	30.65	31.71
MSPRL(L1)	128 $\times$ 128	18 Hours	30.75	31.82
MSPRL	256 $\times$ 256	2.2 Days	30.76	31.87
MSPRL	128 $\times$ 128	18 Hours	30.77	31.89
\botrule

Table 7: Performance comparison of inverse halftoning methods on different datasets. We improve the performance of PRL only using a different training strategy, which model is named PRL-dt. Our MSPRL outperforms all other models and is closer to the true grayscale image in detail recovery. The top two results are marked in red and blue.

Model	Place365		Kodak		Set5		Set14		BSD100		Urban100		Manga109
Model	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM
DnCNN (Zhang et al, 2017)	30.31	0.8913	31.24	0.8759	33.26	0.9192	30.76	0.8812	29.72	0.8600	29.81	0.9031	33.44	0.9427
VDSR (Kim et al, 2016)	30.15	0.8868	30.97	0.8718	32.92	0.9134	30.44	0.8758	29.53	0.8555	29.34	0.8964	32.87	0.9391
EDSR (Lim et al, 2017)	30.48	0.8960	31.48	0.8830	33.42	0.9219	30.95	0.8857	29.86	0.8652	30.22	0.9106	33.90	0.9466
PRL (Xia and Wong, 2018)	29.23	0.8840	30.28	0.8722	32.06	0.9103	29.97	0.8746	28.99	0.8525	29.39	0.9017	32.55	0.9365
GGRL (Yuan et al, 2019)	30.46	0.8960	31.44	0.8830	-	-	-	-	29.85	0.8654	-	-	-	-
MIMOUNet (Cho et al, 2021)	30.56	0.8977	31.55	0.8855	33.54	0.9235	31.07	0.8883	29.91	0.8674	30.41	0.9140	34.21	0.9488
PRL-dt (ours)	30.65	0.9000	31.71	0.8875	33.70	0.9254	31.25	0.8904	30.01	0.8691	30.71	0.9183	34.50	0.9502
MSPRL (ours)	30.77	0.9020	31.89	0.8898	33.81	0.9264	31.40	0.8925	30.09	0.8708	31.10	0.9226	34.85	0.9518
\botrule

4.4 Performance comparison

We compare MSPRL with other inverse halftoning methods and CNN models of relevant vision tasks, such as DnCNN (Zhang et al, 2017), VDSR (Kim et al, 2016) and EDSR (Lim et al, 2017). The single baseline model in EDSR, which contains 16 residual blocks with 64 convolution kernel channels, is used in our task. We remove data pre/postprocessing and upscaling layers. For GGRL (Yuan et al, 2019), the public pretrained model is not available and their training dataset size is 8 times our dataset. Therefore, we only use the GGRL model in our training process, leading to some gaps in its performance compared to the original paper. In order to distinguish similar models, we also test MIMOUNet (Cho et al, 2021). For a fair comparison, these methods employ our training strategies. Because DnCNN, VDSR and EDSR adopt our training strategy, their results are higher than the values of the corresponding models trained in (Xia and Wong, 2018). The performance comparison is demonstrated in Tab. 7. The experimental results show that our MSPRL obtains the best performance on multiple datasets by 0.3 dB gain. Especially on the Urban100 dataset, MSPRL is 0.69 dB higher than MIMOUNet, meanwhile, other models outperform the original PRL due to our training strategies. We also changed the training strategy of PRL, named PRL-dt, and its model performance greatly improved compared with original PRL. The average PSNR on multiple datasets improved by approximately 1.5 dB only by changing the training strategy. Finally, MSPRL also outperforms PRL-dt on all datasets.

We show the visual comparisons in Fig. 6. Our MSPRL can obtain more obvious texture and structure information than PRL-dt, which effectively restores the image details. In the Lena image, the hat texture can be well restored using MSPRL, which is closer to the original real image. And, in the Barbara image, the cloth texture restoration of the other models shows more bending phenomena. In Fig. 7 (Row 2, 3 and 4), other models cannot restore the dense circle and point shape of the image, showing the line shape in different directions, while MSPRL can avoid this problem and reconstruct the image. Compared with other models, although the netted information loss of halftone image is very serious, MSPRL is still able to recover the main details, as shown in Fig. 8. In addition, the restoration visual effects of MSPRL in the architectures, letters and lines are more smooth and refined, which are shown in Fig. 9, Fig. 10, Fig. 11 and Fig. 12. Lastly, we also compare the restoration performance on the classic images shown in Tab. 8.

Table 8: Performance comparison of different inverse halftoning methods on some classic images with size of 512 (PSNR). The best results are marked in Bold.

Model	DnCNN	VDSR	EDSR	PRL	GGRL	MIMOUNet	PRL-dt	MSPRL
Baboon	24.73	24.59	24.85	24.50	24.83	24.98	25.03	25.12
Barbara	29.35	28.08	29.95	29.44	30.19	30.58	30.79	31.59
Boat	31.77	31.54	31.95	31.21	31.92	32.00	32.14	32.25
Couple	31.55	31.36	31.79	30.91	31.77	31.87	31.95	32.07
Goldhill	31.71	31.51	31.86	31.01	31.87	31.90	32.06	32.15
House	38.90	38.55	39.38	36.21	39.39	39.42	39.75	39.95
Lena	34.51	34.32	34.78	33.34	34.77	34.84	35.00	35.09
Man	31.86	31.68	31.97	30.96	31.97	32.00	32.08	32.15
Peppers	34.32	34.09	34.42	33.11	34.39	34.43	34.49	34.55
\botrule

5 Conclusion

In this paper, we present a multiscale progressively residual learning architecture network (MSPRL) for the inverse halftoning task. The encoder restores content information from different scale images and the decoder collects encoder features to extract deep features. The feature maps of the entire model are progressively learned. Our MSPRL is a simple and efficient model that can learn information of different scale images. In addition, we use suitable training strategies compared with many previous CNN-based inverse halftoning methods. We also explored the performance of the model between different settings and feature blocks. The experimental results demonstrate that our method outperforms the other model methods. Recently, many researchers have added colorization tasks to inverse halftoning; we will follow up research to restore better visual perception of color continuous-tone images in the future.

References

\bibcommenthead
Analoui and Allebach (1992) Analoui M, Allebach J (1992) New results on reconstruction of continuous-tone from halftone. In: Acoustics, Speech, and Signal Processing, IEEE International Conference on, IEEE Computer Society, pp 313–316
Bayer (1973) Bayer BE (1973) An optimum method for two-level rendition of continuous tone pictures. In: IEEE International Conference on Communications, June, 1973
Bevilacqua et al (2012) Bevilacqua M, Roumy A, Guillemot C, et al (2012) Low-complexity single-image super-resolution based on nonnegative neighbor embedding
Catté et al (1992) Catté F, Lions PL, Morel JM, et al (1992) Image selective smoothing and edge detection by nonlinear diffusion. SIAM Journal on Numerical analysis 29(1):182–193
Chen et al (2022) Chen L, Chu X, Zhang X, et al (2022) Simple baselines for image restoration. arXiv preprint arXiv:220404676
Cho et al (2021) Cho SJ, Ji SW, Hong JP, et al (2021) Rethinking coarse-to-fine approach in single image deblurring. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4641–4650
Cochran et al (1967) Cochran WT, Cooley JW, Favin DL, et al (1967) What is the fast fourier transform? Proceedings of the IEEE 55(10):1664–1674
Cubuk et al (2020) Cubuk ED, Zoph B, Shlens J, et al (2020) Randaugment: Practical automated data augmentation with a reduced search space. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp 702–703
Dong et al (2014) Dong C, Loy CC, He K, et al (2014) Learning a deep convolutional network for image super-resolution. In: European conference on computer vision, Springer, pp 184–199
Eschbach and Knox (1991) Eschbach R, Knox KT (1991) Error-diffusion algorithm with edge enhancement. JOSA A 8(12):1844–1850
Everingham et al (2015) Everingham M, Eslami SA, Van Gool L, et al (2015) The pascal visual object classes challenge: A retrospective. International journal of computer vision 111:98–136
Floyd (1976) Floyd RW (1976) An adaptive algorithm for spatial gray-scale. In: Proc. Soc. Inf. Disp., pp 75–77
Goyal et al (2017) Goyal P, Dollár P, Girshick R, et al (2017) Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:170602677
Guo et al (2013) Guo JM, Liu YF, Chang JY, et al (2013) Efficient halftoning based on multiple look-up tables. IEEE transactions on image processing 22(11):4522–4531
He et al (2016) He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
He et al (2019) He T, Zhang Z, Zhang H, et al (2019) Bag of tricks for image classification with convolutional neural networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 558–567
Hendrycks and Gimpel (2016) Hendrycks D, Gimpel K (2016) Gaussian error linear units (gelus). arXiv preprint arXiv:160608415
Hou and Qiu (2017) Hou X, Qiu G (2017) Image companding and inverse halftoning using deep convolutional neural networks. arXiv preprint arXiv:170700116
Huang et al (2015) Huang JB, Singh A, Ahuja N (2015) Single image super-resolution from transformed self-exemplars. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5197–5206
Huang et al (2008) Huang WB, Su AW, Kuo YH (2008) Neural network based method for image halftoning and inverse halftoning. Expert Systems with Applications 34(4):2491–2501
Kim et al (2016) Kim J, Lee JK, Lee KM (2016) Accurate image super-resolution using very deep convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1646–1654
Kingma and Ba (2014) Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:14126980
Kite et al (2000) Kite TD, Damera-Venkata N, Evans BL, et al (2000) A fast, high-quality inverse halftoning algorithm for error diffused halftones. IEEE Transactions on Image Processing 9(9):1583–1592
Knuth (1987) Knuth DE (1987) Digital halftones by dot diffusion. ACM Transactions on Graphics (TOG) 6(4):245–273
Lim et al (2017) Lim B, Son S, Kim H, et al (2017) Enhanced deep residual networks for single image super-resolution. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 136–144
Lin et al (2022) Lin Z, Garg P, Banerjee A, et al (2022) Revisiting rcan: Improved training for image super-resolution. arXiv preprint arXiv:220111279
Liu et al (2010) Liu YF, Guo JM, Lee JD (2010) Inverse halftoning based on the bayesian theorem. IEEE Transactions on Image Processing 20(4):1077–1084
Loshchilov and Hutter (2016) Loshchilov I, Hutter F (2016) Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:160803983
Loshchilov and Hutter (2017) Loshchilov I, Hutter F (2017) Decoupled weight decay regularization. arXiv preprint arXiv:171105101
Maas et al (2013) Maas AL, Hannun AY, Ng AY, et al (2013) Rectifier nonlinearities improve neural network acoustic models. In: Proc. icml, Atlanta, Georgia, USA, p 3
Martin et al (2001) Martin D, Fowlkes C, Tal D, et al (2001) A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, IEEE, pp 416–423
Matsui et al (2017) Matsui Y, Ito K, Aramaki Y, et al (2017) Sketch-based manga retrieval using manga109 dataset. Multimedia Tools and Applications 76(20):21,811–21,838
Mese and Vaidyanathan (2001) Mese M, Vaidyanathan PP (2001) Look-up table (lut) method for inverse halftoning. IEEE Transactions on Image Processing 10(10):1566–1578
Mulligan and Ahumada Jr (1992) Mulligan JB, Ahumada Jr AJ (1992) Principled halftoning based on human vision models. In: Human vision, visual processing, and digital display III, SPIE, pp 109–121
Nair and Hinton (2010) Nair V, Hinton GE (2010) Rectified linear units improve restricted boltzmann machines. In: Icml
Qian et al (2022) Qian G, Li Y, Peng H, et al (2022) Pointnext: Revisiting pointnet++ with improved training and scaling strategies. arXiv preprint arXiv:220604670
Seldowitz et al (1987) Seldowitz MA, Allebach JP, Sweeney DW (1987) Synthesis of digital holograms by direct binary search. Applied optics 26(14):2788–2798
Shao et al (2021) Shao L, Zhang E, Li M (2021) An efficient convolutional neural network model combined with attention mechanism for inverse halftoning. Electronics 10(13):1574
Shi et al (2016) Shi W, Caballero J, Huszár F, et al (2016) Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1874–1883
Son and Choo (2014) Son CH, Choo H (2014) Local learned dictionaries optimized to edge orientation for inverse halftoning. IEEE Transactions on Image Processing 23(6):2542–2556
Unal and Çetin (2001) Unal GB, Çetin AE (2001) Restoration of error-diffused images using projection onto convex sets. IEEE transactions on image processing 10(12):1836–1841
Wang et al (2018) Wang X, Yu K, Wu S, et al (2018) Esrgan: Enhanced super-resolution generative adversarial networks. In: Proceedings of the European conference on computer vision (ECCV) workshops, pp 0–0
Wong (1995) Wong PW (1995) Inverse halftoning and kernel estimation for error diffusion. IEEE Transactions on Image Processing 4(4):486–498
Xia and Wong (2018) Xia M, Wong TT (2018) Deep inverse halftoning via progressively residual learning. In: Asian Conference on Computer Vision, Springer, pp 523–539
Xia et al (2021) Xia M, Hu W, Liu X, et al (2021) Deep halftoning with reversible binary pattern. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 14,000–14,009
Xiao et al (2017) Xiao Y, Pan C, Zhu X, et al (2017) Deep neural inverse halftoning. In: 2017 International Conference on Virtual Reality and Visualization (ICVRV), IEEE, pp 213–218
Yen et al (2021) Yen YT, Cheng CC, Chiu WC (2021) Inverse halftone colorization: Making halftone prints color photos. In: 2021 IEEE International Conference on Image Processing (ICIP), IEEE, pp 1734–1738
Yuan et al (2019) Yuan J, Pan C, Zheng Y, et al (2019) Gradient-guided residual learning for inverse halftoning and image expanding. IEEE Access 8:50,995–51,007
Zamir et al (2022) Zamir SW, Arora A, Khan S, et al (2022) Restormer: Efficient transformer for high-resolution image restoration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5728–5739
Zeyde et al (2010) Zeyde R, Elad M, Protter M (2010) On single image scale-up using sparse-representations. In: International conference on curves and surfaces, Springer, pp 711–730
Zhang et al (2017) Zhang K, Zuo W, Chen Y, et al (2017) Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE transactions on image processing 26(7):3142–3155
Zhang et al (2018a) Zhang Y, Li K, Li K, et al (2018a) Image super-resolution using very deep residual channel attention networks. In: Proceedings of the European conference on computer vision (ECCV), pp 286–301
Zhang et al (2018b) Zhang Y, Zhang E, Chen W, et al (2018b) Sparsity-based inverse halftoning via semi-coupled multi-dictionary learning and structural clustering. Engineering Applications of Artificial Intelligence 72:43–53
Zhao et al (2016) Zhao H, Gallo O, Frosio I, et al (2016) Loss functions for image restoration with neural networks. IEEE Transactions on computational imaging 3(1):47–57
Zhou et al (2017) Zhou B, Lapedriza A, Khosla A, et al (2017) Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence