Generation and Recombination for Multifocus Image Fusion with Free Number of Inputs

Huafeng Li, Dan Wang, Yuxin Huang, Yafei Zhang and Zhengtao Yu This work was supported in part by the National Natural Science Foundation of China (61966021, 62276120, 62161015), and the Yunnan Fundamental Research Projects (202301AV070004).H. Li, D. Wang, X. Huang, Yafei Zhang and Z. Yu are with the Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China.(E-mail:lhfchina99@kust.edu.cn (H. Li); wd97@kust.edu.cn (D. Wang))Manuscript received xxxx;

Abstract

Multifocus image fusion is an effective way to overcome the limitation of optical lenses. Many existing methods obtain fused results by generating decision maps. However, such methods often assume that the focused areas of the two source images are complementary, making it impossible to achieve simultaneous fusion of multiple images. Additionally, the existing methods ignore the impact of hard pixels on fusion performance, limiting the visual quality improvement of fusion image. To address these issues, a combining generation and recombination model, termed as GRFusion, is proposed. In GRFusion, focus property detection of each source image can be implemented independently, enabling simultaneous fusion of multiple source images and avoiding information loss caused by alternating fusion. This makes GRFusion free from the number of inputs. To distinguish the hard pixels from the source images, we achieve the determination of hard pixels by considering the inconsistency among the detection results of focus areas in source images. Furthermore, a multi-directional gradient embedding method for generating full focus images is proposed. Subsequently, a hard-pixel-guided recombination mechanism for constructing fused result is devised, effectively integrating the complementary advantages of feature reconstruction-based method and focused pixel recombination-based method. Extensive experimental results demonstrate the effectiveness and the superiority of the proposed method. The source code will be released on https://github.com/xxx/xxx.

Index Terms:

Multifocus image fusion; Image generation; Free number of inputs; Hard pixel detection.

I Introduction

Multifocus image fusion aims to integrate the information of several images captured from the same scene with different focus settings into a single image where all regions are in focus. This technique has received wide attention from researchers as it could overcome the depth of field limitation of optical lenses and produce an image with all objects in focus. Roughly, the existing fusion methods can be classified into two main categories: feature reconstruction-based methods and focused pixel recombination-based methods.

Feature reconstruction-based methods extract features from the source images as the initial step, followed by fusing these features using specific fusion strategies. Finally, the fused features are used to generate the fusion result. Typically, feature extraction methods include multiscale geometric analysis methods(such as wavelet transform [1], multiscale contourlet transform [2]) and deep neural networks [3, 4], etc. Fused images obtained by feature reconstruction-based methods often exhibit high visual quality and can effectively avoid block artifacts. However, since the fusion result is reconstructed or generated based on fused features, there is still a slight gap between the fusion image and the ideal result. Focused pixel recombination-based methods typically involve detection and recombination of focused pixels. Although these methods usually achieve better objective evaluation quality, they are prone to introducing artifacts due to ambiguity of pixel focus property, resulting in the degradation of the visual quality of the fusion result.

Refer to caption — Figure 1: Multi-image fusion strategy for multiple images. (a) Alternating fusion strategy used in existing methods. (b) Fusion strategy in this paper.

To integrate the advantages of these two types of methods, Liu et al.[5] proposed to combine deep residual learning and focus property detection to achieve the fusion of multifocus images. This method assumes that the focused regions between the two source images are complementary. Based on this assumption, the detection results of focused pixels in one source image can be obtained immediately after the focus property of another source image is determined. It’s a commonly used method to obtain focus property detection of two source images. However, this strategy has the following limitations. Firstly, there are a large number of hard pixels in the source images, which exhibits significant ambiguity in their focus properties. These pixels typically come from the smooth regions or boundaries of focused regions. It’s difficult to determine whether the pixels are in focus or not, so there will be a large number of misclassifications in detection result of focused pixels. Subsequently, the quality of the fused image will be compromised if the above detection result is directly used to determine the focus property of another source image. Secondly, these methods can only fuse two images at a time. When faced with the fusion of multiple images, they typically adopt an alternating fusion strategy as shown in Fig.1 (a), i.e., fuse two source images firstly, and then fuse the fused result of the previous stage with that of the remaining source images, until all source images are fused. Although the strategy is feasible, it may lead to information loss due to multiple feature extraction and fusion, which is detrimental to the final fused image.

To address the aforementioned issues, we propose a multifocus image fusion method named GRFusion that is open to the number of fused images. Specifically, the focus property of each source image is determined independently with the assistance of remaining source images. Therefore, the detection result of one source image will no longer affect that of other images. Furthermore, owing to the independent focus property detection for each source image, GRFusion is endowed with the ability to process the fusion of arbitrary number of images in an open manner, reducing the risk of information loss caused by alternating fusion. Additionally, to integrate the complementary advantages of feature reconstruction-based and pixel recombination-based methods, a hard-pixel-guided fusion result construction mechanism is devised. It can effectively reduce the negative impact caused by hard pixels on the fusion result. Technically, this method can determine the origin of the corresponding pixel in the fused image based on whether the focus property of pixel is determined or not. Specifically, for the pixel whose focus property is determined, the corresponding pixel in fused result is derived from the corresponding source image. Otherwise, the pixel in fused image is derived from the result generated by a feature reconstruction-based method with multi-directional gradient embedding proposed in this paper. In summary, this paper has three main contributions as follows.

•

A hard pixels detection method is designed to meet the requirement of parallel fusion of multiple images, where the hard pixels are identified by finding outliers from the detection results of the focus property of the source image. Owing to the independent pixel focus property detection, the developed method is free from the number of fused source images.

By integrating the advantages of feature reconstruction-based method and pixel recombination-based method into a single fusion framework, a novel multifocus image fusion approach termed GRFusion is devised. Unlike existing methods that only use this combined fusion strategy in the transition region between focused and defocused regions. GRFusion applies this strategy to all hard pixels, thus avoiding the negative impact on fused result caused by the detection error of pixel focus property.
•

Experimental results on both visual quality and objective evaluation demonstrate the effectiveness and superiority of our method in comparison to 9 well-known multifocus image fusion methods.

The rest of the paper is structured as follows: Section II briefly reviews the related work; Section III describes the details of the proposed method; Section IV gives the analysis of the experimental results; And Section V summarizes the paper and draws some conclusions.

II Related Work

II-A Feature Reconstruction-Based Multifocus Image Fusion

Feature reconstruction-based methods usually fuse the extracted multifocus image features and then generate the fused images based on the fused features. In such methods, the representation of image features is a crucial factor that influences the fusion image. Currently, the commonly used methods for image feature representation include multi-scale decomposition (MSD) [6, 1, 7, 8, 9], sparse representation (SR)[10, 11, 12, 13, 14, 15, 16, 17], and deep learning (DL)[18, 4, 19, 20, 21, 22, 23, 24]. MSD-based methods often employ multiscale image analysis method such as Wavelet Transform[6, 1], Nonsubsampled Contourlet Transform[7, 2], and Curvelet Transform[8] to represent the the source images. However, these methods employ fixed basis to represent the input image, so their sparse representation ability is weak. SR-based methods have been widely used in multifocus image fusion since they can learn a set of basis from the training samples to sparsely represent the image. Specifically, Yang et al.[14] proposed a SR-based method for restoration and fusion of noisy multifocus images. Zhang et al.[15] developed an analysis-synthesis dictionary pair learning method for fusion of multi-source images, including multifocus images. Li et al.[16] devised an edge-preserving method for fusion of multi-source noisy images. This method can perform fusion and denoising of multifocus noisy images in one framework.

In DL-based methods, the feature reconstruction-based approach generally includes three steps: feature extraction, feature fusion, and fusion image reconstruction. Since the fusion result is reconstructed from fused features, there may be some errors between the reconstructed and the ideal results. This demonstrates that the result obtained by the method based on feature reconstruction is not ideal. To alleviate this issue, Zhao et al.[18] proposed a CNN-based multi-level supervised network for multifocus image fusion, which utilized multi-level supervision to refine the extracted features. Zang et al.[19] introduced spatial and channel attention to enhance the salient information of the source images, effectively avoiding the loss of details. To address the challenge raised by the absence of ground truth, Jung et al.[20] proposed an unsupervised CNN framework for fusion of multi-source images, where an unsupervised loss function was deigned for unsupervised training.

In an unsupervised learning mode, to prevent the loss of source image details, Mustafa et al.[25] added dense connections into feature extraction network at different levels, improving the visual quality of the fused images. Hu et al.[22] proposed a mask prior network and a deep image prior network to achieve multifocus image fusion with zero-shot learning, thereby solving the problem caused by inconsistency between synthetic images and real images. Although the feature reconstruction-based method can generate images that look natural, there is still a certain gap between the generated and the ideal results due to the information loss during the feature extraction and image reconstruction.

II-B Pixel Recombination-Based Multifocus Image Fusion

Pixel recombination-based methods usually combine the focused pixels of the source image to construct the fused image. Therefore, how to accurately detect the pixels in the focus area is critical to them. To this end, Liu et al.[26] proposed a twin network and trained with a large number of samples generated by gaussian filtering to achieve the detection of focused pixels. Guo et al.[27] employed a fully connected network to explore the relationship between pixels in the entire image, enabling the determination of pixel focus property. Ma et al.[28] introduced an encoder-decoder feature extraction network consisting of SEDense Blocks to extract the features from source images, and then the focused pixel is distinguished by measuring spatial frequency of the features. Xiao et al.[29] proposed a U-Net with global-feature encoding to determine the focus property of pixel. Ma et al.[30] proposed SMFuse for focused pixel detection where the repeated blur map and the guided filter are both used to generate the binary mask. Considering the complementarity of features at different receptive fields, Liu et al.[31] proposed multiscale feature interactive network to improve detection performance of the focused pixel.

The aforementioned methods usually construct fused image using the detected focused pixels. However, it is extremely challenging to determine the focus property of hard pixels located at the smooth regions or the boundaries of focused regions. If the fused image is obtained like above, severe artifacts will be introduced into the fused result. To solve this problem, Li et al.[32] proposed an effective multifocus image fusion method based on non-local means filtering, in which the detection result of focused pixel in the transition regions between focused and defocused areas is smoothed by the non-local means filtering, and then the smoothed result is used to guide the fusion process. Similarly, Liu et al.[5] combined feature reconstruction with focused pixel recombination, and then developed an effective method for fusion of multifocus images. Although above methods are effective, they ignore the negative impact of hard pixels on the fusion quality. Additionally, the most of existing methods often assume that the number of input images is two and the focus property is complementary. As a result, they cannot fuse the multiple images in a parallel mode. Alternating fusion strategy is a commonly used method to solve this problem, but it tends to lose the information of the source images. In contrast, the developed GRFusion can fuse multiple images simultaneously and can solve the problems caused by hard pixels effectively.

III Proposed Method

III-A Overview

As shown in Fig. 2, the proposed method mainly consists of three components: hard pixel detection (HPD), full-focus image generation with multi-directional feature embedding (FFIG-MFE), and fusion result construction guided by hard pixels (FRC-G-HP). HPD is mainly used to determine which pixels are from the focus ambiguity area. This allows us to take specific solution to address the fusion of hard pixels. FFIG-MFE is used to generate an all-focus fusion image based on the fused features of the source images. FRC-G-HP combines the all-focus fusion result and the source images to generate the final fused result under the guidance of detection results of focused pixels. In the following section, we will elaborate on these components carefully.

III-B Hard Pixel Detection

III-B1 Focus Modulation

As shown in Fig.2, the HPD is mainly composed of two sub-modules: focus modulation (FM) and multi-scale feature aggregation (MSFA). Meanwhile, convolutional layers are embedded between FM and MSFA to further extract features. FM is mainly used to adjust the discriminability of focused features and improve the detection accuracy of pixel focus property. The specific process is shown in Fig.3. It consists of source image feature extractor (SIFE) and modulation parameter generator (MPG). Let $inpu{t_{i}}$ and $inpu{t_{i^{\prime}}}$ be the inputs of FM, where $inpu{t_{i}}$ denotes the $i$ -th source image, and $inpu{t_{i^{\prime}}}$ denotes the sum of remaining source images. $inpu{t_{i^{\prime}}}$ is mainly used to provide assistance for the focus detection of $inpu{t_{i}}$ . SIFE is used to extract features from $inpu{t_{i}}$ and $inpu{t_{i^{\prime}}}$ , and it consists of convolutional operation, batch normalization and ReLU activation function. MPG is used to generate the modulation parameters to highlight the effect of important features. Technically, we first obtain the edge details of two inputs by:

\begin{aligned} edg{e_{l}}=inpu{t_{l}}-inpu{t_{l}}\circledast{\bm{G}}(l=i,i^{\prime})\end{aligned},

(1)

where $\bm{G}$ denotes the Gaussian blur kernel and $\circledast$ denotes the convolutional operation. In order to highlight the focus property of each pixel in the source images and prevent the data in $\bm{e}dg{e_{l}}$ from being dispersed, a normalization process is performed on $edg{e_{l}}$ :

\begin{aligned} {\bm{B}_{l}}=\frac{{edg{e_{l}}-\min(edg{e_{l}})}}{{\max(edg{e_{l}})-\min(edg{e_{l}})}}{\rm{(}}l{\rm{=}}i{\rm{,}}i^{\prime}{\rm{)}}\end{aligned}.

(2)

To make full use of the focus information in $\bm{e}dg{e_{l}}$ , we reverse the normalization result to obtain:

\begin{aligned} {\bm{R}_{l}}={\bf{1}}-\bm{B}_{l}\end{aligned},

(3)

where $\bm{1}$ denotes an all-one matrix with the same size as ${\bm{R}_{l}}$ . Since ${\bm{R}_{i}}$ and ${\bm{B}_{i^{\prime}}}$ share the same focus property, the concatenated result can highlight the effect of the feature in focus region.

For this reason, we send the concatenated features $[{\bm{B}_{i}},{\bm{R}_{i^{\prime}}}]$ to two sub-networks which both consist of $3\times 3$ convolution and $sigmoid$ activation function to generate scale factor ${\bm{\alpha}_{i}}$ and transform parameter ${\bm{\beta}_{i}}$ , respectively:

\begin{aligned} {\bm{\alpha}_{i}}={\rm{sigmoid}}(Con{v_{3\times 3}}([{\bm{B}_{i}},{\bm{R}_{i^{\prime}}}]))\\ {\bm{\beta}_{i}}={\rm{sigmoid}}(Con{v_{3\times 3}}([{\bm{B}_{i}},{\bm{R}_{i^{\prime}}}]))\\ \end{aligned},

(4)

where $[\cdot,\cdot]$ denotes concatenation operation along the channel dimension, $Con{v_{3\times 3}}$ denotes $3\times 3$ convolutional layer. It should be noted that the parameters in the two convolutional layer in Eq.(4) are not shared.

To prevent the over-modulation of ${\bm{\alpha}_{i}}$ and ${\bm{\beta}_{i}}$ from causing irreparable effects on the features, the parameters ${\bm{\alpha}_{i}}$ and ${\bm{\beta}_{i}}$ are only performed on the first half of the feature channels of $inpu{t_{l}}(l=i,i^{\prime})$ . In detail, $inpu{t_{l}}(l=i,i^{\prime})$ is sent into a feature extraction network consisting of $3\times 3$ convolutional layer, BN and ReLU activation function, and then the results obtained after channel split can be expressed as:

\begin{aligned} (\overleftarrow{\bm{F}_{i}},\overrightarrow{\bm{F}_{i}})={\rm split}(ReLU{(BN(Conv(input_{i},3\times 3)))_{\times 2}})\end{aligned},

(5)

where ${\overleftarrow{\bm{F}_{i}}}$ and ${\overrightarrow{\bm{F}_{i}}}$ denote the first half features and the second half features of $inpu{t_{l}}(l=i,i^{\prime})$ along the channel, respectively. The result after ${\bm{\alpha}_{i}}$ and ${\bm{\beta}_{i}}$ modulation is:

\begin{aligned} \bm{F}_{i}^{\text{mod}}=[{\bm{\alpha}_{i}}\odot\bm{\overleftarrow{\bm{F}_{i}}}+{\bm{\beta}_{i}},\bm{\overrightarrow{\bm{F}_{i}}}]\end{aligned}.

(6)

$\bm{F}_{i}^{\text{mod}}$ denotes the result after focus modulation. To use $\bm{F}_{i}^{\text{mod}}$ and $\bm{F}_{i^{\prime}}^{\text{mod}}$ jointly, in this paper, $\bm{F}_{i}^{\text{mod}}$ and $\bm{F}_{i^{\prime}}^{\text{mod}}$ are concatenated and then fed into MSFA to further extract the features of focused pixels.

III-B2 Multi-Scale Feature Aggregation

After FM, the features in the focused region are highlighted but still not enough to make a correct determination for the focus property of pixels. To this end, MSFA is designed as shown in Fig.4. In this module, the saliency of the features in the focused region is improved by interactive enhancement of features extracted under different receptive fields. Methodologically, MSFA mainly consists of a multiscale convolution, BN, spatial attention (SA)[33], channel attention(CA) [33], $1\times 1$ convolution, and multiscale cross-attention. In detail, we send $[\bm{F}_{i}^{\text{mod}},\bm{F}_{i^{\prime}}^{\text{mod}}]$ into MSFA, the outputs of BN can be formulated:

\begin{array}[]{l}\bm{F}_{i}^{3\times 3}=BN(Con{v_{3\times 3}}([\bm{F}_{i}^{\text{mod}},\bm{F}_{i^{\prime}}^{\text{mod}}])\\ \bm{F}_{i}^{7\times 7}=BN(Con{v_{7\times 7}}([\bm{F}_{i}^{\text{mod}},\bm{F}_{i^{\prime}}^{\text{mod}}])\\ \end{array}.

(7)

To highlight the effect of important information in $\bm{F}_{i}^{3\times 3}$ and $\bm{F}_{i}^{7\times 7}$ in hard pixel detection, $\bm{F}_{i}^{3\times 3}$ and $\bm{F}_{i}^{7\times 7}$ are sent into SA and CA and then integrated by ${1\times 1}$ convolution:

\begin{array}[]{l}\bm{\bar{F}}_{i}^{3\times 3}={{Con}}{{\rm{v}}_{1\times 1}}[CA(\bm{F}_{i}^{3\times 3}),SA(\bm{F}_{i}^{3\times 3})]\\ \bm{\bar{F}}_{i}^{7\times 7}={{Con}}{{\rm{v}}_{1\times 1}}[CA(\bm{F}_{i}^{7\times 7}),SA(\bm{F}_{i}^{7\times 7})]\\ \end{array}.

(8)

Features under different perspective fields not only have consistent focus property, but also have certain complementary effects in focus property detection. If the features of the same image at different scales can be used to assist each other and enhance the focus information, the saliency of the focus area features will be effectively highlighted[34], which is beneficial for determining the pixel focus property. To this end, a feature saliency cross-enhancement mechanism is introduced into MSFA. Specifically, $\bm{\tilde{F}}_{i}^{3\times 3}$ and $\bm{\tilde{F}}_{i}^{7\times 7}$ are obtained after linearly mapping of $\bm{\bar{F}}_{i}^{3\times 3}$ and $\bm{\bar{F}}_{i}^{7\times 7}$ :

\begin{array}[]{l}\bm{\tilde{F}}_{Q,i}^{3\times 3}=\bm{W}_{Q}^{3}\bm{\bar{F}}_{i}^{3\times 3},\bm{\tilde{F}}_{K,i}^{3\times 3}=\bm{W}_{K}^{3}\bm{\bar{F}}_{i}^{3\times 3},\bm{\tilde{F}}_{V,i}^{3\times 3}=\bm{W}_{V}^{3}\bm{\bar{F}}_{i}^{3\times 3}\\ \bm{\tilde{F}}_{Q,i}^{7\times 7}=\bm{W}_{Q}^{7}\bm{\bar{F}}_{i}^{7\times 7},\bm{\tilde{F}}_{K,i}^{7\times 7}=\bm{W}_{K}^{7}\bm{\bar{F}}_{i}^{7\times 7},\bm{\tilde{F}}_{V,i}^{7\times 7}=\bm{W}_{V}^{7}\bm{\bar{F}}_{i}^{7\times 7}\\ \end{array},

(9)

where $\bm{W}_{l^{\prime}}^{k}(l^{\prime}=\bm{Q},\bm{K},\bm{V};k=3,7)$ is a linear transformation matrix. The dimension of $\bm{\tilde{F}}_{l^{\prime},i}^{k}\in{\mathbb{R}^{C\times H\times W}}(l^{\prime}=Q,K;k=3,7)$ is changed to $C\times(HW)$ after reshaping operation. To enhance the features of focused region, the feature saliency cross-enhancement mechanism is introduced:

\begin{array}[]{l}\bm{\hat{F}}_{i}^{3\times 3}={\rm{softmax}}(\frac{{\bm{\tilde{F}}_{Q,i}^{3\times 3}\times{{(\bm{\tilde{F}}_{K,i}^{7\times 7})}^{T}}}}{{\sqrt{d}}})\times\bm{\tilde{F}}_{V,i}^{3\times 3}+\bm{\tilde{F}}_{i}^{3\times 3}\\ \bm{\hat{F}}_{i}^{7\times 7}={\rm{softmax}}(\frac{{\bm{\tilde{F}}_{Q,i}^{7\times 7}\times{{(\bm{\tilde{F}}_{K,i}^{3\times 3})}^{T}}}}{{\sqrt{d}}})\times\bm{\tilde{F}}_{V,i}^{7\times 7}+\bm{\tilde{F}}_{i}^{7\times 7}\\ \end{array},

(10)

where $\bm{T}$ denotes the transpose operation. In $\bm{\hat{F}}_{i}^{3\times 3}$ and $\bm{\hat{F}}_{i}^{7\times 7}$ , the larger value, the corresponding focus property is more significant. Therefore, we can use Eq.(11) to synthesize the significant information:

{\bm{\hat{F}}_{i}}(\cdot,\cdot)=\left\{{\begin{array}[]{*{20}{c}}{\bm{\hat{F}}_{i}^{3\times 3}(\cdot,\cdot),if:abs(\bm{\hat{F}}_{i}^{3\times 3}(\cdot,\cdot))\geq abs(\bm{\hat{F}}_{i}^{7\times 7}(\cdot,\cdot))}\\ {\bm{\hat{F}}_{i}^{7\times 7}(\cdot,\cdot),if:abs(\bm{\hat{F}}_{i}^{3\times 3}(\cdot,\cdot))<abs(\bm{\hat{F}}_{i}^{7\times 7}(\cdot,\cdot))}\\ \end{array}}\right.,

(11)

where $abs(\cdot)$ denotes absolute value operation.

III-B3 Hard Pixel Detection

For focused pixel detection, the features output by FM are sent to an encoder and a decoder to obtain the focus property detection results of each source image, where the encoder consists of five MSFAs and convolutional layers and the decoder consists only five convolutional layers. Thanks to the independent focus detection of each source image, it enables the proposed method to fuse multiple source images in parallel. In this process, the method in this paper needs to separately fuse pixels with ambiguous focus property.

Let $\bm{M}_{i}$ be focus property detection result of the source image $\bm{I}_{i}$ , and $\bm{M}_{i}$ is a binary mask comprised of 0 and 1, where 1 indicates that the corresponding pixel in $\bm{I}_{i}$ is focused, and 0 indicates that the corresponding pixel in $\bm{I}_{i}$ is defocused. Since each source image has an independent detection result, the hard pixels can be identified by finding inconsistencies between detection results. For $N$ source images, $\bm{M}_{1}(x,y)+\bm{M}_{2}(x,y)+\cdots+\bm{M}_{N}(x,y)=1$ means that the focus property of the pixel at $(x,y)$ can be determined, otherwise the pixel at $(x,y)$ is identified regarded as hard pixel. The process can be formulated as:

\displaystyle\bm{M}_{h}(x,y)=\left\{\begin{array}[]{l}0,if:\bm{M}_{1}(x,y)+\bm{M}_{2}(x,y)+\cdots+\bm{M}_{N}(x,y)=1\\ 1/2,{\rm{otherwise}}\\ \end{array}\right.,

(12)

where $\bm{M}_{h}(x,y)=0$ indicates that the focus property of pixel at location $(x,y)$ is determined, $\bm{M}_{h}(x,y)=1/2$ indicates that the pixel at location $(x,y)$ is hard pixel.

III-C Full-focus Image Generation With Multi-directional Feature Embedding

Due to the ambiguity of the focus properties of hard pixels, it’s not optimal to construct the final fused result by selecting the focused pixels from source images. To reduce the negative effect of hard pixels on fusion result, we devise a full-focus image generation method with multi-directional feature embedding (FFIG-MFE) to construct the fusion result of hard pixels. FFIG-MFE consists of multi-directional edge extraction (MDEE), weight generator and edge feature embedding (EFE). As shown in Fig.2, multi-directional edge extraction in FFIG-MFE can be achieved by the designed multi-directional edge extraction kernels presented in Fig.5. Specifically, the size of each kernel is ${3\times 3}$ with the value of center position being 1. In FFIG-MFE, there are eight kernels corresponding to eight different directions are used to extract the edges from the source images. The multi-directional edges are denoted as:

\begin{aligned} \bm{F_{i,{k_{n}}}}=abs(\bm{I}_{i}\circledast\bm{K}_{n})~{}~{}(n=1,2,3...8)\end{aligned},

(13)

where $\bm{F}_{i,{k_{n}}}$ denotes the edge features of the source image $\bm{I}_{i}$ in the $n$ -th direction.

To fully utilize the features at eight direction to generate a full-focus image, a weight generation mechanism is designed to dynamically produce the weights for integrating multi-direction features. As shown in Fig.6, the weight generator consists of ${1\times 1}$ convolutional layer and a residual branch, which first employs, ${1\times 1}$ convolutional operation to reduce the number of channels to $1/4$ of the original number, and then utilizes a ${3\times 3}$ convolutional operation and a ${3\times 3}$ convolutional operation with dilation rate of $d(d=1,3,5,7)$ to extract features under different respective field. The outputs of ${1\times 1}$ convolutional layer and residual block are concatenated, and then weights $\omega_{i}^{n}(x,y)$ can be achieved after ${1\times 1}$ convolution and softmax. With $\omega_{i}^{n}(x,y)$ , the integrated multi-directional edge features can be represented as:

\begin{aligned} {\bm{F}}_{edge,i}(x,y)=\sum\limits_{n=1}^{8}{\omega_{i}^{n}}(x,y)\odot\bm{F}_{i,{k_{n}}}(x,y)\end{aligned}.

(14)

Since $\bm{F}_{edge,i}$ contains the focus property of source image $\bm{I}_{i}$ , we use it to highlight the effect of focused features in EFE for generating a full-focus image. As shown in Fig.7, we extract the features from $\bm{F}_{edge,i}$ using ${1\times 1}$ convolution to make the dimension consistent with $\bm{F}_{i}$ . Let $\bm{f}_{edge,i}(x,y)$ be the feature vector composed of elements from different channels of $\bm{F}_{edge,i}$ at location $(x,y)$ . The result is represented as $\bm{F}_{i}$ after $\bm{I}_{i}$ is processed by convolutional operation, BN and ReLU activate function. We measure the similarity between $\bm{f}_{edge,i}(x,y)$ and all feature vectors $\bm{f}_{i}(x\pm\delta,y\pm\delta)$ by inner product, where $\bm{f}_{i}(x\pm\delta,y\pm\delta)$ denotes the feature vector within $\left[{x\pm\delta,y\pm\delta}\right]$ . Then, the weight is obtained after softmax activated:

\begin{aligned} \bm{D}_{i}(x^{\prime},y^{\prime})=softmax(\bm{f}_{edge,i}(x,y)\bm{f}_{i}^{T}(x^{\prime},y^{\prime}))\end{aligned},

(15)

where $(x^{\prime},y^{\prime})\in\left[{x\pm\delta,y\pm\delta}\right]$ . The weight is used to scale all feature vectors $\bm{f}_{i}(x\pm\delta,y\pm\delta)$ :

\begin{aligned} {{{\hat{\bm{f}}}}_{i}}(x,y)=\sum\limits_{y^{\prime}=y-\delta}^{y+\delta}{\sum\limits_{x^{\prime}=x-\delta}^{x+\delta}{\bm{D}_{i}(x^{\prime},y^{\prime})\bm{f}_{i}(x^{\prime},y^{\prime})}}\end{aligned}.

(16)

When $(x,y)$ in ${\hat{\bm{f}}}_{i}(x,y)$ traverses all pixels of the entire image, the features for generating a full-focus image are generated. Let

\begin{aligned} \hat{\bm{F}}(x,y)=\max\{{\hat{\bm{f}}}_{1}(x,y)+{\bm{f}}_{1}(x,y),\cdots,{\hat{\bm{f}}}_{N}(x,y)+{\bm{f}}_{N}(x,y)\}\end{aligned}.

(17)

Since $\bm{M}_{h}$ can represent the difficulty of focused pixel detection, we concatenate it with $\hat{\bm{F}}$ and then feed it to the decoder to generate full-focus image $\bm{F}_{g}$ :

\begin{aligned} \bm{F}_{g}=\bm{E}_{d}([2\bm{M}_{h},\hat{\bm{F}}])\end{aligned},

(18)

where $\bm{E}_{d}$ denotes a full-focus image decoder consisting of a residual block and a convolutional layer.

III-D Fusion Result Construction Guided by Hard Pixels

With $\bm{M}_{h}$ and full-focus image $\bm{F}_{g}$ , a hard pixels guided fusion result construction is proposed. If the pixel does not belong to the hard pixels, the fused result is constructed directly by selecting the corresponding pixel from source images. Otherwise, the corresponding pixel of $\bm{F}_{g}$ is selected. To this end, the Eq.(19) is developed to update the decision map for fusion result construction:

\begin{aligned} {\tilde{\bm{M}}}_{i}(x,y)=\left\{\begin{array}[]{l}1,{if:}\bm{M}_{i}(x,y)+\bm{M}_{h}(x,y)=1\\ 0,{\rm{otherwise}}\\ \end{array}\right.\end{aligned},

(19)

where ${\tilde{\bm{M}}}_{i}(x,y)=1$ indicates that the pixel located at $(x,y)$ is the focused pixel in the source image $\bm{I}_{i}$ . The final fused result guided by ${\bm{M}}_{h}$ can be expressed as:

\begin{aligned} \bm{F}={\bm{\tilde{M}}_{1}}\odot\bm{I}_{1}+{\bm{\tilde{M}}_{2}}\odot\bm{I}_{2}+\cdots+{\bm{\tilde{M}}_{N}}\odot\bm{I}_{N}+2\bm{M}_{h}\odot\bm{F}_{g}\end{aligned}.

(20)

III-E Loss Function

The loss functions of the method in this paper are divided into two parts: the hard pixels detection loss and the full-focus image generation loss:

Hard pixels detection loss: In HPD, cross entropy loss formulated in Eq.(21) is used to enable the network to realize the detection of pixel focus property:

\begin{aligned} {\ell_{ce}}=-\sum\limits_{i=1}^{N}{\sum\limits_{x=1,y=1}^{H,W}{({g_{i}}(x,y)\log({p_{i}}(x,y))+}}\\ (1-{g_{i}}(x,y)\log(1-{p_{i}}(x,y))\end{aligned},

(21)

where ${p_{i}}(x,y)$ denotes the focused probability at position $(x,y)$ for the $i$ th source image, and ${g_{i}}(x,y)$ is its corresponding label.

Full-focus image generation loss: To obtain a clear full-focus image, besides the global reconstruction loss, the local reconstruction loss is also used in the hard pixel region:

\begin{aligned} {\ell_{rec}}={\left\|\bm{F}_{g}-\bm{G}_{f}\right\|_{1}}+\lambda{\left\|{\bm{M}_{h}\odot\bm{F}_{g}-\bm{M}_{h}\odot\bm{G}_{f}}\right\|_{1}}\end{aligned},

(22)

where $\bm{G}_{f}$ denotes the label of generated full-focus image, $\lambda$ is a hyperparameter that balances the effects of global and local loss.

IV Experiments

IV-A Datasets

In training of GRFusion, we construct training set by artificial synthesis. Similar to DRPL[35], 200 all-in-focus images selected from ImageNet Large Scale Visual Recognition Challenge 2012 (ILSVRC2012)[36] are used to synthesize the training samples. To achieve the data augmentation, each raw image is randomly cropped to 9 sub-images whose sizes are ${256\times 256}$ . Let a ${256\times 256}$ sub-image be $\bm{I}$ , a pair of multifocus images is synthesized by:

\displaystyle\begin{array}[]{l}\bm{I}_{1}=\bm{M}\odot\bm{I}+(\bm{1}-\bm{M})\odot{\bm{\tilde{}}{G}}(\bm{I})\\ \bm{I}_{2}=\bm{M}\odot{\bm{\tilde{}}{G}}(\bm{I})+(\bm{1}-\bm{M})\odot\bm{I}\\ \end{array},

(23)

where $\bm{M}$ denotes a binary mask randomly generated by the “findContours” function in OpenCV, and ${\bm{\tilde{}}{G}}(\cdot)$ is Gaussian filter with five different blurred versions. In the testing, the datasets used in our experiments are Lytro[37], MFI-WHU[38] and MFFW[39]. The details of the testing datasets are shown in Table I and the source images used to show the fused results are displayed in Fig.8:

TABLE I: The details of testing datasets.

Dataset	Resolution	No.of images	Size
Lytro	${520\times 520}$	2	20
Lytro	${520\times 520}$	3	4
MFI-WHU	Various	2	120
MFFW	Various	2	13
		3	4
		4	1
		6	1

IV-B Implementation Details

The training process includes two stages: the training of HPD and the training of FFIG-MF. FFIG-MF is trained after the training of HPD. In training of HPD, the total epoch is set to 600 with batch size of 24. In training of FFIG-MF, the total epoch is also set to 600 and the batch size is set to 4. The loss functions in Eq.(21) and Eq.(22) are minimized with the Adam Optimizer [40]. The learning rate is initialized to 0.0001 and the exponential decay rate is set to 0.9. The proposed method is implemented on Pytorch framework and the experiments are conducted on a desktop with a single GeForce RTX 3090.

TABLE II: Quantitative comparison of fusion results on “Lytro” and “MFI-WHU” datasets. The red font represents the optimal result, while the blue font represents the suboptimal result.

Methods	Lytro dataset					MFI-WHU dataset
Methods	${Q_{MI}}$	${Q_{AB/F}}$	${Q_{CB}}$	${Q_{NCIE}}$	${Q_{SSIM}}$	${Q_{MI}}$	${Q_{AB/F}}$	${Q_{CB}}$	${Q_{NCIE}}$	${Q_{SSIM}}$
MFF-GAN[38]	0.80995	0.64471	0.61022	0.82540	1.65349	0.76162	0.62241	0.60932	0.82352	1.64444
ZMFF[22]	0.90869	0.67569	0.69665	0.83001	1.72113	0.71858	0.61924	0.66719	0.82227	1.63110
MUFusion[41]	0.73928	0.61593	0.59090	0.82231	1.60150	0.65023	0.56568	0.56337	0.81991	1.56340
SwinFusion[42]	0.81836	0.66624	0.61071	0.82555	1.66042	0.76030	0.64815	0.63719	0.82306	1.64592
DRPL[35]	1.09872	0.73257	0.76223	0.84132	1.72350	1.07076	0.71212	0.78625	0.83999	1.71368
GACN[43]	1.12565	0.73649	0.77273	0.84336	1.72347	1.09467	0.70672	0.79610	0.84212	1.70880
SESF[28]	1.17177	0.74152	0.79114	0.84681	1.72534	1.20181	0.72097	0.80893	0.84908	1.71498
SMFuse[30]	1.16521	0.74007	0.77957	0.84628	1.72435	1.17048	0.72933	0.81081	0.84724	1.71463
MSFIN[31]	1.18912	0.74739	0.79569	0.84747	1.72547	1.19948	0.72820	0.82107	0.84911	1.71533
GRFusion	1.18998	0.74742	0.79497	0.84798	1.72495	1.20633	0.73382	0.82473	0.85014	1.71536

IV-C Comparison Methods and Evaluation Metrics

To demonstrate the effectiveness of our GRFusion, both qualitative and quantitative quality assessment are performed. In this process, we compare GRFusion with 9 state-of-the-art methods. These methods can be divided into two categories, i.e., feature reconstruction-based methods including MFF-GAN[38], ZMFF[22], MUFusion[41] and SwinFusion[42], pixel recombination-based methods including DRPL[35], GACN[43], SESF[28], SMFuse[30] and MSFIN[31]. To quantitatively evaluate the fusion results, five commonly used metrics including mutual information( ${Q_{MI}}$ [44]), edge based similarity metric( ${Q_{AB/F}}$ [45]), chen-blum metric( ${Q_{CB}}$ [46]), nonlinear correlation information entropy( ${Q_{NCIE}}$ [47]) and structural similarity ( ${Q_{SSIM}}$ [48]) are used in our experiments. In these metrics, ${Q_{MI}}$ evaluates how much information of the source images is included in the fusion result. ${Q_{AB/F}}$ measures the degree of edges of source images transferred into the fused result. ${Q_{CB}}$ assesses the amount of information transferred from source images into fused result. ${Q_{NCIE}}$ evaluates the fused image from preservation of detail, structure, color and brightness. ${Q_{SSIM}}$ measures the structural similarity between the fusion result and the source images. The larger the values of these evaluation indicators, the better the quality of the fusion results.

IV-D Fusion of Two Source Images

The first experiment is performed on fusion of two source images. In this process, the image pairs from “Lytro” and “MFI-WHU” are used as the test samples, and the fusion results of the source image pairs shown in Fig.8 are displayed in Fig.9 and Fig.10. As observed, GRFusion performs observable advantages when compared to the other 9 methods. First of all, GRFusion is deigned by combining feature reconstruction-based method and focused pixel recombination-based method. Therefore, it can maximize the retention of focused information in the source image, avoiding the loss of detailed information and the introduction of artifacts. Meanwhile, thanks to FRC-G-HP, the degradation of fusion results caused by focus detection errors can be effectively alleviated. Consequently, the developed method can produce the results with better visual quality as shown in Fig.9 and Fig.10.

Specifically, from the boxed area we can see that the residual information in the results generated by the proposed method is less than that of the compared methods. This indicates that GRFusion can transfer more focused information of the source images into the fusion results. Moreover, we can also find that MFF-GAN, MUFusion and SwinFusion methods suffer from chromatic aberrations, and ZMFF tends to loss the information in focused regions. The results obtained by the focused pixel recombination-based methods, i.e., DRPL, GACN, SESF, SMFuse and MSFIN are prone to blurring at the boundary of the focus region. This can be seen from the residual information in the boxed areas. To evaluate the fused results of each method more objectively, we further list quantitative comparison results in Table II. As we can see, the results generated by the proposed method are optimal in most matrices. It further demonstrates the effectiveness of the proposed method.

IV-E Fusion of Multiple Source Images

Unlike most of the existing methods, GRFusion is able to achieve the parallel fusion of multiple images and shows superior fusion performance. To validate this argument, the sequence sample set in Lytro and MFFW with more than two multifocus source images are tested. Specifically, there are 3, 4 and 6 source images in ‘Lytro-triple series-03”, “MFFW3-03”, “MFFW3-06” and “MFFW3-02”. The fusion results generated by different methods are shown in Fig. 11. From the boxed areas shown in the first and the second columns, it can be seen that GRFusion can not only achieve fusion processing of three source images, but also more effectively preserve the focus information, resulting in a fusion image with higher visual quality. From the remaining results illustrated in Fig. 11, we can draw similar conclusions. This is because the method proposed in this paper can perform parallel fusion processing on multiple source images, avoiding the introduction of multiple feature extraction used in alternating fusion. Therefore, the proposed method can effectively alleviate the information loss caused by insufficient feature extraction. To objectively evaluate the quality of fusion results, the evaluation results of five metrics on the fusion images shown in Fig. 11 are displayed in Fig. 12. The objective evaluation results further prove that the proposed method also has advantages when fusing more than two source images.

IV-F Ablation Study

IV-F1 Ablation Study on Network Structure

TABLE III: Ablation study on network structure.

mode1	mode2	mode3	${Q_{MI}}$	${Q_{AB/F}}$	${Q_{CB}}$	${Q_{NCIE}}$	${Q_{SSIM}}$
✓			1.18157	0.74660	0.79488	0.84702	1.72472
	✓		1.18402	0.74698	0.79441	0.84765	1.72493
		✓	1.11146	0.73550	0.75651	0.84253	1.72491
✓	✓	✓	1.18998	0.74742	0.79497	0.84798	1.72495

To verify the effectiveness of combining the feature reconstruction and focused pixel recombination, we compare GRFusion with three fusion modes, where mode1 generates the fusion image by $\bm{I}_{1}\odot\bm{M}_{1}+\bm{I}_{2}\odot(\bm{1}-\bm{M}_{1})$ , mode2 merges the source image by $\bm{I}_{1}\odot(\bm{1}-\bm{M}_{2})+\bm{I}_{2}\odot\bm{M}_{2}$ and mode 3 produces the fusion result by the full-focus image generation method proposed in this paper. In this process, $\bm{M}_{i}(i=1,2)$ is the binary map of focused pixel detection result of source image $\bm{I}_{i}(i=1,2)$ . As boxed regions shown in Fig.13, the defocused information will be introduced when binary map is estimated incorrectly, and fused results will be not so ideal when only the full-focus image generation method is used. In contrast, GRFusion can integrate the complementary advantages of mode1, mode2 and mode3. Consequently, the fused results with better performance can be achieved. The above results are also objectively evaluated in Table III.

IV-F2 Ablation Study on Focus Detection

TABLE IV: Objective evaluation of different methods in FFIG-MFE ablation studies.

Baseline	MDEE	WG	EFE	${\bm{M}_{h}}$	${Q_{MI}}$	${Q_{AB/F}}$	${Q_{CB}}$	${Q_{NCIE}}$	${Q_{SSIM}}$
✓					0.90407	0.72214	0.70561	0.82991	1.69022
✓			✓	✓	1.09770	0.73418	0.75487	0.84118	1.72432
✓	✓		✓	✓	0.99938	0.73418	0.71647	0.83506	1.72021
✓	✓	✓		✓	0.92208	0.72209	0.68758	0.83068	1.69305
✓	✓	✓	✓		1.10961	0.73389	0.76239	0.84108	1.72499
✓	✓	✓	✓	✓	1.11146	0.73550	0.75651	0.84253	1.72491

To validate the effectiveness of FM and MSFA in focus property detection, we remove them from HPD respectively. The ablation results are shown in Fig.14, where “Baseline” means that the baseline constructed by removing FM and MSFA from HPD, “Baseline+FM $\backslash$ MPG” means that the modulation parameter generator is removed from FM, “Baseline+FM $\backslash$ Rev” means that the reverse operator in Eq.(3) is moved from FM, “Baseline+FM” means that FM is added into the Baseline, “Baseline+MSFA” means that MSFA is added into the Baseline and “Baseline+FM+MSFA” means that both FM and MSFA are added into the Baseline. From these results we can see the effects of different components in HPD, and find that all the components have played a positive role in detecting focused pixels.

IV-F3 Ablation Analysis on FFIG-MFE

To verify the effectiveness of different components in FFIG-MFE, ablation experiments are performed for MDEE, EFE, weight generator (WG) and hard pixel information implantation, respectively. In this process, the remaining components of FFIG-MFE after removing MDEE, EFE, WG and hard pixels implantation are used as the baseline. In FFIG-MFE, MDEE is used to extract the edge features from different directions of the source images and then integrate them via WG. In ablation study of MDEE, the Laplace operator is employed to replace MDEE. To investigate the effectiveness of WG and EFE, WG is replaced by addition operation and EFE is replaced with common concatenation operation. In FFIG-MFE, hard pixel information implantation is implemented by Eq.(18). In ablation study of hard pixel information implantation, we directly remove $\bm{M}_{h}$ from Eq.(18). As seen in Fig.15 and Table IV, the performance is optimal when all components are included, and the residual information in the boxed region is also minimal in Fig.15(g). This demonstrates the effectiveness of each component in FFIG-MFE.

IV-G Analysis and Selection of $\lambda$

In our method, there is one critical hyperparameter, i.e., $\lambda$ , in loss function. To search an appropriate value for $\lambda$ , we investigate the impact of $\lambda$ when it changes within $[0,1]$ . As shown in Fig.17, when $\lambda=0$ and $\lambda=0.01$ , the fusion result suffers from slight chromatic, and when $\lambda=0.1$ , $\lambda=0.5$ , $\lambda=1$ , there is residual information presenting in the boxed areas. This indicates that not all information of the focused regions has been transferred into the fusion result. When $\lambda=0.05$ , the fusion result shows higher quality since there is less residual information remaining in the boxed region. From the visual quality of the fusion results, we can conclude that $\lambda=0.05$ is the best choice. The quantitative evaluation of fusion results when $\lambda$ takes different values is shown in Fig.16, which further demonstrates superiority of $\lambda=0.05$ over other settings.

V Conclusion

In this paper, a novel multifocus image fusion method named GRFusion is proposed which combines feature reconstruction with focus pixel recombination. By performing focus property detection of each input source image individually, GRFusion is able to fuse multiple source images simultaneously and avoid information loss caused by alternating fusing strategy effectively. Meanwhile, the determination of hard pixels is realized based on the inconsistency of all the detection results. To reduce the difficulty of fusion of hard pixels, a full-focus image generation method with multi-directional gradient embedding is proposed. With the generated full-focus image, a hard pixel-guided fusion result construction mechanism is designed, which effectively integrates the respective advantages of the feature reconstruction-based method and the focused pixel recombination-based method. Experiment results and ablation studies demonstrate the effectiveness of the proposed method and each component.

References

[1] H. Li, B. Manjunath, and S. K. Mitra, “Multisensor image fusion using the wavelet transform,” Graphical Models and Image Processing, vol. 57, no. 3, pp. 235–245, 1995.
[2] X. Li, F. Zhou, H. Tan, Y. Chen, and W. Zuo, “Multi-focus image fusion based on nonsubsampled contourlet transform and residual removal,” Signal Processing, vol. 184, p. 108062, 2021.
[3] H. Ma, Q. Liao, J. Zhang, and S. Liu, “An $\alpha$ -matte boundary defocus model-based cascaded network for multi-focus image fusion,” IEEE Transactions on Image Processing, vol. 29, pp. 8668–8679, 2020.
[4] O. Bouzos, I. Andreadis, and N. Mitianoudis, “A convolutional neural network-based conditional random field model for structured multi-focus image fusion robust to noise,” IEEE Transactions on Image Processing, vol. 32, pp. 2915–2930, 2023.
[5] Y. Liu, L. Wang, H. Li, and X. Chen, “Multi-focus image fusion with deep residual learning and focus property detection,” Information Fusion, vol. 86, pp. 1–16, 2022.
[6] B. Yu, B. Jia, L. Ding, Z. Cai, Q. Wu, R. Law, J. Huang, L. Song, and S. Fu, “Hybrid dual-tree complex wavelet transform and support vector machine for digital multi-focus image fusion,” Neurocomputing, vol. 182, pp. 1–9, 2016.
[7] A. D. Cunha, J. Zhou, and M. Do, “The nonsubsampled contourlet transform: Theory, design, and applications,” IEEE Transactions on Image Processing, vol. 15, no. 10, pp. 3089–3101, 2006.
[8] Z. Liu, Y. Chai, H. Yin, J. Zhou, and Z. Zhu, “A novel multi-focus image fusion approach based on image decomposition,” Information Fusion, vol. 35, pp. 102–116, 2017.
[9] H. Li, X. Liu, Z. Yu, and Y. Zhang, “Performance improvementschemeofmultifocusimagefusionderived bydifferenceimages,” Signal Processing, vol. 128, pp. 474–493, 2016.
[10] H. Li, X. He, D. Tao, Y. Tang, and R. Wang, “Joint medical image fusion, denoising and enhancement via discriminative low-rank sparse dictionaries learning,” Pattern Recognition, vol. 79, pp. 130–146, 2018.
[11] B. Yang and S. Li, “Multifocus image fusion and restoration with sparse representation,” IEEE transactions on Instrumentation and Measurement, vol. 59, no. 4, pp. 884–892, 2009.
[12] X. Ma, S. Hu, S. Liu, J. Fang, and S. Xu, “Multi-focus image fusion based on joint sparse representation and optimum theory,” Signal Processing: Image Communication, vol. 78, pp. 125–134, 2019.
[13] Y. Liu, X. Chen, R. K. Ward, and Z. J. Wang, “Image fusion with convolutional sparse representation,” IEEE Signal Processing Letters, vol. 23, no. 12, pp. 1882–1886, 2016.
[14] B. Yang and S. Li, “Multifocus image fusion and restoration with sparse representation,” IEEE Transactions on Instrumentation and Measurement, vol. 59, no. 4, pp. 884–892, 2009.
[15] Y. Zhang, M. Yang, N. Li, and Z. Yu, “Analysis-synthesis dictionary pair learning and patch saliency measure for image fusion,” Signal Processing, vol. 167, p. 107327, 2020.
[16] H. Li, Y. Wang, Z. Yang, R. Wang, X. Li, and D. Tao, “Discriminative dictionary learning-based multiple component decomposition for detail-preserving noisy image fusion,” IEEE Transactions on Instrumentation and Measurement, vol. 69, no. 4, pp. 1082–1102, 2020.
[17] H. Li, X. He, Z. Yu, and J. Luo, “Noise-robust image fusion with low-rank sparse decomposition guided by external patch prior,” Information Sciences, vol. 523, pp. 14–37, 2020.
[18] W. Zhao, D. Wang, and H. Lu, “Multi-focus image fusion with a natural enhancement via a joint multi-level deeply supervised convolutional neural network,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 4, pp. 1102–1115, 2018.
[19] Y. Zang, D. Zhou, C. Wang, R. Nie, and Y. Guo, “Ufa-fuse: A novel deep supervised and hybrid model for multifocus image fusion,” IEEE Transactions on Instrumentation and Measurement, vol. 70, pp. 1–17, 2021.
[20] H. Jung, Y. Kim, H. Jang, N. Ha, and K. Sohn, “Unsupervised deep image fusion with structure tensor representations,” IEEE Transactions on Image Processing, vol. 29, pp. 3845–3858, 2020.
[21] S. Xu, L. Ji, Z. Wang, P. Li, K. Sun, C. Zhang, and J. Zhang, “Towards reducing severe defocus spread effects for multi-focus image fusion via an optimization based strategy,” IEEE Transactions on Computional Imaging, vol. 6, pp. 1561–1569, 2020.
[22] X. Hu, J. Jiang, X. Liu, and J. Ma, “Zmff: Zero-shot multi-focus image fusion,” Information Fusion, vol. 92, pp. 127–138, 2023.
[23] Z. Wang, X. Li, H. Duan, and X. Zhang, “A self-supervised residual feature learning model for multifocus image fusion,” IEEE Transactions on Image Processing, vol. 31, pp. 4527–4542, 2022.
[24] X. Nie and B. H. X. Gao, “Mlnet: A multi-domain lightweight network for multi-focus image fusion,” IEEE Transactions on Multimedia, 2023, doi:10.1109/TMM.2022.3194991.
[25] H. T. Mustafa, M. Zareapoor, and J. Yang, “Mldnet: Multi-level dense network for multi-focus image fusion,” Signal Processing: Image Communication, vol. 85, p. 115864, 2020.
[26] Y. Liu, X. Chen, H. Peng, and Z. Wang, “Multi-focus image fusion with a deep convolutional neural network,” Information Fusion, vol. 36, pp. 191–207, 2017.
[27] X. Guo, R. Nie, J. Cao, D. Zhou, and W. Qian, “Fully convolutional network-based multifocus image fusion,” Neural Computation, vol. 30, no. 7, pp. 1775–1800, 2018.
[28] B. Ma, Y. Zhu, X. Yin, X. Ban, H. Huang, and M. Mukeshimana, “Sesf-fuse: An unsupervised deep model for multi-focus image fusion,” Neural Computing and Applications, vol. 33, pp. 5793–5804, 2021.
[29] B. Xiao, B. Xu, X. Bi, and W. Li, “Global-feature encoding u-net (geu-net) for multi-focus image fusion,” IEEE Transactions on Image Processing, vol. 30, pp. 163–175, 2020.
[30] J. Ma, Z. Le, X. Tian, and J. Jiang, “Smfuse: Multi-focus image fusion via self-supervised mask-optimization,” IEEE Transactions on Computational Imaging, vol. 7, pp. 309–320, 2021.
[31] Y. Liu, L. Wang, J. Cheng, and X. Chen, “Multiscale feature interactive network for multifocus image fusion,” IEEE Transactions on Instrumentation and Measurement, vol. 70, pp. 1–16, 2021.
[32] H. Li, H. Qiu, Z. Yu, and B. Li, “Multifocus image fusion via fixed window technique of multiscale images and non-local means filtering,” Signal Processing, vol. 138, pp. 71–85, 2017.
[33] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “Cbam: Convolutional block attention module,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 3–19.
[34] H. Tang, C. Yuan, Z. Li, and J. Tang, “Learning attention-guided pyramidal features for few-shot fine-grained recognition,” Pattern Recognition, vol. 130, p. 108792, 2022.
[35] J. Li, X. Guo, G. Lu, B. Zhang, Y. Xu, F. Wu, and D. Zhang, “Drpl: Deep regression pair learning for multi-focus image fusion,” IEEE Transactions on Image Processing, vol. 29, pp. 4816–4831, 2020.
[36] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, pp. 211–252, 2015.
[37] M. Nejati, S. Samavi, and S. Shirani, “Multi-focus image fusion using dictionary-based sparse representation,” Information Fusion, vol. 25, pp. 72–84, 2015.
[38] H. Zhang, Z. Le, Z. Shao, H. Xu, and J. Ma, “Mff-gan: An unsupervised generative adversarial network with adaptive and gradient joint constraints for multi-focus image fusion,” Information Fusion, vol. 66, pp. 40–53, 2021.
[39] S. Xu, X. Wei, C. Zhang, J. Liu, and J. Zhang, “Mffw: A new dataset for multi-focus image fusion,” arXiv preprint arXiv:2002.04780, 2020.
[40] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[41] C. Cheng, T. Xu, and X.-J. Wu, “Mufusion: A general unsupervised image fusion network based on memory unit,” Information Fusion, vol. 92, pp. 80–92, 2023.
[42] J. Ma, L. Tang, F. Fan, J. Huang, X. Mei, and Y. Ma, “Swinfusion: Cross-domain long-range learning for general image fusion via swin transformer,” IEEE/CAA Journal of Automatica Sinica, vol. 9, no. 7, pp. 1200–1217, 2022.
[43] B. Ma, X. Yin, D. Wu, H. Shen, X. Ban, and Y. Wang, “End-to-end learning for simultaneously generating decision map and multi-focus image fusion result,” Neurocomputing, vol. 470, pp. 204–216, 2022.
[44] M. Hossny, S. Nahavandi, and D. Creighton, “Comments on‘information measure for performance of image fusion’,” Electronics letters, vol. 44, no. 18, pp. 1066–1067, 2008.
[45] C. S. Xydeas, V. Petrovic et al., “Objective image fusion performance measure,” Electronics Letters, vol. 36, no. 4, pp. 308–309, 2000.
[46] Y. Chen and R. S. Blum, “A new automated quality assessment algorithm for image fusion,” Image and Vision Computing, vol. 27, no. 10, pp. 1421–1432, 2009.
[47] Q. Wang, Y. Shen, and J. Jin, “Performance evaluation of image fusion techniques,” Image fusion: algorithms and applications, vol. 19, pp. 469–492, 2008.
[48] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.