PixelGame: Infrared small target segmentation as a Nash equilibrium

Heng Zhou, Chunna Tian^∗, Zhenxi Zhang,
Chengyang Li, Yongqiang Xie^∗, Zhongbo Li Manuscript received October 14, 2021; revised May 1, 2022. (Corresponding author:Chunna Tian, Yongqiang Xie.)

Abstract

A key challenge of infrared small target segmentation (ISTS) is to balance false negative pixels (FNs) and false positive pixels (FPs). Traditional methods combine FNs and FPs into a single objective by weighted sum, and the optimization process is decided by one actor. Minimizing FNs and FPs with the same strategy leads to antagonistic decisions. To address this problem, we propose a competitive game framework (pixelGame) from a novel perspective for ISTS. In pixelGame, FNs and FPs are controlled by different player whose goal is to minimize their own utility function. FNs-player and FPs-player are designed with different strategies: One is to minimize FNs and the other is to minimize FPs. The utility function drives the evolution of the two participants in competition. We consider the Nash equilibrium of pixelGame as the optimal solution. In addition, we propose maximum information modulation (MIM) to highlight the target information. MIM effectively focuses on the salient region including small targets. Extensive experiments on two standard public datasets prove the effectiveness of our method. Compared with other state-of-the-art methods, our method achieves better performance in terms of F1-measure ( $\mathrm{F_{1}}$ ) and the intersection of union ( $\mathrm{IoU}$ ).

Index Terms:

Game theory, deep learning, infrared image, small target segmentation.

I Introduction

Infrared target segmentation plays a fundamental role in many applications, including surveillance and reconnaissance [1], [2], precise strike and guidance [3] in the military field, organ segmentation [4], [5] and cell identification [6] in the biomedical field. In real applications, due to the long distance, InfraRed (IR) targets are usually “dim”, “small” and “sparse” compared with RGB images. The top of Fig. 1 is a typical IR image [7], and the bottom is a common RGB image of natural scene [8]. 1) dim: The noisy and clutter background results in that the IR targets have low contrast and low signal-to-clutter ratio. 2) small: The pixels of IR targets only account for a small proportion in the image. Most of the pixels in IR image are background pixels. 3) sparse: In addition, it can be seen from Fig. 1 that, different from most RGB images, the layout of IR small targets is also sparse.

Many traditional methods rely on hand-crafted features. Traditional methods cannot be adapted to an open and diverse environment due to the lack of texture and shape features. They mainly either simplify the small target as a bright spot [9], [10], or model the background, target and the relationship between them [11] in a particular scene. Only under specific a priori hypothesis, such methods can achieve good performance. However, in the open environment with the diversity of background scenes, it is difficult to segment IR small targets robustly and accurately.

Different from traditional methods, deep convolutional neural networks (CNN) learn infrared small target representations in a data-driven manner. In recent years, inspired by the superior performance of deep learning in machine learning, CNN-based approaches have made new advances in infrared small target detection [12] and segmentation [13], [14], [15], [16].

One of the main challenges is the foreground-background imbalance problem in the ISTS. The foreground pixels in the image are far fewer than the background pixels. Specifically, a large number of background pixels are incorrectly segmented as targets (false positive pixels, $FPs$ ). A small number of target pixels are submerged by clutter (false negative pixels, $FNs$ ).

In order to balance $FNs$ and $FPs$ , most previous CNN-based methods[17] combine the two objectives into one function through weighted sum. The combined objective functions include Dice loss [18], Jaccard loss [19], Tversky loss [20], asymmetric similarity loss [21], sensitivity-specificity loss [22] and penalty loss [23]. The methods based on the combined loss function tune-up the weights to get acceptable solutions, the selection being done by the one actor. The weights as hyperparameters of the loss function control the tradeoff between $FNs$ and $FPs$ . However, such training objective design mainly suffers from two limitations. 1) The same strategy to minimize $FNs$ and $FPs$ simultaneously leads to antagonistic decisions. The former aims to predict as many target pixels as possible, while the latter tends to predict a small number of target pixels with high confidence. 2) Extensive studies on loss function show that the setting of their hyper-parameters is an experienced and difficult work. Therefore, it is intuitive and rational to optimize $FNs$ and $FPs$ as two objectives independently, as in our work.

Our work is motivated by the game theory. Game theory is a framework or paradigm to solve multi-objective optimization problems, especially in dealing with antagonistic criteria [24], [25], [26]. Game-theoretic learning uses an equilibrium state instead of the optimal solution. Following the above idea, we design two tailored sub-networks to act as the FNs-player and FPs-player in the game. FNs-player and FPs-player focus on false negative pixels and false positive pixels, respectively. The players, actions and utility functions in the game theory are corresponding to sub-networks, the change of network parameters and loss functions in the proposed framework, respectively [27], [28]. In this way, the ISTS is transformed into a game paradigm. Under the constraints of the utility function, game players choose the action that minimizes their utility. Finally, the opponents reach the Nash equilibrium, which is an ingenious tradeoff between $FNs$ and $FPs$ .

To obtain high-quality segmentation masks, the context of image is important for small targets [29], [30], [31]. In deep CNN, the receptive field is mainly expanded by down-sampling. Though, traditional deep convolutional networks have a larger receptive field to aggregate contextual information, they lose some spatial location information. Different from large-scale targets, resolution degeneration may cause that IR small targets fail to be segmented. To solve the contradiction between deep high-level semantics and shallow high-resolution feature maps, we adopt the dilated convolution modules [15], [32] in both FNs-player and FPs-player. Fully dilated convolution network ( $\mathrm{FDCN}$ ) can obtain larger receptive fields and maintain the spatial resolution at the same time. Fig. 2 illustrates the visualization of the dilated convolutional feature maps from $\mathrm{FDCN}$ . As shown in Fig. 2, as the receptive field of dilated convolution increases, the small targets are effectively retained and enhanced in the high-resolution features.

In addition, in contrast to the obvious semantic dependencies between objects and backgrounds (e.g., boats and rivers, cars and roads, etc.) in RGB object segmentation, the targets in ISTS are grouped into one broad class. The semantic contrast between backgrounds and targets is weak in ISTS. We observe that small infrared targets are usually local salient in a specific region. Based on these observations, we propose maximum information modulation (MIM). MIM absorbs the advantages of attention mechanism [33], [34] in focusing on effective information. MIM effectively suppresses irrelevant information and enhances the representation of small target in the proposed framework.

In summary, the main contributions of this article are given as follows.

1.

We present a novel perspective to model infrared small target segmentation as a multi-player strategy game (pixelGame). FNs-player and FPs-player focus on reducing $FNs$ and $FPs$ in pixelGame, respectively.
2.

At the same time, a new utility function of pixelGame is designed to encourage two players to conduct the game. The utility function ensures that the participants fully play the game and eventually reach Nash equilibrium.
3.

To handle the small targets, we adopt the dilated convolution modules in both FNs-player and FPs-player. $\mathrm{FDCN}$ takes both large receptive field and high-resolution feature map into account.
4.

Due to the fact that IR targets are usually local salient regions in images, we propose MIM to suppress irrelevant background information by calculating local maxima, which improves the feature discrimination ability on small targets.

The remainder of this article is organized as follows. In Section II, we briefly review the related work on ISTS , deep learning based game theory and two benchmark datasets. Section III gives a meticulous description of proposed pixelGame model on ISTS. Section IV, we conduct extensive experiments on ablation study and comparison with state-of-the-art (SOTA) methods on the benchmark datasets. The results prove the effectiveness of our method. The conclusions and future works are drawn in Section V.

II Related works

In this section, we briefly review related works on ISTS, deep learning based game theory and two benchmark infrared dim small target datasets.

II-A Infrared small target segmentation

Infrared small target segmentation models are divided into two categories: Traditional methods based on mathematical modeling with strong prior assumptions and CNN-based methods emerging in recent years. Specifically, the traditional ISTS methods mainly include spatial domain filtering methods and optimization-based methods.

Traditional spatial domain-based methods, such as Top-Hat filtering [9], Max-Median filtering [35], focus on suppressing the background. Compared with the classic Top-Hat filter, Bai et al. [10] constructed a ring filter template through two related but different structural elements to better suppress the background and noise. Deng et al. [36] weighted the image entropy by multiscale gray difference to improve the SCR of small targets. Gao et al. [37] treated the target as a special sparse component in the noise, so as to distinguish the target from similar background noise. Huang et al. [38] used a relatively large density gap between targets and their neighbors to eliminate the interference caused by clutters in complex backgrounds. Moradi et al. [39] proposed a directional approach to enhance the target area and suppress structural backgrounds. Many methods apply multi-scale technology to suppress the background. Nie et al. [40] proposed a multi-scale local homogeneity measure to improve the saliency of small targets. Gao et al. [41] utilized multi-scale gray and variance difference metrics to enhance the feature representation of small target and mitigate background fluctuations, which improves the detection accuracy. In cluttered background, He et al. [42] enhanced targets by exploiting multi-scale differences in intensity distribution changes and gray values.

Inspired by the human visual system, many local contrast measure (LCM) based methods [11], [43] have been explored. Wei et al. [44] presented multiscale patch-based contrast measure to increase the contrast between the target and background. Huang et al. improved LCM from the aspects of multiscale [45], target shape [46] and the difference between the target and the background [47]. Lu et al. [48] utilized a division scheme of surrounding area to capture the derivative properties of the target. These methods extract the difference between the target and the background from various aspects, but their performance is limited when the background changes dramatically or the target is hidden in the background.

The optimizing methods based on low-rank matrix recovery theory assume that the raw image is generated by a low-rank subspace, and the small targets are formulated as sparse singularity. The infrared patch-image (IPI) model [49] regarded small target detection as an optimization problem of recovering low-rank and sparse matrices. Dai et al. [50] used the partial sum of singular values instead of the nuclear norm of IPI to constrain the low rank background patch image. For various highly complex background scenes, Wang et al. [51] combined the total variation regularization term and principal component pursuit (TV-PCP) to comprehensively describe background feature. Wang et al. [52] analyzed the multi-subspace structure of heterogeneous background data, and proposed a stable multi-subspace learning method (SMSL) based on the internal structure of actual images to improve the robustness of the model. Self-regularized weighted sparse (SRWS) [53] model mined the potential information in the background, and transformed the small target segmentation into the optimization problem of extracting clutter from multiple subspaces.

Compared with matrix, tensors have more advantages in handling high-dimensional data [54]. To distinguish real targets from background residuals in heterogeneous scenes, Zhang et al. [55] proposed an edge and corner awareness-based spatial-temporal tensor (ECA-STT) model. Sun et al. [56] extended the properties of multi-subspace to infrared patch-tensor (IPT) structure to better characterize the highly heterogeneous infrared image background. Kong et al. [57] promoted t-SVD to multimodal t-SVD and enhanced the accuracy of background rank representation in the IPT model. The more accurate the assumptions, the better the performance of these existing IPT-based methods. Therefore, the performance of the model based on optimization relies on the constructed data structure and prior assumptions.

TABLE I: several latest ISTS datasets. BC and LSR represent background clutters and low spatial resolution, respectively.

Datasets	Background	Year	Samples (train&val)	Samples (test)	Image size	Label Type	Sequence	Synthetic/Real
NUST-ISTS[16]	BC & LSR	2019	10,000	100	(101 $\sim$ 442) $\times$ (96 $\sim$ 327)	Pixel	✗	Synthetic
NUAA-ISTS[7]	BC & LSR	2021	341	86	(135 $\sim$ 456) $\times$ (96 $\sim$ 349)	Pixel/Box	✗	Real
DSAT[58]	BC	2019	16,177	N/A	256 $\times$ 256	Centroid	✔	Real

Specific assumptions cannot adapt to the open and diversified background environment. The model based on deep learning has a large capacity and contains a variety of different scenes. In recent years, the release of multiple infrared small target datasets has promoted the research of methods based on CNN. Dai et al. [7] proposes the first CNN-based single-frame infrared small target segmentation model, and designs an asymmetric context modulation (ACM) module to fuse high-level semantics and low-level details. ALCNet [59] transforms the traditional local contrast measurement method into a nonlinear module in the convolution network by combining domain knowledge, which alleviates the problem of the minimum internal characteristics of the pure data-driven methods. ISTDU-Net [60] improves the U-Net [61] segmentation model by increasing the response of small target features and suppressing similar background information, which improves the recognition ability to small targets.

ACM, ALCNet and ISTDU-Net take ResNet-20 [62] as the backbone network to extract infrared small target features, and use the IoU-based weighted loss function to guide model optimization. Nevertheless, there are obvious differences in size, energy and layout between infrared and visible targets.

On the one hand, the ResNet structure obtains a large receptive field to fuse context information by sacrificing spatial resolution. Different from large-scale RGB targets, the resolution degeneration may cause IR small targets fail to be segmented. On the other hand, the IoU-based loss function minimizes $FNs$ and $FPs$ simultaneously, which increases the difficulty of model optimization and leads to antagonistic decisions. Minimizing $FNs$ aims to predict as many target pixels as possible, while minimizing $FPs$ trends to predict less number of high-confident target pixels.

Under the framework of generative adversarial networks (GAN) [63], MDvsFA-cGAN [16] used two different segmentation networks as generators, and aims to balance the results of the two generators through adversarial learning between the generators and the discriminator. MDvsFA-cGAN tries to find the Nash equilibrium of the generator and discriminator. In addition, GAN networks are difficult to train and prone to model collapse, which causes the generators to produce samples in the same mode. Differently, our method aims to find the equilibrium state of the two segmentation networks.

The core idea of GAN is game theory [24]. To take advantage of game theory, we design two novel segmentation networks as two players. Under the guidance of the utility function, the two networks play games directly. When two players reach Nash equilibrium in the game, the final results are generated by the segmentation networks.

II-B Deep learning based game theory

Generally, the deep learning based game is simplified, which consists of three parts [27]: 1) the participants of the game are neural units or neural networks; 2) the choices of each participant; 3) the objective function of each participant. A game in strategic form is given as follows:

\left(\mathcal{T},\left(\varTheta_{k}\right)_{k\in\mathcal{T}},\left(\varPhi_{k}\right)_{k\in\mathcal{T}}\right),

(1)

where $\mathcal{T}\{1,2,\ldots\}$ is a set of players. $\varTheta_{k}$ is the set of actions of player $k\in\mathcal{T}$ , which is essentially a weight. $\varPhi_{k}$ is a loss function of the player $k$ . The strategy of player $k$ is to optimize $\varPhi_{k}$ .

Inspired by the outstanding performance of game theory on principal component analysis [28], we regard the task of infrared small target segmentation as a competitive game. Different networks as game players focus on different incorrectly segmented pixels. Under the guidance of the utility function, the two players continue to play the game, and finally reach the Nash equilibrium to output the segmentation results.

II-C Infrared dim small target datasets

Lacking large-scale data sets severely hampered the application of deep learning on small infrared targets. The recently emerged NUST-ISTS [16] is the first large-scale dataset for infrared small target segmentation. NUST-ISTS contains small targets with various real backgrounds, which results in rich data samples. Hu et al. [58] released the first multi-scene dataset for low slow infrared dim-small aircraft target image sequence (DSAT), the centroid coordinates of objects are marked. NUAA-ISTS [7] is the first real scene infrared small target dataset. NUAA-ISTS provides various annotations including mask and bounding box. The detailed properties of those three datasets are shown in TABLE I.

In terms of dataset size, the NUST-ISTS and DSAT data sets have more than 10,000 labeled image samples. The difference between them is that the NUST-ISTS dataset is per-pixel labeled, while DSAT only provides the center coordinates of targets. The advantage of the NUAA-ISTS dataset is that the samples cover a variety of natural scenes, and all samples with various sizes are taken from the real background.

The experiments of our work are mainly carried out on NUST-ISTS and NUAA-ISTS datasets with pixel-wise target masks. Some representative infrared images of the two datasets are shown in Fig. 3.

To further analyze the two datasets, we statistically count the number and size of the targets. The distribution of the number of targets in each image in those two datasets is shown in Fig. 4 (a). It can be observed that more than 80% images contain only one target. In detail, the images in NUST-ISTS mainly contain single target, while images in NUAA-ISTS usually contain multiple targets. Fig. 4 (b) shows the statistical distribution of the target area (number of pixels contained in the target) on each dataset. From the cumulative line chart on the right, we found that more than 90% infrared targets contain less than 100 pixels, which occupy less than 1% in the image. We can see that the targets of infrared images are small, dim and scattered. It can be seen from Fig. 4 (b) that more than half of the targets contains about 20 pixels.

The signal-to-clutter ratio (SCR) [49], [64] is used to measure the target intensity and background intensity. In general, the higher SCR of the target, the easier the target is to be segmented. In infrared dim small target segmentation, SCR is defined as follows,

\textrm{SCR}=\frac{\left|\mu_{t}-\mu_{c}\right|}{\sigma_{c}}.

(2)

In Eq. (2), $\mu_{t}$ represents the target intensity, which is the mean gray value of the target region. $\mu_{c}$ and $\sigma_{c}$ represent the mean and standard deviation of the gray value in the target neighborhood region, respectively. The target neighborhood region is set to be three times the size of the target region in this paper.

As shown in Fig. 4 (c), the SCR of about 70% IR small targets is lower than 5. The clutter signals, such as clouds, trees, rocks, ground, etc. account for most energy of the IR images. The target can be regarded as the local extreme point in a specific region, but its energy is very weak in the global background.

Given the above analysis, Fig. 4 indicates that the challenges of ISTS not only lie in the limited information from target, but also lie in the complex and changeful background.

III PixelGame: Infrared dim small target segmentation based on Game Theory

As shown in Fig. 5, the proposed pixelGame consists of two sub-networks: FNs-player ( $S_{1}$ ) and FPs-player ( $S_{2}$ ), which segment the IR image pixel by pixel under the guidance of respective utility functions. To handle the small targets, we pay attention to deep high-level semantics and shallow high-resolution feature maps at the same time. Therefore, we combine dilated convolution and encode-decode structure to form the backbones of FNs-player and FPs-player.

In this section, we first introduce how to transform the ISTS task into a game in pixelGame. Specially, we deal with three challenges: 1) How to design suitable sub-networks to control the focus of different players (Section III-A). 2) How to improve the feature representation ability on small targets (Section III-B). 3) How to set a scientific and effective utility function for players to achieve the Nash equilibrium in competitive game (Section III-C).

III-A FNs-player and FPs-player

Inspired by the game theory in solving antagonistic decisions, two segmentation networks as two players in the game optimize their utility functions, respectively. The sub-networks $S_{1}$ and $S_{2}$ segment the infrared image $\mathbf{I}$ pixel by pixel, as shown in Fig. 5. Formally, it can be represented as

\left\{\begin{array}[]{l}S_{1}(\mathbf{I})\rightarrow\textit{{O}}_{1},\\ S_{2}(\mathbf{I})\rightarrow\textit{{O}}_{2},\end{array}\right.

(3)

where $\textit{{O}}_{1}$ and $\textit{{O}}{{}_{2}}$ denote to the segmentation results of two players, respectively.

TABLE II: Detailed backbone of the networks. In the table, “conv-k(m)-d(n)-c(t)” represents a convolutional layer with m

\times

m kernel, dilation factor of n and output channel number of feature maps of t. Head represents the last output layer of the network.

\mathrm{FDCN_{9}}

\mathrm{FDCN_{13}}

Encoder-decoder

conv-k3-d1-c128

conv-k3-d2-c128

conv-k3-d4-c128

conv-k3-d8-c128

conv-k3-d16-c128

conv-k3-d8-c128

conv-k3-d4-c128

conv-k3-d2-c128

conv-k3-d1-c128

conv-k3-d1-c64

conv-k3-d2-c64

conv-k3-d4-c64

conv-k3-d8-c64

conv-k3-d16-c64

conv-k3-d32-c64

conv-k3-d64-c64

conv-k3-d32-c64

conv-k3-d16-c64

conv-k3-d8-c64

conv-k3-d4-c64

conv-k3-d2-c64

conv-k3-d1-c64

Head

conv-k1-d1-c1

In the small target segmentation task, the $FNs$ and $FPs$ are difficult to balance delicately. We separate $FNs$ and $FPs$ , and employ two players to divide and conquer. In order to obtain better performance, the two sub-networks use different structures according to tasks. FNs-player and FPs-player use fully dilated convolution network ( $\mathrm{FDCN}$ ) with different network depth and dilation factor.

The detailed encoder-decoder structures of $\mathrm{FDCN_{9}}$ and $\mathrm{FDCN_{13}}$ are shown in TABLE II. $\mathrm{FDCN_{9}}$ and $\mathrm{FDCN_{13}}$ are the backbone networks of FNs-player and FPs-player, respectively. In all of the models, the convolutional layers except the last one are followed by batch normalization (BN) [65] and leaky rectified linear unit (leakyReLU) [32]. Specifically, the goal of $S_{1}$ player is to reduce the false negative pixels of targets, optimizing $TNs$ and $FNs$ . We employ the shallow encoder-decoder network to extract local information and segment all the pixels of the suspected target. The $\mathrm{FDCN_{9}}$ uses 9-layer convolution, and the dilation factor is increasing from 1 to 16.

Compared with FNs-player, FPs-player increases the accuracy of the predicted pixels belonging to the target class, by optimizing $TPs$ and $FPs$ . The pixels predicted by $S_{2}$ may be as precise as possible. FPs-player needs a larger context and better local receptive field, so $\mathrm{FDCN_{13}}$ is deeper and its dilation factor is larger. The $\mathrm{FDCN_{13}}$ contains 13 convolutional layers, and the maximum dilation factor is 64. Finally, the head layer is used to predict the class of each pixel, generating the binary mask of foreground and background.

III-B Maximum Information Modulation

The information modulation methods represented by the attention mechanism aim to make the model focus on task-related information. In RGB object detection and segmentation, SENet [66] adaptively enhances task-relevant channels by learning the dependencies between different channels. Non-local network [67] is used to capture long-range dependencies and establish the interaction between two pixels with a certain distance on the image. GCNet [33] improves the Non-local network and SENet, enabling query-independent lightweight modules to effectively extract global context information. Triplet attention [34] encodes inter-channel relation and spatial relation, and establishes the dependencies between them to calculate attention weights.

Unlike RGB images, the SCR of infrared dim small targets is very low, and the useful target information is usually submerged in irrelevant clutter and noise. Considering the small targets are arduous to segment, we introduce global max pooling (GMP) [68] and cross-channel max pooling (cMaxPool) [34] to enhance the local salient information of these targets. The $\mathrm{MIM}$ aims to increase the pertinence and capacity of extracted features. In $\mathrm{FDCN_{9}}$ and $\mathrm{FDCN_{13}}$ , we add the $\mathrm{MIM}$ module in the skip connection.

The differences between $\mathrm{MIM}$ and other attention modules are highlighted in Fig. 6. It can be seen from the visualization results that the $\mathrm{MIM}$ performs better in capturing infrared small targets with low SCR than other attention mechanisms.

The $\mathrm{MIM}$ enhances the salient information related to the target, and suppresses a large amount of noise and clutters that are not related to the target. In Eq. (4), the $\mathrm{MIM}$ is performed on the features $\mathbf{X}\in\mathbb{R}^{C\times H\times W}$ of each layer of two encoders.

\mathbf{Z}=m(\mathbf{X}),

(4)

where $m(\cdot)$ represent the $\mathrm{MIM}$ , modulation feature $\mathbf{Z}\in\mathbb{R}^{C\times H\times W}$ , $C,H$ and $W$ represents the channels, height and width of the feature map, respectively.

The network structure of $\mathrm{MIM}$ is shown in Fig. 7. First, the pixel-wise correlation in the spatial domain is used to obtain cross-channel attention $\mathbf{V_{1}}\in\mathbb{R}^{C\times 1\times 1}$ . Specifically, in Eq. (5), the feature map $\mathbf{X}$ is first transformed through point-wise convolution (PWConv) [69] to fuse the relationship between the features of different channels. Eq. (6) makes the $\mathrm{MIM}$ pay more attention to the correlation between different spatial positions of feature.

\mathbf{Y}=\textrm{PWConv}(\mathbf{X}),\vspace{-0.5cm}

(5)

\mathbf{V_{1}}={reshape}(\mathbf{X})*\rho\left({reshape}\left(\mathbf{Y}\right)\right),

(6)

where the PWConv contains BN, and leakyReLU sequences. $\rho$ is the Softmax function. Dimension reshaped as $\mathbf{X}\in\mathbb{R}^{C\times H\times W}\rightarrow\mathbb{R}^{C\times HW}$ , $\mathbf{Y}\in\mathbb{R}^{1\times H\times W}\rightarrow\mathbb{R}^{HW\times 1\times 1}$ .

Secondly, different from RGB target segmentation, the targets in ISTS are often relatively small. GMP selects the extreme values in the feature map, it can effectively improve the saliency of the feature map in the region and suppress noise. Global average pooling (GAP) [70] usually pays more attention to large objects. GAP tends to give higher responses to large irrelevant objects and neglect the small extreme regions. However, the small targets to-be-segmented in our task are usually local extreme points in images, GMP extracts the target-related features more effectively. Thus, GMP can reduce the impact of useless background information and highlight striking target information. In Eq. (7), we obtain channel-wise attention $\mathbf{V_{2}}\in\mathbb{R}^{C\times 1\times 1}$ :

\mathbf{V_{2}}=\sigma(\textrm{GMP}(\mathbf{X})),

(7)

where $\sigma$ is the Sigmoid function.

We combine different target information obtained by Eq. (6) and Eq. (7) to obtain dual-channel attention $\mathbf{M}_{\mathbf{1}}\in\mathbb{R}^{2C\times 1\times 1}$ :

\mathbf{M}_{\textbf{1}}=[\mathbf{V_{1}};\mathbf{V_{2}}].

(8)

As shown in Fig. 7, the two-layer PWConv realizes the full interaction of different channel information through squeezing and excitation, and further enhances the cross-channel global attention $\mathbf{M}_{\mathbf{2}}$ , in Eq. (9).

\mathbf{M}_{\mathbf{2}}=\rho\left(\textrm{PWConv}\left(\textrm{PWConv}\left(\mathbf{M}_{\textbf{1}}\right)\right)\right),

(9)

where $\mathbf{M}_{\mathbf{2}}\in\mathbb{R}^{C\times 1\times 1}$ , and the last layer does not include leakyRelu activation.

Then, we use cMaxPool to obtain the most significant information of different channels. $\mathbf{M}_{\mathbf{3}}$ in Eq. (10) is the max channel pooling feature map, where each point in the feature map is the max of the points at the same position within feature maps.

\mathbf{M}_{\mathbf{3}}=\sigma\left(\textrm{cMaxPool}\left(\mathbf{X},0\right)\right),

(10)

where $\sigma$ is the Sigmoid function, $\mathbf{M}_{\mathbf{3}}\in\mathbb{R}^{1\times H\times W}$ .

Finally, guided by Eq. (9) and Eq. (10), $\mathrm{MIM}$ enhances small target features in both spatial and channel dimensions. The final feature is obtained by

\mathbf{Z}=\mathbf{M}_{\mathbf{2}}\cdot\mathbf{X}+\mathbf{M}_{\mathbf{3}}\cdot\mathbf{X},

(11)

where $\mathbf{Z}\in\mathbb{R}^{C\times H\times W}$ .

III-C Utility function

When evaluating segmentation results, the predicted results are often divided into four parts: The number of correctly segmented pixels of target ( $TPs$ ), the number of incorrectly segmented pixels of background ( $FPs$ ), the number of correctly segmented pixels of background ( $TNs$ ), the number of incorrectly segmented pixels of target ( $FNs$ ). The target pixels are composed of $TPs$ and $FNs$ . The background pixels contain $TNs$ and $FPs$ . The confusion matrix of ISTS is shown in TABLE III. In the confusion matrix, the columns represent the predicted masks O of the pixelGame, and the row represents the ground truth G of the input images.

TABLE III: Confusion matrix. Among them, “1” and “0” represent the target and background respectively.

Confusion matrix		G
Confusion matrix		Actually Positive (1)	Actually Negative (0)
O	Predicted Positive (1)	$TPs$	$FPs$
O	Predicted Negative (0)	$FNs$	$TNs$

In order to achieve high-quality results for the FNs-player and FPs-player games, we design a novel utility function according to each player’s own focus and the overall constraints of the game. This utility function consists of three parts: Player utility, game utility and small target constraints. Scientific and effective utility function help the model reach the equilibrium state in the competitive game.

III-C1 Player utility

For player utility, the main goal of FNs-player and FPs-player is to minimize $FNs$ and $FPs$ . Combined with the confusion matrix in TABLE III, the utility functions $U$ of players are defined as follows:

U(\textit{{O}}_{1},\textit{{G}})=\frac{FNs}{TNs+FNs},

(12)

U(\textit{{O}}_{2},\textit{{G}})=\frac{FPs}{TPs+FPs}.

(13)

The utility function of FNs-player in Eq. $\eqref{Eq.P_FNs}$ focuses on driving the player to predict as many target pixels as possible. On the opposite, in Eq. (13), the utility of FPs-player makes the player distinguish the target and background more finely.

III-C2 Game utility

As the two sides of an antagonistic decision, they need to further enhance the complementary differences between them. Therefore, we define the game utility $G$ as follows:

G\left(\textit{{O}}_{1},\textit{{O}}_{2}\mid\textit{{G}}\right)=\left\|\left(\textit{{O}}_{1}-\textit{{G}}\right)\left(\textit{{O}}_{2}-\textit{{G}}\right)\right\|_{2}.

(14)

Eq. (14) makes the incorrectly segmented pixels of FNs-player and FPs-player as different as possible. Game utility further aggravates the game confrontation between them. Explicit antagonistic utility constraints can not only enhance the complementarity of their results, but also help the game optimization to reach the equilibrium state.

III-C3 Small target constraints

For the specific task of ISTS, we add a small target constraint to ensure that the model is optimized in a reasonable space. It is defined as follows:

A\left(\textit{{O}}\right)=\frac{1}{N}\sum_{k=1}^{N}o_{k}.

(15)

Finally, we choose equal weights according to many experimental attempts. The utility $\varPhi$ of pixelGame is as follows,

\varPhi\left(\textit{{O}}_{1},\textit{{O}}_{2}\mid\textit{{G}}\right)=U\left(\textit{{O}},\textit{{G}}\right)+G\left(\textit{{O}}_{1},\textit{{O}}_{2}\mid\textit{{G}}\right)+A\left(\textit{{O}}\right).

(16)

III-D PixelGame Network

$\mathrm{FDCN_{9}}$ and $\mathrm{FDCN_{13}}$ are the backbones of pixelGame network. A high-resolution prediction feature map is indispensable for small target segmentation. The dilated convolution captures a larger receptive field without reducing the spatial resolution of the feature. The dilation factors of the decoder are symmetrical to that of the encoder. In the encoder-decoder structure, the feature mapping with the same dilation factor exchanges information across layers through skip connection.

The larger dilation factor results in a larger receptive field. On the one hand, some pixels in the large receptive field are not fully utilized. On the other hand, the long-distance dependence of the pixels captured by the large receptive field is not accurate. Therefore, we use $\mathrm{MIM}$ module to improve small object feature representation.

The speciﬁc implementation is shown in Algorithm 1.

Data: input image

\mathbf{I}

, ground truth G, utility function

\varPhi

, game player networks

\varTheta\{{S_{1},S_{2}}\}

, maximum information modulation (

\mathrm{MIM}

)

Output: equilibrium state of the game O

S_{1}\leftarrow\mathrm{MIM}(\mathrm{FDCN_{9}})

S_{2}\leftarrow\mathrm{MIM}(\mathrm{FDCN_{13}})

4 foreach epoch do

\textit{{O}}_{1},\textit{{O}}_{2}\leftarrow S_{1}(\mathbf{I}),S_{2}(\mathbf{I})

\varPhi\leftarrow

Eq.

\eqref{utility}

S_{1},S_{2}\leftarrow\arg\min\varPhi

8 end foreach

\textit{{O}}\leftarrow mean(\textit{{O}}_{1},\textit{{O}}_{2})

10 return O

13 Function MIM( $\mathrm{FDCN_{2n+1}}$ ):

\mathrm{FDCN_{2n+1}}=\{x_{1},...,x_{n},x_{n+1},z_{n},...,z_{1},o\}

, where

x_{n}

and

z_{n}

are the feature map of each layer of the encoder and decoder, respectively.

15 foreach layer $k\in[2,n]$ do

z\leftarrow m(x_{k})

using Eq.

\eqref{Eq.mim}

z_{k}^{\ast}\leftarrow z_{k}+z

19 end foreach

20 return

S\leftarrow\{x_{1},x_{2},...,x_{n},x_{n+1},z_{n}^{\ast},...,z_{2}^{\ast},z_{1},o\}

Algorithm 1 PixelGame: ISTS as a Nash Equilibrium

IV Experimental results and analysis

In order to analyze the potential of pixelGame based on game theory, we compare it with the related SOTA methods on NUST-ISTS and NUAA-ISTS datasets. Moreover, we conducted ablation experiments to verify the effectiveness of different components in pixelGame. In particular, the following questions will be investigated in our experimental evaluation.

1.

Q1: Our key insight is to transform the antagonistic decision of $FNs$ and $FPs$ using the same strategy into a competitive game in which two player networks participate. Based on the game theory, we study how FPs-player and FPs-player achieve a delicate balance under the guidance of utility function with inherent complementarity and optimization antagonism. (Section IV-B1).
2.

Q2: We further explore the contribution of the composition of utility function to the Nash equilibrium in pixelGame. (Section IV-B2).
3.

Q3: In an antagonistic decision of pixelGame, whether the performance of a multi-player strategy game is better than that of a single integrated objective through weighted sums. (Section IV-B3).
4.

Q4: We compare our pixelGame with other SOTA methods and show the segmentation results. The effectiveness of $\mathrm{MIM}$ module is proved. (Section IV-C).

IV-A Experimental Settings

The experiment is conducted on a computer with 3.0 GHz CPU, 128 GB RAM, and four NVIDIA GeForce RTX 3090 GPUs. The pixelGame is implemented by PyTorch. We use Adam as the optimizer. The mini-batch size is set as 8. The learning rate is set as $10^{-5}$ for the FNs-player and FPs-player. The weights of the players are initialized using the identity initialization technique. All models are trained from the scratch, and the training epoch of NUST-ISTS is 70 and that of NUAA-ISTS is 400.

IV-A1 Datasets

Two pixel-level annotated IR small target datasets with diverse scenes and targets are used to verify the performance of the proposed methods, as shown in TABLE I.

To increase the number of the training samples, NUST-ISTS randomly sampled many patches with the same size from the original images, which added up to 10,000 patches for training under different configurations [16]. NUAA-ISTS contained 427 representative images and 480 instances of different scenarios from hundreds of real-world videos [7]. Since the images of the training dataset released by the author of NUST-ISTS are cropped, the training data composed of cropped images and the testing data are different in distribution. Therefore, we re-divide the dataset, randomly sample 80% original training data and the testing data to form the new training dataset, and the rest is the new test dataset. Same as the setting in [59], in order to stack images of different sizes into a batch, the size of each image is resized to 512 $\times$ 512, and randomly cut to 480 $\times$ 480 during training.

IV-A2 Implementation Details

We compare our proposed pixelGame with several SOTA small target detection and segmentation methods: 23 traditional methods and 3 CNN-based methods.

These methods can be categorized into four groups, including i) Traditional spatial domain-based methods (Top-Hat [9], MNWTH [10], Max-Median [35], NWIE [36], MoG-MRF [37], DPIR [38], ADMD [39]); ii) human visual system-based methods (LCM [11], ILCM [43], MPCM [44], RLCM [45], TLLCM [46], WSLCM [47], MDWCM [48]); iii) optimization-based methods ( IPI [49], SMSL [52], NIPPS [50], TV-PCP [51], SRWS [53], WSNMSTIPT [54], ECA-STT [55], NTFRA [57]); and iv) CNN-based algorithm (MDvsFA-cGAN [16], ACM [7], ALCNet [59]).

IV-A3 Evaluation Metrics

We use precision ( $\mathrm{P}$ ), recall ( $\mathrm{R}$ ), F1 score ( $\mathrm{F_{1}}$ ) and intersection over union ( $\mathrm{IoU}$ ) to evaluate the infrared small target segmentation methods.

The precision measures the proportion of correctly segmented target pixels in all segmented target pixels. The recall measures the proportion of correctly segmented target pixels in all true target pixels. In Eq. (17) and Eq. (18), $\mathrm{P}$ and $\mathrm{R}$ are defined as

\mathrm{P}=\frac{\sum_{i=1}^{N}o_{i}g_{i}}{\sum_{i=1}^{N}o_{i}},

(17)

\mathrm{R}=\frac{\sum_{i=1}^{N}o_{i}g_{i}}{\sum_{i=1}^{N}g_{i}}.

(18)

Beside, $\mathrm{IoU}$ is also used to measure the coincidence between the predicted mask and the ground truth. $\mathrm{IoU}$ is defined in Eq. (19).

\mathrm{IoU}=\frac{\sum_{i=1}^{N}o_{i}g_{i}}{\sum_{i=1}^{N}\left(o_{i}+g_{i}-o_{i}g_{i}\right)}.

(19)

In Eq. (17), (18) and (19), $o_{i}$ denotes the class probability of the $i$ -th pixel in the segmentation result O. $g_{i}$ is the class probability of the corresponding position of ground truth G.

In order to evaluate the advantages and disadvantages of different algorithms, the concept of $\mathrm{F_{1}}$ value is proposed based on Eq. $\eqref{Eq.Precision}$ and Eq. $\eqref{Eq.Recall}$ . $\mathrm{F_{1}}$ evaluates $\mathrm{P}$ and $\mathrm{R}$ together. $\mathrm{F_{1}}$ is defined as follows:

\mathrm{F_{1}}=2/\left(\frac{1}{\mathrm{P}}+\frac{1}{\mathrm{R}}\right)=2\frac{\mathrm{P}\times\mathrm{R}}{\mathrm{P}+\mathrm{R}}.

(20)

IV-B Ablation Study

In this section, we study questions Q1-Q3 raised above.

IV-B1 Players Game and Nash Equilibrium (Q1)

In Fig. 8 (a) and (b), there is a clear trend of gradual steady state. As the game between players goes deep, the performance of the players is improved gradually, and finally reaches the Nash equilibrium. From a single index, FNs-player has the highest precision and FPs-player has the highest recall. The game result is between FNs-player and FPs-player. But from the final $\mathrm{F_{1}}$ score, we can see that the results of pixelGame has achieved better performance than the two components as a whole. $\mathrm{F_{1}}$ score in Fig. 8 (a) and (b) illustrates the feasibility and superiority of our separate optimization method.

For different players, FNs-player tends to predict pixels of all potential targets, so it can be seen from recall in Fig. 8 (a) and (b) that more than 98% of the true target pixels are correctly predicted by FNs-player. Differently, FPs-players have high precision and low recall scores. FPs-players put quality before quantity. Therefore, the precision of FPs-player on two datasets is more than 85%. Two players with opposing strategies promote pixelGame to achieve a better balance in the overall game framework.

Furthermore, from the score of $\mathrm{F_{1}}$ in the Fig. 8, the antagonistic learning of players at the pixel level makes the two players have better complementarity at the image level. The fusion result of the two players is considerably better than that of a single player. As show in Fig. 8 (a), the fusion results of FNs-player and FPs-player are improved by about 0.14 and 0.37 on the NUST-ISTS dataset, respectively. The Fig. 8 (b) presents the fusion results of two players are improved by about 0.05 and 0.46 on the NUAA-ISTS dataset, respectively. The performance improvement proves the effectiveness of multi-player adversarial learning based on game theory. Under the guidance of the proposed utility function, the model achieves higher accuracy and a more delicate balance between $FNs$ and $FPs$ .

Nash equilibrium [24], [26] represents a static state, where no player is willing to change strategy any more. During the pixelGame, Fig. 9 (a) and (b) illustrate the evolution of player utilities for FNs-player and FPs-player on the NUST-ISTS and NUAA-ISTS datasets, respectively. As shown in Fig. 9 (a) and (b), the utilities of the two players in the game tend to be steady after decline, and finally reach a stable state. The invariance of the utility functions of two players means that the players do not want to change their strategies, which means the pixelGame reaches a Nash equilibrium.

IV-B2 Impact of Utility Function (Q2)

In order to understand which utility function is critical to the game, we analyze the results on the NUST-ISTS and NUAA-ISTS datasets for each component of the utility function.

The contribution of different components of utility function to the model is shown in TABLE IV. It can be seen from the TABLE IV that competitive games under play utility constraints have achieved high results on $\mathrm{F_{1}}$ and $\mathrm{IoU}$ . $FNs$ and $FPs$ , as two kinds of incorrectly segmented pixels, are the two opposing sides in the overall goal of the game. As shown in TABLE IV, under the guidance of player utility, most of the incorrectly segmented pixels are correctly segmented. It can be observed from TABLE IV and Fig. 10 that game utility makes the results of the two more complementary, and increases by about 5% and 10% on NUST-ISTS and NUAA-ISTS, respectively. The game utility makes the direct competition between the two players more purposeful. The area constraint of the small target is conducive to the faster convergence of the player network, and it improves the performance of the model to a certain extent.

We further analyze the impact of game utility on FNs-player and FPs-player. The game utility urges the two players to pay attention to different predicted pixels. As shown in Fig. 10 (a) and (b), the game utility results in a better Nash equilibrium state for pixelGame.

TABLE IV: Results of ablation experiments for utility functions. In the table,

U

G

and

A

represent player utility, game utility and small target constraint, respectively. Bold font highlights the best results in each column.

Utility functions	NUST-ISTS		NUAA-ISTS
Utility functions	$\mathrm{F_{1}}$	$\mathrm{IoU}$	$\mathrm{F_{1}}$	$\mathrm{IoU}$
$U$	0.5618	0.4531	0.7418	0.6129
$U+A$	0.6047	0.4796	0.8095	0.7400
$U+G$	0.6272	0.5092	0.8493	0.7339
$U+A+G$	0.6387	0.5135	0.8659	0.7452

IV-B3 The Advantage of Separate Game Objectives Compared to Combined Objectives (Q3)

Next, we compare the performance of different classical combined objective functions, including $Dice\ loss$ [18], $IoU\ loss$ [19], and sensitivity specificity loss ( $SS\ loss$ ) [22], etc. The parameter controlling the balance between sensitivity and specificity in $SS\ loss$ is set to $0.5$ .

TABLE V shows that the model performance is poor under the guidance of the combined loss functions. In ISTS, infrared targets are dim, small and sparse within images. These challenges make commonly used combined loss functions unable to effectively focus a small number of target pixels, and cannot distinguish background and noise.

TABLE V: Results

(\mathrm{F_{1}}/\mathrm{IoU})

of ablation experiments for combined objectives (

IoU\ loss

Dice\ loss

and

SS\ loss

) and separated game objective (

Ours

). The best results are highlighted in boldface.

Objectives	NUST-ISTS		NUAA-ISTS
Objectives	FNs-player	FPs-player	FNs-player	FPs-player
$IoU\ loss$	0.54/0.45	0.50/0.41	0.77/0.66	0.76/0.65
$Dice\ loss$	0.55/0.46	0.48/0.40	0.76/0.65	0.76/0.66
$SS\ loss$	0.33/0.22	0.32/0.21	0.37/0.25	0.35/0.23
$Ours$	0.27/0.17	0.51/0.43	0.38/0.29	0.82/0.72

$\mathrm{IoU}$ and $\mathrm{Dice}$ similarity coefficient look very similar in terms of equations, and both are the most commonly used evaluation metrics in object segmentation. As reported in TABLE V, using $IoU\ loss$ and $Dice\ loss$ , in NUST-ISTS and NUAA-ISTS datasets, FNs-player and FPs-player only reach about 0.50 and 0.76 on $\mathrm{F_{1}}$ , and the $\mathrm{IoU}$ of these two sub-networks are about 0.45 and 0.65, respectively. The optimization objectives of both mainly focus on $TPs$ , treating two classes of incorrectly segmented pixels equally. However, due to the IR targets are small, the model leads to over-segmentation easily. The pixels of over-segmentation are mainly noise and background near the target, which are mainly reflected in $FPs$ . The model cannot effectively balance a small number of but very critical pixels such as the edges of dim small targets ( $FPs$ and $FNs$ ), resulting in stagnant performance.

Although $FNs$ and $FPs$ are considered in $SS\ loss$ explicitly, their performance is not ideal. From the perspective of features, features of false positive and false negative pixels are similar, but they are different or even opposite in optimization strategies. For ISTS, the $SS\ loss$ function uses the same strategy to optimize the countermeasure decision, so that the model cannot converge to the global optimal solution.

Comparing the results of combined objectives and separate optimization in TABLE V, it can be seen that FNs-player achieves leading performance by using combined objectives. Considering the beneficial balance of FNs-payer to FPs-player, the performance of pixelGame is further improved. This also proves the prominent performance of our FNs-player and FPs-player network structure for ISTS.

TABLE VI: Comparison with the state-of-the-art ISTS methods. The best result in each column is in red, the second is in blue, and The third is in green. The pixelGame (w/o

\mathrm{MIM}

) is the baseline model.

	NUST-ISTS				NUAA-ISTS
Method	Precision	Recall	$\mathrm{F_{1}}$	IoU	Precision	Recall	$\mathrm{F_{1}}$	IoU
Top-Hat[9] (OE 1996)	0.09	0.21	0.12	0.08	0.14	0.11	0.12	0.13
Max-Median[35] (ISOP 1999)	0.05	0.14	0.05	0.03	0.04	0.18	0.04	0.03
MNWTH[10] (PR 2010)	0.23	0.61	0.27	0.18	0.18	0.27	0.22	0.27
IPI[49] (TIP 2013)	0.51	0.49	0.50	0.38	0.31	0.48	0.34	0.28
LCM[11] (TGRS 2013)	0.15	0.36	0.21	0.13	0.15	0.29	0.22	0.14
ILCM[43] (GRSL 2014)	0.14	0.22	0.20	0.13	0.14	0.25	0.21	0.13
MPCM[44] (PR 2016)	0.28	0.45	0.34	0.24	0.29	0.37	0.34	0.27
NWIE[36] (TAES 2016)	0.23	0.21	0.21	0.18	0.31	0.21	0.23	0.14
NIPPS[50] (INFPHY 2017)	0.12	0.29	0.15	0.10	0.29	0.31	0.32	0.21
TV-PCP[51] (ICV 2017)	0.38	0.25	0.34	0.22	0.32	0.47	0.38	0.40
SMSL[52] (TGRS 2017)	0.54	0.20	0.26	0.17	0.40	0.45	0.41	0.40
RLCM[45] (GRSL 2018)	0.39	0.45	0.41	0.31	0.46	0.44	0.44	0.31
MoG-MRF[37] (PR 2018)	0.22	0.45	0.28	0.20	0.25	0.34	0.31	0.23
DPIR[38] (GRSL 2019)	0.52	0.21	0.24	0.17	0.45	0.32	0.37	0.27
TLLCM[46] (GRSL 2019)	0.52	0.42	0.41	0.35	0.60	0.62	0.54	0.44
WSLCM[47] (GRSL 2020)	0.38	0.30	0.33	0.22	0.58	0.42	0.50	0.38
WSNMSTIPT[54] (INFPHY 2019)	0.28	0.15	0.15	0.13	0.26	0.25	0.21	0.25
MDvsFA-cGAN[16] (ICCV 2019)	0.63	0.65	0.61	0.47	0.72	0.77	0.71	0.60
MDWCM[48] (GRSL 2020)	0.38	0.42	0.40	0.43	0.32	0.43	0.37	0.41
ECA-STT[55] (TGRS 2020)	0.53	0.46	0.49	0.38	0.79	0.58	0.64	0.53
ADMD[39] (SP 2020)	0.57	0.22	0.27	0.19	0.69	0.38	0.42	0.30
ACM[7] (WACV 2021)	0.55	0.74	0.60	0.44	0.83	0.84	0.83	0.71
SRWS[53] (NC 2021)	0.56	0.34	0.44	0.33	0.74	0.47	0.53	0.39
NTFRA[57] (TGRS 2021)	0.47	0.43	0.46	0.38	0.67	0.55	0.57	0.44
ALCNet[59] (TGRS 2021)	0.56	0.73	0.60	0.45	0.88	0.84	0.85	0.73
pixelGame (w/o $\mathrm{MIM}$ )	0.60	0.72	0.61	0.48	0.86	0.85	0.85	0.74
pixelGame	0.66	0.74	0.64	0.51	0.88	0.86	0.86	0.75

IV-C Comparison to State-of-the-Art Approaches

Finally, we solve the problem Q4 by comparing our pixelGame with several CNN-based methods and other traditional mathematical modeling methods. For a long time, the lack of an open benchmark has been one of the bottlenecks hindering the development of ISTS, which allows various algorithms to be compared fairly. We have summarized the existing target detection and segmentation methods of infrared small targets, which is helpful to promote ISTS work. The comparison between the pixelGame and the SOTA methods is reported in TABLE VI. In experiments, we evaluate the performance of the algorithm at the pixel level using precision, recall, $\mathrm{F_{1}}$ and $\mathrm{IoU}$ .

Compared with traditional algorithms, pixelGame is comfortably ahead in both single score (precision and recall) and comprehensive score ( $\mathrm{F_{1}}$ and $\mathrm{IoU}$ ). In CNN-based models, pixelGame better suppresses $FNs$ and $FPs$ , and achieves a more delicate balance between precision and recall. In TABLE VI, on both the NUST-ISTS and NUAA-ISTS datasets, our method achieves the highest $\mathrm{F_{1}}$ and $\mathrm{IoU}$ on both datasets, which are 0.64, 0.86 and 0.51, 0.75, respectively. For the foreground-background imbalance problem, our pixelGame achieves the best balance between $FNs$ and $FPs$ . From the comparison of precision and recall, which are 0.66, 0.74 and 0.88, 0.86, we can see that the precision and recall of our model are closer and more balanced.

Moreover, it can be seen from TABLE VI that deep learning models perform better than the traditional algorithms. Specifically, the CNN-based algorithms mostly design the loss function of the model at the pixel level, such as MDvsFA-cGAN [16] (based on miss detection and false alarm), ACM [7] and ALCNet [59] (based on $nIoU$ ). This demonstrates that the pixel-wise loss function has significant advantages in dense segmentation tasks. The traditional algorithms, such as Top-Hat [9] and LCM [11], mostly suppress the background, and enhance the difference between the target and the background at the image level. Similar to the algorithm of recovering the target from the feature space, such as IPI [49] and IPT [53] series, the target position and approximate shape can be determined, but the precise control at the pixel level cannot be achieved.

In the CNN-based methods, most of the existing methods treat different kinds of error pixels as the same to optimize. For example, ACM [7] and ALCNet [59] use $IoU\ loss$ as the optimization function. The integrated loss methods are not suitable for conquering extreme foreground-background imbalance problem of ISTS. MDvsFA-cGAN [16] makes the two sub-networks focus on false positive pixels and false negative pixels by loss re-weighting, respectively. However, in essence, MDvsFA-cGAN optimizes $FNs$ and $FPs$ at the same time, which is not pure antagonistic learning between $FNs$ and $FPs$ . Differently, our method based on game theory optimizes false positive pixels and false negative pixels separately. The pixelGame is conducive to flexible selection of different optimization strategies and is easier to achieve Nash equilibrium.

To demonstrate the effectiveness of $\mathrm{MIM}$ , we visualize the feature maps enhanced by $\mathrm{MIM}$ module. It can be observed from Fig. 6 that $\mathrm{MIM}$ performs better in capturing IR small targets with low SCR than other attention methods. The main content in infrared image is the building, while the region of target is very small compared with the main building. SENet, GCNet and Triplet attention tend to focus on the main building and neglect the small target. Differently, $\mathrm{MIM}$ prominently enhances the target information and suppresses the background. Benefiting from the different dilation factors, $\mathrm{MIM}$ effectively focuses on the salient region including small targets, which better suits our purpose. Fig. 6 suggests that the enhanced features by $\mathrm{MIM}$ are powerful.

In addition, we design ablation experiments by removing $\mathrm{MIM}$ (denoted as w/o $\mathrm{MIM}$ ). We apply simple addition when removing each module. From the last two rows of TABLE VI, compared to pixelGame (w/o $\mathrm{MIM}$ ), pixelGame improves the segmentation accuracy by about 3% and 1% respectively on the two datasets. As can be seen from TABLE VII, $\mathrm{MIM}$ achieves the best performance on both datasets. The F1 score of MIM is about 2% higher on average than other methods. TABLE VI and TABLE VII present the effectiveness of $\mathrm{MIM}$ for local significant context information enhancement in ISTS tasks. The efficient of $\mathrm{MIM}$ is fully proved.

TABLE VII: Results of ablation experiments for attention modules. Bold font highlights the best results in each column.

Method	NUST-ISTS		NUAA-ISTS
Method	$\mathrm{F_{1}}$	IoU	$\mathrm{F_{1}}$	IoU
w/ SENet	0.59	0.50	0.81	0.73
w/ GCNet	0.62	0.51	0.83	0.74
w/ Triplet	0.55	0.47	0.75	0.67
w/ MIM	0.64	0.51	0.86	0.75

TABLE VIII: Complexity analysis of CNN-based models on NUST-ISTS datasets. Params refer to the total number of parameters that need to be trained in model training. The speed (FPS) of the model is measured on RTX 2080 Ti. FLOPs are floating point operations. The best results are highlighted in boldface.

Method	Params	FPS	FLOPs	IoU
ACM	0.63M	100	1.75G	0.44
ALCNet	0.52M	100	1.41G	0.45
MDvsFA-cGAN	3.76M	2.63	868.75G	0.47
pixelGame	1.69M	9.09	273.15G	0.51

We further study the spatiotemporal complexity of CNN-based methods. The results are shown in TABLE VIII. ACM and ALCNet use ResNet-20 [62] as the backbone architecture. Both MDvsFA-cGAN and pixelGame employ dilated convolutional networks. MDvsFA-cGAN needs more parameters due to more channels of features and fully connected layers in the discriminator. The computational complexity is relatively high. It can be seen from TABLE VIII that pixelGame trades off speed for a high improvement in segmentation accuracy.

Some segmentation results are shown in Fig. 11 and Fig. 12. It can be illustrated from the Fig. 11 and Fig. 12 that the pixelGame segments dim small targets with low SCR accurately and robustly. Due to the low spatial resolution, the targets of NUST-ISTS dataset are small and dim. Some bright background noise is incorrectly predicted as target pixels. The background of NUAA-ISTS dataset mostly comes from real scenes, the contrast between the target and the background is relatively larger, and the segmentation results are more refined.

By and large, the method based on deep learning has obvious advantages in feature representation compared with the traditional methods. Combining the advantages of deep learning and attention model, our method based on game theory achieves a better balance in precision and recall, and achieves the best performance in $\mathrm{F_{1}}$ and $\mathrm{IoU}$ . For small targets in infrared images with low SCR, our proposed $\mathrm{MIM}$ effectively improves the quality of extracted features.

V Conclusion

In ISTS, the IR targets are small in size, weak in energy, and sparse in layout. To solve those problems, we formulate the infrared small target segmentation into playing a competitive game problem. we transform infrared small target segmentation into a competitive game problem. Under the guidance of utility function, each player optimizes different pixels. In the continuous game confrontation, the network continues to learn and finally reaches Nash equilibrium. The proposed $\mathrm{MIM}$ has good performance in sensing small target signals. Compared with traditional methods and deep learning methods based on combined loss function, our pixelGame achieves a better balance in precision and recall, and the highest $\mathrm{F_{1}}$ and $\mathrm{IoU}$ .

For future work, we will try to further improve the efficiency of game optimization, and introduce more prior knowledge to make up the lacking of infrared small target information.

Acknowledgment

The authors would like to thank the editor and the anonymous reviewers for their constructive comments, which are very helpful to improve the quality of this article.

References

[1] S. Razakarivony and F. Jurie, “Vehicle detection in aerial imagery: A small target detection benchmark,” Journal of Visual Communication and Image Representation, vol. 34, pp. 187–203, 2016.
[2] G. Chen, H. Wang, K. Chen, Z. Li, Z. Song, Y. Liu, W. Chen, and A. Knoll, “A survey of the four pillars for small object detection: Multiscale representation, contextual information, super-resolution, and region proposal,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 52, no. 2, pp. 936–953, 2022.
[3] P. Khaledian, S. Moradi, and E. Khaledian, “A new method for detecting variable-size infrared targets,” in Sixth International Conference on Digital Image Processing, vol. 9159, 2014, p. 91591J.
[4] Q. Yu, L. Xie, Y. Wang, Y. Zhou, E. K. Fishman, and A. L. Yuille, “Recurrent saliency transformation network: Incorporating multi-stage visual cues for small organ segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8280–8289.
[5] M. H. Hesamian, W. Jia, X. He, and P. Kennedy, “Deep learning techniques for medical image segmentation: achievements and challenges,” Journal of Digital Imaging, vol. 32, no. 4, pp. 582–596, 2019.
[6] J. Shen, T. Li, C. Hu, H. He, and J. Liu, “Automatic cell segmentation using mini-u-net on fluorescence in situ hybridization images,” in Medical Imaging 2019: Computer-Aided Diagnosis, vol. 10950, 2019, p. 109502T.
[7] Y. Dai, Y. Wu, F. Zhou, and K. Barnard, “Asymmetric contextual modulation for infrared small target detection,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 950–959.
[8] M. Everingham and J. Winn, “The pascal visual object classes challenge 2012 (voc2012) development kit,” Pattern Analysis, Statistical Modelling and Computational Learning, Tech. Rep, vol. 8, p. 5, 2011.
[9] J.-F. Rivest and R. Fortin, “Detection of dim targets in digital infrared imagery by morphological image processing,” Optical Engineering, vol. 35, no. 7, pp. 1886–1893, 1996.
[10] X. Bai and F. Zhou, “Analysis of new top-hat transformation and the application for infrared dim small target detection,” Pattern Recognition, vol. 43, no. 6, pp. 2145–2156, 2010.
[11] C. P. Chen, H. Li, Y. Wei, T. Xia, and Y. Y. Tang, “A local contrast method for small infrared target detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 52, no. 1, pp. 574–581, 2013.
[12] Q. Liu, X. Li, Z. He, N. Fan, D. Yuan, W. Liu, and Y. Liang, “Multi-task driven feature models for thermal infrared tracking,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 11 604–11 611.
[13] D. Guo, L. Zhu, Y. Lu, H. Yu, and S. Wang, “Small object sensitive segmentation of urban street scene with spatial adjacency between object classes,” IEEE Transactions on Image Processing, vol. 28, no. 6, pp. 2643–2653, 2018.
[14] Z. Fan, D. Bi, L. Xiong, S. Ma, L. He, and W. Ding, “Dim infrared image enhancement based on convolutional neural network,” Neurocomputing, vol. 272, pp. 396–404, 2018.
[15] R. Hamaguchi, A. Fujita, K. Nemoto, T. Imaizumi, and S. Hikosaka, “Effective use of dilated convolutions for segmenting small object instances in remote sensing imagery,” in 2018 IEEE Winter Conference on Applications of Computer Vision. IEEE, 2018, pp. 1442–1450.
[16] H. Wang, L. Zhou, and L. Wang, “Miss detection vs. false alarm: Adversarial learning for small object segmentation in infrared images,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 8509–8518.
[17] J. Ma, J. Chen, M. Ng, R. Huang, Y. Li, C. Li, X. Yang, and A. L. Martel, “Loss odyssey in medical image segmentation,” Medical Image Analysis, vol. 71, p. 102035, 2021.
[18] F. Isensee, P. F. Jaeger, S. A. Kohl, J. Petersen, and K. H. Maier-Hein, “nnu-net: a self-configuring method for deep learning-based biomedical image segmentation,” Nature methods, vol. 18, no. 2, pp. 203–211, 2021.
[19] M. A. Rahman and Y. Wang, “Optimizing intersection-over-union in deep neural networks for image segmentation,” in International Symposium on Visual Computing. Springer, 2016, pp. 234–244.
[20] S. S. M. Salehi, D. Erdogmus, and A. Gholipour, “Tversky loss function for image segmentation using 3d fully convolutional deep networks,” in International Workshop on Machine Learning in Medical Imaging. Springer, 2017, pp. 379–387.
[21] S. R. Hashemi, S. S. M. Salehi, D. Erdogmus, S. P. Prabhu, S. K. Warfield, and A. Gholipour, “Asymmetric loss functions and deep densely-connected networks for highly-imbalanced medical image segmentation: Application to multiple sclerosis lesion detection,” IEEE Access, vol. 7, pp. 1721–1735, 2018.
[22] T. Brosch, Y. Yoo, L. Y. Tang, D. K. Li, A. Traboulsee, and R. Tam, “Deep convolutional encoder networks for multiple sclerosis lesion segmentation,” in International Conference on Medical Image Computing and Computer-assisted Intervention. Springer, 2015, pp. 3–11.
[23] S. Yang, J. Kweon, and Y.-H. Kim, “Major vessel segmentation on x-ray coronary angiography using deep networks with a novel penalty loss function,” in International Conference on Medical Imaging with Deep Learning–Extended Abstract Track, 2019.
[24] M. Kallel, R. Aboulaich, A. Habbal, and M. Moakher, “A nash-game approach to joint image restoration and segmentation,” Applied Mathematical Modelling, vol. 38, no. 11-12, pp. 3038–3053, 2014.
[25] T. Wang, Q. Sun, Z. Ji, Q. Chen, and P. Fu, “Multi-layer graph constraints for interactive image segmentation via game theory,” Pattern Recognition, vol. 55, pp. 28–44, 2016.
[26] Y.-G. Hsieh, K. Antonakopoulos, and P. Mertikopoulos, “Adaptive learning in continuous games: Optimal regret bounds and convergence to nash equilibrium,” in Conference on Learning Theory. PMLR, 2021, pp. 2388–2422.
[27] H. Tembine, “Deep learning meets game theory: Bregman-based algorithms for interactive deep generative adversarial networks,” IEEE Transactions on Cybernetics, vol. 50, no. 3, pp. 1132–1145, 2019.
[28] I. Gemp, B. McWilliams, C. Vernade, and T. Graepel, “Eigengame: Pca as a nash equilibrium,” in International Conference on Learning Representations, 2021.
[29] P. Hu and D. Ramanan, “Finding tiny faces,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 951–959.
[30] S. Mittal, M. Tatarchenko, and T. Brox, “Semi-supervised semantic segmentation with high-and low-level consistency,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 4, pp. 1369–1379, 2019.
[31] Q. Zhao, T. Sheng, Y. Wang, Z. Tang, Y. Chen, L. Cai, and H. Ling, “M2det: A single-shot object detector based on multi-level feature pyramid network,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 9259–9266.
[32] X. Zhang, Y. Zou, and W. Shi, “Dilated convolution neural network with leakyrelu for environmental sound classification,” in 2017 22nd International Conference on Digital Signal Processing. IEEE, 2017, pp. 1–5.
[33] Y. Cao, J. Xu, S. Lin, F. Wei, and H. Hu, “Gcnet: Non-local networks meet squeeze-excitation networks and beyond,” in Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2019, pp. 1971–1980.
[34] D. Misra, T. Nalamada, A. U. Arasanipalai, and Q. Hou, “Rotate to attend: Convolutional triplet attention module,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 3139–3148.
[35] S. D. Deshpande, M. H. Er, R. Venkateswarlu, and P. Chan, “Max-mean and max-median filters for detection of small targets,” in Signal and Data Processing of Small Targets 1999, vol. 3809, 1999, pp. 74–83.
[36] H. Deng, X. Sun, M. Liu, C. Ye, and X. Zhou, “Infrared small-target detection using multiscale gray difference weighted image entropy,” IEEE Transactions on Aerospace and Electronic Systems, vol. 52, no. 1, pp. 60–72, 2016.
[37] C. Gao, L. Wang, Y. Xiao, Q. Zhao, and D. Meng, “Infrared small-dim target detection based on markov random field guided noise modeling,” Pattern Recognition, vol. 76, pp. 463–475, 2018.
[38] S. Huang, Z. Peng, Z. Wang, X. Wang, and M. Li, “Infrared small target detection by density peaks searching and maximum-gray region growing,” IEEE Geoscience and Remote Sensing Letters, vol. 16, no. 12, pp. 1919–1923, 2019.
[39] S. Moradi, P. Moallem, and M. F. Sabahi, “Fast and robust small infrared target detection using absolute directional mean difference algorithm,” Signal Processing, vol. 177, p. 107727, 2020.
[40] J. Nie, S. Qu, Y. Wei, L. Zhang, and L. Deng, “An infrared small target detection method based on multiscale local homogeneity measure,” Infrared Physics & Technology, vol. 90, pp. 186–194, 2018.
[41] J. Gao, Y. Guo, Z. Lin, W. An, and J. Li, “Robust infrared small target detection using multiscale gray and variance difference measures,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 11, no. 12, pp. 5039–5052, 2018.
[42] Y. He, C. Zhang, T. Mu, T. Yan, Y. Wang, and Z. Chen, “Multiscale local gray dynamic range method for infrared small-target detection,” IEEE Geoscience and Remote Sensing Letters, vol. 18, no. 10, pp. 1846–1850, 2021.
[43] J. Han, Y. Ma, B. Zhou, F. Fan, K. Liang, and Y. Fang, “A robust infrared small target detection algorithm based on human visual system,” IEEE Geoscience and Remote Sensing Letters, vol. 11, no. 12, pp. 2168–2172, 2014.
[44] Y. Wei, X. You, and H. Li, “Multiscale patch-based contrast measure for small infrared target detection,” Pattern Recognition, vol. 58, pp. 216–226, 2016.
[45] J. Han, K. Liang, B. Zhou, X. Zhu, J. Zhao, and L. Zhao, “Infrared small target detection utilizing the multiscale relative local contrast measure,” IEEE Geoscience and Remote Sensing Letters, vol. 15, no. 4, pp. 612–616, 2018.
[46] J. Han, S. Moradi, I. Faramarzi, C. Liu, H. Zhang, and Q. Zhao, “A local contrast method for infrared small-target detection utilizing a tri-layer window,” IEEE Geoscience and Remote Sensing Letters, vol. 17, no. 10, pp. 1822–1826, 2019.
[47] J. Han, S. Moradi, I. Faramarzi, H. Zhang, Q. Zhao, X. Zhang, and N. Li, “Infrared small target detection based on the weighted strengthened local contrast measure,” IEEE Geoscience and Remote Sensing Letters, vol. 18, no. 9, pp. 1670–1674, 2021.
[48] R. Lu, X. Yang, W. Li, J. Fan, D. Li, and X. Jing, “Robust infrared small target detection via multidirectional derivative-based weighted contrast measure,” IEEE Geoscience and Remote Sensing Letters, vol. 19, pp. 1–5, 2022.
[49] C. Gao, D. Meng, Y. Yang, Y. Wang, X. Zhou, and A. G. Hauptmann, “Infrared patch-image model for small target detection in a single image,” IEEE Transactions on Image Processing, vol. 22, no. 12, pp. 4996–5009, 2013.
[50] Y. Dai, Y. Wu, Y. Song, and J. Guo, “Non-negative infrared patch-image model: Robust target-background separation via partial sum minimization of singular values,” Infrared Physics & Technology, vol. 81, pp. 182–194, 2017.
[51] X. Wang, Z. Peng, D. Kong, P. Zhang, and Y. He, “Infrared dim target detection based on total variation regularization and principal component pursuit,” Image and Vision Computing, vol. 63, pp. 1–9, 2017.
[52] X. Wang, Z. Peng, D. Kong, and Y. He, “Infrared dim and small target detection based on stable multisubspace learning in heterogeneous scene,” IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 10, pp. 5481–5493, 2017.
[53] T. Zhang, Z. Peng, H. Wu, Y. He, C. Li, and C. Yang, “Infrared small target detection via self-regularized weighted sparse model,” Neurocomputing, vol. 420, pp. 124–148, 2021.
[54] Y. Sun, J. Yang, M. Li, and W. An, “Infrared small target detection via spatial–temporal infrared patch-tensor model and weighted schatten p-norm minimization,” Infrared Physics & Technology, vol. 102, p. 103050, 2019.
[55] P. Zhang, L. Zhang, X. Wang, F. Shen, T. Pu, and C. Fei, “Edge and corner awareness-based spatial–temporal tensor model for infrared small-target detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 59, no. 12, pp. 10 708–10 724, 2020.
[56] Y. Sun, J. Yang, and W. An, “Infrared dim and small target detection via multiple subspace learning and spatial-temporal patch-tensor model,” IEEE Transactions on Geoscience and Remote Sensing, vol. 59, no. 5, pp. 3737–3752, 2020.
[57] X. Kong, C. Yang, S. Cao, C. Li, and Z. Peng, “Infrared small target detection via nonconvex tensor fibered rank approximation,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–21, 2021.
[58] B. Hui, Z. Song, and H. Fan, “A dataset for infrared detection and tracking of dim-small aircraft targets under ground/air background,” China Sci. Data, vol. 5, no. 3, pp. 291–302, 2020.
[59] Y. Dai, Y. Wu, F. Zhou, and K. Barnard, “Attentional local contrast networks for infrared small target detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 59, no. 11, pp. 9813–9824, 2021.
[60] Q. Hou, L. Zhang, F. Tan, Y. Xi, H. Zheng, and N. Li, “Istdu-net: Infrared small-target detection u-net,” IEEE Geoscience and Remote Sensing Letters, 2022.
[61] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical Image Computing and Computer-assisted Intervention. Springer, 2015, pp. 234–241.
[62] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
[63] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” Advances in Neural Information Processing Systems, vol. 27, 2014.
[64] H. Deng, X. Sun, M. Liu, C. Ye, and X. Zhou, “Entropy-based window selection for detecting dim and small infrared targets,” Pattern Recognition, vol. 61, pp. 66–77, 2017.
[65] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International Conference on Machine Learning. PMLR, 2015, pp. 448–456.
[66] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7132–7141.
[67] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7794–7803.
[68] M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Is object localization for free?-weakly-supervised learning with convolutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 685–694.
[69] M. Lin, Q. Chen, and S. Yan, “Network in network,” arXiv preprint arXiv:1312.4400, 2013.
[70] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2921–2929.