Unsupervised Landmark Detection Based Spatiotemporal Motion Estimation for 4D Dynamic Medical Images

Yuyu Guo¹ Lei Bi^2(✉) Dongming Wei¹ Liyun Chen¹ Zhengbin Zhu³ Dagan Feng,² , Ruiyan Zhang³ Qian Wang^1(✉) and Jinman Kim^2(✉) Y. Guo, D. Wei, L. Chen and Q. Wang are with the School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai 200030, China, e-mail: wang.qian@sjtu.edu.cn.L. Bi, D. Feng and J. Kim are with the School of Computer Science, University of Sydney, NSW 2006, Australia, e-mail: lei.bi@sydney.edu.au, jinman.kim@sydney.edu.au.Z. Zhu and R. Zhang are with Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai 200025, China.This work was supported in part by Australian Research Council (ARC) grants (LP140100686 and IC170100022), the University of Sydney – Shanghai Jiao Tong University Joint Research Alliance (USYD-SJTU JRA) grants.

Abstract

Motion estimation is a fundamental step in dynamic medical image processing for the assessment of target organ anatomy and function. However, existing image-based motion estimation methods, which optimize the motion field by evaluating the local image similarity, are prone to produce implausible estimation, especially in the presence of large motion. In addition, the correct anatomical topology is difficult to be preserved as the image global context is not well incorporated into motion estimation. In this study, we provide a novel motion estimation framework of Dense-Sparse-Dense (DSD), which comprises two stages. In the first stage, we process the raw dense image to extract sparse landmarks to represent the target organ’s anatomical topology, and discard the redundant information that is unnecessary for motion estimation. For this purpose, we introduce an unsupervised 3D landmark detection network to extract spatially sparse but representative landmarks for the target organ’s motion estimation. In the second stage, we derive the sparse motion displacement from the extracted sparse landmarks of two images of different time-points. Then, we present a motion reconstruction network to construct the motion field by projecting the sparse landmarks’ displacement back into the dense image domain. Furthermore, we employ the estimated motion field from our two-stage DSD framework as initialization and boost the motion estimation quality in light-weight yet effective iterative optimization. We evaluate our method on two dynamic medical imaging tasks to model cardiac motion and lung respiratory motion, respectively. Our method has produced superior motion estimation accuracy compared to existing comparative methods. Besides, the extensive experimental results demonstrate that our solution can extract well representative anatomical landmarks without any requirement of manual annotation. Our code is publicly available Online¹¹1https://github.com/yyguo-sjtu/DSD-3D-Unsupervised-Landmark-Detection-Based-Motion-Estimation.

I Introduction

Four dimensional (4D) dynamic medical imaging is a vital technique for organ motion monitoring, as it provides comprehensive spatial temporal information via multiple 3D volumes sampled over a period of time [1]. These 4D images can be used to estimate motion of the target organ which allows for the assessment of the functional and structural properties of target organs that aid in clinical outcomes. Examples of motion estimation are abundant, including functional heart analysis with 4D-MRI [2], respiratory organ motion modelling with 4D-CT [3], image-guided intervention with 4D-CT [4], diagnosis of disease and treatment planning [5].

Refer to caption — Figure 1: Examples of 4D cardiac imaging of two patient cases, respectively of the two columns. The temporal variation between the image sequence can be large, i.e., between end-systole (ES) and end-diastole (ED) shown in two rows. All images are shown in trans-axial views and cropped to focus on the heart. LVC indicates the left ventricle cavity, while LAC indicates the left atrium cavity.

Technically, motion estimation is to establish corresponding deformation field between two time-point images which can be attained by image registration methods [6, 7]. Image registration is a broad topic and widely applied in various fields, e.g., image reconstruction [8, 9], image super-resolution [10], and atlas-based segmentation [11, 12], etc. Particularly, in our prior work [13], we presented a supervised interpolation network leveraging image registration to improve the sequential sampling rate for individual subjects. Conventional image registration approaches derive the motion field between two input images (usually referred to as the fixed image and moving image) by solving an optimization problem, which aims to maximize the image appearance similarity under certain smoothness regularization [14, 15]. However, these methods are typically time-consuming (or computationally expansive) via the iterative optimization process. Besides, when coping with large deformation, such as in cardiac motion between the two ends of the motion cycle (end-systole (ES) and end-diastole (ED), as shown in Fig. 1), non-physiological misestimation may happen as the price of excessive pursuing of appearance similarity.

Recently, deep-learning-based registration methods draw much attention due to their superior efficiency [16]. They have proved that the conventional optimization can be replaced by a single deep model. These methods can be performed in an unsupervised learning manner [17, 18]. They typically input a pair of images to the convolutional neural network (CNN), and then encode the appearance features hierarchically to decode the deformation field between the input images. Nevertheless, the performance of CNN in registration is limited by the receptive field of the convolution kernels, which may lead to underestimation in the presence of large deformation.

Apart from image-based registration approaches, anatomical prior, such as landmarks or segmentation masks, is another crucial branch to aid motion estimation [19, 20]. These anatomical prior approaches firstly extract the sparse topological representation (i.e., landmarks) of the target organ, other than using original image in the dense representation form [21]. Then, the motion field is predicted via the corresponding landmarks between two different volumetric images. Compared with estimating directly from the images, these anatomy-based methods are usually accurate and robust [22] when coping with large deformation. However, the drawback is that user interaction is often required to annotate the representative landmarks or segmentation masks, which is prone to manual errors and admittedly labor-intensive. For solving this issue, some studies present automatic identification of landmarks (keypoints) [23, 24]. These methods extract the keypoints by calculating the image local features and establish the mapping relationship between the keypoints to obtain the motion information of the target objects. Even so, these methods still require the point matching step to compute the sparse displacement of the landmarks, and then reconstruct the dense motion field in the image domain [25, 26]. This is likely to increase the risk of errors occurring in landmark detection and matching, which may reduce the accuracy of the final motion estimation. Moreover, these methods lack demonstration of their efficiency in volumetric medical images.

To estimate the accurate motion in dynamic 4D medical images, especially for the images with large deformation, we present a novel learning-based motion estimation framework of Dense-Sparse-Dense (DSD). In summary, there are two main contributions in our proposed DSD framework:

1.

We introduce a motion estimation framework based on the guidance of the sparse topology change. In our framework, the motion field is estimated from the dynamic changes of the target organ in its sparse landmark representation. This is to reduce the error motion estimation in the existence of large deformation by leveraging the topological guidance.
2.

We propose a 3D unsupervised landmark detection network for volumetric medical image. It can extract effectively sparse landmarks that represent the anatomical topology of the target organ without any manual annotation. Specifically, we design multiple loss functions to guide the anatomical landmark detection network and utilize optimal transport theory to enforce prior topology constraint to enhance the richness of landmarks’ anatomical description.

Benefiting from the guidance of the target organ’s anatomy, the estimated motion field can effectively improve its plausibility and better cope with large deformation. We evaluate our proposed framework on two clinical motion analysis tasks - cardiac motion and lung respiratory motion - and achieve outstanding performance.

II Related Works

We partition the related works into three topics: (1) image-based registration, (2) landmark-based registration, and (3) landmark detection.

II-A Image-based Registration

Image registration is a basic tool in motion estimation, which can be mainly divided into two categories, i.e., non-learning-based methods and learning-based methods. Classic non-learning-based approaches mainly include diffeomorphic Demons [14], Free-Form Deformation (FFD) [27], etc. These methods derive the deformation field by iteratively maximizing the image similarity under certain smoothness regularization. Both methods are widely used in medical image registration, and are often regarded as the comparing baselines in related researches. However, the hyper-parameters of them are not easy to tune, while tremendous optimization costs high in computation time.

Inspired by the great success of deep learning, several learning-based studies have been proposed in recent years. Many of them leverage the supervised learning strategy, which requires the ground-truth (deformation field) derived from conventional registration tools to train the model. For instance, Yang et al. [28] utilized a U-Net architecture [29] to predict the deformation field for brain MR images. Similarly, Rohe et al. [30] also leveraged U-Net to generate the motion field for cardiac volumes. Krebs et al. [31] present a supervised agent learing based non-rigid registration method to learn the deformation field in low-dimension representation. However, the supervision can be cumbersome to acquire, which restricts the deformation types to be learned.

Recently, unsupervised learning solutions for registration have been proposed to eliminate high reliance on the ground-truth. Li et al. [32] and Vos et al. [33] introduced fully convolutional networks to estimate the deformation field for brain volumetric images and 4D cardiac images, respectively. Guha et al. [34] developed the popular VoxelMorph method for brain image registration. Zhang et al. [35] proposed an inverse-consistency loss based on a similar network of VoxelMorph. These unsupervised registration methods utilize a spatial transform layer with an image similarity loss to train the deep model. While the existing methods usually derive the dense deformation field over the entire image domain, the performance is prone to degrade for the situations such as large deformation.

II-B Landmark-based Registration

Another category of registration is based on sparse landmarks [36, 37]. Landmarks are salient points of geometrical and anatomical uniqueness. Basically, the landmark-based registration methods are mainly applicable to estimate rigid or affine transformation. With the raising number of landmarks, they can also be used to describe more complex (e.g., non-rigid) deformation [38]. Note, the identification of landmarks is particularly important for these methods.

The landmarks can be manually or automatically detected. Most existing landmark-based registration methods rely on manually labeled landmarks. One may directly derive the sparse displacement from the manually labelled landmarks, and further obtain the dense motion field with the help of interpolation such as thin plate spline (TPS) [39]. However, the manual labeling of landmarks is inevitably time-consuming and prone to intra- and inter-observer uncertainty, which limits its real application. For automatic keypoint detection methods, they generally capture the keypoints through the image local features [40]. For instance, Yang et al. [41] presented a CNN feature-based motion tracking method for temporal images. Hosseini et al. [42] utilize feature points to estimation the cardiac motion problem. Li et al. [43] introduced a Multiview-based Parameter Free framework, where pixel-wise motion estimation has been used to derive feature points for identifying the number groups of people in a crowd video. In another study, Liu et al. [44] presented a point trajectories generation method by exploring the motion boundaries for video processing. However, the automatic landmark detection usually comes with complex point matching [45], such as iterative closest point [46], to track the movement of corresponding landmarks and then to complete the following image registration. Although it does not require manual annotation, the errors caused by automatic landmark detection and subsequent point matching limit the accuracy and robustness in motion estimation, especially for medical images with large motion.

II-C Landmark Detection

Landmark detection is widely used in computer vision, e.g., for human face/pose recognition [47, 48, 49]. Initially, regression-based CNN methods utilize a fully convolutional layer to estimate the distinct positions. For instance, Yu et al. [50] presented a deformable landmark regression network that utilized a TPS layer to achieve the cross-talk between point-based deformation and images. Lv et al. [51] introduced a regression CNN with global stage and local stage to improve the performance on facial landmark detection.

Different from the regression approaches, localizing the landmark by predicting its heatmap is an alternative way. These methods usually use a kernel function to convert the coordinate information into the form of heatmap. Nie et al. [52] proposed a hierarchical contextual refinement network for landmarks to efficiently predict the human pose. Siarohin et al. [53] recently presented a heatmap-based landmark detection method to transfer the animation from video to still images. However, compared to exploring discriminate landmarks in 2D space, it is admittedly much more challenging for 3D landmark detection concerning the exponential growth in data complexity. Besides, compared with 2D facial images, 3D/4D medical images usually lack enough visual saliency, as the anatomical landmarks are severely interfered by noisy background.

III Method

We aim to accurately estimate the motion field in 4D dynamic medical images with large motion, based on the guidance of the target organ’s topological representation. As presented in Fig. 2, an unsupervised landmark detection network is firstly utilized to recognize the discriminative and consistent landmarks from the temporal volumetric images. Then, we construct the motion field by projecting from the motion displacement of the sparse landmarks back into the dense motion field.

Let $Q$ be the set of N subjects. Each subject includes T time-point (phase) images, denoted as $\{I_{t}\mid_{t=1...T}\}$ . The spatial domain can be denoted as $\varOmega$ for all subjects, where the image size is $(X,Y,Z)$ . $\mathcal{H}_{t}$ denotes the predicted landmark heatmaps corresponding to the image $I_{t}$ . We define the grid coordinate of the volumetric image as $\mathcal{G}_{(x,y,z)}$ . The size of $\mathcal{G}_{(x,y,z)}$ is $(3,X,Y,Z)$ , where $(X,Y,Z)$ represent the image size and the coordinate for each voxel is represented in three numbers. The distinct locations of the individual landmarks for the image $I_{t}$ are denoted by $L_{t}=\{L_{tk}\mid_{k=1...K}\}$ where K indicates the number of detected landmarks in each volumetric image. The motion field between the input images $I_{t_{m}}$ and $I_{t_{n}}$ is denoted as $\varphi_{t_{m}\rightarrow t_{n}}$ , where $t_{m}$ and $t_{n}$ denote two time points. Note that the image sequences are all in the same temporal sampling.

III-A From Dense Volumetric Image to Sparse Landmarks - Unsupervised Landmark Detection

The landmark detection serves as a core step in our DSD motion estimation framework. To perform the landmark detection from dense volumetric image, we propose an unsupervised learning-based landmark detection network. A standard multi-scale encoder-decoder architecture is used as the backbone, as shown in Fig. 3. To obtain representative landmarks [54], we firstly introduce a self-supervised pre-training strategy to help the encoder of the landmark detection network to focus on the potential motion-affected regions. Then, for obtaining accurate and consistent landmarks, multiple losses are introduced to help train the network to perform landmark localization. Note that none of the annotation information is used to supervise any training step.

III-A1 Self-Supervised Pre-Training for Motion Attention

The detected landmarks are required to represent the temporal changes of the target organ to further guide the motion estimation. To this end, we leverage a self-supervised pre-training strategy, to help the encoder of landmark detection network focus its attention toward the motion-affected regions. Considering that many organ/tissue movements of interest are in cycles (such as heart beating, breathing, etc.), half-cycle movement is presumably monotonous, implying one-way morphological change from large volume to small (or inversely). Such characteristic can be utilized as the self-supervision to train the encoder network to recover the sequence order from randomly permuted one. In this way, the latent feature maps produced by the encoder can focus on the potential motion-affected regions in the temporal sequence.

As depicted by Fig. 4, we leverage a shared encoder $E_{\theta}(\cdot)$ to extract the features from each volumetric image in a sequence, which are later used for landmark detection. The 1D feature codes (for individual 3D volumes) are then processed by a following fully convolutional layer. The fusion layer, which further concatenates the codes of multiple input images, is expected to learn the sequential order (i.e., in phase index) of the inputs. That is, by employing a shuffle function $\mathcal{S}$ to randomly permutate the order of the input sequence, the network in Fig. 4 is trained to recover the correct phase index of the sequence, based on the features encoded through $E_{\theta}(\cdot)$ :

\displaystyle\mathcal{S}\{t1,t2,t3,t4,t5\}=E_{\theta}\bigl{(}\mathcal{S}(I_{t1},I_{t2},I_{t3},I_{t4},I_{t5})\bigr{)}

(1)

We thus formalize the above into a classification task as in Eq. 1. The training is then supervised by the cross-entropy loss in classification. Through the above pre-training of the encoder, the encoder can focus its attention to the motion-affected region rather than the irrelative background in the encoding stage (e.g., the attention maps in Fig. 4).

III-A2 Landmark Coordinate Extraction

As shown in Fig. 3, the input image first passes through the encoder and the decoder to generate $K$ -channel outputs (note that the encoder is pre-trained already and will be further refined in the subsequent training). Then we apply a softmax layer in a spatial-wise manner to generate $K$ heatmaps, i.e., $\mathcal{H}_{t}$ . We then regress the landmark positions $L_{t}$ from the predicted heatmaps $\mathcal{H}_{t}$ as:

	$\displaystyle L_{tk}=\Big{(}\sum_{x}^{\varOmega}(\mathcal{H}_{tk}\times\mathcal{G}_{x}),\sum_{y}^{\varOmega}(\mathcal{H}_{tk}\times\mathcal{G}_{y}),\sum_{z}^{\varOmega}(\mathcal{H}_{tk}\times\mathcal{G}_{z})\Big{)},$		(2)
	$\displaystyle s.t.\sum\mathcal{H}_{tk}=1,\{k=1,...,K\}$		(2)

where $(\mathcal{G}_{x},\mathcal{G}_{y},\mathcal{G}_{z})$ denotes the grid coordinate of the volumetric image as pre-mentioned $\mathcal{G}_{(x,y,z)}$ with the size of $(3,X,Y,Z)$ . The grid coordinate $\mathcal{G}_{(x,y,z)}$ axis range (in $x,y,z$ ) is $[-1,1]$ . We finally capture the exact coordinates of the extracted landmarks by localizing the position with the sum response in the heatmap, as in [55].

III-A3 Training Loss

We introduce four losses to assist in detecting the landmarks in 3D space. Among them, we firstly propose a landmark exclusive loss $\mathcal{L}_{1}$ to enhance the exploration in the 3D image space. Then we present a temporal coherence loss $\mathcal{L}_{2}$ to guide the landmarks to focus on the potential motion-affected region. Considering that the extracted landmarks should conform to human anatomy, we further propose a topological distribution loss $\mathcal{L}_{3}$ . Finally, following [53], we leverage a consistency loss $\mathcal{L}_{4}$ to enhance the stability of landmark detection. More details are as below.

The landmark detector is denoted as $L_{t}=D_{\zeta}(I_{t})$ , where $D$ represents the detection network in volumetric space and $\zeta$ are the learnable parameters of the network. As shown in Fig. 3, we use an end-to-end encoder-decoder architecture with skip-connections for generating two sets of heatmaps $\mathcal{H}_{t_{m}}$ and $\mathcal{H}_{t_{n}}$ , separately, given two images $I_{t_{m}}$ and $I_{t_{n}}$ of different phases. We can derive the spatial coordinates $L_{t_{m}}$ and $L_{t_{n}}$ of the detected landmarks from the heatmaps $\mathcal{H}_{t_{m}}$ and $\mathcal{H}_{t_{n}}$ .

Landmark Exclusive Loss — In order to increase the space exploration capability of landmark detection to enhance the richness of their spatial description, we assume there is mutual repulsion between any two landmarks if they are spatially in close proximity. We particularly apply a Gaussian distribution in volumetric space to describe each landmark centered on its own position as below,

\displaystyle\mathcal{F}_{tk}=\exp\Big{(}-\dfrac{(L_{tk}-\mathcal{G}_{(x,y,z)})^{2}}{2\sigma^{2}}\Big{)},(x,y,z)\in\varOmega

(3)

where $\sigma$ is used to control the spread of each landmark, to tackle the uncertainty in its representation. $\mathcal{F}_{tk}$ is 3D Gaussian heatmap representing the landmark coordinate position. Note that the value range in each landmark’s heatmap is $(0,1]$ . The sum of two landmarks’ heatmap will exceed 1 in certain coordinate if the two landmarks are close to each other, which should be penalized. The loss of the repulsive force is thus defined as:

\displaystyle\mathcal{L}_{1}=ReLU\bigg{(}\sum_{k=1}^{K}\mathcal{F}_{tk}-1\bigg{)}

(4)

By minimizing $\mathcal{L}_{1}$ , we can avoid the case that all landmarks conglomerate in landmark detection.

Temporal Coherence Loss — While the above loss drives all landmarks to explore the entire 3D space, we add temporal coherence loss to help the landmark detector better focus on the motion-affected region. Specifically, the extracted landmarks are expected to cover the motion-affected region, rather than the static background. Therefore, if two corresponding landmarks in the temporal sequence are static across different time-points, such landmarks will be penalized to be detected. Thus, given two corresponding landmarks $L_{t_{m}k}$ and $L_{t_{n}k}$ , the self-supervised temporal coherence loss is defined as below:

\displaystyle\mathcal{L}_{2}=\sum_{k=1}^{K}\left\|\mathcal{F}_{t_{m}k}\times\mathcal{F}_{t_{n}k}\right\|_{1}

(5)

where $\times$ is the element-wise product.

Topological Distribution Loss — Considering that the extracted landmarks should largely conform to human anatomy, therefore, we apply a reference point distribution loss to the landmark detector.

Fig. 5 depicts the process to extract the reference landmarks. We first choose an atlas downloaded from the public cardiac segmentation dataset ²²2https://zmiclab.github.io/projects/mmwhs. It is worthy to note that we only utilize a single atlas and the annotation, which can sufficiently establish reference point distribution in our work. Then, we extract the surface from the annotated segmentation as shown in Fig. 5(a). The extracted surface represents the structural topology of the target organ. Next, we use k-means [56] to cluster the points on the surfaces based on their 3D coordinates, which produces several subgroups as shown in Fig. 5(b). Finally, the center of each subgroup is used as the reference point distribution to represents the anatomical topology of the target organ as a whole (see Fig. 5(c)).

We then utilize the optimal transport theory [57, 58] to enforce the reference point distribution in landmark detection. Let $R_{k}(k=1,...,K)$ denote the coordinates of the reference landmarks in the set. Then, the reference landmarks and our detected landmarks $L$ can be used to derive two tensors $\mathop{M_{R}}$ and $\mathop{M_{L}}$ , both of which are sized $K\times K\times 3$ (see Fig. 6(a)), following Eq. 6

\mathop{M_{L}}=L_{r}-L_{c},\mathop{M_{R}}=R_{r}-R_{c}

(6)

where $r$ and $c$ indicate the row and column indices of the tensors, respectively. Note that $M_{R}$ and $M_{L}$ encode mutual distances between landmarks in the reference and the detected sets. Each row/column of the two tensors can be regarded as representation for the specific landmark.

Since the landmarks in the two sets $R$ and $L$ are presumably corresponding to each other, we can get high correlation between $M_{R}$ and $M_{L}$ , given their one-to-one correspondence assignment. That is, we formulate the following optimal transport problem:

	$\displaystyle\mathcal{L}_{3}=\mathop{argmin}\sum_{r}^{K}\sum_{c}^{K}T^{r,c}\times C^{r,c},$		(7)
	$\displaystyle s.t.\sum\limits_{r}T^{r,c}=\sum\limits_{c}T^{r,c}=1$		(7)

where $T$ is the binary transport matrix to establish one-to-one correspondences from the detected landmark set $L$ to the reference set $R$ . $C$ gauges the distances between the landmarks in the two sets. Thus, the element-wise product between $T$ and $C$ , as in Eq. 7, penalizes two landmarks, if they are corresponding with each other across the two sets (thus activated by $T$ ) yet differ a lot (with high distance in $C$ ).

In our implementation, we refer to the representation of the landmarks in $M_{L}$ and $M_{R}$ to derive the matrix $C$ . Specifically, the value at the $r$ -th row and $c$ -th column in $C$ is calculated as

\displaystyle C^{r,c}=-\sum_{k=1}^{K}<M_{L}^{r,k},M_{R}^{c,k}>

(8)

Additionally, in our study, we expect the detected landmark set and the reference landmark set to have one-to-one correspondences. Thus, $T$ can be simplified as an identity matrix. Consequently, all entries of the optimal transport matrix $T$ will be 0 when $r\neq c$ , and 1 when $r=c$ . The above process is illustrated in Fig. 6.

Consistency Loss — To further improve the stability of landmark detection, we leverage a consistency loss inspired by [59]. For the landmarks extracted from each image, we can randomly generate an affine transformation matrix $\mathcal{A}=[\mathcal{R},\mathcal{B}]+[\mathcal{C}]$ , where $\mathcal{R}$ denotes the rotation, $\mathcal{B}$ denotes the translation, and $\mathcal{C}$ is Gaussian noise. The generated affine transformation can warp image $I_{t_{m}}$ by $\mathcal{W}$ (cf., the warp layer in Fig. 2) to become a new image $\widetilde{I}_{t_{m}}$ . And then the deformed image $\widetilde{I}_{t_{m}}$ can serve as a new image for landmark detection, from which we extract another landmark set $\widetilde{L}_{t_{m}}=D_{\zeta}(\mathcal{W}(I_{t_{m}}\mid\mathcal{A}))$ . The two landmark sets, given known affine transformation matrix $\mathcal{A}$ , should be consistent following the loss in Eq. 9:

\displaystyle\mathcal{L}_{4}=\sum_{k=1}^{K}\|\mathcal{W}(L_{tk}\mid\mathcal{A})-\widetilde{L}_{tk}\|_{1}

(9)

III-B From Sparse Motion to Dense Motion - Motion Reconstruction Network

In this section, the estimation of the dense motion field $\varphi_{t_{m}\leftarrow t_{n}}$ from the detected landmarks ( $L_{t_{m}},L_{t_{n}}$ ) and their sparse correspondences are introduced. We firstly compute the sparse displacement $V_{t_{m}k\leftarrow t_{n}k},\{k=1,...,K\}$ from the detected landmarks $L_{t_{m}},L_{t_{n}}$ . Then, we utilize a dense motion reconstruction network $\mathcal{M}$ (standard U-Net architecture) to estimate the dense motion field from the displacement of the detected landmarks.

The sparse displacement $V_{t_{m}k\leftarrow t_{n}k},\{k=1,...,K\}$ links two temporally-consecutive landmarks $L_{t_{n}k}$ and $L_{t_{m}k}$ in $I_{t_{n}}$ and $I_{t_{m}}$ , respectively. As the displacement $V_{t_{m}k\leftarrow t_{n}k}$ is spatially sparse and cannot directly input to our motion network $\mathcal{M}$ , we first filling the entire $\widetilde{\varphi}_{t_{m}k\leftarrow t_{n}k}$ grid with the value of $V_{t_{m}k\leftarrow t_{n}k}$ and acquire a dense volumetric deformation field $\widetilde{\varphi}_{t_{m}k\leftarrow t_{n}k}$ . We then use $\widetilde{\varphi}_{t_{m}k\leftarrow t_{n}k}$ to translate the image $I_{t_{n}}$ via the warp layer $\mathcal{W}$ , which results in $K$ different deformed images $\widetilde{I}_{t_{m}k\leftarrow t_{n}k}$ in total. Thus, the deformed images $\widetilde{I}_{t_{m}k\leftarrow t_{n}k}$ are input to $\mathcal{M}$ .

However, the deformed $\widetilde{I}_{t_{m}k\leftarrow t_{n}k}$ only contains displacement information in images yet the landmark positions are missing. We thus further inject the corresponding landmarks’ heatmaps into $\mathcal{F}$ . We specifically derive the variation between the corresponding heatmaps as

\displaystyle\mathcal{F}_{t_{m}k\leftarrow t_{n}k}=\mathcal{F}_{t_{m}k}-\mathcal{F}_{t_{n}k}

(10)

Finally, we concatenate the deformed images $\widetilde{I}_{t_{m}k\leftarrow t_{n}k}$ and the variation of heatmaps $\mathcal{F}_{t_{m}k\leftarrow t_{n}k}$ along the channel dimensionality, and feed them into the motion reconstruction network $\mathcal{M}$ to fuse the $\widetilde{\varphi}_{t_{m}k\leftarrow t_{n}k}$ into final dense motion field $\varphi_{t_{m}\leftarrow t_{n}}$ as

\displaystyle\varphi_{t_{m}\leftarrow t_{n}}=\sum_{k=1}^{K}\mathcal{M}(\widetilde{I}_{t_{m}k\leftarrow t_{n}k},\mathcal{F}_{t_{m}k\leftarrow t_{n}k})*\widetilde{\varphi}_{t_{m}k\leftarrow t_{n}k}

(11)

The training of $\mathcal{M}$ is supervised by a voxel-wise similarity loss calculated from the intensity image and its corresponding heatmaps as

	$\displaystyle\mathcal{L}_{d}=$	$\displaystyle\parallel\mathcal{W}(\mathcal{M}_{\gamma}(\varphi_{t_{m}\leftarrow t_{n}}),I_{t_{n}})-I_{t_{m}}\parallel_{2}+$		(12)
		$\displaystyle\sum_{k=1}^{K}\parallel\mathcal{W}(\mathcal{M}_{\gamma}(\varphi_{t_{m}\leftarrow t_{n}}),\mathcal{H}_{t_{n}k})*\mathcal{G}_{x,y,z}-L_{t_{m}k}\parallel_{1}$		(12)

In overall, by integrating the landmark detection network and the dense motion reconstruction network, our final training loss $\mathcal{L}$ is defined as

\displaystyle\mathcal{L}=\lambda_{1}\mathcal{L}_{1}+\lambda_{2}\mathcal{L}_{2}+\lambda_{3}\mathcal{L}_{3}+\lambda_{4}\mathcal{L}_{4}+\lambda_{d}\mathcal{L}_{d}

(13)

The weights $\lambda_{1}=1,\lambda_{2}=0.5,\lambda_{3}=20,\lambda_{4}=100,\lambda_{d}=500$ have been set empirically. As shown in Fig. 2, we thus achieve the dense-sparse-dense transferring and estimate the motion field between the input images. In the final, the reconstructed motion field can be employed as an initial transformation field and further refined via existing registration methods [60]. The refinement aims to boost the quality in estimating the motion field in detail, while the initial transformation produced by DSD has already addressed large deformation. Thus, the refinement only needs a small amount of iterative computation, which is highly efficient.

IV Experiments

IV-A Materials and Implementation details

We demonstrate our method with three datasets: 4D Cardiac CT (4D-C-CT), 4D Lung CT (4D-L-CT) [61], and 4D MR cardiac ACDC [62].

IV-A1 4D-C-CT

Fig. 7(a) shows a snapshot of randomly sampled cardiac volume slices from a sequence. The 4D-C-CT dataset consists of 18 subjects, each having 5 time-points (image volumes) capturing half cardiac cycle from ED to ES. Since the dataset contains several patients with left ventricular aneurysm resulting in abnormally subtle deformation, we select 12 patient data with large volume changes (volume changing rate from 0.37 to 0.75), and only test on these selected challenging datasets. Each volume is characterized by a high intra-slice (x- and y-) resolution ranging from 0.32 to 0.45mm and inter-slice (z-) resolution from 0.37 to 0.82mm.

IV-A2 4D-L-CT

Fig. 7(b) shows a snapshot of lung volume slices from a sequence. The 4D-L-CT dataset consists of 20 patients, each having 10 time-points representing the whole respiratory cycle. 4D-L-CT images were acquired using a 16 $-$ slice, helical CT scanner (Brilliance Big Bore, Philips Medical Systems) with a slice thickness of 3 mm and $512\times 512$ axial resolution ( $\sim 1$ mm pixel size).

IV-A3 ACDC

Fig. 7(c) shows a snapshot of cardiac MR middle slices from a sequence. The ACDC dataset consists of 100 patients, each having multiple time-points representing the whole cardiac cycle. It has lower imaging resolution - the in-plane voxel spacing ranges from 1.37 to 1.68 mm and the inter-slice spacing ranges from 5 to 10mm. Note that, as inter-slice spacing is large in ACDC, we have to change all the comparisons to 2D, and evaluate on three middle slices in each volume.

TABLE I: Motion estimation results on the 4D-C-CT dataset among the individual time-points, from ED to ES for all the comparison methods.

	Interval-1				Interval-2				Interval-3				Interval-4				Mean
	DSC	HD	ASD	RVD	DSC	HD	ASD	RVD	DSC	HD	ASD	RVD	DSC	HD	ASD	RVD	DSC	HD	ASD	RVD
FFD	0.972	4.45	0.414	0.011	0.943	6.88	0.759	0.028	0.884	7.21	1.406	0.086	0.838	8.65	1.935	0.172	0.909 $\pm$ 0.062	6.80 $\pm$ 1.95	1.13 $\pm$ 0.75	0.074 $\pm$ 0.100
Demons	0.977	3.82	0.335	0.006	0.958	4.64	0.554	0.027	0.888	5.69	1.419	0.199	0.825	7.37	2.164	0.377	0.912 $\pm$ 0.068	5.38 $\pm$ 1.87	1.12 $\pm$ 0.96	0.152 $\pm$ 0.212
FeaturePoints	0.943	4.63	0.709	0.041	0.867	10.03	1.454	0.237	0.736	11.22	2.895	0.699	0.650	12.96	3.970	1.071	0.800 $\pm$ 0.150	9.72 $\pm$ 3.43	2.26 $\pm$ 1.61	0.512 $\pm$ 0.607
VoxelMorph	0.951	4.76	0.463	0.017	0.905	6.12	1.135	0.122	0.836	7.57	2.066	0.203	0.761	9.02	2.991	0.511	0.863 $\pm$ 0.090	6.87 $\pm$ 2.30	1.64 $\pm$ 0.99	0.213 $\pm$ 0.226
VoxelMorph + Demons	0.976	3.34	0.347	0.008	0.960	3.96	0.547	0.033	0.891	6.36	1.226	0.162	0.820	7.19	1.761	0.332	0.911 $\pm$ 0.071	5.21 $\pm$ 1.89	0.97 $\pm$ 0.88	0.134 $\pm$ 0.207
DSD (Ours)	0.954	4.09	0.510	0.022	0.929	4.96	0.981	0.093	0.853	7.14	2.115	0.208	0.792	8.41	2.939	0.447	0.882 $\pm$ 0.084	6.16 $\pm$ 2.18	1.66 $\pm$ 0.92	0.192 $\pm$ 0.258
DSD + Demons (Ours)	0.978	3.25	0.266	0.005	0.964	3.90	0.427	0.023	0.916	5.70	1.066	0.117	0.862	6.22	1.440	0.254	0.930 $\pm$ 0.058	4.81 $\pm$ 1.56	0.80 $\pm$ 0.72	0.100 $\pm$ 0.184

We apply contrast-normalization to all the images, consistent with other similar experiments [63]. All the networks are implemented using Pytorch library and trained on two 11GB Nvidia 1080Ti GPUs. The landmark detection model is trained with a learning rate of $2\times 10^{-5}$ while the dense motion reconstruction network is trained with a learning rate of $6\times 10^{-5}$ . In the evaluations of 4D-C-CT and 4D-L-CT, we use four-fold cross-validation on both datasets. In the evaluation of ACDC dataset, we random select 70 training / 30 testing subjects.

IV-B Evaluation and Metrics

We conduct comprehensive experiments to validate our proposed method - DSD (without subsequent refinement) and DSD + Demons (with refinement by Demons). First, we compare our proposed method with three different categories of the state-of-the-art motion estimation methods, including conventional non-rigid image registration method, unsupervised CNN-based registration, and feature point based registration. We implement all the compared methods following their default parameter settings in the same environment, including the same training set, and same testing set:

1.

Conventional non-rigid image registration methods - Demons [14] and Free-Form Deformation (FFD) [27]. Demons utilizes the intensity similarity of the images to derive the driving force, which deforms the moving image to the fixed image. FFD method requires to initialize a group of control points, and optimizes the control points’ displacement to interpolate the deformation field, which warps the moving image to the fixed image.
2.

Unsupervised deep CNN-based method – VoxelMorph [34]. The deep CNN network is trained via the loss from the similarity of image appearance and the smoothness of the deformation field in an unsupervised manner. In the inference stage, it can directly yield the entire deformation field given the moving and the fixed images.
3.

Feature point based method - FeaturePoints [42]. Image local feature points are firstly extracted via the popular scale-invariant feature transform (SIFT) descriptors, and then utilized to register the two images.

Next, for unsupervised landmark detection, we compare it with the state-of-the-art first-order motion model (FOMM) [53]³³3https://aliaksandrsiarohin.github.io/first-order-model-website/ in the ablation study. Note that, since FOMM is originally presented for 2D images, we have adapted FOMM to 3D and regarded 3D-FOMM as the baseline for comparison.

For all comparisons, we implement them using their default parameter settings. For Demons and FFD, we implement them via SimpleITK library. For evaluation, we use the popular metrics including Hausdorff Distance (HD), Dice Similarity Coefficient (DSC), Average Surface Distance (ASD), and Relative volume difference (RVD). Lower HD, ASD, RVD and higher DSC indicate better performance.

V Results and Discussion

V-A Comparison with the state-of-the-art motion estimation methods

We have applied our DSD framework to both 4D-C-CT and 4D-L-CT datasets for the evaluation of its performance. Furthermore, we have employed the motion field produced by deep-learning-based methods as an initialization, and then refine it by Demons, following the strategy in [60]. Note that, the FFD is not selected to refine our motion field due to it does not support the initial deformation setting.

V-A1 4D-C-CT dataset

The quantitative comparison for cardiac motion estimation is shown in Table I. As the motion estimation method should be able to handle the time-points with varying intervals, we also evaluate the performance with different time intervals from ED to ES in Table I. It shows that all the methods have obtained relatively high performances when the time interval is small (i.e., implying small deformation), but the performances drop as the time interval becomes larger (i.e., implying large deformation). This is because of the fact that large deformation causes the accuracy of motion estimation to decrease. In contrast, our DSD + Demons exhibits the improvement in motion estimation accuracy relative to the comparison methods, and the improvement increases as the time interval become larger. We attribute this to the motion estimation guided by the extracted landmarks that can effectively handle the large deformation by avoiding the error estimation.

As highlighted in Fig. 8, where FFD and Demons are prone to errors for the images with large motion, which may result in folding in image deformation. The deep-learning-based VoxelMorph slightly alleviates the error deformation, but there are problems in underestimating the deformation as in Fig. 8. In contrast to these methods, our proposed method first extracts landmarks to represent the topological structure of the target organ and estimates the dense motion field by considering the dynamic changes of the organ’s morphology. While our DSD exhibits the great capability of modeling the large motion from detected landmarks, the local appearance must be considered for subtle refinement of the motion field.

TABLE II: Motion estimation results on the 4D-L-CT dataset among the individual time-points, from maximum breath holding to minimum exhalation for all the comparison methods.

	Interval-1				Interval-2				Interval-3				Interval-4				Mean
	DSC	HD	ASD	RVD	DSC	HD	ASD	RVD	DSC	HD	ASD	RVD	DSC	HD	ASD	RVD	DSC	HD	ASD	RVD
FFD	0.985	4.70	0.115	0.003	0.981	4.63	0.155	0.007	0.977	4.75	0.186	0.009	0.975	5.34	0.206	0.010	0.979 $\pm$ 0.005	4.85 $\pm$ 0.578	0.17 $\pm$ 0.06	0.008 $\pm$ 0.005
Demons	0.987	4.67	0.105	0.003	0.982	4.61	0.140	0.009	0.978	4.78	0.181	0.017	0.973	5.41	0.219	0.027	0.980 $\pm$ 0.005	4.87 $\pm$ 0.584	0.17 $\pm$ 0.07	0.015 $\pm$ 0.013
FeaturePoints	0.937	4.28	0.508	0.012	0.931	5.19	0.521	0.032	0.925	7.62	0.562	0.048	0.918	9.07	0.594	0.069	0.925 $\pm$ 0.014	7.18 $\pm$ 2.178	0.56 $\pm$ 0.08	0.048 $\pm$ 0.033
VoxelMorph	0.979	4.98	0.068	0.006	0.973	5.05	0.113	0.010	0.970	5.16	0.130	0.015	0.967	5.07	0.141	0.020	0.972 $\pm$ 0.005	5.06 $\pm$ 0.523	0.11 $\pm$ 0.03	0.007 $\pm$ 0.010
VoxelMorph + Demons	0.986	4.86	0.073	0.000	0.982	5.18	0.154	0.008	0.979	5.21	0.193	0.015	0.975	5.43	0.209	0.024	0.981 $\pm$ 0.005	5.17 $\pm$ 0.626	0.17 $\pm$ 0.05	0.013 $\pm$ 0.009
DSD (Ours)	0.983	4.79	0.092	0.003	0.977	5.16	0.154	0.008	0.970	5.19	0.193	0.015	0.967	5.34	0.209	0.024	0.974 $\pm$ 0.006	5.12 $\pm$ 0.651	0.17 $\pm$ 0.05	0.013 $\pm$ 0.009
DSD + Demons (Ours)	0.987	4.43	0.084	0.002	0.983	4.87	0.123	0.003	0.980	4.94	0.144	0.005	0.979	5.17	0.164	0.006	0.982 $\pm$ 0.005	4.85 $\pm$ 0.597	0.13 $\pm$ 0.03	0.004 $\pm$ 0.003

Overall, our DSD + Demons scheme achieves the best performance (DSC=0.930, HD=4.81), and VoxelMorph + Demons is ranked second-best (DSC=0.911, HD=5.21). This reveals that our DSD framework can effectively guarantee the accuracy of motion estimation compared with VoxelMorph which is the state-of-the-art deep learning method. In addition, compared with the method only using conventional Demons (DSC=0.912, HD=5.38), our DSD + Demons also has an improvement in the accuracy of the estimations. We attribute this to the fact that the conventional methods, i.e., Demons and FFD, are greatly limited by their optimization based on local image similarity which leads to the inability to effectively estimate large deformation.

V-A2 4D-L-CT dataset

Table II lists the quantitative motion estimation results on the 4D-L-CT dataset. Similar to the evaluation of cardiac data, we also compare the results with respect to different time intervals between the two input time-points. Our method DSD + Demons achieves the best registration accuracy. In fact, the performance of all the compared methods is similar on the 4D-L-CT dataset with only 0.01 difference in DSC between the highest (DSD + Demons: DSC=0.982, HD=4.85) and the lowest (VoxelMorph: DSC=0.972, HD=5.06). This is because the anatomical structures of the lungs are relatively simple, and the motion is easy to estimate compared to the cardiac structure.

Fig. 9 depicts the warped outputs for all the methods. The temporal changes of lungs in the appearance and shapes are generally mild between the phases of maximum breath holding and minimum exhalation (c.f., green lines and red lines in Fig. 9). Although the conventional method Demons achieves a well quantitative performance, there is unfortunately an over-squeezing in the deformation, and thus artifacts appear in the deformed image. For instance, as indicated by the blue circle in Fig. 9, the estimated deformation on the lower lobes of the lungs is horizontally over-squeezed, where the correct deformation (see the fixed image and the moving image in Fig. 9) should have been an upward movement. In contrast, our proposed method is able to effectively reduce the error of motion estimation in the stenosis of the lower lobe of the lung.

TABLE III: Motion estimation results on the ACDC dataset among the right ventricle (RV), left ventricle myocardium (LVM), and left ventricle (LV) between ED and ES

	RV				LVM				LV
	DSC	HD	ASD	RVD	DSC	HD	ASD	RVD	DSC	HD	ASD	RVD
FFD	0.771	7.324	3.202	0.647	0.734	6.364	1.414	0.144	0.847	3.695	1.593	0.266
Demons	0.735	8.513	3.724	0.812	0.815	4.882	1.248	0.101	0.829	4.494	1.895	0.381
FeaturePoints	0.649	10.219	4.905	1.171	0.560	8.436	2.241	0.137	0.632	8.325	4.548	1.247
VoxelMorph	0.771	3.991	2.439	0.188	0.731	4.321	1.564	0.171	0.763	3.981	2.761	0.270
DSD (Ours)	0.806	4.240	1.647	0.246	0.842	3.672	1.242	0.116	0.876	3.073	1.323	0.234

V-A3 ACDC dataset

Table III lists the quantitative motion estimation results on the ACDC dataset. Our proposed DSD method achieves the outstanding performance across most evaluations. As shown in Fig. 10, FFD and Demons are prone to errors in motion estimation. For feature point based method, the moving image is not effectively deformed, because the extracted feature points from the two time-point images lack consistency. Meanwhile, VoxelMorph obtains relatively accurate estimation, but it may still result in folding in image deformation as shown in Fig. 10. In contrast, our proposed DSD method is able to effectively and accurately track the motion in the images.

V-B Ablation study of the DSD method

We also conduct an ablation study for our unsupervised landmark detection network, which is a critical component in our framework. To effectively discuss the impact of our proposed losses on the accuracy of motion estimation and landmark detection, we analyze our proposed method with three groups of different combinations of loss functions following the scheme in [64], i.e., Only-Keeping, Removing and Accumulating, as shown in Table IV shown. In the ‘Only-Keeping’ subgroup, we only keep one selected loss function to combine with the baseline model. For the second ‘Removing’ subgroup, we remove the selected loss function and employ all other proposed loss functions. In the last ‘Accumulating’ subgroup, we increasingly add the loss terms to the baseline model in the order of $\mathcal{L}_{1}$ , $\mathcal{L}_{2}$ and $\mathcal{L}_{3}$ . Note that, the baseline model (3D-FOMM) uses $\mathcal{L}_{4}$ and $\mathcal{L}_{d}$ . We thus only investigate $\mathcal{L}_{1}$ , $\mathcal{L}_{2}$ and $\mathcal{L}_{3}$ in this ablation experiment.

TABLE IV: The quantitative performance of spatiotemporal motion field estimation on the 4D-C-CT dataset for different ablation settings.

	Only-Keeping				Removing				Accumulating
	DSC	HD	ASD	RVD	DSC	HD	ASD	RVD	DSC	HD	ASD	RVD
Baseline (3D-FOMM)	0.783	8.26	3.36	0.604	-	-	-	-	0.783	8.26	3.36	0.604
$\mathcal{L}_{1}$	0.837	7.46	2.43	0.356	0.807	8.56	2.78	0.518	0.837	7.46	2.43	0.356
$\mathcal{L}_{2}$	0.806	8.34	2.92	0.462	0.880	6.42	1.51	0.196	0.842	7.19	2.34	0.261
$\mathcal{L}_{3}$	0.799	8.72	2.76	0.508	0.842	7.19	2.34	0.261	0.882	6.16	1.66	0.192

The motion estimation results in the ablation study are listed in Table IV for 4D-C-CT. Compared to the baseline model 3D-FOMM, our $\mathcal{L}_{1}$ loss greatly improves the effectiveness of landmark detection and also the accuracy in motion estimation. In contrast, using $\mathcal{L}_{2}$ and $\mathcal{L}_{3}$ alone does not significantly improve the model performance compared to the baseline, as shown in the ’Only-Keeping’ subgroup in Table IV. However, only relying on $\mathcal{L}_{1}$ and $\mathcal{L}_{2}$ losses is not able to handle the paired images with large motion. The simultaneous use of $\mathcal{L}_{1}$ and $\mathcal{L}_{3}$ can further significantly improve the accuracy of motion estimation. We attribute this to that our specific designed topological loss $\mathcal{L}_{3}$ effectively establish the conversation between the detected landmarks and the target organ’s anatomy. Therefore, the detected landmarks accurately reflect the dynamic changes of the target organ which can be further used to guide the motion estimation. Compared with $\mathcal{L}_{1}$ and $\mathcal{L}_{3}$ , $\mathcal{L}_{2}$ does not greatly improve the accuracy of the model. We attribute this to that its effect mainly manifested in improving the description of landmark representation, especially without $\mathcal{L}_{3}$ , as shown in Fig. 11. Meanwhile, $\mathcal{L}_{d}$ and subsequent added $\mathcal{L}_{3}$ will weaken its effect on the model performance because they also can guide model to pay attention to the motion-affected region.

Fig. 11 presents the landmark distributions with different loss settings. Adding the driven force, $\mathcal{L}_{1}$ , greatly improves the space exploration of the landmark detection as compared with $BK$ . Although the loss $\mathcal{L}_{1}$ can effectively spread the landmarks in 3D space, they are evenly distributed which cannot match the shape of the target organ (see Fig. 11(b)). On this basis, the temporal coherence motion loss, $\mathcal{L}_{2}$ , further guides the model to explore the regions that may provide more effective guidance to the motion estimation. Finally, to make the detected landmarks conform to the anatomical topology, we add the reference topological loss, $\mathcal{L}_{3}$ , which achieves the best quantitative and qualitative performance in all settings under comparison as shown in Table IV, Fig. 11(d), and Fig. 11(e). Our detected landmarks exhibit strong distribution consistency with the reference landmarks which are extracted from the manual segmentation annotations. This also reflects that the detected landmarks by our unsupervised landmark detection network can conform to the anatomical topology of the target organ.

TABLE V: The quantitative performance of spatiotemporal motion field estimation on the 4D-C-CT dataset with different number of landmarks.

	$Baseline+\mathcal{L}_{1}+\mathcal{L}_{2}$				$Baseline+\mathcal{L}_{1}+\mathcal{L}_{2}+\mathcal{L}_{3}$
	DSC	HD	ASD	RVD	DSC	HD	ASD	RVD
12 Landmarks	0.722	11.51	4.29	0.895	0.784	8.97	2.66	0.471
16 Landmarks	0.772	9.26	3.18	0.618	0.813	8.02	2.28	0.287
20 Landmarks	0.814	8.25	2.32	0.397	0.847	7.13	2.06	0.263
24 Landmarks	0.842	7.19	2.34	0.261	0.882	6.16	1.66	0.192

In addition to verifying the proposed losses, we also evaluate the effects of different numbers of landmarks and report the results in Table V. We use 12, 16, 20 and 24 landmarks to represent the target organ, respectively. Note that, to effectively evaluate the impact of the numbers of landmarks on motion estimation, we leverage $BK+\mathcal{L}_{f}+\mathcal{L}_{b}$ and omit $\mathcal{L}_{o}$ to constrain the landmark detection and motion reconstruction. As expected, the more landmarks result in higher accuracy of the estimated motion fields (12-landmarks: DSC=0.722, HD=11.51; 16-landmarks: DSC=0.772, HD=9.26; 20-landmarks: DSC=0.814, HD=8.25; 24-landmarks: DSC=0.842, HD=7.19). The 24-landmark set achieves the best results in our experiments. We have used 24 landmarks in our study and achieved the state-of-the-art results. Nevertheless, we suggest that a higher number will further improve the performance, albeit incrementally.

VI Conclusion

VI-A Summary

We present a novel unsupervised topology extraction guided motion estimation method for 4D dynamic medical images. Our main findings are that our DSD method: 1) effectively captures the sparse topological anatomy of the volumetric image through unsupervised landmark detection with multiple losses, especially the landmark topological loss, which aligns the detected landmarks to conform to the anatomical prior; 2) is able to better reconstruct the dense deformation field, with motion tracking guided by the detected landmarks.

In our framework, landmark detection can be regarded as a process of sparse encoding of dense image representation, while dense motion reconstruction is a process of decoding from sparse representation. As shown in Fig. 11, we have found that even if the landmarks are not directly related to the shape of the target organ, the motion field between the paired input images can be relatively accurate (see Fig.11(b)). We attribute this to the network training where the conversation between the image coding (landmark detection) and the motion decoding (dense motion reconstruction) is established, even if the result of image coding (detected landmarks) is not morphologically meaningful. However, this compulsory learning limits the accuracy of the motion estimation. In contrast, our unsupervised landmark detection network can effectively capture the sparse landmarks to represent the obvious anatomy of target organ. This ensures that the motion field estimated from the detected landmarks is more reasonable.

In the comparison with the refinement scheme VoxelMorph + Demons and directly derive the motion field by Demons (iteratively derivation), our DSD + Demons achieves a better performance across all metrics on both of 4D-C-CT dataset and 4D-L-CT dataset (Table I and Table II). This is attributed to the fact that these methods, VoxelMorph and Demons, estimate the deformation only by calculating the similarity of local appearance while ignoring the global structural changes. This may lead to their inability to accurately estimate the moving direction of the target organ. Thus, this is prone to incorrect estimation and makes the subsequent optimization unable to further improve the estimation (see Fig. 8 and Fig. 9). In contrast, the initial motion field estimated from our DSD can effectively handle the large deformation and keep the correct anatomical shape under the guidance of the whole anatomical topology change. Then, our initial estimation can be further iteratively optimized to improve the detailed estimation accuracy. In addition, our ablation study further exemplifies that our unsupervised 3D landmark detection network has high consistency and anatomical significance in temporal landmark detection even when the target object shape changes greatly (see Fig. 11).

In conclusion, we introduce a novel learning-based method to estimate the motion for the dynamic medical images. We present an unsupervised 3D landmark detection method. Particularly, our proposed topological constraint can inject prior knowledge of human anatomies into landmark detection. This strategy can make the detected landmarks conform to the anatomical prior of the organ. In addition, our proposed landmark detection network is adaptable to manual landmarks, implying that the landmark detection network can be trained leveraging the manually annotated landmarks in supervised or weakly-supervised manner. We note that, based on our experiment, our proposed self-supervised pre-training strategy has a positive contribution to the model’s convergence, yet no significant impact upon the accuracy of motion estimation.

VI-B Limitations

Several limitations in our work should be discussed. First, the location of each landmark is represented by an independent image-wise heatmap. This leads to a huge requirement in GPU memory usage when the number of landmarks is increased. Next, although our method can improve the accuracy of large motion estimation, it is not sensitive to detailed structure changes, thus requiring further refinement from conventional methods. In the future, we suggest improving the method by reducing the GPU memory. Therefore, better estimation results can be potentially obtained by increasing the number of landmarks.

References

[1] Y. Kwong, A. O. Mel, G. Wheeler, and J. M. Troupis, “Four-dimensional computed tomography (4dct): A review of the current status and applications,” Journal of medical imaging and radiation oncology, vol. 59, no. 5, pp. 545–554, 2015.
[2] A. Bornstedt, E. Nagel, S. Schalla, B. Schnackenburg, C. Klein, and E. Fleck, “Multi-slice dynamic imaging: Complete functional cardiac mr examination within 15 seconds,” Journal of Magnetic Resonance Imaging: An Official Journal of the International Society for Magnetic Resonance in Medicine, vol. 14, no. 3, pp. 300–305, 2001.
[3] T. Pan, T.-Y. Lee, E. Rietzel, and G. T. Chen, “4d-ct imaging of a volume influenced by respiratory motion on multi-slice ct,” Medical physics, vol. 31, no. 2, pp. 333–340, 2004.
[4] K. Cleary and T. M. Peters, “Image-guided interventions: technology review and clinical applications,” Annual review of biomedical engineering, vol. 12, pp. 119–142, 2010.
[5] S. L. Fernandes, U. J. Tanik, V. Rajinikanth, and K. A. Karthik, “A reliable framework for accurate brain image examination and treatment planning based on early diagnosis support for clinicians,” Neural Computing and Applications, vol. 32, no. 20, pp. 15 897–15 908, 2020.
[6] C. Buerger, T. Schaeffter, and A. P. King, “Hierarchical adaptive local affine registration for fast and robust respiratory motion estimation,” Medical image analysis, vol. 15, no. 4, pp. 551–564, 2011.
[7] H. Yu, S. Sun, H. Yu, X. Chen, H. Shi, T. S. Huang, and T. Chen, “Foal: Fast online adaptive learning for cardiac motion estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 4313–4323.
[8] B. Deng, X. Li, H. Wang, Y. He, F. Ciampa, Y. Li, and K. Zhou, “Line scanning thermography reconstruction algorithm for defects inspection with novel velocity estimation and image registration,” IEEE Sensors Journal, 2020.
[9] K. Xuan, L. Xiang, X. Huang, L. Zhang, S. Liao, D. Shen, and Q. Wang, “Multi-modal mri reconstruction with spatial alignment network,” arXiv preprint arXiv:2108.05603, 2021.
[10] Y. Zhang, F. Shi, J. Cheng, L. Wang, P.-T. Yap, and D. Shen, “Longitudinally guided super-resolution of neonatal brain magnetic resonance images,” IEEE transactions on cybernetics, vol. 49, no. 2, pp. 662–674, 2018.
[11] G. Wu, Q. Wang, J. Lian, and D. Shen, “Estimating the 4d respiratory lung motion by spatiotemporal registration and super-resolution image reconstruction,” Medical physics, vol. 40, no. 3, p. 031710, 2013.
[12] X. Zhuang, “Multivariate mixture model for myocardial segmentation combining multi-source images,” IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 12, pp. 2933–2946, 2018.
[13] Y. Guo, L. Bi, E. Ahn, D. Feng, Q. Wang, and J. Kim, “A spatiotemporal volumetric interpolation network for 4d dynamic medical image,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 4726–4735.
[14] T. Vercauteren, X. Pennec, A. Perchant, and N. Ayache, “Diffeomorphic demons: Efficient non-parametric image registration,” NeuroImage, vol. 45, no. 1, pp. S61–S72, 2009.
[15] B. B. Avants, N. J. Tustison, G. Song, P. A. Cook, A. Klein, and J. C. Gee, “A reproducible evaluation of ants similarity metric performance in brain image registration,” Neuroimage, vol. 54, no. 3, pp. 2033–2044, 2011.
[16] G. Haskins, U. Kruger, and P. Yan, “Deep learning in medical image registration: a survey,” Machine Vision and Applications, vol. 31, no. 1, p. 8, 2020.
[17] B. D. de Vos, F. F. Berendsen, M. A. Viergever, H. Sokooti, M. Staring, and I. Išgum, “A deep learning framework for unsupervised affine and deformable image registration,” Medical image analysis, vol. 52, pp. 128–143, 2019.
[18] S. Ghosal and N. Ray, “Deep deformable registration: Enhancing accuracy by fully convolutional neural net,” Pattern Recognition Letters, vol. 94, pp. 81–86, 2017.
[19] A. Sotiras, C. Davatzikos, and N. Paragios, “Deformable medical image registration: A survey,” IEEE transactions on medical imaging, vol. 32, no. 7, pp. 1153–1190, 2013.
[20] M. Liu, J. Zhang, C. Lian, and D. Shen, “Weakly supervised deep learning for brain disease prognosis using mri and incomplete clinical scores,” IEEE Transactions on Cybernetics, 2019.
[21] J. Xu, G. Wang, T. Li, and W. Pedrycz, “Local-density-based optimal granulation and manifold information granule description,” IEEE transactions on cybernetics, vol. 48, no. 10, pp. 2795–2808, 2017.
[22] M. A. Viergever, J. A. Maintz, S. Klein, K. Murphy, M. Staring, and J. P. Pluim, “A survey of medical image registration–under review,” 2016.
[23] X. Yu, J. Yang, T. Wang, and T. Huang, “Key point detection by max pooling for tracking,” IEEE Transactions on Cybernetics, vol. 45, no. 3, pp. 430–438, 2014.
[24] D. Huang, Y. Tang, Y. Wang, L. Chen, and Y. Wang, “Hand-dorsa vein recognition by matching local features of multisource keypoints,” IEEE transactions on cybernetics, vol. 45, no. 9, pp. 1823–1837, 2014.
[25] K. Li, J. Yang, and J. Jiang, “Nonrigid structure from motion via sparse representation,” IEEE transactions on cybernetics, vol. 45, no. 8, pp. 1401–1413, 2014.
[26] Y.-F. Yu, G. Xu, M. Jiang, H. Zhu, D.-Q. Dai, and H. Yan, “Joint transformation learning via the l2, 1-norm metric for robust graph matching,” IEEE transactions on cybernetics, 2019.
[27] M. Modat, G. R. Ridgway, Z. A. Taylor, M. Lehmann, J. Barnes, D. J. Hawkes, N. C. Fox, and S. Ourselin, “Fast free-form deformation using graphics processing units,” Computer methods and programs in biomedicine, vol. 98, no. 3, pp. 278–284, 2010.
[28] X. Yang, R. Kwitt, M. Styner, and M. Niethammer, “Quicksilver: Fast predictive image registration–a deep learning approach,” NeuroImage, vol. 158, pp. 378–396, 2017.
[29] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241.
[30] M.-M. Rohé, M. Datar, T. Heimann, M. Sermesant, and X. Pennec, “Svf-net: Learning deformable image registration using shape matching,” in International conference on medical image computing and computer-assisted intervention. Springer, 2017, pp. 266–274.
[31] J. Krebs, T. Mansi, H. Delingette, L. Zhang, F. C. Ghesu, S. Miao, A. K. Maier, N. Ayache, R. Liao, and A. Kamen, “Robust non-rigid registration through agent-based action learning,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2017, pp. 344–352.
[32] H. Li and Y. Fan, “Non-rigid image registration using fully convolutional networks with deep self-supervision,” arXiv preprint arXiv:1709.00799, 2017.
[33] B. D. de Vos, F. F. Berendsen, M. A. Viergever, M. Staring, and I. Išgum, “End-to-end unsupervised deformable image registration with a convolutional neural network,” in Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. Springer, 2017, pp. 204–212.
[34] G. Balakrishnan, A. Zhao, M. R. Sabuncu, J. Guttag, and A. V. Dalca, “Voxelmorph: a learning framework for deformable medical image registration,” IEEE transactions on medical imaging, vol. 38, no. 8, pp. 1788–1800, 2019.
[35] J. Zhang, “Inverse-consistent deep networks for unsupervised deformable image registration,” arXiv preprint arXiv:1809.03443, 2018.
[36] Y. Jia, Y. Zhang, and T. Rabczuk, “A novel dynamic multilevel technique for image registration,” Computers & Mathematics with Applications, vol. 69, no. 9, pp. 909–925, 2015.
[37] S. Wörz and K. Rohr, “Spline-based hybrid image registration using landmark and intensity information based on matrix-valued non-radial basis functions,” International journal of computer vision, vol. 106, no. 1, pp. 76–92, 2014.
[38] C. V. Stewart, C.-L. Tsai, and B. Roysam, “The dual-bootstrap iterative closest point algorithm with application to retinal image registration,” IEEE transactions on medical imaging, vol. 22, no. 11, pp. 1379–1394, 2003.
[39] S. N. Wood, “Thin plate regression splines,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 65, no. 1, pp. 95–114, 2003.
[40] Y. Bi, B. Xue, and M. Zhang, “Genetic programming-based discriminative feature learning for low-quality image classification,” IEEE Transactions on Cybernetics, 2021.
[41] Z. Yang, T. Dan, and Y. Yang, “Multi-temporal remote sensing image registration using deep convolutional features,” IEEE Access, vol. 6, pp. 38 544–38 555, 2018.
[42] M. S. Hosseini, M. H. Moradi, M. Tabassian, and J. D’hooge, “Non-rigid image registration using a modified fuzzy feature-based inference system for 3d cardiac motion estimation,” Computer Methods and Programs in Biomedicine, vol. 205, p. 106085, 2021.
[43] X. Li, M. Chen, F. Nie, and Q. Wang, “A multiview-based parameter free framework for group detection,” in Thirty-First AAAI Conference on Artificial Intelligence, 2017.
[44] Y. Liu, J. Shen, W. Wang, H. Sun, and L. Shao, “Better dense trajectories by motion in videos,” IEEE transactions on cybernetics, vol. 49, no. 1, pp. 159–170, 2017.
[45] L. Bai, X. Yang, and H. Gao, “Nonrigid point set registration by preserving local connectivity,” IEEE Transactions on Cybernetics, vol. 48, no. 3, pp. 826–835, 2018.
[46] D. Chetverikov, D. Svirko, D. Stepanov, and P. Krsek, “The trimmed iterative closest point algorithm,” in Object recognition supported by user interaction for service robots, vol. 3. IEEE, 2002, pp. 545–548.
[47] S. Zhu, C. Li, C.-C. Loy, and X. Tang, “Unconstrained face alignment via cascaded compositional learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3409–3417.
[48] H.-S. Fang, S. Xie, Y.-W. Tai, and C. Lu, “Rmpe: Regional multi-person pose estimation,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2334–2343.
[49] C.-Q. Huang, J.-K. Chen, Y. Pan, H.-J. Lai, J. Yin, and Q.-H. Huang, “Clothing landmark detection using deep networks with prior of key point associations,” IEEE transactions on cybernetics, vol. 49, no. 10, pp. 3744–3754, 2018.
[50] X. Yu, F. Zhou, and M. Chandraker, “Deep deformation network for object landmark localization,” in European Conference on Computer Vision. Springer, 2016, pp. 52–70.
[51] J. Lv, X. Shao, J. Xing, C. Cheng, and X. Zhou, “A deep regression architecture with two-stage re-initialization for high performance facial landmark detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 3317–3326.
[52] X. Nie, J. Feng, J. Xing, S. Xiao, and S. Yan, “Hierarchical contextual refinement networks for human pose estimation,” IEEE Transactions on Image Processing, vol. 28, no. 2, pp. 924–936, 2018.
[53] A. Siarohin, S. Lathuilière, S. Tulyakov, E. Ricci, and N. Sebe, “First order motion model for image animation,” in Advances in Neural Information Processing Systems, 2019, pp. 7137–7147.
[54] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, P.-A. Manzagol, and L. Bottou, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.” Journal of machine learning research, vol. 11, no. 12, 2010.
[55] T. Jakab, A. Gupta, H. Bilen, and A. Vedaldi, “Unsupervised learning of object landmarks through conditional image generation,” in Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018, pp. 4020–4031.
[56] B.-Q. Shi, J. Liang, and Q. Liu, “Adaptive simplification of point cloud using k-means clustering,” Computer-Aided Design, vol. 43, no. 8, pp. 910–922, 2011.
[57] C. Villani, Optimal transport: old and new. Springer Science & Business Media, 2008, vol. 338.
[58] N. Kolkin, J. Salavon, and G. Shakhnarovich, “Style transfer by relaxed optimal transport and self-similarity,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 10 051–10 060.
[59] A. Siarohin, S. Lathuilière, S. Tulyakov, E. Ricci, and N. Sebe, “Animating arbitrary objects via deep motion transfer,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 2377–2386.
[60] Q. Wang, M. Kim, Y. Shi, G. Wu, D. Shen, A. D. N. Initiative et al., “Predict brain mr image registration via sparse learning of appearance and transformation,” Medical image analysis, vol. 20, no. 1, pp. 61–75, 2015.
[61] G. D. Hugo, E. Weiss, W. C. Sleeman, S. Balik, P. J. Keall, J. Lu, and J. F. Williamson, “Data from 4d lung imaging of nsclc patients. the cancer imaging archive,” 2016.
[62] O. Bernard, A. Lalande, C. Zotti, F. Cervenansky, X. Yang, P.-A. Heng, I. Cetin, K. Lekadir, O. Camara, M. A. G. Ballester et al., “Deep learning techniques for automatic mri cardiac multi-structures segmentation and diagnosis: Is the problem solved?” IEEE transactions on medical imaging, vol. 37, no. 11, pp. 2514–2525, 2018.
[63] Y. Jang, Y. Hong, S. Ha, S. Kim, and H.-J. Chang, “Automatic segmentation of lv and rv in cardiac mri,” in International Workshop on Statistical Atlases and Computational Models of the Heart. Springer, 2017, pp. 161–169.
[64] X. Huang, J. Zhang, D. Li, and P. Li, “Knowledge graph embedding based question answering,” in Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, 2019, pp. 105–113.