¹¹institutetext: School of Data Science, Fudan University, Shanghai, China ²²institutetext: Department of Electrical and Computer Engineering, University of Washington, Seattle, United States

BInGo: Bayesian Intrinsic Groupwise Registration via Explicit Hierarchical Disentanglement ^†^†thanks: X. Wang and X. Luo contributed equally to this work.

Xin Wang 1122 Xinzhe Luo 11 Xiahai Zhuang 11

Abstract

Multimodal groupwise registration aligns internal structures in a group of medical images. Current approaches to this problem involve developing similarity measures over the joint intensity profile of all images, which may be computationally prohibitive for large image groups and unstable under various conditions. To tackle these issues, we propose BInGo, a general unsupervised hierarchical Bayesian framework based on deep learning, to learn intrinsic structural representations to measure the similarity of multimodal images. Particularly, a variational auto-encoder with a novel posterior is proposed, which facilitates the disentanglement learning of structural representations and spatial transformations, and characterizes the imaging process from the common structure with shape transition and appearance variation. Notably, BInGo is scalable to learn from small groups, whereas being tested for large-scale groupwise registration, thus significantly reducing computational costs. We compared BInGo with five iterative or deep-learning methods on three public intrasubject and intersubject datasets, i.e. BraTS, MS-CMR of the heart, and Learn2Reg abdomen MR-CT, and demonstrated its superior accuracy and computational efficiency, even for very large group sizes (e.g., over 1300 2D images from MS-CMR in each group).

1 Introduction

Multimodal groupwise registration aims to align multimodal images into a common structural space. Unlike conventional pairwise registration which aligns moving images separately to a fixed image, groupwise registration can ameliorate the bias from designating a reference image, by estimating a common space to which all the images are co-registered. Therefore, it has become an essential task in multivariate image analysis, including longitudinal research, atlas construction, motion estimation, and population studies [14, 10, 18, 6].

Conventional iterative methods usually estimate the desired spatial transformations by optimizing intensity-based similarity measures [13, 20, 25, 22, 16]. Recently, unsupervised deep-learning methods attempt to realize groupwise registration in an end-to-end fashion [4, 7]. For instance in [4], the authors devised a network to optimize the conditional template entropy (CTE) introduced in [22]. These learning-based models estimate conventional similarity measures developed for iterative registration using stochastic gradient methods, and hence may inherit the same suboptimal performance due to insufficient modeling of the image similarity. Besides, these measures rely on correctly characterizing the joint intensity profile over the entire image group, which may be computationally prohibitive and applicability-limited due to large group sizes.

In this work, however, we shift attention from devising similarity metrics to learning the underlying generative imaging process. Specifically, we establish a probabilistic generative model for the observed images, in which transformations and the common structure are disentangled as latent variables. In this way, groupwise registration can be achieved by unsupervised variational auto-encoding: the encoder extracts structural representations of images and then infers the spatial deformations; the decoder imitates the imaging process by reconstructing original images from estimated common structure and the inverse deformations. The contributions of this work can be summarized as follows:

(1)

We propose BInGo, a theoretically-grounded unsupervised Bayesian framework for groupwise registration, which learns intrinsic structural similarity of multimodal images by a principled variational distribution, and is capable of disentangling structural representations from image appearance.
(2)

BInGo is scalable to be trained with small image groups while being applied to large-scale and variable-size test groups, significantly improving applicability and computational efficiency.
(3)

We demonstrate the superiority of BInGo over similarity-based methods on three multimodal intrasubject and intersubject datasets.

To the best of our knowledge, this is the first work that realizes scalable multimodal groupwise registration by disentangled representation learning.

2 Methodology

Let the original image group be $\bm{X}=(X_{m})_{m=1}^{M}$ , with $M$ the number of modalities and $X_{m}:\mathbb{R}^{d}\supset\mathrm{\Omega}_{m}\rightarrow\mathbb{R}$ . Groupwise registration aims to find spatial transformations $\bm{\phi}=(\phi_{m}:\mathbb{R}^{d}\supset\mathrm{\Omega}\rightarrow\mathrm{\Omega}_{m})_{m=1}^{M}$ , such that the aligned images $(X_{m}\,\raisebox{1.0pt}{ \leavevmode\hbox to3pt{\vbox to3pt{\pgfpicture\makeatletter\hbox{\hskip 1.5pt\lower-1.5pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}}{}{}{}{}{}{}{}{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}\pgfsys@invoke{ }{}\pgfsys@moveto{1.2pt}{0.0pt}\pgfsys@curveto{1.2pt}{0.66273pt}{0.66273pt}{1.2pt}{0.0pt}{1.2pt}\pgfsys@curveto{-0.66273pt}{1.2pt}{-1.2pt}{0.66273pt}{-1.2pt}{0.0pt}\pgfsys@curveto{-1.2pt}{-0.66273pt}{-0.66273pt}{-1.2pt}{0.0pt}{-1.2pt}\pgfsys@curveto{0.66273pt}{-1.2pt}{1.2pt}{-0.66273pt}{1.2pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}}\,\phi_{m})_{m=1}^{M}$ share a common structural representation $\bm{Z}$ .

We assume $X_{m}$ is generated from $\bm{Z}$ and $\phi_{m}^{-1}$ by a transformation-equivariant imaging process $f_{m}$ that contains modality-specific appearance information, i.e.

X_{m}=f_{m}(\bm{Z}\,\raisebox{1.0pt}{ \leavevmode\hbox to3pt{\vbox to3pt{\pgfpicture\makeatletter\hbox{\hskip 1.5pt\lower-1.5pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}}{}{}{}{}{}{}{}{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}\pgfsys@invoke{ }{}\pgfsys@moveto{1.2pt}{0.0pt}\pgfsys@curveto{1.2pt}{0.66273pt}{0.66273pt}{1.2pt}{0.0pt}{1.2pt}\pgfsys@curveto{-0.66273pt}{1.2pt}{-1.2pt}{0.66273pt}{-1.2pt}{0.0pt}\pgfsys@curveto{-1.2pt}{-0.66273pt}{-0.66273pt}{-1.2pt}{0.0pt}{-1.2pt}\pgfsys@curveto{0.66273pt}{-1.2pt}{1.2pt}{-0.66273pt}{1.2pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}}\,\phi_{m}^{-1})=f_{m}(\bm{Z})\,\raisebox{1.0pt}{ \leavevmode\hbox to3pt{\vbox to3pt{\pgfpicture\makeatletter\hbox{\hskip 1.5pt\lower-1.5pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}}{}{}{}{}{}{}{}{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}\pgfsys@invoke{ }{}\pgfsys@moveto{1.2pt}{0.0pt}\pgfsys@curveto{1.2pt}{0.66273pt}{0.66273pt}{1.2pt}{0.0pt}{1.2pt}\pgfsys@curveto{-0.66273pt}{1.2pt}{-1.2pt}{0.66273pt}{-1.2pt}{0.0pt}\pgfsys@curveto{-1.2pt}{-0.66273pt}{-0.66273pt}{-1.2pt}{0.0pt}{-1.2pt}\pgfsys@curveto{0.66273pt}{-1.2pt}{1.2pt}{-0.66273pt}{1.2pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}}\,\phi_{m}^{-1}.

(1)

Therefore, a 3-step unsupervised auto-encoding scheme can be formed: I) By modeling $f_{m}^{-1}$ we could encode individual structural representation for $X_{m}$ as $\bm{Z}\,\raisebox{1.0pt}{ \leavevmode\hbox to3pt{\vbox to3pt{\pgfpicture\makeatletter\hbox{\hskip 1.5pt\lower-1.5pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}}{}{}{}{}{}{}{}{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}\pgfsys@invoke{ }{}\pgfsys@moveto{1.2pt}{0.0pt}\pgfsys@curveto{1.2pt}{0.66273pt}{0.66273pt}{1.2pt}{0.0pt}{1.2pt}\pgfsys@curveto{-0.66273pt}{1.2pt}{-1.2pt}{0.66273pt}{-1.2pt}{0.0pt}\pgfsys@curveto{-1.2pt}{-0.66273pt}{-0.66273pt}{-1.2pt}{0.0pt}{-1.2pt}\pgfsys@curveto{0.66273pt}{-1.2pt}{1.2pt}{-0.66273pt}{1.2pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}}\,\phi_{m}^{-1}=f_{m}^{-1}(X_{m})$ , from which $\bm{\phi}$ could be estimated more easily. II) $\bm{Z}$ could be then inferred from $\bm{Z}=f_{m}^{-1}(X_{m}\,\raisebox{1.0pt}{ \leavevmode\hbox to3pt{\vbox to3pt{\pgfpicture\makeatletter\hbox{\hskip 1.5pt\lower-1.5pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}}{}{}{}{}{}{}{}{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}\pgfsys@invoke{ }{}\pgfsys@moveto{1.2pt}{0.0pt}\pgfsys@curveto{1.2pt}{0.66273pt}{0.66273pt}{1.2pt}{0.0pt}{1.2pt}\pgfsys@curveto{-0.66273pt}{1.2pt}{-1.2pt}{0.66273pt}{-1.2pt}{0.0pt}\pgfsys@curveto{-1.2pt}{-0.66273pt}{-0.66273pt}{-1.2pt}{0.0pt}{-1.2pt}\pgfsys@curveto{0.66273pt}{-1.2pt}{1.2pt}{-0.66273pt}{1.2pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}}\,\phi_{m})$ following the equivariance assumption. III) By modeling $f_{m}$ we could reconstruct $X_{m}$ from the estimated $\bm{Z}$ and $\phi_{m}^{-1}$ .

The following establish BInGo to realize this scheme by variational and disentangled auto-encoding, thus learning $\bm{\phi}$ in a unified and end-to-end fashion.

2.1 Hierarchical Bayesian Inference

Refer to caption — Figure 1: The proposed Bayesian framework. (a) Graphical model of the imaging process. (b) Inference model for the velocity fields $\bm{v}$ (orange), the common structural representation $\bm{z}$ (green), and the reconstruction (blue). Random variables are in circles, deterministic variables are in double circles, and observed variables are shaded.

We first formalize the estimation of spatial deformations $\bm{\phi}=(\phi_{m})_{m=1}^{M}$ through Bayesian inference, as illustrated in Fig. 1. Specifically, let $\bm{x}=(x_{m})_{m=1}^{M}$ be a sample of $\bm{X}$ . We decompose the latent variables generating $\bm{x}$ into two independent subgroups: 1) the common structural representation $\bm{z}$ , and 2) the stationary velocity fields $\bm{v}=(\bm{v}_{m})_{m=1}^{M}$ that parameterize $\bm{\phi}$ [1]. Thus, following the variational Bayes framework, the objective function to maximize is the evidence lower bound (ELBO) of the log-likelihood, which takes the form

\displaystyle\mathcal{L}(\bm{x})\triangleq

\displaystyle\ \mathbb{E}_{q(\bm{z},\bm{v}\,|\,\bm{x})}\big{[}\log p(\bm{x}\,|\,\bm{z},\bm{v})\big{]}-D_{\text{KL}}\big{[}q(\bm{z},\bm{v}\,|\,\bm{x})\;\|\;p(\bm{z})\,p(\bm{v})\big{]}

(2)

where $q(\bm{z},\bm{v}\,|\,\bm{x})=q(\bm{v}\,|\,\bm{x})\,q(\bm{z}\,|\,\bm{v},\bm{x})$ defines the variational posterior distribution, and $D_{\text{KL}}$ is the Kullback-Leibler (KL) divergence. After optimizing the ELBO, the desired velocity fields $\widehat{\bm{v}}$ can be inferred via maximum a posteriori.

Furthermore, we express the latent variables with $L$ hierarchical levels [24], i.e. $\bm{z}=(\bm{z}^{l})_{l=1}^{L}$ and $\bm{v}_{m}=(\bm{v}_{m}^{l})_{l=1}^{L}$ , and a higher level indicates a finer-scale (larger) resolution. Thus, the total deformation $\phi_{m}=\phi_{m}^{1}\,\raisebox{1.0pt}{ \leavevmode\hbox to3pt{\vbox to3pt{\pgfpicture\makeatletter\hbox{\hskip 1.5pt\lower-1.5pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}}{}{}{}{}{}{}{}{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}\pgfsys@invoke{ }{}\pgfsys@moveto{1.2pt}{0.0pt}\pgfsys@curveto{1.2pt}{0.66273pt}{0.66273pt}{1.2pt}{0.0pt}{1.2pt}\pgfsys@curveto{-0.66273pt}{1.2pt}{-1.2pt}{0.66273pt}{-1.2pt}{0.0pt}\pgfsys@curveto{-1.2pt}{-0.66273pt}{-0.66273pt}{-1.2pt}{0.0pt}{-1.2pt}\pgfsys@curveto{0.66273pt}{-1.2pt}{1.2pt}{-0.66273pt}{1.2pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}}\,\cdots\,\raisebox{1.0pt}{ \leavevmode\hbox to3pt{\vbox to3pt{\pgfpicture\makeatletter\hbox{\hskip 1.5pt\lower-1.5pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}}{}{}{}{}{}{}{}{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}\pgfsys@invoke{ }{}\pgfsys@moveto{1.2pt}{0.0pt}\pgfsys@curveto{1.2pt}{0.66273pt}{0.66273pt}{1.2pt}{0.0pt}{1.2pt}\pgfsys@curveto{-0.66273pt}{1.2pt}{-1.2pt}{0.66273pt}{-1.2pt}{0.0pt}\pgfsys@curveto{-1.2pt}{-0.66273pt}{-0.66273pt}{-1.2pt}{0.0pt}{-1.2pt}\pgfsys@curveto{0.66273pt}{-1.2pt}{1.2pt}{-0.66273pt}{1.2pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}}\,\phi_{m}^{L}$ can be deterministically computed by exponentiation of velocity fields ${\phi}_{m}^{l}=\mathrm{id}+\exp({\bm{v}}_{m}^{l})$ , where $\frac{\partial}{\partial t}\phi_{m}^{l}(\bm{\omega},t)=\bm{v}_{m}^{l}(\phi_{m}^{l}(\bm{\omega},t))$ , $\forall\,\bm{\omega}\in\mathrm{\Omega}$ . The hierarchical strategy allows to model a complex deformation by several simpler and easier-to-learn ones.

To simplify the KL, we introduce additional independence assumptions: 1) both the prior and posterior of the velocities factorize among different images, i.e. $p(\bm{v}^{l}\,|\,\bm{v}^{<l})=p(\bm{v}^{l})=\prod_{m=1}^{M}p(\bm{v}_{m}^{l})$ and $q(\bm{v}^{l}\,|\,\bm{x},\bm{v}^{<l})=\prod_{m=1}^{M}q(\bm{v}_{m}^{l}\,|\,\bm{x},\bm{v}^{<l})$ , where $<l$ denotes the group of latent variables in levels less than $l$ , and 2) the common structure $\bm{z}^{l}$ can be inferred directly from the observed images and the estimated velocity fields, i.e. $q(\bm{z}^{l}\,|\,\bm{x},\bm{v},\bm{z}^{<l})=q(\bm{z}^{l}\,|\,\bm{x},\bm{v})$ , since $\{\bm{x},\bm{v}\}$ can determine registered images, thus containing all information about $\bm{z}^{l}$ for any $l$ .

Taking into account these assumptions, the KL divergence can be written as

	$\displaystyle D_{\text{KL}}\big{[}q(\bm{v}\,\|\,\bm{x})\,q(\bm{z}\,\|\,\bm{v},\bm{x})\;\\|\;p(\bm{v})\,p(\bm{z})\big{]}$	(3)
$\displaystyle=$	$\displaystyle\sum_{m=1}^{M}\left[\sum_{l=1}^{L}\mathbb{E}_{q(\bm{v}_{m}^{<l}\,\|\,\bm{x})}\Big{\{}D_{\text{KL}}\big{[}q(\bm{v}_{m}^{l}\,\|\,\bm{x},\bm{v}^{<l})\;\\|\;p(\bm{v}_{m}^{l})\big{]}\Big{\}}\right]\qquad\qquad\text{(i)}\$
	$\displaystyle+\mathbb{E}_{q(\bm{v}\,\|\,\bm{x})}\left[\sum_{l=1}^{L}\mathbb{E}_{q(\bm{z}^{<l}\,\|\,\bm{x},\bm{v})}\Big{\{}D_{\text{KL}}\big{[}q(\bm{z}^{l}\,\|\,\bm{x},\bm{v})\;\\|\;p(\bm{z}^{l}\,\|\,\bm{z}^{<l})\big{]}\Big{\}}\right]\ \text{(ii)},$

where we prescribe $p(\bm{z}^{1}\,|\,\bm{z}^{<1})\triangleq p(\bm{z}^{1})$ and $q(\bm{v}_{m}^{<1}\,|\,\bm{x})=q(\bm{z}^{<1}\,|\,\bm{x},\bm{v})\triangleq 1$ for simplicity. The key idea here is: the overall KL divergence is decomposed w.r.t. (i) the velocity fields $\bm{v}$ , and (ii) the common structure $\bm{z}$ . The former serves as regularization to ensure diffeomorphism, fulfilled by the constraint introduced in [5]; the latter is intended to estimate the structural similarity among the images, as is detailed in the next subsection.

2.2 Intrinsic Similarity over Structural Representations

For registration, it is crucial to efficiently measure the structural similarity, which is achieved in this work by learning instead of a pre-defined metric. To this end, we propose to learn multilevel “expert” distributions $q_{m}(\bm{z}^{l}\,|\,x_{m},\bm{v}_{m})$ , which serve as $f_{m}^{-1}$ (i.e., the inverse of imaging processes) to extract the structural representations of warped images. We further assume $q(\bm{z}^{l}\,|\,\bm{x},\bm{v})$ and $p(\bm{z}^{l}\,|\,\bm{z}^{<l})$ to be the geometric mean [15] and arithmetic mean [23] of the experts, respectively:

q(\bm{z}^{l}|\,\bm{x},\bm{v})\propto\left[\prod_{m=1}^{M}q_{m}(\bm{z}^{l}|\,x_{m},\bm{v}_{m})\right]^{\frac{1}{M}},\ p(\bm{z}^{l}|\bm{z}^{<l})\triangleq\frac{1}{M}\sum_{m=1}^{M}q_{m}(\bm{z}^{l}|\,x_{m},\bm{v}_{m}).

(4)

Therefore, the KL in Eq. 3(ii) essentially measures the intrinsic (dis)similarity, i.e., the dissimilarity of the experts (intrinsic structural representations), whose minimization, as a part of the maximization of the ELBO, encourages the experts to be identical, thus forcing the multilevel posteriors $q(\bm{z}^{l}\,|\,\bm{x},\bm{v})$ to represent the common structure. Meanwhile, the velocity could be learned in tandem.

We model the experts $q_{m}(\bm{z}^{l}\,|\,x_{m},\bm{v}_{m})$ as Gaussians $\mathcal{N}(\bm{\mu}_{m}^{l},\bm{\Sigma}_{m}^{l})$ , and thus so is the joint posterior, i.e. $q(\bm{z}^{l}\,|\,\bm{x},\bm{v})=\mathcal{N}(\bm{\mu}^{l},\bm{\Sigma}^{l})$ with

\bm{\Sigma}^{l}=M\cdot\left[\sum_{m=1}^{M}\left(\bm{\Sigma}_{m}^{l}\right)^{-1}\right]^{-1},\quad\bm{\mu}^{l}=\frac{\bm{\Sigma}^{l}}{M}\cdot\sum_{m=1}^{M}\bm{\mu}_{m}^{l}\left(\bm{\Sigma}_{m}^{l}\right)^{-1}.

(5)

In light of the computational intractability of the KL divergence involving Gaussian mixture distributions, we further exploit its convexity to obtain

D_{\text{KL}}\big{[}q(\bm{z}^{l}\,|\,\bm{x},\bm{v})\;\|\;p(\bm{z}^{l}\,|\,\bm{z}^{<l})\big{]}\leqslant\frac{1}{M}\sum_{m=1}^{M}D_{\text{KL}}\big{[}q(\bm{z}^{l}|\,\bm{x},\bm{v})\;\|\;q_{m}(\bm{z}^{l}|\,x_{m},\bm{v}_{m})\big{]}.

(6)

Hence, the minimization of the KL in Eq. 3(ii) is actually approximated by minimizing the right-hand side of Eq. 6, which has a closed-form expression.

2.3 Explicit Disentanglement with Neural Networks

The ELBO has been decomposed into three terms: reconstruction of the original images (the first term in Eq. 2), the KL divergence for velocity fields to ensure diffeomorphism (Eq. 3(i)), and the KL divergence for the intrinsic (dis)similarity to optimize the registration (Eq. 3(ii)), with weights (hyperparameters) $(\lambda_{i})_{i=1}^{3}$ . To estimate the ELBO, we propose a dedicated hierarchical variational auto-encoder (VAE) as the inference model in Fig. 1(b), whose architecture with the number of hierarchy $L=3$ is depicted in Fig. 2. The network learns $p(\bm{x}\,|\,\bm{z},\bm{v}),q\left(\bm{v}_{m}^{l}\,|\,\bm{x},\bm{v}^{<l}\right)$ and $q_{m}\left(\bm{z}^{l}\,|\,x_{m},\bm{v}_{m}\right)$ , and the expectations in the ELBO are estimated by Monte Carlo sampling, similar to traditional VAEs.

The motif of BInGo is to explicitly disentangle the common structure, spatial transformations and appearance information from multimodal images, thus realizing Eq. 1. To this end, the network comprises three types of cooperative modules: 1) $M$ encoders that extract multi-level transformation-equivariant structural representations $\bm{c}_{m}^{l}$ , corresponding to the inverse imaging function $f_{m}^{-1}$ , 2) multi-level registration (Reg) modules that infer spatial transformations from individual structural representations, corresponding to estimating $\bm{\phi}$ from $\{\bm{z}\,\raisebox{1.0pt}{ \leavevmode\hbox to3pt{\vbox to3pt{\pgfpicture\makeatletter\hbox{\hskip 1.5pt\lower-1.5pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}}{}{}{}{}{}{}{}{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}\pgfsys@invoke{ }{}\pgfsys@moveto{1.2pt}{0.0pt}\pgfsys@curveto{1.2pt}{0.66273pt}{0.66273pt}{1.2pt}{0.0pt}{1.2pt}\pgfsys@curveto{-0.66273pt}{1.2pt}{-1.2pt}{0.66273pt}{-1.2pt}{0.0pt}\pgfsys@curveto{-1.2pt}{-0.66273pt}{-0.66273pt}{-1.2pt}{0.0pt}{-1.2pt}\pgfsys@curveto{0.66273pt}{-1.2pt}{1.2pt}{-0.66273pt}{1.2pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}}\,\phi_{m}^{-1}\}_{m=1}^{M}$ , and 3) $M$ decoders, with modality-specific appearance information embedded as parameters, which reconstruct registered images from learned common structure, corresponding to the imaging function $f_{m}(\bm{z})$ .

Particularly, the Reg module at level $l$ assumes the coarser-scale deformations $\phi_{m}^{<l}\triangleq\phi_{m}^{1}\,\raisebox{1.0pt}{ \leavevmode\hbox to3pt{\vbox to3pt{\pgfpicture\makeatletter\hbox{\hskip 1.5pt\lower-1.5pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}}{}{}{}{}{}{}{}{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}\pgfsys@invoke{ }{}\pgfsys@moveto{1.2pt}{0.0pt}\pgfsys@curveto{1.2pt}{0.66273pt}{0.66273pt}{1.2pt}{0.0pt}{1.2pt}\pgfsys@curveto{-0.66273pt}{1.2pt}{-1.2pt}{0.66273pt}{-1.2pt}{0.0pt}\pgfsys@curveto{-1.2pt}{-0.66273pt}{-0.66273pt}{-1.2pt}{0.0pt}{-1.2pt}\pgfsys@curveto{0.66273pt}{-1.2pt}{1.2pt}{-0.66273pt}{1.2pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}}\,\cdots\,\raisebox{1.0pt}{ \leavevmode\hbox to3pt{\vbox to3pt{\pgfpicture\makeatletter\hbox{\hskip 1.5pt\lower-1.5pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}}{}{}{}{}{}{}{}{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}\pgfsys@invoke{ }{}\pgfsys@moveto{1.2pt}{0.0pt}\pgfsys@curveto{1.2pt}{0.66273pt}{0.66273pt}{1.2pt}{0.0pt}{1.2pt}\pgfsys@curveto{-0.66273pt}{1.2pt}{-1.2pt}{0.66273pt}{-1.2pt}{0.0pt}\pgfsys@curveto{-1.2pt}{-0.66273pt}{-0.66273pt}{-1.2pt}{0.0pt}{-1.2pt}\pgfsys@curveto{0.66273pt}{-1.2pt}{1.2pt}{-0.66273pt}{1.2pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}}\,\phi_{m}^{l-1}$ have been resolved, and thus takes the individual representations $\bm{c}_{m}^{l}(x_{m})\,\raisebox{1.0pt}{ \leavevmode\hbox to3pt{\vbox to3pt{\pgfpicture\makeatletter\hbox{\hskip 1.5pt\lower-1.5pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}}{}{}{}{}{}{}{}{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}\pgfsys@invoke{ }{}\pgfsys@moveto{1.2pt}{0.0pt}\pgfsys@curveto{1.2pt}{0.66273pt}{0.66273pt}{1.2pt}{0.0pt}{1.2pt}\pgfsys@curveto{-0.66273pt}{1.2pt}{-1.2pt}{0.66273pt}{-1.2pt}{0.0pt}\pgfsys@curveto{-1.2pt}{-0.66273pt}{-0.66273pt}{-1.2pt}{0.0pt}{-1.2pt}\pgfsys@curveto{0.66273pt}{-1.2pt}{1.2pt}{-0.66273pt}{1.2pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}}\,\phi_{m}^{<l}$ as inputs, which are like “partially distorted” version of the expert distributions, since the experts are representations of fully registered images $x_{m}\,\raisebox{1.0pt}{ \leavevmode\hbox to3pt{\vbox to3pt{\pgfpicture\makeatletter\hbox{\hskip 1.5pt\lower-1.5pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}}{}{}{}{}{}{}{}{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}\pgfsys@invoke{ }{}\pgfsys@moveto{1.2pt}{0.0pt}\pgfsys@curveto{1.2pt}{0.66273pt}{0.66273pt}{1.2pt}{0.0pt}{1.2pt}\pgfsys@curveto{-0.66273pt}{1.2pt}{-1.2pt}{0.66273pt}{-1.2pt}{0.0pt}\pgfsys@curveto{-1.2pt}{-0.66273pt}{-0.66273pt}{-1.2pt}{0.0pt}{-1.2pt}\pgfsys@curveto{0.66273pt}{-1.2pt}{1.2pt}{-0.66273pt}{1.2pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}}\,\phi_{m}$ . Thus, in the same spirit of Eq. 4, the Reg module computes their geometric mean as the “partially common” structure, and then concatenates the mean and $\bm{c}_{m}^{l}(x_{m})\,\raisebox{1.0pt}{ \leavevmode\hbox to3pt{\vbox to3pt{\pgfpicture\makeatletter\hbox{\hskip 1.5pt\lower-1.5pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}}{}{}{}{}{}{}{}{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}\pgfsys@invoke{ }{}\pgfsys@moveto{1.2pt}{0.0pt}\pgfsys@curveto{1.2pt}{0.66273pt}{0.66273pt}{1.2pt}{0.0pt}{1.2pt}\pgfsys@curveto{-0.66273pt}{1.2pt}{-1.2pt}{0.66273pt}{-1.2pt}{0.0pt}\pgfsys@curveto{-1.2pt}{-0.66273pt}{-0.66273pt}{-1.2pt}{0.0pt}{-1.2pt}\pgfsys@curveto{0.66273pt}{-1.2pt}{1.2pt}{-0.66273pt}{1.2pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}}\,\phi_{m}^{<l}$ to infer $q(\bm{v}_{m}^{l}\,|\,\bm{x},\bm{v}^{<l})$ for each $m$ separately.

Therefore, BInGo can realize the aforementioned 3-step unsupervised scheme induced by Eq. 1, as follows:

I. Bottom-up Inference of Velocity Fields. The encoders first produce multilevel feature maps $\bm{c}_{m}^{l}(x_{m})$ as the individual structural representations of original images. Then, up from the bottom ( $l$ increasing from 1 to $L$ ), given that $\phi_{m}^{<l}$ have been inferred ( $\phi_{m}^{<1}\triangleq\mathrm{id}$ ), the feature maps are warped to obtain $\bm{c}_{m}^{l}(x_{m})\,\raisebox{1.0pt}{ \leavevmode\hbox to3pt{\vbox to3pt{\pgfpicture\makeatletter\hbox{\hskip 1.5pt\lower-1.5pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}}{}{}{}{}{}{}{}{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}\pgfsys@invoke{ }{}\pgfsys@moveto{1.2pt}{0.0pt}\pgfsys@curveto{1.2pt}{0.66273pt}{0.66273pt}{1.2pt}{0.0pt}{1.2pt}\pgfsys@curveto{-0.66273pt}{1.2pt}{-1.2pt}{0.66273pt}{-1.2pt}{0.0pt}\pgfsys@curveto{-1.2pt}{-0.66273pt}{-0.66273pt}{-1.2pt}{0.0pt}{-1.2pt}\pgfsys@curveto{0.66273pt}{-1.2pt}{1.2pt}{-0.66273pt}{1.2pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}}\,\phi_{m}^{<l}$ . Then, the Reg module at level $l$ can infer $q(\bm{v}_{m}^{l}\,|\,\bm{x},\bm{v}^{<l})$ , from which $\bm{v}_{m}^{l}$ is sampled and $\phi_{m}^{l}$ is finally computed through integration.

II. Top-down Inference of Common Structures. As the total deformations $\phi_{m}=\phi_{m}^{1}\,\raisebox{1.0pt}{ \leavevmode\hbox to3pt{\vbox to3pt{\pgfpicture\makeatletter\hbox{\hskip 1.5pt\lower-1.5pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}}{}{}{}{}{}{}{}{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}\pgfsys@invoke{ }{}\pgfsys@moveto{1.2pt}{0.0pt}\pgfsys@curveto{1.2pt}{0.66273pt}{0.66273pt}{1.2pt}{0.0pt}{1.2pt}\pgfsys@curveto{-0.66273pt}{1.2pt}{-1.2pt}{0.66273pt}{-1.2pt}{0.0pt}\pgfsys@curveto{-1.2pt}{-0.66273pt}{-0.66273pt}{-1.2pt}{0.0pt}{-1.2pt}\pgfsys@curveto{0.66273pt}{-1.2pt}{1.2pt}{-0.66273pt}{1.2pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}}\,\cdots\,\raisebox{1.0pt}{ \leavevmode\hbox to3pt{\vbox to3pt{\pgfpicture\makeatletter\hbox{\hskip 1.5pt\lower-1.5pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}}{}{}{}{}{}{}{}{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}\pgfsys@invoke{ }{}\pgfsys@moveto{1.2pt}{0.0pt}\pgfsys@curveto{1.2pt}{0.66273pt}{0.66273pt}{1.2pt}{0.0pt}{1.2pt}\pgfsys@curveto{-0.66273pt}{1.2pt}{-1.2pt}{0.66273pt}{-1.2pt}{0.0pt}\pgfsys@curveto{-1.2pt}{-0.66273pt}{-0.66273pt}{-1.2pt}{0.0pt}{-1.2pt}\pgfsys@curveto{0.66273pt}{-1.2pt}{1.2pt}{-0.66273pt}{1.2pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}}\,\phi_{m}^{L}$ have been inferred, the encoders are fed again with warped images $\bm{x}\,\raisebox{1.0pt}{ \leavevmode\hbox to3pt{\vbox to3pt{\pgfpicture\makeatletter\hbox{\hskip 1.5pt\lower-1.5pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}}{}{}{}{}{}{}{}{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}\pgfsys@invoke{ }{}\pgfsys@moveto{1.2pt}{0.0pt}\pgfsys@curveto{1.2pt}{0.66273pt}{0.66273pt}{1.2pt}{0.0pt}{1.2pt}\pgfsys@curveto{-0.66273pt}{1.2pt}{-1.2pt}{0.66273pt}{-1.2pt}{0.0pt}\pgfsys@curveto{-1.2pt}{-0.66273pt}{-0.66273pt}{-1.2pt}{0.0pt}{-1.2pt}\pgfsys@curveto{0.66273pt}{-1.2pt}{1.2pt}{-0.66273pt}{1.2pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}}\,\bm{\phi}\triangleq(x_{m}\,\raisebox{1.0pt}{ \leavevmode\hbox to3pt{\vbox to3pt{\pgfpicture\makeatletter\hbox{\hskip 1.5pt\lower-1.5pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}}{}{}{}{}{}{}{}{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}\pgfsys@invoke{ }{}\pgfsys@moveto{1.2pt}{0.0pt}\pgfsys@curveto{1.2pt}{0.66273pt}{0.66273pt}{1.2pt}{0.0pt}{1.2pt}\pgfsys@curveto{-0.66273pt}{1.2pt}{-1.2pt}{0.66273pt}{-1.2pt}{0.0pt}\pgfsys@curveto{-1.2pt}{-0.66273pt}{-0.66273pt}{-1.2pt}{0.0pt}{-1.2pt}\pgfsys@curveto{0.66273pt}{-1.2pt}{1.2pt}{-0.66273pt}{1.2pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}}\,\phi_{m})_{m=1}^{M}$ to produce the experts $q_{m}(\bm{z}^{l}\,|\,x_{m},\bm{v}_{m})=\bm{c}_{m}^{l}(x_{m}\circ\phi_{m})$ in a top-down ( $l$ decreasing from $L$ to $1$ ) manner. The posterior $q\left(\bm{z}^{l}\,|\,\bm{x},\bm{v}\right)$ is then computed by Eq. 4, from which the modality-invariant common structural representation $\bm{z}=(\bm{z}^{l})_{l=1}^{L}$ is sampled. Note that we can avoid using encoders twice by warping $\bm{c}_{m}^{l}(x_{m})$ to compute the experts as $\bm{c}_{m}^{l}(x_{m})\circ\phi_{m}$ (thanks to equivariance), but this requires more computational cost.

III. Disentangled Auto-Encoding. Based on the common structure $\bm{z}$ inferred by encoders, the decoders reconstruct the registered images $\widehat{\bm{x}}=(\widehat{x}_{m})_{m=1}^{M}$ . Then reconstructed original images $\widehat{\bm{x}}\,\raisebox{1.0pt}{ \leavevmode\hbox to3pt{\vbox to3pt{\pgfpicture\makeatletter\hbox{\hskip 1.5pt\lower-1.5pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}}{}{}{}{}{}{}{}{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}\pgfsys@invoke{ }{}\pgfsys@moveto{1.2pt}{0.0pt}\pgfsys@curveto{1.2pt}{0.66273pt}{0.66273pt}{1.2pt}{0.0pt}{1.2pt}\pgfsys@curveto{-0.66273pt}{1.2pt}{-1.2pt}{0.66273pt}{-1.2pt}{0.0pt}\pgfsys@curveto{-1.2pt}{-0.66273pt}{-0.66273pt}{-1.2pt}{0.0pt}{-1.2pt}\pgfsys@curveto{0.66273pt}{-1.2pt}{1.2pt}{-0.66273pt}{1.2pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}}\,\bm{\phi}^{-1}\triangleq(\widehat{x}_{m}\,\raisebox{1.0pt}{ \leavevmode\hbox to3pt{\vbox to3pt{\pgfpicture\makeatletter\hbox{\hskip 1.5pt\lower-1.5pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{{}}{}{}{}{}{}{}{}{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@setlinewidth{0.6pt}\pgfsys@invoke{ }{}\pgfsys@moveto{1.2pt}{0.0pt}\pgfsys@curveto{1.2pt}{0.66273pt}{0.66273pt}{1.2pt}{0.0pt}{1.2pt}\pgfsys@curveto{-0.66273pt}{1.2pt}{-1.2pt}{0.66273pt}{-1.2pt}{0.0pt}\pgfsys@curveto{-1.2pt}{-0.66273pt}{-0.66273pt}{-1.2pt}{0.0pt}{-1.2pt}\pgfsys@curveto{0.66273pt}{-1.2pt}{1.2pt}{-0.66273pt}{1.2pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}}\,\phi_{m}^{-1})_{m=1}^{M}$ are obtained by inverse deformations. We utilize (negative) L1 loss between the original images and their reconstructions to estimate the first (reconstruction) term in Eq. 2. In addition, to better disentangle structure and appearance, the convolutional layers of the network are shared across modalities, while domain-specific batch normalization layers (BNs) encode appearance information for each modality [3].

2.4 Towards Large-Scale and Variable-Size Groupwise Registration

Most conventional methods optimize similarity measures defined over the entire image group, making the computation for large groups formidable. Besides, learning-based models that rely on these measures require groups to have the same size. These drawbacks significantly reduce their applicability.

However, BInGo is scalable to handle multimodal image groups with various sizes for training and test. As shown in Fig. 3, it can be trained with either bimodal image pairs or image groups of complete modalities, referred to as partial or complete learning, respectively. Trained in either way, BInGo can be flexibly applied to larger unseen test groups with arbitrary numbers of images, such that computational efficiency of training is substantially boosted.

The scalability of our model is conducive in two scenarios:

Intrasubject Images with Missing Modalities. The goal is to co-register multimodal scans for each patient. In practice, there could be many patients with different missing modalities. BInGo can be trained with these heterogeneous datasets via partial learning, while performing groupwise registration for test subjects with complete modalities.

Intersubject Populations. Intersubject groupwise registration, which is crucial in atlas construction and population analysis, could involve plenty of images for every modality. Conventional methods to this problem rely on iterative optimization and suffer from computational complexity scaling with image group size, while BInGo can greatly relieve training burden and be subsequently applied to test groups of complete populations in one shot.

3 Experiments and Results

3.1 Datasets and Preprocessing

BraTS-2021. The dataset provides 3D pre-operative T1, T1Gd, T2 and T2-FLAIR MR scans of patients with glioblastoma [17, 2]. We randomly selected 300/50/150 patient sets for training/validation/test. The volumes were downsampled into $2\times 2\times 2\text{mm}^{3}$ with ROI of size $80\times 96\times 80$ . As the images of each patient are pre-registered, we use synthetic free-form deformations (FFDs) with different control point spacings to simulate misalignments.

MS-CMR. The MS-CMRSeg challenge [26] provides cardiac MR sequences LGE, bSSFP, and T2 from 45 patients, which exhibit complementary information of the cardiac structure and pathology. The images were preprocessed by affine co-registration, ROI cropping, and slice selection, producing 39/15/44 slices for training/validation/test. We simulated worse misalignments by applying additional FFDs on original images to better demonstrate model efficacy.

Learn2Reg Abdomen MR-CT. This dataset [11, 8] collects 3D MR and CT volumes. The images were resampled into $3\times 3\times 3\text{mm}^{3}$ with size of $128\times 107\times 128$ .

3.2 Experimental Setups

Compared Methods. Three types of unsupervised methods for multimodal groupwise registration were compared on the intrasubject BraTS and MS-CMR datasets: 1) similarity-based iterative methods using information-theoretic metrics CTE, APE or $\mathcal{X}$ -CoReg [22, 25, 16], 2) similarity-based deep-learning models that optimize CTE or APE using an attention residual U-Net (AttResUNet) [9, 19] as the backbone, 3) BInGo, the proposed model learning intrinsic similarity through hierarchical disentanglement. For intersubject population groupwise registration on Learn2Reg, the models in 2) are not applicable since they can only be trained with complete image groups, whereas BInGo can be trained partially. In addition, for BraTS and MS-CMR, all baseline methods used single-level velocity fields as the transformation model; for Learn2Reg, iterative baselines performed rigid, affine and FFD registration successively for optimal accuracy, due to the severe initial misalignments.

Implementation Details. Training was done through the Adam optimizer [12] (learning rate: $10^{-3}$ ; batch size: $20/1$ for MS-CMR/other datasets). The experiments were conducted with PyTorch [21] on an NVIDIA $\text{RTX}^{\textnormal{TM}}$ 3090 GPU.

Evaluation Metrics. We reported the Dice similarity coefficient (DSC) averaged over all pairwise combinations of the segmentation masks for registered images. For the BraTS dataset, as the ground-truth misalignments were available, we also reported the groupwise warping index (gWI) [16], which would reduce to zero if the misaligned images were perfectly co-registered.

	$\displaystyle D_{\text{KL}}\big{[}q(\bm{v}\,\|\,\bm{x})\,q(\bm{z}\,\|\,\bm{v},\bm{x})\;\\|\;p(\bm{v})\,p(\bm{z})\big{]}$	(3)
$\displaystyle=$	$\displaystyle\sum_{m=1}^{M}\left[\sum_{l=1}^{L}\mathbb{E}_{q(\bm{v}_{m}^{<l}\,\|\,\bm{x})}\Big{\{}D_{\text{KL}}\big{[}q(\bm{v}_{m}^{l}\,\|\,\bm{x},\bm{v}^{<l})\;\\|\;p(\bm{v}_{m}^{l})\big{]}\Big{\}}\right]\qquad\qquad\text{(i)}\$
	$\displaystyle+\mathbb{E}_{q(\bm{v}\,\|\,\bm{x})}\left[\sum_{l=1}^{L}\mathbb{E}_{q(\bm{z}^{<l}\,\|\,\bm{x},\bm{v})}\Big{\{}D_{\text{KL}}\big{[}q(\bm{z}^{l}\,\|\,\bm{x},\bm{v})\;\\|\;p(\bm{z}^{l}\,\|\,\bm{z}^{<l})\big{]}\Big{\}}\right]\ \text{(ii)},$

Method	BraTS			MS-CMR
	DSC $\uparrow$	gWI $\downarrow$	#Params.	DSC $\uparrow$	#Params.
None	$0.610\pm 0.150^{\dagger*}$	$1.430\pm 0.644^{\dagger*}$	—	$0.722\pm 0.101^{\dagger*}$	—
\hdashline APE [25]	$\bm{0.726\pm 0.078}$	$\bm{0.596\pm 0.149}$	7.373	$0.811\pm 0.072^{\dagger*}$	0.154
CTE [22]	$0.561\pm 0.148^{\dagger*}$	$1.087\pm 0.411^{\dagger*}$	7.373	$0.816\pm 0.077^{\dagger*}$	0.154
$\mathcal{X}$ -CoReg [16]	$0.707\pm 0.089^{\dagger}$	$0.697\pm 0.212^{\dagger}$	7.373	$0.840\pm 0.077^{\dagger*}$	0.154
\hdashline APE+AttResUNet	$0.693\pm 0.078^{\dagger}$	$0.757\pm 0.153^{\dagger*}$	22.955	$0.846\pm 0.048^{\dagger*}$	8.036
CTE+AttResUNet	$0.659\pm 0.096^{\dagger*}$	$0.916\pm 0.210^{\dagger*}$	22.955	$0.874\pm 0.043^{\dagger*}$	8.036
\hdashline BInGo (complete^†)	$\bm{0.717\pm 0.068}$	$\bm{0.596\pm 0.132}$	13.429	$\bm{0.887\pm 0.033}$	4.516
BInGo (partial^∗)	$0.693\pm 0.075$	$0.709\pm 0.172$	13.429	$\bm{0.877\pm 0.042}$	4.516

	2	4	8	16
None	$0.396\pm 0.168$	$0.386\pm 0.070$	$0.319\pm 0.007$	$0.306\pm 0.000$
\hdashline APE [25]	$0.586\pm 0.376$	$0.574\pm 0.261$	N/A	N/A
CTE [22]	$0.609\pm 0.306$	$0.088\pm 0.036$	N/A	N/A
$\mathcal{X}$ -CoReg [16]	$0.675\pm 0.329$	$0.567\pm 0.224$	N/A	N/A
\hdashline BInGo (partial^∗)	$\bm{0.781\pm 0.108}$	$\bm{0.715\pm 0.122}$	$\bm{0.677\pm 0.059}$	$\bm{0.645\pm 0.000}$

BInGo: Bayesian Intrinsic Groupwise Registration via Explicit Hierarchical Disentanglement ^†^†thanks: X. Wang and X. Luo contributed equally to this work.

Abstract

1 Introduction