BInGo: Bayesian Intrinsic Groupwise Registration via Explicit Hierarchical Disentanglement ††thanks: X. Wang and X. Luo contributed equally to this work.
Abstract
Multimodal groupwise registration aligns internal structures in a group of medical images. Current approaches to this problem involve developing similarity measures over the joint intensity profile of all images, which may be computationally prohibitive for large image groups and unstable under various conditions. To tackle these issues, we propose BInGo, a general unsupervised hierarchical Bayesian framework based on deep learning, to learn intrinsic structural representations to measure the similarity of multimodal images. Particularly, a variational auto-encoder with a novel posterior is proposed, which facilitates the disentanglement learning of structural representations and spatial transformations, and characterizes the imaging process from the common structure with shape transition and appearance variation. Notably, BInGo is scalable to learn from small groups, whereas being tested for large-scale groupwise registration, thus significantly reducing computational costs. We compared BInGo with five iterative or deep-learning methods on three public intrasubject and intersubject datasets, i.e. BraTS, MS-CMR of the heart, and Learn2Reg abdomen MR-CT, and demonstrated its superior accuracy and computational efficiency, even for very large group sizes (e.g., over 1300 2D images from MS-CMR in each group).
1 Introduction
Multimodal groupwise registration aims to align multimodal images into a common structural space. Unlike conventional pairwise registration which aligns moving images separately to a fixed image, groupwise registration can ameliorate the bias from designating a reference image, by estimating a common space to which all the images are co-registered. Therefore, it has become an essential task in multivariate image analysis, including longitudinal research, atlas construction, motion estimation, and population studies [14, 10, 18, 6].
Conventional iterative methods usually estimate the desired spatial transformations by optimizing intensity-based similarity measures [13, 20, 25, 22, 16]. Recently, unsupervised deep-learning methods attempt to realize groupwise registration in an end-to-end fashion [4, 7]. For instance in [4], the authors devised a network to optimize the conditional template entropy (CTE) introduced in [22]. These learning-based models estimate conventional similarity measures developed for iterative registration using stochastic gradient methods, and hence may inherit the same suboptimal performance due to insufficient modeling of the image similarity. Besides, these measures rely on correctly characterizing the joint intensity profile over the entire image group, which may be computationally prohibitive and applicability-limited due to large group sizes.
In this work, however, we shift attention from devising similarity metrics to learning the underlying generative imaging process. Specifically, we establish a probabilistic generative model for the observed images, in which transformations and the common structure are disentangled as latent variables. In this way, groupwise registration can be achieved by unsupervised variational auto-encoding: the encoder extracts structural representations of images and then infers the spatial deformations; the decoder imitates the imaging process by reconstructing original images from estimated common structure and the inverse deformations. The contributions of this work can be summarized as follows:
-
(1)
We propose BInGo, a theoretically-grounded unsupervised Bayesian framework for groupwise registration, which learns intrinsic structural similarity of multimodal images by a principled variational distribution, and is capable of disentangling structural representations from image appearance.
-
(2)
BInGo is scalable to be trained with small image groups while being applied to large-scale and variable-size test groups, significantly improving applicability and computational efficiency.
-
(3)
We demonstrate the superiority of BInGo over similarity-based methods on three multimodal intrasubject and intersubject datasets.
To the best of our knowledge, this is the first work that realizes scalable multimodal groupwise registration by disentangled representation learning.
2 Methodology
Let the original image group be , with the number of modalities and . Groupwise registration aims to find spatial transformations , such that the aligned images share a common structural representation .
We assume is generated from and by a transformation-equivariant imaging process that contains modality-specific appearance information, i.e.
(1) |
Therefore, a 3-step unsupervised auto-encoding scheme can be formed: I) By modeling we could encode individual structural representation for as , from which could be estimated more easily. II) could be then inferred from following the equivariance assumption. III) By modeling we could reconstruct from the estimated and .
The following establish BInGo to realize this scheme by variational and disentangled auto-encoding, thus learning in a unified and end-to-end fashion.
2.1 Hierarchical Bayesian Inference

We first formalize the estimation of spatial deformations through Bayesian inference, as illustrated in Fig. 1. Specifically, let be a sample of . We decompose the latent variables generating into two independent subgroups: 1) the common structural representation , and 2) the stationary velocity fields that parameterize [1]. Thus, following the variational Bayes framework, the objective function to maximize is the evidence lower bound (ELBO) of the log-likelihood, which takes the form
(2) |
where defines the variational posterior distribution, and is the Kullback-Leibler (KL) divergence. After optimizing the ELBO, the desired velocity fields can be inferred via maximum a posteriori.
Furthermore, we express the latent variables with hierarchical levels [24], i.e. and , and a higher level indicates a finer-scale (larger) resolution. Thus, the total deformation can be deterministically computed by exponentiation of velocity fields , where , . The hierarchical strategy allows to model a complex deformation by several simpler and easier-to-learn ones.
To simplify the KL, we introduce additional independence assumptions: 1) both the prior and posterior of the velocities factorize among different images, i.e. and , where denotes the group of latent variables in levels less than , and 2) the common structure can be inferred directly from the observed images and the estimated velocity fields, i.e. , since can determine registered images, thus containing all information about for any .
Taking into account these assumptions, the KL divergence can be written as
(3) | ||||
where we prescribe and for simplicity. The key idea here is: the overall KL divergence is decomposed w.r.t. (i) the velocity fields , and (ii) the common structure . The former serves as regularization to ensure diffeomorphism, fulfilled by the constraint introduced in [5]; the latter is intended to estimate the structural similarity among the images, as is detailed in the next subsection.
2.2 Intrinsic Similarity over Structural Representations
For registration, it is crucial to efficiently measure the structural similarity, which is achieved in this work by learning instead of a pre-defined metric. To this end, we propose to learn multilevel “expert” distributions , which serve as (i.e., the inverse of imaging processes) to extract the structural representations of warped images. We further assume and to be the geometric mean [15] and arithmetic mean [23] of the experts, respectively:
(4) |
Therefore, the KL in Eq. 3(ii) essentially measures the intrinsic (dis)similarity, i.e., the dissimilarity of the experts (intrinsic structural representations), whose minimization, as a part of the maximization of the ELBO, encourages the experts to be identical, thus forcing the multilevel posteriors to represent the common structure. Meanwhile, the velocity could be learned in tandem.
We model the experts as Gaussians , and thus so is the joint posterior, i.e. with
(5) |
In light of the computational intractability of the KL divergence involving Gaussian mixture distributions, we further exploit its convexity to obtain
(6) |
Hence, the minimization of the KL in Eq. 3(ii) is actually approximated by minimizing the right-hand side of Eq. 6, which has a closed-form expression.
2.3 Explicit Disentanglement with Neural Networks
The ELBO has been decomposed into three terms: reconstruction of the original images (the first term in Eq. 2), the KL divergence for velocity fields to ensure diffeomorphism (Eq. 3(i)), and the KL divergence for the intrinsic (dis)similarity to optimize the registration (Eq. 3(ii)), with weights (hyperparameters) . To estimate the ELBO, we propose a dedicated hierarchical variational auto-encoder (VAE) as the inference model in Fig. 1(b), whose architecture with the number of hierarchy is depicted in Fig. 2. The network learns and , and the expectations in the ELBO are estimated by Monte Carlo sampling, similar to traditional VAEs.

The motif of BInGo is to explicitly disentangle the common structure, spatial transformations and appearance information from multimodal images, thus realizing Eq. 1. To this end, the network comprises three types of cooperative modules: 1) encoders that extract multi-level transformation-equivariant structural representations , corresponding to the inverse imaging function , 2) multi-level registration (Reg) modules that infer spatial transformations from individual structural representations, corresponding to estimating from , and 3) decoders, with modality-specific appearance information embedded as parameters, which reconstruct registered images from learned common structure, corresponding to the imaging function .
Particularly, the Reg module at level assumes the coarser-scale deformations have been resolved, and thus takes the individual representations as inputs, which are like “partially distorted” version of the expert distributions, since the experts are representations of fully registered images . Thus, in the same spirit of Eq. 4, the Reg module computes their geometric mean as the “partially common” structure, and then concatenates the mean and to infer for each separately.
Therefore, BInGo can realize the aforementioned 3-step unsupervised scheme induced by Eq. 1, as follows:
I. Bottom-up Inference of Velocity Fields. The encoders first produce multilevel feature maps as the individual structural representations of original images. Then, up from the bottom ( increasing from 1 to ), given that have been inferred (), the feature maps are warped to obtain . Then, the Reg module at level can infer , from which is sampled and is finally computed through integration.
II. Top-down Inference of Common Structures. As the total deformations have been inferred, the encoders are fed again with warped images to produce the experts in a top-down ( decreasing from to ) manner. The posterior is then computed by Eq. 4, from which the modality-invariant common structural representation is sampled. Note that we can avoid using encoders twice by warping to compute the experts as (thanks to equivariance), but this requires more computational cost.
III. Disentangled Auto-Encoding. Based on the common structure inferred by encoders, the decoders reconstruct the registered images . Then reconstructed original images are obtained by inverse deformations. We utilize (negative) L1 loss between the original images and their reconstructions to estimate the first (reconstruction) term in Eq. 2. In addition, to better disentangle structure and appearance, the convolutional layers of the network are shared across modalities, while domain-specific batch normalization layers (BNs) encode appearance information for each modality [3].
2.4 Towards Large-Scale and Variable-Size Groupwise Registration
Most conventional methods optimize similarity measures defined over the entire image group, making the computation for large groups formidable. Besides, learning-based models that rely on these measures require groups to have the same size. These drawbacks significantly reduce their applicability.
However, BInGo is scalable to handle multimodal image groups with various sizes for training and test. As shown in Fig. 3, it can be trained with either bimodal image pairs or image groups of complete modalities, referred to as partial or complete learning, respectively. Trained in either way, BInGo can be flexibly applied to larger unseen test groups with arbitrary numbers of images, such that computational efficiency of training is substantially boosted.

The scalability of our model is conducive in two scenarios:
Intrasubject Images with Missing Modalities. The goal is to co-register multimodal scans for each patient. In practice, there could be many patients with different missing modalities. BInGo can be trained with these heterogeneous datasets via partial learning, while performing groupwise registration for test subjects with complete modalities.
Intersubject Populations. Intersubject groupwise registration, which is crucial in atlas construction and population analysis, could involve plenty of images for every modality. Conventional methods to this problem rely on iterative optimization and suffer from computational complexity scaling with image group size, while BInGo can greatly relieve training burden and be subsequently applied to test groups of complete populations in one shot.
3 Experiments and Results
3.1 Datasets and Preprocessing
BraTS-2021. The dataset provides 3D pre-operative T1, T1Gd, T2 and T2-FLAIR MR scans of patients with glioblastoma [17, 2]. We randomly selected 300/50/150 patient sets for training/validation/test. The volumes were downsampled into with ROI of size . As the images of each patient are pre-registered, we use synthetic free-form deformations (FFDs) with different control point spacings to simulate misalignments.
MS-CMR. The MS-CMRSeg challenge [26] provides cardiac MR sequences LGE, bSSFP, and T2 from 45 patients, which exhibit complementary information of the cardiac structure and pathology. The images were preprocessed by affine co-registration, ROI cropping, and slice selection, producing 39/15/44 slices for training/validation/test. We simulated worse misalignments by applying additional FFDs on original images to better demonstrate model efficacy.
3.2 Experimental Setups
Compared Methods. Three types of unsupervised methods for multimodal groupwise registration were compared on the intrasubject BraTS and MS-CMR datasets: 1) similarity-based iterative methods using information-theoretic metrics CTE, APE or -CoReg [22, 25, 16], 2) similarity-based deep-learning models that optimize CTE or APE using an attention residual U-Net (AttResUNet) [9, 19] as the backbone, 3) BInGo, the proposed model learning intrinsic similarity through hierarchical disentanglement. For intersubject population groupwise registration on Learn2Reg, the models in 2) are not applicable since they can only be trained with complete image groups, whereas BInGo can be trained partially. In addition, for BraTS and MS-CMR, all baseline methods used single-level velocity fields as the transformation model; for Learn2Reg, iterative baselines performed rigid, affine and FFD registration successively for optimal accuracy, due to the severe initial misalignments.
Implementation Details. Training was done through the Adam optimizer [12] (learning rate: ; batch size: for MS-CMR/other datasets). The experiments were conducted with PyTorch [21] on an NVIDIA 3090 GPU.
Evaluation Metrics. We reported the Dice similarity coefficient (DSC) averaged over all pairwise combinations of the segmentation masks for registered images. For the BraTS dataset, as the ground-truth misalignments were available, we also reported the groupwise warping index (gWI) [16], which would reduce to zero if the misaligned images were perfectly co-registered.
3.3 Results
Method | BraTS | MS-CMR | |||
---|---|---|---|---|---|
DSC | gWI | #Params. | DSC | #Params. | |
None | — | — | |||
\hdashline APE [25] | 7.373 | 0.154 | |||
CTE [22] | 7.373 | 0.154 | |||
-CoReg [16] | 7.373 | 0.154 | |||
\hdashline APE+AttResUNet | 22.955 | 8.036 | |||
CTE+AttResUNet | 22.955 | 8.036 | |||
\hdashline BInGo (complete†) | 13.429 | 4.516 | |||
BInGo (partial∗) | 13.429 | 4.516 |
2 | 4 | 8 | 16 | |
---|---|---|---|---|
None | ||||
\hdashline APE [25] | N/A | N/A | ||
CTE [22] | N/A | N/A | ||
-CoReg [16] | N/A | N/A | ||
\hdashline BInGo (partial∗) |


Intrasubject Image Groups. Section 3.3 presents registration accuracy on the BraTS and MS-CMR datasets. BInGo (with complete† or partial∗ learning) worked consistently better than similarity-based iterative or learning approaches, except for the APE-based iterative method on BraTS. Notably, BInGo could achieve superior performance to deep-learning baselines with only half of training parameters, even though their backbone AttResUNet is more advanced than the vanilla UNet-like structure used in BInGo. Furthermore, the proposed partial learning strategy could perform better than most baselines with significantly lower computational demands, which potentiates large-scale end-to-end groupwise registration.
Intersubject Population Image Groups. For Learn2Reg, we merged every test MR-CT pairs to form larger groups (each containing images to co-register) as test sets of intersubject populations. Section 3.3 presents the results versus different . All iterative approaches failed when the groups became large due to excessive GPU cost, while our partial learning strategy worked consistently better and maintained a good performance for large groups. Note that the deep-learning baselines are not capable of handling test groups with varying sizes, and thus not presented.
Additional Scalability Tests. For MS-CMR and BraTS, we merged test groups in a similar way to evaluate BInGo (trained via complete learning) on image groups with different sizes. As shown in Fig. 4, with increasing, co-registration becomes significantly more difficult (indicated by the worse initial DSC/gWI), whereas BInGo maintained decent performance for ultra-large groups (e.g., over 1300 images from MS-CMR), and achieved even better accuracy for larger 3D image groups from BraTS, showing remarkable robustness on large-scale groupwise registration.
Qualitative Results. We visualized the results from the best baselines (APE/ CTE+AttResUNet/-CoReg for BraTS/MS-CMR/Learn2Reg) and BInGo in Fig. 5. BInGo could achieve better alignment for both large-scale anatomy and local fine structures in most cases, and the predicted deformations reached great smoothness and diffeomorphism. The feature maps from BInGo were nearly modality-invariant and shared similar structures, illustrating successful disentanglement of the common structure, spatial transformations and appearance information. Besides, the relatively poor performance in certain local areas may be due to the too severe initial misalignment, which made different tissues with similar intensities appear in the same location. Still, BInGo performed no worse than the baselines in such regions, and achieved satisfactory accuracy for the entire foreground.
4 Conclusion and Discussion
In this work we have presented BInGo, a generative Bayesian framework for unsupervised multimodal groupwise registration. This new formulation of image registration has achieved comparable performance with similarity-based iterative and unsupervised methods. In particular, we demonstrated that equipped with unique scalability, BInGo could reduce the computational burden of groupwise registration without compromising accuracy. This opens up the possibility to realize learning-based multimodal groupwise registration on a large scale and with various group sizes. A potential limitation of our work is that there may exist performance drop on specific datasets, e.g. the highly challenging abdominal images. Future work includes investigation into the negative factors that may inhibit BInGo from generalizing to large image groups.
References
- [1] Ashburner, J.: A fast diffeomorphic image registration algorithm. Neuroimage 38(1), 95–113 (2007)
- [2] Baid, U., Ghodasara, S., Mohan, S., Bilello, M., Calabrese, E., Colak, E., Farahani, K., Kalpathy-Cramer, J., Kitamura, F.C., Pati, S., et al.: The rsna-asnr-miccai brats 2021 benchmark on brain tumor segmentation and radiogenomic classification. arXiv preprint arXiv:2107.02314 (2021)
- [3] Cang, W.G., You, T., Seo, S., Kwak, S., Han, B.: Domain-specific batch normalization for unsupervised domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7354–7362 (2019)
- [4] Che, T., Zheng, Y., Sui, X., Jiang, Y., Cong, J., Jiao, W., Zhao, B.: Dgr-net: Deep groupwise registration of multispectral images. In: International Conference on Information Processing in Medical Imaging. pp. 706–717. Springer (2019)
- [5] Dalca, A.V., Balakrishnan, G., Guttag, J., Sabuncu, M.R.: Unsupervised learning of probabilistic diffeomorphic registration for images and surfaces. Medical image analysis 57, 226–236 (2019)
- [6] Geng, X., Christensen, G.E., Gu, H., Ross, T.J., Yang, Y.: Implicit reference-based group-wise image registration and its application to structural and functional MRI. Neuroimage 47(4), 1341–1351 (Oct 2009)
- [7] He, Z., Chung, A.C.: Unsupervised end-to-end groupwise registration framework without generating templates. In: 2020 IEEE International Conference on Image Processing (ICIP). pp. 375–379. IEEE (2020)
- [8] Hering, A., Hansen, L., Mok, T.C.W., Chung, A.C.S., Siebert, H., Häger, S., Lange, A., Kuckertz, S., Heldmann, S., Shao, W., et al.: Learn2reg: comprehensive multi-task medical image registration challenge, dataset and evaluation in the era of deep learning. IEEE Transactions on Medical Imaging (2022)
- [9] Hu, Y., Modat, M., Gibson, E., Li, W., Ghavami, N., Bonmati, E., Wang, G., Bandula, S., Moore, C.M., Emberton, M., et al.: Weakly-supervised convolutional neural networks for multimodal image registration. Medical image analysis 49, 1–13 (2018)
- [10] Joshi, S.C., Davis, B.C., Jomier, M., Gerig, G.: Unbiased diffeomorphic atlas construction for computational anatomy. NeuroImage 23, S151–S160 (2004)
- [11] Kavur, A.E., Selver, M.A., Dicle, O., Barış, M., Gezer, N.S.: CHAOS - Combined (CT-MR) Healthy Abdominal Organ Segmentation Challenge Data (Apr 2019). https://doi.org/10.5281/zenodo.3362844
- [12] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: 3rd International Conference on Learning Representations (2015)
- [13] Learned-Miller, E.G.: Data driven image models through continuous joint alignment. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(2), 236–250 (2005)
- [14] Liao, S., Jia, H., Wu, G., Shen, D.: A novel framework for longitudinal atlas construction with groupwise registration of subject image sequences. Neuroimage 59(2), 1275–1289 (Jan 2012)
- [15] Lorenzen, P., Prastawa, M., Davis, B., Gerig, G., Bullitt, E., Joshi, S.: Multi-modal image set registration and atlas formation. Medical image analysis 10(3), 440–451 (2006)
- [16] Luo, X., Zhuang, X.: X-metric: An n-dimensional information-theoretic framework for groupwise registration and deep combined computing. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022)
- [17] Menze, B.H., Jakab, A., Bauer, S., Kalpathy-Cramer, J., Farahani, K., Kirby, J., Burren, Y., Porz, N., Slotboom, J., Wiest, R., et al.: The multimodal brain tumor image segmentation benchmark (brats). IEEE transactions on medical imaging 34(10), 1993–2024 (2014)
- [18] Metz, C.T., Klein, S., Schaap, M., van Walsum, T., Niessen, W.J.: Nonrigid registration of dynamic medical imaging data using nd+ t b-splines and a groupwise optimization approach. Medical image analysis 15(2), 238–249 (2011)
- [19] Oktay, O., Schlemper, J., Folgoc, L.L., Lee, M., Heinrich, M., Misawa, K., Mori, K., McDonagh, S., Hammerla, N.Y., Kainz, B., Glocker, B., Rueckert, D.: Attention u-net: Learning where to look for the pancreas. In: Medical Imaging with Deep Learning (2018)
- [20] Orchard, J., Mann, R.: Registering a multisensor ensemble of images. IEEE Transactions on Image Processing 19(5), 1236–1247 (2009)
- [21] Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in pytorch (2017)
- [22] Polfliet, M., Klein, S., Huizinga, W., Paulides, M.M., Niessen, W.J., Vandemeulebroucke, J.: Intrasubject multimodal groupwise registration with the conditional template entropy. Medical image analysis 46, 15–25 (2018)
- [23] Shi, Y., Paige, B., Torr, P., et al.: Variational mixture-of-experts autoencoders for multi-modal deep generative models. Advances in Neural Information Processing Systems 32, 15718–15729 (2019)
- [24] Vahdat, A., Kautz, J.: Nvae: A deep hierarchical variational autoencoder. arXiv preprint arXiv:2007.03898 (2020)
- [25] Wachinger, C., Navab, N.: Simultaneous registration of multiple images: Similarity metrics and efficient optimization. IEEE transactions on pattern analysis and machine intelligence 35(5), 1221–1233 (2012)
- [26] Zhuang, X., Xu, J., Luo, X., Chen, C., Ouyang, C., Rueckert, D., Campello, V.M., Lekadir, K., Vesal, S., RaviKumar, N., et al.: Cardiac segmentation on late gadolinium enhancement mri: a benchmark study from multi-sequence cardiac mr segmentation challenge. Medical Image Analysis 81, 102528 (2022)