This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Improving Convolutional Neural Networks for Cosmological Fields with Random Permutation

Kunhao Zhong kunhaoz@sas.upenn.edu Department of Physics and Astronomy, University of Pennsylvania, Philadelphia, PA 19104, USA    Marco Gatti Department of Physics and Astronomy, University of Pennsylvania, Philadelphia, PA 19104, USA    Bhuvnesh Jain Department of Physics and Astronomy, University of Pennsylvania, Philadelphia, PA 19104, USA
Abstract

Convolutional Neural Networks (CNNs) have recently been applied to cosmological fields – weak lensing mass maps and galaxy maps. However, cosmological maps differ in several ways from the vast majority of images that CNNs have been tested on: they are stochastic, typically low signal-to-noise per pixel, and with correlations on all scales. Further, the cosmology goal is a regression problem aimed at inferring posteriors on parameters that must be unbiased. We explore simple CNN architectures and present a novel approach of regularization and data augmentation to improve its performance for lensing mass maps. We find robust improvement by using a mixture of pooling and shuffling of the pixels in the deep layers. The random permutation regularizes the network in the low signal-to-noise regime and effectively augments the existing data. We use simulation-based inference (SBI) to show that the model outperforms CNN designs in the literature. We find a 30% improvement in the constraints of the S8S_{8} parameter for simulated Stage-III surveys, including systematic uncertainties such as intrinsic alignments. We explore various statistical errors corresponding to next-generation surveys and find comparable improvements. We expect that our approach will have applications to other cosmological fields as well, such as galaxy maps or 21-cm maps.

I Introduction

The Large Scale Structure (LSS) of the universe contains crucial information about its late-time growth history and fundamental physics. Over the past few decades, a diverse array of probes has been employed to study these structures, each contributing significantly to our understanding of the universe. Prominent among these are the Stage-III LSS surveys, which include the Dark Energy Survey (DES) Dark Energy Survey Collaboration et al. (2016); Troxel et al. (2018); Amon et al. (2022), the Kilo-Degree Survey (KiDS) Kuijken et al. (2015); Heymans et al. (2021); Asgari et al. (2021), the Hyper Suprime-Cam Subaru Strategic Program (HSC) Dalal et al. (2023); Li et al. (2023); Sugiyama et al. (2023), and the Baryon Oscillation Spectroscopic Survey (Boss and eBOSS) Smee et al. (2013); Zhao et al. (2022); Alam et al. (2021).

The upcoming Stage-IV LSS surveys, including Rubin Observatory’s LSST Ivezić et al. (2019), the Roman Space Telescope Dore et al. (2019), and the recently launched Euclid mission Laureijs et al. (2011), in combination with the Dark Energy Spectroscopic Instrument Levi et al. (2019), are or will be gathering data soon. All these efforts will shed light on two of the most mysterious problems in physics: dark matter and dark energy.

One of the most important analytical tools used in these studies is the two-point correlation function. This function links observational measurements with theoretical predictions derived from perturbation theory. However, the two-point correlation function only captures the Gaussian information.

A variety of non-Gaussian statistics have thus been proposed to study cosmological fields. These include the three-point function and bispectrum Takada and Jain (2003); Halder et al. (2023); Takada and Jain (2004), peaks and voids Jain and Van Waerbeke (2000); Marian et al. (2009); Martinet et al. (2018); Harnois-Déraps et al. (2021); Zürcher et al. (2022a); Marques et al. (2023), Minkowski functionals Munshi et al. (2012); Kratochvil et al. (2012); Grewal et al. (2022), Betti numbers and persistent homology Feldbrugge et al. (2019); Parroni et al. (2021); Heydenreich et al. (2021, 2022), k-NN and CDF Banerjee and Abel (2023); Anbajagane et al. (2023), and wavelet transform based statistics Cheng et al. (2020); Cheng and Ménard (2021); Valogiannis and Dvorkin (2022); Hothi et al. (2023); Pedersen et al. (2022); Gatti et al. (2023a). See Ref. Euclid Collaboration et al. (2023) for a forecast with the settings of the Euclid Mission.

Most recently, machine learning has been used to extract cosmological information at the field level, particularly with Convolutional Neural Networks (CNN) Fluri et al. (2018); Ribli et al. (2019a); Gupta et al. (2018); Fluri et al. (2019); Ribli et al. (2019b); Matilla et al. (2020); Jeffrey et al. (2021a); Lu et al. (2022, 2023); Hortúa et al. (2020); Akhmetzhanova et al. (2024), Graph Neural Networks (GNN) Perraudin et al. (2019); Fluri et al. (2022); Makinen et al. (2022); Mishra-Sharma (2022), and generative models Dai and Seljak (2022); Ullmo et al. (2021); Yiu et al. (2022); Villaescusa-Navarro et al. (2021a); Dai and Seljak (2023).

Neural networks generally surpass traditional Gaussian statistics in constraining cosmologies, such as the power spectrum. However, the extent of the improvement can vary significantly depending on the realism of the simulations and the scales considered. When compared to other non-Gaussian statistics, the situation is more complicated. For instance, several studies Fluri et al. (2018); Gupta et al. (2018) have suggested that CNNs outperform peak counts methods, whereas others have pointed out that their enhancement over Minkowski functionals is only moderate Matilla et al. (2020). Moreover, there are scenarios where CNNs appear not to perform as effectively as other methods, such as the scattering wavelet transform Cheng et al. (2020). This observation raises the question that while neural networks offer substantial advancements in many areas, their efficacy can vary for cosmological fields. In other words, even though the CNN often contains thousands of free parameters to approximate the underlying function, it might not be suitable to directly apply the commonly used classification-oriented CNN in cosmology. Such insights are crucial for guiding future research and methodology choices in the field of cosmology, particularly in the integration and application of machine learning techniques.

In this paper, we explore modifications to the architecture of CNNs to enhance their performance in cosmological tasks. Cosmological maps such as weak lensing mass maps differ from images used to develop CNNs in that they are stochastic and low signal-to-noise. Moreover cosmological inference we use them for regression, not classification as is typically done in industry. So we explore a variety of regularization and data augmentation schemes in applying CNNs on lensing maps. Our preliminary findings suggest that employing a combination of maximum and average pooling improves the network’s ability to extract relevant information. We also introduce a novel regularization scheme rooted in our understanding of cosmological fields: the incorporation of random permutation layers designed to effectively augment the training data by sacrificing the information of large-scale correlation. We demonstrate the effectiveness of this straightforward technique for statistical noise levels expected in current and upcoming weak lensing surveys. We also include a number of sources of systematic uncertainty.

Additionally, we conduct comparative analyses with conventional regularization methods and alternative strategies that disrupt large-scale correlation. Our initial results indicate that the permutation operation directly applied to the latent space of the neural network consistently boosts accuracy in both the simplified and realistic simulation cases considered in this work. This suggests that by selectively filtering out large-scale information in the deeper layers, the neural network exhibits enhanced performance in extracting information from more critical scales. We anticipate that this simple regularization technique not only holds promise for improving weak lensing fields but may also for other stochastic fields, such as galaxy maps or 21-cm maps. We do not assert that our architecture represents the optimal model as we have not carried out a systematic study for all possible use cases.

The structure of this paper is as follows: In Sec. II we describe the simulations we used for this work, the preprocessing of the data, and the simulation-based inference setup. In Sec. III we discuss how the CNN for cosmological fields should be different from the standard design for image classification. In Sec. IV we present the random permutation layer as a regularization scheme that helps model generalization. We summarize the findings and outlook for future work in Sec. V and we perform extensive tests comparing to other models in the Appendices.

II Methodology

II.1 Simulation of Weak Lensing Convergence Field

For this work, we use weak lensing mass maps (i.e., maps of the weak lensing convergence field) created from the DarkGridV1 N-body simulation suite (Zürcher et al., 2021, 2022b). The DarkGridV1 suite consists of Λ\LambdaCDM-only simulations that vary two parameters, Ωm\Omega_{\rm m} and σ8\sigma_{8}, exploring 58 different cosmologies 111The samples are chosen to follow lines of approximately constant S8S_{8}. See Fig. 2 of Ref. Zürcher et al. (2021) for the grid spanning the Ωmσ8\Omega_{\mathrm{m}}-\sigma_{8} plane.. Each cosmology is represented by five independent full-sky simulations. The simulations have been produced using the PKDGRAV3 code (Potter et al., 2017); the code produces particle number counts at different redshifts (100100 redshifts from z=49z=49 to z=0.0z=0.0), provided as HEALPIX (Górski et al., 2005) maps. For this work, we downsample the resolution of the original maps to NSIDE = 512, corresponding to a pixel resolution \approx 6.9 arcmin. To create weak lensing mass maps, we follow a procedure similar to that in Gatti et al. (2024, 2023b). First, noiseless convergence maps are produced from the particle counts for each redshift shell assuming the Born approximation (Fosalba et al., 2015). Noiseless shear maps, for each redshift shell, are obtained from the convergence maps using a generalisation of the Kaiser-Squires algorithm (Jeffrey et al., 2021b; Kaiser and Squires, 1993). Then, we consider two cases:

simulation case 1 - no observational systematics or tomography. In case 1, we simplify the map-making procedure, and we choose to not include observational systematics in the modeling. We first generate integrated DES-Y3-like shear maps, by weighting the shear maps as a function of redshift by the DES-Y3 redshift distributions Myles et al. (2021). To simplify and reduce the number of maps generated, we focus on just one of the four DES redshift bins, specifically the third bin. Next, we introduce Gaussian noise to the shear maps. This noise is assumed to have a per-pixel value of σ=σe/Neff\sigma=\sigma_{e}/\sqrt{N_{\text{eff}}}, where σe=0.26\sigma_{e}=0.26 represents the shape noise, and Neff=5.6×ApixelN_{\text{eff}}=5.6\times A_{\text{pixel}} is the effective number of galaxies per pixel. This is equivalent to adding σ=σe/2Neff\sigma=\sigma_{e}/\sqrt{2N_{\text{eff}}} to the mass map as used in Ref. Ribli et al. (2019a). Note that here we have used the number density 5.6 of the full DES-Y3 shear sample, not just that of the third bin. This approach is intended to yield cosmological constraints similar to the entire DES-Y3 survey, even though we are using only one tomographic redshift bin. After adding the noise to the full-sky convergence map, we cut 12 non-overlapping square patches; each patch is 512×512512\times 512 pixels (see Fig. 1), and is chosen to be centered on the center of a HEALPIX pixel with a resolution NSIDE = 1. For the 285 full-sky simulations we generated, we have 285×12=3420285\times 12=3420 patches in total.

simulation case 2 - full map-making procedure. In case 2, we produce weak lensing maps using the full procedure implemented in Gatti et al. (2024, 2023b). The procedure produces four maps, one for each of the four DES tomographic bins. Moreover, a DES footprint (fixed mask for bright galaxies, see Fig. 1) is applied. Each map also includes observational and astrophysical systematics, namely redshift uncertainties, shear biases, intrinsic alignment (in the form of the non-linear alignment (NLA) model), and source clustering. These are managed using various nuisance parameters. Practically, for each map, we forward-model these effects by randomly selecting the nuisance parameters that control them from their respective priors – see Gatti et al. (2023b) for details. In this case, we only train using a single patch of the same galaxy mask, approximately centered at the center of the DES footprint (see Fig. 1). For each full sky simulation with the same cosmology, we shift and rotate the DES footprint four times to generate 4 non-overlapping maps. For the 2280 full-sky simulations we generated, we thus have 2280×4=91202280\times 4=9120 patches in total.

In both cases, the training set is images of size 512x512 with a resolution of 6.9 arcmin, which corresponds to 3367degree23367\mathrm{degree}^{2}. In case 1 images have one channel, while in case 2 images have four channels, one for each tomographic bin. These maps are significantly larger than previous works Fluri et al. (2018); Ribli et al. (2019a); Gupta et al. (2018); Jeffrey et al. (2020); Lu et al. (2022); Matilla et al. (2020); Lu et al. (2023) allowing us to use one patch to represent the survey area. Using a larger patch also has the benefit of keeping more modes. We tested the 6-layer CNN model in comparison to cutting to smaller patches of size 256x256, or 128x128. Although the total training set is larger with smaller patches, we observed a lower accuracy of the CNN, especially in the case of galaxy masks. Therefore, we use the large 512x512 image in this work. Note that the large sky area of a single projection breaks the flat-sky approximation. This in theory would not produce a biased result since it is part of the data compression for simulation-based inference as discussed in Sec. II.3. However, in the future we will extend the work to the curved sky, e.g. with structures as presented in Perraudin et al. (2019).

Refer to caption
Refer to caption
Figure 1: Upper Panel: Example patch of the weak lensing mass maps for simulation case 1, in which no mask is applied and only one tomographic bin is used. The side length is 58 degrees and is given in pixel units. Lower Panel: Example patch for simulation case 2, which includes a mask that follows the DES footprint. The figure shows the third of the four tomographic bins.

II.2 Data Augmentation and Training

Prior to training, we add random Gaussian noise to each pixel as a data augmentation technique, particularly effective for noisy images Shorten and Khoshgoftaar (2019). In our methodology, we apply Gaussian noise with a mean equal to the image’s mean value and a standard deviation equivalent to 10% of the original image’s standard deviation. In this work, we add this to all images (training, validation, and test). The noise level remains almost unchanged since it increases by 1+0.12\sqrt{1+0.1^{2}}. Note that since the additional noise is added according to each image and each channel, it is different from the global shape noise where the variance is fixed for all maps. While Ref. Ribli et al. (2019a) demonstrates the efficacy of augmenting data through random rotation and flipping, we observe no significant difference with these methods. The difference may be attributed to our approach of extracting patches from non-overlapping regions of the full sky. Although our data augmentation strategy differs from previous studies, we emphasize its importance as it improves the test error by 25%\sim 25\%. This approach is also applied to the entire masked map, ensuring that the augmented map assigns non-zero values to the masked regions. This procedure can be optimized for future exercise. For example, using a dynamic approach such that different noise is added while training, the strengths of the noise can also decay away as the model is trained for more epochs.

We train separate CNNs for inferring S8S_{8} and Ωm\Omega_{\mathrm{m}}, as the training time is short (15 mins for simulation case 1, and 1 hour for simulation case 2 on one A100 GPU). This approach also mitigates the issue of degeneracy when inferring two parameters simultaneously in a single network. In this work, we chose Mean Square Error (MSE) as the loss function, and we report both the Root Mean Square Error (RMSE) and the coefficient of determination (R-squared) in the test results, which provides complimentary information of goodness of fit. Although MSE is widely adopted, alternative loss functions can significantly influence results. For instance, had we noticed a tendency for biased predictions at extreme input values, adopting a log\log function could have been beneficial to more heavily penalize such discrepancies. Since we did not have such biases for the prediction, we kept MSE in this work. The data set is randomly divided into 80% for training, 10% for validation, and 10% for testing. The last 10% test set data is then used for the simulation-based inference.

We note that in Ribli et al. (2019a) a more careful division method based on the initial conditions of simulations is utilized. Since our simulation only has 58 different cosmologies, we expect this not to be an issue as the validation set effectively covers the full sampling regions. Due to computational resource limits, we did not show every test with averaging over different random seeds. The random seeds in training affect the weight initialization and also the training-validation split, and thus affect the final results. However, we tested in a few cases with different random seeds and found the variance of RMSE to be as small as ΔRMSE0.01%\Delta\rm RMSE\sim 0.01\%, which is partially because we trained for a long enough of epochs.

The CNN is implemented using Pytorch Paszke et al. (2019) and trained using the AdamW Loshchilov and Hutter (2017) optimizer. We use a batch size of 16 and a total epoch of 200. A learning rate scheduler, ReduceLROnPlateau, is used with a reduction factor of 0.3 (default is 0.1). For all other hyperparameters not explicitly mentioned, we use Pytorch-v1.13’s default settings. We hold hyperparameters with respect to training the same for all tests in this paper. We observe effective convergence with this substantial number of epochs. However, it is noteworthy that Quasi-Newtonian optimization methods, such as L-BFGS Andrew and Gao (2007) or the Levenberg–Marquardt algorithm LEVENBERG (1944), are theoretically more suitable for parameter inference where accuracy is more important. Yet, their application often demands substantial memory due to the Jacobian computation. An effective Quasi-Newtonian optimizer for precision training is an interesting direction for future research.

II.3 Simulation-based Inference

Refer to caption
Figure 2: The 6-layer CNN design used in our study, including the optional random permutation blocks. The network is designed similarly to a VGG-net Simonyan and Zisserman (2014). In this work, we use nch=8n_{\mathrm{ch}}=8 channels in the first layer, with nchn_{\mathrm{ch}} increasing by a factor of 2 for successive deeper layers. The three ”Options” shown involve replacing the deeper layers with a random permutation layer followed by convolution and activation at different depths as indicated (note that the arrows do not indicate skip connections). In Option 1, one extra random permutation layer is added without any change to the 6-layer CNN design. For Option 2 and Option 3, the random permutation is performed at earlier layers. Note that we also mix average and maximum pooling layers: see Sec. IV for details. The blocks shown in this figure are not to scale.

We do not use Neural Networks to directly infer posteriors, for that one usually needs to assume a Gaussian likelihood. Instead, we use the Neural Density Estimator (NDE) package PyDelfi introduced in Charnock et al. (2018). Our CNN thus serves as a field-level data compressor similar to the DeepCompressor in Jeffrey et al. (2021a). PyDelfi provides different flow models for Neural Density Estimation. In this paper, we use Masked Autoregressive Flows (MAF) Papamakarios et al. (2017), which is a stacked version of Masked Autoencoders for Distribution Estimation (MADEs) Germain et al. (2015) with normalizing flow. The network of the NDE is trained to minimize the Kullback-Leibler (KL) divergence Kullback and Leibler (1951) between the true probability distribution and the proposed probability distribution. The true likelihood is of course unknown, but it is reduced to a constant in calculating the expectation value of the KL divergence over the implicit prior (the prior set by the distribution of the point cloud of parameter values used for the training simulations). For details, see Charnock et al. (2018); Jeffrey et al. (2021a); Cranmer et al. (2020).

In this work, we stack 2 MAFs with 2 and 3 hidden layers of length 50. Since one patch corresponds to approximately 3367 square degrees of the sky, we infer the posterior using only one image. Note that the simulations used for NDE training are the test set that is not used for CNN training or validation. After the iso-likelihood surface is learned, we use the affine MCMC sampler emcee Foreman-Mackey et al. (2013) to get the final posterior with a flat prior of S8[0.45,1.1]S_{8}\in[0.45,1.1] and Ωm[0.1,0.5]\Omega_{\mathrm{m}}\in[0.1,0.5]. For the purpose of forecasting, we average over 9 predicted values of the same underlying cosmology as the target vector (in our case S8S_{8} and Ωm\Omega_{\mathrm{m}}). This way the posterior is more centered and minimizes the effect of the prior, which allows us to make a more direct comparison of the constraining power.

We follow previous work on the validation for SBI via the empirical coverage test Guo et al. (2017); Hermans et al. (2021); Lemos et al. (2023), which plots the expected coverage probability (usually from the highest posterior density) vs the credible region. For a calibrated posterior with enough samples of the coverage test, one should find that the true parameter is contained within the X%X\% credible region for X%X\% times of the tests. This plot, often called the p-p plot, is not unique to SBI as it is used for comparing two probability distributions in general. In this work, we use TARP (Test of Accuracy with Random Points) Lemos et al. (2023), which is a more computationally efficient way of performing the coverage test. It is important to note that the coverage test aims to verify that the posterior learned from NDE is neither overconstrained nor underconstrained. A failed result would suggest that the NDE network is not suitable for the given problem, e.g. leading to overfitting. As shown in Sec. IV, the results we obtain are well calibrated.

SBI, also called likelihood-free inference, is an emerging field and has been applied to cosmology in several recent studies Cranmer et al. (2020); Lueckmann et al. (2021); Hahn et al. (2023); Lemos et al. (2023). The results for the inferred likelihood have been tested for summary statistics that have an explicit form of the likelihood such as the power spectrum Jeffrey et al. (2021a); Gatti et al. (2023a); Lemos et al. (2023). However, a robust convergence criterion and calibration of misspecification are missing. Various other robustness tests have been proposed to check that the results are not biased or over-estimated Linhart et al. (2022); Jia (2024). Since the main topic of this paper is to improve the CNN for data compression, we defer a detailed study of SBI in the presence of model misspecification to future work.

III CNN for Cosmological Fields

class ShuffledCNN:
def __init__(self):
  // Initialize CNN layers here
1
def randomize_images(self, tensor, image_size):
  x_flat = tensor.view(-1, image_size*image_size)
  idx = torch.stack([torch.randperm(image_size*image_size) for _ in range(x_flat.size(0))])
  if tensor.is_cuda:
   idx = idx.cuda()
  x_randomized_flat = torch.gather(x_flat, 1, idx)
2  x_randomized_flat.view(tensor.shape[0], tensor.shape[1], image_size, image_size)
def forward(self, x):
  // Implement the initial forward pass here
  // Randomly permute the feature maps at corresponding position
  if self.training: // This is optional
   x = self.randomize_images(x, x.shape[-1])
  x = self.conv6(x)
  x = self.LeakyReLU(x)
  x = self.pool6(x)
  // Implement the rest forward pass here
  return x
Algorithm 1 Pytorch pseudo-code of CNN with random permutation

Unlike the prevalent use of CNNs in image recognition tasks, their application in cosmology typically addresses regression problems rather than image classification. In this context, the precision of the output is most important. Moreover, the input data in cosmology, in particular weak lensing fields, significantly differs from conventional images, such as those of cats and boats and others that are part of ImageNetDeng et al. (2009), the long-standing industry benchmark for image classification algorithms. As depicted in Fig. 1, the weak lensing field is stochastic, low signal-to-noise, and has correlations on all scales. We seek a CNN approach that is simple and generalizable, rather than adopting complex architectures like ResNet He et al. (2015) or VGGNet Simonyan and Zisserman (2014) that have been optimized for classification tasks on images in other domains. Prior research Fluri et al. (2018); Ribli et al. (2019a); Gupta et al. (2018); Jeffrey et al. (2020); Lu et al. (2022); Matilla et al. (2020); Lu et al. (2023) employing CNNs for analyzing weak lensing fields has yielded promising results using relatively simple architectures.

Our 6-layer CNN model, as shown in Fig. 2, incorporates a notable departure from previous designs Fluri et al. (2018); Ribli et al. (2019a); Gupta et al. (2018); Fluri et al. (2019); Ribli et al. (2019b); Matilla et al. (2020); Jeffrey et al. (2021a); Lu et al. (2022, 2023); Hortúa et al. (2020), where either the Maximum pooling or the Average pooling is used. Instead, we have combined the use of Maximum Pooling and Average Pooling layers. One motivation for us to use Maximum Pooling in the initial layers is driven by the observation that positive extreme values in our pixels correspond to galaxy clusters which are known to carry valuable cosmological information. Indeed, a study of CNNs using saliency maps for lensing mass maps  Matilla et al. (2020) found that clusters are more informative than voids in the presence of realistic noise. Conversely, deeper layers, correlating distant sky regions of the original map, may require equitable consideration. Here, our choice to switch to average pooling facilitates better generalization in the subsequent Multi-Layer Perceptron (MLP) layers. We test these choices as discussed below and in Appendix. A.

We tested both simulation cases 1 and 2 and also increased the effective galaxy density (lower noise) to very approximately represent future LSST and Roman weak lensing surveys (we do not attempt to simulate the redshift coverage of those surveys). However, the best pooling option does not appear to correlate with noise level. In fact, the average pooling is similar to or better than the mix-pooling. When training on Ωm\Omega_{\mathrm{m}} though, the network fails to converge. We also tested extreme pooling with the largest absolute value (which then includes underdense extrema) and found no difference in the final results. For this reason, we use the mixed-pooling as our baseline choice. One caveat is that the extremely valued pixels could also be the ones most affected by noise or limitations in the simulations or unmodelled effects; this becomes more of a concern for smaller pixel values than ours. Adopting pooling layers in a Bayesian way could potentially overcome this problem Njieutcheu Tassi (2020).

It’s important to note that this architecture is not optimal for every scenario, such as when dealing with patches of different sizes or different choices of systematic uncertainties. Furthermore, we only optimize the architecture for the inference of S8S_{8} while a better design for Ωm\Omega_{\mathrm{m}} is likely different and merits follow-up work.

Nonetheless, our findings below show that simple, physics-inspired modifications to a CNN can significantly improve accuracy. This encourages the pursuit of tailored models for specific problems in cosmology, rather than relying on existing architectures primarily designed for image classification.

IV CNN with Random Permutation Layers

Refer to caption
Figure 3: The test errors for two simulation cases are shown for different positions of the random permutation layers. ”Conv 6” denotes adding the shuffling operation before the final convolution, namely Option 1 in Fig. 2. On the other hand, ”Conv 1” refers to the first layer, so every pixel of the input image is randomly permuted (which leads to a higher loss as expected). While it is clear that shuffling is effective only in the deeper layers, the best shuffling position varies for the two cases and is not necessarily the final layers – possibly because either option 2 or 3 leads to a simplified MLP structure (see Fig. 2).
Refer to caption
Refer to caption
Figure 4: Comparison of the test error in S8S_{8} with and without shuffling (random permutation) for simulation case 1 (Case 2 is shown in Figure 5). Left Panel: 6-layer-CNN with no shuffle. Right Panel: Shuffled-CNN (Option 2 in Fig. 2 which corresponds to a 5-layer CNN), which corresponds to removing correlations for scales larger than 109 arcmin. Note the significant improvement due to shuffling for the full range of values of S8S_{8}. The two panels in this figure represent the best results we found for non-shuffle and with shuffle.
Refer to caption
Refer to caption
Figure 5: Comparison of the test error in S8S_{8} with and without shuffling for simulation case 2 (with mask and tomography). Left Panel: 6-layer-CNN with no shuffle. Right Panel: Shuffled-CNN (Option 3 in Fig. 2which corresponds to a 4-layer CNN), which removes correlations for scales larger than 56 arcmin. The two panels in this figure represent the best results we found for non-shuffle and with shuffle. (If we shuffle at deeper layers, the performance is degraded to RMSE3.0%\mathrm{RMSE}\approx 3.0\% but is still better than the non-shuffle case. )

Adding randomness is a common way in machine learning to improve the generalization accuracy. In this section, we introduce such a technique: random permutation of the pixels in the deeper layers of the network that correspond to large scales.

Figure 2 illustrates our design and the positions we replace the normal design with the random permutation layers. The random permutation layer shuffles (randomizes) the position of each pixel of the feature map. In this work, the shuffle is different for each batch, each channel, and changes every epoch. Note that the four tomographic bins have the same random permutation (shuffle) applied 222This is due to the fact that we use the Conv2d function implemented in Pytorch https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html.

As an example, if we add the permutation layer before convolution layer 6 (Option 1), the input is the feature map of size 8 by 8 for 128 channels. The random permutation operation shuffles these 8x8 pixels randomly for each 128 channel differently. Each pixel in this feature map approximately corresponds to 512/8×6.8435.2arcmin512/8\times 6.8\approx 435.2\ \mathrm{arcmin}. At this scale, the two-point correlation has a large sample variance. We find that by removing the information at large scales in the deeper layers, the neural network optimizes for the scales that are more relevant for cosmology. Moreover, it appears to extract some of the large-scale information in the earlier layers, though as discussed below we do not yet understand exactly how it does that. The shuffling operation can be inserted at different positions in the network, depending on the scale of which noise dominates over the signal. Note that this way of calculating scales is only approximate.

We test the positions of the shuffling operation for simulation cases 1 and 2. As shown in Fig. 3, the accuracy depends on where we perform the shuffling. For simulation case 1 it is the second last layer and for simulation case 2 it is the third last layer. The simplification effectively makes them 5-layer and 4-layer models respectively. In general, we expect the results to depend on the underlying variation of cosmologies, the noise level, tomographic binning, patch size, training set size, and other details. In particular, it will be interesting to see how increasing sample variance at large scales impacts the choice of the network Doux et al. (2022). To ensure a fair comparison, we also varied the depth of the non-shuffled CNN and found the 6-layer works the best (see Appendix. B).

The errors on the test set for both simulation cases are shown in Fig 4 and Fig 5 respectively. The posteriors for the two parameters are shown in Fig. 6. We find that the CNN with shuffling the standard deviation of S8S_{8} is 31% smaller than the non-shuffle case.

Note that the CNN for the two cases is optimized separately, and indeed it is also different for S8S_{8} and Ωm\Omega_{\mathrm{m}}. In particular, it is the 6-layer model for non-shuffle for S8S_{8}, the 4-layer model for non-shuffle for Ωm\Omega_{\mathrm{m}}, and both 4-layer model for shuffle-CNN (option 3). To make the comparison for the same CNN model with the same weight size, we use shuffle option 1 for S8S_{8} as well and get a 17% improvement on the standard deviation of S8S_{8}. However, it is important to note that this change of layer depths only helps the shuffle-CNN, while the non-shuffle CNN suffers from insufficient depth in our case.

Refer to caption
Figure 6: The posterior inferred using simulation-based inference for simulation case 2. We find the improvement in the standard deviation of marginalized S8S_{8} to be 31%. The Figure of Merit =1/detCov(Ωm,S8)=1/\sqrt{\operatorname{det}\operatorname{Cov}\left(\Omega_{\mathrm{m}},S_{8}\right)} is 80% larger although this may not be the best metric as the posterior is non-Gaussian.
Refer to caption
Figure 7: Empirical coverage vs credibility level. A well-calibrated posterior closely aligns with the identity line, indicating that the CNN compression is neither overconstrained nor underconstrained. The lines are generated with TARP Lemos et al. (2023), as discussed in Sec. II.3.

We applied the TARP coverage test with 58 simulations (as there are only 58 independent cosmologies for the S8ΩmS_{8}-\Omega_{\mathrm{m}} plane), and found the posterior is well calibrated as shown in Fig. 7.

A pseudo-code in Pytorch is displayed in Algorithm. 1. Only the lines associated with the random permutation operation are shown. Note that the self_training variable might not be effective depending on the version of Pytorch. In that case, a customized variable should be used. This is also optional as in some cases random shuffling in both training and validation could boost the overall accuracy. Similar to this, we can also easily define the model to only shuffle the pixels after every N epoch.

IV.1 Experiments to understand the improvement

To get insights into the source of these improvements, we conducted a series of experiments, varying the models and altering the training datasets. We plot in Fig. 8 the loss function decay for two models with and without shuffling. The 6-layer CNN non-shuffle model can fit the underlying data very well as suggested by the training loss, but it does not generalize well to the validation or test set. The 4-layer CNN with shuffling, however, can generalize very well, even though it does not fit the training set as well. This demonstrates that the shuffling technique is effective in preventing overfitting. One hypothesis is that the 6-layer non-shuffle model suffers from co-adaptation–a scenario where certain neurons become highly interdependent, leading to overfitting with new test inputs. The shuffling prevents any large-scale outliers from significantly affecting the results.

One case of special interest is when we directly shuffle the original images. As shown in Fig. 9, the result worsens but by less than a factor of 2, which is not as bad as one might expect in the context of standard image analysis, since no structure of the image is retained. However, in cosmology, it is known that the 1-point probability distribution function (PDF) of the image pixels has useful information as it is the PDF of the projected density. In particular, it includes its variance, the smoothed two-point function. The pixel scale in our original image corresponds to a few Mpc at the lens redshift. We test that if we simply measure the average of the entire image, or even use the PDF as summary statistics, the RMSE is \sim 7%, which is significantly worse. This example shows that even if all the information on scales larger than the single pixel value is destroyed, the convolutional layer (plus subsequent nonlinear operations) can extract more information than just PDFs. We do not pursue this point further in this study.

Next, we test cases where we removed the large-scale correlation directly from the map. We do this by setting the spherical harmonic coefficients alm=0a_{lm}=0 for <min\ell<\ell_{\mathrm{min}} when making the map for simulation case 2. The results are summarized in Table. 1. On removing the large-scale correlations, the relative improvement from shuffling is decreased, which makes sense as sample variance in the large-scale modes is also removed by setting alm=0a_{lm}=0 (so the CNN was not learning the noise in the non-shuffle case). For different shuffling options, the variance is larger but still follows the pattern in Fig. 3.

To check if shuffling in the deeper layers causes the CNN to lose large-scale information, we did the test of adding the lensing power spectrum as a summary statistic to the MLP. We found no improvement in the performance, suggesting that the early layers of the CNN do extract some information on large-scale correlations. We leave for future work whether mode coupling due to nonlinear evolution plays a role in this.

We note that if the 6-layer CNN model (no shuffling in training) undergoes a post hoc permutation in the final layer during testing, the error only marginally increases from 1.1% to 1.7%. This suggests that, typically, the last convolution layer is not optimized to capture spatial correlation but instead functions more as the global average and optimizes performance by its weighting of the different channels. We note that the tests discussed in this section are exploratory; a detailed analysis of regularization methods in cosmology is interesting for future research.

Models RMSE (%) R2R^{2} ΔRMSE\Delta\rm RMSE
Case 2 with NO min\ell_{\mathrm{min}}
6-layer CNN (no shuffle) 3.703 0.934 -
CNN with Shuffle Option 1 2.946 0.958 20.4 %
CNN with Shuffle Option 2 2.914 0.959 21.3 %
CNN with Shuffle Option 3 2.579 0.968 30.5 %
Case 2 with min=76\ell_{\mathrm{min}}=76
6-layer CNN (no shuffle) 4.047 0.923 -
CNN with Shuffle Option 1 3.837 0.930 5.2 %
CNN with Shuffle Option 2 3.261 0.950 19.4 %
CNN with Shuffle Option 3 2.953 0.959 27.1 %
Case 2 with min=148\ell_{\mathrm{min}}=148
6-layer CNN (no shuffle) 4.476 0.903 -
CNN with Shuffle Option 1 4.095 0.918 8.5 %
CNN with Shuffle Option 2 3.549 0.939 20.7 %
CNN with Shuffle Option 3 3.240 0.949 27.6 %
Table 1: This table summarizes the results for the S8S_{8} predictions when we remove the large-scale correlation from the maps by setting the spherical harmonic coefficients alm=0a_{lm}=0 for <min\ell<\ell_{\mathrm{min}}. The goal is to understand the shuffle CNN better as discussed in the text. min=76\ell_{\mathrm{min}}=76 approximately corresponds to 108 arcmin, and min=148\ell_{\mathrm{min}}=148 approximately corresponds to 56 arcmin. The column is the percentage improvement over the CNN model without shuffle. In both cases, the overall accuracy is decreased because information is removed from the maps. The relative improvement of shuffling option 1 is also decreased because shuffling at large scales becomes less relevant.

IV.2 Other tests, symmetries, and permutation invariance

We perform various tests of regularization in comparison with the random permutation layers. In Appendix. B, we show 7 additional models that share the same goal as shuffling – to impose permutation invariance at certain layers. For example, we replaced deep layers with an adaptive average pooling layer that simply takes the average value of the 32x32 features maps to one value. We also tested using other permutation invariant quantities like variance, maximum, and minimum. The results suggest that they are not as effective as the shuffling option. In Appendix. C we compare with two other regularization schemes Dropout and batch normalization. We showed that they negatively affect the accuracy in our case. As a supplementary test, we show in Appendix. E the results for different noise levels and training set sizes. Across various test scenarios, the addition of random permutations consistently elevates the generalization accuracy.

Constraining neural networks with physical symmetries has been shown to help the training and the accuracy of the model. In cosmology particularly, translation and rotation symmetries are key ingredients to improve performance. Previous work that imposes these symmetries in normalizing flows Dai and Seljak (2022) and graph neural network Makinen et al. (2022) demonstrate great potential for both generative models and models used for parameter inference. The scattering wavelet transform Mallat (2012) is powerful for cosmological fields partially due to its translation invariance construction. The convolution operation is translation equivariant by definition. However, note that the max-pooling operation breaks the translation equivariance so the CNN is not rigorously translation invariant. Investing in architectures suitable for cosmology with translational Zhang (2019) and rotational symmetry Cohen and Welling (2016); Weiler and Cesa (2019) is a promising future direction – we leave such investigations for future work.

Permutation invariance, although less relevant in CNN, is very common for Graph Neural Networks (GNN) Wu et al. (2019). Janossy Pooling Murphy et al. (2018) employs a random sample of a subset of possible permutations as an alternative pooling operation. Such tweaks to the training process can also be found in CNNs. In particular, in ShuffleNet Zhang et al. (2017) a channel shuffle preceding a group convolution is employed. A more extreme case can be found in Xie et al. (2016), where the authors disturb the loss layer by directly giving wrong labels, and they show it boosts the neural network’s ability to learn more general features.

In this work, we shuffle the feature map for every epoch but only for training. Other variations of random permutation are also worth exploring – for example, we tested shuffle every 10 epochs, which allows the convolution layer to learn the same map within these 10 epochs. The test error is similar to that of shuffling every epoch. One can also introduce a probability distribution where say only 10% of the time the feature map is shuffled. We expect that shuffling every NN epoch to be a more general solution that can be optimized for other cases.

Refer to caption
Figure 8: The MSE loss vs. training epoch. The solid line shows the training loss while the dashed line shows the validation loss. The blue lines are for the CNN without shuffle and the orange lines are for the shuffled CNN (option 3 in Fig. 2). The 6-layer CNN without shuffle can fit the training set very well but the generalization performance is not good as evident from the higher validation loss. This gives an accuracy boost for the models with random shuffling. The loss decay curve shows some differences from case to case, but this is a typical case of how shuffling improves the CNN.
Refer to caption
Figure 9: Test error for simulation case 1 with the random permutation as the first layers, followed by a [18][1\rightarrow 8] convolution layers (8 different kernels to learn). The image is entirely disrupted. The maximum information in each channel is thus the PDF of pixel values. However, if we directly train an MLP using mean, variance, or the full PDF, we get RMSE \sim 7% at best. This example shows that the convolution operation is a computationally efficient way of summarizing different information across channels.

V Discussions and Conclusions

CNNs are the most extensively developed deep learning approach for image analysis. In recent years they have also been applied for inferring cosmological parameters from weak lensing and other cosmological maps. We show that for the characteristics of cosmological fields like lensing mass maps, variations of the standard CNN architecture can lead to improved performance. We have presented a set of regularization and data augmentation methods and quantified the performance improvement for simulated lensing surveys. The simulated surveys mimic the Dark Energy Survey parameters in both statistical and systematic uncertainties, but we also consider the lower shape noise levels corresponding to Stage-IV surveys. The explorations in this paper are not definitive as the detailed implementation should be adapted to the specific applications.

We use random permutations (shuffling operation) in the deep layers of CNN as a novel regularization and data augmentation technique. We also use a mix of maximum and average pooling for varying layer depths. These simple modifications can enhance the performance compared to traditional CNN designs that have been adapted from image classification. Figure. 6 demonstrates the improvement in parameter inference. We also show in Fig. 8 that it can be expected to have better generalization accuracy. While the design requires optimization for specific problems, incorporating a random permutation layer emerges as a promising approach for cosmological fields. An open question that we have addressed only partially is whether CNNs like the ones we have tested lose some large-scale information. It is possible that the early layers of the CNN, prior to shuffling, have extracted the available signal. We find some evidence for this when considering the power spectrum – adding the power spectrum, including the large scales (small angular wavenumbers), to the MLP does not improve the performance further. We leave a detailed investigation for future work.

With the upcoming Stage-IV surveys (Euclid, Rubin-LSST, and Roman), we expect a substantial improvement in the quality and size of lensing mass maps. Deep learning approaches along with higher-order statistics and field-level inference can extract the vast amount of information from these maps that goes beyond standard 2-point statistics. While CNNs may not capture all the information contained in cosmological maps, they offer some clear advantages. As a model-specific data compressor, CNNs are flexible for a lot of different settings. For example, if the data has multiple tomographic bins as in current weak lensing surveys, CNNs can take them as different channels and capture the cross-channel information. If one prefers to analyze data with a shear field instead of the convergence field, one can also treat the two components as different channels and avoid the complexity of reconstructing the convergence map. Hence exploring improved performance with CNNs is likely to be of value in cosmological applications.

Caveats and Future work: This paper has focused on testing a set of regularization and data augmentation schemes with a simple CNN design. Given that this approach is inspired by our understanding that cosmological fields are noisy on scales approaching the survey size, we anticipate that the random permutation layer, when applied at appropriate scales, could be synergized with more complex architectures. For instance, similar shuffling operations can be easily tested with Vision Transformer Hwang et al. (2023) or Graph-based Neural Networks Perraudin et al. (2019).

It is also interesting to test if this regularization and data augmentation method can be effective in other cosmological fields such as the overdensity field Villaescusa-Navarro et al. (2020, 2021b), 21-cm maps Novaes et al. (2023), and secondary anisotropies in CMB temperature or polarization maps Casas et al. (2022); Münchmeyer and Smith (2019).

A useful direction for future research is the optimization of the shuffling operation. One possibility is to introduce a hyperparameter that modulates the probability of performing the shuffle, which is a technique used in Swiderski et al. (2022). Our limited explorations are conservative in the sense that we did not tune hyperparameters, so there is room for improvement by tuning all the hyperparameters for different variations of the noise level. Additionally, exploring other randomness-incorporating structures, such as random shifting Zhao et al. (2017) or randomly wired CNNs Xie et al. (2019), could also offer valuable insights.

Another important direction for future research is in the interpretability of CNNs when applied to cosmological data. Despite their proven efficacy in regression problems, a persistent challenge with CNNs is the ’black box’ nature of their decision-making processes. Understanding how these networks derive their conclusions is important, not just for validating and improving accuracy, but also for gaining deeper insights into the underlying physics. Previous studies Matilla et al. (2020); Ntampaka and Vikhlinin (2022) show intriguing results by using saliency maps Simonyan et al. (2013); Springenberg et al. (2014) (see also a recent study You et al. (2023) using a ‘sum-of-parts’ approach to interpretability). Architectures designed to relate the neural network results to physical qualities like the N-point correlation function are also promising Miles et al. (2021); Gong et al. (2024). It will be interesting to see how the results change with the presence of random permutation layers.

Acknowledgements

We thank Shubh Agrawal, Pratik Chaudhari, Cyrille Doux, Rafael Gomes, Mike Jarvis, Amrut Nadgir, Shivam Pandey, Helen Qu, Dimitrios Tanoglidis, and Eric Wong for useful discussions. We are especially grateful to Tomek Kacprzak and his collaborators for generating and making available the DarkGrid simulations used in this work. This research used resources of the National Energy Research Scientific Computing Center (NERSC), a U.S. Department of Energy Office of Science User Facility located at Lawrence Berkeley National Laboratory, operated under Contract No. DE-AC02-05CH11231. B.J. and M.G. are supported in part by the US Department of Energy grant DE-SC0007901.

References

Appendix A Test pooling choices

Models RMSE (%) R2R^{2}
simulation case 1
All Max Pooling 2.174 0.975
All Avg Pooling 1.308 0.991
Mixed Pooling 1.187 0.992
simulation case 2
All Max Pooling 4.764 0.890
All Avg Pooling 3.61 0.937
Mixed Pooling 3.703 0.934
simulation case 2 with 2neff2n_{\mathrm{eff}}
All Max Pooling 4.5820 0.898
All Avg Pooling 2.520 0.969
Mixed Pooling 3.079 0.954
Mixed Pooling-v2 2.699 0.964
simulation case 2 with 4neff4n_{\mathrm{eff}}
All Max Pooling 3.526 0.939
All Avg Pooling 2.042 0.980
Mixed Pooling 2.380 0.972
Mixed Pooling-v2 1.683 0.986
Table 2: Summary of comparing pooling choices in different simulation cases when applied to S8S_{8}. Mixed Pooling stands for the 4max+2avg as shown in Fig. 2. Mixed Pooling-v2 denotes 2max+4avg. Note that for simulation case 2 the all-avg is slightly better than mix-pooling, but the all-avg does not work at all for inferring Ωm\Omega_{\mathrm{m}}. Hence, we use mixed pooling (4max+2avg) as the baseline choice for this work.

The experiments involving various pooling options are detailed in Table 2. The outcomes vary between simulation case 1 and 2 and with the level of shape noise but not in a straightforward way. This quantitive comparison is restricted to S8S_{8}. We selected mixed pooling as the baseline since the all-average approach is consistently ineffective for training on Ωm\Omega_{\mathrm{m}}. Although this investigation is far from comprehensive due to computational resources limitations, it shows that different choices in pooling can impact the outcomes.

Appendix B Testing Alternatives to Random Permutation Layer

Models RMSE (%) R2R^{2}
Best Model with Shuffle 0.865 0.996
Test Model 1 1.235 0.992
Test Model 2 1.126 0.993
Test Model 3 1.147 0.993
Test Model 4 1.263 0.991
Test Model 5 1.419 0.989
Test Model 6 2.316 0.971
Test Model 7 1.84 0.982
Table 3: Different models that we test as alternatives to random permutation layers, in comparison with the best model with shuffle (option 2). The description for each model is listed in Appendix. B. None of these models performs better than even the 6-layer CNN model without shuffling. This test shows that the effect of random permutation layers cannot be replaced by such permutation invariant statistics.

Since introducing random permutation layers improves performance, we consider other ways to impose permutation invariance. A large average pooling over the entire feature map, for example, is permutation invariant. One can also add other permutation-invariant summary statistics such as the variance, maximum, or minimum. This is more common in Graph Neural Networks, where permutation invariance is crucial due to the data structure. For example, in Wang et al. (2023), the input of the MLP layers is the collection of sum, mean, max, and min.

Here, we test the following possibilities:

\bullet Test Model 1-3. Large Avg

These test models replace the random permutation layer and the subsequent convolution layer with a large average pooling. Test model 1-3 corresponds to models that have 5, 4, and 3 convolution layers. Note that the following MLP also changes in size because the channel number changes.

\bullet Test Model 4. Mean + Variance

This model employs a large average pooling layer and additionally calculates the variance of the 8x8 pixel blocks for each channel. These are then concatenated as the input for the MLP.

\bullet Test Model 5. Mean + Sum + Max + Min

This model is an extension of Test Model 4, using the Mean (Avg), Sum, Maximum, and Minimum of the current channel as the inputs of the MLP.

\bullet Test Model 6. Large Flat Layer

In this model, the final average pooling is omitted. Instead, a flattening operation is used to compile all the information. The input size of MLP is increased to 128×8×8=8192128\times 8\times 8=8192. We also tested other channel numbers with flattening in the end and found similar results.

\bullet Test Model 7. Patchify into smaller CNNs

For this model, we patchify the original image to 8x8 smaller images of size 64x64. We forward these smaller patches to the same weight-sharing convolutional layers and then take the average as the input of the MLP layers.

Note that all of the models above do not involve random permutation. Models 1 to 5 are strictly permutation invariant at corresponding layers. The test errors are summarized in Table 3, and they do not help the 6-layer CNN model as the random permutation layers. This is a model-specific result. Following the idea of random shuffling at large scales, we anticipate some of the choices above could increase accuracy in certain simulation settings. However, the random permutation layer remains a possibility when designing neural networks for cosmological fields.

For Model 7, we patchify the original image into smaller patches, and we tried various combinations of patch sizes, convolutional layers, and MLP sizes. This choice shares the same idea of making the corresponding scales permutation invariant. However, the modes between each patch are not captured by the convolution operations, and thus much less information is available. The RMSE is 1.84, significantly larger than the 4-layer shuffled model or the 6-layer non-shuffle model. Note that this experiment is conducted with simulation case 1, and we expect this method to work worse for simulation case 2 in the presence of masks.

Appendix C Comparison with Standard Regularization

C.1 Dropout

Dropout is a widely recognized regularization technique for mitigating overfitting in machine learning. Essentially, Dropout combats co-adaptation by randomly deactivating a portion of the neurons. Despite its effectiveness, Dropout is not universally applicable, especially in more recent and complex computer vision models like ResNet He et al. (2015).

We show in Table 4 three different use of Dropout:

Dropout-v1

One Dropout layer before the MLP of the 6-layer CNN model. This test is to see if masking some of the embedding representations can prevent overfitting of MLP

Dropout-v2

Three Dropout layers after each linear+activation function in MLP, namely fc 7-9 in Fig. 2. This is the more traditional use of Dropout, where all the representations from the convolution layer are kept, but the use Dropout directly prevents the overfitting of the MLP regression.

Dropout-v3

A 2D Dropout layer before convolution 6, where the random permutation layer is applied. This test is to make a direct comparison to the permutation layer, but instead of shuffling the positions, some patches are masked out during training.

For all three tests, we use a Dropout rate of 0.2. We found no improvement over the 6-layer CNN model.

Models RMSE (%) R2R^{2}
Best Model with Shuffle 0.865 0.996
Dropout
All Avg w/ Dropout-v1 1.214 0.992
Mixed Avg/Max w/ Dropout-v1 1.115 0.993
Mixed Avg/Max w/ Dropout-v2 1.563 0.987
Mixed Avg/Max w/ Dropout-v3 1.165 0.993
BatchNorm before activation
All Avg w/ BN 1.664 0.985
Mixed Avg/Max w/ BN 2.780 0.959
BatchNorm after activation
All Avg w/ BN 2.169 0.975
Mixed Avg/Max w/ BN 1.890 0.981
Table 4: Test models with two other regularization methods Dropout and BatchNorm when applied to simulation case 1 and S8S_{8}. We also tested applying BatchNorm before and after each activation function. See Appendix. C for the details of where these regularization layers are used. As in Table 3, none of these methods do as well as the shuffle CNN.

C.2 BatchNorm

Layer Kernel size Stride Output dimensions
(Input) (4 ×\times 512 ×\times 512)
Convolution 5 ×\times 5 2 (nch/2)×254×254(n_{ch}/2)\times 254\times 254
Convolution 5 ×\times 5 2 nch×125×125n_{ch}\times 125\times 125
Residual block
Residual block
Pooling 2 ×\times 2 2 nch×62×62n_{ch}\times 62\times 62
Convolution 3 ×\times 3 1 (2nch)×60×60(2n_{ch})\times 60\times 60
Pooling 2 ×\times 2 2 (2nch)×30×30(2n_{ch})\times 30\times 30
Convolution 3 ×\times 3 1 (4nch)×28×28(4n_{ch})\times 28\times 28
Pooling 2 ×\times 2 2 (4nch)×14×14(4n_{ch})\times 14\times 14
Convolution 3 ×\times 3 1 (8nch)×12×12(8n_{ch})\times 12\times 12
Pooling 2 ×\times 2 2 (8nch)×6×6(8n_{ch})\times 6\times 6
Convolution 3 ×\times 3 1 (16nch)×4×4(16n_{ch})\times 4\times 4
Pooling 4 ×\times 4 (16nch)×1×1(16n_{ch})\times 1\times 1
Linear 256
ReLU 256
Linear 1
Table 5: The architecture used in Lu et al. (2022), except for the difference in the input and output size. The RMSE for simulation case 2 is 3.776%, which is similar to our non-shuffle CNN model as shown in Fig. 5.

Batch Normalization (BatchNormIoffe and Szegedy (2015) is another very powerful regularization scheme that is commonly used in Machine Learning and has been adopted in previous weak lensing analyses such as Lu et al. (2023); Ribli et al. (2019a); Lu et al. (2022, 2023), but not in Fluri et al. (2018). BatchNorm essentially whitens the output activation function and thus alleviates the internal covariance shift problem 333See https://joelouismarino.github.io/posts/2017/08/statistical-whitening/ for a detailed explanation on statistical whitening and data normalization, which is common in most of the neural network design.

Models RMSE (%) R2R^{2}
simulation case 1 with 2neff2n_{\mathrm{eff}}
non-shuffle 6-layer CNN 0.974 0.996
CNN with Shuffle Option 1 0.740 0.998
CNN with Shuffle Option 2 0.684 0.998
simulation case 1 with 0.5neff0.5n_{\mathrm{eff}}
non-shuffle 6-layer CNN 1.537 0.988
CNN with Shuffle Option 1 1.275 0.991
CNN with Shuffle Option 2 1.833 0.982
simulation case 1 with no shape noise
non-shuffle 6-layer CNN 0.561 0.998
CNN with Shuffle Option 1 0.424 0.999
CNN with Shuffle Option 2 0.565 0.998
simulation case 2 with half training data
non-shuffle 6-layer CNN 4.097 0.912
CNN with Shuffle Option 1 3.594 0.932
CNN with Shuffle Option 2 3.478 0.936
CNN with Shuffle Option 3 2.921 0.955
simulation case 2 with 2neff2n_{\mathrm{eff}}
non-shuffle 6-layer CNN (mixed-pooling) 3.079 0.954
non-shuffle 6-layer CNN (avg-pooling) 2.520 0.969
CNN with Shuffle Option 1 2.520 0.969
CNN with Shuffle Option 2 2.179 0.977
CNN with Shuffle Option 3 2.013 0.980
simulation case 2 with 4neff4n_{\mathrm{eff}}
non-shuffle 6-layer CNN (mixed-pooling-v2) 1.683 0.9886
non-shuffle 6-layer CNN (avg-pooling) 2.042 0.980
CNN with Shuffle Option 1 1.994 0.981
CNN with Shuffle Option 2 1.748 0.985
CNN with Shuffle Option 3 1.410 0.990
Table 6: Additional tests varying noise level and size of the training set. The reported RMSE is for S8S_{8}. The shuffling operation consistently improves the accuracy when comparing non-shuffle 6-layer CNN and shuffle option 1, which only differs by one random permutation layer. Note that the best position of the shuffling operation changes from the fiducial test of the paper.

However, BatchNorm is not a once and for all solution. As a simple example, putting BatchNorm before or after the activation function is non-trivial, with either choice differently affects the goal of whitening 444See http://torch.ch/blog/2016/02/04/resnets.html for a discussion on different effects of BatchNorm before or after the activation. . The issue is further complicated with regularization schemes such as Dropout Li et al. (2018). In fact, a large number of architectures find that when combining the two most powerful regularization schemes Dropout and BatchNorm, the accuracy often decreases.

In this work, we did not adopt any BatchNorm in our architecture, and we show in Table 4 that it negatively affects the results. Note that we use the default setting that enables affine parameters, which are the only two learnable parameters. However, this complicates the training in the mini-batch setting because they have a large effect on the whitened output.

Again, this choice is only specific to optimizing our problem, which can be different if the data and other assumptions change. For instance, if one uses data augmentation with less Gaussian noise injection level, or studies the case with all cosmological parameters varying, the internal large variance of the input data can make BatchNorm necessary. We do, nevertheless, expect the permutation layer to improve the generalization performance in the presence of BatchNorm.

Appendix D Comparison with ResNet used in previous work

We also test an architecture with a residual connection. The architecture is summarized in Table 5, which is the same as in Lu et al. (2022, 2023) except the input channel being 4 for our 4 tomography bin and output number being 1 since we are training one parameter per network. The residual block consists of two convolution layers and a skip connection, similarly in He et al. (2015). We use 10 residual blocks and nch=10n_{\mathrm{ch}}=10 as suggested in Lu et al. (2022). We find no difference between the ResNet and our 6-layer CNN model without shuffle. For simplicity, we did not further optimize this design or test with random permutation layers with this structure. This test suggests that the residual connection might not be necessary for our simulations. We stress again that the comparison is not exact because our simulations are very different. Even though the input sizes are both 512x512, the resolution is very different, with those in He et al. (2015) being 0.87 arcmin. However, this result suggests that the proposed random permutation layer is optimizing the neural network for cosmological fields in a different way.

Appendix E Additional Tests with Different Noise Level and Size of Training Set

We provide some additional tests varying the noise level and training data size. As summarized in Table 6, the shuffling operation consistently boosts the generalization accuracy of the model. Note that the 6-layer CNN should be compared with CNN with shuffle option 1 as they have the same structure and trainable parameters, but only differ by one random permutation layer. In these tests, the other option is not always better than option 1, but CNN with Shuffle Option 1 is always better than the 6-layer CNN model.