This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

An empirical approach to model selection: weak lensing and intrinsic alignments

A. Campos,1,2 S. Samuroff,1,3 and R. Mandelbaum1,2
1McWilliams Center for Cosmology, Department of Physics, Carnegie Mellon University, Pittsburgh, PA 15213, USA
2NSF AI Planning Institute for Physics of the Future, Carnegie Mellon University, Pittsburgh, PA 15213, USA
3Department of Physics, Northeastern University, Boston, MA, 02115, USA
andresar@andrew.cmu.edu 0000-0002-5124-0771 0000-0001-7147-8843 0000-0003-2271-1527
Abstract

In cosmology, we routinely choose between models to describe our data, and can incur biases due to insufficient models or lose constraining power with overly complex models. In this paper we propose an empirical approach to model selection that explicitly balances parameter bias against model complexity. Our method uses synthetic data to calibrate the relation between bias and the χ2\chi^{2} difference between models. This allows us to interpret χ2\chi^{2} values obtained from real data (even if catalogues are blinded) and choose a model accordingly. We apply our method to the problem of intrinsic alignments – one of the most significant weak lensing systematics, and a major contributor to the error budget in modern lensing surveys. Specifically, we consider the example of the Dark Energy Survey Year 3 (DES Y3), and compare the commonly used nonlinear alignment (NLA) and tidal alignment & tidal torque (TATT) models. The models are calibrated against bias in the ΩmS8\Omega_{\mathrm{m}}-S_{8} plane. Once noise is accounted for, we find that it is possible to set a threshold Δχ2\Delta\chi^{2} that guarantees an analysis using NLA is unbiased at some specified level NσN\sigma and confidence level. By contrast, we find that theoretically defined thresholds (based on, e.g., pp-values for χ2\chi^{2}) tend to be overly optimistic, and do not reliably rule out cosmological biases up to 12σ\sim 1-2\sigma. Considering the real DES Y3 cosmic shear results, based on the reported difference in χ2\chi^{2} from NLA and TATT analyses, we find a roughly 30%30\% chance that were NLA to be the fiducial model, the results would be biased (in the ΩmS8\Omega_{\mathrm{m}}-S_{8} plane) by more than 0.3σ0.3\sigma. More broadly, the method we propose here is simple and general, and requires a relatively low level of resources. We foresee applications to future analyses as a model selection tool in many contexts.

keywords:
methods: statistical – cosmology: observations – cosmological parameters – gravitational lensing: weak

1 Introduction

Modern cosmology is an increasingly high-dimensional problem. Although the standard cosmological model itself is relatively simple, containing only five or six free parameters, it cannot, in general, be constrained in isolation. One must rely on measurements on real data, which can contain any number of additional features resulting from non-cosmological processes. It is necessary to include models for such systematics in any cosmological inference, and to marginalise over their parameters. Contemporary weak lensing analyses (see, e.g., Heymans et al. 2013; Dark Energy Survey Collaboration 2016; Jee et al. 2016; Hildebrandt et al. 2017; Troxel et al. 2018; Hikage et al. 2019; Hamana et al. 2020; Asgari et al. 2021; Loureiro et al. 2022; Amon et al. 2022; Secco, Samuroff et al. 2022; Doux et al. 2022) typically have around 153015-30 free parameters, the majority of which are related to measurement uncertainties. This picture is unlikely to change in the coming years. Indeed, as we move into the era of Stage IV surveys (Ivezić et al. 2019; Spergel et al. 2015; Laureijs et al. 2011), the unprecedented statistical power of these new data sets carries an increasing sensitivity to systematics.

Some systematic uncertainties can be modelled pretty accurately given our prior knowledge of their nature; for instance PSF modelling error (Jarvis et al., 2021) and shear measurement biases (Heymans et al., 2006; Bridle et al., 2010; Mandelbaum et al., 2015). In most cases, however, there is a relative lack of prior knowledge about the magnitude and/or scale dependence of the effects being modelled. Some examples include the impact of baryonic feedback (Osato et al., 2015; Chen et al., 2022; Tröster et al., 2022), nonlinear structure formation (and the impact of neutrinos on it; Saito et al. 2008; Bird et al. 2012; Mead et al. 2021; Knabenhans et al. 2021) and galaxy bias (Desjacques et al., 2018; Simon & Hilbert, 2018; Pandey et al., 2020). Here, there is clearly an argument for using the most sophisticated (physically motivated) model available. This is the safest way to avoid bias due to model insufficiency. That said, extra free parameters do potentially carry a cost in terms of constraining power. They can also worsen projection effects, which complicate the interpretation of projected parameter constraints (see Joachimi et al. 2021a; Krause et al. 2021). The ideal approach, then, would be to select a model that balances the two: complex enough to avoid bias, but not more complex than is needed to describe the data.

Model selection methods are widely used in cosmology, often seeking to answer the question of whether introducing new parameters to cosmological models is justified by the data. Some of the most common tools for this are χ2\chi^{2} tests, the Akaike Information Criterion (AIC), the Bayesian Information Criterion (BIC), the Deviance Information Criterion (DIC) and Bayes ratios (see e.g. Liddle et al. 2006a; Liddle 2007; Trotta 2007, 2008; Kerscher & Weller 2019). A characteristic of how all these statistics have been used is that they are interpreted using threshold statistical values, derived in terms of the theoretical properties of the model, e.g. the Jeffreys scale for Bayes ratios. They have also been most commonly applied to compare how well cosmological models fit the data post-analysis, rather than actively being used to select elements of the analysis in the blinded stages. By contrast, the process for choosing the fiducial model for an analysis typically does not make use of model comparison statistics at all. Rather, we tend to rely on generating and analysing simulations (either analytic or numerical) containing various forms of unmodelled systematics (e.g., Krause et al. 2017, 2021; Joachimi et al. 2021a). This approach works, but does heavily depend on the ability to create realistic mocks. It is also important to notice that any model selection method will typically have a number of subjective choices built into them, e.g., whether to compare data vectors or perform likelihood inference, and what cutoff to use for decision-making. This is also true, to an extent, for the method we will present in this paper. That said, our method has the feature that the decision-making happens in well-defined places and has a well-defined interpretation connected to parameter biases (e.g. selecting a tolerable bias level and a confidence interval), as we will see in the following sections.

One of the most significant sources of systematic uncertainty in weak lensing is an effect known as intrinsic alignment (IA; Troxel & Ishak 2015; Joachimi et al. 2015; Kirk et al. 2015; Kiessling et al. 2015). IAs are coherent galaxy shape alignments that are not purely due to lensing, but rather to the interactions with the local and large-scale gravitational field. Although in essence an astrophysical effect, IA correlations appear on much the same angular scales as cosmological ones, and it can be very difficult to disentangle the two. They are not universal, in the sense that they depend significantly on the particular galaxy sample (colour, luminosity, satellite fraction and redshift distribution; e.g. Johnston et al. 2019), and also the details of the shape measurement (Singh & Mandelbaum, 2016). To add to the problem, unlike, for example, photometric redshift error or shear bias, one cannot easily derive tight priors on IAs using simulations or external data. Some physically-motivated IA models that have been developed in the last two decades include the linear alignment (LA) model (Catelan et al., 2001; Hirata & Seljak, 2004, 2010) which, as the name suggests, assumes a linear relationship between galaxy shapes and the local tidal field; an empirical modification of this, known as the nonlinear alignment model (NLA; Hirata et al. 2007; Bridle & King 2007), which is now one of the most common IA models in contemporary weak lensing; and in recent years, the tidal alignment and tidal torquing model (TATT; Blazek et al. 2015; Blazek et al. 2019), which has provided a slightly more complex alternative to NLA. Based on perturbation theory, TATT includes additional terms that are quadratic in the tidal field, intended to encapsulate the processes driving IAs in spiral galaxies, and also additional terms that are designed to enable better IA modelling on smaller (but still 2-halo) scales.

In this paper we propose a new model selection method, which uses the real data. The general idea is to run two competing models on the blinded data, and compare them using statistical metrics. Here we explore two convenient metrics: the difference in the best χ2\chi^{2} per degree of freedom, Δχ(df)2=Δχ2/Δdf\Delta\chi^{2}_{(\textrm{df})}=\Delta\chi^{2}/\Delta\mathrm{df}, and the Bayes ratio RR. We show that, for the method we are proposing, Δχ(df)2\Delta\chi^{2}_{(\textrm{df})} is a very useful metric to perform model selection (RR is less so, for reasons discussed in Section 5.2 and Appendix C). To allow us to interpret the Δχ(df)2\Delta\chi^{2}_{(\textrm{df})}, we use simulated data to calibrate its relation to biases in parameter space due to model insufficiency. It is this process, of running a set of simulations and measuring parameter bias as a function of observable metrics that we refer to as “calibrating the bias-metric relation" in the following sections. The full details on how to perform this calibration are outlined in Section 4. This approach can, in principle, be applied to any type of data and/or systematics. However, in what follows, we apply it the specific scenario of choosing an intrinsic alignment model for a cosmic shear analysis.

The paper is structured as follows: we describe the theoretical modelling of the cosmic shear two-point data vector in Section 2. In Section 3 we describe how the synthetic cosmic shear data is generated, including the choice of IA scenarios. The ingredients and steps for the model selection method are described in Section 4. Our results when applying our method to the problem of IAs in the Dark Energy Survey Year 3 (DES Y3) are presented in Section 5. Finally, in Section 6 we summarise our findings and their significance in the context of the field.

2 Theory & Modelling

We carry out our analysis in the context of the flat Λ\LambdaCDM cosmological model. The cosmological parameters are {Ωm,Ωb,h0,As,ns,Ωνh2}\left\{\Omega_{\rm m},\Omega_{\rm b},h_{0},A_{\rm s},n_{\rm s},\Omega_{\rm\nu}h^{2}\right\}, where Ωm\Omega_{\rm m} is the density parameter for matter, and Ωb\Omega_{\rm b} the equivalent for baryons; h0h_{0} is the dimensionless Hubble constant; AsA_{\rm s} and nsn_{\rm s} are the amplitude and slope of the primordial curvature power spectrum at a scale of k=0.05k=0.05 Mpc-1 respectively; and Ωνh2\Omega_{\rm\nu}h^{2} is the neutrino mass density parameter. We assume three degenerate massive neutrino species, following Krause et al. (2021). We discuss the nuisance parameters of our analysis in the following sections. Prior choices are further described in Appendix A.

2.1 Modelling Cosmic Shear

The impact of gravitational lensing along a particular line of sight is determined by two quantities, known as convergence and shear. The convergence κ\kappa term of the weak lensing transformation describes how much a galaxy on a particular line of sight is distorted due to intervening large scale structure. It is defined as the weighted mass overdensity δ\delta, integrated along the line of sight to the distance of the source χs\chi_{\mathrm{s}}:

κ(𝜽)=0χsdχW(χ)δ(𝜽,χ),\kappa\left(\boldsymbol{\theta}\right)=\int_{0}^{\chi_{\mathrm{s}}}\mathrm{d}\chi\,W(\chi)\delta(\boldsymbol{\theta},\chi), (1)

where 𝜽\boldsymbol{\theta} is the angular position at which the source is observed. The kernel W(χ)W(\chi), defined in Eq. (5), is sensitive to the relative distances of the source and the lens. It is this geometrical term that makes cosmic shear sensitive to the expansion history of the Universe.

The two-point cosmic shear correlations ξ±(θ)\xi_{\pm}(\theta) are obtained by decomposing κ\kappa into EE- and BB-mode components (Crittenden et al., 2002; Schneider et al., 2002). For two redshift bins ii and jj, they can be written in terms of the convergence power spectrum Cκ()C_{\kappa}(\ell) at an angular wavenumber \ell as

ξ+ij(θ)\displaystyle\xi_{+}^{ij}(\theta) =2+14πG±(cosθ)[Cκ,EEij()+Cκ,BBij()],\displaystyle=\sum_{\ell}\frac{2\ell+1}{4\pi}G_{\ell}^{\pm}\left(\cos\theta\right)\left[C_{\kappa,EE}^{ij}(\ell)+C_{\kappa,BB}^{ij}(\ell)\right], (2)
ξij(θ)\displaystyle\xi_{-}^{ij}(\theta) =2+14πG±(cosθ)[Cκ,EEij()Cκ,BBij()],\displaystyle=\sum_{\ell}\frac{2\ell+1}{4\pi}G_{\ell}^{\pm}\left(\cos\theta\right)\left[C_{\kappa,EE}^{ij}(\ell)-C_{\kappa,BB}^{ij}(\ell)\right], (3)

where the functions G±(x)G^{\pm}_{\ell}(x) are calculated from Legendre polynomials P(x)P_{\ell}(x) and averaged over angular bins (see Eqs. 19 and 20 in Krause et al. 2021).

Assuming the Limber approximation (Limber, 1953; LoVerde & Afshordi, 2008), the 2D convergence power spectrum Cκij()C^{ij}_{\kappa}(\ell) is related to the 3D matter power spectrum as:

Cκij()=0χ(zmax)dχWi(χ)Wj(χ)χ2Pδ(+0.5χ,z(χ)),C_{\kappa}^{ij}(\ell)=\int_{0}^{\chi(z_{\textrm{max}})}\mathrm{d}\chi\frac{W^{i}(\chi)W^{j}(\chi)}{\chi^{2}}P_{\delta}\left(\frac{\ell+0.5}{\chi},z(\chi)\right), (4)

where Pδ(k,z)P_{\delta}(k,z) is the nonlinear matter power spectrum and the lensing weight is:

Wi(χ)=3H02Ωm2c2χa(χ)χχHdχni(z(χ))dzdχχχχ,W^{i}(\chi)=\frac{3H_{0}^{2}\Omega_{\mathrm{m}}}{2c^{2}}\frac{\chi}{a(\chi)}\int_{\chi}^{\chi_{\mathrm{H}}}\mathrm{d}\chi^{\prime}\,n^{i}\left(z(\chi^{\prime})\right)\frac{\mathrm{d}z}{\mathrm{d}\chi^{\prime}}\frac{\chi^{\prime}-\chi}{\chi^{\prime}}, (5)

with the source galaxy redshift distribution ni(z)n^{i}(z) normalised to integrate to 1, and χH\chi_{\text{H}} the horizon distance. We follow Krause et al. (2021), and model PδP_{\delta} using a combination of CAMB (Lewis et al., 2000) for the linear part, and HaloFit (Takahashi et al., 2012) for nonlinear modifications. In theory, the power spectra of convergence, CκC_{\kappa}, and cosmological shear, CγC_{\gamma}, are the same, and can be modelled fairly simply as described in Eq. (4). In practice, however, ξ±\xi_{\pm} measurements are sensitive not only to the pure cosmological shear, but also to additional correlations due to, e.g., intrinsic alignments. In the presence of IAs, the convergence spectra in Eqs. (2) and (3) are replaced by CγC_{\gamma}, the calculation of which we come to in Section 2.2.

2.2 Modelling Intrinsic Alignments

In general terms, the impact of intrinsic alignments (IAs) can be thought of as adding a coherent additional component to each galaxy’s shape. That is, the observed ellipticity is γobs=γG+γI+ϵSN\gamma^{\rm obs}=\gamma^{\rm G}+\gamma^{\rm I}+\epsilon_{SN}, or the sum of a shear due to cosmological lensing, an IA-induced shear, and a random shape noise component. Although the latter is typically dominant for any single galaxy, it cancels when the ellipticity is averaged across a large population of galaxies. At the level of angular correlation functions, one has:

Cγij()=Cκij()+CGIij()+CGIji()+CIIij().C^{ij}_{\gamma}(\ell)=C^{ij}_{\kappa}(\ell)+C^{ij}_{\rm GI}(\ell)+C^{ji}_{\rm GI}(\ell)+C^{ij}_{\rm II}(\ell). (6)

The first term, CκC_{\kappa}, is the auto-correlation of cosmological lensing, and is defined in Eq. (4). The intrinsic-intrinsic contribution is referred to as the II term, and arises from galaxies that are spatially close to one another. The intrinsic-shear cross-correlation is known as the GI term, and emerges from the fact that galaxies at different distances along the same line of sight can either be lensed by, or experience gravitational tidal interaction with, the same large scale structure.

Again assuming the Limber approximation, the angular power spectra can be written as

CIIij()=0χHdχni(χ)nj(χ)χ2PII(k=+0.5χ,χ)C^{ij}_{\rm II}(\ell)=\int_{0}^{\chi_{H}}\mathrm{d}\chi\frac{n^{i}(\chi)n^{j}(\chi)}{\chi^{2}}P_{\rm II}\left(k=\frac{\ell+0.5}{\chi},\chi\right) (7)

and

CGIij()=0χHdχWi(χ)nj(χ)χ2PGI(k=+0.5χ,χ).C^{ij}_{\rm GI}(\ell)=\int_{0}^{\chi_{H}}\mathrm{d}\chi\frac{W^{i}(\chi)n^{j}(\chi)}{\chi^{2}}P_{\rm GI}\left(k=\frac{\ell+0.5}{\chi},\chi\right). (8)

Given Eqs. (4), (7) and (8), we have the ingredients to use Eqs. (2) and (3) to predict the observable ξ±\xi_{\pm}. Note that the GI and II power spectra are model dependent. Indeed, how one calculates them is a significant analysis choice in any cosmic shear analysis. In the sections below we describe the two model choices explored in this work.

2.2.1 TATT Model

The tidal alignment and tidal torque model (TATT; Blazek et al. 2019) is based on nonlinear perturbation theory, which is used to expand the field of intrinsic galaxy shapes γI\gamma^{\rm I} in terms of the tidal field ss and the matter overdensity δ\delta. Whereas δ\delta is a scalar at all points in space, γI\gamma^{\rm I} and ss are 3×33\times 3 matrices, defining an ellipsoid in 3D space. Although in principle the expansion could be extended to any order, our implementation includes terms up to quadratic in the tidal field:

γijI=C1sij+C2ksikskj+bTAC1δsij,\gamma^{\rm I}_{ij}=C_{1}s_{ij}+C_{2}\sum_{k}s_{ik}s_{kj}+b_{\rm TA}C_{1}\delta s_{ij}, (9)

where C1,C2C_{1},C_{2} and bTAb_{\rm TA} are free parameters. This leads to the power spectra:

PGI=C1Pδ+bTAC1P0|0E+C2P0|E2,P_{\rm GI}=C_{1}P_{\delta}+b_{\rm TA}C_{1}P_{0|0E}+C_{2}P_{0|E2}, (10)
PIIEE=C12Pδ+2bTAC12P0|0E+bTA2C12P0E|0E+C22PE2|E2+2C1C2P0|E2+2bTAC1C2P0E|E2,P^{\rm EE}_{\rm II}=C^{2}_{1}P_{\delta}+2b_{\rm TA}C^{2}_{1}P_{0|0E}+b^{2}_{\rm TA}C^{2}_{1}P_{0E|0E}\\ +C^{2}_{2}P_{E2|E2}+2C_{1}C_{2}P_{0|E2}+2b_{\rm TA}C_{1}C_{2}P_{0E|E2}, (11)
PIIBB=bTA2C12P0B|0B+C22PB2|B2+2bTAC1C2P0B|B2.P^{\rm BB}_{\rm II}=b^{2}_{\rm TA}C^{2}_{1}P_{0B|0B}+C^{2}_{2}P_{B2|B2}+2b_{\rm TA}C_{1}C_{2}P_{0B|B2}. (12)

The various subscripts to the power spectra indicate correlations between different order terms in the expansion of γI\gamma^{\mathrm{I}}. They can all be calculated to one-loop order as integrals of the linear matter power spectrum over kk (see Blazek et al. 2019 for the full definitions). As can be seen here, the TATT model predicts both EE- and BB-mode II contributions. These are propagated to separate EE- and BB-mode angular power spectra, which enter ξ±\xi_{\pm} in Eqs. (2) and (3). The amplitudes are defined, by convention, as:

C1(z)=A1C¯1ρcΩmD(z)(1+z1+z0)η1,C_{1}(z)=-A_{1}\frac{\bar{C}_{1}\rho_{\rm c}\Omega_{\mathrm{m}}}{D(z)}\left(\frac{1+z}{1+z_{0}}\right)^{\eta_{1}}, (13)
C2(z)=5A2C¯1ρcΩmD2(z)(1+z1+z0)η2.C_{2}(z)=5A_{2}\frac{\bar{C}_{1}\rho_{\rm c}\Omega_{\mathrm{m}}}{D^{2}(z)}\left(\frac{1+z}{1+z_{0}}\right)^{\eta_{2}}. (14)

The pivot redshift z0z_{0} and the constant C¯1\bar{C}_{1} are fixed to values of z0=0.62z_{0}=0.62 and C¯1=5×1014Mh2Mpc2\bar{C}_{1}=5\times 10^{-14}M_{\odot}h^{-2}\mathrm{Mpc}^{2}. Again, this is a convention, such that C1(z)C_{1}(z) and C2(z)C_{2}(z) are roughly of the order of 11 for a typical population of source galaxies. The power law term in C1(z)C_{1}(z) and C2(z)C_{2}(z) adds some flexibility to capture possible redshift evolution beyond what is already encoded in the model. Our implementation of the TATT model then has five free parameters: A1,A2,η1,η2,bTAA_{1},A_{2},\eta_{1},\eta_{2},b_{\rm TA}, which we allow to vary with wide flat priors A1,A2,η1,η2[5,5]A_{1},A_{2},\eta_{1},\eta_{2}\in[-5,5], bTA[0,2]b_{\rm TA}\in[0,2]. This choice of uniformative priors is motivated by the fact that IAs are very sensitive to the properties of the galaxy population, making it very difficult to derive informative priors, and resulting in a lack of directly transferable constraints in the literature for the TATT model parameters. Although intended to match up with different alignment mechanisms, in practice A1A_{1} and A2A_{2} capture any correlations that scale linearly and quadratically with the tidal field. The third amplitude bTAb_{\rm TA} is designed to capture the fact that galaxies over-sample densely populated regions (i.e., one cannot sample the γI\gamma^{\rm I} field uniformly throughout the Universe).

For this work we use the DES Y3 implementation of TATT, within CosmoSIS v1.6111https://bitbucket.org/joezuntz/cosmosis/wiki/Home; des-y3 branch of cosmosis-standard-library (Zuntz et al., 2015). The power spectra in Equations (10)-(12) (with the exception of the nonlinear matter power spectrum PδP_{\delta}) are generated using FAST-PT v2.1222https://github.com/JoeMcEwen/FAST-PT (McEwen et al., 2016; Fang et al., 2017).

2.2.2 NLA Model

Although chronologically older and more commonly used, the nonlinear alignment model (NLA; Bridle & King 2007) is a subspace of TATT. Built on the assumption that galaxy shapes align linearly with the background tidal field, it predicts:

PGI=C1(z)Pδ,PII=C12(z)Pδ,P_{\rm GI}=C_{1}(z)P_{\delta},\;\;\;P_{\rm II}=C^{2}_{1}(z)P_{\delta}, (15)

with the amplitude C1(z)C_{1}(z) as defined in Eq. (13) in our implementation. The NLA model as implemented here differs from its predecessor, the linear alignment model (Catelan et al., 2001; Hirata & Seljak, 2004, 2010), by the fact that PδP_{\delta} in the above equations is the full nonlinear matter power spectrum (in our case generated using HaloFit), not the linear version. Unlike the original formulation, our implementation of NLA also includes a power law redshift dependence in C1(z)C_{1}(z) to capture any additional evolution beyond the basic model (as in Eq. (13) above). In total, our implementation of the NLA model has two free parameters, A1A_{1} and η1\eta_{1}, which we vary with the priors given in Section 2.2.1.

2.3 Other Nuisance Parameters & Scale Cuts

Both the TATT and NLA pipelines include free parameters for redshift error and residual shear bias. We adopt the same modelling as Krause et al. (2021), giving us one Δz\Delta z and one mm parameter per redshift bin, or a total of eight nuisance parameters. Note however, that these parameters are prior dominated for Y3 cosmic shear-only chains, and so add relatively little to the effective dimensionality. For details about the priors see Appendix A and Table 2. We also adopt the fiducial DES Y3 cosmic shear scale cuts (see Krause et al. 2021 for an explanation of how these were derived).

3 Creating and analysing the cosmic shear data vector

In this section we describe how we generate mock data. This is necessary to calibrate the relation between bias in cosmological parameters and statistical metrics used for model comparison, which is central to our method for model selection. Essentially we wish to create an ensemble of data vectors that span a useful range of bias in cosmological parameters and Δχ(df)2\Delta\chi^{2}_{(\textrm{df})} (or RR), allowing us to map out the relation between the two. In Section 3.1 we focus on how to simulate the cosmological lensing terms (which depend on cosmology, not the IA model). Then in Section 3.2 we describe the IA terms (IA model-dependent). We explain how we incorporate noise into our analyses, and why it is necessary, in Section 3.3. Finally, we describe our sampler choices in Section 3.4, and in particular an approximation using importance sampling that we use to accelerate the analysis of the noisy data vectors.

3.1 Generating Mock Data

For a given set of input parameters, we generate a noiseless DES Y3-like cosmic shear data vector, 𝑫\boldsymbol{D}, using the theory pipeline described in Section 2. We assume the fiducial Y3 redshift distributions, as presented in Myles, Alarcon et al. (2021). In all data vectors, the same input flat Λ\LambdaCDM cosmology is used (Ωm=0.29,As=2.38×109,Ωb=0.052,h=0.75,ns=0.99,Ωνh2=0.00053)(\Omega_{\mathrm{m}}=0.29,\;A_{\mathrm{s}}=2.38\times 10^{-9},\;\Omega_{\mathrm{b}}=0.052,\;h=0.75,\;n_{\mathrm{s}}=0.99,\;\Omega_{\nu}h^{2}=0.00053). This corresponds to σ8=0.79\sigma_{8}=0.79, S8=0.77S_{8}=0.77, where S8σ8Ωm/0.3S_{8}\equiv\sigma_{8}\sqrt{\Omega_{\mathrm{m}}/0.3}. We choose these to match the marginalised mean values from the DES Y1 3×23\times 2pt chain used to generate IA samples (see Section 3.2 below). Note, however, that the exact values are not expected to affect our results. We also fix all the redshift and shear calibration nuisance parameters to zero when generating data vectors.

Refer to caption
Figure 1: An illustration of how we generate samples in IA model parameter space for this work. The purple contours show the 68%68\% and 95%95\% confidence levels from the TATT model analysis of the DES Y1 3×23\times 2pt data (Samuroff et al., 2019). Overlain (black points) are the IA samples we derive from this posterior probability distribution after marginalizing over all other parameters. On the diagonal, we show the Y1 marginal posterior (purple), and also the distribution of IA samples (black), both normalised to integrate to 1 over the prior range. As shown, the latter is slightly broader than we would obtain by drawing from the DES posterior distribution.

3.2 Choosing IA Scenarios

When constructing simulated data vectors, it is important to remember that IA model parameters are not independent. The total GI+II intrinsic alignment component in a scenario with, e.g., A1=A2=1A_{1}=A_{2}=1 is very different from one with A1=0.1,A2=1A_{1}=0.1,A_{2}=1. As a consequence, it is possible for two sets of input IA parameters to give similar cosmological bias (when analysed with NLA), but quite different χ2\chi^{2} values. Specific combinations of TATT parameter values may enhance or cancel out cosmological parameter bias, and so it is useful to sample the 5D TATT parameter space rather than scaling up individual parameters to explore the potential for cosmological parameter bias due to model insufficiency. Therefore, instead of a single mock data vector, we generate a set of 21 data vectors, all with the same cosmology, but with different possible IA scenarios. The number of mock data vectors is an analysis choice. The more we generate, the better we cover the IA parameter space, but it also increases computational costs. We verified that 21 was a sufficient number of scenarios to have samples presenting low, medium and high bias in cosmological parameters, while still being reasonable in terms of computational cost (i.e. the chains to run).

To do this, we follow the recipe set out in Section 2 of DeRose et al. (2019). Starting with the posterior from the DES Y1 3×23\times 2pt TATT analysis (the purple contours in Figure 1; Samuroff et al. 2019), we evaluate the covariance matrix of the Np=5N_{p}=5 TATT parameters, and perform an eigenvalue decomposition. We then use Latin Hypercube sampling to generate NsampN_{\rm samp} samples, which are roughly evenly distributed in NpN_{p} dimensional space. Finally, we use the eigenvalues/vectors to rotate and normalise those samples into the parameter space. The results are shown in Figure 1. The idea is that this provides a slightly broader coverage than could be obtained simply by drawing points from the joint posterior, while maintaining the correlations between parameters. In this way, we can cover a range of marginal cases, which are pessimistic, but still consistent with the data; this is useful, since for our purposes it is more important to span the range of plausible TATT model parameters than to preserve the statistics of the Y1 posterior exactly.

3.3 Adding Noise

Since real measurements unavoidably include an (unknown) noise realisation, the calibration of the bias-metric relation is inherently a probabilistic problem (we will return to this point in Section 5; see the discussion there for details). For this reason, it is important that our simulations capture all sources of scatter in the data.

For each of our 21 IA scenarios, defined by a set of input TATT parameter values 𝜽IA,i\boldsymbol{\theta}_{\mathrm{IA},i}, we have a set of noisy data vectors 𝑫~i,j=𝑫(𝜽IA,i)+𝑵j\tilde{\boldsymbol{D}}_{i,j}=\boldsymbol{D}(\boldsymbol{\theta}_{\mathrm{IA},i})+\boldsymbol{N}_{j}, where noise realisation 𝑵j\boldsymbol{N}_{j} is drawn from the covariance matrix, and is assumed to be independent of 𝜽IA,i\boldsymbol{\theta}_{\mathrm{IA},i}. We use the final DES Y3 covariance matrix, which is analytic and includes a Gaussian shape noise and cosmic variance contribution, as well as higher order non-Gaussian and super-sample terms (Friedrich et al., 2021). In total we generate 50 noise realisations, which we apply to each data vector. This gives us a collection of 21 noiseless data vectors, and 21×50=105021\times 50=1050 noisy ones.

For testing, however, it is convenient to arbitrarily choose a single fixed noise realisation, which we refer to as fiducial noise. Figure 2 shows an example DES Y3-like data vector, generated using the setup described above, with the fiducial noise realisation added. For this particular example, the input TATT parameters are the mean values from the Y1 posteriors in Figure 1. For reference, we also show the noiseless version (purple solid), as well as the separate (again, noiseless) IA contributions. Since the IA signal in the lowest bin (1,11,1) seems to dominate, one could reasonably ask whether we could simply focus on this part of the data vector for model selection. We ultimately decide against this for a few reasons. First, although the IA signal is strongest in bin 1,11,1, we can see there is also non-negligible signal in the surrounding bins (e.g. 2,12,1 and 3,13,1). Indeed, during the DES Y3 analysis, a range of IA mitigation techniques were explored, including dropping the lowest auto bin correlations. For moderate TATT scenarios, it was found that this could not reliably eliminate cosmology biases, suggesting the IA contamination is not confined to the 1,11,1 data vector. Additionally, the degree to which the low redshift II contribution dominates the IA signal is somewhat dependent on the input TATT parameters, and so the strength of this assumption varies in parameter space.

Refer to caption
Figure 2: An example of a noisy simulated data vector of the type used in this paper. Each panel shows a redshift bin combination (as labelled), and the upper and lower triangles show ξ+\xi_{+} and ξ\xi_{-} respectively. In each panel we show the simulated cosmic shear data vector with fiducial noise (black points with error bars), as well as the noiseless version (smooth purple). We also show the GI and II intrinsic alignment components separately. For reference, the input IA parameters here correspond to the mean of the DES Y1 posterior discussed in Section 3.2 (A1=0.7,A2=1.36,η1=1.7,η2=2.5,bTA=1)(A_{1}=0.7,A_{2}=-1.36,\eta_{1}=-1.7,\eta_{2}=-2.5,b_{\rm TA}=1). The grey bands represent the fiducial DES Y3 cosmic shear scale cuts, i.e., the scales removed from our analysis.

3.4 Choice of Sampler

3.4.1 Nested Sampling

To sample the cosmic shear likelihood, we use the PolyChord nested sampling algorithm (Handley et al., 2015), which generates estimates for the multidimensional posterior 𝒫(𝜽|𝑫,M)\mathcal{P}(\boldsymbol{\theta}|\boldsymbol{D},M) and the evidence for a given model 𝒵(𝑫|M)\mathcal{Z}(\boldsymbol{D}|M) simultaneously. This matches the DES Y3 choice, and has been validated in terms of both evidence and the contour size compared with a long monte carlo chain (Lemos, Weaverdyck et al., 2022). We briefly explored the possibility of using MultiNest (Feroz et al., 2019), which is conceptually similar, but significantly faster. Ultimately, however, we found that MultiNest underestimates the width of the posteriors in all cases we tested (both NLA and TATT; see Appendix D). It also gives inaccurate evidence values (Lemos, Weaverdyck et al., 2022), which tend to skew towards preferring NLA. For these reasons, we did not pursue this.

To obtain estimates for the best χ2\chi^{2}, we use oversampled chains (i.e., output with 10×10\times the number of points as saved in the standard chains). This approach has been tested in the Y3 cosmic shear setup, and shown to give comparable results to running a likelihood maximiser (Secco, Samuroff et al., 2022). All sampling, as well as the modelling steps described in Section 2, are carried out using CosmoSIS.

3.4.2 Importance Sampling

To assess the impact of data vector noise, in addition to nested sampling we also make use of importance sampling (IS; Neal 1998; Tokdar & Kass 2010 see also Lewis & Bridle 2002; Padilla et al. 2019 for cosmology-specific applications). For each IA scenario 𝜽IA,i\boldsymbol{\theta}_{\mathrm{IA},i}, we wish to estimate the shape and position of the S8ΩmS_{8}-\Omega_{\mathrm{m}} posterior, as well as the best fit and evidence. Running full chains for every combination of noise and IA scenario would be expensive, and IS provides a fast approximation.

Say one wants to estimate the characteristics of a distribution PP, over parameter θ\theta. One can estimate the mean of the function f(θ)f(\theta) under PP as:

f^=f(θ)P(θ)dθ.\hat{f}=\int f(\theta)P(\theta)\mathrm{d}\theta. (16)

This can be rewritten in terms of a second distribution P0P_{0}:

f^=f(θ)P(θ)P0(θ)P0(θ)dθ\displaystyle\hat{f}=\int f(\theta)\frac{P(\theta)}{P_{0}(\theta)}P_{0}(\theta)\mathrm{d}\theta (17)
θiP0f(θi)w(θi),\displaystyle\approx\sum_{\theta_{i}\sim P_{0}}f(\theta_{i})w(\theta_{i}), (18)

where we have redefined the ratio of distributions as a weight w=P/P0w=P/P_{0}. The second line follows as a Monte Carlo estimate for the first, and the sum runs over values of θ\theta drawn from P0(θ)P_{0}(\theta). The equations above make no assumptions about Gaussianity, or about the nature of the distributions. To work well, however, it does require P0P_{0} to be non-zero over the range of θ\theta for which we wish to estimate PP, and it works better in cases where the number of samples is large. Functionally, it also requires (a) that one has, or can generate, samples from P0P_{0}, and (b) that for any given θ\theta, one can evaluate both PP and P0P_{0}.

For our application, P0=𝒫(𝜽|𝑫,M)P_{0}=\mathcal{P}(\boldsymbol{\theta}|\boldsymbol{D},M) is a reference posterior obtained from running a chain on the noiseless data vector 𝑫(𝜽IA)\boldsymbol{D}(\boldsymbol{\theta}_{\rm IA}). As before, we use the ×10\times 10 oversampled PolyChord output for this. The target distribution P=𝒫(𝜽|𝑫~,M)P=\mathcal{P}(\boldsymbol{\theta}|\tilde{\boldsymbol{D}},M) is the posterior we are trying to estimate, conditioned on a noisy data vector 𝑫~\tilde{\boldsymbol{D}}. With this setup, we can estimate PP for each noise realisation by simply iterating through the samples from P0P_{0} and assigning each a weight equal to the ratio of the two posteriors.

In addition to the target posterior for each model, we also estimate the best χ2\chi^{2}. For this, we create a high density pool of samples by merging all of our oversampled PolyChord chains (21 IA samples), in addition to a small number of additional chains run with a Y1 like covariance matrix. This gives us over a million points in parameter space per model. For each noisy data vector, we re-evaluate the likelihood at each point, and select the maximum. Given an estimate for the best χ2\chi^{2} from IS, and assuming a Gaussian likelihood, the Bayesian evidence can then be estimated as (see Section 3 Joachimi et al. 2021b):

ln𝒵(𝑫i~|M)=ln𝒵0(𝑫|M)0.5(χi2χ02),\mathrm{ln}\mathcal{Z}(\tilde{\boldsymbol{D}_{i}}|M)=\mathrm{ln}\mathcal{Z}_{0}(\boldsymbol{D}|M)-0.5\left(\chi^{2}_{i}-\chi^{2}_{0}\right), (19)

where index ii indicates a noise realisation, and 𝒵0\mathcal{Z}_{0} and χ02\chi^{2}_{0} are the evidence and best χ2\chi^{2} obtained from a fiducial reference chain, which in our case are our noiseless chains.

To test the performance of our IS setup, we ran five additional PolyChord chains at different noise realisations, once in the low bias regime and again in the intermediate (1σ\sim 1\sigma) bias regime. We verified that in all cases, our IS setup recovered the best χ2\chi^{2} as well as the shape and mean of the posteriors with a comparable level precision to a full chain. Our implementation is a slight modification of the code discussed in Weaverdyck, Alves et al. (2022), and will be available on release of that paper.

4 Model Selection

In this section, we define the components of our model selection method. In essence, we are proposing to calibrate the observed value of model comparison statistics against the probability of cosmological parameter bias. These quantities can be computed using noisy simulated data, but first they must be properly defined. To this end, we define how we quantify cosmological parameter bias in Section 4.1. The metrics that we tested in search of a useful bias–metric relation are discussed in Section 4.2. In Section 4.3 we make considerations regarding unconverged samples. We summarize our model selection method in Section 4.4.

4.1 Significance Level of Cosmological Parameter Biases

Now that we have a set of noisy data vectors, the next step is to fit them with all parameters free. We analyse each data vector twice, fitting to our full set of cosmological and nuisance parameters, but in one case using TATT, and in the other NLA. We then define bias as the distance between the peak of the marginalised posteriors in the S8ΩmS_{8}-\Omega_{\mathrm{m}} plane. Figure 3 illustrates this for a particular simulated data vector. In brief, the algorithm works by evaluating the Euclidean distance between the peaks of the two posteriors, 𝒫1\mathcal{P}_{1} and 𝒫2\mathcal{P}_{2} in S8ΩmS_{8}-\Omega_{\mathrm{m}} space. It then sequentially computes the confidence ellipses of 𝒫1\mathcal{P}_{1} at different σ\sigma levels, and finds a value of NσN_{\sigma} that minimises the distance between the ellipse and the peak of 𝒫2\mathcal{P}_{2}. Note that this is the same recipe used in Krause et al. (2021). The choice of S8S_{8} and Ωm\Omega_{\mathrm{m}} as the parameters of interest comes simply from the fact that these are the cosmological parameters best constrained by DES. One could conceivably use a more complicated separation metric that is sensitive to the full parameter space, along the lines of those used for assessing tensions between data sets (e.g. Lemos, Raveri, Campos et al. 2021). For our purposes, however, we follow Krause et al. (2021) and consider the simpler 2D metric to be sufficient.

We use noisy simulated data, as described in Section 3.3 – this is an important feature of our analysis, and it is necessary to allow us to meaningfully interpret our statistical metrics. Therefore, the relative separation of the two posteriors is a more useful quantity than the distance from the input values of S8S_{8} and Ωm\Omega_{\mathrm{m}}. The 0.22σ0.22\sigma value shown in Figure 3 is assessed relative to the TATT posterior. This found to be more stable than assessing it relative to the NLA posterior, particularly in relatively extreme IA scenarios where the NLA posteriors are significantly shifted and can be distorted by prior edge effects.

Finally, it is implicit in the above that marginalised TATT constraints represent correct results by which to measure bias. That is, when we refer to bias, we are in fact talking about bias in the cosmological model when assuming the NLA model, with respect to what we find when assuming the TATT model. Although this is clearly reasonable (since our data were created using TATT), marginal contours can be subject to projection effects. Indeed, since some of the TATT parameters are relatively poorly constrained by shear-shear analysis alone, the two IA models cannot be assumed to experience projection effects to the same degree. We test this in Appendix A, and find projection offsets between TATT and NLA at the level of 0.1σ0.1\sigma. This is well below the threshold of 0.3σ0.3\sigma used for this work, and is thus unlikely to significantly affect our results.

Refer to caption
Figure 3: An example of how cosmological parameter bias is defined for a given IA scenario and noisy data vector. The purple point and the dotted ellipsoid show the maximum likelihood and 0.3σ0.3\sigma contour, obtained from the analysis of a noisy simulated data vector with the TATT model. The black is the same, but with the NLA model. The vector connecting the two peaks in the ΩmS8\Omega_{\mathrm{m}}-S_{8} plane defines our bias metric. Note that the TATT contour is slightly offset from the input parameter values (the dashed lines) due to noise and projection effects. It is for this reason that the relative separation, rather than the distance from the input, is the most appropriate bias definition.

4.2 Model Comparison Statistics

We investigate two commonly used test statistics, the χ2\chi^{2} difference and the Bayes ratio. We show in Section 5 that the former is more robust against noise than the latter, and is therefore a more useful metric for the method we are proposing.

4.2.1 χ2\chi^{2} Difference Tests

When dealing with nested models (i.e., where one model is a subspace of the other, as in the case with NLA and TATT), the difference in the best χ2\chi^{2} that can be achieved by each model on the same data is a convenient statistic for model selection (Steiger et al., 1985; Rigdon, 1999; Schermelleh-Engel et al., 2003; Andrae et al., 2010).

The χ2\chi^{2} difference metric is defined as the difference in the best χ2\chi^{2} values of the parameter fits when assuming the two IA models, divided by the difference in their numbers of degrees of freedom (dfdf):

Δχ(df)2=χs2χl2dfsdfl,\Delta\chi^{2}_{(\textrm{df})}=\frac{\chi^{2}_{s}-\chi^{2}_{l}}{df_{s}-df_{l}}, (20)

where s indicates the smaller model (the one with fewer free parameters and therefore more degrees of freedom; NLA in our example), and l denotes the model with more parameters, larger, and so fewer degrees of freedom (TATT in our case)333Note that it is this quantity, weighted by the difference in dfdf, that we refer to as Δχ(df)2\Delta\chi^{2}_{(\textrm{df})} throughout the paper, and not the simple χs2χl2\chi^{2}_{s}-\chi^{2}_{l} difference. This allows us to briefly compare with theoretical cut-offs in Section 5.1. For practical purposes, however, this is not strictly necessary – one could just as easily calibrate the raw difference.. Note that the point estimate to use for the “best χ2\chi^{2}" here is somewhat subjective. For a given chain, we use the value closest to the peak of the multidimensional posterior. One could also use the maximum likelihood, although in practice this tends to be a slightly noisier quantity.

In the limit that the additional parameters in the larger model have no impact on the quality of the fit, the metric is exactly zero: Δχ2=χs2χl2=0\Delta\chi^{2}=\chi^{2}_{s}-\chi^{2}_{l}=0. Very small Δχ(df)2\Delta\chi^{2}_{(\textrm{df})} values can therefore be taken as evidence that the smaller model is sufficient, given the data. In practice, however, this is an unlikely outcome, as extra parameters will typically allow the model more flexibility. Under the null hypothesis that the two models ss and ll both adequately fit the data, the value of the numerator χs2χl2\chi^{2}_{s}-\chi^{2}_{l} is χ2\chi^{2}-distributed with dfdiff=dfsdfldf_{\text{diff}}=df_{s}-df_{l} degrees of freedom, and the expectation value is Δχ21\langle\Delta\chi^{2}\rangle\sim 1 (Wilks, 1938). One can interpret larger Δχ2\Delta\chi^{2} using the corresponding p-values to quantify the degree to which the data appear to favour the larger model. As we will discuss in Section 5, however, for our purposes it is more useful to focus on the observed relation between Δχ(df)2\Delta\chi^{2}_{(\textrm{df})} and parameter bias than on formal statistical thresholds. That is, we propose to use Δχ(df)2\Delta\chi^{2}_{(\textrm{df})} as an empirical tool, which requires calibration using simulated likelihood analyses for any given problem. This way, we are also free from other assumptions behind the standard use of the Δχ(df)2\Delta\chi^{2}_{(\textrm{df})} metric – for example, it formally requires nested models whereas an empirical calibration would not. Note that our approach here is functionally similar, but motivated slightly differently, to the calibration of Posterior Predictive Distribution (PPD) pp-values for internal consistency testing, as implemented in Doux, Baxter et al. (2021).

In principle, Δχ(df)2\Delta\chi^{2}_{(\textrm{df})} is prior-independent. In Bayesian inference, however, the prior typically controls the regions of parameter space that can be explored, and so restrict the values of χ2\chi^{2} that can be attained. In practice, this is only an issue if the likelihood peaks outside the prior bounds (which is, in any case, usually a red flag).

One other point to remember is that, although in an ideal case with well-constrained parameters one extra parameter constitutes one fewer degree of freedom, in practice this is often not true. In such cases, one can calculate the effective degrees of freedom (see Raveri & Hu 2019). With the fiducial DES Y3 cosmic shear setup (minus the shear ratios), the effective degrees of freedom for TATT and NLA are 222 and 224 respectively, giving dfdiff=2df_{\text{diff}}=2 (compared with dfdiff=3df_{\text{diff}}=3 from simple parameter counting; see Secco, Samuroff et al. 2022 Table III).

In Section 5, we also briefly consider two other likelihood-based metrics: the Akaike Information Criterion (AIC; Akaike 1973) and the Bayesian Information Criterion (BIC; Schwarz 1978). Although these statistics have very different theoretical underpinnings (see Liddle 2007), they are similar in form, and can be conveniently reformulated as thresholds in Δχ(df)2\Delta\chi^{2}_{(\textrm{df})}. As with pp-value cut-offs, however, they are seen to be relatively under-cautious in separating high- and low-bias scenarios (see Section 5 and Figure 4).

4.2.2 Bayes Ratio

The Bayesian evidence ratio, or Bayes ratio RR (Jeffreys, 1961; Kass & Raftery, 1995), is a slightly more complicated alternative to Δχ(df)2\Delta\chi^{2}_{(\textrm{df})}. It is defined as the probability of measuring a data vector 𝑫\boldsymbol{D} assuming a given model M1M_{1}, divided by the probability of measuring the same data 𝑫\boldsymbol{D} for a second model M2M_{2}:

R𝒵(𝑫|M1)𝒵(𝑫|M2).R\equiv\frac{\mathcal{Z}(\boldsymbol{D}|M_{1})}{\mathcal{Z}(\boldsymbol{D}|M_{2})}. (21)

Here, 𝒵(𝑫|M)\mathcal{Z}(\boldsymbol{D}|M) is the Bayesian evidence, which can be obtained marginalising over all the model parameters 𝜽\boldsymbol{\theta}:

𝒵(𝑫|M)=d𝜽(𝑫|𝜽,M)P(𝜽|M),\mathcal{Z}(\boldsymbol{D}|M)=\int\mathrm{d}\boldsymbol{\theta}\ \mathcal{L}(\boldsymbol{D}|\boldsymbol{\theta},M)P(\boldsymbol{\theta}|M), (22)

where (𝑫|𝜽,M)\mathcal{L}(\boldsymbol{D}|\boldsymbol{\theta},M) is the likelihood, and P(𝜽|M)P(\boldsymbol{\theta}|M) is the prior, both assuming a particular model. The Bayes ratio is typically interpreted using the Jeffreys scale (Jeffreys, 1961), which defines ranges of values that match up to labels (e.g., “strong evidence", “substantial evidence", etc.).

Note that RR and Δχ(df)2\Delta\chi^{2}_{(\textrm{df})} are not independent from one another (indeed the latter approximates the former under certain assumptions; see Bishop 2006; Marshall et al. 2006). It is important, then, to be careful when seeking to combine information from the two.

Evidence ratios have been widely used in cosmology, both for comparing different data sets under the same model (i.e., as a tension metric; Marshall et al. 2006; Lemos, Raveri, Campos et al. 2021), and for model comparison on the same data (Liddle, Mukherjee & Parkinson, 2006a; Kilbinger et al.,, 2010; Secco, Samuroff et al., 2022). It is worth bearing in mind that the formulation in the two contexts is slightly different. In the former case there is explicit prior dependence, which motivates the use of statistics such as Suspiciousness (see e.g. Lemos, Raveri, Campos et al. 2021 Section 4.2). The version commonly used for model selection, on the other hand, should be independent of the choice of priors, at least in the limit that (a) the models are nested and (b) the priors on the extra parameters are wide compared with the likelihood.

Since cosmological analyses involve a large number of free parameters, computing the Bayesian evidence requires integrating a probability distribution over a high number of dimensions. A common way to calculate it is while producing the posterior distributions, using nested sampling (Skilling, 2006). The precision required from the sampler in order to compute reliable Bayesian evidences, however, often makes the sampling time very long. We choose to use the PolyChord nested sampling algorithm in this work (Handley et al., 2015) – although see Appendix D, where we consider the feasibility of using MultiNest (Feroz et al., 2019) as a slightly faster alternative.

4.2.3 Bias Probability

The above quantities give us the basic tools for our model comparison. There is, however, a piece missing. As we mention in Section 3.3, the calibration is inherently probabilistic. The model comparison metrics (both Δχ(df)2\Delta\chi^{2}_{\rm(df)} and RR), as well as the offset between the NLA and TATT best fits, are somewhat sensitive to noise, and we do not know the true noise realisation in the data. We thus define a bias probability PP for a particular bias tolerance XX:

P(b>Xσ|Δχ(df),obs2<Δχ(df),thr2)=Nsampb>XNsampb>X+Nsampb<X.P(b>X\sigma|\Delta\chi^{2}_{\rm(df),obs}<\Delta\chi^{2}_{\rm(df),thr})=\frac{N^{b>X}_{\rm samp}}{N^{b>X}_{\rm samp}+N^{b<X}_{\rm samp}}. (23)

In words, PP is the probability of the bias in S8S_{8}Ωm\Omega_{\mathrm{m}} being greater than XσX\sigma, if the observed Δχ(df)2\Delta\chi^{2}_{(\textrm{df})} from the data is below some threshold χ(df),thr2\chi^{2}_{\rm(df),thr} (which is to be determined empirically based on the adopted XX and PP).

It is estimated by plotting the distribution of all of our noisy data vectors in the biasΔχ(df)2-\Delta\chi_{\rm(df)}^{2} plane, and, for a particular Δχ(df),thr2\Delta\chi^{2}_{\rm(df),thr}, evaluating the fraction of points that lie both above bias=Xσ\mathrm{bias}=X\sigma and below Δχ(df),thr2\Delta\chi^{2}_{\rm(df),thr} (i.e., in the lower right quadrant of Figure 4). In practice, one starts by defining the tolerance XX and the desired bias confidence PP. For example, one might require the bias to be smaller than X=0.3σX=0.3\sigma at 90%90\% confidence. Given those numbers, we can then iteratively evaluate Eq. (23) with different Δχ(df),thr2\Delta\chi^{2}_{\rm(df),thr} thresholds until the required PP is achieved.

4.3 Dealing with Unphysical Δχ(df)2\Delta\chi^{2}_{(\textrm{df})} Values and Unconverged IA Samples

It is also worth briefly remarking that in our analysis we found about 50 (out of 1050) data points for which χTATT2>χNLA2\chi^{2}_{\rm TATT}>\chi^{2}_{\rm NLA}, and so Δχ(df)2<0\Delta\chi^{2}_{(\textrm{df})}<0. Given that NLA and TATT are nested models, these points are unphysical (a more flexible model should always be able to produce a better or as-good fit). We conclude that they are an artefact of the sampling method; although we tested the robustness of our IS setup, and found it can reproduce the best fit from chains to reasonable precision, some level of sampling noise is still present. Given this logic, it is reasonable to assume that if we were to find ΔχNLA2<χTATT2\Delta\chi^{2}_{\rm NLA}<\chi^{2}_{\rm TATT} in real data, there would likely be some follow-up investigation and the chains would be rerun. This is particularly true if the χ2\chi^{2} is an integral part of the analysis plan, as it is in our method. We thus choose to discard these points. It is worth bearing in mind, however, that this may or may not be a reasonable decision in other setups, depending on the models being compared and the details of the analysis.

Also note that, although our results are based on 21 IA samples, we did initially draw 25 scenarios (see Section 3.2). Of these 25, we found four to be so extreme that the NLA PolyChord chains failed to converge in the noiseless case. These resulted in highly distorted and often bimodal contours in the S8ΩmS_{8}-\Omega_{\mathrm{m}} plane, making it difficult to obtain meaningful estimates for the bias. Given this, and also for the reasoning discussed above, we choose to omit these samples from our analysis. This leaves us with a total of 21 IA samples.

4.4 The Recommended Method for Model Selection

Given all the definitions set out in the sections above, we now follow the recipe outlined below, in order to map and calibrate the bias-metric relation. These steps are, in essence, our method; when written out in this form, it can be very easily generalised to other model selection problems beyond our particular example of IA in cosmic shear.

  1. 1.

    Sample IA scenarios: Draw about 103010-30 parameter samples from either a posterior from a previous analysis or from some reasonable priors using the method described in Section 3.2 (we used 21 drawn from DES Y1 TATT posteriors).

  2. 2.

    Generate data vectors: Generate a simulated noiseless data vector for each IA scenario drawn in the previous step. Other model parameters (e.g., cosmological and nuisance parameters) should be fixed to some fiducial values. See Section 3.1.

  3. 3.

    Analyse noiseless data vectors: There are two chains per data vector, one corresponding to model M1M_{1}, and the other to M2M_{2}. Again, we compared the TATT and NLA IA models using PolyChord to compute statistics. These choices might vary under different applications. Details on the sampling can be found in Section 3.4.1.

  4. 4.

    Compute parameter bias and plot out the bias-metric relation: Demonstrate that the noiseless data vectors show a clear correlation between the test statistic (e.g., Δχ(df)2\Delta\chi^{2}_{(\textrm{df})} or Bayes factor) and parameter bias, as described in Section 4.1.

  5. 5.

    Generate noise realisations: For each data vector, generate 50\sim 50 noise realisations using the covariance matrix, as explained in Section 3.3. In our case, that gives a total of 21×50=105021\times 50=1050 noisy data vectors.

  6. 6.

    Analyse noisy data vectors: As discussed above, we choose to use importance sampling to give a fast approximation for the noisy posteriors. For all noise realisations (50) and IA scenarios (21), for both IA models (2) \longrightarrow 2100 total, estimate the posterior, the NLA-TATT bias and the model test statistics. See Section 3.4.2.

  7. 7.

    Calculate probability: Plot out the bias-metric relation. Use the quantities computed in the previous step to calculate the probability of bias greater than some pre-defined threshold XσX\sigma (see Section 4.2.3).

  8. 8.

    Run analyses on blinded data: Run a full chain on the real data in order to evaluate the observed model statistic. This can then be interpreted in terms of bias probability using the results of the above step. See Section 5.

Note that, to obtain an accurate calibration of the Δχ(df)2\Delta\chi^{2}_{(\textrm{df})} value, all aspects of the modelling should be as close to the final fiducial analysis setup as possible. For an estimate of the computational resources required to employ the proposed method, see Appendix E.

5 Results

Now we have outlined the details of our method in Section 4.4, we will now consider a specific application. As discussed earlier, we choose to focus on the problem of deciding between two intrinsic alignment models for a cosmic shear analysis: NLA and TATT. Although these models are nested, the method does not assume this. Indeed, the only requirement is the ability to generate mock data to calibrate the chosen test statistics; therefore it is quite general and can be applied to a variety of model selection problems.

In Section 5.1 we discuss the results from our PolyChord chains on noiseless data vectors, and the basic trends. Section 5.2 then discusses the more complete probabilistic calibration, which properly factors in the impact of noise. We also compare our empirical results against theoretically derived χ2\chi^{2} thresholds. Section 5.3 looks at how far bias can be inferred from NLA fits alone, without explicit model comparison. Lastly, Section 5.4 considers the wider outlook for lensing cosmology.

5.1 The Noiseless Case

Refer to caption
Figure 4: Δχ(df)2\Delta\chi^{2}_{(\textrm{df})} as a function of cosmological parameter bias for a DES Y3-like cosmic shear analysis. The 21 points correspond to noiseless data vectors, generated with different input IA parameters. As defined in Eq. (20), large values of Δχ(df)2\Delta\chi^{2}_{(\textrm{df})} indicate that the data prefer TATT over NLA. The vertical dotted line marks the 0.3σ0.3\sigma bias limit used in DES Y3 (Krause et al., 2021). We see a clear correlation between the observable metric (Δχ(df)2\Delta\chi^{2}_{(\textrm{df})}) and the underlying parameter bias, particularly for those points for which the bias exceeds 0.2σ\sim 0.2\sigma.

Considering first the noiseless case, Figure 4 shows the relation between bias in the S8ΩmS_{8}-\Omega_{\mathrm{m}} plane and the NLA-TATT Δχ(df)2\Delta\chi^{2}_{(\textrm{df})}. Each point results from running two chains on the same noiseless simulated data vector, first using NLA, and then using TATT. As defined in Eq. (20), large values of Δχ(df)2\Delta\chi^{2}_{(\textrm{df})} indicate statistical preference for the larger model (i.e., TATT). We see a relatively tight relation between bias and Δχ(df)2\Delta\chi^{2}_{(\textrm{df})}, going from Δχ(df)2<1\Delta\chi^{2}_{(\textrm{df})}<1 when bias is small to relatively large values at the high bias end: low bias \rightarrow small χ2\chi^{2} difference, high bias \rightarrow large χ2\chi^{2} difference. Interestingly, the relation appears to have the form (approximately) of a double power law, with a steep slope in the high bias regime, switching to a somewhat shallower function below 0.3σ0.3\sigma. It is worth stressing, however, that this relation is empirical. We do not have a particular expectation for its shape, and it is likely that the details depend on the analysis choices and survey setup. Note that even without data vector noise, this relation presents some scatter. This arises both from sampler noise, and from the fact that this is a complex high dimensional problem, for which two sets of IA values that produce biases of a similar magnitude will not necessarily produce identical Δχ(df)2\Delta\chi^{2}_{(\textrm{df})} values. The vertical dotted line marks the 0.3σ0.3\sigma bias threshold adopted by DES (Krause et al., 2021), which we adopt as our fiducial tolerance (see Section 5.2 below). Although we cannot use this noiseless result for any empirical method because real data will always contain noise, confirming that these quantities clearly correlate is a necessary first step in our method, and important to check before incurring the expense of further calculations. We will see in the next section that the correlation between Δχ2\Delta\chi^{2} and S8ΩmS_{8}-\Omega_{\mathrm{m}} bias holds (with some additional scatter) when we proceed to the noisy case.

In addition to the Δχ(df)2\Delta\chi^{2}_{(\textrm{df})}, we also consider the Bayes Ratio as a potential model comparison metric; while the former presents a clear relation to the bias (as seen in Figure 4), we find the latter be a relatively weak indicator, with additional intrinsic scatter. This can be seen in Appendix C, and in particular Figure 11. We also note that RR and Δχ(df)2\Delta\chi^{2}_{(\textrm{df})} are correlated. In principle one could seek to combine them, but naively treating them as independent metrics is almost certainly double-counting information. Therefore, here we focus on the results using Δχ(df)2\Delta\chi^{2}_{(\textrm{df})}. For further discussion of the Bayes ratio see Appendix C.

5.2 Noise & Probabilistic Calibration

Refer to caption
Figure 5: The impact of data vector noise on Δχ(df)2\Delta\chi^{2}_{(\textrm{df})}. The larger open points show our 21 IA samples with zero noise (identical to those in Figure 4). The smaller coloured dots show the effect of adding random noise realisations, for which parameter constraints are estimated using importance sampling. For each of the 21 colours, we have a collection of 50 realisations. The red and blue horizontal dashed lines mark threshold Δχ(df)2\Delta\chi^{2}_{(\textrm{df})} values, defined by the points where the BIC and AIC respectively prefer NLA and TATT equally. The orange dashed line corresponds to a pp-value p(Δχ2)=0.05p(\Delta\chi^{2})=0.05 (see text, Section 5.2). The fact that these formal cut-offs are relatively ineffective in isolating the bias <0.3σ<0.3\sigma region motivates us to adopt an empirical approach.

In Figure 5 we illustrate the impact of data vector noise in the bias-Δχ(df)2\Delta\chi^{2}_{(\textrm{df})} plane. We show the same 21 noiseless samples discussed above, but now overlain with multiple different noise realisations, as approximated using importance sampling. As we can see, noise introduces scatter in the bias-Δχ(df)2\Delta\chi^{2}_{(\textrm{df})} relation. While this noise is considerably less than in the case of the Bayes factor (for which we show in Appendix C that the scatter due to noise is so large that the relation with bias is extremely weak), it is still non-negligible.

For comparison, we show the Δχ(df)2\Delta\chi^{2}_{(\textrm{df})} cut-offs implied by some standard model selection metrics: BIC, AIC and a pp-value significance threshold444Assuming that the TATT model has 2 additional effective degrees of freedom compared with NLA. Note that this dfdiffdf_{\text{diff}} value was calculated for the DES Y3 shear-only (no shear ratio) case in Secco, Samuroff et al. (2022). It is thus valid for our particular case, but would not necessarily hold under changes to the data vector or analysis choices. of p=0.05p=0.05 (see Section 4.2.1 for definitions). Unfortunately, in the presence of noise, we see that all three cut-offs are relatively weak indicators of bias – i.e., they still favour the simpler model even when significant amount of bias has been introduced in the cosmological parameters. Even in the case of AIC, which is the strictest of the three, there are a problematic fraction of noise realisations where the observable metric favours NLA, and yet NLA is biased by >0.3σ>0.3\sigma (see the points in the lower right hand corner of Figure 5). This illustrates a key motivation for adopting an empirical calibration. Theoretical limits imposed using, e.g., pp-values are not designed to optimise the quantities we care most about (i.e., parameter biases). For a given analysis, it is impossible to know from first principles what level of bias is excluded for a given statistical metric cut-off without some form of calibration.

These observations have important consequences for our method. Since the exact noise realisation in any real data set is unknown, one cannot simply run a single set of IA samples (as in Figure 4), and perform a 1:1 bias-Δχ(df)2\Delta\chi^{2}_{(\textrm{df})} mapping. Nor, as we can see from Figure 5, can we simply fall back on theoretical cut-offs to reliably guard against model bias. Instead, we must consider the problem as a probabilistic one, and factor in the uncertainty from noise.

5.2.1 Probabilistic Interpretation

Refer to caption
Figure 6: Probability of exceeding some specified level of cosmological parameter bias, as a function of the threshold Δχ(df)2\Delta\chi^{2}_{(\textrm{df})} value. For a DES Y3-like cosmic shear data vector with unknown noise and IA realisation, and that is found to give an observed Δχ(df)2\Delta\chi^{2}_{(\textrm{df})} lower than threshold the Δχ(df)thr2\Delta\chi_{\rm(df)thr}^{2}, PP is defined as the probability that the results using NLA are biased by more XσX\sigma in the S8ΩmS_{8}-\Omega_{\rm m} plane. Different values of XX are represented by different colours. In each case, we show both the direct measurement of PP using importance sampling (coloured points), and the lines are obtained by doing a polynomial fit. For illustrative purposes, we also show the Δχ(df)2\Delta\chi^{2}_{(\textrm{df})} threshold that would guarantee NLA is unbiased to within 0.3σ0.3\sigma at a confidence level of 90%90\% (dotted lines and shading).

To interpret our results in a quantitative way, we use Eq. (23) and calculate the bias probability P(b>Xσ|Δχ(df),obs2<Δχ(df),thr2)P(b>X\sigma|\Delta\chi^{2}_{\rm(df),obs}<\Delta\chi^{2}_{\rm(df),thr}). This quantity should be interpreted as the conditional likelihood that, if in the real data one finds a Δχ(df),obs2\Delta\chi^{2}_{\rm(df),obs} value below some limit Δχ(df),thr2\Delta\chi^{2}_{\rm(df),thr} (a horizontal line in Figure 5), then the analysis using NLA will still in fact be biased by XσX\sigma or more.

Figure 6 shows three curves corresponding to chosen bias thresholds of 0.15σ0.15\sigma, 0.3σ0.3\sigma, and 0.5σ0.5\sigma. Each point is calculated using Eq. (23) and the curves are obtained by fitting a fifth order polynomial to the points. We tested the stability of these smoothed fits, and found that they are robust to doubling the number of noise realisations in Figure 5 (from 50 per IA sample to 100). This result provides a powerful tool, which can be used to interpret results from real data. For instance, say we were to run NLA and TATT chains on a blinded Y3 data vector, and find Δχ(df),obs2<0.4\Delta\chi^{2}_{\rm(df),obs}<0.4. With the aid of Figure 6, we could say that the chance of the NLA run being biased by more than 0.5σ0.5\sigma in S8ΩmS_{8}-\Omega_{\mathrm{m}} is about 3%3\%. The probability of exceeding a 0.3σ0.3\sigma threshold is about 10%10\%, and the chance of bias greater than 0.15σ0.15\sigma is about 37%37\%. In practice, the bias tolerance is an analysis choice. As discussed previously, DES Y3 chose a value of 0.3σ0.3\sigma by which to judge simulated chains. The exact number, however, is somewhat subjective, and the most convenient value may depend on how well sampled the low bias end of the bias-Δχ(df)2\Delta\chi^{2}_{(\textrm{df})} relation is. As one might expect, the lower the bias threshold XX, the stronger the requirement on the Δχ(df)2\Delta\chi^{2}_{(\textrm{df})} (i.e., the stronger the data needs to favour NLA) in order to keep the bias probability P(b>Xσ)P(b>X\sigma) low.

To understand how our results depend on various analysis choices, it is perhaps useful to think of the process in Section 4.4 as a series of transformations between different distributions. The points in Figure 5, which determine the final χ2\chi^{2} threshold, can be thought of as the convolution of two parts: an initial distribution of noiseless points Ps(Δχ(df)2,b)P_{s}(\Delta\chi^{2}_{(\textrm{df})},b) (the open points in Figure 5 and the filled in Figure 4) and a second distribution conditioned on each one PN(Δχ(df)2~,b~|Δχ(df)2,b)P_{N}(\widetilde{\Delta\chi^{2}_{(\textrm{df})}},\tilde{b}|\Delta\chi^{2}_{(\textrm{df})},b) (where the tilde denotes noisy values of Δχ(df)2\Delta\chi^{2}_{(\textrm{df})} and bias). In the first case, PsP_{s}, we start with a distribution in IA parameter space P(𝜽IA)P(\boldsymbol{\theta}_{\rm IA}), which we choose. The samples from P(𝜽IA)P(\boldsymbol{\theta}_{\rm IA}) are mapped onto a distribution of noiseless data vectors, which are then transformed (via running chains) into samples in the final bias-Δχ(df)2\Delta\chi^{2}_{(\textrm{df})} space: P(𝜽IA)P(𝑫)Ps(Δχ(df)2,b)P(\boldsymbol{\theta}_{\rm IA})\rightarrow P(\boldsymbol{D})\rightarrow P_{s}(\Delta\chi^{2}_{(\textrm{df})},b). Both mapping steps are dependent on the survey analysis choices (choice of power spectrum, n(z)n(z), covariance matrix, etc.). This is not a problem, as long as these choices match the ones that will be applied on real data. It is, however, likely these choices have an impact on the observed bias-Δχ(df)2\Delta\chi^{2}_{(\textrm{df})} correlation. It is clear from this that Ps(Δχ(df)2,b)P_{s}(\Delta\chi^{2}_{(\textrm{df})},b) also depends to an extent on the choice of P(𝜽IA)P(\boldsymbol{\theta}_{\rm IA}). We can see that P(𝜽IA)P(\boldsymbol{\theta}_{\rm IA}) behaves analogously to a prior, restricting the range of possibilities in the subsequent steps. However, given that the purple points in Figure 4 show a relatively tight correlation and cover a broad range of bias relatively uniformly, we do not expect the details to change things considerably.

The other part of the final sampling of points is the noise distribution PN(Δχ(df)2~,b~|Δχ(df)2,b)P_{N}(\widetilde{\Delta\chi^{2}_{(\textrm{df})}},\tilde{b}|\Delta\chi^{2}_{(\textrm{df})},b). We obtain this for a particular IA sample by sampling noise realisations, and so transforming P(𝑫~|𝑫)PN(Δχ(df)2~,b~|Δχ(df)2,b)P(\widetilde{\boldsymbol{D}}|\boldsymbol{D})\rightarrow P_{N}(\widetilde{\Delta\chi^{2}_{(\textrm{df})}},\tilde{b}|\Delta\chi^{2}_{(\textrm{df})},b). This process is again dependent on the covariance matrix, but not on the choice of P(𝜽IA)P(\boldsymbol{\theta}_{\rm IA}) (at least, not directly).

The end result of the above is that, by convolving to get to P(Δχ(df)2~,b~)P(\widetilde{\Delta\chi^{2}_{\rm(df)}},\tilde{b}), we are able to map out the relationship between a quantity we can measure (the noisy Δχ(df)2~\widetilde{\Delta\chi^{2}_{\rm(df)}}) and the one we are interested in (parameter bias b~\tilde{b}).

5.2.2 Bias Tolerance Implications

Refer to caption
Figure 7: The impact of data vector noise on Δχ(df)2\Delta\chi^{2}_{(\textrm{df})}. The points are the same as in Figure 5. The horizontal line represents an empirical Δχ(df)2\Delta\chi^{2}_{(\textrm{df})} threshold, derived to ensure bias below 0.3σ0.3\sigma with 90%90\% confidence. The four different shaded regions distinguish the following possible scenarios: purple - NLA is sufficient and the calibrated Δχ(df)2\Delta\chi^{2}_{(\textrm{df})} favours NLA ; grey - NLA would be sufficient and yet Δχ(df)2\Delta\chi^{2}_{(\textrm{df})} chooses TATT ; pink - NLA is insufficient and Δχ(df)2\Delta\chi^{2}_{(\textrm{df})} favours TATT ; Red - NLA is insufficient and yet Δχ(df)2\Delta\chi^{2}_{(\textrm{df})} still chooses NLA. This last case is the most dangerous, and the Δχ(df)2\Delta\chi^{2}_{(\textrm{df})} threshold is chosen to keep the fraction of points in this quadrant acceptably small.

We further illustrate our results by taking a concrete example. For our DES Y3 setup, we choose a bias tolerance of X=0.3σX=0.3\sigma, and a bias probability of 10%10\% (P=0.1P=0.1). Using Figure 6, this gives us Δχ(df),thr2=0.4\Delta\chi^{2}_{\rm(df),thr}=0.4 (reading across where the horizontal dashed line meets the purple curve), which is shown in Figure 7 (the horizontal line labelled “empirical threshold"). With the bias and Δχ(df)2\Delta\chi^{2}_{(\textrm{df})} thresholds fixed, the four shaded regions in Figure 7 distinguish the following possible scenarios: (a) NLA is sufficient (i.e., the bias is below our 0.3σ0.3\sigma limit) and Δχ(df)2\Delta\chi^{2}_{(\textrm{df})} chooses NLA (i.e., Δχ(df)2<Δχ(df),thr2\Delta\chi^{2}_{(\textrm{df})}<\Delta\chi^{2}_{\rm(df),thr}; purple); (b) NLA is sufficient and Δχ(df)2\Delta\chi^{2}_{(\textrm{df})} chooses TATT (grey); (c) NLA is insufficient and Δχ(df)2\Delta\chi^{2}_{(\textrm{df})} chooses TATT (pink); (d) NLA is insufficient and Δχ(df)2\Delta\chi^{2}_{(\textrm{df})} chooses NLA (red). As we discussed previously, case (d) is the most dangerous, for obvious reasons. Scenario (b) is not ideal (since we may end up with a model that is more complicated than strictly necessary), but does not result in cosmological parameter biases. The different scenarios can be better understood with the help of a confusion matrix, shown in Table 1.

Table 1: Confusion Matrix. The samples are split into quadrants, corresponding to the four shaded regions in Figure 7. The left/right columns show the fraction of IA samples that give a bias above and below 0.3σ0.3\sigma. The rows indicate whether or not our method using the calibrated Δχ(df)2\Delta\chi^{2}_{(\textrm{df})} prefers NLA or TATT.
Model preferred by Bias
NLA TATT Total
Model preferred TATT 19.1%19.1\% 77.1%77.1\% 96.2%96.2\%
by Δχ(df)2\Delta\chi^{2}_{(\textrm{df})} NLA 3.4%3.4\% 0.4%0.4\% 3.8%3.8\%
Total 22.5%22.5\% 77.5%77.5\% 100%100\%

The columns here represent the model preference according to the amount of bias, and the rows represent the model preference according to the Δχ(df)2\Delta\chi^{2}_{(\textrm{df})}. Since we are effectively using Δχ(df)2\Delta\chi^{2}_{(\textrm{df})} as an empirical proxy for bias, we treat the classification according to the latter (i.e., does using NLA cause cosmological parameter biases for a particular data vector exceeding 0.3σ0.3\sigma?) as the truth and the label according to the former (i.e., is Δχ(df)2\Delta\chi^{2}_{(\textrm{df})} below Δχ(df),thr2\Delta\chi^{2}_{\rm(df),thr}?) as the prediction (in machine learning language).

We see that in our samples, the cosmological parameter bias indicates that NLA should be preferred about 22.5%22.5\% of the time, while TATT should be preferred 77.5%77.5\%. In other words, NLA introduces a bias smaller than our threshold in 1/4\sim 1/4 of the cases. Note that this fraction is somewhat dependent on our particular choices. A different choice of posteriors in Figure 1, for example, could change this fraction. We do not, however, expect this to affect the validity of the method.

When it comes to the performance of Δχ(df)2\Delta\chi^{2}_{(\textrm{df})} in identifying the correct model, we see that it favours TATT in 96.2%96.2\% of the cases, and NLA in only 3.8%3.8\%. We can see that our method is quite conservative, in the sense that there is a non-negligible false positive rate. That is, it prefers TATT over NLA in 19.1%19.1\% of cases, even though NLA would not introduce bias to the model above the 0.3σ0.3\sigma threshold. Reassuringly, however, we also see that our method is is highly effective in ruling out real bias. The strongest feature of our approach, perhaps, is the fact that it is very unlikely to select NLA if it is, in fact, introducing biases to the analysis. We can see this by the very small population of points in the lower right of the matrix (and the red shaded quadrant in Figure 7): this happens in only 0.4%\sim 0.4\% of cases. Put another way, if the calibrated Δχ(df)2\Delta\chi^{2}_{(\textrm{df})} favours TATT, there is a roughly 20%20\% chance (19.1/96.219.1/96.2) that NLA would, in reality, have been fine. Conversely, if it prefers NLA, there is 10%\sim 10\% (0.4/3.80.4/3.8) that NLA is insufficient. Therefore, even though the end result is somewhat cautious (in that there is a moderate false positive rate for TATT), on the positive side we can be confident that if NLA is in fact preferred by the data, it is very unlikely that it will introduce biases to the analysis. As a remark, however, it is important to acknowledge that a possible conclusion from these results is that simply using the most general model is the cheapest alternative from the perspective of computational resources. It is not obvious that this will always be the case, however, given the dependence on analysis setup and other factors.

It is also worth noticing that although the above discussion applied for our specific choices, we can control the conservatism to a significant degree through our analysis choices. We chose a specific bias tolerance and probability that we considered realistic. By changing these values (for example, allowing a bias probability of 20%20\%, or 25%25\%) one can effectively shift the position of the cross in Figure 7, and trade off false positives for false negatives. This is another advantage of the method: it makes the level of conservatism explicit (and indeed quantifiable), and allows one to adjust that level as preferred. This is much less true when using alternative approaches to model selection.

5.3 A Simpler Approach: How Much Can We Tell From A Single Model?

It is also worth taking a moment to consider a related question: if the true IA scenario is extreme enough to give significant cosmological biases, would there be clear red flags from NLA alone, assuming that no fits were carried out with TATT? If this were the case, it would provide a simpler route – instead of performing empirical calibration using simulations, we could simply run one model on the data, and interpret results to see if a more sophisticated model is needed. Considering our 21 IA scenarios with a fixed noise realisation, we find that around 50%50\% of cases with bias >0.3σ>0.3\sigma have χ2\chi^{2} values that appear entirely consistent with being drawn from the corresponding χ2\chi^{2} distribution555Where the NLA pp-value here is calculated by assuming χNLA2\chi^{2}_{\rm NLA} is drawn from a χ2\chi^{2} distribution with 224 degrees of freedom (see Secco, Samuroff et al. 2022). The null hypothesis in this case is that the NLA model is adequate to describe the data, and so small pp-values would indicate model insufficiency., p(χNLA2)>0.05p(\chi^{2}_{\rm NLA})>0.05. A similar picture is seen when we consider a single high bias IA scenario with alternative noise realisations – computing pp-values for each realisation, the majority are above 0.05, even in the presence of bias >0.3σ>0.3\sigma. In other words, even in cases where NLA is significantly biased, it is not necessarily obvious from considering the uncalibrated value of χNLA2\chi^{2}_{\rm NLA} alone. In contrast, the method we propose, using Δχ(df)2\Delta\chi^{2}_{(\textrm{df})}, can correctly identify the need to use TATT to achieve sufficiently unbiased results 77.1/77.599%77.1/77.5\sim 99\% of the time (see the confusion matrix in Section 5.2).

Likewise, although extreme biases do tend to distort the shape of the posteriors, this is not always true in more moderate (but still significantly biased) cases. Figure 8 shows the NLA posteriors in a few different IA scenarios, spanning the range from almost no bias (purple shaded), to 1σ\sim 1\sigma (pink, open contours). For reference we also show the posteriors from TATT fits to the same data vectors in Appendix B. Taken in isolation, none of these show clear signs of problems with the model. It is also interesting that IA mismodelling bias does not always translate into significantly non-zero values for the inferred NLA parameters. In the medium bias case, for example, A1A_{1} and η1\eta_{1} are both consistent with zero to <1σ<1\sigma. Here there is a relatively strong degeneracy between A1A_{1} and S8S_{8}, allowing both A10A_{1}\sim 0 combined with low S8S_{8}, but also a stronger IA amplitude (A11)(A_{1}\sim 1) with a larger S8S_{8}. In projection, this results in broad contours on both parameters (notice the black contours in the upper panel of Figure 8 are slightly wider than the others, with a more prominent asymmetry at high/low S8S_{8}).

Refer to caption
Refer to caption
Figure 8: Examples of the simulated NLA posteriors from three particular IA scenarios with our fiducial noise realisation. These samples were chosen to span a range of bias levels (as defined relative to the TATT posteriors from the same data vectors). In order of severity, the low bias case (purple) has a bias in the S8ΩmS_{8}-\Omega_{\mathrm{m}} plane of 0.1σ\sim 0.1\sigma, and Δχ(df)2=0.24\Delta\chi^{2}_{(\textrm{df})}=0.24, R=21.9±6.1R=21.9\pm 6.1; the medium bias case (black) has 0.36σ0.36\sigma bias, Δχ(df)2=0.49\Delta\chi^{2}_{(\textrm{df})}=0.49, R=1.5±0.3R=1.5\pm 0.3; the high bias case (pink, open) has 0.82σ0.82\sigma bias, Δχ(df)2=1.98\Delta\chi^{2}_{(\textrm{df})}=1.98, R=1.1±0.2R=1.1\pm 0.2. In all cases the posteriors are not visibly distorted (although in the medium and high bias cases, the η1\eta_{1} posterior is cut off slightly by the upper prior edge at η1=5\eta_{1}=5).

5.4 Intrinsic alignment modelling & wider implications for weak lensing

The results discussed so far have a number of direct implications for the question of intrinsic alignment model selection. Primarily, we have shown that it is possible to perform empirical model selection with lensing data. There is a clear relation between cosmological parameter bias and Δχ(df)2\Delta\chi^{2}_{(\textrm{df})}, which allows one to define a threshold that can then be applied to the real data. That said, the failure of conventional statistical metrics (e.g., pp-values) to identify scenarios with significant cosmological parameter biases is notable, and should be kept in mind when trying to understand statistics derived from any single run on real data. The properly calibrated model statistics, however, provide an alternative to the model selection exercises used in previous analyses, which have tended to rely on either simulated analyses (Secco, Samuroff et al. 2022 Section A3), or arguments based on direct-detection studies (Hikage et al. 2019; Joachimi et al. 2021a, Sections 5.4 and 2.4 respectively). The empirical method is arguably an advance on both; first of all, it avoids questions about what constitutes an “extreme" model, which tend to arise in the simulation-based approach. Since the current best constraints on TATT model parameters are relatively weak, it is relatively easy to select an IA scenario that is both consistent with observations, and which would cause significant bias in an NLA analysis (note that this is still true in light of the most recent DES Y3 results Secco, Samuroff et al. 2022; Amon et al. 2022; DES Collaboration 2022). Our empirical approach also avoids the uncertainties that are inherent in extrapolating observations based on direct IA measurements on one very specific type of galaxies to weak lensing measurements on another population entirely.

Although the empirical approach has various strengths, is worth reiterating that data vector noise is a significant source of scatter in the bias-Δχ(df)2\Delta\chi^{2}_{(\textrm{df})} relation. For this reason it is important to accurately simulate the noise properties of the particular data set. While this is in principle simple, given an accurate covariance matrix, it does mean the model selection exercise needs to be repeated for any new data set or changes to the analysis. It also means it is crucial to have a fast and accurate way of estimating posteriors for a large number of noise realisations, such as the IS framework used here.

It is also interesting, finally, to consider how our findings relate to the real Y3 results. Although comparing with the full 3×23\times 2pt results is difficult, for the reasons given above, Secco, Samuroff et al. (2022) (Section VIIB and Table III) present a comparison of IA models without shear ratios, an analysis configuration that matches ours. Specifically, comparing the 2 parameter NLA model with 5 parameter TATT, they find R=1.70±0.36R=1.70\pm 0.36 and Δχ(df)2=5.2/2=2.6\Delta\chi^{2}_{(\textrm{df})}=5.2/2=2.6. Interpreted with the help of Figure 6, this puts the risk of NLA being biased by >0.3σ>0.3\sigma at somewhere around 30%30\%, meaning runs using NLA on Y3 were more likely than not to be unbiased to within the 0.3σ0.3\sigma threshold.

6 Conclusions

In this paper, we explore the idea that model selection for cosmological analyses could be performed a posteriori, being informed by the blinded data themselves. Our goal is to select a model that is sufficient to describe the data, resulting in unbiased parameter constraints at some specified tolerance level. We chose to focus on a specific problem: how best to decide on an intrinsic alignment model for a cosmic shear analysis. This is an important question, and one that has been the subject of much discussion within the weak lensing community in recent years. That said, the basic concept behind our method is much more general, and could be applied in a variety of different contexts; in principle it requires only that the data (including its noise) can be readily simulated.

Using simulated noisy DES Y3 weak lensing data, we tested the method, and identified statistical tools with which to implement it. The main conclusions of our study are as follows:

  • We showed a clear relation between the χ2\chi^{2} difference between two models, and model insufficiency bias on cosmological parameters. This relation was seen to extend across a wide range of biases, from low to high, allowing one to define an empirical Δχ2\Delta\chi^{2} threshold in order to ensure bias is below an acceptable level.

  • We tested a number of common χ2\chi^{2}-based metrics such as AIC, BIC and pre-defined pp-value cutoffs. These were seen to be generally under-cautious, favouring the simpler model even in the presence of 12σ1-2\sigma parameter biases. This result motivates us to use an empirical Δχ2\Delta\chi^{2} calibration. Similarly, when trying to interpret the goodness of fit statistic from a single model, a standard p(χ2)=0.05p(\chi^{2})=0.05 cutoff is not reliable to rule out significant biases.

  • In addition to maximum likelihood-based statistics, we also consider the Bayes factor as a model selection tool. Although useful in the extreme cases, it was seen to be only weakly discriminating for cosmological parameter biases in the range 0.21σ0.2-1\sigma. We therefore recommend it be used with caution, ideally in conjunction with other model selection metrics.

  • Noise is seen to have a potentially significant impact on both the cosmological bias, and the Δχ(df)2\Delta\chi^{2}_{(\textrm{df})}, for any given input IA scenario. At high bias, the picture is relatively stable; noise cannot, in general, cause model selection metrics to prefer the simpler model in a case where adopting that model induces large cosmological parameter biases. The reverse is, however, possible. Due to noise, one can end up in scenario with small cosmological parameter bias, but with selection metrics favouring the more complex model. In this regard, our method tends to err on the side of caution.

Although our qualitative findings are general, it is worth bearing in mind that the details are specific to the DES Y3 cosmic shear only setup. Factors such as choice of two-point statistics, covariance and scale cuts could very easily have an impact, as could modelling choices (baryonic treatment, power spectrum, cosmological model etc) and the choice of sampler. It is therefore important that the simulated analyses used to derive a Δχ2\Delta\chi^{2} threshold are as close as possible (and ideally identical) to the real setup that will be applied to the blinded data.

Model selection is an important topic in cosmology, and in science more generally. It is quite common to have a set of models under consideration, with little prior knowledge about the values of their parameters; what level of complexity is sufficient to describe the data, given its precision, depends on the unknown true model and its unknown parameter values. Given these circumstances, arguably the most cautious approach would be to use the most flexible model, which is more likely to be unbiased. This paper sets out an alternative method, which allows information in the data to inform model selection. Although applicable in similar situations to Bayesian Model Averaging (BAM; Liddle et al., 2006b; Vardanyan et al., 2011), i.e., where there is not enough prior information to justify choosing one model over another, our approach has the advantage of simplicity, and maintains the idea of a fiducial model, which is often useful for practical purposes. It also avoids the prior dependence of methods such as BAM, which is well documented in the literature. Given its generality, simplicity, and the relatively low level of resources required, we foresee applications of the empirical method discussed in this paper to future analyses as a model selection tool in many contexts.

Data Availability

All simulated data vectors, PolyChord chains and Importance Sampling noise samples used in this work are publicly available at https://github.com/AndresaCampos/empirical_model_selection.

Acknowledgements

We thank Scott Dodelson, Sukhdeep Singh, Lucas Secco, Alex Amon, Judit Prat and Agnès Ferté for useful discussions contributing to this work. Many thanks also to Jessie Muir, Noah Weaverdyck, Otávio Alves, Shivam Pandey, and Cyrille Doux for support with code, in particular with setting up the importance sampling pipeline used in this paper. Contour plots were made using the GetDist package (Lewis, 2019).

Andresa Campos thanks the support from the U.S. Department of Energy grant DE-SC0010118 and the NSF AI Institute: Physics of the Future, NSF PHY-2020295. Simon Samuroff is partially supported by NSF grant AST-2206563. RM is supported in part by the Department of Energy grant DE-SC0010118 and in part by a grant from the Simons Foundation (Simons Investigator in Astrophysics, Award ID 620789).

References

Appendix A Parameters & Priors

Our setup matches the fiducial choices of the DES Y3 cosmic shear analysis. The only significant difference is that, for the sake of simplicity, we choose not to use the additional shear ratio likelihood included by Secco, Samuroff et al. (2022); Amon et al. (2022) (a similar decision was made for validating the analysis choices pre-unblinding; see Krause et al. 2021). As a result, our model space is slightly smaller, since we do not need to vary parameters for galaxy bias or lens photoz-z error. The corresponding parameters and their priors are shown in Table 2. Note that these are almost identical to the priors used in the Y3 analysis, except for those on the shear calibration parameters, which have been shifted to match the input to the simulated data.

In a particular setup, one should expect some level of projection effects in the marginal parameter constraints (Krause et al., 2021). Since such offsets are artefacts of the way we choose to visualise our results (i.e., the global best fit is still accurate) it is not, in general, useful to think of them as a form of bias; our method does, however, rely on our ability to interpret differences in the 2D projected S8ΩmS_{8}-\Omega_{\mathrm{m}} plane. It is thus helpful to try to quantify such effects in our case. In Figure 9 we show the results of our NLA and TATT analyses on an NLA-only data vector, with our fiducial noise realisation. That is, in this case, both the TATT and NLA models can reproduce the data exactly (up to noise). The offset between the best-fitting parameters when using the two models, shown in Figure 9, is at the level of 0.1σ\sim 0.1\sigma. This effectively provides a floor to the bias in our analysis. Although we can occasionally find biases below this level due to noise (see Section 5 for discussion), we should consider all these cases as unbiased, at least to within the uncertainty due to projection effects. Note that this is consistent with the results of Krause et al. (2021), who performed a similar test using noiseless data (see their Figure 4).

Note that projection effects are complicated, and may be a function of (among other things) the choice of input parameters, noise realisation, priors and constraining power of the data. Although it is reassuring that our result matches that of Krause et al. (2021), there may still be some residual uncertainty in the size of the effect. This is not, however, necessarily a problem for our method. Indeed, variable projection effects would simply add an extra source of noise in the bias-metric relation, which would be factored into our results in the same way as, e.g., chain-to-chain sampler noise.

Refer to caption
Figure 9: Projected 0.3σ0.3\sigma contours from NLA and TATT chains run on a noisy NLA data vector (see Section 4.1 for definitions). The NLA input parameters are A1=0.7A_{1}=0.7, η1=1.7\eta_{1}=-1.7. Since, by construction, both IA models are sufficient to describe the data, any residual offset is thought to be the result of projection effects. As labelled, this is at the level of 0.1σ0.1\sigma for our analysis setup.
Table 2: A summary of the central values and priors used in our analysis. The top seven rows are cosmological parameters, while those in the lower sections are nuisance parameters corresponding to astrophysics and data calibration. Priors are either uniform (U) or normally-distributed, 𝒩(μ,σ)\mathcal{N}(\mu,\sigma). Note the IA parameters are marked with a star because many different values are used as input to our data vectors, as discussed in Section 3.2. The values shown here are used for convenience, whenever it is useful to show/discuss a single data realisation (e.g., in Figure 2).
Parameter Fiducial Value Prior
Cosmological Parameters
Ωm\Omega_{\mathrm{m}} 0.290.29 U[0.1,0.9]\mathrm{U}[0.1,0.9]
Ωb\Omega_{\mathrm{b}} 0.0520.052 U[0.03,0.07]\mathrm{U}[0.03,0.07]
hh 0.750.75 U[0.55,0.91]\mathrm{U}[0.55,0.91]
AsA_{\mathrm{s}} 2.38×1092.38\times 10^{-9} U[0.5,5.0]×109\mathrm{U}[0.5,5.0]\times 10^{-9}
nsn_{\mathrm{s}} 0.990.99 U[0.87,1.07]\mathrm{U}[0.87,1.07]
Ωνh2\Omega_{\nu}h^{2} 0.000530.00053 U[0.6,6.44]×103\mathrm{U}[0.6,6.44]\times 10^{-3}
Calibration Parameters
m1m_{1} 0.00.0 𝒩(0.0,0.0059)\mathcal{N}(0.0,0.0059)
m2m_{2} 0.00.0 𝒩(0.0,0.0042)\mathcal{N}(0.0,0.0042)
m3m_{3} 0.00.0 𝒩(0.0,0.0054)\mathcal{N}(0.0,0.0054)
m4m_{4} 0.00.0 𝒩(0.0,0.0072)\mathcal{N}(0.0,0.0072)
Δz1\Delta z_{1} 0.00.0 𝒩(0.0,0.018)\mathcal{N}(0.0,0.018)
Δz2\Delta z_{2} 0.00.0 𝒩(0.0,0.015)\mathcal{N}(0.0,0.015)
Δz3\Delta z_{3} 0.00.0 𝒩(0.0,0.011)\mathcal{N}(0.0,0.011)
Δz4\Delta z_{4} 0.00.0 𝒩(0.0,0.017)\mathcal{N}(0.0,0.017)
Intrinsic Alignment Parameters *
A1A_{1} 0.70.7 U[5,5]\mathrm{U}[-5,5]
A2A_{2} 1.36-1.36 U[5,5]\mathrm{U}[-5,5]
η1\eta_{1} 1.7-1.7 U[5,5]\mathrm{U}[-5,5]
η2\eta_{2} 2.5-2.5 U[5,5]\mathrm{U}[-5,5]
bTAb_{\rm TA} 1.01.0 U[0,2]\mathrm{U}[0,2]

Appendix B NLA & TATT Posteriors

For completeness, in Figure 10 we show the TATT model posteriors for the IA scenarios discussed in Section 5 and Figure 8. In that section we discussed three sets of IA model parameters that were selected to give a range of severity of S8ΩmS_{8}-\Omega_{\mathrm{m}} bias in NLA. Our results there showed that significant biases can be present in an NLA analysis without necessarily distorting the shape of the contours or giving a “bad" χ2\chi^{2} (interpreted in the conventional way, using statistically-motivated cut-offs). As expected, the cosmological parameter contours in Figure 10 (upper panel) are consistent with each other. Since the data vectors contain (the same) noise, they are offset from the input point slightly. Depending on the input scenario, the width also varies slightly, primarily due to the tail in A1A2A_{1}-A_{2} space, which correlates with S8S_{8}.

It is worth also briefly commenting here on the shapes of the TATT posteriors. It has been observed before that the TATT model can give rise to teardrop shaped, sometimes slightly bimodal contours in the A1A2A_{1}-A_{2} plane (see for example Secco, Samuroff et al. 2022 Fig. 8 and Sánchez, Prat et al. 2022 Fig. 15). A significant tail to positive (or negative) A2A_{2} tends to create a tail in S8S_{8}, of the sort seen in grey contours in the bottom panel of Figure 10. Note however that the shape and asymmetry depends quite heavily on where the posteriors sit in parameter space (meaning the noise realisation as well as the “true" TATT parameters), and on the constraining power of the data (more constraining power tends to trim away some of the non Gaussian tails). It is not clear that we can use any sort of qualitative assessment based on the TATT posterior shape as an indicator for bias in simpler models.

Refer to caption
Refer to caption
Figure 10: Top: 68% and 95% cosmology confidence contours from TATT model fits on simulated noisy data vectors. Like in Figure 8, the different colours represent samples selected to cover a range from relatively extreme (i.e., large bias in NLA) to mild (low bias) cases. The dotted cross represents the input cosmological parameters (which is offset from the center of the contours due to data vector noise). Bottom: The same, but showing the two TATT amplitude parameters. The markers (dot, star, triangle) show the input IA parameters for each case.

Appendix C Bayes Ratio

Although we ultimately choose to use the Δχ(df)2\Delta\chi^{2}_{(\textrm{df})} as our model comparison statistic, it is also useful to consider other commonly used alternatives. The Bayes ratio has become a popular tool in weak lensing cosmology in recent years, in part because it in principle contains more information than the likelihood. It is also readily available as the by-product of running a nested sampling algorithm to estimate posteriors.

In Figure 11 we show the same 21 IA scenarios as in Figures 4 and 6, but now using the Bayes factor as our model comparison statistic. In the top panel, the open points again show the noiseless case, with evidence values computed using PolyChord. As we can see, there is a weak correlation, with low bias scenarios tending to give somewhere between “substantial" and “barely worth mentioning" on the Jeffreys scale (the coloured bands). Interestingly, in none of our IA scenarios, not even in the regime that is functionally unbiased, do we see “strong" evidence for NLA.

The scattered points in the top panel of Figure 11 show the impact of noise, as estimated using importance sampling. Each colour represents an IA sample, with different points representing different noise realisations. Clearly RR is significantly more sensitive to noise than Δχ(df)2\Delta\chi^{2}_{(\textrm{df})}, as we can see by comparing Figures 11 and 6. We observe essentially two scenarios – when the bias in the S8ΩmS_{8}-\Omega_{\mathrm{m}} plane is greater than 1σ\sim 1\sigma, the Bayes Ratio can tells us that NLA is disfavoured relatively reliably. On the other hand, when the bias in the S8ΩmS_{8}-\Omega_{m} plane is smaller than 1σ1\sigma, there is a considerable amount of scatter.

The bottom panel in Figure 11 shows the bias probability, conditioned on the Bayes ratio category. That is, given the data return RR in a particular class on the Jeffreys scale, PP is the probability that NLA is biased by more than XσX\sigma. There is clearly at least some information here; if the Bayes factor actively favours TATT (R<1R<1), there is a high probability that NLA will be biased, even factoring in noise. Values within the “barely worth mentioning" category could essentially go either way. In the “substantial" category things look better, but even here there is 1015%\sim 10-15\% chance of biases more than 0.3σ0.3\sigma, and almost 50%50\% chance that NLA is biased by more than 0.15σ0.15\sigma in the S8ΩmS_{8}-\Omega_{\mathrm{m}} plane.

We can perhaps understand the relative noisiness in RR by considering Eq. (19). Assuming a Gaussian likelihood, the Bayes factor scales as eΔχ2e^{\Delta\chi^{2}}; any small perturbation in χ2\chi^{2} due to sampling noise will thus be magnified exponentially. We cannot say from this whether this is an inherent issue with the Bayes ratio, or only when estimated using our method of importance sampling. In the absence of an alternative fast method to estimate RR for many noise realisations, however, we recommend Δχ(df)2\Delta\chi^{2}_{(\textrm{df})} as a more robust metric to use with our method.

Refer to caption
Refer to caption
Figure 11: Top: The same 21 samples as in Figure 4, but now showing the Bayes ratio R=𝒵NLA/𝒵TATTR=\mathcal{Z}_{\rm NLA}/\mathcal{Z}_{\rm TATT} rather than Δχ(df)2\Delta\chi^{2}_{(\textrm{df})}. As before, the open points show the bias/evidence ratios estimated by running PolyChord on noiseless data vectors. The points represent the scatter due to noise (50 noise realisations for each IA scenario; see Section 3.3 for details). The vertical line shows the 0.3σ0.3\sigma bound, and the shaded colours show how the different ranges of RR are interpreted according to the Jeffreys scale. Bottom:The same as Figure 6, but showing the Bayesian factor RR (defined as the ratio of Bayesian evidence values obtained from running NLA and TATT on the same data). The coloured bands represent categories on the Jeffreys scale, and PP is the probability of more than XσX\sigma cosmological bias in the NLA model, given an observed Bayes factor in each category.

Appendix D Sampler Comparison

In this appendix we present a brief comparison of two commonly used nested sampling codes: PolyChord and MultiNest. Although a similar (albeit more extensive) exercise is discussed in Lemos, Weaverdyck et al. (2022), their analysis choices differ significantly from ours, and so it is worth revisiting the question. To this end, we re-analyse our 21 noise 0 IA data vectors using MultiNest (500 live points, efficiency=0.3=0.3 , tolerance=0.01=0.01). The results are then compared with our fiducial PolyChord run (500 live points, num_repeats=30=30, tolerance=0.01=0.01). We find:

  • The two samplers give consistent results for point estimates. That is, both can reliably locate the posterior mean, and the sampling around the peak is comparable, giving a similar level of noise in the best fit. As a result, the χ2\chi^{2} difference between NLA and TATT analyses is relatively insensitive to the choice of sampler.

  • MultiNest is seen to underestimate the width of the 1σ1\sigma posteriors on cosmological parameters significantly. This is true in both models; combined with the previous point, it leads to a systematic overestimation of the S8ΩmS_{8}-\Omega_{\mathrm{m}} bias for any given IA scenario. This can be seen in Figure 12, which shows the posteriors as estimated by both samplers for a particular IA scenario.

  • The Bayesian evidence estimates from MultiNest are low compared with PolyChord, as was shown in Lemos, Weaverdyck et al. (2022). Although this is true for both models, the overall result is to increase RR (i.e., push the Bayes ratio towards favouring NLA more strongly).

Given the observations listed above, we chose to use PolyChord as our fiducial sampler, despite the runtime advantage of MultiNest.

Refer to caption
Figure 12: Marginalised posteriors from a single noisy data vector, with a given input IA scenario. The shaded purple and black contours show the results of fitting that data assuming TATT and NLA respectively, using the PolyChord nested sampling code. The unfilled contours are the results of the same analyses, but using the faster, but less accurate, MultiNest algorithm.

Appendix E Computational Resources

Here we describe the computational resources used for this paper. The aim is to provide an estimate of the computing power required to apply the method described in Section 4.4 to perform model selection. The exact amount of time/resources will naturally vary depending on the details of the analysis pipeline. In our particular case:

  • Generate \sim 25 IA samples: less than 1 minute in 1 core.

  • Generate \sim 25 IA datavectors: less than 1 minute per datavector in 1 core.

  • Run 25 NLA chains using PolyChord: around 22h per chain in 128 cores.

  • Run 25 TATT chains using PolyChord: around 28h per chain on 128 cores.

  • Generate 50 noise realisations, to be added to each one of the 25 data vectors: less than 1 minute in 1 core.

  • Generate importance sampling weights and the χ2\chi^{2} pool for 25×\times50 noisy data vectors: around 6-12h on 128 cores.

  • Apply the weights to evaluate the NLA and TATT posteriors and compute the bias and best fit in each case: around 6-12h on 128 cores.