This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Mesoscopic Bayesian Inference by Solvable Models

Shun Katakami Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa, Chiba 277-8561, Japan    Shuhei Kashiwamura Graduate School of Science, The University of Tokyo, Bunkyo, Tokyo 113-0033, Japan    Kenji Nagata Center for Basic Research on Materials, National Institute for Materials Science, Ibaraki, 305-0044, Japan    Masaichiro Mizumaki Faculty of Science, Course for Physical Sciences, Kumamoto University, Kumamoto, Kumamoto, 860-8555, Japan    Masato Okada Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa, Chiba 277-8561, Japan
Abstract

The rapid advancement of data science and artificial intelligence has affected physics in numerous ways, including the application of Bayesian inference, setting the stage for a revolution in research methodology. Our group has proposed Bayesian measurement, a framework that applies Bayesian inference to measurement science with broad applicability across various natural sciences. This framework enables the determination of posterior probability distributions of system parameters, model selection, and the integration of multiple measurement datasets. However, applying Bayesian measurement to real data analysis requires a more sophisticated approach than traditional statistical methods like Akaike information criterion (AIC) and Bayesian information criterion (BIC), which are designed for an infinite number of measurements NN. Therefore, in this paper, we propose an analytical theory that explicitly addresses the case where NN is finite in the linear regression model. We introduceO(1)O(1) mesoscopic variables for NN observation noises. Using this mesoscopic theory, we analyze the three core principles of Bayesian measurement: parameter estimation, model selection, and measurement integration. Furthermore, by introducing these mesoscopic variables, we demonstrate that the difference in free energies, critical for both model selection and measurement integration, can be analytically reduced by two mesoscopic variables of NN observation noises. This provides a deeper qualitative understanding of model selection and measurement integration and further provides deeper insights into actual measurements for nonlinear models. Our framework presents a novel approach to understanding Bayesian measurement results.

I Introduction

The rapid development of data science and artificial intelligence has led to numerous studies in physics that actively incorporate these fields [1, 2], aiming for new developments in physics. Among them, Bayesian inference shows high compatibility with traditional physics. Our group has proposed Bayesian measurement as a framework for applying Bayesian inference from statistics to measurement science [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]. Bayesian measurement can be applied to almost all natural sciences, including physics, chemistry, life sciences, and earth and planetary sciences. In this framework, one can determine the posterior probability distribution of parameters for a mathematical model constituting a system. Additionally, if there are multiple mathematical models explaining the same phenomenon, one can perform model selection to determine the most appropriate model solely on the basis of measurement data. Furthermore, Bayesian integration, i.e., measurement integration, enables the integration of multiple data obtained from multiple measurements on the same system and determines how to integrate this data solely on the basis of the data itself. Bayesian measurement consists of three core principles: estimation of the posterior probability distribution of parameters, model selection, and Bayesian integration.

When performing Bayesian measurement, the results of model selection and Bayesian integration vary depending on the fluctuation of measurement data when the number of data NN is finite. While Bayesian inference was first proposed by Thomas Bayes in the 18th century, its theoretical framework was traditionally developed under the assumption of an infinite number of measurement data NN, as represented by Bayesian information criterion (BIC) [19]. Consequently, conventional BIC theory proves ineffective for model selection using Bayes Free Energy when dealing with a finite number of data N. The construction of theories that explicitly address finite N has become a crucial test of the practicality of Bayesian measurement. Our goal is to go beyond existing theories for an infinite number of data NN.

The purpose of this paper is to propose a novel theoretical framework for the three core principles of Bayesian measurement: estimation of the posterior probability distribution of parameters, model selection, and Bayesian integration, when the number of measurement data NN is finite within the linear regression model. The proposed theory for NN finiteness aims to analytically address the results of model selection and Bayesian integration. In the conventional framework that assumes the infinite limit of measurement data NN, which is typically seen in many theoretical frameworks of Bayesian inference, it is impossible to consider the fluctuations as random variables arising from the finiteness of NN. The proposed theory is an innovative framework that is fundamentally different from conventional theories. In this paper, we develop a solvable theory for the linear regression model y=ax+by=ax+b with Gaussian noise as the measurement noise based on NN quantity measurement data. This model, while seemingly simple, is not merely for theoretical analysis. It is widely used in real measurement settings, such as with linear system responses. Furthermore, insights gained from this model can be extended to general nonlinear models.

Let us assume that the observation noise in NN observation data follows a Gaussian distribution. We define O(1)O(1) mesoscopic variables consisting of the NN Gaussian noises within the linear regression model. Specifically, we define two Gaussian distributions and a chi-square distribution defined by the sum of NN Gaussian noises. Using these mesoscopic variables, we propose a mesoscopic theory of the three core principles of Bayesian measurement: estimation of the posterior probability distribution of parameters, model selection, and Bayesian integration.

This paper is structured as follows. In Section II, we develop a theory using mesoscopic variables to express the estimation of the posterior probability distribution of parameters in Bayesian measurement using the linear regression model y=ax+by=ax+b. In Section III, we build on the mesoscopic theory in Section II to propose a mesoscopic theory for model selection. This theory shows that the Bayesian free energy difference ΔF\Delta F that determines model selection fluctuates greatly when the number of data NN is small. Furthermore, we show that by introducing mesoscopic variables, the free energy difference necessary in model selection can be analytically expressed with one mesoscopic variable of observation noise. In Section IV, we propose a mesoscopic theory for Bayesian integration building on the mesoscopic theory in Section II, and show that the Bayesian free energy difference ΔF\Delta F that determines the Bayesian integration fluctuates significantly when the number of data NN is small. Furthermore, by introducing mesoscopic variables, we show that the free energy difference necessary in the Bayesian integration can be analytically expressed with several mesoscopic variables of NN observation noises. In Sections III and IV, we provide the results of numerical calculations of model selection and Bayesian integration, respectively.

II Bayesian Inference with Linear Models

In this section, we will demonstrate how the probability distribution of Bayesian free energy for finite data size in linear models can be described using a small number of variables within the basic framework of Bayesian inference. To advance the logic of Bayesian inference in linear models, we will first explain the mean squared error (MSE) associated with these models. Subsequently, we will derive the Bayesian posterior probability, enabling model parameter estimation, and the Bayesian free energy, which facilitates model selection.

II.1 Mean Squared Error of Linear Models

Here, to prepare for the discussion on Bayesian inference, we present the conventional MSE for linear models. Consider regressing data D={(xi,yi)}i=1ND=\{(x_{i},y_{i})\}_{i=1}^{N} with NN samples using a two-variable linear model as follows:

y=ax+b.y=ax+b. (1)

In this context, the MSE is given by

E(a,b)\displaystyle E(a,b) =\displaystyle= 12Ni=1N{yi(axi+b)}2,\displaystyle\frac{1}{2N}\sum_{i=1}^{N}\left\{y_{i}-(ax_{i}+b)\right\}^{2}, (2)
=\displaystyle= 12(y2¯2axy¯2by¯+a2x2¯+2abx¯+b2).\displaystyle\frac{1}{2}\left(\bar{y^{2}}-2a\bar{xy}-2b\bar{y}+a^{2}\bar{x^{2}}+2ab\bar{x}+b^{2}\right). (3)

Here, we introduce the empirical means of the variables

x¯\displaystyle\bar{x} =\displaystyle= 1Ni=1Nxi.\displaystyle\frac{1}{N}\sum_{i=1}^{N}x_{i}. (4)
y¯\displaystyle\bar{y} =\displaystyle= 1Ni=1Nyi.\displaystyle\frac{1}{N}\sum_{i=1}^{N}y_{i}. (5)
x2¯\displaystyle\bar{x^{2}} =\displaystyle= 1Ni=1Nxi2.\displaystyle\frac{1}{N}\sum_{i=1}^{N}x_{i}^{2}. (6)
y2¯\displaystyle\bar{y^{2}} =\displaystyle= 1Ni=1Nyi2.\displaystyle\frac{1}{N}\sum_{i=1}^{N}y_{i}^{2}. (7)
xy¯\displaystyle\bar{xy} =\displaystyle= 1Ni=1Nxiyi.\displaystyle\frac{1}{N}\sum_{i=1}^{N}x_{i}y_{i}. (8)

For simplicity, let us assume the input mean of the data, x¯=0\bar{x}=0. Under this assumption, the MSE E(a,b)E(a,b) can be reformulated as:

E(a,b)=a(a)+b(b)+E(a^,b^)E(a^,b^),\displaystyle E(a,b)=\mathcal{E}_{a}(a)+\mathcal{E}_{b}(b)+E(\hat{a},\hat{b})\geq E(\hat{a},\hat{b}), (9)

where a(a)=12x2¯(axy¯x2¯)2\mathcal{E}_{a}(a)=\frac{1}{2}\bar{x^{2}}\left(a-\frac{\bar{xy}}{\bar{x^{2}}}\right)^{2}, b(b)=12(by¯)2\mathcal{E}_{b}(b)=\frac{1}{2}(b-\bar{y})^{2}, a^=xy¯x2¯\hat{a}=\frac{\bar{xy}}{\bar{x^{2}}} and b^=y¯\hat{b}=\bar{y}. The minimum value of the MSE E(a^,b^)E(\hat{a},\hat{b}) is referred to as the residual error.

II.2 Representation Through Microscopic Variables

II.2.1 Microscopic Notation of Mean Squared Error

From this section, we introduce a noise model to facilitate the discussion of Bayesian inference. At this point, we have not addressed the noise model added to the data. Here, we assume the true parameters of aa and bb to be a0a_{0} and b0b_{0}, respectively, and that the noise added to the data DD, denoted as {ni}i=1N\{n_{i}\}_{i=1}^{N}, follows a normal distribution with mean zero and variance σ02\sigma^{2}_{0}. The process of generating the data is assumed to adhere to the following relation:

yi=a0xi+b0+ni,\displaystyle y_{i}=a_{0}x_{i}+b_{0}+n_{i}, (10)

where the probability distribution for the noise nin_{i} is given by:

p(ni)=12πσ02exp(ni22σ02).p(n_{i})=\frac{1}{\sqrt{2\pi\sigma^{2}_{0}}}\exp\left(-\frac{n_{i}^{2}}{2\sigma^{2}_{0}}\right). (11)

In this section, we delve deeper into understanding linear models by examining the dependency of the MSE on the stochastic variables {ni}i=1N\{n_{i}\}_{i=1}^{N}. Given that x¯=0\bar{x}=0, the empirical means of inputs and outputs can be described as follows:

x¯\displaystyle\bar{x} =\displaystyle= 1Ni=1Nxi=0.\displaystyle\frac{1}{N}\sum_{i=1}^{N}x_{i}=0. (12)
xy¯\displaystyle\bar{xy} =\displaystyle= 1Ni=1Nxiyi,\displaystyle\frac{1}{N}\sum_{i=1}^{N}x_{i}y_{i}, (13)
=\displaystyle= a0x2¯+xn¯.\displaystyle a_{0}\bar{x^{2}}+\bar{xn}. (14)
y¯\displaystyle\bar{y} =\displaystyle= 1Ni=1Nyi,\displaystyle\frac{1}{N}\sum_{i=1}^{N}y_{i}, (15)
=\displaystyle= b0+n¯.\displaystyle b_{0}+\bar{n}. (16)
y2¯\displaystyle\bar{y^{2}} =\displaystyle= 1Ni=1Nyi2,\displaystyle\frac{1}{N}\sum_{i=1}^{N}y_{i}^{2}, (17)
=\displaystyle= a02x2¯+b02+n2¯+2b0n¯+2a0xn¯.\displaystyle a_{0}^{2}\bar{x^{2}}+b_{0}^{2}+\bar{n^{2}}+2b_{0}\bar{n}+2a_{0}\bar{xn}. (18)

This can be described by introducing:

n¯\displaystyle\bar{n} =\displaystyle= 1Ni=1Nni.\displaystyle\frac{1}{N}\sum_{i=1}^{N}n_{i}. (19)
n2¯\displaystyle\bar{n^{2}} =\displaystyle= 1Ni=1Nni2.\displaystyle\frac{1}{N}\sum_{i=1}^{N}n_{i}^{2}. (20)

Therefore, the MSE E(a,b)E(a,b) can be expressed as:

E(a,b)\displaystyle E(a,b) =\displaystyle= 12x2¯(aa0xn¯x2¯)2+12(bb0n¯)2+12(xn¯2x2¯n¯2+n2¯).\displaystyle\frac{1}{2}\bar{x^{2}}\left(a-a_{0}-\frac{\bar{xn}}{\bar{x^{2}}}\right)^{2}+\frac{1}{2}(b-b_{0}-\bar{n})^{2}+\frac{1}{2}\left(-\frac{\bar{xn}^{2}}{\bar{x^{2}}}-\bar{n}^{2}+\bar{n^{2}}\right). (21)

II.2.2 Bayesian Inference for Linear Models

From Equation (11), the conditional probability of observing the output yiy_{i} given the input variables and model parameters is described by

p(yi|a,b)=12πσ02exp[(yiaxib)22σ02].p(y_{i}|a,b)=\frac{1}{\sqrt{2\pi\sigma^{2}_{0}}}\exp\left[-\frac{(y_{i}-ax_{i}-b)^{2}}{2\sigma^{2}_{0}}\right]. (22)

Consequently, the joint conditional probability of all observed outputs Y={yi}i=1NY=\{y_{i}\}_{i=1}^{N} can be expressed as

p(Y|a,b)\displaystyle p(Y|a,b) =\displaystyle= i=1Np(yi|a,b),\displaystyle\prod_{i=1}^{N}p(y_{i}|a,b), (23)
=\displaystyle= (12πσ02)Nexp(Nσ02E(a,b)).\displaystyle\left(\frac{1}{\sqrt{2\pi\sigma^{2}_{0}}}\right)^{N}\exp\left(-\frac{N}{\sigma^{2}_{0}}E(a,b)\right). (24)

Utilizing the prior distributions of the linear model parameters aa and bb, denoted as p(a)p(a) and p(b)p(b), respectively, the posterior distribution of the model parameters a,ba,b according to Bayes’ theorem can be formulated as:

p(a,b|Y)=p(Y|a,b)p(a)p(b)p(Y).\displaystyle p(a,b|Y)=\frac{p(Y|a,b)p(a)p(b)}{p(Y)}. (25)

When the prior distributions of the model parameters aa and bb are independently assumed to be uniform within the ranges [ξa,ξa][-\xi_{a},\xi_{a}] and [ξb,ξb][-\xi_{b},\xi_{b}], respectively, the prior distributions for each parameter can be expressed as follows:

p(a)\displaystyle p(a) =\displaystyle= 12ξa{Θ(a+ξa)Θ(aξa)},\displaystyle\frac{1}{2\xi_{a}}\left\{\Theta(a+\xi_{a})-\Theta(a-\xi_{a})\right\}, (26)
p(b)\displaystyle p(b) =\displaystyle= 12ξb{Θ(b+ξb)Θ(bξb)}.\displaystyle\frac{1}{2\xi_{b}}\left\{\Theta(b+\xi_{b})-\Theta(b-\xi_{b})\right\}. (27)

The term p(Y)p(Y), known as the marginal likelihood, is given by

p(Y)=dadbp(Y|a,b)p(a)p(b).\displaystyle p(Y)=\int\mathrm{d}a\mathrm{d}b\ p(Y|a,b)p(a)p(b). (28)

Given that the prior distributions are uniform, the posterior distribution can be expressed as

p(a,b|Y)\displaystyle p(a,b|Y) =\displaystyle= (12πσ02)Nexp(Nσ02E(a,b))\displaystyle\left(\frac{1}{\sqrt{2\pi\sigma^{2}_{0}}}\right)^{N}\exp\left(-\frac{N}{\sigma^{2}_{0}}E(a,b)\right)
×\displaystyle\times 12ξa{Θ(a+ξa)Θ(aξa)}\displaystyle\frac{1}{2\xi_{a}}\left\{\Theta(a+\xi_{a})-\Theta(a-\xi_{a})\right\}
×\displaystyle\times 12ξb{Θ(b+ξb)Θ(bξb)}\displaystyle\frac{1}{2\xi_{b}}\left\{\Theta(b+\xi_{b})-\Theta(b-\xi_{b})\right\}
×\displaystyle\times 1p(Y),\displaystyle\frac{1}{p(Y)},
=\displaystyle= 2Nx2¯σ02πexp{Nσ02[a(a)+b(b)]}\displaystyle\frac{2N\sqrt{\bar{x^{2}}}}{\sigma^{2}_{0}\pi}\exp\left\{-\frac{N}{\sigma^{2}_{0}}\left[\mathcal{E}_{a}(a)+\mathcal{E}_{b}(b)\right]\right\}
×\displaystyle\times {Θ(a+ξa)Θ(aξa)}{Θ(b+ξb)Θ(bξb)}\displaystyle\left\{\Theta(a+\xi_{a})-\Theta(a-\xi_{a})\right\}\left\{\Theta(b+\xi_{b})-\Theta(b-\xi_{b})\right\}
×\displaystyle\times [erfc(Nx2¯2σ02(ξaxy¯x2¯))erfc(Nx2¯2σ02(ξaxy¯x2¯))]1\displaystyle\left[\mathrm{erfc}\left(\sqrt{\frac{N\bar{x^{2}}}{2\sigma^{2}_{0}}}\left(-\xi_{a}-\frac{\bar{xy}}{\bar{x^{2}}}\right)\right)-\mathrm{erfc}\left(\sqrt{\frac{N\bar{x^{2}}}{2\sigma^{2}_{0}}}\left(\xi_{a}-\frac{\bar{xy}}{\bar{x^{2}}}\right)\right)\right]^{-1}
×\displaystyle\times [erfc(N2σ02(ξby¯))erfc(N2σ02(ξby¯))]1.\displaystyle\left[\mathrm{erfc}\left(\sqrt{\frac{N}{2\sigma^{2}_{0}}}\left(-\xi_{b}-\bar{y}\right)\right)-\mathrm{erfc}\left(\sqrt{\frac{N}{2\sigma^{2}_{0}}}\left(\xi_{b}-\bar{y}\right)\right)\right]^{-1}. (30)

This expression enables us to compute the conditional probability of the model parameters given the data, as the posterior distribution.

Here, we derive the Bayesian free energy, which serves as an indicator for model selection and is defined as the negative logarithm of the marginal likelihood.

F(Y)\displaystyle F(Y) =\displaystyle= lnP(Y)\displaystyle-\ln P(Y) (32)
=\displaystyle= N2ln(2πσ02)ln(σ02π2N)+12ln(x2¯)+ln(2ξa)+ln(2ξb)+Nσ02E(a^,b^)\displaystyle\frac{N}{2}\ln(2\pi\sigma^{2}_{0})-\ln\left(\frac{\sigma^{2}_{0}\pi}{2N}\right)+\frac{1}{2}\ln\left(\bar{x^{2}}\right)+\ln(2\xi_{a})+\ln(2\xi_{b})+\frac{N}{\sigma^{2}_{0}}E(\hat{a},\hat{b})
ln[erfc(Nx2¯2σ02(ξaa^))erfc(Nx2¯2σ02(ξaa^))]\displaystyle-\ln\left[\mathrm{erfc}\left(\sqrt{\frac{N\bar{x^{2}}}{2\sigma^{2}_{0}}}\left(-\xi_{a}-\hat{a}\right)\right)-\mathrm{erfc}\left(\sqrt{\frac{N\bar{x^{2}}}{2\sigma^{2}_{0}}}\left(\xi_{a}-\hat{a}\right)\right)\right]
ln[erfc(N2σ02(ξbb^))erfc(N2σ02(ξbb^))].\displaystyle-\ln\left[\mathrm{erfc}\left(\sqrt{\frac{N}{2\sigma^{2}_{0}}}\left(-\xi_{b}-\hat{b}\right)\right)-\mathrm{erfc}\left(\sqrt{\frac{N}{2\sigma^{2}_{0}}}\left(\xi_{b}-\hat{b}\right)\right)\right].

II.3 Representation Through Mesoscopic Variables

Up to this point, each statistical quantity has been treated empirically as an average. This section introduces the concept of mesoscopic variables, which enables a theoretical treatment of these quantities.

II.3.1 Residual Error Through Mesoscopic Variables

In the previous sections, the residual error was obtained as a probabilistic variable dependent on the stochastic variables {ni}i=1N\{n_{i}\}_{i=1}^{N}. Here, we discuss the probability distribution of the value E(a^,b^)×2Nσ02E(\hat{a},\hat{b})\times\frac{2N}{\sigma^{2}_{0}} and demonstrate that it follows a chi-squared distribution. The residual error was given by

E(a^,b^)=12(xn¯2x2¯n¯2+n2¯).\displaystyle E(\hat{a},\hat{b})=\frac{1}{2}\left(-\frac{\bar{xn}^{2}}{\bar{x^{2}}}-\bar{n}^{2}+\bar{n^{2}}\right). (33)

The first and second terms on the right side of Equation (33) are independently distributed. Therefore, E(a^,b^)×2Nσ02E(\hat{a},\hat{b})\times\frac{2N}{\sigma^{2}_{0}} follows a chi-squared distribution with N2N-2 degrees of freedom (proof A). Introducing a probability variable υ\upsilon that follows a chi-squared distribution with N2N-2 degrees of freedom, we can write

p(υ)=12N22Γ(N22)υN42exp(υ2).\displaystyle p(\upsilon)=\frac{1}{2^{\frac{N-2}{2}}\Gamma(\frac{N-2}{2})}\upsilon^{\frac{N-4}{2}}\exp\left(-\frac{\upsilon}{2}\right). (34)

Hence, the left side of Equation (33), which is the residual error, can be expressed as

E(a^,b^)=σ022Nυ.\displaystyle E(\hat{a},\hat{b})=\frac{\sigma^{2}_{0}}{2N}\upsilon. (35)

Furthermore, the first and second terms on the right side of Equation (33) can be expressed using independent stochastic variables τ1,τ2\tau_{1},\tau_{2}, each following a normal distribution 𝒩(0,1)\mathcal{N}(0,1), as

xn¯2x2¯\displaystyle\frac{\bar{xn}^{2}}{\bar{x^{2}}} =\displaystyle= σ02Nτ12,\displaystyle\frac{\sigma^{2}_{0}}{N}\tau_{1}^{2}, (36)
n¯2\displaystyle\bar{n}^{2} =\displaystyle= σ02Nτ22.\displaystyle\frac{\sigma^{2}_{0}}{N}\tau_{2}^{2}. (37)

This approach enables us to theoretically analyze the residual error, understand its distribution and behavior within the framework of Bayesian inference, and provide a more nuanced understanding of the error’s properties. The respective representations of the derived micro variables and meso variables are summarized in Table 1.

Table 1: Summary of the respective representations of the micro and meso variables, and their respective relationships.
Meso τ1,τ2,υ\tau_{1},\tau_{2},\upsilon Micro {ni}i=1N\{n_{i}\}_{i=1}^{N}
τ1\tau_{1} Nσ02x2¯xn¯\sqrt{\frac{N}{\sigma^{2}_{0}\bar{x^{2}}}}\bar{xn}
τ2\tau_{2} Nσ02n¯\sqrt{\frac{N}{\sigma^{2}_{0}}}\bar{n}
υ\upsilon Nσ02(xn¯2x2¯n¯2+n2¯)\frac{N}{\sigma^{2}_{0}}\left(-\frac{\bar{xn}^{2}}{\bar{x^{2}}}-\bar{n}^{2}+\bar{n^{2}}\right)
Micro {ni}i=1N\{n_{i}\}_{i=1}^{N} Meso τ1,τ2,υ\tau_{1},\tau_{2},\upsilon
a^\hat{a} a0xn¯x2¯a_{0}-\frac{\bar{xn}}{\bar{x^{2}}} a0+σ02Nx2¯τ1a_{0}+\sqrt{\frac{\sigma^{2}_{0}}{N\bar{x^{2}}}}\tau_{1}
b^\hat{b} b0n¯b_{0}-\bar{n} b0+σ02Nτ2b_{0}+\sqrt{\frac{\sigma^{2}_{0}}{N}}\tau_{2}
E(a^,b^)E(\hat{a},\hat{b}) 12(xn¯2x2¯n¯2+n2¯)\frac{1}{2}\left(-\frac{\bar{xn}^{2}}{\bar{x^{2}}}-\bar{n}^{2}+\bar{n^{2}}\right) σ022Nυ\frac{\sigma^{2}_{0}}{2N}\upsilon

II.3.2 Posterior Distribution Through Mesoscopic Variables

Using the mesoscopic variables introduced in the previous section, we can reformulate the posterior distribution. From Equation (30), the posterior distribution p(a,b|Y)p(a,b|Y) can be rewritten as:

p(a,b|Y)\displaystyle p(a,b|Y) =\displaystyle= 2Nx2¯σ02πexp{N2σ02[x2¯(aa^(τ1))2+(bb^(τ2))2]}\displaystyle\frac{2N\sqrt{\bar{x^{2}}}}{\sigma^{2}_{0}\pi}\exp\left\{-\frac{N}{2\sigma^{2}_{0}}\left[\bar{x^{2}}\left(a-\hat{a}(\tau_{1})\right)^{2}+\left(b-\hat{b}(\tau_{2})\right)^{2}\right]\right\} (38)
×\displaystyle\times {Θ(a+ξa)Θ(aξa)}{Θ(b+ξb)Θ(bξb)}\displaystyle\left\{\Theta(a+\xi_{a})-\Theta(a-\xi_{a})\right\}\left\{\Theta(b+\xi_{b})-\Theta(b-\xi_{b})\right\}
×\displaystyle\times [erfc(Nx2¯2σ02(ξaa^(τ1)))erfc(Nx2¯2σ02(ξaa^(τ1)))]1\displaystyle\left[\mathrm{erfc}\left(\sqrt{\frac{N\bar{x^{2}}}{2\sigma^{2}_{0}}}\left(-\xi_{a}-\hat{a}(\tau_{1})\right)\right)-\mathrm{erfc}\left(\sqrt{\frac{N\bar{x^{2}}}{2\sigma^{2}_{0}}}\left(\xi_{a}-\hat{a}(\tau_{1})\right)\right)\right]^{-1}
×\displaystyle\times [erfc(N2σ02(ξbb^(τ2)))erfc(N2σ02(ξbb^(τ2)))]1.\displaystyle\left[\mathrm{erfc}\left(\sqrt{\frac{N}{2\sigma^{2}_{0}}}\left(-\xi_{b}-\hat{b}(\tau_{2})\right)\right)-\mathrm{erfc}\left(\sqrt{\frac{N}{2\sigma^{2}_{0}}}\left(\xi_{b}-\hat{b}(\tau_{2})\right)\right)\right]^{-1}.

Here, a^(τ1)=a0+σ02Nx2¯τ1\hat{a}(\tau_{1})=a_{0}+\sqrt{\frac{\sigma^{2}_{0}}{N\bar{x^{2}}}}\tau_{1} and b^(τ2)=b0+σ02Nτ2\hat{b}(\tau_{2})=b_{0}+\sqrt{\frac{\sigma^{2}_{0}}{N}}\tau_{2}. Hence, the posterior distribution is determined solely by the two stochastic variables τ1\tau_{1} and τ2\tau_{2}. Moreover, since Equation (38) enables independent calculations for aa and bb, the distribution of model parameters a,ba,b given the model, denoted as pm(a),pm(b)p_{\mathrm{m}}(a),p_{\mathrm{m}}(b), can be expressed as

pm(a)\displaystyle p_{\mathrm{m}}(a) =\displaystyle= dτ1δ(aa^(τ1))p(τ1),\displaystyle\int\mathrm{d}\tau_{1}\delta(a-\hat{a}(\tau_{1}))p(\tau_{1}), (39)
=\displaystyle= Nx2¯2πσ02exp(Nx2¯2σ02(aa0)2),\displaystyle\sqrt{\frac{N\bar{x^{2}}}{2\pi\sigma^{2}_{0}}}\exp\left(-\frac{N\bar{x^{2}}}{2\sigma^{2}_{0}}(a-a_{0})^{2}\right), (40)
pm(b)\displaystyle p_{\mathrm{m}}(b) =\displaystyle= dτ2δ(bb^(τ2))p(τ2),\displaystyle\int\mathrm{d}\tau_{2}\delta(b-\hat{b}(\tau_{2}))p(\tau_{2}), (41)
=\displaystyle= N2πσ02exp(N2σ02(bb0)2).\displaystyle\sqrt{\frac{N}{2\pi\sigma^{2}_{0}}}\exp\left(-\frac{N}{2\sigma^{2}_{0}}(b-b_{0})^{2}\right). (42)

This shows that the posterior distribution can be represented in terms of mesoscopic variables, providing a theoretical framework to understand the distribution of model parameters aa and bb on the basis of observed data and assumed noise characteristics.

Here, we reformulate the Bayesian free energy using mesoscopic variables. From Equation (32), the Bayesian free energy can be rewritten as

F(Y)\displaystyle F(Y) =\displaystyle= N2ln(2πσ02)ln(σ02π2N)+12ln(x2¯)+ln(2ξa)+ln(2ξb)+υ2\displaystyle\frac{N}{2}\ln(2\pi\sigma^{2}_{0})-\ln\left(\frac{\sigma^{2}_{0}\pi}{2N}\right)+\frac{1}{2}\ln\left(\bar{x^{2}}\right)+\ln(2\xi_{a})+\ln(2\xi_{b})+\frac{\upsilon}{2} (43)
\displaystyle- ln[erfc(Nx2¯2σ02(ξaa^(τ1)))erfc(Nx2¯2σ02(ξaa^(τ1)))]\displaystyle\ln\left[\mathrm{erfc}\left(\sqrt{\frac{N\bar{x^{2}}}{2\sigma^{2}_{0}}}\left(-\xi_{a}-\hat{a}(\tau_{1})\right)\right)-\mathrm{erfc}\left(\sqrt{\frac{N\bar{x^{2}}}{2\sigma^{2}_{0}}}\left(\xi_{a}-\hat{a}(\tau_{1})\right)\right)\right]
\displaystyle- ln[erfc(N2σ02(ξbb^(τ2)))erfc(N2σ02(ξbb^(τ2)))].\displaystyle\ln\left[\mathrm{erfc}\left(\sqrt{\frac{N}{2\sigma^{2}_{0}}}\left(-\xi_{b}-\hat{b}(\tau_{2})\right)\right)-\mathrm{erfc}\left(\sqrt{\frac{N}{2\sigma^{2}_{0}}}\left(\xi_{b}-\hat{b}(\tau_{2})\right)\right)\right].

Note that in the limit of large NN, the negative logarithmic terms in the second and third lines of Equation (43) converge to ln2-\ln 2. Therefore, the effect of stochastic fluctuations is effectively captured solely by the term υ\upsilon.

Thus, the Bayesian free energy is determined by three stochastic variables υ,τ1,τ2\upsilon,\tau_{1},\tau_{2}, and can be expressed as F(Y)=F(υ,τ1,τ2)F(Y)=F(\upsilon,\tau_{1},\tau_{2}). The probability distribution of the Bayesian free energy is

p(F)=dυdτ1dτ2δ(FF(υ,τ1,τ2))p(υ)p(τ1)p(τ2).\displaystyle p(F)=\int\mathrm{d}\upsilon\mathrm{d}\tau_{1}\mathrm{d}\tau_{2}\delta(F-F(\upsilon,\tau_{1},\tau_{2}))p(\upsilon)p(\tau_{1})p(\tau_{2}). (44)

In this section, we derive the representation of the probability distribution using mesoscopic variables. Although it is possible to describe the probability distribution without introducing mesoscopic variables, using microscopic variables leads to a computational complexity that scales proportionally with the number of data points NN. In contrast, the mesoscopic variable representation enables us to compute the probability distribution independently of the number of data points NN.

Specifically, the impact of the ln erfc term in Equation (43) is negligible, so it can be considered a constant, resulting in the free energy distribution depending only on υ\upsilon. Since υ\upsilon follows a chi-squared distribution, Equation (44) can be approximately analytically calculated.

II.4 Numerical Experiments: Bayesian Inference

Here, we numerically verify that the results of Bayesian estimation using the microscopic and mesoscopic expressions coincide. First, Figure 1 presents the probability distributions of residual errors calculated from the microscopic expression (33) and mesoscopic expression (35) for stochastically generated data. Panels (a)–(c) of Figure 1 show the probability distribution of normalized residual errors calculated using the microscopic expression (33) for 100,000 artificially generated data patterns with model parameters a0=1.0,b0=0.0,σ02=1.0a_{0}=1.0,b_{0}=0.0,\sigma_{0}^{2}=1.0. On the other hand, panels (d)–(f) display the probability distribution obtained from 100,000 samplings of the probability distribution of residual errors under the mesoscopic expression (35). Comparing the top and bottom rows of Figure 1, we can confirm that the distributions of residual errors from both microscopic and mesoscopic expressions match. As seen in Equation (35), the residual error can be described as a chi-squared distribution, and Figure 1 demonstrates that as the number of data points increases, the chi-squared distribution asymptotically approaches a Gaussian distribution.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Figure 1: Probability distribution of residual errors. (a)–(c): Probability distribution of values of residual errors calculated from the microscopic expression (33) for 100,000 artificially generated data points with model parameters a0=1.0,b0=0.0,σ02=1.0a_{0}=1.0,b_{0}=0.0,\sigma_{0}^{2}=1.0. (d)–(f): Probability distribution obtained from 100,000 samples of the probability distribution of residual errors using the mesoscopic expression (35). Solid black lines represent the theoretical lines calculated from the chi-squared distribution (Eq. (35)).

Next, Figure 2 presents the probability distributions of free energy calculated from the microscopic expression (32) and the mesoscopic expression (43) for stochastically generated data. Panels (a)–(c) of Figure 2 show the probability distribution of normalized values of free energy calculated using the microscopic expression (32) for 100,000 artificially generated data points with model parameters a0=1.0,b0=0.0,σ02=1.0a_{0}=1.0,b_{0}=0.0,\sigma_{0}^{2}=1.0. Meanwhile, panels (d)–(f) display the probability distribution obtained from 100,000 samples of the probability distribution of free energy using the mesoscopic expression (43). A comparison between the top and bottom rows of Figure 2 confirms that the distributions of free energy from both the microscopic and mesoscopic expressions match.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Figure 2: Probability distribution of free energy density. (a)–(c): Probability distribution of normalized values of free energy, where the normalization is performed by dividing the free energy by the number of data points NN. These values are calculated from the microscopic expression (32) using 100,000 artificially generated data points with model parameters a0=1.0,b0=0.0,σ02=1.0a_{0}=1.0,b_{0}=0.0,\sigma_{0}^{2}=1.0. (d)–(f): Probability distribution obtained from 100,000 samples of the probability distribution of free energy using the mesoscopic expression (43). Solid black lines represent the theoretical lines calculated from the chi-squared distribution, where the terms in the second and third lines of the mesoscopic expression (43) were each approximated as log2-\log 2.

III Model Selection

This section explores model selection between a two- and one-variable linear regression model using the Bayesian free energy, as discussed in previous sections. That is, we deal with the problem of which model best fits a given dataset D={(xi,yi)}i=1ND=\{(x_{i},y_{i})\}_{i=1}^{N}. Here, both models are defined as follows.

yi\displaystyle y_{i} =\displaystyle= axi\displaystyle ax_{i} (45)
yi\displaystyle y_{i} =\displaystyle= axi+b\displaystyle ax_{i}+b (46)

Since the theoretical analysis of the two-variable model was covered in the previous section, this section first discusses the theoretical analysis of the one-variable model. Then, by considering the relationship between the two models via meso variables, we discuss the difference in free energy and the nature of model selection. In this section, the noise level is assumed to be predefined. The case where the noise level is also estimated is discussed in Appendix B.

III.1 Representation of the One-Variable Linear Regression Model Using Microscopic Variables

In this section, we assume that the data are generated from the one-variable model. That is, the following equation is assumed to be generated.

yi=a0xi+niy_{i}=a_{0}x_{i}+n_{i} (47)

where {ni}i=1N\{n_{i}\}_{i=1}^{N} are normally distributed with mean zero and variance σ02\sigma^{2}_{0}.

III.1.1 Microscopic Notation of Mean Squared Error for One-Variable Linear Model

The MSE, similar to the discussions in the previous sections, can be written as

E(a)\displaystyle E(a) =\displaystyle= 12(y2¯2axy¯+a2x2¯),\displaystyle\frac{1}{2}\left(\bar{y^{2}}-2a\bar{xy}+a^{2}\bar{x^{2}}\right), (48)
=\displaystyle= 12[x2¯(axy¯x2¯)2xy¯2x2¯+y2¯],\displaystyle\frac{1}{2}\left[\bar{x^{2}}\left(a-\frac{\bar{xy}}{\bar{x^{2}}}\right)^{2}-\frac{\bar{xy}^{2}}{\bar{x^{2}}}+\bar{y^{2}}\right], (49)
=\displaystyle= a(a)+E(a^).\displaystyle\mathcal{E}_{a}(a)+E(\hat{a}). (50)

Given that x¯=0\bar{x}=0, the empirical means of input and output can be described as

x¯\displaystyle\bar{x} =\displaystyle= 1Ni=1Nxi=0,\displaystyle\frac{1}{N}\sum_{i=1}^{N}x_{i}=0, (51)
xy¯\displaystyle\bar{xy} =\displaystyle= 1Ni=1Nxiyi,\displaystyle\frac{1}{N}\sum_{i=1}^{N}x_{i}y_{i}, (52)
=\displaystyle= a0x2¯+xn¯,\displaystyle a_{0}\bar{x^{2}}+\bar{xn}, (53)
y¯\displaystyle\bar{y} =\displaystyle= 1Ni=1Nyi,\displaystyle\frac{1}{N}\sum_{i=1}^{N}y_{i}, (54)
=\displaystyle= n¯,\displaystyle\bar{n}, (55)
y2¯\displaystyle\bar{y^{2}} =\displaystyle= 1Ni=1Nyi2,\displaystyle\frac{1}{N}\sum_{i=1}^{N}y_{i}^{2}, (56)
=\displaystyle= a02x2¯+n2¯+2a0xn¯.\displaystyle a_{0}^{2}\bar{x^{2}}+\bar{n^{2}}+2a_{0}\bar{xn}. (57)

Here, the residual error E(a^)E(\hat{a}) can be expressed as

E(a^)\displaystyle E(\hat{a}) =\displaystyle= 12[xn¯2x2¯+n2¯].\displaystyle\frac{1}{2}\left[-\frac{\bar{xn}^{2}}{\bar{x^{2}}}+\bar{n^{2}}\right]. (58)

III.1.2 Bayesian Inference for One-Variable Linear Model

Assuming that each noise nin_{i} added to the data D={(xi,yi)}i=1ND=\{(x_{i},y_{i})\}_{i=1}^{N} independently follows a normal distribution with mean zero and variance σ02\sigma^{2}_{0}, the conditional probability of the output given the input variables and model parameters can be written as

p(Y|a)\displaystyle p(Y|a) =\displaystyle= i=1Np(yi|a),\displaystyle\prod_{i=1}^{N}p(y_{i}|a), (59)
=\displaystyle= (12πσ02)Nexp(Nσ02E(a)).\displaystyle\left(\frac{1}{\sqrt{2\pi\sigma^{2}_{0}}}\right)^{N}\exp\left(-\frac{N}{\sigma^{2}_{0}}E(a)\right). (60)

Therefore, the joint conditional probability of all output data Y={yi}i=1NY=\{y_{i}\}_{i=1}^{N} can be expressed as

p(Y)=dap(Y|a)p(a).\displaystyle p(Y)=\int\mathrm{d}a\ p(Y|a)p(a). (61)

According to Bayes’ theorem, the posterior distribution is

p(a|Y)\displaystyle p(a|Y) =\displaystyle= (12πσ02)Nexp(Nσ02E(a))\displaystyle\left(\frac{1}{\sqrt{2\pi\sigma^{2}_{0}}}\right)^{N}\exp\left(-\frac{N}{\sigma^{2}_{0}}E(a)\right)
×\displaystyle\times 12ξa{Θ(a+ξa)Θ(aξa)}1p(Y)\displaystyle\frac{1}{2\xi_{a}}\left\{\Theta(a+\xi_{a})-\Theta(a-\xi_{a})\right\}\frac{1}{p(Y)}
=\displaystyle= 2Nx2¯σ02πexp{Nσ02a(a)}{Θ(a+ξa)Θ(aξa)}\displaystyle\sqrt{\frac{2N\bar{x^{2}}}{\sigma^{2}_{0}\pi}}\exp\left\{-\frac{N}{\sigma^{2}_{0}}\mathcal{E}_{a}(a)\right\}\left\{\Theta(a+\xi_{a})-\Theta(a-\xi_{a})\right\}
×\displaystyle\times [erfc(Nx2¯2σ02(ξaa^))erfc(Nx2¯2σ02(ξaa^))]1.\displaystyle\left[\mathrm{erfc}\left(\sqrt{\frac{N\bar{x^{2}}}{2\sigma^{2}_{0}}}\left(-\xi_{a}-\hat{a}\right)\right)-\mathrm{erfc}\left(\sqrt{\frac{N\bar{x^{2}}}{2\sigma^{2}_{0}}}\left(\xi_{a}-\hat{a}\right)\right)\right]^{-1}. (63)

Here, we derive the Bayesian free energy for a one-variable linear regression model. The Bayesian free energy is obtained by taking the negative logarithm of the marginal likelihood.

F(Y)=N2ln(2πσ02)12ln(σ02π2Nx2¯)+ln(2ξa)+Nσ02E(a^)\displaystyle F(Y)=\frac{N}{2}\ln(2\pi\sigma^{2}_{0})-\frac{1}{2}\ln\left(\frac{\sigma^{2}_{0}\pi}{2N\bar{x^{2}}}\right)+\ln(2\xi_{a})+\frac{N}{\sigma^{2}_{0}}E(\hat{a})
ln[erfc(Nx2¯2σ02(ξaa^))erfc(Nx2¯2σ02(ξaa^))].\displaystyle-\ln\left[\mathrm{erfc}\left(\sqrt{\frac{N\bar{x^{2}}}{2\sigma^{2}_{0}}}\left(-\xi_{a}-\hat{a}\right)\right)-\mathrm{erfc}\left(\sqrt{\frac{N\bar{x^{2}}}{2\sigma^{2}_{0}}}\left(\xi_{a}-\hat{a}\right)\right)\right]. (64)

III.2 Representation of the One-Variable Linear Regression Model Using Mesoscopic Variables

Up to this point, each statistical quantity has been considered as an empirical mean. This section, following the approach of the previous one, introduces mesoscopic variables to provide a theoretical framework for handling these quantities.

III.2.1 Residual Error in One-Variable Linear Regression Model Through Mesoscopic Variables

In the previous sections, the residual error was obtained as a probabilistic variable dependent on the stochastic variables {ni}i=1N\{n_{i}\}_{i=1}^{N}. Here, we discuss the probability distribution of the value E(a^)×2Nσ02E(\hat{a})\times\frac{2N}{\sigma^{2}_{0}} and demonstrate that it follows a chi-squared distribution. The residual error was given by

E(a^)=12(xn¯2x2¯+n2¯).\displaystyle E(\hat{a})=\frac{1}{2}\left(-\frac{\bar{xn}^{2}}{\bar{x^{2}}}+\bar{n^{2}}\right). (65)

The terms on the right side of Equation (65) are independently distributed. Therefore, E(a^)×2Nσ02E(\hat{a})\times\frac{2N}{\sigma^{2}_{0}} follows a chi-squared distribution with N1N-1 degrees of freedom. Introducing a probability variable υ2\upsilon_{2} that follows a chi-squared distribution with N1N-1 degrees of freedom, we can write

p(υ2)=12N12Γ(N12)υN32exp(υ2).\displaystyle p(\upsilon_{2})=\frac{1}{2^{\frac{N-1}{2}}\Gamma(\frac{N-1}{2})}\upsilon^{\frac{N-3}{2}}\exp\left(-\frac{\upsilon}{2}\right). (66)

Thus, the left side of Equation (65), which is the residual error, can be expressed as

E(a^)=σ022Nυ2.\displaystyle E(\hat{a})=\frac{\sigma^{2}_{0}}{2N}\upsilon_{2}. (67)

Furthermore, the first term on the right side of Equation (65) can be expressed using an independent stochastic variable τ1\tau_{1}, following a normal distribution 𝒩(0,1)\mathcal{N}(0,1), as

xn¯2x2¯\displaystyle\frac{\bar{xn}^{2}}{\bar{x^{2}}} =\displaystyle= σ02Nτ12.\displaystyle\frac{\sigma^{2}_{0}}{N}\tau_{1}^{2}. (68)

III.2.2 Posterior Distribution in One-Variable Linear Regression Model Through Mesoscopic Variables

Using the mesoscopic variables introduced in the previous section, we can reformulate the posterior distribution. From Equation (63), the posterior distribution p(a|Y)p(a|Y) can be rewritten as

p(a|Y)\displaystyle p(a|Y) =2Nx2¯σ02πexp{Nx2¯2σ02(aa^(τ1))2}\displaystyle=\sqrt{\frac{2N\bar{x^{2}}}{\sigma^{2}_{0}\pi}}\exp\left\{-\frac{N\bar{x^{2}}}{2\sigma^{2}_{0}}\left(a-\hat{a}(\tau_{1})\right)^{2}\right\}
×{Θ(a+ξa)Θ(aξa)}\displaystyle\quad\times\left\{\Theta(a+\xi_{a})-\Theta(a-\xi_{a})\right\}
×[erfc(Nx2¯2σ02(ξaa^(τ1)))\displaystyle\quad\times\left[\mathrm{erfc}\left(\sqrt{\frac{N\bar{x^{2}}}{2\sigma^{2}_{0}}}\left(-\xi_{a}-\hat{a}(\tau_{1})\right)\right)\right.
erfc(Nx2¯2σ02(ξaa^(τ1)))]1\displaystyle\quad\quad\left.-\mathrm{erfc}\left(\sqrt{\frac{N\bar{x^{2}}}{2\sigma^{2}_{0}}}\left(\xi_{a}-\hat{a}(\tau_{1})\right)\right)\right]^{-1} (69)

Thus, the posterior distribution is determined solely by the stochastic variable τ1\tau_{1}. Moreover, the distribution of the model parameter aa, given the model, denoted as pm(a)p_{\mathrm{m}}(a), can be expressed as

pm(a)\displaystyle p_{\mathrm{m}}(a) =\displaystyle= dτ1δ(aa^(τ1))p(τ1)\displaystyle\int\mathrm{d}\tau_{1}\delta(a-\hat{a}(\tau_{1}))p(\tau_{1}) (70)
=\displaystyle= Nx2¯2πσ02exp(Nx2¯2σ02(aa0)2).\displaystyle\sqrt{\frac{N\bar{x^{2}}}{2\pi\sigma^{2}_{0}}}\exp\left(-\frac{N\bar{x^{2}}}{2\sigma^{2}_{0}}(a-a_{0})^{2}\right). (71)

Here, we reformulate the Bayesian free energy using mesoscopic variables. From Equation (64), the Bayesian free energy can be rewritten as

F(Y)\displaystyle F(Y) =\displaystyle= N2ln(2πσ02)12ln(σ02π2Nx2¯)+ln(2ξa)+υ22\displaystyle\frac{N}{2}\ln(2\pi\sigma^{2}_{0})-\frac{1}{2}\ln\left(\frac{\sigma^{2}_{0}\pi}{2N\bar{x^{2}}}\right)+\ln(2\xi_{a})+\frac{\upsilon_{2}}{2} (72)
\displaystyle- ln[erfc(Nx2¯2σ02(ξaa^(τ1)))erfc(Nx2¯2σ02(ξaa^(τ1)))].\displaystyle\ln\left[\mathrm{erfc}\left(\sqrt{\frac{N\bar{x^{2}}}{2\sigma^{2}_{0}}}\left(-\xi_{a}-\hat{a}(\tau_{1})\right)\right)-\mathrm{erfc}\left(\sqrt{\frac{N\bar{x^{2}}}{2\sigma^{2}_{0}}}\left(\xi_{a}-\hat{a}(\tau_{1})\right)\right)\right].

Therefore, the Bayesian free energy is determined by two stochastic variables, υ2\upsilon_{2} and τ1\tau_{1}, and can be expressed as F(Y)=F(υ2,τ1)F(Y)=F(\upsilon_{2},\tau_{1}). The probability distribution of the Bayesian free energy can be described as

p(F)=dυ2dτ1δ(FF(υ2,τ1))p(υ2)p(τ1).\displaystyle p(F)=\int\mathrm{d}\upsilon_{2}\mathrm{d}\tau_{1}\delta(F-F(\upsilon_{2},\tau_{1}))p(\upsilon_{2})p(\tau_{1}). (73)

III.2.3 Model Selection Through Bayesian Free Energy

This section compares the Bayesian free energy of a two- and one-variable linear regression model to perform model selection. First, we will discuss the relationship between mesoscopic variables υ1,υ2\upsilon_{1},\upsilon_{2}. The residual error for the two models can be expressed as

E(a^,b^)\displaystyle E(\hat{a},\hat{b}) =\displaystyle= 12Ni=1N{yi(a^xi+b^)}2\displaystyle\frac{1}{2N}\sum_{i=1}^{N}\{y_{i}-(\hat{a}x_{i}+\hat{b})\}^{2} (74)
=\displaystyle= E(a^)12(b0+σ02Nτ2)2\displaystyle E(\hat{a})-\frac{1}{2}\left(b_{0}+\sqrt{\frac{{\sigma_{0}}^{2}}{N}}\tau_{2}\right)^{2} (75)

leading to the relationship between υ1\upsilon_{1} and υ2\upsilon_{2} as

υ1=υ2Nσ02(b0+σ02Nτ2)2.\displaystyle\upsilon_{1}=\upsilon_{2}-\frac{N}{\sigma_{0}^{2}}\left(b_{0}+\sqrt{\frac{{\sigma_{0}}^{2}}{N}}\tau_{2}\right)^{2}. (76)

The Bayesian free energy for each model, from Equations (43) and (72), is given by:

Fy=ax+b(υ1,τ1,τ2)\displaystyle F_{y=ax+b}(\upsilon_{1},\tau_{1},\tau_{2}) =\displaystyle= N2ln(2πσ02)ln(σ02π2N)+12ln(x2¯)+ln(2ξa)+ln(2ξb)+υ12\displaystyle\frac{N}{2}\ln(2\pi\sigma^{2}_{0})-\ln\left(\frac{\sigma^{2}_{0}\pi}{2N}\right)+\frac{1}{2}\ln\left(\bar{x^{2}}\right)+\ln(2\xi_{a})+\ln(2\xi_{b})+\frac{\upsilon_{1}}{2} (77)
\displaystyle- ln[erfc(Nx2¯2σ02(ξaa^(τ1)))erfc(Nx2¯2σ02(ξaa^(τ1)))]\displaystyle\ln\left[\mathrm{erfc}\left(\sqrt{\frac{N\bar{x^{2}}}{2\sigma^{2}_{0}}}\left(-\xi_{a}-\hat{a}(\tau_{1})\right)\right)-\mathrm{erfc}\left(\sqrt{\frac{N\bar{x^{2}}}{2\sigma^{2}_{0}}}\left(\xi_{a}-\hat{a}(\tau_{1})\right)\right)\right]
\displaystyle- ln[erfc(N2σ02(ξbb^(τ2)))erfc(N2σ02(ξbb^(τ2)))]\displaystyle\ln\left[\mathrm{erfc}\left(\sqrt{\frac{N}{2\sigma^{2}_{0}}}\left(-\xi_{b}-\hat{b}(\tau_{2})\right)\right)-\mathrm{erfc}\left(\sqrt{\frac{N}{2\sigma^{2}_{0}}}\left(\xi_{b}-\hat{b}(\tau_{2})\right)\right)\right]
Fy=ax(υ2,τ1)\displaystyle F_{y=ax}(\upsilon_{2},\tau_{1}) =N2ln(2πσ02)12ln(σ02π2Nx2¯)+ln(2ξa)+υ22\displaystyle=\frac{N}{2}\ln(2\pi\sigma^{2}_{0})-\frac{1}{2}\ln\left(\frac{\sigma^{2}_{0}\pi}{2N\bar{x^{2}}}\right)+\ln(2\xi_{a})+\frac{\upsilon_{2}}{2}
ln[erfc(Nx2¯2σ02(ξaa^(τ1)))erfc(Nx2¯2σ02(ξaa^(τ1)))]\displaystyle-\ln\left[\mathrm{erfc}\left(\sqrt{\frac{N\bar{x^{2}}}{2\sigma^{2}_{0}}}\left(-\xi_{a}-\hat{a}(\tau_{1})\right)\right)-\mathrm{erfc}\left(\sqrt{\frac{N\bar{x^{2}}}{2\sigma^{2}_{0}}}\left(\xi_{a}-\hat{a}(\tau_{1})\right)\right)\right] (78)

Hence, the difference in the Bayesian free energy (ΔF\Delta F) depends only on the stochastic variable τ2\tau_{2}, and can be expressed as

ΔF(τ2)\displaystyle\Delta F(\tau_{2}) =\displaystyle= Fy=ax(υ2,τ1)Fy=ax+b(υ1,τ1,τ2)\displaystyle F_{y=ax}(\upsilon_{2},\tau_{1})-F_{y=ax+b}(\upsilon_{1},\tau_{1},\tau_{2})
=\displaystyle= 12ln(σ02π2N)ln(2ξb)+N2σ02b^(τ2)2\displaystyle\frac{1}{2}\ln\left(\frac{\sigma^{2}_{0}\pi}{2N}\right)-\ln(2\xi_{b})+\frac{N}{2\sigma_{0}^{2}}\hat{b}(\tau_{2})^{2}
+\displaystyle+ ln[erfc(N2σ02(ξbb^(τ2)))erfc(N2σ02(ξbb^(τ2)))].\displaystyle\ln\left[\mathrm{erfc}\left(\sqrt{\frac{N}{2\sigma^{2}_{0}}}\left(-\xi_{b}-\hat{b}(\tau_{2})\right)\right)-\mathrm{erfc}\left(\sqrt{\frac{N}{2\sigma^{2}_{0}}}\left(\xi_{b}-\hat{b}(\tau_{2})\right)\right)\right]. (80)

Note that the logarithmic term in the second line converges to log2\log 2 in the limit of large NN, thereby indicating that the stochastic fluctuations are primarily affected by the third term N2σ02b^(τ2)2\frac{N}{2\sigma^{2}_{0}}\hat{b}(\tau_{2})^{2}. Since b^(τ2)\hat{b}(\tau_{2}) follows a normal distribution with mean b0b_{0} and variance σ2/N\sigma^{2}/N, b^(τ2)2\hat{b}(\tau_{2})^{2} follows a non-central chi-squared distribution. This enables analytical treatment of the distribution of the free energy difference. The probability distribution of the difference in the Bayesian free energy is

p(ΔF)=dτ2δ(ΔFΔF(τ2))p(τ2).\displaystyle p(\Delta F)=\int\mathrm{d}\tau_{2}\delta(\Delta F-\Delta F(\tau_{2}))p(\tau_{2}). (81)

We can effectively assess the fluctuations of model selection by mesoscopic variables as described in Section II.

III.3 Numerical Experiments: Model Selection

Here, we examine the impact of data quantity and noise intensity inherent in the data on the outcomes of model selection using the mesoscopic representation. Figure 3 shows the two-dimensional frequency distribution from 100,000 samples of the Bayesian free energy difference (Equation (81)) and τ2\tau_{2}. The model parameters are a0=1.0a_{0}=1.0, b0=1.0b_{0}=1.0, and σ02=1.0\sigma_{0}^{2}=1.0, for data sizes N=5,100,1000N=5,100,1000. The vertical and horizontal axes represent the frequency distributions of the free energy difference and of τ2\tau_{2}, respectively. Figure 3(a) shows that with a small number of data points, the frequency of ΔF<0\Delta F<0 is high, indicating frequent failures in model selection. Conversely, Figures 3(b) and (c) show that with a larger number of data points, failures in model selection become negligible.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 3: Probability distribution of differences in the Bayesian free energy for model selection. Probability distribution from 100,000 samples of the Bayesian free energy difference (Equation (81)) with model parameters a0=1.0,b0=1.0,σ02=1.0a_{0}=1.0,b_{0}=1.0,\sigma_{0}^{2}=1.0 for data sizes N=5,100,1000N=5,100,1000. Solid black lines represent the theoretical lines calculated from the non-central chi-squared distribution, where the second line of Equation (81) was approximated as ln2\ln 2.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 4: Two-dimensional frequency distribution from 100,000 samples of the Bayesian free energy difference (Equation (81)) and b^(τ2)\hat{b}(\tau_{2}). The model parameters are a0=1.0,b0=1.0a_{0}=1.0,b_{0}=1.0, and σ02=1.0\sigma_{0}^{2}=1.0, for data sizes N=5,100,1000N=5,100,1000. The vertical and horizontal axes represent the frequency distributions of the free energy difference and of τ2\tau_{2}, respectively.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 5: Probability distribution of differences in the Bayesian free energy for model selection. Probability distribution from 100,000 samples of the Bayesian free energy difference (Equation (81)) with model parameters a0=1.0,b0=0.0,σ02=1.0a_{0}=1.0,b_{0}=0.0,\sigma_{0}^{2}=1.0 for data sizes N=5,100,1000N=5,100,1000. Solid black lines represent the theoretical lines calculated from the non-central chi-squared distribution, where the second line of Equation (81) was approximated as ln2\ln 2.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 6: Two-dimensional frequency distribution from 100,000 samples of the Bayesian free energy difference (Equation (81)) and b^(τ2)\hat{b}(\tau_{2}). The model parameters are a0=1.0,b0=0.0a_{0}=1.0,b_{0}=0.0, and σ02=1.0\sigma_{0}^{2}=1.0, for data sizes N=5,100,1000N=5,100,1000. The vertical and horizontal axes represent the frequency distributions of the free energy difference and of τ2\tau_{2}, respectively.

Figure 4 shows the relationship between the differential free energy ΔF(τ2)\Delta F(\tau_{2}) and the estimated parameter b^(τ2)\hat{b}(\tau_{2}) for the case b0=1.0b_{0}=1.0. Formula (80) shows that b^(τ2)\hat{b}(\tau_{2}) is distributed with the true parameter b0b_{0} in the center. Therefore, should Figures 4(b0=1.0b_{0}=1.0), b^(τ2)2\hat{b}(\tau_{2})^{2} is distributed with positive values, so that ΔF(τ2)\Delta F(\tau_{2}) is concentrated at positive values. This enables us to understand why the two-variable model is mainly selected should Figures 4 in this analysis.

Next, Figure 5 shows the two-dimensional frequency distribution from 100,000 samples of the Bayesian free energy difference (Equation (81)) and τ2\tau_{2}. The model parameters are a0=1.0a_{0}=1.0, b0=0.0b_{0}=0.0, and σ02=1.0\sigma_{0}^{2}=1.0, for data sizes N=5,100,1000N=5,100,1000. The vertical and horizontal axes represent the frequency distributions of the free energy difference and of τ2\tau_{2}, respectively. Figures 5(a)–(c) show that as the number of data points increases, the frequency of ΔF>0\Delta F>0 gradually decreases, but even at N=1000N=1000, the occurrence of ΔF>0\Delta F>0 remains, indicating failures in model selection are still present. The minimum value of the distribution, as evident from Equation (81), shifts negatively on a log(N)\mathrm{log}(N) scale. Therefore, when y=axy=ax is the true model, the pace of improvement in model selection by increasing the number of data points is slower compared with when y=ax+by=ax+b is the true model.

In the case shown in Figures 6(b0=0.0b_{0}=0.0), the value of b^(τ2)2\hat{b}(\tau_{2})^{2} is distributed around zero, so that the value of ΔF(τ2)\Delta F(\tau_{2}) becomes negatively distributed due to the effect of other terms. This suggests that the one-variable model is selected should Figures 6.

Figure 7 shows the probability of selecting the two-variable model y=ax+by=ax+b on the basis of the Bayesian free energy difference (Equation (81)) for model parameters b0=1.0,0.5,0.0b_{0}=1.0,0.5,0.0. Here, we set a0=1.0a_{0}=1.0 and display the frequency distribution as a two-dimensional histogram from 100,000 samples across the dimensions of data number NN and data noise intensity σ2\sigma^{2}. Figure 7(a) shows that model selection tends to fail along the diagonal line where NN and σ2\sigma^{2} have similar values. As the data number increases from this line, appropriate model selection gradually becomes possible. Conversely, as the data number decreases away from this diagonal line, discerning the correct model selection becomes challenging. This diagonal line, as shown in Figures 7(b) and (c), broadens as the value of bb decreases, and at b=0.0b=0.0, the probability of selecting the model y=ax+by=ax+b disappears.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 7: Average selection probability of the two-variable model y=ax+by=ax+b derived from the probability distribution of the difference in the Bayesian free energy with model parameters b0=1.0,0.5,0.0b_{0}=1.0,0.5,0.0 (Equation (81)). The parameters are set as a0=1.0,σ02=1.0a_{0}=1.0,\sigma_{0}^{2}=1.0, and the frequency distribution is shown as a two-dimensional histogram of the number of data points NN and the data noise intensity σ2\sigma^{2}.

IV Bayesian Integration

In this section, we explore a framework for Bayesian inference of model parameters by integrating multiple sets of measurement data under varying conditions. The primary focus here is to demonstrate that the use of Bayesian free energy can determine whether integrated or independent analysis of multiple measurement datasets can be performed using a small number of variables, similar to the previous sections.

We specifically address the regression problem involving two sets of one-dimensional data: D1={(xi(1),yi(1))}i=1N1D_{1}=\{(x_{i}^{(1)},y_{i}^{(1)})\}_{i=1}^{N_{1}} with sample size N1N_{1}, and D2={(xi(2),yi(2))}i=1N2D_{2}=\{(x_{i}^{(2)},y_{i}^{(2)})\}_{i=1}^{N_{2}} with sample size N2N_{2}. The regression is formulated with a two-variable linear model as follows:

yi(1)\displaystyle y_{i}^{(1)} =a0(1)xi(1)+b0(1)+ni(1),\displaystyle=a_{0}^{(1)}x_{i}^{(1)}+b_{0}^{(1)}+n_{i}^{(1)}, (82)
yi(2)\displaystyle y_{i}^{(2)} =a0(2)xi(2)+b0(2)+ni(2),\displaystyle=a_{0}^{(2)}x_{i}^{(2)}+b_{0}^{(2)}+n_{i}^{(2)}, (83)

where the noise terms ni(1)n_{i}^{(1)} and ni(2)n_{i}^{(2)} are assumed to follow normal distributions with mean zero and variances (σ0(1))2(\sigma_{0}^{(1)})^{2} and (σ0(2))2(\sigma_{0}^{(2)})^{2}, respectively. This setup enables us to perform integrated analysis that considers different noise levels and relationships in the data from two distinct experimental conditions.

As in the previous section, the noise variances (σ0(1))2(\sigma_{0}^{(1)})^{2} and (σ0(2))2(\sigma_{0}^{(2)})^{2} are assumed to be known in this section. However, we will discuss the estimation of these noise variances for each model in Appendix B, and how these estimates affect the Bayesian inference process.

IV.1 Representation through Microscopic Variables in Bayesian Integration

IV.1.1 Microscopic Notation of Mean Squared Error for Bayesian Integration

In this case, we define the MSE as

Em(a,b)\displaystyle E_{m}(a,b) =12Nmi=1Nm(yi(m)axi(m)b)2\displaystyle=\frac{1}{2N_{m}}\sum_{i=1}^{N_{m}}(y_{i}^{(m)}-ax_{i}^{(m)}-b)^{2}
=12(x(m)2¯(a(m)x(m)y(m)¯x(m)2¯)2+(b(m)y(m)¯)2x(m)y(m)¯2x(m)2¯y(m)¯2+y(m)2¯).\displaystyle=\frac{1}{2}\left(\bar{{x^{(m)}}^{2}}\left(a^{(m)}-\frac{\bar{x^{(m)}y^{(m)}}}{\bar{{x^{(m)}}^{2}}}\right)^{2}+(b^{(m)}-\bar{y^{(m)}})^{2}-\frac{\bar{x^{(m)}y^{(m)}}^{2}}{\bar{{x^{(m)}}^{2}}}-\bar{y^{(m)}}^{2}+\bar{{y^{(m)}}^{2}}\right). (84)

When each dataset has independent parameters (a(1),b(1)),(a(2),b(2))(a^{(1)},b^{(1)}),(a^{(2)},b^{(2)}), the combined MSE after integration can be written as

E(a(1),b(1),a(2),b(2))=m=12Nmσ(m)02Em(a(m),b(m)).\displaystyle E(a^{(1)},b^{(1)},a^{(2)},b^{(2)})=\sum_{m=1}^{2}\frac{N_{m}}{{\sigma^{(m)}}_{0}^{2}}E_{m}(a^{(m)},b^{(m)}). (85)

Since there are two two-variable linear regression models, we can complete the square independently for each model. Therefore, the expression for the total MSE is:

E(a(1),b(1),a(2),b(2))\displaystyle E(a^{(1)},b^{(1)},a^{(2)},b^{(2)}) =N12σ(1)02(x(1)2¯(a(1)x(1)y(1)¯x(1)2¯)2+(b(1)y(1)¯)2\displaystyle=\frac{N_{1}}{2{\sigma^{(1)}}_{0}^{2}}\left(\bar{{x^{(1)}}^{2}}\left(a^{(1)}-\frac{\bar{x^{(1)}y^{(1)}}}{\bar{{x^{(1)}}^{2}}}\right)^{2}+(b^{(1)}-\bar{y^{(1)}})^{2}\right.
x(1)y(1)¯2x(1)2¯y(1)¯2+y(1)2¯)\displaystyle\quad\left.-\frac{\bar{x^{(1)}y^{(1)}}^{2}}{\bar{{x^{(1)}}^{2}}}-\bar{y^{(1)}}^{2}+\bar{{y^{(1)}}^{2}}\right)
+N22σ(2)02(x(2)2¯(a(2)x(2)y(2)¯x(2)2¯)2+(b(2)y(2)¯)2\displaystyle+\frac{N_{2}}{2{\sigma^{(2)}}_{0}^{2}}\left(\bar{{x^{(2)}}^{2}}\left(a^{(2)}-\frac{\bar{x^{(2)}y^{(2)}}}{\bar{{x^{(2)}}^{2}}}\right)^{2}+(b^{(2)}-\bar{y^{(2)}})^{2}\right.
x(2)y(2)¯2x(2)2¯y(2)¯2+y(2)2¯).\displaystyle\quad\left.-\frac{\bar{x^{(2)}y^{(2)}}^{2}}{\bar{{x^{(2)}}^{2}}}-\bar{y^{(2)}}^{2}+\bar{{y^{(2)}}^{2}}\right). (86)

This matches the results that would be obtained by treating the datasets D1D_{1} and D2D_{2} independently with linear regression models. If we infer a common model parameter a,ba,b from each dataset, then the expression for the MSE becomes:

E(a,b)\displaystyle E(a,b) =m=12Nmσ(m)02Em(a,b)\displaystyle=\sum_{m=1}^{2}\frac{N_{m}}{{\sigma^{(m)}}_{0}^{2}}E_{m}(a,b)
=N12σ(1)02(x(1)2¯(ax(1)y(1)¯x(1)2¯)2+(by(1)¯)2x(1)y(1)¯2x(1)2¯y(1)¯2+y(1)2¯)\displaystyle=\frac{N_{1}}{2{\sigma^{(1)}}_{0}^{2}}\left(\bar{{x^{(1)}}^{2}}\left(a-\frac{\bar{x^{(1)}y^{(1)}}}{\bar{{x^{(1)}}^{2}}}\right)^{2}+(b-\bar{y^{(1)}})^{2}-\frac{\bar{x^{(1)}y^{(1)}}^{2}}{\bar{{x^{(1)}}^{2}}}-\bar{y^{(1)}}^{2}+\bar{{y^{(1)}}^{2}}\right)
+N22σ(2)02(x(2)2¯(ax(2)y(2)¯x(2)2¯)2+(by(2)¯)2x(2)y(2)¯2x(2)2¯y(2)¯2+y(2)2¯)\displaystyle+\frac{N_{2}}{2{\sigma^{(2)}}_{0}^{2}}\left(\bar{{x^{(2)}}^{2}}\left(a-\frac{\bar{x^{(2)}y^{(2)}}}{\bar{{x^{(2)}}^{2}}}\right)^{2}+(b-\bar{y^{(2)}})^{2}-\frac{\bar{x^{(2)}y^{(2)}}^{2}}{\bar{{x^{(2)}}^{2}}}-\bar{y^{(2)}}^{2}+\bar{{y^{(2)}}^{2}}\right) (87)

Further transformations will be applied to aa and bb. Let β(1)=N1σ(1)2\beta^{(1)}=\frac{N_{1}}{{\sigma^{(1)}}^{2}}, β(2)=N2σ(2)2\beta^{(2)}=\frac{N_{2}}{{\sigma^{(2)}}^{2}}, β0(1)=N1σ(1)02\beta^{(1)}_{0}=\frac{N_{1}}{{\sigma^{(1)}}_{0}^{2}}, β0(2)=N2σ(2)02\beta^{(2)}_{0}=\frac{N_{2}}{{\sigma^{(2)}}_{0}^{2}}, a^(m)=x(m)y(m)¯x(m)2¯\hat{a}^{(m)}=\frac{\bar{x^{(m)}y^{(m)}}}{\bar{{x^{(m)}}^{2}}}, and b(m)=y(m)¯b^{(m)}=\bar{y^{(m)}}. Then, the error function can be written as:

E(a,b)\displaystyle E(a,b) =(β(1)x(1)2¯+β(2)x(2)2¯)2(a(β(1)x(1)2¯a^(1)+β(2)x(2)2¯a^(2))(β(1)x(1)2¯+β(2)x(2)2¯))2\displaystyle=\frac{\left(\beta^{(1)}\bar{{x^{(1)}}^{2}}+\beta^{(2)}\bar{{x^{(2)}}^{2}}\right)}{2}\left(a-\frac{\left(\beta^{(1)}\bar{{x^{(1)}}^{2}}\hat{a}^{(1)}+\beta^{(2)}\bar{{x^{(2)}}^{2}}\hat{a}^{(2)}\right)}{\left(\beta^{(1)}\bar{{x^{(1)}}^{2}}+\beta^{(2)}\bar{{x^{(2)}}^{2}}\right)}\right)^{2}
+β(1)+β(2)2(b(β(1)b^(1)+β(2)b^(2))β(1)+β(2))2\displaystyle\quad+\frac{\beta^{(1)}+\beta^{(2)}}{2}\left(b-\frac{\left(\beta^{(1)}\hat{b}^{(1)}+\beta^{(2)}\hat{b}^{(2)}\right)}{\beta^{(1)}+\beta^{(2)}}\right)^{2}
+12((β(1)x(1)2¯)(β(2)x(2)2¯)β(1)x(1)2¯+β(2)x(2)2¯(a^(1)a^(2))2+β(1)β(2)(β(1)+β(2))(b^(1)b^(2))2\displaystyle\quad+\frac{1}{2}\left(\frac{(\beta^{(1)}\bar{{x^{(1)}}^{2}})(\beta^{(2)}\bar{{x^{(2)}}^{2}})}{\beta^{(1)}\bar{{x^{(1)}}^{2}}+\beta^{(2)}\bar{{x^{(2)}}^{2}}}(\hat{a}^{(1)}-\hat{a}^{(2)})^{2}+\frac{\beta^{(1)}\beta^{(2)}}{(\beta^{(1)}+\beta^{(2)})}(\hat{b}^{(1)}-\hat{b}^{(2)})^{2}\right.
β0(1)β(1)τ1(1)2β0(1)β(1)τ2(1)2+n(1)2¯β1β0(2)β(2)τ1(2)2β0(2)β(2)τ2(2)2+n(2)2¯β(2)).\displaystyle\quad\left.-\frac{\beta_{0}^{(1)}}{\beta^{(1)}}{\tau_{1}^{(1)}}^{2}-\frac{\beta_{0}^{(1)}}{\beta^{(1)}}{\tau_{2}^{(1)}}^{2}+\frac{\bar{{n^{(1)}}^{2}}}{\beta_{1}}-\frac{\beta_{0}^{(2)}}{\beta^{(2)}}{\tau_{1}^{(2)}}^{2}-\frac{\beta_{0}^{(2)}}{\beta^{(2)}}{\tau_{2}^{(2)}}^{2}+\frac{\bar{{n^{(2)}}^{2}}}{\beta^{(2)}}\right). (88)

Let us define the integrated errors for parameters aa and bb as follows:

aint(a)\displaystyle\mathcal{E}^{\mathrm{int}}_{a}(a) =(β(1)x(1)2¯+β(2)x(2)2¯)2(a(β(1)x(1)2¯a^(1)+β(2)x(2)2¯a^(2))(β(1)x(1)2¯+β(2)x(2)2¯))2\displaystyle=\frac{\left(\beta^{(1)}\bar{{x^{(1)}}^{2}}+\beta^{(2)}\bar{{x^{(2)}}^{2}}\right)}{2}\left(a-\frac{\left(\beta^{(1)}\bar{{x^{(1)}}^{2}}\hat{a}^{(1)}+\beta^{(2)}\bar{{x^{(2)}}^{2}}\hat{a}^{(2)}\right)}{\left(\beta^{(1)}\bar{{x^{(1)}}^{2}}+\beta^{(2)}\bar{{x^{(2)}}^{2}}\right)}\right)^{2} (89)
bint(b)\displaystyle\mathcal{E}^{\mathrm{int}}_{b}(b) =β(1)+β(2)2(b(β(1)b^(1)+β(2)b^(2))β(1)+β(2))2\displaystyle=\frac{\beta^{(1)}+\beta^{(2)}}{2}\left(b-\frac{\left(\beta^{(1)}\hat{b}^{(1)}+\beta^{(2)}\hat{b}^{(2)}\right)}{\beta^{(1)}+\beta^{(2)}}\right)^{2} (90)

The optimal parameters a^\hat{a} and b^\hat{b} are given by:

a^\displaystyle\hat{a} =\displaystyle= (β(1)x(1)2¯a^(1)+β(2)x(2)2¯a^(2))(β(1)x(1)2¯+β(2)x(2)2¯)\displaystyle\frac{\left(\beta^{(1)}\bar{{x^{(1)}}^{2}}\hat{a}^{(1)}+\beta^{(2)}\bar{{x^{(2)}}^{2}}\hat{a}^{(2)}\right)}{\left(\beta^{(1)}\bar{{x^{(1)}}^{2}}+\beta^{(2)}\bar{{x^{(2)}}^{2}}\right)} (91)
b^\displaystyle\hat{b} =\displaystyle= (β(1)b^(1)+β(2)b^(2))β(1)+β(2)\displaystyle\frac{\left(\beta^{(1)}\hat{b}^{(1)}+\beta^{(2)}\hat{b}^{(2)}\right)}{\beta^{(1)}+\beta^{(2)}} (92)

The residual error can be expressed as:

E(a^,b^)\displaystyle E(\hat{a},\hat{b}) =12((β(1)x(1)2¯)(β(2)x(2)2¯)β(1)x(1)2¯+β(2)x(2)2¯(a^(1)a^(2))2\displaystyle=\frac{1}{2}\left(\frac{(\beta^{(1)}\bar{{x^{(1)}}^{2}})(\beta^{(2)}\bar{{x^{(2)}}^{2}})}{\beta^{(1)}\bar{{x^{(1)}}^{2}}+\beta^{(2)}\bar{{x^{(2)}}^{2}}}(\hat{a}^{(1)}-\hat{a}^{(2)})^{2}\right.
+β(1)β(2)(β(1)+β(2))(b^(1)b^(2))2\displaystyle\quad+\frac{\beta^{(1)}\beta^{(2)}}{(\beta^{(1)}+\beta^{(2)})}(\hat{b}^{(1)}-\hat{b}^{(2)})^{2}
β0(1)β(1)τ1(1)2β0(1)β(1)τ2(1)2+n(1)2¯β1\displaystyle\quad-\frac{\beta_{0}^{(1)}}{\beta^{(1)}}{\tau_{1}^{(1)}}^{2}-\frac{\beta_{0}^{(1)}}{\beta^{(1)}}{\tau_{2}^{(1)}}^{2}+\frac{\bar{{n^{(1)}}^{2}}}{\beta_{1}}
β0(2)β(2)τ1(2)2β0(2)β(2)τ2(2)2+n(2)2¯β(2))\displaystyle\quad\left.-\frac{\beta_{0}^{(2)}}{\beta^{(2)}}{\tau_{1}^{(2)}}^{2}-\frac{\beta_{0}^{(2)}}{\beta^{(2)}}{\tau_{2}^{(2)}}^{2}+\frac{\bar{{n^{(2)}}^{2}}}{\beta^{(2)}}\right) (93)

IV.1.2 Bayesian Inference in Bayesian Integration

Given the input variables and model parameters, the conditional probability of the output can be expressed as:

p(yi(m)|a(m),b(m))=12π(σ0(m))2exp[(yi(m)a(m)xi(m)b(m))22(σ0(m))2]p(y_{i}^{(m)}|a^{(m)},b^{(m)})=\frac{1}{\sqrt{2\pi(\sigma_{0}^{(m)})^{2}}}\exp\left[-\frac{(y_{i}^{(m)}-a^{(m)}x_{i}^{(m)}-b^{(m)})^{2}}{2(\sigma^{(m)}_{0})^{2}}\right] (94)

Therefore, when we have independent parameters a(1),a(2),b(1),b(2)a^{(1)},a^{(2)},b^{(1)},b^{(2)}, the joint conditional probability of all output data Y={D1,D2}Y=\{D_{1},D_{2}\} can be expressed as:

p(Y|a(1),a(2),b(1),b(2))\displaystyle p(Y|a^{(1)},a^{(2)},b^{(1)},b^{(2)}) =m=12i=1Nmp(yi(m)|a(m),b(m))\displaystyle=\prod_{m=1}^{2}\prod_{i=1}^{N_{m}}p(y_{i}^{(m)}|a^{(m)},b^{(m)})
=(12π(σ0(1))2)N1(12π(σ0(2))2)N2exp(N1(σ0(1))2E1(a(1),b(1))N2(σ0(2))2E2(a(2),b(2)))\displaystyle=\left(\frac{1}{\sqrt{2\pi(\sigma^{(1)}_{0})^{2}}}\right)^{N_{1}}\left(\frac{1}{\sqrt{2\pi(\sigma^{(2)}_{0})^{2}}}\right)^{N_{2}}\exp\left(-\frac{N_{1}}{(\sigma^{(1)}_{0})^{2}}E_{1}(a^{(1)},b^{(1)})-\frac{N_{2}}{(\sigma^{(2)}_{0})^{2}}E_{2}(a^{(2)},b^{(2)})\right) (95)

The posterior distribution can be independently analyzed for each model parameter a(1),a(2),b(1),b(2)a^{(1)},a^{(2)},b^{(1)},b^{(2)}, and can be computed as:

p(a(1),a(2),b(1),b(2)|Y)\displaystyle p(a^{(1)},a^{(2)},b^{(1)},b^{(2)}|Y) =m=122Nmx(m)2¯σ(m)02πexp{Nmσ(m)02[a(a(m))+b(b(m))]}\displaystyle=\prod_{m=1}^{2}\frac{2N_{m}\sqrt{\bar{{x^{(m)}}^{2}}}}{{\sigma^{(m)}}^{2}_{0}\pi}\exp\left\{-\frac{N_{m}}{{\sigma^{(m)}}^{2}_{0}}\left[\mathcal{E}_{a}(a^{(m)})+\mathcal{E}_{b}(b^{(m)})\right]\right\}
×{Θ(a(m)+ξa(m))Θ(a(m)ξa(m))}\displaystyle\quad\times\left\{\Theta(a^{(m)}+\xi_{a^{(m)}})-\Theta(a^{(m)}-\xi_{a^{(m)}})\right\}
×{Θ(b(m)+ξb(m))Θ(b(m)ξb(m))}\displaystyle\quad\times\left\{\Theta(b^{(m)}+\xi_{b^{(m)}})-\Theta(b^{(m)}-\xi_{b^{(m)}})\right\}
×[erfc(Nmx(m)2¯2σ(m)02(ξa(m)x(m)y(m)¯x(m)2¯))\displaystyle\quad\times\left[\mathrm{erfc}\left(\sqrt{\frac{N_{m}\bar{{x^{(m)}}^{2}}}{2{\sigma^{(m)}}^{2}_{0}}}\left(-\xi_{a^{(m)}}-\frac{\bar{x^{(m)}y^{(m)}}}{\bar{{x^{(m)}}^{2}}}\right)\right)\right.
erfc(Nmx(m)2¯2σ(m)02(ξa(m)x(m)y(m)¯x(m)2¯))]1\displaystyle\quad\quad\left.-\mathrm{erfc}\left(\sqrt{\frac{N_{m}\bar{{x^{(m)}}^{2}}}{2{\sigma^{(m)}}^{2}_{0}}}\left(\xi_{a^{(m)}}-\frac{\bar{x^{(m)}y^{(m)}}}{\bar{{x^{(m)}}^{2}}}\right)\right)\right]^{-1}
×[erfc(Nm2σ(m)02(ξb(m)y(m)¯))\displaystyle\quad\times\left[\mathrm{erfc}\left(\sqrt{\frac{N_{m}}{2{\sigma^{(m)}}^{2}_{0}}}\left(-\xi_{b^{(m)}}-\bar{y^{(m)}}\right)\right)\right.
erfc(Nm2σ(m)02(ξb(m)y(m)¯))]1\displaystyle\quad\quad\left.-\mathrm{erfc}\left(\sqrt{\frac{N_{m}}{2{\sigma^{(m)}}^{2}_{0}}}\left(\xi_{b^{(m)}}-\bar{y^{(m)}}\right)\right)\right]^{-1} (96)

When the range of the prior distribution is sufficiently large, the posterior distributions of each model parameter can be described as Gaussian distributions centered around a^(1),a^(2),b^(1),b^(2)\hat{a}^{(1)},\hat{a}^{(2)},\hat{b}^{(1)},\hat{b}^{(2)}. On the other hand, when the estimated parameters from both datasets share common model parameters a,ba,b, the joint conditional probability is given by:

p(Y|a,b)\displaystyle p(Y|a,b) =m=12(12π(σ0(m))2)Nmexp(Nm(σ0(m))2Em(a,b))\displaystyle=\prod_{m=1}^{2}\left(\frac{1}{\sqrt{2\pi(\sigma^{(m)}_{0})^{2}}}\right)^{N_{m}}\exp\left(-\frac{N_{m}}{(\sigma^{(m)}_{0})^{2}}E_{m}(a,b)\right)
=(12π(σ0(1))2)N1(12π(σ0(2))2)N2exp(m=12Nm(σ0(m))2Em(a,b))\displaystyle=\left(\frac{1}{\sqrt{2\pi(\sigma^{(1)}_{0})^{2}}}\right)^{N_{1}}\left(\frac{1}{\sqrt{2\pi(\sigma^{(2)}_{0})^{2}}}\right)^{N_{2}}\exp\left(-\sum_{m=1}^{2}\frac{N_{m}}{(\sigma^{(m)}_{0})^{2}}E_{m}(a,b)\right) (97)

The posterior distribution can then be expressed as:

p(a,b|Y)\displaystyle p(a,b|Y) =\displaystyle= 2π(β(1)x(1)2¯+β(2)x(2)2¯)(β(1)+β(2))exp(aint(a)bint(b))\displaystyle\frac{2}{\pi}\sqrt{(\beta^{(1)}\bar{{x^{(1)}}^{2}}+\beta^{(2)}\bar{{x^{(2)}}^{2}})(\beta^{(1)}+\beta^{(2)})}\exp(-\mathcal{E}^{\mathrm{int}}_{a}(a)-\mathcal{E}^{\mathrm{int}}_{b}(b)) (98)
×\displaystyle\times {Θ(a+ξa)Θ(aξa)}{Θ(b+ξb)Θ(bξb)}\displaystyle\left\{\Theta(a+\xi_{a})-\Theta(a-\xi_{a})\right\}\left\{\Theta(b+\xi_{b})-\Theta(b-\xi_{b})\right\}
×\displaystyle\times [erfc(β(1)x(1)2¯+β(2)x(2)2¯2(ξaa^))\displaystyle\left[\mathrm{erfc}\left(\sqrt{\frac{\beta^{(1)}\bar{{x^{(1)}}^{2}}+\beta^{(2)}\bar{{x^{(2)}}^{2}}}{2}}(-\xi_{a}-\hat{a})\right)\right.
\displaystyle- erfc(β(1)x(1)2¯+β(2)x(2)2¯2(ξaa^))]1\displaystyle\left.\mathrm{erfc}\left(\sqrt{\frac{\beta^{(1)}\bar{{x^{(1)}}^{2}}+\beta^{(2)}\bar{{x^{(2)}}^{2}}}{2}}(\xi_{a}-\hat{a})\right)\right]^{-1}
×\displaystyle\times [erfc(β(1)+β(2)2(ξbb^))\displaystyle\left[\mathrm{erfc}\left(\sqrt{\frac{\beta^{(1)}+\beta^{(2)}}{2}}(-\xi_{b}-\hat{b})\right)\right.
\displaystyle- erfc(β(1)+β(2)2(ξbb^))]1.\displaystyle\left.\mathrm{erfc}\left(\sqrt{\frac{\beta^{(1)}+\beta^{(2)}}{2}}(\xi_{b}-\hat{b})\right)\right]^{-1}.

Here, we derive the Bayesian free energy from the results of the previous section, which is used as a criterion for model selection. The Bayesian free energy is the negative logarithm of the marginal likelihood. Assuming a uniform prior distribution and that model parameters are independent for each dataset, the Bayesian free energy can be expressed as:

F(m)(Y)=Nm2ln(2πσ(m)02)ln(σ(m)02π2Nm)+12ln(x(m)2¯)+ln(2ξa(m))+ln(2ξb(m))+Nmσ(m)02Em(a^(m),b^(m))\displaystyle F^{(m)}(Y)=\frac{N_{m}}{2}\ln(2\pi{\sigma^{(m)}}^{2}_{0})-\ln\left(\frac{{\sigma^{(m)}}^{2}_{0}\pi}{2N_{m}}\right)+\frac{1}{2}\ln\left(\bar{{x^{(m)}}^{2}}\right)+\ln(2\xi^{(m)}_{a})+\ln(2\xi^{(m)}_{b})+\frac{N_{m}}{{\sigma^{(m)}}^{2}_{0}}E_{m}(\hat{a}^{(m)},\hat{b}^{(m)})
ln[erfc(Nmx(m)2¯2σ(m)02(ξa(m)a^(m)))erfc(Nmx(m)2¯2σ(m)02(ξa(m)a^(m)))]\displaystyle\hskip 85.35826pt-\ln\left[\mathrm{erfc}\left(\sqrt{\frac{N_{m}\bar{{x^{(m)}}^{2}}}{2{\sigma^{(m)}}^{2}_{0}}}\left(-\xi^{(m)}_{a}-\hat{a}^{(m)}\right)\right)-\mathrm{erfc}\left(\sqrt{\frac{N_{m}\bar{{x^{(m)}}^{2}}}{2{\sigma^{(m)}}^{2}_{0}}}\left(\xi^{(m)}_{a}-\hat{a}^{(m)}\right)\right)\right]
ln[erfc(Nm2σ(m)02(ξb(m)b^(m)))erfc(Nm2σ(m)02(ξb(m)b^(m)))]\displaystyle\hskip 85.35826pt-\ln\left[\mathrm{erfc}\left(\sqrt{\frac{N_{m}}{2{\sigma^{(m)}}^{2}_{0}}}\left(-\xi^{(m)}_{b}-\hat{b}^{(m)}\right)\right)-\mathrm{erfc}\left(\sqrt{\frac{N_{m}}{2{\sigma^{(m)}}^{2}_{0}}}\left(\xi^{(m)}_{b}-\hat{b}^{(m)}\right)\right)\right] (99)
F(Y)=m=12F(m)(Y).\displaystyle F(Y)=\sum_{m=1}^{2}F^{(m)}(Y). (100)

On the other hand, if this model has common model parameters, the Bayesian free energy is obtained by taking the negative logarithm of the marginal likelihood:

F(Y)\displaystyle F(Y) =\displaystyle= N12ln2π(σ0(1))2+N22ln2π(σ0(2))2+ln2ξa+ln2ξb+E(a^,b^)\displaystyle\frac{N_{1}}{2}\ln 2\pi(\sigma_{0}^{(1)})^{2}+\frac{N_{2}}{2}\ln 2\pi(\sigma_{0}^{(2)})^{2}+\ln 2\xi_{a}+\ln 2\xi_{b}+E(\hat{a},\hat{b}) (101)
+\displaystyle+ 12ln2(β(1)x(1)2¯+β(2)x(2)2¯)π+12ln2(β(1)+β(2))π\displaystyle\frac{1}{2}\ln\frac{2(\beta^{(1)}\bar{{x^{(1)}}^{2}}+\beta^{(2)}\bar{{x^{(2)}}^{2}})}{\pi}+\frac{1}{2}\ln\frac{2(\beta^{(1)}+\beta^{(2)})}{\pi}
\displaystyle- ln[erfc(β(1)x(1)2¯+β(2)x(2)2¯2(ξaa^))erfc(β(1)x(1)2¯+β(2)x(2)2¯2(ξaa^))]\displaystyle\ln\left[\mathrm{erfc}\left(\sqrt{\frac{\beta^{(1)}\bar{{x^{(1)}}^{2}}+\beta^{(2)}\bar{{x^{(2)}}^{2}}}{2}}(-\xi_{a}-\hat{a})\right)-\mathrm{erfc}\left(\sqrt{\frac{\beta^{(1)}\bar{{x^{(1)}}^{2}}+\beta^{(2)}\bar{{x^{(2)}}^{2}}}{2}}(\xi_{a}-\hat{a})\right)\right]
\displaystyle- ln[erfc(β(1)+β(2)2(ξbb^))erfc(β(1)+β(2)2(ξbb^))].\displaystyle\ln\left[\mathrm{erfc}\left(\sqrt{\frac{\beta^{(1)}+\beta^{(2)}}{2}}(-\xi_{b}-\hat{b})\right)-\mathrm{erfc}\left(\sqrt{\frac{\beta^{(1)}+\beta^{(2)}}{2}}(\xi_{b}-\hat{b})\right)\right].

IV.2 Representation through Mesoscopic Variables in Bayesian Integration

Up to this point, each statistical measure has been handled as an empirical average. In this section, similar to the previous section, we introduce mesoscopic variables to theoretically manage these statistical measures.

IV.2.1 Residual Error with Mesoscopic Variables in Bayesian Integration

In the previous sections, the residual error was derived as a probability variable dependent on the set of random variables {ni(m)}i=1Nm\{n^{(m)}_{i}\}_{i=1}^{N_{m}}. In this section, we discuss the probability distribution of the residual error E(a^,b^)E(\hat{a},\hat{b}) and demonstrate that it follows a chi-squared distribution. The expression for the residual error is given by:

E(a^,b^)\displaystyle E(\hat{a},\hat{b}) =12((β(1)x(1)2¯)(β(2)x(2)2¯)β(1)x(1)2¯+β(2)x(2)2¯(a^(1)a^(2))2+β(1)β(2)(β(1)+β(2))(b^(1)b^(2))2\displaystyle=\frac{1}{2}\left(\frac{(\beta^{(1)}\bar{{x^{(1)}}^{2}})(\beta^{(2)}\bar{{x^{(2)}}^{2}})}{\beta^{(1)}\bar{{x^{(1)}}^{2}}+\beta^{(2)}\bar{{x^{(2)}}^{2}}}(\hat{a}^{(1)}-\hat{a}^{(2)})^{2}+\frac{\beta^{(1)}\beta^{(2)}}{(\beta^{(1)}+\beta^{(2)})}(\hat{b}^{(1)}-\hat{b}^{(2)})^{2}\right.
β(1)β0(1)τ1(1)2β(1)β0(1)τ2(1)2+β(1)n(1)2¯β(2)β0(2)τ1(2)2β(2)β0(2)τ2(2)2+β(2)n(2)2¯).\displaystyle-\left.\frac{\beta^{(1)}}{\beta_{0}^{(1)}}{\tau_{1}^{(1)}}^{2}-\frac{\beta^{(1)}}{\beta_{0}^{(1)}}{\tau_{2}^{(1)}}^{2}+\beta^{(1)}\bar{{n^{(1)}}^{2}}-\frac{\beta^{(2)}}{\beta_{0}^{(2)}}{\tau_{1}^{(2)}}^{2}-\frac{\beta^{(2)}}{\beta_{0}^{(2)}}{\tau_{2}^{(2)}}^{2}+\beta^{(2)}\bar{{n^{(2)}}^{2}}\right). (102)

In this case,

Em(a^(m),b^(m))\displaystyle E_{m}(\hat{a}^{(m)},\hat{b}^{(m)}) =\displaystyle= 12(x(m)n(m)¯2x(m)2¯n(m)¯2+n(m)2¯)\displaystyle\frac{1}{2}\left(-\frac{\bar{x^{(m)}n^{(m)}}^{2}}{\bar{{x^{(m)}}^{2}}}-\bar{n^{(m)}}^{2}+\bar{{n^{(m)}}^{2}}\right) (103)

is established. From the content of the previous sections,

Em(a^(m),b^(m))\displaystyle E_{m}(\hat{a}^{(m)},\hat{b}^{(m)}) =\displaystyle= σ0(m)22Nυ(m)=12β0(m)υ(m)\displaystyle\frac{{\sigma_{0}^{(m)}}^{2}}{2N}\upsilon^{(m)}=\frac{1}{2\beta_{0}^{(m)}}\upsilon^{(m)} (104)
x(m)n(m)¯2x(m)2¯\displaystyle\frac{\bar{x^{(m)}n^{(m)}}^{2}}{\bar{{x^{(m)}}^{2}}} =\displaystyle= σ0(m)2Nτ1(m)2=τ1(m)2β0(m)\displaystyle\frac{{\sigma_{0}^{(m)}}^{2}}{N}{\tau_{1}^{(m)}}^{2}=\frac{{\tau_{1}^{(m)}}^{2}}{\beta_{0}^{(m)}} (105)
n(m)¯2\displaystyle\bar{n^{(m)}}^{2} =\displaystyle= σ0(m)2Nτ2(m)2=τ2(m)2β0(m)\displaystyle\frac{{\sigma_{0}^{(m)}}^{2}}{N}{\tau_{2}^{(m)}}^{2}=\frac{{\tau_{2}^{(m)}}^{2}}{\beta_{0}^{(m)}} (106)

can be expressed by introducing mesoscopic variables. Thus, a^(m)\hat{a}^{(m)} and b^(m)\hat{b}^{(m)} can be expressed similarly to the previous section as:

a^(m)(τ1(m))\displaystyle\hat{a}^{(m)}(\tau_{1}^{(m)}) =\displaystyle= a0+1x(m)2¯β0(m)τ1(m)\displaystyle a_{0}+\sqrt{\frac{1}{{\bar{{x^{(m)}}^{2}}}\beta_{0}^{(m)}}}\tau_{1}^{(m)} (107)
b^(m)(τ2(m))\displaystyle\hat{b}^{(m)}(\tau_{2}^{(m)}) =\displaystyle= b0+1β0(m)τ2(m)\displaystyle b_{0}+\sqrt{\frac{1}{\beta_{0}^{(m)}}}\tau_{2}^{(m)} (108)

Hence, the residual error in Bayesian integration can be expressed as:

E(a^,b^)\displaystyle E(\hat{a},\hat{b}) =12((β(1)x(1)2¯)(β(2)x(2)2¯)β(1)x(1)2¯+β(2)x(2)2¯(a^(1)(τ1(1))a^(2)(τ1(2)))2+β(1)β(2)(β(1)+β(2))(b^(1)(τ2(1))b^(2)(τ2(2)))2)\displaystyle=\frac{1}{2}\left(\frac{(\beta^{(1)}\bar{{x^{(1)}}^{2}})(\beta^{(2)}\bar{{x^{(2)}}^{2}})}{\beta^{(1)}\bar{{x^{(1)}}^{2}}+\beta^{(2)}\bar{{x^{(2)}}^{2}}}(\hat{a}^{(1)}(\tau_{1}^{(1)})-\hat{a}^{(2)}(\tau_{1}^{(2)}))^{2}+\frac{\beta^{(1)}\beta^{(2)}}{(\beta^{(1)}+\beta^{(2)})}(\hat{b}^{(1)}(\tau_{2}^{(1)})-\hat{b}^{(2)}(\tau_{2}^{(2)}))^{2}\right)
+β(1)E1(a^(1),b^(1))+β(2)E2(a^(2),b^(2))\displaystyle+\beta^{(1)}E_{1}(\hat{a}^{(1)},\hat{b}^{(1)})+\beta^{(2)}E_{2}(\hat{a}^{(2)},\hat{b}^{(2)}) (109)
=12((β(1)x(1)2¯)(β(2)x(2)2¯)β(1)x(1)2¯+β(2)x(2)2¯(a^(1)(τ1(1))a^(2)(τ1(2)))2+β(1)β(2)(β(1)+β(2))(b^(1)(τ2(1))b^(2)(τ2(2)))2)\displaystyle=\frac{1}{2}\left(\frac{(\beta^{(1)}\bar{{x^{(1)}}^{2}})(\beta^{(2)}\bar{{x^{(2)}}^{2}})}{\beta^{(1)}\bar{{x^{(1)}}^{2}}+\beta^{(2)}\bar{{x^{(2)}}^{2}}}(\hat{a}^{(1)}(\tau_{1}^{(1)})-\hat{a}^{(2)}(\tau_{1}^{(2)}))^{2}+\frac{\beta^{(1)}\beta^{(2)}}{(\beta^{(1)}+\beta^{(2)})}(\hat{b}^{(1)}(\tau_{2}^{(1)})-\hat{b}^{(2)}(\tau_{2}^{(2)}))^{2}\right)
+β(1)2β0(1)υ(1)+β(2)2β0(2)υ(2)\displaystyle+\frac{\beta^{(1)}}{2\beta_{0}^{(1)}}\upsilon^{(1)}+\frac{\beta^{(2)}}{2\beta_{0}^{(2)}}\upsilon^{(2)} (110)

and can be described by six mesoscopic variables τ1(1),τ1(2),τ2(1),τ2(2),υ(1),υ(2)\tau_{1}^{(1)},\tau_{1}^{(2)},\tau_{2}^{(1)},\tau_{2}^{(2)},\upsilon^{(1)},\upsilon^{(2)}.

IV.2.2 Posterior Distribution with Mesoscopic Variables in Bayesian Integration

In this section, we use the mesoscopic variables introduced in the previous section to reformulate the posterior distribution. The error functions can be expressed using mesoscopic variables as:

aint(a,τ1(1),τ1(2))\displaystyle\mathcal{E}^{\mathrm{int}}_{a}(a,\tau_{1}^{(1)},\tau_{1}^{(2)}) =\displaystyle= (β(1)x(1)2¯+β(2)x(2)2¯)2(aa^(τ1(1),τ1(2)))2\displaystyle\frac{\left(\beta^{(1)}\bar{{x^{(1)}}^{2}}+\beta^{(2)}\bar{{x^{(2)}}^{2}}\right)}{2}\left(a-\hat{a}(\tau_{1}^{(1)},\tau_{1}^{(2)})\right)^{2} (111)
bint(b,τ2(1),τ2(2))\displaystyle\mathcal{E}^{\mathrm{int}}_{b}(b,\tau_{2}^{(1)},\tau_{2}^{(2)}) =\displaystyle= β(1)+β(2)2(bb^(τ2(1),τ2(2)))2\displaystyle\frac{\beta^{(1)}+\beta^{(2)}}{2}\left(b-\hat{b}(\tau_{2}^{(1)},\tau_{2}^{(2)})\right)^{2} (112)
a^(τ1(1),τ1(2))\displaystyle\hat{a}(\tau_{1}^{(1)},\tau_{1}^{(2)}) =\displaystyle= (β(1)x(1)2¯a^(1)(τ1(1))+β(2)x(2)2¯a^(2)(τ1(2)))(β(1)x(1)2¯+β(2)x(2)2¯)\displaystyle\frac{\left(\beta^{(1)}\bar{{x^{(1)}}^{2}}\hat{a}^{(1)}(\tau_{1}^{(1)})+\beta^{(2)}\bar{{x^{(2)}}^{2}}\hat{a}^{(2)}(\tau_{1}^{(2)})\right)}{\left(\beta^{(1)}\bar{{x^{(1)}}^{2}}+\beta^{(2)}\bar{{x^{(2)}}^{2}}\right)} (113)
b^(τ2(1),τ2(2))\displaystyle\hat{b}(\tau_{2}^{(1)},\tau_{2}^{(2)}) =\displaystyle= (β(1)b^(1)(τ2(1))+β(2)b^(2)(τ2(2)))β(1)+β(2)\displaystyle\frac{\left(\beta^{(1)}\hat{b}^{(1)}(\tau_{2}^{(1)})+\beta^{(2)}\hat{b}^{(2)}(\tau_{2}^{(2)})\right)}{\beta^{(1)}+\beta^{(2)}} (114)

Therefore, the posterior distribution as per Equation (98) can be described as:

p(a,b|Y)\displaystyle p(a,b|Y) =\displaystyle= 2π(β(1)x(1)2¯+β(2)x(2)2¯)(β(1)+β(2))exp(aint(a,τ1(1),τ1(2))bint(b,τ2(1),τ2(2)))\displaystyle\frac{2}{\pi}\sqrt{(\beta^{(1)}\bar{{x^{(1)}}^{2}}+\beta^{(2)}\bar{{x^{(2)}}^{2}})(\beta^{(1)}+\beta^{(2)})}\exp(-\mathcal{E}^{\mathrm{int}}_{a}(a,\tau_{1}^{(1)},\tau_{1}^{(2)})-\mathcal{E}^{\mathrm{int}}_{b}(b,\tau_{2}^{(1)},\tau_{2}^{(2)})) (115)
×\displaystyle\times {Θ(a+ξa)Θ(aξa)}{Θ(b+ξb)Θ(bξb)}\displaystyle\left\{\Theta(a+\xi_{a})-\Theta(a-\xi_{a})\right\}\left\{\Theta(b+\xi_{b})-\Theta(b-\xi_{b})\right\}
×\displaystyle\times [erfc(β(1)x(1)2¯+β(2)x(2)2¯2(ξaa^(τ1(1),τ1(2))))\displaystyle\left[\mathrm{erfc}\left(\sqrt{\frac{\beta^{(1)}\bar{{x^{(1)}}^{2}}+\beta^{(2)}\bar{{x^{(2)}}^{2}}}{2}}(-\xi_{a}-\hat{a}(\tau_{1}^{(1)},\tau_{1}^{(2)}))\right)\right.
\displaystyle- erfc(β(1)x(1)2¯+β(2)x(2)2¯2(ξaa^(τ1(1),τ1(2))))]1\displaystyle\left.\mathrm{erfc}\left(\sqrt{\frac{\beta^{(1)}\bar{{x^{(1)}}^{2}}+\beta^{(2)}\bar{{x^{(2)}}^{2}}}{2}}(\xi_{a}-\hat{a}(\tau_{1}^{(1)},\tau_{1}^{(2)}))\right)\right]^{-1}
×\displaystyle\times [erfc(β(1)+β(2)2(ξbb^(τ2(1),τ2(2))))\displaystyle\left[\mathrm{erfc}\left(\sqrt{\frac{\beta^{(1)}+\beta^{(2)}}{2}}(-\xi_{b}-\hat{b}(\tau_{2}^{(1)},\tau_{2}^{(2)}))\right)\right.
\displaystyle- erfc(β(1)+β(2)2(ξbb^(τ2(1),τ2(2))))]1\displaystyle\left.\mathrm{erfc}\left(\sqrt{\frac{\beta^{(1)}+\beta^{(2)}}{2}}(\xi_{b}-\hat{b}(\tau_{2}^{(1)},\tau_{2}^{(2)}))\right)\right]^{-1}

Thus, the posterior distribution is determined solely by the four stochastic variables τ1(1),τ1(2),τ2(1),τ2(2)\tau_{1}^{(1)},\tau_{1}^{(2)},\tau_{2}^{(1)},\tau_{2}^{(2)}. Also, given the model, the distributions of the model parameters a,ba,b, pm(a),pm(b)p_{\mathrm{m}}(a),p_{\mathrm{m}}(b) can be described as:

pm(a)\displaystyle p_{\mathrm{m}}(a) =dτ1(1)dτ1(2)δ(aa^(τ1(1),τ1(2)))p(τ1(1))p(τ1(2))\displaystyle=\int\mathrm{d}\tau_{1}^{(1)}\mathrm{d}\tau_{1}^{(2)}\delta(a-\hat{a}(\tau_{1}^{(1)},\tau_{1}^{(2)}))p(\tau_{1}^{(1)})p(\tau_{1}^{(2)}) (116)
exp(12(β(1)x(1)2¯+β(2)x(2)2¯)(β(1)x(1)2¯+β(2)x(2)2¯)2(aa0)2)\displaystyle\propto\exp\left(-\frac{1}{2(\beta^{(1)}\bar{{x^{(1)}}^{2}}+\beta^{(2)}\bar{{x^{(2)}}^{2}})}\left(\sqrt{\frac{\beta^{(1)}}{\bar{{x^{(1)}}^{2}}}}+\sqrt{\frac{\beta^{(2)}}{\bar{{x^{(2)}}^{2}}}}\right)^{2}(a-a_{0})^{2}\right) (117)
pm(b)\displaystyle p_{\mathrm{m}}(b) =dτ2(1)dτ2(2)δ(bb^(τ2(1),τ2(2)))p(τ2(1))p(τ2(2))\displaystyle=\int\mathrm{d}\tau_{2}^{(1)}\mathrm{d}\tau_{2}^{(2)}\delta(b-\hat{b}(\tau_{2}^{(1)},\tau_{2}^{(2)}))p(\tau_{2}^{(1)})p(\tau_{2}^{(2)}) (118)
exp(12(β(1)+β(2))(β(1)+β(2))2(bb0)2)\displaystyle\propto\exp\left(-\frac{1}{2(\beta^{(1)}+\beta^{(2)})}\left(\sqrt{\beta^{(1)}}+\sqrt{\beta^{(2)}}\right)^{2}(b-b_{0})^{2}\right) (119)

Here, we reformulate the Bayesian free energy using mesoscopic variables. From Equation (101),

F(Y)\displaystyle F(Y) =N12ln(2π(σ(1))2)+N22ln(2π(σ(2))2)+ln(2ξa)+ln(2ξb)\displaystyle=\frac{N_{1}}{2}\ln(2\pi(\sigma^{(1)})^{2})+\frac{N_{2}}{2}\ln(2\pi(\sigma^{(2)})^{2})+\ln(2\xi_{a})+\ln(2\xi_{b})
+E(a^(τ1(1),τ1(2)),b^(τ2(1),τ2(2)),υ(1),υ(2))\displaystyle\quad+E(\hat{a}(\tau_{1}^{(1)},\tau_{1}^{(2)}),\hat{b}(\tau_{2}^{(1)},\tau_{2}^{(2)}),\upsilon^{(1)},\upsilon^{(2)})
+12ln(2(β(1)x(1)2¯+β(2)x(2)2¯)π)+12ln(2(β(1)+β(2))π)\displaystyle\quad+\frac{1}{2}\ln\left(\frac{2(\beta^{(1)}\bar{{x^{(1)}}^{2}}+\beta^{(2)}\bar{{x^{(2)}}^{2}})}{\pi}\right)+\frac{1}{2}\ln\left(\frac{2(\beta^{(1)}+\beta^{(2)})}{\pi}\right)
ln[erfc(β(1)x(1)2¯+β(2)x(2)2¯2(ξaa^(τ1(1),τ1(2))))\displaystyle\quad-\ln\left[\mathrm{erfc}\left(\sqrt{\frac{\beta^{(1)}\bar{{x^{(1)}}^{2}}+\beta^{(2)}\bar{{x^{(2)}}^{2}}}{2}}(-\xi_{a}-\hat{a}(\tau_{1}^{(1)},\tau_{1}^{(2)}))\right)\right.
erfc(β(1)x(1)2¯+β(2)x(2)2¯2(ξaa^(τ1(1),τ1(2))))]\displaystyle\quad\quad\left.-\mathrm{erfc}\left(\sqrt{\frac{\beta^{(1)}\bar{{x^{(1)}}^{2}}+\beta^{(2)}\bar{{x^{(2)}}^{2}}}{2}}(\xi_{a}-\hat{a}(\tau_{1}^{(1)},\tau_{1}^{(2)}))\right)\right]
ln[erfc(β(1)+β(2)2(ξbb^(τ2(1),τ2(2))))\displaystyle\quad-\ln\left[\mathrm{erfc}\left(\sqrt{\frac{\beta^{(1)}+\beta^{(2)}}{2}}(-\xi_{b}-\hat{b}(\tau_{2}^{(1)},\tau_{2}^{(2)}))\right)\right.
erfc(β(1)+β(2)2(ξbb^(τ2(1),τ2(2))))]\displaystyle\quad\quad\left.-\mathrm{erfc}\left(\sqrt{\frac{\beta^{(1)}+\beta^{(2)}}{2}}(\xi_{b}-\hat{b}(\tau_{2}^{(1)},\tau_{2}^{(2)}))\right)\right] (120)

is rewritten. Therefore, the Bayesian free energy is determined by six stochastic variables τ1(1),τ1(2),τ2(1),τ2(2),υ(1),υ(2)\tau_{1}^{(1)},\tau_{1}^{(2)},\tau_{2}^{(1)},\tau_{2}^{(2)},\upsilon^{(1)},\upsilon^{(2)}, and can be expressed as F(Y)=F(τ1(1),τ1(2),τ2(1),τ2(2),υ(1),υ(2))F(Y)=F(\tau_{1}^{(1)},\tau_{1}^{(2)},\tau_{2}^{(1)},\tau_{2}^{(2)},\upsilon^{(1)},\upsilon^{(2)}). Consequently, the probability distribution of the Bayesian free energy is

p(F)=\displaystyle p(F)=\int dτ1(1)dτ1(2)dτ2(1)dτ2(2)dυ(1)dυ(2)\displaystyle\mathrm{d}\tau_{1}^{(1)}\mathrm{d}\tau_{1}^{(2)}\mathrm{d}\tau_{2}^{(1)}\mathrm{d}\tau_{2}^{(2)}\mathrm{d}\upsilon^{(1)}\mathrm{d}\upsilon^{(2)}
δ(FF(τ1(1),τ1(2),τ2(1),τ2(2),υ(1),υ(2)))\displaystyle\delta\left(F-F(\tau_{1}^{(1)},\tau_{1}^{(2)},\tau_{2}^{(1)},\tau_{2}^{(2)},\upsilon^{(1)},\upsilon^{(2)})\right)
p(τ1(1))p(τ1(2))p(τ2(1))p(τ2(2))p(υ(1))p(υ(2))\displaystyle p(\tau_{1}^{(1)})p(\tau_{1}^{(2)})p(\tau_{2}^{(1)})p(\tau_{2}^{(2)})p(\upsilon^{(1)})p(\upsilon^{(2)}) (121)

IV.2.3 Model Selection in Bayesian Integration

In this section, we compare the Bayesian free energy of the linear regression model by Bayesian integration with that of the independent analysis linear regression model, to perform model selection. The Bayesian free energy for independent analysis is given by:

Fiso(τ1(1),τ1(2),τ2(1),τ2(2),υ(1),υ(2))\displaystyle F^{\mathrm{iso}}(\tau_{1}^{(1)},\tau_{1}^{(2)},\tau_{2}^{(1)},\tau_{2}^{(2)},\upsilon^{(1)},\upsilon^{(2)}) (123)
=\displaystyle= m=12Nm2ln(2πσ(m)2)ln(σ(m)2π2Nm)+12ln(x(m)2¯)+ln(2ξa(m))+ln(2ξb(m))+Nmσ(m)2Em(a^(m),b^(m))\displaystyle\sum_{m=1}^{2}\frac{N_{m}}{2}\ln(2\pi{\sigma^{(m)}}^{2})-\ln\left(\frac{{\sigma^{(m)}}^{2}\pi}{2N_{m}}\right)+\frac{1}{2}\ln\left(\bar{{x^{(m)}}^{2}}\right)+\ln(2\xi^{(m)}_{a})+\ln(2\xi^{(m)}_{b})+\frac{N_{m}}{{\sigma^{(m)}}^{2}}E_{m}(\hat{a}^{(m)},\hat{b}^{(m)})
ln[erfc(Nmx(m)2¯2σ(m)2(ξa(m)a^(m)))erfc(Nmx(m)2¯2σ(m)2(ξa(m)a^(m)))]\displaystyle\hskip 28.45274pt-\ln\left[\mathrm{erfc}\left(\sqrt{\frac{N_{m}\bar{{x^{(m)}}^{2}}}{2{\sigma^{(m)}}^{2}}}\left(-\xi^{(m)}_{a}-\hat{a}^{(m)}\right)\right)-\mathrm{erfc}\left(\sqrt{\frac{N_{m}\bar{{x^{(m)}}^{2}}}{2{\sigma^{(m)}}^{2}}}\left(\xi^{(m)}_{a}-\hat{a}^{(m)}\right)\right)\right]
ln[erfc(Nm2σ(m)2(ξb(m)b^(m)))erfc(Nm2σ(m)2(ξb(m)b^(m)))]\displaystyle\hskip 28.45274pt-\ln\left[\mathrm{erfc}\left(\sqrt{\frac{N_{m}}{2{\sigma^{(m)}}^{2}}}\left(-\xi^{(m)}_{b}-\hat{b}^{(m)}\right)\right)-\mathrm{erfc}\left(\sqrt{\frac{N_{m}}{2{\sigma^{(m)}}^{2}}}\left(\xi^{(m)}_{b}-\hat{b}^{(m)}\right)\right)\right]

and the Bayesian free energy through integration is:

Fint(τ1(1),τ1(2),τ2(1),τ2(2),υ(1),υ(2))=N12ln2π(σ(1))2+N22ln2π(σ(2))2+ln2ξa+ln2ξb+E(a^,b^)\displaystyle F^{\mathrm{int}}(\tau_{1}^{(1)},\tau_{1}^{(2)},\tau_{2}^{(1)},\tau_{2}^{(2)},\upsilon^{(1)},\upsilon^{(2)})=\frac{N_{1}}{2}\ln 2\pi(\sigma^{(1)})^{2}+\frac{N_{2}}{2}\ln 2\pi(\sigma^{(2)})^{2}+\ln 2\xi_{a}+\ln 2\xi_{b}+E(\hat{a},\hat{b})
+12ln2(β(1)x(1)2¯+β(2)x(2)2¯)π+12ln2(β(1)+β(2))π\displaystyle+\frac{1}{2}\ln\frac{2(\beta^{(1)}\bar{{x^{(1)}}^{2}}+\beta^{(2)}\bar{{x^{(2)}}^{2}})}{\pi}+\frac{1}{2}\ln\frac{2(\beta^{(1)}+\beta^{(2)})}{\pi}
ln[erfc(β(1)x(1)2¯+β(2)x(2)2¯2(ξaa^(τ1(1),τ1(2))))erfc(β(1)x(1)2¯+β(2)x(2)2¯2(ξaa^(τ1(1),τ1(2))))]\displaystyle-\ln\left[\mathrm{erfc}\left(\sqrt{\frac{\beta^{(1)}\bar{{x^{(1)}}^{2}}+\beta^{(2)}\bar{{x^{(2)}}^{2}}}{2}}(-\xi_{a}-\hat{a}(\tau_{1}^{(1)},\tau_{1}^{(2)}))\right)-\mathrm{erfc}\left(\sqrt{\frac{\beta^{(1)}\bar{{x^{(1)}}^{2}}+\beta^{(2)}\bar{{x^{(2)}}^{2}}}{2}}(\xi_{a}-\hat{a}(\tau_{1}^{(1)},\tau_{1}^{(2)}))\right)\right]
ln[erfc(β(1)+β(2)2(ξbb^(τ2(1),τ2(2))))erfc(β(1)+β(2)2(ξbb^(τ2(1),τ2(2))))]\displaystyle-\ln\left[\mathrm{erfc}\left(\sqrt{\frac{\beta^{(1)}+\beta^{(2)}}{2}}(-\xi_{b}-\hat{b}(\tau_{2}^{(1)},\tau_{2}^{(2)}))\right)-\mathrm{erfc}\left(\sqrt{\frac{\beta^{(1)}+\beta^{(2)}}{2}}(\xi_{b}-\hat{b}(\tau_{2}^{(1)},\tau_{2}^{(2)}))\right)\right] (124)

If the noise intensity is known, β(m)\beta^{(m)} is equal to β0(m)\beta^{(m)}_{0}, leading to a difference in the residual error contributions given by:

E(a^(τ1(1),τ1(2)),b^(τ2(1),τ2(2)),υ(1),υ(2))m=121β0(m)Em(a^(m),b^(m))\displaystyle E(\hat{a}(\tau_{1}^{(1)},\tau_{1}^{(2)}),\hat{b}(\tau_{2}^{(1)},\tau_{2}^{(2)}),\upsilon^{(1)},\upsilon^{(2)})-\sum_{m=1}^{2}\frac{1}{\beta_{0}^{(m)}}E_{m}(\hat{a}^{(m)},\hat{b}^{(m)})
=12((β(1)x(1)2¯)(β(2)x(2)2¯)β(1)x(1)2¯+β(2)x(2)2¯(a^(1)(τ1(1))a^(2)(τ1(2)))2+β(1)β(2)(β(1)+β(2))(b^(1)(τ2(1))b^(2)(τ2(2)))2),\displaystyle=\frac{1}{2}\left(\frac{(\beta^{(1)}\bar{{x^{(1)}}^{2}})(\beta^{(2)}\bar{{x^{(2)}}^{2}})}{\beta^{(1)}\bar{{x^{(1)}}^{2}}+\beta^{(2)}\bar{{x^{(2)}}^{2}}}(\hat{a}^{(1)}(\tau_{1}^{(1)})-\hat{a}^{(2)}(\tau_{1}^{(2)}))^{2}+\frac{\beta^{(1)}\beta^{(2)}}{(\beta^{(1)}+\beta^{(2)})}(\hat{b}^{(1)}(\tau_{2}^{(1)})-\hat{b}^{(2)}(\tau_{2}^{(2)}))^{2}\right), (125)

which is found that mesoscopic variables υ(1)\upsilon^{(1)} and υ(2)\upsilon^{(2)} disappear from this equation. Therefore, the difference in the Bayesian free energy is determined by four stochastic variables τ1(1),τ1(2),τ2(1),τ2(2)\tau_{1}^{(1)},\tau_{1}^{(2)},\tau_{2}^{(1)},\tau_{2}^{(2)}, and can be expressed as

ΔF(τ1(1),τ1(2),τ2(1),τ2(2))=Fint(τ1(1),τ1(2),τ2(1),τ2(2),υ(1),υ(2))Fiso(τ1(1),τ1(2),τ2(1),τ2(2),υ(1),υ(2)).\displaystyle\Delta F(\tau_{1}^{(1)},\tau_{1}^{(2)},\tau_{2}^{(1)},\tau_{2}^{(2)})=F^{\mathrm{int}}(\tau_{1}^{(1)},\tau_{1}^{(2)},\tau_{2}^{(1)},\tau_{2}^{(2)},\upsilon^{(1)},\upsilon^{(2)})-F^{\mathrm{iso}}(\tau_{1}^{(1)},\tau_{1}^{(2)},\tau_{2}^{(1)},\tau_{2}^{(2)},\upsilon^{(1)},\upsilon^{(2)}). (126)

Consequently, the probability distribution of the difference in the Bayesian free energy can be expressed as:

p(ΔF)=\displaystyle p(\Delta F)=\int dτ1(1)dτ1(2)dτ2(1)dτ2(2)\displaystyle\mathrm{d}\tau_{1}^{(1)}\mathrm{d}\tau_{1}^{(2)}\mathrm{d}\tau_{2}^{(1)}\mathrm{d}\tau_{2}^{(2)}
δ(ΔFΔF(τ1(1),τ1(2),τ2(1),τ2(2)))\displaystyle\delta(\Delta F-\Delta F(\tau_{1}^{(1)},\tau_{1}^{(2)},\tau_{2}^{(1)},\tau_{2}^{(2)}))
p(τ1(1))p(τ1(2))p(τ2(1))p(τ2(2))\displaystyle p(\tau_{1}^{(1)})p(\tau_{1}^{(2)})p(\tau_{2}^{(1)})p(\tau_{2}^{(2)}) (127)

We can effectively assess the fluctuations of Bayesian integration by mesoscopic variables as described in Section II.

IV.3 Numerical Experiment: Bayesian Integration

Here, we examine the effects of the number of data points and the noise intensity of the data on estimation from the results of Bayesian integration via meso-expression. Figure 8 shows the frequency distribution from 100,000 samples of the probability distribution of the difference of the Bayesian free energy with model parameters a0(1)=2.0,a0(2)=4.0,σ02=1.0a_{0}^{(1)}=2.0,a_{0}^{(2)}=4.0,\sigma_{0}^{2}=1.0. Here, for simplicity, we omit the terms b0(1)b_{0}^{(1)} and b0(2)b_{0}^{(2)} from Equations (82) and (83). Consequently, we proceed to calculate the difference in the residual error

E(a^(τ1(1),τ1(2)),υ(1),υ(2))m=121β0(m)Em(a^(m))=12((β(1)x(1)2¯)(β(2)x(2)2¯)β(1)x(1)2¯+β(2)x(2)2¯(a^(1)(τ1(1))a^(2)(τ1(2)))2),\displaystyle E(\hat{a}(\tau_{1}^{(1)},\tau_{1}^{(2)}),\upsilon^{(1)},\upsilon^{(2)})-\sum_{m=1}^{2}\frac{1}{\beta_{0}^{(m)}}E_{m}(\hat{a}^{(m)})=\frac{1}{2}\left(\frac{(\beta^{(1)}\bar{{x^{(1)}}^{2}})(\beta^{(2)}\bar{{x^{(2)}}^{2}})}{\beta^{(1)}\bar{{x^{(1)}}^{2}}+\beta^{(2)}\bar{{x^{(2)}}^{2}}}(\hat{a}^{(1)}(\tau_{1}^{(1)})-\hat{a}^{(2)}(\tau_{1}^{(2)}))^{2}\right), (128)

and the difference of the Bayesian free energy

ΔF(τ1(1),τ1(2))=Fint(τ1(1),τ1(2),υ(1),υ(2))Fiso(τ1(1),τ1(2),υ(1),υ(2)).\Delta F(\tau_{1}^{(1)},\tau_{1}^{(2)})=F^{\mathrm{int}}(\tau_{1}^{(1)},\tau_{1}^{(2)},\upsilon^{(1)},\upsilon^{(2)})-F^{\mathrm{iso}}(\tau_{1}^{(1)},\tau_{1}^{(2)},\upsilon^{(1)},\upsilon^{(2)}). (129)
p(ΔF)=\displaystyle p(\Delta F)=\int dτ1(1)dτ1(2)δ(ΔFΔF(τ1(1),τ1(2)))p(τ1(1))p(τ1(2))\displaystyle\mathrm{d}\tau_{1}^{(1)}\mathrm{d}\tau_{1}^{(2)}\delta(\Delta F-\Delta F(\tau_{1}^{(1)},\tau_{1}^{(2)}))p(\tau_{1}^{(1)})p(\tau_{1}^{(2)}) (130)

which is derived from the Bayesian integration of these simplified equations and thus does not account for the effect of mesoscopic variables τ2(1)\tau_{2}^{(1)} and τ2(2)\tau_{2}^{(2)} associated with b0(1)b_{0}^{(1)} and b0(2)b_{0}^{(2)}.

Figure 8(a) shows that with a small number of data points, the frequency of ΔF<0\Delta F<0 is high, and failures in model selection occur frequently. Conversely, Figures 8(b) and 8(c) show that with a larger number of data points, failures in model selection do not occur.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 8: Probability distribution of differences in the Bayesian free energy for Bayesian integration. Probability distribution from 100,000 samples of the Bayesian free energy difference (Equation (130)) with model parameters a0(1)=2.0,a0(2)=4.0,σ02=1.0a_{0}^{(1)}=2.0,a_{0}^{(2)}=4.0,\sigma_{0}^{2}=1.0 for data sizes N=5,100,1000N=5,100,1000. Solid black lines represent the theoretical lines calculated from the non-central chi-squared distribution.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 9: Heat map of ΔF\Delta F as a function of τ1(1)\tau_{1}^{(1)} and τ1(2)\tau_{1}^{(2)} with model parameters a0(1)=2.0,a0(2)=4.0,σ02=1.0a_{0}^{(1)}=2.0,a_{0}^{(2)}=4.0,\sigma_{0}^{2}=1.0 for data sizes N=5,100,1000N=5,100,1000. as the small, medium, and large circles represent those with radii of 1, 2, and 3, respectively. According to the properties of the chi-square distribution with 2 degrees of freedom, the probabilities for τ1(1)\tau_{1}^{(1)} and τ1(2)\tau_{1}^{(2)} to lie within these circles are approximately 39, 86, and 98%, respectively.

Figure 9 shows the relationship between the differential free energy ΔF(τ1(1),τ2(2))\Delta F(\tau_{1}^{(1)},\tau_{2}^{(2)}) and (τ1(1),τ1(2))(\tau_{1}^{(1)},\tau_{1}^{(2)}). Here, model parameters are set as a0(1)=2.0,a0(2)=4.0,σ02=1.0a_{0}^{(1)}=2.0,a_{0}^{(2)}=4.0,\sigma_{0}^{2}=1.0 for data sizes N=5,100,1000N=5,100,1000. The region being depicted corresponds to the main area where τ1(1)\tau_{1}^{(1)} and τ1(2)\tau_{1}^{(2)} are generated. In this case, Equation (128) shows that the difference in free energy is primarily due to the difference in the estimated parameter aa. In this case, since there is a significant difference with a0(1)a0(2)=2.0a_{0}^{(1)}-a_{0}^{(2)}=2.0, the result selected by the free energy suggests that they should be treated as almost independent. However, in situations with a small amount of data, the effect of τ1(1)\tau_{1}^{(1)} and τ1(2)\tau_{1}^{(2)} due to the fluctuation in the estimation of the parameter aa becomes relatively large, which also indicates that the posterior probability of treating them as independent slightly decreases.

Next, Figure 10 shows the frequency distribution from 100,000 samples of the probability distribution of the difference in the Bayesian free energy with model parameters a0(1)=2.0,a0(2)=2.0,σ02=1.0a_{0}^{(1)}=2.0,a_{0}^{(2)}=2.0,\sigma_{0}^{2}=1.0 (Equation (130)). Figures 10(a)–(c) show that as the number of data points increases, the frequency of ΔF>0\Delta F>0 gradually decreases. However, even at N=1000N=1000, the frequency of ΔF>0\Delta F>0 remains, indicating that failures in model selection are occurring. The minimum value of the distribution transitions negatively on a log(N)\mathrm{log}(N) scale, as evident from Equation (127).

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 10: Probability distribution of differences in the Bayesian free energy for Bayesian integration. Frequency distribution from 100,000 samples of the Bayesian free energy difference (Equation (130)) with model parameters a0(1)=2.0,a0(2)=2.0,σ02=1.0a_{0}^{(1)}=2.0,a_{0}^{(2)}=2.0,\sigma_{0}^{2}=1.0 for data sizes N=5,100,1000N=5,100,1000. Solid black lines represent the theoretical lines calculated from the non-central chi-squared distribution.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 11: Heat map of ΔF\Delta F as a function of τ1(1)\tau_{1}^{(1)} and τ1(2)\tau_{1}^{(2)} with model parameter a0(1)=2.0,a0(2)=2.0,σ02=1.0a_{0}^{(1)}=2.0,a_{0}^{(2)}=2.0,\sigma_{0}^{2}=1.0 for data sizes N=5,100,1000N=5,100,1000. The small, medium, and large circles represent those with radii of 1, 2, and 3, respectively. According to the properties of the chi-square distribution with 2 degrees of freedom, the probabilities for τ1(1)\tau_{1}^{(1)} and τ1(2)\tau_{1}^{(2)} to lie within these circles are approximately 39, 86, and 98%, respectively.

Figure 11 shows the relationship between the differential free energy ΔF(τ1(1),τ2(2))\Delta F(\tau_{1}^{(1)},\tau_{2}^{(2)}) and (τ1(1),τ1(2))(\tau_{1}^{(1)},\tau_{1}^{(2)}). Here, model parameters are set as a0(1)=2.0,a0(2)=2.0,σ02=1.0a_{0}^{(1)}=2.0,a_{0}^{(2)}=2.0,\sigma_{0}^{2}=1.0 for data sizes N=5,100,1000N=5,100,1000. In this case, since there is no significant difference between a0(1)a0(2)=0.0a_{0}^{(1)}-a_{0}^{(2)}=0.0, the result indicates that they should be integrated, as shown in this figure. Furthermore, as the amount of data increases, the difference in free energy decreases in the negative direction, suggesting that the posterior probability of integration increases.

Finally, Figure 12 shows the selection probabilities of the separate model derived from the probability distribution of the difference in the Bayesian free energy with model parameters a0(1)=2.0,a0(2)=4.0,3.0,2.0a_{0}^{(1)}=2.0,a_{0}^{(2)}=4.0,3.0,2.0 (Equation (130)). Here, we display the frequency distribution as a two-dimensional histogram from 100,000 samples in a two-dimensional space of the number of data points NN and data noise intensity σ2\sigma^{2}. Figure 12(a) shows that model selection tends to fail along the diagonal line where NN and σ2\sigma^{2} have similar values. As the number of data points increases from this line, model selection gradually becomes more effective. Conversely, as the number of data points decreases from the diagonal line, discrimination in model selection is eliminated. This diagonal line, as shown in Figures 12(b) and (c), widens as the value of a0(2)a_{0}^{(2)} approaches that of a0(1)a_{0}^{(1)}, and when a0(2)=2.0a_{0}^{(2)}=2.0, only the selection probabilities of the integrated model remain.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 12: Selection probabilities of the separate model derived from the probability distribution of the difference in the Bayesian free energy with model parameters a0(1)=2.0,a0(2)=4.0,3.0,2.0a_{0}^{(1)}=2.0,a_{0}^{(2)}=4.0,3.0,2.0 (Equation (130)). Here, we set b0(1)=0.0,b0(2)=0.0b_{0}^{(1)}=0.0,b_{0}^{(2)}=0.0 and display the frequency distribution as a two-dimensional histogram from 100,000 samples in a two-dimensional space of the number of data points NN and data noise intensity σ2\sigma^{2}.

V Conclusion

In this study, we proposed an innovative theory that can address finite measurement datasets NN in a linear regression model, a previously unaddressed challenge. This is the first step towards a new theoretical framework that goes beyond the conventional Bayesian measurement framework. Through our results, we confirmed that the outcomes of Bayesian estimation using theoretically derived microscopic and mesoscopic representations are consistent.

We summarize the insights gained from our results in the following. In conventional theoretical frameworks, only asymptotic characteristics such as the number of observation data NN approaches infinity are discussed, making it difficult to consider fluctuations due to finite data [19]. By introducing O(1)O(1) mesoscopic variables defined from NN Gaussian noises, we succeeded in theoretically determining how important statistics such as the free energy difference ΔF\Delta F converge to a limit as NN increases to infinity. We have established a theoretical foundation that can handle fluctuations due to finite data in the estimation of the posterior probability distribution of parameters, model selection, and Bayesian integration—the three core principles of Bayesian measurement. This is a groundbreaking achievement in the history of Bayesian inference.

As a result, the estimation of the posterior probability distribution of parameters in a linear regression model could be analytically expressed in terms of mesoscopic variables consisting of a sum of NN Gaussian variables when using microscopic and mesoscopic representations. The residual error can be described using Gaussian and chi-squared distributions of mesoscopic variables, enabling the posterior probability distribution to be analytically derived even for a finite number of observed data NN. This is particularly important in real measurements where data is limited, demonstrating the potential for practical applications.

Regarding model selection, the proposed theory proved particularly useful. In conventional Bayesian measurements based on numerical calculations, fluctuations due to the finite number NN are often overlooked, leading to erroneous model selection results. The proposed theory addressed this by analytically evaluating the Bayesian free energy difference ΔF\Delta F, which depends on the number of observed data NN. This enables us to quantitatively evaluate the variation in the free energy difference distribution obtained from the microscopic and mesoscopic representations, demonstrating how the number of observational data NN and the observation noise variance σ2\sigma^{2} affect model selection. Theoretically, we demonstrate that the model selection results are stable when the number of observational data NN is about 100, suggesting that the proposed theory can provide guidelines for actual measurements.

The proposed theory also proved useful in Bayesian integration, enabling analytical evaluation even when the number of observational data NN is finite, and showing how the number of observational data NN and the observation noise variance σ2\sigma^{2} affect the results of Bayesian integration. Theoretically, we demonstrated that the Bayesian integration results are stable when the number of observational data NN is about 100, again suggesting that the proposed theory can provide guidelines for data analysis and design of actual measurements.

The proposed theory establishes a new paradigm in Bayesian measurement, leading to more accurate and reliable scientific and technological results. The linear regression model y=ax+by=ax+b, while seemingly simple, is not merely for theoretical analysis. It is widely used in real measurement settings, such as linear system responses. Furthermore, insights gained from this model can be extended to general nonlinear models. We hope that this research will contribute to the development of measurement and data analysis and design in various fields of natural science.

Acknowledgements.
This work was supported by JSPS KAKENHI Grant Numbers JP23H00486, 23KJ0723, and 23K16959, and CREST Grant Numbers JPMJCR1761 and JPMJCR1861 from the Japan Science and Technology Agency (JST).

References

  • [1] Giuseppe Carleo, Ignacio Cirac, Kyle Cranmer, Laurent Daudet, Maria Schuld, Naftali Tishby, Leslie Vogt-Maranto, and Lenka Zdeborová. Machine learning and the physical sciences. Rev. Mod. Phys., 91:045002, Dec 2019.
  • [2] H. Wang, T. Fu, Y. Du, et al. Scientific discovery in the age of artificial intelligence. Nature, 620:47–60, 2023.
  • [3] K. Nagata, S. Sugita, and M. Okada. Bayesian spectral deconvolution with the exchange monte carlo method. Neural Networks, 28:82, 2012.
  • [4] K. Nagata, R. Muraoka, Y. Mototake, T. Sasaki, and M. Okada. Bayesian spectral deconvolution based on poisson distribution: Bayesian measurement and virtual measurement analytics (vma). J. Phys. Soc. Jpn., 88:044003, 2019.
  • [5] S. Tokuda, K. Nagata, and M. Okada. Simultaneous estimation of noise variance and number of peaks in bayesian spectral deconvolution. J. Phys. Soc. Jpn., 86:024001, 2017.
  • [6] S. Katakami, H. Sakamoto, K. Nagata, T. Arima, and M. Okada. Bayesian parameter estimation from dispersion relation observation data with poisson process. Phys. Rev. E, 105:065301, 2022.
  • [7] H. Ueda, S. Katakami, S. Yoshida, T. Koyama, Y. Nakai, T. Mito, M. Mizumaki, and M. Okada. Bayesian approach to t1 analysis in nmr spectroscopy with applications to solid state physics. J. Phys. Soc. Jpn., 92:054002, 2023.
  • [8] R. Nishimura, S. Katakami, K. Nagata, M. Mizumaki, and M. Okada. Bayesian integration for hamiltonian parameters of crystal field. J. Phys. Soc. Jpn., 93:034003, 2024.
  • [9] Y. Yokoyama, T. Uozumi, K. Nagata, M. Okada, and M. Mizumaki. Bayesian integration for hamiltonian parameters of x-ray photoemission and absorption spectroscopy. J. Phys. Soc. Jpn., 90:034703, 2021.
  • [10] R. Moriguchi, S. Tsutsui, S. Katakami, K. Nagata, M. Mizumaki, and M. Okada. Bayesian inference on hamiltonian selections for mössbauer spectroscopy. J. Phys. Soc. Jpn., 91:104002, 2022.
  • [11] Y. Hayashi, S. Katakami, S. Kuwamoto, K. Nagata, M. Mizumaki, and M. Okada. Bayesian inference for small-angle scattering data. J. Phys. Soc. Jpn., 92:094002, 2023.
  • [12] Y. Yokoyama, S. Kawaguchi, and M. Mizumaki. Bayesian framework for analyzing adsorption processes observed via time-resolved x-ray diffraction. Sci. Rep., 13:14349, 2023.
  • [13] Y. Yokoyama, N. Tsuji, I. Akai, K. Nagata, M. Okada, and M. Mizumaki. Bayesian orbital decomposition and determination of end condition for magnetic compton scattering. J. Phys. Soc. Jpn., 90:094802, 2021.
  • [14] T. Yamasaki, K. Iwamitsu, H. Kumazoe, M. Okada, M. Mizumaki, and I. Akai. Bayesian spectroscopy of synthesized soft x-ray absorption spectra showing magnetic circular dichroism at the ni-l3,-l2 edges. Sci. Technol. Adv. Mater. Methods, 1:75, 2021.
  • [15] K. Iwamitsu, T. Yokota, K. Murata, M. Kamezaki, M. Mizumaki, T. Uruga, and I. Akai. Spectral analysis of x-ray absorption near edge structure in α\alpha-fe2o3 based on bayesian spectroscopy. Phys. Status Solidi B, 257:2000107, 2020.
  • [16] H. Kumazoe, K. Iwamitsu, M. Imamura, K. Takahashi, Y. Mototake, M. Okada, and I. Akai. Quantifying physical insights cooperatively with exhaustive search for bayesian spectroscopy of x-ray photoelectron spectra. Sci. Rep., 13:13221, 2023.
  • [17] S. Kashiwamura, S. Katakami, R. Yamagami, K. Iwamitsu, H. Kumazoe, K. Nagata, T. Okajima, I. Akai, and M. Okada. Bayesian spectral deconvolution of x-ray absorption near edge structure discriminating between high-and low-energy domains. J. Phys. Soc. Jpn., 91:074009, 2022.
  • [18] S. Tokuda, K. Nagata, and M. Okada. Intrinsic regularization effect in bayesian nonlinear regression scaled by observed data. Phys. Rev. Res., 4:043165, 2022.
  • [19] G. Schwarz. Estimating the dimension of a model. Ann. Stat., 6:461, 1978.
  • [20] Serge Lang. Linear algebra. Springer Science & Business Media, 1987.
  • [21] G. A. F. Seber and A. J. Lee. Linear regression analysis. John Wiley & Sons, Hoboken, NJ, 2012.

Appendix A Proof that Residual Errors Follow a Chi-squared Distribution

Consider an orthogonal matrix Q×Q\in\mathbb{R^{N\times N}} whose first and second rows are defined as follows:

𝒒1T=(1N,1N,,1N)\displaystyle\bm{q}_{1}^{T}=\left(\frac{1}{\sqrt{N}},\frac{1}{\sqrt{N}},\cdots,\frac{1}{\sqrt{N}}\right) (131)
𝒒2T=(x1Nx2¯,x2Nx2¯,,xNNx2¯)\displaystyle\bm{q}_{2}^{T}=\left(\frac{x_{1}}{\sqrt{N\bar{x^{2}}}},\frac{x_{2}}{\sqrt{N\bar{x^{2}}}},\cdots,\frac{x_{N}}{\sqrt{N\bar{x^{2}}}}\right) (132)

The existence of such an orthogonal matrix is guaranteed by Basis Extension Theorem in linear algebra[20, 21]. From the properties of orthogonal matrices, we have:

QTQ=QQT=I\displaystyle Q^{T}Q=QQ^{T}=I (133)

The random variables {ni}i=1N\{n_{i}\}_{i=1}^{N} independently follow a Gaussian distribution 𝒩(0,σ2)\mathcal{N}(0,\sigma^{2}), so let 𝒏T=(n1,n2,,nN)\bm{n}^{T}=(n_{1},n_{2},\cdots,n_{N}). The probability density function of 𝒏\bm{n} is given by:

f(𝒏)=(12πσ2)Nexp(12σ2𝒏T𝒏)\displaystyle f(\bm{n})=\left(\frac{1}{\sqrt{2\pi\sigma^{2}}}\right)^{N}\exp\left(-\frac{1}{2\sigma^{2}}\bm{n}^{T}\bm{n}\right) (134)

Applying the orthogonal transformation 𝒏~=Q𝒏\tilde{\bm{n}}=Q\bm{n}, we have:

f(𝒏~)=(12πσ2)Nexp(12σ2𝒏~T𝒏~)\displaystyle f(\tilde{\bm{n}})=\left(\frac{1}{\sqrt{2\pi\sigma^{2}}}\right)^{N}\exp\left(-\frac{1}{2\sigma^{2}}\tilde{\bm{n}}^{T}\tilde{\bm{n}}\right) (135)

At this time, the elements n~i\tilde{n}_{i} of 𝒏~\tilde{\bm{n}} obtained by the orthogonal transformation are independent. In addition, n~1\tilde{n}_{1} and n~2\tilde{n}_{2} are given by:

n~1\displaystyle\tilde{n}_{1} =\displaystyle= 𝒒1T𝒏\displaystyle\bm{q}_{1}^{T}\bm{n} (136)
=\displaystyle= 1Ni=1Nni\displaystyle\frac{1}{\sqrt{N}}\sum_{i=1}^{N}n_{i} (137)
n~2\displaystyle\tilde{n}_{2} =\displaystyle= 𝒒2T𝒏\displaystyle\bm{q}_{2}^{T}\bm{n} (138)
=\displaystyle= 1Nx2¯i=1Nxini\displaystyle\frac{1}{\sqrt{N\bar{x^{2}}}}\sum_{i=1}^{N}x_{i}n_{i} (139)

where each corresponds to the second and first terms of the right-hand side of Equation (33), respectively. From this, the residual error is:

E(a^,b^)×2N\displaystyle E(\hat{a},\hat{b})\times 2N =\displaystyle= (1Nx2¯i=1Nxini)2(1Ni=1Nni)2+i=1Nni2\displaystyle-\left(\frac{1}{\sqrt{N\bar{x^{2}}}}\sum_{i=1}^{N}x_{i}n_{i}\right)^{2}-\left(\frac{1}{\sqrt{N}}\sum_{i=1}^{N}n_{i}\right)^{2}+\sum_{i=1}^{N}n_{i}^{2} (140)
=\displaystyle= n~22n~12+i=1Nni2\displaystyle-\tilde{n}_{2}^{2}-\tilde{n}_{1}^{2}+\sum_{i=1}^{N}n_{i}^{2} (141)
=\displaystyle= n~22n~12+i=1Nn~i2\displaystyle-\tilde{n}_{2}^{2}-\tilde{n}_{1}^{2}+\sum_{i=1}^{N}\tilde{n}_{i}^{2} (142)
=\displaystyle= i=3Nn~i2\displaystyle\sum_{i=3}^{N}\tilde{n}_{i}^{2} (143)

Thus, E(a^,b^)×2N/σ2E(\hat{a},\hat{b})\times 2N/\sigma^{2} follows a chi-squared distribution with N2N-2 degrees of freedom, independent of the first and second terms on the right-hand side of Equation (33).

Appendix B Noise Estimation

In this section, we examine how the inclusion of noise estimation affects the results of model selection and Bayesian integration, using mesoscopic variables for Bayesian representation.

B.1 Bayesian Inference with Noise Estimation

In this subsection, we explore the impact of noise estimation on Bayesian inference. We extend the previous models to include noise variance as a probabilistic variable, making the framework applicable to realistic situations where the noise intensity is unknown beforehand.

B.1.1 Noise Variance Estimation

Up to this point, the noise variance σ02\sigma^{2}_{0} has been treated as a constant. By considering the noise variance as a probabilistic variable σ2\sigma^{2}, this section demonstrates how to estimate the noise variance from the data by maximizing posterior distribution p(σ2|Y)p(\sigma^{2}|Y) on the basis of Bayesian inference.

The posterior probability of the noise variance p(σ2|Y)p(\sigma^{2}|Y) can be determined from the joint probability p(σ2,a,b,Y)p(\sigma^{2},a,b,Y) as

p(σ2,a,b,Y)\displaystyle p(\sigma^{2},a,b,Y) =\displaystyle= p(Y|σ2,a,b)p(a)p(b)p(σ2).\displaystyle p(Y|\sigma^{2},a,b)p(a)p(b)p(\sigma^{2}). (144)

The dependency part of the posterior probability p(σ2|Y)p(\sigma^{2}|Y) on σ2\sigma^{2} is

p(σ2|Y)\displaystyle p(\sigma^{2}|Y) \displaystyle\propto p(σ2)p(Y|σ2),\displaystyle p(\sigma^{2})p(Y|\sigma^{2}), (145)
=\displaystyle= p(σ2)dadbp(Y|σ2,a,b)p(a)p(b),\displaystyle p(\sigma^{2})\int\mathrm{d}a\mathrm{d}b\ p(Y|\sigma^{2},a,b)p(a)p(b),
=\displaystyle= p(σ2)(12πσ2)Nexp(Nσ2E(a^,b^))\displaystyle p(\sigma^{2})\left(\frac{1}{\sqrt{2\pi\sigma^{2}}}\right)^{N}\exp\left(-\frac{N}{\sigma^{2}}E(\hat{a},\hat{b})\right)
×\displaystyle\times 12ξaσ2π2Nx2¯[erfc(Nx2¯2σ2(ξaxy¯x2¯))erfc(Nx2¯2σ2(ξaxy¯x2¯))]\displaystyle\frac{1}{2\xi_{a}}\sqrt{\frac{\sigma^{2}\pi}{2N\bar{x^{2}}}}\left[\mathrm{erfc}\left(\sqrt{\frac{N\bar{x^{2}}}{2\sigma^{2}}}\left(-\xi_{a}-\frac{\bar{xy}}{\bar{x^{2}}}\right)\right)-\mathrm{erfc}\left(\sqrt{\frac{N\bar{x^{2}}}{2\sigma^{2}}}\left(\xi_{a}-\frac{\bar{xy}}{\bar{x^{2}}}\right)\right)\right]
×\displaystyle\times 12ξbσ2π2N[erfc(N2σ2(ξby¯))erfc(N2σ2(ξby¯))].\displaystyle\frac{1}{2\xi_{b}}\sqrt{\frac{\sigma^{2}\pi}{2N}}\left[\mathrm{erfc}\left(\sqrt{\frac{N}{2\sigma^{2}}}\left(-\xi_{b}-\bar{y}\right)\right)-\mathrm{erfc}\left(\sqrt{\frac{N}{2\sigma^{2}}}\left(\xi_{b}-\bar{y}\right)\right)\right]. (147)

When the prior distribution p(σ2)p(\sigma^{2}) is considered uniform, the posterior probability of the noise variance can be equated to the marginal likelihood for model parameters a,ba,b, thus enabling us to treat Equation (145) similarly to the calculation of Equation (32). The free energy F(σ2)F(\sigma^{2}), obtained by taking the negative log of p(σ2|Y)p(\sigma^{2}|Y), is

F(σ2)\displaystyle F(\sigma^{2}) =\displaystyle= lnp(σ2|Y),\displaystyle-\ln p(\sigma^{2}|Y),
\displaystyle\sim N2ln(2πσ2)ln(σ2π2N)+12ln(x2¯)+ln(2ξa)+ln(2ξb)+Nσ2E(a^,b^)\displaystyle\frac{N}{2}\ln(2\pi\sigma^{2})-\ln\left(\frac{\sigma^{2}\pi}{2N}\right)+\frac{1}{2}\ln\left(\bar{x^{2}}\right)+\ln(2\xi_{a})+\ln(2\xi_{b})+\frac{N}{\sigma^{2}}E(\hat{a},\hat{b})
\displaystyle- ln[erfc(Nx2¯2σ2(ξaa^))erfc(Nx2¯2σ2(ξaa^))]\displaystyle\ln\left[\mathrm{erfc}\left(\sqrt{\frac{N\bar{x^{2}}}{2\sigma^{2}}}\left(-\xi_{a}-\hat{a}\right)\right)-\mathrm{erfc}\left(\sqrt{\frac{N\bar{x^{2}}}{2\sigma^{2}}}\left(\xi_{a}-\hat{a}\right)\right)\right]
\displaystyle- ln[erfc(N2σ2(ξbb^))erfc(N2σ2(ξbb^))].\displaystyle\ln\left[\mathrm{erfc}\left(\sqrt{\frac{N}{2\sigma^{2}}}\left(-\xi_{b}-\hat{b}\right)\right)-\mathrm{erfc}\left(\sqrt{\frac{N}{2\sigma^{2}}}\left(\xi_{b}-\hat{b}\right)\right)\right]. (149)

The optimal noise variance σ2\sigma^{2} can be obtained by

σ^2(υ,τ1,τ2)\displaystyle\hat{\sigma}^{2}(\upsilon,\tau_{1},\tau_{2}) =\displaystyle= argmaxσ2p(σ2|Y),\displaystyle\arg\max_{\sigma^{2}}p(\sigma^{2}|Y), (150)
=\displaystyle= argminσ2F(σ2).\displaystyle\arg\min_{\sigma^{2}}F(\sigma^{2}). (151)

B.1.2 Noise Variance Through Mesoscopic Variables

Here, we describe the noise variance using mesoscopic variables. Since σ^2(υ,τ1,τ2)\hat{\sigma}^{2}(\upsilon,\tau_{1},\tau_{2}) cannot be analytically determined, we assume it has been numerically estimated. The probability distribution of the noise variance can then be described as

p(σ2)=dυdτ1dτ2δ(σ2σ^2(υ,τ1,τ2))p(υ)p(τ1)p(τ2).\displaystyle p(\sigma^{2})=\int\mathrm{d}\upsilon\mathrm{d}\tau_{1}\mathrm{d}\tau_{2}\delta(\sigma^{2}-\hat{\sigma}^{2}(\upsilon,\tau_{1},\tau_{2}))p(\upsilon)p(\tau_{1})p(\tau_{2}). (152)

B.1.3 Numerical Experiment Including Noise Estimation: Bayesian Inference

Here, we numerically verify the results of Bayesian estimation including noise estimation. Initially, noise estimation is performed using Equation (149), and the estimated noise is used to calculate the free energy from Equation (43).

Figure 13 shows the probability distribution of the free energy density during noise estimation and that of the estimated noise. Figures 13(a)–(c) show the probability distributions of the normalized free energy values, calculated from Equation (43) for 100,000 artificially generated data patterns with model parameters a0=1.0,b0=0.0,σ02=1.0a_{0}=1.0,b_{0}=0.0,\sigma_{0}^{2}=1.0, where noise estimation was performed. Figures 13(d)–(f) show the frequency distribution of the noise estimated for each dataset.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Figure 13: Probability distribution of free energy density and estimated noise. (a)–(c): Probability distribution of the normalized values of free energy calculated from Equation (43) for 100,000 artificially generated data patterns with model parameters a0=1.0,b0=0.0,σ02=1.0a_{0}=1.0,b_{0}=0.0,\sigma_{0}^{2}=1.0, where noise estimation was performed. (d)–(f): Probability distribution of noise estimated for each of the 100,000 patterns of artificial data.

Figure 14 shows the frequency distribution of the estimated noise for 10,000 artificially generated data patterns with model parameters a0=1.0,b0=0.0,σ02=1.0a_{0}=1.0,b_{0}=0.0,\sigma_{0}^{2}=1.0 and N[5,1000]N\in[5,1000]. Figure 14shows that as NN increases, the frequency distribution of the estimated noise converges towards the true value.

Refer to caption
Figure 14: Frequency distribution of estimated noise for 10,000 artificially generated data patterns with model parameters a0=1.0,b0=0.0,σ02=1.0a_{0}=1.0,b_{0}=0.0,\sigma_{0}^{2}=1.0 and N[5,1000]N\in[5,1000].

Figure 15 shows the frequency distribution of the estimated noise for 10,000 artificially generated data patterns with N=5,100,1000N=5,100,1000, σ02[0.01,1]\sigma^{2}_{0}\in[0.01,1], and model parameters a0=1.0,b0=0.0a_{0}=1.0,b_{0}=0.0. Figure 15 shows that the estimation accuracy of the frequency distribution of the estimated noise depends only on NN and not on σ02\sigma^{2}_{0}.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 15: Frequency distribution of estimated noise for 10,000 artificially generated data patterns with model parameters a0=1.0,b0=0.0a_{0}=1.0,b_{0}=0.0 and N=5,100,1000N=5,100,1000, σ02=[0.01,,1]\sigma^{2}_{0}=[0.01,\ldots,1].

B.2 Model Selection with Noise Estimation

In this subsection, we delve into how noise estimation affects the process of model selection. By incorporating the noise variance as a probabilistic variable, we refine the Bayesian framework to better handle realistic scenarios where the noise intensity is unknown.

We will specifically explore how the estimation of noise variance affects the comparison between different models. This includes demonstrating the application of Bayesian free energy to select between one- and two-variable linear regression models with the additional complexity of noise estimation.

B.2.1 Noise Variance Estimation for One-Variable Linear Regression Model

Up to this point, the noise variance σ02\sigma_{0}^{2} was treated as known. By considering the noise variance as a probabilistic variable σ2\sigma^{2} and building upon the discussions in previous sections, this section demonstrates a method for estimating the noise variance from data by maximizing posterior distribution p(σ2|Y)p(\sigma^{2}|Y) on the basis of Bayesian inference.

Given the joint probability p(σ2,a,Y)p(\sigma^{2},a,Y), the posterior probability of the noise variance p(σ2|Y)p(\sigma^{2}|Y) is derived as

p(σ2,a,Y)\displaystyle p(\sigma^{2},a,Y) =\displaystyle= p(Y|σ2,a)p(a)p(σ2).\displaystyle p(Y|\sigma^{2},a)p(a)p(\sigma^{2}). (153)

The portion of the posterior probability p(σ2|Y)p(\sigma^{2}|Y) dependent on σ2\sigma^{2} is

p(σ2|Y)\displaystyle p(\sigma^{2}|Y) =\displaystyle= 1p(Y)dadbp(σ2,a,Y),\displaystyle\frac{1}{p(Y)}\int\mathrm{d}a\mathrm{d}b\ p(\sigma^{2},a,Y), (154)
=\displaystyle= p(σ2)p(Y)dap(Y|σ2,a)p(a),\displaystyle\frac{p(\sigma^{2})}{p(Y)}\int\mathrm{d}a\ p(Y|\sigma^{2},a)p(a), (155)
p(Y)\displaystyle p(Y) =\displaystyle= dadσ2p(Y|σ2,a)p(a)p(σ2).\displaystyle\int\mathrm{d}a\mathrm{d}\sigma^{2}\ p(Y|\sigma^{2},a)p(a)p(\sigma^{2}). (156)

Assuming a uniform prior distribution p(σ2)p(\sigma^{2}), the right side of Equation (155) can be executed similarly to the calculation of Equation (64). Taking the negative logarithm of Equation (155), the free energy F(σ2)F(\sigma^{2}) is expressed as

F(σ2)\displaystyle F(\sigma^{2}) =\displaystyle= lnp(σ2|Y)\displaystyle-\ln p(\sigma^{2}|Y) (158)
\displaystyle\sim N2ln(2πσ2)12ln(σ2π2Nx2¯)+ln(2ξa)+Nσ2E(a^)\displaystyle\frac{N}{2}\ln(2\pi\sigma^{2})-\frac{1}{2}\ln\left(\frac{\sigma^{2}\pi}{2N\bar{x^{2}}}\right)+\ln(2\xi_{a})+\frac{N}{\sigma^{2}}E(\hat{a})
ln[erfc(Nx2¯2σ2(ξaa^))erfc(Nx2¯2σ2(ξaa^))].\displaystyle-\ln\left[\mathrm{erfc}\left(\sqrt{\frac{N\bar{x^{2}}}{2\sigma^{2}}}\left(-\xi_{a}-\hat{a}\right)\right)-\mathrm{erfc}\left(\sqrt{\frac{N\bar{x^{2}}}{2\sigma^{2}}}\left(\xi_{a}-\hat{a}\right)\right)\right].

Here, note that p(Y)p(Y) is constant with respect to σ2\sigma^{2}.

The optimal noise variance σ2\sigma^{2} can be obtained by

σ^2(υ2,τ1)\displaystyle\hat{\sigma}^{2}(\upsilon_{2},\tau_{1}) =\displaystyle= argmaxσ2p(σ2|Y),\displaystyle\arg\max_{\sigma^{2}}p(\sigma^{2}|Y), (159)
=\displaystyle= argminσ2F(σ2).\displaystyle\arg\min_{\sigma^{2}}F(\sigma^{2}). (160)

B.2.2 Noise Variance in One-Variable Linear Regression Model Through Mesoscopic Variables

Here, we describe the noise variance using mesoscopic variables. Since σ^2(υ2,τ1)\hat{\sigma}^{2}(\upsilon_{2},\tau_{1}) cannot be analytically determined, assuming it has been numerically determined, the probability distribution of the noise variance can be described as:

p(σ2)=dυ2dτ1δ(σ2σ^2(υ2,τ1))p(υ2)p(τ1).p(\sigma^{2})=\int\mathrm{d}\upsilon_{2}\mathrm{d}\tau_{1}\delta(\sigma^{2}-\hat{\sigma}^{2}(\upsilon_{2},\tau_{1}))p(\upsilon_{2})p(\tau_{1}). (161)

B.2.3 Model Selection Through Bayesian Free Energy with Noise Estimation

Up to this point, we have considered model selection on the basis of known true noise variance. Now, let us consider model selection when also estimating noise variance within each model, denoted as σ12\sigma_{1}^{2} and σ22\sigma_{2}^{2} for the two models, respectively. The Bayesian free energy for each model, after estimating noise variance, is

Fy=ax+b(υ1,τ1,τ2,σ12)\displaystyle F_{y=ax+b}(\upsilon_{1},\tau_{1},\tau_{2},\sigma^{2}_{1}) =N2ln(2πσ12)ln(σ12π2N)+12ln(x2¯)+ln(2ξa)+ln(2ξb)+σ02υ12σ12\displaystyle=\frac{N}{2}\ln(2\pi\sigma^{2}_{1})-\ln\left(\frac{\sigma^{2}_{1}\pi}{2N}\right)+\frac{1}{2}\ln\left(\bar{x^{2}}\right)+\ln(2\xi_{a})+\ln(2\xi_{b})+\frac{\sigma^{2}_{0}\upsilon_{1}}{2\sigma^{2}_{1}}
ln[erfc(Nx2¯2σ12(ξaa^(τ1)))\displaystyle\quad-\ln\left[\mathrm{erfc}\left(\sqrt{\frac{N\bar{x^{2}}}{2\sigma^{2}_{1}}}\left(-\xi_{a}-\hat{a}(\tau_{1})\right)\right)\right.
erfc(Nx2¯2σ12(ξaa^(τ1)))]\displaystyle\quad\left.-\mathrm{erfc}\left(\sqrt{\frac{N\bar{x^{2}}}{2\sigma^{2}_{1}}}\left(\xi_{a}-\hat{a}(\tau_{1})\right)\right)\right]
ln[erfc(N2σ12(ξbb^(τ2)))\displaystyle\quad-\ln\left[\mathrm{erfc}\left(\sqrt{\frac{N}{2\sigma^{2}_{1}}}\left(-\xi_{b}-\hat{b}(\tau_{2})\right)\right)\right.
erfc(N2σ12(ξbb^(τ2)))]\displaystyle\quad\left.-\mathrm{erfc}\left(\sqrt{\frac{N}{2\sigma^{2}_{1}}}\left(\xi_{b}-\hat{b}(\tau_{2})\right)\right)\right] (162)

for the two-variable model and

Fy=ax(υ2,τ1,σ22)\displaystyle F_{y=ax}(\upsilon_{2},\tau_{1},\sigma^{2}_{2}) =N2ln(2πσ22)12ln(σ22π2Nx2¯)+ln(2ξa)+σ02υ22σ22\displaystyle=\frac{N}{2}\ln(2\pi\sigma^{2}_{2})-\frac{1}{2}\ln\left(\frac{\sigma^{2}_{2}\pi}{2N\bar{x^{2}}}\right)+\ln(2\xi_{a})+\frac{\sigma^{2}_{0}\upsilon_{2}}{2\sigma^{2}_{2}}
ln[erfc(Nx2¯2σ22(ξaa^(τ1)))\displaystyle\quad-\ln\left[\mathrm{erfc}\left(\sqrt{\frac{N\bar{x^{2}}}{2\sigma^{2}_{2}}}\left(-\xi_{a}-\hat{a}(\tau_{1})\right)\right)\right.
erfc(Nx2¯2σ22(ξaa^(τ1)))]\displaystyle\quad\left.-\mathrm{erfc}\left(\sqrt{\frac{N\bar{x^{2}}}{2\sigma^{2}_{2}}}\left(\xi_{a}-\hat{a}(\tau_{1})\right)\right)\right] (163)

for the one-variable model.

From these expressions, the difference in the Bayesian free energy, taking into account noise variance estimation, can be described as

ΔF(υ1,τ1,τ2,σ12,σ22)\displaystyle\Delta F(\upsilon_{1},\tau_{1},\tau_{2},\sigma^{2}_{1},\sigma^{2}_{2}) =Fy=ax(υ2,τ1,σ22)Fy=ax+b(υ1,τ1,τ2,σ12)\displaystyle=F_{y=ax}(\upsilon_{2},\tau_{1},\sigma^{2}_{2})-F_{y=ax+b}(\upsilon_{1},\tau_{1},\tau_{2},\sigma^{2}_{1}) (164)

This difference is determined on the basis of mesoscopic variables and their relationships as noted in Equation (75), enabling us to depict the probability distribution of the difference in the Bayesian free energy as a function of mesoscopic variables:

p(ΔF)=dυ1dτ1dτ2δ(ΔFΔF(υ1,τ1,τ2,σ12(υ1,τ1,τ2),σ22(υ1,τ1,τ2)))p(υ1)p(τ1)p(τ2).p(\Delta F)=\int\mathrm{d}\upsilon_{1}\mathrm{d}\tau_{1}\mathrm{d}\tau_{2}\delta(\Delta F-\Delta F(\upsilon_{1},\tau_{1},\tau_{2},\sigma^{2}_{1}(\upsilon_{1},\tau_{1},\tau_{2}),\sigma^{2}_{2}(\upsilon_{1},\tau_{1},\tau_{2})))p(\upsilon_{1})p(\tau_{1})p(\tau_{2}). (165)

B.2.4 Numerical Experiment Including Noise Estimation: Model Selection

Here, we examine the impact of the number of data points and the noise strength of the data on estimation from the results of model selection performed with noise estimation using mesoscopic representation.

Figure 16 shows the frequency distribution from 100,000 samples of the probability distribution of the difference in the Bayesian free energy when noise estimation is performed and when noise is assumed known, with model parameters a0=1.0,b0=1.0,σ02=1.0a_{0}=1.0,b_{0}=1.0,\sigma_{0}^{2}=1.0 (Equation (81)). The horizontal and vertical axes represent the frequency distribution of the free energy when noise is known and estimated, respectively. Figure 16 shows that as NN increases, the difference in the frequency distributions between the cases of noise estimation and known noise diminishes.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 16: Frequency distribution from 100,000 samples of the probability distribution of the difference in the Bayesian free energy when noise estimation is performed and when noise is assumed known, with model parameters a0=1.0,b0=1.0,σ02=1.0a_{0}=1.0,b_{0}=1.0,\sigma_{0}^{2}=1.0 (Equation (81)). The horizontal and vertical axes represent the frequency distribution of the free energy when noise is known and estimated, respectively.

Figure 17 shows the frequency distribution from 100,000 samples of the probability distribution of the difference in the Bayesian free energy when noise estimation is performed and when noise is assumed known, with model parameters a0=1.0,b0=0.0,σ02=1.0a_{0}=1.0,b_{0}=0.0,\sigma_{0}^{2}=1.0 (Equation (81)). The horizontal and vertical axes represent the frequency distribution of the free energy when noise is known and estimated, respectively. Figure 17 shows that as NN increases, the difference in the frequency distributions between the cases of noise estimation and known noise diminishes.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 17: Frequency distribution from 100,000 samples of the probability distribution of the difference in the Bayesian free energy when noise estimation is performed and when noise is assumed known, with model parameters a0=1.0,b0=0.0,σ02=1.0a_{0}=1.0,b_{0}=0.0,\sigma_{0}^{2}=1.0 (Equation (81)). The horizontal and vertical axes represent the frequency distribution of the free energy when noise is known and estimated, respectively.

Finally, Figure 18 shows the selection probabilities of the two-variable model y=ax+by=ax+b derived from the probability distribution of the difference in the Bayesian free energy with model parameters b0=1.0,0.5,0.0b_{0}=1.0,0.5,0.0 (Equation (81)). Here, we set a0=1.0a_{0}=1.0 and display the frequency distribution as a two-dimensional histogram from 100,000 samples in a two-dimensional space of the number of data points NN and data noise intensity σ2\sigma^{2}. Figure 18 (a) shows that along the diagonal line where NN and σ2\sigma^{2} have similar values, there is a tendency for model selection to fail. As the number of data points increases from this line, gradually more appropriate model selections become possible. Conversely, as the number of data points decreases from the diagonal line, discrimination in model selection is eliminated. This diagonal line, as shown in Figures 18 (b) and (c), widens as the value of bb decreases, and at b=0.0b=0.0, the selection probability of y=ax+by=ax+b disappears. The overall behavior of the probability distribution does not change regardless of whether noise estimation is performed.

From the aforementioned results, we found that the difference between simultaneously estimating two noises and estimating each noise independently becomes negligible with large data sizes. The former method involves optimization in a high-dimensional space, while the latter involves that in a one-dimensional space. Estimating multiple noises simultaneously increases the search space exponentially. Optimizing each noise independently ensures sufficient accuracy, which is beneficial for real-world applications.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 18: Average selection probability of the two-variable model y=ax+by=ax+b derived from the probability distribution of the difference in the Bayesian free energy with model parameters b0=1.0,0.5,0.0b_{0}=1.0,0.5,0.0 (Equation (81)). The parameters are set as a0=1.0,σ02=1.0a_{0}=1.0,\sigma_{0}^{2}=1.0, and the frequency distribution is shown as a two-dimensional histogram of the number of data points NN and the data noise intensity σ2\sigma^{2}.

B.3 Bayesian Integration with Noise Estimation

In this subsection, we examine how noise estimation affects the process of Bayesian integration when multiple datasets are involved. By treating the noise variances as probabilistic variables, we enhance the Bayesian framework to accommodate realistic situations where the noise levels are unknown and may vary between datasets.

B.3.1 Bayesian Integration of Noise Variance Estimation

In this section, we consider the noise variances σ(1)2,σ(2)2{\sigma^{(1)}}^{2},{\sigma^{(2)}}^{2} as random variables and demonstrate a method to estimate these variances by maximizing posterior distribution p(σ(1)2,σ(2)2|Y)p({\sigma^{(1)}}^{2},{\sigma^{(2)}}^{2}|Y) on the basis of Bayesian inference. When the model parameters are inferred independently for each dataset, the noise variances σ(1)2,σ(2)2{\sigma^{(1)}}^{2},{\sigma^{(2)}}^{2} should be inferred independently as well. Consider the case where the model parameters are common across datasets.

Let us assume the joint probability of σ(1)2,σ(2)2,a,b,Y{\sigma^{(1)}}^{2},{\sigma^{(2)}}^{2},a,b,Y is given by:

p(σ(1)2,σ(2)2,a,b,Y)\displaystyle p({\sigma^{(1)}}^{2},{\sigma^{(2)}}^{2},a,b,Y) =\displaystyle= p(Y|σ(1)2,σ(2)2,a,b)p(a)p(b)p(σ(1)2)p(σ(2)2)\displaystyle p(Y|{\sigma^{(1)}}^{2},{\sigma^{(2)}}^{2},a,b)p(a)p(b)p({\sigma^{(1)}}^{2})p({\sigma^{(2)}}^{2}) (166)

From this, the posterior probability of the noise variances p(σ(1)2,σ(2)2|Y)p({\sigma^{(1)}}^{2},{\sigma^{(2)}}^{2}|Y) can be expressed as:

p(σ(1)2,σ(2)2|Y)\displaystyle p({\sigma^{(1)}}^{2},{\sigma^{(2)}}^{2}|Y) =1p(Y)dadbp(σ(1)2,σ(2)2,a,b,Y)\displaystyle=\frac{1}{p(Y)}\int\mathrm{d}a\mathrm{d}b\ p({\sigma^{(1)}}^{2},{\sigma^{(2)}}^{2},a,b,Y) (167)
=p(σ(1)2,σ(2)2)p(Y)dadbp(Y|σ(1)2,σ(2)2,a,b)p(a)p(b)\displaystyle=\frac{p({\sigma^{(1)}}^{2},{\sigma^{(2)}}^{2})}{p(Y)}\int\mathrm{d}a\mathrm{d}b\ p(Y|{\sigma^{(1)}}^{2},{\sigma^{(2)}}^{2},a,b)p(a)p(b) (168)
p(Y)\displaystyle p(Y) =dadbdσ(1)2dσ(2)2p(Y|σ(1)2,σ(2)2,a,b)p(a)p(b)p(σ(1)2)p(σ(2)2).\displaystyle=\int\mathrm{d}a\mathrm{d}b\mathrm{d}{\sigma^{(1)}}^{2}\mathrm{d}{\sigma^{(2)}}^{2}\ p(Y|{\sigma^{(1)}}^{2},{\sigma^{(2)}}^{2},a,b)p(a)p(b)p({\sigma^{(1)}}^{2})p({\sigma^{(2)}}^{2}). (169)

Here, note that p(Y)p(Y) is constant with respect to σ(1)2{\sigma^{(1)}}^{2} and σ(2)2{\sigma^{(2)}}^{2}. At this time, the portion of the posterior probability p(σ(1)2,σ(2)2|Y)p({\sigma^{(1)}}^{2},{\sigma^{(2)}}^{2}|Y) that depends on σ(1)2,σ(2)2{\sigma^{(1)}}^{2},{\sigma^{(2)}}^{2} can be expressed as:

p(σ(1)2,σ(2)2|Y)\displaystyle p({\sigma^{(1)}}^{2},{\sigma^{(2)}}^{2}|Y) \displaystyle\propto p(σ(1)2)p(σ(2)2)dadbp(Y|σ(1)2,σ(2)2,a,b)p(a,b)\displaystyle p({\sigma^{(1)}}^{2})p({\sigma^{(2)}}^{2})\int\mathrm{d}a\mathrm{d}b\ p(Y|{\sigma^{(1)}}^{2},{\sigma^{(2)}}^{2},a,b)p(a,b)
=\displaystyle= p(σ(1)2)p(σ(2)2)(12π(σ0(1))2)N1(12π(σ0(2))2)N212ξa12ξbexp(E(a^,b^))\displaystyle p({\sigma^{(1)}}^{2})p({\sigma^{(2)}}^{2})\left(\frac{1}{\sqrt{2\pi(\sigma^{(1)}_{0})^{2}}}\right)^{N_{1}}\left(\frac{1}{\sqrt{2\pi(\sigma^{(2)}_{0})^{2}}}\right)^{N_{2}}\frac{1}{2\xi_{a}}\frac{1}{2\xi_{b}}\exp\left(-E(\hat{a},\hat{b})\right)
×\displaystyle\times π2(β(1)x(1)2¯+β(2)x(2)2¯)[erfc(β(1)x(1)2¯+β(2)x(2)2¯2(ξaa^))\displaystyle\sqrt{\frac{\pi}{2(\beta^{(1)}\bar{{x^{(1)}}^{2}}+\beta^{(2)}\bar{{x^{(2)}}^{2}})}}\left[\mathrm{erfc}\left(\sqrt{\frac{\beta^{(1)}\bar{{x^{(1)}}^{2}}+\beta^{(2)}\bar{{x^{(2)}}^{2}}}{2}}(-\xi_{a}-\hat{a})\right)\right.
\displaystyle- erfc(β(1)x(1)2¯+β(2)x(2)2¯2(ξaa^))]\displaystyle\left.\mathrm{erfc}\left(\sqrt{\frac{\beta^{(1)}\bar{{x^{(1)}}^{2}}+\beta^{(2)}\bar{{x^{(2)}}^{2}}}{2}}(\xi_{a}-\hat{a})\right)\right]
×\displaystyle\times π2(β(1)+β(2))[erfc(β(1)+β(2)2(ξbb^))\displaystyle\sqrt{\frac{\pi}{2(\beta^{(1)}+\beta^{(2)})}}\left[\mathrm{erfc}\left(\sqrt{\frac{\beta^{(1)}+\beta^{(2)}}{2}}(-\xi_{b}-\hat{b})\right)\right.
\displaystyle- erfc(β(1)+β(2)2(ξbb^))]\displaystyle\left.\mathrm{erfc}\left(\sqrt{\frac{\beta^{(1)}+\beta^{(2)}}{2}}(\xi_{b}-\hat{b})\right)\right] (171)

When the prior distribution p(σ(1)2)p(σ(2)2)p({\sigma^{(1)}}^{2})p({\sigma^{(2)}}^{2}) is considered uniform, the right-hand side of Equation (B.3.1) can be computed similarly to Equation (101), and the free energy derived from taking the negative logarithm of Equation (B.3.1) is given by:

F(σ(1)2,σ(2)2)\displaystyle F({\sigma^{(1)}}^{2},{\sigma^{(2)}}^{2}) =\displaystyle= lnp(σ(1)2,σ(2)2|Y)\displaystyle-\ln p({\sigma^{(1)}}^{2},{\sigma^{(2)}}^{2}|Y)
\displaystyle\sim N12ln2π(σ(1))2+N22ln2π(σ(2))2+ln2ξa+ln2ξb+E(a^,b^)\displaystyle\frac{N_{1}}{2}\ln 2\pi(\sigma^{(1)})^{2}+\frac{N_{2}}{2}\ln 2\pi(\sigma^{(2)})^{2}+\ln 2\xi_{a}+\ln 2\xi_{b}+E(\hat{a},\hat{b})
+\displaystyle+ 12ln2(β(1)x(1)2¯+β(2)x(2)2¯)π+12ln2(β(1)+β(2))π\displaystyle\frac{1}{2}\ln\frac{2(\beta^{(1)}\bar{{x^{(1)}}^{2}}+\beta^{(2)}\bar{{x^{(2)}}^{2}})}{\pi}+\frac{1}{2}\ln\frac{2(\beta^{(1)}+\beta^{(2)})}{\pi}
\displaystyle- ln[erfc(β(1)x(1)2¯+β(2)x(2)2¯2(ξaa^))erfc(β(1)x(1)2¯+β(2)x(2)2¯2(ξaa^))]\displaystyle\ln\left[\mathrm{erfc}\left(\sqrt{\frac{\beta^{(1)}\bar{{x^{(1)}}^{2}}+\beta^{(2)}\bar{{x^{(2)}}^{2}}}{2}}(-\xi_{a}-\hat{a})\right)-\mathrm{erfc}\left(\sqrt{\frac{\beta^{(1)}\bar{{x^{(1)}}^{2}}+\beta^{(2)}\bar{{x^{(2)}}^{2}}}{2}}(\xi_{a}-\hat{a})\right)\right]
\displaystyle- ln[erfc(β(1)+β(2)2(ξbb^))erfc(β(1)+β(2)2(ξbb^))].\displaystyle\ln\left[\mathrm{erfc}\left(\sqrt{\frac{\beta^{(1)}+\beta^{(2)}}{2}}(-\xi_{b}-\hat{b})\right)-\mathrm{erfc}\left(\sqrt{\frac{\beta^{(1)}+\beta^{(2)}}{2}}(\xi_{b}-\hat{b})\right)\right]. (173)

Minimizing the free energy can numerically determine the values of σ(1)2{\sigma^{(1)}}^{2} and σ(2)2{\sigma^{(2)}}^{2} that maximize the posterior probability. The optimal noise variance σ(1)2{\sigma^{(1)}}^{2} and σ(2)2{\sigma^{(2)}}^{2} can be obtained by

(σ^(1)(τ1(1),τ1(2),τ2(1),τ2(2),υ(1),υ(2)),σ^(2)(τ1(1),τ1(2),\displaystyle({\hat{\sigma}^{(1)}}(\tau_{1}^{(1)},\tau_{1}^{(2)},\tau_{2}^{(1)},\tau_{2}^{(2)},\upsilon^{(1)},\upsilon^{(2)}),\ {\hat{\sigma}^{(2)}}(\tau_{1}^{(1)},\tau_{1}^{(2)}, τ2(1),τ2(2),υ(1),υ(2)))\displaystyle\tau_{2}^{(1)},\tau_{2}^{(2)},\upsilon^{(1)},\upsilon^{(2)}))
=argmax(σ(1),σ(2))p(σ(1)2,σ(2)2Y),\displaystyle=\arg\max_{({\sigma^{(1)}},{\sigma^{(2)}})}p({\sigma^{(1)}}^{2},{\sigma^{(2)}}^{2}\mid Y), (174)
=argmin(σ(1),σ(2))F(σ(1)2,σ(2)2).\displaystyle=\arg\min_{({\sigma^{(1)}},{\sigma^{(2)}})}F({\sigma^{(1)}}^{2},{\sigma^{(2)}}^{2}). (175)

B.3.2 Noise Variance Through Mesoscopic Variables in Bayesian Integration

Noise estimation numerically determines σ(m)2{\sigma^{(m)}}^{2} that minimizes Equation 173. Since the estimated noise variance depends on six random variables τ1(1),τ1(2),τ2(1),τ2(2),υ(1),υ(2)\tau_{1}^{(1)},\tau_{1}^{(2)},\tau_{2}^{(1)},\tau_{2}^{(2)},\upsilon^{(1)},\upsilon^{(2)}, it can be expressed as σ^(m)(τ1(1),τ1(2),τ2(1),τ2(2),υ(1),υ(2))\hat{\sigma}^{(m)}(\tau_{1}^{(1)},\tau_{1}^{(2)},\tau_{2}^{(1)},\tau_{2}^{(2)},\upsilon^{(1)},\upsilon^{(2)}). The probability distribution of the noise variance is given by

p(σ(m)2)\displaystyle p({\sigma^{(m)}}^{2}) =dτ1(1)dτ1(2)dτ2(1)dτ2(2)dυ(1)dυ(2)\displaystyle=\int\mathrm{d}\tau_{1}^{(1)}\mathrm{d}\tau_{1}^{(2)}\mathrm{d}\tau_{2}^{(1)}\mathrm{d}\tau_{2}^{(2)}\mathrm{d}\upsilon^{(1)}\mathrm{d}\upsilon^{(2)}
δ(σ(m)2(σ^(m)(τ1(1),τ1(2),τ2(1),τ2(2),υ(1),υ(2)))2)\displaystyle\quad\delta\left({\sigma^{(m)}}^{2}-\left(\hat{\sigma}^{(m)}(\tau_{1}^{(1)},\tau_{1}^{(2)},\tau_{2}^{(1)},\tau_{2}^{(2)},\upsilon^{(1)},\upsilon^{(2)})\right)^{2}\right)
×p(τ1(1))p(τ1(2))p(τ2(1))p(τ2(2))p(υ(1))p(υ(2))\displaystyle\quad\times p(\tau_{1}^{(1)})p(\tau_{1}^{(2)})p(\tau_{2}^{(1)})p(\tau_{2}^{(2)})p(\upsilon^{(1)})p(\upsilon^{(2)}) (176)

B.3.3 Model Selection in Bayesian Integration with Noise Estimation

The difference in the Bayesian free energy, ΔF\Delta F, is determined by six stochastic variables τ1(1),τ1(2),τ2(1),τ2(2),υ(1),υ(2)\tau_{1}^{(1)},\tau_{1}^{(2)},\tau_{2}^{(1)},\tau_{2}^{(2)},\upsilon^{(1)},\upsilon^{(2)}, and can be expressed as:

ΔF=ΔF(τ1(1),τ1(2),τ2(1),τ2(2),υ(1),υ(2))\displaystyle\Delta F=\Delta F(\tau_{1}^{(1)},\tau_{1}^{(2)},\tau_{2}^{(1)},\tau_{2}^{(2)},\upsilon^{(1)},\upsilon^{(2)}) (177)

This indicates that the calculation of the Bayesian free energy differences incorporates these six variables, reflecting the complex dynamics when noise intensity is part of the estimation process.

Consequently, the probability distribution of the difference in the Bayesian free energy can be expressed as:

p(ΔF)=\displaystyle p(\Delta F)=\int dτ1(1)dτ1(2)dτ2(1)dτ2(2)dυ(1)dυ(2)\displaystyle\mathrm{d}\tau_{1}^{(1)}\mathrm{d}\tau_{1}^{(2)}\mathrm{d}\tau_{2}^{(1)}\mathrm{d}\tau_{2}^{(2)}\mathrm{d}\upsilon^{(1)}\mathrm{d}\upsilon^{(2)}
δ(ΔFΔF(τ1(1),τ1(2),τ2(1),τ2(2),υ(1),υ(2)))\displaystyle\delta(\Delta F-\Delta F(\tau_{1}^{(1)},\tau_{1}^{(2)},\tau_{2}^{(1)},\tau_{2}^{(2)},\upsilon^{(1)},\upsilon^{(2)}))
p(τ1(1))p(τ1(2))p(τ2(1))p(τ2(2))p(υ(1))p(υ(2))\displaystyle p(\tau_{1}^{(1)})p(\tau_{1}^{(2)})p(\tau_{2}^{(1)})p(\tau_{2}^{(2)})p(\upsilon^{(1)})p(\upsilon^{(2)}) (178)

B.3.4 Numerical Experiment Including Noise Estimation: Bayesian Integration

Here, we examine the impact of the number of data and the noise intensity inherent in the data on estimation through the results of Bayesian integration, which includes noise estimation using meso-expressions.

Figure 19 shows the frequency distribution of the difference in the Bayesian free energy, sampled from 100,000 instances with model parameters a0(1)=2.0,a0(2)=4.0,σ02=1.0a_{0}^{(1)}=2.0,a_{0}^{(2)}=4.0,\sigma_{0}^{2}=1.0 (see Equation (127)). The methods for noise estimation include simultaneous estimation of two noises from two datasets by minimizing free energy using Equation (173), and independent estimation of each noise from each dataset using Equation (149). The results of simultaneous estimation are shown in the upper section and those of independent noise estimation in the lower section. The horizontal and vertical axes represent the frequency distribution of the free energy when noise is known and estimated, respectively. Figure 19shows that as NN increases, the difference in the frequency distribution between the cases with estimated noise and known noise diminishes.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Figure 19: Frequency distribution sampled from 100,000 instances on the basis of the probability distribution of the difference in the Bayesian free energy with model parameters a0(1)=2.0,a0(2)=4.0,σ02=1.0a_{0}^{(1)}=2.0,a_{0}^{(2)}=4.0,\sigma_{0}^{2}=1.0 (see Equation (127)). The horizontal and vertical axes represent the frequency distribution of the free energy when noise is known and estimated, respectively. Noise estimation methods include simultaneous estimation of two noises from two datasets by minimizing free energy using Equation (173), and independent estimation of each noise from each dataset using Equation (149). The results of simultaneous estimation are shown in the upper section and those of independent noise estimation in the lower section.

Next, Figure 20 shows the frequency distribution obtained from sampling 100,000 times from the probability distribution of the difference in the Bayesian free energy, with model parameters a0(1)=2.0,a0(2)=2.0,σ02=1.0a_{0}^{(1)}=2.0,a_{0}^{(2)}=2.0,\sigma_{0}^{2}=1.0 (Equation (127)). The horizontal and vertical axes represent the frequency distribution of the free energy when noise is known and estimated, respectively. There are two methods for estimating noise: one is to estimate two noises simultaneously by minimizing the free energy using Equation (173) from two data points, and the other is to independently estimate each noise from each data using Equation (149). The results of simultaneous estimation are shown in the upper section, and those of independent noise estimation are displayed in the lower section. Figure 20shows that as NN increases, the difference in the frequency distributions between the cases of estimated noise and known noise disappears.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Figure 20: Frequency distribution sampled from the probability distribution of the difference in the Bayesian free energy, given the model parameters a0(1)=2.0,a0(2)=2.0,σ02=1.0a_{0}^{(1)}=2.0,a_{0}^{(2)}=2.0,\sigma_{0}^{2}=1.0 (Equation (127)), with 100,000 samples. The horizontal and vertical axes represent the frequency distribution of the free energy when noise is known and estimated, respectively. There are two methods for noise estimation: one involves simultaneously estimating two noises by minimizing free energy using Equation (173) from two data points, and the other involves independently estimating each noise from each data using Equation (149). The upper section shows the results of simultaneous estimation, and the lower section shows the results of independent noise estimation.

Finally, Figure 21 displays the probabilities of selecting the integrated model obtained from the probability distribution of the difference in the Bayesian free energy with model parameters a0(1)=2.0,a0(2)=4.0,3.0,2.0a_{0}^{(1)}=2.0,a_{0}^{(2)}=4.0,3.0,2.0 (Equation (127)). Here, we set b0(1)=0.0,b0(2)=0.0b_{0}^{(1)}=0.0,b_{0}^{(2)}=0.0 and show the frequency distribution as a two-dimensional histogram, sampled 100,000 times from the two-dimensional space of the number of data points NN and the noise strength σ2\sigma^{2}. Two methods for estimating noise are used: one involves simultaneously estimating two noises by minimizing free energy using Equation (173) from two data points, and the other involves independently estimating each noise from each data using Equation (149). The results of simultaneous estimation are shown in the upper section, and those of independent noise estimation are shown in the lower section.

From the aforementioned results, we found that the difference between simultaneously estimating two noises and estimating each noise independently becomes negligible with large data sizes. The former method involves optimization in a high-dimensional space, while the latter involves optimization in a one-dimensional space. Estimating multiple noises simultaneously increases the search space exponentially. Optimizing each noise independently ensures sufficient accuracy, and it is beneficial for real-world applications.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Figure 21: Probabilities of selecting the separated model derived from the probability distribution of the difference in the Bayesian free energy, given model parameters a0(1)=2.0,a0(2)=4.0,3.0,2.0a_{0}^{(1)}=2.0,a_{0}^{(2)}=4.0,3.0,2.0 (Equation (127)). Setting b0(1)=0.0,b0(2)=0.0b_{0}^{(1)}=0.0,b_{0}^{(2)}=0.0, the frequency distribution is shown as a two-dimensional histogram sampled 100,000 times from the bi-dimensional space of data number NN and data noise strength σ2\sigma^{2}. Two noise estimation methods are used: one using Equation (173) for simultaneously estimating two noises by minimizing the free energy from two data points, and another using Equation (149) for independently estimating each noise from each data. The upper section shows results from simultaneous estimation, while the lower section shows results from independent noise estimation.