This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Bayesian approach to multi-task learning with network lasso

Kaito Shimamura 111NTT Advanced Technology Corporation, Muza Kawasaki Central Tower, 1310 Omiya-cho, Saiwai-ku, Kawasaki-shi, Kanagawa 212-0014, Japan. Graduate School of Informatics and Engineering, The University of Electro-Communications, 1-5-1 Chofugaoka, Chofu-shi, Tokyo 182-8585, Japan. kaito.shimamura@ai.lab.uec.ac.jp (corresponding author) , Shuichi Kawano 222Graduate School of Informatics and Engineering, The University of Electro-Communications, 1-5-1 Chofugaoka, Chofu-shi, Tokyo 182-8585, Japan. skawano@ai.lab.uec.ac.jp
Abstract

Network lasso is a method for solving a multi-task learning problem through the regularized maximum likelihood method. A characteristic of network lasso is setting a different model for each sample. The relationships among the models are represented by relational coefficients. A crucial issue in network lasso is to provide appropriate values for these relational coefficients. In this paper, we propose a Bayesian approach to solve multi-task learning problems by network lasso. This approach allows us to objectively determine the relational coefficients by Bayesian estimation. The effectiveness of the proposed method is shown in a simulation study and a real data analysis.

Key Words and Phrases: Dirichlet–Laplace distribution, Hierarchical Bayesian model, Markov chain Monte Carlo, Variable selection.

1 Introduction

One important research topic is personalization, which is setting a different prediction model for each sample. For example, in the field of personalized medicine, disease should not be predicted by a single model for all patients but rather predicted using a different model for each patient. This issue is common to many research fields. To achieve this aim, multi-task learning (Evgeniou and Pontil, (2007); Argyriou et al., (2008); Lengerich et al., (2018)) is useful.

Hallac et al., (2015) proposed solving a multi-task learning problem through the regularized maximum likelihood method, which is also referred to as network lasso. A characteristic of network lasso is to estimate models for each sample and group the models simultaneously. Various applications of network lasso have been reported in the literature (e.g., Ambos et al., (2018); Jung, (2020); Tran et al., (2020)). The estimation of network lasso uses relational coefficients, which are quantitative expressions of the relationships among the samples. However, no estimating methods for the relational coefficients in network lasso have yet been reported.

To overcome this problem, we propose a Bayesian framework for network lasso that includes estimating the relational coefficients. We define the likelihood and the prior distribution of the regression coefficients, and consider network lasso from a Bayesian perspective. Under this framework, we assume the Dirichlet prior distribution for the relational coefficients. Models are estimated by the MCMC algorithm. As a result, our method provides more accurate estimation than existing methods, even if the relational coefficients of some of the samples are unknown. In addition, by using a prior distribution that induces variable selection, irrelevant features are excluded from the model. This should allow us to expand the applicable fields of data compared to the existing method.

The remainder of the paper is organized as follows. Section 2 describes regression modeling based on network lasso. In Section 3, we propose a Bayesian approach to multi-task learning with network lasso. Section 4 focuses on related work. Section 5 compares the performance of the proposed method with the existing method by Monte Carlo simulation, and Section 6 compares the performance by real data analysis. Concluding remarks are given in Section 7.

2 Regression modeling based on network lasso

Suppose that we have observed data {(yi,𝒙i);i=1,2,,n}\{(y_{i},\mbox{\boldmath$x$}_{i});i=1,2,\cdots,n\} for response variable yy and pp-dimensional predictor variables 𝒙=(x1,x2,,xp)T\mbox{\boldmath$x$}=(x_{1},x_{2},\cdots,x_{p})^{T}. A linear regression model is assumed as follows:

yi=𝒙iT𝒘i+ϵi,\displaystyle y_{i}=\mbox{\boldmath$x$}_{i}^{T}\mbox{\boldmath$w$}_{i}+\epsilon_{i}, (1)

where 𝒘i=(wi1,wi2,,wip)T(i=1,2,,n)\mbox{\boldmath$w$}_{i}=(w_{i1},w_{i2},\cdots,w_{ip})^{T}\ (i=1,2,\cdots,n) is a pp-dimensional regression coefficient vector and ϵi\epsilon_{i} is an error variable distributed as N(0,σ2)\mbox{N}(0,\sigma^{2}), in which σ2(>0)\sigma^{2}\ (>0) is a variance parameter. In a general regression problem, all regression coefficients are assumed to be the same (i.e., 𝒘1=𝒘2==𝒘n\mbox{\boldmath$w$}_{1}=\mbox{\boldmath$w$}_{2}=\cdots=\mbox{\boldmath$w$}_{n}). In (1), a different regression coefficient 𝒘i\mbox{\boldmath$w$}_{i} is assumed for each sample. This can be regarded as considering a different model for each sample.

For the overall model, we consider solving the regression problem by minimizing as follows:

min𝒘1,,𝒘ni=1n(yi𝒙iT𝒘i)2+λ(i1,i2)ri1i2𝒘i1𝒘i22,\displaystyle\min_{\bm{w}_{1},\cdots,\bm{w}_{n}}\sum_{i=1}^{n}(y_{i}-\mbox{\boldmath$x$}_{i}^{T}\mbox{\boldmath$w$}_{i})^{2}+\lambda\sum_{(i_{1},i_{2})\in\mathcal{E}}r_{i_{1}i_{2}}{\|\mbox{\boldmath$w$}_{i_{1}}-\mbox{\boldmath$w$}_{i_{2}}\|_{2}}, (2)

where λ(>0)\lambda\ (>0) is a regularization parameter, \mathcal{E} is a set whose elements are pairs of sample subscripts, and ri1i2r_{i_{1}i_{2}} is a relational coefficient. In this minimization problem, the second term allows us to estimate 𝒘^i1=𝒘^i2\hat{\mbox{\boldmath$w$}}_{i_{1}}=\hat{\mbox{\boldmath$w$}}_{i_{2}}. If 𝒘^i1=𝒘^i2\hat{\mbox{\boldmath$w$}}_{i_{1}}=\hat{\mbox{\boldmath$w$}}_{i_{2}}, then the i1i_{1}th sample and the i2i_{2}th sample belong to the same model. The relational coefficient ri1i2r_{i_{1}i_{2}} is the strength of the relationship between samples (yi1,𝒙i1)(y_{i_{1}},\mbox{\boldmath$x$}_{i_{1}}) and (yi2,𝒙i2)(y_{i_{2}},\mbox{\boldmath$x$}_{i_{2}}). The stronger the relationship is, the greater the value of ri1i2r_{i_{1}i_{2}} is. Hallac et al., (2015) referred to the minimization problem in (2) as network lasso.

Here we focus on the relational coefficient ri1i2r_{i_{1}i_{2}}. The more correctly the relational coefficients are set, the better the accuracy of estimation of network lasso is. In network lasso, the relational coefficients are determined independently of (2). For example, if the samples have location information, then the relational coefficients are defined by the distances between the samples. On the other hand, if the data do not contain such information, the determination of the relational coefficients becomes difficult. Therefore, we propose an approach based on Bayesian sparse modeling. Adopting this approach allows us to estimate the model and the relational coefficients simultaneously.

3 Proposed method

In this section, we introduce multi-task learning using network lasso from the viewpoint of Bayesian sparse modeling. Bayesian sparse modeling, as represented by Bayesian lasso (Park and Casella, 2008), has advantages such as quantifying posterior uncertainty and estimating hyperparameters. The proposed method overcomes the problem described in the previous section.

3.1 Model

We consider a likelihood and a prior distribution of regression coefficients as follows.

f(𝒚|X,W,σ2)\displaystyle f(\mbox{\boldmath$y$}|X,W,\sigma^{2}) =\displaystyle= (2πσ2)n/2exp{12σ2i=1n(yi𝒘iT𝒙i)2},\displaystyle(2\pi\sigma^{2})^{-n/2}\exp\left\{-\frac{1}{2\sigma^{2}}\sum_{i=1}^{n}(y_{i}-\mbox{\boldmath$w$}_{i}^{T}\mbox{\boldmath$x$}_{i})^{2}\right\},
π(W|λ1,λ2,R,σ2)\displaystyle\pi(W|\lambda_{1},\lambda_{2},R,\sigma^{2}) \displaystyle\propto (i1,i2)λ1ri1i2σ2exp{λ1ri1i2σ2𝒘i1𝒘i22}\displaystyle\prod_{(i_{1},i_{2})\in\mathcal{E}}\frac{\lambda_{1}r_{i_{1}i_{2}}}{\sqrt{\sigma^{2}}}\exp\left\{-\frac{\lambda_{1}r_{i_{1}i_{2}}}{\sqrt{\sigma^{2}}}\|\mbox{\boldmath$w$}_{i_{1}}-\mbox{\boldmath$w$}_{i_{2}}\|_{2}\right\}
×i=1n(λ2σ2)pexp{λ2σ2𝒘i1},\displaystyle\quad\times\prod_{i=1}^{n}\left(\frac{\lambda_{2}}{\sqrt{\sigma^{2}}}\right)^{p}\exp\left\{-\frac{\lambda_{2}}{\sqrt{\sigma^{2}}}\|\mbox{\boldmath$w$}_{i}\|_{1}\right\},

where 𝒚=(y1,y2,,yn)T\mbox{\boldmath$y$}=(y_{1},y_{2},\cdots,y_{n})^{T} is the nn-dimensional response vector, λ1(>0)\lambda_{1}\ (>0) and λ2(>0)\lambda_{2}\ (>0) are hyperparameters, X=(𝒙1,𝒙2,,𝒙n)TX=(\mbox{\boldmath$x$}_{1},\mbox{\boldmath$x$}_{2},\cdots,\mbox{\boldmath$x$}_{n})^{T} is the n×pn\times p design matrix, and W=(𝒘1,𝒘2,,𝒘n)TW=(\mbox{\boldmath$w$}_{1},\mbox{\boldmath$w$}_{2},\cdots,\mbox{\boldmath$w$}_{n})^{T} is the n×pn\times p regression coefficient matrix. The prior distribution (3.1) simultaneously plays a role in the grouping of individuals and the variable selection. The hyperparameter λ1\lambda_{1} controls the degree of individual grouping and λ2\lambda_{2} controls the degree of variable selection. Furthermore, we define an n×nn\times n relational matrix RR whose (i1,i2)(i_{1},i_{2})-th element is the relational coefficient ri1i2r_{i_{1}i_{2}}. In particular, we assume that all diagonal components of RR are zero.

Next, we assume the following prior distribution for the relational coefficients:

{ri1i21}\displaystyle\{r_{i_{1}i_{2}}^{-1}\} \displaystyle\sim Dir(α,,α),\displaystyle\mbox{Dir}(\alpha,\cdots,\alpha),

where Dir()\mbox{Dir}(\cdot) represents the Dirichlet distribution and α(>0)\alpha\ (>0) is a hyperparameter. Note that we could also assume independent gamma distributions for ri1i21r_{i_{1}i_{2}}^{-1}. However, assuming independent prior distributions for parameters with relative measures, such as relational coefficients, empirically reduces the predictive accuracy of the model. Therefore, we use the Dirichlet distribution, which takes the joint distribution of relational coefficients into account. We also assume the following prior distribution for λ1\lambda_{1}:

λ11\displaystyle\lambda_{1}^{-1} \displaystyle\sim Ga(α#,1/2),\displaystyle\mbox{Ga}(\alpha\#\mathcal{E},1/2),

where #\#\mathcal{E} is the number of elements in the set \mathcal{E} and Ga(,)\mbox{Ga}(\cdot,\cdot) represents the gamma distribution.

Thereby, the tuning parameters in the proposed method are α\alpha and λ2\lambda_{2}. The α\alpha controls a group of individuals. When α\alpha is small, λ1\lambda_{1} is large and the variance of {ri1i2}\{r_{i_{1}i_{2}}\} is large. In other words, the smaller α\alpha is, the more 𝒘i1𝒘i22\|\mbox{\boldmath$w$}_{i_{1}}-\mbox{\boldmath$w$}_{i_{2}}\|_{2} terms are estimated to be zero.

3.2 Estimation by MCMC

We use Gibbs sampling to estimate the proposed model. Since it is difficult to obtain the full conditional distributions from the prior distribution in (3.1)(\ref{prior}), we derive them by rewriting the prior distribution. Using scale mixtures of normal distributions (Andrews and Mallows, 1974), we obtain the following transformation:

π(W,{ri1i2},λ1,σ2|α,λ2)\displaystyle\pi(W,\{r_{i_{1}i_{2}}\},\lambda_{1},\sigma^{2}|\alpha,\lambda_{2}) \displaystyle\propto (i1,i2)λ1ri1i2σ2exp{λ1ri1i2σ2𝒘i1𝒘i22}\displaystyle\prod_{(i_{1},i_{2})\in\mathcal{E}}\frac{\lambda_{1}r_{i_{1}i_{2}}}{\sqrt{\sigma^{2}}}\exp\left\{-\frac{\lambda_{1}r_{i_{1}i_{2}}}{\sqrt{\sigma^{2}}}\|\mbox{\boldmath$w$}_{i_{1}}-\mbox{\boldmath$w$}_{i_{2}}\|_{2}\right\}
×i=1nj=1pλ2σ2exp{λ2σ2|wij|}\displaystyle\times\prod_{i=1}^{n}\prod_{j=1}^{p}\frac{\lambda_{2}}{\sqrt{\sigma^{2}}}\exp\left\{-\frac{\lambda_{2}}{\sqrt{\sigma^{2}}}|w_{ij}|\right\}
×λ1(α#1)exp{12λ1}π(σ2)(i1,i2)ri1i2(α1)\displaystyle\times\lambda_{1}^{-(\alpha\#\mathcal{E}-1)}\exp\left\{-\frac{1}{2\lambda_{1}}\right\}\pi(\sigma^{2})\prod_{(i_{1},i_{2})\in\mathcal{E}}r_{i_{1}i_{2}}^{-(\alpha-1)}
\displaystyle\propto (i1,i2)λ1ri1i2σ20g1(𝒘i1,𝒘i2,ri1i2,λ1,τi1i2,σ2)𝑑τi1i2\displaystyle\prod_{(i_{1},i_{2})\in\mathcal{E}}\frac{\lambda_{1}r_{i_{1}i_{2}}}{\sqrt{\sigma^{2}}}\int_{0}^{\infty}g_{1}(\mbox{\boldmath$w$}_{i_{1}},\mbox{\boldmath$w$}_{i_{2}},r_{i_{1}i_{2}},\lambda_{1},\tau_{i_{1}i_{2}},\sigma^{2})d\tau_{i_{1}i_{2}}
×i=1nj=1pλ22σ20g2(wij,τ~ij,σ2,λ2)dτ~ij\displaystyle\times\prod_{i=1}^{n}\prod_{j=1}^{p}\frac{\lambda_{2}^{2}}{\sqrt{\sigma^{2}}}\int_{0}^{\infty}g_{2}(w_{ij},\widetilde{\tau}_{ij},\sigma^{2},\lambda_{2})d\widetilde{\tau}_{ij}
×λ1(α#1)exp{12λ1}π(σ2)(i1,i2)ri1i2(α1),\displaystyle\times\lambda_{1}^{-(\alpha\#\mathcal{E}-1)}\exp\left\{-\frac{1}{2\lambda_{1}}\right\}\pi(\sigma^{2})\prod_{(i_{1},i_{2})\in\mathcal{E}}r_{i_{1}i_{2}}^{-(\alpha-1)},

where

g1(𝒘i1,𝒘i2,ri1i2,λ1,τi1i2,σ2)\displaystyle g_{1}(\mbox{\boldmath$w$}_{i_{1}},\mbox{\boldmath$w$}_{i_{2}},r_{i_{1}i_{2}},\lambda_{1},\tau_{i_{1}i_{2}},\sigma^{2}) =\displaystyle= τi1i21/2exp{λ12ri1i222τi1i2σ2𝒘i1𝒘i222}exp{12τi1i2},\displaystyle\tau_{i_{1}i_{2}}^{-1/2}\exp\left\{-\frac{\lambda_{1}^{2}r_{i_{1}i_{2}}^{2}}{2\tau_{i_{1}i_{2}}\sigma^{2}}\|\mbox{\boldmath$w$}_{i_{1}}-\mbox{\boldmath$w$}_{i_{2}}\|_{2}^{2}\right\}\exp\left\{-\frac{1}{2}\tau_{i_{1}i_{2}}\right\},
g2(wij,τ~ij,σ2,λ2)\displaystyle g_{2}(w_{ij},\widetilde{\tau}_{ij},\sigma^{2},\lambda_{2}) =\displaystyle= τ~ij1/2exp{12τ~ijσ2wij2}exp{λ222τ~ij}.\displaystyle\widetilde{\tau}_{ij}^{-1/2}\exp\left\{-\frac{1}{2\widetilde{\tau}_{ij}\sigma^{2}}w_{ij}^{2}\right\}\exp\left\{-\frac{\lambda_{2}^{2}}{2}\widetilde{\tau}_{ij}\right\}.

This transformation allows us to rewrite the prior distribution as follows:

π(W|{1/ri1i2},1/λ1,{τi1i21},{τ~ij1},σ2)\displaystyle\pi(W|\{1/r_{i_{1}i_{2}}\},1/\lambda_{1},\{\tau_{i_{1}i_{2}}^{-1}\},\{\widetilde{\tau}_{ij}^{-1}\},\sigma^{2}) \displaystyle\propto (i1,i2)N(𝒘i1𝒘i22| 0,τi1i2σ2/λ12ri1i22)\displaystyle\prod_{(i_{1},i_{2})\in\mathcal{E}}\mbox{N}(\|\mbox{\boldmath$w$}_{i_{1}}-\mbox{\boldmath$w$}_{i_{2}}\|_{2}\ |\ 0,\tau_{i_{1}i_{2}}\sigma^{2}/\lambda_{1}^{2}r_{i_{1}i_{2}}^{2}) (4)
×i=1nj=1pN(wij| 0,τ~ijσ2),\displaystyle\times\prod_{i=1}^{n}\prod_{j=1}^{p}\mbox{N}(w_{ij}\ |\ 0,\widetilde{\tau}_{ij}\sigma^{2}),
{ri1i21}\displaystyle\{r_{i_{1}i_{2}}^{-1}\} \displaystyle\sim Dir(α,,α),\displaystyle\mbox{Dir}(\alpha,\cdots,\alpha),
λ11\displaystyle\lambda_{1}^{-1} \displaystyle\sim Ga(α#,1/2),\displaystyle\mbox{Ga}(\alpha\#\mathcal{E},1/2),
τi1i2\displaystyle\tau_{i_{1}i_{2}} \displaystyle\sim Exp(1/2),\displaystyle\mbox{Exp}(1/2),
τ~ij\displaystyle\widetilde{\tau}_{ij} \displaystyle\sim Exp(λ22/2),\displaystyle\mbox{Exp}(\lambda_{2}^{2}/2),
σ2\displaystyle\sigma^{2} \displaystyle\sim IG(ν0/2,η0/2),\displaystyle\mbox{IG}(\nu_{0}/2,\eta_{0}/2),

where Exp()\mbox{Exp}(\cdot) represents the exponential distribution and IG(,)\mbox{IG}(\cdot,\cdot) represents the inverse-gamma distribution. The inverse-gamma probability density function is given by

IG(x|ν,η)=ηνΓ(ν)x(ν+1)exp(ηx),\displaystyle\mbox{IG}(x|\nu,\eta)=\frac{\eta^{\nu}}{\Gamma(\nu)x^{-(\nu+1)}}\exp\left(-\frac{\eta}{x}\right), (5)

where ν(>0)\nu\ (>0) is a shape parameter, η(>0)\eta\ (>0) is a scale parameter, and Γ()\Gamma(\cdot) is the Gamma function. This allows us to consider τi1i2\tau_{i_{1}i_{2}} and τ~ij\widetilde{\tau}_{ij} as new parameters in addition to WW, {ri1i2}\{r_{i_{1}i_{2}}\}, λ1\lambda_{1}, and σ2\sigma^{2}. By using these prior distributions, the posterior distribution can be estimated by Gibbs sampling. The details of the hierarchical representation and the sampling algorithm are given in Appendix A.

We note that the prior distribution in (4) is identical to the Dirichlet–Laplace prior distribution (Bhattacharya et al., 2015). The prior distributions in (4) are expressed as follows:

π(W|σ2)\displaystyle\pi(W|\sigma^{2}) \displaystyle\propto (i1,i2)1σ2DL(𝒘i1𝒘i22/σ2|α)\displaystyle\prod_{(i_{1},i_{2})\in\mathcal{E}}\frac{1}{\sqrt{\sigma^{2}}}\mbox{DL}(\|\mbox{\boldmath$w$}_{i_{1}}-\mbox{\boldmath$w$}_{i_{2}}\|_{2}/\sqrt{\sigma^{2}}\ |\ \alpha)
×i=1nj=1p1σ2Laplace(wij/σ2|λ2),\displaystyle\times\prod_{i=1}^{n}\prod_{j=1}^{p}\frac{1}{\sqrt{\sigma^{2}}}\mbox{Laplace}(w_{ij}/\sqrt{\sigma^{2}}\ |\ \lambda_{2}),
σ2\displaystyle\sigma^{2} \displaystyle\sim IG(ν0/2,η0/2),\displaystyle\mbox{IG}(\nu_{0}/2,\eta_{0}/2),

where DL()\mbox{DL}(\cdot) represents the Dirichlet–Laplace prior distribution. Based on this expression, the proposed model can be regarded as applying the Dirichlet–Laplace prior to the grouping of individuals and the Laplace distribution to the selection of variables. The details of the Dirichlet–Laplace prior distribution are described in Appendix B.

The Dirichlet–Laplace prior distribution is a type of global–local prior distribution (Polson and Scott, 2010). The important features of the global–local prior distribution include that it has a peak at the origin and that it has heavy tails. These features allow us to perform more robust estimation than using the Laplace distribution. Various global–local priors have been proposed: e.g., the normal-exponential-gamma distribution (Griffin and Brown, 2005), the normal-gamma distribution (Brown and Griffin, 2010), and the horseshoe distribution (Carvalho et al., 2010).

4 Relationship with peripheral research

We here summarize the research peripherally related to the proposed method. First, we describe localized lasso (Yamada et al., 2017), which performs grouping of individuals and variable selection simultaneously. Next, we describe another Bayesian approach for network lasso.

4.1 Localized lasso

The localized lasso proposed by Yamada et al., (2017) considers the following minimization problem:

min𝒘1,,𝒘ni=1n(yi𝒙iT𝒘i)2+λ1(i1,i2)ri1i2𝒘i1𝒘i22+λ2i=1n𝒘i12.\displaystyle\min_{\bm{w}_{1},\cdots,\bm{w}_{n}}\sum_{i=1}^{n}(y_{i}-\mbox{\boldmath$x$}_{i}^{T}\mbox{\boldmath$w$}_{i})^{2}+\lambda_{1}\sum_{(i_{1},i_{2})\in\mathcal{E}}r_{i_{1}i_{2}}{\|\mbox{\boldmath$w$}_{i_{1}}-\mbox{\boldmath$w$}_{i_{2}}\|_{2}}+\lambda_{2}\sum_{i=1}^{n}\|\mbox{\boldmath$w$}_{i}\|_{1}^{2}.

This minimization is based on network lasso and performs grouping of individuals and variable selection. In addition, Yamada et al., (2017) proposed an efficient iterative algorithm for localized lasso. This algorithm converges to the global optimal solution without the need for tuning the parameters λ1\lambda_{1} and λ2\lambda_{2}.

Our proposed method uses the L1L_{1} norm, while localized lasso uses the L1,2L_{1,2} norm. Because the L1,2L_{1,2} norm is used, at least one of the regression coefficients for a sample is always non-zero for localized lasso. Since our proposed method employs the L1L_{1} norm, 𝒘i=𝟎\mbox{\boldmath$w$}_{i}=\mbox{\boldmath$0$} may arise. To avoid this case, we adjust the hyperparameter λ2\lambda_{2} to obtain 𝒘i𝟎\mbox{\boldmath$w$}_{i}\neq\mbox{\boldmath$0$}.

Localized lasso requires knowledge of the relational coefficients. That is, if the relational coefficients are unknown, localized lasso needs to estimate relational coefficients from data. This means that the performance of localized lasso depends on the method for determining relational coefficients.

4.2 Network lasso in a Bayesian framework

Rad et al., (2017) proposed using the following likelihood and prior distribution:

f(𝒚|X,W,σ2)\displaystyle f(\mbox{\boldmath$y$}|X,W,\sigma^{2}) =\displaystyle= (2πσ2)n/2exp{12σ2i=1n(yi𝒘iT𝒙i)2},\displaystyle(2\pi\sigma^{2})^{-n/2}\exp\left\{-\frac{1}{2\sigma^{2}}\sum_{i=1}^{n}(y_{i}-\mbox{\boldmath$w$}_{i}^{T}\mbox{\boldmath$x$}_{i})^{2}\right\}, (6)
π(W|λ,σ2)\displaystyle\pi(W|\lambda,\sigma^{2}) \displaystyle\propto (i1,i2)λσ2exp{λσ2𝒘i1𝒘i22}.\displaystyle\prod_{(i_{1},i_{2})\in\mathcal{E}}\frac{\lambda}{\sqrt{\sigma^{2}}}\exp\left\{-\frac{\lambda}{\sqrt{\sigma^{2}}}\|\mbox{\boldmath$w$}_{i_{1}}-\mbox{\boldmath$w$}_{i_{2}}\|_{2}\right\}.

Unlike the prior distribution in (4), this prior distribution does not consider any relational coefficients ri1i2r_{i_{1}i_{2}}. Moreover, this prior does not allow variable selection. Gibbs sampling is used to estimate the posterior distribution. Rad et al., (2017) pointed out that the MAP estimates of regression coefficients given by (6) are similar to those from network lasso.

Rad et al., (2017) showed that ADMM allows for fast computation of estimates of regression coefficients. On the other hand, Rad et al., (2017) pointed out that using ADMM sacrifices the quantification of the posterior uncertainty. By using the quantification of the posterior uncertainty, it is possible to evaluate the estimates from the shape of the posterior distribution, rather than focusing only on the results of point estimation.

Rad et al., (2017) set assumptions as the target data being an evenly spaced time series or having lattice position information, and determined \mathcal{E} based on these assumptions. Therefore, the method can be applied only to data that satisfy the above prerequisites.

5 Simulation study

We demonstrated our proposed method with artificial data. We simulated data (yi,𝒙i)(i=1,2,,n)(y_{i},\mbox{\boldmath$x$}_{i})\ (i=1,2,\cdots,n) from the following model:

yi=𝒙iT𝒘k,i+ϵi,(i=1,2,,n),\displaystyle y_{i}=\mbox{\boldmath$x$}_{i}^{T}\mbox{\boldmath$w$}_{k,i}^{*}+\epsilon_{i},\qquad(i=1,2,\cdots,n),

where 𝒘k,i\mbox{\boldmath$w$}^{*}_{k,i} is a vector of the pp-dimensional true regression coefficient and ϵi\epsilon_{i} is an error distributed as N(0,1)\mbox{N}(0,1). In addition, 𝒙i(i=1,2,,n)\mbox{\boldmath$x$}_{i}\ (i=1,2,\cdots,n) is generated from a uniform distribution on [1,1][-1,1]. We considered several settings: n=30, 50n=30,\ 50 and p=5, 10p=5,\ 10. When p=10p=10, the features consist of five irrelevant features and five relevant features. The data have three groups, and the sample size for each group is the same. The true regression coefficients were set as follows:

  • When p=5p=5,

    𝒘1,i\displaystyle\mbox{\boldmath$w$}_{1,i}^{*} =\displaystyle= (5.0, 1.0,1.0, 0.0, 0.0)T,\displaystyle(5.0,\ 1.0,\ -1.0,\ 0.0,\ 0.0)^{T},
    𝒘2,i\displaystyle\mbox{\boldmath$w$}_{2,i}^{*} =\displaystyle= (0.0, 1.0,5.0, 1.0, 0.0)T,\displaystyle(0.0,\ 1.0,\ -5.0,\ 1.0,\ 0.0)^{T},
    𝒘3,i\displaystyle\mbox{\boldmath$w$}_{3,i}^{*} =\displaystyle= (0.0, 0.0, 0.0, 0.5,0.5)T.\displaystyle(0.0,\ 0.0,\ 0.0,\ 0.5,\ -0.5)^{T}.
  • When p=10p=10,

    𝒘1,i\displaystyle\mbox{\boldmath$w$}_{1,i}^{*} =\displaystyle= (5.0, 1.0,1.0, 0.0, 0.0,𝟎5T)T,\displaystyle(5.0,\ 1.0,\ -1.0,\ 0.0,\ 0.0,\ \mbox{\boldmath$0$}_{5}^{T})^{T},
    𝒘2,i\displaystyle\mbox{\boldmath$w$}_{2,i}^{*} =\displaystyle= (0.0, 1.0,5.0, 1.0, 0.0,𝟎5T)T,\displaystyle(0.0,\ 1.0,\ -5.0,\ 1.0,\ 0.0,\ \mbox{\boldmath$0$}_{5}^{T})^{T},
    𝒘3,i\displaystyle\mbox{\boldmath$w$}_{3,i}^{*} =\displaystyle= (0.0, 0.0, 0.0, 0.5,0.5,𝟎5T)T.\displaystyle(0.0,\ 0.0,\ 0.0,\ 0.5,\ -0.5,\ \mbox{\boldmath$0$}_{5}^{T})^{T}.

The subscript set \mathcal{E} was also generated by random numbers. Let \mathcal{F} be the set of subscript pairs (i,j)(i,j) such that 𝒙i\mbox{\boldmath$x$}_{i} and 𝒙j\mbox{\boldmath$x$}_{j} belong to the same group and let 𝒢\mathcal{G} be the set of subscript pair (i,j)(i,j) such that 𝒙i\mbox{\boldmath$x$}_{i} and 𝒙j\mbox{\boldmath$x$}_{j} belong to the different groups. The subscript set \mathcal{E} was given by ={f,g}\mathcal{E}=\{f,g\}, where ff\subset\mathcal{F} and g𝒢g\subset\mathcal{G}. We assume that #f=TR×#\#f=\mbox{TR}\times\#\mathcal{F} and #g=FR×#𝒢\#g=\mbox{FR}\times\#\mathcal{G}, where TR and FR are constants between zero and one.

As the estimation accuracy, we used the MSE given by

MSE =\displaystyle= E[(𝒚𝒚^)T(𝒚𝒚^)]\displaystyle\mbox{E}\left[(\mbox{\boldmath$y$}^{*}-\hat{\mbox{\boldmath$y$}})^{T}(\mbox{\boldmath$y$}^{*}-\hat{\mbox{\boldmath$y$}})\right]
=\displaystyle= i=1n(𝒘k,i𝒘^i)TΣi(𝒘k,i𝒘^i),\displaystyle\sum_{i=1}^{n}(\mbox{\boldmath$w$}_{k,i}^{*}-\hat{\mbox{\boldmath$w$}}_{i})^{T}\Sigma_{i}(\mbox{\boldmath$w$}_{k,i}^{*}-\hat{\mbox{\boldmath$w$}}_{i}),

where

𝒚=(y1,y2,,yn)T,yi=𝒘k,iT𝒙i,\displaystyle\mbox{\boldmath$y$}^{*}=(y^{*}_{1},y^{*}_{2},\cdots,y^{*}_{n})^{T},\quad y^{*}_{i}=\mbox{\boldmath$w$}_{k,i}^{*T}\mbox{\boldmath$x$}_{i},
𝒚^=(y^1,y^2,,y^n)T,y^i=𝒘^iT𝒙i,\displaystyle\hat{\mbox{\boldmath$y$}}=(\hat{y}_{1},\hat{y}_{2},\cdots,\hat{y}_{n})^{T},\quad\hat{y}_{i}=\hat{\mbox{\boldmath$w$}}_{i}^{T}\mbox{\boldmath$x$}_{i},

𝒘^i\hat{\mbox{\boldmath$w$}}_{i} is an estimate of 𝒘i\mbox{\boldmath$w$}_{i}, and Σi\Sigma_{i} is the variance-covariance matrix of 𝒙i\mbox{\boldmath$x$}_{i}.

The dataset was generated 50 times. We computed the mean and standard deviation of MSE from the 50 repetitions. For each generated dataset, the estimates were obtained using 50,000 iterations of a Gibbs sampler. Candidates of the hyperparameters were set as 10li(i=1,2,m)10^{l-i}\ (i=1,2,\cdots m). In order to make a fair comparison, the results were compared using the hyperparameter that maximized the MSE.

The existing method to be compared was localized lasso (LL). We also compared a method that assumes fixed values for ri1i2r_{i_{1}i_{2}} and λ1\lambda_{1} instead of assuming prior distributions. We refer to this method as the Bayesian approach to multi-task learning with network lasso (BMN). The proposed method is referred to as the Bayesian approach to multi-task learning with network lasso using the Dirichlet–Laplace distribution (DLBMN). By comparing BMN and DLBMN, we can consider whether assuming a prior distribution for the hyperparameters increases the accuracy of the estimation.

Tables 1 to 4 show the means and standard deviations of the MSEs. Not surprisingly, all tables show that the larger the TR is and the smaller the FR is, the smaller the MSE is. Also, for all tables, regardless of TR, the increase in MSE with increasing FR is smaller for DLBMN than for LL and BMN. From the results for n=30n=30 and p=5p=5 in Table 1, when TR=0.2\mbox{TR}=0.2, the MSE is large regardless of the TF. The results for n=30n=30 and p=10p=10 in Table 2 also show that the MSE is large regardless of TF for TR=0.2\mbox{TR}=0.2. When FR is greater than 0.10.1, LL and BMN have MSEs greater than 20, while the MSE of DLBMN is very small. In the results for n=120n=120 in Table 3 and Table 4, the MSE is small when TR=0.2\mbox{TR}=0.2 and FR=0.01\mbox{FR}=0.01, unlike the results for n=30n=30. In the results for n=120n=120 and p=5p=5, unlike the other results, DLBMN has a small MSE even with TR=1.0\mbox{TR}=1.0 and FR=0.4\mbox{FR}=0.4. DLBMN has a smaller MSE than LL and BMN in most of the results. In particular, the number of results such that the MSE was less than 1.01.0 was 34 for DLBMN, compared to 19 for LL and 13 for BMN.

Table 1: Means and standard deviations of MSEs in simulation study with n=30n=30 and p=5p=5.
LL BMN DLBMN
TR=0.20 FR=0.01 30.96 [17.85] 31.21 [16.05] 29.39 [16.11]
FR=0.10 89.26 [21.85] 82.29 [17.33] 79.13 [18.66]
FR=0.20 102.52 [13.90] 97.41 [13.55] 95.83 [14.73]
FR=0.30 105.98 [11.14] 101.82 [12.01] 99.85 [13.13]
FR=0.40 108.18 [11.24] 104.91 [11.40] 102.96 [12.60]
FR=1.00 109.36 [10.71] 107.36 [11.49] 105.30 [11.36]
TR=0.40 FR=0.01 1.99 [3.29] 2.25 [3.45] 2.10 [3.41]
FR=0.10 22.36 [14.65] 40.98 [18.30] 14.34 [16.57]
FR=0.20 67.87 [17.34] 74.74 [18.86] 68.57 [24.93]
FR=0.30 85.66 [15.42] 86.50 [16.06] 84.29 [17.67]
FR=0.40 94.31 [14.44] 93.38 [14.67] 91.41 [15.14]
FR=1.00 106.79 [11.50] 103.66 [12.05] 102.14 [11.81]
TR=0.60 FR=0.01 0.31 [0.18] 0.36 [0.28] 0.38 [0.26]
FR=0.10 3.37 [3.94] 16.31 [13.09] 0.88 [1.48]
FR=0.20 34.01 [21.75] 53.62 [20.83] 22.31 [30.43]
FR=0.30 64.28 [19.06] 72.42 [18.57] 57.60 [34.38]
FR=0.40 81.65 [16.92] 82.68 [17.09] 80.49 [20.15]
FR=1.00 103.89 [11.90] 100.75 [13.26] 99.90 [13.33]
TR=0.80 FR=0.01 0.28 [0.16] 0.30 [0.20] 0.32 [0.22]
FR=0.10 0.78 [0.75] 5.25 [6.01] 0.38 [0.27]
FR=0.20 12.51 [11.42] 32.57 [18.45] 3.35 [9.70]
FR=0.30 39.92 [21.39] 58.10 [20.97] 23.20 [33.60]
FR=0.40 65.45 [19.54] 72.60 [18.84] 61.29 [36.77]
FR=1.00 101.05 [13.13] 96.81 [14.36] 96.90 [13.88]
TR=1.00 FR=0.01 0.28 [0.16] 0.30 [0.21] 0.28 [0.19]
FR=0.10 0.46 [0.29] 2.56 [3.76] 0.31 [0.20]
FR=0.20 3.96 [5.29] 18.07 [15.71] 0.36 [0.23]
FR=0.30 19.81 [17.85] 40.87 [21.39] 10.65 [25.81]
FR=0.40 43.95 [21.58] 59.26 [20.16] 36.48 [37.36]
FR=1.00 96.15 [14.20] 92.77 [14.24] 94.04 [14.58]
Table 2: Means and standard deviations of MSEs in simulation study with n=30n=30 and p=10p=10.
LL BMN DLBMN
TR=0.20 FR=0.01 85.60 [29.38] 53.85 [21.58] 51.84 [19.60]
FR=0.10 130.96 [17.11] 110.48 [18.36] 110.13 [19.39]
FR=0.20 131.11 [13.38] 120.50 [14.54] 118.54 [15.13]
FR=0.30 130.16 [13.31] 123.04 [12.56] 120.74 [13.29]
FR=0.40 129.58 [12.84] 124.71 [12.05] 122.72 [12.92]
FR=1.00 129.35 [13.34] 124.70 [12.55] 122.94 [12.69]
TR=0.40 FR=0.01 20.17 [21.40] 8.32 [8.67] 5.31 [6.81]
FR=0.10 93.14 [27.10] 79.07 [19.33] 60.55 [26.00]
FR=0.20 122.32 [17.46] 108.41 [15.80] 104.73 [17.28]
FR=0.30 126.18 [15.40] 116.53 [14.15] 114.26 [14.94]
FR=0.40 127.55 [14.47] 119.65 [12.89] 117.21 [13.87]
FR=1.00 129.21 [13.50] 123.42 [12.53] 122.49 [13.44]
TR=0.60 FR=0.01 12.02 [15.26] 2.35 [4.91] 1.74 [3.80]
FR=0.10 48.17 [24.13] 53.31 [21.78] 12.69 [16.04]
FR=0.20 104.08 [23.43] 94.27 [18.32] 75.69 [32.33]
FR=0.30 121.88 [17.22] 108.57 [15.08] 104.34 [17.67]
FR=0.40 126.76 [15.00] 114.92 [13.54] 112.10 [15.23]
FR=1.00 128.86 [13.59] 121.33 [12.72] 120.82 [13.74]
TR=0.80 FR=0.01 9.33 [14.75] 1.87 [5.95] 1.33 [3.06]
FR=0.10 28.75 [20.15] 33.04 [20.23] 3.98 [8.67]
FR=0.20 77.24 [29.98] 77.41 [21.23] 30.33 [32.15]
FR=0.30 111.00 [21.73] 98.60 [17.72] 86.84 [34.78]
FR=0.40 122.71 [16.16] 108.14 [15.38] 107.07 [16.76]
FR=1.00 128.69 [13.35] 119.45 [13.19] 119.67 [13.73]
TR=1.00 FR=0.01 7.93 [8.08] 1.43 [2.08] 0.78 [0.88]
FR=0.10 21.34 [13.84] 21.67 [15.33] 2.10 [4.32]
FR=0.20 57.37 [29.16] 59.66 [23.64] 10.07 [15.18]
FR=0.30 94.69 [26.93] 86.22 [20.60] 52.09 [42.84]
FR=0.40 116.11 [20.41] 100.84 [18.46] 101.56 [23.35]
FR=1.00 128.33 [13.74] 116.58 [14.23] 118.79 [14.50]
Table 3: Means and standard deviations of MSEs in simulation study with n=120n=120 and p=5p=5.
LL BMN DLBMN
TR=0.20 FR=0.01 0.93 [2.99] 1.03 [2.89] 0.77 [2.84]
FR=0.10 201.14 [32.03] 222.27 [27.69] 33.10 [64.60]
FR=0.20 318.97 [19.16] 319.84 [19.31] 320.61 [19.48]
FR=0.30 348.21 [16.70] 347.98 [17.16] 347.75 [17.00]
FR=0.40 361.51 [15.99] 360.85 [17.12] 360.53 [16.47]
FR=1.00 380.77 [14.87] 381.24 [15.29] 380.17 [15.98]
TR=0.40 FR=0.01 0.22 [0.09] 0.24 [0.11] 0.17 [0.08]
FR=0.10 2.53 [1.38] 13.22 [9.58] 0.31 [0.12]
FR=0.20 200.63 [35.25] 220.72 [31.88] 6.21 [37.70]
FR=0.30 290.24 [21.22] 293.01 [21.31] 301.01 [21.26]
FR=0.40 321.41 [17.83] 321.51 [18.53] 326.63 [18.65]
FR=1.00 367.69 [15.22] 367.48 [15.63] 366.98 [16.57]
TR=0.60 FR=0.01 0.18 [0.08] 0.21 [0.10] 0.23 [0.09]
FR=0.10 0.86 [0.26] 1.53 [0.53] 0.21 [0.09]
FR=0.20 22.10 [21.29] 61.46 [35.50] 0.25 [0.13]
FR=0.30 200.57 [36.42] 218.47 [32.46] 11.02 [52.97]
FR=0.40 273.53 [23.65] 278.57 [22.96] 288.15 [47.29]
FR=1.00 353.08 [16.13] 353.43 [16.33] 355.18 [17.32]
TR=0.80 FR=0.01 0.17 [0.08] 0.21 [0.11] 0.21 [0.09]
FR=0.10 0.61 [0.18] 0.78 [0.23] 0.19 [0.08]
FR=0.20 2.17 [0.96] 6.55 [4.56] 0.20 [0.10]
FR=0.30 60.41 [38.97] 102.47 [43.17] 0.25 [0.14]
FR=0.40 198.08 [37.27] 215.65 [33.84] 44.49 [99.13]
FR=1.00 338.08 [17.27] 339.14 [17.21] 344.62 [18.09]
TR=1.00 FR=0.01 0.17 [0.08] 0.21 [0.11] 0.21 [0.11]
FR=0.10 0.50 [0.15] 0.55 [0.17] 0.18 [0.08]
FR=0.20 1.13 [0.38] 2.03 [0.77] 0.20 [0.10]
FR=0.30 5.91 [4.57] 20.82 [16.34] 0.23 [0.12]
FR=0.40 89.68 [42.72] 126.74 [43.54] 0.28 [0.16]
FR=1.00 322.05 [18.44] 324.17 [18.42] 335.27 [18.19]
Table 4: Means and standard deviations of MSEs in simulation study with n=120n=120 and p=10p=10.
LL BMN DLBMN
TR=0.20 FR=0.01 0.87 [1.35] 1.10 [1.51] 0.60 [1.09]
FR=0.10 361.14 [26.82] 353.15 [22.02] 355.39 [22.64]
FR=0.20 411.92 [22.00] 404.21 [19.47] 402.18 [19.61]
FR=0.30 425.32 [20.88] 420.40 [19.48] 416.43 [19.86]
FR=0.40 431.38 [20.26] 427.08 [18.53] 423.01 [19.69]
FR=1.00 438.29 [18.46] 430.65 [16.93] 431.33 [18.02]
TR=0.40 FR=0.01 0.45 [0.15] 0.38 [0.20] 0.33 [0.11]
FR=0.10 126.75 [60.35] 177.94 [48.44] 0.41 [0.16]
FR=0.20 362.05 [26.46] 359.88 [22.05] 365.44 [21.27]
FR=0.30 396.84 [22.85] 391.57 [18.87] 394.19 [19.66]
FR=0.40 410.57 [22.19] 403.36 [18.37] 405.67 [19.89]
FR=1.00 432.03 [19.76] 423.16 [16.56] 424.74 [18.32]
TR=0.60 FR=0.01 0.40 [0.13] 0.34 [0.15] 0.34 [0.11]
FR=0.10 4.32 [3.59] 15.55 [11.60] 0.37 [0.13]
FR=0.20 276.53 [40.62] 284.20 [31.54] 7.57 [49.54]
FR=0.30 363.08 [25.91] 356.69 [21.10] 371.72 [21.68]
FR=0.40 390.34 [23.20] 380.98 [18.92] 391.74 [20.32]
FR=0.40 426.53 [20.83] 416.03 [16.99] 419.09 [18.76]
TR=0.80 FR=0.01 0.38 [0.13] 0.32 [0.12] 0.34 [0.10]
FR=0.10 1.47 [0.69] 2.54 [1.26] 0.37 [0.12]
FR=0.20 112.35 [58.22] 150.19 [47.83] 0.43 [0.17]
FR=0.30 311.25 [33.57] 309.14 [27.34] 198.92 [176.32]
FR=0.40 363.11 [25.56] 355.28 [21.15] 377.08 [21.96]
FR=1.00 419.47 [21.53] 409.90 [17.67] 415.17 [19.82]
TR=1.00 FR=0.01 0.37 [0.13] 0.31 [0.11] 0.34 [0.10]
FR=0.10 0.99 [0.43] 1.17 [0.44] 0.36 [0.11]
FR=0.20 17.18 [17.23] 36.70 [26.78] 0.41 [0.13]
FR=0.30 227.63 [49.11] 237.62 [38.73] 15.23 [72.67]
FR=0.40 326.41 [29.59] 321.66 [23.86] 332.25 [100.64]
FR=1.00 411.29 [21.01] 401.90 [16.85] 410.47 [18.92]

6 Application

We applied our proposed method to a real dataset comprising Greater Sacramento area data (http://support.spatialkey.com/spatialkey-sample-csv-data/). This dataset was used by Hallac et al., (2015). This dataset is a listing of real estate transactions for one week of May 2008. It contains 985 samples of sales information, including latitude, longitude, number of bedrooms, number of bathrooms, square feet, and sales price. We considered a model that predicts the selling price from the numbers of bedrooms, bathrooms, and square feet. The data have some missing values, with 17% of the samples missing at least one of the above three features. We used the remaining 814 samples after excluding those with missing values. We determined \mathcal{E} using the latitude and longitude coordinates of each house. The \mathcal{E} is the set of subscript pairs (i,j)(i,j) corresponding to the five houses closest to the ii-th house (i=1,2,,ni=1,2,\cdots,n) after removing the test set.

We considered the same three methods as in Section 5: LL, BMN, and DLBMN. The methods were compared using five-fold cross-validation. Accuracies of each method were evaluated using prediction squared error (PSE). The PSE corresponding to the kk-th iteration of cross-validation is given by

PSE=1n(k)𝒚~(k)X~(k)𝜷^(k)22,(k=1,2,,5),\mbox{PSE}=\frac{1}{n^{(k)}}\|\widetilde{\mbox{\boldmath$y$}}^{(k)}-\widetilde{X}^{(k)}\hat{\mbox{\boldmath$\beta$}}^{(k)}\|_{2}^{2},\qquad(k=1,2,\cdots,5),

where 𝒚~(k)\widetilde{\mbox{\boldmath$y$}}^{(k)} and X~(k)\widetilde{X}^{(k)} are the test data in the cross-validation, n(k)n^{(k)} is the number of test samples, and 𝜷^(k)\hat{\mbox{\boldmath$\beta$}}^{(k)} is the vector of the regression coefficient estimated by the data other than the test data. We used the value of the hyperparameter that maximized the PSE.

All methods had a mean value and standard deviation of PSE of 0.37 and 0.09, respectively. We defined \mathcal{E} by the distances between houses. It is assumed that the locations of properties and groups of individuals are highly related. Therefore, the accuracy of relational coefficients in estimating LL and BMN is expected to be high. In such a case, LL is known to estimate the model with high accuracy. We confirm by this study that DLBMN enjoys the same accuracy as LL, even in environments where LL can estimate with high accuracy.

7 Conclusion

We proposed an approach to multi-task learning with network lasso based on Bayesian sparse modeling. By grouping samples based on a Dirichlet–Laplace prior, we modeled the relationships among samples, which have not been considered by existing methods. Compared with the existing methods, the proposed method estimated the parameters with high accuracy, regardless of the estimation accuracy of the relationships among the samples. This result suggests that the fields of applicable data can be expanded compared to the existing methods.

Acknowledgements

S. K. was supported by JSPS KAKENHI Grant Numbers JP19K11854 and JP20H02227. Super-computing resources were provided by Human Genome Center (The Univ. of Tokyo).

Appendix Appendix A Estimation by Gibbs sampling

For estimation by Gibbs sampling, it is necessary to derive the full conditional distribution from the joint prior distribution. In the proposed model, the full conditional distributions are obtained from the prior distributions of each parameter and the likelihood function as follows:

𝒘i|𝒚,{τi1i2},{ri1i2},{τ~ij},σ2\displaystyle\mbox{\boldmath$w$}_{i}|\mbox{\boldmath$y$},\{\tau_{i_{1}i_{2}}\},\{r_{i_{1}i_{2}}\},\{\widetilde{\tau}_{ij}\},\sigma^{2} \displaystyle\sim N(S1𝒎,σ2S1),\displaystyle\mbox{N}(S^{-1}\mbox{\boldmath$m$},\sigma^{2}S^{-1}),
S=𝒙i𝒙iT+λ12kirik2τikIn+D~i,\displaystyle S=\mbox{\boldmath$x$}_{i}\mbox{\boldmath$x$}_{i}^{T}+\lambda_{1}^{2}\sum_{k\in\mathcal{E}_{i}}\frac{r_{ik}^{2}}{\tau_{ik}}I_{n}+\widetilde{D}_{i},
i={j;(i,j)},D~i=diag(τ~i11,,τ~ip1),\displaystyle\mathcal{E}_{i}=\{j;\ (i,j)\in\mathcal{E}\},\quad\widetilde{D}_{i}=\mbox{diag}(\widetilde{\tau}^{-1}_{i1},\cdots,\widetilde{\tau}^{-1}_{ip}),
𝒎=yi𝒙i+λ12kirik2τik𝒘k,\displaystyle\mbox{\boldmath$m$}=y_{i}\mbox{\boldmath$x$}_{i}+\lambda_{1}^{2}\sum_{k\in\mathcal{E}_{i}}\frac{r^{2}_{ik}}{\tau_{ik}}\mbox{\boldmath$w$}_{k},
τi1i21|W,ri1i2,σ2\displaystyle\tau_{i_{1}i_{2}}^{-1}|W,r_{i_{1}i_{2}},\sigma^{2} \displaystyle\sim IGauss(μ,λ),\displaystyle\mbox{IGauss}(\mu^{\prime},\lambda^{\prime}),
μ=σ2λ12ri1i22𝒘i1𝒘i222,λ=1,\displaystyle\mu^{\prime}=\sqrt{\frac{\sigma^{2}}{\lambda_{1}^{2}r_{i_{1}i_{2}}^{2}\|\mbox{\boldmath$w$}_{i_{1}}-\mbox{\boldmath$w$}_{i_{2}}\|_{2}^{2}}},\quad\lambda^{\prime}=1,
λ11|W,{ri1i2},σ2\displaystyle\lambda_{1}^{-1}|W,\{r_{i_{1}i_{2}}\},\sigma^{2} \displaystyle\sim giG(χ,ρ,λ),\displaystyle\mbox{giG}\left(\chi^{\prime},\rho^{\prime},\lambda^{\prime}\right),
χ=2(i1,i2)ri1i2𝒘i1𝒘i22/σ2,\displaystyle\chi^{\prime}=2\sum_{(i_{1},i_{2})\in\mathcal{E}}r_{i_{1}i_{2}}\|\mbox{\boldmath$w$}_{i_{1}}-\mbox{\boldmath$w$}_{i_{2}}\|_{2}/\sqrt{\sigma^{2}},
ρ=1,λ=#(a1),\displaystyle\rho^{\prime}=1,\quad\lambda^{\prime}=\#\mathcal{E}(a-1),
Ti1i2|W,σ2\displaystyle T_{i_{1}i_{2}}|W,\sigma^{2} \displaystyle\sim giG(χ′′,ρ′′,λ′′),\displaystyle\mbox{giG}(\chi^{\prime\prime},\rho^{\prime\prime},\lambda^{\prime\prime}),
χ′′=2𝒘i1𝒘i22/σ2,\displaystyle\chi^{\prime\prime}=2\|\mbox{\boldmath$w$}_{i_{1}}-\mbox{\boldmath$w$}_{i_{2}}\|_{2}/\sqrt{\sigma^{2}},
ρ′′=1,λ′′=a1,\displaystyle\rho^{\prime\prime}=1,\quad\lambda^{\prime\prime}=a-1,
ri1i21\displaystyle r_{i_{1}i_{2}}^{-1} =\displaystyle= Ti1i2/(i1,i2)Ti1i2,\displaystyle T_{i_{1}i_{2}}/\sum_{(i_{1},i_{2})\in\mathcal{E}}T_{i_{1}i_{2}},
τ~ij1|W,σ2\displaystyle\widetilde{\tau}_{ij}^{-1}|W,\sigma^{2} \displaystyle\sim IGauss(μ~,λ~),\displaystyle\mbox{IGauss}(\widetilde{\mu}^{\prime},\widetilde{\lambda}^{\prime}),
μ~=λ22σ2wij2,λ~=λ22,\displaystyle\widetilde{\mu}^{\prime}=\sqrt{\frac{\lambda_{2}^{2}\sigma^{2}}{w_{ij}^{2}}},\quad\widetilde{\lambda}^{\prime}=\lambda_{2}^{2},
σ2|𝒚,W,{τi1i2},{ri1i2},{τ~ij}\displaystyle\sigma^{2}|\mbox{\boldmath$y$},W,\{\tau_{i_{1}i_{2}}\},\{r_{i_{1}i_{2}}\},\{\widetilde{\tau}_{ij}\} \displaystyle\sim IG(ν,η),\displaystyle\mbox{IG}(\nu^{\prime},\eta^{\prime}),
ν=n+#+np+ν0,\displaystyle\nu^{\prime}=n+\#\mathcal{E}+np+\nu_{0},
η=i=1n(yi𝒘iT𝒙i)2+(i1,i2)λ12ri1i22τi1i2𝒘i1𝒘i122\displaystyle\eta^{\prime}=\sum_{i=1}^{n}(y_{i}-\mbox{\boldmath$w$}_{i}^{T}\mbox{\boldmath$x$}_{i})^{2}+\sum_{(i_{1},i_{2})\in\mathcal{E}}\frac{\lambda_{1}^{2}r_{i_{1}i_{2}}^{2}}{\tau_{i_{1}i_{2}}}\|\mbox{\boldmath$w$}_{i_{1}}-\mbox{\boldmath$w$}_{i_{1}}\|_{2}^{2}
+i=1nj=1pwij2τ~ij+η0.\displaystyle\qquad+\sum_{i=1}^{n}\sum_{j=1}^{p}\frac{w_{ij}^{2}}{\widetilde{\tau}_{ij}}+\eta_{0}.

These full conditional distributions make it possible for us to generate samples from the posterior distribution by Gibbs sampling.

Appendix Appendix B Dirichlet–Laplace prior distribution

The Dirichlet–Laplace prior distribution (Bhattacharya et al., 2015) was proposed to satisfy posterior consistency. Bayesian regression models using this prior distribution are known to be asymptotically consistent in variable selection. The probability density function of the Dirichlet–Laplace prior distribution is defined as follows:

DL(𝜽|α)\displaystyle\mbox{DL}(\mbox{\boldmath$\theta$}|\alpha) \displaystyle\propto j=1p{p(θj|τj,ν)}p(𝝉|α)p(ν)j=1p(dτj)dν\displaystyle\int\cdots\int\prod_{j=1}^{p}\left\{p(\theta_{j}|\tau_{j},\nu)\right\}p(\mbox{\boldmath$\tau$}|\alpha)p(\nu)\prod_{j=1}^{p}(d\tau_{j})d\nu
\displaystyle\propto j=1p{p(θj|ψj,τj2,ν2)p(ψj)}p(𝝉|α)p(ν)j=1p(dτjdψj)dν,\displaystyle\int\cdots\int\prod_{j=1}^{p}\left\{p(\theta_{j}|\psi_{j},\tau_{j}^{2},\nu^{2})p(\psi_{j})\right\}p(\mbox{\boldmath$\tau$}|\alpha)p(\nu)\prod_{j=1}^{p}(d\tau_{j}d\psi_{j})d\nu,

where 𝜽=(θ1,θ2,,θp)T\mbox{\boldmath$\theta$}=(\theta_{1},\theta_{2},\cdots,\theta_{p})^{T} and 𝝉=(τ1,τ2,,τp)T\mbox{\boldmath$\tau$}=(\tau_{1},\tau_{2},\cdots,\tau_{p})^{T}. The prior distribution of each parameter is assumed as

θj|τj,ν\displaystyle\theta_{j}|\tau_{j},\nu \displaystyle\sim Laplace(1/τjν),\displaystyle\mbox{Laplace}(1/\tau_{j}\nu),
θj|τj,ψj,ν\displaystyle\theta_{j}|\tau_{j},\psi_{j},\nu \displaystyle\sim N(0,ψjτj2ν2),\displaystyle\mbox{N}(0,\psi_{j}\tau_{j}^{2}\nu^{2}),
𝝉\tau \displaystyle\sim Dir(α,,α),\displaystyle\mbox{Dir}(\alpha,\cdots,\alpha),
ψj\displaystyle\psi_{j} \displaystyle\sim Exp(1/2),\displaystyle\mbox{Exp}(1/2),
ν\displaystyle\nu \displaystyle\sim Ga(pα,1/2),\displaystyle\mbox{Ga}(p\alpha,1/2),

where the hyperparameter α(>0)\alpha\ (>0) is a parameter that controls the degree of sparseness of θj\theta_{j}.

References

  • Ambos et al., (2018) Ambos, H., Tran, N., and Jung, A. (2018). The logistic network lasso. arXiv preprint arXiv:1805.02483.
  • Andrews and Mallows, (1974) Andrews, D. F. and Mallows, C. L. (1974). Scale mixtures of normal distributions. Journal of the Royal Statistical Society Series B, 36(1):99–102.
  • Argyriou et al., (2008) Argyriou, A., Evgeniou, T., and Pontil, M. (2008). Convex multi-task feature learning. Machine learning, 73(3):243–272.
  • Bhattacharya et al., (2015) Bhattacharya, A., Pati, D., Pillai, N. S., and Dunson, D. B. (2015). Dirichlet–laplace priors for optimal shrinkage. Journal of the American Statistical Association, 110(512):1479–1490.
  • Brown and Griffin, (2010) Brown, P. J. and Griffin, J. E. (2010). Inference with normal–gamma prior distributions in regression problems. Bayesian analysis, 5(1):171–188.
  • Carvalho et al., (2010) Carvalho, C. M., Polson, N. G., and Scott, J. G. (2010). The horseshoe estimator for sparse signals. Biometrika, 97(2):465–480.
  • Evgeniou and Pontil, (2007) Evgeniou, A. and Pontil, M. (2007). Multi-task feature learning. Advances in neural information processing systems, 19:41.
  • Griffin and Brown, (2005) Griffin, J. and Brown, P. (2005). Alternative prior distributions for variable selection with very many more variables than observations. University of Kent Technical Report.
  • Hallac et al., (2015) Hallac, D., Leskovec, J., and Boyd, S. (2015). Network lasso: Clustering and optimization in large graphs. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pages 387–396.
  • Jung, (2020) Jung, A. (2020). On the duality between network flows and network lasso. IEEE Signal Processing Letters, 27:940–944.
  • Lengerich et al., (2018) Lengerich, B. J., Aragam, B., and Xing, E. P. (2018). Personalized regression enables sample-specific pan-cancer analysis. Bioinformatics, 34(13):i178–i186.
  • Park and Casella, (2008) Park, T. and Casella, G. (2008). The Bayesian lasso. Journal of the American Statistical Association, 103(482):681–686.
  • Polson and Scott, (2010) Polson, N. G. and Scott, J. G. (2010). Shrink globally, act locally: Sparse Bayesian regularization and prediction. Bayesian Statistics, 9:501–538.
  • Rad et al., (2017) Rad, K. R., Machado, T. A., Paninski, L., et al. (2017). Robust and scalable bayesian analysis of spatial neural tuning function data. The Annals of Applied Statistics, 11(2):598–637.
  • Tran et al., (2020) Tran, N., Ambos, H., and Jung, A. (2020). Classifying partially labeled networked data via logistic network lasso. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3832–3836. IEEE.
  • Yamada et al., (2017) Yamada, M., Koh, T., Iwata, T., Shawe-Taylor, J., and Kaski, S. (2017). Localized lasso for high-dimensional regression. In Artificial Intelligence and Statistics, pages 325–333. PMLR.