This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

An Efficient Approach for Computing Optimal Low-Rank Regularized Inverse Matrices

Julianne Chung
Virginia Tech
Department of Mathematics
Blacksburg, VA, USA
jmchung@vt.edu
   Matthias Chung
Virginia Tech
Department of Mathematics
Blacksburg, VA, USA
mcchung@vt.edu

Abstract

Standard regularization methods that are used to compute solutions to ill-posed inverse problems require knowledge of the forward model. In many real-life applications, the forward model is not known, but training data is readily available. In this paper, we develop a new framework that uses training data, as a substitute for knowledge of the forward model, to compute an optimal low-rank regularized inverse matrix directly, allowing for very fast computation of a regularized solution. We consider a statistical framework based on Bayes and empirical Bayes risk minimization to analyze theoretical properties of the problem. We propose an efficient rank update approach for computing an optimal low-rank regularized inverse matrix for various error measures. Numerical experiments demonstrate the benefits and potential applications of our approach to problems in signal and image processing.

Keywords: ill-posed inverse problems, regularization, low-rank approximation, TSVD, Bayes risk, empirical Bayes risk, machine learning

1 Introduction

Inverse problems arise in scientific applications such as biomedical imaging, geophysics, computational biology, computer graphics, and security [1, 2, 3, 4, 5, 6, 7, 8]. Assume that 𝐀m×n{\bf A}\in\mathbb{R}^{m\times n} and 𝐛m{\bf b}\in\mathbb{R}^{m} are given, then the linear inverse problem can be written as

𝐀𝝃+𝜹=𝐛,{\bf A}{\boldsymbol{\xi}}+{\boldsymbol{\delta}}={\bf b}, (1)

where 𝝃n{\boldsymbol{\xi}}\in\mathbb{R}^{n} is the desired solution and 𝜹m{\boldsymbol{\delta}}\in\mathbb{R}^{m} is additive noise. We assume the underlying problem is ill-posed. A problem is ill-posed if the solution does not exist, is not unique, or does not depend continuously on the data [9]. A main challenge of ill-posed inverse problems is that small errors in the data may result in large errors in the reconstruction. In order to obtain a more meaningful reconstruction, regularization in the form of prior knowledge is needed to stabilize the inversion process.

Standard regularization methods require knowledge of 𝐀{\bf A} [10, 2]. However, in many applications, the operator 𝐀{\bf A} may not be readily available, or is only partially known. In this work, we investigate an approach that avoids including 𝐀{\bf A} in the problem formulation, and instead learns the necessary information from training or calibration data. Assume training data 𝐛(k),𝝃(k){\bf b}^{(k)},{\boldsymbol{\xi}}^{(k)} for k=1,,Kk=1,\ldots,K, for (1) are given. First, our goal is to find a rank-rr matrix 𝐙n×m{\bf Z}\in\mathbb{R}^{n\times m} that gives a small error for the given training set. That is, the sample mean error,

fK(𝐙)=1Kk=1Kek(𝐙)=1Kk=1Kρ(𝐙𝐛(k)𝝃(k))f_{K}({\bf Z})=\frac{1}{K}\sum_{k=1}^{K}e_{k}({\bf Z})=\frac{1}{K}\sum_{k=1}^{K}\rho\left({\bf Z}{\bf b}^{(k)}-{\boldsymbol{\xi}}^{(k)}\right) (2)

should be small for some given error measure ρ:n0+\rho:\mathbb{R}^{n}\to\mathbb{R}^{+}_{0}, e.g., ρ(𝐱)=𝐱pp\rho({\bf x})=\left\|{\bf x}\right\|_{p}^{p}. Here, ek(𝐙)=ρ(𝐙𝐛(k)𝝃(k))e_{k}({\bf Z})=\rho\left({\bf Z}{\bf b}^{(k)}-{\boldsymbol{\xi}}^{(k)}\right) refers to the sample error for a single observation kk. In this paper, we consider efficient methods for obtaining an optimal low-rank regularized inverse matrix,

𝐙^=argminrank(𝐙)r1Kk=1Kρ(𝐙𝐛(k)𝝃(k)).\widehat{\bf Z}=\operatorname*{arg\,min}_{{\rm rank\!}\left({\bf Z}\right)\leq r}\frac{1}{K}\sum_{k=1}^{K}\rho\left({\bf Z}{\bf b}^{(k)}-{\boldsymbol{\xi}}^{(k)}\right)\,. (3)

Then, once matrix 𝐙^\widehat{\bf Z} is computed, solving the inverse problem (1) requires a simple matrix-vector multiplication: 𝝃=𝐙^𝐛{\boldsymbol{\xi}}=\widehat{\bf Z}{\bf b}. Matrix 𝐙^\widehat{\bf Z} can be precomputed, so that regularized solutions to inverse problems can be obtained efficiently and accurately.

Methods that incorporate training data have been widely used in machine learning [11, 12] and have been analyzed in the statistical learning literature [13]. Most previous work on using training data for solving inverse problems has been restricted to minimizing the predictive mean squared error and require knowledge of the forward model [14, 15]. In this paper, we provide a novel framework to solve inverse problems that does not require knowledge of the forward model. We develop efficient methods to compute an optimal low-rank regularized inverse matrix for a variety of error measures. In addition, we provide a natural framework for evaluating solutions and predicting uncertainties.

The paper is structured as follows. In Section 2 we provide background on regularization of inverse problems and discuss the connection to Bayes risk minimization. Some theoretical results are provided for the corresponding Bayes risk minimization problem. In Section 3 we describe numerical methods to efficiently solve (3), including a rank-update approach that can be used for various error measures and large-scale problems. Numerical results are presented in Section 4 and conclusions are provided in Section 5.

2 Background and Motivation

The singular value decomposition (SVD) is an important tool for analyzing regularization methods. For any real m×nm\times n matrix 𝐀{\bf A}, let 𝐀=𝐔𝚺𝐕{\bf A}={\bf U}{\boldsymbol{\Sigma}}{\bf V}^{\top} be the SVD, where 𝚺{\boldsymbol{\Sigma}} is a diagonal matrix containing the singular values, σ1σ2σrank(𝐀)>0,\sigma_{1}\geq\sigma_{2}\geq...\geq\sigma_{{\rm rank\!}\left({\bf A}\right)}>0, and orthogonal matrices 𝐔{\bf U} and 𝐕{\bf V} contain the left and right singular vectors 𝐮i,i=1,2,,m,{\bf u}_{i}\,,i=1,2,...,m, and 𝐯i,i=1,2,,n{\bf v}_{i},\,i=1,2,...,n respectively [16]. For a given positive integer rrank(𝐀)r\leq\mbox{rank}({\bf A}), define

𝐀r=𝐔r𝚺r𝐕r,{\bf A}_{r}={\bf U}_{r}{\boldsymbol{\Sigma}}_{r}{\bf V}_{r}^{\top},

where 𝐔r=[𝐮1,𝐮2,𝐮r]m×r,𝐕r=[𝐯1,𝐯2,𝐯r]n×r{\bf U}_{r}=[{\bf u}_{1},{\bf u}_{2},...{\bf u}_{r}]\in\mathbb{R}^{m\times r},\,{\bf V}_{r}=[{\bf v}_{1},{\bf v}_{2},...{\bf v}_{r}]\in\mathbb{R}^{n\times r}, and

𝚺r=[σ1σ2σr]r×r.{\boldsymbol{\Sigma}}_{r}=\begin{bmatrix}\sigma_{1}&&&\\ &\sigma_{2}&&\\ &&\ddots&\\ &&&\sigma_{r}\end{bmatrix}\in\mathbb{R}^{r\times r}.

Then, we denote 𝐀r=𝐕r𝚺r1𝐔r{\bf A}_{r}^{\dagger}={\bf V}_{r}{\boldsymbol{\Sigma}}_{r}^{-1}{\bf U}_{r}^{\top} to be the pseudoinverse of 𝐀r{\bf A}_{r} [17].

For ill-posed inverse problems, the singular values of 𝐀{\bf A} decay to and cluster at zero. Spectral filtering methods provide solutions where components of the solution corresponding to the small singular values are damped or neglected. For example, the rank-rr truncated SVD (TSVD) solution can be written as

𝝃TSVD=𝐀r𝐛.{\boldsymbol{\xi}}_{\rm TSVD}={\bf A}_{r}^{\dagger}{\bf b}\,. (4)

In Figure 1, we provide TSVD sample mean errors for different values of rr for a image deblurring problem. For TSVD, the choice of rank is crucial. For small rank, there are not enough singular vectors to capture the details of the solution, but for large rank, small singular values are included, and noise corrupts the solution. This is an illustrative example, so details are not specified here. See Section 4 for more thorough numerical investigations.

Refer to caption
Figure 1: Illustration of a typical sample mean error fK(𝐀r)f_{K}({\bf A}^{\dagger}_{r}) on a logarithmic scale for Truncated SVD reconstructions for various ranks rr.

It is worth noting that the TSVD reconstruction matrix 𝐀r{\bf A}_{r}^{\dagger} is one example of a rank-rr matrix that solves (1). In this paper, we seek an optimal rank-rr matrix for all training data. Computing a solution to problem (3) is non-trivial, and in the next section we provide some underlying theory from a Bayes risk perspective.

2.1 Bayes risk minimization

By interpreting problem (3) as an empirical Bayes risk minimization problem [18, 13, 19], we can provide a statistical framework for analyzing the problem and motivate numerical algorithms for computing an optimal low-rank regularized inverse matrix. In the following we assume that 𝝃{\boldsymbol{\xi}} and 𝜹{\boldsymbol{\delta}} are random variables with given probability distributions 𝒫𝝃\mathcal{P}_{{\boldsymbol{\xi}}} and 𝒫𝜹\mathcal{P}_{{\boldsymbol{\delta}}}. Using the fact that 𝐙𝐛𝝃=(𝐙𝐀𝐈)𝝃+𝐙𝜹{\bf Z}{\bf b}-{\boldsymbol{\xi}}=({\bf Z}{\bf A}-{\bf I}){\boldsymbol{\xi}}+{\bf Z}{\boldsymbol{\delta}}, the Bayes risk minimization problem can be written as,

𝐙ˇ=argminrank(𝐙)r𝔼𝝃,𝜹(ρ((𝐙𝐀𝐈)𝝃+𝐙𝜹)),\check{\bf Z}=\operatorname*{arg\,min}_{{\rm rank\!}\left({\bf Z}\right)\leq r}\ \mathbb{E}_{{\boldsymbol{\xi}},{\boldsymbol{\delta}}}\left(\rho(({\bf Z}{\bf A}-{\bf I}){\boldsymbol{\xi}}+{\bf Z}{\boldsymbol{\delta}})\right)\,, (5)

where 𝔼𝝃,𝜹\mathbb{E}_{{\boldsymbol{\xi}},{\boldsymbol{\delta}}} is the expected value with respect to the joint distribution of 𝝃{\boldsymbol{\xi}} and 𝜹{\boldsymbol{\delta}} [18, 13]. The particular choices of ρ,\rho, 𝒫𝝃,\mathcal{P}_{{\boldsymbol{\xi}}}, and 𝒫𝜹\mathcal{P}_{{\boldsymbol{\delta}}} in (5) determine an optimal reconstruction matrix 𝐙ˇ\check{\bf Z}. Optimization problem (3) can be interpreted as an empirical Bayes risk minimization problem corresponding to Bayes risk minimization problem (5), where 𝝃(k){\boldsymbol{\xi}}^{(k)} and 𝐛(k)=𝐀𝝃(k)+𝜹(k){\bf b}^{(k)}={\bf A}{\boldsymbol{\xi}}^{(k)}+{\boldsymbol{\delta}}^{(k)} denote realizations of the random variables 𝝃{\boldsymbol{\xi}} and 𝜹{\boldsymbol{\delta}}.

2.2 Bayes Risk Minimization for the 2-norm case

For the special case where ρ(𝐱)=𝐱22,\rho({\bf x})=\left\|{\bf x}\right\|_{2}^{2}, the Bayes risk can be written as

f~(𝐙)=𝔼𝝃,𝜹((𝐙𝐀𝐈n)𝝃+𝐙𝜹22).\tilde{f}({\bf Z})=\mathbb{E}_{{\boldsymbol{\xi}},{\boldsymbol{\delta}}}\left(\left\|({\bf Z}{\bf A}-{\bf I}_{n}){\boldsymbol{\xi}}+{\bf Z}{\boldsymbol{\delta}}\right\|_{2}^{2}\right). (6)

In this section, we provide closed form solutions for both the unconstrained and rank-constrained Bayes risk minimization problems for the 2-norm case.

We assume that the random variables 𝝃{\boldsymbol{\xi}} and 𝜹{\boldsymbol{\delta}} are statistically independent. Let 𝝁𝝃{\boldsymbol{\mu}}_{{\boldsymbol{\xi}}} be the mean and 𝐂𝝃{\bf C}_{{\boldsymbol{\xi}}} be the covariance matrix for the distribution 𝒫𝝃\mathcal{P}_{{\boldsymbol{\xi}}}. Similarly, 𝝁𝜹{\boldsymbol{\mu}}_{{\boldsymbol{\delta}}} and 𝐂𝜹{\bf C}_{{\boldsymbol{\delta}}} are the mean and covariance matrix for 𝒫𝜹\mathcal{P}_{{\boldsymbol{\delta}}}. We assume that both matrices are positive definite and that all entries are finite. One fact that will be useful to us is that the expectation of the quadratic form ϕ𝐊ϕ{\boldsymbol{\phi}}^{\top}{\bf K}{\boldsymbol{\phi}} involving a symmetric matrix 𝐊{\bf K} and a random variable ϕ{\boldsymbol{\phi}} with covariance matrix 𝐂ϕ{\bf C}_{{\boldsymbol{\phi}}} can be written as

𝔼ϕ(ϕ𝐊ϕ)=𝔼ϕ(ϕ)𝐊𝔼ϕ(ϕ)+tr(𝐊𝐂ϕ),\mathbb{E}_{{\boldsymbol{\phi}}}({\boldsymbol{\phi}}^{\top}{\bf K}{\boldsymbol{\phi}})=\mathbb{E}_{{\boldsymbol{\phi}}}({\boldsymbol{\phi}})^{\top}{\bf K}\ \mathbb{E}_{{\boldsymbol{\phi}}}({\boldsymbol{\phi}})+{\rm tr\!}\left({\bf K}{\bf C}_{{\boldsymbol{\phi}}}\right),

where tr(){\rm tr\!}\left(\,\cdot\,\right) denotes the trace of a matrix [20]. With this framework and the Cholesky decompositions 𝐂𝝃=𝐌𝝃𝐌𝝃{\bf C}_{{\boldsymbol{\xi}}}={\bf M}_{{\boldsymbol{\xi}}}{\bf M}_{{\boldsymbol{\xi}}}^{\top} and 𝐂𝜹=𝐌𝜹𝐌𝜹{\bf C}_{{\boldsymbol{\delta}}}={\bf M}_{{\boldsymbol{\delta}}}{\bf M}_{{\boldsymbol{\delta}}}^{\top}, we can write the Bayes risk (6) as,

f~(𝐙)\displaystyle\tilde{f}({\bf Z}) =𝔼𝝃,𝜹((𝐙𝐀𝐈n)𝝃+𝐙𝜹22)\displaystyle=\mathbb{E}_{{\boldsymbol{\xi}},{\boldsymbol{\delta}}}\left(\left\|({\bf Z}{\bf A}-{\bf I}_{n}){\boldsymbol{\xi}}+{\bf Z}{\boldsymbol{\delta}}\right\|_{2}^{2}\right)
=𝔼𝝃(𝝃(𝐙𝐀𝐈n)(𝐙𝐀𝐈n)𝝃)+2𝔼𝜹(𝜹)𝐙(𝐙𝐀𝐈n)𝔼𝝃(𝝃)+𝔼𝜹(𝜹𝐙𝐙𝜹)\displaystyle=\mathbb{E}_{{\boldsymbol{\xi}}}\left({\boldsymbol{\xi}}^{\top}({\bf Z}{\bf A}-{\bf I}_{n})^{\top}({\bf Z}{\bf A}-{\bf I}_{n}){\boldsymbol{\xi}}\right)+2\ \mathbb{E}_{{\boldsymbol{\delta}}}({\boldsymbol{\delta}}^{\top})\ {\bf Z}^{\top}({\bf Z}{\bf A}-{\bf I}_{n})\ \mathbb{E}_{{\boldsymbol{\xi}}}\left({\boldsymbol{\xi}}\right)+\mathbb{E}_{{\boldsymbol{\delta}}}\left({\boldsymbol{\delta}}^{\top}{\bf Z}^{\top}{\bf Z}{\boldsymbol{\delta}}\right)
=𝝁𝝃(𝐙𝐀𝐈n)(𝐙𝐀𝐈n)𝝁𝝃+tr((𝐙𝐀𝐈n)(𝐙𝐀𝐈n)𝐌𝝃𝐌𝝃)\displaystyle={\boldsymbol{\mu}}_{{\boldsymbol{\xi}}}^{\top}({\bf Z}{\bf A}-{\bf I}_{n})^{\top}({\bf Z}{\bf A}-{\bf I}_{n}){\boldsymbol{\mu}}_{{\boldsymbol{\xi}}}+{\rm tr\!}\left(({\bf Z}{\bf A}-{\bf I}_{n})^{\top}({\bf Z}{\bf A}-{\bf I}_{n}){\bf M}_{{\boldsymbol{\xi}}}{\bf M}_{{\boldsymbol{\xi}}}^{\top}\right)
+2𝝁𝜹𝐙(𝐙𝐀𝐈n)𝝁𝝃+𝝁𝜹𝐙𝐙𝝁𝜹+tr(𝐙𝐙𝐌𝜹𝐌𝜹)\displaystyle+2{\boldsymbol{\mu}}_{{\boldsymbol{\delta}}}^{\top}{\bf Z}^{\top}({\bf Z}{\bf A}-{\bf I}_{n}){\boldsymbol{\mu}}_{{\boldsymbol{\xi}}}+{\boldsymbol{\mu}}_{{\boldsymbol{\delta}}}^{\top}{\bf Z}^{\top}{\bf Z}{\boldsymbol{\mu}}_{{\boldsymbol{\delta}}}+{\rm tr\!}\left({\bf Z}^{\top}{\bf Z}{\bf M}_{{\boldsymbol{\delta}}}{\bf M}_{{\boldsymbol{\delta}}}^{\top}\right)
=(𝐙𝐀𝐈n)𝝁𝝃22+(𝐙𝐀𝐈n)𝐌𝝃F2+2𝝁𝜹𝐙(𝐙𝐀𝐈n)𝝁𝝃+𝐙𝝁𝜹22+𝐙𝐌𝜹F2,\displaystyle=\left\|({\bf Z}{\bf A}-{\bf I}_{n}){\boldsymbol{\mu}}_{{\boldsymbol{\xi}}}\right\|_{2}^{2}+\left\|({\bf Z}{\bf A}-{\bf I}_{n}){\bf M}_{{\boldsymbol{\xi}}}\right\|_{\rm F}^{2}+2{\boldsymbol{\mu}}_{{\boldsymbol{\delta}}}^{\top}{\bf Z}^{\top}({\bf Z}{\bf A}-{\bf I}_{n}){\boldsymbol{\mu}}_{{\boldsymbol{\xi}}}+\left\|{\bf Z}{\boldsymbol{\mu}}_{{\boldsymbol{\delta}}}\right\|_{2}^{2}+\left\|{\bf Z}{\bf M}_{{\boldsymbol{\delta}}}\right\|_{\rm F}^{2},

where F\left\|\,\cdot\,\right\|_{\rm F} denotes the Frobenius norm of a matrix.

We can simplify this final expression by pre-whitening the noise 𝜹{\boldsymbol{\delta}}, thereby converting to a coordinate system where the noise mean is zero and the noise covariance is a multiple of the identity. Define the random variable

𝝌=𝐌𝜹1(𝜹𝝁𝜹).{\boldsymbol{\chi}}={\bf M}_{{\boldsymbol{\delta}}}^{-1}({\boldsymbol{\delta}}-{\boldsymbol{\mu}}_{{\boldsymbol{\delta}}})\,. (7)

Then, 𝝁𝝌=𝟎{\boldsymbol{\mu}}_{{\boldsymbol{\chi}}}={\bf 0} and 𝐂𝝌=𝐈m,{\bf C}_{{\boldsymbol{\chi}}}={\bf I}_{m}, and we have

𝐌𝜹1𝐀𝝃+𝝌\displaystyle{\bf M}_{{\boldsymbol{\delta}}}^{-1}{\bf A}{\boldsymbol{\xi}}+{\boldsymbol{\chi}} =𝐌𝜹1(𝐛𝝁𝜹),\displaystyle={\bf M}_{{\boldsymbol{\delta}}}^{-1}({\bf b}-{\boldsymbol{\mu}}_{{\boldsymbol{\delta}}})\,, (8)

which can be written as,

𝐀~𝝃+𝝌\displaystyle\widetilde{\bf A}{\boldsymbol{\xi}}+{\boldsymbol{\chi}} =𝐛~,\displaystyle=\widetilde{\bf b}\,, (9)

where 𝐀~=𝐌𝜹1𝐀\widetilde{\bf A}={\bf M}_{{\boldsymbol{\delta}}}^{-1}{\bf A} and 𝐛~=𝐌𝜹1(𝐛𝝁𝜹).\widetilde{\bf b}={\bf M}_{{\boldsymbol{\delta}}}^{-1}({\bf b}-{\boldsymbol{\mu}}_{{\boldsymbol{\delta}}}). Thus, without loss of generality (although with the need for numerical care if 𝐌𝜹{\bf M}_{{\boldsymbol{\delta}}} is ill-conditioned), we can always assume that 𝝁δ=𝟎{\boldsymbol{\mu}}_{\delta}={\bf 0} and 𝐂𝜹=η2𝐈m{\bf C}_{{\boldsymbol{\delta}}}=\eta^{2}{\bf I}_{m}. Therefore, our Bayes risk (6) takes the form

f~(𝐙)=(𝐙𝐀𝐈n)𝝁𝝃22+(𝐙𝐀𝐈n)𝐌𝝃F2+η2𝐙F2.\tilde{f}({\bf Z})=\left\|({\bf Z}{\bf A}-{\bf I}_{n}){\boldsymbol{\mu}}_{{\boldsymbol{\xi}}}\right\|_{2}^{2}+\left\|({\bf Z}{\bf A}-{\bf I}_{n}){\bf M}_{{\boldsymbol{\xi}}}\right\|_{\rm F}^{2}+\eta^{2}\left\|{\bf Z}\right\|_{\rm F}^{2}. (10)

Under certain conditions, summarized in the following theorem, there exists a unique global minimizer for (10).

Theorem 2.1.

Given 𝐀m×n{\bf A}\in\mathbb{R}^{m\times n}, 𝐌𝛏n×n{\bf M}_{{\boldsymbol{\xi}}}\in\mathbb{R}^{n\times n}, 𝛍𝛏n{\boldsymbol{\mu}}_{{\boldsymbol{\xi}}}\in\mathbb{R}^{n} and η\eta\in\mathbb{R}, consider the problem of finding a matrix 𝐙ˇn×m\check{\bf Z}\in\mathbb{R}^{n\times m} with

𝐙ˇ=argmin𝐙(𝐙𝐀𝐈n)𝝁𝝃22+(𝐙𝐀𝐈n)𝐌𝝃F2+η2𝐙F2.\check{\bf Z}=\operatorname*{arg\,min}_{{\bf Z}}\left\|({\bf Z}{\bf A}-{\bf I}_{n}){\boldsymbol{\mu}}_{{\boldsymbol{\xi}}}\right\|_{2}^{2}+\left\|({\bf Z}{\bf A}-{\bf I}_{n}){\bf M}_{{\boldsymbol{\xi}}}\right\|_{\rm F}^{2}+\eta^{2}\left\|{\bf Z}\right\|_{\rm F}^{2}. (11)

If either η0\eta\neq 0 or if 𝐀{\bf A} has rank mm and 𝛍𝛏𝛍𝛏+𝐌𝛏𝐌𝛏{\boldsymbol{\mu}}_{{\boldsymbol{\xi}}}{\boldsymbol{\mu}}_{{\boldsymbol{\xi}}}^{\top}+{\bf M}_{{\boldsymbol{\xi}}}{\bf M}_{{\boldsymbol{\xi}}}^{\top} is nonsingular, then the unique global minimizer is

𝐙ˇ=(𝝁𝝃𝝁𝝃+𝐌𝝃𝐌𝝃)𝐀[𝐀(𝝁𝝃𝝁𝝃+𝐌𝝃𝐌𝝃)𝐀+η2𝐈m]1.\check{\bf Z}=({\boldsymbol{\mu}}_{{\boldsymbol{\xi}}}{\boldsymbol{\mu}}_{{\boldsymbol{\xi}}}^{\top}+{\bf M}_{{\boldsymbol{\xi}}}{\bf M}_{{\boldsymbol{\xi}}}^{\top}){\bf A}^{\top}[{\bf A}({\boldsymbol{\mu}}_{{\boldsymbol{\xi}}}{\boldsymbol{\mu}}_{{\boldsymbol{\xi}}}^{\top}+{\bf M}_{{\boldsymbol{\xi}}}{\bf M}_{{\boldsymbol{\xi}}}^{\top}){\bf A}^{\top}+\eta^{2}{\bf I}_{m}]^{-1}. (12)
Proof.

Using derivatives of norms and traces, we get

df~(𝐙)d𝐙=2(𝐙𝐀𝐈n)𝝁𝝃𝝁𝝃𝐀+2(𝐙𝐀𝐈n)𝐌𝝃𝐌𝝃𝐀+2η2𝐙.\frac{{\rm d}\tilde{f}({\bf Z})}{{\rm d}{\bf Z}}=2({\bf Z}{\bf A}-{\bf I}_{n}){\boldsymbol{\mu}}_{{\boldsymbol{\xi}}}{\boldsymbol{\mu}}_{{\boldsymbol{\xi}}}^{\top}{\bf A}^{\top}+2({\bf Z}{\bf A}-{\bf I}_{n}){\bf M}_{{\boldsymbol{\xi}}}{\bf M}_{{\boldsymbol{\xi}}}^{\top}{\bf A}^{\top}+2\eta^{2}{\bf Z}. (13)

Setting the above derivative equal to zero gives the following first order condition,

𝐙[𝐀(𝝁𝝃𝝁𝝃+𝐌𝝃𝐌𝝃)𝐀+η2𝐈m]=(𝝁𝝃𝝁𝝃+𝐌𝝃𝐌𝝃)𝐀.{\bf Z}[{\bf A}({\boldsymbol{\mu}}_{{\boldsymbol{\xi}}}{\boldsymbol{\mu}}_{{\boldsymbol{\xi}}}^{\top}+{\bf M}_{{\boldsymbol{\xi}}}{\bf M}_{{\boldsymbol{\xi}}}^{\top}){\bf A}^{\top}+\eta^{2}{\bf I}_{m}]=({\boldsymbol{\mu}}_{{\boldsymbol{\xi}}}{\boldsymbol{\mu}}_{{\boldsymbol{\xi}}}^{\top}+{\bf M}_{{\boldsymbol{\xi}}}{\bf M}_{{\boldsymbol{\xi}}}^{\top}){\bf A}^{\top}. (14)

Under our assumptions, the matrix 𝐀(𝝁𝝃𝝁𝝃+𝐌𝝃𝐌𝝃)𝐀+η2𝐈m{\bf A}({\boldsymbol{\mu}}_{{\boldsymbol{\xi}}}{\boldsymbol{\mu}}_{{\boldsymbol{\xi}}}^{\top}+{\bf M}_{{\boldsymbol{\xi}}}{\bf M}_{{\boldsymbol{\xi}}}^{\top}){\bf A}^{\top}+\eta^{2}{\bf I}_{m} is invertible, so we get the unique solution 𝐙ˇ\check{\bf Z} that is given in the statement of the Theorem. It is a minimizer since the second derivative matrix of f~(𝐙)\tilde{f}({\bf Z}) is positive definite. ∎

To summarize, we have minimized the Bayes risk (6) under the assumptions that 𝝃{\boldsymbol{\xi}} and 𝜹{\boldsymbol{\delta}} are statistically independent random variables with symmetric positive definite finite covariance matrices.

Next we consider the rank-constrained Bayes risk minimization problem. With the additional assumption that 𝝁𝝃=𝟎{\boldsymbol{\mu}}_{{\boldsymbol{\xi}}}={\bf 0}, problem (5) can be written as

minrank(𝐙)r(𝐙𝐀𝐈)𝐌F2+α2𝐙F2,\min_{{\rm rank\!}\left({\bf Z}\right)\leq r}\ \left\|({\bf Z}{\bf A}-{\bf I}){\bf M}\right\|_{\rm F}^{2}+\alpha^{2}\left\|{\bf Z}\right\|_{\rm F}^{2}, (15)

where 𝐌=𝐌𝝃{\bf M}={\bf M}_{{\boldsymbol{\xi}}} and α=η.\alpha=\eta. A closed form solution to (15) was recently provided in [21], and the result is summarized in the following theorem.

Theorem 2.2.

Given a matrix 𝐀m×n{\bf A}\in\mathbb{R}^{m\times n} where rank(𝐀)nm{\rm rank\!}\left({\bf A}\right)\leq n\leq m and an invertible matrix 𝐌n×n{\bf M}\in\mathbb{R}^{n\times n}, let their generalized singular value decomposition be 𝐀=𝐔𝚺𝐆1{\bf A}={\bf U}{\boldsymbol{\Sigma}}{\bf G}^{-1}, 𝐌=𝐆𝐒1𝐕{\bf M}={\bf G}{\bf S}^{-1}{\bf V}^{\top}. Let α\alpha be a given parameter, nonzero if rank(𝐀)<m{\rm rank\!}\left({\bf A}\right)<m. Let rrank(𝐀)r\leq{\rm rank\!}\left({\bf A}\right) be a given positive integer. Define 𝐃=𝚺𝐒2𝚺+α2𝐈{\bf D}={\boldsymbol{\Sigma}}{\bf S}^{-2}{\boldsymbol{\Sigma}}^{\top}+\alpha^{2}{\bf I}. Let the symmetric matrix 𝐇=𝐆𝐒4𝚺𝐃1𝚺𝐆{\bf H}={\bf G}{\bf S}^{-4}{\boldsymbol{\Sigma}}^{\top}{\bf D}^{-1}{\boldsymbol{\Sigma}}{\bf G}^{\top} have eigenvalue decomposition 𝐇=𝐕ˇ𝚲𝐕ˇ{\bf H}=\check{\bf V}{\boldsymbol{\Lambda}}\check{\bf V}^{\top} with eigenvalues ordered so that λjλi\lambda_{j}\geq\lambda_{i} for j<inj<i\leq n. Then a global minimizer 𝐙ˇn×m\check{\bf Z}\in\mathbb{R}^{n\times m} of problem (15) is

𝐙ˇ=𝐕ˇr𝐕ˇr𝐆𝐒2𝚺𝐃1𝐔,\check{\bf Z}=\check{\bf V}_{r}\check{\bf V}_{r}^{\top}{\bf G}{\bf S}^{-2}{\boldsymbol{\Sigma}}^{\top}{\bf D}^{-1}{\bf U}^{\top}, (16)

where 𝐕ˇr\check{\bf V}_{r} contains the first rr columns of 𝐕ˇ\check{\bf V}. Moreover this 𝐙ˇ\check{\bf Z} is the unique global minimizer of (15) if and only if λr>λr+1\lambda_{r}>\lambda_{r+1}.

Next we consider the case where 𝐂𝝃=β2𝐈n{\bf C}_{{\boldsymbol{\xi}}}=\beta^{2}{\bf I}_{n} for some scalar β\beta. In this case the Bayes risk simplifies to

f~(𝐙)=β2𝐙𝐀𝐈nF2+η2𝐙F2,\tilde{f}({\bf Z})=\beta^{2}\left\|{\bf Z}{\bf A}-{\bf I}_{n}\right\|_{\rm F}^{2}+\eta^{2}\left\|{\bf Z}\right\|_{\rm F}^{2}, (17)

and, by rescaling, the low-rank Bayes risk minimization problem can be restated as optimization problem (15) where α2=η2β2\alpha^{2}=\frac{\eta^{2}}{\beta^{2}} represents the noise-to-signal ratio and 𝐌=𝐈n{\bf M}={\bf I}_{n}. It was shown in [21] that for this problem, the global minimizer simplifies to

𝐙ˇ=𝐕r𝚿r𝐔r,\check{\bf Z}={\bf V}_{r}{\boldsymbol{\Psi}}_{r}{\bf U}_{r}^{\top}, (18)

where 𝚿r=diag(σ1σ12+α2,,σJσr2+α2){\boldsymbol{\Psi}}_{r}={\rm diag\!}\left(\frac{\sigma_{1}}{\sigma_{1}^{2}+\alpha^{2}},\ldots,\frac{\sigma_{J}}{\sigma_{r}^{2}+\alpha^{2}}\right) and 𝐕r{\bf V}_{r} and 𝐔r{\bf U}_{r} were defined in the beginning of this section. Moreover, 𝐙ˇ\check{\bf Z} is the unique global minimizer if and only if σr>σr+1\sigma_{r}>\sigma_{r+1}. For problems where 𝐀{\bf A} is known, matrix 𝐙ˇ\check{\bf Z} can be used to compute a filtered solution to inverse problems. That is,

𝐱filter𝐙ˇ𝐛=j=1nϕj𝐮j𝐛σj𝐯j,{\bf x}_{\rm filter}\equiv\check{\bf Z}{\bf b}=\sum_{j=1}^{n}\phi_{j}\frac{{\bf u}_{j}^{\top}{\bf b}}{\sigma_{j}}{\bf v}_{j},

where the filter factors are given by

ϕj={σj2σj2+α2, for jr,0, for j>r.\phi_{j}=\left\{\begin{array}[]{cl}\frac{\sigma_{j}^{2}}{\sigma_{j}^{2}+\alpha^{2}},&\mbox{ for }j\leq r,\\ 0,&\mbox{ for }j>r.\end{array}\right.

Here, 𝐙ˇ\check{\bf Z} is the truncated Tikhonov reconstruction matrix [21].

3 Computing Low-rank Regularized Inverse Matrices

In this section, we describe numerical methods to efficiently compute a low-rank regularized inverse matrix, i.e., a solution to (3), for problems where matrix 𝐀{\bf A} is unknown and training data are available. We first describe a method that can be used for the special case where ρ=22\rho=\left\|\,\cdot\,\right\|_{2}^{2}. Further we discuss a rank-update approach that can be utilized for large-scale problems and for general error measures.

3.1 2-norm case

For the case ρ=22\rho=\left\|\,\cdot\,\right\|_{2}^{2}, we can use the relationship between the 2-norm of a vector and the Frobenius norm of a matrix to rewrite optimization problem (3) as

𝐙^=argminrank(𝐙)r1Kk=1K𝐙𝐛(k)𝝃(k)22=1K𝐙𝐁𝐂F2,\widehat{\bf Z}=\operatorname*{arg\,min}_{{\rm rank\!}\left({\bf Z}\right)\leq r}\frac{1}{K}\sum_{k=1}^{K}\left\|{\bf Z}{\bf b}^{(k)}-{\boldsymbol{\xi}}^{(k)}\right\|_{2}^{2}=\frac{1}{K}\left\|{\bf Z}{\bf B}-{\bf C}\right\|_{\rm F}^{2}\,, (19)

where 𝐁=[𝐛(1),,𝐛(K)]{\bf B}=[{\bf b}^{(1)},\ldots,{\bf b}^{(K)}] and 𝐂=[𝝃(1),,𝝃(K)]{\bf C}=[{\boldsymbol{\xi}}^{(1)},\ldots,{\boldsymbol{\xi}}^{(K)}]. A closed form solution for (19) exists, and the result is summarized in the following theorem.

Theorem 3.1.

Let matrices 𝐁m×K{\bf B}\in\mathbb{R}^{m\times K} and 𝐂n×K{\bf C}\in\mathbb{R}^{n\times K} be given. Further let 𝐕~\tilde{\bf V} be the matrix of right singular vectors of 𝐁{\bf B} and define 𝐏=𝐂𝐕~s(𝐕~s){\bf P}={\bf C}\tilde{\bf V}_{s}(\tilde{\bf V}_{s})^{\top} where s=rank(𝐁)s={\rm rank\!}\left({\bf B}\right). Then

𝐙^=𝐏r𝐁\widehat{\bf Z}={\bf P}_{r}{\bf B}^{\dagger}

is a solution to the minimization problem (19). This solution is unique if and only if either rrank(𝐏)r\geq{\rm rank\!}\left({\bf P}\right) or 1rrank(𝐏)1\leq r\leq{\rm rank\!}\left({\bf P}\right) and σ¯r>σ¯r+1\bar{\sigma}_{r}>\bar{\sigma}_{r+1}, where σ¯r\bar{\sigma}_{r} and σ¯r+1\bar{\sigma}_{r+1} denote the rr and (r+1)(r+1)st singular values of 𝐏{\bf P}.

Proof.

This result is a special case of the main result of Sondermann [22] and was reproved with if-and-only-if uniqueness conditions in Theorem 2.1 of Friedland and Torokhti [23]. ∎

The main computational cost of constructing 𝐙^\widehat{\bf Z} using Theorem 3.1 lies in two SVD computations (for matrices 𝐁{\bf B} and 𝐏{\bf P}). If these computations are possible and minimizing the mean squared error is desired, then we propose to use the above result to compute optimal low-rank regularized inverse matrix 𝐙^\widehat{\bf Z}, which in turn can be used to solve inverse problems [24].

3.2 A rank-update approach

For general error measures or for large problems with large training sets, we propose a rank-update approach for obtaining optimal low-rank regularized inverse matrix 𝐙^\widehat{\bf Z}. The approach is motivated by the following corollary to Theorem 2.2.

Corollary.

Assume all conditions of Theorem 2.2 are fulfilled. Let 𝐙ˇJ\check{\bf Z}^{J} be a global minimizer of (15) for maximal rank JJ and 𝐙ˇJ+\check{\bf Z}^{J+\ell} be a global minimizer of (15) for maximal rank J+J+\ell. Then 𝐙¯=𝐙ˇJ+𝐙ˇJ\bar{\bf Z}^{\ell}=\check{\bf Z}^{J+\ell}-\check{\bf Z}^{J} is the global minimizer of

minrank(𝐙)((𝐙ˇJ+𝐙)𝐀𝐈)𝐌F2+α2𝐙ˇJ+𝐙F2.\min_{{\rm rank\!}\left({\bf Z}\right)\leq\ell}\left\|((\check{\bf Z}^{J}+{\bf Z}){\bf A}-{\bf I}){\bf M}\right\|_{\rm F}^{2}+\alpha^{2}\left\|\check{\bf Z}^{J}+{\bf Z}\right\|_{\rm F}^{2}. (20)

Furthermore, 𝐙¯\bar{\bf Z}^{\ell} is the unique global minimizer if and only if λJ>λJ+1\lambda_{J}>\lambda_{J+1} and λJ+>λJ++1\lambda_{J+\ell}>\lambda_{J+\ell+1} with J++1<rank(𝐀)J+\ell+1<{\rm rank\!}\left({\bf A}\right).

Corollary Corollary states that in the Bayes framework for the 2-norm case, a rank-update approach (in exact arithmetic) can be used to compute an optimal low-rank regularized matrix. That is, assume we are given the rank-JJ approximation 𝐙ˇJ\check{\bf Z}^{J} and we are interested in finding 𝐙ˇJ+\check{\bf Z}^{J+\ell}. To calculate the optimal rank-(J+)(J+\ell) approximation 𝐙ˇJ+\check{\bf Z}^{J+\ell}, we just need to solve a rank-\ell optimization problem of the form (20) and then update the solution, 𝐙ˇJ+=𝐙ˇJ+𝐙¯\check{\bf Z}^{J+\ell}=\check{\bf Z}^{J}+\bar{\bf Z}^{\ell}.

We are interested in the empirical Bayes problem, and we extend the rank-update approach to problem (3), where the basic idea is to iteratively build up matrices 𝐗^n×r\widehat{\bf X}\in\mathbb{R}^{n\times r} and 𝐘^m×r\widehat{\bf Y}\in\mathbb{R}^{m\times r} such that 𝐙^=𝐗^𝐘^.\widehat{\bf Z}=\widehat{\bf X}\widehat{\bf Y}^{\top}. Suppose we are given the optimal rank-JJ matrix 𝐙^\widehat{\bf Z}. That is, 𝐙^\widehat{\bf Z} solves (3) for r=Jr=J. Then, we compute a rank-\ell update to 𝐙^\widehat{\bf Z} by solving the following rank-\ell optimization problem,

(𝐗^,𝐘^)=argmin𝐗n×,𝐘m×fK(𝐗,𝐘)=1Kk=1Kρ((𝐙^+𝐗𝐘)𝐛(k)𝝃(k)),(\widehat{\bf X}^{\ell},\widehat{\bf Y}^{\ell})=\operatorname*{arg\,min}_{{\bf X}\in\mathbb{R}^{n\times\ell},{\bf Y}^{m\times\ell}}f_{K}({\bf X},{\bf Y})=\frac{1}{K}\sum_{k=1}^{K}\rho((\widehat{\bf Z}+{\bf X}{\bf Y}^{\top}){\bf b}^{(k)}-{\boldsymbol{\xi}}^{(k)})\,, (21)

and a rank-(J+)(J+\ell) regularized inverse matrix can be computed as 𝐙^=𝐙^+𝐗^(𝐘^)\widehat{\bf Z}=\widehat{\bf Z}+\widehat{\bf X}^{\ell}(\widehat{\bf Y}^{\ell})^{\top}. This process continues until the desired rank is achieved. A summary of the general algorithm can be found below, see Algorithm 1.

Algorithm 1 Rank-\ell update approach for computing a solution to (3)
1: Initialize 𝐙^=𝟎n×m\widehat{\bf Z}={\bf 0}_{n\times m}, 𝐗^=[]\widehat{\bf X}=[\,\,], 𝐘^=[]\widehat{\bf Y}=[\,\,]
2:while stopping criteria not satisfied do
3:  solve rank-\ell update problem (21)
4:  update matrix inverse approximation: 𝐙^𝐙^+𝐗^(𝐘^)\widehat{\bf Z}\longleftarrow\widehat{\bf Z}+\widehat{\bf X}^{\ell}(\widehat{\bf Y}^{\ell})^{\top}
5:  update solutions: 𝐗^[𝐗^,𝐗^],𝐘^[𝐘^,𝐘^]\widehat{\bf X}\longleftarrow[\widehat{\bf X},\widehat{\bf X}^{\ell}],\widehat{\bf Y}\longleftarrow[\widehat{\bf Y},\widehat{\bf Y}^{\ell}]
6:end while

Next we provide some computational remarks for Algorithm 1.

Choice of \ell.

At each iteration of Algorithm 1, we must solve optimization problem (21), which has (m+n)(m+n)\,\ell unknowns. Hence, the choice of \ell will depend on the size of the problem and may even vary per iteration, e.g., 1,2,\ell_{1},\ell_{2},\ldots. For large-scale problems such as those arising in image reconstruction, we recommend updates of size =1\ell=1 to keep optimization problem (21) as small as possible.

Stopping criteria.

Stopping criteria for Algorithm 1 should have four components. First, a safe-guard maximal rank condition. That is, we stop if rank(𝐙^){\rm rank\!}\left(\widehat{\bf Z}\right) is greater than some pre-determined integer value rmaxr_{\max}. Secondly, let fmaxf_{\max} be a user-defined error tolerance for the function value, then we stop if fKfmaxf_{K}\leq f_{\max}. Similarly, we can monitor the relative improvement of fKf_{K} in each iteration and stop if the relative improvement is smaller than some tolerance. Lastly, if a rank-\ell update does not substantially improve fKf_{K}, the iteration should be terminated. More specifically, let 𝐗^=[𝐱^1,,𝐱^]\widehat{\bf X}^{\ell}=[\widehat{\bf x}_{1},\ldots,\widehat{\bf x}_{\ell}] and choose tolerance τ\tau. Then we can control the rank deficiency of 𝐗{\bf X}^{\ell} by stopping if min{𝐱^1,,𝐱^}τ\min\{\left\|\widehat{\bf x}_{1}\right\|_{\infty},\ldots,\left\|\widehat{\bf x}_{\ell}\right\|_{\infty}\}\leq\tau. The particular choices of stopping criteria and tolerances are problem dependent.

Solving rank-\ell update problem (21).

The main challenge for Algorithm 1 is in line 3, where we need to solve the rank-\ell update problem (21). First, we remark that the solution (𝐗^,𝐘^)(\widehat{\bf X}^{\ell},\widehat{\bf Y}^{\ell}) is not unique. In fact, 𝐗^(𝐘^)=1a𝐗^(a𝐘^)\widehat{\bf X}^{\ell}(\widehat{\bf Y}^{\ell})^{\top}=\frac{1}{a}\widehat{\bf X}^{\ell}\,(a\widehat{\bf Y}^{\ell})^{\top} for any a0a\neq 0. As a consequence optimization problem (21) is ill-posed and a particular solution needs to be selected. Various computational approaches exist to handle this kind of ill-posedness. One way to tackle this problem is to constrain each column in 𝐗{\bf X} to have norm 11, i.e., 𝐗(:,j)2=1\left\|{\bf X}_{(:,j)}\right\|_{2}=1 for j=1,,j=1,\ldots,\ell, and solve the constrained or relaxed joint optimization problem. We found this approach to be computationally inefficient and instead propose an alternating optimization approach. Notice that for any convex function ρ\rho, function fK(𝐗,𝐘)f_{K}({\bf X},{\bf Y}) is convex in 𝐗{\bf X} and convex in 𝐘{\bf Y}, so an alternating optimization approach avoids the above ill-posedness. Until convergence, the solution to min𝐘fK(𝐗,𝐘)\min_{{\bf Y}}f_{K}({\bf X},{\bf Y}) is computed for fixed 𝐗{\bf X}, and then the solution to min𝐗fK(𝐗,𝐘)\min_{{\bf X}}f_{K}({\bf X},{\bf Y}) is computed for fixed 𝐘{\bf Y}. Columns of 𝐗{\bf X} are normalized at each iteration. The alternating optimization is summarized in Algorithm 2.

Algorithm 2 Alternating optimization for solving (21)
1: choose 𝐗(0)m×{\bf X}^{(0)}\in\mathbb{R}^{m\times\ell}, i=0i=0
2:while stopping criteria not satisfied do
3:  Solve 𝐘(i+1)=argmin𝐘fK(𝐗(i),𝐘){\bf Y}^{(i+1)}=\operatorname*{arg\,min}_{{\bf Y}}f_{K}({\bf X}^{(i)},{\bf Y})
4:  Solve 𝐗(i+1)=argmin𝐗fK(𝐗,𝐘(i+1)){\bf X}^{(i+1)}=\operatorname*{arg\,min}_{{\bf X}}f_{K}({\bf X},{\bf Y}^{(i+1)})
5:  𝐗(:,j)(i+1)𝐗(:,j)(i+1)/𝐗(:,j)(i+1)2{\bf X}_{(:,j)}^{(i+1)}\longleftarrow{\bf X}_{(:,j)}^{(i+1)}/\left\|{\bf X}_{(:,j)}^{(i+1)}\right\|_{2} for j=1,,j=1,\ldots,\ell
6:  ii+1i\longleftarrow i+1
7:end while

Methods for solving optimization problems in lines 3 and 4 of Algorithm 2 depend on the specific form of ρ\rho. For instance for ρ=pp\rho=\left\|\,\cdot\,\right\|_{p}^{p} with 2p<2\leq p<\infty we suggest using Gauss-Newton type methods. For pp-norms with 1<p<21<p<2 we propose to use Quasi-Newton type methods such as SR1, BFGS, and LBFGS [25] or utilize smoothing techniques to compensate for discontinuities in the derivative. Standard stopping criteria can be used for these optimization problems, as well as for Algorithm 2 [26]. The required gradients of fKf_{K} and Gauss-Newton approximation to the Hessian are provided below.

Let 𝐞(k)=𝐗𝐘𝐛(k)𝐜(k){\bf e}^{(k)}={\bf X}{\bf Y}^{\top}{\bf b}^{(k)}-{\bf c}^{(k)}, 𝐱=vec(𝐗){\bf x}={\rm vec}({\bf X}), and 𝐲=vec(𝐘){\bf y}={\rm vec}({\bf Y}). Then the gradients are given by

𝐱fK=1K(𝐉𝐱)𝝆and𝐲fK=1K(𝐉𝐲)𝝆,\nabla_{{\bf x}}\,f_{K}=\frac{1}{K}\left({\bf J}_{{\bf x}}\right)^{\top}\,{\boldsymbol{\rho}}\qquad\mbox{and}\qquad\nabla_{{\bf y}}\,f_{K}=\frac{1}{K}\left({\bf J}_{{\bf y}}\right)^{\top}\,{\boldsymbol{\rho}}, (22)

where 𝝆=vec([𝝆(1),,𝝆(K)]){\boldsymbol{\rho}}={\rm vec}([{\boldsymbol{\rho}}^{(1)},\ldots,{\boldsymbol{\rho}}^{(K)}]) with 𝝆(k)=ρ{\boldsymbol{\rho}}^{(k)}=\nabla\rho evaluated at 𝐞(k){\bf e}^{(k)}. The Jacobian matrices have the form,

𝐉𝐱=[𝐉𝐱1(1)𝐉𝐱(1)𝐉𝐱1(K)𝐉𝐱(K)]and𝐉𝐲=[𝐉𝐲1(1)𝐉𝐲(1)𝐉𝐲1(K)𝐉𝐲(K)],{\bf J}_{\bf x}=\begin{bmatrix}{\bf J}^{(1)}_{{\bf x}_{1}}&\cdots&{\bf J}^{(1)}_{{\bf x}_{\ell}}\\ \vdots&&\vdots\\ {\bf J}^{(K)}_{{\bf x}_{1}}&\cdots&{\bf J}^{(K)}_{{\bf x}_{\ell}}\end{bmatrix}\qquad\mbox{and}\qquad{\bf J}_{\bf y}=\begin{bmatrix}{\bf J}^{(1)}_{{\bf y}_{1}}&\cdots&{\bf J}^{(1)}_{{\bf y}_{\ell}}\\ \vdots&&\vdots\\ {\bf J}^{(K)}_{{\bf y}_{1}}&\cdots&{\bf J}^{(K)}_{{\bf y}_{\ell}}\end{bmatrix},

where 𝐉𝐱j(k)=𝐲j𝐛(k)𝐈n{\bf J}^{(k)}_{{\bf x}_{j}}={\bf y}_{j}^{\top}{\bf b}^{(k)}{\bf I}_{n} and 𝐉𝐲j(k)=𝐱j(𝐛(k)){\bf J}^{(k)}_{{\bf y}_{j}}={\bf x}_{j}\left({\bf b}^{(k)}\right)^{\top}. The Gauss-Newton approximations of the Hessians are given by

𝐇𝐱=1K(𝐉𝐱)𝐃𝐉𝐱and𝐇𝐲=1K(𝐉𝐲)𝐃𝐉𝐲,{\bf H}_{\bf x}=\frac{1}{K}\left({\bf J}_{\bf x}\right)^{\top}{\bf D}\,{\bf J}_{\bf x}\qquad\mbox{and}\qquad{\bf H}_{\bf y}=\frac{1}{K}\left({\bf J}_{\bf y}\right)^{\top}{\bf D}\,{\bf J}_{\bf y}, (23)

where 𝐃=diag(𝐃(1),,𝐃(K)){\bf D}={\rm diag\!}\left({\bf D}^{(1)},\ldots,{\bf D}^{(K)}\right) and 𝐃(k)=2ρ{\bf D}^{(k)}=\nabla^{2}\rho evaluated at 𝐞(k){\bf e}^{(k)}. Note that 𝐃{\bf D} is diagonal for any error measures ρ\rho of the form ρ(𝐱)=i=1nϕ(xi)\rho({\bf x})=\sum_{i=1}^{n}\phi(x_{i}) which includes the Huber function and all pp-norms with p<p<\infty.

Another important remark is that for computational efficiency, we do not need to form 𝐙^\widehat{\bf Z} explicitly. That is, if we let 𝐜(k)=𝝃(k)𝐙^𝐛(k),{\bf c}^{(k)}={\boldsymbol{\xi}}^{(k)}-\widehat{\bf Z}{\bf b}^{(k)}, then (21) can be reformulated as

(𝐗^,𝐘^)=argmin𝐗n×,𝐘m×1Kk=1Kρ(𝐗𝐘𝐛(k)𝐜(k)),(\widehat{\bf X}^{\ell},\widehat{\bf Y}^{\ell})=\operatorname*{arg\,min}_{{\bf X}\in\mathbb{R}^{n\times\ell},{\bf Y}^{m\times\ell}}\frac{1}{K}\sum_{k=1}^{K}\rho({\bf X}{\bf Y}^{\top}{\bf b}^{\left(k\right)}-{\bf c}^{\left(k\right)}), (24)

where vectors 𝐜(k){\bf c}^{(k)} should be updated at each iteration as 𝐜(k)𝐜(k)𝐗^(𝐘^)𝐛(k).{\bf c}^{(k)}\longleftarrow{\bf c}^{(k)}-\widehat{\bf X}^{\ell}(\widehat{\bf Y}^{\ell})^{\top}{\bf b}^{(k)}.

Using 𝐙^\widehat{\bf Z} to solve inverse problems.

We have described some numerical approaches for computing 𝐙^\widehat{\bf Z}, and it is worthwhile to note that all of these computations can be done off-line. Once computed, the optimal low-rank regularized inverse matrix 𝐙^\widehat{\bf Z} can be used to solve inverse problems very quickly, requiring only one matrix-vector multiplication 𝐙^𝐛\widehat{\bf Z}{\bf b}, if 𝐙^\widehat{\bf Z} is obtained explicitly, or two multiplications 𝐗^(𝐘^𝐛)\widehat{\bf X}(\widehat{\bf Y}^{\top}{\bf b}) if 𝐗^\widehat{\bf X} and 𝐘^\widehat{\bf Y} are computed separately. This is particularly important for applications where hundreds of inverse problems need to be solved in real time. The ability to obtain fast and accurate reconstructions of many inverse problems in real time quickly compensates for the initial cost of constructing 𝐙^\widehat{\bf Z} (or 𝐗^\widehat{\bf X} and 𝐘^\widehat{\bf Y}).

One of the significant advantages of our approach is that we seek a regularized inverse matrix 𝐙^\widehat{\bf Z} directly, so that knowledge of the forward model 𝐀{\bf A} is not required. However, for applications where 𝐀{\bf A} is desired, our framework could be equivalently used to first estimate a low-rank matrix 𝐀{\bf A} from the training data by solving Equation (3) where 𝐙{\bf Z} is 𝐀,{\bf A}, and 𝐛k{\bf b}^{k} and 𝝃k{\boldsymbol{\xi}}^{k} are switched. This approach is similar to ideas from machine learning [27, 28] and are related to matrix completion problems where the goal is to reconstruct unknown or missing values in the forward matrix [29]. Then, once 𝐀^\widehat{\bf A} is computed, standard regularization methods may be required to solve inverse problems. If this is the case, we propose to use the training data to obtain optimal regularization parameters, as discussed in [14]. Since the ultimate goal in many applications is to compute solutions to inverse problems, obtaining an approximation of 𝐀{\bf A} is not necessary and may even introduce additional errors. Thus, if 𝐀{\bf A} is not required, our approach to compute the optimal regularized inverse matrix is more efficient and less prone to errors.

4 Numerical Results

In this section, we compute optimal low-rank regularized inverse matrices for a variety of problems and evaluate the performance of these matrices when solving inverse problems. We generated a training set of KK observations, 𝝃(k){\boldsymbol{\xi}}^{(k)} and 𝐛(k),k=1,,K{\bf b}^{(k)},k=1,\ldots,K. Then for various error measures, ρ\rho, we compute optimal low-rank regularized inverse matrices 𝐙^\widehat{\bf Z} as described in Section 3. To validate the performance of the computed optimal low-rank regularized inverse matrices, we generate a validation set of K¯\bar{K} different observations, 𝐛¯(k),k=1,,K¯,\bar{\bf b}^{(k)},k=1,\ldots,\bar{K}, and compute the error between the reconstructed solution and the true solution 𝝃¯(k),k=1,,K¯\bar{\boldsymbol{\xi}}^{(k)},k=1,\ldots,\bar{K}, i.e., ek(𝐙^)=ρ(𝐙^𝐛¯(k)𝝃¯(k))e_{k}(\widehat{\bf Z})=\rho(\widehat{\bf Z}\bar{\bf b}^{(k)}-\bar{\boldsymbol{\xi}}^{(k)}). For each experiment, we report the sample mean error fK(𝐙^)f_{K}(\widehat{\bf Z}) as computed in (2), along with other descriptive statistics regarding the error distributions.

Standard regularization methods such as TSVD require knowledge of 𝐀{\bf A}, while our approach requires training data. Thus, it is difficult to provide a fair comparison of methods. In our experiments, we compare our reconstructions to two “standard” reconstructions:

  • TSVD-𝐀{\bf A}: This refers to the TSVD reconstruction, as computed in (4), and it requires matrix 𝐀{\bf A} and its SVD. The rank-rr reconstruction matrix for TSVD-𝐀{\bf A} is given by 𝐙=𝐀r.{\bf Z}={\bf A}_{r}^{\dagger}.

  • TSVD-𝐀^\widehat{\bf A}: If 𝐀{\bf A} is not available, one could estimate it from the training data by solving the following rank-constrained optimization problem,

    𝐀^=argminrank(𝐀)r¯1Kk=1Kρ(𝐀𝝃(k)𝐛(k)),\widehat{\bf A}=\operatorname*{arg\,min}_{{\rm rank\!}\left({\bf A}\right)\leq\bar{r}}\frac{1}{K}\sum_{k=1}^{K}\rho\left({\bf A}{\boldsymbol{\xi}}^{(k)}-{\bf b}^{(k)}\right)\,, (25)

    for some fixed integer r¯.\bar{r}. We propose to use methods described in Section 3 to obtain matrix 𝐀^\widehat{\bf A}. Then, the computed matrix 𝐀^\widehat{\bf A} could be used to construct a TSVD reconstruction. We refer to this approach as TSVD-𝐀^\widehat{\bf A}, and the rank-rr reconstruction matrix for TSVD-𝐀^\widehat{\bf A} for rr¯r\leq\bar{r} is given by 𝐙=𝐀^r.{\bf Z}=\widehat{\bf A}_{r}^{\dagger}. This seems to be a fair comparison to our approach since only training data is used; however, a potential disadvantage of TSVD-𝐀^\widehat{\bf A} is that 𝐀^\widehat{\bf A} can be large, and computing its SVD may be prohibitive.

We present results for three experiments: Experiments 1 and 2 are 1D examples, and Experiment 3 is a 2D example.

4.1 One-dimensional examples

For the 1D examples, we use random time signals by generating piecewise constant functions as seen in Figure 2. Here, the number of jumps in each signal is a randomly chosen integer between 3 and 20, drawn from a uniform distribution. The times for the jumps and the signal strength were both randomly selected in the interval [0,1][0,1], also from a uniform distribution. The discretized signal has length n=150.n=150. The corresponding blurred signals were computed as in Equation (1), where matrix 𝐀{\bf A} is 150×150150\times 150 and represents convolution with a 1D Gaussian kernel with zero mean and variance 2, and 𝜹{\boldsymbol{\delta}} represents Gaussian white noise. The noise level for the blurred signals was 0.010.01, which means that 𝜹22/𝐀𝝃220.01\left\|{\boldsymbol{\delta}}\right\|_{2}^{2}/\left\|{\bf A}{\boldsymbol{\xi}}\right\|_{2}^{2}\approx 0.01. Training and validation signals were generated in the same manner. Figure 2 displays four sample signals from the validation set and their corresponding observed signals.

Refer to caption
Figure 2: Four sample validation signals and corresponding observed signals, used in Experiments 1 and 2.
Experiment 1.

For this experiment, we consider the case where ρ=1222\rho=\frac{1}{2}\left\|\,\cdot\,\right\|_{2}^{2} and compute an optimal regularized inverse matrix (ORIM) as described in Section 3.1. We refer to this approach as ORIM2\rm ORIM_{2} and we refer to the computed matrix as 𝐙^2\widehat{\bf Z}_{2}.

First, we fix the number of training signals to K=10,000K=10,000. For varying ranks r,r, we compute the sample mean error for ORIM2\rm ORIM_{2}, TSVD-𝐀{\bf A}, and TSVD-𝐀^\widehat{\bf A} for a set of 10,00010,000 different validation signals. For computing 𝐀^\widehat{\bf A}, r¯\bar{r} was fixed to 100100. Errors plots as a function of rank are shown in Figure 3. We observe that the sample mean error for ORIM2 is monotonically decreasing and consistently smaller than the sample mean errors for both TSVD-𝐀{\bf A} and TSVD-𝐀^\widehat{\bf A}. This is true for all ranks. Furthermore, both TSVD-𝐀{\bf A} and TSVD-𝐀^\widehat{\bf A} exhibit semi-convergence behavior similar to Figure 1, whereby the error initially improves as the rank increases, but reconstructions degrade for larger ranks. On the other hand, since regularization is built into the design and construction of 𝐙^2\widehat{\bf Z}_{2}, ORIM2\rm ORIM_{2} avoids semi-convergence behavior altogether, as evident in the flattening of the error curve. Thus, ORIM2\rm ORIM_{2} solutions are less sensitive to the choice of rank.

For ranks less than r=57r=57, TSVD-𝐀{\bf A} achieves a smaller sample mean error than TSVD-𝐀^\widehat{\bf A}. We also note that the smallest error that TSVD-𝐀^\widehat{\bf A} can achieve is larger than the smallest error than TSVD-𝐀{\bf A} can achieve. This is to be expected since 𝐀^\widehat{\bf A} is an approximation of 𝐀.{\bf A}. Although the semi-convergence phenomenon for TSVD-𝐀^\widehat{\bf A} occurs later and errors grow slower than for TSVD-𝐀{\bf A}, both TSVD solutions require the selection of a good choice for the rank. Since training data is available, one could use the training data to select an “optimal” rank for TSVD-𝐀{\bf A} and TSVD-𝐀^\widehat{\bf A}, i.e., the rank which provides the minimal sample mean error for the training set. This is equivalent to the opt-TSVD approach from [14]. For TSVD-𝐀{\bf A} and TSVD-𝐀^\widehat{\bf A}, the minimal sample mean error for the training set occurs at ranks 51 and 54 (consistent with validation set), respectively.

Refer to caption
Figure 3: Sample mean error fKf_{K} on a logarithmic scale for ORIM2\rm ORIM_{2}, TSVD-𝐀{\bf A}, and TSVD-𝐀^\widehat{\bf A} for the validation signals for various ranks. The dots correspond to the minimal sample mean error for the training data.

In Figure 4, we provide box-and-whisker plots to describe the statistics of the sample errors for the validation set, ek(𝐙)=12𝐙𝐛¯(k)𝝃¯(k)22e_{k}({\bf Z})=\frac{1}{2}\left\|{\bf Z}\bar{\bf b}^{(k)}-\bar{\boldsymbol{\xi}}^{(k)}\right\|_{2}^{2} for k=1,,K¯k=1,\ldots,\bar{K}. Each box provides the median error, along with the 25th and 75th percentiles. The whiskers include extreme data, and outliers are plotted individually. For TSVD-𝐀{\bf A} and TSVD-𝐀^\widehat{\bf A}, errors correspond to the ranks computed using the opt-TSVD approach (rr = 51 and 54 respectively), and errors for ORIM2\rm ORIM_{2} correspond to maximal rank 100. The average of the errors ek(𝐙)e_{k}({\bf Z}) correspond to the dots in Figure 3.

Refer to caption
Figure 4: Box plots of sample errors for the validation set used in Experiment 1. All results use the 2-norm.

Next, we investigate how the errors for ORIM2 vary for different numbers of training signals. That is, for increasing numbers of training signals, we compute the ORIM2\rm ORIM_{2} matrix and record sample errors ek(𝐙^2),e_{k}(\widehat{\bf Z}_{2}), for both the training set and the validation set. It is worthwhile to mention that the rank of 𝐙^2\widehat{\bf Z}_{2} is inherently limited by the number of training signals and the signal length, i.e., rmin{K,n}r\leq\min\{K,n\}. For this example, we have n=150,n=150, and we select the rank of 𝐙^2\widehat{\bf Z}_{2} to be

r={Kfor K<50,50for K50.r=\left\{\begin{array}[]{ll}K&\mbox{for }K<50,\\ 50&\mbox{for }K\geq 50\,.\end{array}\right. (26)

Notice that for rank r=50r=50 with a signal length of n=150n=150, we have a total of 2rn=15,0002\cdot r\cdot n=15,000 unknowns (for 𝐗{\bf X} and 𝐘{\bf Y}). In Figure 5 we provide the median errors along with the 25th to 75th percentiles. These plots are often referred to as Pareto curves in the literature.

We observe that for the number of training signals K<50,K<50, the ORIM2 matrix is significantly biased towards the training set. This is evident in the numerically zero computed errors for the training set. In addition, we expect the errors for the training set to increase and the errors for the validation set to decrease as more training signals are included in the training set, due to a reduction in bias towards the training set. Eventually, errors for both signal sets reach a plateau, where additional training data do not significantly improve the solution. The point at which the errors begin to stabilize indicates a natural choice for the number of training signals to use, e.g., when the relative change in errors reaches a predetermined tolerance. For this example a relative improvement of 10310^{-3} for the errors for the training set is reached at 1,0001,000 training images.

There is an interesting phenomena that occurs in the reconstruction errors for the validation set, whereby the error peaks around K=150K=150. Although puzzling at first, the peak in errors is intuitive when one considers the problem that is being solved. For K=150K=150 matrix 𝐁{\bf B} is 150×150.150\times 150. If 𝐁{\bf B} is full rank (s=150s=150), 𝐏=𝐂{\bf P}={\bf C} in Theorem 3.1 and problem (19) has the solution,

𝐙^2\displaystyle\widehat{\bf Z}_{2} =argminrank(𝐙)r1K𝐙𝐁𝐂F2\displaystyle=\operatorname*{arg\,min}_{{\rm rank\!}\left({\bf Z}\right)\leq r}\frac{1}{K}\left\|{\bf Z}{\bf B}-{\bf C}\right\|_{\rm F}^{2}
=𝐂r𝐁1,\displaystyle={\bf C}_{r}{\bf B}^{-1},

which is the special case considered by Eckart-Young and Schmidt [30, 31] Notice that 𝐙^2\widehat{\bf Z}_{2} is uniquely determined by the training set, so reconstructions at this point will be most biased to the training set.

Refer to caption
Figure 5: Pareto curves. The curves in the upper figure correspond to sample errors for the training set and the curves in lower figure correspond to sample errors for the validation set. Here we plot the median values along with the 25th and 75th percentiles of the errors as a function of the number of training signals used to construct 𝐙^2\widehat{\bf Z}_{2}.
Experiment 2.

For this experiment, we use the same setup as described in Experiment 1 for the 1D example, but we compare results obtained by using different error measures ρ\rho in equation (3). In particular, we consider ρ=1ppp\rho=\frac{1}{p}\left\|\,\cdot\,\right\|_{p}^{p} for p=1.2,2,p=1.2,2, and 5.5. For p=1.2,p=1.2, we use additional smoothing to deal with non-differentiability. We compute an optimal low-rank regularized inverse matrix 𝐙^p\widehat{\bf Z}_{p} for each pp. The computed matrices are then used to reconstruct the validation set. We refer to these approaches as ORIM1.2,ORIM2\rm ORIM_{1.2},\rm ORIM_{2} and ORIM5\rm ORIM_{5}, respectively. Box-and-whisker plots of the sample errors for the validation signals are presented in Figure 6, along with errors for TSVD-𝐀^\widehat{\bf A}. In each subfigure, the reconstruction error was computed using a different error measure. As expected, the reconstructions with smallest error correspond to 𝐙^p,\widehat{\bf Z}_{p}, where pp defines the error measure used for the training set.

Refer to caption
Figure 6: Box plots of sample errors for the validation set used in Experiment 2. Sample errors were computed using ρ=1ppp\rho=\frac{1}{p}\left\|\,\cdot\,\right\|_{p}^{p} where p=1.2,2,p=1.2,2, and 55, respectively.
Refer to caption
Figure 7: Absolute reconstruction errors for validation signals presented in Figure 2.

We emphasize that even though the sample errors for the various methods may appear similar in value in the box plots, the reconstructions are significantly different, as seen in the absolute errors. Plots of the absolute values of the error vectors, 𝐙^p𝐛¯(k)𝝃¯(k),\widehat{\bf Z}_{p}\bar{\bf b}^{(k)}-\bar{\boldsymbol{\xi}}^{(k)}, for the four validation signals presented in Figure 2 can be found in Figure 7. We observe that ORIM5\rm ORIM_{5} allows for large reconstruction errors throughout the signal without major peaks, while absolute reconstruction errors for ORIM1.2\rm ORIM_{1.2} tend to remain small in flat regions, with only a few high peaks in the error at locations corresponding to discontinuities in the signal. These observations are consistent with norms in general and illustrate the potential benefits of using a different error measure.

4.2 Two-dimensional example

Next, we describe a problem from image deblurring, where the underlying model can be represented as (1) where matrix 𝐀{\bf A} represents a blurring process, and 𝝃{\boldsymbol{\xi}} and 𝐛{\bf b} represent the true and observed images respectively. Entries in matrix 𝐀{\bf A} depend on the point spread function (PSF). For image deblurring examples, matrix 𝐀{\bf A} is typically very large and sparse. However, under assumptions of spatial invariance and assumed boundary conditions and/or properties of the PSF, structure in 𝐀{\bf A} can be exploited and fast transform algorithms can be used to efficiently diagonalize 𝐀{\bf A} [5]. For example, if the PSF satisfies a double symmetry condition and reflexive boundary conditions on the image are assumed, then 𝐀{\bf A} can be diagonalized using the 2D discrete cosine transform (DCT). Of course, these approaches rely on knowledge of the exact PSF.

Blind deconvolution methods have been developed for problems where the PSF is not known or partially known [32]; however, most of these approaches require strict assumptions on the PSF such as spatial invariance or parameterization of the blur function, or they require a good initial estimate of the PSF. Our approach uses training data and does not require knowledge of 𝐀{\bf A} and, hence, knowledge of the PSF. Furthermore, we do not require any assumptions regarding spatial invariance of the blur, boundary conditions, or special properties of the blur (e.g., separability, symmetry, etc.). We refer the reader to [24] for comparisons to blind deconvolution methods.

Experiment 3.
Refer to caption
Figure 8: Sample true images and corresponding observed, blurred and noisy images that were used for training in Experiment 3.

In this experiment, we generate a set of 6,0006,000 training images. Each image has 128×128128\times 128 pixels and represents a collection of circles, where the number of circles, location, radius and intensity were all randomly drawn from a uniform distribution. In this example, the PSF was a 2D Gaussian kernel with zero mean and variance 5. We assume reflexive boundary conditions for the image, and Gaussian white noise was added to each image, where the noise level was randomly selected between 0.10.1 and 0.150.15. Sample true images along with the corresponding blurred, noisy image can be found in Figure 8. A validation set of 7,0007,000 images was generated in the same manner.

Refer to caption
Figure 9: Sample mean error for ORIM2, TSVD-𝐀{\bf A}, and TSVD-𝐀^\widehat{\bf A} in logarithmic scale for validation data in Experiment 3. The dots correspond to the minimal sample mean error for the training data.

We compare the sample mean errors for the validation set for ORIM2\rm ORIM_{2}, TSVD-𝐀{\bf A}, and TSVD-𝐀^\widehat{\bf A}, where the maximal rank for 𝐀^\widehat{\bf A} was r¯=1,000\bar{r}=1,000. The sample mean errors are provided in Figure 9 for various ranks. Dots correspond to minimal sample mean error for the training data. As expected, TSVD-𝐀{\bf A} exhibits clear semi-convergence behavior, where the minimum error is achieved at rank 238238 (for both the training and validation sets). In contrast, ORIM2 maintains low sample mean errors for various ranks, and only exhibits slight semiconvergence for larger ranks. This can be attributed to bias towards the training data, since for larger ranks, more training data is required to compensate for more unknowns. We observe that the relative difference between the lowest sample mean error for ORIM2 and the lowest sample mean error for TSVD-𝐀{\bf A} is approximately 10%10\%. That is, even without knowledge of 𝐀{\bf A}, ORIM2 can produce reconstructions that are comparable to TSVD-𝐀{\bf A} reconstructions.

Next, we compare the distribution of the sample errors for ORIM2 and TSVD-𝐀^\widehat{\bf A}, corresponding to rank 1,000.1,000. Density plots can be found in Figure 10. Among the methods that only require training data, ORIM2 is less sensitive than TSVD-𝐀^\widehat{\bf A}, implying that it is better to seek a low rank regularized inverse matrix directly, versus seeking a low rank approximation to 𝐀{\bf A} and then inverting it.

Refer to caption
Figure 10: Density of sample errors for ORIM2 and TSVD-𝐀^\widehat{\bf A} for Experiment 3.

One of the validation images, with corresponding observed image can be found in the first column of Figure 11. Reconstructions for ORIM2, TSVD-𝐀{\bf A}, and TSVD-𝐀^\widehat{\bf A}, and corresponding absolute error images between the reconstructed image and the true image (with inverted colormap so that white corresponds to zero error) are provided in Figure 11 (top and bottom row respectively).

Refer to caption
Figure 11: Image reconstructions for four validation images for Experiment 3. The bottom row presents absolute error images between the reconstructed image and the true image.

5 Conclusions

In this paper, we described a new framework for solving inverse problems where the forward model is not known, but training data is readily available. We considered the problem of finding an optimal low-rank regularized inverse matrix that, once computed, can be used to solve hundreds of inverse problems in real time. We used a Bayes risk minimization framework to provide underlying theory for the problem. For the empirical Bayes problem, we proposed a rank-\ell update approach and showed that this approach is optimal in the Bayes framework for the 22-norm. Numerical results demonstrate that our approach can produce solutions which are superior to standard methods like TSVD, but without full knowledge of the forward model. The examples illustrate the benefits of using different error measures and the potential for use in large-scale imaging applications.

Acknowledgment

We are grateful to Dianne P. O’Leary for helpful discussions.

References

  • [1] H.W. Engl, M. Hanke, and A. Neubauer. Regularization of Inverse Problems. Springer, New York, 2000.
  • [2] C.R. Vogel. Computational Methods for Inverse Problems. SIAM, Philadelphia, 2002.
  • [3] A.G. Ramm. Inverse Problems: Mathematical and Analytical Techniques with Applications To Engineering. Springer, New York, 2004.
  • [4] A. Tarantola. Inverse Problem Theory and Methods for Model Parameter Estimation. SIAM, Philadelphia, 2005.
  • [5] P.C. Hansen, J.G. Nagy, and D.P. O’Leary. Deblurring Images: Matrices, Spectra and Filtering. SIAM, Philadelphia, 2006.
  • [6] A. Kirsch. An Introduction to the Mathematical Theory of Inverse Problems. Springer, New York, 2011.
  • [7] R.C. Aster, B. Borchers, and C.H. Thurber. Parameter Estimation and Inverse Problems. Academic Press, Waltham, 2012.
  • [8] F.D.M. Neto and A.J.S. Neto. An Introduction to Inverse Problems With Applications. Springer, New York, 2012.
  • [9] J. Hadamard. Lectures on Cauchy’s Problem in Linear Differential Equations. Yale University Press, New Haven, 1923.
  • [10] P.C. Hansen. Rank-deficient and Discrete Ill-posed Problems. SIAM, Philadelphia, 1998.
  • [11] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of Machine Learning. The MIT Press, Cambridge, MA, 2012.
  • [12] Y.S. Abu-Mostafa, M. Magdon-Ismail, and H.T. Lin. Learning from Data. AMLBook, 2012.
  • [13] V.N. Vapnik. Statistical Learning Theory. Wiley, San Francisco, 1998.
  • [14] J. Chung, M. Chung, and D.P. O’Leary. Designing optimal filters for ill-posed inverse problems. SIAM Journal on Scientific Computing, 33(6):3132–3152, 2011.
  • [15] J. Chung, M. Chung, and D.P. O’Leary. Optimal filters from calibration data for image deconvolution with data acquisition error. Journal of Mathematical Imaging and Vision, 44(3):366–374, 2012.
  • [16] G.H. Golub and C.F. Van Loan. Matrix Computations. Johns Hopkins University Press, Baltimore, 3 edition, 1996.
  • [17] G.W. Stewart. Matrix Algorithms: Volume 1, Basic Decompositions. SIAM, Philadelphia, 1998.
  • [18] B.P. Carlin and T.A. Louis. Bayes and Empirical Bayes Methods for Data Analysis. Chapman and Hall/CRC, Boca Raton, 2 edition, 2000.
  • [19] A. Shapiro, D. Dentcheva, and R. Ruszczynski. Lectures on Stochastic Programming: Modeling and Theory. SIAM, Philadelphia, 2009.
  • [20] G.A.F. Seber and A.J. Lee. Linear Regression Analysis. Wiley, San Francisco, 2 edition, 2003.
  • [21] J. Chung, M. Chung, and D.P. O’Leary. Optimal regularized low rank inverse approximation. In revision at Linear Algebra and Applications, 2014.
  • [22] D. Sondermann. Best approximate solutions to matrix equations under rank restrictions. Statistische Hefte, 27:57–66, 1986.
  • [23] S. Friedland and A. Torokhti. Generalized rank-constrained matrix approximations. SIAM Journal on Matrix Analysis and Applications, 29(2):656–659, 2007.
  • [24] J. Chung and M. Chung. Computing optimal low-rank matrix approximations for image processing. In IEEE Proceedings of the Asilomar Conference on Signals, Systems, and Computers. November 3-6, 2013, Pacific Grove, CA, USA, 2013.
  • [25] J. Nocedal and S.J. Wright. Numerical Optimization. Springer, New York, 1999.
  • [26] P.E. Gill, W. Murray, and M.H. Wright. Practical Optimization. Emerald Group Publishing, Bingley, UK, 1981.
  • [27] P. Baldi and S. Brunak. Bioinformatics: the Machine Learning Approach. MIT press, Cambridge, 2001.
  • [28] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer, New York, 2 edition, 2009.
  • [29] J.-F. Cai, E. J. Candès, and Z. Shen. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization, 20(4):1956–1982, 2010.
  • [30] C. Eckart and G. Young. The approximation of one matrix by another of lower rank. Psychometrika, 1(3):211–218, 1936.
  • [31] E. Schmidt. Zur Theorie der linearen und nichtlinearen Integralgleichungen. Mathematische Annalen, 63(4):433–476, 1907.
  • [32] P. Campisi and K. Egiazarian. Blind Image Deconvolution: Theory and Applications. CRC press, Boca Raton, 2007.