This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Solving sparse linear systems with approximate inverse preconditioners on analog devices

Vasileios Kalantzis1, Anshul Gupta2, Lior Horesh2, Tomasz Nowicki2, Mark S. Squillante2, and Chai Wah Wu2 IBM Research, Thomas J. Watson Research Center, Yorktown Heights, NY, USA
Email: 1vkal@ibm.com, 2{anshul,lhoresh,tnowicki,mss,cwwu}@us.ibm.com
Abstract

Sparse linear system solvers are computationally expensive kernels that lie at the heart of numerous applications. This paper proposes a preconditioning framework that combines approximate inverses with stationary iterations to substantially reduce the time and energy requirements of this task by utilizing a hybrid architecture that combines conventional digital microprocessors with analog crossbar array accelerators. Our analysis and experiments with a simulator for analog hardware demonstrate that an order of magnitude speedup is readily attainable without much impact on convergence, despite the noise in analog computations.

Index Terms:
Richardson iteration, approximate inverse preconditioners, analog crossbar arrays

I Introduction

The iterative solution of sparse linear systems [1] is a fundamental task across many applications in science, engineering, and optimization. For decades, progress in speeding up preconditioned iterative solvers was primarily driven by constructing more effective preconditioning algorithms and via steady microprocessor performance gains. The last decade saw the advent of fast power-efficient low-precision hardware, such as Graphics Processing Units (GPUs), which gave rise to new opportunities to accelerate sparse iterative solvers [2, 3, 4, 5, 6, 7, 8]. While these advances have led to remarkable gains in the performance of sparse iterative solvers, conventional CMOS-based digital microprocessors possess inherent scaling and power dissipation limitations. It is therefore imperative that we explore alternative hardware architectures to address current and future challenges of sparse iterative solvers. One such alternative paradigm is based on analog devices configured as crossbar arrays of non-volatile memory. Such arrays can achieve high degrees of parallelism with low energy consumption by mapping matrices onto arrays of memristive elements capable of storing information and executing simple operations such as a multiply-and-add. In particular, these devices can perform Matrix-Vector Multiplication (MVM) in near constant time independent of the number of nonzero entries in the operand matrices [9, 10]. Indeed, considerable computational speed-ups have been achieved with analog crossbar arrays [11, 12, 13, 14, 15, 16, 17, 18, 19], mostly for machine learning applications that can tolerate the noise and lower precision of these devices.

Analog designs with enhanced precision have been proposed [20, 21] to meet the accuracy demands of scientific applications, but the improvement in precision comes at the cost of increased hardware complexity and energy consumption. Memristive hardware with precision enhancement has been exploited in the context of Jacobi iterations to solve linear systems stemming from partial differential equations (PDEs) [22], and as an autonomous linear system solver [23] for small dense systems. The latter approach was used to apply overlapping block-Jacobi preconditioners in the Generalized Minimal RESidual (GMRES) iterative solver [20] with bit-slicing for increasing precision. Inner-outer iterations [24, 25, 26] have been considered for dense systems, where analog arrays were used for the inner solver while iterative refinement was used as an outer solver in the digital space.

The main contribution of this paper is a flexible preconditioning framework for inexpensively solving a large class of sparse linear systems by effectively utilizing simple analog crossbar arrays without precision-enhancing extensions. Our approach combines flexible iterative solvers and approximate inverse preconditioners on a hybrid architecture consisting of a conventional digital processor and an analog crossbar array accelerator. We analytically and experimentally show that this combination can solve certain sparse linear systems at a fraction of the cost of solving them on digital processors. While the relative advantage of analog hardware grows with the size of the linear systems, we demonstrate the potential for an order of magnitude speedup even for systems with just a few hundred equations.

While our method can be used to provide a fast stand-alone sparse solver, it also has exciting applications in the context of the upcoming exascale platforms [27]. Extreme-scale solvers are likely to be highly composable [28] and hierarchical [29] with multiple levels of nested algorithmic components. A solver like the one proposed in this paper, running on analog accelerator-equipped nodes of a massively parallel platform, is an ideal fast and low-power candidate for a local subdomain or coarse-grid solver to tackle the local components of a much larger sparse linear system.

II Linear Algebra on Analog Hardware

We begin by briefly describing our architecture model and its key properties. Figure 1 illustrates a hybrid digital-analog architecture that combines a conventional microprocessor system with an analog crossbar array for performing MVM. The crossbar consists of rows and columns of conductors with a memristive element at each row-column intersection [30]. The conductance of these elements can be set, reset, or updated in a non-volatile manner. Matrix operations are performed by mapping the n×nn\times n operand matrix MM onto the crossbar array, where the conductance GijG_{ij} at the intersection of row ii and column jj represents the value M[i,j]M[i,j] (i.e., the (i,j)(i,j)-th entry of the matrix MM) after a suitable scaling.

Refer to caption
Figure 1: A hybrid digital-analog architecture consisting of a CPU and a memristive crossbar array for multiplying an n×nn\times n matrix MM with a vector rr to approximate y=Mry=Mr in O(1)O(1) time.

The MVM operation y=Mry=Mr can be emulated by sending pulses of VinV_{in} volts along the columns of the crossbar such that the length tjt_{j} of the pulse along column jj is proportional to the jjth entry r[j]r[j] of vector rr, with suitable normalization. Following Ohm’s Law, this contributes a current equal to VinGijV_{in}G_{ij} on the conductor corresponding to row ii for a duration tjt_{j}. Following Kirchoff’s Current Law, the currents along each row accumulate and can be integrated over the time period equivalent to the maximum pulse length using capacitors, yielding a charge proportional to j=1nM[i,j]r[j]\sum_{j=1}^{n}M[i,j]r[j], which in turn is proportional to y[i]y[i]. This integrated value can be recovered as a digital quantity via an analog-to-digital converter (ADC), and yields an approximation y^[i]\widehat{y}[i] to y[i]y[i].

The above procedure involves multiple sources of nondeterministic noise so that the output y^n\widehat{y}\in\mathbb{R}^{n} is equivalent to a low precision approximation of yy. Writing MM to the crossbar array incurs multiplicative and additive write noises NWmn×nN^{Wm}\in\mathbb{R}^{n\times n} and NWan×nN^{Wa}\in\mathbb{R}^{n\times n}, respectively, and the actual conductance values at the crosspoints of the array reflect M^=M(I+NWm)+NWa\widehat{M}=M\odot(I+N^{Wm})+N^{Wa}, where \odot denotes element-wise multiplication. Similarly, digital-to-analog conversion (DAC) of the vector rr into voltage pulses suffers from multiplicative and additive input noises NImnN^{Im}\in\mathbb{R}^{n} and NIanN^{Ia}\in\mathbb{R}^{n}, respectively. As a result, the matrix M^\widehat{M} is effectively multiplied by a perturbed version of rr given by r^=r(𝟏+NIm)+NIa\widehat{r}=r\odot(\mathbf{1}+N^{Im})+N^{Ia}, where 𝟏\mathbf{1} is a vector of all ones.

A characteristic equation to describe the output y^\widehat{y} of an analog MVM MrMr can be written as

y^=M^r^(𝟏+NOm)+NOa,\widehat{y}=\widehat{M}\widehat{r}\odot(\mathbf{1}+N^{Om})+N^{Oa}, (1)

where NOmnN^{Om}\in\mathbb{R}^{n} and NOanN^{Oa}\in\mathbb{R}^{n} denote the multiplicative and additive components of the output noise, respectively. These components reflect the inherent inexactness in the multiplication based on circuit laws and current integration, as well as the loss of precision in the ADC conversion of the result vector. Equation (1) can be simplified as

y^=(M+E)r+N^Oa,\widehat{y}=(M+E)r+\widehat{N}^{Oa}, (2)

where N^Oa\widehat{N}^{Oa} is the overall effective additive noise that captures all the lower order noise terms from Equation (1) and the nondetermistic error matrix EE is given by

E=M(NWm+𝟏NIm+NOm𝟏)+NWa;E=M\odot(N^{Wm}+\mathbf{1}\otimes N^{Im}+N^{Om}\otimes\mathbf{1})+N^{Wa}; (3)

here \otimes denotes outer product. We provide additional technical details in Appendix A.

The exact characteristics and magnitudes of the noises depend on the physical implementation [12, 25] of the analog array and the materials used therein, the details of which are beyond the scope of this paper but they are generally captured by our model. Regardless of the implementation, the noises are always device dependent and stochastic.

Once the matrix MM is written to the crossbar array, analog MVMs involving the matrix can be performed in O(1)O(1) time, resulting in an O(n2)O(n^{2}) speedup over the equivalent digital computation if MM is dense. Therefore, a linear system solver that can overcome the analog MVM error, even if it requires many more MVM operations than its digital counterpart, can substantially reduce the cost of the solution.

III An iterative solver based on memristive approximate inverse preconditioning

The goal of preconditioning is to transform an n×nn\times n system of linear equations Ax=bAx=b to enable an iterative procedure to compute a sufficient approximation of xx in fewer steps. Algebraically, preconditioning results in solving the system MAx=MbMAx=Mb, where Mn×nM\in\mathbb{R}^{n\times n} is an approximation of A1A^{-1}. Since A1A^{-1} is dense in general, most popular general-purpose preconditioners are applied implicitly; e.g., the ILU preconditioner generates a lower (upper) triangular matrix LL (UU) such that ALUA\approx LU. The preconditioner M=U1L1M=U^{-1}L^{-1} is then applied by forward/backward substitution [1].

III-A Approximate inverse matrices as preconditioners

Although the inverse of a sparse matrix is dense in general, a significant fraction of the entries in the inverse matrix often have small magnitudes [31, 32]. Approximate inverse preconditioners [33, 34, 35] exploit this fact and seek to compute an approximation MM to A1A^{-1} that can be applied via an MVM operation. Computing MM can be framed as an optimization problem that seeks to minimize IMAF||I-MA||_{F}. One popular method for computing MM is by the Sparse Approximate Inverse (SPAI) technique [35], which updates the sparsity pattern of MM dynamically by repeatedly choosing new profitable indices for each column of MM that lead to an improved approximation of A1A^{-1}. The quality of the approximation is controlled by computing the norm of the residual I(:,j)AM(:,j)I(:,j)-AM(:,j) after each column update. The jjth column of MM has at most 𝚗𝚗𝚣AI\mathtt{nnz}_{\mathrm{AI}} nonzero entries and satisfies AM(:,j)I(:,j)F𝚝𝚘𝚕AI\|AM(:,j)-I(:,j)\|_{F}\leq\mathtt{tol}_{\mathrm{AI}}, for a given threshold tolerance 𝚝𝚘𝚕AI\mathtt{tol}_{\mathrm{AI}}\in\mathbb{R}.

An effective approximate inverse preconditioner MM can still be considerably denser than the coefficient matrix AA. Therefore, preconditioning must lead to a drastic convergence improvement for a reduction in the overall wall-clock time. Applying approximate inverse preconditioners as proposed in this paper has the advantage that the computation time and energy consumption of preconditioning would be much lower on an analog crossbar array than on digital hardware, even at low precisions. Since MVM with MM can be performed in O(1)O(1) time on an analog array regardless of the number of nonzeros in MM, denser approximate inverses may be utilized without increasing the associated costs.

While analog crossbar arrays can speed up the application of the approximate inverse preconditioner, the accuracy of the application step is constrained by the various nondeterministic noises in the analog hardware discussed in Section II. This restricts the use of popular Krylov subspace approaches such as GMRES [36], which require a deterministic preconditioner. Although flexible variants [37] exist, in which the preconditioner can vary from one step to another, these variants generally require the orthonormalization of large subspaces and perform a relatively larger number of Floating Point Operations (FLOPs) outside the preconditioning step. Therefore, in this paper, we consider a stationary iterative scheme.

III-B Preconditioned Richardson iteration

Stationary methods approximate the solution of a linear system Ax=bAx=b by iteratively applying an update of the form xi=Dxi1+dx_{i}=Dx_{i-1}+d, where Dn×nD\in\mathbb{R}^{n\times n} and dnd\in\mathbb{R}^{n} are fixed. Preconditioned Richardson iteration can be defined as

αMAx\displaystyle\alpha MAx =αMb,α,\displaystyle=\alpha Mb,\ \ \ \alpha\in\mathbb{R},
x\displaystyle x =x+αMbαMAx\displaystyle=x+\alpha Mb-\alpha MAx
=(IαMA)x+αMb,\displaystyle=(I-\alpha MA)x+\alpha Mb,
xi\displaystyle x_{i} =(IαMA)xi1+αMb,\displaystyle=(I-\alpha MA)x_{i-1}+\alpha Mb,

which is identical to the form xi=Dxi1+dx_{i}=Dx_{i-1}+d, with D=IαMAD=I-\alpha MA and d=αMbd=\alpha Mb [38]. The approximate solution xix_{i} produced by preconditioned Richardson iteration during the iith iteration satisfies

xix\displaystyle x_{i}-x =xi1xαMAxi1+αMAx\displaystyle=x_{i-1}-x-\alpha MAx_{i-1}+\alpha MAx
=(xi1x)αMA(xi1x)\displaystyle=(x_{i-1}-x)-\alpha MA(x_{i-1}-x)
=(IαMA)(xi1x),\displaystyle=(I-\alpha MA)(x_{i-1}-x),
xix\displaystyle\|x_{i}-x\| IαMAxi1x\displaystyle\leq\|I-\alpha MA\|\|x_{i-1}-x\|
(IαMA)ix0x,\displaystyle\leq\left(\|I-\alpha MA\|\right)^{i}\|x_{0}-x\|,

where the latter inequality holds for any vector norm and the corresponding induced matrix norm. A similar bound can also be shown for the residual vector ri=bAxir_{i}=b-Ax_{i}; i.e.,

ri(IαAM)ir0.\|r_{i}\|\leq\left(\|I-\alpha AM\|\right)^{i}\|r_{0}\|.

The above inequalities show that the sequence limixi\lim\limits_{i\rightarrow\infty}x_{i} will converge to the true solution xx as long as the preconditioner MM and scaling scalar α\alpha are chosen such that IαMA<1\|I-\alpha MA\|<1. In fact, the limit of the iterative process will converge to xx as long as the spectral radius ρ(IαMA)\rho(I-\alpha MA) of the matrix IαMAI-\alpha MA is strictly less than one. Nonetheless, we use the spectral norm since ρ(IαMA)IαMA\rho(I-\alpha MA)\leq\|I-\alpha MA\|. If we denote the eigenvalues of the matrix MAMA by |λ1||λn||\lambda_{1}|\leq\ldots\leq|\lambda_{n}|, then we need to pick α\alpha such that |1αλn|<1|1-\alpha\lambda_{n}|<1, i.e., we need 0<α<2λn|λn|20<\alpha<\dfrac{2\lambda_{n}^{*}}{|\lambda_{n}|^{2}}. We use α=1\alpha=1 in the remainder of the paper.

III-C Preconditioning through memristive hardware

Algorithm 1 summarizes our hybrid Richardson iterations preconditioned by applying an approximate inverse through analog crossbar hardware. The iterative procedure terminates when either the maximum number of iterations mitm_{\mathrm{it}} is reached or the relative residual norm drops below a given threshold tolerance tolt_{\mathrm{ol}}\in\mathbb{R}. Each iteration of Algorithm 1 requires a sparse MVM with the coefficient matrix AA (Line 5), two “scalar alpha times x plus y” (AXPY) [39] operations (Lines 5 and 7), a vector norm computation (Line 6), and an MVM with the approximate inverse preconditioner MM (Line 7) computed in analog hardware. Therefore, of the 3n+2(𝚗𝚗𝚣(A)+𝚗𝚗𝚣(M))3n+2(\mathtt{nnz}(A)+\mathtt{nnz}(M)) total FLOPs per iteration, 2𝚗𝚗𝚣(M)2\mathtt{nnz}(M) can be performed in O(1)O(1) time on the analog hardware.

1:input: An×n;b,x0n;tol;mitA\in\mathbb{R}^{n\times n};\ b,x_{0}\in\mathbb{R}^{n};\ t_{\mathrm{ol}}\in\mathbb{R};\ m_{\mathrm{it}}\in\mathbb{N}.
2: Construct approximate inverse preconditioner Mn×nM\in\mathbb{R}^{n\times n};
3: Write MM to the analog crossbar array;
4:for i=0i=0 to mit1m_{\mathrm{it}}-1 do
5:  ri=bAxir_{i}=b-Ax_{i}; // Digital MVM and AXPY
6:  If (ri2tolb\|r_{i}\|_{2}\leq t_{\mathrm{ol}}\|b\|) exit;
7:  xi+1=xi+Mrix_{i+1}=x_{i}+Mr_{i}; // Analog MVM and digital AXPY
8:end for
9:return  xix_{i};
Algorithm 1 Richardson iterations with approximate inverse preconditioning on hybrid hardware

The ideal per-iteration speedup that the hybrid implementation can achieve over its digital counterpart is 𝒮ideal=1+2𝚗𝚗𝚣(M)3n+2𝚗𝚗𝚣(A){\cal S}_{\mathrm{ideal}}=1+\dfrac{2\mathtt{nnz}(M)}{3n+2\mathtt{nnz}(A)}, which increases with the density 𝚗𝚗𝚣(M)\mathtt{nnz}(M) of the preconditioner. In reality, the hybrid implementation is likely to require more iterations to reach the same residual norm as the digital implementation because the noise introduced by the analog hardware leads to a less accurate application of the preconditioner MM. More specifically, excluding the time spent on constructing MM, the overall computational speedup achieved by the hybrid implementation is

𝒮tot=mdmh(1+2𝚗𝚗𝚣(M)3n+2𝚗𝚗𝚣(A)),{\cal S}_{\mathrm{tot}}=\dfrac{m_{\mathrm{d}}}{m_{\mathrm{h}}}\left(1+\dfrac{2\mathtt{nnz}(M)}{3n+2\mathtt{nnz}(A)}\right),

where mdm_{\mathrm{d}} and mhm_{\mathrm{h}} denote the number of digital and hybrid preconditioned Richardson iterations, respectively.

III-D The effects of device noise

Since the MVM between the preconditioner MM and the residual vector of the iith iteration rir_{i} is computed through an analog crossbar array, the inherent inaccuracies of the analog device lead to an update of the form

xi+1=xi+(M+E)ri+N^Oax_{i+1}=x_{i}+(M+E)r_{i}+\widehat{N}^{Oa}

from Equation (2). Note that EE and N^Oa\widehat{N}^{Oa} stochastically and nondeterministically vary between iterations. The error norm xi+1x\|x_{i+1}-x\| then satisfies the relationships

xi+1xN^Oa\|x_{i+1}-x\|\geq\|\widehat{N}^{Oa}\|

and

xi+1xI(M+E)Axix.\|x_{i+1}-x\|\leq\|I-(M+E)A\|\|x_{i}-x\|.

The additive noise N^Oa\widehat{N}^{Oa} has a direct impact on the lower bound of the error that can be achieved. Additionally, the norm of the error xi+1x\|x_{i+1}-x\| will be smaller than xix\|x_{i}-x\| as long as I(M+E)A<1\|I-(M+E)A\|<1, for which we have the upper bound

I(M+E)AIMA+EA.\|I-(M+E)A\|\leq\|I-MA\|+\|E\|\|A\|.

Under the assumption that the spectral norm of the matrix IMAI-MA is less than one, i.e., IMA<1\|I-MA\|<1, a sufficient condition for the spectral norm of the matrix I(M+E)AI-(M+E)A to be less than one is

E<1IMAA.\|E\|<\dfrac{1-\|I-MA\|}{\|A\|}.

Additionally, for I(M+E)A<1\|I-(M+E)A\|<1 to hold under the condition M=A1+ΔM=A^{-1}+\Delta, E<A1MA1\|E\|<\|A\|^{-1}-\|M-A^{-1}\|.

Consistent with the original deterministic analysis for iterative refinement by Moler [40] from a digital perspective, but adapted to Richardson iteration with noisy analog devices, a sufficiently small E\|E\| in a stochastic sense guarantees the convergence of the preconditioned Richardson iteration process. In particular, we need to have E<1IMAA\|{E}\|<\frac{1-\|I-MA\|}{\|A\|} hold in a sufficiently large number of iterations. To understand and ensure when these conditions will hold, we establish explicit bounds for E\|E\| in terms of the statistical characteristics of the elements of the matrix EE. We omit the details here and refer the interested reader to Appendix A.

IV Simulation Experiments

Our experiments were conducted in a Matlab environment (version R2020b) with 64-bit arithmetic on a single core of a 2.3 GHz 8-Core Intel i9 machine with 64 GB of system memory. We used a Matlab version of the publicly available simulator [41] with a PyTorch interface for emulating the noise, timing, and energy characteristics of an analog crossbar array. The simulator models all sources of analog noise outlined in Section II as scaled Gaussian processes. Using Matlab notation, the components of the matrix write noise were modeled as 𝚛𝚊𝚗𝚍𝚗()×5.0𝚎𝟹\mathtt{randn(\cdot)\times 5.0\mathtt{e}-3}, and those of the input and output noises were both modeled as 𝚛𝚊𝚗𝚍𝚗()×1.0𝚎𝟸\mathtt{randn(\cdot)\times 1.0\mathtt{e}-2}; these are the default settings in the simulator based on currently realizable analog hardware [16]. The number of bits used in the ADC and DAC was set to 77 and 99, respectively. Table I shows the default parameters used throughout our experiments to simulate the analog device and to construct the approximate inverse preconditioner.

TABLE I: Default parameters.
Module Parameter Value
Analog device NWN^{W} 5.0𝚎35.0\mathtt{e}-3
Analog device NIN^{I} 1.0𝚎21.0\mathtt{e}-2
Analog device NON^{O} 1.0𝚎21.0\mathtt{e}-2
Richardson it. tolt_{\mathrm{ol}} 1.0𝚎51.0\mathtt{e}-5
Richardson it. mitm_{\mathrm{it}} 5050
Approximate inverse 𝚗𝚗𝚣AI\mathtt{nnz}_{\mathrm{AI}} 40×𝚗𝚗𝚣(A)40\times\mathtt{nnz}(A)
Approximate inverse 𝚝𝚘𝚕AI\mathtt{tol}_{\mathrm{AI}} 5.0𝚎25.0\mathtt{e}-2
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: Left: Sparsity patterns of the discretized Laplacian for the square mesh Ω:=[0,1]2\Omega:=[0,1]^{2} (top) and the circular mesh Ω:={x2+y2=1}\Omega:=\{x^{2}+y^{2}=1\} (bottom). Right: Surface plots of the solutions u(x,y)u(x,y).

Our test problems originate from discretizations of model PDEs. The first two matrices are derived from a Finite Element (FE) discretization of Poisson’s equation on Ω2\Omega\subset\mathbb{R}^{2} with homogeneous Dirichlet boundary conditions

(2x2+2y2)u(x,y)=f(x,y),u|Ω=0,-\left(\dfrac{\partial^{2}}{\partial x^{2}}+\dfrac{\partial^{2}}{\partial y^{2}}\right)u(x,y)=f(x,y),\ \ u|_{\partial\Omega}=0, (4)

where \partial denotes the partial derivative with respect to an independent variable and f(x,y)f(x,y) denotes the load (or source) vector. We consider two different domains: a square domain Ω:=[0,1]2\Omega:=[0,1]^{2} and a circular domain Ω={[x,y]2|x2+y2=1}\Omega=\{[x,y]\in\mathbb{R}^{2}\;\big{|}\;x^{2}+y^{2}=1\} of radius one centered at the origin. The Laplace operator (2x2+2y2)u(x,y)\left(\dfrac{\partial^{2}}{\partial x^{2}}+\dfrac{\partial^{2}}{\partial y^{2}}\right)u(x,y) is discretized by linear finite elements while the load vector is set equal to the constant source function f(x,y)1f(x,y)\equiv 1. Figure 2 plots the sparsity patterns of the discretized Laplacian matrices AA and the surfaces of the solutions u(x,y)u(x,y) for the square (n=625n=625) and circular (n=362n=362) meshes. Our third test matrix stems from a Finite Difference (FD) discretization of the Laplace operator in the unit cube with homogeneous Dirichlet boundary conditions and n=83n=8^{3}. The average number of nonzero entries per row in the discretized sparse matrices AA and in the corresponding preconditioners MM are listed in Table II, which shows that the fill-in factor (i.e., the ratio nnz(M)/nnz(A)\mathrm{nnz}(M)/\mathrm{nnz}(A)) ranges from 1414 to 2424. Our test matrices are relatively small because we are limited by the speed of the Matlab version of the simulator for the analog crossbar arrays; however, our results are meaningful because the relative advantage of analog hardware, in general, only increases with larger matrices.

TABLE II: Comparison of Richardson iterations with digital (MdM_{d}) and hybrid (MhM_{h}) preconditioning; mm, mdm_{\mathrm{d}} and mhm_{\mathrm{h}} denote the number of non-preconditioned, digital, and hybrid preconditioned Richardson iterations, respectively; κ\kappa denotes the condition number of AA.
Problem nn κ\kappa nnz(A)n\dfrac{\mathrm{nnz}(A)}{n} nnz(M)n\dfrac{\mathrm{nnz}(M)}{n} ρ(IA)\rho(I-A) ρ(IMdA)\rho(I-M_{\mathrm{d}}A) ρ(IMhA)\rho(I-M_{\mathrm{h}}A) mm mdm_{\mathrm{d}} mhm_{\mathrm{h}} FLOPs(MdM_{\mathrm{d}}) FLOPs(MhM_{\mathrm{h}})
FE Square 625 3.4𝚎+23.4\mathtt{e}+2 4.4 93.5 0.99 0.75 0.75 Failed 41 44 5.0𝚎+65.0\mathtt{e}+6 3.1𝚎+53.1\mathtt{e}+5
FE Circular 362 2.0𝚎+22.0\mathtt{e}+2 5.6 86.6 0.97 0.55 0.55 Failed 21 23 1.4𝚎+61.4\mathtt{e}+6 1.1𝚎+51.1\mathtt{e}+5
FD 3D 512 6.2𝚎+16.2\mathtt{e}+1 6.2 81.1 0.75 0.17 0.54 Failed 7 16 6.3𝚎+56.3\mathtt{e}+5 1.2𝚎+51.2\mathtt{e}+5
Refer to caption
Refer to caption
Refer to caption
Figure 3: Relative residual norm achieved by Richardson iterations without and with digital/hybrid approximate inverse preconditioning.
Refer to caption
Refer to caption
Refer to caption
Figure 4: FLOPs as a function of the achieved relative residual norm for each of the three problems in Table II.
Refer to caption
Refer to caption
Refer to caption
Figure 5: FLOPs required to reach a certain accuracy by standard and preconditioned Richardson iterations as the maximum number of nonzero entries in MM varies according to 𝚗𝚗𝚣AI=20×𝚗𝚗𝚣(A), 40×𝚗𝚗𝚣(A)\mathtt{nnz}_{\mathrm{AI}}=20\times\mathtt{nnz}(A),\ 40\times\mathtt{nnz}(A), and 𝚗𝚗𝚣AI=60×𝚗𝚗𝚣(A)\mathtt{nnz}_{\mathrm{AI}}=60\times\mathtt{nnz}(A).

In our experiments, we mainly compare the following iterative solvers: (aa) Richardson iterations without preconditioning; (bb) Richardson iterations with approximate inverse preconditioning applied in the IEEE 754 standard (denoted by “Richardson+d-ai”); and (cc) Richardson iterations with approximate inverse preconditioning applied through simulated analog hardware (denoted by “Richardson+h-ai”). We observe from the results in Table II that standard Richardson iterations fail to converge within mit=50m_{\mathrm{it}}=50 iterations for all three test problems, but converge rapidly with approximate inverse preconditioning. This is not surprising since the spectral radius of the matrix IAI-A, i.e., ρ(IA)\rho(I-A), is very close to one in all cases. The convergence behavior of the two preconditioned variants is almost identical for the FE matrices, but hybrid preconditioned Richardson iteration requires more than twice the number iterations compared to the digital variant for the FD matrix. Overall, the hybrid preconditioned Richardson requires 5×5\times to 10×10\times fewer digital FLOPs compared to the purely digital version.

Figure 3 plots the relative residual norm after each Richardson iteration both without preconditioning and with the various preconditioners considered in this paper. As expected, hybrid preconditioning requires more iterations; however, this trade-off is highly beneficial because analog hardware allows for a rapid application of the preconditioner. This is evident in Figure 4 which shows that, on an average, hybrid preconditioning requires substantially fewer digital FLOPs to converge. For reference, we have also included Richardson iterations with an ILU(0) preconditioner [1] applied in the IEEE 754 standard.

Figure 5 plots the number of digital FLOPs versus the relative residual norm for 𝚗𝚗𝚣AI\mathtt{nnz}_{\mathrm{AI}} values of 20×𝚗𝚗𝚣(A)20\times\mathtt{nnz}(A), 40×𝚗𝚗𝚣(A)40\times\mathtt{nnz}(A), and 60×𝚗𝚗𝚣(A)60\times\mathtt{nnz}(A). A denser preconditioner speeds up the convergence rate at the cost of additional FLOPs per iteration in the digital implementation. For the square FE mesh, the hybrid implementation yields a greater improvement in FLOPs compared to its digital counterpart as the preconditioner is made denser because the time to apply MM on the analog device does not increase with nnz(MM). However, this trend is not observed for the other matrices and diminishes with increasing preconditioner density because the analog noises limit the accuracy of the MVM operation MrMr even if MM is highly accurate. As a result, the number of iterations does not always decline as nnz(MM) increases.

Figures 35 show that Richardson iterations preconditioned with approximate inverses on the hybrid architecture outperform preconditioned Richardson iterations on the digital architecture by greater margins as the condition number (listed in Table II) of AA grows larger. Figure 5 also shows that our method can use denser preconditioners more effectively for matrices with higher condition numbers. These trends bode well for the potential use of our proposed preconditioning framework on a hybrid architecture in real-life problems that are likely to be larger and more ill-conditioned.

10102020303040405050606010510^{-5}10410^{-4}10310^{-3}10210^{-2}10110^{-1}10010^{0}10110^{1}γ\gammaTime (s)Time breakdownConstruct MMWrite MMMVM with M𝚍M_{\mathtt{d}}MVM with M𝚑M_{\mathtt{h}}MVM with AA
Figure 6: Timings of parts of Algorithm 1 as the maximum number of allowed nonzero entries 𝚗𝚗𝚣AI\mathtt{nnz}_{\mathrm{AI}} in matrix MM is set to γ×𝚗𝚗𝚣(A)\gamma\times\mathtt{nnz}(A).

Figure 6 plots the impact of varying the fill-in factor for the FE square matrix on the time required by the key operations of Algorithm 1, such as constructing MM, writing it to the analog crossbar, performing an MVM with it in both the digital and hybrid settings, and performing a digital MVM with the coefficient matrix AA. The sequential time for the construction of the preconditioner in MATLAB is reported for academic purposes only; in practice, this step would utilize compiler optimizations and the ample availability of parallelism [34, 35] to finish much faster. The time for writing MM is roughly equivalent to that of performing seven digital MVMs with it, implying that the benefits of the hybrid architecture start accruing after seven iterations. Note that the size of this matrix MM is only 625×625625\times 625. The current state of analog hardware technology is capable of realizing crossbar arrays of sizes up to 4000×40004000\times 4000 [16]. These can accommodate much larger preconditoner matrices for which the analog MVM times would still be similar to those in Figure 6 and the write times will increase as O(n)O(n), but the digital MVM times would be much higher, growing as O(n2)O(n^{2}).

579112020303040402222161616161717444443434242232323232323Number of DAC bitsRichardson iterationsNumber of ADC bits = number of DAC bits + 2FD3DFE SquareFE Circ
Figure 7: Number of iterations required by hybrid preconditioned Richardson as the number of DAC/ADC bits vary. For the 5-bit DAC, no convergence was obtained after 100 iterations for the FE matrices.

Some hardware parameters, such as the number of DAC and ADC bits, can have a profound impact on the time and energy consumption of the analog device, and therefore on its overall performance. We vary the number of DAC and ADC bits and plot the results in Figure 7 to justify our choice of 7 DAC bits and 9 ADC bits. Fewer bits tend to lead to divergence, while more bits result in increased time and energy consumption with little or no improvement in convergence, as the accuracy of the analog computations is limited by other sources of noise.

V Concluding remarks

With the slowing of Moore’s law [42] for digital microprocessors, there has been considerable interest in exploring alternatives, such as analog computing, for speeding up computationally expensive kernels. While simple analog crossbar arrays have been shown to be effective in many applications involving dense matrices, it has been challenging to address sparse matrix problems, such as solving sparse linear systems, because they do not naturally map to dense crossbar arrays and generally have a low tolerance for the stochastic errors pervasive in analog computing. This paper proposes, analyzes, and experimentally evaluates a preconditioning framework that exploits inexpensive MVM on analog arrays to substantially reduce the time and energy required for solving an important practical class of sparse linear systems. These and similar efforts are critical for meeting the continually growing computational demands of various applications of sparse solvers and for preparing these solvers for the fast but energy-constrained embedded and exascale [27] systems of the future.

Acknowledgments

We would like to thank Tayfun Gokmen, Malte Rasch, and Shashanka Ubaru for helpful discussions and input that substantially contributed to the content of this paper.

References

  • [1] Y. Saad, Iterative methods for sparse linear systems. SIAM, 2003.
  • [2] D. Bertaccini and S. Filippone, “Sparse approximate inverse preconditioners on high performance GPU platforms,” Computers & Mathematics with Applications, vol. 71, no. 3, pp. 693–711, 2016.
  • [3] A. Abdelfattah, H. Anzt, E. G. Boman, E. Carson, T. Cojean, J. Dongarra, M. Gates, T. Grützmacher, N. J. Higham, S. Li et al., “A survey of numerical methods utilizing mixed precision arithmetic,” arXiv preprint arXiv:2007.06674, 2020.
  • [4] K. L. Oo and A. Vogel, “Accelerating geometric multigrid preconditioning with half-precision arithmetic on GPUs,” arXiv preprint arXiv:2007.07539, 2020.
  • [5] M. Baboulin, A. Buttari, J. Dongarra, J. Kurzak, J. Langou, J. Langou, P. Luszczek, and S. Tomov, “Accelerating scientific computations with mixed precision algorithms,” Computer Physics Communications, vol. 180, no. 12, pp. 2526–2533, 2009.
  • [6] A. Haidar, S. Tomov, J. Dongarra, and N. J. Higham, “Harnessing GPU tensor cores for fast fp16 arithmetic to speed up mixed-precision iterative refinement solvers,” in SC18: Int. Conf. for High Perf. Computing, Networking, Storage and Analysis. IEEE, 2018, pp. 603–613.
  • [7] H. Anzt, M. Gates, J. Dongarra, M. Kreutzer, G. Wellein, and M. Köhler, “Preconditioned krylov solvers on GPUs,” Parallel Computing, vol. 68, pp. 32–44, 2017.
  • [8] A. Haidar, H. Bayraktar, S. Tomov, J. Dongarra, and N. J. Higham, “Mixed-precision iterative refinement using tensor cores on GPUs to accelerate solution of linear systems,” Proceedings of the Royal Society A, vol. 476, no. 2243, p. 20200110, 2020.
  • [9] M. Hu et al., “Dot-product engine for neuromorphic computing: Programming 1t1m crossbar to accelerate matrix-vector multiplication,” in 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC). IEEE, 2016, pp. 1–6.
  • [10] L. Xia, P. Gu, B. Li, T. Tang, X. Yin, W. Huangfu, S. Yu, Y. Cao, Y. Wang, and H. Yang, “Technological exploration of rram crossbar array for matrix-vector multiplication,” Journal of Computer Science and Technology, vol. 31, no. 1, pp. 3–19, 2016.
  • [11] A. Sebastian, M. L. G. Le Gallo, R. Khaddam-Aljameh, and E. Eleftheriou, “Memory devices and applications for in-memory computing,” Nature Nanotechnology, vol. 15, pp. 529–544, 2020.
  • [12] W. Haensch, T. Gokmen, and R. Puri, “The next generation of deep learning hardware: Analog computing,” Proceedings of the IEEE, vol. 107, pp. 108–122, 2019.
  • [13] M. J. Rasch, T. Gokmen, M. Rigotti, and W. Haensch, “RAPA-ConvNets: Modified convolutional networks for accelerated training on architectures with analog arrays,” Frontiers in Neuroscience, vol. 13, p. 753, 2019.
  • [14] S. Ambrogio et al., “Equivalent-accuracy accelerated neural-network training using analogue memory,” Nature, vol. 558, pp. 60–67, 2018.
  • [15] A. Fumarola, P. Narayanan, L. L. Sanches, S. Sidler, J. Jang, and K. Moon, “Accelerating machine learning with non-volatile memory: exploring device and circuit tradeoffs,” in IEEE International Conference on Rebooting Computing (ICRC), 2016, pp. 1–8.
  • [16] T. Gokmen and Y. Vlasov, “Acceleration of deep neural network training with resistive cross-point devices: Design considerations,” Frontiers in Neuroscience, vol. 10, 2016.
  • [17] G. W. Burr, R. M. Shelby, A. Sebastian, S. Kim, and S. Sidler, “Neuromorphic computing using non-volatile memory,” Advances in Physics, vol. 2, pp. 89–124, 2016.
  • [18] M. N. Bojnordi and E. Ipek, “Memristive Boltzmann machine: A hardware accelerator for combined optimization and deep learning,” in Int. Symp. on High Performance Computer Architecture (HPCA), 2016.
  • [19] A. Shafice et al., “ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars,” in International Symposium on Computer Architecture (ISCA), 2016.
  • [20] B. Feinberg, R. Wong, T. P. Xiao, C. H. Bennett, J. N. Rohan, E. G. Boman, M. J. Marinella, S. Agarwal, and E. Ipek, “An analog preconditioner for solving linear systems,” in 2021 IEEE Int. Symp. on High-Performance Computer Architecture (HPCA). IEEE, pp. 761–774.
  • [21] B. Feinberg, U. K. R. Vengalam, N. Whitehair, S. Wang, and E. Ipek, “Enabling scientific computing on memristive accelerators,” in International Symposium on Computer Architecture (ISCA), 2018, pp. 367–382.
  • [22] M. A. Zidan, Y. Jeong, J. Lee, B. Chen, S. Huang, M. J. Kushner, and W. D. Lu, “A general memristor-based partial differential equation solver,” Nature Electronics, vol. 1, no. 7, pp. 411–420, 2018.
  • [23] Z. Sun, G. Pedretti, E. Ambrosi, A. Bricalli, W. Wang, and D. Ielmini, “Solving matrix equations in one step with cross-point resistive arrays,” Proceedings of the National Academy of Sciences, vol. 116, no. 10, pp. 4123–4128, 2019.
  • [24] I. Richter, K. Pas, X. Guo, R. Patel, J. Liu, E. Ipek, and E. G. Friedman, “Memristive accelerator for extreme scale linear solvers,” in Government Microcircuit Applications & Critical Technology Conference (GOMACTech), 2015.
  • [25] M. Le Gallo, A. Sebastian, R. Mathis, M. Manica, H. Giefers, T. Tuma, C. Bekas, A. Curioni, and E. Eleftheriou, “Mixed-precision in-memory computing,” Nature Electronics, vol. 1, no. 4, pp. 246–253, 2018.
  • [26] Z. Zhu, A. B. Klein, G. Li, and S. Pang, “Fixed-point iterative linear inverse solver with extended precision,” arXiv preprint arXiv:2105.02106, 2021.
  • [27] H. Antz et al., “Preparing sparse solvers for exascale computing,” Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, vol. 378, no. 20190053, 2020.
  • [28] J. Brown, M. G. Knepley, D. A. May, L. C. McInnes, and B. Smith, “Composable linear solvers for multiphysics,” in 11th International Symposium on Parallel and Distributed Computing, 2012, pp. 55–62.
  • [29] L. C. McInnes, B. Smith, H. Zhang, and R. T. Mills, “Hierarchical and nested Krylov methods for extreme-scale computing,” Parallel Computing, vol. 40, no. 1, pp. 17–31, 2014.
  • [30] L. Chua, “Memristor-the missing circuit element,” IEEE Transactions on Circuit Theory, vol. 18, no. 5, pp. 507–519, 1971.
  • [31] M. Benzi and G. H. Golub, “Bounds for the entries of matrix functions with applications to preconditioning,” BIT Numerical Mathematics, vol. 39, no. 3, pp. 417–438, 1999.
  • [32] S. Demko, W. F. Moss, and P. W. Smith, “Decay rates for inverses of band matrices,” Mathematics of computation, vol. 43, no. 168, pp. 491–499, 1984.
  • [33] M. Benzi and M. Tuma, “A comparative study of sparse approximate inverse preconditioners,” Applied Numerical Mathematics, vol. 30, no. 2-3, pp. 305–340, 1999.
  • [34] E. Chow, “Parallel implementation and practical use of sparse approximate inverse preconditioners with a priori sparsity patterns,” The International Journal of High Performance Computing Applications, vol. 15, no. 1, pp. 56–74, 2001.
  • [35] M. J. Grote and T. Huckle, “Parallel preconditioning with sparse approximate inverses,” SIAM Journal on Scientific Computing, vol. 18, no. 3, pp. 838–853, 1997.
  • [36] Y. Saad and M. H. Schultz, “GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems,” SIAM J. on scientific and statistical computing, vol. 7, no. 3, pp. 856–869, 1986.
  • [37] Y. Saad, “A flexible inner-outer preconditioned GMRES algorithm,” SIAM J. on Scientific Computing, vol. 14, no. 2, pp. 461–469, 1993.
  • [38] L. F. Richardson, “IX. The approximate arithmetical solution by finite differences of physical problems involving differential equations, with an application to the stresses in a masonry dam,” Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, vol. 210, no. 459-470, pp. 307–357, 1911.
  • [39] C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh, “Basic Linear Algebra Subprograms for Fortran usage,” ACM Transactions on Mathematical Software, vol. 5, no. 3, pp. 308–323, 1979.
  • [40] C. Moler, “Iterative refinement in floating point,” Annals of the ACM, vol. 14, no. 2, pp. 316–321, 1967.
  • [41] M. J. Rasch, D. Moreda, T. Gokmen, M. L. Gallo, F. Carta, C. Goldberg, K. E. Maghraoui, A. Sebastian, and V. Narayanan, “A flexible and fast pytorch toolkit for simulating training and inference on analog crossbar arrays,” 2021.
  • [42] J. Shalf, “The future of computing beyond Moore’s law,” Phil. Trans. R. Soc. A., vol. 378, no. 20190061, 2020.
  • [43] J. Wilkinson, Rounding Errors in Algebraic Processes. Prentice Hall, 1963.
  • [44] J. Merikoski and R. Kumar, “Lower bounds for the spectral norm,” Journal of Inequalities in Pure and Applied Mathematics, vol. 6, no. 4, 2005, article 124.
  • [45] P. Billingsley, Probability and Measure, 3rd ed. John Wiley & Sons, 1995.

Appendix A Error and Convergence Analysis

A-A Error in MVM Due to Device Noise

Performing the MVM operation y=Mry=Mr with a matrix Mn×nM\in\mathbb{R}^{n\times n} and vector rr\in\mathbb{R} on an analog crossbar array involves multiple sources of nondeterministic noise so that the output y^\widehat{y}\in\mathbb{R} is equivalent to a low precision approximation of yy. Writing MM to the crossbar array incurs multiplicative and additive write noises NWmn×nN^{Wm}\in\mathbb{R}^{n\times n} and NWan×nN^{Wa}\in\mathbb{R}^{n\times n}, respectively, and the actual conductance values at the crosspoints of the array are given by M^=M(I+NWm)+NWa\widehat{M}=M\odot(I+N^{Wm})+N^{Wa}, where \odot denotes element-wise multiplication. Similarly, digital-to-analog conversion (DAC) of the vector rr into voltage pulses suffers from multiplicative and additive input noises NImnN^{Im}\in\mathbb{R}^{n} and NIanN^{Ia}\in\mathbb{R}^{n}, respectively. As a result, the matrix M^\widehat{M} is effectively multiplied by a perturbed version of rr given by r^=r(𝟏+NIm)+NIa\widehat{r}=r\odot(\mathbf{1}+N^{Im})+N^{Ia}, where 𝟏\mathbf{1} is a vector of all ones.

A characteristic equation to describe the output y^\widehat{y} of an analog MVM MrMr is given by

y^=((M(I+NWm)+NWa)(r(𝟏+NIm)+NIa))(𝟏+NOm)+NOa,\widehat{y}=\left((M\odot(I+N^{Wm})+N^{Wa})(r\odot(\mathbf{1}+N^{Im})+N^{Ia})\right)\odot(\mathbf{1}+N^{Om})+N^{Oa}, (5)

where NOmnN^{Om}\in\mathbb{R}^{n} and NOanN^{Oa}\in\mathbb{R}^{n} denote the multiplicative and additive components of the output noise, respectively, which reflects the inherent inexactness of the multiplication based on circuit laws and current integration, as well as the loss of precision in the ADC conversion of the result vector.

Equation (5) can be rearranged as

y^=((M(I+NWm)+NWa)r(𝟏+NIm))(𝟏+NOm)+(M(I+NWm)NIa+NWaNIa)(𝟏+NOm)+NOa,\widehat{y}=\left((M\odot(I+N^{Wm})+N^{Wa})r\odot(\mathbf{1}+N^{Im})\right)\odot(\mathbf{1}+N^{Om})+(M\odot(I+N^{Wm})N^{Ia}+N^{Wa}N^{Ia})\odot(\mathbf{1}+N^{Om})+N^{Oa},

or more concisely

y^=(M(I+NWm)+NWa)r(𝟏+NIm)(𝟏+NOm)+N^Oa,\widehat{y}=\left(M\odot(I+N^{Wm})+N^{Wa}\right)r\odot(\mathbf{1}+N^{Im})\odot(\mathbf{1}+N^{Om})+\widehat{N}^{Oa}, (6)

where N^Oa\widehat{N}^{Oa} is the overall effective additive noise (M(I+NWm)NIa+NWaNIa)(𝟏+NOm)+NOa(M\odot(I+N^{Wm})N^{Ia}+N^{Wa}N^{Ia})\odot(\mathbf{1}+N^{Om})+N^{Oa} and captures the lower order noise terms. The iith entry y^[i]\widehat{y}[i] of y^\widehat{y} in Equation (6) is given by

y^[i]=(j=1n(M[i,j](1+NWm[i,j])+NWa[i,j])r[j](1+NIm[j])(1+NOm[i]))+N^Oa[i],\widehat{y}[i]=\left(\sum_{j=1}^{n}\left(M[i,j](1+N^{Wm}[i,j])+N^{Wa}[i,j]\right)r[j](1+N^{Im}[j])(1+N^{Om}[i])\right)+\widehat{N}^{Oa}[i],

and upon ignoring the products of components of NWmN^{Wm}, NImN^{Im} and NOmN^{Om}, we then have

y^[i]=(j=1n(M[i,j](1+NWm[i,j]+NIm[j]+NOm[i])+NWa[i,j])r[j])+N^Oa[i].\widehat{y}[i]=\left(\sum_{j=1}^{n}\left(M[i,j](1+N^{Wm}[i,j]+N^{Im}[j]+N^{Om}[i])+N^{Wa}[i,j]\right)r[j]\right)+\widehat{N}^{Oa}[i]. (7)

Rewriting Equation (7) in matrix form yields

y^=(M(I+NWm+𝟏NIm+NOm𝟏)+NWa)r+N^Oa,\widehat{y}=\left(M\odot(I+N^{Wm}+\mathbf{1}\otimes N^{Im}+N^{Om}\otimes\mathbf{1})+N^{Wa}\right)r+\widehat{N}^{Oa},

where \otimes denotes outer product, or more concisely

y^=(M+E)r+N^Oa,\widehat{y}=(M+E)r+\widehat{N}^{Oa}, (8)

where the nondeterministic error matrix EE is given by

E=M(NWm+𝟏NIm+NOm𝟏)+NWa.E=M\odot(N^{Wm}+\mathbf{1}\otimes N^{Im}+N^{Om}\otimes\mathbf{1})+N^{Wa}. (9)

A-B Wilkinson and Moler’s Deterministic Analysis of Iterative Refinement from a Digital Perspective

The iterative refinement (IR) scheme, proposed by Wilkinson in 1948 [43], is often used to reduce errors in solving linear systems Ax=bAx=b, especially when the matrix AA is ill-conditioned. This IR approach consists of the use of an inaccurate solver for the most expensive operations, namely solving the residual equation Ad=rbAxAd=r\triangleq b-Ax (with time complexity O(n3)O(n^{3})), and then updating the estimate xx and computing the next residual (both with time complexity O(n2)O(n^{2})) using forms of higher precision arithmetic. Error bounds for the IR algorithm were subsequently analyzed by Moler in 1967 [40]. In particular, it was shown by Wilkinson and Moler that if the condition number of the matrix AA is not too large and the residual is computed at double the working precision, then the IR algorithm will converge to the true solution within working precision.

More precisely, the IR algorithm comprises the following steps:

  1. 1.

    Solve Ax=bAx=b;

  2. 2.

    Compute residual rm=bAxmr_{m}=b-Ax_{m};

  3. 3.

    Solve Adm=rmAd_{m}=r_{m} using a basic (inaccurate computation) method;

  4. 4.

    Update xm+1=xm+dmx_{m+1}=x_{m}+d_{m};

  5. 5.

    Repeat until stopping criteria satisfied.

Although there are many variants of the above IR algorithm, our primary focus here is on the use of analog hardware to compute the solution of the linear system Adm=rmAd_{m}=r_{m} in step 33, rendering the benefits of greater efficiency and/or lower power consumption, but at the expense of lower and noisier precision. All other steps in the above algorithm are assumed to be computed on a digital computer at some form of higher precision.

We end this brief summary of previous work by providing a high-level overview of the original error and convergence analysis due to Moler [40]. Consider the IR algorithm and focus on the following three steps for each iteration mm:

  1. 1.

    rm=bAxm+cmr_{m}=b-Ax_{m}+c_{m};

  2. 2.

    A(I+Fm)dm=rmA(I+F_{m})d_{m}=r_{m};

  3. 3.

    xm+1=xm+dm+gmx_{m+1}=x_{m}+d_{m}+g_{m}.

Here, cmc_{m}, FmF_{m} and gmg_{m} are corrections that allow for the above equations to hold exactly, since the actual computations are performed on hardware that is inexact. For now, to simplify the exposition of our analysis, let us assume gm=cm=0g_{m}=c_{m}=0, i.e., steps 11 and 33 are computed with perfect precision; positive instances of these precision constants can be readily incorporated subsequently in our analysis.

Suppose it can be shown that Fm<1,m\|F_{m}\|<1,\;\forall m, and hence the matrices I+FmI+F_{m} are nonsingular. Furthermore, we have

(I+Fm)1Fm=i=0(1)iFmi+1,(I+F_{m})^{-1}F_{m}=\sum_{i=0}^{\infty}(-1)^{i}F_{m}^{i+1},

which implies

(I+Fm)1Fmi=0Fmi+1=Fm1Fm.\|(I+F_{m})^{-1}F_{m}\|\leq\sum_{i=0}^{\infty}\|F_{m}\|^{i+1}=\dfrac{\|F_{m}\|}{1-\|F_{m}\|}. (10)

Then, letting x=A1bx=A^{-1}b be the true solution, it is possible to algebraically establish that

xm+1x=(I+Fm)1Fm(xmx),x_{m+1}-x=(I+F_{m})^{-1}F_{m}(x_{m}-x),

from which it follows

xm+1xFm1Fmxmx.\|x_{m+1}-x\|\leq\dfrac{\|F_{m}\|}{1-\|F_{m}\|}\|x_{m}-x\|.

This therefore leads to a contraction when Fm\|F_{m}\| is sufficiently small (in particular, Fm<1/2,m\|F_{m}\|<1/2,\;\forall m), thus guaranteeing convergence with a rate corresponding to the above equation.

A-C Analysis of Iterative Refinement from an Analog Perspective

Adapting the above error and convergence analysis due to Moler [40] for our objectives herein, we start by modeling the writing of the n×nn\times n matrix AA into the analog hardware before beginning the 33-step iterative process. This results in a one-time write error that we model as

A~=A+J~,\widetilde{A}=A+\widetilde{J}, (11)

where J~\widetilde{J} is a matrix whose elements are independent and identically distributed (i.i.d.) according to a general distribution 𝒟J(μJ,σJ2){\mathcal{D}}_{J}(\mu_{J},\sigma_{J}^{2}) (not necessarily Gaussian) having finite mean μJ\mu_{J} and finite variance σJ2\sigma_{J}^{2}. For the analog devices under consideration [41], μJ=0\mu_{J}=0.

Next, we consider step 22 of the 33-step iterative process in Appendix A-B as being the only one performed on inaccurate (analog) hardware. We want to model in a general fashion the overall noise that is incurred when solving the linear system in step 22 of each iteration mm. To this end, let us assume that the computation of dmd_{m} in step 22 of iteration mm satisfies

(A~+K~m)dm=rm,(\widetilde{A}+\widetilde{K}_{m})d_{m}=r_{m}, (12)

where K~m\widetilde{K}_{m} are matrices whose elements are i.i.d. according to a general distribution 𝒟K(μK,σK2){\mathcal{D}}_{K}(\mu_{K},\sigma_{K}^{2}) (not necessarily Gaussian) having finite mean μK\mu_{K} and finite variance σK2\sigma_{K}^{2}.

We seek to establish bounds on K~m\|\widetilde{K}_{m}\|. Even though all matrix norms are equivalent in a topological sense, it is convenient and prudent for our purposes to choose matrix norms with which it is easy to work. Let us consider the max\max norm max\|\cdot\|_{\max}, the Euclidean distance operator norm 2\|\cdot\|_{2} and the Frobenius matrix norm F\|\cdot\|_{F}, taking advantage of the known bounds

|r1|2++|rn|2nK~mmaxKm2K~mF,\sqrt{\dfrac{|r_{1}|^{2}+\cdots+|r_{n}|^{2}}{n}}\leq\|\widetilde{K}_{m}\|_{\max}\leq\|K_{m}\|_{2}\leq\|\widetilde{K}_{m}\|_{F},

where r1,,rnr_{1},\ldots,r_{n} are the row sums of K~m\widetilde{K}_{m}; refer to [44] for this lower bound. Focusing on the norm \|\cdot\| being either max\|\cdot\|_{\max} or 2\|\cdot\|_{2}, we then derive

𝔼(K~m)𝔼(K~mF)\displaystyle{\mathbb{E}}(\|\widetilde{K}_{m}\|)\leq{\mathbb{E}}(\|\widetilde{K}_{m}\|_{F}) =𝔼(ij(K~m)ij2)\displaystyle={\mathbb{E}}\left(\sqrt{\sum_{ij}(\widetilde{K}_{m})_{ij}^{2}}\right)
𝔼(ij(K~m)ij2)=ij𝔼((K~m)ij2)=n2𝔼((K~m)112)=n2(μK2+σK2)\displaystyle\leq\sqrt{{\mathbb{E}}(\sum_{ij}(\widetilde{K}_{m})_{ij}^{2})}=\sqrt{\sum_{ij}{\mathbb{E}}((\widetilde{K}_{m})_{ij}^{2})}=\sqrt{n^{2}{\mathbb{E}}((\widetilde{K}_{m})_{11}^{2})}=\sqrt{n^{2}(\mu_{K}^{2}+\sigma_{K}^{2})}
=nμK2+σK2,\displaystyle=n\sqrt{\mu_{K}^{2}+\sigma_{K}^{2}}, (13)

where the first inequality follows from the above matrix-norm upper bound, the first equality is by definition of the Frobenius matrix norm, the second inequality is due to Jensen’s inequality, the second equality is a basic property of the expectation operator, the third equality is due to the i.i.d. assumption, the fourth equality is by definition of the second moment, the (i,j)(i,j)th element of the matrix K~m\widetilde{K}_{m} is denoted by (K~m)ij{(\widetilde{K}_{m})}_{ij}, and expectation is with respect to 𝒟K(,){\mathcal{D}}_{K}(\cdot,\cdot).

Let us next consider the variance of K~m\|\widetilde{K}_{m}\|, more specifically 𝕍ar(K~m)=𝔼(K~m2)𝔼(K~m)2{\mathbb{V}ar}(\|\widetilde{K}_{m}\|)={\mathbb{E}}(\|\widetilde{K}_{m}\|^{2})-{\mathbb{E}}(\|\widetilde{K}_{m}\|)^{2}. To this end, we focus initially on the second moment 𝔼(K~m2){\mathbb{E}}(\|\widetilde{K}_{m}\|^{2}) and follow along the lines of the derivation of Equation (13) above, to obtain

𝔼(K~m2)𝔼(K~mF2)\displaystyle{\mathbb{E}}(\|\widetilde{K}_{m}\|^{2})\leq{\mathbb{E}}(\|\widetilde{K}_{m}\|^{2}_{F}) =𝔼((ij(K~m)ij2)2)=𝔼(ij(K~m)ij2)=ij𝔼((K~m)ij2)\displaystyle={\mathbb{E}}\left(\left(\sqrt{\sum_{ij}(\widetilde{K}_{m})_{ij}^{2}}\right)^{2}\right)={\mathbb{E}}\left(\sum_{ij}(\widetilde{K}_{m})_{ij}^{2}\right)=\sum_{ij}{\mathbb{E}}((\widetilde{K}_{m})_{ij}^{2})
=n2𝔼((K~m)112)=n2(μK2+σK2),\displaystyle=n^{2}{\mathbb{E}}((\widetilde{K}_{m})_{11}^{2})=n^{2}(\mu_{K}^{2}+\sigma_{K}^{2}), (14)

where expectation is again with respect to 𝒟K(,){\mathcal{D}}_{K}(\cdot,\cdot). Now we turn to a lower bound on 𝔼(K~m){\mathbb{E}}(\|\widetilde{K}_{m}\|), for which we exploit the above matrix-norm lower bound to conclude

𝔼(K~m)\displaystyle{\mathbb{E}}(\|\widetilde{K}_{m}\|) 𝔼(i|(j(K~m)ij)|2n).\displaystyle\geq{\mathbb{E}}\left(\sqrt{\dfrac{\sum_{i}|(\sum_{j}(\widetilde{K}_{m})_{ij})|^{2}}{n}}\right).

Then we derive for the numerator on the right hand side

i|(j(K~m)ij)|2\displaystyle\sum_{i}|(\sum_{j}(\widetilde{K}_{m})_{ij})|^{2} =i(nj(K~m)ijn)2\displaystyle=\sum_{i}\bigg{(}n\sum_{j}\dfrac{(\widetilde{K}_{m})_{ij}}{n}\bigg{)}^{2}
=n2{i(j(K~m)ijnkj(K~m)kjn2)2+2(ij(K~m)ijn2)(ij(K~m)ijn)\displaystyle=n^{2}\left\{\sum_{i}\bigg{(}\sum_{j}\dfrac{(\widetilde{K}_{m})_{ij}}{n}-\sum_{k}\sum_{j}\dfrac{(\widetilde{K}_{m})_{kj}}{n^{2}}\bigg{)}^{2}+2\bigg{(}\sum_{i}\sum_{j}\dfrac{(\widetilde{K}_{m})_{ij}}{n^{2}}\bigg{)}\bigg{(}\sum_{i}\sum_{j}\dfrac{(\widetilde{K}_{m})_{ij}}{n}\bigg{)}\right.
n(ij((K~m)ijn2))2}\displaystyle\qquad\qquad\qquad\qquad\left.-n\bigg{(}\sum_{i}\sum_{j}\bigg{(}\dfrac{(\widetilde{K}_{m})_{ij}}{n^{2}}\bigg{)}\bigg{)}^{2}\right\}
=n2{i(j(K~m)ijnkj(K~m)kjn2)2+2n(ij(K~m)ijn2)2n(ij((K~m)ijn2))2}\displaystyle=n^{2}\left\{\sum_{i}\bigg{(}\sum_{j}\dfrac{(\widetilde{K}_{m})_{ij}}{n}-\sum_{k}\sum_{j}\dfrac{(\widetilde{K}_{m})_{kj}}{n^{2}}\bigg{)}^{2}+2n\bigg{(}\sum_{i}\sum_{j}\dfrac{(\widetilde{K}_{m})_{ij}}{n^{2}}\bigg{)}^{2}-n\bigg{(}\sum_{i}\sum_{j}\bigg{(}\dfrac{(\widetilde{K}_{m})_{ij}}{n^{2}}\bigg{)}\bigg{)}^{2}\right\}
=n2{i(j(K~m)ijnkj(K~m)kjn2)2+n(ij(K~m)ijn2)2}\displaystyle=n^{2}\left\{\sum_{i}\bigg{(}\sum_{j}\dfrac{(\widetilde{K}_{m})_{ij}}{n}-\sum_{k}\sum_{j}\dfrac{(\widetilde{K}_{m})_{kj}}{n^{2}}\bigg{)}^{2}+n\bigg{(}\sum_{i}\sum_{j}\dfrac{(\widetilde{K}_{m})_{ij}}{n^{2}}\bigg{)}^{2}\right\}
n3(ij(K~m)ijn2)2,\displaystyle\geq n^{3}\bigg{(}\sum_{i}\sum_{j}\dfrac{(\widetilde{K}_{m})_{ij}}{n^{2}}\bigg{)}^{2},

from which we obtain

𝔼(K~m)\displaystyle{\mathbb{E}}(\|\widetilde{K}_{m}\|) 𝔼(i|(j(K~m)ij)|2n)𝔼(n3(ij(K~m)ij/n2)2n)\displaystyle\geq{\mathbb{E}}\left(\sqrt{\dfrac{\sum_{i}|(\sum_{j}(\widetilde{K}_{m})_{ij})|^{2}}{n}}\right)\geq{\mathbb{E}}\left(\sqrt{\dfrac{n^{3}\big{(}\sum_{i}\sum_{j}(\widetilde{K}_{m})_{ij}/n^{2}\big{)}^{2}}{n}}\right)
=𝔼(n2(ij(K~m)ijn2)2)n𝔼(|ij(K~m)ijn2|)\displaystyle={\mathbb{E}}\left(\sqrt{n^{2}\bigg{(}\sum_{i}\sum_{j}\dfrac{(\widetilde{K}_{m})_{ij}}{n^{2}}\bigg{)}^{2}}\right)\geq n{\mathbb{E}}\bigg{(}\Big{|}\sum_{i}\sum_{j}\dfrac{(\widetilde{K}_{m})_{ij}}{n^{2}}\Big{|}\bigg{)}
n𝔼(1n2ij(K~m)ij)=nμK.\displaystyle\geq n{\mathbb{E}}\bigg{(}\dfrac{1}{n^{2}}\sum_{i}\sum_{j}(\widetilde{K}_{m})_{ij}\bigg{)}=n\mu_{K}. (15)

Upon combining Equations (14) and (15), this yields

𝕍ar(K~m)=𝔼(K~m2)𝔼(K~m)2n2(μK2+σK2)n2μK2=n2σK2.{\mathbb{V}ar}(\|\widetilde{K}_{m}\|)={\mathbb{E}}(\|\widetilde{K}_{m}\|^{2})-{\mathbb{E}}(\|\widetilde{K}_{m}\|)^{2}\leq n^{2}(\mu_{K}^{2}+\sigma_{K}^{2})-n^{2}\mu_{K}^{2}=n^{2}\sigma_{K}^{2}. (16)

Hence, under the assumption that the elements of K~m\widetilde{K}_{m} are i.i.d. random variables with general distribution 𝒟K(μK,σK2){\mathcal{D}}_{K}(\mu_{K},\sigma_{K}^{2}), it follows from the above analysis that

0𝔼(K~m)nμK2+σK2 and 𝕍ar(K~m)n2σK2.0\leq{\mathbb{E}}(\|\widetilde{K}_{m}\|)\leq n\sqrt{\mu_{K}^{2}+\sigma_{K}^{2}}\qquad\mbox{ and }\qquad{\mathbb{V}ar}(\|\widetilde{K}_{m}\|)\leq n^{2}\sigma_{K}^{2}. (17)

Next we turn to complete the error and convergence analysis that covers the iterative process with per-iteration errors modeled via K~m\widetilde{K}_{m}. From step 22 of the original 33-step IR algorithm together with Equation (12), we have

F~m=A~1K~m.\widetilde{F}_{m}=\widetilde{A}^{-1}\widetilde{K}_{m}.

Letting κ(A~)\kappa(\widetilde{A}) denote an upper bound on 𝔼(A~1){\mathbb{E}}(\|\widetilde{A}^{-1}\|), together with Equation (17) and recalling that J~\widetilde{J} and K~m\widetilde{K}_{m} are independent by definition, we then derive

𝔼(F~m)𝔼(A~1)𝔼(K~m)κ(A~)(nμK2+σK2){\mathbb{E}}(\|\widetilde{F}_{m}\|)\leq{\mathbb{E}}(\|\widetilde{A}^{-1}\|)\,{\mathbb{E}}(\|\widetilde{K}_{m}\|)\leq\kappa(\widetilde{A})\left(n\sqrt{\mu_{K}^{2}+\sigma_{K}^{2}}\right) (18)

and

𝕍ar(F~m)\displaystyle{\mathbb{V}ar}(\|\widetilde{F}_{m}\|) 𝕍ar(A~1K~m)=A~12𝕍ar(K~m)\displaystyle\leq{\mathbb{V}ar}(\|\widetilde{A}^{-1}\|\,\|\widetilde{K}_{m}\|)=\|\widetilde{A}^{-1}\|^{2}\,{\mathbb{V}ar}(\|\widetilde{K}_{m}\|)
κ(A~)2(n2σK2),\displaystyle\leq\kappa(\widetilde{A})^{2}\left(n^{2}\sigma_{K}^{2}\right), (19)

which also renders

𝔼(F~m2)\displaystyle{\mathbb{E}}(\|\widetilde{F}_{m}\|^{2}) κ(A~)2n2σK2+κ(A~)2n2(μK2+σK2)=κ(A~)2n2(μK2+2σK2).\displaystyle\leq\kappa(\widetilde{A})^{2}\,n^{2}\sigma_{K}^{2}+\kappa(\widetilde{A})^{2}n^{2}(\mu_{K}^{2}+\sigma_{K}^{2})=\kappa(\widetilde{A})^{2}\,n^{2}(\mu_{K}^{2}+2\sigma_{K}^{2}).

Recall from the original analysis above due to Moler that

xm+1x=(I+Fm)1Fm(xmx),x_{m+1}-x=(I+F_{m})^{-1}F_{m}(x_{m}-x), (20)

from which we have by definition for our purposes

(x~m+1x~x~mx~>z)((I+F~m)1F~m>z),z,{\mathbb{P}}\left(\dfrac{\|\widetilde{x}_{m+1}-\widetilde{x}\|}{\|\widetilde{x}_{m}-\widetilde{x}\|}>z\right)\leq{\mathbb{P}}\left(\|(I+\widetilde{F}_{m})^{-1}\widetilde{F}_{m}\|>z\right),\qquad\forall z,

where x~\widetilde{x} denotes the solution to A~x~=b\widetilde{A}\widetilde{x}=b. We further obtain from Equation (20)

x~m+1x~=(I+F~m)1F~m(x~mx~)=(k=0m(I+F~k)1F~k)(x~0x~),\widetilde{x}_{m+1}-\widetilde{x}=(I+\widetilde{F}_{m})^{-1}\widetilde{F}_{m}(\widetilde{x}_{m}-\widetilde{x})=\left(\prod_{k=0}^{m}(I+\widetilde{F}_{k})^{-1}\widetilde{F}_{k}\right)(\widetilde{x}_{0}-\widetilde{x}),

which then leads to

x~m+1x~\displaystyle\|\widetilde{x}_{m+1}-\widetilde{x}\| k=0m(I+F~k)1F~kx~0x~k=0m(I+F~k)1F~kx~0x~\displaystyle\leq\left\|\prod_{k=0}^{m}(I+\widetilde{F}_{k})^{-1}\widetilde{F}_{k}\right\|\|\widetilde{x}_{0}-\widetilde{x}\|\leq\prod_{k=0}^{m}\|(I+\widetilde{F}_{k})^{-1}\widetilde{F}_{k}\|\|\widetilde{x}_{0}-\widetilde{x}\| (21)
logx~m+1x~\displaystyle\log\|\widetilde{x}_{m+1}-\widetilde{x}\| k=0mlog(I+F~k)1F~k+logx~0x~,\displaystyle\leq\sum_{k=0}^{m}\log\|(I+\widetilde{F}_{k})^{-1}\widetilde{F}_{k}\|+\log\|\widetilde{x}_{0}-\widetilde{x}\|, (22)

where the second inequality follows from the triangle inequality and the third inequality is by definition of the logarithm applied to both sides. Let x~\widetilde{x}_{*} denote the asymptotic limit for x~m\widetilde{x}_{m}, namely x~mx~\widetilde{x}_{m}\rightarrow\widetilde{x}_{*} as mm\rightarrow\infty. Then, taking the limit as mm\rightarrow\infty, it follows from the above that

logx~x~\displaystyle\log\|\widetilde{x}_{*}-\widetilde{x}\| m=0log(I+F~m)1F~m+logx~0x~m=0logF~m1F~m+C\displaystyle\leq\sum_{m=0}^{\infty}\log\|(I+\widetilde{F}_{m})^{-1}\widetilde{F}_{m}\|+\log\|\widetilde{x}_{0}-\widetilde{x}\|\leq\sum_{m=0}^{\infty}\log\dfrac{\|\widetilde{F}_{m}\|}{1-\|\widetilde{F}_{m}\|}+C
=m=0(logF~mlog(1F~m))+C,\displaystyle=\sum_{m=0}^{\infty}\left(\log\|\widetilde{F}_{m}\|-\log(1-\|\widetilde{F}_{m}\|)\right)+C, (23)

where the second inequality is due to Equation (10) and CC is a finite constant.

Taking the expectation of both sides of Equation (A-C) yields

𝔼(logx~x~)\displaystyle{\mathbb{E}}(\log\|\widetilde{x}_{*}-\widetilde{x}\|) 𝔼(m=0(logF~mlog(1F~m))+C)\displaystyle\leq{\mathbb{E}}\left(\sum_{m=0}^{\infty}\left(\log\|\widetilde{F}_{m}\|-\log(1-\|\widetilde{F}_{m}\|)\right)+C\right)
=m=0𝔼(logF~mlog(1F~m))+C.\displaystyle=\sum_{m=0}^{\infty}{\mathbb{E}}\left(\log\|\widetilde{F}_{m}\|-\log(1-\|\widetilde{F}_{m}\|)\right)+C. (24)

Consistent with the original analysis of Moler, we assume that the support for F~m\|\widetilde{F}_{m}\| is between 0 and 11, i.e., the distribution of F~m\widetilde{F}_{m} is such that 0F~m<10\leq\|\widetilde{F}_{m}\|<1. Further assume that the distribution of F~m\|\widetilde{F}_{m}\| is symmetric around its mean with 𝔼(F~m){\mathbb{E}}(\|\widetilde{F}_{m}\|) bounded away from 12\frac{1}{2}. Then, due to the symmetry of log(x)log(1x)\log(x)-\log(1-x) for 0<x<10<x<1, it follows that the second term in the expectation of Equation (A-C) is greater than the first term, and therefore the expectation is negative and the summation tends to -\infty since C<C<\infty. To see this, note that f(x)=log(x)log(1x)f(x)=\log(x)-\log(1-x) is monotonically increasing in xx and, for δ[0,12)\delta\in[0,\frac{1}{2}), f(12+δ)=f(12δ)f(\frac{1}{2}+\delta)=-f(\frac{1}{2}-\delta). Let pm(x)p_{m}(x) be the probability distribution of the random variable Xm=F~mX_{m}=\|\widetilde{F}_{m}\| that is symmetric around its mean μm<12\mu_{m}<\frac{1}{2}. Then, f(μmδ)<f(μm+δ)0f(\mu_{m}-\delta)<-f(\mu_{m}+\delta)\leq 0 and 𝔼(f(Xm))=pm(x)f(x)𝑑x0{\mathbb{E}}(f(X_{m}))=\int p_{m}(x)f(x)dx\leq 0 is in fact bounded away from 0. We therefore have established convergence in expectation of the iterative process under the assumptions above which implies that 𝔼(x)=x~{\mathbb{E}}(x_{*})=\widetilde{x}.

Lastly, we establish a related result that exploits an important characteristic of analog computing devices [41], namely that the random variables Xm=F~mX_{m}=\|\widetilde{F}_{m}\| have different means μm=𝔼(F~m)\mu_{m}={\mathbb{E}}(\|\widetilde{F}_{m}\|) whose average is asymptotically equal to zero; i.e., m=01μm/0\sum_{m=0}^{\ell-1}\mu_{m}/\ell\rightarrow 0 in the limit as \ell\rightarrow\infty. Consider the sequence of independent random variables {X0,,X1}\{X_{0},\ldots,X_{\ell-1}\} and assume that all of the random variables XmX_{m} have the same variance, i.e., σm2=σ2\sigma_{m}^{2}=\sigma^{2}. Define s2:=m=01σm2=σ2s_{\ell}^{2}:=\sum_{m=0}^{\ell-1}\sigma_{m}^{2}=\ell\sigma^{2} and thus s=σs_{\ell}=\sigma\sqrt{\ell}. Let us now consider

1sm=01(Xmμm)\displaystyle\dfrac{1}{s_{\ell}}\sum_{m=0}^{\ell-1}(X_{m}-\mu_{m}) =m=01(Xmσμmσ)=m=01(Xmσμmσ)\displaystyle=\sum_{m=0}^{\ell-1}\left(\dfrac{X_{m}}{\sigma\sqrt{\ell}}-\dfrac{\mu_{m}}{\sigma\sqrt{\ell}}\right)=\sum_{m=0}^{\ell-1}\left(\dfrac{X_{m}}{\sigma\sqrt{\ell}}-\dfrac{\mu_{m}\sqrt{\ell}}{\ell\sigma}\right)
=m=01Xmσm=01μmσ.\displaystyle=\sum_{m=0}^{\ell-1}\dfrac{X_{m}}{\sigma\sqrt{\ell}}-\sum_{m=0}^{\ell-1}\dfrac{\mu_{m}}{\ell}\dfrac{\sqrt{\ell}}{\sigma}. (25)

Assuming that m=01μm/0\sum_{m=0}^{\ell-1}\mu_{m}/\ell\rightarrow 0 at a rate faster than O()O(\sqrt{\ell}) and taking the limit as \ell\rightarrow\infty on both sides of Equation (A-C), we obtain

lim1sm=01(Xmμm)=limm=01Xmσ.\lim_{\ell\rightarrow\infty}\dfrac{1}{s_{\ell}}\sum_{m=0}^{\ell-1}(X_{m}-\mu_{m})=\lim_{\ell\rightarrow\infty}\sum_{m=0}^{\ell-1}\dfrac{X_{m}}{\sigma\sqrt{\ell}}. (26)

Further suppose that, for some δ>0\delta>0, the following (Lyapunov) condition is satisfied:

lim1s2+δm=01𝔼[|Xmμm|2+δ]=0.\lim_{\ell\rightarrow\infty}\dfrac{1}{s_{\ell}^{2+\delta}}\sum_{m=0}^{\ell-1}{\mathbb{E}}[|X_{m}-\mu_{m}|^{2+\delta}]=0.

Then, from the Lyapunov central limit theorem [45] and Equation (26), we have that the sum of Xm/(σ)X_{m}/(\sigma\sqrt{\ell}) converges in distribution to a standard normal random variable as \ell goes to infinity, namely

m=01Xmσd𝒩(0,1), as ,\sum_{m=0}^{\ell-1}\dfrac{X_{m}}{\sigma\sqrt{\ell}}\quad\stackrel{{\scriptstyle d}}{{\rightarrow}}\quad\mathcal{N}(0,1),\qquad\mbox{ as }\ell\rightarrow\infty, (27)

where 𝒩(0,1)\mathcal{N}(0,1) denotes the standard normal distribution.

We have therefore established convergence in expectation of the iterative refinement process such that 𝔼(x)=x~{\mathbb{E}}(x_{*})=\widetilde{x}, as long as the distribution of F~m\|\widetilde{F}_{m}\| is symmetric around its mean with F~m\|\widetilde{F}_{m}\| taking values in [0,1)[0,1) and 𝔼(F~m)<12{\mathbb{E}}(\|\widetilde{F}_{m}\|)<\frac{1}{2}. From Equation (18), this implies the requirement that

𝔼(F~m)κ(A~)(nμK2+σK2)<12{\mathbb{E}}(\|\widetilde{F}_{m}\|)\leq\kappa(\widetilde{A})\left(n\sqrt{\mu_{K}^{2}+\sigma_{K}^{2}}\right)<\dfrac{1}{2} (28)

to ensure convergence in expectation of the iterative process. Our asymptotic result in Equation (27) suggests that we can drop μK\mu_{K} as part of the overall iterative process, and thus this requirement can be further simplified over many iterations as

𝔼(F~m)κ(A~)(nσK2)=κ(A~)nσK<12.{\mathbb{E}}(\|\widetilde{F}_{m}\|)\leq\kappa(\widetilde{A})\left(n\sqrt{\sigma_{K}^{2}}\right)=\kappa(\widetilde{A})n\sigma_{K}<\dfrac{1}{2}. (29)

It is important to note, however, that Equations (28) and (29) only represent upper bounds on the parameters associated with F~m\|\widetilde{F}_{m}\|, which can be (and are likely to be) quite loose. Mathematically speaking, we really only need 𝔼(F~m)<12{\mathbb{E}}(\|\widetilde{F}_{m}\|)<\frac{1}{2} or F~m<12\|\widetilde{F}_{m}\|<\frac{1}{2} infinitely often. In practice, we simply need to have F~m<12\|\widetilde{F}_{m}\|<\frac{1}{2} hold a sufficiently large number of times such that k=0m(I+F~k)1F~k<12\prod_{k=0}^{m}(I+\widetilde{F}_{k})^{-1}\widetilde{F}_{k}<\frac{1}{2} for large mm.

A-D Analysis of Preconditioned Richardson Iteration from an Analog Perrspective

We next apply our error and convergence analysis above to the case of preconditioned Richardson iteration. In particular, analogous to the analysis in Appendix A-C for K~m\widetilde{K}_{m}, we have that 𝔼(x)=x{\mathbb{E}}(x_{\ast})=x as long as 𝔼(IMA)<1{\mathbb{E}}(\|I-{M}A\|)<1. With EE defined as in Section III-D to be

E=M(NWm+𝟏NIm+NOm𝟏)+NWaE=M\odot(N^{Wm}+\mathbf{1}\otimes N^{Im}+N^{Om}\otimes\mathbf{1})+N^{Wa}

we then seek to establish bounds on E\|{E}\| such that

E<1IMAA.\|{E}\|<\dfrac{1-\|I-MA\|}{\|A\|}.

Analogous again to the analysis in Appendix A-C for K~m\widetilde{K}_{m}, we assume E{E} to be a matrix whose elements are i.i.d. according to a general distribution 𝒟E(μE,σE2){\mathcal{D}}_{E}(\mu_{E},\sigma_{E}^{2}) (not necessarily Gaussian) having finite mean μE\mu_{E} and finite variance σE2\sigma_{E}^{2}. Even though all matrix norms are equivalent in a topological sense, it is convenient and prudent for our purposes to choose matrix norms with which it is easy to work. As in Appendix A-C, we shall continue to consider the max\max norm max\|\cdot\|_{\max}, the Euclidean distance operator norm 2\|\cdot\|_{2} and the Frobenius matrix norm F\|\cdot\|_{F}.

Focusing on the norm \|\cdot\| being either max\|\cdot\|_{\max} or 2\|\cdot\|_{2}, we follow our derivation above that leads to Equation (13) and obtain

𝔼(E)𝔼(EF)\displaystyle{\mathbb{E}}(\|{E}\|)\leq{\mathbb{E}}(\|{E}\|_{F}) =nμE2+σE2.\displaystyle=n\sqrt{\mu_{E}^{2}+\sigma_{E}^{2}}. (30)

Similarly, from our analysis above leading to Equations (14) and (16), we respectively have

𝔼(E2)n2(μE2+σE2)\displaystyle{\mathbb{E}}(\|{E}\|^{2})\leq n^{2}(\mu_{E}^{2}+\sigma_{E}^{2}) (31)

and

𝕍ar(E)n2σE2.{\mathbb{V}ar}(\|{E}\|)\leq n^{2}\sigma_{E}^{2}. (32)

Our asymptotic result in Equation (27) applied to E\|{E}\| suggests that we can drop μE\mu_{E} as part of the overall preconditioned Richardson iteration process, and thus Equations (30) and (31) can be further simplified over many iterations as

𝔼(E)nσE and 𝔼(E2)n2σE2.{\mathbb{E}}(\|{E}\|)\leq n\sigma_{E}\qquad\mbox{ and }\qquad{\mathbb{E}}(\|{E}\|^{2})\leq n^{2}\sigma_{E}^{2}. (33)

Summing everything together, and assuming that write errors are the dominant source of noise, we can guarantee from Equation (30) that the hybrid preconditioned Richardson iteration process will make progress in expectation during each iteration as long as

μE2+σE2<1IMAnA;\sqrt{\mu_{E}^{2}+\sigma_{E}^{2}}<\dfrac{1-\|I-MA\|}{n\|A\|};

whereas Equation (33) suggests that such progress will be guaranteed in expectation over many iterations as long as

σE<1IMAnA.\sigma_{E}<\dfrac{1-\|I-MA\|}{n\|A\|}.

Observe that, in both cases, the upper bound becomes larger when: (aa) IMA0\|I-MA\|\rightarrow 0; (bb) the spectral norm of the matrix AA becomes smaller; and/or (cc) the size of matrix AA decreases. The first component (aa) is the only part that we can control by constructing a more accurate approximate inverse, while the second component (bb) and third component (cc) depend on the given matrix AA. Hence, depending upon the properties of the problem and the quality of the approximate inverse, we have from the above analysis an understanding of the suitability of the device properties.

At the same time, it is important to note that Equations (30) – (33) only represent upper bounds on the parameters associated with E\|{E}\|, which can be (and are likely to be) quite loose. Mathematically speaking, we really only need E<1IMAA\|{E}\|<\dfrac{1-\|I-MA\|}{\|A\|} infinitely often, and thus in practice we simply need to have E<1IMAA\|{E}\|<\dfrac{1-\|I-MA\|}{\|A\|} hold a sufficiently large number of times to realize convergence.

Similarly, for the bound on EE under the condition M=A1+ΔM=A^{-1}+\Delta, we have from the above analysis that a sufficient condition for convergence in expectation is given by

σE<1n(A1Δ)=1n(A1MA1).\sigma_{E}<\frac{1}{n}(\|A\|^{-1}-\|\Delta\|)=\frac{1}{n}(\|A\|^{-1}-\|M-A^{-1}\|).