This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Lecture Notes on Support Preconditioning

Sivan Toledo
Tel Aviv University
(Date: April 2007)

Preface

These lecture notes were written in 2007. They were available since then at https://www.tau.ac.il/˜stoledo/Support. I compiled them and uploaded them without any editing in December 2021.

Chapter 1 Motivation and Overview

In this chapter we examine solvers for two families of linear systems of equations. Our goal is motivate certain ideas and techniques, not to explain in detail how the solvers work. We focus on the measured behaviors of the solvers and on the fundamental questions that the solvers and their performance pose.

1.1. The Model Problems: Two- and Three Dimensional Meshes

The coefficient matrices of the linear systems that we study arise from two families of undirected graphs: two- and three-dimensional meshes. The vertices of the graphs, which correspond to rows and columns in the coefficient matrix AA, are labeled 11 through nn. The edges of the graph correspond to off-diagonal nonzeros in AA, the value of which is always 1-1. If (i,j)(i,j) is an edge in the graph, then Aij=Aji=1A_{ij}=A_{ji}=-1. If (i,j)(i,j) is not an edge in the mesh for some iji\neq j, then Aij=Aji=0A_{ij}=A_{ji}=0. The value of the diagonal elements in AA is selected to make the sums of elements in rows 2,3,,n2,3,\ldots,n exactly 0, and to make the sum of the elements in the first row exactly 11. We always number meshes so that vertices that differ only in their xx coordinates are contiguous and numbered from small xx coordinates to large ones; vertices that have the same zz coordinate are also contiguous, numbered from small yy coordinates to high ones. Figure 1.1.1 shows an example of a two-dimensional mesh and the matrix that corresponds to it. Figure 1.1.2 shows an example of a small three-dimensional mesh.

Refer to caption

A=[311131113111211311114111141111311211131113112]A=\left[\setcounter{MaxMatrixCols}{12}\begin{smallmatrix}3&-1&&&-1\\ -1&3&-1&&&-1\\ &-1&3&-1&&&-1\\ &&-1&2&&&&-1\\ -1&&&&3&-1&&&-1\\ &-1&&&-1&4&-1&&&-1\\ &&-1&&&-1&4&-1&&&-1\\ &&&-1&&&-1&3&&&&-1\\ &&&&-1&&&&2&-1\\ &&&&&-1&&&-1&3&-1\\ &&&&&&-1&&&-1&3&-1\\ &&&&&&&&&&-1&2\end{smallmatrix}\right]

Figure 1.1.1. A 44-by-33 two-dimentional mesh, with vertices labeled 1,,121,\ldots,12, and the coefficient matrix AA that corresponds to it.
Refer to caption
Figure 1.1.2. A 44-by-33-by-22 three-dimensional mesh. The indices of the vertices are not shown.

Matrices similar to these do arise in applications, such as finite-difference discretizations of partial differential equations. These particular families of matrices are highly structured, and the structure allows specialized solvers to be used, such as multigrid solvers and spectral solvers. The methods that we present in this book are, however, applicable to a much wider class of matrices, not only to those arising from structured meshes. We use these matrices here only to illustrate and motivate various linear solvers.

The experimental results that we present in the rest of the chapter all use a solution vector xx with random uniformly-distributed elements between 0 and 11.

1.2. The Cholesky Factorization

Our matrices are symmetric (by definition) and positive definite. We do not show here that they are positive definite (all their eigenvalues are positive); this will wait until a later chapter. But they are positive definite.

The simplest way to solve a linear system Ax=bAx=b with a positive definite coefficient matrix AA is to factor AA into a produce of a lower triangular factor LL and its transpose, A=LLTA=LL^{T}. When AA is symmetric positive definite, such a factorization always exists. It is called the Cholesky factorization, and it is a variant of Gaussian Elimination. Once we compute the factorization, we can solve Ax=bAx=b by solving two triangular linear systems,

Ly\displaystyle Ly =\displaystyle= b\displaystyle b
LTx\displaystyle L^{T}x =\displaystyle= y.\displaystyle y\;.

Solvers for linear systems of equations that factor the coefficient matrix into a product of simpler matrices are called direct solvers. Here the factorization is into triangular matrices, but orthogonal, diagonal, tridiagonal, and block diagonal matrices are also used as factors in direct solvers.

The factorization can be computed recursively. If AA is 11-by-11, then L=A11L=\sqrt{A_{11}}. Otherwise, we can partition AA and LL into 4 blocks each, such that the diagonal blocks are square,

A=[A11A21TA21A22]=[L11L21L22][L11TL21TL22T]=LLT.A=\begin{bmatrix}A_{11}&A_{21}^{T}\\ A_{21}&A_{22}\end{bmatrix}=\begin{bmatrix}L_{11}\\ L_{21}&L_{22}\end{bmatrix}\begin{bmatrix}L_{11}^{T}&L_{21}^{T}\\ &L_{22}^{T}\end{bmatrix}=LL^{T}\;.

This yields the following equations, which define the blocks of LL,

(1.2.1) A11\displaystyle A_{11} =\displaystyle= L11L11T\displaystyle L_{11}L_{11}^{T}
(1.2.2) A21\displaystyle A_{21} =\displaystyle= L21L11T\displaystyle L_{21}L_{11}^{T}
(1.2.3) A22\displaystyle A_{22} =\displaystyle= L21L21T+L22L22T.\displaystyle L_{21}L_{21}^{T}+L_{22}L_{22}^{T}\;.

We compute LL by recursively solving Equation 1.2.1 for L11L_{11}. Once we have computed L11L_{11}, we solve Equation 1.2.2 for L21L_{21} by substitution. The final step in the computation of LL is to subtract L21L21TL_{21}L_{21}^{T} from A22A_{22} and to factor the difference recursively.

This algorithm is reliable and numerically stable, but if implemented naively, it is slow. Let ϕ(n)\phi(n) be the number of arithmetic operations that this algorithm performs on a matrix of size nn. To estimate ϕ\phi, we partition AA so that A11A_{11} is 11-by-11. The partitioning does not affect ϕ(n)\phi(n), but this particular partitioning makes it easy to estimate ϕ(n)\phi(n). With this partitioning, computing LL involves computing one square root, dividing an (n1)(n-1)-vector by the root, computing the symmetric outer product of the scaled (n1)(n-1)-vector, subtracting this outer product from a symmetric (n1)(n-1)-by-(n1)(n-1) matrix, and recursively factoring the difference. We can set this up as a recurrence relation,

ϕ(n)\displaystyle\phi(n) =\displaystyle= 1+(n1)+(n1)2+ϕ(n1)\displaystyle 1+(n-1)+(n-1)^{2}+\phi(n-1)
ϕ(1)\displaystyle\phi(1) =\displaystyle= 1.\displaystyle 1\;.

This yields

ϕ(n)\displaystyle\phi(n) =\displaystyle= n+j=2n(j1)+j=2n(n1)2\displaystyle n+\sum_{j=2}^{n}(j-1)+\sum_{j=2}^{n}(n-1)^{2}
=\displaystyle= n+j=1n1j+j=1n1j2\displaystyle n+\sum_{j=1}^{n-1}j+\sum_{j=1}^{n-1}j^{2}
=\displaystyle= n+n(n1)2+n(n12)(n1)3\displaystyle n+\frac{n(n-1)}{2}+\frac{n(n-\frac{1}{2})(n-1)}{3}
=\displaystyle= n33+o(n3).\displaystyle\frac{n^{3}}{3}+o\left(n^{3}\right)\;.

The little-oo notation means that we have neglected terms that grow slower than cn3cn^{3} for any constant cc.

Cubic growth means that for large meshes, a Cholesky-based solver is slow. For a mesh with a million vertices, the solver needs to perform more than 0.3×10180.3\times 10^{18} arithmetic operations. Even at a rate of 101210^{12} operations per second, the factorization takes almost 4 days. On a personal computer (say a 2005 model), the factorization will take more than a year. A mesh with a million vertices may seem like a large mesh, but it is not. The mesh has less than 22 million edges in two dimensions and less than 33 million in three dimensions, so its representation does not take a lot of space in memory. We can do a lot better.

1.3. Sparse Cholesky

One way to speed up the factorization is to exploit sparsity in AA. If we inspect the nonzero pattern of AA, we see that most of its elements are zero. If AA is derived from a 2-dimensional mesh, only 5 diagonals contain nonzero elements. If AA is derived from a 3-dimensional mesh, only 7 diagonals are nonzero. Figure 1.3.1 shows the nonzero pattern of such matrices. Only O(n)O(n) elements out of n2n^{2} in AA are nonzero.

A2D=[Refer to caption]A_{\textrm{2D}}=\left[\vbox{\hbox{\includegraphics[height=143.09538pt]{matlab/plots/mesh2d_8_7_pattern.eps}}}\right] A3D=[Refer to caption]A_{\textrm{3D}}=\left[\vbox{\hbox{\includegraphics[height=143.09538pt]{matlab/plots/mesh3d_4_6_3_pattern.eps}}}\right]

Figure 1.3.1. The nonzero pattern of matrices corresponding to an 88-by-77 mesh (left) and to a 44-by-66-by-33 mesh. Nonzero elements are denoted by small squares. The other elements are all zeros.

Sparse factorization codes exploit the sparsity of AA and avoid computations involving zero elements (except, sometimes, for zeros that are created by exact cancellations). Figure 1.3.2 compares the running times of a sparse Cholesky solver with that of a dense Cholesky solver, which does not exploit zeros. The sparse solver is clearly must faster, especially on 2-dimensional meshes.

Refer to caption
Refer to caption
Figure 1.3.2. The performance of dense and sparse Cholesky on 2- and 3-dimensional meshes. The data in this graph, as well as in all the other graphs in this chapter, was generated using Matlab 7.0.

To fully understand the performance of the sparse solver, it helps to inspect the nonzero pattern of the Cholesky factor LL, shown in Figure 1.3.3. Exploiting sparsity pays off because the factor is also sparse: it fills, but not completely. It is easy to characterize exactly where the factor fills. It fills almost completely within the band structure of AA, the set of diagonals dd that are closer to the main diagonal than a nonzero diagonal. Outside this band structure, the elements of LL are all zero. For a n\sqrt{n}-by-n\sqrt{n} mesh, the outer diagonal is n\sqrt{n}, so the total amount of arithmetic in a sparse factorization is n2(n)2+o(n2)=2n2+o(n2)n\cdot 2\left(\sqrt{n}\right)^{2}+o(n^{2})=2n^{2}+o(n^{2}). For large nn, this is a lot less than the dense n3/3+o(n2)n^{3}/3+o(n^{2}) bound. For three dimensional meshs, exploiting sparsity helps, but not as much. In a matrix corresponding to a n3\sqrt[3]{n}-by-n3\sqrt[3]{n}-by-n3\sqrt[3]{n} mesh, the outer nonzero diagonal is n2/3n^{2/3}, so the amount of arithmetic is 2n7/3+o(n7/3)2n^{7/3}+o(n^{7/3}).

A=[Refer to caption]A=\left[\vbox{\hbox{\includegraphics[height=143.09538pt]{matlab/plots/mesh2d_11_15_pattern.eps}}}\right] L=[Refer to caption]L=\left[\vbox{\hbox{\includegraphics[height=143.09538pt]{matlab/plots/mesh2d_11_15_factor.eps}}}\right]

Figure 1.3.3. The matrix of a 1515-by-1111 mesh (left) and its Cholesky factor (right).

1.4. Preordering for Sparsity

Sparse factorization codes exploit zero elements in the reduced matrices and the factors. It turns out that by reordering the rows and columns of a matrix we can reduce fill in the reduced matrices and the factors. The most striking example for this phenomenon is the arrow matrix,

A=[n+1111111111111].A=\begin{bmatrix}n+1&-1&-1&-1&\cdots&-1\\ -1&1\\ -1&&1\\ -1&&&1\\ \vdots&&&&\ddots\\ -1&&&&&1\end{bmatrix}\;.

When we factor this matrix, it fills completely after the elimination of the first column, since the outer product L2:n,1L2:n,1TL_{2\colon n,1}L_{2\colon n,1}^{T} is full (has no zeros). But if we reverse the order of the rows and columns, we have

PAPT\displaystyle PAP^{T} =\displaystyle= [11111][n+1111111111111][11111]\displaystyle\left[\begin{smallmatrix}&&&&&1\\ &&&&1\\ &&&1\\ &&1\\ &\cdot\\ 1\end{smallmatrix}\right]\left[\begin{smallmatrix}n+1&-1&-1&-1&\cdots&-1\\ -1&1\\ -1&&1\\ -1&&&1\\ \vdots&&&&\ddots\\ -1&&&&&1\end{smallmatrix}\right]\left[\begin{smallmatrix}&&&&&1\\ &&&&1\\ &&&1\\ &&1\\ &\cdot\\ 1\end{smallmatrix}\right]
=\displaystyle= [1111111111111n+1].\displaystyle\left[\begin{smallmatrix}1&&&&&-1\\ &1&&&&-1\\ &&1&&&-1\\ &&&1&&-1\\ &&&&\ddots&\vdots\\ -1&-1&-1&-1&-1&n+1\end{smallmatrix}\right]\;.

The Cholesky factor of PAPTPAP^{T} is as sparse as PAPTPAP^{T}

chol(PAPT)=[111111111].\textrm{chol}(PAP^{T})=\begin{bmatrix}1\\ &1\\ &&1\\ &&&1\\ &&&&\ddots\\ -1&-1&-1&-1&\cdots&1\end{bmatrix}\;.

Solving a linear system with a Cholesky factorization of a permuted coefficient matrix is trivial. We apply the same permutation to the right-hand-side bb before the two triangular solves, and apply the inverse permutation after the triangular solves.

A~=[Refer to caption]\tilde{A}=\left[\vbox{\hbox{\includegraphics[height=143.09538pt]{matlab/plots/mesh2d_11_15_nesteddis.eps}}}\right] L~=[Refer to caption]\tilde{L}=\left[\vbox{\hbox{\includegraphics[height=143.09538pt]{matlab/plots/mesh2d_11_15_ndfactor.eps}}}\right]

Figure 1.4.1. A symmetric nested-dissection permutation A~\tilde{A} of the matrix shown in Figure 1.3.3 and the Cholesky factor L~\tilde{L} of A~\tilde{A}.

For matrices that correspond to meshes, there is no permutation that leads to a no-fill factorization, but there are permutations that reduce fill significantly. Figure 1.4.1 shows a theoretically-effictive fill-reducing permutation of the matrix of an 1111-by-1515 mesh and its Cholesky factor. This symmetric reordering of the rows and columns is called a nested dissection ordering. It reorders the rows and columns by finding a small set of vertices in the mesh whose removal breaks the mesh into two separate components of roughly the same size. This set is called a vertex separator and the corresponding rows and columns are ordered last. The rows and columns corresponding to each connected component of the mesh minus the separator are ordered using the same strategy recursively. Figure 1.4.2 illustrates this construction.

Refer to caption A~=[Refer to caption]\tilde{A}=\begin{bmatrix}\vbox{\hbox{\includegraphics[height=151.76964pt]{matlab/plots/mesh2d_nd_pattern.eps}}}\end{bmatrix}

Figure 1.4.2. A nested-dissection ordering of a mesh (left) and the nested-dissection reordered matrix of the mesh (right). The shaded ellipse in the mesh shows the top-level vertex separators. The shading in the matrix shows the rows and columns that correspond to the separator; the thin lines separate the submatrices that correspond to the connected components of the mesh minus the separator..

The factorization of the nested-dissection-ordered matrix of a a n\sqrt{n}-by-n\sqrt{n} mesh performs Θ(n3/2)\Theta(n^{3/2}) arithmetic operations.111explain asymptotic notation. (We don’t give here the detailed analysis of fill and work under nested-dissection orderings.) For large meshes, this is a significant improvement over the Θ(n2)\Theta(n^{2}) operations that the factorization performs without reordering. For a n3\sqrt[3]{n} -by-n3\sqrt[3]{n}-by-n3\sqrt[3]{n} mesh, the factorization performs Θ(n2)\Theta(n^{2}) operations. This is again an improvement, but the Θ(n2)\Theta(n^{2}) solution cost is too high for many applications.

Refer to caption
Refer to caption
Figure 1.4.3. The effect of reordering the symmetrically rows and columns on the running time of sparse Cholesky factorizations.

Figure 1.4.3 compares the running times of sparse Choesky applied to the natural ordering of the meshes and to fill-reducing orderings. The nested-dissection and minimum-degree orderings speed up the factorization considerably.

Unfortunately, nested-dissection-type orderings are usually the best we can do in terms of fill and work in sparse factorization of meshes, both structured and unstructured ones. A class of ordering heuristics called minimum-degree heuristics can beat nested dissection on small and medium-size meshes. On larger meshes, various hybrids of nested-dissection and minimum-degree are often more effective than pure variants of either method, but these hybrids have the same asymptotic behavior as pure nested-dissection orderings.

To asymptotically beat sparse Cholesky factorizations, we something more sophisticated than a factorization of the coefficient matrix.

1.5. Krylov-Subspace Iterative Methods

A radically different way to solve Ax=bAx=b is to to find the vector x(k)x^{(k)} that minimizes Ax(k)b\|Ax^{(k)}-b\| in a subspace of n\mathbb{R}^{n}, the subspace that is spanned by b,Ab,A2b,,Ak1bb,Ab,A^{2}b,\ldots,A^{k-1}b. This subspace is called the Krylov subspace of AA and bb, and such solvers are called Krylov-subspace solvers. This idea is appealing because of two reasons, the first obvious and the other less so. First, we can create a basis for this subspace by repeatedly multiplying vectors by AA. Since our matrices, as well as many of the matrices that arise in practice, are so sparse, multiplying AA by a vector is cheap. Second, it turns out that when AA is symmetric, finding the minimizer x(k)x^{(k)} is also cheap. These considerations, taken together, imply that the work performed by such solvers is proportional to kk. If x(k)x^{(k)} converges quickly to xx, the total solution cost is low. But for most real-world problems, as well as for our meshes, convergence is slow.

The conjugate gradients method (cg) and the minimal-residual method (minres) are two Krylov-subspace methods that are appropriate for symmetric positive-definite matrices. Minres is more theoretically appealing, because it minimizes the residual Ax(k)bAx^{(k)}-b in the 22-norm. It works even if AA is indefinite. Its main drawback is that it suffers from a certain numerically instability when implemented in floating point. The instability is almost always mild, not catastrophic, but it can slow convergence and it can prevent the method from converging to very small residuals. Conjugate gradients is more reliable numerically, but it minimizes the residual in the AA-norm, not in the 22-norm.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 1.5.1. A comparison of a sparse direct Cholesky solver with an iterative minres solver.

Figure 1.5.1 compares the performance of a sparse direct solver (with a fill-reducing ordering) with the performance of a minres solver. The performance of an iterative solver depends on the prescribed convergence threshold. Here we instructed the solver to halt once the relative norm of the residual

Ax(k)bb\frac{\|Ax^{(k)}-b\|}{\|b\|}

is 10610^{-6} or smaller. On 2-dimensional meshes, the direct solver is faster. On 3-dimensional meshes, the iterative solver is faster. The fundamental reason for the difference is the size of the smallest balanced vertex separators in 2- and 3-dimensional meshes. In a 2-dimensional mesh with a bounded aspect ratio, the smallest balanced vertex separator has Θ(n1/2)\Theta(n^{1/2}) vertices; in a 3-dimensional mesh, the size of the smallest separator is Θ(n2/3)\Theta(n^{2/3}). The larger separators in 3-dimensional meshes cause the direct solver to be expensive, but they speed up the iterative solver. This is shown in Figure 1.5.2: minres converges after fewer iterations on 2-dimensional meshes than on 3-dimensional meshes. The full explanation of this behavior is beyond the scope of this chapter, but this phenomenon shows how different direct and iterative solvers are. Something that hurts a direct solver may help an iterative solver.

Refer to caption
Refer to caption
Figure 1.5.2. The number of iterations required to reduce the relative norm of the residual to 10610^{6} or less.

As long as AA is symmetric and positive definite, the performance of a sparse Cholesky solver does not depend on the values of the nonzero elements of AA, only on AA’s nonzero pattern. The performance of an iterative solver, on the other hand, depends on the values of the nonzero elements. The behavior of minres displayed in Figures 1.5.1 and 1.5.2 depend on our way of assigning values to nonzero elements of AA.

Comparing the running times of sparse and iterative solvers does not reveal all the differences between them. The graphs that present the number of arithmetic operations, in Figure 1.5.1, reveal another aspect. The differences in the amount of work that the solvers perform are much greater than the differences in running times. This indicates that the direct solver performs arithmetic at a much higher rate than the iterative solver. The exact arithmetic-rate ratio is implementation dependent, but qualitatively, direct solvers usually run at higher computational rates than iterative solvers. Most of the difference is due to the fact that it is easier to exploit complex computer architectures, in particular cache memories, in direct solvers.

Another important difference between direct and iterative solvers is memory usage. A solver like minres only allocates a few nn-vectors. A direct solver needs more memory, because, as we have seen, the factors fill. The Cholesky factor of a 3-dimensional mesh contains Ω(n4/3)\Omega(n^{4/3}) nonzeros, so the memory-usage ratio between a direct and iterative solver worsens as the mesh grows.

1.6. Preconditioning

How can we improve the speed of Krylov-subspace iterative solvers? We cannot significantly reduce the cost of each iteration, because for matrices as sparse as ours, the cost of each iteration is only Θ(n)\Theta(n). There is little hope for finding a better approximation x(k)x^{(k)} in the span of b,Ab,,Ak1bb,Ab,\ldots,A^{k-1}b, because minres already finds the approximation that minimizes the norm of the residual.

To understand what can be improved, we examine simpler iterative methods. Given an approximate solution x(k)x^{(k)} to the system Ax=bAx=b, we define the error e(k)=xx(k)e^{(k)}=x-x^{(k)} and the residual r(k)=bAx(k)r^{(k)}=b-Ax^{(k)}. The error e(k)e^{(k)} is the required correction to x(k)x^{(k)}; if we add e(k)e^{(k)} to x(k)x^{(k)}, we obtain xx. We obviously do not know what e(k)e^{(k)} is, but we know what Ae(k)Ae^{(k)} is: it is r(k)r^{(k)}, because

Ae(k)\displaystyle Ae^{(k)} =\displaystyle= A(xx(k))\displaystyle A\left(x-x^{(k)}\right)
=\displaystyle= AxAx(k)\displaystyle Ax-Ax^{(k)}
=\displaystyle= bAx(k)\displaystyle b-Ax^{(k)}
=\displaystyle= r(k).\displaystyle r^{(k)}\;.

Computing the correction requires solving a linear system of equation whose coefficient matrix is AA, which is what we are trying to do in the first place. The situation seems cyclic, but it is not. The key idea is to realize that even an inexact correction can bring us closer to xx. Suppose that we had another matrix BB that is close to AA and such that linear systems of the form Bz=rBz=r are much earsier to solve than systems whose coefficient matrix is AA. Then we could solve Bz(k)=r(k)Bz^{(k)}=r^{(k)} for a correction z(k)z^{(k)} and set x(k+1)=x(k)+z(k)=x(k)+B1(bAx(k))x^{(k+1)}=x^{(k)}+z^{(k)}=x^{(k)}+B^{-1}(b-Ax^{(k)}). Since Ae(k)=bAx(k)Ae^{(k)}=b-Ax^{(k)}, we can express e(k)e^{(k)} directly in terms of e(0)e^{(0)},

e(k+1)\displaystyle e^{(k+1)} =\displaystyle= xx(k+1)\displaystyle x-x^{(k+1)}
=\displaystyle= xx(k)+B1(bAx(k))\displaystyle x-x^{(k)}+B^{-1}\left(b-Ax^{(k)}\right)
=\displaystyle= e(k)+B1Ae(k)\displaystyle e^{(k)}+B^{-1}Ae^{(k)}
=\displaystyle= (IB1A)e(k)\displaystyle\left(I-B^{-1}A\right)e^{(k)}
=\displaystyle= (IB1A)ke(0).\displaystyle\left(I-B^{-1}A\right)^{k}e^{(0)}\;.

Therefore, if (IB1A)k\left(I-B^{-1}A\right)^{k} converges quickly to a zero matrix, then x(k)x^{(k)} quickly towards xx. This is the precise sense in which BB must be close to AA: (IB1A)k\left(I-B^{-1}A\right)^{k} should converge quickly to zero. This happens if the spectral radius of (IB1A)\left(I-B^{-1}A\right) is small (the spectral radius of a matrix is the maximal absolute value of its eigenvalues).

The convergence of more sophisticaded iterative methods, like minres and cg, can also be accelerated using this technique, called preconditioning. In every iteration, we solve a correction equation Bz(k)=r(k)Bz^{(k)}=r^{(k)}. The coefficient matrix BB of the correction equation is called a preconditioner. The convergence of preconditioned minres and preconditioned cg also depends on spectral (eigenvalue) properties associated with AA and BB. For these methods, it is not the spectral radius that determines the convergence rate, but other spectral properties. But the principle is the same: BB should be close to AA in some spectral metric.

In this book, we mainly focus on one family of methods to construct and analyze preconditioners. There are many other ways. The constructions that we present are called support preconditioners, and our analysis method is called support theory.

We introduce support preconditioning here using one particular family of support preconditioners for our meshes. These preconditioners, which are only effective for regular meshes whose edges all have the same weights (off-diagonals in AA all have the same value), are called Joshi preconditioners. We construct a Joshi preconditioner BB for a mesh matrix AA by dropping certain edges from the mesh, and then constructing a coefficient matrix BB for the sparsified mesh using the procedure given in Section 1.1. Figures 1.6.1 and 1.6.2 present examples of such sparsified meshes.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 1.6.1. A 1515-by-1111 two-dimensional mesh (top left) with Joshi subgraphs for k=4k=4 (top right), k=2k=2 (bottom left), and k=15k=15 (bottom right). The subgraph for k=2k=2 is the densest nontrivial subgraph and the subgraph for k=15k=15 is the sparsest possible; it is a spanning tree of the mesh.
Refer to caption
Figure 1.6.2. A 33-by-44-by-33 three-dimensional Joshi mesh.

To solve the correction equations during the iterations, we compute the sparse Cholesky factorization of PBPTPBP^{T}, where PP is a fill-reducing permutation PP for BB.

The construction of Joshi preconditioners depends on a parameter kk. For small kk, the mesh of the preconditioner is similar to the mesh of AA. In particular, for k=1k=1 we obtain B=AB=A. As we shall see, Joshi preconditioners with a small kk lead to fast convergence rates. But Joshi preconditioners with a small kk do not have small balanced vertex separators, so their Cholesky factorization suffers from significant fill. Not as much as in the Cholesky factor of AA, but still significant. This fill slows down the preconditioner-construction phase of the linear solver, in which we construct and factor BB, and it also slows down each iteration. As kk gets larger, the similarity between the meshes diminishes. Convergence rates are worse than for small kk, but the factorization of BB and the solution of the correction equation in each iteration become cheaper. In particular, when kk is at least as large as the side of the mesh, the mesh of BB is a tree, which has single-vertex balanced separators; factoring it costs only Θ(n)\Theta(n) operations, and its factor contains only Θ(n)\Theta(n) nonzeros.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 1.6.3. The behavior of preconditioned minres solvers with Joshi preconditioners. For k=1k=1, the performance is essentially the performance of the direct solver. The preconditioners were reordered using a minimum degree ordering and factored.

Figure 1.6.3 presents the performance of preconditioned minres with Joshi preconditioners on two meshes. On the three dimensional mesh, the preconditioned iterative solver is much faster than the direct solver and about as fast as the unpreconditioned minres solver. The preconditioned solver performs fewer arithmetic operations than both the direct solver and the unpreconditioned solver. In terms of arithmetic operations, the Joshi preconditioner for k=3k=3 is the most efficient. For small kk, the preconditioned solvers perform fewer iterations than the unpreconditioned solver. The number of iteration rises with kk. Perhaps the most interesting aspect of these graphs is the fact that the cost of the factorization of BB drops dramatically from k=1k=1, where B=AB=A, to k=2k=2. Even slightly sparsifying the mesh reduces the cost of the factorization dramatically. The behavior on the 2-dimensional mesh is similar, except that in terms of time, the cost of the solver rises monotonically with kk. In particular, the direct solver is fastest. But in terms of arithmetic operations, the preconditioned solver with k=2k=2 is the most efficient.

1.7. The Main Questions

We have now set the stage to pose the main questions in support theory and preconditioning. The main theme in support preconditioning is the construction of preconditioners using graph algorithms. Given a matrix AA, we model it as a graph GAG_{A}. From the GAG_{A} we build another graph GBG_{B}, which approximates GAG_{A} in some graph-related metric. From GBG_{B} we build the preconditioner BB. The preconditioner is usually factored as preparation for solving correction equations Bz=rBz=r, but sometimes the correction equation is also solved iteratively. This raises three questions

  1. (1)

    How do we model matrices as graphs?

  2. (2)

    Given two matrices AA and BB and their associated graphs GAG_{A} and GBG_{B}, how do we use the graphs to estimate or bound the convergence of iterative linear solvers in which BB preconditions AA?

  3. (3)

    Given the graph GAG_{A} of a matrix AA, how do we construct a graph GBG_{B} in a metric that ensures fast convergence rates?

Support theory addresses all three questions: the modeling question, the analysis qustion, and the graph-approximation question.

The correspondence between graphs and matrices is usually an isomorphism, to allow the analysis to yield useful convergence-rate bounds. It may seem useless to transform one problem, the matrix-approximation probem, into a completely equivalent one, a graph-approximation question. It turns out, however, that many effective constructions only make sense when the problem is described in terms of graphs. They would not have been invented as matrix constructions. The Joshi preconditioners are a good example. They are highly structured when viewed as graph constructions, but would make little sense as direct sparse-matrix constructions. The analysis, on the other hand, is easier, more natural, and more general when carried out in terms of matrices. And so we transform the orignal matrix to a graph to allow for sophisticaed constructions, and then we transform the graphs back to matrices to analyze the convergence rate.

The preconditioning framework that we presented is the main theme of support theory, but there are other themes. The combinatorial and algebraic tools that were invented in order to develop preconditioners have also been used to analyze earlier preconditioners, to bound the extreme eigenvalues of matrices, to analyze the null spaces of matrices, and more.

Chapter 2 Iterative Krylov-Subspace Solvers

Preconditioned Krylov-subspace iterations are a key ingredient in many modern linear solvers, including in solvers that employ support preconditioners. This chapter presents Krylov-subspace iterations for symmetric semidefinite matrices. In particular, we analyze the convergence behavior of these solvers. Understanding what determines the convergence rate is key to designing effective preconditioners.

2.1. The Minimal Residual Method

The minimal-residual method (minres) is an iterative algorithm that finds in each iteration the vector x(t)x^{(t)} that minimizes the residual Ax(t)b2\|Ax^{(t)}-b\|_{2} in a subspace 𝒦t\mathcal{K}_{t} of n\mathbb{R}^{n}. These subspaces, called the Krylov subspaces, are nested, 𝒦t𝒦t+1\mathcal{K}_{t}\subseteq\mathcal{K}_{t+1}, and the dimension of the subspace usually grows by one in every iteration, so the accuracy of the approximate solution x(t)x^{(t)} tends to improve from one iteration to the next. The construction of the spaces 𝒦t\mathcal{K}_{t} is designed to allow an efficient computation of the approximate solution in every iteration.

Definition 1.

The Krylov subspace 𝒦t\mathcal{K}_{t} is the subspace that is spanned by the columns of the matrix Kt=[bAbA2bAt1]K_{t}=\begin{bmatrix}b&Ab&A^{2}b&\cdots&A^{t-1}\end{bmatrix}. That is,

𝒦t={Kty:yt}.\mathcal{K}_{t}=\left\{K_{t}y:y\in\mathbb{R}^{t}\right\}\;.

Once we have a basis for 𝒦t\mathcal{K}_{t}, we can express the minimization of Axb2\|Ax-b\|_{2} in 𝒦t\mathcal{K}_{t} as an unconstrained linear least-squares problem, since

minx𝒦tAxb2=minytAKtyb2.\min_{x\in\mathcal{K}_{t}}\|Ax-b\|_{2}=\min_{y\in\mathbb{R}^{t}}\|AK_{t}y-b\|_{2}\;.

There are three fundamental tools in the solution of least squares problems with full-rank mm-by-nn coefficient matrices with mnm\geq n. The first tool is unitary transformations. A unitary matrix QQ preserves the 22-norm of vectors, Qx2=x2\|Qx\|_{2}=\|x\|_{2} for any xx (a square matrix with orthonormal columns, or equivalently, a matrix such that QQ=IQQ^{*}=I). This allows us to transform least squares problem into equivalent forms

(2.1.1) minxAxb2=minyQAxQb2.\min_{x}\|Ax-b\|_{2}=\min_{y}\|QAx-Qb\|_{2}\;.

An insight about matrices that contain only zeros in rows n+1n+1 through mm is the second tool. Consider the least squares problem

minx[R10]x[b1b2]2.\min_{x}\left\|\begin{bmatrix}R_{1}\\ 0\end{bmatrix}x-\begin{bmatrix}b_{1}\\ b_{2}\end{bmatrix}\right\|_{2}\;.

If the coefficient matrix has full rank, then the square block R1R_{1} must be invertible. The solution to the problem is the vector xx that minimizes the Euclidean distance between [R10]x\left[\begin{smallmatrix}R_{1}\\ 0\end{smallmatrix}\right]x and [b1b2]\left[\begin{smallmatrix}b_{1}\\ b_{2}\end{smallmatrix}\right] is minimal. But it is clear what is this vector: x=R11b1x=R_{1}^{-1}b_{1}. This vector yields [R10]x=[b10]\left[\begin{smallmatrix}R_{1}\\ 0\end{smallmatrix}\right]x=\begin{bmatrix}b_{1}\\ 0\end{bmatrix}, and we clearly cannot any closer. The third tool combines the first two into an algorithm. If we factor AA into a product QRQR of a unitary matrix QQ and an upper trapezoidal matrix R=[R10]R=\begin{bmatrix}R_{1}^{*}&0\end{bmatrix}^{*}, which is always possible, then we can multiply the system by QQ^{*} to obtain

minxAxb2\displaystyle\min_{x}\left\|Ax-b\right\|_{2} =\displaystyle= minxQRxb2\displaystyle\min_{x}\left\|QRx-b\right\|_{2}
=\displaystyle= minxRxQb2\displaystyle\min_{x}\left\|Rx-Q^{*}b\right\|_{2}
=\displaystyle= minx[R10]xQb2.\displaystyle\min_{x}\left\|\begin{bmatrix}R_{1}\\ 0\end{bmatrix}x-Q*b\right\|_{2}\;.

We can now compute the minimizer xx by substitution. Furthermore, the norm of the residual is given by (Qb)n+1:m2\|(Q^{*}b)_{n+1\colon m}\|_{2}.

In principle, we could solve (2.1.1) for yy using a QRQR factorization of AKtAK_{t}. Once we find yy, we can compute x=Ktyx=K_{t}y. There are, however, two serious defects in this approach. First, it is inefficient, because AKtAK_{t} is dense. Second, as tt grows, AtbA^{t}b tends to the subspace spanned by the dominant eigenvectors of AA (the eigenvectors associated with the eigenvalues with maximal absolute value). This causes KtK_{t} to become ill conditioned; for some vectors x𝒦tx\in\mathcal{K}_{t} with x2=1\|x\|_{2}=1, the vector yy such that x=Ktyx=K_{t}y has huge elements. This phenomenon will happen even if we normalize the columns of KtK_{t} to have unit norm, and it causes instabilities when the computation is carried out using floating-point arithmetic.

We need a better basis for 𝒦t\mathcal{K}_{t} for stability, and we need to exploit the special properties of AKtAK_{t} for efficiency. Minres does both. It uses a stable basis that can be computed efficiently, and it it combines the three basic tools in a clever way to achieve efficiency.

An orthonormal basis QtQ_{t} for 𝒦t\mathcal{K}_{t} would work better, because for x=Qtzx=Q_{t}z we would have z2=x2\|z\|_{2}=\|x\|_{2}. There are many orthonormal bases for 𝒦t\mathcal{K}_{t}. We choose a particular basis that we can compute incrementally, one basis column in each iteration. The basis that we use consists of the columns of the QQ factor from the QRQR factorization of Kt=QtRtK_{t}=Q_{t}R_{t}, where QtQ_{t} is orthonormal and RtR_{t} is upper triangular. We compute QtQ_{t} using Gram-Schmidt orthogonalization. The Gram-Schmidt process is not always numerically stable when carried out in floating-point arithmetic, but it is the only efficient way to compute QtQ_{t} one column at a time.

We now show how to compute, given Qt=[q1qt]Q_{t}=\begin{bmatrix}q_{1}&\cdots&q_{t}\end{bmatrix}, the next basis column qt+1q_{t+1}. The key to the efficient computation of qt+1q_{t+1} is the special relationship of the matrices QtQ_{t} with AA. We assume by induction that Kt=QtRtK_{t}=Q_{t}R_{t} and that rt,t0r_{t,t}\neq 0. If rt,t=0r_{t,t}=0, then it is not hard to show that the exact solution xx is in 𝒦t\mathcal{K}_{t}, so we would have found it in one of the previous iterations. If we reached iteration tt, then x𝒦tx\not\in\mathcal{K}_{t}, so rt,t0r_{t,t}\neq 0. Expanding the last column of Kt=QtRtK_{t}=Q_{t}R_{t}, we get

At1b=r1,tq1+r2,tq2++rt1,tqt1+rt,tqt.A^{t-1}b=r_{1,t}q_{1}+r_{2,t}q_{2}+\cdots+r_{t-1,t}q_{t-1}+r_{t,t}q_{t}\;.

We now isolate qtq_{t} to obtain

qt=rt,t1(At1br1,tq1r2,tq2rt1,tqt1),q_{t}=r_{t,t}^{-1}\left(A^{t-1}b-r_{1,t}q_{1}-r_{2,t}q_{2}-\cdots-r_{t-1,t}q_{t-1}\right)\;,

so

Aqt=rt,t1(Atbr1,tAq1r2,tAq2rt1,tAqt1).Aq_{t}=r_{t,t}^{-1}\left(A^{t}b-r_{1,t}Aq_{1}-r_{2,t}Aq_{2}-\cdots-r_{t-1,t}Aq_{t-1}\right)\;.

Because qt𝒦tq_{t}\in\mathcal{K}_{t}, clearly Aqt𝒦t+1Aq_{t}\in\mathcal{K}_{t+1}. Furthermore, if we reached iteration t+1t+1, then Aqt𝒦tAq_{t}\not\in\mathcal{K}_{t}, because if Aqt𝒦tAq_{t}\in\mathcal{K}_{t} then x𝒦tx\in\mathcal{K}_{t}. Therefore, we can orthogonalize AqtAq_{t} with respect to q1,,qtq_{1},\dots,q_{t} and normalize to obtain qt+1q_{t+1}

q~t+1\displaystyle\tilde{q}_{t+1} =\displaystyle= Aqt(qtAqt)qt(q1Aqt)q1\displaystyle Aq_{t}-\left(q_{t}^{*}Aq_{t}\right)q_{t}-\cdots-\left(q_{1}^{*}Aq_{t}\right)q_{1}
qt+1\displaystyle q_{t+1} =\displaystyle= q~t+1/q~t+12.\displaystyle\tilde{q}_{t+1}/\left\|\tilde{q}_{t+1}\right\|_{2}\;.

These expressions allows us not only to compute qtq_{t}, but also to express AqtAq_{t} as a linear combination of q1,qt,qt+1q_{1},\ldots q_{t},q_{t+1},

Aqt\displaystyle Aq_{t} =\displaystyle= q~t+12qt+1+(qtAqt)qt++(q1Aqt)q1\displaystyle\left\|\tilde{q}_{t+1}\right\|_{2}q_{t+1}+\left(q_{t}^{*}Aq_{t}\right)q_{t}+\cdots+\left(q_{1}^{*}Aq_{t}\right)q_{1}
=\displaystyle= ht+1,tqt+1+ht,tqt++h1,tq1,\displaystyle h_{t+1,t}q_{t+1}+h_{t,t}q_{t}+\cdots+h_{1,t}q_{1}\;,

where ht+1,t=q~t+12h_{t+1,t}=\|\tilde{q}_{t+1}\|_{2}, and where hj,t=qjAqth_{j,t}=q_{j}^{*}Aq_{t} for jtj\leq t. The same argument also holds for Aq1,,Aqt1Aq_{1},\ldots,Aq_{t-1}, so in matrix terms,

AQt=Qt+1Ht~.AQ_{t}=Q_{t+1}\tilde{H_{t}}\;.

The matrix H~t(t+1)×t\tilde{H}_{t}\in\mathbb{R}^{(t+1)\times t} is upper Hessenberg: in column jj, rows j+2j+2 to t+1t+1 are zero. If we multiply both sides of this equation from the left by QtQ_{t}^{*}, we obtain

QtAQt\displaystyle Q_{t}^{*}AQ_{t} =\displaystyle= QtQt+1Ht~\displaystyle Q_{t}^{*}Q_{t+1}\tilde{H_{t}}
=\displaystyle= (1100)Ht~\displaystyle\left(\begin{array}[]{ccc}1\\ &\ddots\\ &&1\\ 0&\cdots&0\end{array}\right)\tilde{H_{t}}
=\displaystyle= (Ht~)1:t,1:t.\displaystyle\left(\tilde{H_{t}}\right)_{1\colon t,1\colon t}\;.

We denote the first tt rows of Ht~\tilde{H_{t}} by HtH_{t} to obtain

QtAQt=Ht.Q_{t}^{*}AQ_{t}=H_{t}\;.

Because AA is symmetric, HtH_{t} must be symmetric, and hence tridiagonal. The symmetry of HtH_{t} implies that in column jj of HtH_{t} and H~t\tilde{H}_{t}, rows 11 through j2j-2 are also zero. This is the key to the efficiency of minres and similar algorithms, because it implies that for jt2j\leq t-2, hj,t=qjAqt=0h_{j,t}=q_{j}^{*}Aq_{t}=0; we do not need to compute these inner products and we do not need to subtract (qjAqt)qj\left(q_{j}^{*}Aq_{t}\right)q_{j} from AqtAq_{t} in the orthogonalization process. As we shall shortly see, we do not even need to store qjq_{j} for jt2j\leq t-2.

Now that we have an easy-to-compute orthonormal basis for 𝒦t\mathcal{K}_{t}, we return to our least-squares problem

minx𝒦tAxb2\displaystyle\min_{x\in\mathcal{K}_{t}}\|Ax-b\|_{2} =\displaystyle= minytAKtyb2\displaystyle\min_{y\in\mathbb{R}^{t}}\|AK_{t}y-b\|_{2}
=\displaystyle= minztAQtzb2\displaystyle\min_{z\in\mathbb{R}^{t}}\|AQ_{t}z-b\|_{2}
=\displaystyle= minztQt+1Ht~zb2.\displaystyle\min_{z\in\mathbb{R}^{t}}\|Q_{t+1}\tilde{H_{t}}z-b\|_{2}\;.

Our strategy now is to compute the minimizer zz from the expression in the last line. Once we compute zz, the minimizer xx in 𝒦t\mathcal{K}_{t} is x=Qtzx=Q_{t}z. To compute the minimizer, we use the equality

argminztQt+1H~tzb2\displaystyle\arg\min_{z\in\mathbb{R}^{t}}\|Q_{t+1}\tilde{H}_{t}z-b\|_{2} =\displaystyle= argminztH~tzQt+1b2\displaystyle\arg\min_{z\in\mathbb{R}^{t}}\|\tilde{H}_{t}z-Q_{t+1}^{*}b\|_{2}
=\displaystyle= argminztH~tzb2e12,\displaystyle\arg\min_{z\in\mathbb{R}^{t}}\|\tilde{H}_{t}z-\|b\|_{2}e_{1}\|_{2}\;,

where e1e_{1} is the first unit vector. The equality Qt+1b=b2e1Q_{t+1}^{*}b=\|b\|_{2}e_{1} holds because Qt+1Q_{t+1} is the orthogonal factor in the QRQR factorization of Kt+1K_{t+1}, whose first column is bb. To solve this least-squares problem, we will use a QRQR factorization H~t=VtUt\tilde{H}_{t}=V_{t}U_{t} of H~t\tilde{H}_{t}: the minimizer zz is then defined by Utz=b2Vte1U_{t}z=\|b\|_{2}V_{t}^{*}e_{1}. Because H~tt×t\tilde{H}_{t}\in\mathbb{R}^{t\times t} is triadiagonal we can compute its QRQR factorization with a sequence of of t1t-1 Givens rotations, where the iith rotation transforms rows ii and i1i-1 of H~t\tilde{H}_{t} and of b2e1\|b\|_{2}e_{1}. A Given rotation is a unitary matrix of the form

[11cosθsinθsinθcosθ11].\begin{bmatrix}1\\ &\ddots\\ &&1\\ &&&\cos\theta&\sin\theta\\ &&&-\sin\theta&\cos\theta\\ &&&&&1\\ &&&&&&\ddots\\ &&&&&&&1\end{bmatrix}\;.

We choose θ\theta so as to anihilate the subdiagonal element in column ii of H~t\tilde{H}_{t}.

One difficulty that arises is that z=ztz=z_{t} changes completely in every iteration, so to form x=Qtzx=Q_{t}z, we need to either store all the columns of QtQ_{t} to produce the approximate solution xx, or to recompute QtQ_{t} again once we obtain a zz that ensures a small-enough residual. Fortunately, there is a cheaper alternative that only requires storing a constant number of vectors. Let Mt=QtUt1M_{t}=Q_{t}U_{t}^{-1}, and let w=Utzw=U_{t}z. Instead of solving for zz, we only compute w=b2Vte1w=\|b\|_{2}V_{t}^{*}e_{1}. Because the iithe rotation only transforms rows i1i-1 and ii of b2e1\|b\|_{2}e_{1}, we can compute ww one entry at a time. Since x=Qtz=QtUt1Utz=Mtwx=Q_{t}z=Q_{t}U_{t}^{-1}U_{t}z=M_{t}w, we can accumulate xx using the columns of MtM_{t}. We compute the columns of MtM_{t} one at a time using the triangular linear system UtMt=QtU_{t}M_{t}=Q_{t}; in iteration tt, we compute the last column of MtM_{t} from UtU_{t} and the last column of QtQ_{t}.

minres(A,bA,b)
  q1=b/b2q_{1}=b/\|b\|_{2} \triangleright the first column of QtQ_{t}
  w1=b2w_{1}=\|b\|_{2} \triangleright the first element of the vector ww
  for t=1,2,t=1,2,\ldots until convergence
    compute AqtAq_{t}
    q~t+1=Aqt(qtAqt)qt(qt1Aqt)qt1\tilde{q}_{t+1}=Aq_{t}-\left(q_{t}^{*}Aq_{t}\right)q_{t}-\left(q_{t-1}^{*}Aq_{t}\right)q_{t-1}
    Ht+1,t=q~t+12H_{t+1,t}=\|\tilde{q}_{t+1}\|_{2}
    Ht,t=qtAqtH_{t,t}=q_{t}^{*}Aq_{t}
    Ht1,t=qt1AqtH_{t-1,t}=q_{t-1}^{*}Aq_{t}
    qt+1=q~t+1/q~t+12q_{t+1}=\tilde{q}_{t+1}/\left\|\tilde{q}_{t+1}\right\|_{2}
    \triangleright Apply rotation t2t-2 to H:,tH_{\colon,t}
    Rt2,t=st2Ht1,tR_{t-2,t}=s_{t-2}H_{t-1,t}
    if t>2t>2 then Ut1,t=ct2Ht1,tU_{t-1,t}=c_{t-2}H_{t-1,t} else Ut1,t=Ht1,tU_{t-1,t}=H_{t-1,t}
    \triangleright Apply rotation t1t-1 to H:,tH_{\colon,t}
    Ut1,t=ct1Ut1,t+st1Ht,tU_{t-1,t}=c_{t-1}U_{t-1,t}+s_{t-1}H_{t,t}
    if t>1t>1 then Ut,t=st1Ut1,t+ct1Ht,tU_{t,t}=-s_{t-1}U_{t-1,t}+c_{t-1}H_{t,t} else Ut,t=Ht,tU_{t,t}=H_{t,t}
    compute [ctststct]\left[\begin{smallmatrix}c_{t}&-s_{t}\\ s_{t}&c_{t}\end{smallmatrix}\right], the Givens rotation such
     that [ctststct][Ut,tHt+1,t]=[anything0]\left[\begin{smallmatrix}c_{t}&-s_{t}\\ s_{t}&c_{t}\end{smallmatrix}\right]\left[\begin{smallmatrix}U_{t,t}\\ H_{t+1,t}\end{smallmatrix}\right]=\left[\begin{smallmatrix}\textrm{anything}\\ 0\end{smallmatrix}\right]
    \triangleright Apply rotation tt to H:,tH_{\colon,t}
    Ut,t=ctUt,t+stHt+1,tU_{t,t}=c_{t}U_{t,t}+s_{t}H_{t+1,t}
    \triangleright Apply rotation tt to form w=Utz=Vtbe1w=U_{t}z=V_{t}^{*}\|b\|e_{1}
    wt+1=stwtw_{t+1}=-s_{t}w_{t}
    wt=ctwtw_{t}=c_{t}w_{t}
    mt=rt,t1(qtUt1,tmt1Ut2,tmt2)m_{t}=r_{t,t}^{-1}\left(q_{t}-U_{t-1,t}m_{t-1}-U_{t-2,t}m_{t-2}\right) \triangleright next column of MM
    x(t)=x(t1)+wtmtx^{(t)}=x^{(t-1)}+w_{t}m_{t}
  end for

Figure 2.1.1. Minres. To keep the pseudo-code simple, we use the convention that vectors and matrix/vector elements with nonpositive indices are zeros. By this convension, x(0)=q0=m0=m1=0x^{(0)}=q_{0}=m_{0}=m_{-1}=0, and so on. In an actual code, this convension can be implemented either using explicit zero vectors or using conditionals.

2.2. The Conjugate Gradients Algorithm

Minres is a variant of an older and more well-known algorithm called the Conjugate Gradients method (cg). The Conjugate Gradients method is only guaranteed to work when AA is symmetric positive definite, whereas minres only requires AA to be symmetric. Conjugate Gradients also minimizes the norm of the residuals over the same Krylov subspaces, but not the 22-norm but the A1A^{-1}-norm,

Ax(t)bA1=(Ax(t)b)A1(Ax(t)b).\|Ax^{(t)}-b\|_{A^{-1}}=\sqrt{\left(Ax^{(t)}-b\right)^{*}A^{-1}\left(Ax^{(t)}-b\right)}\;.

This is equivalent to minimizing the AA-norm of the error,

xx(t)A=(xx(t))A(xx(t)).\|x-x^{(t)}\|_{A}=\sqrt{\left(x-x^{(t)}\right)^{*}A\left(x-x^{(t)}\right)}\;.

Minimizing the AA-norm of the residual may seem like an odd idea, because of the dependence on AA in the measurement of the residual and the error. But there are several good reasons not to worry. First, we use the 22-norm of the residual as a stoping criterion, to stop the iterations only when the 22-norm is small enough, not when the A1A^{-1}-norm is small. Second, even when the stopping criterion is based on the 22-norm of the residual, Conjugate Gradients usually converges only slightly slower than minres. Still, the minimization of the 22-norm of the residual in minres is more elegant.

Why do people use Conjugate Gradients if minres is more theoretically appealing? The main reason that that the matrices Mt=QtRt1M_{t}=Q_{t}R_{t}^{-1} and RtR_{t} that are used in minres to form x(t)x^{(t)} can be ill conditioned. This means that in floating-point arithmetic, the computed x(t)x^{(t)} are not always accurate minimizers. In the Conjugate Gradients method, the columns of the basis matrix (the equivalent of MtM_{t}) are AA-conjugate, not arbitrary. This reduces the inaccuracies in the computation of the approximate solution in each iterations, so the method is numerically more stable than minres.

We shall not derive the details of the Conjugate Gradient method here. The derivation is similar to that of minres (there are also other ways to derive cg), and is presented in many textbooks.

2.3. Convergence-Rate Bounds

The appeal of Krylov-subspace iterations stems to some extent from the fact that their convergence is easy to understand and to bound.

The crucial step in the analysis is the expression of the residual bAx(t)b-Ax^{(t)} as a application of a univariate polynomial p~\tilde{p} to AA and a multiplication of the resulting matrix p~(A)\tilde{p}(A) by bb. Since x(t)𝒦tx^{(t)}\in\mathcal{K}_{t}, x(t)=Ktyx^{(t)}=K_{t}y for some yy. That is, x(t)=y1b+y2Ab+ytAt1bx^{(t)}=y_{1}b+y_{2}Ab+\cdots y_{t}A^{t-1}b. Therefore,

bAx(t)=by1Aby2A2bytAtb.b-Ax^{(t)}=b-y_{1}Ab-y_{2}A^{2}b-\cdots-y_{t}A^{t}b\;.

If we denote p(z)=1y1zy2z2ytzt=1zp~(z)p(z)=1-y_{1}z-y_{2}z^{2}-\cdots-y_{t}z^{t}=1-z\tilde{p}(z), we obtain bAx(t)=p(A)bb-Ax^{(t)}=p(A)b.

Definition 2.

Let x(t)𝒦tx^{(t)}\in\mathcal{K}_{t} be an approximate solution of Ax=bAx=b and let r(t)=bAx(t)r^{(t)}=b-Ax^{(t)} be the corresponding residual. The polynomial p~t\tilde{p}_{t} such that x(t)=p~t(A)bx^{(t)}=\tilde{p}_{t}(A)b is called the solution polynomial of the iteration and the polynomial pt(z)=1zp~(z)p_{t}(z)=1-z\tilde{p}(z) is called the residual polynomial of the iteration.

Figure 2.3.1 shows several minres residual polynomials.

Refer to caption Refer to caption
Refer to caption Refer to caption
Figure 2.3.1. Minres residual polynomials of four linear problems. The order nn of the matrices is shown in each plot, as well as the degree of the polynomials. The eigenvalues of the matrices are shown using gray tick marks on the xx axis. In all but the bottom-right plot the solution vector is a random vector; in the bottom-right plot, the solution vector (and hance also the right-hand side) is a random combination of only 100100 eigenvectors of AA, those associated with the 100100 smallest eigenvalues.

We now express p(A)p(A) in terms of the eigendecomposition of AA. Let A=VΛVA=V\Lambda V^{*} be an eigendecomposition of AA. Since AA is Hermitian, Λ\Lambda is real and VV is unitary. We have

p(A)b=p(VΛV)b=Vp(Λ)Vb,p(A)b=p\left(V\Lambda V^{*}\right)b=Vp(\Lambda)V^{*}b\;,

so

(2.3.1) bAx(t)2=p(A)b2=Vp(Λ)Vb2=p(Λ)Vb2.\left\|b-Ax^{(t)}\right\|_{2}=\left\|p(A)b\right\|_{2}=\left\|Vp(\Lambda)V^{*}b\right\|_{2}=\left\|p(\Lambda)V^{*}b\right\|_{2}\;.

We can obtain several bounds on the norm of the residual from this expression. The most important one is

bAx(t)2\displaystyle\left\|b-Ax^{(t)}\right\|_{2} =\displaystyle= p(Λ)Vb2\displaystyle\left\|p(\Lambda)V^{*}b\right\|_{2}
\displaystyle\leq p(Λ)2Vb2=p(Λ)2b2\displaystyle\left\|p(\Lambda)\right\|_{2}\left\|V^{*}b\right\|_{2}=\left\|p(\Lambda)\right\|_{2}\left\|b\right\|_{2}
=\displaystyle= maxi=1n{|p(λi)|}b2.\displaystyle\max_{i=1}^{n}\left\{\left|p\left(\lambda_{i}\right)\right|\right\}\left\|b\right\|_{2}\;.

The last equality follows from the facts that p(Λ)p(\Lambda) is a diagonal matrix and that the 22-norm of a diagonal matrix is the largest absolute value of an element in it. This proves the following result.

Theorem 3.

The relative 22-norm of the residual in the ttth iteration of minres,

bAx(t)2/b2,\left\|b-Ax^{(t)}\right\|_{2}/\left\|b\right\|_{2}\;,

is bounded by maxi=1n{|p(λi)|}\max_{i=1}^{n}\left\{\left|p\left(\lambda_{i}\right)\right|\right\} for any univariate polynomial pp of degree tt such that p(0)=1p(0)=1, where the λi\lambda_{i}’s are the eigenvalues of AA.

We can strenghen this result by noting that if bb is a linear combination of only some of the eigenvectors of AA, then only the action of pp on corresponding eigenvalues matters (not on all the eigenvalues). An example of this behavior is shown in the bottom-right plot of Figure 2.3.1. More formally, from (2.3.1) we obtain

p(Λ)Vb2=[p(λ1)(v1b)p(λ2)(v2b)p(λn)(vnb)]2.\left\|p(\Lambda)V^{*}b\right\|_{2}=\left\|\begin{bmatrix}p(\lambda_{1})(v_{1}^{*}b)\\ p(\lambda_{2})(v_{2}^{*}b)\\ \vdots\\ p(\lambda_{n})(v_{n}^{*}b)\end{bmatrix}\right\|_{2}\;.

In general, if bb is orthogonal or nearly orthogonal to an eigenvector vjv_{j} of AA, then the product p(λj)(vjb)p(\lambda_{j})(v_{j}^{*}b) can be small even if p(λj)p(\lambda_{j}) is quite large. But since right-hand sides bb with this property are rare in practice, this stronger bound is not useful for us.

Theorem 3 states that if there are low-degree polynomials that are low on the eigenvalues of AA and assume the value 11 at 0, the minres converges quickly. Let us examine a few examples. A degree-nn polynomial pp can satisfy p(0)=1p(0)=1 and p(λi)=0p(\lambda_{i})=0 simulteneously for any set of nn nonzero eigenvalues λ1,,λn\lambda_{1},\ldots,\lambda_{n}. Therefore, in the absense of rounding errors minres must converge after nn iterations to the exact solution. We could also derive this exact-convergence result from the fact that 𝒦n=n\mathcal{K}_{n}=\mathbb{R}^{n}, but the argument that we just gave characterizes the minres polynomials at or near convergence: their roots are at or near the eigenvalues of AA. If AA has repeated eigenvalues, then it has fewer than nn distinct eigenvalues, so we expect exact convergence after fewer than nn iterations. Even if AA does not have repeated eigenvalues, but it does have only a few tight clusters of eigenvalues, then minres will converge quickly, because a polynomial with one root near every cluster and a bounded derivative at the roots will assume low values at all the roots. On the other hand, a residual polynomial cannot have small values very close to 0, because it must assume the value 11 at 0. These examples lead us to the most important observation about Krylov-subspace solvers:

Symmetric Krylov-subspace iterative methods for solving linear systems of equations Ax=bAx=b converge quickly if the eigenvalues of AA form a few tight clusters and if AA does not have eigenvalues very close to 0.

Scaling both AA and bb can cluster the eigenvalues or move them away from zero, but has no effect at all on convergence. Scaling up AA and bb moves the eigenvalues away from zero, but distributes them on a larger interval; scaling AA and bb down clusters the eigenvalues around zero, but this brings them closer to zero. Krylov-subspace iterations are invariant to scaling.

This observation leads to two questions, one analytic and one constructive: (1) Exactly how quickly does the iteration converges given some characterization of the spectrum of AA? (2) How can we alter the specturm of AA in order to accelerate convergence? We shall start with the second question.

2.4. Preconditioning

Suppose that we have a matrix BB that approximates AA (in a sense that will become clear shortly), and whose inverse is easier to apply than the inverse of AA. That is, linear systems of the form Bz=rBz=r are much easier to solve for zz than linear systems Ax=bAx=b. Perhpas the sparse Cholesky factorization of BB is cheaper to compute than AA’s, and perhaps there is another inexpensive way to apply B1B^{-1} to rr. If BB approximates AA in the sense that B1AB^{-1}A is close to the identity, then an algorithm like minres will converge quickly when applied to the linear system

(2.4.1) (B1A)x=B1b,\left(B^{-1}A\right)x=B^{-1}b\;,

because the eigenvalues of the coefficient matrix B1AB^{-1}A are clustered around 11. We will initialize the algorithm by computing the right-hand side B1B^{-1}, and in every iteration we will multiply qtq_{t} by AA and then apply B1B^{-1} to the product. This technique is called preconditioning.

The particular form of preconditioning that we used in (2.4.1) is called left preconditioning, because the inverse of the preconditioner BB is applied to both sides of Ax=bAx=b from the left. Left preconditioning is not appropriate to algorithms line minres and Conjugate Gradients that exploit the symmetry of the coefficient matrix, because in general B1AB^{-1}A is not symmetric. We could replace minres by a Krylov-subspace iterative solver that is applicable to unsymmetric matrices, but this would force us to give up either the efficiency or the optimality of minres and Conjugate Gradients.

Fortunately, if BB is symmetric positive definite, then there are forms of preconditioning that are appropriate for symmetric Krylov-subspace solvers. One form of symmetric preconditioning solves

(2.4.2) (B1/2AB1/2)(B1/2x)=B1/2b\left(B^{-1/2}AB^{-1/2}\right)\left(B^{1/2}x\right)=B^{-1/2}b

for xx. A clever transformation of Conjugate Gradients yields an iterative algorithm in which the matrix that generates the Krylov subspace is (B1/2AB1/2)\left(B^{-1/2}AB^{-1/2}\right), but which only applies AA and B1B^{-1} in every iteration. In other words, B1/2B^{-1/2} is never used, not even as a linear operator. We shall not show this transformation here.

We will show how to use a simpler but equally effective form of preconditioning. Let B=LLTB=LL^{T} be the Cholesky factorization of AA. We shall solve

(2.4.3) (L1ALT)(LTx)=L1b\left(L^{-1}AL^{-T}\right)\left(L^{T}x\right)=L^{-1}b

for y=LTxy=L^{T}x using minres, and then we will solve LTx=yL^{T}x=y by substitution. To form the right-hand side L1bL^{-1}b, we solve for it by substitution as well. The coefficient matrix L1ALTL^{-1}AL^{-T} is clearly symmetric, so we can indeed apply minres to it. To apply the coefficient matrix to qtq_{t} in every iteration, we apply LTL^{-T} by substitution, apply AA, and apply L1L^{-1} by substitution.

Refer to caption
Figure 2.4.1. Minres residual polynomials for a 1515-by-1515 two-dimensional mesh (both graphs). The graph on the left shows polynomial from the application of minres directly to the original linear problem Ax=bAx=b, and the graph on the right shows a polynomial from the application of minres to a preconditioned problem (L1ALT)(LTx)=(L1b)\left(L^{-1}AL^{-T}\right)\left(L^{T}x\right)=\left(L^{-1}b\right). The scaling of the axes in the two graphs are the same.

Figure 2.4.1 shows the spectrum and one minres polynomial for two matrices: a two-dimensional mesh and the same mesh preconditioned as in (2.4.3) with a Joshi preconditioner BB. We can see that this particular preconditioner causes three changes in the spectrum. The most important change is the small eigenvalues of L1ALTL^{-1}AL^{-T} are much larger than the small eigenvalues of AA. This allows the minres polynomial to assume much smaller vaues on the spectrum. The minres polynomials must assume the value 11 at 0, so they tend to have large values on the neighborhood of 0 as well. Therefore, a specturm with larger smallest eigenvalues leads to faster convergence. Indeed, the preconditioned problem decreased the size of the residual by a factor 101410^{-14} in 4646 iterations, whereas the unpreconditioned problem took 110110 iterations to achieve a residual with a similar norm. Two other changes that the preconditioner caused are a large gap in the middle of the spectrum, which is a good thing, and a larger largest eigenvalue, which is not. But these two changes are probably less important than the large increase in the smallest eigenvalues.

The different forms of preconditioning differ in the algorithmic details of the solver, but they all have the same spectrum.

Theorem 4.

Let AA be a symmetric matrix and let B=LLTB=LL^{T} be a symmetric positive-definite matrix. A scalar λ\lambda is either an eigenvalue of all the following eigenvalue problems or of none of them:

B1Ax\displaystyle B^{-1}Ax =\displaystyle= λx\displaystyle\lambda x
B1/2AB1/2y\displaystyle B^{-1/2}AB^{-1/2}y =\displaystyle= λy\displaystyle\lambda y
L1ALTz\displaystyle L^{-1}AL^{-T}z =\displaystyle= λz\displaystyle\lambda z
Aw\displaystyle Aw =\displaystyle= λBw\displaystyle\lambda Bw
Proof.

The following relations prove the equivalence of the specta:

x\displaystyle x =\displaystyle= B1/2y\displaystyle B^{-1/2}y
y\displaystyle y =\displaystyle= B1/2x\displaystyle B^{1/2}x
x\displaystyle x =\displaystyle= LTz\displaystyle L^{-T}z
z\displaystyle z =\displaystyle= LTx\displaystyle L^{T}x
x\displaystyle x =\displaystyle= w\displaystyle w

We can strengthen this theorem to also include certain semidefinite preconditioners.

Theorem 5.

Let AA be a symmetric matrix and let B=LLTB=LL^{T} be a symmetric positive-semidefinite matrix such that null(B)=null(A)\textrm{null}(B)=\textrm{null}(A). Denote by X+X^{+} the pseudo-inverse of a matrix XX. A scalar λ\lambda is either an eigenvalue of all the following eigenvalue problems or of none of them:

B+Ax\displaystyle B^{+}Ax =\displaystyle= λx\displaystyle\lambda x
(B+)1/2A(B+)1/2y\displaystyle\left(B^{+}\right)^{1/2}A\left(B^{+}\right)^{1/2}y =\displaystyle= λy\displaystyle\lambda y
(L+)A(L+)Tz\displaystyle\left(L^{+}\right)A\left(L^{+}\right)^{T}z =\displaystyle= λz\displaystyle\lambda z
Aw\displaystyle Aw =\displaystyle= λBw\displaystyle\lambda Bw
Proof.

We note that null(L+)=null((L+)T)=null(B+)=null(B)=null(A)\textrm{null}(L^{+})=\textrm{null}((L^{+})^{T})=\textrm{null}(B^{+})=\textrm{null}(B)=\textrm{null}(A). Therefore, λ=0\lambda=0 is an eigenvalue of all the above problems. If λ0\lambda\neq 0 is an eigenvalue of one of the above problems, then the corresponding eigenvector is not in null(A)\textrm{null}(A). This implies that the relations defined in the proof of Theorem 4, with inverses replaced by pseudo-inverses, define relations between nonzero vectors. Therefore, λ\lambda is an eigenvalue of all the eigenvalue problems. ∎

Even though the different forms of preconditioning are equivalent in terms of the spectra of the coefficient matrices, they are different algorithmically. If symmetry is not an issue (e.g., if AA itself is unsymmetric), the form B1AB^{-1}A is the most general. When AA is symmetric, we usually require that BB is symmetric positive-definite (or semi-definite with the same null space as AA). In this case, the form B1/2AB1/2B^{-1/2}AB^{-1/2}, when coupled with the transformation that allows multiplications only by B1B^{-1} (and not by B1/2B^{-1/2}), is more widely applicable than the form L1ALTL^{-1}AL^{-T}, because the latter requires a Cholesky factorization of BB, whereas in the former any method of applying B1B^{-1} can be used.

One issue that arizes with any form of preconditioning is the definition of the residual. If we apply minres to (L1ALT)(LTx)=L1b\left(L^{-1}AL^{-T}\right)\left(L^{T}x\right)=L^{-1}b, say, it minimizes the 22-norm of the preconditioned residual

L1b(L1ALT)(LTx(t))=L1bL1Ax(t)=L1(bAx(t)).L^{-1}b-\left(L^{-1}AL^{-T}\right)\left(L^{T}x^{(t)}\right)=L^{-1}b-L^{-1}Ax^{(t)}=L^{-1}\left(b-Ax^{(t)}\right)\;.

Thus, the true residual bAx(t)b-Ax^{(t)} in preconditioned minres may not be minimal. This is roughly the same issue as with the norms used in Conjugate Gradients: we minimize the residual in a norm that is related to AA.

2.5. Chebyshev Polynomials and Convergence Bounds

The link between Krylov-subspace iterations and polynomials suggests another idea. Given some information on the spectrum of AA, we can try to analytically define a sequence p~t\tilde{p}_{t} of solution polynomials such that for any bb the vector x(t)p~t(A)b𝒦tx^{(t)}\equiv\tilde{p}_{t}(A)b\in\mathcal{K}_{t} is a good approximation to xx, the exact solution of Ax=bAx=b. More specifically, we can try to define p~t\tilde{p}_{t} such that the residuals bAx(t)b-Ax^{(t)} are small. We have seen that the residual can be expressed as pt(A)bp_{t}(A)b, where pt(z)=1zp~t(z)p_{t}(z)=1-z\tilde{p}_{t}(z) is the residual polynomial. Therefore, if p~t\tilde{p}_{t} is such that ptp_{t} assumes low values on the eigenvalues of AA and satisfies pt(0)=1p_{t}(0)=1, then x(t)x^{(t)} is a good approximate solution. This idea can be used both to construct iterative solvers and to prove bounds on the convergence of methods like minres and Conjugate Gradients.

One obvious problem is that we do not know what the eigenvalues of AA are. Finding the eigenvalues is more difficult than solving Ax=bAx=b. However, in some cases we can use the structure of AA (even with preconditioning) to derive bounds on the smallest and largest eigenvalues of a positive-definite martrix AA, denoted λmin\lambda_{\min} and λmax\lambda_{\max}.

Suppose that we somehow obtained bounds on the extereme eigenvalues of AA,

0<ρminλminλiλmaxρmax.0<\rho_{\min}\leq\lambda_{\min}\leq\lambda_{i}\leq\lambda_{\max}\leq\rho_{\max}\;.

We shall not discuss here how we might obtain ρmin\rho_{\min} and ρmax\rho_{\max}; this is the topic of much of the rest of the book. It turns out that we can build a sequence ptp_{t} of polynomials such that

  1. (1)

    pt(0)=1p_{t}(0)=1, and

  2. (2)

    maxz[ρmin,ρmax]|pt(z)|\max_{z\in[\rho_{\min},\rho_{\max}]}\left|p_{t}(z)\right| is as small as possible for a degree tt polynomial with value 11 at 0.

The polynomials that solve this optimization problem are derived from Chebyshev polynomials, which can be defined using the recurrence

c0(z)\displaystyle c_{0}(z) =\displaystyle= 1\displaystyle 1
c1(z)\displaystyle c_{1}(z) =\displaystyle= z\displaystyle z
ct(z)\displaystyle c_{t}(z) =\displaystyle= 2zct1(z)ct2(z).\displaystyle 2zc_{t-1}(z)-c_{t-2}(z)\;.

The polynomials that reduce the residual are

pt(z)=1ct(ρmax+ρminρmaxρmin)ct(ρmax+ρmin2zρmaxρmin).p_{t}(z)=\frac{1}{c_{t}\left(\frac{\rho_{\max}+\rho_{\min}}{\rho_{\max}-\rho_{\min}}\right)}c_{t}\left(\frac{\rho_{\max}+\rho_{\min}-2z}{\rho_{\max}-\rho_{\min}}\right)\;.

An Iterative Linear Solver based on Chebyshev Polynomials

Our first application of Chebyshev polynomials is an iterative Krylov-subspace solver based on them. We will refer to this solver as the Krylov-Chebyshev solver111The algorithm that we describe below is related to a more well-known Chebychev linear solver that is used with matrix splittings.. The polynomials ptp_{t} implicitly define polynomials p~t\tilde{p}_{t} that we can use to construct approximate solutions. The residual for an approximate solution x(t)x^{(t)} is r(t)=bAx(t)r^{(t)}=b-Ax^{(t)}. If we define x(t)=p~t(A)bx^{(t)}=\tilde{p}_{t}(A)b, we have r(t)=pt(A)b=bAp~t(A)br^{(t)}=p_{t}(A)b=b-A\tilde{p}_{t}(A)b.

We now derive recurrences for x(t)x^{(t)} and r(t)r^{(t)}. To keep the notation simple, we define ρ+=ρmax+ρmin\rho_{+}=\rho_{\max}+\rho_{\min} and ρ=ρmax+ρmin\rho_{-}=\rho_{\max}+\rho_{\min}. For the two base cases we have

r(0)=p0(A)b\displaystyle r^{(0)}=p_{0}(A)b =\displaystyle= c01(ρ+ρ)c0(ρ+ρI2ρA)b=Ib=b\displaystyle c_{0}^{-1}\left(\frac{\rho_{+}}{\rho_{-}}\right)c_{0}\left(\frac{\rho_{+}}{\rho_{-}}I-\frac{2}{\rho_{-}}A\right)b=Ib=b
r(1)=p1(A)b\displaystyle r^{(1)}=p_{1}(A)b =\displaystyle= c11(ρ+ρ)c1(ρ+ρI2ρA)b\displaystyle c_{1}^{-1}\left(\frac{\rho_{+}}{\rho_{-}}\right)c_{1}\left(\frac{\rho_{+}}{\rho_{-}}I-\frac{2}{\rho_{-}}A\right)b
=\displaystyle= (ρ+ρ)1(ρ+ρI2ρA)b=b2ρ+Ab.\displaystyle\left(\frac{\rho_{+}}{\rho_{-}}\right)^{-1}\left(\frac{\rho_{+}}{\rho_{-}}I-\frac{2}{\rho_{-}}A\right)b=b-\frac{2}{\rho_{+}}Ab\;.

This implies that

x(0)\displaystyle x^{(0)} =\displaystyle= 0\displaystyle 0
x(1)\displaystyle x^{(1)} =\displaystyle= 2ρ+b.\displaystyle\frac{2}{\rho_{+}}b\;.

For t2t\geq 2 we have

pt(A)b\displaystyle p_{t}(A)b =\displaystyle= ct1(ρ+ρ)ct(ρ+ρI2ρA)b\displaystyle c_{t}^{-1}\left(\frac{\rho_{+}}{\rho_{-}}\right)c_{t}\left(\frac{\rho_{+}}{\rho_{-}}I-\frac{2}{\rho_{-}}A\right)b
=\displaystyle= ct1(ρ+ρ)(2(ρ+ρI2ρA)ct1(ρ+ρI2ρA)ct2(ρ+ρI2ρA))b\displaystyle c_{t}^{-1}\left(\frac{\rho_{+}}{\rho_{-}}\right)\left(2\left(\frac{\rho_{+}}{\rho_{-}}I-\frac{2}{\rho_{-}}A\right)c_{t-1}\left(\frac{\rho_{+}}{\rho_{-}}I-\frac{2}{\rho_{-}}A\right)-c_{t-2}\left(\frac{\rho_{+}}{\rho_{-}}I-\frac{2}{\rho_{-}}A\right)\right)b
=\displaystyle= ct1(ρ+ρ)(2(ρ+ρI2ρA)ct1(ρ+ρ)pt1(A)ct2(ρ+ρ)pt2(A))b\displaystyle c_{t}^{-1}\left(\frac{\rho_{+}}{\rho_{-}}\right)\left(2\left(\frac{\rho_{+}}{\rho_{-}}I-\frac{2}{\rho_{-}}A\right)c_{t-1}\left(\frac{\rho_{+}}{\rho_{-}}\right)p_{t-1}\left(A\right)-c_{t-2}\left(\frac{\rho_{+}}{\rho_{-}}\right)p_{t-2}\left(A\right)\right)b
=\displaystyle= ct1(ρ+ρ)(2(ρ+ρI2ρA)ct1(ρ+ρ)r(t1)ct2(ρ+ρ)r(t2)).\displaystyle c_{t}^{-1}\left(\frac{\rho_{+}}{\rho_{-}}\right)\left(2\left(\frac{\rho_{+}}{\rho_{-}}I-\frac{2}{\rho_{-}}A\right)c_{t-1}\left(\frac{\rho_{+}}{\rho_{-}}\right)r^{(t-1)}-c_{t-2}\left(\frac{\rho_{+}}{\rho_{-}}\right)r^{(t-2)}\right)\;.

To compute r(t)r^{(t)} from this recurrence, we need r(t1)r^{(t-1)} and r(t2)r^{(t-2)}, three elements of the sequence ct(ρ+/ρ)c_{t}(\rho_{+}/\rho_{-}), and one multiplication of a vector by AA. Therefore, we can compute r(t)r^{(t)} and ct(ρ+/ρ)c_{t}(\rho_{+}/\rho_{-}) concurrently in a loop. From the recurrence for r(t)r^{(t)} we can derive a recurrence for x(t)x^{(t)},

r(t)\displaystyle r^{(t)} =\displaystyle= pt(A)b\displaystyle p_{t}(A)b
=\displaystyle= ct1(ρ+ρ)(2ρ+ρct1(ρ+ρ)r(t1)4ρAct1(ρ+ρ)r(t1)ct2(ρ+ρ)r(t2))\displaystyle c_{t}^{-1}\left(\frac{\rho_{+}}{\rho_{-}}\right)\left(\frac{2\rho_{+}}{\rho_{-}}c_{t-1}\left(\frac{\rho_{+}}{\rho_{-}}\right)r^{(t-1)}-\frac{4}{\rho_{-}}Ac_{t-1}\left(\frac{\rho_{+}}{\rho_{-}}\right)r^{(t-1)}-c_{t-2}\left(\frac{\rho_{+}}{\rho_{-}}\right)r^{(t-2)}\right)
=\displaystyle= ct1(ρ+ρ)(2ρ+ρct1(ρ+ρ)(bAx(t1))\displaystyle c_{t}^{-1}\left(\frac{\rho_{+}}{\rho_{-}}\right)\biggl{(}\frac{2\rho_{+}}{\rho_{-}}c_{t-1}\left(\frac{\rho_{+}}{\rho_{-}}\right)\left(b-Ax^{(t-1)}\right)
4ρAct1(ρ+ρ)r(t1)\displaystyle\quad\quad\quad\quad\quad\quad-\frac{4}{\rho_{-}}Ac_{t-1}\left(\frac{\rho_{+}}{\rho_{-}}\right)r^{(t-1)}
ct2(ρ+ρ)(bAx(t2))).\displaystyle\quad\quad\quad\quad\quad\quad-c_{t-2}\left(\frac{\rho_{+}}{\rho_{-}}\right)\left(b-Ax^{(t-2)}\right)\biggr{)}\;.

so

x(t)\displaystyle x^{(t)} =\displaystyle= p~t(A)b\displaystyle\tilde{p}_{t}(A)b
=\displaystyle= ct1(ρ+ρ)(2ρ+ρct1(ρ+ρ)x(t1)\displaystyle c_{t}^{-1}\left(\frac{\rho_{+}}{\rho_{-}}\right)\biggl{(}\frac{2\rho_{+}}{\rho_{-}}c_{t-1}\left(\frac{\rho_{+}}{\rho_{-}}\right)x^{(t-1)}
+4ρct1(ρ+ρ)r(t1)\displaystyle\quad\quad\quad\quad\quad\quad+\frac{4}{\rho_{-}}c_{t-1}\left(\frac{\rho_{+}}{\rho_{-}}\right)r^{(t-1)}
ct2(ρ+ρ)x(t2)).\displaystyle\quad\quad\quad\quad\quad\quad-c_{t-2}\left(\frac{\rho_{+}}{\rho_{-}}\right)x^{(t-2)}\biggr{)}\;.
Refer to caption
Figure 2.5.1. Chebyshev residual polynomials for two 200200-by-200200 matrices with different spectra, where ρmin=λmin\rho_{\min}=\lambda_{\min} is set at the minimal eigenvalue and ρmax=λmax\rho_{\max}=\lambda_{\max} is set at the largest eigenvalue. The two matrices have the same extreme eigenvalues but otherwise their spectra is very different. The Chebyshev polynomials depend only on ρmax\rho_{\max} and ρmin\rho_{\min}. Compare to the minres polynomials for the similar spectra in Figure 2.3.1.

This algorithm converges more slowly than minres and Conjugate Gradients. Minres is guaranteed to minimize the residual over all x(t)x^{(t)} in 𝒦t\mathcal{K}_{t}. The solution that we obtain from the Krylov-Chebyshev recurrences is in 𝒦t\mathcal{K}_{t}, so it cannot yield a smaller residual than the minres solution. Its main algorithmic advantage over minres and Conjugate Gradients is that the polynomials p~t\tilde{p}_{t} that it produces depend only on tt and on ρmin\rho_{\min} and ρmax\rho_{\max}, but they do not depend on bb. Therefore, p~t(A)\tilde{p}_{t}(A) is a fixed linear operator, so it can be used as a preconditioner B1=p~t(A)B^{-1}=\tilde{p}_{t}(A). In contrast, the polynomials that minres and Conjugate Gradients generate depend on bb, so we cannot use these solvers as preconditioners unless we set very strict convergence bounds, in which case an application of these solvers is numerically indistiguishable from an application of A1A^{-1}.

Since this algorithm does not exploit the symmetry of AA, it can be used with left preconditioning, not only with symmetric forms of preconditioning.

Chebyshev-Based Convergence Bounds

We can also use the Chebyshev iteration to bound the convergence of minres and Conjugate Gradients. The key is the following theorem, which we state without a proof.

Theorem 6.

Let p(z)p(z) be the Chebyshev polynomials defined above with respect to the interval [λmin,λmax][\lambda_{\min},\lambda_{\max}]. Denote by κ\kappa the ratio κ=λmax/λmin\kappa=\lambda_{\max}/\lambda_{\min}. For any z[λmin,λmax]z\in[\lambda_{\min},\lambda_{\max}] we have

|pt(z)|2((κ+1κ1)t+(κ+1κ1)t)12(κ1κ+1)t.\left|p_{t}(z)\right|\leq 2\left(\left(\frac{\sqrt{\kappa}+1}{\sqrt{\kappa}-1}\right)^{t}+\left(\frac{\sqrt{\kappa}+1}{\sqrt{\kappa}-1}\right)^{-t}\right)^{-1}\leq 2\left(\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}\right)^{t}\;.

In this theorem, 0<λminλmax0<\lambda_{\min}\leq\lambda_{\max} are arbitrary positive numbers that denote then end of the inteval that defines ptp_{t}. But when these numbers are the extreme eigenvalues of a symmetric positive definite matrix, the ratio κ=λmax/λmin\kappa=\lambda_{\max}/\lambda_{\min} plays an important-enough role in numerical linear algebra to deserve a name.

Definition 7.

Let AA be a symmetric semidefinite matrix, and let λmin\lambda_{\min} and λmax\lambda_{\max} be its extreme nonzero eigenvalue. The ratio

κ=λmaxλmax=A2A+2\kappa=\frac{\lambda_{\max}}{\lambda_{\max}}=\left\|A\right\|_{2}\left\|A^{+}\right\|_{2}

is called the spectral condition number of AA. The definition

κ=AA+\kappa=\left\|A\right\|\left\|A^{+}\right\|

generalizes the condition number to any matrix and to any norm.

We can use Theorem 6 to bound the residuals in minres.

Theorem 8.

Consider the application of minres to the linear system Ax=bAx=b. Let r(t)r^{(t)} be the minres residual at iteration tt. Then

r(t)2b22(κ1κ+1)t.\frac{\left\|r^{(t)}\right\|_{2}}{\left\|b\right\|_{2}}\leq 2\left(\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}\right)^{t}\;.
Proof.

The residual r(t)r^{(t)} is the minimal residual for any x(t)𝒦tx^{(t)}\in\mathcal{K}_{t}. Let r^(t)\hat{r}^{(t)} be the Krylov-Chebyshev residual for AA and bb and let ptp_{t} be the Krylov-Chebyshev residual polynomial. Let A=VΛVA=V\Lambda V^{*} be an eigendecomposition of AA. The theorem follows from the following inequalities.

r(t)2\displaystyle\left\|r^{(t)}\right\|_{2} \displaystyle\leq r^(t)2=p(A)b2=Vp(Λ)Vb2=p(Λ)Vb2\displaystyle\left\|\hat{r}^{(t)}\right\|_{2}=\left\|p(A)b\right\|_{2}=\left\|Vp(\Lambda)V^{*}b\right\|_{2}=\left\|p(\Lambda)V^{*}b\right\|_{2}
\displaystyle\leq p(Λ)V2b2=p(Λ)2b2\displaystyle\left\|p(\Lambda)V^{*}\right\|_{2}\left\|b\right\|_{2}=\left\|p(\Lambda)\right\|_{2}\left\|b\right\|_{2}
=\displaystyle= maxi{|p(λi)|}b2\displaystyle\max_{i}\left\{\left|p\left(\lambda_{i}\right)\right|\right\}\left\|b\right\|_{2}
\displaystyle\leq maxz[λmin,λmax]{|p(z)|}b2\displaystyle\max_{z\in[\lambda_{\min},\lambda_{\max}]}\left\{\left|p\left(z\right)\right|\right\}\left\|b\right\|_{2}
\displaystyle\leq 2(κ1κ+1)tb2.\displaystyle 2\left(\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}\right)^{t}\left\|b\right\|_{2}\;.

Similar results can be stated for the error and residual of Conjugate Gradients in the AA and A1A^{-1} norms, respectively.

For small κ\kappa, we can expect convergence to a fixed tolerance, say r(t)21012r(t)2\left\|r^{(t)}\right\|_{2}\leq 10^{-12}\left\|r^{(t)}\right\|_{2} within a constant number of iterations. As κ\kappa grows,

κ1κ+112κ,\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}\to 1-\frac{2}{\sqrt{\kappa}}\;,

so we are guaranteed convergence to a fixed tolerance within O(κ)O(\sqrt{\kappa}) iterations.

Chapter 3 Computing the Cholesky Factorization of Sparse Matrices

In many support preconditioners, the preconditioner BB is factored before the iterations begin. The Cholesky factorization of BB allows us to efficiently solve the correction equations Bz=rBz=r. This chapter explains the principles behind the factorization of sparse symmetric positive definite matrices.

3.1. The Cholesky Factorization

We first show that the Cholesky factorization A=LLTA=LL^{T} of a symmetric positive-definite (spd) matrix AA always exists.

A matrix AA is positive definite if xTAx>0x^{T}Ax>0 for all 0xn0\neq x\in\mathbb{R}^{n}. The same definition extends to complex Hermitian matrices. The matrix is positive semidefinite if xTAx0x^{T}Ax\geq 0 for all 0xn0\neq x\in\mathbb{R}^{n} and xTAx=0x^{T}Ax=0 for some 0xn0\neq x\in\mathbb{R}^{n}.

To prove the existence of the factorization, we use induction and the construction shown in Chapter XXX. If AA is 11-by-11, then xTAx=A11x12>0x^{T}Ax=A_{11}x_{1}^{2}>0, so A110A_{11}\geq 0, so it has a real square root. We set L11=A11L_{11}=\sqrt{A_{11}} and we are done. We now assume by induction that all spd matrices of dimension n1n-1 or smaller have a Cholesky factorization. We now partition AA into a 22-by-22 block matrix and use the partitioning to construct a factorization,

A=[A11A21TA21A22].A=\begin{bmatrix}A_{11}&A_{21}^{T}\\ A_{21}&A_{22}\end{bmatrix}\;.

Because AA is spd, A11A_{11} must also be spd. If it is not positive definite, then a vector y0y\neq 0 such that yTA11y0y^{T}A_{11}y\leq 0 can be extended with zeros to a vector xx such that xTAx0x^{T}Ax\leq 0. Therfore, A11A_{11} has a Choleskyh factor L11L_{11} by induction. The Cholesky factor of a nonsingular matrix must be nonsingular, so we can define L21=A21L11TL_{21}=A_{21}L_{11}^{-T} (L11TL_{11}^{-T} denotes the inverse of L11TL_{11}^{T}). The key step in the proof is to show that the Schur complement A22L21L21T=A22A21A111A21TA_{22}-L_{21}L_{21}^{T}=A_{22}-A_{21}A_{11}^{-1}A_{21}^{T} is also positive definite. Suppose for contradition that it is not, and let z0z\neq 0 be such that zT(A22L21L21T)z0z^{T}\left(A_{22}-L_{21}L_{21}^{T}\right)z\leq 0. We define

x=[A111A21Tzz]=[wz].x=\begin{bmatrix}-A_{11}^{-1}A_{21}^{T}z\\ z\end{bmatrix}=\begin{bmatrix}w\\ z\end{bmatrix}\;.

Now we have

xTAx\displaystyle x^{T}Ax =\displaystyle= wTA11w+wTA21Tz+zTA21w+zTA22z\displaystyle w^{T}A_{11}w+w^{T}A_{21}^{T}z+z^{T}A_{21}w+z^{T}A_{22}z
=\displaystyle= (A111A21Tz)TA11(A111A21Tz)2zTA21(A111A21Tz)w+zTA22z\displaystyle\left(A_{11}^{-1}A_{21}^{T}z\right)^{T}A_{11}\left(A_{11}^{-1}A_{21}^{T}z\right)-2z^{T}A_{21}\left(A_{11}^{-1}A_{21}^{T}z\right)w+z^{T}A_{22}z
=\displaystyle= zTA21A111A21Tz2zTA21A111A21Tz+zTA22z\displaystyle z^{T}A_{21}A_{11}^{-1}A_{21}^{T}z-2z^{T}A_{21}A_{11}^{-1}A_{21}^{T}z+z^{T}A_{22}z
=\displaystyle= zT(A22L21L21T)z\displaystyle z^{T}\left(A_{22}-L_{21}L_{21}^{T}\right)z
\displaystyle\leq 0,\displaystyle 0\;,

which is impossible, so our supposition was wrong. Because A22L21L21TA_{22}-L_{21}L_{21}^{T} is spd, it has a Cholesky factor L22L_{22} by induction. The three blocks L11L_{11}, L21L_{21}, and L22L_{22} form a Cholesky factor for AA, since

A11\displaystyle A_{11} =\displaystyle= L11L11T\displaystyle L_{11}L_{11}^{T}
A21\displaystyle A_{21} =\displaystyle= L21L11T\displaystyle L_{21}L_{11}^{T}
A22\displaystyle A_{22} =\displaystyle= L21L21T+L22L22T.\displaystyle L_{21}L_{21}^{T}+L_{22}L_{22}^{T}\;.

Symmetric positive semidefinite (spsd) matrices also have a Cholesky factorization, but in floating-point arithmetic, it is difficult to compute a Cholesky factor that is both backward stable and has the same rank as AA. To see that a factorization exists, we modify the construction as follows. If AA is 11-by-11, then if it is singular than it is exactly zero, in which case we can set L=AL=A. Otherwise, we partition AA, selecting A11A_{11}to be 11-by-11. If A110A_{11}\neq 0, then it is invertible and we can continue with the same proof, except that we show that the Schur complement is semidefinite, not definte. If A11=0A_{11}=0, then it is not hard to show that A21A_{21} must be zero as well. This implies that the equation A21=L21L11TA_{21}=L_{21}L_{11}^{T} is zero on both sides for any choice of L21L_{21}. We set L21=0L_{21}=0 and continue.

The difficulty in a floating-point implementation lies in deciding whether a computed A11A_{11} would be zero in exact arithmetic. In general, even if the next diagonal element A11=0A_{11}=0 in exact arithmetic, in floating-point it might be computed as a small nonzero value. Should we round it (and A21A_{21}) to zero? Assuming that it is a nonzero when in fact is should be treated is a zero leads to an unstable factorization. The small magnitude of the computed L11L_{11} can cause the elements of L21L_{21} to be large, which leads to large inaccuracies in the Schur complement. On the other hand, rounding small values to zero always leads to a backward stable factorization, since the rounding is equivalent to a small perturbation in AA. But rounding a column to zero when the value in exact arithmetic is not zero causes the rank of LL to be smaller than the rank of AA. This can later cause trouble, since some vectors bb that are in the range of AA are not in the range of LL. In such a case, there is no xx such that LLTx=bLL^{T}x=b even if Ax=bAx=b is consistent.

3.2. Work and Fill in Sparse Cholesky

When AA is sparse, operations on zeros can be skipped. For example, suppose that

A=[2003]=[2003][2003]T.A=\begin{bmatrix}2&0&\cdots\\ 0&3\\ \vdots&&\ddots\end{bmatrix}=\begin{bmatrix}\sqrt{2}&0&\cdots\\ 0&\sqrt{3}\\ \vdots&&\ddots\end{bmatrix}\begin{bmatrix}\sqrt{2}&0&\cdots\\ 0&\sqrt{3}\\ \vdots&&\ddots\end{bmatrix}^{T}\;.

There is no need to divide 0 by 2\sqrt{2} to obtain L2,1L_{2,1} (the element in row 22 and column 11; from here on, expressions like L2,1L_{2,1} denote matrix elements, not submatrices). Similarly, there is no need to subtract 0×00\times 0 from the 33, the second diagonal element. If we represent zeros implicitly rather than explicitly, we can avoid computations that have no effect, and we can save storage. How many arithmetic operations do we need to perform in this case?

Definition 9.

We define η(A)\eta(A) to be the number of nonzero elements in a matrix AA. We define ϕ𝚊𝚕𝚐(A)\phi_{\mathtt{alg}}(A) to be the number of arithmetic operations that some algorithm 𝚊𝚕𝚐\mathtt{alg} performs on an input matrix AA, excluding multiplications by zeros, divisions of zeros, additions and subtractions of zeros, and taking square roots of zeros. When the algorithm is clear from the context, we drop the subscript in the ϕ\phi notation.

Theorem 10.

Let 𝚂𝚙𝚊𝚛𝚜𝚎𝙲𝚑𝚘𝚕\mathtt{SparseChol} be a sparse Cholesky factorization algorithm that does not multiply by zeros, does not divide zeros, does not add or subtract zeros, and does not compute square roots of zeros. Then for a symmetric positive-definite matrix AA with a Cholesky factor LL we have

ϕ𝚂𝚙𝚊𝚛𝚜𝚎𝙲𝚑𝚘𝚕(A)\displaystyle\phi_{\mathtt{SparseChol}}(A) =\displaystyle= j=1n(1+η(Lj+1:n)+η(Lj+1:n)(η(Lj+1:n)+1)2)\displaystyle\sum_{j=1}^{n}\left(1+\eta\left(L_{j+1\colon n}\right)+\frac{\eta\left(L_{j+1\colon n}\right)\left(\eta\left(L_{j+1\colon n}\right)+1\right)}{2}\right)
=\displaystyle= j=1nO(η2(Lj+1:n)).\displaystyle\sum_{j=1}^{n}O\left(\eta^{2}\left(L_{j+1\colon n}\right)\right)\;.
Proof.

Every arithmetic operation in the Cholesky factorization involves an element or two from a column of LL: in square roots and divisions the output is an element of LL, and in multiply-subtract the two inputs that are multiplied are elements from one column of LL. We count the total number of operations by summing over columns of LL.

Let us count the operations in which elements of Lj:n,jL_{j\colon n,j} are involved. First, one square root. Second, divisions of η(Lj+1:n)\eta\left(L_{j+1\colon n}\right) by that square root. We now need to count the operations in which elements from this column update the Schur complement. To count these operations, we assume that the partitioning of AA is into a 22-by-22 block matrix, in which the first diagonal block consists of jj rows and columns and the second of njn-j. The computation of the Schur complement is

Aj+1:n,j+1:nLj+1:n,1:jLj+1:n,1:jT=Aj+1:n,j+1:nk=1jLj+1:n,kLj+1:n,kT.A_{j+1\colon n,j+1\colon n}-L_{j+1\colon n,1\colon j}L_{j+1\colon n,1\colon j}^{T}=A_{j+1\colon n,j+1\colon n}-\sum_{k=1}^{j}L_{j+1\colon n,k}L_{j+1\colon n,k}^{T}\;.

This is the only Schur-complement computation in which Lj:n,jL_{j\colon n,j} is involved under this partitioning of AA. It was not yet used, because it has just been computed. It will not be used again, because the recursive factorization of the Schur complement is self contained. The column contributes one outer product Lj+1:n,jLj+1:n,jTL_{j+1\colon n,j}L_{j+1\colon n,j}^{T}. This outer product contains η2(Lj+1:n)\eta^{2}\left(L_{j+1\colon n}\right) nonzero elements, but it is symmetric, so only its lower triangle needs to be computed. For each nonzero element in this outer product, two arithmetic operations are performed: a multiplication of two elements of Lj+1:nL_{j+1\colon n} and a subtraction of the product from another number. This yields the total operation count. ∎

Thus, the number of arithmetic operations is asymptotically proportional to the sum of squares of the nonzero counts in columns of LL. The total number of nonzeros in LL is, of course, simply the sum of the nonzero counts,

η(L)=j=1nη(Lj+1:n).\eta(L)=\sum_{j=1}^{n}\eta\left(L_{j+1\colon n}\right)\;.

This relationship between the arithmetic operation count and the nonzero counts shows two things. First, sparser factors usually (but not always) require less work to compute. Second, a factor with balanced nonzero counts requires less work to compute than a factor with some relatively dense columns, even if the two factors have the same dimension and the same total number of nonzeros.

The nonzero structure of AA and LL does not always reveal everything about the sparsity during the factorization. Consider the following matrix and its Cholesky factor,

A=[422422226226]=[22112112][22112112]T.A=\begin{bmatrix}4&&2&2\\ &4&2&-2\\ 2&2&6\\ 2&-2&&6\end{bmatrix}=\begin{bmatrix}2\\ &2\\ 1&1&2\\ 1&-1&&2\end{bmatrix}\begin{bmatrix}2\\ &2\\ 1&1&2\\ 1&-1&&2\end{bmatrix}^{T}\;.

The element in position 4,34,3 is zero in AA and in LL, but it might fill in one of the Schur complements. If we partition AA into two 22-by-22 blocks, this element never fills, since

A3:4,3:4L3:4,1:2L3:4,1:2T\displaystyle A_{3\colon 4,3\colon 4}-L_{3\colon 4,1\colon 2}L_{3\colon 4,1\colon 2}^{T} =\displaystyle= [66][1111][1111]\displaystyle\begin{bmatrix}6\\ &6\end{bmatrix}-\begin{bmatrix}1&1\\ 1&-1\end{bmatrix}\begin{bmatrix}1&1\\ 1&-1\end{bmatrix}
=\displaystyle= [66][2002]\displaystyle\begin{bmatrix}6\\ &6\end{bmatrix}-\begin{bmatrix}2&0\\ 0&2\end{bmatrix}
=\displaystyle= [44].\displaystyle\begin{bmatrix}4\\ &4\end{bmatrix}\;.

However, if we first partition AA into a 11-by-11 block and a 33-by-33 block, then the 4,34,3 element fills in the Schur complement,

A2:4,2:4L2:4,1L2:4,1T\displaystyle A_{2\colon 4,2\colon 4}-L_{2\colon 4,1}L_{2\colon 4,1}^{T} =\displaystyle= [4222626][011][011]\displaystyle\begin{bmatrix}4&2&-2\\ 2&6\\ -2&&6\end{bmatrix}-\begin{bmatrix}0\\ 1\\ 1\end{bmatrix}\begin{bmatrix}0&1&1\end{bmatrix}
=\displaystyle= [4222626][000011011]\displaystyle\begin{bmatrix}4&2&-2\\ 2&6\\ -2&&6\end{bmatrix}-\begin{bmatrix}0&0&0\\ 0&1&1\\ 0&1&1\end{bmatrix}
=\displaystyle= [422251215].\displaystyle\begin{bmatrix}4&2&-2\\ 2&5&-1\\ -2&-1&5\end{bmatrix}\;.

When we continue the factorization, the 4,34,3 element in the Schur complement must cancel exactly by a subsequent subtraction, because we know it is zero in LL (The Cholesky factor is unique). This example shows that an element can fill in some of the Schur complements, even if it zero in LL. Clearly, even an position that is not zero in AA can become zero in LL due to similar cancelation. For some analyses, it helps to define fill in a way that accounts for this possibility.

Definition 11.

A fill in a sparse Cholesky factorization is a row-column pair i,ji,j such that Ai,j=0A_{i,j}=0 and which fills in at least one Schur complement. That is, Ai,j=0A_{i,j}=0 and Ai,jLi,1:kL1:k,jT0A_{i,j}-L_{i,1\colon k}L_{1\colon k,j}^{T}\neq 0 for some k<jk<j.

3.3. An Efficient Implemention of Sparse Cholesky

To fully exploit sparsity, we need to store the matrix and the factor in a data structure in which effect-free operations on zeros incur no computational cost at all. Testing values to determine whether they are zero before performing an arithmetic operation is a bad idea: the test takes time even if the value is zero (on most processor, a test like this takes more time than an arithmetic operation). The data structure should allow the algorithm to implicitly skip zeros. Such a data structure increases the cost of arithmetic on nonzeros. Our goal is to show that in spite of the overhead of manipulating the sparse data structure, the total number of operations in the factorization can be kept proportional to the number of arithmetic operations.

Another importan goal in the design of the data structure is memory efficiency. We would like the total amount of memory required for the sparse-matrix data structure to be proportional to the number of nonzeros that it stores.

Before we can present a data structure that achieves these goals, we need to reorder the computations that the Cholesky factorization perform. The reordered variant that we present is called column-oriented Cholesky. In the framework of our recursive formulations, this variant is based on repartitioning the matrix after the elimination of each column. We begin with a partitioning of AA into the first row and column and the rest of the matrix,

A=[A1,1A2:n,1TA2:n,1A2:n,2:n].A=\begin{bmatrix}A_{1,1}&A_{2\colon n,1}^{T}\\ A_{2\colon n,1}&A_{2\colon n,2\colon n}\end{bmatrix}\;.

We compute L1,1=A1,1L_{1,1}=\sqrt{A_{1,1}} and divide A2:n,1A_{2\colon n,1} to by the root to obtain L2:n,1L_{2\colon n,1}. Now comes the key step. Instead of computing all the Schur complement A2:n,2:nL2:n,1L2:n,1TA_{2\colon n,2\colon n}-L_{2\colon n,1}L_{2\colon n,1}^{T}, we compute only its first column, A2:n,2L2:n,1L2,1TA_{2\colon n,2}-L_{2\colon n,1}L_{2,1}^{T}. The first column of the Schur complement allows us to compute the second column of LL. At this point we have computed the first two columns of LL and nothing else. We now view the partitioning of AA as

A=[A1:2,1:2A3:n,1:2TA3:n,1:2A3:n,3:n]=[L1:2,1:2L3:n,1:2L3:n,3:n][L1:2,1:2L3:n,1:2L3:n,3:n]T.A=\begin{bmatrix}A_{1:2,1:2}&A_{3\colon n,1:2}^{T}\\ A_{3\colon n,1:2}&A_{3\colon n,3\colon n}\end{bmatrix}=\begin{bmatrix}L_{1:2,1:2}\\ L_{3\colon n,1:2}&L_{3\colon n,3\colon n}\end{bmatrix}\begin{bmatrix}L_{1:2,1:2}\\ L_{3\colon n,1:2}&L_{3\colon n,3\colon n}\end{bmatrix}^{T}\;.

We have already computed L1:2,1:2L_{1:2,1:2} and L3:n,1:2L_{3\colon n,1:2}. We still need to compute L3:n,3:nL_{3\colon n,3\colon n}. We do so in the same way: computing the first column of the Schur complement A3:n,3L3:n,1:2L3,1:2TA_{3\colon n,3}-L_{3\colon n,1:2}L_{3,1\colon 2}^{T} and eliminating it. The algorithm, ignoring sparsity issues, is shown in Figure XXX.

for j=1:nj=1\colon n

Sj:n=Aj:n,jLj:n,1:j1Lj,1:j1TS_{j\colon n}=A_{j\colon n,j}-L_{j\colon n,1:j-1}L_{j,1\colon j-1}^{T}

Lj,j=SjL_{j,j}=\sqrt{S_{j}}

Lj+1:n,j=Sj+1:n/Lj,jL_{j+1\colon n,j}=S_{j+1\colon n}/L_{j,j}

end

Figure 3.3.1. Column-oriented Cholesky. The vector SS is a temporary vector that stores a column of the Schur complement. By definition, operations involving a range i:ji\colon j of rows or columns for i<1i<1 or j>nj>n are not performed at all. (This allows us to drop the conditional that skips the computation of the Schur-complement column for j=1j=1 and the computation of the nonexisting subdiagonal part of the last column of LL; in an actual algorithm, these conditionals would be present, or the processing of j=1j=1 and j=nj=n would be performed outside the loop.)

We now present a data structure for the efficient implementation of sparse column-oriented Cholesky. Our only objective in presenting this data structure is show that sparse Cholesky can be implemented in time proportional to arithmetic operations and in space proportional to the number of nonzeros in AA and LL. The data structure that we present is not the best possible. Using more sophisticated data structures, the number of operations in the factorization and the space required can be reduced further, but not asymptotically.

We store AA using an array of nn linked lists. Each linked list stores the diagonal and subdiagonal nonzeros in one column of AA. Each structure in the linked-list stores the value of one nonzero and its row index. There is no need to store the elements of AA above its main diagonal, because (1) AA is symmetric, and (2) column-oriented Cholesky never references them. There is also no need to store the column indices in the linked-list structures, because each list stores elements from only one column. The linked lists are ordered from low row indices to high row indices.

We store the already-computed columns of LL redundantly. One part of the data structure, which we refer to as columns, stores the columns of LL in exactly the same way we store AA. The contents of the two other parts of the data structure, called cursors and rows, depend on the number of columns that have already been computed. Immediately before step jj begins, these parts contains the following data. Cursors is an array of nn pointers. The first j1j-1 pointers, if set, point to linked-list elements of columns. 𝙲𝚞𝚛𝚜𝚘𝚛𝚜i\mathtt{Cursors}_{i} points to the first element with row index larger than jj in 𝚌𝚘𝚕𝚞𝚖𝚗𝚜i\mathtt{columns}_{i}, if there is such an element. If not, it is not set (that is, it contains a special invalid value). The other nj+1n-j+1 elements of cursors are not yet used. Like columns, rows is an array of linked list. The iith list stores the elements of Li,1:j1L_{i,1\colon j-1}, but in reverse order, from high to low column indices. Each element in such a list contains a nonzero value and its column index.

The column SS of the Schur complement is represented by a data structure s called a sparse accumulator. This data structure consists of an array s.values of nn real or complex numbers, an array s.rowind of nn row indices, and array s.exists of booleans, and an integer s.nnz that specifies how many rows indices are actually stored in s.rowind.

Here is how we use these data structures to compute the factorization. Step jj begins by copying Aj:n,jA_{j\colon n,j} to s. To do so, we start by setting s.nnz to zero. We then traverse the list that represents Aj:n,jA_{j\colon n,j}. For a list element that represents Ai,jA_{i,j}, we increment s.nnz, store ii in 𝚜.𝚛𝚘𝚠𝚒𝚗𝚍𝚜.𝚗𝚗𝚣\mathtt{s.rowind_{s.nnz}} store AijA_{ij} in 𝚜.𝚟𝚊𝚕𝚞𝚎𝚜i\mathtt{s.values}_{i}, and set 𝚜.𝚎𝚡𝚒𝚜𝚝𝚜i\mathtt{s.exists}_{i} to a true value.

Our next task is to subtract Lj:n,1:j1Lj,1:j1TL_{j\colon n,1:j-1}L_{j,1\colon j-1}^{T} from SS. We traverse 𝚛𝚘𝚠𝚜j\mathtt{rows}_{j} to find the nonzero elements in Lj,1:j1TL_{j,1\colon j-1}^{T}. For each nonzero Lj,kL_{j,k} that we find, we subtract Lj:n,kLk,jT=Lj,kLj:n,kL_{j\colon n,k}L_{k,j}^{T}=L_{j,k}L_{j\colon n,k} from SS. To subtract, we traverse 𝚌𝚘𝚕𝚞𝚖𝚗𝚜k\mathtt{columns}_{k} starting from 𝚌𝚞𝚛𝚜𝚘𝚛𝚜k\mathtt{cursors}_{k}. Let Li,kL_{i,k} be a nonzero found during this traversal. To subtract Lj,kLi,kL_{j,k}L_{i,k} from SiS_{i}, we first check whether 𝚜.𝚎𝚡𝚒𝚜𝚝𝚜i\mathtt{s.exists}_{i} is true. If it is, we simply subtract Lj,kLi,kL_{j,k}L_{i,k} from 𝚜.𝚟𝚊𝚕𝚞𝚎𝚜i\mathtt{s.values}_{i}. If not, then SiS_{i} was zero until now (in step jj). We increment s.nnz, store ii in 𝚜.𝚛𝚘𝚠𝚒𝚗𝚍𝚜.𝚗𝚗𝚣\mathtt{s.rowind_{s.nnz}} store 0Lj,kLi,k0-L_{j,k}L_{i,k} in 𝚜.𝚟𝚊𝚕𝚞𝚎𝚜i\mathtt{s.values}_{i}, and set 𝚜.𝚎𝚡𝚒𝚜𝚝𝚜i\mathtt{s.exists}_{i} to a true. During the traversal of 𝚌𝚘𝚕𝚞𝚖𝚗𝚜k\mathtt{columns}_{k}, we many need to advance 𝚌𝚞𝚛𝚜𝚘𝚛𝚜k\mathtt{cursors}_{k} to prepare it for subsequent steps of the algorithm. If the first element that we find in the travelral has row index jj, we advance 𝚌𝚞𝚛𝚜𝚘𝚛𝚜k\mathtt{cursors}_{k} to the next element in 𝚌𝚘𝚕𝚞𝚖𝚗𝚜k\mathtt{columns}_{k}. Otherwise, we do not modify 𝚌𝚘𝚕𝚞𝚖𝚗𝚜k\mathtt{columns}_{k}.

Finally, we compute Lj:n,jL_{j\colon n,j} and insert it into the nonzero data structure that represents LL. We replace 𝚜.𝚟𝚊𝚕𝚞𝚎𝚜j\mathtt{s.values}_{j} by its square root. For each rows index iji\neq j stored in one of the first 𝚜.𝚗𝚗𝚣\mathtt{s.nnz} entries of 𝚜.𝚛𝚘𝚠𝚒𝚗𝚍\mathtt{s.rowind}, we divide 𝚜.𝚟𝚊𝚕𝚞𝚎𝚜i\mathtt{s.values}_{i} by 𝚜.𝚟𝚊𝚕𝚞𝚎𝚜j\mathtt{s.values}_{j}. We now sort elements 11 though 𝚜.𝚗𝚗𝚣\mathtt{s.nnz} of 𝚜.𝚛𝚘𝚠𝚒𝚗𝚍\mathtt{s.rowind}, to ensure that the elements 𝚌𝚘𝚕𝚞𝚖𝚗𝚜j\mathtt{columns}_{j} are linked in ascending row order. We traverse the row indices stored in 𝚜.𝚛𝚘𝚠𝚒𝚗𝚍\mathtt{s.rowind}. For each index ii such that 𝚜.𝚟𝚊𝚕𝚞𝚎𝚜i0\mathtt{s.values}_{i}\neq 0, we allocate a new element for 𝚌𝚘𝚕𝚞𝚖𝚗𝚜j\mathtt{columns}_{j}, link it to the previous element that we created, and store in it ii and 𝚜.𝚟𝚊𝚕𝚞𝚎𝚜i\mathtt{s.values}_{i}. We also set 𝚜.𝚎𝚡𝚒𝚜𝚝𝚜i\mathtt{s.exists}_{i} to false, to prepare it for the next iteration.

We now analyze the number of operations and the storage requirements of this algorithm.

Theorem 12.

The total number of operations that the sparse-cholesky algorithm described above performs is Θ(ϕ(A))\Theta(\phi(A)). The amount of storage that the algorithm uses, including the representation of its input and output, is Θ(n+η(A)+η(A))\Theta(n+\eta(A)+\eta(A)).

Proof.

Let’s start with bounding work. Step jj starts with copying column jj of AA into the sparse accumulator, at a total cost of Θ(1+η(Aj:n,j))\Theta(1+\eta(A_{j\colon n,j})). No aritmetic is performed, but we can charge these operations to subsequent arithmetic operations. If one of these values is modified in the accumulator, we charge the copying to the subsequent multiply-subtract. If not, we charge it to either the square root of the diagonal element or to the division of subdiagonal elements. We cannot charge all the copying operations to roots and divisions, becuase some of the copied elements might get canceled before they are divided by Lj,jL_{j,j}.

The traversal of 𝚛𝚘𝚠𝚜j\mathtt{rows}_{j} and the subtractions of scaled columns of LL from the accumulator are easy to account for. The processing of an element of 𝚛𝚘𝚠𝚜j\mathtt{rows}_{j} is charged to the modification of the diagonal, Sj=SjLj,k2S_{j}=S_{j}-L_{j,k}^{2}. The traversal of the suffix of 𝚌𝚘𝚕𝚞𝚖𝚗𝚜k\mathtt{columns}_{k} performs 2η(Lj:n,k)2\eta(L_{j\colon n,k}) arithmetic operations andΘ(η(Lj:n,k))\Theta(\eta(L_{j\colon n,k})) non-arithmetic operations, so all operations are accounted for.

After the column of the Schur complement is computed, the algorithm computes a square root, scales η(Lj:n,j)1\eta(L_{j\colon n,j})-1 elements, sorts the indices of Lj:n,jL_{j\colon n,j}, and creates a linked-list to hold the elements of Lj:n,jL_{j\colon n,j}. The total number of operations in these computations is clearly O(η2(Lj+1:n,j))O(\eta^{2}(L_{j+1\colon n,j})), even if we use a quadratic sorting algorithm, so by using Theorem 10, we conclude that the operation-count bound holds.

Bounding space is easier. Our data structure includes a few arrays of length nn and linked lists. Every linked-list element represents a nonzero in AA or in LL, and every nonzero is represented by at most two linked-list elements. Therefore, the total storage requirements is Θ(n+η(A)+η(A))\Theta(n+\eta(A)+\eta(A)). ∎

The theorem shows that the only way to asymptotically reduce the total number of operations in a sparse factorization is to reduce the number of arithmetic operations. Similarly, to asymptotically reduce the total storage we must reduce the fill. There are many ways to optimize and improve both the data structure and the algorithm that we described, but these optimizations can reduce the number of operations and the storage requirements only by a constant multiplicative constant.

Improvements in this algorithm range from simple data-structure optimizations, through sophisticated preprocessing steps, to radical changes in the representation of the Schur complement. The most common data-structure optimization, which is used by many sparse factorization codes, is the use of compressed-column storage. In this data structure, all the elements in all the 𝚌𝚘𝚕𝚞𝚖𝚗𝚜\mathtt{columns} list are stored in two contiguous arrays, one for the actual numerical values and another for the row indices. A integer third array of size n+1n+1 points to the beginning of each column in these arrays (and to the location just past column nn). Preprocessing steps can upper bound the number of nonzeros in each column of LL (this is necessary for exact preallocation of the compressed-column arrays) and the identity of potential nonzeros. The prediction of fill in LL can can eliminate the conditional in the inner loop that updates the sparse accumulator; this can significantly speed up the computation. The preprocessing steps can also construct a compact data structured called the elimination tree of AA that can be used for determining the nonzero structure of a row of LL without maintaining the 𝚛𝚘𝚠𝚜\mathtt{rows} lists. This also speeds up the computation significantly. The elimination tree has many other uses. Finally, multifrontal factorization algorithms maintain the Schur complement in a completely different way, as a sum of sparse symmetric update matrices.

The number of basic computational operations in an algorithm is not an accurate predictor of its running time. Different algorithms have different mixtures of operations and different memory-access patterns, and these factors affect running times, not only the total number of operations. We have seen evidence for this in Chapter XXX, were we have seen cases where direct solvers run at much higher computational rates than iterative solvers. But to develop fundamental improvements in algorithms, it helps to focus mainly on operation counts, and in particular, on asymptotic operation counts. Once new algorithms are discovered, we can and should optimize them both in terms of operation counts and in terms of the ability to exploit cache memories, multiple processors, and other architectural features of modern computers. But our primary concern here is asymptotic operation counts.

3.4. Characterizations of Fill

If we want to reduce fill, we need to characterize fill. Definition 11 provides an characterization, but this characterization is not useful for predicting fill before it occurs. One reason that predicting fill using Definition 11 is that it for cancellations, which are difficult to predict. In this section we provide two other characterizations. They are not exact, in the sense that they characterize a superset of the fill. In many cases these characterizations are exact, but not always. On the other hand, these charactersizations can be used to predict fill before the factorization begins. Both of them are given in terms of graphs that are related to AA, not in terms of matrices.

Definition 13.

The pattern graph of an nn-by-nn symmetric matrix AA is an undirected graph GA=({1,2,n},E)G_{A}=(\{1,2,\ldots n\},E) whose vertices are the integers 11 through nn and whose edges are pairs (i,j)(i,j) such that Ai,j0A_{i,j}\neq 0.

Definition 14.

Let GG be an undirected graph with vertices 1,2,,n1,2,\ldots,n. A vertex elimination step on vertex jj of GG is the transformation of GG into another undirected graph eliminate(G,j)\textrm{eliminate}(G,j). The edge set of eliminate(G,j)\textrm{eliminate}(G,j) is the union of the edge set of GG with a clique on the neighbors of jj in GG whose indices are larger than jj,

E(eliminate(G,j))=E(G){(i,k)\displaystyle E(\textrm{eliminate}(G,j))=E(G)\cup\{(i,k) |\displaystyle| i>j\displaystyle i>j
k>j\displaystyle k>j
(j,i)E(G)\displaystyle(j,i)\in E(G)
(k,i)E(G)}.\displaystyle(k,i)\in E(G)\}\;.

The fill graph GA+G_{A}^{+} of AA is the graph

GA+=eliminate(eliminate(eliminate(GA,1)),n1),n).G_{A}^{+}=\textrm{eliminate}(\textrm{eliminate}(\cdots\textrm{eliminate}(G_{A},1)\cdots),n-1),n)\;.

The edges of the fill graph provide a bound on fill

Lemma 15.

Let AA be an nn-by-nn symmetric positive-definite matrix. Let 1j<in1\leq j<i\leq n. If Ai,j0A_{i,j}\neq 0 or i,ji,j is a fill element, then (i,j)(i,j) is an edge of GA+G_{A}^{+}.

Proof.

The elimination of a vertex only adds edges to the graph. The fill graph is produced by a chain of vertex eliminations, starting from the graph of AA. If Ai,j0A_{i,j}\neq 0, then (i,j)(i,j) is an edge of the graph of AA, and therefore also an edge of its fill graph.

We now prove by induction on jj that a fill element i,ji,j is also an edge (i,j)(i,j) in

eliminate(eliminate(eliminate(GA,1)),j2),j1).\textrm{eliminate}(\textrm{eliminate}(\cdots\textrm{eliminate}(G_{A},1)\cdots),j-2),j-1)\;.

The claim hold for j=1j=1 because Li,10L_{i,1}\neq 0 if and only if Ai,10A_{i,1}\neq 0. Therefore, there are no fill edges for j=1j=1, so the claims holds. We now prove the claim for j>1j>1. Assume that i,ji,j is a fill element, and let S(j)S^{(j)} be the Schur complement just prior to the jjth elimination step,

S(j)=Aj:n,j:nLj:n,1:j1Lj:n,1:j1T=Aj:n,j:nk=1j1Lj:n,kLj:n,kT.S^{(j)}=A_{j\colon n,j\colon n}-L_{j\colon n,1\colon j-1}L_{j\colon n,1\colon j-1}^{T}=A_{j\colon n,j\colon n}-\sum_{k=1}^{j-1}L_{j\colon n,k}L_{j\colon n,k}^{T}\;.

Since i,ji,j is a fill element, Si,j(j)0S_{i,j}^{(j)}\neq 0 but Ai,j=0A_{i,j}=0. Therefore, Li,kLj,k0L_{i,k}L_{j,k}\neq 0 for some k<jk<j. This implies that Li,k0L_{i,k}\neq 0 and Lj,k0L_{j,k}\neq 0. By induction, (i,k)(i,k) and (j,k)(j,k) are edges of

eliminate(eliminate(GA,1)),k1).\textrm{eliminate}(\cdots\textrm{eliminate}(G_{A},1)\cdots),k-1)\;.

Therefore, (i,j)(i,j) is an edge of

eliminate(eliminate(eliminate(GA,1)),k1),k).\textrm{eliminate}(\textrm{eliminate}(\cdots\textrm{eliminate}(G_{A},1)\cdots),k-1),k)\;.

Since edges are never removed by vertex eliminations, (i,j)(i,j) is also an edge of

eliminate(eliminate(eliminate(GA,1)),j2),j1).\textrm{eliminate}(\textrm{eliminate}(\cdots\textrm{eliminate}(G_{A},1)\cdots),j-2),j-1)\;.

This proves the claim and the entire lemma. ∎

The converse is not true. An element in a Schur complement can fill and then be canceled out, but the edge in the fill graph remains. Also, fill in a Schur complement can cancel exactly an original element of AA, but the fill graph still contains the corresponding edge.

The following lemma provides another characterization of fill.

Lemma 16.

The pair (i,j)(i,j) is an edge of the fill graph of AA if and only if GAG_{A} contains a simple path from ii to jj whose internal vectices all have indices smaller than min(i,j)\min(i,j).

Proof.

Suppose that GAG_{A} contains such a path. We prove that (i,j)(i,j) edge of the fill graph by induction on the length of the path. If path contains only one edge then (i,j)(i,j) is an edge of GAG_{A}, so it is also an edge of GA+G_{A}^{+}. Suppose that the claim holds for paths of length 1\ell-1 or shorter and that a path of length \ell connects ii and jj. Let kk be the smallest-index vertex in the path. The elimination of vertex kk adds an edge connecting its neighbors in the path, because their indices are higher than kk. Now there is a path of length 1\ell-1 between ii and jj; the internal vertices still have indices smaller than min(i,j)\min(i,j). By induction, future elimination operations on the graph will create the fill edge (i,j)(i,j).

To prove the other direction, suppose that (i,j)(i,j) is an edge of GA+G_{A}^{+}. If (i,j)(i,j) is an edge of GAG_{A}, a path exists. If not, the edge (i,j)(i,j) is created by some elimination step. Let kk be the vertex whose elimination creates this edge. We must have k<min(i,j)k<\min(i,j), otherwise the kkth elimination step does not create the edge. Furthermore, when kk is eliminated, the edges (k,i)(k,i) and (k,j)(k,j) exist. If they are original edge of GAG_{A}, we are done—we have found the path. If not, we use a similar argument to show that there must be suitable paths from ii to kk and from kk to jj. The concatenation of these paths is the sought after path. ∎

3.5. Perfect and Almost-Perfect Elimination Orderings

The Cholesky factorization of symmetric positive definite matrices is numerically stable, and symmetrically permuting the rows and columns of an spd matrix yields another spd matrix. Therefore, we can try to symmetrically permute the rows and columns of a sparse matrix to reduce fill and work in the factorization. We do not have to worry that the permutation will numerically destabilize the factorization. In terms of the graph of the matrix, a symmetric row and column pemutation corresponds to relabeling the vertices of the graphs. In otherwords, given an undirected graph we seek an elimination ordering for its vertices.

Some graphs have elimination orderings that generate no fill, so that GA+=GAG_{A}^{+}=G_{A}. Such orderings are called perfect-elimination orderings. The most common example are trees.

Lemma 17.

If GAG_{A} is a tree or a forest, then GA+=GAG_{A}^{+}=G_{A}.

Proof.

On a tree or forest, any depth-first-search postorder is a perfect-elimination ordering.

By Lemma 16, all the fill edges occur within connected components of GAG_{A}. Therefore, it is enough to show that each tree in a forest has a perfect-elimination ordering. We assume from here on that GAG_{A} is a tree.

Let vv be an arbitrary vertex of GAG_{A}. Perform a depth-first traversal of GAG_{A} starting from the root vv, and number the vertices in postorder: give the next higher index to a vertex vv immediately after the traversal returns from the last child of vv, starting from index 11. Such an ordering guarantees that in the rooted tree rooted at vv, a vertex uu has a higher index than all the vertices in the subtree rooted at uu.

Under such an ordering, the elimination of a vertex uu creates no fill edges at all, because uu has at most one neighbor with a higher index, its parent in the rooted tree. ∎

Most graphs do not have perfect-elimination orderings, but some orderings are almost as good. The elimination of a vertex with only one higher-numbered neighbor is called a perfect elimination step, because it produces no fill edge. Eliminating a vertex with two higher-numbered neighbors is not perfect, but almost: it produces one fill edge. If GAG_{A} contains an isolated path

v0v1v2vv+1v_{0}\leftrightarrow v_{1}\leftrightarrow v_{2}\leftrightarrow\cdots\leftrightarrow v_{\ell}\leftrightarrow v_{\ell+1}

such that the degree of v1,v2,,vv_{1},v_{2},\ldots,v_{\ell} in GAG_{A} is 22, then we can eliminate v1,,vv_{1},\ldots,v_{\ell} (in any order), producing only \ell fill edges and performing only Θ()\Theta(\ell) operations. This observation is useful if we try to sparsify a graph so that the sparsified graph has a good elimination ordering.

3.6. Symbolic Elimination and the Elimination Tree

We can use the fill-characterization technique that Lemma 15 describes to create an efficient algorithm for predicting fill. Predicting fill, or even predicting just the nonzero counts in rows and columns of the factor, can lead to more efficient factorization algorithms. The improvements are not asymptotic, but they can be nonetheless significant.

The elimination of vertex kk adds to the graph the edges of a clique, a complete subgraph induced by the higher-numbered neighbors of kk. Some of these edges may have already been part of the graph, but some may be new. If we represent the edge set of the partially-eliminated graph using a clique-cover data structure, we efficiently simulate the factorization process and enumerate the fill edges. An algorithm that does this is called a symbolic elimination algorithm. It is symbolic in the sense that it simulates the sparse factorization process, but without computing the numerical values of elements in the Schur complement or the factor.

A clique cover represents the edges of an undirected graph using an array of linked lists. Each list specifies a clique by specifying the indices of a set of vertices: each link-list element specifies one vertex index. The edge set of the graph is the union of the cliques. We can create a clique-cover data structure by creating one two-vertex clique for each edge in the graph.

A clique cover allows us to simulate elimination steps efficiently. We also maintain an array of vertices. Each element in the vertex array is a doubly-linked list of cliques that the vertex participates in. To eliminate vertex kk, we need to create a new clique containing its higher-numbered neighbors. These neighbors are exactly the union of higher-than-kk vertices in all the cliques that kk participates in. We can find them by traversing the cliques that kk participates in. For each clique qq that kk participates in, we traverse qq’s list of vertices and add the higher-than-kk vertices that we find to the new clique. We add a vertex to the new clique at most once, using a length-nn bit vector to keep track of which vertices have already been added to the new clique. Before we move on to eliminate vertex k+1k+1, we clear the bits that we have set in the bit vector, using the vertices in the new clique to indicate which bits must be cleared. For each vertex jj in the new clique, we also add the new clique to the list of cliques that jj participates in.

The vertices in the new clique, together with kk, are exactly the nonzero rows in column kk of a cancellation-free Cholesky factor of AA. Thus, the symbolic-elimination algorithm predicts the structure of LL if there are no cancellations.

We can make the process more efficient by not only creating new cliques, but merging cliques. Let qq be a clique that kk participates in, and let i>ki>k and j>kj>k be vertices in qq. The clique qq represents the edge (i,j)(i,j). But the new clique that the elimination of kk creates also represents (i,j)(i,j), because both ii and jj belong to the new clique. Therefore, we can partially remove clique qq from the data structure. The removal makes the elimination of higher-than-kk vertices belonging to it cheaper, because we will not have to traverse it again. To remove qq, we need each element of the list representing qq to point to the element representing qq in the appropriate vertex list. Because the vertex lists are doubly-linked, we can remove elements from them in constant time.

Because each clique either represents a single nonzero of AA or the nonzeros in a single column of a cancellation-free factor LL, the total size of the data structure is Θ(η(A)+η(L)+n)\Theta(\eta(A)+\eta(L)+n). If we do not need to store the column structures (for example, if we are only interested in nonzero counts), we can remove completely merged cliques. Because we create at most one list element for each list element that we delete, the size of the data structure in this scheme is bounded by the size of the original data structure, which is only Θ(η(A)+n)\Theta(\eta(A)+n).

The number of operations in this algorithm is Θ(η(A)+η(L)+n)\Theta(\eta(A)+\eta(L)+n), which is often much less than the cost of the actual factorization,

ϕ(A)=j=1nO(η2(Lj+1:n)).\phi(A)=\sum_{j=1}^{n}O\left(\eta^{2}\left(L_{j+1\colon n}\right)\right)\;.

To see that this is indeed the case, we observe that each linked-list element is touched by the algorithm only twice. An element of a clique list is touched once when it is first created, and another time when the clique is merged into another clique. Because the clique is removed from vertex lists when it is merged, it is never referenced again. An element of a vertex list is also touched only once after it is created. Either the element is removed from the vertex’s list before the vertex is eliminated, or it is touched during the traveral of the vertex’s list.

The process of merging the cliques defines an important structure that plays an important role in many sparse-matrix factorization algorithms, the elimination tree of AA. To define the elimination tree, we name some of the cliques. Cliques that are created when we initialize the clique cover to represent the edge set of GAG_{A} have no name. Cliques that are formed when we eliminate a vertex are named after the vertex. The elimination tree is a rooted forest on the vertices {1,2,,n}\{1,2,\ldots,n\}. The parent π(k)\pi(k) of vertex kk is the name of the clique into which clique kk is merged, if it is. If clique kk is not merged in the symbolic-elimination process, kk is a root of a tree in the forest. There are many alternative definitions of the elimination tree. The elimination tree has many applications. In particular, it allows us to compute the number of nonzeros in each row and column of a cancellation-free Cholesky factor in time almost linear in η(A)\eta(A), even faster than using symbolic elimination.

3.7. Minimum-Degree Orderings

The characterization of fill in Lemma 15 also suggests an ordering heuristic for reducing fill. If the elimination of a yet-uneliminated vertex creates a clique whose size is the number of the uneliminated neighbors of the chosen vertex, it makes sense to eliminate the vertex with the fewer uneliminated neighbors. Choosing this vertex minimizes the size of the clique that is created in the next step and minimizes the arithmetic work in the next step. Ordering heuristics based on this idea are called minimum-degree heuristics.

There are two problems in the minimum-degree idea. First, not all the edges in the new cliques are new; some of them might be part of GAG_{A} or might have already been created by a previous elimination step. It is possible to minimize not the degree but the actual new fill, but this turns out to be more expensive algorithmically. Minimizing fill in this way also turns out to produce orderings that are not significantly better than minimum-degree orderings. Second, an optimal choice for the next vertex to eliminate may be suboptimal globally. There are families of matrices on which minimum-degree orderings generate asymptotically more fill than the optimal ordering. Minimum-degree and minimum-fill heuristics are greedy and myopic. They select vertices for elimination one at a time, and the lack of global planning can hurt them. This is why minimum fill is often not much better than minimum fill.

Even though minimum-degree algorithms are theoretically known to be suboptimal, they are very fast and often produce effective orderings. On huge matrices nested-dissection orderings, which we discuss next, are often more effective than minimum-degree orderings, but on smaller matrices, minimum-degree orderings are sometimes more effective in reducing fill and work.

Minimum-degree algorithms usually employ data structures similar to the clique cover used by the symbolic elimination algorithm. The data structure is augmented to allow vertex degrees to be determined or approximated. Maintaining exact vertex degrees is expensive, since a vertex cover does not represent vertex degrees directly. Many successful minimum-degree algorithms therefore use degree approximations that are cheaper to maintain or compute. Since the minimum-degree is only a heuristic, these approximations are not neccessarily less effective in reducing fill than exact degrees; sometimes they are more effective.

3.8. Nested-Dissection Orderings for Regular Meshes

Nested-dissection orderings are known to be approximately optimal, and on huge matrices that can be significantly more effective than other ordering heuristics. On large matrices, the most effective orderings are often nested-dissection/minimum-degree hybrids.

Nested-dissection orderings are defined recursively using vertex subsets called separators.

Definition 18.

Let G=(V,E)G=(V,E) be an undirected graph with |V|=n|V|=n. Let α\alpha and β\beta be real positive constants, and let ff be a real function over the positive integers. An (α,β,f)(\alpha,\beta,f) vertex separator in GG is a set SVS\subseteq V of vertices that satisfies the following conditions.

Separation::

There is a partition V1V2=VSV_{1}\cup V_{2}=V\setminus S such that for any vV1v\in V_{1} and uV2u\in V_{2}, the edge (u,v)E(u,v)\not\in E.

Balance::

|V1|,|V2|αn|V_{1}|,|V_{2}|\leq\alpha n.

Size::

|S|βf(n)|S|\leq\beta f(n).

Given a vertex separator SS in GAG_{A}, we order the rows and columns of AA of the separator last, say in an arbitrary order (but if GSG\setminus S contains many connected components then a clever ordering of SS can further reduce fill and work), and the rows and columns corresponding to V1V_{1} and V2V_{2} first. By Lemma 16, this ensures that for any vV1v\in V_{1} and uV2u\in V_{2}, the edge (u,v)(u,v) is not a fill edge, so Lu,v=Lv,u=0L_{u,v}=L_{v,u}=0. Therefore, the interleaving of vertices of V1V_{1} and V2V_{2} has no effect on fill, so we can just as well order all the vertices of V1V_{1} before all the vertices of V2V_{2}.

The function of the separator in the ordering is to ensure that Lu,v=0L_{u,v}=0 for any vV1v\in V_{1} and uV2u\in V_{2}. Any ordering in which the vertices of V1V_{1} appear first, followed by the vertices of V2V_{2}, followed by the vertices of SS, ensures that a |V1||V_{1}|-by-|V2||V_{2}| rectangular block in LL does not fill. A good separator is one for which this block is large. The size of SS determines the circumference of this rectangular block, because half the circumference is |V1|+|V2|=n|S||V_{1}|+|V_{2}|=n-|S|. The size imbalance between V1V_{1} and V2V_{2} determines the aspect ratio of this rectangle. For a given circumference, the area is maximized the rectangle is as close to square as possible. Therefore, a small balanced separator reduces fill more effectively than a large or unbalanced separator.

By using separators in the subgraphs of GAG_{A} induced by V1V_{1} and V2V_{2} to order the diagonal blocks of AA that correspond to V1V_{1} and V2V_{2}, we can avoid fill in additional blocks of LL. These blocks are usually smaller than the top-level |V1||V_{1}|-by-|V2||V_{2}| block, but they are still helpful and significant in reducing the total fill and work. If |V1||V_{1}| or |V2||V_{2}| are small, say smaller than a constant threshold, we order the corresponding subgraphs arbitrarily.

Let us see how this works on square two-dimensional meshes. To keep the analysis simple, we use a cross-shaped separator that partitions the mesh into four square or nearly-square subgraphs. We assume that GAG_{A} is an nxn_{x}-by-nyn_{y} where nxny=nn_{x}n_{y}=n and where |nxny|1|n_{x}-n_{y}|\leq 1. The size of a cross-shaped separator in GAG_{A} is |S|=nx+ny1<2n|S|=n_{x}+n_{y}-1<2\sqrt{n}. To see this, we note that if nx=ny=nn_{x}=n_{y}=\sqrt{n} then |S|=2n1|S|=2\sqrt{n}-1. Otherwise, without loss of generality nx=ny1<nyn_{x}=n_{y}-1<n_{y}, so |S|=2nx=2nxnx<2nxny=2n|S|=2n_{x}=2\sqrt{n_{x}n_{x}}<2\sqrt{n_{x}n_{y}}=2\sqrt{n}. The separator breaks the mesh into four subgraphs, each of which is almost square (their xx and yy dimensions differ by at most 11), of size at most n/4n/4. This implies that throught the recursion, each size-mm subgraph that is produced by the nested-dissection ordering has a (0.25,0.5,m)(0.25,0.5,\sqrt{m}) 44-way separator that we use in the ordering.

We analyze the fill and work in the factorization by blocks of columns. Let SS be the top-level separator and let V1,V2,,V4V_{1},V_{2},\ldots,V_{4} be the vertex sets of the separated subgraphs. We have

η(L)\displaystyle\eta(L) =\displaystyle= η(L:,S)+η(L:,V1)+η(L:,V2)+η(L:,V3)+η(L:,V4)\displaystyle\eta\left(L_{\colon,S}\right)+\eta\left(L_{\colon,V_{1}}\right)+\eta\left(L_{\colon,V_{2}}\right)+\eta\left(L_{\colon,V_{3}}\right)+\eta\left(L_{\colon,V_{4}}\right)
\displaystyle\leq |S|(|S|+1)2+η(L:,V1)+η(L:,V2)+η(L:,V3)+η(L:,V4).\displaystyle\frac{|S|\left(|S|+1\right)}{2}+\eta\left(L_{\colon,V_{1}}\right)+\eta\left(L_{\colon,V_{2}}\right)+\eta\left(L_{\colon,V_{3}}\right)+\eta\left(L_{\colon,V_{4}}\right)\;.

Bounding the fill in a block of columns that corresponds to one of the separated subgraphs is tricky. We cannot simply substitute a similar expression to form a reccurence. The matrix AA is square, but when we analyze fill in one of these column blocks, we need to account for fill in both the square diagonal block and in the subdiagonal block. If uSu\in S and vV1v\in V_{1} and Au,v0A_{u,v}\neq 0, then elements in Lu,V1L_{u,V_{1}} can fill. A accurate estimate of how much fill occurs in such blocks of the factor is critical to the analysis of nested dissection algorithms. If we use the trivial bound η(Lu,V1)n/4\eta(L_{u,V_{1}})\leq n/4, we get an asymptotically loose bound η(L)=O(n1.5)\eta(L)=O(n^{1.5}) on fill.

To achieve an asymptotically tight bound, we set up a recurrence for fill in a nested-dissection ordering of the mesh. Let η¯(mx,my)\bar{\eta}(m_{x},m_{y}) be a bound on the fill in the columns corresponding to an mxm_{x}-by-mym_{y} submesh (with |mxmy|1|m_{x}-m_{y}|\leq 1). At most 2mx+2my2m_{x}+2m_{y} edges connect the submesh to the rest of the mesh. Therefore we have

η¯(mx,my)(mx+my1)(mx+my)2+(mx+my1)(2mx+2my)+4η¯(mx2,my2).\bar{\eta}(m_{x},m_{y})\leq\frac{\left(m_{x}+m_{y}-1\right)\left(m_{x}+m_{y}\right)}{2}+\left(m_{x}+m_{y}-1\right)\left(2m_{x}+2m_{y}\right)+4\bar{\eta}\left(\frac{m_{x}}{2},\frac{m_{y}}{2}\right)\;.

The first term in the right-hand side bounds fill in the separator’s diagonal block. The second term, which is the crucial ingredient in the analysis, bounds fill in the subdiagonal rows of the separator columns. The third term bounds fill in the four column blocks corresponding to the subgraphs of the separated submesh. If we cast the recurrence in terms of m=mxmym=m_{x}m_{y} we obtain

η¯(m)Θ(m)+4η¯(m4).\bar{\eta}(m)\leq\Theta\left(m\right)+4\bar{\eta}\left(\frac{m}{4}\right)\;.

By the Master Theorem CLR2ed_Thm_4.1, The solution of the recurrence is η¯(m)=O(mlogm)\bar{\eta}(m)=O(m\log m). Since η(L)η¯(n)\eta(L)\leq\bar{\eta}(n), we have η(L)=O(nlogn)\eta(L)=O(n\log n). We can analyze work in the factorization in a similar way. We set up a recurrence

ϕ¯(m)Θ(m1.5)+4ϕ(m4),¯\bar{\phi}(m)\leq\Theta\left(m^{1.5}\right)+4\bar{\phi\left(\frac{m}{4}\right)\;,}

whose solution leads to ϕ(A)=O(nn)\phi(A)=O(n\sqrt{n}). It is possible to show that these two bounds are tight and that under such a nested-dissection ordering for a two-dimensional mesh, η(L)=Θ(nlogn)\eta(L)=\Theta(n\log n) and ϕ(A)=Θ(nn)\phi(A)=\Theta(n\sqrt{n}).

For a three-dimensional mesh, a similar analysis yields η(L)=Θ(n4/3)\eta(L)=\Theta(n^{4/3}) and ϕ(A)=Θ(n6/3)=Θ(n2)\phi(A)=\Theta(n^{6/3})=\Theta(n^{2}).

In practice, we can reduce the constants by stopping the recurrence at fairly large subgraphs, say around m=100m=100, and to order these subgraphs using a minimum-degree ordering. This does not change the asymptotic worst-case bounds on work and fill, but it usually improves the actual work and fill counts. There are other nested-dissection/minimum-degree hybrids that often improve the work and fill counts without hurting the asymptotic worst-case bounds.

Definition 19.

A class of graphs satisfies an (α,β,f)(\alpha,\beta,f) vertex-separator theorem (or a separator theorem for short) if every nn-vertex graph in the class has an (α,β,f)(\alpha,\beta,f) vertex separator.

3.9. Generalized Nested-Dissection Orderings

When GAG_{A} is not a regular mesh, the analysis becomes much harder. When applied to general graphs, the ordering framework that we described in which a small balanced vertex separator is ordered last and the connected components are ordered recursively is called generalized nested dissection.

Definition 20.

A class of graphs satisfies an (α,β,f)(\alpha,\beta,f) vertex-separator theorem (or a separator theorem for short) if every nn-vertex graph in the class has an (α,β,f)(\alpha,\beta,f) vertex separator.

For example, planar graphs satisfy an (2/3,8,n)(2/3,\sqrt{8},\sqrt{n}) separator theorem. Since nested dissection orders subgraphs recursively, we must ensure that subgraphs belong to the same class of graphs, so that they can be ordered effectively as well.

Definition 21.

A class of graphs is said to be closed under subgraph if every subgraph of a graph in the class is also in the class.

There are two ways to use small balanced vertex separators to order an arbitrary graph. The first algorithm, which we call the lrt algorithm, guarantees an effective ordering for graphs belonging to a class that satisfies a separator theorem and is closed under subgraph. The algorithm receives an input a range [i,j][i,j] of indices and graph GG with nn vertices, \ell of which may already have been ordered. If >0\ell>0, then the ordered vertices have indices j+1,,jj-\ell+1,\ldots,j. The algorithm assigns the rest of the indices, i,,ji,\ldots,j-\ell, to the unnumbered vertices. The algorithm works as follows.

  1. (1)

    If nn is small enough, n(β(1α))2n\leq(\beta(1-\alpha))^{2}, the algorithm orders the unnumbered vertices arbitrarily and returns.

  2. (2)

    The algorithm finds a small balanced vertex separator SS in GG. The separator separates the graph into subgraphs with vertex sets V1V_{1} and V2V_{2}. The two subgraphs may contain more than one connected componet each.

  3. (3)

    The algorithm arbitrarily assigns indices j|S|+1,,jj-\ell-|S|+1,\ldots,j-\ell to the vertices of SS.

  4. (4)

    The algorithm recurses on the subgraphs induced by V1SV_{1}\cup S and by V2SV_{2}\cup S. The unnumbered vertices in the first subgraph are assigned the middle range of [i,j][i,j] and the second the first range (starting from ii).

We initialize the algorithm by setting =0\ell=0 and [i,j]=[1,n][i,j]=[1,n].

The second algorithm, called the gt algorithm, finds a separator, orders its vertices last, and then recurses on each connected component of the graph with the separator removed; the vertices of each component are ordered consecutively. In each recursive call the algorithm finds a separator that separates the vertices of one component, ignoring the rest of the graph. This algorithm does not guarantee an effective orderings to any graph belonging to a class that satisfies a separator theorem and is closed under subgraph; additional conditions on the graphs are required, such as bounded degree or closure under edge contractions. Therefore, we analyze the first algorithm.

Theorem 22.

Let GG be a graph belonging to a class that satisfies an (α,β,n)(\alpha,\beta,\sqrt{n}) separator theroem and closed under subgraph. Ordering the vertices of GG using the LRT algorithm leades to O(nlogn)O(n\log n) fill edges.

Proof.

We prove the theorem by induction on nn. If nn0n\leq n_{0}, the theorem holds by setting the constant cc in the big-OO notation high enough so that cnlogn>n(n1)/2cn\log n>n(n-1)/2 for all nn0n\leq n_{0}. Suppose that the theorem holds for all graphs with fewer than nn vertices.

The algorithm finds a separator SS that splits the graph into subgraphs induced by V1V_{1} and by V2V_{2}. We partition the fill edges into four categories: (1) fill edges with two endpoints in SS; (2) fill edges with one endpoint in SS and the other in one of the \ell already-numbered vertices; (3,4) fill edges with at least one endpoint in V2V_{2} or in V2V_{2}. Since there are no fill edges with one endpoint in V1V_{1} and the other in V2V_{2}, categories 3 and 4 are indeed disjoint.

The number of fill edges in Category 1 is at most |S|(|S|1)/2β2n/2|S|(|S|-1)/2\leq\beta^{2}n/2.

The number of fill edges in Category 2 is at most |S|βn|S|\ell\leq\beta\ell\sqrt{n}.

Let η¯(n,)\bar{\eta}(n,\ell) be an upper bound on the number of fill edges in the ordering produced by the algorithm on a graph with nn vertices, \ell of which are already numbered.

The number of fill edges in Category 3 is at most η¯(|V1|+|S|,1)\bar{\eta}(|V_{1}|+|S|,\ell_{1}), where 1\ell_{1} is the number of already-numbered vertices in V1SV_{1}\cup S after the vertices of SS have been numbered. Note that 1\ell_{1} may be smaller than +|S|\ell+|S| because some of the \ell vertices that are initially numbered be in V2V_{2}. Similarly, the number of fill edges in Category 4 is at most η¯(|V2|+|S|,2)\bar{\eta}(|V_{2}|+|S|,\ell_{2}).

By summing the bounds for the four categories, we obtain a recurrence on η¯\bar{\eta},

(3.9.1) η¯(n,)β2n2+βn+η¯(|V1|+|S|,1)+η¯(|V2|+|S|,2).\bar{\eta}\left(n,\ell\right)\leq\frac{\beta^{2}n}{2}+\beta\ell\sqrt{n}+\bar{\eta}\left(|V_{1}|+|S|,\ell_{1}\right)+\bar{\eta}\left(|V_{2}|+|S|,\ell_{2}\right)\;.

We claim that

(3.9.2) η¯(n,)c(n+)log2n+c′′n\bar{\eta}\left(n,\ell\right)\leq c^{\prime}(n+\ell)\log_{2}n+c^{\prime\prime}\ell\sqrt{n}

for some constants cc^{\prime} and c′′c^{\prime\prime}. Since initially =0\ell=0, this bound implies the O(nlogn)O(n\log n) bound on fill.

To prove the bound, we denote n1=|V1|+|S|n_{1}=|V_{1}|+|S| and n2=|V2|+|S|n_{2}=|V_{2}|+|S| and note the following bounds:

1+2+2βnnn1+n2n+βn(1α)nn1,n2αn+βn\begin{array}[]{rcccl}&&\ell_{1}+\ell_{2}&\leq&\ell+2\beta\sqrt{n}\\ n&\leq&n_{1}+n_{2}&\leq&n+\beta\sqrt{n}\\ (1-\alpha)n&\leq&n_{1},n_{2}&\leq&\alpha n+\beta\sqrt{n}\end{array}

The first inequality follows from the fact that every alread-numbered vertex in the input to the two recursive calls is either one of the \ell initially-numbered vertices or a vertex of SS. An initially-numbered vertex that is not in SS is passed to only one of the recursive calls; vertices in SS are be passed as already numbered to the two calls, but there are at most βn\beta\sqrt{n} of them. The second inequality follows from the fact that the subgraphs passed to the two recursive calls contain together all the vertices in V=SV1V2V=S\cup V_{1}\cup V_{2} and that |S|βn|S|\leq\beta\sqrt{n} vertices are passed to the both of the recursive calls. The third inequality follows from the guarantees on |V1||V_{1}|, |V2||V_{2}|, and |S||S| under the separator theroem.

We now prove the claim (3.9.2) by induction on nn. For nn smaller than some constant size, we can ensure that the claim holds simply by choosing cc^{\prime} and c′′c^{\prime\prime} to be large enough. In particular, η¯(n,)n(n1)/2\bar{\eta}\left(n,\ell\right)\leq n(n-1)/2, which is smaller than the right-hand side of (3.9.2) for small enough nn and large enough cc^{\prime} and c′′c^{\prime\prime}.

For larger nn, we use Equation (3.9.1) and invoke the inductive assumption regarding the correctness of (3.9.2),

η¯(n,)\displaystyle\bar{\eta}\left(n,\ell\right) \displaystyle\leq β2n2+βn+η¯(|V1|+|S|,1)+η¯(|V2|+|S|,2)\displaystyle\frac{\beta^{2}n}{2}+\beta\ell\sqrt{n}+\bar{\eta}\left(|V_{1}|+|S|,\ell_{1}\right)+\bar{\eta}\left(|V_{2}|+|S|,\ell_{2}\right)
\displaystyle\leq β2n2+βn+c(n1+1)log2n1+c′′1n1+c(n2+2)log2n2+c′′2n2.\displaystyle\frac{\beta^{2}n}{2}+\beta\ell\sqrt{n}+c^{\prime}\left(n_{1}+\ell_{1}\right)\log_{2}n_{1}+c^{\prime\prime}\ell_{1}\sqrt{n_{1}}+c^{\prime}\left(n_{2}+\ell_{2}\right)\log_{2}n_{2}+c^{\prime\prime}\ell_{2}\sqrt{n_{2}}\;.

The rest of the proof only involves manipulations of the expression in the second line to show that for large enough cc^{\prime} and c′′c^{\prime\prime}, it is bounded by (3.9.2). We omit the details. ∎

A similar analysis yields an analysis on arithmetic operations.

Theorem 23.

Let GG be a graph belonging to a class that satisfies an (α,β,n)(\alpha,\beta,\sqrt{n}) separator theroem and closed under subgraph. Ordering the vertices of GG using the LRT algorithm leads to O(n1.5)O(n^{1.5}) arithmetic operations in the algorithm.

These results can be applied directly to matrices whose pattern graphs are planar, and more generally to graphs that can be embedded in surfaces with a bounded genus. Such graphs are closed under subgraph and satisfy a (2/3,8,n)(2/3,\sqrt{8},\sqrt{n}) separator theorem. Furthermore, the separator in an nn-vertex graph of this family can be found in O(n)O(n) operations, and it is possible to show that we can even find all the separators required in all the levels of the recursive algorithm in O(nlogn)O(n\log n) time.

For the gt algorithm, which is somewhat easier to implement and which finds separators faster (because the separator is not included in the two subgraphs to be recursively partitioned), a separator theorem is not suffiecient to guarantee comparable bounds on fill and work. To ensure that the algorithm leads to similar asymptotic bounds, one must also assume that the graphs are closed under edge contraction or that they have a bounded degree. However, the two algorithms have roughly the same applicability, because planar graphs and graphs that can be embedded on surfaces with a bounded genus are closed under edge contraction. It is not clear which of the two algorithms is better in practice.

The fill and work results for both the lrt and the gt algorithms can be extended to graphs that satisfy separator theorems with separators smaller or larger than n\sqrt{n}. See the original articles for the asymptotic bounds.

Finally, we mention that an algorithm that finds approximately optimal balanced vertex separators can be used to find a permutation that approximately minimizes fill and work in sparse Cholesky. The algorithm is similar in principle to the lrt and gt algorithms. This result shows that up to a polylogarithmic factors, the quality of vertex separators in a graph determines sparsity that we can achieve in sparse Cholesky.

3.10. Notes and References

Bandwidth and envelope reduction orderings.

The idea of nested dissection, and the analysis of nested dissection on regular meshes is due to George XXX. The lrt generalized-nested-dissection algorithm and its analysis are due to Lipton, Rose, and Tarjan XXX. The gt algorithm and its analysis are due to Gilbert and Tarjan XXX.

Exercises

Exercise 24.

The proof of Theorem 10 was essentially based on the fact that the following algorithm is a correct implementation of the Cholesky factorization:

  • S=AS=A

    for j=1:nj=1\colon n

    Lj,j=Sj,jL_{j,j}=\sqrt{S_{j,j}}

    Lj+1:n,j=Sj+1:n,j/Lj,jL_{j+1\colon n,j}=S_{j+1\colon n,j}/L_{j,j}

    for i=j+1:ni=j+1\colon n

    for k=j+1:nk=j+1\colon n

    Si,k=Si,kLi,jLk,jS_{i,k}=S_{i,k}-L_{i,j}L_{k,j}

    end

    end

    end

  1. (1)

    Show that this implementation is correct.

  2. (2)

    Show that in each outer iteration, the code inside the inner loop, the kk loop, performs η2(Lj+1:n,j)\eta^{2}(L_{j+1\colon n,j}) nontrivial subtractions (subtractions in which a nonzero value is subtracted).

  3. (3)

    This implementation of the Cholesky factorization is often called jikjik Cholesky, because the ordering of the loops is jj first (outermost), then jj, and finally kk, the inner loop. In fact, all 66 permutations of the loop indices yield correct Cholesky factorizations; the expression inside the inner loop should is the same in all the permutations. Show this by providing 66 appropriate Matlab functions.

  4. (4)

    For each of the 66 permutations, consider the middle iteration of the outer loop. Sketch the elements of AA, the elements of LL, and the elements of SS that are referenced during this iteration of the outer loop. For the jikjik loop ordering shown above, AA is not referenced inside outer loop, and for LL and SS the sketches shown above.

Exercise 25.

In this exercise we explore fill and work in the Cholesky factorization of banded and low-profile matrices. We say that a matrix has a half-bandwidth kk if for all i<jki<j-k we have Aj,i=0A_{j,i}=0.

  1. (1)

    Suppose that AA corresponds to an xx-by-yy two-dimensional mesh whose vertices are ordered as in the previous chapter. What is the half bandwidth of AA? What happens when xx is larger than yy, and what happens when yy is larger? Can you permute AA to achieve a minimal bandwidth in both cases?

  2. (2)

    Show that in the factorization of a banded matrix, all the fill occurs within the band. That is, LL is also banded with the same bound on its half bandwidth.

  3. (3)

    Compute a bound on fill and work as a function of nn and kk.

  4. (4)

    In some cases, AA is banded but with a large bandwidth, but in most of the rows and/or columns all the nonzeros are closer to the main diagonal than predicted by the half-bandwidth. Can you derive an improved bound that takes into account the local bandwidth of each row and/or column? In particular, you need to think about whether a bound on the rows or on the columns, say in the lower triangular part of AA, is more useful.

  5. (5)

    We ordered the vertices of our meshes in a way that matrices come out banded. In many applications, the matrices are not banded, but they can be symmetrically permuted to a banded form. Using an xx-by-yy mesh with xyx\gg y as a motivating example, develop a graph algorithm that finds a permutation PP that clusters the nonzeros of PAPTPAP^{T} near the main diagonal. Use a breadth-first-search as a starting point. Consider the previous quesion in the exercise as you refine the algorithm.

  6. (6)

    Given an xx-by-yy mesh with xyx\gg y, derive bounds for fill and work in a generalized-nested-dissection factorization of the matrix corresponding to the mesh. Use separators that approximately bisect along the xx dimention until you reach subgraphs with a square aspect ratio. Give the bounds in terms of xx and yy.

Chapter 4 Symmetric Diagonally-Dominant Matrices and Graphs

Support theory provides effective algorithms for constructing preconditioners for diagonally-dominant matrices and effective ways to analyze these preconditioners. This chapter explores the structure of diagonally-dominant matrices and the relation between graphs and diagonallly-dominant matrices. The next chapter will show how we can use the structure of diagonally-dominant matrices to analyze preconditioners, and the chapter that follows presents algorithms for constructing algorithms.

4.1. Incidence Factorizations of Diagonally-Dominant Matrices

Definition 26.

A square matrix An×nA\in\mathbb{R}^{n\times n} is called diagonally-dominant if for every i=1,2,ni=1,2,\ldots n we have

Aiij=1jin|Aij|.A_{ii}\geq\sum_{{j=1\atop j\neq i}}^{n}\left|A_{ij}\right|\;.

Symmetric diagonally dominant matrices have symmetric factorizations A=UUTA=UU^{T} such that each column of UU has at most two nonzeros, and all nonzeros in each column have the same absolute values. We now establish a notation for such columns.

Definition 27.

Let 1i,jn1\leq i,j\leq n, iji\neq j. A length-nn positive edge vector, denoted i,j\left\langle i,-j\right\rangle, is the vector

i,j=ij[+11],i,jk={+1k=i1k=j0otherwise.\left\langle i,-j\right\rangle=\begin{matrix}\\ i\\ \\ j\\ \end{matrix}\begin{bmatrix}\vdots\\ +1\\ \vdots\\ -1\\ \vdots\end{bmatrix},\quad\left\langle i,-j\right\rangle_{k}=\left\{\begin{array}[]{rl}+1&k=i\\ -1&k=j\\ 0&\textrm{otherwise}.\end{array}\right.

A negative edge vector i,j\left\langle i,j\right\rangle is the vector

i,j=ij[+1+1],i,jk={+1k=i+1k=j0otherwise.\left\langle i,j\right\rangle=\begin{matrix}\\ i\\ \\ j\\ \end{matrix}\begin{bmatrix}\vdots\\ +1\\ \vdots\\ +1\\ \vdots\end{bmatrix},\quad\left\langle i,j\right\rangle_{k}=\left\{\begin{array}[]{rl}+1&k=i\\ +1&k=j\\ 0&\textrm{otherwise}.\end{array}\right.

The reason for the assignment of signs to edge vectors will become apparent later. A vertex vector i\left\langle i\right\rangle is the unit vector

ik={+1k=i0otherwise.\left\langle i\right\rangle_{k}=\left\{\begin{array}[]{rl}+1&k=i\\ 0&\textrm{otherwise}.\end{array}\right.

We now show that a symmetric diagonally dominant matrix can always be expressed as a sum of outer products of edge and vertex vectors, and therefore, as a symmetric product of a matrix whose columns are edge and vertex vectors.

Lemma 28.

Let An×nA\in\mathbb{R}^{n\times n} be a diagonally dominant symmetric matrix. We can decompose A as follows

A\displaystyle A =\displaystyle= i<jAij>0|Aij|i,ji,jT\displaystyle\sum_{\begin{subarray}{c}i<j\\ A_{ij}>0\end{subarray}}\left|A_{ij}\right|\left\langle i,j\right\rangle\left\langle i,j\right\rangle^{T}
+i<jAij<0|Aij|i,ji,jT\displaystyle+\sum_{\begin{subarray}{c}i<j\\ A_{ij}<0\end{subarray}}\left|A_{ij}\right|\left\langle i,-j\right\rangle\left\langle i,-j\right\rangle^{T}
+i=1n(Aiij=1jin|Aij|)iiT\displaystyle+\sum_{i=1}^{n}\left(A_{ii}-\sum_{\begin{subarray}{c}j=1\\ j\neq i\end{subarray}}^{n}\left|A_{ij}\right|\right)\left\langle i\right\rangle\left\langle i\right\rangle^{T}
=\displaystyle= i<jAij>0(|Aij|i,j)(|Aij|i,j)T\displaystyle\sum_{\begin{subarray}{c}i<j\\ A_{ij}>0\end{subarray}}\left(\sqrt{\left|A_{ij}\right|}\;\left\langle i,j\right\rangle\right)\left(\sqrt{\left|A_{ij}\right|}\;\left\langle i,j\right\rangle\right)^{T}
+i<jAij<0(|Aij|i,j)(|Aij|i,j)T\displaystyle+\sum_{\begin{subarray}{c}i<j\\ A_{ij}<0\end{subarray}}\left(\sqrt{\left|A_{ij}\right|}\;\left\langle i,-j\right\rangle\right)\left(\sqrt{\left|A_{ij}\right|}\;\left\langle i,-j\right\rangle\right)^{T}
+i=1n(Aiij=1jin|Aij|i)(Aiij=1jin|Aij|i)T.\displaystyle+\sum_{i=1}^{n}\left(\sqrt{A_{ii}-\sum_{\begin{subarray}{c}j=1\\ j\neq i\end{subarray}}^{n}\left|A_{ij}\right|}\;\left\langle i\right\rangle\right)\left(\sqrt{A_{ii}-\sum_{\begin{subarray}{c}j=1\\ j\neq i\end{subarray}}^{n}\left|A_{ij}\right|}\;\left\langle i\right\rangle\right)^{T}\;.
Proof.

The terms in the summations in lines 1 and 4 are clearly equal (we only distributed the scalars), and so are the terms in lines 2 and 5 and in lines 3 and 6. Therefore, the second equality holds.

We now show that the first equality holds. Consider the rank-11 matrix

i,ji,jT=[+111+1],\left\langle i,-j\right\rangle\left\langle i,-j\right\rangle^{T}=\begin{bmatrix}\ddots\\ &+1&&-1\\ &&\ddots\\ &-1&&+1\\ &&&&\ddots\end{bmatrix}\;,

and similarly,

i,ji,jT=[+1+1+1+1]\left\langle i,j\right\rangle\left\langle i,j\right\rangle^{T}=\begin{bmatrix}\ddots\\ &+1&&+1\\ &&\ddots\\ &+1&&+1\\ &&&&\ddots\end{bmatrix}

(in both matrices the four nonzeros are in rows and columns ii and jj). Suppose Aij>0A_{ij}>0 for some iji\neq j. The only contribution to element i,ji,j from the three sums is from a single term in the first sum, either the term |Aij|i,ji,jT\left|A_{ij}\right|\left\langle i,j\right\rangle\left\langle i,j\right\rangle^{T} or the term |Aji|j,ij,iT\left|A_{ji}\right|\left\langle j,i\right\rangle\left\langle j,i\right\rangle^{T}, depending on whether i>ji>j. The i,ji,j element of this rank-11 matrix is |Aij|=Aij\left|A_{ij}\right|=A_{ij}. If Aij<0A_{ij}<0, a similar argument shows that only |Aij|i,ji,jT\left|A_{ij}\right|\left\langle i,-j\right\rangle\left\langle i,-j\right\rangle^{T} (assuming i<ji<j) contributes to the i,ji,j element, and that the value of the element |Aij|=Aij-\left|A_{ij}\right|=A_{ij}. The third summation ensures that the values of diagonal elements is also correct. ∎

The matrix decompositions of this form play a prominent role in support theory, so we give them a name:

Definition 29.

A matrix whose columns are scaled edge and vertex vectors (that is, vectors of the forms ci,jc\left\langle i,-j\right\rangle, ci,jc\left\langle i,j\right\rangle, and cic\left\langle i\right\rangle) is called an incidence matrix. A factorization A=UUTA=UU^{T} where UU is an incidence matrix is called an incidence factorization. An incidence factorization with no zero columns, with at most one vertex vector for each index ii, with at most one edge vector for each index pair i,ji,j, and whose positive edge vectors are all of the form cmin(i,j),max(i,j)c\left\langle\min(i,j),-\max(i,j)\right\rangle is called a canonical incidence factorization.

Lemma 30.

Let An×nA\in\mathbb{R}^{n\times n} be a diagonally dominant symmetric matrix. Then AA has an incidence factorization A=UUTA=UU^{T}, and a unique canonical incidence factorization.

Proof.

The existence of the factorization follows directly from Lemma 28. We now show that the canonical incidence factorization is uniquely determined by AA. Suppose that Aij=0A_{ij}=0. Then UU cannot have a column which is a nonzero multiple of i,j\left\langle i,j\right\rangle, i,j\left\langle i,-j\right\rangle, or i,j\left\langle-i,j\right\rangle, since if it did, there would be only one such column, which would imply (UUT)ij0=Aij\left(UU^{T}\right)_{ij}\neq 0=A_{ij}. Now suppost that Aij>0A_{ij}>0. Then UU must have a column Aiji,j\sqrt{A_{ij}}\,\left\langle i,j\right\rangle. Similarly, if Aij<0A_{ij}<0, then UU must have a column Aijmin(i,j),max(i,j)\sqrt{-A_{ij}}\,\left\langle\min(i,j),-\max(i,j)\right\rangle. The uniqueness of the edge vectors implies that the scaled vertex vectors in UU are also unique. ∎

4.2. Graphs and Their Laplacians Matrices

We now define the connection between undirected graphs and diagonally-dominant symmetric matrices.

Definition 31.

Let G=({1,2,n},E)G=(\{1,2,\ldots n\},E) be an undirected graph on the vertex set {1,2,,n}\{1,2,\ldots,n\} with no self loops or parallel edges. That is, the edge-set EE consists of unordered pairs of unequal integers (i,j)(i,j) such that 1i,jn1\leq i,j\leq n and iji\neq j. The degree of a vertex ii is the numbers of edges incident on it. The Laplacian of GG is the matrix An×nA\in\mathbb{R}^{n\times n} such that

Aij={degree(i)i=j1(i,j)E (the index pair is unordered)0otherwise.A_{ij}=\left\{\begin{array}[]{rl}\textrm{degree}(i)&i=j\\ -1&(i,j)\in E\text{ (the index pair is unordered)}\\ 0&\textrm{otherwise}.\end{array}\right.
Lemma 32.

The Laplacian of an undirected graph is symmetric and diagonally dominant.

Proof.

The Laplacian is symmetric because the graph is undirected. The Laplacian is diagonally dominant because the number of off-diagonal nonzero entries in row ii is exactly degree(i)\textrm{degree}(i) and the value of each such nonzero is 1-1. ∎

We now generalize the definition of Laplacians to undirected graphs with positive edge weights.

Definition 33.

Let G=({1,2,n},E,c)G=(\{1,2,\ldots n\},E,c) be a weighted undirected graph on the vertex set {1,2,,n}\{1,2,\ldots,n\} with no self loops, and with a weight function c:E{0}c:E\rightarrow\mathbb{R}\setminus\{0\}. The weighted Laplacian of GG is the matrix An×nA\in\mathbb{R}^{n\times n} such that

Aij={(i,k)E|c(i,k)|i=jc(i,j)(i,j)E0otherwise.A_{ij}=\left\{\begin{array}[]{rl}\sum_{(i,k)\in E}\left|c(i,k)\right|&i=j\\ -c(i,j)&(i,j)\in E\\ 0&\textrm{otherwise}.\end{array}\right.

If some of the edge weights are negative, we call GG a signed graph, otherwise, we simply call it a weighted graph.

Lemma 34.

The Laplacian of a weighted undirected graph (signed or unsigned) is symmetric and diagonally dominant.

Proof.

Again symmetry follows from the fact that the graph is undirected. Diagonal dominance follows from the following equation, which holds for any ii, 1in1\leq i\leq n:

j=1jin|Aij|=j=1jiAij0n|Aij|=(i,j)En|Aij|=(i,j)En|c(i,j)|=Aii.\sum_{\begin{subarray}{c}j=1\\ j\neq i\end{subarray}}^{n}\left|A_{ij}\right|=\sum_{\begin{subarray}{c}j=1\\ j\neq i\\ A_{ij}\neq 0\end{subarray}}^{n}\left|A_{ij}\right|=\sum_{(i,j)\in E}^{n}\left|A_{ij}\right|=\sum_{(i,j)\in E}^{n}\left|c(i,j)\right|=A_{ii}\;.

If we allow graphs to have nonnegative vertex weights as well as edge weights, then this class of graphs becomes completely isomorphic to symmetric diagonally-dominant matrices.

Definition 35.

Let G=({1,2,n},E,c,d)G=(\{1,2,\ldots n\},E,c,d) be a weighted undirected graph on the vertex set {1,2,,n}\{1,2,\ldots,n\} with no self loops, and with weight functions c:E{0}c:E\rightarrow\mathbb{R}\setminus\{0\} and d:{1,,n}+{0}d:\{1,\ldots,n\}\rightarrow\mathbb{R}_{+}\cup\{0\}. The Laplacian of GG is the matrix An×nA\in\mathbb{R}^{n\times n} such that

Aij={d(i)+(i,k)E|c(i,k)|i=jc(i,j)(i,j)E0otherwise.A_{ij}=\left\{\begin{array}[]{rl}d(i)+\sum_{(i,k)\in E}\left|c(i,k)\right|&i=j\\ -c(i,j)&(i,j)\in E\\ 0&\textrm{otherwise}.\end{array}\right.

A vertex ii such that d(i)>0d(i)>0 is called a strictly dominant vertex.

Lemma 36.

The Laplacians of the graphs defined in Definition 35 are symmetric and diagonally dominant. Furthermore, these graphs are isomorphic to symmetric diagonally-dominant matrices under this Laplacian mapping.

Proof.

Symmetry again follows from the fact that the graph is undirected. For any ii we have

j=1jin|Aij|=j=1jiAij0n|Aij|=(i,j)En|Aij|=(i,j)En|c(i,j)|=Aiid(i),\sum_{\begin{subarray}{c}j=1\\ j\neq i\end{subarray}}^{n}\left|A_{ij}\right|=\sum_{\begin{subarray}{c}j=1\\ j\neq i\\ A_{ij}\neq 0\end{subarray}}^{n}\left|A_{ij}\right|=\sum_{(i,j)\in E}^{n}\left|A_{ij}\right|=\sum_{(i,j)\in E}^{n}\left|c(i,j)\right|=A_{ii}-d(i)\;,

so

Aii=d(i)+j=1jin|Aij|j=1jin|Aij|A_{ii}=d(i)+\sum_{\begin{subarray}{c}j=1\\ j\neq i\end{subarray}}^{n}\left|A_{ij}\right|\geq\sum_{\begin{subarray}{c}j=1\\ j\neq i\end{subarray}}^{n}\left|A_{ij}\right|

because d(i)0d(i)\geq 0. This shows that Laplacians are diagonally dominant. The isomorphism follows from the fact that the following expressions uniquely determine the graph ({1,.,n},E,c,d)(\{1,.\ldots,n\},E,c,d) associated with a diagonally-dominant symmetric matrix:

(i,j)E\displaystyle(i,j)\in E iff ij and Aij0\displaystyle i\neq j\text{ and }A_{ij}\neq 0
c(i,j)\displaystyle c(i,j) =\displaystyle= Aij\displaystyle-A_{ij}
d(i)\displaystyle d(i) =\displaystyle= Aiij=1jin|Aij|.\displaystyle A_{ii}-\sum_{\begin{subarray}{c}j=1\\ j\neq i\end{subarray}}^{n}\left|A_{ij}\right|\;.

We prefer to work with vertex weights rather than allowing self loops because edge and vertex vectors are algebraically different. As we shall see below, linear combinations of edge vectors can produce new edge vectors, but never vertex vectors. Therefore, it is convenient to distinguish edge weights that correspond to scaling of edge vectors from vertex weights that correspond to scaling of vertex vectors.

In algorithms, given an explicit representation of a diagonally-dominant matrix AA, we can easily compute an explicit representation of an incidence factor UU (including the canonical incidence factor if desired). Sparse matrices are often represented by a data structure that stores a compressed array of nonzero entries for each row or each column of the matrix. Each entry in a row (column) array stores the column index (row index) of the nonzero, and the value of the nonzero. From such a representation of AA we can easily construct a sparse representation of UU by columns. We traverse each row of AA, creating a column of UU for each nonzero in the upper (or lower) part of AA. During the traversal, we can also compute all the d(i)d(i)’s. The conversion works even if only the upper or lower part of AA is represented explicitly.

We can use the explicit representation of AA as an implicit representation of UU, with each off-diagonal nonzero of AA representing an edge-vector column of UU. If AA has no strictly-dominant rows, that is all. If AA has strictly dominant rows, we need to compute their weights using a linear traversal of AA.

4.3. Laplacians and Resistive Networks

Weighted (but unsigned) Laplacians model voltage-current relationships in electrical circuits made up of resistors. This fact has been used in developing some support preconditioners. This section explains how Laplacians model resistive networks.

Let GA=({1,,n},E,c)G_{A}=(\{1,\ldots,n\},E,c) be an edge-weighted graph and let AA be its Laplacian. We view GAG_{A} as an electrical circuit with nn nodes (connection points) and with m=|E|m=|E| resistors. If (i,j)E(i,j)\in E, then there is a resistor with capacitance c(i,j)c(i,j) between nodes ii and jj (equivalently, with resistence 1/c(i,j)1/c(i,j)).

Given a vector xx of node potentials, the voltage drop across resistor (i,j)(i,j) is |xixj||x_{i}-x_{j}|. If the voltage drop is nonzero, current will flow across the resistor. Current flow has a direction, so we need to orient resistors in order to assign a direction to current flows. We arbitrarily orient resistors from the lower-numbered node to the higher-numbered one. In that direction, the directional voltage drop across (i,j)(i,j) is

xmin(i,j)xmax(i,j)=min(i,j),max(i,j)Tx.x_{\min(i,j)}-x_{\max(i,j)}=\left\langle\min(i,j),-\max(i,j)\right\rangle^{T}x\;.

We denote E={(i1,j1),(i2,j2),,(im,jm)}E=\{(i_{1},j_{1}),(i_{2},j_{2}),\ldots,(i_{m},j_{m})\} and we denote

U~=[min(i1,j1),max(i1,j1)min(im,jm),max(im,jm)].\tilde{U}=\left[\begin{array}[]{ccc}\left\langle\min(i_{1},j_{1}),-\max(i_{1},j_{1})\right\rangle&\cdots&\left\langle\min(i_{m},j_{m}),-\max(i_{m},j_{m})\right\rangle\end{array}\right]\;.

Using this notation, the vector U~Tx\tilde{U}^{T}x is the vector of all directional voltage drops. The directional current flow across resistor (i,j)(i,j) is c(i,j)(xmin(i,j)xmax(i,j))c(i,j)(x_{\min(i,j)}-x_{\max(i,j)}). We denote by CC the diagonal matrix with Ck,k=c(ik,jk)C_{k,k}=c(i_{k},j_{k}). Then the vector CU~TxC\tilde{U}^{T}x is the vector of directional current flow across the mm resistors.

Next, we compute nodal currents, the net current flowing in or out of each node. If we set the potential of node ii to xix_{i}, this means that we have connected node ii to a voltage source, such as a battery. The voltage source keeps the potential at ii at exactly xix_{i} volts. To do so, it might need to send current into xix_{i} or to absorb currect flowing out of xix_{i}. These current flows also have a direction: we can either compute the current flowing into ii, or the current flowing out of ii. We will arbitrarily compute the current flowing into ii. How much current flows into node ii? Exactly the net current flowing into it from its neighbors in the circuit, minus the current flowing from it to its neighbors. Let (ik,jk)E(i_{k},j_{k})\in E. If ik<jki_{k}<j_{k}, then (CU~Tx)k\left(C\tilde{U}^{T}x\right)_{k} represents the current c(ik,jk)(xmin(ik,jk)xmax(ik,jk))c(i_{k},j_{k})(x_{\min(i_{k},j_{k})}-x_{\max(i_{k},j_{k})}) flowing from iki_{k} to jkj_{k}, which is positive if xik>xjkx_{i_{k}}>x_{j_{k}}. If ik>jki_{k}>j_{k} then current flowing from iki_{k} to jkj_{k} will have a negative sign (even though it too is flowing out of iki_{k}if xik>xjkx_{i_{k}}>x_{j_{k}}) and we have to negate it before we add it to the total current flowing from iki_{k} to its neighbors, which is exactly the net current flowing into iki_{k} from the voltage source. That is, the total current from a node ii to its neighbors is

(ik,jk)Eik=i(1)i>jk(CU~Tx)k=U~i,:CU~Tx.\sum_{\begin{subarray}{c}(i_{k},j_{k})\in E\\ i_{k}=i\end{subarray}}(-1)^{i>j_{k}}\left(C\tilde{U}^{T}x\right)_{k}=\tilde{U}_{i,\colon}C\tilde{U}^{T}x\;.

Therefore, the vector of net node currents is exactly U~CU~T=A\tilde{U}C\tilde{U}^{T}=A. We can also compute the total power dissipated by the circuit. We multiply the current flowing across each resistor by the voltage drop and sum over resistors, to obtain (xTU~)(CU~Tx)=xTU~CU~Tx\left(x^{T}\tilde{U}\right)\left(C\tilde{U}^{T}x\right)=x^{T}\tilde{U}C\tilde{U}^{T}x.

We summarize the various voltage-current relationships:

xnodal voltagesU~Txvoltage drops across resistorsCU~Txcurrent flows across resistorsU~CU~Tx=Axnet nodal currentsxTU~CU~Tx=xTAxtotal power dissipated by the circuit\begin{array}[]{ll}x&\text{nodal voltages}\\ \tilde{U}^{T}x&\text{voltage drops across resistors}\\ C\tilde{U}^{T}x&\text{current flows across resistors}\\ \tilde{U}C\tilde{U}^{T}x=Ax&\text{net nodal currents}\\ x^{T}\tilde{U}C\tilde{U}^{T}x=x^{T}Ax&\text{total power dissipated by the circuit}\end{array}

These relationships provide physical interpretations to many quantities and equations that we have already seen. Given a vector xx of nodal voltages, AxAx is the corresponding vector of nodal currents, where AA is the Laplacian of the circuit. Conversely, given a vector bb of nodal currents, solving Ax=bAx=b determines the corresponding nodal voltages. The maximal eigenvalue of AA measures the maximum power that the circuit can dissipate under a unit vector of voltages, and the minimal nonzero eigenvalue measures the minimum power that is dissipated under a unit voltage vector that is orthogonal to the constant vector (which spans the null space of AA).

Laplacians with strictly-dominant rows arise when the boundary conditions, the known quantities in the circuit, include a mixture of nodal voltages and nodal currents. Suppose that we know the currents in all the nodes except for one, say nn, where we know the voltage, not the current. Let us assume that the voltage at node nn is 55 Volts. We want to compute the voltages at nodes 11 to n1n-1. We cannot use the linear system U~CU~Tx=Ax=b\tilde{U}C\tilde{U}^{T}x=Ax=b directly, because we do not know bnb_{n}, and on the other hand, we do know that xn=5x_{n}=5. Therefore, we drop the last row from this linear system, since we do not know its right-hand side. We have

[A1,1A1,n1A1,nAn1,1An1,n1An1,n][x1xn15]=[b1bn1]\begin{bmatrix}A_{1,1}&\cdots&A_{1,n-1}&A_{1,n}\\ \vdots&&\vdots&\vdots\\ A_{n-1,1}&\cdots&A_{n-1,n-1}&A_{n-1,n}\end{bmatrix}\begin{bmatrix}x_{1}\\ \vdots\\ x_{n-1}\\ 5\end{bmatrix}=\begin{bmatrix}b_{1}\\ \vdots\\ b_{n-1}\end{bmatrix}

or

[A1,1A1,n1An1,1An1,n1][x1xn1]=[b1bn1]5[A1,n1An1,n1].\begin{bmatrix}A_{1,1}&\cdots&A_{1,n-1}\\ \vdots&&\vdots\\ A_{n-1,1}&\cdots&A_{n-1,n-1}\end{bmatrix}\begin{bmatrix}x_{1}\\ \vdots\\ x_{n-1}\end{bmatrix}=\begin{bmatrix}b_{1}\\ \vdots\\ b_{n-1}\end{bmatrix}-5\begin{bmatrix}A_{1,n-1}\\ \vdots\\ A_{n-1,n-1}\end{bmatrix}\;.

We have obtained a new square and symmetric coefficient matrix and a known right-hand side. The new matrix is still diagonally dominant, but now has strictly-dominant rows: if Ak,n0A_{k,n}\neq 0, then row kk in the new coefficient matrix is now strictly-dominant, since we removed Ak,nA_{k,n} from it. If we know the voltages at other nodes, we repeat the process and drop more rows and columns, making the remaining coefficient matrix even more dominant.

Given two resistive networks with the same number of unknown-voltage nodes, what is the interpretation of a path embedding π\pi? An embedding of GAG_{A} in GBG_{B} shows, for each edge (i,j)(i,j) in GAG_{A}, a path between ii and jj in GBG_{B}. That path can carry current and its conductance, which is the inverse of the sum of resistances along the path, serves as a lower bound on the conductance between ii and jj in GBG_{B}. Intuitively, an embedding allows us to show that GBG_{B} is, up to a certain factor, “as good as” GAG_{A}, in the sense that for a given vector of voltages, currents in GBG_{B} are not that much smaller than the currents in GAG_{A}.

Chapter 7 Augmented Spanning-Tree Preconditioners

It’s time to construct a preconditioner! This chapter combinatorial algorithms for constructing preconditioners that are based on augmenting spanning trees with extra edges.

7.1. Spanning Tree Preconditioners

We start with the simplest support preconditioner, a maximum spanning tree for weighted Laplacians. The construction of the preconditioner BB aims to achieve three goals:

  • The generalized eigenvalues (A,B)(A,B) should be at least 11.

  • The preconditioner should be as sparse as possible and as easy to factor as possible.

  • The product of the maximum dilation and the maximum congestion should be low, to ensure that generalized eigenvalues of the pencil (A,B)(A,B) are not too large. (We can also try to achieve low stretch.)

These objectives will not lead us to a very effective preconditioner. It usually pays to relax the second objective and make the preconditioner a little denser in order to achive a smaller κ(A,B)\kappa(A,B). But here we strive for simplicity, so we stick with these objectives.

We achieve the first objective using a simple technique. Given AA, we compute its canonical incidence factor UU. We construct VV by dropping some of the columns of UU. If we order the columns of UU so that the columns that we keep in VV appear first, then

V=U(I0).V=U\left(\begin{array}[]{c}I\\ 0\end{array}\right)\;.

This is a subset preconditioner, so

λ(A,B)σ1(B,A)(I0)22=12=1.\lambda(A,B)\geq\sigma^{-1}(B,A)\geq\left\|\left(\begin{array}[]{c}I\\ 0\end{array}\right)\right\|_{2}^{-2}=1^{-2}=1\;.

We now study the second objective, sparsity in BB. By Lemma LABEL:lemma:preconditioner-connectedness, we should GBG_{B} as connected as GAG_{A}. That is, we cannot drop so many columns from UU that connected components of GAG_{A} become disconnected in GBG_{B}. How sparse can we make GBG_{B} under this constraint? A spanning forest of GAG_{A}. That is, we can drop edges from GAG_{A} until no cycles remain in GBG_{B}. If there are cycles, we can clearly drop an edge. If there are no cycles, dropping an edge will disconnect a connected component of GAG_{A}. A spanning subgraph with no cycles is called a spanning forest. If GAG_{A} is connected, the spanning forest is a spanning tree. A spanning forest GBG_{B} of GAG_{A} is always very sparse; the number of edges is nn minus the number of connected components in GAG_{A}. The weighted Laplacian BB of a spanning forest GBG_{B} can be factored into triangular factors in Θ(n)\Theta(n) time, the factor requires Θ(n)\Theta(n) words of memory to store, and each preconditioning step requires Θ(n)\Theta(n) arithmetic operations.

Which edges should we drop, and which edges should we keep in the spanning forest? If GBG_{B} is a spanning forest of GAG_{A}, than for every edge (i1,i2)(i_{1},i_{2}) there is exactly one path in GBG_{B} between i1i_{1} and i2i_{2}. Therefore, if GBG_{B} is a spanning forest, then there is a unique path embedding π\pi of the edges of GAG_{A} into paths in GBG_{B}. The dropping policy should try to minimize the maximum congestion and dilation of the embedding (or the stretch/crowding of the embedding). The expressions for congestion, dilation, stretch, and crowsing are sums of ratios whose denominators are absolute values or squares of absolute values of entries of BB. Therefore, to reduce the congestion, dilation, and stretch, we can try to drop edges (i1,i2)(i_{1},i_{2}) that correspond to Ai1,i2A_{i_{1},i_{2}}with small absolute values, and keep “heavy” edges that correspond to entries of AA with large absolute values.

One class of forests that are easy to construct and that favor heavy edges are maximum spanning forests. A maximum spanning forest maximizes the sum of the weights Ai1,i2-A_{i_{1},i_{2}} of the edges of the forest. One property of maximum spanning forest is particularly useful for us.

Lemma 37.

Let GBG_{B} be a maximum spanning forest of the weighted (but unsigned) graph GAG_{A}, and let π(i1,i)=(i1,i2,,i)\pi(i_{1},i_{\ell})=(i_{1},i_{2},\ldots,i_{\ell}) be the path with endpoints i1i_{1} and ii_{\ell} in GBG_{B}. Then for j=1,,1j=1,\ldots,\ell-1 we have |Ai1,i2||Aij,ij+1||A_{i_{1},i_{2}}|\leq|A_{i_{j},i_{j+1}}|.

Proof.

Suppose for contradition that for some jj, |Ai1,i|>|Aij,ij+1||A_{i_{1},i_{\ell}}|>|A_{i_{j},i_{j+1}}|. If we add (i1,i)(i_{1},i_{\ell}) to GBG_{B}, we create a cycle. If we then drop (ij,ij+1)(i_{j},i_{j+1}), the resulting subgraph again becomes a spanning forest. The total weight of the new spannign forest is |Ai1,i||Aij,ij+1|>0|A_{i_{1},i_{\ell}}|-|A_{i_{j},i_{j+1}}|>0 more than that of GBG_{B}, contradicting the hypothesis that GBG_{B} is a maximum spanning forest. ∎

It follows that in all the summations that constitute the congestion, dilation, stretch and crowding, the terms are bounded by 11. This yields the following result.

Lemma 38.

Let AA and BB be weighted Laplacians with identical row sums, such that GBG_{B} is a maximum spanning forest of GAG_{A}. Then

κ(A,B)(n1)m,\kappa(A,B)\leq(n-1)m\;,

where nn is the order of AA and BB and mm is the number of nonzeros in the strictly upper triangular part of AA.

Proof.

Let WW correspond to the path embedding of GAG_{A} in GBG_{B}. For an edge (i1,i2)(i_{1},i_{2}) in GAG_{A} we have

dilationπ(i1,i2)=(j1,,j2)(j1,,j2)π(i1,i2)Ai1,i2Bj1,j2(j1,,j2)(j1,,j2)π(i1,i2)1n1.\textrm{dilation}_{\pi}(i_{1},i_{2})=\sum_{{(j_{1,},j_{2})\atop(j_{1,},j_{2})\in\pi(i_{1},i_{2})}}\sqrt{\frac{A_{i_{1},i_{2}}}{B_{j_{1},j_{2}}}}\leq\sum_{{(j_{1,},j_{2})\atop(j_{1,},j_{2})\in\pi(i_{1},i_{2})}}1\leq n-1\;.

The rightmost inequality holds because the number of edges in a simple path in a graph is at most n1n-1, the number of vertices minus one. The total number of paths that use a single edge in GBG_{B} is at most mm, the number of edges in GAG_{A}, and hence the number of paths (in fact, its not hard to see that the number of paths is at most m(n2)m-(n-2)). Therefore,

κ(A,B)\displaystyle\kappa(A,B) \displaystyle\leq σ(A,B)/σ(B,A)\displaystyle\sigma(A,B)/\sigma(B,A)
\displaystyle\leq σ(A,B)/1\displaystyle\sigma(A,B)/1
\displaystyle\leq (max{1,max(i1,i2)GAdilationπ(i1,i2)})(max{1,max(j1,j2)GAcongestionπ(j1,j2)})\displaystyle\left(\max\left\{1,\max_{(i_{1},i_{2})\in G_{A}}\textrm{dilation}_{\pi}(i_{1},i_{2})\right\}\right)\left(\max\left\{1,\max_{(j_{1},j_{2})\in G_{A}}\textrm{congestion}_{\pi}(j_{1},j_{2})\right\}\right)
\displaystyle\leq (n1)m.\displaystyle(n-1)m\;.

A similar argument shows that the stretch of an edge is at most n1n-1, and since there are at most mm edges, WF2\|W\|_{F}^{2} is at most (n1)m+n(n-1)m+n. ∎

Algorithms for constructing minimum spanning trees and forests can easily be adapted to construct maximum spanning forests. For example, Kruskal’s algorithm starts out with no edges in GBG_{B}. It sorts the edges of GAG_{A} by weight and processes from heavy to light. For each edge, the algorithm determines whether its endpoints are in the same connected component of the current forest. If the endpoints are in the same component, then adding the edge would close a cycle, so the edge is dropped from GBG_{B}. Otherwise, the edge is added to GBG_{B}. This algorithm requires a union-find data structure to determine whether two vertices belong to the same connected components. It is easy to implement the algorithm so that it performs O(mlogm)O(m\log m) operations. The main data structure in another famous algorithm, Prim’s, is a priority queue of vertices, and it can be implemented so that it performs O(m+nlogn)O(m+n\log n) operations. If AA is very sparse, Prim’s algorithm is faster.

When we put together the work required to construct a maximum-spanning-forest preconditioner, to factor it, the condition number of the preconditioned system, and the cost per iteration, we can bound the total cost of the linear solver.

Theorem 39.

Let AA be a weighted Laplacians of order nn with n+2mn+2m nonzeros. Then a minimal-residual preconditioned Krylov-subspace method with a maximum-spanning-forest preconditioner can solve a consistent linear system Ax=bAx=b using O((n+m)nm)O((n+m)\sqrt{nm}) operations.

Proof.

Constructing the preconditioner requires O(m+nlogn)O(m+n\log n) work. Computing the Cholesky factorization of the preconditioner requires Θ(n)\Theta(n) work. The cost per iteration is Θ(n+m)\Theta(n+m) operations, because AA has Θ(n+m)\Theta(n+m) nonzeros and the factor of the preconditioner only Θ(n)\Theta(n). The condition number of the preconditioned system is bounded by nmnm, so the number of iteration to reduce the relative residual by a constant factor is O(nm)O(\sqrt{nm}). Thus, the total solution cost is

O(m+nlogn)+Θ(n)+O((n+m)nm)=O((n+m)nm).O(m+n\log n)+\Theta(n)+O((n+m)\sqrt{nm})=O((n+m)\sqrt{nm})\;.

Can we do any better with spanning forest preconditioner? It seems that the congesion-dilation-product bound cannot give a condition-number bound better than O(mn)O(mn).margin: Is there a lower bound? But the stretch bound can. Constructions for low-stretch trees can construct a spanning forest GBG_{B} for GAG_{A} such that

(i1,i2)GAstretchπ(i1,i2)=O((m+n)(lognloglogn)2),\sum_{(i_{1},i_{2})\in G_{A}}\textrm{stretch}_{\pi}(i_{1},i_{2})=O\left((m+n)(\log n\log\log n)^{2}\right)\;,

where π\pi is the embedding of edges of GAG_{A} into paths in GBG_{B}. Like maximum spanning forests, lowe-stretch forests also favor heavy edges, but their optimization criteria are different. The number of operations required to construct such a forest is O((m+n)log2n)O((m+n)\log^{2}n). The constructoin details are considerably more complex than those of maximum spanning forests, so we do not give them here. The next theorem summarizes the total cost to solve a linear system with a low-stretch forest preconditioner.

Theorem 40.

Let AA be a weighted Laplacians of order nn with n+2mn+2m nonzeros. A minimal-residual preconditioned Krylov-subspace method with a low-stretch-forest preconditioner can solve a consistent linear system Ax=bAx=b using O((n+m)1.5(lognloglogn)O((n+m)^{1.5}(\log n\log\log n) operations.

Proof.

Constructing the preconditioner requires O((m+n)log2n)O((m+n)\log^{2}n) work. Computing the Cholesky factorization of the preconditioner requires Θ(n)\Theta(n) work. The cost per iteration is Θ(n+m)\Theta(n+m). The condition number of the preconditioned system is O((m+n)(lognloglogn)2)O((m+n)(\log n\log\log n)^{2}). Thus, the total solution cost is

O((m+n)log2n)+Θ(n)+O((n+m)(m+n)(lognloglogn)2)=O((n+m)1.5(lognloglogn).O((m+n)\log^{2}n)+\Theta(n)+O\left((n+m)\sqrt{(m+n)(\log n\log\log n)^{2}}\right)=\\ O\left((n+m)^{1.5}(\log n\log\log n\right)\;.

7.2. Vaiyda’s Augmented Spanning Trees

The analysis of spanning-tree preconditioners shows how they can be improved. The construction of the preconditioner is cheap, the factorization of the preconditioner is cheap, each iteration is cheap, but the solver performs many iterations. This suggests that it might be better to make the preconditioner a bit denser. This would make the construction and the factorization more expensive, and it would make every iteration more expensive. But if we add edges cleverly, the number of iterations can be dramatically reduced.

The following algorithm adds edges to a maximum spanning tree in an attempt to reduce the bounds on both the congestion and the dilation. The algorithm works in two phases. In the first phase, the algorithm removes edges to partition the tree into about tt subtrees of roughly the same size. In the second phase, the algorithm adds the heaviest edge between every two subtrees. The value tt is a parameter that controls the density of the preconditioner. When tt is small, the preconditioner remains a tree or close to a tree. As tt grows, the preconditioner becomes denser and more expensive to construct and to factor, the cost of every iteration grows, but the number of iterations shrinks.

The partitioning algorithm, called TreePartition, is recursive and works as follows. It maintains an integer vector ss. The element sis_{i} is the number of vertices in the subtree rooted at ii. We begin by calling TreePartition on the root of the maximum spanning tree. The algorithm starts processing

TreeParition(vertex ii, integer array ss)
  si1s_{i}\leftarrow 1
  for each child jj of ii
    if (sj>n/t+1)(s_{j}>n/t+1)
      TreePartition(jj,ss)

Algorithm 1

7.3. Notes and References

The low-stretch forests are from Elkin-Emek-Spielman-Teng, STOC 2005. This is an improvement over Alon-Karp-Peleg-West.

Appendix A Linear Algebra Notation and Definitions

This appendix provides notation, definitions and some well known results, mostly in linear algebra. We assume that all matrices and vectors are real unless we explicitly state otherwise.

A square matrix is symmetric if Aij=AjiA_{ij}=A_{ji}.

A.1. Eigenvalues and Eigenvectors

A square matrix is positive definite if xTAx>0x^{T}Ax>0 for all xx and positive semidefinite if xTAx0x^{T}Ax\geq 0 for all xx.

The eigendecomposition of a square matrix is a factorization A=VΛV1A=V\Lambda V^{-1} where Λ\Lambda is diagonal. The columns of VV are called eigenvectors of AA and the diagonal elements of Λ\Lambda are called the eigenvalues of AA. A real matrix may have complex eigenvalues. The expression Λ(A)\Lambda(A) denotes the set of eigenvalues of AA.

Not all matrices have an eigendecomposition, but every symmetric matrix has one. In particular, the eigenvalues of symmetric matrices are real and their eigenvectors are orthogonal to each other. Therefore, the eigendecomposition of symmetric matrices is of the form A=VΛVTA=V\Lambda V^{T}. The eigenvalues of positive definite matrices are all positive, and the eigenvalues of positive semidefinite matrices are all non-negative.

For any matrix VV, the product VVTVV^{T} is symmetric and positive semidefinite.

A.2. The Singular Value Decomposition

Every matric Am×nA\in\mathbb{C}^{m\times n}, mnm\geq n has a singular value decomposition (SVD) A=UΣVA=U\Sigma V^{*} where Um×nU\in\mathbb{C}^{m\times n} with orthonormal columns, Vn×nV\in\mathbb{C}^{n\times n} with orthonormal columns, and Σn×n\Sigma\in\mathbb{R}^{n\times n} is non-negative and diagonal. (This decomposition is sometimes called the reduced SVD, the full one being with a rectangular UU and an mm-by-nn Σ\Sigma.) The columns of UU are called left singular vectors, the columns of VV are called right singular vectors, and the diagonal elements of Σ\Sigma are called singular values. The singular values are non-negative. We denote the set of singular values of AA by Σ(A)\Sigma(A).

A.3. Generalized Eigenvalues

Preconditioning involves two matrices, the coefficient matrix and the preconditioners. The convergence of iterative linears solvers for symmetric semidefinite problems depends on the generalized eigenvalues of the pair of matrices. A pair (S,T)(S,T) of matrices is also called a pencil.

Definition 41.

Let SS and TT be nn-by-nn complex matrices. We say that a scalar λ\lambda is a finite generalized eigenvalue of the matrix pencil (pair) (S,T)(S,T) if there is a vector v0v\neq 0 such that

Sv=λTvSv=\lambda Tv

and Tv0Tv\neq 0. We say that \infty is a infinite generalized eigenvalue of (S,T)(S,T) if there exist a vector v0v\neq 0 such that Tv=0Tv=0 but Sv0Sv\neq 0. Note that \infty is an eigenvalue of (S,T)(S,T) if and only if 0 is an eigenvalue of (T,S)(T,S). The finite and infinite eigenvalues of a pencil are determined eigenvalues (the eigenvector uniquely determines the eigenvalue). If both Sv=Tv=0Sv=Tv=0 for a vector v0v\neq 0, we say that vv is an indeterminate eigenvector, because Sv=λTvSv=\lambda Tv for any scalar λ\lambda.

We order from smallest to largest. We will denote the kkth eigenvalue of SS by λk(S)\lambda_{k}(S), and the kkth determined generalized eigenvalue of (S,T)(S,T) by λk(S,T)\lambda_{k}(S,T). Therefore λ1(S)λl(S)\lambda_{1}(S)\leq\dots\leq\lambda_{l}(S) and λ1(S,T)λd(S,T)\lambda_{1}(S,T)\leq\dots\leq\lambda_{d}(S,T), where ll is the number of eigenvalues SS has, and dd is the number of determined eigenvalues that (S,T)(S,T) has.

Definition 42.

A pencil (S,T)(S,T) is Hermitian positive semidefinite (H/PSD) if SS is Hermitian, TT is positive semidefinite, and null(T)null(S)\mbox{null}(T)\subseteq\mbox{null}(S).

The generalized eigenvalue problem on H/PSD pencils is, mathematically, a generalization of the Hermitian eigenvalue problem. In fact, the generalized eigenvalues of an H/PSD can be shown to be the eigenvalues of an equivalent Hermitian matrix. The proof appears in the Appendix. Based on this observation it is easy to show that other eigenvalue properties of Hermitian matrices have an analogy for H/PSD pencils. For example, an H/PSD pencil, (S,T)(S,T), has exactly rank(T)\mbox{rank}(T) determined eigenvalues (counting multiplicity), all of them finite and real.

A useful tool for analyzing the spectrum of an Hermitian matrix is the Courant-Fischer Minimax Theorem [MatrixComp].

Theorem 43.

(Courant-Fischer Minimax Theorem) Suppose that Sn×nS\in\mathbb{C}^{n\times n} is an Hermitian matrix, then

λk(S)=mindim(U)=kmaxxUxSxxx\lambda_{k}(S)=\min_{\dim(U)=k}\max_{x\in U}\frac{x^{*}Sx}{x^{*}x}

and

λk(S)=maxdim(V)=nk+1minxVxSxxx.\lambda_{k}(S)=\max_{\dim(V)=n-k+1}\min_{x\in V}\frac{x^{*}Sx}{x^{*}x}\,.

As discussed above, the generalized eigenvalue problem on H/PSD pencils is a generalization of the eigenvalue problem on Hermitian matrices. Therefore, there is a natural generalization of Theorem 43 to H/PSD pencils, which we refer to as the Generalized Courant-Fischer Minimax Theorem. We now state the theorem. For completeness the proof appears in the Appendix.

Theorem 44.

(Generalized Courant-Fischer Minimax Theorem) Suppose that Sn×nS\in\mathbb{C}^{n\times n} is an Hermitian matrix and that Tn×nT\in\mathbb{C}^{n\times n} is an Hermitian positive semidefinite matrix such that null(T)null(S)\text{\emph{null}}(T)\subseteq\text{\emph{null}}(S). For 1krank(T)1\leq k\leq\mbox{\emph{rank}}(T) we have

λk(S,T)\displaystyle\lambda_{k}(S,T) =mindim(U)=kUnull(T)maxxSxSxxTx\displaystyle=\min_{\begin{array}[]{c}\dim(U)=k\\ U\perp\mbox{\emph{null}}(T)\end{array}}\max_{x\in S}\frac{x^{*}Sx}{x^{*}Tx}

and

λk(S,T)=maxdim(V)=rank(T)k+1Vnull(T)minxSxSxxTx.\lambda_{k}(S,T)=\max_{\begin{array}[]{c}\dim(V)=\mbox{rank}(T)-k+1\\ V\perp\mbox{\emph{null}}(T)\end{array}}\min_{x\in S}\frac{x^{*}Sx}{x^{*}Tx}\,.