A Two-Level Direct Solver for the Hierarchical Poincaré-Steklov Method

Joseph Kump Anna Yesypenko Per-Gunnar Martinsson

Abstract

The Hierarchical Poincaré-Steklov method (HPS) is a solver for linear elliptic PDEs that combines a multidomain spectral collocation discretization scheme with a highly efficient direct solver. It is particularly well-suited for solving problems with highly oscillatory solutions. However, its hierarchical structure requires a heterogeneous and recursive approach that limits the potential parallelism. Our work introduces a new two-level solver for HPS which divides the method into dense linear algebra operations that can be implemented on GPUs, and a sparse system encoding the hierarchical aspects of the method which interfaces with highly optimized multilevel sparse solvers. This methodology improves the numerical stability of HPS as well as its computational performance. We present a description and justification of our method, as well as an array of tests on three-dimensional problems to showcase its accuracy and performance.

keywords:

HPS , sparse direct solvers , elliptic PDEs , spectral methods , domain decomposition , GPUs

^†^†journal: arXiv

\affiliation

[label1]organization=Oden Institute, University of Texas at Austin,addressline=201 E 24th St, city=Austin, postcode=78712, state=TX, country=USA

1 Introduction

Consider an elliptic boundary value problem of the form

	$\displaystyle[Au](\mathbf{x})$	$\displaystyle=f(\mathbf{x}),$	$\displaystyle\mathbf{x}\in\Omega\subset\mathbb{R}^{d}$
	$\displaystyle u(\mathbf{x})$	$\displaystyle=g(\mathbf{x}),$	$\displaystyle\mathbf{x}\in\partial\Omega.$		(1)

Here $d=2\text{ or }3$ , $A$ is an elliptic partial differential operator (PDO), $\Omega$ is a bounded domain in $\mathbb{R}^{d}$ , $g:\partial\Omega\rightarrow\mathbb{R}$ is a known Dirichlet boundary condition, and $f:\Omega\rightarrow\mathbb{R}$ is a known body load. Our goal is to find $u:\Omega\rightarrow\mathbb{R}$ , the continuous function that satisfies (1). An example PDO is the 3D variable-coefficient Helmholtz operator with wavenumber $\kappa$ and smooth nonnegative function $b$ ,

Au=-\left(\frac{\partial^{2}}{\partial x_{1}^{2}}+\frac{\partial^{2}}{\partial x_{2}^{2}}+\frac{\partial^{2}}{\partial x_{3}^{2}}\right)u-\kappa^{2}b(\vec{x})u.

(2)

Numerically solving an oscillatory problem such as the Helmholtz equation is difficult because of a common phenomenon called the pollution effect [1, 2]. A domain-based numerical method for solving the Helmholtz equation suffers from the pollution effect if, as $k\rightarrow\infty$ , the total number of degrees of freedom needed to maintain accuracy grows faster than $k^{d}$ [3]. This effect makes many numerical methods infeasible to use for solving highly oscillatory problems. One group of methods that generally avoids the pollution effect, however, are global spectral methods [4]. These discretize a bounded, continuous domain using a set of basis functions restricted to the domain - a common choice are polynomials derived from Chebyshev roots, which more effective than other classic polynomials (such as Gaussian nodes) at accurately approximating continuous functions [5]. For a global spectral method using Chebyshev nodes, the degrees of freedom represent the order of polynomials used to approximate the solution function. High-order polynomials can be highly oscillatory while smooth, so they are ideal for approximating oscillatory functions.

However, global spectral methods introduce their own problems. The matrices formulated to solve these numerical methods are typically dense, making their use computationally expensive. Moreover, they are often poorly conditioned [6]. In addition, Chebyshev nodes in particular are denser near domain boundaries than the interior, which can lead to an uneven accuracy in their approximation.

One strategy to leverage the strengths of spectral methods while mitigating their drawbacks is by integrating them into multidomain approaches, which partition the domain into subdomains on which the spectral method is applied, then use some technique to “glue” these subdomains together. Examples include spectral collocation methods [7, 8] and spectral element methods [9, 10, 11]. With these approaches the total degrees of freedom is generally a product of the spatial partitioning and the polynomial order per subdomain. Since the spectral method is localized to smaller subdomains, its polynomial order can remain relatively low which improves conditioning. This approach also enables the operators assembled in the method’s implementation to be sparse, since subdomains typically only interact with nearby subdomains in the discretization.

The Hierarchical Poincaré-Steklov method, or HPS method, is a multidomain spectral collocation scheme. As illustrated in its name, HPS utilizes a hierarchical divide-and-conquer approach where the domain $\Omega$ is recursively split in half into a series of subdomains [12]. The PDE is enforced locally on each lowest-level subdomain (“leaf”) through a spectral method. Then in an upward pass through the tree of hierarchical subdomains, solution operators are formed for leaf boundaries by enforcing continuity of the Neumann derivative. The HPS scheme is typically solved directly.

In addition to the advantages of other multidomain spectral methods, the HPS scheme’s use of a direct solver makes it well-suited for problems where effective preconditioners are difficult to obtain (including aforementioned oscillatory problems), and the operators constructed as part of the method can be used for problems with the same PDO but different $f$ and $g$ [13] - this can make HPS well-suited for solving parabolic PDEs as well, since data from the previous time step is expressed in the body load [14, 15]. However, the hierarchical structure of HPS does cause some issues of its own. In particular, the solution operators generated at each level tend to be ill-conditioned since they are derived from spectral differentiation matrices and Dirichlet-to-Neumann (DtN) maps which are themselves ill-conditioned. Moreover, the recursive nature of hierarchical methods does not lend to easy parallelization: the levels of solution operators have to be constructed in serial since each one depends on the previous level [16].

We propose a modified solver for the HPS scheme on 3D domains that forgoes the hierarchical approach for a two-level framework, analogous to [17]. Instead of recursively constructing a chain of solution operators in an upward pass, we use the DtN maps of each leaf subdomain to formulate a sparse system that encodes all leaf boundaries concurrently. This system can easily interface with effective multifrontal sparse solvers such as MUMPS, which can apply state-of-the-art pivoting techniques to improve the method’s stability. In addition, flattening the hierarchical component of the method enables easier parallelization and use of GPUs accelerators for parts of the method. This paper provides a summary of the traditional HPS method, then presents this improved method and an array of numerical experiments to showcase its effectiveness.

2 Preliminaries

Recall the boundary value problem we intend to solve, shown in (1). In this section we will describe the use of spectral methods and merge operations for discretizing and solving this type of problem.

2.1 Leaf Computation

In this subsection we will describe a global spectral method to approximate an elliptic PDE on a box domain. We will apply this later to individual subdomains within a larger boundary value problem. Suppose we have a PDE such as in (1), where $\Omega$ is a rectangular domain such as $[0,1]^{d}$ , $d=2$ or $3$ . We discretize $\Omega$ into a product grid of $p^{d}$ Chebyshev nodes $\{x_{k}\}_{k=1}^{p^{d}}$ , where $p$ is the chosen polynomial order (other works have used polynomial coefficients as degrees of freedom instead [18]). Let $\mathbf{u}\in\mathbb{R}^{p^{d}}$ be the numerical approximation of $u$ on this discretization, such that $\mathbf{u}_{k}\approx u(x_{k})$ . There exist spectral differentiation matrices $\mathbf{D}^{(1)}$ , $\mathbf{D}^{(2)}$ , and (in 3D) $\mathbf{D}^{(3)}$ such that $[\mathbf{D}^{(i)}\mathbf{u}]\approx[\partial_{i}u](x_{k})$ , and more generally $\mathbf{D}^{(i)}\mathbf{u}$ is a discretization of $\partial_{i}u$ [19]. Thus we can approximate the partial differential operator with a spectral differentiation matrix. For example, in the 3D Helmholtz equation shown in (2), we can discretize the partial differential operator with

\mathbf{A}\mathbf{u}=-\left(\left(\mathbf{D}^{(1)}\right)^{2}+\left(\mathbf{D}^{(2)}\right)^{2}+\left(\mathbf{D}^{(3)}\right)^{2}+\kappa^{2}b(\mathbf{x})\mathbf{I}\right)\mathbf{u},

(3)

where $\mathbf{I}$ is the $p^{3}\times p^{3}$ identity matrix. Since Chebyshev nodes include the endpoints of an interval, there will always be points on $\partial\Omega$ : $4p$ points on the edges for $d=2$ , and $6p^{2}$ points on the faces for $d=3$ . Let $B$ denote the indices of $\mathbf{u}$ that lie on $\partial\Omega$ , and $I$ denote the indices of $\mathbf{u}$ on the interior of $\Omega$ . We can then show

$\displaystyle\mathbf{A}\mathbf{u}$	$\displaystyle=\mathbf{f}$
	$\displaystyle\implies$	$\displaystyle\mathbf{A}_{I,:}\mathbf{u}$	$\displaystyle=\mathbf{f}_{I}$
	$\displaystyle\implies$	$\displaystyle\mathbf{A}_{I,I}\mathbf{u}_{I}+\mathbf{A}_{I,B}\mathbf{u}_{B}$	$\displaystyle=\mathbf{f}_{I}$
	$\displaystyle\implies$	$\displaystyle\mathbf{A}_{I,I}\mathbf{u}_{I}$	$\displaystyle=-\mathbf{A}_{I,B}\mathbf{u}_{B}+\mathbf{f}_{I}$
	$\displaystyle\implies$	$\displaystyle\mathbf{u}_{I}$	$\displaystyle=\left(-\left(\mathbf{A}_{I,I}\right)^{-1}\mathbf{A}_{I,B}\right)\mathbf{u}_{B}+\left(\mathbf{A}_{I,I}\right)^{-1}\mathbf{f}_{I}$
	$\displaystyle\implies$	$\displaystyle\mathbf{u}_{I}$	$\displaystyle=\mathbf{S}\mathbf{u}_{B}+\left(\mathbf{A}_{I,I}\right)^{-1}\mathbf{f}_{I}.$	(4)

$\mathbf{S}$ is sometimes called the solution operator [20], and is of size $\#\text{ of interior points }\times\#\text{ of boundary points}=(p-2)^{d}\times 2dp^{d-1}$ . Thus given a boundary value problem such as (1) with partial differential operator $A$ and a discretization $\mathbf{A}$ based on Chebyshev nodes $\{x\}_{k=1}^{p^{d}}$ , we can derive $\mathbf{S}$ that maps the problem’s Dirichlet data to its solution on the interior of $\Omega$ (with the addition of a body load, if inhomogeneous). This is a global spectral method, and it excels at approximating highly-oscillatory problems such as the Helmholtz equation. However, it is poorly conditioned for larger $p$ , and it can struggle to capture function details near the center of $\Omega$ since there are fewer Chebyshev nodes in that region. In addition, both $\mathbf{A}$ and $\mathbf{S}$ are dense matrices, making use of this method for large $p$ computationally expensive.

2.2 Merge Operation for Two Subdomains

Refer to caption — Figure 1: A rectangular 2D domain $\Omega$ divided into subdomains $\Omega^{(1)},\Omega^{(2)}$ . $\mathbf{u}_{1}$ and $\mathbf{u}_{2}$ refer to function values on nodes in the interiors of $\Omega^{(1)}$ and $\Omega^{(2)}$ , while $\mathbf{u}_{3}$ and $\mathbf{u}_{4}$ are on $\partial\Omega$ restricted to $\Omega^{(1)}$ and $\Omega^{(2)}$ , respectively (the Dirichlet boundary condition). $\mathbf{u}_{5}$ is the shared boundary between $\Omega^{(1)}$ and $\Omega^{(2)}$ ; its points are not on $\partial\Omega$ .

To mitigate the shortcomings of a global spectral method in solving an elliptic PDE such as in (1), we will partition $\Omega$ into two rectangular non-overlapping subdomains $\Omega^{(1)}$ and $\Omega^{(2)}$ (we will alternatively call these subdomains “leaves” or “boxes”). We will discretize each subdomain using $p^{d}$ Chebyshev nodes, as we did for $\Omega$ in Section 2.1. This leads to a total of $2p^{d}-p^{d-1}$ points, since there is one shared face of points between $\Omega^{(1)}$ and $\Omega^{(2)}$ . A diagram of this discretization in 2D is presented in Figure 1, with $\mathbf{u}_{5}$ denoting the shared face. We already know the data of the other faces ( $\mathbf{u}_{3},\mathbf{u}_{4}$ in the figure) on both boxes as part of the Dirichlet boundary condition. If we can solve for the points on the shared face first, then we can solve for the interior of each subdomain separately using a spectral method as in Section 2.1, enforcing the PDE locally. To help solve for the shared face, we will add an additional condition to our problem: the Neumann derivative of $u$ on $\partial\Omega^{(1)}$ and $\partial\Omega^{(2)}$ must be equal on the shared face. In other words, we enforce continuity of the Neumann derivative.

To enforce this continuity, we need to find a numerical approximation to the Neumann derivative. This can be done using spectral differentiation matrices. In particular we form a Neumann operator matrix $\mathbf{N}$ on each subdomain such that

\left.u_{\text{Neumann}}\right|_{\partial\Omega^{(i)}}\approx\left.\mathbf{N}\mathbf{u}\right|_{\Omega^{(i)}}=\begin{bmatrix}\mathbf{D}_{K_{l},:}^{(1)}\\ \mathbf{D}_{K_{r},:}^{(1)}\\ \mathbf{D}_{K_{d},:}^{(2)}\\ \mathbf{D}_{K_{u},:}^{(2)}\\ \mathbf{D}_{K_{b},:}^{(3)}\\ \mathbf{D}_{K_{f},:}^{(3)}\\ \end{bmatrix}\left.\mathbf{u}\right|_{\Omega^{(i)}}\text{ for }i=1,2.

Here $l,r,d,u,b,f$ are indices for each face of the subdomain boundary, with $l,r$ pointing in the $x_{1}$ direction, $d,u$ pointing in the $x_{2}$ direction, and $b,f$ pointing in the $x_{3}$ direction (respectively “left”, “right”, “down”, “up”, “back”, and “front” - in 2D $b$ and $f$ are omitted). $\mathbf{N}$ obtains the Neumann derivative for each face, taking in both the boundary points $\mathbf{u}_{B}$ and interior points $\mathbf{u}_{I}$ as inputs. To obtain the Neumann derivatives using only $\mathbf{u}_{B}$ , we combine $\mathbf{N}$ with the solution operator $\mathbf{S}$ to get

$\displaystyle\mathbf{N}\mathbf{u}$	$\displaystyle=\mathbf{N}_{:,[B,I]}\begin{bmatrix}\mathbf{u}_{B}\\ \mathbf{u}_{I}\end{bmatrix}\text{ (reindexing entries / columns)}$
	$\displaystyle=\mathbf{N}_{:,[B,I]}\begin{bmatrix}\mathbf{u}_{B}\\ \mathbf{S}\mathbf{u}_{B}+\mathbf{A}_{I,I}^{-1}\mathbf{f}_{I}\end{bmatrix}\text{ from (\ref{eqn:solutionOp})}$
	$\displaystyle=\mathbf{N}_{:,[B,I]}\begin{bmatrix}\mathbf{Id}\\ \mathbf{S}\end{bmatrix}\mathbf{u}_{B}+\mathbf{N}_{:,[B,I]}\begin{bmatrix}0\\ \mathbf{A}_{I,I}^{-1}\mathbf{f}_{I}\end{bmatrix}$
	$\displaystyle=\mathbf{T}\mathbf{u}_{B}+\mathbf{N}_{:,I}\mathbf{A}_{I,I}^{-1}\mathbf{f}_{I}.$	(5)

$\mathbf{Id}$ is the $\text{size}(\mathbf{u}_{B})\times\text{size}(\mathbf{u}_{B})$ identity matrix and $\mathbf{T}$ denotes the Dirichlet-to-Neumann (DtN) map for $\Omega^{(i)}$ (also called the Poincaré-Steklov operator). For homogeneous PDEs we approximate the Neumann derivative entirely with $\mathbf{T}\mathbf{u}_{B}$ , whereas the inhomogeneous case requires some contribution from the body load as seen in (2.2). Since $\mathbf{T}$ maps from the box boundary to itself, it is a square matrix of size $2dp^{d-1}\times 2dp^{d-1}$ .

Borrowing terminology from Figure 1, let $\mathbf{u}_{1},\mathbf{u}_{2},\mathbf{u}_{3},\mathbf{u}_{4},\mathbf{u}_{5}$ denote the various subsets of $\mathbf{u}$ on discretized $\Omega$ . Let $\mathbf{v}_{3},\mathbf{v}_{4},\mathbf{v}_{5}$ denote the Neumann derivatives on the same points as $\mathbf{u}_{3},\mathbf{u}_{4},\mathbf{u}_{5}$ , and let $\mathbf{T}^{(1)},\mathbf{T}^{(2)}$ denote the DtNs on $\Omega^{(1)},\Omega^{(2)}$ . Then, with a reindexing of some rows and columns of the DtNs,

\displaystyle\begin{bmatrix}\mathbf{v}_{3}\\ \mathbf{v}_{5}\end{bmatrix}=\mathbf{T}^{(1)}\begin{bmatrix}\mathbf{u}_{3}\\ \mathbf{u}_{5}\end{bmatrix},\text{ }\begin{bmatrix}\mathbf{v}_{4}\\ \mathbf{v}_{5}\end{bmatrix}=\mathbf{T}^{(2)}\begin{bmatrix}\mathbf{u}_{4}\\ \mathbf{u}_{5}\end{bmatrix}.

(6)

By setting $\mathbf{v}_{5}$ (Neumann derivative of the shared face) equal to itself, we get

	$\displaystyle\mathbf{v}_{5}=\mathbf{v}_{5}$
$\displaystyle\implies$	$\displaystyle\mathbf{T}_{5,:}^{(1)}\begin{bmatrix}\mathbf{u}_{3}\\ \mathbf{u}_{5}\end{bmatrix}=\mathbf{T}_{5,:}^{(2)}\begin{bmatrix}\mathbf{u}_{4}\\ \mathbf{u}_{5}\end{bmatrix}$
$\displaystyle\implies$	$\displaystyle\mathbf{T}_{5,3}^{(1)}\mathbf{u}_{3}+\mathbf{T}_{5,5}^{(1)}\mathbf{u}_{5}=\mathbf{T}_{5,4}^{(2)}\mathbf{u}_{4}+\mathbf{T}_{5,5}^{(2)}\mathbf{u}_{5}$
$\displaystyle\implies$	$\displaystyle\left(\mathbf{T}_{5,5}^{(1)}-\mathbf{T}_{5,5}^{(2)}\right)\mathbf{u}_{5}=-\mathbf{T}_{5,3}^{(1)}\mathbf{u}_{3}+\mathbf{T}_{5,4}^{(2)}\mathbf{u}_{4}$
$\displaystyle\implies$	$\displaystyle\mathbf{u}_{5}=\left(\mathbf{T}_{5,5}^{(1)}-\mathbf{T}_{5,5}^{(2)}\right)^{-1}\begin{bmatrix}-\mathbf{T}_{5,3}^{(1)}&\mathbf{T}_{5,4}^{(2)}\end{bmatrix}\begin{bmatrix}\mathbf{u}_{3}\\ \mathbf{u}_{4}\end{bmatrix}=\mathbf{S}^{(3)}\begin{bmatrix}\mathbf{u}_{3}\\ \mathbf{u}_{4}\end{bmatrix}.$	(7)

Thus we can compute the shared face $\mathbf{u}_{5}$ $\mathbf{u}_{3}$ and $\mathbf{u}_{4}$ , given by the Dirichlet boundary condition, and a new solution operator $\mathbf{S}^{(3)}$ computed entirely from submatrices of the DtNs. This assumes a homogeneous PDE - if the problem is inhomogeneous, additional terms with the body load are present such as in (2.2). Once $\mathbf{u}_{5}$ is solved, we can treat it as Dirichlet data for $\Omega_{1}$ and $\Omega_{2}$ , along with $\mathbf{u}_{3}$ and $\mathbf{u}_{4}$ , to get $\mathbf{u}_{1}$ and $\mathbf{u}_{2}$ .

This divide-and-conquer approach has a few advantages over a global spectral method. The solution operators and DtN maps for each subdomain can be computed separately, providing an easy source of parallelism. While the merged solution operator $\mathbf{S}^{(3)}$ is still somewhat ill-conditioned, it is better conditioned than a global spectral matrix with the same domain size and degrees of freedom. In addition, nodes are more evenly spaced across the entire domain than in a global spectral method.

2.2.1 Merging with Corner and Edge Points

Since Chebyshev nodes include the ends of their interval, using them to discretize boxes produces nodes on the corners and edges. Corner nodes present a challenge in this method, since their Neumann derivative is ill-defined (in three dimensions this challenge extends to edge points as well, but for consistency we will also refer to these nodes as corner nodes). If our partial differential operator $A$ does not include mixed second order derivatives (such as in the Poisson equation or constant-coefficient Helmholtz), then the corner nodes do not impact the values of interior domain points through the solution operator $\mathbf{S}$ due to the structure of spectral differentiation matrices, so we may simply drop them from our discretization: when merging boxes, we only consider the $(p-2)^{d-1}$ points on the interior of the shared face.

However, if $A$ does include mixed second order terms, then the corner points impact interior points through $\mathbf{S}$ , so exclusion will significantly reduce accuracy. To make their role in the Neumann derivative unambiguous, we interpolate the box face points to Gaussian nodes (which do not include the ends of their interval) in this case. Then we define our DtN map $\mathbf{T}_{\text{interp}}=\mathbf{L}_{CtG}\mathbf{T}\mathbf{L}_{GtC}$ , where $\mathbf{L}_{CtG}$ is the Chebyshev-to-Gaussian interpolation operator, and $\mathbf{L}_{GtC}$ is the Gaussian-to-Chebyshev interpolation operator. Once the shared face is solved, we then interpolate it back to Chebyshev nodes before applying the local solution operators. A projection operator is composed onto the Gaussian-to-Chebyshev interpolation matrix to preserve continuity at corner points.

We call the degree of the modified box exteriors $q$ . In the case of dropped corners $q=p-2$ . We can choose $q$ when we use Gaussian interpolation, but $q=p-2$ or $p-1$ is generally most effective. This means in practice the DtN maps $\mathbf{T}$ are size $2dq^{d-1}\times 2dq^{d-1}$ .

2.3 Analogy to Sparse $LU$ Factorization

One interpretation of this merge operation is as an analogy to $LU$ factorization [21]. Consider Figure 1, simplifying it by considering the domain Dirichlet boundary terms $\mathbf{u}_{3},\mathbf{u}_{4}$ parts of $\mathbf{u}_{1}$ and $\mathbf{u}_{2}$ respectively. Then we can represent our numerical partial differential operator (PDO) $\mathbf{A}$ with

\mathbf{A}\mathbf{u}=\mathbf{f}\equiv\begin{bmatrix}\mathbf{A}_{11}&&\mathbf{A}_{15}\\ &\mathbf{A}_{22}&\mathbf{A}_{25}\\ \mathbf{A}_{51}&\mathbf{A}_{52}&\mathbf{A}_{55}\end{bmatrix}\begin{bmatrix}\mathbf{u}_{1}\\ \mathbf{u}_{2}\\ \mathbf{u}_{5}\end{bmatrix}=\begin{bmatrix}\mathbf{f}_{1}\\ \mathbf{f}_{2}\\ \mathbf{f}_{5}\end{bmatrix}.

(8)

Here $\mathbf{u}$ is the solution and $\mathbf{f}$ is the body load on the interior of $\Omega$ , split into the two subdomains and shared boundary. Because an elliptic PDO is a local operator there is no interaction through $\mathbf{A}$ on $\mathbf{f}_{1}$ by $\mathbf{u}_{2}$ or $\mathbf{f}_{2}$ by $\mathbf{u}_{1}$ , so $\mathbf{A}=0$ in those submatrices. However, both $\mathbf{u}_{1}$ and $\mathbf{u}_{2}$ , are local to the shared boundary where $\mathbf{u}_{5}$ lies, so they both interact with the body load there, $\mathbf{f}_{5}$ , through $\mathbf{A}$ . To interpret the first two block rows, note that

\mathbf{A}_{11}\mathbf{u}_{1}+\mathbf{A}_{15}\mathbf{u}_{5}=\mathbf{f}_{1}\iff\mathbf{u}_{1}=-\mathbf{A}_{11}^{-1}\mathbf{A}_{15}\mathbf{u}_{5}+\mathbf{A}_{11}^{-1}\mathbf{f}_{1}

(9)

This is similar to our solution operator formulation shown in (2.1), with the part of the subdomain boundary on $\partial\Omega$ now folded into $\mathbf{u}_{1}$ . Next, let $\mathbf{\tilde{u}}_{1}=\mathbf{A}_{11}^{-1}\mathbf{f}_{1}$ and $\mathbf{\tilde{u}}_{2}=\mathbf{A}_{22}^{-1}\mathbf{f}_{2}$ . If we supply these vectors in place of $\mathbf{u}_{1},\mathbf{u}_{2}$ in (8) and set $\mathbf{u}_{5}=\mathbf{0}$ , then the equation still holds with the same body loads $\mathbf{f}_{1}$ and $\mathbf{f}_{2}$ . Thus $\mathbf{\tilde{u}}_{1}$ and $\mathbf{\tilde{u}}_{2}$ are the particular solutions to our PDE on the interiors of $\Omega^{(1)}$ , and $\Omega^{(2)}$ with $\mathbf{u}_{5}=\mathbf{0}$ . Now consider an upper triangular matrix $\mathbf{U}$ that satisfies

\displaystyle\underbracket{\begin{bmatrix}\mathbf{I}&&\mathbf{A}_{11}^{-1}\mathbf{A}_{15}\\ &\mathbf{I}&\mathbf{A}_{22}^{-1}\mathbf{A}_{25}\\ &&\mathbf{I}\end{bmatrix}}_{\mathbf{U}}\begin{bmatrix}\mathbf{u}_{1}\\ \mathbf{u}_{2}\\ \mathbf{u}_{5}\end{bmatrix}=\begin{bmatrix}\mathbf{\tilde{u}}_{1}\\ \mathbf{\tilde{u}}_{2}\\ \mathbf{u}_{5}\end{bmatrix}.

(10)

This linear system can be confirmed with (9). $\mathbf{U}$ decouples our solutions $\mathbf{u}_{1},\mathbf{u}_{2}$ into the particular solutions $\mathbf{\tilde{u}}_{1},\mathbf{\tilde{u}}_{2}$ which are derived from our global Dirichlet BC without the subdomain-only (shared face) Dirichlet BC, and $\mathbf{u}_{5}$ which is that subdomain-only boundary condition. Thus $\mathbf{U}^{-1}$ collects both components of our solutions - in effect it represents the end of solving the merged system, where we use $\mathbf{u}_{3}$ and $\mathbf{u}_{5}$ to get $\mathbf{u}_{1}$ , and $\mathbf{u}_{4}$ and $\mathbf{u}_{5}$ to get $\mathbf{u}_{2}$ . Next, let $\mathbf{L}$ be a lower triangular matrix defined such that

\displaystyle\underbracket{\begin{bmatrix}\mathbf{I}&&\\ &\mathbf{I}&\\ \mathbf{A}_{51}\mathbf{A}_{11}^{-1}&\mathbf{A}_{52}\mathbf{A}_{22}^{-1}&\mathbf{I}\end{bmatrix}}_{\mathbf{L}}\begin{bmatrix}\mathbf{f}_{1}\\ \mathbf{f}_{2}\\ \underbracket{\mathbf{f}_{5}-\mathbf{A}_{51}\mathbf{A}_{11}^{-1}\mathbf{f}_{1}-\mathbf{A}_{52}\mathbf{A}_{22}^{-1}\mathbf{f}_{2}}_{\mathbf{\tilde{f}}_{5}}\end{bmatrix}=\begin{bmatrix}\mathbf{f}_{1}\\ \mathbf{f}_{2}\\ \mathbf{f}_{5}\end{bmatrix}.

(11)

To interpret this, note that

\mathbf{\tilde{f}}_{5}=\mathbf{f}_{5}-\mathbf{A}_{51}\mathbf{\tilde{u}}_{1}-\mathbf{A}_{52}\mathbf{\tilde{u}}_{2}\iff\mathbf{f}_{5}=\begin{bmatrix}\mathbf{A}_{51}&\mathbf{A}_{52}\end{bmatrix}\begin{bmatrix}\mathbf{\tilde{u}}_{1}\\ \mathbf{\tilde{u}}_{2}\end{bmatrix}+\mathbf{\tilde{f}}_{5}.

(12)

This has some resemblance to (7) when noting that $\mathbf{\tilde{u}}_{1}$ encodes the impact of Dirichlet data $\mathbf{u}_{3}$ without $\mathbf{u}_{5}$ (and likewise $\mathbf{\tilde{u}}_{2}$ is $\mathbf{u}_{4}$ without $\mathbf{u}_{5}$ ). If an additional operator was composed onto $[\mathbf{A}_{51}|\mathbf{A}_{52}]$ that resembled a Neumann derivative approximation following an inverse of the PDO, then it would closely resemble $[-\mathbf{T}_{53}^{(1)}|\mathbf{T}_{54}^{(2)}]$ from (7). Another way to view (12) is that it breaks loads from decoupled $\mathbf{\tilde{u}}_{1},\mathbf{\tilde{u}}_{2}$ on the two subdomains from $\mathbf{f}_{5}$ , leaving another decoupled term in $\mathbf{\tilde{f}}_{5}$ . By giving us $\mathbf{\tilde{f}}_{5}$ from $\mathbf{f}$ , $\mathbf{L}^{-1}$ can enable us to formulate a reduced problem that will solve for $\mathbf{u}_{5}$ given $\mathbf{\tilde{f}}_{5}$ - it performs the first part of the merge operation. To define this reduced problem, consider a block diagonal matrix $\mathbf{D}$ where

\underbracket{\begin{bmatrix}\mathbf{A}_{11}&&\\ &\mathbf{A}_{22}&\\ &&\mathbf{T}_{55}\end{bmatrix}}_{\mathbf{D}}\begin{bmatrix}\mathbf{\tilde{u}}_{1}\\ \mathbf{\tilde{u}}_{2}\\ \mathbf{u}_{5}\end{bmatrix}=\begin{bmatrix}\mathbf{f}_{1}\\ \mathbf{f}_{2}\\ \mathbf{\tilde{f}}_{5}\end{bmatrix}.

(13)

This now represents a fully decoupled system for the particular solutions $\mathbf{\tilde{u}}_{1},\mathbf{\tilde{u}}_{2}$ and shared boundary $\mathbf{u}_{5}$ . However, we need to find a submatrix $\mathbf{T}_{55}$ that satisfies this system. Such a $\mathbf{T}_{55}$ can be found as

$\displaystyle\mathbf{\tilde{f}}_{5}$	$\displaystyle=\mathbf{f}_{5}-\mathbf{A}_{51}\mathbf{\tilde{u}}_{1}-\mathbf{A}_{52}\mathbf{\tilde{u}}_{2}$
	$\displaystyle=\mathbf{f}_{5}-\mathbf{A}_{51}(\mathbf{u}_{1}+\mathbf{A}_{11}^{-1}\mathbf{A}_{15}\mathbf{u}_{5})-\mathbf{A}_{52}(\mathbf{u}_{2}+\mathbf{A}_{22}^{-1}\mathbf{A}_{25}\mathbf{u}_{5})$
	$\displaystyle=(\mathbf{f}_{5}-\mathbf{A}_{51}\mathbf{u}_{1}-\mathbf{A}_{52}\mathbf{u}_{2})-\mathbf{A}_{51}\mathbf{A}_{11}^{-1}\mathbf{A}_{15}\mathbf{u}_{5}-\mathbf{A}_{52}\mathbf{A}_{22}^{-1}\mathbf{A}_{25}\mathbf{u}_{5}$
	$\displaystyle=\mathbf{A}_{55}\mathbf{u}_{5}-\mathbf{A}_{51}\mathbf{A}_{11}^{-1}\mathbf{A}_{15}\mathbf{u}_{5}-\mathbf{A}_{52}\mathbf{A}_{22}^{-1}\mathbf{A}_{25}\mathbf{u}_{5}$
	$\displaystyle=(\mathbf{A}_{55}-\mathbf{A}_{51}\mathbf{A}_{11}^{-1}\mathbf{A}_{15}-\mathbf{A}_{52}\mathbf{A}_{22}^{-1}\mathbf{A}_{25})\mathbf{u}_{5}$
	$\displaystyle=\mathbf{T}_{55}\mathbf{u}_{5}.$	(14)

This formulation for $\mathbf{T}_{55}$ includes the effect of $\mathbf{u}_{5}$ on its body load ( $\mathbf{A}_{55}$ ) as well as terms that resemble the solution operators in (2.1) acting on $\mathbf{u}_{5}$ only ( $\mathbf{A}_{11}^{-1}\mathbf{A}_{15},\mathbf{A}_{22}^{-1}\mathbf{A}_{25}$ ). If we combine it with (12) then we see

\mathbf{u}_{5}=-\mathbf{T}_{55}^{-1}\begin{bmatrix}\mathbf{A}_{51}&\mathbf{A}_{52}\end{bmatrix}\begin{bmatrix}\mathbf{\tilde{u}}_{1}\\ \mathbf{\tilde{u}}_{2}\end{bmatrix}+\mathbf{T}_{55}^{-1}\mathbf{f}_{5}.

(15)

This more closely resembles (7). Lastly, by combining the matrices together,

\mathbf{L}\mathbf{D}\mathbf{U}\mathbf{u}=\mathbf{L}\mathbf{D}\begin{bmatrix}\mathbf{\tilde{u}}_{1}\\ \mathbf{\tilde{u}}_{2}\\ \mathbf{u}_{5}\end{bmatrix}=\mathbf{L}\begin{bmatrix}\mathbf{\tilde{f}}_{1}\\ \mathbf{\tilde{f}}_{2}\\ \mathbf{f}_{5}\end{bmatrix}=\mathbf{f}.

(16)

Thus $\mathbf{L}\mathbf{D}\mathbf{U}$ is an LDU factorization of $\mathbf{A}$ , with $(\mathbf{L}\mathbf{D})^{-1}$ resembling the merge operations we derive via DtN maps, and $\mathbf{U}^{-1}$ resembling the interior solve for $\mathbf{u}_{1},\mathbf{u}_{2}$ once we obtain the shared boundary data. This conceptualization will prove useful in justifying our algorithm for HPS.

3 Algorithm Description

In the previous section we discussed how to merge two subdomains of a boundary value problem together using DtN maps before solving each subdomain with a localized spectral method. Here we will briefly explain how the original Hierarchical Poincaré-Steklov method extends this concept to $>2$ subdomains, before introducing the contributions of this work.

3.1 Hierarchical Poincare-Steklov: Traditional Solver

Given an elliptic PDE such as in (1) with a rectangular domain $\Omega$ , we can hierarchically partition $\Omega$ into a series of rectangular “box” subdomains, illustrated in Figure 2. Note that the boxes in each level are children of boxes in the previous level, with each box in level $\ell$ having 2 children in level $\ell+1$ , until we reach the final level of “leaf” boxes.

The HPS algorithm is split into two stages: the “build” stage, and the “solve” stage. In the build stage, we first construct solution and DtN maps for every leaf box in the final level of our discretization ( $\ell=4$ in Fig. 2). We then proceed in an upward pass, constructing solution maps for each box in the next level up using the leaf DtNs, following (7). These solution maps solve for the boundaries between paired leaf boxes (boxes 8 and 9, 10 and 11, 12 and 13, and 14 and 15 in Fig. 2). We then recursively construct solution and DtN maps for each level of the discretization, until we get a solution operator for $\ell=1$ which solves for one large face in the middle of the domain (between boxes 2 and 3 in Fig. 2). Once all solution operators are constructed, we proceed with the solve stage that follows a downward pass: using the Dirichlet boundary condition and the $\ell=1$ solution operator to solve for the middle boundary, then proceeding down and using each level’s solution operators to solve for subsequent box faces, until every leaf box has all faces solved. Then we solve for the leaf interiors using their local solution operators.

The method’s upward and downward passes recursively extend the formulation and use of solution operators we derived in Section 2.2. Similarly, they also resemble the LDU factorization described in Section 2.3. We can imagine the upward pass as similar to $\mathbf{L}^{-1}$ described there, with both encoding merge operations that reduce the body loads of the problem onto shared boundaries. $\mathbf{D}^{-1}$ then resembles formulating the final components of the merged solution operators and solving for the shared face points - it combines elements of the upward and downward pass. Lastly, $\mathbf{U}^{-1}$ is the end of the downward pass, obtaining the total solution. Given the similarities, it is not surprising that the classical algorithm for obtaining the DtN and solution operators of traditional HPS resembles multifrontal methods for sparse LU factorization [22, 23].

The original HPS method has a few significant drawbacks. The solution and DtN maps can be formed in parallel within each level, and prior work has used shared-memory models or MPI to accelerate this component [24, 25]. However, across levels these operators must be constructed recursively. This forces a large serial component to the algorithm. In addition, the method on a discretization with $L$ levels requires the storage of $2^{L}-1$ dense solution operators, which can be large especially for 3D problems. One can choose to recompute certain levels during the downward pass to reduce storage cost, but this greatly increases the computation time as a trade-off. In addition, higher level DtN maps tend to get increasingly ill-conditioned in this method, which can cause numerical challenges in the method’s stability (formulations using other merge operators such as impedance maps may be better conditioned [26]). Since each solution operator is applied separately in the downward pass, techniques such as pivoting are limited to improving the conditioning of one solution operator at a time - this is one way HPS differs significantly from sparse LU algorithms, despite the other similarities. Our contributions mitigate these problems with the original HPS method.

3.2 Two-Level HPS

In this section we will discuss the contribution of our work: designing a direct solver for elliptic PDEs in 2D or 3D that utilizes the HPS method in a 2 level framework. Instead of constructing higher level solution and DtN operators, we use the leaf DtN maps to formulate a system that solves for all leaf boundaries concurrently. We use this system to acquire the leaf boundaries, then apply the leaf solution operators in parallel to solve for leaf interiors. With this approach we use the discretization for HPS to reduce the problem in a way analogous to static condensation, visualized in Figure 3. Similar to traditional HPS, our method can naturally be divided into a build stage and a solve stage.

3.2.1 Build Stage

As in previous sections suppose we have an elliptic PDE with a Dirichlet boundary condition on rectangular domain $\Omega$ , in this case partitioned into a number of non-overlapping rectangular subdomains $\Omega^{(i)}$ , each discretized with Chebyshev nodes. An example with 8 such boxes can be seen in Figure 4, similar to level $\ell=4$ in Figure 2. We know the value of numerical solution $\mathbf{u}$ at every node on $\partial\Omega$ due to the Dirichlet boundary condition - we call this Dirichlet data $\mathbf{u}_{D}$ . However, there are a number of box faces that do not lie on $\partial\Omega$ (see Figure 5(a), where there are 12 such faces). These must be solved before applying the leaf solution operators on each box to obtain $\mathbf{u}$ on box interiors.

To solve for the shared faces we enforce continuity of the Neumann derivative using Dirichlet-to-Neumann maps as described in Section 2.2. However, each shared face depends on other shared faces since DtN maps require the entire box boundary data as input (unlike traditional HPS, where some box faces are solved earlier by different-level solution operators). So we must encode all shared faces in one coupled system. We construct this system $\mathbf{T}_{\rm sp}$ using DtN maps $\mathbf{T}^{(i)}$ and the relation between two boxes that share a boundary. Recall that each $\mathbf{T}^{(i)}$ is size $2dq^{d-1}\times 2dq^{d-1}$ , where $q\approx p-2$ is either the size of a box face with corner points dropped, or the number of Gaussian nodes to which we interpolate (see Section 2.2.1). Thus the contribution of one box face to another face’s Neumann derivative via DtN is a submatrix of $\mathbf{T}^{(i)}$ of size $q^{d-1}\times q^{d-1}$ . We index our discretization points by box face so we can represent $\mathbf{T}_{\rm sp}$ as a block matrix consisting of these submatrices. Referring back to (7) and Figure 1 in Section 2.2, each block row of $\mathbf{T}_{\rm sp}$ (encoding the shared face between leaf boxes $\Omega^{(1)}$ and $\Omega^{(2)}$ ) looks like

			$\displaystyle\left(\mathbf{T}_{5,5}^{(1)}-\mathbf{T}_{5,5}^{(2)}\right)\mathbf{u}_{5}+\mathbf{T}_{5,3}^{(1)}\mathbf{u}_{3}-\mathbf{T}_{5,4}^{(2)}\mathbf{u}_{4}=\mathbf{0}_{q^{d-1}}$
	$\displaystyle\iff$		$\displaystyle\begin{bmatrix}\dots&\mathbf{0}&\mathbf{T}_{5,3}^{(1)}&\left(\mathbf{T}_{5,5}^{(1)}-\mathbf{T}_{5,5}^{(2)}\right)&-\mathbf{T}_{5,4}^{(2)}&\mathbf{0}&\dots\end{bmatrix}\begin{bmatrix}(\text{rest of }\mathbf{u})\\ \vdots\\ \mathbf{u}_{3}\\ \mathbf{u}_{5}\\ \mathbf{u}_{4}\\ \vdots\\ (\text{rest of }\mathbf{u})\end{bmatrix}=\mathbf{0}_{q^{d-1}}.$

Recall $\mathbf{u}_{5}$ is $\mathbf{u}$ on the shared face between $\Omega^{(1)}$ and $\Omega^{(2)}$ , $\mathbf{u}_{3}$ is $\mathbf{u}$ on the other $2d-1$ faces of $\Omega^{(1)}$ , and $\mathbf{u}_{4}$ is $\mathbf{u}$ on the other $2d-1$ faces of $\Omega^{(2)}$ . Thus the block main diagonal $\mathbf{T}_{5,5}^{(1)}-\mathbf{T}_{5,5}^{(2)}$ is of size $q^{d-1}\times q^{d-1}$ , and the block off-diagonals from $\mathbf{T}_{5,3}^{(1)}$ and $-\mathbf{T}_{5,4}^{(2)}$ are each of size $q^{d-1}\times(2d-1)q^{d-1}$ . This means each block row of $\mathbf{T}_{\rm sp}$ has at most $q^{d-1}\times(4d-1)q^{d-1}$ nonzero entries arranged into $q^{d-1}\times q^{d-1}$ sized blocks (for some rows it is less, since portions of $\mathbf{u}_{3},\mathbf{u}_{4}$ can lie on $\partial\Omega$ and be encoded in $\mathbf{u}_{D}$ instead), so it has a distinctive block-sparse structure. This sparse structure can be seen for a problem with $2^{3}$ subdomains in Figure 5(b), and for a larger problem with $5^{3}$ subdomains in Figure 6. Note that the sparsity pattern is unaffected by the total number of subdomains.

Once $\mathbf{T}_{\rm sp}$ is obtained, we may factorize it using a multilevel sparse solver such as MUMPS [27, 28]. An outline of the construction and factorization of $\mathbf{T}_{\rm sp}$ is shown in Algorithm 1. First we construct the DtN maps $\mathbf{T}^{(i)}$ , potentially with GPUs. Then we use submatrices of these DtN maps to formulate $\mathbf{T}_{\rm sp}$ , which we then factorize.

Algorithm 1 Build Stage

1:domain

\Omega

, discretized box partition

\{\Omega^{(i)}\}_{i=1}^{n}

, polynomial order

p

\{\mathbf{T}^{(i)}\}\leftarrow

DtN maps from Alg. 3.

\{E^{(j)}\}\leftarrow

set of box faces not on

\partial\Omega

N_{\text{face}}\leftarrow p^{2}\times\#(\{E^{(j)}\})

(

\#

of disc. points on faces).

\mathbf{T}_{\rm sp}\leftarrow\text{ sparse }\mathbf{0}^{N_{\text{face}}\times N_{\text{face}}}

6:for all

E^{(j)}

\{E^{(j)}\}

\Omega^{(i_{1})},\Omega^{(i_{2})}\leftarrow

the two boxes that share

E^{(j)}

8: for

k=1,2

F\leftarrow

set of faces on

\Omega^{(i_{1})}\cup\Omega^{(i_{2})}\setminus E^{(j)}

10:

\mathsf{E}\leftarrow

indices for

E^{(j)}

local to their matrix.

\triangleright

differ for

\mathbf{T}_{\rm sp},\mathbf{T}^{(i_{k})}

11:

\mathsf{F}\leftarrow

indices for

F

local to their matrix.

12:

[\mathbf{T}_{\rm sp}]_{\mathsf{E},\mathsf{E}}+=\mathbf{T}^{(i_{k})}_{\mathsf{E},\mathsf{E}}

\triangleright

block main diagonal

13:

[\mathbf{T}_{\rm sp}]_{\mathsf{E},\mathsf{F}}+=\mathbf{T}^{(i_{k})}_{\mathsf{E},\mathsf{F}}

\triangleright

block off-diagonal

14: end for

15:end for

16:

\mathbf{LU}\leftarrow

factorization of

\mathbf{T}_{\rm sp}

for use as solver.

17:return

\mathbf{LU}

3.2.2 Solve Stage

Once the factorized sparse system is obtained, we may use it as a direct solver for multiple variations in Dirichlet data $\mathbf{u}_{D}$ and body load $\mathbf{f}$ in the same boundary value problem. This is because the DtN maps $\mathbf{T}^{(i)}$ and thus $\mathbf{T}_{\rm sp}$ are only dependent on the problem’s partial differential operator. We find $\mathbf{u}$ on the shared faces of our discretization by solving the sparse system,

\mathbf{T}_{\rm sp}\mathbf{u}_{I_{x}}=\mathbf{b}\left(\mathbf{u}_{D},\mathbf{f}\right)\text{ where }\mathbf{u}_{I_{x}}\approx\left.u\right|_{\bigcup\partial\Omega^{(i)}\setminus\partial\Omega}.

(18)

Here $\mathbf{b}$ is a right-hand-side specified by the Dirichlet data and body load. To solve for the box interior points (blue in Figure 4), we use the solution operators $\mathbf{S}^{(i)}$ shown in (2.1). These operators enforce the PDE on the box interiors given the now-complete box boundary data. Each matrix $\mathbf{S}^{(i)}$ is dense, but they can be formed and applied to box boundary data indepedently. Thus they are well-suited to use with batched linear algebra routines on GPUs, such as those offered in PyTorch [29] (the solution operators are also necessary to construct the DtN operators $\mathbf{T}^{(i)}$ , but in practice it is often preferred to re-compute them and avoid transferring memory between CPU and GPU). The solve stage is detailed in Algorithm 2.

Algorithm 2 Solve Stage

1:domain

\Omega

, discretized box partition

\{\Omega^{(i)}\}_{i=1}^{n}

, factorized boundary solve system

\mathbf{LU}

, Dirichlet data

\mathbf{u}_{D}

, body load

\mathbf{f}

, number of discretization points

N

\mathbf{u}\leftarrow\mathbf{0}^{N}

\mathbf{b}\leftarrow

right hand side, derived from

\mathbf{u}_{D}

and

\mathbf{f}

\mathbf{u}\left[\text{indices on }\partial\Omega\right]=\mathbf{u}_{D}

\mathbf{u}\left[\text{indices on box boundaries not on }\partial\Omega\right]=\mathbf{LU}(\mathbf{b})

\{\mathbf{S}^{(i)}\}\leftarrow

from Alg. 3.

7:for

i=1,\dots,n

\mathsf{c}_{i}\leftarrow

indices for points in interior of

\Omega^{(i)}

\mathsf{x}_{i}\leftarrow

indices for points on

\partial\Omega^{(i)}

10:

\mathbf{u}\left[\mathsf{c}_{i}\right]=\mathbf{S}^{(i)}\mathbf{u}\left[\mathsf{x}_{i}\right]

11:end for

12:return

\mathbf{u}

3.2.3 Advantages of the 2-Level Framework

In our solver we still discretize $\Omega$ into a series of non-overlapping subdomains, but instead of constructing a hierarchical series of solution operators in an upward pass, we only form solution operators at the leaf level and use leaf DtN maps to form a larger system that solves for all of the subdomain boundaries concurrently - unlike the original methodology for HPS, which encodes different subdomain boundaries at each level. This resulting system is sparse, so we can factorize it using efficient sparse direct solvers such as those within the PETSc library [30].

Highly-optimized multifrontal solvers such as MUMPS can use various pivoting techniques such as partial threshold pivoting, relaxed pivoting [31], and tournament pivoting [32] to maximize numerical stability of a matrix factorization like sparse LU [33]. This enables accurate direct solves with potentially ill-conditioned systems. In theory such pivoting techniques could be used on the higher-level solution operators in traditional HPS to improve their stability (although these operators are dense, so sparse solvers could not be used). However, encoding solutions for all shared faces into one sparse matrix operator $\mathbf{T}_{\rm sp}$ allows such pivoting techniques to be applied holistically on a much larger portion of the problem at once. This means the best possible pivoting can be applied, maximizing the factorization’s stability while preserving an optimal sparsity pattern. In effect we outsource the multilevel component of HPS to an optimized solver better equipped to handle its conditioning. Previous work has found other effective means to represent the HPS discretization as one large system, but required iterative techniques to solve [34]. Our approach preserves the original strength of HPS being a direct method.

In addition, the non-hierarchical nature of the formulation of $\mathbf{T}_{\rm sp}$ - relying only on leaf DtN maps and not higher level DtN or solution operators - increases the method’s uniformity and parallelism. All DtN maps can be constructed in parallel; they are dense matrices, but that makes them well-suited to formulation on GPUs if available via batched linear algebra routines [35]. Similarly, the computation of the leaf solution operators in the solve stage may also be done in batch with GPUs. This allows us to use modern hardware to greatly improve the computational performance of the HPS method.

Algorithm 3 Form DtN Maps

1:domain

\Omega

, discretized box partition

\{\Omega^{(i)}\}_{i=1}^{n}

, box local interior indices

I_{loc}

, box local boundary indices

B_{loc}

, Neumann operator

\mathbf{N}

2:for

i=1,\dots,n

\mathbf{A}^{(i)}\leftarrow

spectral differentiation matrix on

\Omega^{(i)}

\mathbf{S}^{(i)}\leftarrow-\left(A_{I_{loc},I_{loc}}^{(i)}\right)^{-1}A_{I_{loc},B_{loc}}

\triangleright

solution maps

\mathbf{T}^{(i)}\leftarrow\mathbf{N}\begin{bmatrix}I\\ S^{(i)}\end{bmatrix}

\triangleright

DtN maps

6:end for

7:return

\{\mathbf{S}^{(i)}\},\{\mathbf{T}^{(i)}\}

3.3 Asymptotic Cost

The build stage has an asymptotic cost of

O\left(p^{6}N\right)\text{ (forming DtNs) }+O\left(N^{2}\right)\text{ (sparse factorization).}

We formulate the DtNs with embarrassingly parallel dense matrix operations that are well-suited for GPUs using batched linear algebra routines. As $p$ increases, its effect on the DtN size and run time is substantial - this is because of the need to invert a large block of the spectral differentiation submatrix $\mathbf{A}^{(i)}$ , as shown in Algorithm 3. Nevertheless, we can still easily formulate DtNs in batch for $p\leq 15$ . Even for larger $p$ where batched operations are impractical, formulating the DtNs in serial is relatively fast as shown in our numerical results. In practice the steepest cost in the build stage comes from the factorization of the sparse system, which for a 3D problem using a multilevel solver is $O(N^{2})$ . However, the sparse factorization is unaffected by the choice of $p$ and this cost is only required once per differential operator and domain discretization. Since the sparse system is relatively small (it has $<\frac{6N}{p}\times\frac{6N}{p}$ total entries with $\approx 11p^{2}$ nonzero entries per row), it can be stored in memory and reused for multiple solves. For the solve stage, our asymptotic cost is

O\left(N^{4/3}\right)\text{ (sparse solve) }+O\left(p^{6}N\right)\text{ (interior box solves).}

If we store the solution operators in memory then the interior box solves are only $O(p^{2}N)$ , but in practice it is typically preferred to reformulate the solution operators and avoid having to store and potentially transfer additional memory.

4 Numerical Experiments

4.1 Accuracy results - Poisson and Helmholtz Equations

We investigate the accuracy of our HPS solver on the homogeneous Poisson equation,

	$\displaystyle\Delta u$	$\displaystyle=$	$\displaystyle 0\text{ on }\Omega=[-1.1,0.1]\times[1,2]\times[1.2,2.2]$
	$\displaystyle u(\mathbf{x})$	$\displaystyle=$	$\displaystyle\frac{1}{4\pi\\|\mathbf{x}\\|_{2}}\text{ on }\partial\Omega.$		(19)

Here the Dirichlet boundary condition is defined so that it has a solution in its Green’s function. We compute numerical solutions to this equation across a range of block sizes $h$ and polynomial orders $p$ , comparing their relative $\ell^{2}$ -norm error to the Green’s function. The accuracy results are in Figure 7. We notice convergence rates of $\approx h^{7}$ for $p=6$ and $h^{12}$ for $p=8$ until the relative error reaches $\approx 10^{-13}$ , at which point it flattens (for larger $p$ the error reaches $10^{-13}$ before a convergence rate can be fairly estimated). Subsequent refinement does not generally improve accuracy, but it does not lead to an increase in error either. This is likely due to the HPS discretization being fairly well-conditioned, especially compared to global spectral methods.

To assess our method’s accuracy on oscillatory problems we also evaluate it on the homogeneous Helmholtz equation with its Green’s function used as a manufactured solution with wavenumber $\kappa$ ,

	$\displaystyle\Delta u+\kappa u$	$\displaystyle=$	$\displaystyle 0\text{ on }\Omega=[-1.1,0.1]\times[1,2]\times[1.2,2.2]$
	$\displaystyle u(\mathbf{x})$	$\displaystyle=$	$\displaystyle\frac{\cos(\kappa\|\mathbf{x}\|)}{4\pi\|\mathbf{x}\|}\text{ on }\partial\Omega.$		(20)

In Figure (8) we show relative $\ell^{2}$ -norm errors for our solver on Equation (20) with fixed wavenumbers $\kappa=16$ and $\kappa=30$ , which correspond to roughly 2.5 and 4.5 wavelengths on the domain, respectively. We see comparable convergence results to the Poisson equation, though the lower bound on relative error is somewhat higher for each problem ( $\approx 10^{-11}$ for $\kappa=16$ and $\approx 10^{-10}$ for $\kappa=30$ ). However, increasing the resolution further does not cause the error to diverge. Along with the results for Equation (19), this suggests our method remains stable when problems are over-resolved. Figure 7 shows the relative error for the Helmholtz equation with a varying $\kappa$ , set to provide $\approx 10$ discretization points per wavelength. We see higher $p$ corresponds to a better numerical accuracy, while increasing $h$ does not improve results (since the wavenumber is also increased), but errors remain relatively stable. Since our method maintains accuracy while increasing degrees of freedom linearly with $\kappa$ , it appears to not suffer from the pollution effect.

We then consider the gravity Helmholtz equation,

	$\displaystyle\Delta u+\kappa^{2}(1-x_{3})u$	$\displaystyle=$	$\displaystyle-1\text{ on }\Omega=[1.1,2.1]\times[-1,0]\times[-1.2,-0.2]$
	$\displaystyle u(\mathbf{x})$	$\displaystyle=$	$\displaystyle 0\text{ on }\partial\Omega.$		(21)

This equation adds a spatially-varying component to the wavenumber variation based on $x_{3}$ analogous to the effect of a gravitational field, hence the name [36]. Since $1<1-x_{3}<2.2$ , the variation is greater through $\Omega$ than in the constant Helmholtz equation with the same $\kappa$ . Unlike the previous Poisson and Helmholtz equations, equation (21) does not have a manufactured solution. It also produces relatively sharp gradients near the domain boundaries since there is a zero Dirichlet boundary condition, and it has a variable coefficient in spatial coordinates (specifically $x_{3}$ ). We investigate our solver’s accuracy by computing the relative $\ell^{2}$ -norm errors of our results to an over-resolved solution at select points. Figure 9 shows the solution of this problem using select $p$ and $h$ values, while Figure 10 shows these errors in the case of $p$ -refinement, for a range of values of $h$ . We can see our solver shows convergence for each $h$ .

4.2 Speed and scaling

To evaluate the computational complexity of our HPS solver, we measure the execution times of its three components: the construction of the DtN operators used for the build stage, the factorization time for our sparse matrix constructed by the DtNs, and the solve time. We know the asymptotic cost of constructing the DtN operators is $O\left(p^{6}N\right)$ as shown in Equation (3.3). Figure (11) illustrates the run times for the build stage in solving the Poisson and Helmholtz equations detailed in Section 4.1, using batched linear algebra routines on an Nvidia V100 GPU. As expected, we observe a linear increase in runtime with $h$ -refinement (i.e. a higher $N$ with fixed $p$ ). The constant for the linear scaling in $h$ , though, increases sharply with higher values of $p$ .

A challenge with utilizing GPUs for DtN formulation is memory storage. In 3D each DtN map is size $\approx 6p^{2}\times 6p^{2}$ . Computing them requires the inversion of the spectral differentiation operator on interior points which is size $\approx(p-2)^{3}\times(p-2)^{3}$ . These matrices grow quickly as $p$ increases (at $p=20$ they are $\approx 5.7$ million and $34$ million entries, respectively). For larger problem sizes it is impossible to store all DtN maps on a GPU concurrently. Interior differentiation submatrices do not have to be stored beyond formulating their box’s DtN map, but they are much larger for high $p$ and present considerable overhead. To handle this memory concern we have implemented an aggressive two-level scheduler that stores a limited number of DtNs on the GPU at once, and formulates a smaller number of DtNs in batch at a time. For $p\geq 16$ on a V100 GPU the best results occur when only one DtN is formulated at a time. However, we may still store multiple DtNs on the GPU before transferring them collectively to the CPU.

In Figure (12) we see the time to factorize the resulting sparse matrix for the build stage, using MUMPS. The expected asymptotic cost is $O(N^{2})$ . However, our results more closely resemble a lower cost of $O(N^{3/2})$ for $N$ up to $\approx$ 8 million. Overall the factorization time dominates the build stage of our method given its more costly scaling in $N$ and the reliance on CPUs through MUMPS. Lastly in Figure (13) we see the runtime for the leaf interior solves. These follow a linear scaling similar to the batched DtN construction.

4.3 Curved and non-rectangular domains

The HPS method has been extended to other problems such as surface PDEs [37], inverse scattering problems [38], and more generally non-rectangular domains. We can apply our HPS solver to non-rectangular domain geometries through the use of an analytic parameterization between the domain we wish to model, such as a sinusoidal curve along the $y$ -axis shown in Figure (14(a)), and a rectangular reference domain. These parameter maps extend the versatility of the solver, but feature multiple challenges in the implementation: they require variable coefficients for our differential operators; they may use mixed second order differential terms, i.e. $\frac{\partial^{2}u}{\partial x_{i}\partial x_{j}}$ where $i\neq j$ ; and (depending on the domain) may have areas that are near-singular, such as around the sharp point $(x,y)\approx(2,0)$ in Figure (14(a)). We investigate the accuracy of our solver on this sinusoidal domain for the Helmholtz equation as shown in Eq. (20), but now on the domain

\Psi=\left\{x_{1},\frac{x_{2}}{\psi(x_{1})},x_{3}\text{ for }(x_{1},x_{2},x_{3})\in\Omega\right\}\text{ where }\psi(z)=1-\frac{1}{4}\sin(6z).

(22)

We map (20) on $\Psi$ to a cubic reference domain $\Omega=[1.1,2.1]\times[-1,0]\times[-1.2,0.2]$ using the parameter map

	$\displaystyle-\frac{\partial^{2}u}{\partial x_{1}^{2}}-\left(\left(\frac{\psi^{\prime}(x_{1})x_{2}}{\psi(x_{1})}\right)^{2}+\psi(x_{1})^{2}\right)\frac{\partial^{2}u}{\partial x_{2}^{2}}-\frac{\partial^{2}u}{\partial x_{3}^{2}}-2\frac{\psi^{\prime}(x_{1})x_{2}}{\psi(x_{i})}\frac{\partial^{2}u}{\partial x_{1}\partial x_{2}}$
	$\displaystyle-\frac{\psi^{\prime\prime}(x_{1})x_{2}}{\psi(x_{i})}\frac{\partial u}{\partial x_{2}}-\kappa u=0,\text{ }(x_{1},x_{2},x_{3})\in\Omega.$		(23)

We test this problem with a manufactured solution in the Green’s function of the Helmholtz equation, with corresponding Dirichlet data matching our test in Section 4.1. A plot of this problem and the computed solution for wavenumber $\kappa=26$ is shown in Figure 14(a), while convergence studies for this problem with $\kappa=16$ and $\kappa=30$ are shown in Figure 15. We observe a similar convergence pattern to the Helmholtz test on a cubic domain, although the speed of convergence is somewhat slower.

We also investigate our solver’s accuracy on the Helmholtz equation when the domain is a three-dimensional annulus, shown in Figure 14(b). We use an analytic parametrization to map this problem to a rectangular reference domain with higher resolution in the $y$ -axis and a periodic boundary condition. Relative errors in $p$ and $N$ are comparable to those for the problem on a curved domain.

5 Conclusion and Future Work

We have designed and implemented a two-level framework for the Hierarchical Poincaré-Steklov method that effectively decouples the hierarchical and dense portions of the method, using state-of-the-art multifrontal solvers to improve the stability of the first component, and GPUs to improve performance of the second component. Our code works for three-dimensional problems. However, there are still several directions for future work.

Although multifrontal solvers like MUMPS provide excellent stability, their use is also the main performance bottleneck in our implementation. GPU-based sparse solvers such as those in the NVIDIA cuDSS library may offer better performance. Alternatively, the reduced problem given to the sparse solver may itself be compressed into a two-level framework using SlabLU [39]. There are also potential performance gains by exploiting rank structure in the leaf operators, especially in the 3D discretization where these operators can be fairly large.

This implementation of HPS may be extended to other formulations of the method such as those using impedance maps rather than DtN maps, which may further improve conditioning. It may also be extended to adaptive discretizations [40, 41]. In addition to handling corner nodes, interpolation is also useful for implementing merge operators across subdomains with variable $p$ . Our method could also work in this adaptive case, although variable $p$ would somewhat alter the block-sparse structure of $\mathbf{T}_{\rm sp}$ .

References

[1] I. Babuška, F. Ihlenburg, E. T. Paik, S. A. Sauter, A generalized finite element method for solving the helmholtz equation in two dimensions with minimal pollution, Computer methods in applied mechanics and engineering 128 (3-4) (1995) 325–359.
[2] I. M. Babuska, S. A. Sauter, Is the pollution effect of the fem avoidable for the helmholtz equation considering high wave numbers?, SIAM Journal on numerical analysis 34 (6) (1997) 2392–2423.
[3] J. Galkowski, E. A. Spence, Does the helmholtz boundary element method suffer from the pollution effect?, Siam Review 65 (3) (2023) 806–828.
[4] A. Townsend, S. Olver, The automatic solution of partial differential equations using a global spectral method, Journal of Computational Physics 299 (2015) 106–123.
[5] J. P. Boyd, Chebyshev and Fourier spectral methods, Courier Corporation, 2001.
[6] S. Olver, A. Townsend, A fast and well-conditioned spectral method, siam REVIEW 55 (3) (2013) 462–489.
[7] T. Babb, A. Gillman, S. Hao, P.-G. Martinsson, An accelerated poisson solver based on multidomain spectral discretization, BIT Numerical Mathematics 58 (2018) 851–879.
[8] A. Gillman, P.-G. Martinsson, A direct solver with o(n) complexity for variable coefficient elliptic pdes discretized via a high-order composite spectral collocation method, SIAM Journal on Scientific Computing 36 (4) (2014) A2023–A2046.
[9] A. T. Patera, A spectral element method for fluid dynamics: laminar flow in a channel expansion, Journal of computational Physics 54 (3) (1984) 468–488.
[10] Y. Maday, R. Munoz, Spectral element multigrid. ii. theoretical justification, Journal of scientific computing 3 (1988) 323–353.
[11] A. Yeiser, A. A. Townsend, A spectral element method for meshes with skinny elements, arXiv preprint arXiv:1803.10353 (2018).
[12] P. Martinsson, The hierarchical poincaré-steklov (hps) solver for elliptic pdes: A tutorial, arXiv preprint arXiv:1506.01308 (2015).
[13] P.-G. Martinsson, Fast direct solvers for elliptic PDEs, SIAM, 2019.
[14] T. Babb, P.-G. Martinsson, D. Appelö, Hps accelerated spectral solvers for time dependent problems: Part i, algorithms, in: Spectral and High Order Methods for Partial Differential Equations ICOSAHOM 2018: Selected Papers from the ICOSAHOM Conference, London, UK, July 9-13, 2018, Springer International Publishing, 2020, pp. 131–141.
[15] K. Chen, D. Appelö, T. Babb, P.-G. Martinsson, Fast and high-order approximation of parabolic equations using hierarchical direct solvers and implicit runge-kutta methods, Communications on Applied Mathematics and Computation (2024) 1–21.
[16] S. Hao, P.-G. Martinsson, A direct solver for elliptic pdes in three dimensions based on hierarchical merging of poincaré–steklov operators, Journal of Computational and Applied Mathematics 308 (2016) 419–434.
[17] A. Yesypenko, P.-G. Martinsson, GPU optimizations for the hierarchical poincaré-steklov scheme, in: International Conference on Domain Decomposition Methods, Springer, 2022, pp. 519–528.
[18] D. Fortunato, N. Hale, A. Townsend, The ultraspherical spectral element method, Journal of Computational Physics 436 (2021) 110087.
[19] L. N. Trefethen, Finite difference and spectral methods for ordinary and partial differential equations, Cornell University-Department of Computer Science and Center for Applied …, 1996, Ch. 8.
[20] A. Gillman, P.-G. Martinsson, An o (n) algorithm for constructing the solution operator to 2d elliptic boundary value problems in the absence of body loads, Advances in Computational Mathematics 40 (2014) 773–796.
[21] T. A. Davis, Direct methods for sparse linear systems, SIAM, 2006.
[22] A. George, Nested dissection of a regular finite element mesh, SIAM journal on numerical analysis 10 (2) (1973) 345–363.
[23] P. R. Amestoy, T. A. Davis, I. S. Duff, An approximate minimum degree ordering algorithm, SIAM Journal on Matrix Analysis and Applications 17 (4) (1996) 886–905.
[24] N. N. Beams, A. Gillman, R. J. Hewett, A parallel shared-memory implementation of a high-order accurate solution technique for variable coefficient helmholtz problems, Computers & Mathematics with Applications 79 (4) (2020) 996–1011.
[25] D. Chipman, Ellipticforest: A direct solver library for elliptic partial differential equations on adaptive meshes, Journal of Open Source Software 9 (96) (2024) 6339.
[26] A. Gillman, A. H. Barnett, P.-G. Martinsson, A spectrally accurate direct solution technique for frequency-domain scattering problems with variable media, BIT Numerical Mathematics 55 (2015) 141–170.
[27] P. R. Amestoy, I. S. Duff, J.-Y. L’Excellent, J. Koster, Mumps: a general purpose distributed memory sparse solver, in: International Workshop on Applied Parallel Computing, Springer, 2000, pp. 121–130.
[28] J. Pan, L. Xiao, M. Tian, T. Liu, L. Wang, Heterogeneous multi-core optimization of mumps solver and its application, in: Proceedings of the 2021 ACM International Conference on Intelligent Computing and Its Emerging Applications, 2021, pp. 122–127.
[29] S. Imambi, K. B. Prakash, G. Kanagachidambaresan, Pytorch, Programming with TensorFlow: solution for edge computing applications (2021) 87–104.
[30] S. Balay, S. Abhyankar, M. Adams, J. Brown, P. Brune, K. Buschelman, E. Constantinescu, A. Dener, J. Faibussowitsch, W. Gropp, et al., Petsc/tao users manual revision 3.22, Tech. rep., Argonne National Laboratory (ANL), Argonne, IL (United States) (2024).
[31] I. S. Duff, S. Pralet, Towards stable mixed pivoting strategies for the sequential and parallel solution of sparse symmetric indefinite systems, SIAM Journal on Matrix Analysis and Applications 29 (3) (2007) 1007–1024.
[32] L. Grigori, J. W. Demmel, H. Xiang, Communication avoiding gaussian elimination, in: SC’08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, IEEE, 2008, pp. 1–12.
[33] P. Amestoy, J.-Y. L’Excellent, T. Mary, C. Puglisi, Mumps: Multifrontal massively parallel solver for the direct solution of sparse linear equations, in: EMS/ECMI Lanczos prize, ECM conference, Sevilla, 2024.
[34] J. P. Lucero Lorca, N. Beams, D. Beecroft, A. Gillman, An iterative solver for the hps discretization applied to three dimensional helmholtz problems, SIAM Journal on Scientific Computing 46 (1) (2024) A80–A104.
[35] J. Dongarra, S. Hammarling, N. J. Higham, S. D. Relton, M. Zounon, Optimized batched linear algebra for modern architectures, in: European Conference on Parallel Processing, Springer, 2017, pp. 511–522.
[36] A. H. Barnett, B. J. Nelson, J. M. Mahoney, High-order boundary integral equation solution of high frequency wave scattering from obstacles in an unbounded linearly stratified medium, Journal of Computational Physics 297 (2015) 407–426.
[37] D. Fortunato, A high-order fast direct solver for surface pdes, SIAM Journal on Scientific Computing 46 (4) (2024) A2582–A2606.
[38] C. Borges, A. Gillman, L. Greengard, High resolution inverse scattering in two dimensions using recursive linearization, SIAM Journal on Imaging Sciences 10 (2) (2017) 641–664.
[39] A. Yesypenko, P.-G. Martinsson, SlabLU: a two-level sparse direct solver for elliptic PDEs, Advances in Computational Mathematics 50 (4) (2024) 90.
[40] P. Geldermans, A. Gillman, An adaptive high order direct solution technique for elliptic boundary value problems, SIAM Journal on Scientific Computing 41 (1) (2019) A292–A315.
[41] D. Chipman, D. Calhoun, C. Burstedde, A fast direct solver for elliptic pdes on a hierarchy of adaptively refined quadtrees, arXiv preprint arXiv:2402.14936 (2024).

A Two-Level Direct Solver for the Hierarchical Poincaré-Steklov Method

Abstract

keywords:

1 Introduction

2 Preliminaries

2.1 Leaf Computation

2.2 Merge Operation for Two Subdomains

2.2.1 Merging with Corner and Edge Points

2.3 Analogy to Sparse L​ULU Factorization

3 Algorithm Description

3.1 Hierarchical Poincare-Steklov: Traditional Solver

3.2 Two-Level HPS

3.2.1 Build Stage

3.2.2 Solve Stage

3.2.3 Advantages of the 2-Level Framework

3.3 Asymptotic Cost

4 Numerical Experiments

4.1 Accuracy results - Poisson and Helmholtz Equations

4.2 Speed and scaling

4.3 Curved and non-rectangular domains

5 Conclusion and Future Work

References

2.3 Analogy to Sparse $LU$ Factorization