This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

High-Order Algorithms for Compressible Reacting Flow with Complex Chemistry

M. Emmetta    W. Zhanga Corresponding author. Email: MWEmmett@lbl.gov    and J. B. Bella
aCenter for Computational Sciences and Engineering, Lawrence Berkeley National Laboratory, Berkeley, CA 94720.
Abstract

In this paper we describe a numerical algorithm for integrating the multicomponent, reacting, compressible Navier-Stokes equations, targeted for direct numerical simulation of combustion phenomena. The algorithm addresses two shortcomings of previous methods. First, it incorporates an eighth-order narrow stencil approximation of diffusive terms that reduces the communication compared to existing methods and removes the need to use a filtering algorithm to remove Nyquist frequency oscillations that are not damped with traditional approaches. The methodology also incorporates a multirate temporal integration strategy that provides an efficient mechanism for treating chemical mechanisms that are stiff relative to fluid dynamical time scales. The overall methodology is eighth order in space with options for fourth order to eighth order in time. The implementation uses a hybrid programming model designed for effective utilization of many-core architectures. We present numerical results demonstrating the convergence properties of the algorithm with realistic chemical kinetics and illustrating its performance characteristics. We also present a validation example showing that the algorithm matches detailed results obtained with an established low Mach number solver.


keywords:
DNS; spectral deferred corrections; finite difference; high-order methods; flame simulations

1 Introduction

The simulation of a turbulent reacting flow without the use of either turbulence models or closure models for turbulence chemistry interaction is referred to as a direct numerical simulation, or DNS. One of the standard computational tools used in DNS studies of combustion is the numerical integration of the multicomponent, reacting compressible Navier-Stokes equations. DNS approaches have been in active use in the combustion community for more than twenty-five years; a comprehensive review of the literature is outside the scope of this article. Examples of early work in the field can be found in [3, 15, 28, 38, 17, 8]. With the advent of massively parallel, high performance computer architectures, it is now feasible to perform fully compressible DNS simulations for turbulent flows with detailed kinetics. See, for example, [35, 31, 40, 21].

Compressible DNS approaches place a premium on high order of accuracy. Spatial discretizations are typically based on defining high-order spatial derivative operators, DxD_{x}, DyD_{y} and DzD_{z}. Common choices are central difference approximations [19] or compact differences based on Pade approximations [23]. Two codes that have seen significant application in combustion, SENGA [17, 7] and S3D [32, 40, 21], use tenth and eighth order central differences, respectively. Second-order derivative terms are approximated using repeated application of the first-order derivative operators, namely

(a(x)Ux)xDx(aDxU)(a(x)U_{x})_{x}\approx D_{x}(aD_{x}U)

This treatment of second-order derivatives does not provide adequate damping for Nyquist-frequency oscillations. Consequently some type of higher-order filter is employed periodically to damp high-frequency components of the system. Alternatively, the chain rule can be applied to second-order derivative terms. However, the chain rule approach is known to have stability issues (see e.g., [18]). Discretization of the spatial operators results in a system of ordinary differential equations (ODEs), which are typically integrated using high-order explicit Runge Kutta schemes such as those discussed by Kennedy et al. [20].

Although these types of compressible DNS codes have been highly successful, they suffer from two basic weaknesses. First, the use of explicit integration techniques limits the time step to the minimum of acoustic and chemical time scales. (Diffusive time scales are typically less restrictive than acoustic time scales for DNS.) This can be particularly restrictive for detailed chemical kinetics that can be stiff on acoustic time scales. The time step restriction can be relaxed by using an implicit / explicit temporal integration scheme such as IMEX but at the expense of solving linear systems to implicitly advance the kinetics at acoustic time scales.

The other issue with traditional compressible DNS methodology is the evaluation of second-order derivative terms by repeated application of the high-order first derivative operators. This approach is efficient in terms of floating point work but requires either multiple communication steps or much wider layers of ghost cells for each evaluation of the spatial operators. Although this type of design has been effective, as we move toward new many-core architectures that place a premium on reduced communication, the algorithm becomes communication bound and computational efficiency is reduced.

In this paper, we introduce two new algorithmic features that address the weaknesses discussed above. For the spatial discretization, we introduce an extension of the sixth-order narrow stencil finite difference algorithm of Kamakoti and Pantano [18] that provides a single discretization of (a(x)Ux)x(a(x)U_{x})_{x} to eighth-order accuracy. This approach requires significantly more floating point work than the standard approach but allows the spatial discretization to be computed with less communication. Furthermore, it eliminates the need for a filter to damp Nyquist-frequency oscillations. We also introduce a multirate time stepping strategy based on spectral deferred corrections (SDC) that enables the chemistry to be advanced on a different time scale than the fluid dynamics. This allows the fluid dynamics to be evaluated less frequently than in a fully coupled code resulting in shorter run times. These algorithms are implemented in a hybrid OpenMP/MPI parallel code named SMC.

In §2 we introduce the spatial discretization used by SMC and the eighth-order narrow stencil discretization of variable coefficient diffusion operators. In §3 we discuss the temporal discretization used by SMC and the multirate spectral deferred correction algorithm. In §4 we discuss some of the issues associated with a hybrid parallel implementation of SMC. In particular, we discuss a spatial blocking strategy that provides effective use of OpenMP threads. In §5 we present numerical results illustrating the convergence properties and parallel performance of SMC, and show its application to dimethyl ether-air combustion. Finally, the main results of the paper are summarized in §6.

2 Spatial Discretization

2.1 Navier-Stokes Equations

The multicomponent reacting compressible Navier-Stokes equations are given by

ρt+(ρ𝒖)=\displaystyle\frac{\partial\rho}{\partial t}+\nabla\cdot(\rho\mbox{\boldmath$u$})={} 0,\displaystyle 0, (1a)
ρ𝒖t+(ρ𝒖𝒖)+p=\displaystyle\frac{\partial\rho\mbox{\boldmath$u$}}{\partial t}+\nabla\cdot(\rho\mbox{\boldmath$u$}\mbox{\boldmath$u$})+\nabla p={} 𝝉,\displaystyle\nabla\cdot\mbox{\boldmath{$\tau$}}, (1b)
ρYkt+(ρYk𝒖)=\displaystyle\frac{\partial\rho Y_{k}}{\partial t}+\nabla\cdot(\rho Y_{k}\mbox{\boldmath$u$})={} ρω˙kk,\displaystyle\rho\dot{\omega}_{k}-\nabla\cdot\mathcal{F}_{k}, (1c)
ρEt+[(ρE+p)𝒖]=\displaystyle\frac{\partial\rho E}{\partial t}+\nabla\cdot[(\rho E+p)\mbox{\boldmath$u$}]={} (λT)+(𝝉𝒖)kkhk,\displaystyle\nabla\cdot(\lambda\nabla T)+\nabla\cdot(\mbox{\boldmath{$\tau$}}\cdot\mbox{\boldmath$u$})-\nabla\cdot\sum_{k}\mathcal{F}_{k}h_{k}, (1d)

where ρ\rho is the density, 𝒖u is the velocity, pp is the pressure, EE is the specific energy density (kinetic energy plus internal energy), TT is the temperature, 𝝉\tau is the viscous stress tensor, and λ\lambda is the thermal conductivity. For each of the chemical species kk, YkY_{k} is the mass fraction, k\mathcal{F}_{k} is the diffusive flux, hkh_{k} is the specific enthalpy, and ω˙k\dot{\omega}_{k} is the production rate. The system is closed by an equation of state that specifies pp as a function of ρ\rho, TT and YkY_{k}. For the examples presented here, we assume an ideal gas mixture.

The viscous stress tensor is given by

τij=η(uixj+ujxi23δij𝒖)+ξδij𝒖,\tau_{ij}=\eta\left(\frac{\partial u_{i}}{\partial x_{j}}+\frac{\partial u_{j}}{\partial x_{i}}-\frac{2}{3}\delta_{ij}\nabla\cdot\mbox{\boldmath$u$}\right)+\xi\delta_{ij}\nabla\cdot\mbox{\boldmath$u$}, (2)

where η\eta is the shear viscosity and ξ\xi is the bulk viscosity.

A mixture model for species diffusion is employed in SMC. We define an initial approximation ¯k\bar{\mathcal{F}}_{k} to the species diffusion flux given by

¯k=ρDk(Xk+(XkYk)pp)\bar{\mathcal{F}}_{k}=-\rho D_{k}\left(\nabla X_{k}+(X_{k}-Y_{k})\frac{\nabla p}{p}\right) (3)

where XkX_{k} is the mole fraction, and DkD_{k} is the diffusion coefficient of species kk. For species transport to be consistent with conservation of mass we require

kk=0.\sum_{k}\mathcal{F}_{k}=0. (4)

In general, the fluxes defined by (3) do not satisfy this requirement and some type of correction is required. When there is a good choice for a reference species that has a high molar concentration throughout the entire domain, one can use (3) to define the species diffusion flux (i.e., k¯k\mathcal{F}_{k}\equiv\bar{\mathcal{F}}_{k}) for all species except the reference, and then use (4) to define the flux of the reference species. This strategy is often used in combustion using N2\mathrm{N_{2}} as the reference species. An alternative approach [30] is to define a correction velocity 𝑽c\mbox{\boldmath$V$}_{c} given by

𝑽c=¯\mbox{\boldmath$V$}_{c}=\sum_{\ell}\bar{\mathcal{F}}_{\ell} (5)

and then to set

k=¯kYk𝑽c.\mathcal{F}_{k}=\bar{\mathcal{F}}_{k}-Y_{k}\mbox{\boldmath$V$}_{c}. (6)

The SMC code supports both of these operating modes. We will discuss these approaches further in §2.4.

2.2 Eighth-Order and Sixth-Order Discretization

The SMC code discretizes the governing Navier-Stokes equations (1) in space using high-order centered finite-difference methods. For first order derivatives a standard 8th8^{\rm th} order stencil is employed. For second order derivatives with variable coefficients of the form

x(a(x)ux),\frac{\partial}{\partial x}\left(a(x)\frac{\partial u}{\partial x}\right), (7)

a novel 8th8^{\rm th} order narrow stencil is employed. This stencil is an extension of those developed by Kamakoti and Pantano [18]. The derivatives in (7) on a uniform grid are approximated by

x(aux)|iHi+1/2Hi1/2Δx2,\frac{\partial}{\partial x}\left(a\frac{\partial u}{\partial x}\right)\bigg{|}_{i}\approx\frac{H_{i+1/2}-H_{i-1/2}}{\Delta x^{2}}, (8)

where

Hi+1/2=m=s+1sn=s+1sai+mMmnui+n,H_{i+1/2}=\sum_{m=-s+1}^{s}\sum_{n=-s+1}^{s}a_{i+m}M_{mn}u_{i+n}, (9)

s=4s=4, and MM is an 8×88\times 8 matrix (in general, for a discretization of order 2s2s, MM is a 2s×2s2s\times 2s matrix). This discretization is conservative because (8) is in flux form. Note, however, that Hi+1/2H_{i+1/2} is not a high-order approximation of the flux at xi+1/2x_{i+1/2} even though the flux difference (8) approximates the derivative of the flux to order 2s2s. This type of stencil is narrow because the width of the stencil is only 2s+12s+1 for both uu and aa. In contrast, the width of a wide stencil in which the first-order derivative operator is applied twice is 4s+14s+1 for uu and 2s+12s+1 for aa.

Following the strategy in [18], a family of 8th8^{\rm th} order narrow stencils for the second order derivative in (7) has been derived. The resulting 8th8^{\rm th} order stencil matrix is given by

M=[m11m12m13m14m15000m21m22m23m24m25m2600m31m32m33m34m35m36m370m41m42m43m44m45m46m47m48m48m47m46m45m44m43m42m410m37m36m35m34m33m32m3100m26m25m24m23m22m21000m15m14m13m12m11],M=\left[\begin{array}[]{cccccccc}m_{11}&m_{12}&m_{13}&m_{14}&m_{15}&0&0&0\\ m_{21}&m_{22}&m_{23}&m_{24}&m_{25}&m_{26}&0&0\\ m_{31}&m_{32}&m_{33}&m_{34}&m_{35}&m_{36}&m_{37}&0\\ m_{41}&m_{42}&m_{43}&m_{44}&m_{45}&m_{46}&m_{47}&m_{48}\\ -m_{48}&-m_{47}&-m_{46}&-m_{45}&-m_{44}&-m_{43}&-m_{42}&-m_{41}\\ 0&-m_{37}&-m_{36}&-m_{35}&-m_{34}&-m_{33}&-m_{32}&-m_{31}\\ 0&0&-m_{26}&-m_{25}&-m_{24}&-m_{23}&-m_{22}&-m_{21}\\ 0&0&0&-m_{15}&-m_{14}&-m_{13}&-m_{12}&-m_{11}\end{array}\right], (10)

where

m11\displaystyle m_{11} =5336+m48,\displaystyle=\frac{5}{336}+m_{48}, m12\displaystyle m_{12} =83360015m47145m48,\displaystyle=-\frac{83}{3600}-\frac{1}{5}m_{47}-\frac{14}{5}m_{48},
m13\displaystyle m_{13} =29950400+25m47+135m48,\displaystyle=\frac{299}{50400}+\frac{2}{5}m_{47}+\frac{13}{5}m_{48}, m14\displaystyle m_{14} =171260015m4745m48,\displaystyle=\frac{17}{12600}-\frac{1}{5}m_{47}-\frac{4}{5}m_{48},
m15\displaystyle m_{15} =11120,\displaystyle=\frac{1}{1120}, m21\displaystyle m_{21} =115602m48,\displaystyle=-\frac{11}{560}-2m_{48},
m22\displaystyle m_{22} =31360+m47+3m48,\displaystyle=-\frac{31}{360}+m_{47}+3m_{48}, m23\displaystyle m_{23} =4120095m47+45m48,\displaystyle=\frac{41}{200}-\frac{9}{5}m_{47}+\frac{4}{5}m_{48},
m24\displaystyle m_{24} =592750400+45m4795m48,\displaystyle=-\frac{5927}{50400}+\frac{4}{5}m_{47}-\frac{9}{5}m_{48}, m25\displaystyle m_{25} =1760015m4745m48,\displaystyle=\frac{17}{600}-\frac{1}{5}m_{47}-\frac{4}{5}m_{48},
m26\displaystyle m_{26} =50350400+15m47+45m48,\displaystyle=-\frac{503}{50400}+\frac{1}{5}m_{47}+\frac{4}{5}m_{48}, m31\displaystyle m_{31} =1280,\displaystyle=-\frac{1}{280},
m32\displaystyle m_{32} =109750402m47+6m48,\displaystyle=\frac{1097}{5040}-2m_{47}+6m_{48}, m33\displaystyle m_{33} =134910080+3m4712m48,\displaystyle=-\frac{1349}{10080}+3m_{47}-12m_{48},
m34\displaystyle m_{34} =8875040m47+6m48,\displaystyle=-\frac{887}{5040}-m_{47}+6m_{48}, m35\displaystyle m_{35} =361350400+45m4795m48,\displaystyle=\frac{3613}{50400}+\frac{4}{5}m_{47}-\frac{9}{5}m_{48},
m36\displaystyle m_{36} =4672520035m47+185m48,\displaystyle=\frac{467}{25200}-\frac{3}{5}m_{47}+\frac{18}{5}m_{48}, m37\displaystyle m_{37} =1392520015m4795m48,\displaystyle=\frac{139}{25200}-\frac{1}{5}m_{47}-\frac{9}{5}m_{48},
m41\displaystyle m_{41} =171680+2m48,\displaystyle=\frac{17}{1680}+2m_{48}, m42\displaystyle m_{42} =3192520+2m478m48,\displaystyle=-\frac{319}{2520}+2m_{47}-8m_{48},
m43\displaystyle m_{43} =91950402m47+6m48,\displaystyle=-\frac{919}{5040}-2m_{47}+6m_{48}, m44\displaystyle m_{44} =4452016,\displaystyle=-\frac{445}{2016},
m45\displaystyle m_{45} =583720m47+6m48,\displaystyle=\frac{583}{720}-m_{47}+6m_{48}, m46\displaystyle m_{46} =652247m48,\displaystyle=-\frac{65}{224}-7m_{48},

and m47m_{47} and m48m_{48} are two free parameters. Note that the well-posedness requirement established in [18] is satisfied by the stencil matrix above independent of the two free parameters. In SMC the default values of the free parameters were set to m47=3557/44100m_{47}=3557/44100 and m48=2083/117600m_{48}=-2083/117600 to minimize an upper bound on truncation error that is calculated by summing the absolute value of each term in the truncation error. Additional analysis of truncation errors and numerical experiments on the 8th8^{\rm th} order narrow stencil using various choices of the two parameters are presented in Appendix 7.

The SMC code uses a 6th6^{\rm th} order stencil near physical boundaries where the conditions outside the domain are not well specified (see §2.5). The general 6th6^{\rm th} order stencil matrix is given by

M=[m11m12m13m1400m21m22m23m24m250m31m32m33m34m35m36m36m35m34m33m32m310m25m24m23m22m2100m14m13m12m11],M=\left[\begin{array}[]{cccccc}m_{11}&m_{12}&m_{13}&m_{14}&0&0\\ m_{21}&m_{22}&m_{23}&m_{24}&m_{25}&0\\ m_{31}&m_{32}&m_{33}&m_{34}&m_{35}&m_{36}\\ -m_{36}&-m_{35}&-m_{34}&-m_{33}&-m_{32}&-m_{31}\\ 0&-m_{25}&-m_{24}&-m_{23}&-m_{22}&-m_{21}\\ 0&0&-m_{14}&-m_{13}&-m_{12}&-m_{11}\end{array}\right], (11)

where

m11\displaystyle m_{11} =11180+m36,\displaystyle=-\frac{11}{180}+m_{36}, m12\displaystyle m_{12} =192m36,\displaystyle=\frac{1}{9}-2m_{36},
m13\displaystyle m_{13} =118+m36,\displaystyle=-\frac{1}{18}+m_{36}, m14\displaystyle m_{14} =1180,\displaystyle=\frac{1}{180},
m21\displaystyle m_{21} =7603m36,\displaystyle=\frac{7}{60}-3m_{36}, m22\displaystyle m_{22} =1120+5m36,\displaystyle=-\frac{1}{120}+5m_{36},
m23\displaystyle m_{23} =17902m36,\displaystyle=-\frac{17}{90}-2m_{36}, m24\displaystyle m_{24} =572+m36,\displaystyle=\frac{5}{72}+m_{36},
m25\displaystyle m_{25} =190m36,\displaystyle=\frac{1}{90}-m_{36}, m31\displaystyle m_{31} =115+3m36,\displaystyle=-\frac{1}{15}+3m_{36},
m32\displaystyle m_{32} =11603m36,\displaystyle=-\frac{11}{60}-3m_{36}, m33\displaystyle m_{33} =101360,\displaystyle=-\frac{101}{360},
m34\displaystyle m_{34} =1371802m36,\displaystyle=\frac{137}{180}-2m_{36}, m35\displaystyle m_{35} =83360+m36,\displaystyle=-\frac{83}{360}+m_{36},

and m36m_{36} is a free parameter. The 6th6^{\rm th} order narrow stencil developed in [18] is a special case of the general 6th6^{\rm th} order stencil when m36=0.220063m_{36}=0.220063. Like the 8th8^{\rm th} order stencil, the well-posedness and the order of accuracy for the 6th6^{\rm th} order stencil is independent of the free parameter. In SMC, the free parameter is set to 281/3600281/3600 to minimize an upper bound on truncation error that is calculated by summing the absolute value of each term in the truncation error.

2.3 Cost Analysis of Eighth-Order Stencil

The 8th8^{\rm th} order narrow stencil presented in § 2.2 requires significantly more floating-point operations (FLOPs) than the standard wide stencil, but has the benefit of less communication. The computational cost of a single evaluation the narrow stencil, for NN cells in one dimension, is 112N+111112N+111 FLOPs (this is the cost of operations in (9) and (8), excluding the division by Δx2\Delta x^{2}). For the 8th8^{\rm th} order wide stencil with two steps of communication, the computational cost on NN cells is 23N+823N+8 FLOPs. Hence, the narrow stencil costs 89N+10389N+103 more FLOPs than the wide stencil on NN cells in one dimension. As for communication, the wide stencil requires 6464 more bytes than the narrow stencil for NN cells with 8 ghost cells in one dimension. Here we assume that an 8-byte double-precision floating-point format is used. Ignoring latency, the difference in time for the two algorithms is then

89N+103F64B\frac{89N+103}{F}-\frac{64}{B}

where FF is the floating-point operations per second (FLOPS) and BB is the network bandwidth. To estimate this difference on current machines, we used the National Energy Research Scientific Computing Center’s newest supercomputer Edison as a reference. Each compute node on Edison has a peak performance of 460.8 GFLOPs (Giga-FLOPs per second), and the MPI bandwidth is about 8 GB/s 111The configuration of Edison is available at http://www.nersc.gov/users/computational-systems/edison/configuration/.. For N=64N=64 in one dimension, the run time cost for the extra FLOPs of the narrow stencil is about 1.3×108s1.3\times 10^{-8}\,\mathrm{s}, which is about 1.6 times of the extra communication cost of the wide stencil, 8.0×109s8.0\times 10^{-9}\,\mathrm{s}. It should be emphasized that the estimates here are crude for several reasons. We neglected the efficiency of both floating-point and network operations for simplicity. In counting FLOPs, we did not consider a fused multiply-add instruction or memory bandwidth. In estimating communication costs, we did not consider network latency, the cost of data movement if MPI messages are aggregated to reduce latency, or communication hiding through overlapping communication and computation. Moreover, the efficiency of network operations is likely to drop when the number of nodes used increases, whereas the efficiency of FLOPs alone is independent of the number of total nodes. Nevertheless, as multicore systems continue to evolve, the ratio of FLOPS to network bandwidth is most likely to increase. Thus, we expect that the narrow stencil will cost less than the wide stencil on future systems, although any net benefit will depend on the systems configuration. As a data point, we note that both stencils are implemented in SMC, and for a test with 5123512^{3} cells using 512 MPI ranks and 12 threads on 6144 cores of Edison, a run using the narrow stencil took about the same time as using the wide stencil. Further measurements are presented in §5.2.

2.4 Correction for Species Diffusion

In §2.1, we introduced the two approaches employed in SMC to enforce the consistency of the mixture model of species diffusion with respect to mass conservation. The first approach uses a reference species and is straightforward to implement. We note that the last term in (1d) can be treated as kk(hkhref)-\nabla\cdot\sum_{k}\mathcal{F}_{k}(h_{k}-h_{\mathrm{ref}}), where hrefh_{\mathrm{ref}} is the species enthalpy for the reference species, because of (4). The second approach uses a correction velocity (5) and involves the computation of (Yk𝑽c)\nabla\cdot(Y_{k}\mbox{\boldmath$V$}_{c}) and (hkYk𝑽c)\nabla\cdot(h_{k}Y_{k}\mbox{\boldmath$V$}_{c}). Although one might attempt to compute the correction velocity according to Vc,i+1/2=H,i+1/2V_{c,i+1/2}=\sum_{\ell}H_{\ell,i+1/2}, where H,i+1/2H_{\ell,i+1/2} for species \ell is computed using (9), this would reduce the accuracy for correction velocity to second-order since Hi+1/2H_{i+1/2} is only a second-order approximation to the flux at i+1/2i+1/2. To maintain 8th8^{\rm th} order accuracy without incurring extra communication and too much extra computation, we employ the chain rule for the correction terms. For example, we compute (hkYk𝑽c)\nabla\cdot(h_{k}Y_{k}\mbox{\boldmath$V$}_{c}) as hkYk𝑽c+(hkYk)𝑽ch_{k}Y_{k}\nabla\cdot\mbox{\boldmath$V$}_{c}+\nabla(h_{k}Y_{k})\cdot\mbox{\boldmath$V$}_{c}, and use 𝑽c=¯\nabla\cdot\mbox{\boldmath$V$}_{c}=\sum_{\ell}\nabla\cdot\bar{\mathcal{F}}_{\ell}, where ¯\nabla\cdot\bar{\mathcal{F}}_{\ell} is computed using the 8th8^{\rm th} order narrow stencil for second order derivatives. The velocity correction form arises naturally in multicomponent transport theory; namely, it represents the first term in an expansion of the full multicomponent diffusion matrix. (See Giovangigli [13].) For that reason, velocity correction is the preferred option in SMC.

2.5 Physical Boundaries

The SMC code uses characteristic boundary conditions to treat physical boundaries. Previous studies in the literature have shown that characteristic boundary conditions can successfully lower acoustic reflections at open boundaries without suffering from numerical instabilities. Thompson [36, 37] applied the one-dimensional approximation of the characteristic boundary conditions to the hyperbolic Euler equations, and the one-dimensional formulation was extended by Poinsot and Lele [29] to the viscous Navier-Stokes equations. For multicomponent reacting flows, Sutherland and Kennedy [34] further improved the treatment of boundary conditions by the inclusion of reactive source terms in the formulation. In SMC, we have adopted the formulation of characteristic boundary conditions proposed in [41, 39], in which multi-dimensional effects are also included.

The 8th8^{\rm{th}} order stencils used in SMC require four cells on each side of the point where the derivative is being calculated. Near the physical boundaries where the conditions outside the domain are unknown, lower order stencils are used for derivatives with respect to the normal direction. At the fourth and third cells from the boundary we use 6th6^{\rm{th}} and 4th4^{\rm{th}} order centered stencils respectively, for both advection and diffusion. At cells next to the boundary we reduce to centered second-order for diffusion and a biased third-order discretization for advection. Finally, for the boundary cells, a completely biased second-order stencil is used for diffusion and a completely biased third-order stencil is used for advection.

3 Temporal Discretization

3.1 Spectral Deferred Corrections

SDC methods for ODEs were first introduced in [10] and have been subsequently refined and extended in [25, 26, 16, 14, 5, 22]. SDC methods construct high-order solutions within one time step by iteratively approximating a series of correction equations at collocation nodes using low-order sub-stepping methods.

Consider the generic ODE initial value problem

u(t)=f(u(t),t),u(0)=u0u^{\prime}(t)=f\bigl{(}u(t),t\bigr{)},\qquad u(0)=u_{0} (12)

where t[0,T]t\in[0,T]; u0,u(t)Nu_{0},\,u(t)\in{\mathbb{C}}^{N}; and f:×NNf:{\mathbb{R}}\times{\mathbb{C}}^{N}\rightarrow{\mathbb{C}}^{N}. SDC methods are derived by considering the equivalent Picard integral form of (12) given by

u(t)=u0+0tf(u(s),s)𝑑s.u(t)=u_{0}+\int_{0}^{t}f\bigl{(}u(s),s\bigr{)}\,ds. (13)

A single time step [tn,tn+1][t_{n},t_{n+1}] is divided into a set of intermediate sub-steps by defining M+1M+1 collocation points τm[tn,tn+1]{\tau}_{m}\in[t_{n},t_{n+1}] such that tn=τ0<τ1<<τM=tn+1t_{n}={\tau}_{0}<{\tau}_{1}<\cdots<{\tau}_{M}=t_{n+1}. Then, the integrals of f(u(t),t)f\bigl{(}u(t),t\bigr{)} over each of the intervals [tn,τm][t_{n},{\tau}_{m}] are approximated by

Inmtnτmf(u(s),s)𝑑sΔtj=0Mqmjf(Uj,τj)I_{n}^{m}\equiv\int_{t_{n}}^{{\tau}_{m}}f\bigl{(}u(s),s\bigr{)}\,ds\approx\Delta t\sum_{j=0}^{M}q_{mj}f(U_{j},{\tau}_{j}) (14)

where Uju(τj)U_{j}\approx u({\tau}_{j}), Δt=tn+1tn\Delta t=t_{n+1}-t_{n}, and qmjq_{mj} are quadrature weights. The quadrature weights that give the highest order of accuracy given the collocation points τm{\tau}_{m} are obtained by computing exact integrals of the Lagrange interpolating polynomial over the collocation points τm{\tau}_{m}. Note that each of the integral approximations InmI_{n}^{m} (that represent the integral of ff from tnt_{n} to τm{\tau}_{m}) in (14) depends on the function values f(Um,τm)f(U_{m},{\tau}_{m}) at all of the collocation nodes τm{\tau}_{m}.

To simplify notation, we define the integration matrix \bmQ{\bm{Q}} to be the M×(M+1)M\times(M+1) matrix consisting of entries qmjq_{mj}; and the vectors \bmU[U1,,UM]{\bm{U}}\equiv[U_{1},\cdots,U_{M}] and \bmF[f(U0,t0),,f(UM,tM)]{\bm{F}}\equiv[f(U_{0},t_{0}),\cdots,f(U_{M},t_{M})]. With these definitions, the Picard equation (13) within the time step [tn,tn+1][t_{n},t_{n+1}] is approximated by

\bmU=\bmU0+Δt\bmQ\bmF{\bm{U}}={\bm{U}}_{0}+\Delta t\,{\bm{Q}}\,{\bm{F}} (15)

where \bmU0=U0\bm1{\bm{U}}_{0}=U_{0}\bm{1}. Note again that the integration matrix \bmQ{\bm{Q}} is dense so that each entry of \bmU{\bm{U}} depends on all other entries of \bmU{\bm{U}} (through \bmF{\bm{F}}) and U0U_{0}. Thus, (15) is an implicit equation for the unknowns in \bmU{\bm{U}} at all of the quadrature points. Finally, we note that the solution of (15) corresponds to the collocation or implicit Runge-Kutta solution of (12) over the nodes τm{\tau}_{m}. Hence SDC can be considered as an iterative method for solving the spectral collocation formulation.

The SDC scheme used here begins by spreading the initial condition unu_{n} to each of the collocation nodes so that the provisional solution \bmU0{\bm{U}}^{0} is given by \bmU0=[U0,,U0]{\bm{U}}^{0}=[U_{0},\cdots,U_{0}]. Subsequent iterations (denoted by kk superscripts) proceed by applying the node-to-node integration matrix \bmS{\bm{S}} to \bmFk{\bm{F}}^{k} and correcting the result using a forward-Euler time-stepper between the collocation nodes. The node-to-node integration matrix \bmS{\bm{S}} is used to approximate the integrals Imm+1=τmτm+1f(u(s),s)𝑑sI_{m}^{m+1}=\int_{{\tau}_{m}}^{{\tau}_{m+1}}f(u(s),s)\,ds (as opposed to the integration matrix \bmQ{\bm{Q}} which approximates InmI_{n}^{m}) and is constructed in a manner similar to the integration matrix \bmQ{\bm{Q}}. The update equation corresponding to the forward-Euler sub-stepping method for computing \bmUk+1{\bm{U}}^{k+1} is given by

Um+1k+1=Umk+1+Δtm[f(Umk+1,tm)f(Umk,tm)]+ΔtSk,mU^{k+1}_{m+1}=U^{k+1}_{m}+\Delta t_{m}\bigl{[}f(U^{k+1}_{m},t_{m})-f(U^{k}_{m},t_{m})\bigr{]}+\Delta t\,S^{k,m} (16)

where Sk,mS^{k,m} is the mthm^{\rm th} row of \bmS\bmFk{\bm{S}}{\bm{F}}^{k}. The process of computing (16) at all of the collocation nodes τm{\tau}_{m} is referred to as an SDC sweep or an SDC iteration. The accuracy of the solution generated after kk SDC iterations done with such a first-order method is formally O(Δtk)O(\Delta t^{k}) as long as the spectral integration rule (which is determined by the choice of collocation nodes τm{\tau}_{m}) is at least order kk.

Here, a method of lines discretization based on the spatial discretization presented in §2 is used to reduce the partial differential equations (PDEs) in question to a large system of ODEs.

The computational cost of the SDC scheme used here is determined by the number of nodes M+1M+1 chosen and the number of SDC iterations KK taken per time step. The number of resulting function evaluations NevalsN_{\rm evals} is Nevals=KMN_{\rm evals}=KM. For example, with 3 Gauss-Lobatto (M=2M=2) nodes, the resulting integration rule is 4th4^{\rm th} order. To achieve this formal order of accuracy we perform K=4K=4 SDC iterations per time step and hence we require Nevals=8N_{\rm evals}=8 function evaluations (note that F(U0,0)F(U_{0},0) is recycled from the previous time step, and hence the first time-step requires 9 function evaluations).

By increasing the number of SDC nodes and adjusting the number of SDC iterations taken, we can construct schemes of arbitrary order. For example, to construct a 6th6^{\rm th} order scheme one could use a 5 point Gauss-Lobatto integration rule (which is 8th8^{\rm th} order accurate), but only take 6 SDC iterations. For high order explicit schemes used here we have observed that the SDC iterations sometimes converge to the collocation solution in fewer iterations than is formally required. This can be detected by computing the SDC residual

Rk=U0+\bmq\bmFkUMk,R^{k}=U_{0}+\bm{q}\cdot{\bm{F}}^{k}-U_{M}^{k}, (17)

where \bmq\bm{q} is the last row of \bmQ{\bm{Q}}, and comparing successive residuals. If |Rk1|/|Rk||R^{k-1}|/|R^{k}| is close to one, the SDC iterations have converged to the collocation solution and subsequent iterations can be skipped.

3.2 Multirate Integration

Multirate SDC (MRSDC) methods use a hierarchy of SDC schemes to integrate systems that contain processes with disparate time-scales more efficiently than fully coupled methods [5, 22]. A traditional MRSDC method integrates processes with long time-scales with fewer SDC nodes than those with short time-scales. They are similar in spirit to sub-cycling schemes in which processes with short characteristic time-scales are integrated with smaller time steps than those with long characteristic time-scales. MRSDC schemes, however, update the short time-scale components of the solutions at the long time-scale time steps during each MRSDC iteration, as opposed to only once for typical sub-cycling schemes, which results in tighter coupling between processes throughout each time-step [5, 22]. The MRSDC schemes presented here generalize the structure and coupling of the quadrature node hierarchy relative to the schemes first presented in [5, 22].

Before presenting MRSDC in detail, we note that the primary advantage of MRSDC schemes lies in how the processes are coupled. For systems in which the short time-scale process is moderately stiff, using more SDC nodes than the long time-scale process helps alleviate the stiffness of the short time-scale processes. However, extremely stiff processes are best integrated between SDC nodes with, for example, a variable order/step-size backward differentiation formula (BDF) scheme. In this case, the short time-scale processes can in fact be treated with fewer SDC nodes than long time-scale processes.

Consider the generic ODE initial value problem with two disparate time-scales

u(t)=f1(u(t),t)+f2(u(t),t),u(0)=u0u^{\prime}(t)=f_{1}\bigl{(}u(t),t\bigr{)}+f_{2}\bigl{(}u(t),t\bigr{)},\qquad u(0)=u_{0} (18)

where we have split the generic right-hand-side ff in (12) into two components: f1f_{1} and f2f_{2}. A single time step [tn,tn+1][t_{n},t_{n+1}] is first divided into a set of intermediate sub-steps by defining M1+1M_{1}+1 collocation points such that τm[tn,tn+1]{\tau}_{m}\in[t_{n},t_{n+1}]. The vector of these M1+1M_{1}+1 collocation points τm{\tau}_{m} is denoted \bmt1\bm{t}_{1}, and these collocation points will be used to integrate the f1f_{1} component. The time step is further divided by defining M2+1M_{2}+1 collocation points τp[tn,tn+1]{\tau}_{p}\in[t_{n},t_{n+1}] such that \bmt1\bmt2\bm{t}_{1}\subset\bm{t}_{2}, where \bmt2\bm{t}_{2} is the vector of these new τp{\tau}_{p} collocation points. These collocation points will be used to integrate the f2f_{2} component. That is, a hierarchy of collocation points is created with \bmt1\bmtN\bm{t}_{1}\subset\cdots\subset\bm{t}_{N}. The splitting of the right-hand-side in (18) should be informed by the numerics at hand. Traditionally, slow time-scale processes are split into f1f_{1} and placed on the \bmt1\bm{t}_{1} nodes, and fast time-scale processes are split into f2f_{2} and places on the \bmt2\bm{t}_{2} nodes. However, if the fast time-scale processes in the MRSDC update equation (20) are extremely stiff then it is more computationally effective to advance these processes using stiff integration algorithms over a longer time scale. In this case, we can swap the role of of the processes within MRSDC and treat the explicit update on the faster time scale (i.e., with the \bmt2\bm{t}_{2} nodes). As such, we hereafter refer to the collocation points in \bmt1\bm{t}_{1} as the “coarse nodes” and those in \bmt2\bm{t}_{2} as the “fine nodes”.

The SDC nodes \bmt1\bm{t}_{1} for the coarse component are typically chosen to correspond to a formal quadrature rule like the Gauss-Lobatto rule. These nodes determine the overall order of accuracy of the resulting MRSDC scheme. Subsequently, there is some flexibility in how the nodes for fine components are chosen. For example, the points in \bmt2\bm{t}_{2} can be chosen to correspond to a formal quadrature rule

  1. (a)

    over the entire time step [tn,tn+1][t_{n},t_{n+1}];

  2. (b)

    over each of the intervals defined by the points in \bmt1\bm{t}_{1};

  3. (c)

    over several sub-intervals of the intervals defined by the points in \bmt1\bm{t}_{1};

Note that in the first case, a quadrature rule that nests (or embeds) itself naturally is required (such as the Clenshaw-Curtis rule), whereas in the second and third cases the nodes in \bmt2\bm{t}_{2} usually do not correspond to a formal quadrature rule over the entire time step [tn,tn+1][t_{n},t_{n+1}]. In the second and third cases we note that the basis used to construct the \bmQ{\bm{Q}} and \bmS{\bm{S}} integration matrices are typically piecewise polynomial – this allows us to shrink the effective time-step size of the fine process without constructing an excessively high-order polynomial interpolant. These three possibilities are depicted in Fig. 1.

Refer to caption
Figure 1: MRSDC quadrature nodes. (a) The left axis shows an MRSDC hierarchy with \bmt1\bm{t}_{1} corresponding to 5 Clenshaw-Curtis nodes and \bmt2\bm{t}_{2} corresponding to 9 Clenshaw-Curtis nodes. (b) The middle axis shows an MRSDC hierarchy with \bmt1\bm{t}_{1} corresponding to 3 Gauss-Lobatto nodes and \bmt2\bm{t}_{2} corresponding to 5 Gauss-Lobatto nodes between each pair of nodes in \bmt1\bm{t}_{1}. (c) The right axis shows an MRSDC hierarchy with \bmt1\bm{t}_{1} corresponding to 3 Gauss-Lobatto nodes and \bmt2\bm{t}_{2} corresponding to 3 Gauss-Lobatto nodes repeated twice between each pair of nodes in \bmt1\bm{t}_{1}. The red and blue lines show that the MRSDC integration matrix \bmQ2,2{\bm{Q}}_{2,2} is constructed with (a) one polynomial spanning all 9 fine nodes, (b) two polynomials spanning 5 fine nodes each, and (c) four polynomials spanning 3 fine nodes each.

Once the SDC nodes for the desired MRSDC scheme have been chosen, the node-to-node and regular integration matrices (\bmS{\bm{S}} and \bmQ{\bm{Q}}, respectively) can be constructed. While single-rate SDC schemes, as presented in §3.1, require one set of \bmS{\bm{S}} and \bmQ{\bm{Q}} matrices, the two-component MRSDC scheme used here requires two sets of \bmS{\bm{S}} and \bmQ{\bm{Q}} matrices. These are denoted with subscripts i,ji,j so that the action of \bmSi,j{\bm{S}}_{i,j} approximates the node-to-node integrals between nodes in \bmti\bm{t}_{i} of the \bmFj{\bm{F}}_{j} component that “lives” on the \bmtj\bm{t}_{j} nodes, and similarly for the \bmQi,j{\bm{Q}}_{i,j} matrices. Note that both \bmSi,j{\bm{S}}_{i,j} and \bmQi,j{\bm{Q}}_{i,j} are of size Mi×(Mj+1)M_{i}\times(M_{j}+1).

With these definitions, the Picard equation (13) within the time step [tn,tn+1][t_{n},t_{n+1}] is approximated by

\bmU=\bmU0+Δt\bmQ2,1\bmF1+Δt\bmQ2,2\bmF2{\bm{U}}={\bm{U}}_{0}+\Delta t\,{\bm{Q}}_{2,1}{\bm{F}}_{1}+\Delta t\,{\bm{Q}}_{2,2}{\bm{F}}_{2} (19)

where \bmU{\bm{U}} is the vector of approximate solution values UqU_{q} over the fine nodes in \bmt2\bm{t}_{2}. The MRSDC scheme used in SMC has two operating modes:

  1. 1.

    The coarse (f1f_{1}) and fine (f2f_{2}) components of (18) correspond to the advection-diffusion and reaction parts of (1), respectively, and the forward-Euler sub-stepping scheme is used for both components.

  2. 2.

    The coarse (f1f_{1}) and fine (f2f_{2}) components of (18) correspond to the reaction and advection-diffusion parts of (1), respectively. The advection-diffusion component is treated with a forward-Euler sub-stepper while the reaction component is treated with the variable-order BDF scheme in VODE [6].

In the first mode, the scheme is purely explicit whereas in the second mode the algorithm uses an implicit treatment of reactions. Which version is most appropriate for a given simulation depends on the relative stiffness of chemistry and fluid dynamics. When the chemical time scales are not significantly disparate from the fluid mechanical time scale, the fully explicit version will be the most efficient. When the chemistry is extremely stiff, the second mode is preferred.

The MRSDC update equation used in SMC for mode 1 over the nodes in \bmt2\bm{t}_{2} is given by

Uq+1k+1=Uqk+1\displaystyle U^{k+1}_{q+1}=U^{k+1}_{q} +Δtq[f1(Upk+1,tp)f1(Upk,tp)]\displaystyle+\Delta t_{q}\bigl{[}f_{1}(U^{k+1}_{p},t_{p})-f_{1}(U^{k}_{p},t_{p})\bigr{]} (20)
+Δtq[f2(Uqk+1,tq)f2(Uqk,tq)]+ΔtS2,2k,q+ΔtS2,1k,q\displaystyle+\Delta t_{q}\bigl{[}f_{2}(U^{k+1}_{q},t_{q})-f_{2}(U^{k}_{q},t_{q})\bigr{]}+\Delta t\,S_{2,2}^{k,q}+\Delta t\,S_{2,1}^{k,q}

where S2,jk,qS_{2,j}^{k,q} is the qthq^{\rm th} row of \bmS2,j\bmFj{\bm{S}}_{2,j}{\bm{F}}_{j}, and the pthp^{\rm th} node of the fine component corresponds to the closest coarse node to the left of the fine node qq. That is, the coarse component f1f_{1} is held constant, and equal to its value at the left coarse node p=q/(M2/M1)p=\lfloor q/(M_{2}/M_{1})\rfloor, during the fine updating steps. The update equation for mode 2 is

Uq+1k+1=Uqk+1\displaystyle U^{k+1}_{q+1}=U^{k+1}_{q} +Δtq[f1(Up+1k+1,tp)f1(Up+1k,tp)]\displaystyle+\Delta t_{q}\bigl{[}f_{1}(U^{k+1}_{p+1},t_{p})-f_{1}(U^{k}_{p+1},t_{p})\bigr{]} (21)
+Δtq[f2(Uqk+1,tq)f2(Uqk,tq)]+ΔtS2,2k,q+ΔtS2,1k,q\displaystyle+\Delta t_{q}\bigl{[}f_{2}(U^{k+1}_{q},t_{q})-f_{2}(U^{k}_{q},t_{q})\bigr{]}+\Delta t\,S_{2,2}^{k,q}+\Delta t\,S_{2,1}^{k,q}

where f1(Up+1k+1,tp)f_{1}(U^{k+1}_{p+1},t_{p}) is computed with VODE.

The computational cost of an MRSDC scheme is determined by the number of nodes M+1M_{\ell}+1 chosen for each multirate component \ell and the number of SDC iterations KK taken per time step. The number of resulting function evaluations NevalsN_{\rm evals}^{\ell} per component \ell is Nevals=KMN_{\rm evals}^{\ell}=KM_{\ell}. For example, with 3 Gauss-Lobatto (M=2M=2) nodes for the coarse component and 5 Gauss-Lobatto nodes for the fine component between each of the coarse nodes (so that there are 9 nodes in \bmt2\bm{t}_{2} and hence M2=8M_{2}=8), the number of coarse function evaluations over K=4K=4 iterations is Nevals1=8N_{\rm evals}^{1}=8, and the number of fine function evaluations is Nevals2=32N_{\rm evals}^{2}=32.

The reacting compressible Navier-Stokes equations (1) have two characteristic time-scales: one over which advection and diffusion processes evolve, and another over which reactions evolve. The reaction time-scale is usually shorter than the advection/diffusion time-scale. As such, the coarse and fine MRSDC nodes typically correspond to the advection/diffusion and reaction parts of (1), respectively. However, as is the case for the Dimethyl Ether Jet problem shown in §5.3, the reaction terms in (1) can become so stiff that it becomes impractical to integrate them explicitly. In these cases, SMC employs the variable-order BDF schemes in VODE [6] to integrate the reaction terms implicitly. This effectively reverses the time-step constraint of the system: the reaction terms can be integrated using VODE (which internally takes many steps) with a larger time-step than the advection-diffusion terms. As such, the roles of the coarse and fine MRSDC nodes can be reversed so that the coarse and fine MRSDC nodes correspond to the reaction and advection-diffusion terms of (1), respectively, thereby amortizing the cost of the implicit solves by using fewer SDC nodes to integrate the stiff chemistry.

4 Parallelization

SMC is implemented within the BoxLib software framework with a hybrid MPI-OpenMP approach for parallelization [1]. In BoxLib, the computational domain is divided into boxes. Each MPI process owns one or more boxes via decomposition of the spatial domain and it can spawn multiple OpenMP threads. For parallelization with OpenMP, the traditional approach is a fine-grained approach in which individual loops are threaded using OpenMP PARALLEL DO directives (with COLLAPSE clause if the number of threads is relatively large). This approach works well for tasks like computing transport coefficients and chemistry production rates because there is only one PARALLEL DO region for each box. However, this approach to fine-grained parallelism does not work well for computing diffusive and hyperbolic terms. The reason for this is that many loops are used in computing terms like k\nabla\cdot\mathcal{F}_{k}. This results in many PARALLEL DO regions with serial parts in between that, in turn, can incur significant overheads and limit parallel efficiency. To overcome these issues in the fine-grained parallelism approach, a coarser-grained approach is taken for computing diffusive and hyperbolic terms. In this approach, OpenMP PARALLEL DO directives are not used. Instead, a SPMD (single program, multiple data) style is used. Each thread works on its own domain. Listing 1 shows a code snippet illustrating the coarse-grained parallelism approach. Here q, D and dUdt are four-dimensional Fortran arrays for primitive variables, transport coefficients, and rate of change of conserved variables, respectively. These shared arrays contain data for the entire box. The get_threadbox subroutine computes the bounds of sub-boxes owned by each thread and stores them in private arrays lo and hi. The original box (without ghost cells) is entirely covered by the sub-boxes without overlap. Typically, the domain decomposition among threads is performed in yy and zz-dimensions, but not in xx-dimension, because the xx-dimension of our data arrays is contiguous in memory. The diffterm subroutine is called by each thread with private lo and hi, and it has local variables (including arrays) that are automatically private to each thread. Note that there are no OpenMP statements in the diffterm subroutine. This approach reduces synchronization points and thread overhead, resulting in improved thread scaling as demonstrated in the next section.

Listing 1: Fortran code snippet illustrating the coarse-grained parallelism approach.
subroutine compute_dUdt() ! for a box of nx * ny * nz cells
! Only part of the subroutine is shown here
integer :: lo(3), hi(3)
!$omp parallel private(lo,hi)
call get_threadbox(lo,hi)
call diffterm(lo,hi,q,D,dUdt)
!$omp end parallel
end subroutine compute_dUdt
subroutine diffterm(lo,hi,q,D,dUdt)
! Only part of the subroutine is shown here
integer, intent(in) :: lo(3), hi(3)
! q and D have four ghost cells in each side
real, intent(in) :: q(-3:nx+4,-3:ny+4,-3:nz+4,nq)
real, intent(in) :: D(-3:nx+4,-3:ny+4,-3:nz+4,nD)
real, intent(inout) :: dUdt( 1:nx , 1:ny , 1:nz ,nU)
! tmp is a local array
real :: tmp(lo(1)-4:hi(1)+4,lo(2)-4:hi(2)+4,lo(3)-4:hi(3)+4)
! a nested loop
! In the fine-grained approach, this line would be !$omp do.
do k =lo(3)-4,hi(3)+4
do j =lo(2)-4,hi(2)+4
do i=lo(1)-4,hi(1)+4
tmp(i,j,k) = ......
end do
end do
end do
! another nested loop
! In the fine-grained approach, this line would be !$omp do.
do k =lo(3),hi(3)
do j =lo(2),hi(2)
do i=lo(1),hi(1)
dUdt(i,j,k,1) = ......
end do
end do
end do
end subroutine diffterm

5 Numerical Results

In this section we present numerical tests to demonstrate the theoretical rate of convergence and performance of SMC. We also compare simulation results obtained using SMC with the results from the low Mach number LMC code [9, 4, 27]. Numerical simulations presented here use EGLIB [12, 11] to compute transport coefficients.

5.1 Convergence

5.1.1 Space/time Convergence

In this section we present numerical tests to demonstrate the theoretical rate of convergence of SMC for the single-rate SDC integrator.

The convergence tests presented here are set up as follows. The simulations use a 9-species H2/O2\mathrm{H_{2}}/\mathrm{O_{2}} reaction mechanism [24]. The computational domain is Lx=Ly=Lz=(0.0005m,0.0005m)L_{x}=L_{y}=L_{z}=(-0.0005\,\mathrm{m},0.0005\,\mathrm{m}) with periodic boundaries in all three dimensions. The initial pressure, temperature and velocity of the gas are set to

p\displaystyle p =p0[1+0.1exp(r2r02)],\displaystyle=p_{0}\left[1+0.1\exp{\left(-\frac{r^{2}}{r_{0}^{2}}\right)}\right], (22)
T\displaystyle T =T0+T1exp(r2r02),\displaystyle=T_{0}+T_{1}\exp{\left(-\frac{r^{2}}{r_{0}^{2}}\right)}, (23)
vx\displaystyle v_{x} =v0sin(2πLxx)cos(2πLyy)cos(2πLzz),\displaystyle=v_{0}\sin{\left(\frac{2\pi}{L_{x}}x\right)}\cos{\left(\frac{2\pi}{L_{y}}y\right)}\cos{\left(\frac{2\pi}{L_{z}}z\right)}, (24)
vy\displaystyle v_{y} =v0cos(2πLxx)sin(2πLyy)cos(2πLzz),\displaystyle=-v_{0}\cos{\left(\frac{2\pi}{L_{x}}x\right)}\sin{\left(\frac{2\pi}{L_{y}}y\right)}\cos{\left(\frac{2\pi}{L_{z}}z\right)}, (25)
vz\displaystyle v_{z} =0,\displaystyle=0, (26)

where p0=1atmp_{0}=1\,\mathrm{atm}, T0=1100KT_{0}=1100\,\mathrm{K}, T1=400KT_{1}=400\,\mathrm{K}, v0=3ms1v_{0}=3\,\mathrm{m}\,\mathrm{s}^{-1}, r0=0.0001mr_{0}=0.0001\,\mathrm{m}, and r=x2+y2+z2r=\sqrt{x^{2}+y^{2}+z^{2}}. The mole fractions of species are initially set to zero except that

X(H2)\displaystyle X(\mathrm{H}_{2}) =0.1+0.025exp(r2r02),\displaystyle=0.1+0.025\exp{\left(-\frac{r^{2}}{r_{0}^{2}}\right)}, (27)
X(O2)\displaystyle X(\mathrm{O}_{2}) =0.25+0.050exp(r2r02),\displaystyle=0.25+0.050\exp{\left(-\frac{r^{2}}{r_{0}^{2}}\right)}, (28)
X(N2)\displaystyle X(\mathrm{N}_{2}) =1X(H2)X(O2).\displaystyle=1-X(\mathrm{H}_{2})-X(\mathrm{O}_{2}). (29)

To demonstrate convergence, we performed a series of runs with increasing spatial and temporal resolutions. A single-rate SDC scheme with 5 Gauss-Lobatto nodes and 8 SDC iterations is used for time stepping. The simulations are stopped at t=8×107st=8\times 10^{-7}\,\mathrm{s}. When the spatial resolution changes by a factor of 2 from one run to another, we also change the time step by a factor of 2. Therefore, the convergence rate in this study is expected to be 8th8^{\rm th} order. Table 5.1.1 shows the LL_{\infty} and L2L_{2}-norm errors and convergence rates for ρ\rho, TT, 𝒖u, Y(H2)Y(\mathrm{H_{2}}), Y(O2)Y(\mathrm{O_{2}}), Y(OH)Y(\mathrm{OH}), Y(H2O)Y(\mathrm{H_{2}O}), and Y(N2)Y(\mathrm{N_{2}}). Because there is no analytic solution for this problem, we compute the errors by comparing the numerical solution using N3N^{3} points with the solution using (2N)3(2N)^{3} points. The L2L_{2}-norm error is computed as

L2(N)=1N3i=1Nj=1Nk=1N(ϕN,ijkϕ2N,ijk)2,L_{2}(N)=\sqrt{\frac{1}{N^{3}}\sum_{i=1}^{N}\sum_{j=1}^{N}\sum_{k=1}^{N}(\phi_{N,ijk}-\phi_{2N,ijk})^{2}}, (30)

where ϕN,ijk\phi_{N,ijk} is the solution at point (i,j,k)(i,j,k) of the N3N^{3} point run and ϕ2N,ijk\phi_{2N,ijk} is the solution of the (2N)3(2N)^{3} point run at the same location. The results of our simulations demonstrate the high-order convergence rate of our scheme. The LL_{\infty} and L2L_{2}-norm convergence rates for various variables in the simulations approach 8th8^{\rm th} order, the theoretical order of the scheme. Relatively low orders are observed at low resolution because the numerical solutions are not in the regime of asymptotic convergence yet.

\tbl

Errors and convergence rates for three-dimensional simulations of a hydrogen flame using single-rate SDC and 8th order finite difference stencil. Density ρ\rho, temperature TT and 𝒖u are in SI units.
Variable No. of Points Δt(109s)\Delta t\,(10^{-9}\,\mathrm{s}) LL_{\infty} Error LL_{\infty} Rate L2L_{2} Error L2L_{2} Rate ρ\rho 32332^{3} 4 2.613E-08 5.365E-09 64364^{3} 2 3.354E-10 6.28 6.655E-11 6.33 1283128^{3} 1 2.053E-12 7.35 3.438E-13 7.60 2563256^{3} 0.5 8.817E-15 7.86 1.437E-15 7.90 TT 32332^{3} 4 3.328E-02 6.722E-03 64364^{3} 2 4.184E-04 6.31 8.266E-05 6.35 1283128^{3} 1 2.515E-06 7.38 4.266E-07 7.60 2563256^{3} 0.5 1.095E-08 7.84 1.783E-09 7.90 uxu_{x} 32332^{3} 4 6.184E+00 8.451E-01 64364^{3} 2 7.368E-02 6.39 8.859E-03 6.58 1283128^{3} 1 4.554E-04 7.34 4.559E-05 7.60 2563256^{3} 0.5 2.011E-06 7.82 1.904E-07 7.90 uyu_{y} 32332^{3} 4 6.677E+00 8.801E-01 64364^{3} 2 7.218E-02 6.53 9.121E-03 6.59 1283128^{3} 1 4.784E-04 7.24 4.693E-05 7.60 2563256^{3} 0.5 2.041E-06 7.87 1.961E-07 7.90 uzu_{z} 32332^{3} 4 6.443E+00 8.632E-01 64364^{3} 2 7.211E-02 6.48 8.987E-03 6.59 1283128^{3} 1 4.626E-04 7.28 4.624E-05 7.60 2563256^{3} 0.5 2.034E-06 7.83 1.932E-07 7.90 Y(H2)Y(\mathrm{H_{2}}) 32332^{3} 4 5.507E-08 1.272E-08 64364^{3} 2 8.619E-10 6.00 1.764E-10 6.17 1283128^{3} 1 5.802E-12 7.21 9.828E-13 7.49 2563256^{3} 0.5 2.512E-14 7.85 4.169E-15 7.88 Y(O2)Y(\mathrm{O_{2}}) 32332^{3} 4 1.358E-06 9.166E-08 64364^{3} 2 1.406E-08 6.59 7.136E-10 7.00 1283128^{3} 1 7.111E-11 7.63 3.468E-12 7.68 2563256^{3} 0.5 2.982E-13 7.90 1.433E-14 7.92 Y(OH)Y(\mathrm{OH}) 32332^{3} 4 3.858E-12 3.791E-14 64364^{3} 2 2.740E-14 7.14 2.661E-16 7.15 1283128^{3} 1 1.288E-16 7.73 1.245E-18 7.74 2563256^{3} 0.5 5.278E-19 7.93 5.093E-21 7.93 Y(H2O)Y(\mathrm{H_{2}O}) 32332^{3} 4 7.994E-12 7.318E-14 64364^{3} 2 5.847E-14 7.09 5.347E-16 7.10 1283128^{3} 1 2.778E-16 7.72 2.548E-18 7.71 2563256^{3} 0.5 1.141E-18 7.93 1.048E-20 7.93 Y(N2)Y(\mathrm{N_{2}}) 32332^{3} 4 1.328E-06 8.762E-08 64364^{3} 2 1.383E-08 6.58 6.622E-10 7.05 1283128^{3} 1 6.999E-11 7.63 3.160E-12 7.71 2563256^{3} 0.5 2.943E-13 7.89 1.300E-14 7.92

5.1.2 Multirate Integration

In this section we present numerical tests to demonstrate the theoretical convergence rate of the MRSDC integrator in SMC.

Table 5.1.2 shows the LL_{\infty} and L2L_{2}-norm errors and convergence rates for ρ\rho, TT, uxu_{x}, Y(H2)Y(\mathrm{H_{2}}), Y(O2)Y(\mathrm{O_{2}}), Y(OH)Y(\mathrm{OH}), Y(H2O)Y(\mathrm{H_{2}O}), and Y(N2)Y(\mathrm{N_{2}}) for the same test problem as described in §5.1, except that T0=300KT_{0}=300\,\mathrm{K} and T1=1100KT_{1}=1100\,\mathrm{K}. All runs were performed on a 32332^{3} grid, so that the errors and convergence rates reported here are ODE errors. Because there is no analytic solution for this problem, a reference solution was computed using the same spatial grid, but with an 8th8^{\rm th}-order single-rate SDC integrator and Δt=1.25×109\Delta t=1.25\times 10^{-9}s. We note that the LL_{\infty} and L2L_{2}-norm convergence rates for all variables and MRSDC configurations are 4th4^{\rm th}-order, as expected since the coarse MRSDC component uses 3 Gauss-Lobatto nodes. This verifies that the MRSDC implementation in SMC operates according to its specifications regardless of how the fine nodes are chosen (see §3.2).

\tbl

Errors and convergence rates for three-dimensional simulations of a hydrogen flame using 4th order MRSDC and 8th order finite difference stencil. Density ρ\rho, temperature TT and 𝒖u are in SI units. The MRSDC XX / YY configurations use XX coarse nodes and YY fine nodes between each pair of coarse nodes (type (b) in §3.2). The MRSDC XX / Y×RY\times R configurations use XX coarse nodes and YY fine nodes repeated RR times between each pair of coarse nodes (type (c) in §3.2). All SDC nodes are Gauss-Lobatto (GL) nodes. Density ρ\rho, temperature TT and 𝒖u are in SI units.
SDC Configuration Variable Δt(109s)\Delta t\,(10^{-9}\,\mathrm{s}) LL_{\infty} Error LL_{\infty} Rate L2L_{2} Error L2L_{2} Rate GL 3 / GL 9 ρ\rho 10 5.638e-07 3.958e-08 5 3.392e-08 4.06 2.426e-09 4.03 2.5 2.093e-09 4.02 1.509e-10 4.01 TT 10 2.748e-04 1.577e-05 5 1.661e-05 4.05 9.504e-07 4.05 2.5 1.029e-06 4.01 5.877e-08 4.02 uxu_{x} 10 6.928e-04 2.253e-05 5 4.255e-05 4.03 1.369e-06 4.04 2.5 2.650e-06 4.01 8.496e-08 4.01 Y(H2)Y({\rm H}_{2}) 10 1.960e-09 5.051e-11 5 1.160e-10 4.08 3.059e-12 4.05 2.5 7.108e-12 4.03 1.896e-13 4.01 Y(O2)Y({\rm O}_{2}) 10 9.501e-09 3.340e-10 5 5.762e-10 4.04 2.027e-11 4.04 2.5 3.562e-11 4.02 1.255e-12 4.01 Y(OH)Y({\rm OH}) 10 5.989e-17 3.531e-19 5 3.671e-18 4.03 2.183e-20 4.02 2.5 2.292e-19 4.00 1.370e-21 3.99 Y(H2O)Y({\rm H}_{2}{\rm O}) 10 4.309e-18 6.660e-20 5 2.519e-19 4.10 3.928e-21 4.08 2.5 1.527e-20 4.04 2.394e-22 4.04 Y(N2)Y({\rm N}_{2}) 10 8.743e-09 3.191e-10 5 5.258e-10 4.06 1.938e-11 4.04 2.5 3.238e-11 4.02 1.200e-12 4.01 GL 3 / GL 5 ×\times 2 ρ\rho 10 5.638e-07 3.958e-08 5 3.392e-08 4.06 2.426e-09 4.03 2.5 2.093e-09 4.02 1.509e-10 4.01 TT 10 2.748e-04 1.577e-05 5 1.661e-05 4.05 9.504e-07 4.05 2.5 1.029e-06 4.01 5.877e-08 4.02 uxu_{x} 10 6.928e-04 2.253e-05 5 4.255e-05 4.03 1.369e-06 4.04 2.5 2.650e-06 4.01 8.496e-08 4.01 Y(H2)Y({\rm H}_{2}) 10 1.960e-09 5.051e-11 5 1.160e-10 4.08 3.059e-12 4.05 2.5 7.108e-12 4.03 1.896e-13 4.01 Y(O2)Y({\rm O}_{2}) 10 9.501e-09 3.340e-10 5 5.762e-10 4.04 2.027e-11 4.04 2.5 3.562e-11 4.02 1.255e-12 4.01 Y(OH)Y({\rm OH}) 10 5.989e-17 3.531e-19 5 3.671e-18 4.03 2.183e-20 4.02 2.5 2.292e-19 4.00 1.370e-21 3.99 Y(H2O)Y({\rm H}_{2}{\rm O}) 10 4.309e-18 6.660e-20 5 2.519e-19 4.10 3.928e-21 4.08 2.5 1.527e-20 4.04 2.394e-22 4.04 Y(N2)Y({\rm N}_{2}) 10 8.743e-09 3.191e-10 5 5.258e-10 4.06 1.938e-11 4.04 2.5 3.238e-11 4.02 1.200e-12 4.01

5.2 Performance

5.2.1 Parallel Performance

In this section we present numerical tests to demonstrate the parallel efficiency of SMC.

Two strong scaling studies were performed: a pure OpenMP study on a 61-core Intel Xeon Phi, and a pure MPI study on the Hopper supercomputer at the National Energy Research Scientific Computing Center (NERSC).

The pure OpenMP strong scaling study was carried out on a 61-core 1.1 GHz Intel Xeon Phi coprocessor that supported up to 244 hyperthreads. We performed a series of pure OpenMP runs, each consisting of a single box of 1283128^{3} points, of the test problem in section 5.1 using various numbers of threads. The simulations are run for 10 time steps. Table 5.2.1 shows that SMC has excellent thread scaling behavior. An factor of 86 speedup over the 1 thread run is obtained on the 61-core coprocessor using 240 hyperthreads.

The pure MPI strong scaling study was carried out on the Hopper supercomputer at NERSC. Each compute node of Hopper has 4 non-uniform memory access (NUMA) nodes, with 6 cores on each NUMA node. We performed a series of pure MPI runs of the test problem in section 5.1 using various numbers of processors and integration schemes on a grid of 64364^{3} points. The simulations are run for 10 time steps. Several observations can be made from the results listed in Table 5.2.1. First, when using a single processor the wide stencil is slightly faster than the narrow stencil regardless of the integration scheme used. This is consistent with the cost analysis of the stencils done in §2.3 which shows that the wide stencil uses fewer FLOPs per stencil evaluation than the narrow stencil. The difference in single-processor run times, however, is not significant which suggests that both stencils are operating in a regime dominated by memory access times. Second, when using multiple MPI processes, both the wall clock time per step and efficiency of the narrow stencil is consistently better than the wide stencil regardless of the integrator being used. Again, this is consistent with cost analysis done in §2.3 which shows that the communication cost associated with the narrow stencil is less than the wide stencil. The difference between the stencils when using multiple MPI processes is observable but not dramatic. Regarding parallel efficiency, we note that as more processors are used the amount of work performed grows due to the presence of ghost cells where, for example, diffusion coefficients need to be computed. As such, we cannot expect perfect parallel efficiency. In the extreme case were the problem is spread across 512 processors, each processor treats an 838^{3} box, which grows to a 16316^{3} box with the addition of ghost cells. Finally, note that the parallel efficiency of the MRSDC scheme is higher than the SRSDC scheme regardless of the stencil being used since, for this test, the MRSDC scheme performs the same number of advection/diffusion evaluations (and hence stencil evaluations and extra ghost cell work) but performs significantly more evaluations of the chemistry terms – MRSDC has a higher arithmetic intensity with respect to MPI communications than SRSDC.

\tbl

Strong scaling behavior of pure OpenMP runs on a 61-core Intel Xeon Phi coprocessor. Average wall clock time per time step and speedup are shown for runs using various number of threads. Hyperthreading is used for the 128 and 240-thread runs.
No. of threads Wall time per step (s) Speedup 1 1223.58 2 626.13 2.0 4 328.97 3.7 8 159.88 7.7 16 79.73 16 32 40.12 31 60 24.55 50 128 18.19 67 240 14.28 86

\tbl

Strong scaling behavior of pure MPI runs with 64364^{3} grid points on the Hopper supercomputer at NERSC. Average wall clock time per time step, speedup, and efficiency are shown for runs using various number of processors and schemes. The SRSDC scheme uses single-rate SDC with 3 nodes, with narrow and wide stencils. The MRSDC scheme uses multi-rate SDC with 3 fine nodes, and 5 fine nodes repeated 2 times between each coarse node (type (c) in §3.2). All SDC nodes are Gauss-Lobatto (GL) nodes.
Method/Stencil No. of MPI processes Wall time per step (s) Speedup Efficiency SRSDC/Narrow 1 243.35 8 42.92 5.67 71% 64 7.91 30.84 48% 512 2.14 133.95 22% SRSDC/Wide 1 231.60 8 43.21 5.36 67% 64 8.04 28.80 45% 512 2.27 102.09 20% MRSDC/Narrow 1 695.57 8 101.86 6.82 85% 64 15.25 45.69 71% 512 3.15 220.55 43% MRSDC/Wide 1 685.87 8 102.45 6.69 84% 64 15.75 43.55 68% 512 3.28 208.85 41%

We also performed a weak scaling study using the test problem in section 5.1 on the Hopper supercomputer at the NERSC. In this weak scaling study, a hybrid MPI/OpenMP approach with 6 threads per MPI process is used. Each MPI process works on two 64364^{3} boxes. The simulations are run for 10 time steps. Table 5.2.1 shows the average wall clock time per time step for a series of runs on various numbers of cores. We use the 6-core run as baseline and define parallel efficiency for an nn-core run as T6/TnT_{6}/T_{n}, where T6T_{6} and TnT_{n} are the average wall clock time per time step for the 6 and nn-core runs, respectively. Excellent scaling to about 100 thousand cores is observed.

\tbl

Weak scaling behavior of hybrid MPI/OpenMP runs on Hopper at NERSC. Average wall clock time per time step and parallel efficiency are shown for runs on various number of cores. Each MPI process spawns 6 OpenMP threads.
No. of cores No. of MPI processes Wall time per step (s) Efficiency 6 1 13.72 24 4 14.32 96% 192 32 14.46 95% 1536 256 14.47 95% 12288 2048 15.02 91% 98304 16384 15.32 90%

5.2.2 Multirate Integration

Table 5.2.2 compares run time and function evaluation counts for premixed methane flame using various configurations of SDC; MRSDC; and a six-stage, fourth-order Runge-Kutta integrator. The configuration is a cubic box (Lx=Ly=Lz=0.008L_{x}=L_{y}=L_{z}=0.008m) with periodic boundary conditions in the xx and yy directions, and inlet/outlet boundary conditions in the zz direction. The GRI-MECH 3.0 chemistry network (53 species, 325 reactions) [33] was used so that the reaction terms in (1) were quite stiff and an MRSDC integrator with the reaction terms on the fine nodes was appropriate. A reference solution was computed using a single-rate SDC integrator with a time step Δt\Delta t of 1×1091\times 10^{-9}s to a final time of 4.2×1074.2\times 10^{-7}s. All simulations were run on a 32332^{3} point grid. For each integrator, the test was run using successively larger time steps until the solution no longer matched the reference solution. The largest time step that matched the reference solution (to within a relative LL_{\infty} error of less than 1×1051\times 10^{-5}) is reported. All SDC integrators use 3 Gauss-Lobatto nodes for the coarse component and hence all integrators, including the Runge-Kutta integrators, are formally 4th4^{\rm th} order accurate. We note that all of the multi-rate integrators are able to use larger time steps than the SRSDC and RK integrators due to the stiffness of the reaction terms: the multi-rate integrators use more nodes to integrate the reaction terms and hence the effective time-steps used in the update equation (20) are smaller. This means that, in all cases, the multi-rate integrators evaluate the advection/diffusion terms less frequently than the single-rate integrator. However, in all but one case, the multi-rate integrators also evaluate the reaction terms more frequently than the SRSDC and RK integrators. The trade-off between the frequency of evaluating the advection/diffusion vs reaction terms is highlighted by comparing the MRSDC 3 GL / 9 GL and MRSDC 3 GL / 13 GL runs: the run with 13 fine nodes can use a slightly larger time-step at the cost of evaluating the reaction terms more frequently, resulting in only a very slight runtime advantage over the run using 9 fine nodes and a smaller time-step. Ultimately the largest runtime advantage is realized by choosing a multi-rate configuration that allows a large time-step to be taken without requiring too many extra calls to the reaction network. For this test the MRSDC 3 / 5×25\times 2 configuration achieves this goal and is able to run to completion approximately 6.0 times faster than the single-rate SDC integrator and 4.5 times faster than the RK integrator.

\tbl

Comparison of runtime and function evaluation count for various configurations of SDC and MRSDC. Function evaluation counts are reported as: no. of advection/diffusion evaluations / no. of chemistry evaluations. All schemes use the narrow stencil, except where noted. The SRSDC XX configuration uses single-rate SDC with XX nodes. The RK scheme is fourth order and use six stages. The MRSDC XX / YY configurations use MRSDC with XX coarse nodes and YY fine nodes between each pair of coarse nodes (type (b) in §3.2). The MRSDC XX / Y×RY\times R configurations use MRSDC with XX coarse nodes and YY fine nodes repeated RR times between each pair of coarse nodes (type (c) in §3.2). All SDC nodes are Gauss-Lobatto (GL) nodes.
Scheme Δt\Delta t (10910^{-9}s) Runtime (s) Evaluations (AD/R) SRSDC, 3 GL 6 130.6 561/561 RK, 6/4 6 96.5 420/420 RK, 6/4 wide 6 95.4 420/420 MRSDC, 3 GL / 9 GL type (b) 30 44.1 113/897 MRSDC, 3 GL / 13 GL type (b) 42 39.7 81/961 MRSDC, 3 GL / 3×83\times 8 GL type (c) 60 30.8 57/897 MRSDC, 3 GL / 5×25\times 2 GL type (c) 60 21.7 57/449

5.3 Dimethyl Ether Jet

We present here a two-dimensional simulation of jet using a 39-species dimethyl ether (DME) chemistry mechanism [2]. A 2D Cartesian domain, 0.00114m<x<0.00114m-0.00114\,\mathrm{m}<x<0.00114\,\mathrm{m} and 0<y<0.00228m0<y<0.00228\,\mathrm{m}, is used for this test. Pressure is initially set to 40atm40\,\mathrm{atm} everywhere. The initial temperature, velocity and mole fractions of species are set to

T0\displaystyle T_{0} =ηTjet+(1η)Tair,\displaystyle=\eta T_{\mathrm{jet}}+(1-\eta)T_{\mathrm{air}}, (31)
v0x\displaystyle v_{0x} =0,\displaystyle=0, (32)
v0y\displaystyle v_{0y} =ηvjet+(1η)vair,\displaystyle=\eta v_{\mathrm{jet}}+(1-\eta)v_{\mathrm{air}}, (33)
X0\displaystyle X_{0} =ηXjet+(1η)Xair,\displaystyle=\eta X_{\mathrm{jet}}+(1-\eta)X_{\mathrm{air}}, (34)

where Tjet=400KT_{\mathrm{jet}}=400\,\mathrm{K}, Tair=1525KT_{\mathrm{air}}=1525\,\mathrm{K}, vjet=51.2ms1v_{\mathrm{jet}}=51.2\,\mathrm{m}\,\mathrm{s}^{-1}, and vair=5.12ms1v_{\mathrm{air}}=5.12\,\mathrm{m}\,\mathrm{s}^{-1}. The η\eta variable is given by

η=12(tanhx+x0σtanhxx0σ),\eta=\frac{1}{2}(\tanh{\frac{x+x_{0}}{\sigma}}-\tanh{\frac{x-x_{0}}{\sigma}}), (35)

where x0=5.69×105mx_{0}=5.69\times 10^{-5}\,\mathrm{m} and σ=0.5x0\sigma=0.5x_{0}. The mole fractions of species for the “jet” and “air” states are set to zero except that

Xjet(CH3OCH3)\displaystyle X_{\mathrm{jet}}(\mathrm{CH_{3}OCH_{3}}) =0.2,\displaystyle=0.2, (36)
Xjet(N2)\displaystyle X_{\mathrm{jet}}(\mathrm{N_{2}}) =0.8,\displaystyle=0.8, (37)
Xair(O2)\displaystyle X_{\mathrm{air}}(\mathrm{O_{2}}) =0.21,\displaystyle=0.21, (38)
Xair(N2)\displaystyle X_{\mathrm{air}}(\mathrm{N_{2}}) =0.79.\displaystyle=0.79. (39)

An inflow boundary is used at the lower yy-boundary, whereas outflow boundary conditions are applied at the other three boundaries. Sinusoidal variation is added to the inflow velocity as follows,

vy(x,t)=v0y+v~ηsin(2πLxx)sin(2πLtt),v_{y}(x,t)=v_{0y}+\tilde{v}\eta\sin\left(\frac{2\pi}{L_{x}}x\right)\sin\left(\frac{2\pi}{L_{t}}t\right), (40)

where v~=10ms1\tilde{v}=10\,\mathrm{m}\,\mathrm{s}^{-1}, Lx=0.00228mL_{x}=0.00228\,\mathrm{m} is the length of the domain in xx-direction, and Lt=105sL_{t}=10^{-5}\,\mathrm{s} is the period of the variation. With these parameters, the resulting flow problem is in the low Mach number regime to facilitate cross-validation with an established low Mach number solver [9]

The 39-species DME mechanism used in this test is extremely stiff so that it becomes impractical to evolve the reaction system explicitly. Hence, we use the variable-order BDF scheme in VODE [6] for computing species production rates. In this test, we use MRSDC, and put the reaction terms ρω˙k\rho\dot{\omega}_{k} in (1c) and advection-diffusion terms on the coarse and fine nodes, respectively. Due to the use of an implicit solver, we can evaluate the reaction terms on the coarse node with a large time step without suffering from instability. As for advection-diffusion terms, the time step is limited by the acoustic time scale. In this test, we use 5 coarse Gauss-Lobatto nodes for the reaction terms, whereas for the advection-diffusion terms we use 3 fine Gauss-Lobatto nodes repeated twice between each pair of coarse nodes. The time step is chosen to have a Courant-Friedrichs-Lewy (CFL) number of 2. The effective CFL numbers for a substep of advection-diffusion (i.e., the interval spanned by 3 fine Gauss-Lobatto nodes) are about 0.17 and 0.33. (Note that the 5 coarse Gauss-Lobatto nodes are not evenly distributed.) We perform 4 SDC iterations per time step. The simulation is run with 204822048^{2} uniform points corresponding to a cell size of 1.1×106m\sim 1.1\times 10^{-6}\,\mathrm{m} up to time 6×105s6\times 10^{-5}\,\mathrm{s}. The numerical results are shown in figures 2 & 3. As a validation of SMC, we have also presented the numerical results obtained using the LMC code that is based on a low-Mach number formulation [9, 4, 27] that approximates the compressible flow equation to O(M2)O(M^{2}). The results of the two codes are nearly indistinguishable in spite of the difference between the two formulations. This is because the maximum Mach number MM in this problem is only 0.140.14 so the low Mach number model is applicable. We note that this simulation is essentially infeasible with a fully explicit solver because of the stiffness of the chemical mechanism.

Refer to caption
Figure 2: Density and temperature at 6×105s6\times 10^{-5}\,\mathrm{s} for the DME jet test problem. We show (a) density from SMC, (b) density from LMC, (c) temperature from SMC, and (d) temperature from LMC. Only part of the domain immediately surrounding the jet is shown here.
Refer to caption
Figure 3: Mass fraction of OH\mathrm{OH} and C2H4\mathrm{C_{2}H_{4}} at 6×105s6\times 10^{-5}\,\mathrm{s} for the DME jet test problem. We show (a) Y(OH)Y(\mathrm{OH}) from SMC, (b) Y(OH)Y(\mathrm{OH}) from LMC, (c) Y(C2H4)Y(\mathrm{C_{2}H_{4}}) from SMC, and (d) Y(C2H4)Y(\mathrm{C_{2}H_{4}}) from LMC. Only part of the domain immediately surrounding the jet is shown here.

6 Conclusions

We have described a numerical algorithm for integrating the multicomponent, reacting, compressible Navier-Stokes equations that is eight-order in space with fourth order to eighth order temporal integration. The methodology uses a narrow stencil approximation of diffusive terms that reduces the communication compared to existing methods and removes the need to use a filtering algorithm to remove Nyquist frequency oscillations that are not damped with traditional approaches. It also incorporates a multirate temporal integration strategy that provides an efficient mechanism for treating chemical mechanisms that are stiff relative to fluid dynamical time scales. The narrow stencil uses more floating point operations per stencil evaluation than the standard wide stencil but requires less communication. A strong scaling study demonstrated that the narrow stencil scales better than the wide stencil which has important ramifications for stencil performance on future multicore machines. The multirate integration approach supports two modes for operation. One treats the reactions explicitly using smaller time steps than are used for the fluid mechanics; the other treats reaction implicitly for case in which the chemical time scales are significantly faster than the fluid dynamical time scales. A performance study of the multi-rate integration scheme with realistic chemical kinetics showed that the multi-rate integrator can operate with a larger time-step compared to single-rate integrators and ultimately obtained an accurate solution in less time. The implementation uses a hybrid programming model designed for effective utilization of a large number of threads that is targeted toward next generation many-core architectures. The code scales well to approximately 100K cores and shows excellent thread performance on a 61-core Intel Xeon Phi coprocessor. We present numerical results demonstrating the convergence properties of the algorithm with realistic chemical kinetics and validating the algorithm against results obtained with an established low Mach number solver.

Acknowledgements

This work was supported by the Applied Mathematics Program and the Exascale Co-Design Program of the DOE Office of Advanced Scientific Computing Research under the U.S. Department of Energy under contract DE-AC02-05CH11231. This research used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.

References

  • [1] Center for Computational Sciences and Engineering, Lawrence Berkeley National Laboratory, BoxLib, software available at https://ccse.lbl.gov/BoxLib/.
  • [2] G. Bansal, A. Mascarenhas, J. H. Chen, T. Lu, and Z. Luo, Direct numerical simulations of autoignition in stratified dimethyl-ether (dme)/air turbulent mixtures, Combustion and Flame, in preparation (2013).
  • [3] M. Baum, T. J. Poinsot, D. C. Haworth, and N. Darabiha, Direct numerical simulation of H2/O2/N2 flames with complex chemistry in two-dimensional turbulent flows, J. Fluid Mech., 281 (1994), pp. 1–32.
  • [4] J. B. Bell, M. S. Day, A. S. Almgren, M. J. Lijewski, and C. A. Rendleman, Adaptive numerical simulation of turbulent premixed combustion, in Proceedings of the first MIT conference on computational fluid and solid mechanics, Cambridge, MA, June 2001, MIT Press.
  • [5] A. Bourlioux, A. T. Layton, and M. L. Minion, High-order multi-implicit spectral deferred correction methods for problems of reactive flow, Journal of Computational Physics, 189 (2003), pp. 651–675.
  • [6] P. Brown, G. Byrne, and A. Hindmarsh, Vode: A variable-coefficient ode solver, SIAM Journal on Scientific and Statistical Computing, 10 (1989), pp. 1038–1051.
  • [7] N. Chakraborty and S. Cant, Unsteady effects of strain rate and curvature on turbulent premixed flames in an inflow-outflow configuration, Combustion and Flame, 137 (2004), pp. 129 – 147.
  • [8] J. H. Chen and H. Im, Correlation of flame speed with stretch in turbulent premixed methane/air flames, Proc. Combust. Inst., 27 (1998), pp. 819–826.
  • [9] M. S. Day and J. B. Bell, Numerical simulation of laminar reacting flows with complex chemistry, Combust. Theory Modelling, 4 (2000), pp. 535–556.
  • [10] A. Dutt, L. Greengard, and V. Rokhlin, Spectral deferred correction methods for ordinary differential equations, BIT Numerical Mathematics, 40 (2000), pp. 241–266.
  • [11] A. Ern, C. C. Douglas, and M. D. Smooke, Detailed chemistry modeling of laminar diffusion flames on parallel computers, Int. J. Sup. App., 9 (1995), pp. 167–186.
  • [12] A. Ern and V. Giovangigli, Multicomponent Transport Algorithms, vol. m24 of Lecture Notes in Physics, Springer-Verlag, Berlin, 1994.
  • [13] V. Giovangigli, Multicomponent Flow Modeling, Birkhäuser, Boston, 1999.
  • [14] A. C. Hansen and J. Strain, Convergence theory for spectral deferred correction, Preprint, (2006).
  • [15] D. C. Haworth and T. J. Poinsot, Numerical simulations of Lewis number effects in turbulent premixed flames, J. Fluid Mech., 244 (1992), pp. 405–436.
  • [16] J. Huang, J. Jia, and M. Minion, Accelerating the convergence of spectral deferred correction methods, Journal of Computational Physics, 214 (2006), pp. 633–656.
  • [17] K. Jenkins and R. Cant, Direct numerical simulation of turbulent flame kernels, in Recent Advances in DNS and LES, D. Knight and L. Sakell, eds., vol. 54 of Fluid Mechanics and its Applications, Springer Netherlands, 1999, pp. 191–202.
  • [18] R. Kamakoti and C. Pantano, High-order narrow stencil finite-difference approximations of second-order derivatives involving variable coefficients, SIAM Journal on Scientific Computing, 31 (2010), pp. 4222–4243.
  • [19] C. A. Kennedy and M. H. Carpenter, Several new numerical methods for compressible shear-layer simulations, Applied Numerical Mathematics, 14 (1994), pp. 397 – 433.
  • [20] C. A. Kennedy, M. H. Carpenter, and R. M. Lewis, Low-storage, explicit runge-kutta schemes for the compressible navier-stokes equations, Applied Numerical Mathematics, 35 (2000), pp. 177 – 219.
  • [21] H. Kolla, R. W. Grout, A. Gruber, and J. H. Chen, Mechanisms of flame stabilization and blowout in a reacting turbulent hydrogen jet in cross-flow, Combustion and Flame, 159 (2012), pp. 2755 – 2766. Special Issue on Turbulent Combustion.
  • [22] A. T. Layton and M. L. Minion, Conservative multi-implicit spectral deferred correction methods for reacting gas dynamics, Journal of Computational Physics, 194 (2004), pp. 697–715.
  • [23] S. K. Lele, Compact finite difference schemes with spectral-like resolution, Journal of Computational Physics, 103 (1992), pp. 16 – 42.
  • [24] J. Li, Z. Zhao, A. Kazakov, and F. L. Dryer, An updated comprehensive kinetic model of hydrogen combustion, International Journal of Chemical Kinetics, 36 (2004), pp. 566–575.
  • [25] M. L. Minion, Semi-implicit spectral deferred correction methods for ordinary differential equations, Communications in Mathematical Sciences, 1 (2003), pp. 471–500.
  • [26]  , Semi-implicit projection methods for incompressible flow based on spectral deferred corrections, Applied numerical mathematics, 48 (2004), pp. 369–387.
  • [27] A. Nonaka, J. B. Bell, M. S. Day, C. Gilet, A. S. Almgren, and M. L. Minion, A deferred correction coupling strategy for low mach number flow with complex chemistry, Combustion Theory and Modelling, 16 (2012), pp. 1053–1088.
  • [28] T. Poinsot, S. Candel, and A. Trouv , Applications of direct numerical simulation to premixed turbulent combustion, Progress in Energy and Combustion Science, 21 (1995), pp. 531 – 576.
  • [29] T. Poinsot and S. Lelef, Boundary conditions for direct simulations of compressible viscous flows, Journal of Computational Physics, 101 (1992), pp. 104 – 129.
  • [30] T. Poinsot and D. Veynante, Theoretical and Numerical Combustion, R. T. Edwards, Inc., Philadelphia, first ed., 2001.
  • [31] R. Sankaran, E. R. Hawkes, J. H. Chen, T. Lu, and C. K. Law, Structure of a spatially developing turbulent lean methane-air bunsen flame, Proceedings of the Combustion Institute, 31 (2007), pp. 1291 – 1298.
  • [32] R. Sankaran, E. R. Hawkes, C. Y. Yoo, J. H. Chen, T. Lu, and C. K. Law, Direct numerical simuation of stationary lean premixed methane-air flames under intense turbulence, in Proceedings of the 5th US Combustion Meeting (CD-ROM), Western States Section of the Combustion Institute, Mar. 2007. Paper B09.
  • [33] G. P. Smith et al., GRI-Mech 3.0. Available at http://www.me.berkeley.edu/gri_mech.
  • [34] J. C. Sutherland and C. A. Kennedy, Improved boundary conditions for viscous, reacting, compressible flows, Journal of Computational Physics, 191 (2003), pp. 502 – 524.
  • [35] M. Tanahashi, M. Fujimura, and T. Miyauchi, Coherent fine scale eddies in turbulent premixed flames, Proc. Combust. Inst., 28 (2000), pp. 529–535.
  • [36] K. W. Thompson, Time dependent boundary conditions for hyperbolic systems, Journal of Computational Physics, 68 (1987), pp. 1 – 24.
  • [37]  , Time-dependent boundary conditions for hyperbolic systems, {II}, Journal of Computational Physics, 89 (1990), pp. 439 – 461.
  • [38] L. Vervisch and T. Poinsot, Direct numerical simulation of non-premixed turbulent flames, Annual Review of Fluid Mechanics, 30 (1998), pp. 655–691.
  • [39] C. S. Yoo and H. G. Im, Characteristic boundary conditions for simulations of compressible reacting flows with multi-dimensional, viscous and reaction effects, Combustion Theory and Modelling, 11 (2007), pp. 259–286.
  • [40] C. S. Yoo, E. S. Richardson, R. Sankaran, and J. H. Chen, A DNS study on the stabilization mechanism of a turbulent lifted ethylene jet flame in highly-heated coflow, Proceedings of the Combustion Institute, 33 (2011), pp. 1619 – 1627.
  • [41] C. S. Yoo, Y. Wang, A. Trouvé, and H. G. Im, Characteristic boundary conditions for direct simulations of turbulent counterflow flames, Combustion Theory and Modelling, 9 (2005), pp. 617–646.
\appendices

7 Eighth-Order Narrow Stencil: Free Parameters and Numerical Experiments

The truncation error (numerical minus exact) of the 8th8^{\rm th} order narrow stencil discretization of (au/x)/x\partial(a{\partial u}/{\partial x})/{\partial x} (§ 2.2) is given by

ei=\displaystyle e_{i}={} 13150a10ux101630ax9ux91352ax28ux8+(1138404m48)3ax37ux7\displaystyle-\frac{1}{3150}a\frac{\partial^{10}u}{\partial x^{10}}-\frac{1}{630}\frac{\partial a}{\partial x}\frac{\partial^{9}u}{\partial x^{9}}-\frac{1}{35}\frac{\partial^{2}a}{\partial x^{2}}\frac{\partial^{8}u}{\partial x^{8}}+(-\frac{113}{840}-4m_{48})\frac{\partial^{3}a}{\partial x^{3}}\frac{\partial^{7}u}{\partial x^{7}}
+(487168014m48)4ax46ux6+(451312600+25m47925m48)5ax55ux5\displaystyle+(-\frac{487}{1680}-14m_{48})\frac{\partial^{4}a}{\partial x^{4}}\frac{\partial^{6}u}{\partial x^{6}}+(-\frac{4513}{12600}+\frac{2}{5}m_{47}-\frac{92}{5}m_{48})\frac{\partial^{5}a}{\partial x^{5}}\frac{\partial^{5}u}{\partial x^{5}}
+(277710080+m4711m48)6ax64ux4+(318125200+45m47145m48)7ax73ux3\displaystyle+(-\frac{2777}{10080}+m_{47}-11m_{48})\frac{\partial^{6}a}{\partial x^{6}}\frac{\partial^{4}u}{\partial x^{4}}+(-\frac{3181}{25200}+\frac{4}{5}m_{47}-\frac{14}{5}m_{48})\frac{\partial^{7}a}{\partial x^{7}}\frac{\partial^{3}u}{\partial x^{3}}
+(140350400+15m4715m48)8ax82ux216309ax9ux\displaystyle+(-\frac{1403}{50400}+\frac{1}{5}m_{47}-\frac{1}{5}m_{48})\frac{\partial^{8}a}{\partial x^{8}}\frac{\partial^{2}u}{\partial x^{2}}-\frac{1}{630}\frac{\partial^{9}a}{\partial x^{9}}\frac{\partial u}{\partial x}
+O(Δx8),\displaystyle+O(\Delta x^{8}), (41)

where aa and all derivatives of aa and uu are evaluated at x=xix=x_{i}. An upper bound of the truncation error can be calculated by summing the absolute value of each term in (41). Assuming that all (na/xn)(10nu/x10n)(\partial^{n}a/\partial x^{n})(\partial^{10-n}u/\partial x^{10-n}) are of similar order, we can minimize the upper bound by setting m47=3557/44100m_{47}=3557/44100 and m48=2083/117600m_{48}=-2083/117600. We do not expect this particular choice will be the best in general.

We now present numerical results for the two test problems in [18] using the narrow stencil. The test problems solve the two-dimensional heat equation

ut=x(a(x,y)ux)+y(b(x,y)uy)+g(x,y,t),\frac{\partial u}{\partial t}=\frac{\partial}{\partial x}\left(a(x,y)\frac{\partial u}{\partial x}\right)+\frac{\partial}{\partial y}\left(b(x,y)\frac{\partial u}{\partial y}\right)+g(x,y,t), (42)

on the periodic spatial domain [0,2π]×[0,2π][0,2\pi]\times[0,2\pi]. The initial condition of is given by

u(x,y,t=0)=sin(x)sin(y).u(x,y,t=0)=\sin{(x)}\sin{(y)}. (43)

In the first test problem (T1), the variable coefficients and source term are given by

a(x,y)=\displaystyle a(x,y)={} b(x,y)=1+ϵcos(x)cos(y),\displaystyle b(x,y)=1+\epsilon\cos{(x)}\cos{(y)}, (44)
g(x,y,t)=\displaystyle g(x,y,t)={} (1+4ϵcos(x)cos(y))uex(x,y,t),\displaystyle(1+4\epsilon\cos{(x)}\cos{(y)})u_{\mathrm{ex}}(x,y,t), (45)

where ϵ=0.1\epsilon=0.1, and the exact solution is

uex(x,y,t)=etsin(x)sin(y).u_{\mathrm{ex}}(x,y,t)=e^{-t}\sin{(x)}\sin{(y)}. (46)

In the second test problem (T2), the variable coefficients and source term are given by

a(x,y)=\displaystyle a(x,y)={} 1+ϵcos(2x)sin(2y+π3),\displaystyle 1+\epsilon\cos{(2x)}\sin{(2y+\frac{\pi}{3})}, (47)
b(x,y)=\displaystyle b(x,y)={} 1+ϵcos(2x+π3)sin(2y),\displaystyle 1+\epsilon\cos{(2x+\frac{\pi}{3})}\sin{(2y)}, (48)
g(x,y,t)=\displaystyle g(x,y,t)={} 0,\displaystyle 0, (49)

where ϵ=0.9\epsilon=0.9. There is no exact analytic solution for the second test.

We have performed a series of calculations with various resolutions. A 4th4^{\rm th} order SDC scheme with 3 Gauss-Lobatto nodes and 4 SDC iterations is used with time step Δt=0.4((Δx2+Δy2)×max(a(x,y),b(x,y)))1\Delta t=0.4((\Delta x^{-2}+\Delta y^{-2})\times\max(a(x,y),b(x,y)))^{-1}. Three sets of the two free parameters in the narrow stencil are selected in these runs: (1) m47=3557/44100m_{47}=3557/44100 and m48=2083/117600m_{48}=-2083/117600, hereafter denoted by “SMC”; (1) m47=0m_{47}=0 and m48=0m_{48}=0, hereafter denoted by “ZERO”; and (3) m47=1059283/13608000m_{47}=1059283/13608000 and m48=856481/40824000m_{48}=-856481/40824000, hereafter denoted by “OPTIMAL”. The SMC stencil minimizes the upper bound of the truncation error. The OPTIMAL stencil is designed to minimize the 2-norm of truncation error for the test problems given the initial condition. Tables 7 & 7 show the LL_{\infty} and L2L_{2}-norm errors and convergence rates for the three stencils. For the second test problem, the errors are computed by comparing the solutions to a high-resolution run using 6402640^{2} cells. Note that the designed order of convergence is observed in the numerical experiments. In particular, the convergence rate is independent of the choice of the two free parameters, as expected. The SMC stencil, which is the default stencil used in SMC, performs well in both tests. In contrast, the OPTIMAL stencil, which is optimized specifically for these two test problems, produces the smallest errors (especially for the first test problem). Finally, note that the accuracy of the stencil is essentially insensitive to the two free parameters for the second, more challenging, test problem.

\tbl

Errors and convergence rates for the two-dimensional stencil test problem 1 at t=1t=1. Quadruple precision is used in this test to minimize roundoff errors in the floating-point computation.
Stencil No. of Points LL_{\infty} Error LL_{\infty} Rate L2L_{2} Error L2L_{2} Rate SMC 20220^{2} 4.167E-08 1.984E-08 40240^{2} 1.721E-10 7.92 8.005E-11 7.95 80280^{2} 6.827E-13 7.98 3.152E-13 7.99 1602160^{2} 2.676E-15 8.00 1.234E-15 8.00 ZERO 20220^{2} 2.642E-07 1.394E-07 40240^{2} 1.223E-09 7.76 5.911E-10 7.88 80280^{2} 4.877E-12 7.97 2.357E-12 7.97 1602160^{2} 1.915E-14 7.99 9.252E-15 7.99 OPTIMAL 20220^{2} 1.444E-08 7.320E-09 40240^{2} 5.757E-11 7.97 2.880E-11 7.99 80280^{2} 2.261E-13 7.99 1.130E-13 7.99 1602160^{2} 8.842E-16 8.00 4.422E-16 8.00

\tbl

Errors and convergence rates for the two-dimensional stencil test problem 2 at t=1t=1.
Stencil No. of Points LL_{\infty} Error LL_{\infty} Rate L2L_{2} Error L2L_{2} Rate SMC 40240^{2} 2.279E-04 4.270E-05 80280^{2} 4.261E-06 5.74 6.151E-07 6.12 1602160^{2} 3.351E-08 6.99 4.021E-09 7.26 3202320^{2} 1.639E-10 7.68 1.835E-11 7.78 ZERO 40240^{2} 2.611E-04 4.913E-05 80280^{2} 4.744E-06 5.78 6.918E-07 6.15 1602160^{2} 3.682E-08 7.01 4.481E-09 7.27 3202320^{2} 1.797E-10 7.68 2.039E-11 7.78 OPTIMAL 40240^{2} 2.221E-04 4.160E-05 80280^{2} 4.174E-06 5.73 6.014E-07 6.11 1602160^{2} 3.291E-08 6.99 3.938E-09 7.25 3202320^{2} 1.610E-10 7.68 1.798E-11 7.77