This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Memory-efficient ww-projection with the fast Gauss transform

K. W. Bannister1,2,3, T. J. Cornwell1
1CSIRO Astronomy and Space Science, PO Box 76, Epping NSW 1710, Australia
2Bolton Fellow
E-mail: keith.bannister@csiro.au
(Accepted 2011 XXX XXX. Received XXX December XX; in original form XXX XXX XXX)
Abstract

We describe a method performing ww-projection using the fast Gauss transform of Strain (1991). We derive the theoretical performance, and simulate the actual performance for a range of ww for a canonical array. While our implementation is dominated by overheads, we argue that this approach could for the basis of a higher-performing algorithms with particular application to the Square Kilometre Array.

keywords:
techniques: interferometric
pagerange: Memory-efficient ww-projection with the fast Gauss transformReferencespubyear: 2011

1 Introduction

Interferometers are composed of an array of antennas arranged in a 3 dimensional volume. For long observations, even instantaneously planar arrays have baselines that are non-coplanar due to Earth rotation synthesis. Correcting non-coplanar baselines in interferometers require the use of a de-projection technique to compensate for the non-coplanarity. To date, ww-projection (Cornwell et al., 2008) is the most computationally efficient algorithm known.

Interferometric images are usually formed by convolutional resampling of the measured visibilities on a regularly sampled, two dimensional uvuv plane. Convolutional resampling operates by multiplying each visibility by a convolution function and adding the result to the uvuv-plane. In the simplest case, the convolution function is used for anti-aliasing purposes, although it can be used to compensate for primary beam affects, or in the case of ww-projection, to compensate for non-coplanar baselines. ww-projection uses a convolution function that is dependant on ww, which is the convolution of an anti-aliasing function, and a Fresnel term.

While more efficient that other methods (Cornwell et al., 2008), ww-projection does have some shortcomings. Firstly, the convolution functions take time to compute, and for a short observation, can dominate the time to compute an image. Secondly, the amount of memory to store the convolution functions can be large, and can exhaust the capabilities of a computing node. Finally, multiply-add operation is generally memory-bandwidth bound, so that the central processing unit (CPU) spends the majority of the time waiting for the data to arrive, and relatively little time doing the multiply-add operation.

The memory constraints of ww-projection are of particular concern. The trend in computing has been for memory bandwidth to increase much more slowly than arithmetic capacity, which has roughly doubled every 18 months (Moore’s law), with this trend likely to continue for the foreseeable future. Therefore, the performance of memory-bandwidth bound algorithms such as ww-projection will not improve as quickly as an algorithm which is bound by arithmetic capacity. Furthermore, modern high performance computers rely on having a very large number of small nodes, with each node having relatively limited memory. The requirement to store large convolution functions is, therefore, at odds with having nodes with small amounts of memory.

Memory efficient algorithms, therefore, will be an important step on the path to implementing wide-field imaging capabilities for the Square Kilometre Array (SKA). We have embarked on an program to research new approaches to interferometric imaging, with particular emphasis on algorithms that align more closely with the memory-bound computing that will be available in the time frame of the SKA.

In this paper we describe a different approach to ww-projection. The key to our method is to represent the convolution function, including the ww-projection term, and an anti-aliasing function, as a complex Gaussian. The gridding and degridding problems can then be solved using the variable-width fast Gauss transform (FGT; Strain (1991)). One advantage of this approach is that it does not require convolution functions to be computed and stored, and the theoretical memory bandwidth is less than standard gridding, in some situations.

In section 2 we describe Gaussian anti-aliasing functions used for ww projection and in section 3 we apply these functions with the Fast Gauss Transform, to the gridding and degridding problems. In section 4 we derive the theoretical operations and memory requirements, and in section 5 we compare the theoretical requirements for our method and the standard methods. In section 6 we compare the theoretical results with a real-world implementation. We discuss our results in 7 and we draw our conclusions in section 8.

2 ww-projection with a Gaussian anti-aliasing function

In this section we describe the gridding process, and use a Gaussian anti-aliasing function to obtain a closed form expression for the convolution function.

We acknowledge that the anti-aliasing properties of a Gaussian are not ideal. Typical interferometric imaging uses prolate spheroidal wavefunctions, which have a optimal out-of-band rejection (Schwab, 1980). Nonetheless, Gaussian functions have simple analytical relationships (e.g. the convolutions and Fourier transforms of Gaussians are both Gaussians) that allow us to proceed.

2.1 Gridding

The aim of convolutional gridding is to take visibilities sampled on arbitrary u,v,wu,v,w coordinates and interpolate them onto a regular grid so that a 2D fast Fourier transform (FFT) can be used to estimate the sky brightness distribution (Taylor et al., 1999, p. 134ff). Mathematically, the grid is evaluated at a point (uc,vc)(u_{c},v_{c}) by multiplying each visibility by a shifted convolution function according to:

F(uc,vc)=k=1MC(ucuk,vcvk,wk)V(uk,vk,wk)F(u_{c},v_{c})=\sum_{k=1}^{M}{C(u_{c}-u_{k},v_{c}-v_{k},w_{k})V(u_{k},v_{k},w_{k})} (1)

where MM is the number of visibilities in to be gridded, uu, vv, and ww are the coordinates of the projected baseline (in wavelengths), C(u,v,w)C(u,v,w) is a convolution function and V(u,v,w)V(u,v,w) is the measured visibility weight. Typically, CC is assumed to be zero outside some region of support (6–8 pixels). If we take no account for the ww term, then the convolution function is chosen to be an anti-aliasing function, and is independent of ww, i.e. C(u,v,w)=A(u,v)C(u,v,w)=A(u,v). If we account for the ww term using ww-projection of Cornwell et al. (2008), then the convolution function is given by

C(u,v,w)=A(u,v)G(u,v,w)C(u,v,w)=A(u,v)*G(u,v,w) (2)

where * denotes convolution, A(u,v)A(u,v) is an anti-aliasing function, and

G(u,v,w)=iwexp(πi[u2+v2]/w)G(u,v,w)=\frac{i}{w}\exp\left(-\pi i[u^{2}+v^{2}]/w\right) (3)

is the Fresnel diffraction term required by the ww projection algorithm.

2.2 Radially symmetric Gaussian anti-aliasing functions

For a fixed ww, equation 2 is the convolution of the anti-aliasing function with a complex Gaussian. If we choose a Gaussian for the anti-aliasing function AA, the resulting convolution function is also Gaussian. To simplify notation, we will consider the radially symmetric problem by introducing a new parameter t2=u2+v2t^{2}=u^{2}+v^{2}, and can now write the Fresnel term as:

G(t,w)\displaystyle G(t,w) =\displaystyle= iwexp(πit2w)\displaystyle\frac{i}{w}\exp\left(-\frac{\pi it^{2}}{w}\right) (4)
=\displaystyle= iwexp(t2iδG)\displaystyle\frac{i}{w}\exp\left(\frac{-t^{2}}{i\delta_{G}}\right) (5)

with δG=w/π\delta_{G}=-w/\pi. We can also write the Gaussian anti-aliasing function as

A(t)=exp(t2δA).A(t)=\exp{\left(\frac{-t^{2}}{\delta_{A}}\right)}. (6)

The gridding function is given by the convolution G(t)A(t)G(t)*A(t), which is another Gaussian, whose variance is equal to the sum of the variances of the original Gaussians, i.e., to within a scaling factor:

C(t,w)\displaystyle C(t,w) =\displaystyle= A(t)G(t,w)\displaystyle A(t)*G(t,w) (7)
=\displaystyle= iwexp(t21δA+iδG)\displaystyle\frac{i}{w}\exp\left(-t^{2}\frac{1}{\delta_{A}+i\delta_{G}}\right) (8)
=\displaystyle= iwexp(t2δAδA2+δG2)exp(t2iδGδA2+δG2)\displaystyle\frac{i}{w}\exp\left(-t^{2}\frac{\delta_{A}}{\delta^{2}_{A}+\delta^{2}_{G}}\right)\exp\left(-t^{2}\frac{-i\delta_{G}}{\delta^{2}_{A}+\delta^{2}_{G}}\right) (9)
=\displaystyle= iwexp(t2δR)real envelopeexp(it2δI)complex chirp\displaystyle\frac{i}{w}\underbrace{\exp\left(-\frac{t^{2}}{\delta_{R}}\right)}_{\textrm{real envelope}}\underbrace{\exp\left(-\frac{it^{2}}{\delta_{I}}\right)}_{\textrm{complex chirp}} (10)

The width of the envelope is given by

δR=δA2+δG2δA,\delta_{R}=\frac{\delta_{A}^{2}+\delta_{G}^{2}}{\delta_{A}}, (12)

and width of the complex chirp is

δI=δA2+δG2δG.\delta_{I}=\frac{\delta_{A}^{2}+\delta_{G}^{2}}{\delta_{G}}. (13)

The convolution function is, therefore, a complex Gaussian, which can be expressed as the product of real-valued Gaussian envelope (the δR\delta_{R} term in equation 10) with a complex chirp (the δI\delta_{I} term). Clearly, as |w||w| increases, the width of the Gaussian envelope δR\delta_{R} increases.

For a given visibility indexed by jj the form of the convolution function can be calculated from its ww coordinate using:

C(t,wj)=iwjexp(t2δj)C(t,w_{j})=\frac{i}{w_{j}}\exp{\left(\frac{-t^{2}}{\delta_{j}}\right)} (14)

where

δj=δAiwjπ\delta_{j}=\delta_{A}-i\frac{w_{j}}{\pi} (15)

is the width of the relevant convolution function, and wjw_{j} is the ww coordinate of the visibility.

2.3 Conversion from wavelengths to pixels

Up to this point, the u,v,wu,v,w co-ordinates and the Gaussian widths have been in units of wavelength. For the remainder of this paper, width will often be quoted in terms of pixels. We use the terms pixel and uvuv-cell interchangeably. To convert from wavelengths to pixels, a size of uvuv-cell is required. The size of the uvuv-cell, (in wavelengths) is set by the inverse of the desired field of view (in radians) i.e. ϕ=1/θf.o.v\phi=1/\theta_{\rm f.o.v}.

Therefore, a Gaussian width in pixels can be calculated from the width in wavelength by:

δ(pixels)=δ(wavelengths)ϕ2\delta(\rm pixels)=\frac{\delta(\rm wavelengths)}{\phi^{2}} (16)

3 ww projection with the Fast Gauss Transform

3.1 A brief introduction to the Fast Gauss Transform

In this section we introduce the Fast Gauss Transform (FGT) with variable scales (Strain, 1991) (hereafter S91), which is an approximate technique for calculating the sum of Gaussians. We conform to the language of S91, which uses the term ‘source’ to describe a Gaussian function with a given position, amplitude and width. We will describe the application of the FGT onto the gridding problem in section 3.2.

The FGT works by partitioning the evaluation region into square boxes, with each box containing a set of Taylor coefficients (Fig. 1). For each Gaussian ‘source’, the Taylor coefficients for the boxes surrounding the source position are updated, with the size of the updated region being defined by the ‘width’ of the source, i.e. wider Gaussians update more boxes. When all sources have been applied, the Taylor coefficients are evaluated at any number of target locations. The algorithm is parameterised by three numbers: rr, which sets the box size, pp, which defines number of Taylor coefficients to consider, and δ\delta which sets the smallest size of Gaussian that will be considered. Reducing the box size, or increasing the number of Taylor coefficients increases the accuracy of the sum.

3.2 Gridding with the Fast Gauss Transform

Now that we have a Gaussian convolution function (Equation 14), we can solve the sum of Equation 1 the FGT with source-dependant scales. In the language of gridding visibilities, the evaluation region is the uvuv plane. The ‘sources’ are the visibilities, with the source position being given by the coordinates of the visibility on the uvuv plane. The width of the source is related to the ww coordinate of the visibility as in Equation 15, with the minimum width when w=0w=0, being essentially the width of the anti-aliasing function. When all the visibilities have been applied to the Taylor coefficients, the Taylor coefficients are evaluated on a regular grid, for each uvuv cell on the uvuv plane.

Refer to caption
Figure 1: This figure illustrates the process of gridding or degridding a single visibility. The uvuv plane of size NpixN_{\rm pix} is partitioned into boxes size LL. The box size can be the same size as a uvuv cell. Each box contains (ppd+1)2(p-p_{d}+1)^{2} Taylor coefficients. To grid or degrid a visibility indexed by jj, all boxes within RjR_{j} of the visibility uvuv position (tjt_{j}) are updated (circular blue shaded region). Within each box, q2q^{2} Taylor coefficients are updated with no cheating. If cheating is enabled (i.e. pd>0p_{d}>0), then only q2q^{\prime 2} of the (ppd+1)2(p-p_{d}+1)^{2} Taylor coefficients (orange shaded region) are updated. RjR_{j} has been exaggerated for clarity.

For a detailed description of the algorithm, we refer the reader to S91. For the purposes of explaining our implementation, we outline the algorithm in the following text, using similar notation to S91. This description applies to the one-dimensional case, but S91 describes the relationships to extend it to the multi-dimensional case (using multi-index notation), for which the 2D case is relevant here. Additionally, this description is only valid for real Gaussians. The extension to complex Gaussians will be considered in the next section.

The algorithm proceeds as follows:

  1. 1.

    Given values of rr and pp calculate:

    ϵ=rpp1rp,\epsilon=\frac{r^{p}_{p}}{1-r_{p}}, (17)

    where

    rp\displaystyle r_{p} =\displaystyle= rep+1<1\displaystyle r\sqrt{\frac{e}{p+1}}<1 (18)
  2. 2.

    Partition the uvuv plane into the set of boxes BB with centers tBt_{B}, and side length L=r2δL=r\sqrt{2\delta} parallel to the axes, where δ=minjδj=δA\delta=\min_{j}\delta_{j}=\delta_{A} is the smallest Gaussian width to be considered.

  3. 3.

    For each visibility indexed by jj, calculate the range Rj=δjlogϵR_{j}=\sqrt{-\delta_{j}\log\epsilon}. Also calculate the order qpq\leq p satisfying

    rq,jq1rq,jϵ\frac{r^{q}_{q,j}}{1-r_{q,j}}\leq\epsilon (19)

    where rq,j=reδ/δj(q+1)r_{q,j}=r\sqrt{e\delta/\delta_{j}(q+1)}.

  4. 4.

    For each box within RjR_{j} of the visibility position tjt_{j}, update the Taylor coefficients according to:

    Cβ=1β!δjbvj(δδj)|β|/2hβ(tjtBδ)C_{\beta}=\frac{1}{\beta!}\sum_{\delta_{j}\geq b}v_{j}\left(\frac{\delta}{\delta_{j}}\right)^{|\beta|/2}h_{\beta}\left(\frac{t_{j}-t_{B}}{\sqrt{\delta}}\right) (20)

    where β\beta runs from 0 to qq, vjv_{j} is the visibility weight, and hβh_{\beta} is the Hermite function (see S91).

  5. 5.

    Once all the visibilities have been applied, evaluate the Taylor coefficients at each cell in the uvuv plane, by finding the box in which the uvuv cell is situated, and calculating the sum:

    F(t)=βpCβ(tCtBδ)βF(t)=\sum_{\beta\leq p}C_{\beta}\left(\frac{t_{C}-t_{B}}{\sqrt{\delta}}\right)^{\beta} (21)

    where tCt_{C} is the position of the uvuv cell, and β\beta runs from 0 to pp.

The error in the evaluation of the sum is bounded by:

EpVrpp1rp,E_{p}\leq V\frac{r^{p}_{p}}{1-r_{p}}, (22)

where

V=j=1M|vj|.\displaystyle V=\sum_{j=1}^{M}|v_{j}|. (23)

rr and pp are chosen to make |Ep|Vϵ|E_{p}|\leq V\epsilon.

3.3 Extension to complex Gaussians

The formulation of S91 is explicitly for real Gaussians, i.e. with δj,vj\delta_{j},v_{j}\in\Re, but in our case δj,vj\delta_{j},v_{j}\in\mathds{C}. A full derivation of the algorithm for complex values is outside the scope of this work . The important questions are whether the series of Equation 20 converges (and for what values of a complex δ\delta) and what is a suitable range (RjR_{j}).

Empirically, we have determined that:

  1. 1.

    Using a range of Rj=δR,jlogϵR_{j}=\sqrt{-\delta_{R,j}\log\epsilon} provides good results. This also makes sense because it is the amplitude of the envelope that determines the range.

  2. 2.

    For calculating the order qq (Equation 19) we used

    δj=max(δA,δI)\delta_{j}=max(\delta_{A},\delta_{I}) (24)

    Other possibilities (e.g. δj=δR\delta_{j}=\delta_{R}) reduced qq too aggressively as ww increased, which resulted in very large errors.

In our case, the minimum value of δ\delta occurs when w=0w=0 which also coincides with (δ)=0\Im(\delta)=0 so the box size is easily set as equal to L=r2δAL=r\sqrt{2\delta_{A}}.

3.4 Optimisations

S91 describes a number of optimisations that can be used in the FGT with source-dependent scales:

  1. 1.

    S91 introduces a single set of Taylor coefficients, positioned at the center of the uvuv plane, to capture Gaussians with the largest scales. Visibilities larger than an upper scale size δj>b\delta_{j}>b can be added to this single set, rather than the sets associated with the boxes. S91 recommends the upper scale size be at least 10% of the size of the evaluation region, in order to allow for practical precision in evaluating the large number of Taylor coefficients. In practice, most array configurations would have relatively few baselines with such a large ww that their convolution functions would span >> 10% of the uvuv plane, so this optimisation has limited impact.

  2. 2.

    S91 introduces a lower scale aa below which the Gaussian sum is directly evaluated. If there are only a few Gaussians with scales smaller than aa, we can increase the box size, which reduces the memory and arithmetic capacity required, while incurring minimal penalty from direct evaluation. In practice, any interferometric observation is likely to be arranged such that the mean value of ww is approximately zero. Therefore, there are unlikely to be outlying, small Gaussians that would allow for increasing the box size. Once again, this optimisation has limited impact.

  3. 3.

    S91 proposes doing direct evaluation for boxes containing only a few evaluation points. As we are evaluating on a regular grid, we have the same number of evaluation points inside each box, so applying this optimisation would degenerate into standard gridding.

  4. 4.

    S91 proposes a method for avoiding needlessly accurate computation, i.e. for large scales (which are smoother than small ones) more boxes are updated, but smaller number of Taylor coefficients (i.e. the order qq described in Section 3.2) per box can be updated. This optimisation is useful and derives the most benefit for visibilities with large scales (|w||w| large, see section 4.). We have incorporated this optimisation in Equation 19.

  5. 5.

    S91 notes that the equation to calculate the order (Equation 19) overestimates the error by at sometimes two orders of magnitude. Memory bandwidth and operations can be saved by ‘cheating’ on the value of qq by decrementing qq by some value. This strategy is described in the section 3.4.1.

  6. 6.

    The fact that each uv-cell contains a number of Taylor terms affords an additional axis available for parallelisation. The strategy is described in section 3.4.2.

3.4.1 Cheating

There are two sources of error when a sample is gridded (see Figure 3). The first is from truncating the support; that is, only boxes only within a finite range Rj=δRlogϵR_{j}=\sqrt{-\delta_{R}\log\epsilon} are updated. Convolutional gridding suffers from the same type of truncation error. The second source of error is from truncation of the Taylor terms; that is, only q<q<\infty Taylor terms are updated. S91 notes that the equation for calculating the error from truncation of Taylor terms (Equation 19) overestimates the error by several orders of magnitude. If this is the case, for a given value of ϵ\epsilon, the error will be dominated by the truncated support, and the contribution from the truncation of the Taylor terms will be negligible. In principle, therefore, one can reduce the number of Taylor terms without substantially increasing the overall error, as long as it remains less than the error from the truncation of support.

In order to balance the errors from both effects, we introduce an additional parameter to the algorithm pdp_{d}, which decrements value of qq, so that the actual value of qq that is used is q=max(qpd,0)q^{\prime}=\rm{max}(q-p_{d},0). This reduces the CPU and memory requirements of updating a given box, at the expense of increasing the errors due to truncating the Taylor terms.

3.4.2 Parallelisation

One interesting property of our approach is that affords an additional axis of parallelisation (the Taylor terms) in addition to those available for standard gridding (typically, data parallelisation over frequency channels). We assume that the parallelism is achieved through message passing.

In section 4 we show that the memory bandwidth and operations count for the FGT is dominated by updating the Taylor terms. As each of the Taylor terms is independent, one can in principle store each of the (ppd+1)2(p-p_{d}+1)^{2} Taylor terms on (up to) as many nodes, thereby dividing the per-node storage requirement by the number of nodes. For the case where q2q^{\prime 2} is a multiple of the number of nodes, the work is spread equally among the nodes. Each node can update its Taylor terms for the boxes within range, and, the total memory bandwidth is increased by the number of nodes.

One penalty of parallelising in this way is that each node must calculate the same interim values (e.g. the values of the Hermite polynomials). This duplication can be reduced by partitioning the Taylor terms among fewer nodes, with each node being responsible for a particular row or column of the Taylor terms.

If q2q^{\prime 2} is less than the number of nodes, then some nodes will have no work to perform, as they are responsible for Taylor coefficients that are not being updated. One must be careful, therefore, to choose an algorithm parameterisation i.e. (rr,pp and pdp_{d}) that guarantees never to give q2q^{\prime 2} less than the number of nodes.

The efficiency of this approach depends crucially on the properties of the computing nodes. It is useful if the nodes are storage or memory bound, but not if the cost of computing the interim values dominates the computing time. As this tradeoff requires intimate knowledge of the particular computing architecture, we will not pursue a detailed analysis of parallelisation in this paper.

3.5 Degridding with the Fast Gauss Transform

Degridding is used as part of the major cycle of Cotton-schwab clean (Rau & Cornwell, 2011). Degridding involves taking the dot-product of the uvuv plane with the convolution function shifted to the location of a visibility at an arbitrary location (i.e. not at the center of a uvuv cell). It is the dual of gridding.

Degridding can be performed analogously to gridding using the fast Gauss transform with target-dependent scales (S91). All the same comments about optimisations for gridding (Section 3.4) apply as for degridding.

3.6 Key differences between classical gridding and the FGT

The key conceptual differences between classical gridding and FGT gridding are described in Table 1

Table 1: Conceptual differences between classical gridding and gridding with the Fast Gauss Transform.
Property Classical Gridding Fast Gauss Transform
Accuracy: Exact Approximate, but with controllable error.
Anti-aliasing function: Arbitrary. Gaussian only.
Fundamental scale: A uvuv-cell, set by imaging geometry. A box, which can be smaller, or larger than a uvuv-cell.
Gridding a single visibility updates: Cells on the uvuv-plane. Taylor coefficients in a set of boxes. The cells of the uvuv-plane is calculated as a post-processing step.

4 Theoretical performance

Here we derive the theoretical performance of the gridding and degridding processes using the fast Gauss transform, and traditional convolutional gridding.

4.1 Operations count of the fast Gauss transform

4.1.1 Gridding

To grid a single visibility of width δj\delta_{j}, the number of boxes that are within a circular region within the range Rj=δRlogϵR_{j}=\sqrt{-\delta_{R}\log\epsilon} is approximately:

NB=π4(2Rj/L+1)2.N_{B}=\frac{\pi}{4}(2\lceil R_{j}/L\rceil+1)^{2}. (25)

The the order qpq\leq p is chosen to satisfy the error estimate:

rqjq1rqjϵ\frac{r^{q}_{qj}}{1-r_{qj}}\leq\epsilon (26)

where rqj=reδ/δj(q+1)r_{q}j=r\sqrt{e\delta/\delta_{j}(q+1)}.

To apply cheating (Section 3.4.1), we decrement the value of qq, such that q=max(qpd,0)q^{\prime}=\rm{max}(q-p_{d},0).

Now we consider work required to update the Taylor coefficients in each box (Eq. 20). The key is to compute and store the Hermite functions, and powers of (δ/δj)|β|/2(\delta/\delta_{j})^{|\beta|/2} separately, and perform the sum at the end, by taking advantage of the product form of the multi-index notation for multiple dimensions.

The Hermite function is a polynomial multiplied by a complex exponential. The argument for the complex exponential is the same for every order qq^{\prime}, so only one calculation is required per set of Hermite function evaluations. The way in which a complex exponential is computed by a CPU is highly implementation-dependant, so we simplify the analysis here and assume that it requires NcexpN_{\rm cexp} floating point operations.

We observe that the Hermite function of order qq^{\prime} has only Nc=q/2+1N_{c}=\lfloor q^{\prime}/2\rfloor+1 nonzero coefficients. Second, we note that we will evaluate the Hermite functions for order 0βq0\leq\beta\leq q^{\prime} with the same argument. Therefore, we can calculate the powers of the argument initially with qq^{\prime} operations. Evaluating each Hermite function requires only NcN_{c} additional operations. Therefore, to compute all Hermite functions up to order qq^{\prime} requires approximately:

Nops,hermite\displaystyle N_{\rm ops,hermite} =\displaystyle= Ncexp+q+2β=0NcNc\displaystyle N_{\rm cexp}+q^{\prime}+2\sum_{\beta=0}^{N_{c}}N_{c} (27)
=\displaystyle= Ncexp+q+2β=0Ncβ/2+1\displaystyle N_{\rm cexp}+q^{\prime}+2\sum_{\beta=0}^{N_{c}}\lfloor\beta/2\rfloor+1
\displaystyle\simeq Ncexp+q+NC(NC+1)\displaystyle N_{\rm cexp}+q^{\prime}+N_{C}(N_{C}+1)

The multi-index powers of tβ=(δ/δj)|β|/2t_{\beta}=(\delta/\delta_{j})^{|\beta|/2} can be computed recursively, by computing tn=(δ/δj)tn2t_{n}=(\delta/\delta_{j})t_{n-2} with t0=(δ/δj)1/2t_{0}=(\delta/\delta_{j})^{1/2}, and t1=(δ/δj)1t_{1}=(\delta/\delta_{j})^{1}. This requires 2(q+1)2(q^{\prime}+1) operations 0n2q0\leq n\leq 2q^{\prime}.

The weight vjv_{j} can be folded into the power values also, with 2(q+1)2(q^{\prime}+1) operations.

Finally, the full sum to update CβC_{\beta} over two dimensions requires the multiplication of Hermite functions for 2 dimensions, and the powers, plus the accumulation, which requires 3(q+1)23(q^{\prime}+1)^{2} operations in total.

The total number of operations to update a single box is therefore the sum of two Hermite evaluations (one for each dimension), plus the powers of δ/δj\delta/\delta_{j}, the weights and the final sum, i.e.:

Ngrid=2(Ncexp+q+NC[NC+1])+4(q+1)+3(q+1)2N_{\rm grid}=2(N_{\rm cexp}+q^{\prime}+N_{C}[N_{C}+1])+4(q^{\prime}+1)+3(q^{\prime}+1)^{2} (28)

For small qq^{\prime}, the cost is dominated by the cost of evaluating the complex exponential. For large qq^{\prime}, it is dominated by the multiply-add step in the final sum.

4.1.2 Degridding

The operations count for the degridding process is similar to the gridding case. The order qq^{\prime} is evaluated identically, as are the Hermite functions and the powers of δ/δj\delta/\delta_{j}. The only difference is that no weight is included, and the multiply-add stage includes the product of the two hermite functions, the power and the Taylor term, requiring 4(q+1)24(q^{\prime}+1)^{2} operations.

The number of operations per visibility is therefore:

Ndegrid=2(Ncexp+q+NC[NC+1])+2(q+1)+4(q+1)2.N_{\rm degrid}=2(N_{\rm cexp}+q^{\prime}+N_{C}[N_{C}+1])+2(q^{\prime}+1)+4(q^{\prime}+1)^{2}. (29)

Once again, for small qq^{\prime}, the cost is dominated by the cost of evaluating the complex exponential. For large qq^{\prime}, it is dominated by the multiply-add step in the final sum.

4.1.3 Memory bandwidth

Calculating memory bandwidth is complicated by the issue of cache hierarchy. We assume the simplest case: i.e. no caching. In practice, this is a reasonable assumption, as visibilities are often stored in no particular order. Therefore, the desired grid (and convolution functions in the case of convolutional gridding) are essentially random, meaning that all memory accesses go to the main memory and bypass the caches.

For the fast Gauss transform, gridding a visibility requires the updating of (q+1)2NB(q^{\prime}+1)^{2}N_{B} complex coefficients. The coefficients must be read into the processor, updated, and written back to memory, requiring 2(q+1)2NB2(q^{\prime}+1)^{2}N_{B} memory transactions per visibility.

For degridding, the coefficients do not need to be written back to memory, therefore the only (q+1)2NB(q^{\prime}+1)^{2}N_{B} memory transactions are required.

4.1.4 Storage

For an image size of Npix2N_{\rm pix}^{2}, the gridding operation requires storage for (Npix/L)2(ppd+1)2(N_{\rm pix}/L)^{2}(p-p_{d}+1)^{2} complex coefficients. Degridding requires the same storage.

4.2 Performance of convolutional gridding and degridding

In this section we derive the operations count, memory bandwidth and storage required to perform gridding and degridding with stored convolution funcitons. In order to put the two methods on equal footing in terms of imaging performance, we will use Gaussian anti-aliasing functions for the ww-projection, rather than the commonly-used prolate spheroidal wavefunctions. This means that the imaging performance of the two approaches is essentially the same (except for the errors in truncating the Taylor series of the FGT) and the support size of the convolution functions is also easily computed.

We assume standard ww projection computes the convolution functions in advance and stores them in memory. We assume a convolution function of 1-D size for the kkth ww-plane Mk=kΔwM_{\rm k}=k\Delta w, where kk is an integer, and Δw\Delta w difference in size between different ww planes. Usually the cached convolution function is oversampled by a factor κ\kappa of 4–8 in order to accurately grid visibilities whose coordinates are not exactly in the center of a uvuv cell.

We assume that we truncate the convolution function when the real envelope reaches a value ϵ\epsilon, which is at a distance tϵ=δRlogϵt_{\epsilon}=\sqrt{-\delta_{R}\log\epsilon} from the centre of the convolution function. Therefore, the 1D support size in pixels, of the kkth ww-plane is (by substituting Equation 12)

Mk\displaystyle M_{k} =\displaystyle= 2tϵ\displaystyle 2t_{\epsilon} (30)
=\displaystyle= 2logϵ(δA+k2Δw2π2δA)\displaystyle 2\sqrt{\log\epsilon\left(\delta_{A}+\frac{k^{2}\Delta w^{2}}{\pi^{2}\delta_{A}}\right)}

4.2.1 Operations count

Gridding a visibility requires two operations per point (weight times convolution function, add to grid), for a total of 2Mk22M_{\rm k}^{2} operations per visibility. Degridding also requires two operations per point(convolution function times grid, add to result) and and therefore requires 2Mk22M_{\rm k}^{2} operations per visibility.

4.2.2 Memory bandwidth

Gridding requires the convolution function, and the grid position to be retrieved and written back to memory, requiring 3Mk23M_{k}^{2} memory transactions per visibility. Degridding has no requirement to write back the result, (as the intermediate sum is stored in a register), so only 2Mk22M_{\rm k}^{2} memory transactions are required.

4.2.3 Storage

To compute the total memory required, we will determine the amount required to store NwN_{w} ww-planes for values of ww between [0,+wmax][0,+w_{\rm max}], uniformly sampled with width Δw=wmax/Nw\Delta w=w_{\rm max}/N_{w}. In practice, we need to store ww-planes for [wmax,0)[-w_{\rm max},0), so we will take the above result and multiply by two.

The amount of memory required to store the convolution functions is therefore equal to the sum of the oversampled convolution functions, i.e.:

Nmem2\displaystyle\frac{N_{\rm mem}}{2} =\displaystyle= k=0Nw(κMk)2\displaystyle\sum_{k=0}^{N_{w}}{(\kappa M_{k})^{2}} (31)
=\displaystyle= 4κ2logϵ(NwδA+Δw2π2δANw6[Nw+1][2Nw+1])\displaystyle-4\kappa^{2}\log\epsilon\left(N_{w}\delta_{A}+\frac{\Delta w^{2}}{\pi^{2}\delta_{A}}\frac{N_{w}}{6}[N_{w}+1][2N_{w}+1]\right)
\displaystyle\simeq 43κ2logϵwmax2π2δANw\displaystyle-\frac{4}{3}\kappa^{2}\log\epsilon\frac{w_{\rm max}^{2}}{\pi^{2}\delta_{A}}N_{w} (32)

5 Theoretical Performance Comparison for Gridding

We consider the gridding problem for two scenarios, L=1L=1 and L=2L=2. We assuming δA=1\delta_{A}=1 pixel, a required accuracy of ϵ=103\epsilon=10^{-3}, a 1 degree field of view and a 6 km maximum baseline at λ=0.2\lambda=0.2 giving a wmaxw_{\rm max} of 30kλ30k\lambda. The convolution function size at maximum ww of Mwmax=191M_{\rm wmax}=191 pixels. We assume the cost of evaluating a complex exponential is Ncexp=100N_{\rm cexp}=100 flops. This estimate is also justified by measurements of our implementation (Fig. 5).

The FGT method for L=1L=1 and L=2L=2 is outperformed by standard gridding both in terms of operations and memory bandwidth, if cheating is not enabled (Figure 2). For the L=1L=1 case in particular, the operations count is dominated by the complex exponential.

If cheating is enabled, then the situation is substantially improved. Once the ww value reaches above a certain threshold, the order q=0q^{\prime}=0, and only one Taylor coefficient is updated per box. In the L=1L=1 case, the memory bandwidth is reduced to 0.5 of the standard gridding case, because the FGT only requires only two memory transactions per pixel (read + write coefficient), while standard gridding requires three (read pixel + read convolution function + write pixel). For L=2L=2 the improvement in memory bandwidth is even more with the FGT requiring 10 times less memory bandwidth than standard gridding. The FGT is more efficient as it only updates one coefficient for each box, which encapsulates 4 pixels. The FGT case is also improved because it only updates boxes within a circular region, while standard gridding updates all pixels within a square.

This encouraging result leads to a number of questions: is the FGT bound by memory bandwidth? Can these performance gains be realised in practice? Can cheating with pd=3p_{d}=3 give reasonable image errors? We will address these questions in the following section.

Refer to caption
Figure 2: Gridding with the fast Gauss transform can save memory bandwidth for large support sizes, at a cost of floating point operations. Plotted here is the ratio of operations (top panel) and memory bandwidth (bottom panel) for the fast Gauss transform compared with convolutional gridding (see Section 4). The x-axis is the value of the ww coordinate, which is also a measure of the width of the convolution function (Equation 15). The target error for each algorithm was set to ϵ=103\epsilon=10^{-3}. The fast Gauss transform was configured with δA=1\delta_{A}=1 pixel and we have assumed Ncexp=100N_{\rm cexp}=100. The step drops in the ratio occur when the FGT reduces the order qq^{\prime} for the Taylor truncation. For w>14kλw>14\,\mathrm{k\lambda}, the FGT with L=2L=2 requires 21.8 times more operations but only 0.1 times the memory bandwidth of classical gridding, but only if cheating is enabled with pd=3p_{d}=3.

6 Implementation

We implemented the gridding and degridding algorithms as described in Section 3 in C++. As with the theoretical simulation, we set the width of the anti-aliasing function equal to δA=1\delta_{A}=1 uvuv cell. We aim explore the parameter space similar to the theoretical analysis, i.e. around L=1L=1 and L=2L=2 and ranges of errors around 103<ϵ<10210^{-3}<\epsilon<10^{-2}.

To compare the gridding errors, we compared the relative error of a visibility gridded with the FGT with the equivalent complex Gaussian (Equation 9), calculated over 2562256^{2} pixels (which is larger than the 191 pixels we expect for ϵ=103\epsilon=10^{-3} and w=wmaxw=w_{\rm max}) . To compare the compute times, we compare the time to grid 100 visibilities of fixed u,v,wu,v,w coordinates with the FGT, with the same number of visibilities gridded with standard gridding. The support size of the gridding kernel was chosen to have with equivalent error to the FGT error, i.e. Mk2M_{k}^{2} pixels. We chose 101 ww planes from 0 to wmaxw_{\rm max}. The processing was performed on an Intel Core 2 Duo processor running at 2.66 GHz, with 32 kB of L1 cache, 3 MB of L2 cache and with 8 GB of 1057 MHz DDR3 RAM.

We consider here only the results for the gridding operation, as the degridding algorithm results essentially the same.

6.1 Error patterns

A typical error pattern is shown in Figure 3. The absolute error in the gridded visibility contains two components. The circular component is due to the truncation to finite support, i.e. boxes outside the circle were not updated. The vertical and horizontal lines are due to truncation to a finite number of Taylor terms, with the largest error where the grid is evaluated near the boundary of adjacent boxes. The box structure of the gridding process can also be clearly seen in the FFT error, which also shows a box structure.

Refer to caption
Figure 3: Error pattern for a visibility gridded at w=wmaxw=w_{\rm max} for L=0.7L=0.7 and q=7q^{\prime}=7. Each box is 256x256 uvuv pixels. The rows are the visibility computed with the full complex Gaussian, the FGT, the error, and the FFT of the error. The columns are the real, imaginary and absolute values of the data respectively. The circular component of the error is due to the truncated range, while the horizontal and vertical components of the error are due to the truncation of the Taylor series. This example was chosen because it illustrates both components clearly.

6.2 Peak Image Errors vs ww

The peak error in the FFT of the visibility (i.e. the image plane), as a function of the ww for a range of algorithm parameterisations are shown in Figure 4. To begin with, we consider the case for L1L~\sim 1. Somewhat surprisingly, Equation 17, which describes the error in gridding a single visibility on uvuv plane, also predicts the error the image plane for many parameterisations and values of ww. For pd=0p_{d}=0, and small ww, the errors are substantially smaller than the predicted ϵ\epsilon. As ww increases, qq^{\prime} is reduced, and the error generally matches the predicted ϵ\epsilon. For some parameterisations (e.g. r=0.5r=0.5, p=7p=7) there is a jump in error above predicted ϵ\epsilon at a particular ww.

For the L2L~2 case, for small ww, the errors are at or below the theoretical limits in most cases, however in many cases the errors are considerably worse than for L=1L=1. For example, for ww small in some cases (e.g. L=2.5L=2.5, p=18p=18, pd=0p_{d}=0) the error increases far beyond the theoretical limit. For large ww, the errors are several orders of magnitude worse than predicted.

Refer to caption
Refer to caption
Figure 4: The image plane error of the FGT is a strong function of ww and the algorithm parameterisations. Here we plot the relative error in the image plane of the FGT vs. ww for a range of algorithm parameterisations. The solid lines are the measured error in the image plane and the dotted lines are the predicted values of ϵ\epsilon (Equation 17). The top panels are for a box size L1L\sim 1, and the bottom panels are for a box size of L2L\sim 2 pixels.

6.3 Processing time

The measured processing time for the FGT vs. standard gridding is shown in Figure 5. Our implementation operates substantially slower than standard gridding for all parameterisations and ranges of ww, and the operations rate is well matched by the our theoretical model for the operations count.

Refer to caption
Refer to caption
Figure 5: Our implementation of gridding with the fast Gauss transform is substantially slower than standard gridding, and compute bound. Here we plot the ratio of compute time to grid 100 visibilities with the FGT vs. standard gridding, over a range of ww. The solid lines are the ratio of the measured processing times, while the dotted lines are predicted ratio in operations count (the same as top bottom panel of Fig. 2). The top panels are for a box size L1L\sim 1, and the bottom panels are for a box size of L2L\sim 2 pixels.

7 Discussion

7.1 FGT Implementation

The results from testing our implementation are somewhat conflicting. On the plus side, Figure 4 clearly shows that, for small values of ww, Equation 19 is overestimating the truncation error, and the value of qq^{\prime} being used is too large. This was the original motivation for introducing cheating, and it is clear that the error is still overestimated even for pd=3p_{d}=3 in some cases.

Unfortunately, for larger values of ww, the fact that the error jumps above the predicted ϵ\epsilon, even with no cheating, implies that our empirical extension to the FGT to complex Gaussians, (in particular Equation 19, and Equation 24) is not correct. As a result of this incorrectness, our current cheating algorithm has limited value. Even with pd=1p_{d}=1, most parameterisations contain jumps in error of factors of few to 100 over the range of ww. A different mapping between ww and qq^{\prime} is required to maintain a constant error across ww.

Increasing the box size above L>2.3L>2.3 appears to have catastrophic affects on the error for small ww. We suspect this is due either to numerical instability in evaluating high-order polynomials, or more likely, the Hermite expansions simply fail to converge when the Gaussian width is less than some function of the box size.

Our implementation’s processing time is clearly compute-bound as our computing times match our theoretical model for the operations count almost exactly in most instances. The theoretical model suggests the operations count is dominated by the cost of computing the complex exponential, and we expect that is the case, although other overheads may also be responsible. Sadly, we were not able to approach the memory-bound performance that originally motivated this work.

7.2 Future work

In spite of the disappointing performance of our implementation, we hope that some of the ideas from this work could be extended in future. In particular:

  • The idea of having tuneable error is attractive. For arrays with high sidelobes in their synthesised beams (i.e. poor uvuv coverage), large gridding errors can be made, as long as they remain below the clean threshold for the given major cycle.

  • The required memory bandwidth of our algorithm is only weakly dependant on |w||w|, so for arrays with very long baselines, our algorithm may be suitable.

  • The Hermite expansions were originally proposed by Greengard & Strain (1991) for real Gaussians. It may be that faster-converging expansions can be found for complex Gaussians - particularly in the case for large ww where the real envelope is smooth and the complex chirp contains many peaks.

  • The form of the convolution function is smooth at the centre and oscillates more rapidly as tt increases, with the fastest oscillations being damped at the edges by the envelope. Currently, our approach is to use the same order for all boxes, which leads to the largest errors being made away from the centre of the convolution function (Figure 3) where the oscillations are the fastest and largest. There may, therefore, be some value in varying the order qq^{\prime} as a function of the distance from the centre, thereby applying the majority of the computational effort where the errors are likely to be worst.

  • We implemented the polynomial evaluations in a naive manner. More sophisticated methods exist (Knuth, 1997, p. 485ff) that could reduce the operations count and improve numerical stability.

  • To calculate the complex exponential, we used the CEXP function from the standard C library, which is accurate but slow. Faster, approximate methods are available111http://gruntthepeon.free.fr/ssemath/ (e.g. (Schraudolph, 1998). For an implementation whose run time is dominated by CEXP such as ours, these methods could improve speed by a factor of a few, albeit at the cost of some accuracy.

  • Gaussian anti-aliasing functions are not well suited to imaging, as the alias rejection is not very good. The modern state of the art is to use prolate spheroidal wavefunctions (Schwab, 1980), that have better alias rejection. Unfortunately, the Hermite functions are not a suitable basis for Taylor expansion of the prolate spheroidal wavefunctions. If suitable analytic multipole expansions of the convolution of the complex chirp, and the prolate spheroidal wavefunctions can be found, then they can be applied with the fast multiple method (Greengard & Rokhlin, 1987; Greengard, 1988) to obtain a similar result as our FGT, but with better anti-aliasing properties.

  • The fast multipole method could also be used to compensate for the primary beam in the uvuv-plane (AW-projection, (Bhatnagar et al., 2008)). Once again, this would require analytic multipole expansions of the required convolution functions. While this may not be possible with an arbitrary primary beam, it may be more tractable if we assume Gaussian primary beams.

  • For problems where the compute time of ww-projection is dominated by the time to compute convolution functions (by Fourier transforming a phase screen in the image plane), using standing gridding with Gaussian anti-aliasing functions could be preferable, as they can be directly and efficiently computed in the uvuv domain.

  • Easily parallelised algorithms will clearly be important for large arrays such as SKA, where the computational work is likely to be spread over many thousands of nodes. Once the gridding problem has been distributed over the usual axes of frequency, pointing direction and polarisation, it is possible that problem may still be too large to be efficiently computed by a single node. For the gridding operation, the option of distributing and parallelising over Taylor coefficients adds an additional axis which an be exploited when distributing work among processors, over and above the traditional axes.

8 Conclusions

We have described a procedure to perform ww-projection with the fast Gauss transform with variable scales, where the anti-aliasing function is chosen to be a Gaussian. The gridding problem is solved by the FGT with source variable scales, and the degridding problem by the FGT with target variable scales. While the theoretical efficiency of our approach is encouraging, we were not able to approach the the theoretical performance gains with our implementation. Nonetheless, we find that ww projection with approximate algorithms such as the FGT or fast multipole methods may yet have promise, by having the attractive properties of tuneable error, an additional parallelisation axis, and no calculation and storage of convolution functions. The methods require additional research, to improve the practical implementation, find expansions with faster convergence, and find closed forms with better anti-aliasing properties.

9 Acknowledgements

The authors would like to thank Maxim Voronkov and Ben Humphries for invaluable help with C++ coding, Matthew Whiting for the idea of tuneable errors within major cycles, and Oleg Smirnov for his insightful referee comments.

References

  • Bhatnagar et al. (2008) Bhatnagar, S., Cornwell, T. J., Golap, K., Uson, J. M. 2008, A&A, 487, 419
  • Cornwell et al. (2008) Cornwell, T. J., Golap, K., Bhatnagar, S. 2008, IEEE Journal of Selected Topics in Signal Processing, 2, 647
  • Greengard (1988) Greengard, L. 1988, The rapid evaluation of potential fields in particle systems, Vol. 1987 (the MIT Press)
  • Greengard & Rokhlin (1987) Greengard, L. Rokhlin, V. 1987, Journal of computational physics, 73, 325
  • Greengard & Strain (1991) Greengard, L. Strain, J. 1991, SIAM Journal on Scientific and Statistical Computing, 12, 79
  • Knuth (1997) Knuth, D. E. 1997, The Art of Computer Programming, 3rd edn., Vol. 2 (Addison-Wesley)
  • Rau & Cornwell (2011) Rau, U. Cornwell, T. J. 2011, A&A, 532, A71
  • Schraudolph (1998) Schraudolph, N. N. 1998, Neural Computation, 11, 853
  • Schwab (1980) Schwab, F. R. 1980, Optimal Gridding, Tech. rep., VLA Scientific Memoranda 132
  • Strain (1991) Strain, J. 1991, SIAM Journal on Scientific and Statistical Computing, 12, 1131
  • Taylor et al. (1999) Taylor, G. B., Carilli, C. L., Perley, R. A., eds. 1999, Astronomical Society of the Pacific Conference Series, Vol. 180, Synthesis Imaging in Radio Astronomy II (Orem, Utah, USA: ASP)