This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A multimodular algorithm for computing Bernoulli numbers

David Harvey Courant Institute of Mathematical Sciences, New York University, 251 Mercer St, New York NY 10012, USA dmharvey@cims.nyu.edu http://cims.nyu.edu/˜harvey/
(Date: 13th October 2008)
Abstract.

We describe an algorithm for computing Bernoulli numbers. Using a parallel implementation, we have computed BkB_{k} for k=108k=10^{8}, a new record. Our method is to compute BkB_{k} modulo pp for many small primes pp, and then reconstruct BkB_{k} via the Chinese Remainder Theorem. The asymptotic time complexity is O(k2log2+εk)O(k^{2}\log^{2+\varepsilon}k), matching that of existing algorithms that exploit the relationship between BkB_{k} and the Riemann zeta function. Our implementation is significantly faster than several existing implementations of the zeta-function method.

1. Introduction

The Bernoulli numbers B0,B1,B2,B_{0},B_{1},B_{2},\ldots are rational numbers defined by

xex1=k=0Bkk!xk.\frac{x}{e^{x}-1}=\sum_{k=0}^{\infty}\frac{B_{k}}{k!}x^{k}.

In this paper we focus on the problem of computing BkB_{k}, as an exact rational number, for a single large (even) value of kk. To date, the most efficient algorithms for computing an isolated BkB_{k} have been based on the formula

(1) Bk=(1)k/2+12ζ(k)k!(2π)k,k2 and k even,B_{k}=(-1)^{k/2+1}\frac{2\zeta(k)k!}{(2\pi)^{k}},\qquad\text{$k\geq 2$ and $k$ even},

where ζ(s)\zeta(s) is the Riemann zeta function [IR90, p. 231]. One first computes a high-precision approximation for BkB_{k} via (1), using the Euler product for ζ(s)\zeta(s). The exact denominator of BkB_{k} is known by the von Staudt–Clausen theorem. Provided that the numerical approximation to BkB_{k} has enough correct bits — it turns out that O(klogk)O(k\log k) bits is enough — we may recover the numerator of BkB_{k} exactly. We will refer to this algorithm and its variants as the ‘zeta-function algorithm’.

The zeta-function algorithm has been rediscovered numerous times. The formula (1) itself goes back to Euler. Chowla and Hartung [CH72] give an ‘exact formula’ for BkB_{k}, the evaluation of which is tantamount to executing the algorithm, but using the defining sum for ζ(k)\zeta(k) instead of the Euler product, and using 2(2k1)2(2^{k}-1) in place of the exact denominator of BkB_{k}. Fillebrown [Fil92] mentions [CH72] and points out that using the Euler product is asymptotically faster; she gives a detailed time and space complexity analysis, and attributes the algorithm to Herbert Wilf (unpublished). Fee and Plouffe [FP07] state that they discovered and implemented the zeta-function algorithm in 1996. Kellner [Kel02] also refers to [CH72] and suggests using the Euler product. He mentions some data produced by Fee and Plouffe, although not their algorithm. According to Pavlyk [Pav08], the implementation in Mathematica 6 [Wol07] derives from Kellner’s work. Stein [Ste07, p. 30] indicates another independent discovery by Henri Cohen, Karim Belabas, and Bill Daly (unpublished), that led to the implementation in PARI/GP [Par07].

In this paper we present a new algorithm for computing BkB_{k} that does not depend on (1) or any properties of ζ(s)\zeta(s). The basic idea is to compute BkB_{k} modulo pp for sufficiently many small primes pp, and then reconstruct BkB_{k} using the Chinese Remainder Theorem. Using a parallel implementation, we have computed BkB_{k} for k=108k=10^{8}, improving substantially on the previous record of k=107k=10^{7} [Pav08].

Let M(n)M(n) denote the time (bit operations) required to multiply two nn-bit integers. We may assume that M(n)=O(nlog1+εn)M(n)=O(n\log^{1+\varepsilon}n), using for example the Schönhage–Strassen algorithm [SS71]. According to Fillebrown’s analysis [Fil92, p. 441], the zeta-function algorithm for computing BkB_{k} requires O(k)O(k) multiplications of integers with O(klogk)O(k\log k) bits, so its running time is O(k2log2+εk)O(k^{2}\log^{2+\varepsilon}k). The new algorithm also runs in time O(k2log2+εk)O(k^{2}\log^{2+\varepsilon}k). However, we will see that over the feasible range of computation the running time behaves more like O(k2logk)O(k^{2}\log k).

The new algorithm is easily parallelisable, as the computations for the various pp are independent. Only the final stages of the modular reconstruction are more difficult to execute in parallel, but these account for only O(k1+ε)O(k^{1+\varepsilon}) of the running time. During the modular computations, each thread requires only O(log1+εk)O(\log^{1+\varepsilon}k) space. The zeta-function algorithm may also be parallelised, for instance by mapping Euler factors to threads, but this requires O(klogk)O(k\log k) space per thread.

2. Computing BkB_{k} modulo pp

Throughout this section p5p\geq 5 denotes a prime, and kk an even integer satisfying 2kp32\leq k\leq p-3. Our algorithm for computing BkB_{k} modulo pp is based on the following well-known congruence. Let 0<c<p0<c<p, and suppose that ck1(modp)c^{k}\not\equiv 1\pmod{p}. Then

(2) Bkk1ckx=1p1xk1hc(x)(modp),B_{k}\equiv\frac{k}{1-c^{k}}\sum_{x=1}^{p-1}x^{k-1}h_{c}(x)\pmod{p},

where

hc(x):=(xmodp)c(x/cmodp)p+c12.h_{c}(x):=\frac{(x\bmod p)-c(x/c\bmod p)}{p}+\frac{c-1}{2}.

Equation (2) may be deduced by reading the statement of Theorem 2.3 of [Lan90, §2] modulo pp (as is done in the proof of Corollary 2 to that theorem).

Equation (2) may be used to compute BkB_{k} modulo pp directly, but computing xk1x^{k-1} modulo pp for each xx is unnecessarily expensive. Instead, we select a generator gg of the multiplicative group (𝐙/p𝐙)×(\mathbf{Z}/p\mathbf{Z})^{\times}, put c=gc=g, and rewrite the sum as

(3) Bkk1gki=1p1gi(k1)hg(gi)(modp).B_{k}\equiv\frac{k}{1-g^{k}}\sum_{i=1}^{p-1}g^{i(k-1)}h_{g}(g^{i})\pmod{p}.

Note that gk1(modp)g^{k}\not\equiv 1\pmod{p} since p1kp-1\mathbin{\nmid}k. For x0(modp)x\not\equiv 0\pmod{p} we have hc(x)=hc(x)h_{c}(-x)=-h_{c}(x), and g(p1)/21(modp)g^{(p-1)/2}\equiv-1\pmod{p}, so we obtain the half-length sum

(4) Bk2k1gki=1(p1)/2gi(k1)hg(gi)(modp),B_{k}\equiv\frac{2k}{1-g^{k}}\sum_{i=1}^{(p-1)/2}g^{i(k-1)}h_{g}(g^{i})\pmod{p},

leading to the following algorithm.

ga generator of (𝐙/p𝐙)×g\leftarrow\text{a generator of $(\mathbf{Z}/p\mathbf{Z})^{\times}$}
rgk1(modp)r\leftarrow g^{k-1}\pmod{p}
u(g1)/2(modp)u\leftarrow(g-1)/2\pmod{p}
S0S\leftarrow 0,   X1X\leftarrow 1,   YrY\leftarrow r
for i1i\leftarrow 1 to (p1)/2(p-1)/2 do  qgX/pq\leftarrow\lfloor gX/p\rfloor
S(S+(uq)Y)modpS\leftarrow(S+(u-q)Y)\bmod p
XgXmodpX\leftarrow gX\bmod p,   YrYmodpY\leftarrow rY\bmod p
end return 2kS/(1gk)modp2kS/(1-g^{k})\bmod p
Algorithm 1 Compute BkB_{k} modulo pp
Proposition 1.

Let p5p\geq 5 be a prime, and let kk be an even integer in the interval 2kp32\leq k\leq p-3. Algorithm 1 computes BkB_{k} modulo pp in time O(plog1+εp)O(p\log^{1+\varepsilon}p).

Proof.

At the top of the loop we have X=gi1(modp)X=g^{i-1}\pmod{p} and Y=gi(k1)(modp)Y=g^{i(k-1)}\pmod{p}. The value assigned to qq is (g(gi1modp)(gimodp))/p=uhg(gi)(g(g^{i-1}\bmod p)-(g^{i}\bmod p))/p=u-h_{g}(g^{i}). Therefore the value added to SS is gi(k1)hg(gi)g^{i(k-1)}h_{g}(g^{i}). By (4) the return value is BkB_{k} modulo pp.

We now analyse the complexity. A generator of (𝐙/p𝐙)×(\mathbf{Z}/p\mathbf{Z})^{\times} may be found deterministically in time O(p1/4+ε)O(p^{1/4+\varepsilon}) [Shp96]. In the main loop, the algorithm performs O(p)O(p) additions, multiplications and divisions (with remainder) of integers with O(logp)O(\log p) bits. The computation of rr requires another O(logk)=O(logp)O(\log k)=O(\log p) such operations. The division by 1gk1-g^{k} requires an extended GCD of integers with O(logp)O(\log p) bits. The total cost is O(plog1+εp)O(p\log^{1+\varepsilon}p). ∎

Remark.

To compute B108B_{10^{8}} (see §5), the largest prime needed is 15583220631558322063. Even on a 32-bit machine, this fits into a single machine word, and the cost of arithmetic modulo pp does not depend on pp. Therefore, over the feasible range of computation, the complexity of Algorithm 1 may be regarded as only O(p)O(p).

3. Computing BkB_{k} as a rational number

In this section we present a multimodular algorithm for computing Bk=Nk/DkB_{k}=N_{k}/D_{k} as an exact rational number. Before getting to the algorithm proper, we make some preliminary calculations. In what follows, we assume that k4k\geq 4 and that kk is even.

The denominator DkD_{k} is given by the von Staudt–Clausen theorem [IR90, p. 233]:

Dk=p primep1|kp.D_{k}=\prod_{\begin{subarray}{c}\text{$p$ prime}\\ p-1\mathbin{|}k\end{subarray}}p.

Note that Dk2k+1D_{k}\leq 2^{k+1}, since DkD_{k} divides 2(2k1)2(2^{k}-1) [CH72, p. 114].

The size of the numerator NkN_{k} may be bounded quite tightly as follows. From (1) and Stirling’s approximation [Lan97, p. 120] we have

log|Bk|\displaystyle\log|B_{k}| =log2+logζ(k)+logk!klog2π\displaystyle=\log 2+\log\zeta(k)+\log k!-k\log 2\pi
(k+12)logk(1+log2π)k+(log2+logζ(k)+log2π)+112k.\displaystyle\leq\left(k+\frac{1}{2}\right)\log k-(1+\log 2\pi)k+(\log 2+\log\zeta(k)+\log\sqrt{2\pi})+\frac{1}{12k}.

Since ζ(k)ζ(4)\zeta(k)\leq\zeta(4) and 12k4812k\geq 48, a short calculation shows that |Nk|<2β|N_{k}|<2^{\beta} where β:=(k+0.5)log2k4.094k+2.470+log2Dk\beta:=\lceil(k+0.5)\log_{2}k-4.094k+2.470+\log_{2}D_{k}\rceil.

For any X>0X>0, let MX:=pX,p1kpM_{X}:=\prod_{p\leq X,p-1\mathbin{\nmid}k}p. Our strategy will be to select XX so that MX2βM_{X}\geq 2^{\beta}, and then compute NkN_{k} modulo MXM_{X}. This ensures that we have sufficient information to reconstruct NkN_{k}. In the algorithm itself our choice of XX will be quite sharp, but we also need a rough a priori estimate. We will take Y:=max(37,(k+0.5)log2k)Y:=\max(37,\lceil(k+0.5)\log_{2}k\rceil). To check that this works, we will use the estimate pxlogp0.73x\sum_{p\leq x}\log p\geq 0.73x, a weak form of the prime number theorem, valid for x37x\geq 37 [Nar00, p. 111]. Then we have

log2MY\displaystyle\log_{2}M_{Y} =log2Dk+pYlog2p\displaystyle=-\log_{2}D_{k}+\sum_{p\leq Y}\log_{2}p
(k+1)+0.73log2Y\displaystyle\geq-(k+1)+\frac{0.73}{\log 2}Y
Yk1\displaystyle\geq Y-k-1
(k+0.5)log2kk1\displaystyle\geq(k+5)\log_{2}k-k-1
(k+0.5)log2k3.094k+6.470(since k4)\displaystyle\geq(k+5)\log_{2}k-094k+470\quad(\text{since $k\geq 4$})
(k+0.5)log2k4.094k+3.470+log2Dk+2\displaystyle\geq(k+5)\log_{2}k-094k+470+\log_{2}D_{k}+2
(k+0.5)log2k4.094k+2.470+log2Dk+2\displaystyle\geq\lceil(k+5)\log_{2}k-094k+470+\log_{2}D_{k}\rceil+2
=β+2,\displaystyle=\beta+2,

so that MY2β+2M_{Y}\geq 2^{\beta+2}. Algorithm 2 presents the resulting algorithm; several important details are given in the proof of the following theorem.

// compute DkD_{k} and preliminary bounds
Ymax(37,(k+0.5)log2k)Y\leftarrow\max(37,\lceil(k+0.5)\log_{2}k\rceil)
Compute list of all primes pYp\leq Y
Dkp1|kpD_{k}\leftarrow\prod_{p-1\mathbin{|}k}p
β(k+0.5)log2k4.094k+2.470+log2Dk\beta\leftarrow\lceil(k+0.5)\log_{2}k-4.094k+2.470+\log_{2}D_{k}\rceil
// compute tight bound XX
p3p\leftarrow 3,   M1.0M^{\prime}\leftarrow 1.0
while M<2β+1M^{\prime}<2^{\beta+1} do  pNextPrime(p)p\leftarrow\mathrm{NextPrime}(p)
if p1kp-1\mathbin{\nmid}k then MpMM^{\prime}\leftarrow pM^{\prime}
end XpX\leftarrow p
// collect modular information
for p{2,3,5,,X},p1kp\in\{2,3,5,\ldots,X\},p-1\mathbin{\nmid}k do rpBk(modp)r_{p}\leftarrow B_{k}\pmod{p}
// reconstruction
MMX=pX,p1kpM\leftarrow M_{X}=\prod_{p\leq X,p-1\mathbin{\nmid}k}p
RR\leftarrow unique integer in [0,M)[0,M) congruent to rpr_{p} modulo pp for all primes p|Mp\mathbin{|}M
NDkR(modM)N^{\prime}\leftarrow D_{k}R\pmod{M}
if k2(mod4)k\equiv 2\pmod{4} then NkNN_{k}\leftarrow N^{\prime} else NkNMN_{k}\leftarrow N^{\prime}-M
return Nk/DkN_{k}/D_{k}
Algorithm 2 Compute BkB_{k} as a rational number (k4k\geq 4, kk even)
Theorem 2.

Algorithm 2 computes BkB_{k} in time O(k2log2+εk)O(k^{2}\log^{2+\varepsilon}k).

Proof.

Computing all primes less than YY requires time O(k1+ε)O(k^{1+\varepsilon}), via an elementary sieving algorithm, since Y=O(klogk)Y=O(k\log k). In what follows we assume that all primality tests and enumeration of primes are performed with the aid of this list.

To compute DkD_{k}, make a list of the factors of kk (for example, by testing divisibility of kk by each integer in [2,k][2,\sqrt{k}]), and for each factor d|kd\mathbin{|}k check whether d+1d+1 is prime. Multiply together those that are prime using a product tree [vzGG03, Algorithm 10.3, p. 293]. Since logDk=O(k)\log D_{k}=O(k), this takes time O(k1+ε)O(k^{1+\varepsilon}).

The goal of the while-loop is to quickly find a bound XX, much tighter than YY, such that MX2βM_{X}\geq 2^{\beta}. The variable MM^{\prime} holds an approximation to the product of the primes encountered so far. It should be represented in floating-point, with at least n=log2Y+1n=\lceil\log_{2}Y\rceil+1 bits in the mantissa, so that each multiplication by pp magnifies the relative error by at most 1+2n1+2^{-n}. We claim that the loop terminates before exhausting the precomputed list of primes. Indeed, let ss be the number of pp such that pYp\leq Y and p1kp-1\mathbin{\nmid}k. After ss iterations, MM^{\prime} approximates MYM_{Y} with relative error at most (1+2n)s(1+(2Y)1)Ye1/2(1+2^{-n})^{s}\leq(1+(2Y)^{-1})^{Y}\leq e^{1/2}; that is, Me1/2MYe1/22β+22β+1M^{\prime}\geq e^{-1/2}M_{Y}\geq e^{-1/2}2^{\beta+2}\geq 2^{\beta+1}. This is impossible since the loop termination condition would already have been met earlier. This argument also shows that the selected bound XX satisfies MXe1/22β+12βM_{X}\geq e^{-1/2}2^{\beta+1}\geq 2^{\beta}. Since Y=O(klogk)Y=O(k\log k), computing XX takes time O(k1+ε)O(k^{1+\varepsilon}).

In the for-loop, for each pp we use Algorithm 1 to compute rp:=Bk(modp)r_{p}:=B_{k}\pmod{p}. Note that Algorithm 1 requires that 2kp32\leq k\leq p-3. To satisfy this requirement, we put m=kmod(p1)m=k\bmod(p-1), compute Bm(modp)B_{m}\pmod{p} using Algorithm 1, and then recover Bk(modp)B_{k}\pmod{p} via Kummer’s congruence Bk/kBm/m(modp)B_{k}/k\equiv B_{m}/m\pmod{p} [Lan90, p. 41]. The total number of primes processed is no more than π(Y)=O(Y/logY)=O(klogk/log(klogk))=O(k)\pi(Y)=O(Y/\log Y)=O(k\log k/\log(k\log k))=O(k). According to Proposition 1, the cost for each prime is O(plog1+εp)=O(klog2+εk)O(p\log^{1+\varepsilon}p)=O(k\log^{2+\varepsilon}k), since pY=O(klogk)p\leq Y=O(k\log k). Therefore the total cost of the while-loop is O(k2log2+εk)O(k^{2}\log^{2+\varepsilon}k).

We compute MM and RR simultaneously using fast Chinese Remaindering [vzGG03, Theorem 10.25, p. 302]; the cost is O((logMX)1+ε)=O((klogk)1+ε)=O(k1+ε)O((\log M_{X})^{1+\varepsilon})=O((k\log k)^{1+\varepsilon})=O(k^{1+\varepsilon}).

In the last few lines we recover NkN_{k}. By construction we have RBk(modM)R\equiv B_{k}\pmod{M}, so NNk(modM)N^{\prime}\equiv N_{k}\pmod{M}. Note that |Nk|<2βM|N_{k}|<2^{\beta}\leq M. If k2(mod4)k\equiv 2\pmod{4} then NkN_{k} is positive so we simply have Nk=NN_{k}=N^{\prime}; otherwise NkN_{k} is negative, and we must correct NN^{\prime} by subtracting MM. ∎

Remark.

During the modular collection phase, we skipped those pp for which p1|kp-1\mathbin{|}k. An alternative would be to use the fact that pBk1(modp)pB_{k}\equiv-1\pmod{p} for these pp [IR90, p. 233], and to apply the multimodular algorithm directly to NkN_{k} rather than to BkB_{k}. This would require the additional minor overhead of computing Dk(modp)D_{k}\pmod{p} for all of the other primes.

Remark.

As mentioned below the proof of Proposition 1, the running time of Algorithm 1 behaves like O(p)O(p) in practice. The corresponding bound for the running time of Algorithm 2 is O(k2logk)O(k^{2}\log k).

4. Implementation techniques for computing BkB_{k} modulo pp

In this section we sketch several techniques that yield a large constant factor improvement in the speed of computing BkB_{k} modulo pp, for almost all kk and pp. In experiments we have found that these techniques yield an increase in speed by a factor in excess of 150150 compared to an efficient implementation of Algorithm 1.

We first perform some preparatory algebra. Consider again the congruence (2). Let nn be the multiplicative order of cc in (𝐙/p𝐙)×(\mathbf{Z}/p\mathbf{Z})^{\times}, and put m=(p1)/nm=(p-1)/n. Let gg be a generator of (𝐙/p𝐙)×(\mathbf{Z}/p\mathbf{Z})^{\times}. Then {gicj}0i<m,0j<n\{g^{i}c^{-j}\}_{0\leq i<m,0\leq j<n} forms a complete set of representatives for (𝐙/p𝐙)×(\mathbf{Z}/p\mathbf{Z})^{\times}. Putting x=gicjx=g^{i}c^{-j}, we obtain

(5) Bkk1ck0i<m0j<n(gk1)i(ck1)jhc(gicj)(modp).B_{k}\equiv\frac{k}{1-c^{k}}\sum_{\begin{subarray}{c}0\leq i<m\\ 0\leq j<n\end{subarray}}(g^{k-1})^{i}(c^{k-1})^{-j}h_{c}(g^{i}c^{-j})\pmod{p}.

Considerations similar to those of §2 allow us to halve the number of terms, as follows. First suppose that nn is even. Since hc(x)=hc(x)h_{c}(-x)=-h_{c}(x) and cn/21(modp)c^{n/2}\equiv-1\pmod{p}, the double sum in (5) becomes

0i<m0j<n/2(gk1)i(ck1)jhc(gicj)+0i<m0j<n/2(gk1)i(ck1)jn/2hc(gicjn/2)\displaystyle\mathrel{\phantom{\equiv}}\sum_{\begin{subarray}{c}0\leq i<m\\ 0\leq j<n/2\end{subarray}}(g^{k-1})^{i}(c^{k-1})^{-j}h_{c}(g^{i}c^{-j})+\sum_{\begin{subarray}{c}0\leq i<m\\ 0\leq j<n/2\end{subarray}}(g^{k-1})^{i}(c^{k-1})^{-j-n/2}h_{c}(g^{i}c^{-j-n/2})
20i<m0j<n/2(gk1)i(ck1)jhc(gicj)(modp).\displaystyle\equiv 2\sum_{\begin{subarray}{c}0\leq i<m\\ 0\leq j<n/2\end{subarray}}(g^{k-1})^{i}(c^{k-1})^{-j}h_{c}(g^{i}c^{-j})\pmod{p}.

Now suppose that nn is odd. Then mm is even, and we have (gm/2)n(1)ng(p1)/2(1)(1)1(modp)(-g^{m/2})^{n}\equiv(-1)^{n}g^{(p-1)/2}\equiv(-1)(-1)\equiv 1\pmod{p}, so that gm/2cλ(modp)-g^{m/2}\equiv c^{\lambda}\pmod{p} for some 0λ<n0\leq\lambda<n. The double sum in (5) becomes

0i<m/20j<n(gk1)i(ck1)jhc(gicj)+0i<m/20j<n(gk1)i+m/2(ck1)jhc(gi+m/2cj)\displaystyle\mathrel{\phantom{\equiv}}\sum_{\begin{subarray}{c}0\leq i<m/2\\ 0\leq j<n\end{subarray}}(g^{k-1})^{i}(c^{k-1})^{-j}h_{c}(g^{i}c^{-j})+\sum_{\begin{subarray}{c}0\leq i<m/2\\ 0\leq j<n\end{subarray}}(g^{k-1})^{i+m/2}(c^{k-1})^{-j}h_{c}(g^{i+m/2}c^{-j})
0i<m/20j<n(gk1)i(ck1)jhc(gicj)0i<m/20j<n(gk1)i(ck1)j+λhc(gicj+λ)\displaystyle\equiv\sum_{\begin{subarray}{c}0\leq i<m/2\\ 0\leq j<n\end{subarray}}(g^{k-1})^{i}(c^{k-1})^{-j}h_{c}(g^{i}c^{-j})-\sum_{\begin{subarray}{c}0\leq i<m/2\\ 0\leq j<n\end{subarray}}(g^{k-1})^{i}(c^{k-1})^{-j+\lambda}h_{c}(-g^{i}c^{-j+\lambda})
20i<m/20j<n(gk1)i(ck1)jhc(gicj)(modp).\displaystyle\equiv 2\sum_{\begin{subarray}{c}0\leq i<m/2\\ 0\leq j<n\end{subarray}}(g^{k-1})^{i}(c^{k-1})^{-j}h_{c}(g^{i}c^{-j})\pmod{p}.

We combine both cases into a single statement by writing

(6) Bk2k1ck0i<m(gk1)i0j<n(ck1)jhc(gicj)(modp),B_{k}\equiv\frac{2k}{1-c^{k}}\sum_{0\leq i<m^{\prime}}(g^{k-1})^{i}\sum_{0\leq j<n^{\prime}}(c^{k-1})^{-j}h_{c}(g^{i}c^{-j})\pmod{p},

where

n={n/2n even,nn odd,m=(p1)/2n.n^{\prime}=\begin{cases}n/2&\text{$n$ even,}\\ n&\text{$n$ odd,}\end{cases}\qquad m^{\prime}=\frac{(p-1)/2}{n^{\prime}}.

The sum in (6) may be evaluated by an algorithm similar to Algorithm 1, with an outer loop over ii and an inner loop over jj; we omit the details. The inner loop is executed nm=(p1)/2n^{\prime}m^{\prime}=(p-1)/2 times, the same number of iterations as in the main loop of Algorithm 1. Note that if cc is chosen ‘randomly’, then we expect nn^{\prime} to be quite large, so it is imperative to make the inner loop as efficient as possible.

To this end, we specialise to the case c=1/2(modp)c=1/2\pmod{p}; this will allow us to exploit the fact that multiplication by 22 modulo pp is very cheap. Of course, this specialisation does not work if 2k1(modp)2^{k}\equiv 1\pmod{p}. In practice, such primes tend to be rare for any given kk, and we may fall back on Algorithm 1 to handle them.

Directly from the definition of hc(x)h_{c}(x), we have h1/2(x)=f(x)/4h_{1/2}(x)=-f(x)/4 where

f(x):={10<(xmodp)<p/2,1p/2<(xmodp)<p.f(x):=\begin{cases}1&\phantom{p/}0<(x\bmod p)<p/2,\\ -1&p/2<(x\bmod p)<p.\end{cases}

From (6) we obtain

(7) Bkk2(2k1)0i<m(gk1)i0j<n(2k1)jf(gi2j)(modp),B_{k}\equiv\frac{k}{2(2^{-k}-1)}\sum_{0\leq i<m^{\prime}}(g^{k-1})^{i}\sum_{0\leq j<n^{\prime}}(2^{k-1})^{j}f(g^{i}2^{j})\pmod{p},

where nn is the order of 22 in (𝐙/p𝐙)×(\mathbf{Z}/p\mathbf{Z})^{\times}, and nn^{\prime} and mm^{\prime} are defined as before.

In the algorithm that evaluates (7), the inner loop is considerably simpler than the main loop of Algorithm 1. Given gi2jg^{i}2^{j} at the beginning of the loop, we compute the next term gi2j+1g^{i}2^{j+1} by doubling it and subtracting pp if necessary, and f(gi2j)f(g^{i}2^{j}) is 1-1 or +1+1 according to whether the subtraction occurred. Therefore only a single modular multiplication is required on each iteration, to keep track of (2k1)j(2^{k-1})^{j}.

Next we describe further optimisations for computing a sum of the form

(8) 0j<Nrjf(2js)(modp).\sum_{0\leq j<N}r^{j}f(2^{j}s)\pmod{p}.

(For evaluating (7), we take N=nN=n^{\prime}, r=2k1r=2^{k-1}, s=gis=g^{i}.) The following observation is crucial: if 0<s<p0<s<p, and if the binary expansion of s/ps/p is 0.b0b1b20.b_{0}b_{1}b_{2}\ldots (where each bjb_{j} is 0 or 1), then f(2js)=(1)bjf(2^{j}s)=(-1)^{b_{j}}. Indeed, the binary expansion of 2js/p2^{j}s/p is obtained by shifting that of s/ps/p to the left by jj digits, so

(2jsmodp)/p=0.bjbj+1,(2^{j}s\bmod p)/p=0.b_{j}b_{j+1}\ldots,

and then

f(2js)=+10<(2jsmodp)<p/2bj=0.f(2^{j}s)=+1\iff 0<(2^{j}s\bmod p)<p/2\iff b_{j}=0.

Assume that we have available a routine for computing the binary expansion of s/ps/p, whose output is a sequence of base-2m2^{m} digits. Strip off the last NmodmN\bmod m terms of (8) and compute them separately; we may then assume that NN is divisible by mm. Put j=mt+uj=mt+u, so that (8) becomes

(9) 0t<N/m(rm)t0u<mruf(2mt+us).\sum_{0\leq t<N/m}(r^{m})^{t}\sum_{0\leq u<m}r^{u}f(2^{mt+u}s).

We may now use a caching strategy as follows. Maintain a table {Tσ}\{T_{\sigma}\} of length 2m2^{m}, where σ{+1,1}m\sigma\in\{+1,-1\}^{m}. Each TσT_{\sigma} is a residue modulo pp, and is initialised to zero. For each 0t<N/m0\leq t<N/m, add (rm)t(r^{m})^{t} to the table element TσT_{\sigma}, where σ\sigma is determined from the tt-th base-2m2^{m} digit of s/ps/p by

σ=(f(2mts),f(2mt+1s),,f(2mt+m1s)).\sigma=(f(2^{mt}s),f(2^{mt+1}s),\ldots,f(2^{mt+m-1}s)).

(In practice, σ\sigma is represented by the sequence of bits themselves, and is obtained by a simple bit-mask.) After finishing the loop over tt, compute the 2m2^{m} values Vσ=0u<mruσuV_{\sigma}=\sum_{0\leq u<m}r^{u}\sigma_{u}, where σ=(σ0,,σm1)\sigma=(\sigma_{0},\ldots,\sigma_{m-1}). Then the desired sum is given by σVσTσ\sum_{\sigma}V_{\sigma}T_{\sigma}. The main benefit of this approach is that in the inner loop we only perform N/mN/m modular multiplications, instead of NN. The tradeoff is that we must maintain the table {Tσ}\{T_{\sigma}\}, and some extra work is required at the end to assemble the information from the table. For this to be worthwhile, we should have N2mN\gg 2^{m}. We should also choose mm so that the base-2m2^{m} digits of s/ps/p can be extracted efficiently. Moreover, memory locality problems can offset the savings in arithmetic if mm is too large. We have found in practice that m=8m=8 works quite well.

In the implementation discussed in §5, we made several further optimisations, which we describe only briefly.

For better memory locality — and indeed less total memory consumption — we split (9) into blocks and process each block separately. This avoids needing to store and then later retrieve too many bits of the expansion of s/ps/p. Instead of computing s/ps/p for each ss separately, we precompute an approximation to 1/p1/p, and then multiply it by each ss that occurs, taking advantage of the speed of multiplication by a single-word integer compared to the corresponding division.

We use several tables instead of just one. For example, on a 32-bit machine, we use four tables: the highest 8 bits of each word of s/ps/p contribute to the first table, the next 8 bits to the second table, and so on. This reduces further the number of modular multiplications. When adding values into a table, we skip the reduction modulo pp, delaying this until the end of the loop. Of course, this is only possible when the word size is large enough compared to pp, otherwise overflow may occur. Finally, instead of re-initialising the table on each iteration of the outer loop, we simply continue to accumulate data into the same table. This partly mitigates against the overhead incurred when mm^{\prime} is unusually large compared to nn^{\prime}.

Most of the modular arithmetic uses Montgomery’s reduction algorithm [Mon85].

5. Performance data

The author implemented the above algorithms in a C++ package named bernmm (Bernoulli multi-modular). The source code is released under the GNU General Public License (GPL), and is freely available from the author’s web page. It depends on the GMP library [Gra08] for arbitrary precision arithmetic, and on NTL [Sho07] for some of the single-precision modular arithmetic. It is included as a standard component of the Sage computer algebra system (version 3.1.2) [SJ05], accessible via the bernoulli function.

The parallel version distributes the modular computations among the requested number of threads, and depends on the standard pthreads API. For the reconstruction phase, the lowest layers of the subproduct tree are performed in parallel, but the number of threads decreases towards the top of the tree, since the extended GCD routine is not threaded.

Table 1 shows timing data for 104k10810^{4}\leq k\leq 10^{8}, spaced by intervals of 10\sqrt{10}, for bernmm and several existing implementations of the zeta-function algorithm. The timings were obtained on a 16-core 2.6GHz AMD Opteron (64-bit) machine with 96 GB RAM, running Ubuntu Linux. The last two rows of the table (k=107.5k=\lfloor 10^{7.5}\rfloor and k=108k=10^{8}) improve on the prior record k=107k=10^{7} set by Pavlyk [Pav08]. The values B31622776B_{31622776} and B108B_{10^{8}} may be downloaded from the author’s web page.

The compiler used was gcc 4.1.3. We used the version of GMP shipped with Sage 3.1.2, which is the same as GMP 4.2.2, but also includes assembly code written by Pierrick Gaudry that improves performance on the Opteron, and code by Niels Möller that implements a quasi-linear time extended GCD algorithm [Möl08]. We used NTL version 5.4.1, also included in Sage.

The timings for PARI/GP were obtained for version 2.3.3, using the same patched version of GMP. We used the bernfrac function from the GP interpreter. The timings for calcbn, a specialised program for computing Bernoulli numbers, were obtained from calcbn 2.0 [Kel08], again using the same patched GMP. The timings for Mathematica were obtained from the Linux x86 (64-bit) version of Mathematica 6.0, using the BernoulliB function. Neither PARI/GP nor Mathematica supply a parallel implementation as far as we are aware.

The peak memory usage for computing B108B_{10^{8}} was approximately 5 GB; this occurred during the final extended GCD operation. By comparison, PARI/GP used 4.3 GB for k=107k=10^{7}, and calcbn used 1.7 GB for k=107k=10^{7} (with ten threads).

For each implementation, we observe that the ratio of the times in adjacent rows of the table is approximately 10, reflecting the predicted quasi-quadratic asymptotic running time of both the zeta-function algorithm and the multimodular algorithm.

We checked our results by two methods. First, we checked that as a real number, the proposed value for BkB_{k} agrees with the estimate |Bk|2(2π)kk!|B_{k}|\approx 2(2\pi)^{-k}k!. Since we computed BkB_{k} by modular information alone, this is a very strong consistency check. Second, we reduced modulo pp the proposed value for BkB_{k}, and compared this against the output of Algorithm 1, for a few primes distinct from those primes that we used to compute BkB_{k}. Again, this is a very strong consistency check — stronger in fact, since it involves all of the bits of the proposed BkB_{k}.

Mathematica, PARI/GP and calcbn use GMP internally, so naturally any improvement in GMP’s large integer arithmetic will improve the timings reported in the table. In this connection we mention the improvements in multiplication speed reported by [GKZ07]; their paper also gives comparisons with other implementations of large-integer arithmetic.

PARI MMA calcbn bernmm
CPU CPU CPU wall CPU wall
kk threads time time time time time time
10410^{4} 1 0.07 0.18 0.04 0.25
104.5\lfloor 10^{4.5}\rfloor 1 0.68 1.80 0.49 1.02
10510^{5} 1 7.45 21.7 5.72 5.46
105.5\lceil 10^{5.5}\rceil 1 88.3 285 82.2 38.6
2 82.3 41.6 38.6 20.4
10610^{6} 1 1058 3434 935 340
4 967 246 341 96.1
106.5\lceil 10^{6.5}\rceil 1 1.241041.24\hskip 1.0pt\mathord{\cdot}\hskip 1.0pt10^{4} 4.251044.25\hskip 1.0pt\mathord{\cdot}\hskip 1.0pt10^{4} 3397
4 1.331041.33\hskip 1.0pt\mathord{\cdot}\hskip 1.0pt10^{4} 3385 3406 909
10710^{7} 1 1.421051.42\hskip 1.0pt\mathord{\cdot}\hskip 1.0pt10^{5} 5.101055.10\hskip 1.0pt\mathord{\cdot}\hskip 1.0pt10^{5}
10 1.601051.60\hskip 1.0pt\mathord{\cdot}\hskip 1.0pt10^{5} 1.611041.61\hskip 1.0pt\mathord{\cdot}\hskip 1.0pt10^{4} 3.591043.59\hskip 1.0pt\mathord{\cdot}\hskip 1.0pt10^{4} 3938
107.5\lfloor 10^{7.5}\rfloor 10 3.861053.86\hskip 1.0pt\mathord{\cdot}\hskip 1.0pt10^{5} 4.031044.03\hskip 1.0pt\mathord{\cdot}\hskip 1.0pt10^{4}
10810^{8} 10 4.191064.19\hskip 1.0pt\mathord{\cdot}\hskip 1.0pt10^{6} 4.271054.27\hskip 1.0pt\mathord{\cdot}\hskip 1.0pt10^{5}
Table 1. Time (in seconds) for computing BkB_{k} for PARI/GP, Mathematica, calcbn and bernmm.

Acknowledgments

Many thanks to Joe Buhler, Bernd Kellner, and Andrew Sutherland for their comments on a draft of this paper, and to the Department of Mathematics at Harvard University for providing the hardware on which the computations were performed.

References

  • [CH72] S. Chowla and P. Hartung, An “exact” formula for the mm-th Bernoulli number, Acta Arith. 22 (1972), 113–115.
  • [Fil92] Sandra Fillebrown, Faster computation of Bernoulli numbers, J. Algorithms 13 (1992), no. 3, 431–445.
  • [FP07] Greg Fee and Simon Plouffe, An efficient algorithm for the computation of Bernoulli numbers, 2007, preprint arXiv:math.NT/0702300v2.
  • [GKZ07] Pierrick Gaudry, Alexander Kruppa, and Paul Zimmermann, A GMP-based Implementation of Schönhage–Strassen’s Large Integer Multiplication Algorithm, ISSAC (Dongming Wang, ed.), ACM, 2007, pp. 167–174.
  • [Gra08] Torbjörn Granlund, The GNU Multiple Precision Arithmetic library, 2008, http://gmplib.org/.
  • [IR90] Kenneth Ireland and Michael Rosen, A classical introduction to modern number theory, 2 ed., Graduate Texts in Mathematics, vol. 84, Springer-Verlag, 1990.
  • [Kel02] Bernd C. Kellner, Über irreguläre Paare höherer Ordnungen, Diplomarbeit, 2002.
  • [Kel08] by same author, calcbn, 2008, http://www.bernoulli.org/.
  • [Lan90] Serge Lang, Cyclotomic fields I and II, second ed., Graduate Texts in Mathematics, vol. 121, Springer-Verlag, New York, 1990.
  • [Lan97] by same author, Undergraduate analysis, second ed., Undergraduate Texts in Mathematics, Springer-Verlag, New York, 1997.
  • [Möl08] Niels Möller, On Schönhage’s algorithm and subquadratic integer GCD computation, Math. Comp. 77 (2008), no. 261, 589–607 (electronic).
  • [Mon85] Peter L. Montgomery, Modular multiplication without trial division, Math. Comp. 44 (1985), no. 170, 519–521.
  • [Nar00] Władysław Narkiewicz, The development of prime number theory, Springer Monographs in Mathematics, Springer-Verlag, Berlin, 2000, From Euclid to Hardy and Littlewood.
  • [Par07] The PARI Group, Bordeaux, PARI/GP, version 2.3.3, 2007, available from http://pari.math.u-bordeaux.fr/.
  • [Pav08] Oleksandr Pavlyk, Today We Broke the Bernoulli Record: From the Analytical Engine to Mathematica, 2008, posted on Wolfram Blog (http://blog.wolfram.com/) on 29th April 2008, retrieved 15th June 2008.
  • [Sho07] Victor Shoup, NTL: A library for doing number theory, http://www.shoup.net/ntl/, 2007.
  • [Shp96] Igor Shparlinski, On finding primitive roots in finite fields, Theoret. Comput. Sci. 157 (1996), no. 2, 273–275.
  • [SJ05] William Stein and David Joyner, Sage: System for algebra and geometry experimentation, Communications in Computer Algebra (ACM SIGSAM Bulletin) 39 (2005), no. 2, 61–64.
  • [SS71] A. Schönhage and V. Strassen, Schnelle Multiplikation grosser Zahlen, Computing (Arch. Elektron. Rechnen) 7 (1971), 281–292.
  • [Ste07] William Stein, Modular forms, a computational approach, Graduate Studies in Mathematics, vol. 79, American Mathematical Society, Providence, RI, 2007, With an appendix by Paul E. Gunnells.
  • [vzGG03] Joachim von zur Gathen and Jürgen Gerhard, Modern computer algebra, second ed., Cambridge University Press, Cambridge, 2003.
  • [Wol07] Wolfram Research, Inc., Mathematica 6.0, 2007, http://www.wolfram.com/.