This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Tight Bounds for Hashing Block Sourcesthanks: An extended abstract of this paper will appear in RANDOM ‘08 [CV08].

Kai-Min Chung
Harvard University
Work done when visiting U.C. Berkeley, supported by US-Israel BSF grant 2002246 and NSF grant CNS-0430336.
   Salil Vadhan
Harvard University
Work done when visiting U.C. Berkeley, supported by the Miller Institute for Basic Research in Science, a Guggenheim Fellowship, US-Israel BSF grant 2006060, and ONR grant N00014-04-1-0478.
(July 29, 2025)
Abstract

It is known that if a 2-universal hash function HH is applied to elements of a block source (X1,,XT)(X_{1},\ldots,X_{T}), where each item XiX_{i} has enough min-entropy conditioned on the previous items, then the output distribution (H,H(X1),,H(XT))(H,H(X_{1}),\ldots,H(X_{T})) will be “close” to the uniform distribution. We provide improved bounds on how much min-entropy per item is required for this to hold, both when we ask that the output be close to uniform in statistical distance and when we only ask that it be statistically close to a distribution with small collision probability. In both cases, we reduce the dependence of the min-entropy on the number TT of items from 2logT2\log T in previous work to logT\log T, which we show to be optimal. This leads to corresponding improvements to the recent results of Mitzenmacher and Vadhan (SODA ‘08) on the analysis of hashing-based algorithms and data structures when the data items come from a block source.

1 Introduction

A block source is a sequence of items 𝐗=(X1,,XT)\mathbf{X}=(X_{1},\dots,X_{T}) in which each item has at least some kk bits of “entropy” conditioned on the previous ones [CG88]. Previous works [CG88, Zuc96, MV08] have analyzed what happens when one applies a 2-universal hash function to each item in such a sequence, establishing results of the following form:

Block-Source Hashing Theorems (informal): If (X1,,XT)(X_{1},\ldots,X_{T}) is a block source with kk bits of “entropy” per item and HH is a random hash function from a 2-universal family mapping to mkm\ll k bits, then (H(X1),,H(XT))(H(X_{1}),\ldots,H(X_{T})) is “close” to the uniform distribution.

In this paper, we prove new results of this form, achieving improved (in some cases, optimal) bounds on how much entropy kk per item is needed to ensure that the output is close to uniform, as a function of the other parameters (the output length mm of the hash functions, the number TT of items, and the “distance” from the uniform distribution). But first we discuss the two applications that have motivated the study of Block-Source Hashing Theorems.

1.1 Applications of Block-Source Hashing

Randomness Extractors.

A randomness extractor is an algorithm that extracts almost-uniform bits from a source of biased and correlated bits, using a short seed of truly random bits as a catalyst [NZ96]. Extractors have many applications in theoretical computer science and have played a central role in the theory of pseudorandomness. (See the surveys [NT99, Sha04, Vad07].) Block-source Hashing Theorems immediately yield methods for extracting randomness from block sources, where the seed is used to specify a universal hash function. The gain over hashing the entire TT-tuple at once is that the blocks may be much shorter than the entire sequence, and thus a much shorter seed is required to specify the universal hash function. Moreover, many subsequent constructions of extractors for general sources (without the block structure) work by first converting the source into a block source and performing block-source hashing.

Analysis of Hashing-Based Algorithms.

The idea of hashing has been widely applied in designing algorithms and data structures, including hash tables [Knu98], Bloom filters [BM03], summary algorithms for data streams [Mut03], etc. Given a stream of data items (x1,,xT)(x_{1},\dots,x_{T}), we first hash the items into
(H(x1),,H(xT))(H(x_{1}),\dots,H(x_{T})), and carry out a computation using the hashed values. In the literature, the analysis of a hashing algorithm is typically a worst-case analysis on the input data items, and the best results are often obtained by unrealistically modelling the hash function as a truly random function mapping the items to uniform and independent mm-bit strings. On the other hand, for realistic, efficiently computable hash functions (eg., 22-universal or O(1)O(1)-wise independent hash functions), the provable performance is sometimes significantly worse. However, such gaps seem to not show up in practice, and even standard 2-universal hash functions empirically seem to match the performance of truly random hash functions. To explain this phenomenon, Mitzenmacher and Vadhan [MV08] have suggested that the discrepancy is due to worst-case analysis, and propose to instead model the input items as coming from a block source. Then Block-Source Hashing Theorems imply that the performance of universal hash functions is close to that of truly random hash functions, provided that each item has enough bits of entropy.

1.2 How Much Entropy is Required?

A natural question about Block-Source Hashing Theorems is: how large does the “entropy” kk per item need to be to ensure a certain amount of “closeness” to uniform (where both the entropy and closeness can be measured in various ways). This also has practical significance for the latter motivation regarding hashing-based algorithms, as it corresponds to the amount of entropy we need to assume in data items. In [MV08], they provide bounds on the entropy required for two measures of closeness, and use these as basic tools to bound the required entropy in various applications. The requirement is usually some small constant multiple of logT\log T, where TT is the number of items in the source, which can be on the borderline between a reasonable and unreasonable assumption about real-life data. Therefore, it is interesting to pin down the optimal answers to these questions. In what follows, we first summarize the previous results, and then discuss our improved analysis and corresponding lower bounds.

A standard way to measure the distance of the output from the uniform distribution is by statistical distance.111The statistical distance of two random variables XX and YY is Δ(X,Y)=maxT|Pr[XT]Pr[YT]|\Delta(X,Y)=\max_{T}|\Pr[X\in T]-\Pr[Y\in T]|, where TT ranges over all possible events. In the randomness extractor literature, classic results [CG88, ILL89, Zuc96] show that using 2-universal hash functions, k=m+2log(T/ε)+O(1)k=m+2\log(T/\varepsilon)+O(1) bits of min-entropy (or even Renyi entropy)222The min-entropy of a random variable XX is H(X)=minxlog(1/Pr[X=x])\mathrm{H}_{\infty}(X)=\min_{x}\log(1/\Pr[X=x]). All of the results mentioned actually hold for the less stringent measure of Renyi entropy H2(X)=log(1/ExX[Pr[X=x]])\mathrm{H}_{2}(X)=\log(1/\mathop{\mbox{\sc E}}\nolimits_{x\leftarrow X}[\Pr[X=x]]). per item is sufficient for the output distribution to be ε\varepsilon-close to uniform in statistical distance. Sometimes a less stringent closeness requirement is sufficient, where we only require that the output distribution is ε\varepsilon-close to a distribution having “small” collision probability333The collision probability of a random variable XX is xPr[X=x]2\sum_{x}\Pr[X=x]^{2}. By “small collision probability,” we mean that the collision probability is within a constant factor of the collision probability of uniform distribution.. A result of [MV08] shows that k=m+2logT+log(1/ε)+O(1)k=m+2\log T+\log(1/\varepsilon)+O(1) suffices to achieve this requirement. Using 44-wise independent hash functions, [MV08] further reduce the required entropy to k=max{m+logT,1/2(m+3logT+log(1/ε))}+O(1)k=\max\{m+\log T,1/2(m+3\log T+\log(1/\varepsilon))\}+O(1).

Setting Previous Results Our Results
2-universal hashing m+2logT+2log(1/ε)m+2\log T+2\log(1/\varepsilon) m+logT+2log(1/ε)m+\log T+2\log(1/\varepsilon)
ε\varepsilon-close to uniform [CG88, ILL89, Zuc96]
2-universal hashing m+2logT+log(1/ε)m+2\log T+\log(1/\varepsilon) [MV08] m+logT+log(1/ε)m+\log T+\log(1/\varepsilon)
ε\varepsilon-close to small cp.
4-wise indep. hashing max{m+logT,\max\{m+\log T, max{m+logT,\max\{m+\log T,
ε\varepsilon-close to small cp. 1/2(m+3logT+log1/ε)}1/2(m+3\log T+\log 1/\varepsilon)\} [MV08] 1/2(m+2logT+log(1/ε)}1/2(m+2\log T+\log(1/\varepsilon)\}
Table 1: Our Results: Each entry denotes the min-entropy (actually, Renyi entropy) required per item when hashing a block source of TT items to mm-bit strings to ensure that the output has statistical distance at most ε\varepsilon from uniform (or from having collision probability within a constant factor of uniform). Additive constants are omitted for readability.

Our Results.

We reduce the entropy required in the previous results, as summarized in Table 1. Roughly speaking, we save an additive logT\log T bits of min-entropy (or Renyi entropy) for all cases. We show that using universal hash functions, k=m+logT+2log1/ε+O(1)k=m+\log T+2\log 1/\varepsilon+O(1) bits per item is sufficient for the output to be ε\varepsilon-close to uniform, and k=m+log(T/ε)+O(1)k=m+\log(T/\varepsilon)+O(1) is enough for the output to be ε\varepsilon-close to having small collision probability. Using 44-wise independent hash functions, the entropy kk further reduces to max{m+logT,1/2(m+2logT+log1/ε)}+O(1)\max\{m+\log T,1/2(m+2\log T+\log 1/\varepsilon)\}+O(1). The results hold even if we consider the joint distribution (H,H(X1),,H(XT))(H,H(X_{1}),\dots,H(X_{T})) (corresponding to “strong extractors” in the literature on randomness extractors). Substituting our improved bounds in the analysis of hashing-based algorithms from [MV08], we obtain similar reductions in the min-entropy required for every application with 2-universal hashing. With 4-wise independent hashing, we obtain a slight improvement for Linear Probing, and for the other applications, we show that the previous bounds can already be achieved with 2-universal hashing. The results are summarized in Table 2.

Although the logT\log T improvement seems small, we remark that it could be significant for practical settings of parameter. For example, suppose we want to hash 6464 thousand internet traffic flows, so logT16\log T\approx 16. Each flow is specified by the 32-bit IP addresses and 16-bit port numbers for the source and destination plus the 8-bit transport protocol, for a total of 104 bits. There is a noticeable difference between assuming that each flow contains 3logT483\log T\approx 48 vs. 4logT644\log T\approx 64 bits of entropy as they are only 104 bits long, and are very structured.

We also prove corresponding lower bounds showing that our upper bounds are almost tight. Specifically, we show that when the data items have not enough entropy, then the joint distribution (H,H(X1),,H(XT))(H,H(X_{1}),\dots,H(X_{T})) can be “far” from uniform. More precisely, we show that if k=m+logT+2log1/εO(1)k=m+\log T+2\log 1/\varepsilon-O(1), then there exists a block source (X1,,XT)(X_{1},\dots,X_{T}) with kk bits of min-entropy per item such that the distribution (H,H(X1),,H(XT))(H,H(X_{1}),\dots,H(X_{T})) is ε\varepsilon-far from uniform in statistical distance (for HH coming from any hash family). This matches our upper bound up to an additive constant. Similarly, we show that if k=m+logTO(1)k=m+\log T-O(1), then there exists a block source (X1,,XT)(X_{1},\dots,X_{T}) with kk bits of min-entropy per item such that the distribution (H,H(X1),,H(XT))(H,H(X_{1}),\dots,H(X_{T})) is 0.990.99-far from having small collision probability (for HH coming from any hash family). This matches our upper bound up to an additive constant in case the statistical distance parameter ε\varepsilon is constant; we also exhibit a specific 2-universal family for which the log(1/ε)\log(1/\varepsilon) in our upper bound is nearly tight — it cannot be reduced below log(1/ε)loglog(1/ε)\log(1/\varepsilon)-\log\log(1/\varepsilon). Finally, we also extend all of our lower bounds to the case that we only consider distribution of hashed values (H(X1),,H(XT))(H(X_{1}),\ldots,H(X_{T})), rather than their joint distribution with YY. For this case, the lower bounds are necessarily reduced by a term that depends on the size of the hash family. (For standard constructions of universal hash functions, this amounts to logn\log n bits of entropy, where nn is the bit-length of an individual item.)

Type of Hash Family Previous Results [MV08] Our Results
Linear Probing
2-universal hashing 4logT4\log T 3logT3\log T
4-wise independence 2.5logT2.5\log T 2logT2\log T
Balanced Allocations with dd Choices
2-universal hashing (d+2)logT(d+2)\log T (d+1)logT(d+1)\log T
4-wise independence (d+1)logT(d+1)\log T
Bloom Filters
2-universal hashing 4logT4\log T 3logT3\log T
4-wise independence 3logT3\log T
Table 2: Applications: Each entry denotes the min-entropy (actually, Renyi entropy) required per item to ensure that the performance of the given application is “close” to the performance when using truly random hash functions. In all cases, the bounds omit additive terms that depend on how close a performance is desired, and we restrict to the (standard) case that the size of the hash table is linear in the number of items being hashed. That is, m=logT+O(1)m=\log T+O(1).

Techniques.

At a high level, all of the previous analyses for hashing block sources were loose due to summing error probabilities over the TT blocks. Our improvements come from avoiding this linear blow-up by choosing more refined measures of error. For example, when we want the output to have small statistical distance from uniform, the classic Leftover Hash Lemma [ILL89] says that min-entropy k=m+2log(1/ε0)k=m+2\log(1/\varepsilon_{0}) suffices for a single hashed block to be ε0\varepsilon_{0}-close to uniform, and then a “hybrid argument” implies that the joint distribution of TT hashed blocks is Tε0T\varepsilon_{0}-close to uniform [Zuc96]. Setting ε0=ε/T\varepsilon_{0}=\varepsilon/T, this leads to a min-entropy requirement of k=m+2log(1/ε)+2logTk=m+2\log(1/\varepsilon)+2\log T per block. We obtain a better bound, reducing 2logT2\log T to logT\log T, by using Hellinger distance to analyze the error accumulation over blocks, and only passing to statistical distance at the end.

For the case where we only want the output to be close to having small collision probability, the previous analysis of [MV08] worked by first showing that the expected collision probability of each hashed block h(Xi)h(X_{i}) is “small” even conditioned on previous blocks, then using Markov’s Inequality to deduce that each hashed block has small collision probability except with some probability ε0\varepsilon_{0}, and finally doing a union bound to deduce that all hashed blocks have small collision probability except with probability Tε0T\varepsilon_{0}. We avoid the union bound by working with more refined notions of “conditional collision probability,” which enable us to apply Markov’s Inequality on the entire sequence rather than on each block individually.

The starting point for our negative results is the tight lower bound for randomness extractors due to Radhakrishnan and Ta-Shma [RT00]. Their methods show that if the min-entropy parameter kk is not large enough, then for any hash family, there exists a (single-block) source XX such that h(X)h(X) is “far” from uniform (in statistical distance) for “many” hash functions hh. We then take our block source (X1,,XT)(X_{1},\ldots,X_{T}) to consist of TT iid copies of XX, and argue that the statistical distance from uniform grows sufficiently fast with the number TT of copies taken. For example, we show that if two distributions have statistical distance ε\varepsilon, then their TT-fold products have statistical distance Ω(min{1,Tε})\Omega(\min\{1,\sqrt{T}\cdot\varepsilon\}), strengthening a previous bound of Reyzin [Rey04], who proved a bound of Ω(min{ε1/3,Tε})\Omega(\min\{\varepsilon^{1/3},\sqrt{T}\cdot\varepsilon\}).

2 Preliminaries

Notations.

All logs are based 22. We use the convention that N=2nN=2^{n}, K=2kK=2^{k}, and M=2mM=2^{m}. We think of a data item XX as a random variable over [N]={1,,N}[N]=\{1,\dots,N\}, which can be viewed as the set of nn-bit strings. A hash function h:[N][M]h:[N]\rightarrow[M] hashes an item to a mm-bit string. A hash function family \mathcal{H} is a multiset of hash functions, and HH will usually denote a uniformly random hash function drawn from \mathcal{H}. U[M]U_{[M]} denotes the uniform distribution over [M][M]. Let 𝐗=(X1,,XT)\mathbf{X}=(X_{1},\dots,X_{T}) be a sequence of data items. We use X<iX_{<i} to denote the first i1i-1 items (X1,,Xi1)(X_{1},\dots,X_{i-1}). We refer to XiX_{i} as an item or a block interchangeably. Our goal is to study the distribution of hashed sequence (H,𝐘)=(H,Y1,,YT)=def(H,H(X1),,H(XT))(H,\mathbf{Y})=(H,Y_{1},\dots,Y_{T})\mathbin{\stackrel{{\scriptstyle\rm def}}{{=}}}(H,H(X_{1}),\dots,H(X_{T})).

Hash Families.

The truly random hash family \mathcal{H} is the set of all functions from [N][N] to [M][M]. A hash family \mathcal{H} is ss-universal if for every sequence of distinct elements x1,,xs[N]x_{1},\dots,x_{s}\in[N], PrH[H(x1)==H(xs)]1/Ms\Pr_{H}[H(x_{1})=\dots=H(x_{s})]\leq 1/M^{s}. \mathcal{H} is ss-wise independent if for every sequence of distinct elements x1,,xs[N]x_{1},\dots,x_{s}\in[N], H(x1),,H(xs)H(x_{1}),\dots,H(x_{s}) are independent and uniform random variables over [M][M].

Block Sources and Collision Probability.

For a random variable XX, the collision probability of XX is cp(X)=Pr[X=X]=xPr[X=x]2\mathrm{cp}(X)=\Pr[X=X^{\prime}]=\sum_{x}\Pr[X=x]^{2}, where XX^{\prime} is an independent copy of XX. The Renyi entropy H2(X)=log(1/cp(X))\mathrm{H}_{2}(X)=\log(1/\mathrm{cp}(X)) can be viewed as a measure of the amount of randomness in XX (In the randomness extractor literature, the entropy is measured by min-entropy H(X)=minxsupp(X)log(1/Pr[X=x])H_{\infty}(X)=\min_{x\in\mathrm{supp}(X)}\log(1/\Pr[X=x]), but using the less stringent measure Renyi entropy makes our results stronger since H2(X)H(X)H_{2}(X)\geq H_{\infty}(X).) For an event EE, (X|E)(X|_{E}) is the random variable defined by conditioning XX on EE.

Definition 2.1 (Block Sources)

A sequence of random variables (X1,,XT)(X_{1},\dots,X_{T}) over [N]T[N]^{T} is a block KK-source if for every i[T]i\in[T], and every x<ix_{<i} in the support of X<iX_{<i}, we have cp(Xi|X<i=x<i)1/K\mathrm{cp}(X_{i}|X_{<i}=x_{<i})\leq 1/K. That is, each item XiX_{i} has at least k=logKk=\log K bits of Renyi entropy even after conditioning on the previous items.

Let 𝐗=(X1,,XT)\mathbf{X}=(X_{1},\dots,X_{T}) be a sequence of random variables over [M]T[M]^{T}. We are interested in bounding the overall collision probability cp(𝐗)\mathrm{cp}(\mathbf{X}) by the collision probability of each blocks. Suppose all XiX_{i}’s are independent, then cp(𝐗)=i=1Tcp(Xi)\mathrm{cp}(\mathbf{X})=\prod_{i=1}^{T}\mathrm{cp}(X_{i}). The following lemma generalizes Lemma 4.2 in [MV08], which says that if for every 𝐱𝐗\mathbf{x}\in\mathbf{X}, the average collision probability of every block XiX_{i} conditioning on X<i=x<iX_{<i}=x_{<i} is small, then the overall collision probability cp(𝐗)\mathrm{cp}(\mathbf{X}) is also small. In particular, if 𝐗\mathbf{X} is a block KK-source, then cp(𝐗)1/KT\mathrm{cp}(\mathbf{X})\leq 1/K^{T}.

Lemma 2.2

Let 𝐗=(X1,,XT)\mathbf{X}=(X_{1},\dots,X_{T}) be a sequence of random variables such that for every 𝐱supp(𝐗)\mathbf{x}\in\mathrm{supp}(\mathbf{X}),

1Ti=1Tcp(Xi|X<i=x<i)α.\frac{1}{T}\sum_{i=1}^{T}\mathrm{cp}(X_{i}|_{X_{<i}=x_{<i}})\leq\alpha.

Then the overall collision probability satisfies cp(𝐗)αT\mathrm{cp}(\mathbf{X})\leq\alpha^{T}.

Proof. By Arithmetic Mean-Geometric Mean Inequality, the inequality in the premise implies

i=1Tcp(Xi|X<i=x<i)αT.\prod_{i=1}^{T}\mathrm{cp}(X_{i}|_{X_{<i}=x_{<i}})\leq\alpha^{T}.

Therefore, it suffices to prove

cp(𝐗)max𝐱supp(𝐗)i=1Tcp(Xi|X<i=x<i).\mathrm{cp}(\mathbf{X})\leq\max_{\mathbf{x}\in\mathrm{supp}(\mathbf{X})}\prod_{i=1}^{T}\mathrm{cp}(X_{i}|_{X_{<i}=x_{<i}}).

We prove it by induction on TT. The base case T=1T=1 is trivial. Suppose the lemma is true for T1T-1. We have

cp(𝐗)\displaystyle\mathrm{cp}(\mathbf{X}) =\displaystyle= x1Pr[X1=x1]2cp(X2,,XT|X1=x1)\displaystyle\sum_{x_{1}}\Pr[X_{1}=x_{1}]^{2}\cdot\mathrm{cp}(X_{2},\dots,X_{T}|_{X_{1}=x_{1}})
\displaystyle\leq (x1Pr[X1=x1]2)maxx1cp(X2,,XT|X1=x1)\displaystyle\left(\sum_{x_{1}}\Pr[X_{1}=x_{1}]^{2}\right)\cdot\max_{x_{1}}\mathrm{cp}(X_{2},\dots,X_{T}|_{X_{1}=x_{1}})
\displaystyle\leq cp(X1)maxx1(maxx2,,xTi=2Tcp(Xi|X<i=x<i))\displaystyle\mathrm{cp}(X_{1})\cdot\max_{x_{1}}\left(\max_{x_{2},\dots,x_{T}}\prod_{i=2}^{T}\mathrm{cp}(X_{i}|_{X_{<i}=x_{<i}})\right)
=\displaystyle= max𝐱i=1Tcp(Xi|X<i=x<i),\displaystyle\max_{\mathbf{x}}\prod_{i=1}^{T}\mathrm{cp}(X_{i}|_{X_{<i}=x_{<i}}),

as desired.    

Statistical Distance.

The statistical distance is a standard way to measure the distance of two distributions. Let XX and YY be two random variables. The statistical distance of XX and YY is Δ(X,Y)=maxT|Pr[XT]Pr[YT]|=(1/2)x|Pr[X=x]Pr[Y=x]|\Delta(X,Y)=\max_{T}|\Pr[X\in T]-\Pr[Y\in T]|=(1/2)\cdot\sum_{x}|\Pr[X=x]-\Pr[Y=x]|, where TT ranges over all possible events. When Δ(X,Y)ε\Delta(X,Y)\leq\varepsilon, we say that XX is ε\varepsilon-close to YY. Similarly, if Δ(X,Y)ε\Delta(X,Y)\geq\varepsilon, then XX is ε\varepsilon-far from YY. The following standard lemma says that if XX has small collision probability, then XX is close to uniform in statistical distance.

Lemma 2.3

Let XX be a random variable over [M][M] such that cp(X)(1+ε)/M\mathrm{cp}(X)\leq(1+\varepsilon)/M. Then Δ(X,U[M])ε\Delta(X,U_{[M]})\leq\sqrt{\varepsilon}.

Conditional Collision Probability.

Let (X,Y)(X,Y) be jointly distributed random variables. We can define the conditional Renyi entropy of XX conditioning on YY as follows.

Definition 2.4

The conditional collision probability of XX conditioning on YY is cp(X|Y)=\mathrm{cp}(X|Y)=
EyY[cp(X|Y=y)]\mathop{\mathrm{E}}\displaylimits_{y\leftarrow Y}[\mathrm{cp}(X|_{Y=y})]. The conditional Renyi entropy is H2(X|Y)=log1/cp(X|Y)\mathrm{H}_{2}(X|Y)=\log 1/\mathrm{cp}(X|Y).

The following lemma says that as in the case of Shannon entropy, conditioning can only decrease the entropy.

Lemma 2.5

Let (X,Y,Z)(X,Y,Z) be jointly distributed random variables. We have cp(X)cp(X|Y)cp(X|Y,Z)\mathrm{cp}(X)\leq\mathrm{cp}(X|Y)\leq\mathrm{cp}(X|Y,Z).

Proof. For the first inequality, we have

cp(X)\displaystyle\mathrm{cp}(X) =\displaystyle= xPr[X=x]2\displaystyle\sum_{x}\Pr[X=x]^{2}
=\displaystyle= y,yPr[Y=y]Pr[Y=y](xPr[X=x|Y=y]Pr[X=x|Y=y])\displaystyle\sum_{y,y^{\prime}}\Pr[Y=y]\cdot\Pr[Y=y^{\prime}]\cdot\left(\sum_{x}\Pr[X=x|Y=y]\cdot\Pr[X=x|Y=y^{\prime}]\right)
\displaystyle\leq y,yPr[Y=y]Pr[Y=y](xPr[X=x|Y=y]2)1/2(xPr[X=x|Y=y]2)1/2\displaystyle\sum_{y,y^{\prime}}\Pr[Y=y]\cdot\Pr[Y=y^{\prime}]\cdot\left(\sum_{x}\Pr[X=x|Y=y]^{2}\right)^{1/2}\cdot\left(\sum_{x}\Pr[X=x|Y=y^{\prime}]^{2}\right)^{1/2}
=\displaystyle= EyY[cp(X|Y=y)1/2]2\displaystyle\mathop{\mathrm{E}}\displaylimits_{y\leftarrow Y}\left[\mathrm{cp}(X|Y=y)^{1/2}\right]^{2}
\displaystyle\leq cp(X|Y)\displaystyle\mathrm{cp}(X|Y)

For the second inequality, observe that for every yy in the support of YY, we have cp(X|Y=y)cp((X|Y=y)|(Z|Y=y))\mathrm{cp}(X|_{Y=y})\leq\mathrm{cp}((X|_{Y=y})|(Z|_{Y=y})) from the first inequality. It follows that

cp(X|Y)\displaystyle\mathrm{cp}(X|Y) =\displaystyle= EyY[cp(X|Y=y)]\displaystyle\mathop{\mathrm{E}}\displaylimits_{y\leftarrow Y}[\mathrm{cp}(X|_{Y=y})]
\displaystyle\leq EyY[cp((X|Y=y)|(Z|Y=y))]\displaystyle\mathop{\mathrm{E}}\displaylimits_{y\leftarrow Y}[\mathrm{cp}((X|_{Y=y})|(Z|_{Y=y}))]
=\displaystyle= EyY[Ez(Z|Y=y)[cp(X|Y=y,Z=z)]\displaystyle\mathop{\mathrm{E}}\displaylimits_{y\leftarrow Y}[\mathop{\mathrm{E}}\displaylimits_{z\leftarrow(Z|Y=y)}[\mathrm{cp}(X|_{Y=y,Z=z})]
=\displaystyle= cp(X|Y,Z)\displaystyle\mathrm{cp}(X|Y,Z)

 

3 Positive Results: How Much Entropy is Sufficient?

In this section, we present our positive results, showing that the distribution of hashed sequence (H,𝐘)=(H,H(X1),,H(XT))(H,\mathbf{Y})=(H,H(X_{1}),\dots,H(X_{T})) is close to uniform when HH is a random hash function from a 22-universal hash family, and 𝐗=(X1,,XT)\mathbf{X}=(X_{1},\dots,X_{T}) has sufficient entropy per block. The new contribution is that we will not need K=2kK=2^{k} to be as large as in previous works, and so save the required randomness in the block source 𝐗=(X1,,XT)\mathbf{X}=(X_{1},\dots,X_{T}).

3.1 Small Collision Probability Using 22-universal Hash Functions

Let H:[N][M]H:[N]\rightarrow[M] be a random hash function from a 22-universal family \mathcal{H}. We first study the conditions under which (H,𝐘)=(H,H(X1),,H(XT))(H,\mathbf{Y})=(H,H(X_{1}),\dots,H(X_{T})) is ε\varepsilon-close to having collision probability O(1/(||MT))O(1/(|\mathcal{H}|\cdot M^{T})). This requirement is less stringent than (H,𝐘)(H,\mathbf{Y}) being ε\varepsilon-close to uniform in statistical distance, and so requires less bits of entropy. Mitzenmacher and Vadhan [MV08] show that this guarantee suffices for some hashing applications. They show that KMT2/εK\geq MT^{2}/\varepsilon is enough to satisfy the requirement. We save a factor of TT, and show that in fact, KMT/εK\geq MT/\varepsilon, is sufficient. (Taking logs yields the first entry in Table 1, i.e. it suffices to have Renyi entropy k=m+logT+log(1/ε)k=m+\log T+\log(1/\varepsilon) per block.) Formally, we prove the following theorem.

Theorem 3.1

Let H:[N][M]H:[N]\rightarrow[M] be a random hash function from a 22-universal family \mathcal{H}. Let 𝐗=(X1,,XT)\mathbf{X}=(X_{1},\dots,X_{T}) be a block KK-source over [N]T[N]^{T}. For every ε>0\varepsilon>0, the hashed sequence (H,𝐘)=(H,\mathbf{Y})=
(H,H(X1),,H(XT))(H,H(X_{1}),\dots,H(X_{T})) is ε\varepsilon-close to a distribution (H,𝐙)=(H,Z1,,ZT)(H,\mathbf{Z})=(H,Z_{1},\dots,Z_{T}) such that

cp(H,𝐙)1||MT(1+MKε)T.\mathrm{cp}(H,\mathbf{Z})\leq\frac{1}{|\mathcal{H}|\cdot M^{T}}\left(1+\frac{M}{K\varepsilon}\right)^{T}.

In particular, if KMT/εK\geq MT/\varepsilon, then (H,𝐙)(H,\mathbf{Z}) has collision probability at most (1+2MT/Kε)/(||MT(1+2MT/K\varepsilon)/(|\mathcal{H}|\cdot M^{T}).

To analyze the distribution of the hashed sequence (H,𝐘)(H,\mathbf{Y}), the starting point is the following version of the Leftover Hash Lemma [BBR85, ILL89], which says that when we hash a random variable XX with enough entropy using a 22-universal hash function HH, the conditional collision probability of H(X)H(X) conditioning on HH is small.

Lemma 3.2 (The Leftover Hash Lemma)

Let H:[N][M]H:[N]\rightarrow[M] be a random hash function from a 22-universal family \mathcal{H}. Let XX be a random variable over [N][N] with cp(X)1/K\mathrm{cp}(X)\leq 1/K. We have cp(H(X)|H)1/M+1/K\mathrm{cp}(H(X)|H)\leq 1/M+1/K.

We now sketch how the hashed block source 𝐘=(Y1,,YT)=(H(X1),,H(XT))\mathbf{Y}=(Y_{1},\ldots,Y_{T})=(H(X_{1}),\ldots,H(X_{T})) is analyzed in [MV08], and how we improve the analysis. The following natural approach is taken in [MV08]. Since the data 𝐗\mathbf{X} is a block KK-source, the Leftover Hash Lemma tells us that for every block i[T]i\in[T], if we condition on the previous blocks X<i=x<iX_{<i}=x_{<i}, then the hashed value (Yi|X<i=x<i)(Y_{i}|_{X_{<i}=x_{<i}}) has small conditional collision probability, i.e. cp((Yi|X<i=x<i)|H)1/M+1/K\mathrm{cp}((Y_{i}|_{X_{<i}=x_{<i}})|H)\leq 1/M+1/K. This is equivalent to saying that the average collision probability of (Yi|X<i=x<i)(Y_{i}|_{X_{<i}=x_{<i}}) over the choice of the hash function HH is small, i.e.,

EhH[cp(h(Xi)|X<i=x<i)]=cp((Yi|X<i=x<i)|H)1M+1K.\mathop{\mathrm{E}}\displaylimits_{h\leftarrow H}[\mathrm{cp}(h(X_{i})|_{X_{<i}=x_{<i}})]=\mathrm{cp}((Y_{i}|_{X_{<i}=x_{<i}})|H)\leq\frac{1}{M}+\frac{1}{K}.

We can then use a Markov argument to say that for every block, with probability at least 1ε/T1-\varepsilon/T over hHh\leftarrow H, the collision probability is at most 1/M+T/(Kε)1/M+T/(K\varepsilon). We can then take a union bound to say that for every 𝐱supp(𝐗)\mathbf{x}\in\mathrm{supp}(\mathbf{X}), at least (1ε)(1-\varepsilon)-fraction of hash functions hh are good in the sense that cp(h(Xi)|X<i=x<i)\mathrm{cp}(h(X_{i})|_{X_{<i}=x_{<i}}) is small for all blocks i=1,,Ti=1,\dots,T. [MV08] shows that if this condition is true for every (h,𝐱)supp(H,𝐗)(h,\mathbf{x})\in\mathrm{supp}(H,\mathbf{X}), then 𝐘\mathbf{Y} is a block (1/M+T/(Kε))(1/M+T/(K\varepsilon))-source, and thus the overall collision probability is at most (1+MT/Kε)T/MT(1+MT/K\varepsilon)^{T}/M^{T}. [MV08] also shows how to modify an ε\varepsilon-fraction of the distribution to fix the bad hash functions, and thus complete the analysis.

The problem of the above analysis is that taking a Markov argument for each block, and then taking a union bound incurs a loss of factor TT. To avoid this, we want to apply Markov argument only once to the whole sequence. For example, a natural thing to try is to sum over blocks to get

EhH[1Ti=1Tcp(h(Xi)|X<i=x<i)]=1Ti=1Tcp((Yi|X<i=x<i)|H)1M+1K,\mathop{\mathrm{E}}\displaylimits_{h\leftarrow H}\left[\frac{1}{T}\sum_{i=1}^{T}\mathrm{cp}(h(X_{i})|_{X_{<i}=x_{<i}})\right]=\frac{1}{T}\sum_{i=1}^{T}\mathrm{cp}((Y_{i}|_{X_{<i}=x_{<i}})|H)\leq\frac{1}{M}+\frac{1}{K},

and use a Markov argument to deduce that for every 𝐱supp(𝐗)\mathbf{x}\in\mathrm{supp}(\mathbf{X}), with probability 1ε1-\varepsilon over hHh\leftarrow H, the average collision probability per block satisfies

1Ti=1Tcp(h(Xi)|X<i=x<i)1M+1Kε.\frac{1}{T}\cdot\sum_{i=1}^{T}\mathrm{cp}(h(X_{i})|_{X_{<i}=x_{<i}})\leq\frac{1}{M}+\frac{1}{K\varepsilon}.

We need to bound the collision probability of 𝐘\mathbf{Y} using this information. We may want to apply Lemma 2.2, but it requires the information on (1/T)i=1Tcp(Yi|Y<i=y<i)(1/T)\sum_{i=1}^{T}\mathrm{cp}(Y_{i}|_{Y_{<i}=y_{<i}}) instead of (1/T)i=1Tcp(h(Xi)|X<i=x<i)(1/T)\sum_{i=1}^{T}\mathrm{cp}(h(X_{i})|_{X_{<i}=x_{<i}}). That is, Lemma 2.2 requires us to condition on previous hashed values Y<iY_{<i}, whereas the above argument refers to conditioning on the un-hashed values X<iX_{<i}. The difficulty with directly reasoning about the former is that conditioned on the hashed values Y<iY_{<i}, the hash function HH may no longer be uniform (as it is correlated with Y<iY_{<i}) and thus the Leftover Hash Lemma no longer applies.

To get around with the issues, we work with the averaged form of conditional collision probability cp(Yi|H,Y<i)\mathrm{cp}(Y_{i}|H,Y_{<i}), as from Definition 2.4. Our key observation is that now we can apply Lemma 2.5 to deduce that for every block i[T]i\in[T], the conditional collision probability satisfies cp(Yi|H,Y<i)cp(Yi|H,X<i)1/M+1/K\mathrm{cp}(Y_{i}|H,Y_{<i})\leq\mathrm{cp}(Y_{i}|H,X_{<i})\leq 1/M+1/K. Then, by a Markov argument, it follows that with probability 1ε1-\varepsilon over (h,𝐲)(H,𝐘)(h,\mathbf{y})\leftarrow(H,\mathbf{Y}), the average collision probability satisfies

1Ti=1Tcp(Yi|(H,Y<i)=(h,y<i))1M+1Kε.\frac{1}{T}\sum_{i=1}^{T}\mathrm{cp}(Y_{i}|_{(H,Y_{<i})=(h,y_{<i})})\leq\frac{1}{M}+\frac{1}{K\varepsilon}.

We can then modify an ε\varepsilon-fraction of distribution, and apply Lemma 2.2 to complete the analysis.

The following lemma formalizes our claim about that the conditional collision probability of every block of (H,𝐘)(H,\mathbf{Y}) is small.

Lemma 3.3

Let H:[N][M]H:[N]\rightarrow[M] be a random hash function from a 22-universal family \mathcal{H}. Let 𝐗=(X1,,XT)\mathbf{X}=(X_{1},\dots,X_{T}) be a block KK-source over [N]T[N]^{T}. Let (H,𝐘)=(H,H(X1),,H(XT))(H,\mathbf{Y})=(H,H(X_{1}),\dots,H(X_{T})). Then cp(H)=1/||\mathrm{cp}(H)=1/|\mathcal{H}| and for every i[T]i\in[T], cp(Yi|H,Y<i)1/M+1/K\mathrm{cp}(Y_{i}|H,Y_{<i})\leq 1/M+1/K.

Proof. cp(H)=1/||\mathrm{cp}(H)=1/|\mathcal{H}| is trivial since HH is the uniform distribution. Fix i[T]i\in[T]. By the definition of block KK-source, for every x<ix_{<i} in the support of X<iX_{<i}, cp(Xi|X<i=x<i)1/K\mathrm{cp}(X_{i}|_{X_{<i}=x_{<i}})\leq 1/K. By the Leftover Hash Lemma, we have cp((Yi|X<i=x<i)|(H|X<i=x<i))1/M+1/K\mathrm{cp}((Y_{i}|_{X_{<i}=x_{<i}})|(H|_{X_{<i}=x_{<i}}))\leq 1/M+1/K for every x<ix_{<i}. It follows that cp(Yi|H,X<i)1/M+1/K\mathrm{cp}(Y_{i}|H,X_{<i})\leq 1/M+1/K. Now, we can think of (Yi|H,X<i)(Y_{i}|H,X_{<i}) as YiY_{i} first conditioning on (H,Y<i)(H,Y_{<i}), and then further conditioning on X<iX_{<i}. By Lemma 2.5, we have

cp(Yi|H,Y<i)cp(Yi|H,Y<i,X<i)=cp(Yi|H,X<i)1/M+1/K,\mathrm{cp}(Y_{i}|H,Y_{<i})\leq\mathrm{cp}(Y_{i}|H,Y_{<i},X_{<i})=\mathrm{cp}(Y_{i}|H,X_{<i})\leq 1/M+1/K,

as desired.    

We use this to prove Theorem 3.1 as outlined above.

  • Proof of Theorem 3.1:   By Lemma 3.3, for every i[T]i\in[T], we have

    E(h,𝐲)(H,𝐘)[cp(Yi|(H,Y<i)=(h,y<i))]=cp(Yi|H,Y<i)1M+1K.\mathop{\mathrm{E}}\displaylimits_{(h,\mathbf{y})\leftarrow(H,\mathbf{Y})}\left[\mathrm{cp}(Y_{i}|_{(H,Y_{<i})=(h,y_{<i})})\right]=\mathrm{cp}(Y_{i}|H,Y_{<i})\leq\frac{1}{M}+\frac{1}{K}.

    By linearity of expectation, the average conditional collision probability is also small.

    E(h,𝐲)(H,𝐘)[1Ti=1Tcp(Yi|(H,Y<i)=(h,y<i))]1M+1K.\mathop{\mathrm{E}}\displaylimits_{(h,\mathbf{y})\leftarrow(H,\mathbf{Y})}\left[\frac{1}{T}\sum_{i=1}^{T}\mathrm{cp}(Y_{i}|_{(H,Y_{<i})=(h,y_{<i})})\right]\leq\frac{1}{M}+\frac{1}{K}.

    Note that the collision probability of a random variable over [M][M] is at least 1/M1/M. Thus, Markov’s inequality implies that with probability at least 1ε1-\varepsilon over (h,𝐲)(H,𝐘)(h,\mathbf{y})\leftarrow(H,\mathbf{Y}),

    1Ti=1Tcp(Yi|(H,Y<i)=(h,y<i))1M+1Kε=1M(1+MKε).\frac{1}{T}\sum_{i=1}^{T}\mathrm{cp}(Y_{i}|_{(H,Y_{<i})=(h,y_{<i})})\leq\frac{1}{M}+\frac{1}{K\varepsilon}=\frac{1}{M}\cdot\left(1+\frac{M}{K\varepsilon}\right). (1)

    In Lemma 3.4 below, we show how to fix the bad (h,𝐲)(h,\mathbf{y})’s by modifying at most ε\varepsilon-fraction of the distribution. Formally, Lemma 3.4 says that there exists a distribution (H,𝐙)=(H,Z1,,ZT)(H,\mathbf{Z})=(H,Z_{1},\dots,Z_{T}) such that (H,𝐘)(H,\mathbf{Y}) is ε\varepsilon-close to (H,𝐙)(H,\mathbf{Z}), and for every (h,𝐳)(H,𝐙)(h,\mathbf{z})\leftarrow(H,\mathbf{Z}),

    1Ti=1Tcp(Zi|(H,Z<i)=(h,z<i))1M(1+MKε).\frac{1}{T}\sum_{i=1}^{T}\mathrm{cp}(Z_{i}|_{(H,Z_{<i})=(h,z_{<i})})\leq\frac{1}{M}\cdot\left(1+\frac{M}{K\varepsilon}\right).

    Applying Lemma 2.2 on (𝐙|H=h)(\mathbf{Z}|_{H=h}) for every hsupp()h\in\mathrm{supp}(\mathcal{H}), we have

    cp(H,𝐙)=1||EhH[cp(𝐙|H=h)]1||MT(1+MKε)T.\mathrm{cp}(H,\mathbf{Z})=\frac{1}{|\mathcal{H}|}\cdot\mathop{\mathrm{E}}\displaylimits_{h\leftarrow H}\left[\mathrm{cp}(\mathbf{Z}|_{H=h})\right]\leq\frac{1}{|\mathcal{H}|\cdot M^{T}}\cdot\left(1+\frac{M}{K\varepsilon}\right)^{T}.

     

Lemma 3.4

Let (H,𝐘)=(H,Y1,,YT)(H,\mathbf{Y})=(H,Y_{1},\dots,Y_{T}) be jointly distributed random variables over ×[M]T\mathcal{H}\times[M]^{T} such that with probability at least 1ε1-\varepsilon over (h,𝐲)(H,𝐘)(h,\mathbf{y})\leftarrow(H,\mathbf{Y}), the average conditional collision probability satisfies

1Ti=1Tcp(Yi|(H,Y<i)=(h,y<i))1M+α.\frac{1}{T}\cdot\sum_{i=1}^{T}\mathrm{cp}(Y_{i}|_{(H,Y_{<i})=(h,y_{<i})})\leq\frac{1}{M}+\alpha.

Then there exists a distribution (H,𝐙)=(H,Z1,,ZT)(H,\mathbf{Z})=(H,Z_{1},\dots,Z_{T}) such that (H,𝐙)(H,\mathbf{Z}) is ε\varepsilon-close to (H,𝐘)(H,\mathbf{Y}), and for every (h,𝐳)supp(H,𝐙)(h,\mathbf{z})\in\mathrm{supp}(H,\mathbf{Z}), we have

1Ti=1Tcp(Zi|(H,Z<i)=(h,z<i))1M+α.\frac{1}{T}\cdot\sum_{i=1}^{T}\mathrm{cp}(Z_{i}|_{(H,Z_{<i})=(h,z_{<i})})\leq\frac{1}{M}+\alpha.

Furthermore, the marginal distribution of HH is unchanged.

Proof. We define the distribution (H,𝐙)(H,\mathbf{Z}) as follows.

  • Sample (h,𝐲)(H,𝐘)(h,\mathbf{y})\leftarrow(H,\mathbf{Y}).

  • If (1/T)i=1Tcp(Yi|(H,Y<i)=(h,y<i))1/M+α(1/T)\cdot\sum_{i=1}^{T}\mathrm{cp}(Y_{i}|_{(H,Y_{<i})=(h,y_{<i})})\leq 1/M+\alpha, then output (h,𝐲)(h,\mathbf{y}).

  • Otherwise, let j[T]j\in[T] be the least index such that

    1Ti=1j(cp(Yi|(H,Y<i)=(h,y<i))1M)α and 1Ti=1j+1(cp(Yi|(H,Y<i)=(h,y<i))1M)>α\frac{1}{T}\sum_{i=1}^{j}\left(\mathrm{cp}(Y_{i}|_{(H,Y_{<i})=(h,y_{<i})})-\frac{1}{M}\right)\leq\alpha\mbox{ and }\frac{1}{T}\sum_{i=1}^{j+1}\left(\mathrm{cp}(Y_{i}|_{(H,Y_{<i})=(h,y_{<i})})-\frac{1}{M}\right)>\alpha
  • Choose wj+1,,wTU[M]w_{j+1},\dots,w_{T}\leftarrow U_{[M]}, and output (h,y1,,yj,wj+1,,wT)(h,y_{1},\dots,y_{j},w_{j+1},\dots,w_{T}).

It is easy to check that (i) (H,𝐙)(H,\mathbf{Z}) is well-defined, (ii) (H,𝐘)(H,\mathbf{Y}) is ε\varepsilon-close to (H,𝐙)(H,\mathbf{Z}), (iii) for every (h,𝐳)(H,𝐙)(h,\mathbf{z})\in(H,\mathbf{Z}), (1/T)i=1Tcp(Zi|(H,Z<i)=(h,z<i))1/M+α(1/T)\cdot\sum_{i=1}^{T}\mathrm{cp}(Z_{i}|_{(H,Z_{<i})=(h,z_{<i})})\leq 1/M+\alpha, and (iv) the marginal distribution of HH is unchanged.    

3.2 Small Collision Probability Using 44-wise Independent Hash Functions

As discussed in [MV08], using 44-wise independent hash functions H:[N][M]H:[N]\rightarrow[M] from \mathcal{H}, we can further reduce the required randomness in the data 𝐗=(X1,,XT)\mathbf{X}=(X_{1},\dots,X_{T}). [MV08] shows that in this case, KMT+2MT3/εK\geq MT+\sqrt{2MT^{3}/\varepsilon} is enough for the hashed sequence (H,𝐘)(H,\mathbf{Y}) to be ε\varepsilon-close to having collision probability O(1/||MT)O(1/|\mathcal{H}|\cdot M^{T}). As discussed in the previous subsection, by avoiding using union bounds, we show that KMT+2MT2/εK\geq MT+\sqrt{2MT^{2}/\varepsilon} suffices. (Taking logs yields the second entry in Table 1, i.e. it suffices to have Renyi entropy k=max{m+logT,(1/2)(m+2logT+log(1/ε))}+O(1)k=\max\{m+\log T,(1/2)\cdot(m+2\log T+\log(1/\varepsilon))\}+O(1) per block.) Formally, we prove the following theorem.

Theorem 3.5

Let H:[N][M]H:[N]\rightarrow[M] be a random hash function from a 44-wise independent family \mathcal{H}. Let 𝐗=(X1,,XT)\mathbf{X}=(X_{1},\dots,X_{T}) be a block KK-source over [N]T[N]^{T}. For every ε>0\varepsilon>0, the hashed sequence (H,𝐘)=(H,H(X1),,H(XT))(H,\mathbf{Y})=(H,H(X_{1}),\dots,H(X_{T})) is ε\varepsilon-close to a distribution (H,𝐙)=(H,Z1,,ZT)(H,\mathbf{Z})=(H,Z_{1},\dots,Z_{T}) such that

cp(H,𝐙)1||MT(1+MK+2MK2ε)T.\mathrm{cp}(H,\mathbf{Z})\leq\frac{1}{|\mathcal{H}|\cdot M^{T}}\left(1+\frac{M}{K}+\sqrt{\frac{2M}{K^{2}\varepsilon}}\right)^{T}.

In particular, if KMT+2MT2/εK\geq MT+\sqrt{2MT^{2}/\varepsilon}, then (H,𝐙)(H,\mathbf{Z}) has collision probability at most (1+γ)/(||MT)(1+\gamma)/(|\mathcal{H}|\cdot M^{T}) for γ=2(MT+2MT2/ε)/K\gamma=2\cdot(MT+\sqrt{2MT^{2}/\varepsilon})/K.

The improvement of Theorem 3.5 over Theorem 3.1 comes from that when we use 44-wise independent hash families, we have a concentration result on the conditional collision probability for each block , via the following lemma.

Lemma 3.6 ([MV08])

Let H:[N][M]H:[N]\rightarrow[M] be a random hash function from a 44-wise independent family \mathcal{H}, and XX a random variable over [N][N] with cp(X)1/K\mathrm{cp}(X)\leq 1/K. Then we have

VarhH[cp(h(X))]2MK2.\mathop{\mathrm{Var}}\displaylimits_{h\leftarrow H}[\mathrm{cp}(h(X))]\leq\frac{2}{MK^{2}}.

We can then replace the application of Markov’s Inequality in the proof of Theorem 3.1 by Chebychev’s Inequality to get stronger result. Formally, we prove the following lemma, which suffices to prove Theorem 3.5.

Lemma 3.7

Let H:[N][M]H:[N]\rightarrow[M] be a random hash function from a 44-wise independent family \mathcal{H}. Let 𝐗=(X1,,XT)\mathbf{X}=(X_{1},\dots,X_{T}) be a block KK-source over [N]T[N]^{T}. Let (H,𝐘)=(H,H(X1),,H(XT))(H,\mathbf{Y})=(H,H(X_{1}),\dots,H(X_{T})). Then with probability at least 1ε1-\varepsilon over (h,𝐲)(H,𝐘)(h,\mathbf{y})\leftarrow(H,\mathbf{Y}),

1Ti=1Tcp(Yi|(H,Y<i)=(h,y<i))1M(1+MK+2MK2ε).\frac{1}{T}\sum_{i=1}^{T}\mathrm{cp}(Y_{i}|_{(H,Y_{<i})=(h,y_{<i})})\leq\frac{1}{M}\cdot\left(1+\frac{M}{K}+\sqrt{\frac{2M}{K^{2}\varepsilon}}\right).

Theorem 3.5 follows immediately by composing Lemma 3.7, 3.4, and 2.2 in the same way as the proof of Theorem 3.1.

  • Proof of Lemma 3.7:   Recall that we have

    E(h,𝐲)(H,𝐘)[1Ti=1Tcp(Yi|(H,Y<i)=(h,y<i))]1M+1K.\mathop{\mathrm{E}}\displaylimits_{(h,\mathbf{y})\leftarrow(H,\mathbf{Y})}\left[\frac{1}{T}\sum_{i=1}^{T}\mathrm{cp}(Y_{i}|_{(H,Y_{<i})=(h,y_{<i})})\right]\leq\frac{1}{M}+\frac{1}{K}.

    Hence, our goal is to upper bound the probability of the value (1/T)i=1Tcp(Yi|(H,Y<i)=(h,y<i))(1/T)\sum_{i=1}^{T}\mathrm{cp}(Y_{i}|_{(H,Y_{<i})=(h,y_{<i})}) deviating from its mean by 2/MK2ε\sqrt{2/MK^{2}\varepsilon}. Our strategy is to bound the variance of a properly defined random variable, and then apply Chebychev’s Inequality. By Lemma 3.6, the information we get from 44-wise independent hash function is that for every i[T]i\in[T], we have

    VarhH[cp(Yi|(H,X<i)=(h,x<i))]2MK2x<isupp(X<i)\mathop{\mathrm{Var}}\displaylimits_{h\leftarrow H}\left[\mathrm{cp}(Y_{i}|_{(H,X_{<i})=(h,x_{<i})})\right]\leq\frac{2}{MK^{2}}\quad\quad\forall x_{<i}\in\mathrm{supp}(X_{<i}) (2)

    Fix i[T]i\in[T], let us try to bound the variance of the ii-th block. There are two issues to take care of. Firstly, the variance we have is conditioning on X<iX_{<i} instead of Y<iY_{<i}. Secondly, even when conditioning on X<iX_{<i}, it is possible that the variance is

    Var(h,𝐱)(H,𝐗)[cp(Yi|(H,X<i)=(h,x<i))]=Ω(1K2)2MK2.\mathop{\mathrm{Var}}\displaylimits_{(h,\mathbf{x})\leftarrow(H,\mathbf{X})}\left[\mathrm{cp}(Y_{i}|_{(H,X_{<i})=(h,x_{<i})})\right]=\Omega\left(\frac{1}{K^{2}}\right)\gg\frac{2}{MK^{2}}.

    The reason is that conditioning on different X<i=x<iX_{<i}=x_{<i}, the collision probability of (Yi|X<i=x<i)(Y_{i}|_{X_{<i}=x_{<i}}) may have different expectation over hh\leftarrow\mathcal{H}. Thus, we have to subtract the mean first. Let us define

    f(h,x<i)=cp(Yi|(H,X<i)=(h,x<i))EhH[cp(Yi|(H,X<i)=(h,x<i))]f(h,x_{<i})=\mathrm{cp}(Y_{i}|_{(H,X_{<i})=(h,x_{<i})})-\mathop{\mathrm{E}}\displaylimits_{h\leftarrow H}\left[\mathrm{cp}(Y_{i}|_{(H,X_{<i})=(h,x_{<i})})\right]

    Now, for every x<isupp(X<i)x_{<i}\in\mathrm{supp}(X_{<i}), f(H,x<i)f(H,x_{<i}) has mean 0, and variance 2/MK2\leq 2/MK^{2}. It follows that

    Var(h,𝐱)(H,𝐗)[f(h,x<i)]2MK2.\mathop{\mathrm{Var}}\displaylimits_{(h,\mathbf{x})\leftarrow(H,\mathbf{X})}\left[f(h,x_{<i})\right]\leq\frac{2}{MK^{2}}.

    We now deal with the issue of conditioning on X<iX_{<i} versus Y<iY_{<i}. Let us define

    g(h,y<i)=Ex<i(X<i|(H,Y<i)=(h,y<i))[f(h,x<i)].g(h,y_{<i})=\mathop{\mathrm{E}}\displaylimits_{x_{<i}\leftarrow(X_{<i}|_{(H,Y_{<i})=(h,y_{<i})})}\left[f(h,x_{<i})\right].

    We claim that

    cp(Yi|(H,Y<i)=(h,y<i))1M+1K+g(h,y<i).\mathrm{cp}(Y_{i}|_{(H,Y_{<i})=(h,y_{<i})})\leq\frac{1}{M}+\frac{1}{K}+g(h,y_{<i}).

    Indeed, by Lemma 2.5 and the definition of ff and gg,

    cp(Yi|(H,Y<i)=(h,y<i))\displaystyle\mathrm{cp}(Y_{i}|_{(H,Y_{<i})=(h,y_{<i})})
    \displaystyle\leq cp((Yi|(H,Y<i)=(h,y<i))|(Xi|(H,Y<i)=(h,y<i)))\displaystyle\mathrm{cp}((Y_{i}|_{(H,Y_{<i})=(h,y_{<i})})|(X_{i}|_{(H,Y_{<i})=(h,y_{<i})}))
    =\displaystyle= Ex<i(X<i|(H,Y<i)=(h,y<i))[cp(Yi|(H,X<i)=(h,x<i))]\displaystyle\mathop{\mathrm{E}}\displaylimits_{x_{<i}\leftarrow(X_{<i}|_{(H,Y_{<i})=(h,y_{<i})})}\left[\mathrm{cp}(Y_{i}|_{(H,X_{<i})=(h,x_{<i})})\right]
    =\displaystyle= Ex<i(X<i|(H,Y<i)=(h,y<i))[f(h,x<i)+EhH[cp(Yi|(H,X<i)=(h,x<i))]]\displaystyle\mathop{\mathrm{E}}\displaylimits_{x_{<i}\leftarrow(X_{<i}|_{(H,Y_{<i})=(h,y_{<i})})}\left[f(h,x_{<i})+\mathop{\mathrm{E}}\displaylimits_{h\leftarrow H}\left[\mathrm{cp}(Y_{i}|_{(H,X_{<i})=(h,x_{<i})})\right]\right]
    \displaystyle\leq g(h,y<i)+1M+1K\displaystyle g(h,y_{<i})+\frac{1}{M}+\frac{1}{K}

    Also note that g(H,Y<i)g(H,Y_{<i}) has mean 0 and small variance:

    E(h,y<i)(H,y<i)[g(h,y<i)]=E(h,𝐱)(H,𝐗)[f(h,x<i)]=0,\mathop{\mathrm{E}}\displaylimits_{(h,y_{<i})\leftarrow(H,y_{<i})}[g(h,y_{<i})]=\mathop{\mathrm{E}}\displaylimits_{(h,\mathbf{x})\leftarrow(H,\mathbf{X})}[f(h,x_{<i})]=0,
    Var(h,y<i)(H,y<i)[g(h,y<i)]Var(h,𝐱)(H,𝐗)[f(h,x<i)]2MK2.\mathop{\mathrm{Var}}\displaylimits_{(h,y_{<i})\leftarrow(H,y_{<i})}[g(h,y_{<i})]\leq\mathop{\mathrm{Var}}\displaylimits_{(h,\mathbf{x})\leftarrow(H,\mathbf{X})}[f(h,x_{<i})]\leq\frac{2}{MK^{2}}.

    The above argument holds for every block i[T]i\in[T]. Taking average over blocks, we get

    E(h,𝐲)(H,𝐘)[1Ti=1Tg(h,y<i)]=0,\mathop{\mathrm{E}}\displaylimits_{(h,\mathbf{y})\leftarrow(H,\mathbf{Y})}\left[\frac{1}{T}\sum_{i=1}^{T}g(h,y_{<i})\right]=0,
    Var(h,𝐲)(H,𝐘)[1Ti=1Tg(h,y<i)]2MK2, and \mathop{\mathrm{Var}}\displaylimits_{(h,\mathbf{y})\leftarrow(H,\mathbf{Y})}\left[\frac{1}{T}\sum_{i=1}^{T}g(h,y_{<i})\right]\leq\frac{2}{MK^{2}},\mbox{ and }
    1Ti=1Tcp(Yi|(H,Y<i)=(h,y<i))1M+1K+(1Ti=1Tg(h,y<i)).\frac{1}{T}\sum_{i=1}^{T}\mathrm{cp}(Y_{i}|_{(H,Y_{<i})=(h,y_{<i})})\leq\frac{1}{M}+\frac{1}{K}+\left(\frac{1}{T}\sum_{i=1}^{T}g(h,y_{<i})\right).

    Finally, we can apply Chebychev’s Inequality to random variable (1/T)ig(H,Y<i)(1/T)\cdot\sum_{i}g(H,Y_{<i}) to get the desired result: with probability 1ε1-\varepsilon over (h,𝐲)(H,𝐘)(h,\mathbf{y})\leftarrow(H,\mathbf{Y}),

    1Ti=1Tcp(Yi|(H,Y<i)=(h,y<i))1M(1+MK+2MK2ε).\frac{1}{T}\sum_{i=1}^{T}\mathrm{cp}(Y_{i}|_{(H,Y_{<i})=(h,y_{<i})})\leq\frac{1}{M}\cdot\left(1+\frac{M}{K}+\sqrt{\frac{2M}{K^{2}\varepsilon}}\right).

     

3.3 Statistical Distance to Uniform Distribution

Let H:[N][M]H:[N]\rightarrow[M] be a random hash function form a 22-universal family \mathcal{H}. Let 𝐗=(X1,,XT)\mathbf{X}=(X_{1},\dots,X_{T}) be a block KK-source over [N]T[N]^{T}. In this subsection, we study the statistical distance between the distribution of hashed sequence (H,𝐘)=(H,H(X1),,H(XT))(H,\mathbf{Y})=(H,H(X_{1}),\dots,H(X_{T})) and the uniform distribution (H,U[M]T)(H,U_{[M]^{T}}). Classic results of [CG88, ILL89, Zuc96] show that if KMT2/ε2K\geq MT^{2}/\varepsilon^{2}, then (H,𝐘)(H,\mathbf{Y}) is ε\varepsilon-close to uniform. The proof idea is as follows. The Leftover Hash Lemma together with Lemma 2.3 tells us that the joint distribution of hash function and a hashed value (H,Yi)=(H,H(Xi))(H,Y_{i})=(H,H(X_{i})) is M/K\sqrt{M/K}-close to uniform U[M]U_{[M]} even conditioning on the previous blocks X<iX_{<i}. One can then use a hybrid argument to show that the distance grows linearly with the number of blocks, so (H,𝐘)(H,\mathbf{Y}) is TM/KT\cdot\sqrt{M/K}-close to uniform. Taking KMT2/ε2K\geq MT^{2}/\varepsilon^{2} completes the analysis.

We save a factor of TT, and show that in fact, K=MT/ε2K=MT/\varepsilon^{2} is sufficient. (Taking logs yields the third entry in Table 1, i.e. it suffices to have Renyi entropy k=m+logT+2log(1/ε)k=m+\log T+2\log(1/\varepsilon) per block.) Formally, we prove the following theorem.

Theorem 3.8

Let H:[N][M]H:[N]\rightarrow[M] be a random hash function from a 22-universal family \mathcal{H}. Let 𝐗=(X1,,XT)\mathbf{X}=(X_{1},\dots,X_{T}) be a block KK-source over [N]T[N]^{T}. For every ε>0\varepsilon>0 such that K>MT/ε2K>MT/\varepsilon^{2}, the hashed sequence (H,𝐘)=(H,H(X1),,H(XT))(H,\mathbf{Y})=(H,H(X_{1}),\dots,H(X_{T})) is ε\varepsilon-close to uniform (H,U[M]T)(H,U_{[M]^{T}}).

Recall that the previous analysis goes by passing to statistical distance first, and then measuring the growth of distance using statistical distance. This incurs a quadratic dependency of KK on TT. Since without further information, the hybrid argument is tight, to save a factor of TT, we have to measure the increase of distance over blocks in another way, and pass to statistical distance only in the end. It turns out that the Hellinger distance (cf., [GS02]) is a good measure for our purposes:

Definition 3.9 (Hellinger distance)

Let XX and YY be two random variables over [M][M]. The Hellinger distance between XX and YY is

d(X,Y)=def(12i(Pr[X=i]Pr[Y=i]))1/2=1iPr[X=i]Pr[Y=i].d(X,Y)\mathbin{\stackrel{{\scriptstyle\rm def}}{{=}}}\left(\frac{1}{2}\sum_{i}(\sqrt{\Pr[X=i]}-\sqrt{\Pr[Y=i]})\right)^{1/2}=\sqrt{1-\sum_{i}\sqrt{\Pr[X=i]\cdot\Pr[Y=i]}}.

Like statistical distance, Hellinger distance is a distance measure for distributions, and it takes value in [0,1][0,1]. The following standard lemma says that the two distance measures are closely related. We remark that the lemma is tight in both directions even if YY is the uniform distribution.

Lemma 3.10 (cf., [GS02])

Let XX and YY be two random variables over [M][M]. We have

d(X,Y)2Δ(X,Y)2d(X,Y).d(X,Y)^{2}\leq\Delta(X,Y)\leq\sqrt{2}\cdot d(X,Y).

In particular, the lemma allows us to upper-bound the statistical distance by upper-bounding the Hellinger distance. Since our goal is to bound the distance to uniform, it is convenient to introduce the following definition.

Definition 3.11 (Hellinger Closeness to Uniform)

Let XX be a random variable over [M][M]. The Hellinger closeness of XX to uniform U[M]U_{[M]} is

C(X)=def1MiMPr[X=i]=1d(X,U[M])2.C(X)\mathbin{\stackrel{{\scriptstyle\rm def}}{{=}}}\frac{1}{M}\sum_{i}\sqrt{M\cdot\Pr[X=i]}=1-d(X,U_{[M]})^{2}.

Note that C(X,Y)=C(X)C(Y)C(X,Y)=C(X)\cdot C(Y) when XX and YY are independent random variables, so the Hellinger closeness is well-behaved with respect to products (unlike statistical distance). By Lemma 3.10, if the Hellinger closeness C(X)C(X) is close to 11, then XX is close to uniform in statistical distance. Recall that collision probability behaves similarly. If the collision probability cp(X)\mathrm{cp}(X) is close to 1/M1/M, then XX is close to uniform. In fact, by the following normalization, we can view the collision probability as the 22-norm of XX, and the Hellinger closeness as the 1/21/2-norm of XX.

Let f(i)=MPr[X=i]f(i)=M\cdot\Pr[X=i] for i[M]i\in[M]. In terms of f()f(\cdot), the collision probability is cp(X)=(1/M2)if(i)2\mathrm{cp}(X)=(1/M^{2})\cdot\sum_{i}f(i)^{2}, and Lemma 2.3 says that if the “22-norm” Mcp(X)=Ei[f(i)2]1+εM\cdot\mathrm{cp}(X)=\mathop{\mathrm{E}}\displaylimits_{i}[f(i)^{2}]\leq 1+\varepsilon where the expectation is over uniform i[M]i\in[M], then Δ(X,U)ε\Delta(X,U)\leq\sqrt{\varepsilon},. Similarly, Lemma 3.10 says that if the “1/21/2-norm” C(X)=Ei[f(i)]1εC(X)=\mathop{\mathrm{E}}\displaylimits_{i}[\sqrt{f(i)}]\geq 1-\varepsilon, then Δ(X,U)ε\Delta(X,U)\leq\sqrt{\varepsilon}.

We now discuss our approach to prove Theorem 3.8. We want to show that (H,𝐘)(H,\mathbf{Y}) is close to uniform. All we know is that the conditional collision probability cp(Yi|H,Y<i)\mathrm{cp}(Y_{i}|H,Y_{<i}) is close to 1/M1/M for every block. If all blocks are independent, then the overall collision probability cp(H,𝐘)\mathrm{cp}(H,\mathbf{Y}) is small, and so (H,𝐘)(H,\mathbf{Y}) is close to uniform. However, this is not true without independence, since 22-norm tends to over-weight heavy elements. In contrast, the 1/21/2-norm does not suffer this problem. Therefore, our approach is to show that small conditional collision probability implies large Hellinger closeness. Formally, we have the following lemma.

Lemma 3.12

Let 𝐗=(X1,,XT)\mathbf{X}=(X_{1},\dots,X_{T}) be jointly distributed random variables over [M1]××[MT][M_{1}]\times\dots\times[M_{T}] such that cp(Xi|X<i)αi/Mi\mathrm{cp}(X_{i}|X_{<i})\leq\alpha_{i}/M_{i} for every i[T]i\in[T]. Then the Hellinger closeness satisfies

C(𝐗)1α1αT.C(\mathbf{X})\geq\sqrt{\frac{1}{\alpha_{1}\dots\alpha_{T}}}.

With this lemma, the proof of Theorem 3.8 is immediate.

  • Proof of Theorem 3.8:   By Lemma 3.3, cp(H)=1/||\mathrm{cp}(H)=1/|\mathcal{H}|, and cp(Yi|H,Y<i)(1+M/K)/M\mathrm{cp}(Y_{i}|H,Y_{<i})\leq(1+M/K)/M for every i[T]i\in[T]. By Lemma 3.12, the Hellinger closeness satisfies C(H,𝐘)(1+M/K)T/21MT/2KC(H,\mathbf{Y})\geq(1+M/K)^{-T/2}\geq 1-MT/2K (recall that KMT/ε2K\geq MT/\varepsilon^{2}). It follows by Lemma 3.10 that

    Δ((H,𝐘),(H,U[M]T))2d((H,𝐘),(H,U[M]T))=21C(H,𝐘)MT/Kε.\Delta((H,\mathbf{Y}),(H,U_{[M]^{T}}))\leq\sqrt{2}\cdot d((H,\mathbf{Y}),(H,U_{[M]^{T}}))=\sqrt{2}\cdot\sqrt{1-C(H,\mathbf{Y})}\leq\sqrt{MT/K}\leq\varepsilon.

     

We proceed to prove Lemma 3.12. The main idea is to use Hölder’s inequality to relate two different norms. We recall Hölder’s inequality first.

Lemma 3.13 (Hölder’s inequality[Dur04])
  • Let F,GF,G be two non-negative functions from [M][M] to {\mathbb{R}}, and p,q>0p,q>0 satisfying 1/p+1/q=11/p+1/q=1. Let xx be a uniformly random index over [M][M]. We have

    Ex[F(x)G(x)]Ex[F(x)p]1/pEx[G(x)q]1/q.\mathop{\mathrm{E}}\displaylimits_{x}[F(x)\cdot G(x)]\leq\mathop{\mathrm{E}}\displaylimits_{x}[F(x)^{p}]^{1/p}\cdot\mathop{\mathrm{E}}\displaylimits_{x}[G(x)^{q}]^{1/q}.
  • In general, let F1,,FnF_{1},\dots,F_{n} be non-negative functions from [M][M] to {\mathbb{R}}, and p1,pn>0p_{1},\dots p_{n}>0 satisfying 1/p1+1/pn=11/p_{1}+\dots 1/p_{n}=1. We have

    Ex[F1(x)Fn(x)]Ex[F1(x)p1]1/p1Ex[Fn(x)pn]1/pn.\mathop{\mathrm{E}}\displaylimits_{x}[F_{1}(x)\cdots F_{n}(x)]\leq\mathop{\mathrm{E}}\displaylimits_{x}[F_{1}(x)^{p_{1}}]^{1/p_{1}}\cdots\mathop{\mathrm{E}}\displaylimits_{x}[F_{n}(x)^{p_{n}}]^{1/p_{n}}.
  • Proof of Lemma 3.12:   We prove it by induction on TT. The base case T=1T=1 is already non-trivial. Let XX be a random variable over [M][M] with cp(X)α/M\mathrm{cp}(X)\leq\alpha/M, we need to show that the Hellinger closeness C(X)1/αC(X)\geq\sqrt{1/\alpha}. Recall the normalization we mentioned before. Let f(x)=MPr[X=x]f(x)=M\cdot\Pr[X=x] for every x[M]x\in[M]. In terms of f()f(\cdot), we want to show that Ex[f(x)2]α\mathop{\mathrm{E}}\displaylimits_{x}[f(x)^{2}]\leq\alpha implies Ex[f(x)]1/α\mathop{\mathrm{E}}\displaylimits_{x}[\sqrt{f(x)}]\geq\sqrt{1/\alpha}. Note that Ex[f(x)]=1\mathop{\mathrm{E}}\displaylimits_{x}[f(x)]=1. We now apply Hölder’s inequality with F=f2/3F=f^{2/3}, G=f1/3G=f^{1/3}, p=3p=3, and q=3/2q=3/2. We have

    Ex[f(x)]Ex[f(x)2]1/3Ex[f(x)1/2]2/3,\mathop{\mathrm{E}}\displaylimits_{x}[f(x)]\leq\mathop{\mathrm{E}}\displaylimits_{x}[f(x)^{2}]^{1/3}\cdot\mathop{\mathrm{E}}\displaylimits_{x}[f(x)^{1/2}]^{2/3},

    which implies

    C(X)=Ex[f(x)]Ex[f(x)]3/2/Ex[f(x)2]1/21/α.C(X)=\mathop{\mathrm{E}}\displaylimits_{x}[\sqrt{f(x)}]\geq\mathop{\mathrm{E}}\displaylimits_{x}[f(x)]^{3/2}/\mathop{\mathrm{E}}\displaylimits_{x}[f(x)^{2}]^{1/2}\geq\sqrt{1/\alpha}.

    Suppose the lemma is true for T1T-1, we show that it is true for TT. Let f(x)=M1Pr[X1=x]f(x)=M_{1}\cdot\Pr[X_{1}=x]. To apply the induction hypothesis, we consider the conditional random variables (X2,,XT|X1=x)(X_{2},\dots,X_{T}|_{X_{1}=x}) for every x[M1]x\in[M_{1}]. For every x[M1]x\in[M_{1}] and j=2,,Tj=2,\dots,T, we define gj(x)=Mjcp((Xj|X1=x)|(X2,,Xj1|X1=x))g_{j}(x)=M_{j}\cdot\mathrm{cp}((X_{j}|_{X_{1}=x})|(X_{2},\dots,X_{j-1}|_{X_{1}=x})) to be the ”normalized” conditional collision probability. By induction hypothesis, we have C(X2,,XT|X1=x)g2(x)gT(x)C(X_{2},\dots,X_{T}|_{X_{1}=x})\geq\sqrt{g_{2}(x)\cdots g_{T}(x)} for every x[M1]x\in[M_{1}]. It follows that

    C(𝐗)=Ex[f(x)C(X2,,XT|X1=x)]Ex[f(x)/g2(x)gT(x)].C(\mathbf{X})=\mathop{\mathrm{E}}\displaylimits_{x}[\sqrt{f(x)}\cdot C(X_{2},\dots,X_{T}|_{X_{1}=x)}]\geq\mathop{\mathrm{E}}\displaylimits_{x}[\sqrt{f(x)/g_{2}(x)\cdots g_{T}(x)}].

    We use Hölder’s inequality twice to show that Ex[f(x)/g2(x)gT(x)]1/α1αT\mathop{\mathrm{E}}\displaylimits_{x}[\sqrt{f(x)/g_{2}(x)\cdots g_{T}(x)}]\geq\sqrt{1/\alpha_{1}\cdots\alpha_{T}}. Let us first summarize the constraints we have. By definition, we have Ex[f(x)2]α1\mathop{\mathrm{E}}\displaylimits_{x}[f(x)^{2}]\leq\alpha_{1}. Fix j{2,,T}j\in\{2,\dots,T\}. Note that

    cp(Xj|X<j)\displaystyle\mathrm{cp}(X_{j}|X_{<j})
    =\displaystyle= ExX1[cp((Xj|X1=x)|(X2,,Xj1|X1=x))]\displaystyle\mathop{\mathrm{E}}\displaylimits_{x\leftarrow X_{1}}\left[\mathrm{cp}((X_{j}|_{X_{1}=x})|(X_{2},\dots,X_{j-1}|_{X_{1}=x}))\right]
    =\displaystyle= ExX1[gj(x)/Mj]\displaystyle\mathop{\mathrm{E}}\displaylimits_{x\leftarrow X_{1}}[g_{j}(x)/M_{j}]
    =\displaystyle= ExU[M][f(x)gj(x)]/Mj\displaystyle\mathop{\mathrm{E}}\displaylimits_{x\leftarrow U_{[M]}}[f(x)g_{j}(x)]/M_{j}

    It follows that Ex[f(x)gj(x)]αj\mathop{\mathrm{E}}\displaylimits_{x}[f(x)g_{j}(x)]\leq\alpha_{j} for j=2,,Tj=2,\dots,T. Now, we apply the second version of Hölder’s Inequality with F1=(f/g2gT)1/2F_{1}=(f/g_{2}\cdots g_{T})^{1/2}, Fj=(fgj)1/(T+1)F_{j}=(fg_{j})^{1/(T+1)} for j=2,,Tj=2,\dots,T, p1=2/(T+1)p_{1}=2/(T+1), and pj=1/(T+1)p_{j}=1/(T+1) for j=2,,Tj=2,\dots,T, which gives

    Ex[f(x)T/(T+1)]Ex[f(x)/g2(x)gT(x)]2/(T+1)Ex[f(x)g2(x)]1/(T+1)Ex[f(x)gT(x)]1/(T+1),\mathop{\mathrm{E}}\displaylimits_{x}\left[f(x)^{T/(T+1)}\right]\leq\mathop{\mathrm{E}}\displaylimits_{x}\left[\sqrt{f(x)/g_{2}(x)\cdots g_{T}(x)}\right]^{2/(T+1)}\cdot\mathop{\mathrm{E}}\displaylimits_{x}\left[f(x)g_{2}(x)\right]^{1/(T+1)}\cdots\mathop{\mathrm{E}}\displaylimits_{x}\left[f(x)g_{T}(x)\right]^{1/(T+1)},

    so

    Ex[f(x)/g2(x)gT(x)]\displaystyle\mathop{\mathrm{E}}\displaylimits_{x}\left[\sqrt{f(x)/g_{2}(x)\cdots g_{T}(x)}\right] \displaystyle\geq Ex[f(x)T/(T+1)](T+1)/2j=2TEx[f(x)gj(x)]1/2\displaystyle\mathop{\mathrm{E}}\displaylimits_{x}\left[f(x)^{T/(T+1)}\right]^{(T+1)/2}\cdot\prod_{j=2}^{T}\mathop{\mathrm{E}}\displaylimits_{x}\left[f(x)g_{j}(x)\right]^{-1/2}
    \displaystyle\geq Ex[f(x)T/(T+1)](T+1)/21/α2αT.\displaystyle\mathop{\mathrm{E}}\displaylimits_{x}\left[f(x)^{T/(T+1)}\right]^{(T+1)/2}\cdot\sqrt{1/\alpha_{2}\cdots\alpha_{T}}.

    It remains to lower bound the first term by 1/α1\sqrt{1/\alpha_{1}}. We apply Hölder again with F=f2/(T+2)F=f^{2/(T+2)}, G=fT/(T+2)G=f^{T/(T+2)}, p=T+2p=T+2, and q=(T+2)/(T+1)q=(T+2)/(T+1), which gives

    Ex[f(x)]Ex[f(x)2]1/(T+2)Ex[f(x)T/(T+1)](T+1)/(T+2),\mathop{\mathrm{E}}\displaylimits_{x}\left[f(x)\right]\leq\mathop{\mathrm{E}}\displaylimits_{x}\left[f(x)^{2}\right]^{1/(T+2)}\cdot\mathop{\mathrm{E}}\displaylimits_{x}\left[f(x)^{T/(T+1)}\right]^{(T+1)/(T+2)},

    so

    Ex[f(x)T/(T+1)](T+1)/2Ex[f(x)](T+2)/2/Ex[f(x)2]1/21/α1.\mathop{\mathrm{E}}\displaylimits_{x}\left[f(x)^{T/(T+1)}\right]^{(T+1)/2}\geq\mathop{\mathrm{E}}\displaylimits_{x}\left[f(x)\right]^{(T+2)/2}/\mathop{\mathrm{E}}\displaylimits_{x}\left[f(x)^{2}\right]^{1/2}\geq\sqrt{1/\alpha_{1}}.

    Combining the inequalities, we have C(𝐗)1/α1αTC(\mathbf{X})\geq\sqrt{1/\alpha_{1}\cdots\alpha_{T}}.    

4 Negative Results: How Much Entropy is Necessary?

In this section, we provide lower bounds on the entropy needed for the data items. We show that if KK is not large enough, then for every hash family \mathcal{H}, there exists a block KK-source 𝐗=(X1,,XT)\mathbf{X}=(X_{1},\dots,X_{T}) such that the hashed sequence 𝐘=(H(X1),,H(XT))\mathbf{Y}=(H(X_{1}),\dots,H(X_{T})) do not satisfy the desired closeness requirements to uniform (possibly in conjunction with the hash function HH).

4.1 Lower Bound for Statistical Distance to Uniform Distribution

Let us first consider the requirement for the joint distribution of (H,𝐘)(H,\mathbf{Y}) being ε\varepsilon-close to uniform. When there is only one block, this is exactly the requirement for a “strong extractor”. The lower bound in the extractor literature, due to Radhakrishnan and Ta-Shma [RT00] shows that KΩ(M/ε2)K\geq\Omega(M/\varepsilon^{2}) is necessary, which is tight up to a constant factor. Our goal is to show that when hashing TT blocks, the value of KK required for each block increases by a factor of TT. Intuitively, each block will produce some error (i.e., the hashed value is not close to uniform), and the overall error will accumulate over the blocks, so we need to inject more randomness per block to reduce the error. Indeed, we use this intuition to show that KΩ(MT/ε2)K\geq\Omega(MT/\varepsilon^{2}) is necessary for the hashed sequence to be ε\varepsilon-close to uniform, matching the upper bound in Theorem 3.8. Note that the lower bound holds even for a truly random hash family. Formally, we prove the following theorem.

Theorem 4.1

Let N,M,N,M, and TT be positive integers and ε(0,ε0)\varepsilon\in(0,\varepsilon_{0}) a real number such that NMT/ε2N\geq MT/\varepsilon^{2}, where ε0>0\varepsilon_{0}>0 is a small absolute constant. Let H:[N][M]H:[N]\rightarrow[M] be a random hash function from an hash family \mathcal{H}. Then there exists an integer K=Ω(MT/ε2)K=\Omega(MT/\varepsilon^{2}), and a block KK-source 𝐗=(X1,,XT)\mathbf{X}=(X_{1},\dots,X_{T}) such that (H,𝐘)=(H,H(X1),,H(XT))(H,\mathbf{Y})=(H,H(X_{1}),\dots,H(X_{T})) is ε\varepsilon-far from uniform (H,U[M]T)(H,U_{[M]^{T}}) in statistical distance.

To prove the theorem, we need to find such an 𝐗\mathbf{X} for every hash family \mathcal{H}. Following the intuition, we find an XX that incurs certain error on a single block, and take 𝐗=(X1,,XT)\mathbf{X}=(X_{1},\dots,X_{T}) to be TT i.i.d. copies of XX. More precisely, we first find a KK-source XX such that for Ω(1)\Omega(1)-fraction of hash functions hh\in\mathcal{H}, h(X)h(X) is Ω(ε/T)\Omega(\varepsilon/\sqrt{T})-far from uniform. This step is the same as the lower bound proof for extractors [RT00], which uses the probabilistic method. We pick XX to be a random flat KK-source, i.e., a uniform distribution over a random set of size KK, and show that XX satisfies the desired property with nonzero probability. The next step is to measure how the error accumulates over independent blocks. Note that for a fixed hash function hh, the hashed sequence (h(X1),,h(XT))(h(X_{1}),\dots,h(X_{T})) consists of TT i.i.d. copies of h(X)h(X). Reyzin [Rey04] has shown that the statistical distance increases T\sqrt{T} when we have TT independent copies for small TT. However, Reyzin’s result only shows an increase up to distance O(δ1/3)O(\delta^{1/3}), where δ\delta is the statistical distance of the original random variables. We improve Reyzin’s result to show that the Ω(T)\Omega(\sqrt{T}) growth continues until the distance reaches some absolute constant. We then use it to show that the joint distribution (H,𝐘)(H,\mathbf{Y}) is far from uniform.

The following lemma corresponds to the first step.

Lemma 4.2

Let NN and MM be positive integers and ε(0,1/4),δ(0,1)\varepsilon\in(0,1/4),\delta\in(0,1) real numbers such that NM/ε2N\geq M/\varepsilon^{2}. Let H:[N][M]H:[N]\rightarrow[M] be a random hash function from an hash family \mathcal{H}. Then there exists an integer K=Ω(δ2M/ε2)K=\Omega(\delta^{2}M/\varepsilon^{2}), and a flat KK-source XX over [N][N] such that with probability at least 1δ1-\delta over hHh\leftarrow H, h(X)h(X) is ε\varepsilon-far from uniform.

Proof. Let K=min{αM/ε2,N/2}K=\lfloor\min\{\alpha\cdot M/\varepsilon^{2},N/2\}\rfloor for some α\alpha to be determined later. Let XX be a random flat KK-source over [N][N]. That is, X=USX=U_{S} where S[N]S\subset[N] is a uniformly random size KK subset of [N][N]. We claim that for every hash function h:[N][M]h:[N]\rightarrow[M],

PrS[ h(US) is ε-far from uniform ]1cα\Pr_{S}[\mbox{ $h(U_{S})$ is $\varepsilon$-far from uniform }]\geq 1-c\cdot\sqrt{\alpha} (3)

for some absolute constant cc. Let us assume (3), and prove the lemma first. Since the claim holds for every hash function hh,

PrhH,S[ h(US) is ε-far from uniform ]1cα.\Pr_{h\leftarrow H,S}[\mbox{ $h(U_{S})$ is $\varepsilon$-far from uniform }]\geq 1-c\cdot\sqrt{\alpha}.

Thus, there exists a flat KK-source USU_{S} such that

PrhH[ h(US) is ε-far from uniform ]1cα.\Pr_{h\leftarrow H}[\mbox{ $h(U_{S})$ is $\varepsilon$-far from uniform }]\geq 1-c\cdot\sqrt{\alpha}.

The lemma follows by setting α=min{δ2/c2,1/32}\alpha=\min\{\delta^{2}/c^{2},1/32\}. We proceed to prove (3). It suffices to show that for every y[M]y\in[M], with probability at least 1cα1-c^{\prime}\cdot\sqrt{\alpha} over random USU_{S}, the deviation of Pr[h(US)=y]\Pr[h(U_{S})=y] from 1/M1/M is at least 4ε/M4\varepsilon/M, where cc^{\prime} is another absolute constant. That is,

PrS[|Pr[h(US)=y]1M|4εM]1cα.\Pr_{S}\left[\left|\Pr[h(U_{S})=y]-\frac{1}{M}\right|\geq\frac{4\varepsilon}{M}\right]\geq 1-c^{\prime}\cdot\sqrt{\alpha}. (4)

Again, let us see why (4) is sufficient to prove (3) first. Let us call y[M]y\in[M] is bad for SS if

|Pr[h(US)=y]1M|4εM.\left|\Pr[h(U_{S})=y]-\frac{1}{M}\right|\geq\frac{4\varepsilon}{M}.

Since Inequality (4) holds for every y[M]y\in[M], we have

PrS,y[y is bad for S]1cα,\Pr_{S,y}[\mbox{$y$ is bad for $S$}]\geq 1-c^{\prime}\cdot\sqrt{\alpha},

where yy is uniformly random over [M][M]. It follows that

PrS[at least 1/2-fraction of y are bad for S]12cα\Pr_{S}[\mbox{at least $1/2$-fraction of $y$ are bad for $S$}]\geq 1-2c^{\prime}\cdot\sqrt{\alpha}

Observe that if at least 1/21/2-fraction of yy are bad for SS, then Δ(h(X),U[M])ε\Delta(h(X),U_{[M]})\geq\varepsilon. Inequality (3) follows by setting c=2cc=2c^{\prime}.

It remains to prove (4). Let T=h1(y)T=h^{-1}(y). We have PrS[h(US)=y]=|ST|/|S|\Pr_{S}[h(U_{S})=y]=|S\cap T|/|S|. Thus, recall that KαM/ε2K\leq\alpha M/\varepsilon^{2}, (4) follows from inequality

PrS[||ST|KM|<4KεM]cKε2M,\Pr_{S}\left[\left||S\cap T|-\frac{K}{M}\right|<\frac{4K\varepsilon}{M}\right]\leq c^{\prime}\cdot\sqrt{\frac{K\varepsilon^{2}}{M}},

which follows by the claim below by setting L=K/ML=K/M, and β=4εK/M\beta=4\varepsilon\sqrt{K/M} (Working out the parameters, we have c=4c′′c^{\prime}=4c^{\prime\prime}, ε<1/4\varepsilon<1/4 implies β<L\beta<\sqrt{L}, and α1/32\alpha\leq 1/32 implies β<1\beta<1.)

Claim 4.3

Let N,K>1N,K>1 be positive integers such that N>2KN>2K, and L[0,K/2]L\in[0,K/2], β(0,min{1,L})\beta\in(0,\min\{1,\sqrt{L}\}) real numbers. Let S[N]S\subset[N] be a random subset of size KK, and T[N]T\subset[N] be a fixed subset of arbitrary size. We have

PrS[||ST|L|βL]c′′β,\Pr_{S}\left[\left||S\cap T|-L\right|\leq\beta\sqrt{L}\right]\leq c^{\prime\prime}\cdot\beta,

for some absolute constant c′′c^{\prime\prime}.

Intuitively, the probability in the claim is maximized when the set TT has size NL/KNL/K so that L=ES[|ST|]L=\mathop{\mathrm{E}}\displaylimits_{S}[|S\cap T|], and the claim follows by observing that in this case, the distribution has deviation Θ(L)\Theta(\sqrt{L}), and each possible outcome has probability O(1/L)O(\sqrt{1/L}). The formal proof of the claim is in Appendix A and is proved by expressing the probability in terms of binomial coefficients, and estimating them using Stirling formula.    

The next step is to measure the increase of statistical distance over independent random variables.

Lemma 4.4

Let XX and YY be random variables over [M][M] such that Δ(X,Y)ε\Delta(X,Y)\geq\varepsilon. Let 𝐗=(X1,,XT)\mathbf{X}=(X_{1},\dots,X_{T}) be TT i.i.d. copies of XX, and let 𝐘=(Y1,,YT)\mathbf{Y}=(Y_{1},\dots,Y_{T}) be TT i.i.d. copies of YY. We have

Δ(𝐗,𝐘)min{ε0,cTε},\Delta(\mathbf{X},\mathbf{Y})\geq\min\{\varepsilon_{0},c\sqrt{T}\cdot\varepsilon\},

where ε0,c\varepsilon_{0},c are absolute constants.

We defer the proof of the above lemma to Appendix B.

  • Proof of Theorem 4.1:   The absolute constant ε0\varepsilon_{0} in the theorem is a half of the ε0\varepsilon_{0} in Lemma 4.4. By Lemma 4.2 there is a flat KK-source such that for 1/21/2-fraction of hash functions hh\in\mathcal{H}, h(X)h(X) is (2ε/cT)(2\varepsilon/c\sqrt{T})-far from uniform, for K=Ω((1/2)2M/(2ε/cT)2)=Ω(MT/ε2)K=\Omega((1/2)^{2}M/(2\varepsilon/c\sqrt{T})^{2})=\Omega(MT/\varepsilon^{2}). We set 𝐗=(X1,,XT)\mathbf{X}=(X_{1},\dots,X_{T}) to be TT independent copies of XX. Consider a hash function hh such that h(X)h(X) is (2ε/cT)(2\varepsilon/c\sqrt{T})-far from uniform. By Lemma 4.4, (h(X1),,h(XT))(h(X_{1}),\dots,h(X_{T})) is 2ε2\varepsilon-far from uniform. Note that this holds for 1/21/2-fraction of hash function hh. It follows that

    Δ((H,𝐘),(H,U[M]))=EhH[Δ((h(X1),,h(XT),U[M]T)]122ε=ε.\Delta((H,\mathbf{Y}),(H,U_{[M]}))=\mathop{\mathrm{E}}\displaylimits_{h\leftarrow H}\left[\Delta((h(X_{1}),\dots,h(X_{T}),U_{[M]^{T}})\right]\geq\frac{1}{2}\cdot 2\varepsilon=\varepsilon.

     

4.2 Lower Bound for Small Collision Probability

In this subsection, we prove lower bounds on the entropy needed per item to ensure that the sequence of hashed values is close to having small collision probability. Since this requirement is less stringent than being close to uniform, less entropy is needed from the source. The interesting setting in applications is to require the hashed sequence (H,𝐘)=(H,H(X1),,H(XT))(H,\mathbf{Y})=(H,H(X_{1}),\dots,H(X_{T})) to be ε\varepsilon-close to having collision probability O(1/(||MT))O(1/(|\mathcal{H}|\cdot M^{T})). Recall that in this setting, instead of requiring KMT/ε2K\geq MT/\varepsilon^{2}, KΩ(MT/ε)K\geq\Omega(MT/\varepsilon) is sufficient for 22-universal hash functions (Theorem 3.1), and KΩ(MT+TM/ε)K\geq\Omega(MT+T\sqrt{M/\varepsilon}) is sufficient for 44-wise independent hash functions (Theorem 3.5). The main improvement from 22-universal to 44-wise independent hashing is the better dependency on ε\varepsilon. Indeed, it can be shown that if we use truly random hash functions, we can reduce the dependency on ε\varepsilon to log(1/ε)\log(1/\varepsilon). Since we are now proving lower bounds for arbitrary hash families, we focus on the dependency on MM and TT. Specifically, our goal is to show that K=Ω(MT)K=\Omega(MT) is necessary. More precisely, we show that when KMTK\ll MT, it is possible for the hashed sequence (H,𝐘)(H,\mathbf{Y}) to be .99.99-far from any distribution that has collision probability less than 100/(||MT)100/(|\mathcal{H}|\cdot M^{T}).

We use the same strategy as in the previous subsection to prove this lower bound. Fixing a hash family \mathcal{H}, we take TT independent copies (X1,,XT)(X_{1},\dots,X_{T}) of the worst-case XX found in Lemma 4.2, and show that (H,H(X1),,H(XT))(H,H(X_{1}),\dots,H(X_{T})) is far from having small collision probability. The new ingredient is to show that when we have TT independent copies, and KMTK\ll MT, then (h(X1),,h(XT))(h(X_{1}),\dots,h(X_{T})) is very far from uniform (say, 0.990.99-far) for many hh\in\mathcal{H}. We then argue that in this case, we can not reduce the collision probability of (h(X1),,h(XT))(h(X_{1}),\dots,h(X_{T})) by changing a small fraction of distribution, which implies the overall distribution (H,𝐘)(H,\mathbf{Y}) is far from any distribution (H,𝐙)(H^{\prime},\mathbf{Z}) with small collision probability. Formally, we prove the following theorem.

Theorem 4.5

Let N,M,N,M, and TT be positive integers such that NMTN\geq MT. Let δ(0,1)\delta\in(0,1) and α>1\alpha>1 be real numbers such that α<δ3eT/32/128\alpha<\delta^{3}\cdot e^{T/32}/128. Let H:[N][M]H:[N]\rightarrow[M] be a random hash function from a hash family \mathcal{H}. There exists an integer K=Ω(δ2MT/log(α/δ))K=\Omega(\delta^{2}MT/\log(\alpha/\delta)), and a block KK-source 𝐗=(X1,,XT)\mathbf{X}=(X_{1},\dots,X_{T}) such that (H,𝐘)=(H,H(X1),,H(XT))(H,\mathbf{Y})=(H,H(X_{1}),\dots,H(X_{T})) is (1δ)(1-\delta)-far from any distribution (H,𝐙)(H^{\prime},\mathbf{Z}) with cp(H,𝐙)α/(||MT)\mathrm{cp}(H^{\prime},\mathbf{Z})\leq\alpha/(|\mathcal{H}|\cdot M^{T}).

Think of α\alpha and δ\delta as constants. Then the theorem says that K=Ω(MT)K=\Omega(MT) is necessary for the hashed sequence (H,H(X1),,H(XT))(H,H(X_{1}),\dots,H(X_{T})) to be close to having small collision probability, matching the upper bound in Theorem 3.1. In the previous proof, we used Lemma 4.4 to measure the increase of distance over blocks. However, the lemma can only measure the progress up to some small constant. It is known that if the number of copies TT is larger then Ω(1/ε2)\Omega(1/\varepsilon^{2}), where ε\varepsilon is the statistical distance of original copy, then the statistical distance goes to 11 exponentially fast. Formally, we use the following lemma.

Lemma 4.6 ([SV99])

Let XX and YY be random variables over [M][M] such that Δ(X,Y)ε\Delta(X,Y)\geq\varepsilon. Let 𝐗=(X1,,XT)\mathbf{X}=(X_{1},\dots,X_{T}) be TT i.i.d. copies of XX, and let 𝐘=(Y1,,YT)\mathbf{Y}=(Y_{1},\dots,Y_{T}) be TT i.i.d. copies of YY. We have

Δ(𝐗,𝐘)1eTε2/2.\Delta(\mathbf{X},\mathbf{Y})\geq 1-e^{-T\varepsilon^{2}/2}.

We remark that Lemma 4.4 and 4.6 are incomparable. In the parameter range of Lemma 4.4, Lemma 4.6 only gives Δ(𝐗,𝐘)Ω(Tε2)\Delta(\mathbf{X},\mathbf{Y})\geq\Omega(T\varepsilon^{2}) instead of Ω(Tε)\Omega(\sqrt{T}\varepsilon). To argue that the overall distribution is far from having small collision probability, we introduce the following notion of nonuniformity.

Definition 4.7

Let XX be a random variable over [M][M] with probability mass function pp. XX is (δ,β)(\delta,\beta)-nonuniform if for every function q:[M]q:[M]\rightarrow{\mathbb{R}} such that 0q(x)p(x)0\leq q(x)\leq p(x) for all x[M]x\in[M], and xq(x)δ\sum_{x}q(x)\geq\delta, the function satisfies

x[M]q(x)2>β/M.\sum_{x\in[M]}q(x)^{2}>\beta/M.

Intuitively, a distribution XX over [M][M] is (δ,β)(\delta,\beta)-nonuniform means that even if we remove (1δ)(1-\delta)-fraction of probability mass from XX, the “collision probability” remains greater than β/M\beta/M. In particular, XX is (1δ)(1-\delta)-far from any random variable YY with cp(Y)β/M\mathrm{cp}(Y)\leq\beta/M.

Lemma 4.8

Let XX be a random variable over [M][M]. If XX is (1η)(1-\eta)-far from uniform, then XX is (2βη,β)(2\sqrt{\beta\cdot\eta},\beta)-nonuniform for every β1\beta\geq 1.

Proof. Let pp be the probability mass function of XX, and q:[M]q:[M]\rightarrow{\mathbb{R}} be a function such that 0q(x)p(x)0\leq q(x)\leq p(x) for every x[M]x\in[M], and xq(x)2βη\sum_{x}q(x)\geq 2\sqrt{\beta\cdot\eta}. Our goal is to show that xq(x)2>β/M\sum_{x}q(x)^{2}>\beta/M. Let T={x[M]:p(x)1/M}T=\{x\in[M]:p(x)\geq 1/M\}. Note that

Δ(X,U[M])=Pr[XT]Pr[U[M]T]1η.\Delta(X,U_{[M]})=\Pr[X\in T]-\Pr[U_{[M]}\in T]\geq 1-\eta.

This implies Pr[XT]1η\Pr[X\in T]\geq 1-\eta, and μ(T)=Pr[U[M]T]η\mu(T)=\Pr[U_{[M]}\in T]\leq\eta. Now,

xTq(x)2βηPr[XT]2βηη>βη,\sum_{x\in T}q(x)\geq 2\sqrt{\beta\cdot\eta}-\Pr[X\notin T]\geq 2\sqrt{\beta\cdot\eta}-\eta>\sqrt{\beta\cdot\eta},

and μ(T)η\mu(T)\leq\eta implies

x[M]q(x)2xTq(x)2(xTq(x))2|T|>βM.\sum_{x\in[M]}q(x)^{2}\geq\sum_{x\in T}q(x)^{2}\geq\frac{\left(\sum_{x\in T}q(x)\right)^{2}}{|T|}>\frac{\beta}{M}.

 

We are ready to prove Theorem 4.5.

  • Proof of Theorem 4.5:   By Lemma 4.2 with ε=2ln(128α/δ3)/T<1/4\varepsilon=\sqrt{2\ln(128\alpha/\delta^{3})/T}<1/4, there is a flat KK-source XX such that for (1δ/4)(1-\delta/4)-fraction of hash function hh\in\mathcal{H}, h(X)h(X) is ε\varepsilon-far from uniform, for K=Ω((δ/4)2M/ε2)=Ω(δ2MT/log(α/δ))K=\Omega((\delta/4)^{2}M/\varepsilon^{2})=\Omega(\delta^{2}MT/\log(\alpha/\delta)). We set 𝐗=(X1,,XT)\mathbf{X}=(X_{1},\dots,X_{T}) to be TT independent copies of XX. Consider a hash function hh such that h(X)h(X) is ε\varepsilon-far from uniform. By Lemma 4.6, (h(X1),,h(XT))(h(X_{1}),\dots,h(X_{T})) is (1η)(1-\eta)-far from uniform, for η=eε2T/2=δ3/128α\eta=e^{-\varepsilon^{2}T/2}=\delta^{3}/128\alpha. By Lemma 4.8, (h(X1),,h(XT))(h(X_{1}),\dots,h(X_{T})) is (δ/4,2α/δ)(\delta/4,2\alpha/\delta)-nonuniform for (1δ/4)(1-\delta/4)-fraction of hash functions hh. By the first statement of Lemma 4.9 below, this implies that (H,𝐘)(H,\mathbf{Y}) is (1δ)(1-\delta)-far from any distribution (H,𝐙)(H^{\prime},\mathbf{Z}) with collision probability α/(||MT)\alpha/(|\mathcal{H}|\cdot M^{T}).    

Lemma 4.9

Let (H,Y)(H,Y) be a joint distribution over ×[M]\mathcal{H}\times[M] such that the marginal distribution HH is uniform over \mathcal{H}. Let ε,δ,α\varepsilon,\delta,\alpha be positive real numbers.

  1. 1.

    If Y|H=hY|_{H=h} is (δ/4,2α/δ)(\delta/4,2\alpha/\delta)-nonuniform for at least (1δ/4)(1-\delta/4)-fraction of hh\in\mathcal{H}, then (H,Y)(H,Y) is (1δ)(1-\delta)-far from any distribution (H,Z)(H^{\prime},Z) with cp(H,Z)α/(||M)\mathrm{cp}(H^{\prime},Z)\leq\alpha/(|\mathcal{H}|\cdot M).

  2. 2.

    If Y|H=hY|_{H=h} is (0.1,2α/ε)(0.1,2\alpha/\varepsilon)-nonuniform for at least 2ε2\varepsilon-fraction-frction of hh\in\mathcal{H}, then (H,Y)(H,Y) is ε\varepsilon-far from any distribution (H,Z)(H^{\prime},Z) with cp(H,Z)α/(||M)\mathrm{cp}(H^{\prime},Z)\leq\alpha/(|\mathcal{H}|\cdot M).

Proof. We introduce the following notations first. For every hh\in\mathcal{H}, we define qh:[M]q_{h}:[M]\rightarrow{\mathbb{R}} by

qh(y)=min{Pr[(H,Y)=(h,y)],Pr[(H,Z)=(h,y)]}q_{h}(y)=\min\{\Pr[(H,Y)=(h,y)],\Pr[(H^{\prime},Z)=(h,y)]\}

for every y[M]y\in[M]. We also define f:f:\mathcal{H}\rightarrow{\mathbb{R}} by

f(h)=y[M]qh(y)Pr[H=h]=1||.f(h)=\sum_{y\in[M]}q_{h}(y)\leq\Pr[H=h]=\frac{1}{|\mathcal{H}|}.

For the first statement, let (H,Z)(H^{\prime},Z) be a random variable over ×[M]\mathcal{H}\times[M] that is (1δ)(1-\delta)-close to (H,Y)(H,Y). We need to show that cp(H,Z)>α/(||M)\mathrm{cp}(H^{\prime},Z)>\alpha/(|\mathcal{H}|\cdot M). Note that hf(h)=1Δ((H,Y),(H,Z))δ\sum_{h}f(h)=1-\Delta((H,Y),(H^{\prime},Z))\geq\delta. So there are at least a (3δ/4)(3\delta/4)-fraction of hash functions hh with f(h)(δ/4)/||f(h)\geq(\delta/4)/|\mathcal{H}|. At least a (3δ/4)(δ/4)=δ/2(3\delta/4)-(\delta/4)=\delta/2-fraction of hh satisfy both f(h)(δ/4)/||f(h)\geq(\delta/4)/|\mathcal{H}| and Y|H=hY|_{H=h} is (δ/4,2α/δ)(\delta/4,2\alpha/\delta)-nonuniform. By the definition of nonuniformity, for each such hh, we have

y[M]T(||qh(y))2>2αδM.\sum_{y\in[M]^{T}}(|\mathcal{H}|\cdot q_{h}(y))^{2}>\frac{2\alpha}{\delta\cdot M}.

Therefore,

cp(H,Z)h,yqh(y)2>(δ2||)2αδ||2M=α||M.\mathrm{cp}(H^{\prime},Z)\geq\sum_{h,y}q_{h}(y)^{2}>\left(\frac{\delta}{2}\cdot|\mathcal{H}|\right)\cdot\frac{2\alpha}{\delta\cdot|\mathcal{H}|^{2}M}=\frac{\alpha}{|\mathcal{H}|\cdot M}.

Similarly, for the second statement, let (H,Z)(H^{\prime},Z) be a random variable over ×[M]\mathcal{H}\times[M] that is ε\varepsilon-close to (H,Y)(H,Y). We need to show that cp(H,Z)>α/(||M)\mathrm{cp}(H^{\prime},Z)>\alpha/(|\mathcal{H}|\cdot M). Note that hf(h)=1Δ((H,Y),(H,Z))1ε\sum_{h}f(h)=1-\Delta((H,Y),(H^{\prime},Z))\geq 1-\varepsilon. So there are at least a 1ε/0.91-\varepsilon/0.9-fraction of hh with f(h)0.1/||f(h)\geq 0.1/|\mathcal{H}|. At least a 2εε/0.9>ε/22\varepsilon-\varepsilon/0.9>\varepsilon/2-fraction of hash functions satisfy both f(h)0.1/||f(h)\geq 0.1/|\mathcal{H}| and Y|H=hY|_{H=h} is (0.1,2α/ε)(0.1,2\alpha/\varepsilon)-nonuniform. By Lemma 4.8, for each such hh, we have

y[M](||qh(y))2>2αεM.\sum_{y\in[M]}(|\mathcal{H}|\cdot q_{h}(y))^{2}>\frac{2\alpha}{\varepsilon\cdot M}.

Therefore,

cp(H,Z)h,yqh(y)2>(ε2||)2αε||2M=α||M.\mathrm{cp}(H^{\prime},Z)\geq\sum_{h,y}q_{h}(y)^{2}>\left(\frac{\varepsilon}{2}\cdot|\mathcal{H}|\right)\cdot\frac{2\alpha}{\varepsilon\cdot|\mathcal{H}|^{2}M}=\frac{\alpha}{|\mathcal{H}|\cdot M}.

 

4.3 Lower Bounds for the Distribution of Hashed Values Only

We can extend our lower bounds to the distribution of hashed sequence 𝐘=(H(X1),,H(XT))\mathbf{Y}=(H(X_{1}),\dots,H(X_{T})) along (without HH) for both closeness requirements, at the price of losing the dependency on ε\varepsilon and incurring some dependency on the size of the hash family. Let 2d=||2^{d}=|\mathcal{H}| be the size of the hash family. The dependency on dd is necessary. Intuitively, the hashed sequence 𝐘\mathbf{Y} contains at most TmT\cdot m bits of entropy, and the input (H,X1,,XT)(H,X_{1},\dots,X_{T}) contains at least d+Tkd+T\cdot k bits of entropy. When dd is large enough, it is possible that all the randomness of hashed sequence comes from the randomness of the hash family. Indeed, if HH is TT-wise independent (which is possible with dTmd\simeq T\cdot m), then (H(X1),,H(XT))(H(X_{1}),\dots,H(X_{T})) is uniform when X1,,XTX_{1},\dots,X_{T} are all distinct. Therefore,

Δ((H(X1),,H(XT)),U[M]T)Pr[ not all X1,,XT are distinct ]\Delta((H(X_{1}),\dots,H(X_{T})),U_{[M]^{T}})\leq\Pr[\mbox{ not all $X_{1},\dots,X_{T}$ are distinct }]

Thus, K=Ω(T2)K=\Omega(T^{2}) (independent of MM) suffices to make the hashed value close to uniform.

Theorem 4.10

Let N,M,TN,M,T be positive integers, and dd a positive real number such that NMT/dN\geq MT/d. Let δ(0,1)\delta\in(0,1), α>1\alpha>1 be real numbers such that α2d<δ3eT/32/128\alpha\cdot 2^{d}<\delta^{3}\cdot e^{T/32}/128. Let H:[N][M]H:[N]\rightarrow[M] be a random hash function from an hash family \mathcal{H} of size at most 2d2^{d}. There exists an integer K=Ω(δ2MT/dlog(α/δ))K=\Omega(\delta^{2}MT/d\cdot\log(\alpha/\delta)), and a block KK-source 𝐗=(X1,,XT)\mathbf{X}=(X_{1},\dots,X_{T}) such that 𝐘=(H(X1),,H(XT))\mathbf{Y}=(H(X_{1}),\dots,H(X_{T})) is (1δ)(1-\delta)-far from any distribution 𝐙=(Z1,,ZT)\mathbf{Z}=(Z_{1},\dots,Z_{T}) with cp(𝐙)α/MT\mathrm{cp}(\mathbf{Z})\leq\alpha/M^{T}. In particular, 𝐘\mathbf{Y} is (1δ)(1-\delta)-far from uniform.

Think of α\alpha and δ\delta as constants. Then the theorem says that when the hash function contains dT/(32ln2)O(1)d\leq T/(32\ln 2)-O(1) bits of randomness, K=Ω(MT/d)K=\Omega(MT/d) is necessary for the hashed sequence to be close to uniform. For example, in some typical hash applications, N=poly(M)N={\mathrm{poly}}(M) and the hash function is 22-universal or O(1)O(1)-wise independent. In this case, d=O(logM)d=O(\log M) and we need K=Ω(MT/logM)K=\Omega(MT/\log M). (Recall that our upper bound in Theorem 3.1 says that K=O(MT)K=O(MT) suffices.)

Proof. We will deduce the theorem from Theorem 4.5. Replacing the parameter α\alpha by α2d\alpha\cdot 2^{d} in Theorem 4.5, we know that there exists an integer K=Ω(δ2MT/dlog(α/δ))K=\Omega(\delta^{2}MT/d\cdot\log(\alpha/\delta)) and a block KK-source 𝐗=(X1,,XT)\mathbf{X}=(X_{1},\dots,X_{T}) such that (H,𝐘)=(H,H(X1),,H(XT))(H,\mathbf{Y})=(H,H(X_{1}),\dots,H(X_{T})) is (1δ)(1-\delta)-far from any distribution (H,𝐙)(H^{\prime},\mathbf{Z}) with cp(H,𝐙)α2d/(2dMT)=α/MT\mathrm{cp}(H^{\prime},\mathbf{Z})\leq\alpha\cdot 2^{d}/(2^{d}\cdot M^{T})=\alpha/M^{T}. Now, suppose we are given a random variable 𝐙\mathbf{Z} on [M]T[M]^{T} with Δ(𝐘,𝐙)1δ\Delta(\mathbf{Y},\mathbf{Z})\leq 1-\delta. Then we can define an HH^{\prime} such that Δ((H,𝐘),(H,𝐙))=Δ(𝐘,𝐙)\Delta((H,\mathbf{Y}),(H^{\prime},\mathbf{Z}))=\Delta(\mathbf{Y},\mathbf{Z}) (Indeed, define the conditional distribution H|𝐙=𝐳H^{\prime}|_{\mathbf{Z}=\mathbf{z}} to equal H|𝐘=𝐳H|_{\mathbf{Y}=\mathbf{z}} for every 𝐳[M]T\mathbf{z}\in[M]^{T}.) Then we have

cp(𝐙)cp(H,𝐙)>αMT.\mathrm{cp}(\mathbf{Z})\geq\mathrm{cp}(H^{\prime},\mathbf{Z})>\frac{\alpha}{M^{T}}.

 

One limitation of the above lower bound is that it only works when dT/(32ln2)O(1)d\leq T/(32\ln 2)-O(1). For example, the lower bound cannot be applied when the hash function is TT-wise independent. Although d=Ω(T)d=\Omega(T) may not be interesting in practice, for the sake of completeness, we provide another simple lower bound to cover this parameter region.

Theorem 4.11

Let N,M,TN,M,T be positive integers, and δ(0,1)\delta\in(0,1), α>1\alpha>1, d>0d>0 real numbers. Let H:[N][M]H:[N]\rightarrow[M] be a random hash function from an hash family \mathcal{H} of size at most 2d2^{d}. Suppose KNK\leq N be an integer such that K(δ2/4α2d)1/TMK\leq(\delta^{2}/4\alpha\cdot 2^{d})^{1/T}\cdot M. Then there exists a block KK-source 𝐗=(X1,,XT)\mathbf{X}=(X_{1},\dots,X_{T}) such that 𝐘=(H(X1),,H(XT))\mathbf{Y}=(H(X_{1}),\dots,H(X_{T})) is (1δ)(1-\delta)-far from any distribution 𝐙=(Z1,,ZT)\mathbf{Z}=(Z_{1},\dots,Z_{T}) with cp(𝐙)α/MT\mathrm{cp}(\mathbf{Z})\leq\alpha/M^{T}. In particular, 𝐘\mathbf{Y} is (1δ)(1-\delta)-far from uniform.

Again, think of α\alpha and δ\delta as constants. The theorem says that K=Ω(M/2d/T)K=\Omega(M/2^{d/T}) is necessary for the hashed sequence to be close to uniform. In particular, when d=Θ(T)d=\Theta(T), K=Ω(M)K=\Omega(M) is necessary. Theorem 4.10 gives the same conclusion, but only works for dT/(32ln2)O(1)d\leq T/(32\ln 2)-O(1). On the other hand, when d=o(T)d=o(T), Theorem 4.10 gives better lower bound K=Ω(MT/d)K=\Omega(MT/d).

Proof. Let XX be any flat KK-source, i.e., a uniform distribution over a set of size KK. We simply take 𝐗=(X1,,XT)\mathbf{X}=(X_{1},\dots,X_{T}) to be TT independent copies of XX. Note that 𝐘\mathbf{Y} has support at most as large as (H,𝐗)(H,\mathbf{X}). Thus,

|supp(𝐘)||supp(H,𝐗)|=2dKTδ24αMT.|\mathrm{supp}(\mathbf{Y})|\leq|\mathrm{supp}(H,\mathbf{X})|=2^{d}\cdot K^{T}\leq\frac{\delta^{2}}{4\alpha}\cdot M^{T}.

Therefore, 𝐘\mathbf{Y} is (1δ2/4α)(1-\delta^{2}/4\alpha)-far from uniform. By Lemma 4.8, 𝐘\mathbf{Y} is (1δ)(1-\delta)-far from any distribution 𝐙=(Z1,,ZT)\mathbf{Z}=(Z_{1},\dots,Z_{T}) with cp(𝐙)α/MT\mathrm{cp}(\mathbf{Z})\leq\alpha/M^{T}.    

4.4 Lower Bound for 22-universal Hash Functions

In this subsection, we show Theorem 3.1 is almost tight in the following sense. We show that there exists K=Ω(MT/εlog(1/ε))K=\Omega(MT/\varepsilon\cdot\log(1/\varepsilon)), a 22-universal hash family \mathcal{H}, and a block KK-source 𝐗\mathbf{X} such that (H,𝐘)(H,\mathbf{Y}) is ε\varepsilon-far from having collision probability 100/(||MT)100/(|\mathcal{H}|\cdot M^{T}). The improvement over Theorem 4.5 is the almost tight dependency on ε\varepsilon. Recall that Theorem 3.1 says that for 22-universal hash family, K=O(MT/ε)K=O(MT/\varepsilon) suffices. The upper and lower bound differs by a factor of log(1/ε)\log(1/\varepsilon). In particular, our result for 44-wise independent hash functions (Theorem 3.5) cannot be achieved with 22-universal hash functions. The lower bound can further be extended to the distribution of hashed sequence 𝐘=(H(X1),,H(XT))\mathbf{Y}=(H(X_{1}),\dots,H(X_{T})) as in the previous subsection. Furthermore, since the 22-universal hash family we use has small size, we only pay a factor of O(logM)O(\log M) in the lower bound on KK. Formally we prove the following results.

Theorem 4.12

For every prime power MM, real numbers ε(0,1/4)\varepsilon\in(0,1/4) and α1\alpha\geq 1, the following holds. For all integers tt and NN such that εMt11\varepsilon\cdot M^{t-1}\geq 1 and N6εM2tN\geq 6\varepsilon M^{2t}, and for T=ε2M2t1log(α/ε)T=\lceil\varepsilon^{2}M^{2t-1}\log(\alpha/\varepsilon)\rceil,444For technical reason, our lower bound proof does not work for every sufficiently large TT. However, note that the density of TT such that the lower bound holds is 1/M21/M^{2}. there exists an integer K=Ω(MT/εlog(α/ε))K=\Omega(MT/\varepsilon\cdot\log(\alpha/\varepsilon)), and a 22-universal hash family \mathcal{H} from [N][N] to [M][M], and a block KK-source 𝐗=(X1,,XT)\mathbf{X}=(X_{1},\dots,X_{T}) such that (H,𝐘)=(H,H(X1),,H(XT))(H,\mathbf{Y})=(H,H(X_{1}),\dots,H(X_{T})) is ε\varepsilon-far from any distribution (H,𝐙)(H^{\prime},\mathbf{Z}) with cp(H,𝐙)α/(||MT)\mathrm{cp}(H^{\prime},\mathbf{Z})\leq\alpha/(|\mathcal{H}|\cdot M^{T}).

Theorem 4.13

For every prime power MM, real numbers ε(0,1/4)\varepsilon\in(0,1/4) and α1\alpha\geq 1, the following holds. For all integers tt and NN such that εMt11\varepsilon\cdot M^{t-1}\geq 1 and N6εM2tN\geq 6\varepsilon M^{2t}, and for T=ε2M2t1log(αM/ε)T=\lceil\varepsilon^{2}M^{2t-1}\log(\alpha M/\varepsilon)\rceil, there exists an integer K=Ω(MT/εlog(αM/ε))K=\Omega(MT/\varepsilon\cdot\log(\alpha M/\varepsilon)), and a 22-universal hash family \mathcal{H} from [N][N] to [M][M], and a block KK-source 𝐗=(X1,,XT)\mathbf{X}=(X_{1},\dots,X_{T}) such that 𝐘=(H(X1),,H(XT))\mathbf{Y}=(H(X_{1}),\dots,H(X_{T})) is ε\varepsilon-far from any distribution 𝐙\mathbf{Z} with cp(𝐙)α/MT\mathrm{cp}(\mathbf{Z})\leq\alpha/M^{T}.

Basically, the idea is to show that the Markov Inequality applied in the proof of Theorem 3.1 (see Inequality (1))is tight for a single block. More precisely, we show that there exists a 22-universal hash family \mathcal{H}, and a KK-source XX such that with probability ε\varepsilon over hHh\leftarrow H, cp(h(X))1/M+Ω(1/Kε)\mathrm{cp}(h(X))\geq 1/M+\Omega(1/K\varepsilon). Intuitively, if we take T=Θ(Kεlog(α/ε)/M)T=\Theta(K\varepsilon\cdot\log(\alpha/\varepsilon)/M) independent copies of such XX, then the collision probability will satisfy cp(h(X1),,h(XT))(1+Ω(M/Kε))T/MTα/(εMT)\mathrm{cp}(h(X_{1}),\dots,h(X_{T}))\geq(1+\Omega(M/K\varepsilon))^{T}/M^{T}\geq\alpha/(\varepsilon M^{T}), and so the overall collision probability is cp(H,𝐘)α/(||MT)\mathrm{cp}(H,\mathbf{Y})\geq\alpha/(|\mathcal{H}|\cdot M^{T}). Formally, we analyze our construction below using Hellinger distance, and show that the collision probability remains high even after modifying a Θ(ε)\Theta(\varepsilon)-fraction of distribution.

  • Proof of Theorem 4.12:   Fix a prime power MM, and ε>0\varepsilon>0, we identify [M][M] with the finite field 𝔽\mathbb{F} of size MM. Let tt be an integer parameter such that Mt1>1/εM^{t-1}>1/\varepsilon. Recall that the set 0\mathcal{H}_{0} of linear functions {ha:𝔽t𝔽}a𝔽t\{h_{\vec{a}}:\mathbb{F}^{t}\rightarrow\mathbb{F}\}_{\vec{a}\in\mathbb{F}^{t}} where ha(x)=iaixih_{\vec{a}}(\vec{x})=\sum_{i}a_{i}x_{i} is 22-universal. Note that picking a random hash function h0h\leftarrow\mathcal{H}_{0} is equivalent to picking a random vector a𝔽t\vec{a}\leftarrow\mathbb{F}^{t}. Two special properties of 0\mathcal{H}_{0} are (i) when a=0\vec{a}=\vec{0}, the whole domain 𝔽t\mathbb{F}^{t} is sent to 0𝔽0\in\mathbb{F}, and (ii) the size of hash family |0||\mathcal{H}_{0}| is the same as the size of the domain, namely |𝔽t||\mathbb{F}^{t}|. We will use 0\mathcal{H}_{0} as a building block in our construction.

    We proceed to construct the hash family \mathcal{H}. We partition the domain [N][N] into several sub-domains, and apply different hash function to each sub-domain. Let ss be an integer parameter to be determined later. We require NsMtN\geq s\cdot M^{t}, and partition [N][N] into D0,D1,,DsD_{0},D_{1},\dots,D_{s}, where each of D1,,DsD_{1},\dots,D_{s} has size MtM^{t} and is identified with 𝔽t\mathbb{F}^{t}, and D0D_{0} is the remaining part of [N][N]. In our construction, the data 𝐗\mathbf{X} will never come from D0D_{0}. Thus, wlog, we can assume D0D_{0} is empty. For every i=1,,si=1,\dots,s, we use a linear hash function hai0h_{\vec{a}_{i}}\in\mathcal{H}_{0} to send DiD_{i} to 𝔽\mathbb{F}. Thus, a hash function hh\in\mathcal{H} consists of ss linear hash function (ha1,,has)(h_{\vec{a}_{1}},\dots,h_{\vec{a}_{s}}), and can be described by ss vectors a1,,as𝔽t\vec{a}_{1},\dots,\vec{a}_{s}\in\mathbb{F}^{t}. Note that to make \mathcal{H} 22-universal, it suffices to pick a1,,as\vec{a}_{1},\dots,\vec{a}_{s} pairwise independently. Specifically, we identify 𝔽t\mathbb{F}^{t} with the finite field 𝔽^\hat{\mathbb{F}} of size MtM^{t}, and pick (a1,,as)(\vec{a}_{1},\dots,\vec{a}_{s}) by picking a,b𝔽^a,b\in\hat{\mathbb{F}}, and output (a+α1b,a+α2b,,a+αsb)(a+\alpha_{1}\cdot b,a+\alpha_{2}\cdot b,\dots,a+\alpha_{s}\cdot b) for some distinct constants α1,,αs𝔽^\alpha_{1},\dots,\alpha_{s}\in\hat{\mathbb{F}}. Formally, we define the hash family to be

    ={ha,b:[N][M]}a,b𝔽^, where ha,b=(ha+α1b,,ha+αsb)=def(h1a,b,,hsa,b).\mathcal{H}=\{h^{a,b}:[N]\rightarrow[M]\}_{a,b\in\hat{\mathbb{F}}}\mbox{, where }h^{a,b}=(h_{a+\alpha_{1}b},\dots,h_{a+\alpha_{s}b})\mathbin{\stackrel{{\scriptstyle\rm def}}{{=}}}(h^{a,b}_{1},\dots,h^{a,b}_{s}).

    It is easy to verify that \mathcal{H} is indeed 22-universal, and ||=M2t|\mathcal{H}|=M^{2t}.

    We next define a single block KK-source XX that makes the Markov Inequality (1) tight. We simply take XX to be a uniform distribution over D1DsD_{1}\cup\cdots\cup D_{s}, and so K=sMtK=s\cdot M^{t}. Consider a hash function ha,bh^{a,b}\in\mathcal{H}. If all hia,bh^{a,b}_{i} are non-zero and distinct, then ha,b(X)h^{a,b}(X) is the uniform distribution. If exactly one hia,b=0h^{a,b}_{i}=0, then ha,bh^{a,b} sends Mt+(s1)Mt1M^{t}+(s-1)M^{t-1} elements in [N][N] to 0, and (s1)Mt1(s-1)M^{t-1} elements to each nonzero y𝔽y\in\mathbb{F}. Let us call such ha,bh^{a,b} bad hash functions. Thus, if ha,bh^{a,b} is bad, then

    cp(ha,b(X))\displaystyle\mathrm{cp}(h^{a,b}(X)) =\displaystyle= (Mt+(s1)Mt1K)2+(M1)((s1)Mt1K)2\displaystyle\left(\frac{M^{t}+(s-1)M^{t-1}}{K}\right)^{2}+(M-1)\cdot\left(\frac{(s-1)M^{t-1}}{K}\right)^{2}
    =\displaystyle= 1M+M1s2M1M+12s2.\displaystyle\frac{1}{M}+\frac{M-1}{s^{2}M}\geq\frac{1}{M}+\frac{1}{2s^{2}}.

    Note that ha,bh^{a,b} is bad with probability

    Pr[exactly one hia,b=0]=Pr[b0i(a+αib=0)]=(11Mt)sMts2Mt.\Pr[\mbox{exactly one $h^{a,b}_{i}=0$}]=\Pr[b\neq 0\wedge\exists i\quad(a+\alpha_{i}b=0)]=\left(1-\frac{1}{M^{t}}\right)\cdot\frac{s}{M^{t}}\geq\frac{s}{2M^{t}}.

    We set s=4εMtMts=\lceil 4\varepsilon M^{t}\rceil\leq M^{t}. It follows that with probability at least 2ε2\varepsilon over hh\leftarrow\mathcal{H}, the collision probability satisfies cp(h(X))1/M+1/(4Kε)\mathrm{cp}(h(X))\geq 1/M+1/(4K\varepsilon), as we intuitively desired. However, instead of working with collision probability directly, we need to use Hellinger closeness to measure the growth of distance to uniform (see Definition 3.9.) The following claim upper bounds the Hellinger closeness of h(X)h(X) for bad hash functions hh. The proof of the claim is deferred to the end of this section.

    Claim 4.14

    Suppose hh is a bad hash function defined as above, then the Hellinger closeness of h(X)h(X) satisfies C(h(X))1M/(64Kε)C(h(X))\leq 1-M/(64K\varepsilon).

Finally, for every integer T[ε2M2t1log(α/ε),c0ε2M2t1log(α/ε)]T\in[\varepsilon^{2}M^{2t-1}\log(\alpha/\varepsilon),c_{0}\cdot\varepsilon^{2}M^{2t-1}\log(\alpha/\varepsilon)], we can write T=c(64Kε/M)ln(800α/ε)T=c\cdot(64K\varepsilon/M)\cdot\ln(800\alpha/\varepsilon) for some constant c<c0c<c_{0}. Let 𝐗=(X1,,XT)\mathbf{X}=(X_{1},\dots,X_{T}) be TT independent copies of XX. We now show that K,,𝐗K,\mathcal{H},\mathbf{X} satisfy the conclusion of the theorem. That is, K=Ω(MT/(εlog(α/ε)))K=\Omega(MT/(\varepsilon\log(\alpha/\varepsilon))) (as follows from the definition of TT) and (H,𝐘)=(H,H(X1),,H(XT))(H,\mathbf{Y})=(H,H(X_{1}),\dots,H(X_{T})) is ε\varepsilon-far from any distribution (H,𝐙)(H^{\prime},\mathbf{Z}) with cp(H,𝐙)α/(||MT)\mathrm{cp}(H^{\prime},\mathbf{Z})\leq\alpha/(|\mathcal{H}|\cdot M^{T}).

Consider the distribution (h(X1),,h(XT))(h(X_{1}),\dots,h(X_{T})) for a bad hash function hh\in\mathcal{H}. From the above claim, the Hellinger closeness satisfies

C(h(X1),,h(XT))=C(h(X))T(1M/64Kε)TeMT/64Kε800αε.C(h(X_{1}),\dots,h(X_{T}))=C(h(X))^{T}\leq(1-M/64K\varepsilon)^{T}\leq e^{MT/64K\varepsilon}\leq\frac{800\alpha}{\varepsilon}.

By Lemma 3.10 and the definition of Hellinger closeness, we have

Δ((h(X1),,h(XT)),U[M]T)1C(h(X1),,h(XT))1800αε.\Delta((h(X_{1}),\dots,h(X_{T})),U_{[M]^{T}})\geq 1-C(h(X_{1}),\dots,h(X_{T}))\geq 1-\frac{800\alpha}{\varepsilon}.

By Lemma 4.8, (h(X1),,h(XT))(h(X_{1}),\dots,h(X_{T})) is (0.1,2α/ε)(0.1,2\alpha/\varepsilon)-nonuniform for at least 2ε2\varepsilon-fraction of bad hash functions hh. By the second statement of Lemma 4.9, this implies (H,𝐘)(H,\mathbf{Y}) is ε\varepsilon-far from any distribution (H,𝐙)(H^{\prime},\mathbf{Z}) with cp(H,𝐙)α/(||MT)\mathrm{cp}(H^{\prime},\mathbf{Z})\leq\alpha/(|\mathcal{H}|\cdot M^{T}).

In sum, given M,ε,α,tM,\varepsilon,\alpha,t that satisfies the premise of the theorem, we set K=4εMtMtK=\lceil 4\varepsilon M^{t}\rceil\cdot M^{t}, and proved that for every NKN\geq K, and T=Θ((Kε/M)ln(α/ε))T=\Theta((K\varepsilon/M)\cdot\ln(\alpha/\varepsilon)), the conclusion of the theorem is true. It remains to prove Claim 4.14.

Proof of claim:   Let p(x)=MPr[h(X)=x]p(x)=M\cdot\Pr[h(X)=x] for every x𝔽x\in\mathbb{F}. For a bad hash function hh, we have p(0)=(1+(M1)/s)p(0)=(1+(M-1)/s), and p(x)=(11/s)p(x)=(1-1/s) for every x0x\neq 0. We will upper bound C(h(X))=(1/M)xp(x)C(h(X))=(1/M)\cdot\sum_{x}\sqrt{p(x)} using Taylor series. Recall that for z0z\geq 0, there exists some z,z′′[0,z]z^{\prime},z^{\prime\prime}\in[0,z] such that

1+z=1+z2+z22(14(1+z)3/2)1+z2z28(1+z)3/2 , and\sqrt{1+z}=1+\frac{z}{2}+\frac{z^{2}}{2}\cdot\left(-\frac{1}{4(1+z^{\prime})^{3/2}}\right)\leq 1+\frac{z}{2}-\frac{z^{2}}{8(1+z)^{3/2}}\mbox{ , and}
1z=1z121z′′1z2.\sqrt{1-z}=1-z\cdot\frac{1}{2\sqrt{1-z^{\prime\prime}}}\leq 1-\frac{z}{2}.

We have

C(h(X))\displaystyle C(h(X)) =\displaystyle= 1Mxp(x)\displaystyle\frac{1}{M}\sum_{x}\sqrt{p(x)}
\displaystyle\leq 1M(1+M12s(M1)28s2(1+(M1)/s)3/2+(M1)(112s))\displaystyle\frac{1}{M}\left(1+\frac{M-1}{2s}-\frac{(M-1)^{2}}{8s^{2}\cdot(1+(M-1)/s)^{3/2}}+(M-1)\cdot\left(1-\frac{1}{2s}\right)\right)
=\displaystyle= 1(M1)28Ms2(1+(M1)/s)3/2\displaystyle 1-\frac{(M-1)^{2}}{8Ms^{2}(1+(M-1)/s)^{3/2}}

Recall that M2M\geq 2, s=εMtMs=\varepsilon M^{t}\geq M, and s2=Kεs^{2}=K\varepsilon, we have

C(h(X))1M264Kε.C(h(X))\leq 1-\frac{M^{2}}{64K\varepsilon}.

\Box

 

Recall that ||=M2t|\mathcal{H}|=M^{2t}. Theorem 4.13 follows from Theorem 4.12 by exactly the same argument as in the proof of Theorem 4.10.

Acknowledgments

We thank Wei-Chun Kao for helpful discussions in the early stages of this work, David Zuckerman for telling us about Hellinger distance, and Michael Mitzenmacher for suggesting parameter settings useful in practice.

References

  • [BBR85] Charles H. Bennett, Gilles Brassard, and Jean-Marc Robert. How to reduce your enemy’s information (extended abstract). In Hugh C. Williams, editor, Advances in Cryptology—CRYPTO ’85, volume 218 of Lecture Notes in Computer Science, pages 468–476. Springer-Verlag, 1986, 18–22 August 1985.
  • [BM03] Andrei Z. Broder and Michael Mitzenmacher. Survey: Network applications of bloom filters: A survey. Internet Mathematics, 1(4), 2003.
  • [CG88] Benny Chor and Oded Goldreich. Unbiased bits from sources of weak randomness and probabilistic communication complexity. SIAM J. Comput., 17(2):230–261, 1988.
  • [CV08] Kai-Min Chung and Salil Vadhan. Tight bounds for hashing block sources. In APPROX-RANDOM, 2008.
  • [Dur04] Richard Durrett. Probability: Theorey and Examples. Third Edition. Duxbury, 2004.
  • [GS02] Alison L. Gibbs and Francis Edward Su. On choosing and bounding probability metrics. International Statistical Review, 70:419, 2002.
  • [ILL89] Russell Impagliazzo, Leonid A. Levin, and Michael Luby. Pseudo-random generation from one-way functions (extended abstracts). In Proceedings of the Twenty First Annual ACM Symposium on Theory of Computing, pages 12–24, Seattle, Washington, 15–17 May 1989.
  • [Knu98] Donald E. Knuth. The art of computer programming, Volume 3: Sorting and Searching. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1998.
  • [MV08] Michael Mitzenmacher and Salil Vadhan. Why simple hash functions work: Exploiting the entropy in a data stream. In Proceedings of the 19th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA ‘08), pages 746–755, 20–22 January 2008.
  • [Mut03] S. Muthukrishnan. Data streams: algorithms and applications. In SODA, pages 413–413, 2003.
  • [NT99] Noam Nisan and Amnon Ta-Shma. Extracting randomness: A survey and new constructions. J. Comput. Syst. Sci., 58(1):148–173, 1999.
  • [NZ96] Noam Nisan and David Zuckerman. Randomness is linear in space. Journal of Computer and System Sciences, 52(1):43–52, February 1996.
  • [RT00] Jaikumar Radhakrishnan and Amnon Ta-Shma. Bounds for dispersers, extractors, and depth-two superconcentrators. SIAM Journal on Discrete Mathematics, 13(1):2–24 (electronic), 2000.
  • [Rey04] Leonid Reyzin. A note on the statistical difference of small direct products. Technical Report BUCS-TR-2004-032, Boston University Computer Science Department, 2004.
  • [SV99] Amit Sahai and Salil Vadhan. Manipulating statistical difference. In Panos Pardalos, Sanguthevar Rajasekaran, and José Rolim, editors, Randomization Methods in Algorithm Design (DIMACS Workshop, December 1997), volume 43 of DIMACS Series in Discrete Mathematics and Theoretical Computer Science, pages 251–270. American Mathematical Society, 1999.
  • [Sha04] Ronen Shaltiel. Recent developments in extractors. In G. Paun, G. Rozenberg, and A. Salomaa, editors, Current Trends in Theoretical Computer Science, volume 1: Algorithms and Complexity. World Scientific, 2004.
  • [Vad07] Salil Vadhan. The unified theory of pseudorandomness. SIGACT News, 38(3), September 2007.
  • [Zuc96] David Zuckerman. Simulating BPP using a general weak random source. Algorithmica, 16(4/5):367–391, October/November 1996.

Appendix A Technical Lemma on Binomial Coefficients

Lemma A.1 (Claim 4.3, restated)

Let N,K>1N,K>1 be integers such that N>2KN>2K, and L[0,K/2]L\in[0,K/2], β(0,min{1,L})\beta\in(0,\min\{1,\sqrt{L}\}) real numbers. Let S[N]S\subset[N] be a random subset of size KK, and T[N]T\subset[N] be a fixed subset of [N][N] of arbitrary size. We have

PrS[||ST|L|βL]O(β).\Pr_{S}\left[\left||S\cap T|-L\right|\leq\beta\sqrt{L}\right]\leq O(\beta).

Proof. By an abuse of notation, we use TT to denote the size of set TT. The probability can be expressed as a sum of binomial coefficients as follows.

PrS[||ST|L|βL]=R=LβLL+βL(TR)(NTKR)(NK).\Pr_{S}\left[\left||S\cap T|-L\right|\leq\beta\sqrt{L}\right]=\sum_{R=\left\lceil L-\beta\sqrt{L}\right\rceil}^{\left\lfloor L+\beta\sqrt{L}\right\rfloor}\frac{{T\choose R}{N-T\choose K-R}}{{N\choose K}}.

Note that there are at most 2βL+1\lfloor 2\beta\sqrt{L}\rfloor+1 terms, it suffices to show that for every R[LβL,L+βL]R\in\left[L-\beta\sqrt{L},L+\beta\sqrt{L}\right],

f(T)=def(TR)(NTKR)(NK)O(1L).f(T)\mathbin{\stackrel{{\scriptstyle\rm def}}{{=}}}\frac{{T\choose R}{N-T\choose K-R}}{{N\choose K}}\leq O\left(\sqrt{\frac{1}{L}}\right).

We use the following bound on binomial coefficients, which can be derived from Stirling’s formula.

Claim A.2

For integers 0<i<a0<i<a, 0<j<b0<j<b, we have

(ai)(bj)(a+bi+j)O(ab(i+j)(a+bij)i(ai)j(bj)(a+b)).\frac{{a\choose i}{b\choose j}}{{a+b\choose i+j}}\leq O\left(\sqrt{\frac{a\cdot b\cdot(i+j)\cdot(a+b-i-j)}{i\cdot(a-i)\cdot j\cdot(b-j)\cdot(a+b)}}\right).

Note that L[0,K/2]L\in[0,K/2] implies KR=Ω(K)K-R=\Omega(K). When 2RTN2K+2R2R\leq T\leq N-2K+2R, we have

f(T)=(TR)(NTKR)(NK)\displaystyle f(T)=\frac{{T\choose R}{N-T\choose K-R}}{{N\choose K}}
=\displaystyle= O(T(NT)K(NK)R(TR)(KR)(NTK+R)N)\displaystyle O\left(\sqrt{\frac{T(N-T)K(N-K)}{R(T-R)(K-R)(N-T-K+R)N}}\right)
=\displaystyle= O(1RKKRNKNT(NT)(TR)(NTK+R))\displaystyle O\left(\sqrt{\frac{1}{R}\cdot\frac{K}{K-R}\cdot\frac{N-K}{N}\cdot\frac{T(N-T)}{(T-R)(N-T-K+R)}}\right)
=\displaystyle= O(1R)=O(1L),\displaystyle O\left(\sqrt{\frac{1}{R}}\right)=O\left(\sqrt{\frac{1}{L}}\right),

as desired. Note that when N>2KN>2K, such TT exists. Finally, observe that β2<L\beta^{2}<L implies R1R\geq 1, and

f(T)f(T+1)=(TR+1)(NT)(T+1)(NTK+R).\frac{f(T)}{f(T+1)}=\frac{(T-R+1)(N-T)}{(T+1)(N-T-K+R)}.

It follows that f(T)f(T) is increasing when T2RT\leq 2R, and f(T)f(T) is decreasing when TN2K+2RT\geq N-2K+2R. Therefore, f(T)f(2R)=O(1/L)f(T)\leq f(2R)=O(\sqrt{1/L}) for T2RT\leq 2R, and f(T)f(N2K+2R)=O(1/L)f(T)\leq f(N-2K+2R)=O(\sqrt{1/L}) for TN2K+2RT\geq N-2K+2R, which complete the proof.    

Appendix B Proof of Lemma 4.4

Lemma B.1 (Lemma 4.4, restated)

Let XX and YY be random variables over [M][M] such that Δ(X,Y)ε\Delta(X,Y)\geq\varepsilon. Let 𝐗=(X1,,XT)\mathbf{X}=(X_{1},\dots,X_{T}) be TT i.i.d. copies of XX, and let 𝐘=(Y1,,YT)\mathbf{Y}=(Y_{1},\dots,Y_{T}) be TT i.i.d. copies of YY. We have

Δ(𝐗,𝐘)min{ε0,cTε},\Delta(\mathbf{X},\mathbf{Y})\geq\min\{\varepsilon_{0},c\sqrt{T}\cdot\varepsilon\},

where ε0,c\varepsilon_{0},c are absolute constants.

Proof. We prove the lemma by the following two claims. The first claim reduces the lemma to the special case that XX is a Bernoulli random variable with bias Ω(ε)\Omega(\varepsilon), and YY is a uniform coin. The second claim proves the special case.

Claim B.2

Let XX and YY be random variables over [M][M] such that Δ(X,Y)=ε\Delta(X,Y)=\varepsilon. Then there exists a randomized function f:[M]{0,1}f:[M]\rightarrow\{0,1\} such that f(Y)=U{0,1}f(Y)=U_{\{0,1\}}, and Δ(f(X),f(Y))ε/2\Delta(f(X),f(Y))\geq\varepsilon/2.

Proof of claim:   By the definition, there exists a set T[M]T\subset[M] such that

|Pr[XT]Pr[YT]|=ε.\left|\Pr[X\in T]-\Pr[Y\in T]\right|=\varepsilon.

With out loss of generality, we can assume that Pr[YT]=p1/2\Pr[Y\in T]=p\leq 1/2 (because we can take the complement of TT.) Let g:[M]{0,1}g:[M]\rightarrow\{0,1\} be the indicator function of TT, so we have PrY[g(Y)=1]=p\Pr_{Y}[g(Y)=1]=p. For every x[M]x\in[M], we define f(x)=g(x)bf(x)=g(x)\vee b, where bb is a biased coin with Pr[b=0]=1/(2(1p))\Pr[b=0]=1/(2(1-p)). The claim follows by observing that

Pr[f(Y)=0]=Pr[g(Y)=0b=0]=(1p)1/(2(1p))=1/2,\Pr[f(Y)=0]=\Pr[g(Y)=0\wedge b=0]=(1-p)\cdot 1/(2(1-p))=1/2,

and

Δ(f(X),f(Y))Δ(X,Y)Pr[b=0]ε/2.\Delta(f(X),f(Y))\geq\Delta(X,Y)\cdot\Pr[b=0]\geq\varepsilon/2.

\Box

Claim B.3

Let XX be a Bernoulli random variable over {0,1}\{0,1\} such that Pr[X=0]=1/2ε\Pr[X=0]=1/2-\varepsilon. Let 𝐗=(X1,,XT)\mathbf{X}=(X_{1},\dots,X_{T}) be TT independent copies of XX. Then

Δ(𝐗,U{0,1}T)min{ε0,cTε},\Delta(\mathbf{X},U_{\{0,1\}^{T}})\geq\min\{\varepsilon_{0},c\sqrt{T}\varepsilon\},

where ε0,c\varepsilon_{0},c are absolute constants independent of ε\varepsilon and TT.

Proof of claim:   For 𝐱{0,1}T\mathbf{x}\in\{0,1\}^{T}, let the weight wt(𝐱)\mathrm{wt}(\mathbf{x}) of 𝐱\mathbf{x} to be the number of 11’s in 𝐱\mathbf{x}. Let

S={x{0,1}T:wt(x)T2T}S=\left\{x\in\{0,1\}^{T}:\mathrm{wt}(x)\leq\frac{T}{2}-\sqrt{T}\right\}

be the subset of {0,1}T\{0,1\}^{T} with small weight (This choice of SS is the main source of improvement in our proof compared to that of Reyzin [Rey04], who instead considers the set of all xx with weight at most T/2T/2.) For every 𝐱S\mathbf{x}\in S, we have

Pr[𝐗=𝐱]12T(1ε)T/2+T(1+ε)T/2T(1min{Tε2,12})Pr[U{0,1}T=𝐱].\Pr[\mathbf{X}=\mathbf{x}]\leq\frac{1}{2^{T}}\cdot(1-\varepsilon)^{T/2+\sqrt{T}}\cdot(1+\varepsilon)^{T/2-\sqrt{T}}\leq\left(1-\min\left\{\frac{\sqrt{T}\cdot\varepsilon}{2},\frac{1}{2}\right\}\right)\cdot\Pr[U_{\{0,1\}^{T}}=\mathbf{x}].

By standard results on large deviation, we have

Pr[U{0,1}TS]Ω(1).\Pr[U_{\{0,1\}^{T}}\in S]\geq\Omega(1).

Combining the above two inequalities, we get

Δ(𝐗,U{0,1}T)\displaystyle\Delta(\mathbf{X},U_{\{0,1\}^{T}}) \displaystyle\geq Pr[U{0,1}TS]Pr[𝐗S]\displaystyle\Pr[U_{\{0,1\}^{T}}\in S]-\Pr[\mathbf{X}\in S]
\displaystyle\geq (1(1min{Tε2,12}))Pr[U{0,1}TS]\displaystyle\left(1-\left(1-\min\left\{\frac{\sqrt{T}\cdot\varepsilon}{2},\frac{1}{2}\right\}\right)\right)\cdot\Pr[U_{\{0,1\}^{T}}\in S]
\displaystyle\geq min{Tε2,12}Ω(1)\displaystyle\min\left\{\frac{\sqrt{T}\cdot\varepsilon}{2},\frac{1}{2}\right\}\cdot\Omega(1)
=\displaystyle= min{cTε,ε0}\displaystyle\min\{c\sqrt{T}\varepsilon,\varepsilon_{0}\}

for some absolute constants c,ε0c,\varepsilon_{0}, which completes the proof.   \Box

Note that applying the same randomized function ff on two random variables XX and YY cannot increase the statistical distance. I.e., Δ(f(X),f(Y))Δ(X,Y)\Delta(f(X),f(Y))\leq\Delta(X,Y). The lemma following immediately by the above two claims:

Δ(𝐗,𝐘)\displaystyle\Delta(\mathbf{X},\mathbf{Y}) \displaystyle\geq Δ(((f1(X1),,fT(XT)),((f1(Y1),,fT(YT))\displaystyle\Delta(((f_{1}(X_{1}),\dots,f_{T}(X_{T})),((f_{1}(Y_{1}),\dots,f_{T}(Y_{T}))
\displaystyle\geq min{ε0,cTε}\displaystyle\min\{\varepsilon_{0},c\sqrt{T}\varepsilon\}

where f1,,fTf_{1},\dots,f_{T} are independent copies of randomized function defined in Claim B.2, and ε0,c\varepsilon_{0},c are absolute constants from Claim B.3.