University of Waterloo, Canadaxiaohu@uwaterloo.cahttps://orcid.org/0000-0002-7890-665X University of Waterloo, Canadazhiang.wu@uwaterloo.cahttps://orcid.org/0009-0004-8647-1416 {CCSXML} <ccs2012> <concept> <concept_id>10002978.10003018.10003020</concept_id> <concept_desc>Security and privacy Management and querying of encrypted data</concept_desc> <concept_significance>500</concept_significance> </concept> <concept> <concept_id>10002951.10002952.10003190.10003192.10003426</concept_id> <concept_desc>Information systems Join algorithms</concept_desc> <concept_significance>500</concept_significance> </concept> </ccs2012> \ccsdesc[500]Security and privacy Management and querying of encrypted data \ccsdesc[500]Information systems Join algorithms \EventEditors \EventNoEds2 \EventLongTitle28th International Conference on Database Theory (ICDT 2025) \EventShortTitleICDT 2025 \EventAcronymICDT \EventYear2025 \EventDate \EventLocation \EventLogo \SeriesVolume \ArticleNo

Optimal Oblivious Algorithms for Multi-way Joins

Xiao Hu Zhiang Wu

Abstract

In cloud databases, cloud computation over sensitive data uploaded by clients inevitably causes concern about data security and privacy. Even when encryption primitives and trusted computing environments are integrated into query processing to safeguard the actual contents of the data, access patterns of algorithms can still leak private information about the data. Oblivious RAM (ORAM) and circuits are two generic approaches to address this issue, ensuring that access patterns of algorithms remain oblivious to the data. However, deploying these methods on insecure algorithms, particularly for multi-way join processing, is computationally expensive and inherently challenging.

In this paper, we propose a novel sorting-based algorithm for multi-way join processing that operates without relying on ORAM simulations or other security assumptions. Our algorithm is a non-trivial, provably oblivious composition of basic primitives, with time complexity matching the insecure worst-case optimal join algorithm, up to a logarithmic factor. Furthermore, it is cache-agnostic, with cache complexity matching the insecure lower bound, also up to a logarithmic factor. This clean and straightforward approach has the potential to be extended to other security settings and implemented in practical database systems.

keywords:

oblivious algorithms, multi-way joins, worst-case optimality

1 Introduction

In outsourced query processing, a client entrusts sensitive data to a cloud service provider, such as Amazon, Google, or Microsoft, and subsequently issues queries to the provider. The service provider performs the required computations and returns the results to the client. Since these computations are carried out on remote infrastructure, ensuring the security and privacy of query evaluation is a critical requirement. Specifically, servers must remain oblivious to any information about the underlying data throughout the computation process. To achieve this, advanced cryptographic techniques and trusted computing hardware are employed to prevent servers from inferring the actual contents of the data [34, 19]. However, the memory accesses during execution may still lead to information leakage, posing an additional challenge to achieving comprehensive privacy. For example, consider the basic (natural) join operator on two database instances: $R_{1}=\left\{(a_{i},b_{i}):i\in[N]\right\}\Join S_{1}=\left\{(b_{i},c_{i}):i\in[N]\right\}$ and $R_{2}=\left\{(a_{i},b_{1}):i\in[N]\right\}\Join S_{2}=\left\{(b_{1},c_{i}):i\in[N]\right\}$ for some $N\in\mathbb{Z}^{+}$ , where each pair of tuples can be joined if and only if they have the same $b$ -value. Suppose each relation is sorted by their $b$ -values. Using the merge join algorithm, there is only one access to $S_{1}$ between two consecutive accesses to $R_{1}$ , but there are $N$ accesses to $S_{2}$ between two consecutive accesses to $R_{2}$ . Hence, the server can distinguish the degree information of join keys of the input data by observing the sequence of memory accesses. Moreover, if the server counts the total number of memory accesses, it can further infer the number of join results of the input data.

The notion of obliviousness was proposed to formally capture such a privacy guarantee on the memory access pattern of algorithms [31, 30]. This concept has inspired a substantial body of research focused on developing algorithms that achieve obliviousness in practical database systems [55, 24, 20, 17]. A generic approach to achieving obliviousness is Oblivious RAM (ORAM) [31, 41, 29, 52, 23, 48, 7], which translates each logical access into a poly-logarithmic (in terms of the data size) number of physical accesses to random locations of the memory. but the poly-logarithmic additional cost per memory access is very expensive in practice [15]. Another generic approach involves leveraging circuits [53, 26]. Despite their theoretical promise, generating circuits is inherently complex and resource-intensive, and integrating such constructions into database systems often proves to be inefficient. These challenges highlight the advantages of designing algorithms that are inherently oblivious to the input data, eliminating the need for ORAM frameworks or circuit constructions.

In this paper, we take on this question for multi-way join processing, and examine the insecure worst-case optimal join (WCOJ) algorithm [43, 44, 50], that can compute any join queries in time proportional to the maximum number of join results. Our objective is to investigate the intrinsic properties of the WCOJ algorithm and transform it into an oblivious version while preserving its optimal complexity guarantee.

1.1 Problem Definition

Multi-way join. A (natural) join query can be represented as a hypergraph $\mathcal{Q}=(\mathcal{V},\mathcal{E})$ [1], where $\mathcal{V}$ models the set of attributes, and $\mathcal{E}\subseteq 2^{\mathcal{V}}$ models the set of relations. Let $\mathrm{dom}(x)$ be the domain of attribute $x\in\mathcal{V}$ . An instance of $\mathcal{Q}$ is a function $\mathcal{R}$ that maps each $e\in\mathcal{E}$ to a set of tuples $R_{e}$ , where each tuple $t\in R_{e}$ specifies a value in $\mathrm{dom}(x)$ for each attribute $x\in e$ . The result of a join query $\mathcal{Q}$ over an instance $\mathcal{R}$ , denoted by $\mathcal{Q}(\mathcal{R})$ , is the set of all combinations of tuples, one from each relation $R_{e}$ , that share the same values in their common attributes, i.e.,

\mathcal{Q}(\mathcal{R})=\left\{t\in\prod_{x\in\mathcal{V}}\mathrm{dom}(x)\mid\forall e\in\mathcal{E},\exists t_{e}\in R_{e},\pi_{e}t=t_{e}\right\}.

Let $N=\sum_{e\in\mathcal{E}}|R_{e}|$ be the input size of instance $\mathcal{R}$ , i.e., the total number of tuples over all relations. Let $\textrm{OUT}=|\mathcal{Q}(R)|$ be the output size of the join query $\mathcal{Q}$ over instance $\mathcal{R}$ . We study the data complexity [1] of join algorithms by measuring their running time in terms of input and output size of the instance. We consider the size of $\mathcal{Q}$ , i.e., $|\mathcal{V}|$ and $|\mathcal{E}|$ , as constant.

Model of computation.

We consider a two-level hierarchical memory model [40, 18]. The computation is performed within trusted memory, which consists of $M$ registers of the same width. For simplicity, we assume that the trusted memory size is $c\cdot M$ , where $c$ is a constant. This assumption will not change our results by more than a constant factor. Since we assume the query size as a constant, the arity of each relation is irrelevant. Each tuple is assumed to fit into a single register, with one register allocated per tuple, including those from input relations as well as intermediate results. We further assume that $c\cdot M$ tuples from any set of relations can fit into the trusted memory. Input data and all intermediate results generated during the execution are encrypted and stored in an untrusted memory of unlimited size. Both trusted and untrusted memory are divided into blocks of size $B$ . One memory access moves a block of $B$ consecutive tuples from trusted to untrusted memory or vice versa. The complexity of an algorithm is measured by the number of such memory accesses.

An algorithm typically operates by repeating the following three steps: (1) read encrypted data from the untrusted memory into the trusted memory, (2) perform computation inside the trusted memory, and (3) Encrypt necessary data and write them back to the untrusted memory. Adversaries can only observe the address of the blocks read from or written to the untrusted memory in (1) and (3), but not data contents. They also cannot interfere with the execution of the algorithm. The sequence of memory accesses to the untrusted memory in the execution is referred to as the “access pattern” of the algorithm. In this context, we focus on two specific scenarios of interest:

•

Random Access Model (RAM). This model can simulate the classic RAM model with $M=O(1)$ and $B=1$ , where the trusted memory corresponds to $O(1)$ registers and the untrusted memory corresponds to the main memory. The time complexity in this model is defined as the number of accesses to the main memory by a RAM algorithm.
•

External Memory Model (EM). This model can naturally simulate the classic EM model [3, 51], where the trusted memory corresponds to the main memory and the untrusted memory corresponds to the disk. Following prior work [28, 21, 18], we focus on the cache-agnostic EM algorithms, which are unaware of the values of $M$ (memory size) and $B$ (block size), a property commonly referred to as cache-oblivious in the literature. To avoid ambiguity, we use the terms “cache-agnostic” to refer to “cache-oblivious” and “oblivious” to refer to “access-pattern-oblivious” throughout this paper. The advantages of cache-agnostic algorithms have been extensively studied, particularly in multi-level memory hierarchies. A cache-agnostic algorithm can seamlessly adapt to operate efficiently between any two adjacent levels of the hierarchy. We adopt the tall cache assumption, $M=\Omega(B^{2})$ and further $M=\Omega(\log^{1+\epsilon}N)$ ¹¹1In this work, $\log(\cdot)$ always means $\log_{2}(\cdot)$ and should be distinguished from $\log_{\frac{M}{B}}(\cdot)$ . for an arbitrarily small constant $\epsilon\in(0,1)$ , and the wide block assumption, $B=\Omega(\log^{0.55}N)$ . These are standard assumptions widely adopted in the literature of EM algorithms [3, 51, 6, 28, 21, 18]. The cache complexity in this model is defined as the number of accesses to the disk by an EM algorithm.

Oblivious Algorithms. The notion of obliviousness is defined based on the access pattern of an algorithm. Memory accesses to the trusted memory are invisible to the adversary and, therefore, have no impact on security. Let $\mathcal{A}$ be an algorithm, $\mathcal{Q}$ a join query, and $\mathcal{R}$ an arbitrary input instance of $\mathcal{Q}$ . We denote $\mathsf{Access}_{\mathcal{A}}(\mathcal{Q},\mathcal{R})$ as the sequence of memory accesses made by $\mathcal{A}$ to the untrusted memory when given $(\mathcal{Q},\mathcal{R})$ as the input, where each memory access is a read or write operation associated with a physical address. The join query $\mathcal{Q}$ and the size $N$ of the input instance are considered non-sensitive information and can be safely exposed to the adversary. In contrast, all input tuples are considered sensitive information and should be hidden from adversaries. This way, the access pattern of an oblivious algorithm $\mathcal{A}$ should only depend on $\mathcal{Q}$ and $N$ , ensuring no leakage of sensitive information.

Definition 1.1 (Obliviousness [30, 31, 14]).

An algorithm $\mathcal{A}$ is oblivious for a join query $\mathcal{Q}$ , if given an arbitrary parameter $N\in\mathbb{Z}^{+}$ , for any pair of instances $\mathcal{R},\mathcal{R}^{\prime}$ of $\mathcal{Q}$ with input size $N$ , $\mathsf{Access}_{\mathcal{A}}(\mathcal{Q},\mathcal{R})\overset{\mathrm{\delta}}{\equiv}\mathsf{Access}_{\mathcal{A}}(\mathcal{Q},\mathcal{R}^{\prime})$ , where $\delta$ is a negligible function in terms of $N$ . Specifically, for any positive constant $c$ , there exists $N_{c}$ such that $\delta(N)<\frac{1}{N^{c}}$ for any $N>N_{c}$ . The notation $\overset{\mathrm{\delta}}{\equiv}$ indicates the statistical distance between two distributions is at most $\delta$ .

This notion of obliviousness applies to both deterministic and randomized algorithms. For a randomized algorithm, different execution states may arise from the same input instance due to the algorithm’s inherent randomness. Each execution state corresponds to a specific sequence of memory accesses, allowing the access pattern to be modeled as a random variable with an associated probability distribution over the set of all possible access patterns. The statistical distance between two probability distributions is typically quantified using standard metrics, such as the total variation distance. A randomized algorithm is indeed oblivious if its access pattern exhibits statistically indistinguishable distributions across all input instances of the same size. Relatively simpler, a deterministic algorithm is oblivious if it displays an identical access pattern for all input instances of the same size.

1.2 Review of Existing Results

Oblivious RAM. ORAM is a general randomized framework designed to protect access patterns [31]. In ORAM, each logical access is translated into a poly-logarithmic number of random physical accesses, thereby incurring a poly-logarithmic overhead. Goldreich et al. [31] established a lower bound $\Omega(\log N)$ on the access overhead of ORAMs in the RAM model. Subsequently, Asharov et al. [7] proposed a theoretically optimal ORAM construction with an overhead of $O(\log N)$ in the RAM model under the assumption of the existence of a one-way function, which is rather impractical [47]. It remains unknown whether a better cache complexity than $O(\log N)$ can be shown for such a construction. Path ORAM [48] is currently the most practical ORAM construction, but it introduces an $O(\log^{2}N)$ overhead and requires $\Omega(1)$ trusted memory. In the EM model, one can place the tree data structures for ORAM in an Emde Boas layout, resulting in a memory access overhead of $O(\log N\cdot\log_{B}N)$ .

Insecure Join Algorithms.

The WCOJ algorithm [43] have been developed to compute any join query in $O(N^{\rho^{*}})$ time²²2A hashing-based algorithm achieves $O(N^{\rho^{*}})$ time in the worst case using the lazy array technique [27]., where $\rho^{*}$ is the fractional edge cover number of the join query (formally defined in Section 2.1). The optimality is implied by the AGM bound [8]. ³³3The maximum number of join results produced by any instance of input size $N$ is $O(N^{\rho^{*}})$ , which is also tight in the sense that there exists some instance of input size $N$ that can produce $\Theta(N^{\rho^{*}})$ join results. However, these WCOJ algorithms are not oblivious. In Section 4, we use triangle join as an example to illustrate the information leakage from the WCOJ algorithm. Another line of research also explored output-sensitive join algorithms. A join query can be computed in $O((N^{\textsf{subw}}+\textrm{OUT})\cdot\mathsf{polylog}N))$ time [54, 2], where subw is the submodular-width of the join query. For example, $\textsf{subw}=1$ if and only if the join query is acyclic [11, 25]. These algorithms are also not oblivious due to various potential information leakages. For instance, the total number of memory accesses is influenced by the output size, which can range from a constant to a polynomially large value relative to the input size. A possible mitigation strategy is worst-case padding, which involves padding dummy accesses to match the worst case. However, this approach does not necessarily result in oblivious algorithms, as their access patterns may still vary significantly across instances with the same input size.

In contrast, there has been significantly less research on multi-way join processing in the EM model. First of all, we note that an EM version of the WCOJ algorithm incurs at least $\Omega\left(\frac{N^{\rho^{*}}}{B}\right)$ cache complexity since there are $\Theta(N^{\rho^{*}})$ join results in the worst case and all join results should be written back to disk. For the basic two-way join, the nested-loop algorithm has cache complexity $O\left(\frac{N^{2}}{B}\right)$ and the sort-merge algorithm has cache complexity $O\left(\frac{N}{B}\log_{\frac{M}{B}}\frac{N}{B}+\frac{\textrm{OUT}}{B}\right)$ . For multi-way join queries, an EM algorithm with cache complexity $O\left(\frac{N^{\rho^{*}}}{M^{\rho^{*}-1}B}\cdot\log_{\frac{M}{B}}\frac{N}{B}+\frac{\textrm{OUT}}{B}\right)$ has been achieved for Berge-acyclic joins [37], $\alpha$ -acyclic joins [36, 39], graph joins [38, 22] and Loomis-Whitney joins [39].⁴⁴4Some of these algorithms have been developed for the Massively Parallel Computation (MPC) model [10] and can be adapted to the EM model through the MPC-to-EM reduction [39]. These results were previously stated without including the output-dependent term $\frac{\textrm{OUT}}{B}$ since they do not consider the cost of writing join results back to disk. Again, even padding the output size to be as large as the worst case, these algorithms remain non-oblivious since their access patterns heavily depend on the input data. Furthermore, even in the insecure setting, no algorithm with a cache complexity $O\left(\frac{N^{\rho^{*}}}{B}\right)$ is known for general join queries.

Oblivious Join Algorithms.

Oblivious algorithms have been studied for join queries in both the RAM and EM models. In the RAM model, the naive nested-loop algorithm can be transformed into an oblivious one by incorporating some dummy writes, as it enumerates all possible combinations of tuples from input relations in a fixed order. This algorithm runs in $O(N^{|\mathcal{E}|})$ time, where $|\mathcal{E}|$ is the number of relations in the join query. Wang et al. [53] designed circuits for conjunctive queries - capturing all join queries as a special case - whose time complexity matches the AGM bound up to poly-logarithmic factors. Running such a circuit will automatically yield an oblivious join algorithm with $O\left(N^{\rho^{*}}\cdot\mathsf{polylog}N\right)$ time complexity. By integrating the insecure WCOJ algorithm [44] with the optimal ORAM [7], it is possible to achieve an oblivious algorithm with $O(N^{\rho^{*}}\cdot\log N)$ time complexity, albeit under restrictive theoretical assumptions. Alternatively, incorporating the insecure WCOJ algorithm into the Path ORAM yields an oblivious join algorithm with $O\left(N^{\rho^{*}}\cdot\log^{2}N\right)$ time complexity.

In the EM model, He et al. [35] proposed a cache-agnostic nested-loop join algorithm for the basic two-way join $R\Join S$ with $O\left(\frac{|R|\cdot|S|}{B}\right)$ cache complexity, which is also oblivious. Applying worst-case padding and the optimal ORAM construction to the existing EM join algorithms, we can derive an oblivious join algorithm with $O\left(\frac{N^{\rho^{*}}}{B}\cdot\log_{\frac{M}{B}}\frac{N}{B}\cdot\log N\right)$ cache complexity for specific cases such as acyclic joins, graph joins and Loomis-Whitney joins. However, these algorithms are not cache-agnostic. For general join queries, no specific oblivious algorithm has been proposed for the EM model, aside from results derived from the oblivious RAM join algorithm. These results yield cache complexities of either $O\left(N^{\rho^{*}}\cdot\log N\right)$ or $O\left(N^{\rho^{*}}\cdot\log N\cdot\log_{B}N\right)$ , as they rely heavily on retrieving tuples from hash tables or range search indices.

	Previous Results	New Results
RAM model	$O\left(N^{\rho^{*}}\cdot\log N\right)$ [44, 7]	$O\left(N^{\rho^{*}}\cdot\log N\right)$
RAM model	(one-way function assumption)	(no assumption)
Cache-agnostic	$O\left(\frac{N^{\min\{\rho^{}+1,\rho\}}}{B}\cdot\log_{\frac{M}{B}}\frac{N^{\min\{\rho^{}+1,\rho\}}}{B}\right)$	$O\left(\frac{N^{\rho^{}}}{B}\cdot\log_{\frac{M}{B}}\frac{N^{\rho^{}}}{B}\right)$
EM model
EM model	(no assumption)	(tall cache and wide block assumptions)

Table 1: Comparison between previous and new oblivious algorithms for multi-way joins.

N

is the input size.

\rho^{*}

and

\rho

are the input join query’s fractional and integral edge cover numbers, respectively.

M

is the trusted memory size.

B

is the block size.

Relaxed Variants of Oblivious Join Algorithms.

Beyond fully oblivious algorithms, researchers have explored relaxed notions of obliviousness by allowing specific types of leakage, such as the join size, the multiplicity of join values, and the size of intermediate results. One relevant line of work examines join processing with released input and output sizes. For example, integrating an insecure output-sensitive join algorithm into an ORAM framework produces a relaxed oblivious algorithm with $O\left((N^{\textsf{subw}}+\textrm{OUT})\cdot\mathrm{polylog}N\right)$ time complexity. It is noted that relaxed oblivious algorithm with the same time complexity $O((N+\textrm{OUT})\cdot\log N)$ have been proposed without requiring ORAM [5, 40] for the basic two-way join as well as acyclic joins. Although not fully oblivious, these algorithms serve as fundamental building blocks for developing our oblivious algorithms for general join queries. Another line of work considered differentially oblivious algorithms [14, 12, 18], which require only that access patterns appear similar across neighboring input instances. However, differentially oblivious algorithms have so far been limited to the basic two-way join [18]. This paper does not pursue this direction further.

1.3 Our Contribution

Our main contribution can be summarized as follows (see Table 1):

•

We give a nested-loop-based algorithm for general join queries with $O\left(N^{\min\{\rho^{*}+1,\rho\}}\cdot\log N\right)$ time complexity and $O\left(\frac{N^{\min\{\rho^{*}+1,\rho\}}}{B}\cdot\log_{\frac{M}{B}}\frac{N^{\min\{\rho^{*}+1,\rho\}}}{B}\right)$ cache complexity, where $\rho^{*}$ and $\rho$ are the fractional and integral edge cover number of the join query, respectively (formally defined in Section 2.1). This algorithm is also cache-agnostic. For classes of join queries with $\rho^{*}=\rho$ , such as acyclic joins, even-length cycle joins and boat joins (see Section 3), this is almost optimal up to logarithmic factors.
•

We design an oblivious algorithm for general join queries with $O\left(N^{\rho^{*}}\cdot\log N\right)$ time complexity, which has matched the insecure counterpart by a logarithmic factor and recovered the previous ORAM-based result, which assumes the existence of one-way functions. This algorithm is also cache-agnostic, with $O\left(\frac{N^{\rho^{*}}}{B}\cdot\log_{\frac{M}{B}}\frac{N^{\rho^{*}}}{B}\right)$ cache complexity. This cache complexity can be simplified to $O\left(\frac{N^{\rho^{*}}}{B}\cdot\log_{\frac{M}{B}}\frac{N}{B}\right)$ when $B<N^{\frac{c-\rho^{*}}{c-1}}$ for some sufficiently large constant $c$ . This result establishes the first worst-case near-optimal join algorithm in the insecure EM model when all join results are returned to disk.
•

We develop an improved algorithm for relaxed two-way joins with better cache complexity, which is also cache-agnostic. By integrating our oblivious algorithm with generalized hypertree decomposition [33], we obtain a relaxed oblivious algorithm for general join queries with $O\left((N^{\textsf{fhtw}}+\textrm{OUT})\cdot\log N\right)$ time complexity and $O\left(\frac{N^{\textsf{fhtw}}+\textrm{OUT}}{B}\cdot\log_{\frac{M}{B}}\frac{N^{\textsf{fhtw}}+\textrm{OUT}}{B}\right)$ cache complexity, where fhtw is the fractional hypertree width of the input query.

Roadmap. This paper is organized as follows. In Section 2, we introduce the preliminaries for building our algorithms. In Section 3, we show our first algorithm based on the nested-loop algorithm. While effective, this algorithm is not always optimal, as demonstrated with the triangle join. In Section 4, we use triangle join to demonstrate the leakage of insecure WCOJ algorithm and show how to transform it into an oblivious algorithm. We introduce our oblivious WCOJ algorithm for general join queries in Section 5, and conclude in Section 6.

2 Preliminaries

2.1 Fractional and Integral Edge Cover Number

For a join query $\mathcal{Q}=(\mathcal{V},\mathcal{E})$ , a function $W:\mathcal{E}\to[0,1]$ is a fractional edge cover for $\mathcal{Q}$ if $\sum_{e\in\mathcal{E}:x\in e}W(e)\geq 1$ for any $x\in\mathcal{V}$ . An optimal fractional edge cover is the one minimizing $\sum_{e\in\mathcal{E}}W(e)$ , which is captured by the following linear program:

\displaystyle\min\ \sum_{e\in\mathcal{E}}W(e)\ \ \ \textrm{s.t. }\ \sum_{e\in\mathcal{E}:x\in e}W(e)

\displaystyle\geq 1,\forall x\in\mathcal{V};\ \ W(e)\in[0,1],\forall e\in\mathcal{E}

(1)

The optimal value of (1) is the fractional edge cover number of $\mathcal{Q}$ , denoted as $\rho^{*}(\mathcal{Q})$ . Similarly, a function $W:\mathcal{E}\to\{0,1\}$ is an integral edge cover if $\sum_{e\in\mathcal{E}:x\in e}W(e)\geq 1$ for any $x\in\mathcal{V}$ . The optimal integral edge cover is the one minimizing $\sum_{e\in\mathcal{E}}W(e)$ , which can be captured by a similar linear program as (1) except that $W(e)\in[0,1]$ is replaced with $W(e)\in\{0,1\}$ . The optimal value of this linear program is the integral edge cover number of $\mathcal{Q}$ , denoted as $\rho(\mathcal{Q})$ .

2.2 Oblivious Primitives

We introduce the following oblivious primitives, which form the foundation of our algorithms. Each primitive displays an identical access pattern across instances of the same input size.

Linear Scan.

Given an input array of $N$ elements, a linear scan of all elements can be done with $O(N)$ time complexity and $O(\frac{N}{B})$ cache complexity in a cache-agnostic way.

Sort [4, 9].

Given an input array of $N$ elements, the goal is to output an array according to some predetermined ordering. The classical bitonic sorting network [9] requires $O(N\cdot\log^{2}N)$ time. Later, this time complexity has been improved to $O\left(N\cdot\log N\right)$ [4] in 1983. However, due to the large constant parameter hidden behind $O(N\cdot\log N)$ , the classical bitonic sorting is more commonly used in practice, particularly when the size $N$ is not too large. Ramachandran and Shi [45] showed a randomized algorithm for sorting with $O(N\cdot\log N)$ time complexity and $O\left(\frac{N}{B}\log_{\frac{M}{B}}\frac{N}{B}\right)$ cache complexity under the tall cache assumption.

Compact [32, 46].

Given an input array of $N$ elements, some of which are distinguished as $\perp$ , the goal is to output an array with all non-distinguished elements moved to the front before any $\perp$ , while preserving the ordering of non-distinguished elements. Lin et al. [42] showed a randomized algorithm for compaction with $O(N\cdot\log\log N)$ time complexity and $O\left(\frac{N}{B}\right)$ cache complexity under the tall cache assumption.

We use the above primitives to construct additional building blocks for our algorithms. To ensure obliviousness, all outputs from these primitives include a fixed size equal to the worst-case scenario, i.e., $N$ , comprising both real and dummy elements. All these primitives achieve $O(N\cdot\log N)$ time complexity and $O\left(\frac{N}{B}\cdot\log_{\frac{M}{B}}\frac{N}{B}\right)$ cache complexity. Further details are provided in Appendix B.

SemiJoin.

Given two input relations $R$ , $S$ of at most $N$ tuples and their common attribute(s) $x$ , the goal is to output the set of tuples in $R$ that can be joined with at least one tuple in $S$ .

Project.

Given an input relation $R$ of $N$ tuples defined over attributes $e$ , and a subset of attributes $x\subseteq e$ , the goal is to output $\{t\in R:\pi_{x}t\}$ , ensuring no duplication.

Intersect.

Given two input arrays $R,S$ of at most $N$ elements, the goal is to output $R\cap S$ .

Augment.

Given a relation $R$ and $k$ additional relations $S_{1},S_{2},\cdots,S_{k}$ (each with at most $N$ tuples) sharing common attribute(s) $x$ , the goal is to attach each tuple $t$ the number of tuples in $S_{i}$ (for each $i\in[k]$ ) that can be joined with $t$ on $x$ .

We note that any sequential composition of oblivious primitives yields more complex algorithms that remain oblivious, which is the key principle underlying our approach.

2.3 Oblivious Two-way Join

NestedLoop. Nested-loop algorithm can compute $R\Join S$ with $O(|R|\cdot|S|)$ time complexity, which iterates all combinations of tuples from $R,S$ and writes a join result (or a dummy result, if necessary, to maintain obliviousness). He et al. [35] proposed a cache-agnostic version in the EM model with $O\left(\frac{|R|\cdot|S|}{B}\right)$ cache complexity, which is also oblivious.

Theorem 2.1 ([35]).

For $R\Join S$ , there is a cache-agnostic algorithm that can compute $R\Join S$ with $O\left(|R|\cdot|S|\right)$ time complexity and $O\left(\frac{|R|\cdot|S|}{B}\right)$ cache complexity, whose access pattern only depends on $M,B,|R|$ and $|S|$ .

RelaxedTwoWay. The relaxed two-way join algorithm [5, 40] takes as input two relations $R,S$ and a parameter $\tau\geq|R\Join S|$ , and output a table of $\tau$ elements containing join results of $R\Join S$ , whose access pattern only depends on $|R|,|S|$ and $\tau$ . This algorithm can also be easily transformed into a cache-agnostic version with $O((|R|+|S|+\tau)\cdot\log(|R|+|S|+\tau))$ time complexity and $O\left(\frac{|R|+|S|+\tau}{B}\cdot\log\tau\right)$ cache complexity. In Appendix C, we show how to improve this algorithm with better cache complexity without sacrificing the time complexity.

Theorem 2.2.

For $R\Join S$ and a parameter $\tau\geq|R\Join S|$ , there is a cache-agnostic algorithm that can compute $R\Join S$ with $O\left((|R|+|S|+\tau)\cdot\log(|R|+|S|+\tau)\right)$ time complexity and $O\left(\frac{|R|+|S|+\tau}{B}\cdot\log_{\frac{M}{B}}\frac{|R|+|S|+\tau}{B}\right)$ cache complexity under the tall cache and wide block assumptions, whose access pattern only depends on $M,B,|R|,|S|$ and $\tau$ .

3 Beyond Oblivious Nested-loop Join

Although the nested-loop join algorithm is described for the two-way join, it can be extended to multi-way joins. For a general join query with $k$ relations, the nested-loop primitive can be recursively invoked $k-1$ times, resulting in an oblivious algorithm with $O\left(\frac{N^{k}}{B}\right)$ cache complexity. Careful inspection reveals that we do not necessarily feed all input relations into the nested loop; instead, we can restrict enumeration to combinations of tuples from relations included in an integral edge cover of the join query. Recall that for $\mathcal{Q}=(\mathcal{V},\mathcal{E})$ , an integral edge cover of $\mathcal{Q}$ is a function $W:\mathcal{E}\to\{0,1\}$ , such that $\sum_{e:x\in e}W(e)\geq 1$ holds for every $x\in\mathcal{V}$ . While enumerating combinations of tuples from relations “chosen” by $W$ , we can apply semi-joins using remaining relations to filter intermediate join results.

As described in Algorithm 1, it first chooses an optimal integral edge cover $W^{*}$ of $\mathcal{Q}$ (line 1), and then invokes the NestedLoop primitive to iteratively compute the combinations of tuples from relations with $W^{*}(e)=1$ (line 7), whose output is denoted as $L$ . Meanwhile, we apply the semi-join between $L$ and the remaining relations (line 8).

Below, we analyze the complexity of this algorithm. First, as $|\mathcal{E}^{\prime}|\leq\rho$ , the intermediate join results materialized in the while-loop is at most $O(N^{\rho})$ . After semi-join filtering, the number of surviving results is at most $O\left(N^{\rho^{*}}\right)$ . In this way, the number of intermediate results materialized by line 7 is at most $O\left(N^{\rho^{*}+1}\right)$ . Putting everything together, we obtain:

Theorem 3.1.

For a general join query $\mathcal{Q}$ , there is an oblivious and cache-agnostic algorithm that can compute $\mathcal{Q}(\mathcal{R})$ for an arbitrary instance $\mathcal{R}$ of input size $N$ with $O\left(N^{\min\{\rho,\rho^{*}+1\}}\right)$ time complexity and $O\left(\frac{N^{\min\{\rho,\rho^{*}+1\}}}{B}\cdot\log_{\frac{M}{B}}\frac{N^{\min\{\rho,\rho^{*}+1\}}}{B}\right)$ cache complexity under the tall cache and wide block assumptions, where $\rho^{*}$ and $\rho$ are the optimal fractional and integral edge cover number of $\mathcal{Q}$ , respectively.

It is important to note that any oblivious join algorithm incurs a cache complexity of $\Omega\left(\frac{N^{\rho^{*}}}{B}\right)$ , so Theorem 3.1 is optimal up to a logarithmic factor for join queries where $\rho=\rho^{*}$ . Below, we list several important classes of join queries that exhibit this desirable property:

W^{*}\leftarrow

an optimal integral edge cover of

\mathcal{Q}

L\leftarrow\emptyset

;

\mathcal{E}^{\prime}\leftarrow\{e\in\mathcal{E}:W^{*}(e)=1\}

;

4 while $\mathcal{E}^{\prime}\neq\emptyset$ do

e\leftarrow

an arbitrary relation in

\mathcal{E}^{\prime}

;

\mathcal{E}^{\prime}\leftarrow\mathcal{E}^{\prime}-\{e\}

;

7 if $L=\emptyset$ then

L\leftarrow R_{e}

;

8 else

L\leftarrow\textsc{NestedLoop}(L,R_{e})

;

9 foreach $e^{\prime}\in\mathcal{E}-\{e\}$ do

L\leftarrow\textsc{SemiJoin}(L,R_{e^{\prime}})

;

11return

L

;

Algorithm 1 ObliviousNestedLoopJoin

(\mathcal{Q},\mathcal{R})

Example 3.2 ( $\alpha$ -acyclic Join).

A join query $\mathcal{Q}$ is $\alpha$ -acyclic [11, 25] if there is a tree structure $\mathcal{T}$ of $\mathcal{Q}=(\mathcal{V},\mathcal{E})$ such that (1) there is a one-to-one correspondence between relations in $\mathcal{Q}$ and nodes in $\mathcal{T}$ ; (2) for every attribute $x\in\mathcal{V}$ , the set of nodes containing $x$ form a connected subtree of $\mathcal{T}$ . Any $\alpha$ -acyclic join admits an optimal fractional edge cover that is integral [36].

Example 3.3 (Even-length Cycle Join).

An even-length cycle join is defined as $\mathcal{Q}=R_{1}(x_{1},x_{2})\Join R_{2}(x_{2},x_{3})\Join\cdots\Join R_{k-1}(x_{k-1},x_{k})\Join R_{k}(x_{k},x_{1})$ for some even integer $k$ . It has two integral edge covers $\{R_{1},R_{3},\cdots,R_{k-1}\}$ and $\{R_{2},R_{4},\cdots,R_{k}\}$ , both of which are also an optimal fractional edge cover. Hence, $\rho^{*}=\rho=\frac{k}{2}$ .

Example 3.4 (Boat Join).

A boat join is defined as $\mathcal{Q}=R_{1}(x_{1},y_{1})\Join R_{2}(x_{2},y_{2})\Join\cdots\Join R_{k}(x_{k},y_{k})\Join R_{k+1}(x_{1},x_{2},\cdots,x_{k})\Join R_{k+2}(y_{1},y_{2},\cdots,y_{k})$ . It has an integral edge cover $\{R_{1},R_{2}\}$ that is also an optimal fractional edge cover. Hence, $\rho^{*}=\rho=2$ .

4 Warm Up: Triangle Join

The simplest join query that oblivious nested-loop join algorithm cannot solve optimally is the triangle join $\mathcal{Q}_{\triangle}=R_{1}(x_{2},x_{3})\Join R_{2}(x_{1},x_{3})\Join R_{3}(x_{1},x_{2})$ , which has $\rho=2$ and $\rho^{*}=\frac{3}{2}$ . While various worst-case optimal algorithms for the triangle join have been proposed in the RAM model, none of these are oblivious due to their inherent leakage of intermediate statistics. Below, we outline the issues with existing insecure algorithms and propose a strategy to make them oblivious.

Insecure Triangle Join Algorithm 2.

We start with attribute $x_{1}$ . Each value $a\in\mathrm{dom}(x_{1})$ induces a subquery $\displaystyle{\mathcal{Q}_{a}=R_{1}\Join(\sigma_{x_{1}=a}R_{2})\Join(\sigma_{x_{1}=a}R_{3})}$ . Moreover, a value $a\in\mathrm{dom}(x_{1})$ is heavy if $|\pi_{x_{3}}\sigma_{x_{1}=a}R_{2}|\cdot|\pi_{x_{2}}\sigma_{x_{1}=a}R_{3}|$ is greater than $|R_{1}|$ , and light otherwise. If $a$ is light, $\mathcal{Q}_{a}$ is computed by materializing the Cartesian product between $\pi_{x_{3}}\sigma_{x_{1}=a}R_{1}$ and $\pi_{x_{2}}\sigma_{x_{1}=a}R_{3}$ , and then filter the intermediate result by a semi-join with $R_{1}$ . Every surviving tuple forms a join result with $a$ , which will be written back to untrusted memory. If $a$ is heavy, $\mathcal{Q}_{a}$ is computed by applying the semi-joins between $R_{1}$ with $\sigma_{x_{1}=a}R_{2}$ and $\sigma_{x_{1}=a}R_{3}$ . This algorithm achieves a time complexity of $O(N^{\frac{3}{2}})$ (see [43] for detailed analysis), but it leaks sensitive information through the following mechanisms:

•

$\left|(\pi_{x_{1}}R_{2})\cap(\pi_{x_{1}}R_{3})\right|$ is leaked by the number of for-loop iterations in line 2;
•

$\left|\pi_{x_{2}}\sigma_{x_{1}=a}R_{3}\right|$ and $\left|\pi_{x_{3}}\sigma_{x_{1}=a}R_{2}\right|$ are leaked by the number of for-loop iterations in line 4;
•

The sizes of heavy and light values in $(\pi_{x_{1}}R_{2})\cap(\pi_{x_{1}}R_{3})$ are leaked by the if-else condition in lines 3 and 6;

To protect intermediate statistics, we pad dummy tuples to every intermediate result (such as $(\pi_{x_{1}}R_{2})\cap(\pi_{x_{1}}R_{3})$ , $\pi_{x_{3}}\sigma_{x_{1}=a}R_{2}$ and $\pi_{x_{2}}\sigma_{x_{1}=a}R_{3}$ ) to match the worst-case size $N$ . To hide heavy and light values, we replace conditional if-else branches with a unified execution plan by visiting every possible combination of $\left(\pi_{x_{2}}\sigma_{x_{1}=a}R_{3}\right)\times\left(\pi_{x_{3}}\sigma_{x_{1}=a}R_{2}\right)$ and every tuple of $R_{1}$ . By integrating these techniques, this strategy leads to $N^{2}$ memory accesses, hence destroying the power of two choices that is a key advantage in the insecure WCOJ algorithm.

L\leftarrow\emptyset

;

2 foreach $a\in\left(\pi_{x_{1}}R_{2}\right)\cap\left(\pi_{x_{1}}R_{3}\right)$ do

3 if $|\sigma_{x_{1}=a}R_{2}|\cdot|\sigma_{x_{1}=a}R_{3}|\leq|R_{1}|$ then

4 foreach $(b,c)\in\left(\pi_{x_{2}}\sigma_{x_{1}=a}R_{3}\right)\times\left(\pi_{x_{3}}\sigma_{x_{1}=a}R_{2}\right)$ do

5 if $(b,c)\in R_{1}$ then write

(a,b,c)

L

;

8 else

9 foreach $(b,c)\in R_{1}$ do

10 if $(a,b)\in R_{3}$ and $(a,c)\in R_{2}$ then write

(a,b,c)

L

;

14return

L

;

Algorithm 2 Compute

\mathcal{Q}_{\triangle}

by power of two choices

A\leftarrow\left(\pi_{x_{1}}R_{2}\right)\cap\left(\pi_{x_{1}}R_{3}\right)

by Project and Intersect;

A\leftarrow\textsc{Augment}(A,\{R_{2},R_{3}\},x_{1})

;

A_{1},A_{2}\leftarrow\emptyset

;

4 while read $(a,\Delta_{1},\Delta_{2})$ from $A$ do //

\Delta_{1}=|\pi_{x_{3}}\sigma_{x_{1}=a}R_{2}|

and

\Delta_{2}=|\pi_{x_{2}}\sigma_{x_{1}=a}R_{3}|

5 if $\Delta_{1}\cdot\Delta_{2}\leq|R_{1}|$ then write

a

A_{1}

, write

\perp

A_{2}

;

6 else write

a

A_{2}

, write

\perp

A_{1}

;

L_{1}\leftarrow\textsc{RelaxedTwoWay}(A_{2},R_{1},N^{\frac{3}{2}})

;

L_{1}\leftarrow\textsc{SemiJoin}(L_{1},R_{2})

L_{1}\leftarrow\textsc{SemiJoin}(L_{1},R_{3})

;

R_{2}\leftarrow\textsc{SemiJoin}(R_{2},A_{1})

R_{3}\leftarrow\textsc{SemiJoin}(R_{3},A_{1})

;

L_{2}\leftarrow\textsc{RelaxedTwoWay}(R_{2},R_{3},N^{\frac{3}{2}})

;

L_{2}\leftarrow\textsc{SemiJoin}(L_{2},R_{1})

;

13 return Compact

L_{1}\cup L_{2}

while keeping the first

N^{\frac{3}{2}}

tuples;

Algorithm 3 Inject Obliviousness to Algorithm 2

Inject Obliviousness to Algorithm 2.

To inject obliviousness into Algorithm 2, Algorithm 3 leverages oblivious primitives to ensure the same access pattern across all instances of the input size. Here’s a breakdown of how this is achieved and why it works. We start with computing $A=\left(\pi_{x_{1}}R_{2}\right)\cap\left(\pi_{x_{1}}R_{3}\right)$ by the Intersect primitive. Then, we partition values in $A$ into two subsets $A_{1},A_{2}$ , depending on the relative order between $|\pi_{x_{3}}\sigma_{x_{1}=a}R_{2}|\cdot|\pi_{x_{2}}\sigma_{x_{1}=a}R_{3}|$ and $|R_{1}|$ . We next compute the following two-way joins $A_{2}\Join R_{1}$ and $(R_{2}\ltimes A_{1})\Join(R_{3}\ltimes A_{2})$ by invoking the RelaxedTwoWay primitive separately, each with the upper bound $N^{\frac{3}{2}}$ . At last, we filter intermediate join results by the SemiJoin primitive and remove unnecessary dummy tuples by the Compact primitive.

Analysis of Algorithm 3.

It suffices to show that $|(R_{2}\ltimes A_{1})\Join(R_{3}\ltimes A_{1})|\leq N^{\frac{3}{2}}$ and $|A_{2}\Join R_{1}|\leq N^{\frac{3}{2}}$ , which directly follows from the query decomposition lemma [44]:

\sum_{a\in A}\min\left\{\left|\sigma_{x_{1}=a}R_{2}\right|\cdot\left|\sigma_{x_{1}=a}R_{3}\right|,|R_{1}|\right\}\leq\sum_{a\in A}\left(\left|R_{2}\ltimes a\right|\cdot\left|R_{3}\ltimes a\right|\right)^{\frac{1}{2}}\cdot\left|R_{1}\ltimes a\right|^{\frac{1}{2}}\leq N^{\frac{3}{2}}.

All other primitives have $O(N\cdot\log N)$ time complexity and $O\left(\frac{N}{B}\cdot\log_{\frac{M}{B}}\frac{N}{B}\right)$ cache complexity. Hence, this whole algorithm incurs $O\left(N^{\frac{3}{2}}\cdot\log N\right)$ time complexity and $O\left(\frac{N^{\frac{3}{2}}}{B}\cdot\log_{\frac{M}{B}}\frac{N^{\frac{3}{2}}}{B}\right)$ cache complexity. As each step is oblivious, the composition of all these steps is also oblivious.

Insecure Triangle Join Algorithm 4.

We start with attribute $x_{1}$ . We first compute the candidate values in $x_{1}$ that appear in some join results, i.e., $(\pi_{x_{1}}R_{2})\cap(\pi_{x_{1}}R_{3})$ . For each candidate value $a$ , we retrieve the candidate values in $x_{2}$ that can appear together with $a$ in some join results, i.e., $\left(\pi_{x_{2}}\sigma_{x_{1}=a}R_{3}\right)\cap\left(\pi_{x_{2}}R_{1}\right)$ . Furthermore, for each candidate value $b$ , we explore the possible values in $x_{3}$ that can appear together with $(a,b)$ in some join results. More precisely, every value $c$ appears in $\pi_{x_{3}}\sigma_{x_{2}=b}R_{1}$ as well as $\pi_{x_{3}}\sigma_{x_{1}=a}R_{2}$ forms a triangle with $a,b$ . This algorithm runs in $O(N^{\frac{3}{2}})$ time (see [44] for detailed analysis). Similarly, it is not oblivious as the following intermediate statistics may be leaked:

•

$\left|(\pi_{x_{1}}R_{2})\cap(\pi_{x_{1}}R_{3})\right|$ is leaked by the number of for-loop iterations in line 2;
•

$\left|(\pi_{x_{2}}\sigma_{x_{1}=a}R_{3})\cap(\pi_{x_{2}}R_{1})\right|$ is leaked by the number of for-loop iterations in line 3;
•

$\left|(\pi_{x_{3}}\sigma_{x_{2}=b}R_{2})\cap(\pi_{x_{3}}\pi_{x_{1}=a}R_{2})\right|$ is leaked by the number of for-loop iterations in line 4;

L\leftarrow\emptyset

;

2 foreach $a\in\left(\pi_{x_{1}}R_{2}\right)\cap\left(\pi_{x_{1}}R_{3}\right)$ do

3 foreach $b\in\left(\pi_{x_{2}}\sigma_{x_{1}=a}R_{3}\right)\cap\left(\pi_{x_{2}}R_{1}\right)$ do

4 foreach $c\in\left(\pi_{x_{3}}\sigma_{x_{2}=b}R_{1}\right)\cap\left(\pi_{x_{3}}\sigma_{x_{1}=a}R_{2}\right)$ do

5 write

(a,b,c)

L

;

9return

L

;

Algorithm 4 Compute

\mathcal{Q}_{\triangle}

by delaying computation

R_{3}\leftarrow\textsc{Augment}(R_{3},R_{1},x_{2})

R_{3}\leftarrow\textsc{Augment}(R_{3},R_{2},x_{1})

;

K_{1},K_{2}\leftarrow\emptyset

;

3 while read $(t,\Delta_{1},\Delta_{2})$ from $R_{3}$ do // Suppose

\Delta_{i}=|R_{i}\ltimes\{t\}|

4 if $\Delta_{1}\leq\Delta_{2}$ then write

t

K_{1}

, write

\perp

K_{2}

;

5 else write

t

K_{2}

, write

\perp

K_{1}

;

L_{1}\leftarrow\textsc{RelaxedTwoWay}(K_{1},R_{1},N^{\frac{3}{2}})

L_{1}\leftarrow\textsc{SemiJoin}(L_{1},R_{2})

;

L_{2}\leftarrow\textsc{RelaxedTwoWay}(K_{2},R_{2},N^{\frac{3}{2}})

L_{2}\leftarrow\textsc{SemiJoin}(L_{2},R_{1})

;

9 return Compact

L_{1}\cup L_{2}

while keeping the first

N^{\frac{3}{2}}

tuples;

Algorithm 5 Inject Obliviousness to Algorithm 4

To achieve obliviousness, a straightforward solution is to pad every intermediate result with dummy tuples to match the worst-case size $N$ . However, this would result in $N^{3}$ memory accesses, which is even less efficient than the nested-loop-based algorithm in Section 3.

Inject Obliviousness to Algorithm 4.

We transform Algorithm 4 into an oblivious version, presented as Algorithm 5, by employing oblivious primitives. The first modification merges the first two for-loops (lines 2–3 in Algorithm 4) into one step (line 1 in Algorithm 5). This is achieved by applying the semi-joins on $R_{3}$ using $R_{1},R_{2}$ separately. Then, the third for-loop (line 4 in Algorithm 4) is replaced with a strategy based on the power of two choices. Specifically, for each surviving tuple $(a,b)\in R_{3}$ , we first compute the size of two lists, $\left|\pi_{x_{3}}\sigma_{x_{2}=b}R_{1}\right|$ and $\left|\pi_{x_{3}}\sigma_{x_{1}=a}R_{2}\right|$ , and put $(a,b)$ into either $K_{1}$ or $K_{2}$ , based on the relative order between $\left|\pi_{x_{3}}\sigma_{x_{2}=b}R_{1}\right|$ and $\left|\pi_{x_{3}}\sigma_{x_{1}=a}R_{2}\right|$ . We next compute the following two-way joins $K_{1}\Join R_{1}$ and $K_{2}\Join R_{2}$ by invoking the RelaxedTwoWay primitive, each with the upper bound $N^{\frac{3}{2}}$ separately. Finally, we filter intermediate join results by the SemiJoin primitive and remove unnecessary dummy tuples by the Compact primitive.

Complexity of Algorithm 5.

It suffices to show that $|K_{1}\Join R_{1}|\leq N^{\frac{3}{2}}$ and $|K_{2}\Join R_{2}|\leq N^{\frac{3}{2}}$ , which directly follows from the query decomposition lemma [44]:

\sum_{(a,b)\in R_{3}}\min\left\{\left|\pi_{x_{3}}\sigma_{x_{2}=b}R_{1}\right|,\left|\pi_{x_{3}}\sigma_{x_{1}=a}R_{2}\right|\right\}\leq\sum_{(a.b)\in R_{3}}\left|R_{1}\ltimes(a,b)\right|^{\frac{1}{2}}\cdot\left|R_{2}\ltimes(a,b)\right|^{\frac{1}{2}}\leq N^{\frac{3}{2}}.

All other primitives incur $O(N\log N)$ time complexity and $O\left(\frac{N}{B}\cdot\log_{\frac{M}{B}}\frac{N}{B}\right)$ cache complexity. Hence, this algorithm incurs $O\left(N^{\frac{3}{2}}\cdot\log N\right)$ time complexity and $O\left(\frac{N^{\frac{3}{2}}}{B}\cdot\log_{\frac{M}{B}}\frac{N^{\frac{3}{2}}}{B}\right)$ cache complexity. As each step is oblivious, the composition of all these steps is also oblivious.

Theorem 4.1.

For triangle join $\mathcal{Q}_{\triangle}$ , there is an oblivious and cache-agnostic algorithm that can compute $\mathcal{Q}(\mathcal{R})$ for any instance $\mathcal{R}$ of input size $N$ with $O\left(N^{\frac{3}{2}}\cdot\log N\right)$ time complexity and $O\left(\frac{N^{\frac{3}{2}}}{B}\cdot\log_{\frac{M}{B}}\frac{N^{\frac{3}{2}}}{B}\right)$ cache complexity under the tall cache and wide block assumptions.

5 Oblivious Worst-case Optimal Join Algorithm

1 if $|\mathcal{V}|=1$ then return

\cap_{e\in\mathcal{E}}R_{e}

by Intersect;

(I,J)\leftarrow

an arbitrary partition of

\mathcal{V}

;

\mathcal{Q}_{I}\leftarrow\textsc{GenericJoin}((I,\mathcal{E}[I]),\left\{\pi_{I}R_{e}:e\in\mathcal{E}\right\})

;

4 foreach $t\in\mathcal{Q}_{I}$ do

\mathcal{Q}_{t}\leftarrow\textsc{GenericJoin}((J,\mathcal{E}[J]),\left\{\pi_{J}(R_{e}\ltimes t):e\in\mathcal{E}\right\})

;

5 return

\bigcup_{t\in\mathcal{Q}_{I}}\{t\}\times\mathcal{Q}_{t}

;

Algorithm 6

\textsc{GenericJoin}(\mathcal{Q}=(\mathcal{V},\mathcal{E}),\mathcal{R})

[44]

In this section, we start with revisiting the insecure WCOJ algorithm in Section 5.1 and then present our oblivious algorithm in Section 5.2 and present its analysis in Section 5.3. Subsequently, in Section 5.4, we explore the implications of our oblivious algorithm for relaxed oblivious algorithms designed for cyclic join queries.

5.1 Generic Join Revisited

In a join query $\mathcal{Q}=(\mathcal{V},\mathcal{E})$ , for a subset of attributes $S\subseteq\mathcal{V}$ , we use $\mathcal{Q}[S]=(S,\mathcal{E}[S])$ to denote the sub-query induced by attributes in $S$ , where $\mathcal{E}[S]=\{e\cap S:e\in\mathcal{E}\}$ . For each attribute $x\in\mathcal{V}$ , we use $\mathcal{E}_{x}=\{e\in\mathcal{E}:x\in e\}$ to denote the set of relations containing $x$ . The insecure WCOJ algorithm described in [44] is outlined in Algorithm 6, which takes as input a general join query $\mathcal{Q}=(\mathcal{V},\mathcal{E})$ and an instance $\mathcal{R}$ . In the base case, when only one attribute exists, it computes the intersection of all relations. For the general case, it partitions the attributes into two disjoint subsets, $I$ and $J$ , such that $I\cap J=\emptyset$ and $I\cup J=\mathcal{V}$ . The algorithm first computes the sub-query $\mathcal{Q}[I]$ , induced by attributes in $I$ , whose join result is denoted $\mathcal{Q}_{I}$ . Then, for each tuple $t\in\mathcal{Q}_{I}$ , it recursively invokes the whole algorithm to compute the sub-query $\mathcal{Q}[J]$ induced by attributes in $J$ , over tuples that can be joined with $t$ . The resulting join result for each tuple $t$ is denoted as $\mathcal{Q}_{t}$ . Finally, it attaches each tuple in $\mathcal{Q}_{t}$ with $t$ , representing the join results in which $t$ participates. The algorithm ultimately returns the union of all join results for tuples in $\mathcal{Q}_{I}$ . However, Algorithm 6 exhibits significant leakage of data statistics that violates the obliviousness constraint, for example:

•

$\left|\bigcap_{e\in\mathcal{E}}R_{e}\right|$ is leaked in line 1;
•

$\left|\pi_{I}R_{e}\right|$ for each relation $e\in\mathcal{E}$ is leaked in line 3;
•

$|\mathcal{Q}_{I}|$ , $\left|\pi_{J}\left(R_{e}\ltimes t\right)\right|$ , and $\left|\mathcal{Q}_{t}\right|$ for each tuple $t\in\mathcal{Q}_{I}$ are leaked in line 4.

More importantly, this algorithm heavily relies on hashing indexes or range search indexes for retrieving tuples, such that the intersection at line 1 can be computed in $O\left(\min_{e\in\mathcal{E}}|R_{e}|\right)$ time. However, these indexes do not work well in the external memory model since naively extending this algorithm could result in $O\left(N^{\rho^{*}}\right)$ cache complexity, which is too expensive. Consequently, designing a WCOJ algorithm that simultaneously maintains cache locality and achieves obliviousness remains a significant challenge.

1 if $|\mathcal{V}|=1$ then return

\cap_{e\in\mathcal{E}}R_{e}

by Intersect;

(I,J)\leftarrow

a partition of

\mathcal{V}

such that (1)

|J|=1

; or (2)

|J|=2

(say

J=\{y,z\}

) and

\mathcal{E}_{y}-\mathcal{E}_{z}\neq\emptyset

and

\mathcal{E}_{z}-\mathcal{E}_{y}\neq\emptyset

;

3 foreach $e\in\mathcal{E}$ do

S_{e}\leftarrow\textsc{Project}(R_{e},e\cap I)

;

\mathcal{Q}_{I}\leftarrow\textsc{ObliviousGenericJoin}((I,\mathcal{E}[I]),\left\{S_{e}:e\in\mathcal{E}\right\})

;

5 if $|J|=1$ then // Suppose

J=\{x\}

6 foreach $e\in\mathcal{E}_{x}$ do

\mathcal{Q}_{I}\leftarrow\textsc{Augment}(\mathcal{Q}_{I},R_{e},e\cap I)

;

\{Q_{I}^{e}\}_{e\in\mathcal{E}_{x}}\leftarrow\textsc{Partition-I}(\mathcal{Q}_{I},\mathcal{E}_{x})

;

8 foreach $e\in\mathcal{E}_{x}$ do

L_{e}\leftarrow\textsc{RelaxedTwoWay}\left(\mathcal{Q}^{e}_{I},R_{e},N^{\rho^{*}(\mathcal{Q})}\right)

;

10 for $e^{\prime}\in\mathcal{E}_{x}-\{e\}$ do

L_{e}\leftarrow\textsc{SemiJoin}(L_{e},R_{e^{\prime}})

;

L\leftarrow\bigcup_{e\in\mathcal{E}_{x}}L_{e}

;

14else // Suppose

J=\{y,z\}

15 foreach $e\in\mathcal{E}_{y}\cup\mathcal{E}_{z}$ do

\mathcal{Q}_{I}\leftarrow\textsc{Augment}(\mathcal{Q}_{I},R_{e},e\cap I)

;

\{\mathcal{Q}_{I}^{e_{1},e_{2}}\}_{(e_{1},e_{2})\in(\mathcal{E}_{y}-\mathcal{E}_{z})\times(\mathcal{E}_{z}-\mathcal{E}_{y})},\{\mathcal{Q}_{I}^{e_{3}}\}_{e_{3}\in\mathcal{E}_{x}\cap\mathcal{E}_{y}}\leftarrow\textsc{Partition-II}(Q_{I},\mathcal{E}_{y},\mathcal{E}_{z})

;

17 foreach $(e_{1},e_{2})\in(\mathcal{E}_{y}-\mathcal{E}_{z})\times(\mathcal{E}_{z}-\mathcal{E}_{y})$ do

L_{e_{1},e_{2}}\leftarrow\textsc{RelaxedTwoWay}\left(\mathcal{Q}^{e_{1},e_{2}}_{I},R_{e_{1}},N^{\rho^{*}(\mathcal{Q})}\right)

;

L_{e_{1},e_{2}}\leftarrow\textsc{RelaxedTwoWay}\left(L_{e_{1},e_{2}},R_{e_{2}},N^{\rho^{*}(\mathcal{Q})}\right)

;

20 foreach $e\in\mathcal{E}-\{e_{1},e_{2}\}$ do

L_{e_{1},e_{2}}\leftarrow\textsc{SemiJoin}(L_{e_{1},e_{2}},R_{e})

;

22 foreach $e_{3}\in\mathcal{E}_{y}\cap\mathcal{E}_{z}$ do

L_{e_{3}}\leftarrow\textsc{RelaxedTwoWay}\left(\mathcal{Q}^{e_{3}}_{I},R_{e_{3}},N^{\rho^{*}(\mathcal{Q})}\right)

;

24 foreach $e\in\mathcal{E}-\{e_{3}\}$ do

L_{e_{3}}\leftarrow\textsc{SemiJoin}(L_{e_{3}},R_{e})

;

L\leftarrow\left(\bigcup_{(e_{1},e_{2})\in(\mathcal{E}_{y}-\mathcal{E}_{z})\times(\mathcal{E}_{z}-\mathcal{E}_{y})}L_{e_{1},e_{2}}\right)\cup\left(\bigcup_{e_{3}\in\mathcal{E}_{y}\cap\mathcal{E}_{z}}L_{e_{3}}\right)

;

28return Compact

L

while keeping the first

N^{\rho^{*}(\mathcal{Q})}

tuples;

Algorithm 7

\textsc{ObliviousGenericJoin}(\mathcal{Q}=(\mathcal{V},\mathcal{E}),\mathcal{R})

1 foreach $e\in\mathcal{E}_{x}$ do

\mathcal{Q}^{e}_{I}\leftarrow\emptyset

;

2 while read $(t,\{\Delta_{e}(t)\}_{e\in\mathcal{E}_{x}})$ from $\mathcal{Q}_{I}$ do // Suppose

\Delta_{e}(t)=|R_{e}\ltimes\{t\}|

e^{\prime}\leftarrow\arg\min_{e\in\mathcal{E}_{x}}\Delta_{e}(t)

;

4 write

t

\mathcal{Q}^{e^{\prime}}_{I}

and write

\perp

\mathcal{Q}^{e^{\prime\prime}}_{I}

for each

e^{\prime\prime}\in\mathcal{E}_{x}-\{e^{\prime}\}

;

return

\{Q_{I}^{e}\}_{e\in\mathcal{E}_{x}}

;

Algorithm 8

\textsc{Partition-I}(\mathcal{Q}_{I},\mathcal{E}_{x})

1 foreach $(e_{1},e_{2})\in(\mathcal{E}_{y}-\mathcal{E}_{z})\times(\mathcal{E}_{z}-\mathcal{E}_{y})$ do

\mathcal{Q}^{e_{1},e_{2}}_{I}\leftarrow\emptyset

;

2 foreach $e_{3}\in\mathcal{E}_{y}\cap\mathcal{E}_{z}$ do

\mathcal{Q}^{e_{3}}_{I}\leftarrow\emptyset

;

3 while read $(t,\{\Delta_{e}(t)\}_{e\in\mathcal{E}_{y}\cup\mathcal{E}_{z}})$ from $\mathcal{Q}_{I}$ do // Suppose

\Delta_{e}(t)=|R_{e}\ltimes\{t\}|

\displaystyle{e_{1},e_{2},e_{3}\leftarrow\arg\min_{e\in\mathcal{E}_{y}-\mathcal{E}_{z}}\Delta_{e}(t),\arg\min_{e\in\mathcal{E}_{z}-\mathcal{E}_{y}}\Delta_{e}(t),\arg\min_{e\in\mathcal{E}_{y}\cap\mathcal{E}_{z}}\Delta_{e}(t)}

;

5 if $\Delta_{e_{1}}(t)\cdot\Delta_{e_{2}}(t)\leq\Delta_{e_{3}}(t)$ then

6 write

t

\mathcal{Q}^{e_{1},e_{2}}_{I}

;

7 foreach $(e_{1}^{\prime},e_{2}^{\prime})\in(\mathcal{E}_{y}-\mathcal{E}_{z})\times(\mathcal{E}_{z}-\mathcal{E}_{y})-\{(e_{1},e_{2})\}$ do write

\perp

\mathcal{Q}^{e_{1}^{\prime},e_{2}^{\prime}}_{I}

;

8 foreach $e_{3}^{\prime}\in\mathcal{E}_{y}\cap\mathcal{E}_{z}$ do write

\perp

\mathcal{Q}^{e_{3}^{\prime}}_{I}

;

10 else

11 write

t

\mathcal{Q}^{e_{3}}_{I}

;

12 foreach $(e_{1}^{\prime},e_{2}^{\prime})\in(\mathcal{E}_{y}-\mathcal{E}_{z})\times(\mathcal{E}_{z}-\mathcal{E}_{y})$ do write

\perp

\mathcal{Q}^{e_{1}^{\prime},e_{2}^{\prime}}_{I}

;

13 foreach $e_{3}^{\prime}\in\mathcal{E}_{y}\cap\mathcal{E}_{z}-\{e_{3}\}$ do write

\perp

\mathcal{Q}^{e_{3}^{\prime}}_{I}

;

16return

\{\mathcal{Q}_{I}^{e_{1},e_{2}}\}_{(e_{1},e_{2})\in(\mathcal{E}_{y}-\mathcal{E}_{z})\times(\mathcal{E}_{z}-\mathcal{E}_{y})},\{\mathcal{Q}_{I}^{e_{3}}\}_{e_{3}\in\mathcal{E}_{x}\cap\mathcal{E}_{y}}

;

Algorithm 9

\textsc{Partition-II}(\mathcal{Q}_{I},\mathcal{E}_{y},\mathcal{E}_{z})

5.2 Our Algorithm

Now, we extend our oblivious triangle join algorithms from Section 4 to general join queries, as described in Algorithm 7. It is built on a recursive framework:

Base Case: $|\mathcal{V}|=1$ . In this case, the join degenerates to the set intersection of all input relations, which can be efficiently computed by the Intersect primitive.

General Case: $|\mathcal{V}|>1$ .

In general, we partition $\mathcal{V}$ into two subsets $I$ and $J$ , but with the constraint that $|J|=1$ or $|J|=2$ but the two attributes $y,z$ in $J$ must satisfy $\mathcal{E}_{y}-\mathcal{E}_{z}\neq\emptyset$ and $\mathcal{E}_{z}-\mathcal{E}_{y}\neq\emptyset$ . Similar to Algorithm 6, we compute the sub-query $\mathcal{Q}[I]$ by invoking the whole algorithm recursively, whose join result is denoted as $\mathcal{Q}_{I}$ . To prevent the potential leakage, we must be careful about the projection of each relation involved in this subquery, which is computed by the Project primitive. We further distinguish two cases based on $|J|$ :

General Case 1: $|J|=1$ .

Suppose $J=\{x\}$ . Recall that for each tuple $t\in\mathcal{Q}_{I}$ , Algorithm 6 computes the intersection $\cap_{e\in\mathcal{E}_{x}}\left(R_{e}\ltimes t\right)$ on $x$ in the base case. To ensure this step remains oblivious, we must conceal the size of $R_{e}\ltimes t$ . To achieve this, we augment each tuple $t\in\mathcal{Q}_{I}$ with its degree in $R_{e}$ , which is defined as $\Delta_{e}(t)=\left|R_{e}\ltimes t\right|$ , using the Augment primitive. Then, we partition tuples in $\mathcal{Q}_{I}$ into $|\mathcal{E}_{x}|$ subsets based on their smallest degree across all relations in $\mathcal{E}_{x}$ . The details are described in Algorithm 8. Let $\mathcal{Q}^{e}_{I}\subseteq\mathcal{Q}_{I}$ denote the set of tuples whose degree is the smallest in $R_{e}$ , i.e., $e=\arg\min_{e^{\prime}\in\mathcal{E}_{x}}\Delta_{e^{\prime}}(t)$ for each $t\in\mathcal{Q}^{e}_{I}$ . Whenever we write one tuple $t\in\mathcal{Q}_{I}$ to one subset, we also write a dummy tuple $\perp$ to the other $|\mathcal{E}_{x}|-1$ subsets. At last, for each $e\in\mathcal{E}_{x}$ , we compute $R_{e}\Join\mathcal{Q}^{e}_{I}$ by invoking the RelaxedTwoWay primitive (line 9), with upper bound $N^{\rho^{*}}$ , and further filter them by remaining relations with semi-joins (line 10).

General Case 2: $|J|=2$ .

Suppose $J=\{y,z\}$ . Consider an arbitrary tuple $t\in\mathcal{Q}_{I}$ . Algorithm 6 computes the residual query $\left\{\bigcap_{e\in\mathcal{E}_{y}\cap\mathcal{E}_{z}}(R_{e}\ltimes t)\right\}\Join\left\{\bigcap_{e\in\mathcal{E}_{y}-\mathcal{E}_{z}}(R_{e}\ltimes t)\right\}\Join\left\{\bigcap_{e\in\mathcal{E}_{z}-\mathcal{E}_{y}}(R_{e}\ltimes t)\right\}$ . Like the case above, we first compute its degree in $R_{e}$ as $\Delta_{e}(t)$ , by the Augment primitive. We then partition tuples in $\mathcal{Q}_{I}$ into $|\mathcal{E}_{y}\cap\mathcal{E}_{z}|+|\mathcal{E}_{y}-\mathcal{E}_{z}|\cdot|\mathcal{E}_{z}-\mathcal{E}_{y}|$ subsets based on their degrees, but more complicated than Case 1. The details are described in Algorithm 9. More specifically, for each $e_{3}\in\mathcal{E}_{y}\cap\mathcal{E}_{z}$ , let

\displaystyle\mathcal{Q}^{e_{3}}_{I}=\biggl{\{}t\in\mathcal{Q}_{I}:

\displaystyle\Delta_{e_{3}}(t)=\min_{e^{\prime\prime}\in\mathcal{E}_{y}\cap\mathcal{E}_{z}}\Delta_{e^{\prime\prime}}(t),\Delta_{e_{3}}(t)<\min_{e\in\mathcal{E}_{y}-\mathcal{E}_{z},e^{\prime}\in\mathcal{E}_{z}-\mathcal{E}_{y}}\Delta_{e}(t)\cdot\Delta_{e^{\prime}}(t)\biggl{\}};

and for each pair $(e_{1},e_{2})\in(\mathcal{E}_{y}-\mathcal{E}_{z})\times(\mathcal{E}_{z}-\mathcal{E}_{y})$ , let

\displaystyle\mathcal{Q}^{e_{1},e_{2}}_{I}=\biggl{\{}t\in\mathcal{Q}_{I}:

\displaystyle\Delta_{e_{1}}(t)\cdot\Delta_{e_{2}}(t)=\min_{e\in\mathcal{E}_{y}-\mathcal{E}_{z},e^{\prime}\in\mathcal{E}_{z}-\mathcal{E}_{y}}\Delta_{e}(t)\cdot\Delta_{e^{\prime}}(t)\leq\min_{e^{\prime\prime}\in\mathcal{E}_{y}\cap\mathcal{E}_{z}}\Delta_{e^{\prime\prime}}(t)\biggl{\}}

For each $(e_{1},e_{2})\in(\mathcal{E}_{y}-\mathcal{E}_{z})\times(\mathcal{E}_{z}-\mathcal{E}_{y})$ , we compute $R_{e_{1}}\Join R_{e_{2}}\Join\mathcal{Q}_{I}^{e_{1},e_{2}}$ by invoking the RelaxedTwoWay primitive iteratively (line 16-17), with the upper bound $N^{\rho^{*}(\mathcal{Q})}$ , and filter these results by remaining relations with semi-joins (line 18). For each $e_{3}\in\mathcal{E}_{y}\cap\mathcal{E}_{z}$ , we compute $R_{e_{3}}\Join\mathcal{Q}^{e_{3}}_{I}$ by invoking the RelaxedTwoWay primitive (line 20), with the upper bound $N^{\rho^{*}(\mathcal{Q})}$ , and filter these results by remaining relations with semi-joins (line 21).

5.3 Analysis of Algorithm 7

Base Case: $|\mathcal{V}|=1$ . The obliviousness is guaranteed by the Intersect primitive. The cache complexity is $O\left(\frac{N}{B}\cdot\log_{\frac{M}{B}}\frac{N}{B}\right)$ . In this case, $\rho^{*}=1$ . Hence, Theorem 5.1 holds.

General Case: $|\mathcal{V}|>1$ .

By hypothesis, the recursive invocation of ObliviousGenericJoin at line 4 takes $O\left(N^{\rho^{*}(\mathcal{Q})}\cdot\log N\right)$ time and $O\left(\frac{N^{\rho^{*}}}{B}\cdot\log_{\frac{M}{B}}\frac{N}{B}\right)$ cache complexity, since $\rho^{*}((I,\mathcal{E}[I]))\leq\rho^{*}(\mathcal{Q})$ . We then show the correctness and complexity for all invocations of RelaxedTwoWay primitive. Let $\rho^{*}(\cdot)$ be an optimal fractional edge cover of $\mathcal{Q}$ . The real size of the two-way join at line 9 can be first rewritten as:

\sum_{e\in\mathcal{E}_{x}}\left|R_{e}\Join\mathcal{Q}^{e}_{I}\right|=\sum_{e\in\mathcal{E}_{x}}\sum\limits_{t\in\mathcal{Q}^{e}_{I}}\left|R_{e}\ltimes t\right|=\sum_{e\in\mathcal{E}_{x}}\sum\limits_{t\in\mathcal{Q}^{e}_{I}}\min_{e^{\prime}\in\mathcal{E}_{x}}\left|R_{e^{\prime}}\ltimes t\right|\leq\sum_{t\in\mathcal{Q}_{I}}\prod_{e^{\prime}\in\mathcal{E}_{x}}\left|R_{e^{\prime}}\ltimes t\right|^{\rho^{*}(e^{\prime})}\leq N^{\rho^{*}}

where the inequalities follow the facts that $\displaystyle{\sum_{e^{\prime}\in\mathcal{E}_{x}}\rho^{*}(e^{\prime})\geq 1}$ , $\bigcup_{r\in\mathcal{E}_{x}}\mathcal{Q}^{e}_{I}=\mathcal{Q}_{I}$ , and the query decomposition lemma [44]. Hence, $N^{\rho^{*}(\mathcal{Q})}$ is valid upper bound for $R_{e}\Join\mathcal{Q}^{e}_{I}$ for each $e\in\mathcal{E}_{x}$ . The real size of the two-way join at lines 18-19 and line 22 can be rewritten as

		$\displaystyle\sum_{e_{1}\in\mathcal{E}_{y}-\mathcal{E}_{z},e_{2}\in\mathcal{E}_{z}-\mathcal{E}_{y}}\left\|R_{e_{1}}\Join R_{e_{2}}\Join\mathcal{Q}^{e_{1},e_{2}}_{I}\right\|+\sum_{e_{3}\in\mathcal{E}_{y}\cap\mathcal{E}_{z}}\left\|R_{e_{3}}\Join\mathcal{Q}^{e_{3}}_{I}\right\|$
		$\displaystyle=\sum_{e_{1}\in\mathcal{E}_{y}-\mathcal{E}_{z},e_{2}\in\mathcal{E}_{z}-\mathcal{E}_{y}}\sum_{t\in\mathcal{Q}^{e_{1},e_{2}}_{I}}\left\|\left(R_{e_{1}}\ltimes t\right)\Join\left(R_{e_{2}}\ltimes t\right)\right\|+\sum_{e_{3}\in\mathcal{E}_{y}\cap\mathcal{E}_{z}}\sum_{t\in\mathcal{Q}^{e_{3}}_{I}}\left\|R_{e_{3}}\ltimes t\right\|$
		$\displaystyle=\sum_{t\in\mathcal{Q}_{I}}\min\left\{\min_{e_{1}\in\mathcal{E}_{y}-\mathcal{E}_{z},e_{2}\in\mathcal{E}_{z}-\mathcal{E}_{y}}\|R_{e_{1}}\ltimes t\|\cdot\|R_{e_{2}}\ltimes t\|,\min_{e_{3}\in\mathcal{E}_{y}\cap\mathcal{E}_{z}}\|R_{e_{3}}\ltimes t\|\right\}$		(2)

Let $\displaystyle{\rho_{1}=\sum_{e\in\mathcal{E}_{y}-\mathcal{E}_{z}}\rho^{*}(e)}$ , $\displaystyle{\rho_{2}=\sum_{e\in\mathcal{E}_{z}-\mathcal{E}_{y}}\rho^{*}(e)}$ and $\displaystyle{\rho_{3}=\sum_{e\in\mathcal{E}_{y}\cap\mathcal{E}_{z}}\rho^{*}(e)}$ . Note $\rho_{3}\geq 1-\min\{\rho_{1},\rho_{2}\}$ as $\rho^{*}(\cdot)$ is a valid fractional edge cover for both $y$ and $z$ . For each tuple $t\in\mathcal{Q}_{I}$ , we have

	$\displaystyle\min\left\{\min_{e_{1}\in\mathcal{E}_{y}-\mathcal{E}_{z},e_{2}\in\mathcal{E}_{z}-\mathcal{E}_{y}}\left\|R_{e_{1}}\ltimes t\right\|\cdot\|R_{e_{2}}\ltimes t\|,\min_{e_{3}\in\mathcal{E}_{y}\cap\mathcal{E}_{z}}\left\|R_{e_{3}}\ltimes t\right\|\right\}$
	$\displaystyle\leq\left(\min_{e_{1}\in\mathcal{E}_{y}-\mathcal{E}_{z}}\left\|R_{e_{1}}\ltimes t\right\|\right)^{\rho_{1}}\cdot\left(\min_{e_{2}\in\mathcal{E}_{z}-\mathcal{E}_{y}}\|R_{e_{2}}\ltimes t\|\right)^{\rho_{2}}\cdot\left(\min_{e_{3}\in\mathcal{E}_{y}\cap\mathcal{E}_{z}}\left\|R_{e_{3}}\ltimes t\right\|\right)^{\rho_{3}}$
	$\displaystyle\leq\prod_{e\in\mathcal{E}_{y}-\mathcal{E}_{z}}\left\|R_{e}\ltimes t\right\|^{\rho^{}(e)}\cdot\prod_{e\in\mathcal{E}_{z}-\mathcal{E}_{y}}\left\|R_{e}\ltimes t\right\|^{\rho^{}(e)}\cdot\prod_{e\in\mathcal{E}_{y}\cap\mathcal{E}_{z}}\left\|R_{e}\ltimes t\right\|^{\rho^{}(e)}=\prod_{e\in\mathcal{E}_{y}\cup\mathcal{E}_{z}}\left\|R_{e}\ltimes t\right\|^{\rho^{}(e)},$

where the first inequality follows from $\min\left\{a,b\right\}\leq a^{p}\cdot b^{1-p}$ for $a,b\geq 0$ and $p\in[0,1]$ , and the third inequality follows from $\rho_{1},\rho_{2}\geq\min\left\{\rho_{1},\rho_{2}\right\}$ . Now, we can further bound (5.3) as

(\ref{eq:2})\leq\sum_{t\in\mathcal{Q}_{I}}\prod_{e\in\mathcal{E}_{y}\cup\mathcal{E}_{z}}\left|R_{e}\ltimes t\right|^{\rho^{*}(e)}=\sum_{t\in\mathcal{Q}_{I}}\prod_{e\in\mathcal{E}_{y}\cup\mathcal{E}_{z}}\left|R_{e}\ltimes t\right|^{\rho^{*}(e)}\leq\prod_{e\in\mathcal{E}}|R_{e}|^{\rho^{*}(e)}\leq N^{\rho^{*}}

where the second last inequality follows the query decomposition lemma [44].

Theorem 5.1.

For a general join query $\mathcal{Q}$ , there is an oblivious and cache-agnostic algorithm that can compute $\mathcal{Q}(\mathcal{R})$ for any instance $\mathcal{R}$ of input size $N$ with $O\left(N^{\rho^{*}}\cdot\log N\right)$ time complexity and $O\left(\frac{N^{\rho^{*}}}{B}\cdot\log_{\frac{M}{B}}\frac{N^{\rho^{*}}}{B}\right)$ cache complexity under the tall cache and wide block assumptions, where $\rho^{*}$ is the optimal fractional edge cover number of $\mathcal{Q}$ .

5.4 Implications to Relaxed Oblivious Algorithms

Our oblivious WCOJ algorithm can be combined with the generalized hypertree decomposition framework [33] to develop a relaxed oblivious algorithm for general join queries.

Definition 5.2 (Generalized Hypertree Decomposition (GHD)).

Given a join query $\mathcal{Q}=(\mathcal{V},\mathcal{E})$ , a GHD of $\mathcal{Q}$ is a pair $(\mathcal{T},\lambda)$ , where $\mathcal{T}$ is a tree as an ordered set of nodes and $\lambda:\mathcal{T}\to 2^{\mathcal{V}}$ is a labeling function which associates to each vertex $u\in\mathcal{T}$ a subset of attributes in $\mathcal{V}$ , $\lambda_{u}$ , such that (1) for each $e\in\mathcal{E}$ , there is a node $u\in\mathcal{T}$ such that $e\subseteq\lambda_{u}$ ; (2) For each $x\in\mathcal{V}$ , the set of nodes $\{u\in\mathcal{T}:x\in\lambda_{u}\}$ forms a connected subtree of $\mathcal{T}$ . The fractional hypertree width of $\mathcal{Q}$ is defined as $\displaystyle{\min_{(\mathcal{T},\lambda)}\max_{u\in\mathcal{T}}\rho^{*}\left((\lambda_{u},\{e\cap\lambda:e\in\mathcal{E}\})\right)}$ .

The pseudocode of our algorithm is given in Appendix D. Suppose we take as input a join query $\mathcal{Q}=(\mathcal{V},\mathcal{E})$ , an instance $\mathcal{R}$ , and an upper bound on the output size $\tau\geq|\mathcal{Q}(\mathcal{R})|$ . Let $(\mathcal{T},\lambda)$ be an arbitrary GHD of $\mathcal{Q}$ . We first invoke Algorithm 7 to compute the subquery $\mathcal{Q}_{u}=(\lambda_{u},\mathcal{E}_{u})$ defined by each node $u\in\mathcal{T}$ , where $\mathcal{E}_{u}=\{e\cap u:e\in\mathcal{E}\}$ , and materialize its join result as one relation. We then apply the classic Yannakakis algorithm [54] on the materialized relations by invoking the SemiJoin primitive for semi-joins and the RelaxedTwoWay primitive for pairwise joins. After removing dangling tuples, the size of each two-way join is upper bound by the size of the final join results and, therefore, $\tau$ . This leads to a relaxed oblivious algorithm whose access pattern only depends on $N$ and $\tau$ .

Theorem 5.3.

For a join query $\mathcal{Q}$ , an instance $\mathcal{R}$ of input size $N$ , and parameter $\tau\geq|\mathcal{Q}(\mathcal{R})|$ , there is a cache-agnostic algorithm that can compute $\mathcal{Q}(\mathcal{R})$ with $O\left((N^{w}+\tau)\cdot\log(N^{w}+\tau)\right)$ time complexity and $O\left(\frac{N^{w}+\tau}{B}\cdot\log_{\frac{M}{B}}\frac{N^{w}+\tau}{B}\right)$ cache complexity, whose access pattern only depends on $N$ and $\tau$ , where $w$ is the fractional hypertree width of $\mathcal{Q}$ .

6 Conclusion

This paper has introduced a general framework for oblivious multi-way join processing, achieving near-optimal time and cache complexity. However, several intriguing questions remain open for future exploration:

•

Balancing Privacy and Efficiency: Recent research has investigated improved trade-offs between privacy and efficiency, aiming to overcome the challenges of worst-case scenarios, such as differentially oblivious algorithms [14].
•

Emit model for EM algorithms. In the context of EM join algorithms, the emit model - where join results are directly outputted without writing back to disk - has been considered. It remains open whether oblivious, worst-case optimal join algorithms can be developed without requiring all join results to be written back to disk.
•

Communication-oblivious join algorithm for MPC model. A natural connection exists between the MPC and EM models in join processing. While recent work has explored communication-oblivious algorithms in the MPC model [13, 49], extending these ideas to multi-way join processing remains an open challenge.

References

[1] S. Abiteboul, R. Hull, and V. Vianu. Foundations of databases, volume 8. Addison-Wesley Reading, 1995.
[2] M. Abo Khamis, H. Q. Ngo, and A. Rudra. Faq: questions asked frequently. In PODS, pages 13–28, 2016.
[3] A. Aggarwal and S. Vitter, Jeffrey. The input/output complexity of sorting and related problems. Communications of the ACM, 31(9):1116–1127, 1988.
[4] M. Ajtai, J. Komlós, and E. Szemerédi. An 0 (n log n) sorting network. In STOC, pages 1–9, 1983.
[5] A. Arasu and R. Kaushik. Oblivious query processing. ICDT, 2013.
[6] L. Arge, M. A. Bender, E. D. Demaine, B. Holland-Minkley, and J. Ian Munro. An optimal cache-oblivious priority queue and its application to graph algorithms. SIAM Journal on Computing, 36(6):1672–1695, 2007.
[7] G. Asharov, I. Komargodski, W.-K. Lin, K. Nayak, E. Peserico, and E. Shi. Optorama: Optimal oblivious ram. In Eurocrypt, pages 403–432. Springer, 2020.
[8] A. Atserias, M. Grohe, and D. Marx. Size bounds and query plans for relational joins. In FOCS, pages 739–748. IEEE, 2008.
[9] K. E. Batcher. Sorting networks and their applications. In Proceedings of the April 30–May 2, 1968, spring joint computer conference, pages 307–314, 1968.
[10] P. Beame, P. Koutris, and D. Suciu. Communication steps for parallel query processing. JACM, 64(6):1–58, 2017.
[11] C. Beeri, R. Fagin, D. Maier, and M. Yannakakis. On the desirability of acyclic database schemes. JACM, 30(3):479–513, 1983.
[12] A. Beimel, K. Nissim, and M. Zaheri. Exploring differential obliviousness. In APPROX/RANDOM, 2019.
[13] T. Chan, K.-M. Chung, W.-K. Lin, and E. Shi. Mpc for mpc: secure computation on a massively parallel computing architecture. In ITCS. Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2020.
[14] T. H. Chan, K.-M. Chung, B. M. Maggs, and E. Shi. Foundations of differentially oblivious algorithms. In SODA, pages 2448–2467. SIAM, 2019.
[15] Z. Chang, D. Xie, and F. Li. Oblivious ram: A dissection and experimental evaluation. Proc. VLDB Endow., 9(12):1113–1124, 2016.
[16] Z. Chang, D. Xie, F. Li, J. M. Phillips, and R. Balasubramonian. Efficient oblivious query processing for range and knn queries. TKDE, 2021.
[17] Z. Chang, D. Xie, S. Wang, and F. Li. Towards practical oblivious join. In SIGMOD, 2022.
[18] S. Chu, D. Zhuo, E. Shi, and T.-H. H. Chan. Differentially Oblivious Database Joins: Overcoming the Worst-Case Curse of Fully Oblivious Algorithms. In ITC, volume 199, pages 19:1–19:24, 2021.
[19] V. Costan and S. Devadas. Intel sgx explained. Cryptology ePrint Archive, 2016.
[20] N. Crooks, M. Burke, E. Cecchetti, S. Harel, R. Agarwal, and L. Alvisi. Obladi: Oblivious serializable transactions in the cloud. In OSDI, pages 727–743, 2018.
[21] E. D. Demaine. Cache-oblivious algorithms and data structures. Lecture Notes from the EEF Summer School on Massive Data Sets, 8(4):1–249, 2002.
[22] S. Deng and Y. Tao. Subgraph enumeration in optimal i/o complexity. In ICDT. Schloss Dagstuhl–Leibniz-Zentrum für Informatik, 2024.
[23] S. Devadas, M. v. Dijk, C. W. Fletcher, L. Ren, E. Shi, and D. Wichs. Onion oram: A constant bandwidth blowup oblivious ram. In TCC, pages 145–174. Springer, 2016.
[24] S. Eskandarian and M. Zaharia. Oblidb: Oblivious query processing for secure databases. Proc. VLDB Endow., 13(2).
[25] R. Fagin. Degrees of acyclicity for hypergraphs and relational database schemes. JACM, 30(3):514–550, 1983.
[26] A. Z. Fan, P. Koutris, and H. Zhao. Tight bounds of circuits for sum-product queries. SIGMOD, 2(2):1–20, 2024.
[27] J. Flum, M. Frick, and M. Grohe. Query evaluation via tree-decompositions. JACM, 49(6):716–752, 2002.
[28] M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In FOCS, pages 285–297. IEEE, 1999.
[29] C. Gentry, K. A. Goldman, S. Halevi, C. Julta, M. Raykova, and D. Wichs. Optimizing oram and using it efficiently for secure computation. In PETs, pages 1–18. Springer, 2013.
[30] O. Goldreich. Towards a theory of software protection and simulation by oblivious rams. In STOC, pages 182–194, 1987.
[31] O. Goldreich and R. Ostrovsky. Software protection and simulation on oblivious rams. JACM, 43(3):431–473, 1996.
[32] M. T. Goodrich. Data-oblivious external-memory algorithms for the compaction, selection, and sorting of outsourced data. In SPAA, pages 379–388, 2011.
[33] G. Gottlob, N. Leone, and F. Scarcello. Hypertree decompositions and tractable queries. JCSS, 64(3):579–627, 2002.
[34] H. Hacigümüş, B. Iyer, C. Li, and S. Mehrotra. Executing sql over encrypted data in the database-service-provider model. In SIGMOD, pages 216–227, 2002.
[35] B. He and Q. Luo. Cache-oblivious nested-loop joins. In CIKM, pages 718–727, 2006.
[36] X. Hu. Cover or pack: New upper and lower bounds for massively parallel joins. In PODS, pages 181–198, 2021.
[37] X. Hu, M. Qiao, and Y. Tao. I/o-efficient join dependency testing, loomis–whitney join, and triangle enumeration. JCSS, 82(8):1300–1315, 2016.
[38] B. Ketsman and D. Suciu. A worst-case optimal multi-round algorithm for parallel computation of conjunctive queries. In PODS, pages 417–428, 2017.
[39] P. Koutris, P. Beame, and D. Suciu. Worst-case optimal algorithms for parallel query processing. In ICDT. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2016.
[40] S. Krastnikov, F. Kerschbaum, and D. Stebila. Efficient oblivious database joins. VLDB, 13(12):2132–2145, 2020.
[41] E. Kushilevitz, S. Lu, and R. Ostrovsky. On the (in) security of hash-based oblivious ram and a new balancing scheme. In SODA, pages 143–156. SIAM, 2012.
[42] W.-K. Lin, E. Shi, and T. Xie. Can we overcome the n log n barrier for oblivious sorting? In SODA, pages 2419–2438. SIAM, 2019.
[43] H. Q. Ngo, E. Porat, C. Ré, and A. Rudra. Worst-case optimal join algorithms. JACM, 65(3):1–40, 2018.
[44] H. Q. Ngo, C. Ré, and A. Rudra. Skew strikes back: New developments in the theory of join algorithms. Acm Sigmod Record, 42(4):5–16, 2014.
[45] V. Ramachandran and E. Shi. Data oblivious algorithms for multicores. In SPAA, pages 373–384, 2021.
[46] S. Sasy, A. Johnson, and I. Goldberg. Fast fully oblivious compaction and shuffling. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, pages 2565–2579, 2022.
[47] E. Shi. Path oblivious heap: Optimal and practical oblivious priority queue. In 2020 IEEE Symposium on Security and Privacy (SP), pages 842–858. IEEE, 2020.
[48] E. Stefanov, M. V. Dijk, E. Shi, T.-H. H. Chan, C. Fletcher, L. Ren, X. Yu, and S. Devadas. Path oram: an extremely simple oblivious ram protocol. JACM, 65(4):1–26, 2018.
[49] Y. Tao, R. Wang, and S. Deng. Parallel communication obliviousness: One round and beyond. Proceedings of the ACM on Management of Data, 2(5):1–24, 2024.
[50] T. L. Veldhuizen. Leapfrog triejoin: A simple, worst-case optimal join algorithm. In ICDT, 2014.
[51] J. S. Vitter. External memory algorithms and data structures: Dealing with massive data. ACM Computing surveys (CsUR), 33(2):209–271, 2001.
[52] X. Wang, H. Chan, and E. Shi. Circuit oram: On tightness of the goldreich-ostrovsky lower bound. In CCS, pages 850–861, 2015.
[53] Y. Wang and K. Yi. Query evaluation by circuits. In PODS, 2022.
[54] M. Yannakakis. Algorithms for acyclic database schemes. In VLDB, pages 82–94, 1981.
[55] W. Zheng, A. Dave, J. G. Beekman, R. A. Popa, J. E. Gonzalez, and I. Stoica. Opaque: An oblivious and encrypted distributed analytics platform. In NSDI 17, pages 283–298, 2017.

Appendix A Missing Materials in Section 1

Graph Joins. A join query $\mathcal{Q}=(\mathcal{V},\mathcal{E})$ is a graph join if $|e|\leq 2$ for each $e\in\mathcal{E}$ , i.e., each relation contains at most two attributes.

Loomis-Whitney Joins.

A join query $\mathcal{Q}=(\mathcal{V},\mathcal{E})$ is a Loomis-Whitney join if $\mathcal{V}=\{x_{1},x_{2},\cdots,x_{k}\}$ and $\mathcal{E}=\{\mathcal{V}-\{x_{i}\}:i\in[k]\}$ .

Appendix B Oblivious Primitives in Section 2

We provide the algorithm descriptions and pseudocodes for the oblivious primitives declared in Section 2.2. For the local variables used in these primitives, key, val, $\mathsf{pos}$ and $\mathsf{cnt}$ , we do not need to establish obliviousness for them because they are stored in the trusted memory during the entire execution of the algorithms and the adversaries cannot observe the access pattern to them. But for all the temporal sets with non-constant size, $K$ and $L$ , they are stored in the untrusted memory.

SemiJoin.

Given two input relations $R$ , $S$ and their common attribute(s) $x$ , the goal is to replace each tuple in $R$ that cannot be joined with any tuple in $S$ with a dummy tuple $\perp$ , i.e., compute $R\ltimes S$ . As shown in Algorithm 10, we first sort all tuples by their join values and break ties by putting $S$ -tuples before $R$ -tuples if they share the same join value in $x$ . We then perform a linear scan, using an additional variable key to track the largest join value of the previous tuple that is no larger than the join value of the current tuple $t$ visited. More specifically, we distinguish two cases on $t$ . Suppose $t\in R$ . If $\pi_{x}t=\textsf{key}$ , we just write $t$ to the result array $L$ . Otherwise, we write a dummy tuple $\perp$ to $L$ . Suppose $t\in S$ . We simply write a dummy tuple $\perp$ to $L$ and update key with $\pi_{x}t$ . At last, we compact the elements in $L$ to move all $\perp$ to the last and keep the first $|R|$ tuples in $L$ .

K\leftarrow

Sort

R\cup S

by attribute(s)

x

, breaking ties by putting

S

-tuples before

R

-tuples when they have the same value in

x

;

\textsf{key}\leftarrow\perp

L\leftarrow\emptyset

;

3 while read $t$ from $K$ do

4 if $t\in R$ then

5 if $t\neq\perp$ and $\pi_{x}t=\textsf{key}$ then write

t

L

;

6 else write

\perp

L

;

8 else write

\perp

L

;

\textsf{key}\leftarrow\pi_{x}t

;

11return Compact

L

while keeping the first

|R|

tuples;

Algorithm 10 SemiJoin(

R,S,x

)

K\leftarrow\textrm{{Sort}}

R

by attribute(s)

x

with all

\perp

moved to the last;

\mathsf{key}\leftarrow\perp

\textsf{val}\leftarrow 0

L\leftarrow\emptyset

;

3 while read $t$ from $K$ do

4 if $t=\perp$ then write

\perp

L

;

5 else if $t\neq\perp$ and $\pi_{x}t=\mathsf{key}$ then write

\perp

L

\textsf{val}\leftarrow\textsf{val}\oplus w(t)

;

6 else write

(\textsf{key},\textsf{val})

L

\textsf{val}\leftarrow w(t)

\textsf{key}\leftarrow\pi_{x}t

;

8write

(\textsf{key},\textsf{val})

L

;

9 return Compact

L

while keeping the first

|R|

tuples;

Algorithm 11 ReduceByKey(

R,x,w(\cdot),\oplus

)

ReduceByKey.

Given an input relation $R_{e}$ , some of which are distinguished as $\perp$ , a set of key attribute(s) $x\subseteq e$ , a weight function $w$ , and an aggregate function $\oplus$ , the goal is to output the aggregation of each key value, which is defined as the function $\oplus$ over the weights of all tuples with the same key value. This primitive can be used to compute degree information, i.e., the number of tuples displaying a specific key value in a relation.

As shown in Algorithm 11, we sort all tuples by their key values (values in attribute(s) $x$ ) while moving all distinguished tuples to the last of the relation. Then, we perform a linear scan, using an additional variable key to track the key value of the previous tuple, and val to track the aggregation over the weights of tuples visited. We distinguish three cases. If $t=\perp$ , the remaining tuples in $K$ are all distinguished as $\perp$ , implied by the sorting. We write a dummy tuple $\perp$ to $L$ in this case. If $t\neq\perp$ and $\pi_{x}t=\textsf{key}$ , we simply write a dummy tuple $t$ to $L$ , and increase val by $w(t)$ . If $t\neq\perp$ and $\textsf{key}\neq\pi_{x}t$ , the values of all elements with key key are already aggregated into val. In this case, we need to write $(\textsf{key},\textsf{val})$ to $L$ and update val with $w(t)$ , i.e., the value of current tuple, and key with $\pi_{x}t$ . At last, we compact the tuples in $L$ by moving all $\perp$ to the last and keep the first $|R|$ tuples in $L$ for obliviousness.

K\leftarrow

Sort

R\cup S

by attribute(s)

x

while moving all

\perp

to the last and breaking ties by putting

S

-tuples before

R

-tuples when they have the same value in

x

;

\textsf{key}\leftarrow\perp

\textsf{val}\leftarrow 0

L\leftarrow\emptyset

;

3 while read $t$ from $K$ do

4 if $t=\perp$ then write

\perp

L

;

5 else if $t\in S$ then

6 write

\perp

L

\textsf{val}\leftarrow\pi_{\bar{x}}t

\textsf{key}\leftarrow\pi_{x}t

7 else if $t\in R$ and $\pi_{x}t=\textsf{key}$ then write

(t,\textsf{val})

L

;

8 else write

\perp

L

;

10return Compact

L

while keeping the first

|R|

tuples;

Algorithm 12 Annotate(

R,S,x)

)

Annotate.

Given an input relation $R$ , where each tuple is associated with a key, and a list $S$ of key-value pairs, where each pair is associated with a distinct key, the goal is to attach, for each tuple in $R$ , the value of the corresponding distinct pair in $S$ matched by the key. As shown in Algorithm 12, we first sort all tuples in $R$ and $S$ by their key values in attribute $x$ , while moving all $\perp$ to the last of the relation and breaking ties by putting all $S$ -tuples before $R$ -tuples when they have the same key value. We then perform a linear scan, using another two variables $\textsf{key},\textsf{val}$ to track the $S$ -tuple with the largest key but no larger than the key of the current tuple visited. We distinguish the following cases. If $t$ is a $S$ -tuple and $t\neq\perp$ , we update $\textsf{key},\textsf{val}$ with $t$ . If $t$ is a $R$ -tuple and $t\neq\perp$ , we attach val to $t$ by writing $(t,\textsf{val})$ to $L$ . We write a dummy tuple $\perp$ to $L$ in the remaining cases. Finally, we compact the tuples in $L$ to remove unnecessary dummy tuples.

K\leftarrow

Sort

R

by attribute(s)

x

;

\textsf{key}\leftarrow\perp

\textsf{val}\leftarrow 0

L\leftarrow\emptyset

;

3 foreach $t\in K$ do

4 if $\pi_{x}t=\textsf{key}$ then

\textsf{val}\leftarrow\textsf{val}+1

;

5 else

\textsf{val}\leftarrow 1

\textsf{key}\leftarrow\pi_{x}t

;

6 write

(t,\textsf{val})

L

;

8return

L

;

Algorithm 13 MultiNumber(

R,x

)

MultiNumber.

Given an input relation $R$ , each associated with a key attribute(s) $x$ , the goal is to attach consecutive numbers $1,2,3,\cdots,$ to tuples with the same key.

As shown in Algorithm 13, we first sort all tuples in $R$ by attribute $x$ . We then perform a linear scan, using two additional variables $\textsf{key},\textsf{val}$ to track the key of the previous tuples, and the number assigned to the previous tuple. Consider $t$ as the current element visited. If $\pi_{x}t=\textsf{key}$ , we simply increase val by $1$ . Otherwise, we set val to $1$ and update key with $\pi_{x}t$ . In both cases, we assign val to tuple $t$ and writw $(t,\textsf{val})$ to $L$ .

Project.

Given an input relation $R$ defined over attributes $e$ , and a subset of attributes $x\subseteq e$ , the goal is to output the list $\{t\in R:\pi_{x}t\}$ (without duplication). This primitive can be simply solved by sorting by attribute(s) $x$ and then removing duplicates by a linear scan.

Intersect.

Given two input arrays $R,S$ of distinct elements separately, the goal is to output the common elements appearing in both $R$ and $S$ . This primitive can be done with sorting by attribute(s) $x$ , and then a linear scan would suffice to find out common elements.

Augment.

Given two relations $R,S$ of at most $N$ tuples and their common attribute(s) $x$ , the goal is to attach each tuple $t$ the number of tuples in $S$ that can be joined with $t$ on $x$ . The Augment primitive can be implemented by the ReduceByKey and Annotate primitives. See Algorithm 14.

1 foreach $i\in[k]$ do

L\leftarrow\textsc{ReduceByKey}(S_{i},x)

;

R\leftarrow\textsc{Annotate}(R,L,x)

;

5return

R

;

Algorithm 14 Augment(

R,\{S_{1},S_{2},\cdots,S_{k}\},x

)

Appendix C RelaxedTwoWay Primitive

Given two relations $R,S$ of $N_{1},N_{2}$ tuples and an integral parameter $\tau$ , where $N_{1}+N_{2}=N$ and $|R\Join S|\leq\tau$ , the goal is to output a relation of size $\tau$ whose first $|R\Join S|$ tuples are the join results and the remaining tuples are dummy tuples. Arasu et al. [5] first proposed an oblivious algorithm for $\tau=|R\Join S|$ , but it involves rather complicated primitive without giving complete details [16]. Krastnikov et al. [40] later showed a more clean and effective version, but this algorithm does not have a satisfactory cache complexity. Below, we present our own version of the relaxed two-way join. We need one important helper primitive first.

Expand Primitive.

Given a sequence of $\langle(t_{i},w_{i}):w_{i}\in\mathbb{Z}^{+},i\in[N]\rangle$ and a parameter $\tau\geq\sum_{i\in[N]}w_{i}$ , the goal is to expand each tuple $t_{i}$ with $w_{i}$ copies and output a table of $\tau$ tuples. The naive way of reading a pair $(t_{i},w_{i})$ and then writing $w_{i}$ copies does not preserve obliviousness since the number of consecutive writes can leak the information. Alternatively, one might consider writing a fixed number of tuples after reading each pair. Still, the ordering of reading pairs is critical for avoiding dummy writes and avoiding too many pairs stored in trusted memory (this strategy is exactly adopted by [5]).

We present a simpler algorithm by combining the oblivious primitives. Suppose $L$ is the output table of $R$ , such that $L$ contains $w_{i}$ copies of $t_{i}$ , and any tuple $t_{i}$ comes before $t_{j}$ if $i<j$ . As described in Algorithm 15, it consists of four phases:

•

(lines 1-4). for each pair $(t_{i},w_{i})\in R$ with $w_{i}\neq 0$ , attach the beginning position of $t_{i}$ in $\tilde{R}$ , which is $\sum_{j<i}w_{j}$ . For the remaining pairs with $w_{i}=0$ , replace them with $\perp$ and attach with the infinite position as these tuples will not participate in any join result;
•

(lines 5-7) pad $\tau$ dummy tuples and attach them with consecutive numbers $1.5,2.5,\cdots$ ; after sorting the well-defined positions, each tuple $t_{i}$ will be followed by $w_{i}$ dummy tuples, and all dummy tuples with infinite positions are put at last;
•

(lines 8-14) for each tuple $t_{i}$ , we replace it with $\perp$ but the following $w_{i}$ dummy tuples with $t_{i}$ . After moving all dummy tuples to the end, the first $\tau$ elements are the output.

\mathsf{pos}\leftarrow 1

K\leftarrow\emptyset

;

2 while read $(t_{i},w_{i})$ from $R$ do

3 if $(t_{i},w_{i})=(\perp,\perp)$ then write

(\perp,+\infty)

X

;

4 else write

(t_{i},\mathsf{pos})

K

\mathsf{pos}\leftarrow\mathsf{pos}+w_{i}

;

\mathsf{pos}\leftarrow 1.5

;

7 foreach $i\in[\tau]$ do write

(\perp,\mathsf{pos})

K

\mathsf{pos}\leftarrow\mathsf{pos}+1

;

8 Sort

K

\mathsf{pos}

;

t\leftarrow\perp

\mathsf{cnt}\leftarrow 0

L\leftarrow\emptyset

;

10 while read $(\textsf{key},\mathsf{pos})$ from $K$ do

11 if $\mathsf{pos}=+\infty$ then write

\perp

L

;

12 else if $\textsf{key}\neq\perp$ then

t\leftarrow\textsf{key}

, write

\perp

L

;

13 else if $\mathsf{cnt}<\tau$ then write

t

L

;

14 else write

\perp

L

;

\mathsf{cnt}\leftarrow\mathsf{cnt}+1

;

17return Compact

L

while keeping the first

\tau

elements;

Algorithm 15 Expand(

R=\langle(t_{i},w_{i}):i\in[N]\rangle,\tau

)

It can be easily checked that the access pattern of Expand only depends on the values of $\tau$ and $N$ . Moreover, Expand is cache-agnostic since they are constructed by sequential compositions of cache-agnostic primitives (Scan, Sort and Compact).

Lemma C.1.

Given a relation $\mathcal{R}$ of input size $N$ and a parameter $\tau$ , the Expand primitive is cache-agnostic with $O\left((N+\tau)\cdot\log(N+\tau)\right)$ time complexity and $O\left(\frac{N+\tau}{B}\log_{\frac{M}{B}}\frac{N+\tau}{B}\right)$ cache complexity, whose access pattern only depends on $N$ and $\tau$ .

Now, we are ready to describe the algorithmic details of RelaxedTwoWay primitive. The high-level idea is to simulate the sort-merge join algorithm without revealing the movement of pointers in the merge phase. Let $L=R(x_{1},x_{2})\Join S(x_{2},x_{3})$ be the join results sorted by $x_{2},x_{3},x_{1}$ lexicographically. The idea is to transform $R,S$ into a sub-relation of $L$ by keeping attributes $(x_{1},x_{2}),(x_{2},x_{3})$ separately, without removing duplicates. Then, doing a one-to-one merge to obtain the final join results suffices. As described in Algorithm 16, we construct these two sub-relations from the input relations $R,S$ via the following steps (a running example is given in Figure 1):

•

(line 1) attach each tuple with the number of tuples it can be joined in the other relation;
•

(line 2) expand each tuple to the annotated number of copies;
•

(lines 3-4) prepare the expanded $\tilde{R}$ and $\tilde{S}$ with the “correct” ordering, as it appears in the final sort-merge join results;
•

(lines 5-8) perform a one-to-one merge of ordered tuples in $\tilde{R}$ and $\bar{S}$ ;

As a sequential composition of (relaxed) oblivious primitives, RelaxedTwoWay is cache-agnostic, with $O((N+\tau)\cdot\log(N+\tau))$ time complexity and $O(\frac{N+\tau}{B}\cdot\log_{\frac{M}{B}}\frac{N+\tau}{B})$ cache complexity, whose access pattern only depends on $N$ and $\tau$ .

\hat{R}\leftarrow\textsc{Augment}(R,S,x_{2})

\hat{S}\leftarrow\textsc{Augment}(S,R,x_{2})

;

\tilde{R}\leftarrow\textsc{Expand}(\hat{R},\tau)

\tilde{S}\leftarrow\textsc{Expand}(\hat{S},\tau)

;

\bar{S}\leftarrow\textsc{MultiNumber}(\tilde{S},x_{2})

; //

\tilde{S}

is enriched with another attribute

\mathsf{num}

4 Sort

\bar{S}

by attributes

x_{2}

and

\mathsf{num}

lexicographically;

L\leftarrow\emptyset

;

6 while read $t_{1}$ from $\tilde{R}$ and read $t_{2}$ from $\bar{S}$ do

7 if $t_{1}\not=\perp$ and $t_{2}\not=\perp$ then write

t_{1}\Join t_{2}

L

;

8 else write

\perp

L

;

10return

L

;

Algorithm 16 RelaxedTwoWay(

R(x_{1},x_{2}),S(x_{2},x_{3}),\tau

)

Refer to caption — Figure 1: A running example of Algorithm 16.

Appendix D Missing Materials in Section 5

(\mathcal{T},\lambda)\leftarrow

a GHD of

\mathcal{Q}

;

2 foreach node $u\in\mathcal{T}$ do

\mathcal{E}_{u}\leftarrow\{e\cap\lambda_{u}:e\in\mathcal{E}\}

;

4 foreach $e\in\mathcal{E}$ do

S_{e,u}\leftarrow\pi_{e\cap\lambda_{u}}R_{e}

by project;

\mathcal{Q}_{u}\leftarrow\textsc{OblivousGenericJoin}\left((\lambda_{u},\mathcal{E}_{u}),\{S_{e,u}:e\in\mathcal{E}\}\right)

;

7while visit nodes $u\in\mathcal{T}$ in a bottom-up way (excluding the root) do

p_{u}\leftarrow

the parent node of

u

;

\mathcal{Q}_{p_{u}}\leftarrow\textsc{SemiJoin}(\mathcal{Q}_{p_{u}},\mathcal{Q}_{u})

;

11while visit nodes $u\in\mathcal{T}$ in a top-down way (excluding the leaves) do

12 foreach child node $v$ of $u$ do

\mathcal{Q}_{v}\leftarrow\textsc{SemiJoin}(\mathcal{Q}_{v},\mathcal{Q}_{u})

;

14while visit nodes $u\in\mathcal{T}$ in a bottom-up way (excluding the root) do

p_{u}\leftarrow

the parent node of

u

;

\mathcal{Q}_{p_{u}}\leftarrow\textsc{RelaxedTwoWay}(\mathcal{Q}_{p_{u}},\mathcal{Q}_{u},\tau)

;

18return

\mathcal{Q}_{r}

for the root node

r

\mathcal{T}

;

Algorithm 17

\textsc{RelaxedJoin}(\mathcal{Q}=(\mathcal{V},\mathcal{E}),\mathcal{R},\tau)

		$\displaystyle\sum_{e_{1}\in\mathcal{E}_{y}-\mathcal{E}_{z},e_{2}\in\mathcal{E}_{z}-\mathcal{E}_{y}}\left\|R_{e_{1}}\Join R_{e_{2}}\Join\mathcal{Q}^{e_{1},e_{2}}_{I}\right\|+\sum_{e_{3}\in\mathcal{E}_{y}\cap\mathcal{E}_{z}}\left\|R_{e_{3}}\Join\mathcal{Q}^{e_{3}}_{I}\right\|$
		$\displaystyle=\sum_{e_{1}\in\mathcal{E}_{y}-\mathcal{E}_{z},e_{2}\in\mathcal{E}_{z}-\mathcal{E}_{y}}\sum_{t\in\mathcal{Q}^{e_{1},e_{2}}_{I}}\left\|\left(R_{e_{1}}\ltimes t\right)\Join\left(R_{e_{2}}\ltimes t\right)\right\|+\sum_{e_{3}\in\mathcal{E}_{y}\cap\mathcal{E}_{z}}\sum_{t\in\mathcal{Q}^{e_{3}}_{I}}\left\|R_{e_{3}}\ltimes t\right\|$
		$\displaystyle=\sum_{t\in\mathcal{Q}_{I}}\min\left\{\min_{e_{1}\in\mathcal{E}_{y}-\mathcal{E}_{z},e_{2}\in\mathcal{E}_{z}-\mathcal{E}_{y}}\|R_{e_{1}}\ltimes t\|\cdot\|R_{e_{2}}\ltimes t\|,\min_{e_{3}\in\mathcal{E}_{y}\cap\mathcal{E}_{z}}\|R_{e_{3}}\ltimes t\|\right\}$		(2)

Optimal Oblivious Algorithms for Multi-way Joins

Abstract

keywords:

1 Introduction

1.1 Problem Definition

Model of computation.

Definition 1.1 (Obliviousness [30, 31, 14]).

1.2 Review of Existing Results

Insecure Join Algorithms.

Oblivious Join Algorithms.

Relaxed Variants of Oblivious Join Algorithms.

1.3 Our Contribution

2 Preliminaries

2.1 Fractional and Integral Edge Cover Number

2.2 Oblivious Primitives

Linear Scan.

Sort [4, 9].

Compact [32, 46].

SemiJoin.

Project.

Intersect.

Augment.

2.3 Oblivious Two-way Join

Theorem 2.1 ([35]).

Theorem 2.2.

3 Beyond Oblivious Nested-loop Join

Theorem 3.1.

Example 3.2 (α\alpha-acyclic Join).

Example 3.3 (Even-length Cycle Join).

Example 3.4 (Boat Join).

4 Warm Up: Triangle Join

Insecure Triangle Join Algorithm 2.

Inject Obliviousness to Algorithm 2.

Analysis of Algorithm 3.

Insecure Triangle Join Algorithm 4.

Inject Obliviousness to Algorithm 4.

Complexity of Algorithm 5.

Theorem 4.1.

5 Oblivious Worst-case Optimal Join Algorithm

5.1 Generic Join Revisited

5.2 Our Algorithm

General Case: |𝒱|>1|\mathcal{V}|>1.

General Case 1: |J|=1|J|=1.

General Case 2: |J|=2|J|=2.

5.3 Analysis of Algorithm 7

General Case: |𝒱|>1|\mathcal{V}|>1.

Theorem 5.1.

5.4 Implications to Relaxed Oblivious Algorithms

Definition 5.2 (Generalized Hypertree Decomposition (GHD)).

Theorem 5.3.

6 Conclusion

References

Appendix A Missing Materials in Section 1

Loomis-Whitney Joins.

Appendix B Oblivious Primitives in Section 2

SemiJoin.

ReduceByKey.

Annotate.

MultiNumber.

Project.

Intersect.

Augment.

Appendix C RelaxedTwoWay Primitive

Expand Primitive.

Lemma C.1.

Appendix D Missing Materials in Section 5

Example 3.2 ( $\alpha$ -acyclic Join).

General Case: $|\mathcal{V}|>1$ .

General Case 1: $|J|=1$ .

General Case 2: $|J|=2$ .

General Case: $|\mathcal{V}|>1$ .