This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Complex event recognition meets hierarchical conjunctive queries

Dante Pinto Pontificia Universidad Católica de Chile, drpinto1@uc.cl, cristian.riveros@uc.cl Cristian Riveros Pontificia Universidad Católica de Chile, drpinto1@uc.cl, cristian.riveros@uc.cl
Abstract

Hierarchical conjunctive queries (HCQ) are a subclass of conjunctive queries (CQ) with robust algorithmic properties. Among others, Berkholz, Keppeler, and Schweikardt have shown that HCQ is the subclass of CQ (without projection) that admits dynamic query evaluation with constant update time and constant delay enumeration. On a different but related setting stands Complex Event Recognition (CER), a prominent technology for evaluating sequence patterns over streams. Since one can interpret a data stream as an unbounded sequence of inserts in dynamic query evaluation, it is natural to ask to which extent CER can take advantage of HCQ to find a robust class of queries that can be evaluated efficiently.

In this paper, we search to combine HCQ with sequence patterns to find a class of CER queries that can get the best of both worlds. To reach this goal, we propose a class of complex event automata model called Parallelized Complex Event Automata (PCEA) for evaluating CER queries with correlation (i.e., joins) over streams. This model allows us to express sequence patterns and compare values among tuples, but it also allows us to express conjunctions by incorporating a novel form of non-determinism that we call parallelization. We show that for every HCQ (under bag semantics), we can construct an equivalent PCEA. Further, we show that HCQ is the biggest class of acyclic CQ that this automata model can define. Then, PCEA stands as a sweet spot that precisely expresses HCQ (i.e., among acyclic CQ) and extends them with sequence patterns. Finally, we show that PCEA also inherits the good algorithmic properties of HCQ by presenting a streaming evaluation algorithm under sliding windows with logarithmic update time and output-linear delay for the class of PCEA with equality predicates.

1 Introduction

Hierarchical Conjunctive Queries [12] (HCQ) are a subclass of Conjunctive Queries (CQ) with good algorithmic properties for dynamic query evaluation [9, 18]. In this scenario, users want to continuously evaluate a CQ over a database that receives insertion, updates, or deletes of tuples, and to efficiently retrieve the output after each modification. A landmark result by Berkholz, Keppeler, and Schweikardt [5] shows that HCQ are the subfragment among CQ for dynamic query evaluation. Specifically, they show one can evaluate every HCQ with constant update time and constant-delay enumeration. Furthermore, they show that HCQ are the only class of full CQ (i.e., CQ without projection) with such guarantees, namely, under fined-grained complexity assumptions, a full CQ can be evaluated with constant update time and constant delay enumeration if, and only if, the query is hierarchical. Therefore, HCQ stand as the fragment for efficient evaluation under a dynamic scenario (see also [18]).

Data stream processing is another dynamic scenario where we want to evaluate queries continuously but now over an unbounded sequence of tuples (i.e., a data stream). Complex Event Recognition (CER) is one such technology for processing information flow [14, 11]. CER systems read high-velocity streams of data, called events, and evaluate expressive patterns for detecting complex events, a subset of relevant events that witness a critical case for a user. A singular aspect of CER compared to other frameworks is that the order of the stream’s data matters, reflecting the temporal order of events in reality (see [27]). For this reason, sequencing operators are first citizens on CER query languages, which one combines with other operators, like filtering, disjunction, and correlation (i.e., joins), among others [4].

Similar to dynamic query evaluation, this work aims to find a class of CER query languages with efficient streaming query evaluation. Our strategy to pursue this goal is simple but effective: we use HCQ as a starting point to guide our search for CER query languages with good algorithmic properties. Since one can interpret a data stream as an unbounded sequence of inserts in dynamic query evaluation, we want to extend HCQ with sequencing while maintaining efficient evaluation. We plan this strategy from an algorithmic point of view. Instead of studying which CER query language fragments have such properties, we look for automata models that can express HCQ. By finding such a model, we can later design our CER query language to express these queries [17].

With this goal and strategy in mind, we start from the proposal of Chain Complex Event Automata (CCEA), an automata model for CER expressing sequencing queries with correlation, but that cannot express simple HCQ [16]. We extend this model with a new sort of non-deterministic power that we call parallelization. This feature allows us to run several parallel executions that start independently and to gather them together when reading new data items. We define the class of Parallelized Complex Event Automata (PCEA), the extension of CCEA with parallelization. As an extension, PCEA can express patterns with sequencing, disjunction, iteration, and correlation but also allows conjunction. In particular, we can show that PCEA can express an acyclic CQ QQ if, and only if, QQ is hierarchical. Then, PCEA is a sweet spot that precisely expresses HCQ (i.e., among acyclic CQ) and extends them with sequencing and other operations. Moreover, we show that PCEA inherits the good algorithmic properties of HCQ by presenting a streaming evaluation algorithm under sliding windows, reaching our desired goal.

Contributions

The technical contributions and outline of the paper are the following.

In Section 2, we provide some basic definitions plus recalling the definition of CCEA.

In Section 3, we introduce the concept of parallelization for standard non-deterministic NFA, called PFA, and study their properties. We show that PFA can be determinized in exponential time (similar to NFA) (Proposition 3.2). We then apply this notion to CER and define the class of PCEA, showing that it is strictly more expressive than CCEA (Proposition 3.4).

Section 4 compares PCEA with HCQ under bag semantics. Given that PCEA runs over streams and HCQ over relational databases, we must revisit the semantics of HCQ and formalize in which sense an HCQ and a PCEA define the same query. We show that under such comparison, every HCQ QQ under bag semantics can be expressed by a PCEA with equality predicates of exponential size in |Q||Q| and of quadratic size if QQ does not have self joins (Theorem 4.1). Furthermore, if QQ is acyclic but not hierarchical, then QQ cannot be defined by any PCEA (Theorem 4.2).

In Section 5, we study the evaluation of PCEA in a streaming scenario. Specifically, we present a streaming evaluation algorithm under a sliding window with logarithmic update time and output-linear delay for the class of unambiguous PCEA with equality predicates (Theorem 5.1).

Related work

Dynamic query evaluation of HCQ and acyclic CQ has been studied in [5, 18, 28, 20]. This research line did not study HCQ or acyclic CQ in the presence of order predicates. [29, 26] studied CQ under comparisons (i.e., θ\theta-joins) but in a static setting (i.e., no updates). The closest work is [19], which studied dynamic query evaluation of CQ with comparisons; however, this work did not study well-behaved classes of HCQ with comparisons, and, further, their algorithms have update time linear in the data.

Complex event recognition and, more generally, data stream processing have studied the evaluation of joins over streams (see, e.g., [31, 30, 21]). To the best of our knowledge, no work in this research line optimizes queries focused on HCQ or provides guarantees regarding update time or enumeration delay in this setting. We base our work on [16], which we will discuss extensively.

2 Preliminaries

Strings and NFA

A string is a sequence of elements s¯=a0an1\bar{s}=a_{0}\ldots a_{n-1}. For presentation purposes, we make no distinction between a sequence or a string and, thus, we also write s¯=a0,,an1\bar{s}=a_{0},\ldots,a_{n-1} for denoting a string. We will denote strings using a bar and its ii-th element by s¯[i]=ai\bar{s}[i]=a_{i}. We use |s¯|=n|\bar{s}|=n for the length of s¯\bar{s} and {s¯}={a0,,an1}\{\bar{s}\}=\{a_{0},\ldots,a_{n-1}\} to consider s¯\bar{s} as a set. Given two strings s¯\bar{s} and s¯\bar{s}^{\prime}, we write s¯s¯\bar{s}\bar{s}^{\prime} for the concatenation of s¯\bar{s} followed by s¯\bar{s}^{\prime}. Further, we say that s¯\bar{s}^{\prime} is a prefix of s¯\bar{s}, written as s¯ps¯\bar{s^{\prime}}\preceq_{p}\bar{s}, if |s¯||s¯||\bar{s}^{\prime}|\leq|\bar{s}| and s¯[i]=s¯[i]\bar{s}^{\prime}[i]=\bar{s}[i] for all i<|s¯|i<|\bar{s}^{\prime}|. Given a non-empty set Σ\Sigma we denote by Σ\Sigma^{*} the set of all strings from elements in Σ\Sigma, where ϵΣ\epsilon\in\Sigma^{*} denotes the 0-length string. For a function f:ΣΩf:\Sigma\rightarrow\Omega and s¯Σ\bar{s}\in\Sigma^{*}, we write f(s¯)=f(a0)f(an1)f(\bar{s})=f(a_{0})\ldots f(a_{n-1}) to denote the point-wise application of ff over s¯\bar{s}.

A Non-deterministic Finite Automaton (NFA) is a tuple A=(Q,Σ,Δ,I,F)\pazocal{A}=(Q,\Sigma,\Delta,I,F) such that QQ is a finite set of states, Σ\Sigma is a finite alphabet, ΔQ×Σ×Q\Delta\subseteq Q\times\Sigma\times Q is the transition relation, and II and FF are the set of initial and final states, respectively. A run of A\pazocal{A} over a string s¯=a0an1Σ\bar{s}=a_{0}\ldots a_{n-1}\in\Sigma^{*} is a non-empty sequence p0pnp_{0}\ldots p_{n} such that p0Ip_{0}\in I, and (pi,ai,pi+1)Δ(p_{i},a_{i},p_{i+1})\in\Delta for every i<ni<n. We say that A\pazocal{A} accepts a string s¯Σ\bar{s}\in\Sigma^{*} iff there exists such a run of A\pazocal{A} over s¯\bar{s} such that pnFp_{n}\in F. We define the language L(A)Σ\pazocal{L}(\pazocal{A})\subseteq\Sigma^{*} of all strings accepted by A\pazocal{A}. Finally we say that A\pazocal{A} is a Deterministic Finite Automaton (DFA) iff Δ\Delta is given as a partial function Δ:Q×ΣQ\Delta:Q\times\Sigma\rightarrow Q and |I|=1|I|=1.

Schemas, tuples, and streams

Fix a set D of data values. A relational schema σ\sigma (or just schema) is a pair (T,arity)(\textbf{T},\operatorname{arity}) where T are the relation names and arity:T\operatorname{arity}:\textbf{T}\rightarrow\mathbb{N} maps each name to a number, that is, its arity. An RR-tuple of σ\sigma (or just a tuple) is an object R(a0,,ak1)R(a_{0},\ldots,a_{k-1}) such that RTR\in\textbf{T}, each aiDa_{i}\in\textbf{D}, and k=arity(R)k=\operatorname{arity}(R). We will write R(a¯)R(\bar{a}) to denote a tuple with values a¯\bar{a}. We denote by Tuples[σ]\operatorname{Tuples}[\sigma] the set of all RR-tuples of all RTR\in\textbf{T}. We define the size of a tuple R(a¯)R(\bar{a}) as |R(a¯)|=i=0k1|a¯[i]||R(\bar{a})|=\sum_{i=0}^{k-1}|\bar{a}[i]| with k=arity(R)k=\operatorname{arity}(R) where |a¯[i]||\bar{a}[i]| is the size of the data value a¯[i]D\bar{a}[i]\in\textbf{D}, which depends on the domain.

A stream S\pazocal{S} over σ\sigma is an infinite sequence of tuples S=t0t1t2\pazocal{S}=t_{0}t_{1}t_{2}\ldots such that tiTuples[σ]t_{i}\in\operatorname{Tuples}[\sigma] for every i0i\geq 0. For a running example, consider the schema σ0\sigma_{0} with relation names T={R,S,T}\textbf{T}=\{R,S,T\}, arity(R)=arity(S)=2\operatorname{arity}(R)=\operatorname{arity}(S)=2 and arity(T)=1\operatorname{arity}(T)=1. A stream S0\pazocal{S}_{0} over σ0\sigma_{0} could be the following:

S0:=S(2,11)0T(2)1R(1,10)2S(2,11)3T(1)4R(2,11)5S(4,13)6T(1)7\pazocal{S}_{0}\ :=\ \underbrace{S(2,11)}_{\texttt{\scriptsize{0}}}\ \underbrace{T(2)}_{\texttt{\scriptsize{1}}}\ \underbrace{R(1,10)}_{\texttt{\scriptsize{2}}}\ \underbrace{S(2,11)}_{\texttt{\scriptsize{3}}}\ \underbrace{T(1)}_{\texttt{\scriptsize{4}}}\ \underbrace{R(2,11)}_{\texttt{\scriptsize{5}}}\ \underbrace{S(4,13)}_{\texttt{\scriptsize{6}}}\ \underbrace{T(1)}_{\texttt{\scriptsize{7}}}\ \ldots

where we add an index (i.e., the position) below each tuple (for simplification, we use D=\textbf{D}=\mathbb{N}).

Predicates

For a fix kk, a kk-predicate PP is a subset of Tuples[σ]k\operatorname{Tuples}[\sigma]^{k}. Further, we say that t¯=(t1,,tk)\bar{t}=(t_{1},\ldots,t_{k}) satisfies PP iff t¯P\bar{t}\in P. We say that PP is unary if k=1k=1 and binary if k=2k=2. In the following, we denote any class of unary or binary predicates by U or B, respectively.

Although we define our automata models for any class of unary and binary predicates, the following two predicate classes will be relevant for algorithmic purposes (see Section 4 and 5). Let σ\sigma be a schema. We denote by Ulin\textbf{U}_{\operatorname{lin}} the class of all unary predicates UU such that, for every tTuples[σ]t\in\operatorname{Tuples}[\sigma], one can decide in linear time over |t||t| whether tt satisfies UU or not. In addition, we denote by Beq\textbf{B}_{\operatorname{eq}} the class of all equality predicates defined as follow: a binary predicate BB is an equality predicate iff there exist partial functions B\vec{\reflectbox{$B$}} and B\vec{B} over Tuples[σ]\operatorname{Tuples}[\sigma] such that, for every t1,t2Tuples[σ]t_{1},t_{2}\in\operatorname{Tuples}[\sigma], (t1,t2)B(t_{1},t_{2})\in B iff B(t1)\reflectbox{$\vec{\reflectbox{$B$}}$}(t_{1}) and B(t2)\vec{B}(t_{2}) are defined and B(t1)=B(t2)\reflectbox{$\vec{\reflectbox{$B$}}$}(t_{1})=\vec{B}(t_{2}). Further, we require that one can compute B(t1)\reflectbox{$\vec{\reflectbox{$B$}}$}(t_{1}) and B(t2)\vec{B}(t_{2}) in linear time over |t1||t_{1}| and |t2||t_{2}|, respectively. For example, recall our schema σ0\sigma_{0} and consider the binary predicate (Tx,Sxy)={(T(a),S(a,b))a,bD}(Tx,Sxy)=\{(T(a),S(a,b))\mid a,b\in\textbf{D}\}. Then by using the functions B(T(a))=a\reflectbox{$\vec{\reflectbox{$B$}}$}(T(a))=a and B(S(a,b))=a\vec{B}(S(a,b))=a, one can check that (Tx,Sxy)(Tx,Sxy) is an equality predicate.

Note that Beq\textbf{B}_{\operatorname{eq}} is a more general class of equality predicates compared with the ones used in [16], that will serve in our automata models for comparing tuples by “equality” in different subsets of attributes. We take here a more semantic presentation, where the equality comparison between tuples is directly given by the functions B\vec{\reflectbox{$B$}} and B\vec{B} and not symbolically by some formula.

Chain complex event automata

A Chain Complex Event Automaton (CCEA) [16] is a tuple C=(Q,U,B,Ω,Δ,I,F)\pazocal{C}\ =\ (Q,\textbf{U},\textbf{B},\Omega,\Delta,I,F) where QQ is a finite set of states, U is a set of unary predicates, B is a set of binary predicates, Ω\Omega is a finite set of labels, I:QU×(2Ω{})I:Q\rightarrow\textbf{U}\times(2^{\Omega}\setminus\{\emptyset\}) is a partial initial function, FQF\subseteq Q is the set of final states, and Δ\Delta is a finite transition relation of the form: ΔQ×U×B×(2Ω{})×Q.\Delta\subseteq Q\times\textbf{U}\times\textbf{B}\times(2^{\Omega}\setminus\{\emptyset\})\times Q. Let S=t0t1\pazocal{S}=t_{0}t_{1}\ldots be a stream. A configuration of C\pazocal{C} over S\pazocal{S} is a tuple (p,i,L)Q××(2Ω{})(p,i,L)\in Q\times\mathbb{N}\times(2^{\Omega}\setminus\{\emptyset\}), representing that the automaton C\pazocal{C}, is at state pp after having read and marked tit_{i} with the set of labels LL. For Ω\ell\in\Omega, we say that (p,i,L)(p,i,L) marked position ii with \ell iff L\ell\in L. Given a position nn\in\mathbb{N}, we say that a configuration is accepting at position nn iff it is of the form (p,n,L)(p,n,L) and pFp\in F. Then a run ρ\rho of C\pazocal{C} over S\pazocal{S} is a sequence of configurations:

ρ:=(p0,i0,L0),(p1,i1,L1),,(pn,in,Ln)\rho\ :=\ (p_{0},i_{0},L_{0}),(p_{1},i_{1},L_{1}),\ldots,(p_{n},i_{n},L_{n})

such that i0<i1<<ini_{0}<i_{1}<\ldots<i_{n}, I(p0)=(U,L0)I(p_{0})=(U,L_{0}) is defined and ti0Ut_{i_{0}}\in U, and there exists a transition (pj1,Uj,Bj,Lj,pj)Δ(p_{j-1},U_{j},B_{j},L_{j},p_{j})\in\Delta such that tijUjt_{i_{j}}\in U_{j} and (tij1,tij)Bj(t_{i_{j-1}},t_{i_{j}})\in B_{j} for every j[1,n]j\in[1,n]. Intuitively, a run of a CCEA is a subsequence of the stream that can follow a path of transitions, where each transition checks a local condition (i.e., the unary predicate UjU_{j}) and a join condition (i.e., the binary predicate BjB_{j}) with the previous tuple. For the first tuple, a CCEA can only check a local condition (i.e., there is no previous tuple).

Given a run ρ\rho like above, we define its valuation νρ:Ω2\nu_{\rho}:\Omega\rightarrow 2^{\mathbb{N}} such that νρ()\nu_{\rho}(\ell) is the set consisting of all positions in ρ\rho marked by \ell, formally, νρ()={ijjnLj}\nu_{\rho}(\ell)=\{i_{j}\mid\,j\leq n\wedge\ell\in L_{j}\}. Further, given a position nn\in\mathbb{N}, we say that ρ\rho is an accepting run at position nn iff (pn,in,Ln)(p_{n},i_{n},L_{n}) is an accepting configuration at nn. Then the output of C\pazocal{C} over S\pazocal{S} at position nn is defined as:

\lsemC\rsemn(S)={νρρ is an accepting run at position n of C over S}.{\lsem{}{\pazocal{C}}\rsem}_{n}(\pazocal{S})\ =\ \{\nu_{\rho}\mid\text{$\rho$ is an accepting run at position $n$ of $\pazocal{C}$ over $S$}\}.
Example 2.1.

Below, we show an example of a CCEA over the schema σ0\sigma_{0} with Ω={}\Omega=\{\bullet\}:

C0:\pazocal{C}_{0}:q0q_{0}q1q_{1}q2q_{2}T/T\,/\,\bulletS,(Tx,Sxy)/S,(Tx,Sxy)\,/\,\bulletR,(Sxy,Rxy)/R,(Sxy,Rxy)\,/\,\bullet

We use TT to denote the predicate T={T(a)aD}T=\{T(a)\mid a\in\textbf{D}\} and similar for SS and RR. Further, we use (Tx,Sxy)(Tx,Sxy) and (Sxy,Rxy)(Sxy,Rxy) to denote equality predicates as defined above. An accepting run of C0\pazocal{C}_{0} over S0\pazocal{S}_{0} is ρ=(q0,1,{}),(q1,3,{}),(q2,5,{})\rho=(q_{0},1,\{\bullet\}),(q_{1},3,\{\bullet\}),(q_{2},5,\{\bullet\}) which produces the valuation νρ={{1,3,5}}\nu_{\rho}=\{\bullet\mapsto\{1,3,5\}\} that represents the subsequence T(2),S(2,11),R(2,11)T(2),S(2,11),R(2,11) of S0\pazocal{S}_{0}. Intuitively, C0\pazocal{C}_{0} defines all subsequences of the form T(a),S(a,b),R(a,b)T(a),S(a,b),R(a,b) for every a,bDa,b\in\textbf{D}.

Note that the definition of CCEA above differs from [16] to fit our purpose better. Specifically, we use a set of labels Ω\Omega to annotate positions in the streams and define valuations in the same spirit as the model of annotated automata used in [3, 23]. One can see this extension as a generalization to the model in [16], where |Ω|=1|\Omega|=1. This extension will be helpful to enrich the outputs of our models for comparing them with hierarchical conjunctive queries with self-joins (see Section 4).

Computational model

For our algorithms, we assume the computational model of Random Access Machines (RAM) with uniform cost measure, and addition as it basic operation [1, 15]. This RAM has read-only registers for the input, read-writes registers for the work, and write-only registers for the output. This computation model is a standard assumption in the literature [5, 6].

3 Parallelized complex event automata

This section presents our automata model for specifying CER queries with conjunction called Parallelized Complex Event Automata (PCEA), which strictly generalized CCEA by adding a new feature called parallelization. For the sake of presentation, we first formalize the notion of parallelization for NFA to extend the idea to CCEA. Before this, we need the notation of labeled trees that will be useful for our definitions and proofs.

Labeled trees

As it is common in the area [24], we define (unordered) trees as a finite set of strings tt\subseteq\mathbb{N}^{\ast} that satisfies two conditions: (1) tt contains the empty string, (i.e., εt\varepsilon\in t), and (2) tt is a prefix-closed set, namely, if a1anta_{1}...a_{n}\in t, then a1ajta_{1}...a_{j}\in t for every j<nj<n. We will refer to the string of tt as nodes, and the root of a tree, root(t)\texttt{root}(t), will be the empty string ε\varepsilon.

Let u¯,v¯t\bar{u},\bar{v}\in t be nodes. The depth of u¯\bar{u} will be given by its length 0pttu¯=|u¯|0pt{t}{\bar{u}}=|\bar{u}|. We say that u¯\bar{u} is the parent of v¯\bar{v} and write parentt(v¯)=u¯\texttt{parent}_{t}(\bar{v})=\bar{u} if v¯=u¯n\bar{v}=\bar{u}\cdot n for some nn\in\mathbb{N}. Likewise, we say that v¯\bar{v} is a child of u¯\bar{u} if u¯\bar{u} is the parent of v¯\bar{v} and define childrent(u¯)={v¯tparentt(v¯)=u¯}\texttt{children}_{t}(\bar{u})=\{\bar{v}\in t\mid\texttt{parent}_{t}(\bar{v})=\bar{u}\}. Similarly, we define the descendants of u¯\bar{u} as desct(u¯)={v¯tu¯pv¯}\texttt{desc}_{t}(\bar{u})=\{\bar{v}\in t\mid\bar{u}\preceq_{p}\bar{v}\} and the ancestors as ancstt(u¯)={v¯tv¯pu¯}\texttt{ancst}_{t}(\bar{u})=\{\bar{v}\in t\mid\bar{v}\preceq_{p}\bar{u}\}; note that u¯desct(u¯)\bar{u}\in\texttt{desc}_{t}(\bar{u}) and u¯ancstt(u¯)\bar{u}\in\texttt{ancst}_{t}(\bar{u}). A node u¯\bar{u} is a leaf of tt if desct(u¯)={u¯}\texttt{desc}_{t}(\bar{u})=\{\bar{u}\}, and an inner node if it is not a leaf node. We define the set of leaves of u¯\bar{u} as leavest(u¯)={v¯desct(u¯)v¯ is a leaf node}\texttt{leaves}_{t}(\bar{u})=\{\bar{v}\in\texttt{desc}_{t}(\bar{u})\mid\bar{v}\text{ is a leaf node}\}.

A labeled tree τ\tau is a function τ:tL\tau\colon t\rightarrow L where tt is a tree and LL is any finite set of labels. We use dom(τ)\operatorname{dom}(\tau) to denote the underlying tree structure tt of τ\tau. Given that τ\tau is a function, we can write τ(u¯)\tau(\bar{u}) to denote the label of node u¯dom(τ)\bar{u}\in\operatorname{dom}(\tau). To simplify the notation, we extend all the definitions above for a tree tt to labeled tree τ\tau, changing tt by dom(τ)\operatorname{dom}(\tau). For example, we write u¯τ\bar{u}\in\tau to refer to u¯dom(τ)\bar{u}\in\operatorname{dom}(\tau), or parentτ(u¯)\texttt{parent}_{\tau}(\bar{u}) to refer to parentdom(τ)(u¯)\texttt{parent}_{\operatorname{dom}(\tau)}(\bar{u}). Finally, we say that two labeled trees τ\tau and τ\tau^{\prime} are isomorphic if there exists a bijection f:dom(τ)dom(τ)f\colon\operatorname{dom}(\tau)\rightarrow\operatorname{dom}(\tau^{\prime}) such that u¯pv¯\bar{u}\preceq_{p}\bar{v} iff f(u¯)pf(v¯)f(\bar{u})\preceq_{p}f(\bar{v}) and τ(u¯)=τ(f(u¯))\tau(\bar{u})=\tau^{\prime}(f(\bar{u})) for every u¯,v¯dom(τ)\bar{u},\bar{v}\in\operatorname{dom}(\tau). We will usually say that τ\tau and τ\tau^{\prime} are equal, meaning they are isomorphic.

Parallelized finite automata

A Parallelized Finite Automaton (PFA) is a tuple 𝒫=(Q,Σ,Δ,I,F)\mathcal{P}=(Q,\Sigma,\Delta,I,F) where QQ is a finite set of states, Σ\Sigma is a finite alphabet, I,FQI,F\subseteq Q are the sets of initial and accepting states, respectively, and Δ2Q×Σ×Q\Delta\subseteq 2^{Q}\times\Sigma\times Q is the transition relation. We define the size of 𝒫\mathcal{P} as |𝒫|=|Q|+(P,a,q)(|P|+1)|\mathcal{P}|=|Q|+\sum_{(P,a,q)}(|P|+1), namely, the number of states plus the size of encoding the transitions.

A run tree of a PFA 𝒫\mathcal{P} over a string s¯=a1anΣ\bar{s}=a_{1}\ldots a_{n}\in\Sigma^{\ast} is a labeled tree τ:tdom(τ)\tau:t\rightarrow\operatorname{dom}(\tau) such that 0ptτu¯=n0pt{\tau}{\bar{u}}=n for every leaf u¯τ\bar{u}\in\tau; in other words, every node of τ\tau is labeled by a state of 𝒫\mathcal{P} and all branches have the same length nn. In addition, τ\tau must satisfy the following two conditions: (1) every leaf node u¯\bar{u} of tt is labeled by an initial state (i.e., τ(u¯)I\tau(\bar{u})\in I) and (2) for every inner node v¯\bar{v} at depth ii (i.e., 0ptτv¯=i0pt{\tau}{\bar{v}}=i) there must be a transition (P,ani,q)Δ(P,a_{n-i},q)\in\Delta such that τ(v¯)=q\tau(\bar{v})=q, |childrenτ(v¯)|=|P||\texttt{children}_{\tau}(\bar{v})|=|P| and P={τ(u¯)u¯childrenτ(v¯)}P=\{\tau(\bar{u})\mid\bar{u}\in\texttt{children}_{\tau}(\bar{v})\}, that is, children have different labels and PP is the set of labels in the children of v¯\bar{v}. We say that τ\tau is an accepting run of 𝒫\mathcal{P} over s¯\bar{s} iff τ\tau is a run of 𝒫\mathcal{P} over s¯\bar{s} and τ(ε)F\tau(\varepsilon)\in F (recall that ε=root(τ)\varepsilon=\texttt{root}(\tau)). We say that 𝒫\mathcal{P} accepts a string s¯Σ\bar{s}\in\Sigma^{\ast} if there is an accepting run of 𝒫\mathcal{P} over s¯\bar{s} and we define the language recognized by 𝒫\mathcal{P}, L(𝒫)\pazocal{L}(\mathcal{P}), as the set of strings that 𝒫\mathcal{P} accepts.

Example 3.1.

In Figure 1 (left), we show the example of a PFA 𝒫0\mathcal{P}_{0} over the alphabet Σ={T,S,R}\Sigma=\{T,S,R\}. Intuitively, the upper part (i.e., p0,p1p_{0},p_{1}) looks for a symbol TT, the lower part (i.e., p2,p3p_{2},p_{3}) for a symbol SS, and both runs join together in p4p_{4} when they see a symbol RR. Then, 𝒫0\mathcal{P}_{0} defines all strings that contain symbols TT and SS (in any order) before a symbol RR.

One can see that PFA is a generalization of an NFA. Indeed, NFA is a special case of an PFA where each run tree τ\tau is a line. Nevertheless, PFA do not add expressive power to NFA, given that PFA is another model for recognizing regular languages, as the next result shows.

Proposition 3.2.

For every PFA 𝒫\mathcal{P} with nn states there exists a DFA A\pazocal{A} with at most 2n2^{n} states such that L(𝒫)=L(A)\pazocal{L}(\mathcal{P})=\pazocal{L}(\pazocal{A}). In particular, all languages defined by PFA are regular.

Intuitively, one could interpret a PFA as an Alternating Finite Automaton (AFA) [7] that runs backwards over the string (however, they still process the string in a forward direction). It was shown in [7, Theorem 5.2 and 5.3] that for every AFA that defines a language LL with nn states, there exists an equivalent DFA with 22n2^{2^{n}} states in the worst case that recognizes LL. Nevertheless, they argued that the reverse language LR={a1a2anΣana2a1L}L^{R}=\{a_{1}a_{2}\ldots a_{n}\in\Sigma^{\ast}\mid a_{n}\ldots a_{2}a_{1}\in L\} can always be accepted by a DFA with at most 2n2^{n} states. Then, one can see Proposition 3.2 as a consequence of reversing an alternating automaton. Despite this connection, we use here PFA as a proper automata model, which was not studied or used in [7]. Another related proposal is the parallel finite automata model presented in [25]. Indeed, one can consider PFA as a restricted case of this model, although it was not studied in [25]. For this reason, we decided to name the PFA model with the same acronym but a slightly different name as in [25].

𝒫0:\mathcal{P}_{0}:p0p_{0}p1p_{1}p2p_{2}p3p_{3}p4p_{4}Σ\SigmaΣ\SigmaΣ\SigmaΣ\SigmaΣ\SigmaTTSSRRP0:\pazocal{P}_{0}:q0q_{0}q1q_{1}q2q_{2}T/T\,/\,\bulletS/S\,/\,\bullet(Tx,Rxy)(Tx,Rxy)(Sxy,Rxy)(Sxy,Rxy)R/R/\bullet
Figure 1: On the left, an example of a PFA and, on the right, an example of a PCEA.

Parallelized complex event automata

A Parallelized Complex Event Automaton (PCEA) is the extension of CCEA with the idea of parallelization as in PFA. Specifically, a PCEA is a tuple P=(Q,U,B,Ω,Δ,F)\pazocal{P}=(Q,\textbf{U},\textbf{B},\Omega,\Delta,F), where QQ, U, B, Ω\Omega, and FF are the same as for CCEA, and Δ\Delta is a finite transition relation of the form:

Δ2Q×U×BQ×(2Ω{})×Q.\Delta\subseteq 2^{Q}\times\textbf{U}\times\textbf{B}^{Q}\times(2^{\Omega}\setminus\{\emptyset\})\times Q.

where BQ\textbf{B}^{Q} are all partial functions :QB\mathcal{B}:Q\rightarrow\textbf{B}, that associate a state qq to a binary predicate (q)\mathcal{B}(q). We define the size of P\pazocal{P} as |P|=|Q|+(P,U,,L,q)Δ(|P|+|L|)|\pazocal{P}|=|Q|+\sum_{(P,U,\mathcal{B},L,q)\in\Delta}(|P|+|L|). Note that P\pazocal{P} does not define the initial function explicitly. As we will see, transitions of the form (,U,,L,q)(\emptyset,U,\mathcal{B},L,q) will play the role of the initial function on a run of P\pazocal{P}.

Next, we extend the notion of a run from CCEA to its parallelized version. Let S=t0t1\pazocal{S}=t_{0}t_{1}\ldots be a stream. A run tree of P\pazocal{P} over S\pazocal{S} is now a labeled tree τ:t(Q××(2Ω{}))\tau:t\rightarrow(Q\times\mathbb{N}\times(2^{\Omega}\setminus\{\emptyset\})) where each node u¯τ\bar{u}\in\tau is labeled with a configuration τ(u¯)=(q,i,L)\tau(\bar{u})=(q,i,L) such that, for every child v¯childrenτ(u¯)\bar{v}\in\texttt{children}_{\tau}(\bar{u}) with τ(v¯)=(p,j,M)\tau(\bar{v})=(p,j,M), it holds that j<ij<i. In other words, the positions of τ\tau-configurations increase towards the root of τ\tau, similar to the runs of a CCEA. In addition, u¯\bar{u} must satisfy the transition relation Δ\Delta, that is, there must exist a transition (P,U,,L,q)Δ(P,U,\mathcal{B},L,q)\in\Delta such that (1) tiUt_{i}\in U, (2) |childrenτ(u¯)|=|P||\texttt{children}_{\tau}(\bar{u})|=|P| and P={pv¯childrenτ(u¯).τ(v¯)=(p,j,M)}P=\{p\mid\exists\bar{v}\in\texttt{children}_{\tau}(\bar{u}).\,\tau(\bar{v})=(p,j,M)\}, and (3) for every v¯childrenτ(u¯)\bar{v}\in\texttt{children}_{\tau}(\bar{u}) with τ(v¯)=(p,j,M)\tau(\bar{v})=(p,j,M), (tj,ti)(p)(t_{j},t_{i})\in\mathcal{B}(p). Similar to PFA, condition (2) forces that there exists a bijection between PP and the states at the children of u¯\bar{u}. Instead, condition (3) forces that two consecutive configurations (p,j,M)(p,j,M) and (q,i,L)(q,i,L) must satisfy the binary predicate in (p)\mathcal{B}(p) associated with pp. Notice that, if u¯\bar{u} is a leaf node in τ\tau, then it must hold that P=P=\emptyset and condition (3) is trivially satisfied. Also, note that we do not assume that all leaves are at the same depth.

Given a position nn\in\mathbb{N}, we say that τ\tau is an accepting run at position nn iff the root configuration τ(ε)\tau(\varepsilon) is accepting at position nn. Further, we define the output of a run τ\tau as the valuation ντ:Ω2\nu_{\tau}:\Omega\rightarrow 2^{\mathbb{N}} such that ντ()={iu¯τ.τ(u¯)=(q,i,L)L}\nu_{\tau}(\ell)=\{i\mid\,\exists\bar{u}\in\tau.\ \tau(\bar{u})=(q,i,L)\wedge\ell\in L\} for every label Ω\ell\in\Omega. Finally, the output of a PCEA P\pazocal{P} over S\pazocal{S} at the position nn is defined as:

\lsemP\rsemn(S)={νττ is an accepting run at position n of P over S}.{\lsem{}{\pazocal{P}}\rsem}_{n}(\pazocal{S})\ =\ \{\nu_{\tau}\mid\text{$\tau$ is an accepting run at position $n$ of $\pazocal{P}$ over $S$}\}.
Example 3.3.

In Figure 1 (right), we show an example of a PCEA P0\pazocal{P}_{0} over schema σ0\sigma_{0} with Ω={}\Omega=\{\bullet\}. We use the same notation as in Example 2.1 to represent unary and equality predicates. If we run P0\pazocal{P}_{0} over S0\pazocal{S}_{0}, we have the following two run trees at position 55:

τ0:\tau_{0}:(q2,5,)(q_{2},5,\bullet)(q0,1,)(q_{0},1,\bullet)(q1,3,)(q_{1},3,\bullet)τ1:\tau_{1}:(q2,5,)(q_{2},5,\bullet)(q0,1,)(q_{0},1,\bullet)(q1,0,)(q_{1},0,\bullet)

that produces the valuation ντ0={{1,3,5}}\nu_{\tau_{0}}=\{\bullet\mapsto\{1,3,5\}\} and ντ1={{0,1,5}}\nu_{\tau_{1}}=\{\bullet\mapsto\{0,1,5\}\} representing the subsequences T(2),S(2,11),R(2,11)T(2),S(2,11),R(2,11) and S(2,11),T(2),R(2,11)S(2,11),T(2),R(2,11) of S0\pazocal{S}_{0}, respectively. Note that the former is an output of C0\pazocal{C}_{0} in Example 2.1, but the latter is not.

It is easy to see that every CCEA is a PCEA where every transition (P,U,,L,q)Δ(P,U,\mathcal{B},L,q)\in\Delta satisfies that |P|1|P|\leq 1. Additionally, the previous example gives evidence that PCEA is a strict generalization of CCEA, namely, there exists no CCEA that can define P0\pazocal{P}_{0}. Intuitively, since a CCEA can only compare the current tuple to the last tuple, for a stream like S=R(a,b),T(a),S(a,b)\pazocal{S}=R(a,b),T(a),S(a,b) it would be impossible to check conditions over the second attribute of tuples R(a,b)R(a,b) and S(a,b)S(a,b).

Proposition 3.4.

PCEA is strictly more expressive than CCEA.

Unambiguous PCEA

We end this section by introducing a subclass of PCEA relevant to our algorithmic results. Let P\pazocal{P} be a PCEA and τ\tau a run of P\pazocal{P} over some stream. We say that τ\tau is simple iff for every two different nodes u¯,u¯τ\bar{u},\bar{u}^{\prime}\in\tau with τ(u¯)=(q,i,L)\tau(\bar{u})=(q,i,L) and τ(u¯)=(q,i,L)\tau(\bar{u}^{\prime})=(q^{\prime},i^{\prime},L^{\prime}), if i=ii=i^{\prime}, then LL=L\cap L^{\prime}=\emptyset. In other words, τ\tau is simple if all positions of the valuation ντ\nu_{\tau} are uniquely represented in τ\tau. We say that P\pazocal{P} is unambiguous if (1) every accepting run of P\pazocal{P} is simple and (2) for every stream S\pazocal{S} and accepting run τ\tau^{\prime} of P\pazocal{P} over S\pazocal{S} with valuation ντ\nu_{\tau}, there is no other run τ\tau of P\pazocal{P} with valuation ντ\nu_{\tau^{\prime}} such that ντ=ντ\nu_{\tau}=\nu_{\tau^{\prime}}. For example, the reader can check that P0\pazocal{P}_{0} is unambiguous.

Condition (2) of unambiguous PCEA ensures that each output is witnessed by exactly one run. This condition is common in MSO enumeration [2, 22] for a one-to-one correspondence between outputs and runs. Condition (1) forces a correspondence between the size of the run and the size of the output it represents. As we will see, both conditions will be helpful for our evaluation algorithm, and satisfied by our translation of hierarchical conjunctive queries into PCEA in the next section.

4 Representing hierarchical conjunctive queries

This section studies the connection between PCEA and hierarchical conjunctive queries (HCQ) over streams. For this purpose, we must first define the semantics of HCQ over streams and how to relate their expressiveness with PCEA. We connect them by using a bag semantics of CQ. We start by introducing bags that will be useful throughout this section.

Bags

A bag (also called a multiset) is usually defined in the literature as a function that maps each element to its multiplicity (i.e., the number of times it appears). In this work, we use a different but equivalent representation of a bag where each element has its own identity. This representation will be helpful in our context to deal with duplicates in the stream and define the semantics of hierarchical CQ in the case of self joins.

We define a bag (with own identity) BB as a surjective function B:IUB:I\rightarrow U where II is a finite set of identifiers (i.e., the identity of each element) and UU is the underlying set of the bag. Given any bag BB, we refer to these components as I(B)I(B) and U(B)U(B), respectively. For example, a bag B={{a,a,b}}B=\{\!\!\{a,a,b\}\!\!\} (where aa is repeated twice) can be represented with a surjective function B0={0a,1a,2b}B_{0}=\{0\mapsto a,1\mapsto a,2\mapsto b\} where I(B0)={0,1,2}I(B_{0})=\{0,1,2\} and U(B0)={a,b}U(B_{0})=\{a,b\}. In general, we will use the standard notation for bags {{a0,,an1}}\{\!\!\{a_{0},\ldots,a_{n-1}\}\!\!\} to denote the bag BB whose identifiers are I(B)={0,,n1}I(B)=\{0,\ldots,n-1\} and B(i)=aiB(i)=a_{i} for each iI(B)i\in I(B). Note that if B:IUB:I\rightarrow U is injective, then BB encodes a set (i.e., no repetitions). We write aBa\in B if B(i)=aB(i)=a for some iI(B)i\in I(B) and define the empty bag \emptyset such that I()=I(\emptyset)=\emptyset and U()=U(\emptyset)=\emptyset.

For a bag BB and an element aa, we define the multiplicity of aa in BB as multB(a)=|{iB(i)=a}|\operatorname{mult}_{B}(a)=|\{i\mid B(i)=a\}|. Then, we say that a bag BB^{\prime} is contained in BB, denoted as BBB^{\prime}\subseteq B, iff multB(a)multB(a)\operatorname{mult}_{B^{\prime}}(a)\leq\operatorname{mult}_{B}(a) for every aa. We also say that two bags BB^{\prime} and BB are equal, and write B=BB=B^{\prime}, if BBB^{\prime}\subseteq B and BBB\subseteq B^{\prime}. Note that two bags can be equal although the set of identifiers can be different (i.e., they are equal up to a renaming of the identifiers). Given a set AA, we say that BB is a bag from elements of AA (or just a bag of AA) if U(B)AU(B)\subseteq A.

Relational databases

Recall that D is our set of data values and let σ=(T,arity)\sigma=(\textbf{T},\operatorname{arity}) be a schema. A relational database DD (with duplicates) over σ\sigma is a bag of Tuples[σ]\operatorname{Tuples}[\sigma]. Given a relation name RTR\in\textbf{T}, we write RDR^{D} as the bag of DD containing only the RR-tuples of DD, formally, I(RD)={iI(D)D(i)=R(a¯) for some a¯}I(R^{D})=\{i\in I(D)\mid D(i)=R(\bar{a})\text{ for some $\bar{a}$}\} and RD(i)=D(i)R^{D}(i)=D(i) for every iI(RD)i\in I(R^{D}). For example, consider again the schema σ0\sigma_{0}. Then a database D0D_{0} over σ0\sigma_{0} is the bag:

D0:={{S(2,11),T(2),R(1,10),S(2,11),T(1),R(2,11)}}.D_{0}\ :=\ \{\!\!\{\,S(2,11),T(2),R(1,10),S(2,11),T(1),R(2,11)\,\}\!\!\}.

Here, one can check that TD0={{T(2),T(1)}}T^{D_{0}}=\{\!\!\{T(2),T(1)\}\!\!\} and SD0={{S(2,11),S(2,11)}}S^{D_{0}}=\{\!\!\{S(2,11),S(2,11)\}\!\!\}.

Conjunctive queries

Fix a schema σ=(T,arity)\sigma=(\textbf{T},\operatorname{arity}) and a set of variables X disjoint from D (i.e., XD=\textbf{X}\cap\textbf{D}=\emptyset). A Conjunctive Query (CQ) over relational schema σ\sigma is a syntactic structure of the form:

Q(x¯)R0(x¯0),,Rm1(x¯m1)Q(\bar{x})\ \leftarrow\ R_{0}(\bar{x}_{0}),\ldots,R_{m-1}(\bar{x}_{m-1})

such that QQ is a relational name not in T, RiTR_{i}\in\textbf{T}, x¯i\bar{x}_{i} is a sequence of variables in X and data values in D, and |x¯i|=arity(Ri)|\bar{x}_{i}|=\operatorname{arity}(R_{i}) for every i<mi<m. Further, x¯\bar{x} is a sequence of variables in x¯0,,x¯m1\bar{x}_{0},\ldots,\bar{x}_{m-1}. We will denote a CQ like (4) by QQ, where Q(x¯)Q(\bar{x}) and R0(x¯0),,Rm1(x¯m1)R_{0}(\bar{x}_{0}),\ldots,R_{m-1}(\bar{x}_{m-1}) are called the head and the body of QQ, respectively. Furthermore, we call each Ri(x¯i)R_{i}(\bar{x}_{i}) an atom of QQ. For example, the following are two conjunctive queries Q0Q_{0} and Q1Q_{1} over the schema σ0\sigma_{0}:

Q0(x,y)T(x),S(x,y),R(x,y)Q1(x,y)T(x),R(x,y),S(2,y),T(x)Q_{0}(x,y)\leftarrow T(x),\,S(x,y),\,R(x,y)\ \ \ \ \ \ \ \ \ \ \ \ Q_{1}(x,y)\leftarrow T(x),\,R(x,y),\,S(2,y),\,\,T(x)

Note that a query can repeat atoms. For this reason, we will regularly consider QQ as a bag of atoms, where I(Q)I(Q) are the positions of QQ and U(Q)U(Q) is the set of distinct atoms. For instance, we can consider Q1Q_{1} above as a bag of atoms, where I(Q1)={0,1,2,3}I(Q_{1})=\{0,1,2,3\} (i.e., the position of the atoms) and Q1(0)=T(x)Q_{1}(0)=T(x), Q1(1)=R(x,y)Q_{1}(1)=R(x,y), Q1(2)=S(2,y)Q_{1}(2)=S(2,y), Q1(3)=T(x)Q_{1}(3)=T(x). We say that a CQ QQ has self joins if there are two atoms with the same relation name. We can see in the previous example that Q1Q_{1} has self joins, while Q0Q_{0} does not.

Homomorphisms and CQ bag semantics

Let QQ be a CQ, and DD be a database over the same schema σ\sigma. A homomorphism is any function h:XDDh:\textbf{X}\cup\textbf{D}\rightarrow\textbf{D} such that h(a)=ah(a)=a for every aDa\in\textbf{D}. We extend hh as a function from atoms to tuples such that h(R(x¯)):=R(h(x¯))h(R(\bar{x})):=R(h(\bar{x})) for every atom R(x¯)R(\bar{x}). We say that hh is a homomorphism from QQ to DD if hh is a homomorphism and h(R(x¯))Dh(R(\bar{x}))\in D for every atom R(x¯)R(\bar{x}) in QQ. We denote by Hom(Q,D)\operatorname{Hom}(Q,D) the set of all homomorphisms from QQ to DD.

To define the bag semantics of CQ, we need a more refined notion of homomorphism that specifies the correspondence between atoms in QQ and tuples in DD. Formally, a tuple-homomorphism from QQ to DD (or t-homomorphism for short) is a function η:I(Q)I(D)\eta:I(Q)\rightarrow I(D) such that there exists a homomorphism hηh_{\eta} from QQ to DD satisfying that hη(Q(i))=D(η(i))h_{\eta}(Q(i))=D(\eta(i)) for every iI(Q)i\in I(Q). For example, consider again Q0Q_{0} and D0D_{0} above, then η0={01,13,2,5}\eta_{0}=\{0\mapsto 1,1\mapsto 3,2,\mapsto 5\} and η1={01,10,25}\eta_{1}=\{0\mapsto 1,1\mapsto 0,2\mapsto 5\} are two t-homomorphism from Q0Q_{0} to D0D_{0}.

Intuitively, a t-homomorphism is like a homomorphism, but it additionally specifies the correspondence between atoms (i.e., I(Q)I(Q)) and tuples (i.e., I(D)I(D)) in the underlying bags. One can easily check that if η\eta is a t-homomorphism, then hηh_{\eta} (restricted to the variables of QQ) is unique. For this reason, we usually say that hηh_{\eta} is the homomorphism associated to η\eta. Note that the converse does not hold: for hh from QQ to DD, there can be several t-homomorphisms η\eta such that h=hηh=h_{\eta}.

Let Q(x¯)Q(\bar{x}) be the head of QQ. We define the output of a CQ QQ over a database DD as:

\lsemQ\rsem(D)={{Q(hη(x¯))η is a t-homomorphism from Q to D}}.{\lsem{}{Q}\rsem}(D)\ =\ \{\!\!\{Q(h_{\eta}(\bar{x}))\,\mid\,\text{$\eta$ is a t-homomorphism from $Q$ to $D$}\}\!\!\}.

Note that the result is another relation where each Q(hη(x¯))Q(h_{\eta}(\bar{x})) is witnessed by a t-homomorphism from QQ to DD. In other words, there is a one-to-one correspondence between tuples in \lsemQ\rsem(D){\lsem{}{Q}\rsem}(D) and t-homomorphisms from QQ to DD.

Discussion

In the literature, homomorphisms are usually used to define the set semantics of a CQ QQ over a database DD. They are helpful for set semantics but “inconvenient” for bag semantics since it does not specify the correspondence between atoms and tuples; namely, they only witness the existence of such correspondence. In [8], Chaudhuri and Vardi introduced the bag semantics of CQ by using homomorphisms, which we recall next. Let QQ be a CQ like (4) and DD a database over the same schema σ\sigma, and let hHom(Q,D)h\in\operatorname{Hom}(Q,D). We define the multiplicity of hh with respect to QQ and DD by:

multQ,D(h)=i=0m1multD(h(Ri(x¯i)))\operatorname{mult}_{Q,D}(h)\ =\ \prod_{i=0}^{m-1}\operatorname{mult}_{D}(h(R_{i}(\bar{x}_{i})))

Chaudhuri and Vardi defined the bag semantics Q\lceil\lceil{Q}\rfloor\rfloor of QQ over DD as the bag Q(D)\lceil\lceil{Q}\rfloor\rfloor(D) such that each tuple Q(a¯)Q(\bar{a}) has multiplicity equal to:

multQ(D)(Q(a¯))=hHom(Q,D):h(x¯)=a¯multQ,D(h)\operatorname{mult}_{\lceil\lceil{Q}\rfloor\rfloor(D)}(Q(\bar{a}))\ =\ \sum_{h\in\operatorname{Hom}(Q,D)\,\colon\,h(\bar{x})=\bar{a}}\operatorname{mult}_{Q,D}(h)

In Appendix B, we prove that for every CQ QQ and database DD it holds that \lsemQ\rsem(D)=Q(D){\lsem{}{Q}\rsem}(D)=\lceil\lceil{Q}\rfloor\rfloor(D), namely, the bag semantics introduced here (i.e., with t-homomorphisms) is equivalent to the standard bag semantics of CQ. The main difference is that the standard bag semantics of CQ are defined in terms of homomorphisms and multiplicities, and there is no direct correspondence between outputs and homomorphisms. For this reason, we redefine the bag semantics of CQ in terms of t-homomorphism that will connect the outputs of CQ with the outputs of PCEA over streams.

CQ over streams

Now, we define the semantics of CQ over streams, formalizing its comparison with queries in complex event recognition. For this purpose, we must show how to interpret streams as databases and encode CQ’s outputs as valuations. Fix a schema σ\sigma and a stream S=t0t1\pazocal{S}=t_{0}t_{1}\cdots over σ\sigma. Given a position nn\in\mathbb{N}, we define the database of S\pazocal{S} at position nn as the σ\sigma-database Dn[S]={{t0,t1,,tn}}D_{n}[\pazocal{S}]=\{\!\!\{t_{0},t_{1},\ldots,t_{n}\}\!\!\}. For example, D5[S0]=D0D_{5}[\pazocal{S}_{0}]=D_{0}. One can interpret here that S\pazocal{S} is a sequence of inserts, and then Dn[S]D_{n}[\pazocal{S}] is the database version at position nn. Since Dn[S]D_{n}[\pazocal{S}] is a bag, the identifiers I(Dn[S])I(D_{n}[\pazocal{S}]) coincide with the positions of the sequence t0tnt_{0}\ldots t_{n}.

Let QQ be a CQ over σ\sigma, and let η:I(Q)I(Dn[S])\eta:I(Q)\rightarrow I(D_{n}[\pazocal{S}]) be a t-homomorphism from QQ to Dn[S]D_{n}[\pazocal{S}]. If we consider Ω=I(Q)\Omega=I(Q), we can interpret η\eta as a valuation η^:Ω2\hat{\eta}:\Omega\rightarrow 2^{\mathbb{N}} that maps each atom of QQ to a set with a single position; formally, η^(i)={η(i)}\hat{\eta}(i)=\{\eta(i)\} for every iI(Q)i\in I(Q). Then, we define the semantics of QQ over stream S\pazocal{S} at position nn as:

\lsemQ\rsemn(S)={η^η is a t-homomorphism from Q to Dn[S]}{\lsem{}{Q}\rsem}_{n}(\pazocal{S})\ =\ \{\hat{\eta}\mid\text{$\eta$ is a t-homomorphism from $Q$ to $D_{n}[\pazocal{S}]$}\}

Note that \lsemQ\rsemn(S){\lsem{}{Q}\rsem}_{n}(\pazocal{S}) is equivalent to evaluating QQ over Dn[S]D_{n}[\pazocal{S}] where instead of outputting a bag of tuples \lsemQ\rsem(Dn[S]){\lsem{}{Q}\rsem}(D_{n}[\pazocal{S}]), we output the t-homomorphisms (i.e., as valuations) that are in a one-to-one correspondence with the tuples in \lsemQ\rsem(Dn[S]){\lsem{}{Q}\rsem}(D_{n}[\pazocal{S}]).

Hierarchical conjunctive queries and main results

Let QQ be a CQ of the form (4). Given a variable xXx\in\textbf{X}, define atoms(x)\textit{atoms}(x) as the bag of all atoms Ri(x¯i)R_{i}(\bar{x}_{i}) of QQ such that xx appears in x¯i\bar{x}_{i}. We say that QQ is full if every variable appearing in x¯0,,x¯m1\bar{x}_{0},\ldots,\bar{x}_{m-1} also appears in x¯\bar{x}. Then, QQ is a Hierarchical Conjunctive Query (HCQ)[12] iff QQ is full and for every pair of variables x,yXx,y\in\textbf{X} it holds that atoms(x)atoms(y)\textit{atoms}(x)\subseteq\textit{atoms}(y), atoms(y)atoms(x)\textit{atoms}(y)\subseteq\textit{atoms}(x) or atoms(x)atoms(y)=\textit{atoms}(x)\cap\textit{atoms}(y)=\emptyset. For example, one can check that Q0Q_{0} is a HCQ, but Q1Q_{1} is not.

HCQ is a subset of CQ that can be evaluated with constant-delay enumeration under updates [5, 18]. Moreover, it is the greatest class of full conjunctive queries that can be evaluated with such guarantees under fine-grained complexity assumptions. Therefore, HCQ is the right yardstick to measure the expressive power of PCEA for defining queries with strong efficiency guarantees. Given a PCEA P\pazocal{P} and a CQ QQ over the same schema σ\sigma, we say that P\pazocal{P} is equivalent to QQ (denoted as PQ\pazocal{P}\equiv Q) iff for every stream S\pazocal{S} over σ\sigma and every position nn it holds that \lsemP\rsemn(S)=\lsemQ\rsemn(S){\lsem{}{\pazocal{P}}\rsem}_{n}(\pazocal{S})={\lsem{}{Q}\rsem}_{n}(\pazocal{S}).

Theorem 4.1.

Let σ\sigma be a schema. For every HCQ QQ over σ\sigma, there exists a PCEA PQ\pazocal{P}_{Q} over σ\sigma with unary predicates in Ulin\textbf{U}_{\operatorname{lin}} and binary predicates in Beq\textbf{B}_{\operatorname{eq}} such that PQQ\pazocal{P}_{Q}\equiv Q. Furthermore, PQ\pazocal{P}_{Q} is unambiguous and of at most exponential size with respect to QQ. If QQ does not have self joins, then PQ\pazocal{P}_{Q} is of quadratic size.

Q0(x,y)T(x)0,S(x,y)1,R(x,y)2Q_{0}(x,y)\leftarrow\underbrace{T(x)}_{0},\underbrace{S(x,y)}_{1},\underbrace{R(x,y)}_{2}xx0yy1122PQ0:\pazocal{P}_{Q_{0}}\!:011xxT/0T/0S/1S/1(Tx,Rxy)(Tx,Rxy)(Sx,Rxy)(Sx,Rxy)R/2R/2022xxT/0T/0R/2R/2(Tx,Sxy)(Tx,Sxy)(Rxy,Sxy)(Rxy,Sxy)S/1S/11122yyxxS/1S/1R/2R/2R,(Sxy,Rxy)/2R,(Sxy,Rxy)/2S,(Rxy,Sxy)/1S,(Rxy,Sxy)/1T,(?xy,Tx)/0T,(?xy,Tx)/0
Figure 2: An illustration of constructing an PCEA from a HCQ. On the left, the HCQ Q0Q_{0} and its qq-tree. On the right, a PCEA PQ0\pazocal{P}_{Q_{0}} equivalent to Q0Q_{0}. For presentation purposes, states are repeated several times and ?xy?xy means a binary relation with any relation name (i.e., RR or SS).
Proof sketch.

We give an example of the construction to provide insights on the expressive power of PCEA for defining HCQ (the full technical proof is in the appendix). For this construction, we rely on a qq-tree of a HCQ, a structure introduced in [5]. Formally, let QQ be a HCQ and assume, for the sake of simplification, that QQ is connected (i.e., the Gaifman graph associated to QQ is connected). A q-tree for QQ is a labeled tree, τQ:tI(Q){x¯}\tau_{Q}:t\rightarrow I(Q)\cup\{\bar{x}\}, where for every x{x¯}x\in\{\bar{x}\} there is a unique inner node u¯t\bar{u}\in t such that τQ(u¯)=x\tau_{Q}(\bar{u})=x, and for every atom iI(Q)i\in I(Q) there is a unique leaf node v¯t\bar{v}\in t such that τQ(v¯)=i\tau_{Q}(\bar{v})=i. Further, if u¯1,,u¯k\bar{u}_{1},\ldots,\bar{u}_{k} are the inner nodes of the path from the root until v¯\bar{v}, then {x¯i}={τQ(u¯1),,τQ(u¯k)}\{\bar{x}_{i}\}=\{\tau_{Q}(\bar{u}_{1}),\ldots,\tau_{Q}(\bar{u}_{k})\}. In [5], it was shown that a CQ QQ is hierarchical and connected iff there exists a qq-tree for QQ. For instance, in Figure 2 (left) we display again the HCQ Q0Q_{0}, labeled with the identifiers of the atoms, and below a qq-tree for Q0Q_{0}.

For a connected HCQ without self-joins the idea of the construction is to use the qq-tree of QQ as the underlying structure of the PCEA PQ\pazocal{P}_{Q}. Indeed, the nodes of the qq-tree will be the states of PQ\pazocal{P}_{Q}. For example, in Figure 2 (right) we present a PCEA PQ0\pazocal{P}_{Q_{0}} equivalent to Q0Q_{0}, where we use multiple copies of the states for presentation purposes (i.e., if two states have the same label, they are the same state in the figure). As you can check, the states are {0,1,2,x,y}\{0,1,2,x,y\}, which are the nodes of the qq-tree. Furthermore, the leaves of the qq-trees (i.e., the atoms) are the initial states {0,1,2}\{0,1,2\} where PQ0\pazocal{P}_{Q_{0}} uses a unary predicate to check that the tuples have arrived and annotates with the corresponding identifier.

For every atom Ri(x¯i)R_{i}(\bar{x}_{i}) and every variable x{x¯i}x\in\{\bar{x}_{i}\}, PQ\pazocal{P}_{Q} jumps with a transition to the state xx which is a node in the qq-tree and joins with all the atoms and variables “hanging” from the path from xx to the leave ii in the qq-tree. For example, consider the first component (i.e., top-left) of PQ0\pazocal{P}_{Q_{0}} in Figure 2. When PQ0\pazocal{P}_{Q_{0}} reads a tuple R(a,b)R(a,b), it jumps to state xx and joins with all the atoms hanging from the path from xx to 22, namely, the atoms TT and SS. Similarly, consider the last component (i.e., below) of PQ0\pazocal{P}_{Q_{0}} in Figure 2. When PQ0\pazocal{P}_{Q_{0}} reads a tuple R(a,b)R(a,b), it also jumps to state yy, but now the only atom hanging from the path from yy to 22 in the qq-tree is 11, which corresponds to a single transition from 11 to yy joining with the atom S(x,y)S(x,y). Finally, when PQ0\pazocal{P}_{Q_{0}} reads a tuple T(a)T(a), the only variable that hangs in the path from the root to 0 is the variable yy, and then there is a single transition from yy to xx, joining with an equality predicate (?xy,Tx)(?xy,Tx) where ?xy?xy means a binary relation with any relational name (i.e., RR or SS). Finally, the root of the qq-tree serves as the final state of the PQ0\pazocal{P}_{Q_{0}}, namely, all atoms were found. Note that an accepting run tree of PQ0\pazocal{P}_{Q_{0}} serves as a witness that the qq-tree is complete. The construction of HCQ with self-joins is more involved, and we present the details in the appendix. ∎

The previous result shows that PCEA has the expressive power to specify every HCQ. Given that HCQ characterize the full CQ that can be evaluated in a dynamic setting (under complexity assumptions), a natural question is to ask whether PCEA has the right expressive power, in the sense that it cannot define non-hierarchical CQ. We answer this question positively by focusing on acyclic CQ. Let QQ be a CQ of the form (4). A join-tree for QQ is labeled tree τ:tU(Q)\tau:t\rightarrow U(Q) such that for every variable xx the set {u¯tτ(u¯)atoms(x)}\{\bar{u}\in t\mid\tau(\bar{u})\in\textit{atoms}(x)\} form a connected tree in τ\tau. We say that QQ is acyclic if QQ has a join tree. One can check that both Q0Q_{0} and Q1Q_{1} are examples of acyclic CQ.

Theorem 4.2.

Let σ\sigma be a schema. For every acyclic CQ QQ over σ\sigma, if QQ is not hierarchical, then PQ\pazocal{P}\not\equiv Q for all PCEA P\pazocal{P} over σ\sigma.

We note that, although PCEA can only define acyclic CQ that are hierarchical, it can define queries that are not CQ. For instance, P0\pazocal{P}_{0} in Example 3.3 cannot be defined by any CQ, since a CQ cannot express that the RR-tuple must arrive after TT and SS. Therefore, the class of queries defined by PCEA  is strictly more expressive than HCQ.

By Theorems 4.1 and 4.2, PCEA capture the expressibility of HCQ among acyclic CQ. In the next section, we show that they also share their good algorithmic properties for streaming evaluation.

5 An evaluation algorithm for PCEA

Below, we present our evaluation algorithm for unambiguous PCEA with equality predicates. We do this in a streaming setting where the algorithm reads a stream sequentially, and at each position, we can enumerate the new outputs fired by the last tuple. Furthermore, our algorithm works under a sliding window scenario, where we only want to enumerate the outputs inside the last ww items for some window size ww. This scenario is motivated by CER [14, 11, 6], where the importance of data decreases with time, and then, we want the outputs inside some relevant time window.

In the following, we start by defining the evaluation problem and stating the main theorem, followed by describing our data structure for storing valuations. We end this section by explaining the algorithm and stating its correctness.

The streaming evaluation problem

Let σ\sigma be a fixed schema. For a valuation ν:Ω2\nu:\Omega\rightarrow 2^{\mathbb{N}}, we define min(ν)=min{iΩ.iν()}\min(\nu)=\min\{i\mid\exists\ell\in\Omega.\ i\in\nu(\ell)\}, namely, the minimum position appearing in ν\nu. In this section, we study the following evaluation problem of PCEA over streams:

Problem: EvalPCEA[σ]\textsl{EvalPCEA}[\sigma] Input: An unambiguous PCEA P=(Q,Ulin,Beq,Ω,Δ,F)\pazocal{P}=(Q,\textbf{U}_{\operatorname{lin}},\textbf{B}_{\operatorname{eq}},\Omega,\Delta,F) over σ\sigma, a window size ww\in\mathbb{N}, and a stream S=t0t1\pazocal{S}=t_{0}t_{1}\ldots Output: At each position ii, enumerate all valuations ν\lsemP\rsemi(S)\nu\in{\lsem{}{\pazocal{P}}\rsem}_{i}(\pazocal{S}) such that |imin(ν)|w|i-\min(\nu)|\leq w.

The goal is to output the set \lsemP\rsemiw(S)={ν\lsemP\rsemi(S)|imin(ν)|w}{\lsem{}{\pazocal{P}}\rsem}_{i}^{w}(\pazocal{S})=\{\nu\in{\lsem{}{\pazocal{P}}\rsem}_{i}(\pazocal{S})\mid|i-\min(\nu)|\leq w\} by reading the stream S\pazocal{S} tuple-by-tuple sequentially. We assume here a method yield[S]\texttt{yield}[\pazocal{S}] such that each call retrieves the next tuple, that is, the ii-th call to yield[S]\texttt{yield}[\pazocal{S}] retrieves tit_{i} for each i0i\geq 0.

For solving EvalPCEA[σ]\textsl{EvalPCEA}[\sigma], our desire is to find a streaming evaluation algorithm [16, 18] that, for each tuple tit_{i}, updates its internal state quickly and enumerates the set \lsemP\rsemiw(S){\lsem{}{\pazocal{P}}\rsem}_{i}^{w}(\pazocal{S}) with output-linear delay. More precisely, let f:3f:\mathbb{N}^{3}\rightarrow\mathbb{N}. A streaming enumeration algorithm \mathcal{E} with ff-update time for EvalPCEA[σ]\textsl{EvalPCEA}[\sigma] works as follows. Before reading the stream S\pazocal{S}, \mathcal{E} receives as input a PCEA P\pazocal{P} and ww\in\mathbb{N}, and does some preprocessing. By calling yield[S]\texttt{yield}[\pazocal{S}], \mathcal{E} reads S\pazocal{S} sequentially and processes the next tuple tit_{i} in two phases called the update phase and enumeration phase, respectively. In the update phase, \mathcal{E} updates a data structure 𝖣𝖲\mathsf{DS} with tit_{i} taking time O(f(|P|,|ti|,w))\pazocal{O}(f(|\pazocal{P}|,|t_{i}|,w)). In the enumeration phase, \mathcal{E} uses 𝖣𝖲\mathsf{DS} for enumerating \lsemP\rsemiw(S){\lsem{}{\pazocal{P}}\rsem}_{i}^{w}(\pazocal{S}) with output-linear delay. Formally, if \lsemP\rsemiw(S)={ν1,,νk}{\lsem{}{\pazocal{P}}\rsem}_{i}^{w}(\pazocal{S})=\{\nu_{1},\ldots,\nu_{k}\} (i.e., in arbitrary order), the algorithm prints #ν1#ν2##νk#\#\nu_{1}\#\nu_{2}\#\ldots\#\nu_{k}\# to the output registers, sequentially. Furthermore, \mathcal{E} prints the first and last symbols #\# when the enumeration phase starts and ends, respectively, and the time difference (i.e., the delay) between printing the #\#-symbols surrounding νi\nu_{i} is in O(|νi|)\pazocal{O}(|\nu_{i}|). Finally, if such an algorithm exists, we say that EvalPCEA[σ]\textsl{EvalPCEA}[\sigma] admits a streaming evaluation algorithm with ff-update time and output-linear delay.

In the following, we prove the following algorithmic result for evaluating PCEA.

Theorem 5.1.

EvalPCEA[σ]\textsl{EvalPCEA}[\sigma] admits a streaming evaluation algorithm with (|P||t|+|P|log(|P|)+|P|log(w))(|\pazocal{P}|\cdot|t|+|\pazocal{P}|\cdot\log(|\pazocal{P}|)+|\pazocal{P}|\cdot\log(w))-update time and output-linear delay.

Note that the update time does not depend on the number of outputs seen so far, and regarding data complexity (i.e., assuming that P\pazocal{P} and the size of the tuples, |t||t| are fixed), the update time is logarithmic in the size of the sliding window. Theorem 5.1 improves with respect to [16] by considering a more general class of queries and evaluating over a sliding window. In contrast, Theorem 5.1 is incomparable to the algorithms for dynamic query evaluation of HCQ in [5, 18]. On the one hand, [5, 18] show constant update time algorithms for HCQ under insertions and deletions. On the other hand, Theorem 5.1 works for CER queries that can compare the order over tuples. If we restrict to HCQ, the algorithms in [5, 18] have better complexity, given that there is no need to maintain and check the order of how tuples are inserted or deleted.

It is important to note that we base the algorithm of Theorem 5.1 on the ideas introduced in [16]. Nevertheless, it has several new insights that are novel and are not present in [16]. First, our algorithm evaluates PCEA, which is a generalization of CCEA, and then the approach in [16] requires several changes. Second, the data structure for our algorithm must manage the evaluation of a sliding window and simultaneously combine parallel runs into one. This challenge requires a new strategy for enumeration that combines cross-products with checking a time condition. Finally, maintaining the runs that are valid inside the sliding window with logarithmic update time requires the design of a new data structure based on the principles of a heap, which is novel. We believe this data structure is interesting in its own right, which could lead to new advances in streaming evaluation algorithms with enumeration.

We dedicate the rest of this section to explaining the streaming evaluation algorithm of Theorem 5.1, starting by describing the data structure 𝖣𝖲\mathsf{DS}.

The data structure

Fix a set of labels Ω\Omega. For representing sets of valuations ν:Ω2\nu:\Omega\rightarrow 2^{\mathbb{N}}, we use a data structure composed of nodes, where each node stores a position, a set of labels, and pointers to other nodes. Formally, the data structure 𝖣𝖲\mathsf{DS} is composed by a set of nodes, denoted by Nodes(𝖣𝖲)\operatorname{Nodes}(\mathsf{DS}), where each node 𝗇\mathsf{n} has a set L(𝗇)ΩL(\mathsf{n})\subseteq\Omega, a position i(𝗇)i(\mathsf{n})\in\mathbb{N}, a set prod(𝗇)Nodes(𝖣𝖲)\operatorname{prod}(\mathsf{n})\subseteq\operatorname{Nodes}(\mathsf{DS}), and two links to other nodes uleft(𝗇),uright(𝗇)Nodes(𝖣𝖲)\operatorname{uleft}(\mathsf{n}),\operatorname{uright}(\mathsf{n})\in\operatorname{Nodes}(\mathsf{DS}). We assume that the directed graph G𝖣𝖲G_{\mathsf{DS}} with V(G𝖣𝖲)=Nodes(𝖣𝖲)V(G_{\mathsf{DS}})=\operatorname{Nodes}(\mathsf{DS}) and E(G𝖣𝖲)={(𝗇1,𝗇2)𝗇2prod(𝗇1)𝗇2=uleft(𝗇1)𝗇2=uright(𝗇1)}E(G_{\mathsf{DS}})=\{(\mathsf{n}_{1},\mathsf{n}_{2})\mid\mathsf{n}_{2}\in\operatorname{prod}(\mathsf{n}_{1})\vee\mathsf{n}_{2}=\operatorname{uleft}(\mathsf{n}_{1})\vee\mathsf{n}_{2}=\operatorname{uright}(\mathsf{n}_{1})\} is acyclic. In addition, we assume a special node Nodes(𝖣𝖲)\bot\in\operatorname{Nodes}(\mathsf{DS}) that serves as a bottom node (i.e., all components above are undefined for \bot) and prod(𝗇)\bot\notin\operatorname{prod}(\mathsf{n}) for every 𝗇\mathsf{n}.

Each node in 𝖣𝖲\mathsf{DS} represents a bag of valuations. For explaining this representation, we need to first introduce some algebraic operations on valuations. Given two valuations ν,ν:Ω2\nu,\nu^{\prime}:\Omega\rightarrow 2^{\mathbb{N}}, we define the product νν:Ω2\nu\oplus\nu^{\prime}:\Omega\rightarrow 2^{\mathbb{N}} such that [νν]()=ν()ν()[\nu\oplus\nu^{\prime}](\ell)=\nu(\ell)\cup\nu^{\prime}(\ell) for every Ω\ell\in\Omega. Further, we extend this product to bags of valuations VV and VV^{\prime} such that VV={{νννV,νV}}V\oplus V^{\prime}=\{\!\!\{\nu\oplus\nu^{\prime}\mid\nu\in V,\nu^{\prime}\in V^{\prime}\}\!\!\}. Note that \oplus is an associative and commutative operation and, thus, we can write iVi\bigoplus_{i}V_{i} for referring to a sequence of \oplus-operations. Given a pair (L,i)2Ω×(L,i)\in 2^{\Omega}\times\mathbb{N}, we define the valuation νL,i:Ω2\nu_{L,i}:\Omega\rightarrow 2^{\mathbb{N}} such that νL,i()={i}\nu_{L,i}(\ell)=\{i\} if L\ell\in L, and νL,i()=\nu_{L,i}(\ell)=\emptyset, otherwise. With this notation, for every 𝗇Nodes(𝖣𝖲)\mathsf{n}\in\operatorname{Nodes}(\mathsf{DS}) we define the bags \lsem𝗇\rsemprod{\lsem{}{\mathsf{n}}\rsem}_{\operatorname{prod}} and \lsem𝗇\rsem{\lsem{}{\mathsf{n}}\rsem} recursively as follows:

\lsem𝗇\rsemprod:={{νL(𝗇),i(𝗇)}}𝗇prod(𝗇)\lsem𝗇\rsem\lsem𝗇\rsem:=\lsem𝗇\rsemprod\lsemuleft(𝗇)\rsem\lsemuright(𝗇)\rsem.{\lsem{}{\mathsf{n}}\rsem}_{\operatorname{prod}}:=\{\!\!\{\nu_{L(\mathsf{n}),i(\mathsf{n})}\}\!\!\}\oplus\bigoplus_{\mathsf{n}^{\prime}\in\operatorname{prod}(\mathsf{n})}{\lsem{}{\mathsf{n}^{\prime}}\rsem}\ \ \ \ \ \ \ \ \ \ {\lsem{}{\mathsf{n}}\rsem}:={\lsem{}{\mathsf{n}}\rsem}_{\operatorname{prod}}\cup{\lsem{}{\operatorname{uleft}(\mathsf{n})}\rsem}\cup{\lsem{}{\operatorname{uright}(\mathsf{n})}\rsem}.

For \bot, we define \lsem\rsemprod=\lsem\rsem={\lsem{}{\bot}\rsem}_{\operatorname{prod}}={\lsem{}{\bot}\rsem}=\emptyset. Intuitively, the set prod(𝗇)\operatorname{prod}(\mathsf{n}) represents the product of its nodes with the valuation νL,i\nu_{L,i}, and the nodes uleft(𝗇)\operatorname{uleft}(\mathsf{n}) and uright(𝗇)\operatorname{uright}(\mathsf{n}) represent unions (for union-left and union-right, respectively). This interpretation is analog to the product and union nodes used in previous work of MSO enumeration [2, 22], but here we encode products and unions in a single node.

For efficiently enumerating \lsem𝗇\rsem{\lsem{}{\mathsf{n}}\rsem}, we require that valuations in 𝖣𝖲\mathsf{DS} are represented without overlapping. To formalize this idea, define that the product νν\nu\oplus\nu^{\prime} is simple if for every Ω\ell\in\Omega, ν()\nu(\ell) and ν()\nu^{\prime}(\ell) are disjoint and [νν]()=ν()ν()[\nu\oplus\nu^{\prime}](\ell)=\nu(\ell)\cup\nu^{\prime}(\ell). Accordingly, we extend this notion to bags of valuations: VVV\oplus V^{\prime} is simple if νν\nu\oplus\nu^{\prime} is simple for every νV\nu\in V and νV\nu^{\prime}\in V^{\prime}. We say that 𝖣𝖲\mathsf{DS} is simple if {{νL(𝗇),i(𝗇)}}𝗇prod(𝗇)\lsem𝗇\rsem\{\!\!\{\nu_{L(\mathsf{n}),i(\mathsf{n})}\}\!\!\}\oplus\bigoplus_{\mathsf{n}^{\prime}\in\operatorname{prod}(\mathsf{n})}{\lsem{}{\mathsf{n}^{\prime}}\rsem} is simple for every 𝗇Nodes(𝖣𝖲)\mathsf{n}\in\operatorname{Nodes}(\mathsf{DS}). This notion is directly related to unambiguous PCEA in Section 3. Intuitively, the first condition of unambiguous PCEA will help us to force that 𝖣𝖲\mathsf{DS} is always simple.

The next step is to incorporate the window-size restriction to 𝖣𝖲\mathsf{DS}. For a node 𝗇Nodes(𝖣𝖲)\mathsf{n}\in\operatorname{Nodes}(\mathsf{DS}), let max(𝗇)=max{iν()ν\lsem𝗇\rsemΩ}\max(\mathsf{n})=\max\{i\in\nu(\ell)\mid\nu\in{\lsem{}{\mathsf{n}}\rsem}\wedge\ell\in\Omega\}. Then, given a position imax(𝗇)i\geq\max(\mathsf{n}) and a window size ww\in\mathbb{N}, define the bag:

\lsem𝗇\rsemiw:={{ν\lsem𝗇\rsem|imin(ν)|w}}.{\lsem{}{\mathsf{n}}\rsem}^{w}_{i}\ :=\ \{\!\!\{\nu\in{\lsem{}{\mathsf{n}}\rsem}\mid|i-\min(\nu)|\leq w\}\!\!\}.

Our plan is to represent \lsem𝗇\rsemiw{\lsem{}{\mathsf{n}}\rsem}^{w}_{i} and enumerate its valuations with output-linear delay. For this goal, from now on we fix a ww\in\mathbb{N} and write 𝖣𝖲w\mathsf{DS}_{w} to denote the data structure with window size ww. For the enumeration of \lsem𝗇\rsemiw{\lsem{}{\mathsf{n}}\rsem}^{w}_{i}, in each node 𝗇\mathsf{n} we store the value:

maxstart(𝗇):=max{min(ν)ν\lsem𝗇\rsemprod}\operatorname{max-start}(\mathsf{n})\ :=\ \max\left\{\min(\nu)\mid\nu\in{\lsem{}{\mathsf{n}}\rsem}_{\operatorname{prod}}\right\}

This value will be helpful to verify whether \lsem𝗇\rsemiw{\lsem{}{\mathsf{n}}\rsem}^{w}_{i} is non-empty or not; in particular, one can check that \lsem𝗇\rsemiw{\lsem{}{\mathsf{n}}\rsem}^{w}_{i}\neq\emptyset iff |imaxstart(𝗇)|w|i-\operatorname{max-start}(\mathsf{n})|\leq w. We always assume that |max(𝗇)maxstart(𝗇)|w|\max(\mathsf{n})-\operatorname{max-start}(\mathsf{n})|\leq w (otherwise \lsem𝗇\rsemiw={\lsem{}{\mathsf{n}}\rsem}^{w}_{i}=\emptyset). In addition, we require an order with uleft(𝗇)\operatorname{uleft}(\mathsf{n}) and uright(𝗇)\operatorname{uright}(\mathsf{n}) to discard empty unions easily. For every node 𝗇Nodes(𝖣𝖲w)\mathsf{n}\in\operatorname{Nodes}(\mathsf{DS}_{w}), we require:

maxstart(𝗇)maxstart(uleft(𝗇))\operatorname{max-start}(\mathsf{n})\geq\operatorname{max-start}(\operatorname{uleft}(\mathsf{n}))  and   maxstart(𝗇)maxstart(uright(𝗇))\operatorname{max-start}(\mathsf{n})\geq\operatorname{max-start}(\operatorname{uright}(\mathsf{n}))

whenever uleft(𝗇)uright(𝗇)\operatorname{uleft}(\mathsf{n})\neq\bot\neq\operatorname{uright}(\mathsf{n}). Intuitively, the binary tree formed by 𝗇\mathsf{n} and all nodes that can be reached by following uleft()\operatorname{uleft}(\cdot) and uright()\operatorname{uright}(\cdot) is not strictly ordered; however, it follows the same principle (5) as a heap [10]. Note that it is not our goal to use 𝖣𝖲w\mathsf{DS}_{w} as a priority queue (since removing the max element from a heap takes logarithmic time, and we need constant time), but to use condition (5) to quickly check if there are more outputs to enumerate in uleft(𝗇)\operatorname{uleft}(\mathsf{n}) or uright(𝗇)\operatorname{uright}(\mathsf{n}) by comparing the max-start value of a node with the start of the current location of the time window.

Theorem 5.2.

Let ww\in\mathbb{N} be a window size and assume that 𝖣𝖲w\mathsf{DS}_{w} is simple. Then, for every 𝗇Nodes(𝖣𝖲w)\mathsf{n}\in\operatorname{Nodes}(\mathsf{DS}_{w}) and every position imax(𝗇)i\geq\max(\mathsf{n}), the valuations in \lsem𝗇\rsemiw{\lsem{}{\mathsf{n}}\rsem}_{i}^{w} can be enumerated with output-linear delay and without preprocessing (i.e., the enumeration starts immediately).

We require two procedures, called extend and union, for operating nodes in our algorithm. The first procedure 𝚎𝚡𝚝𝚎𝚗𝚍(L,i,𝖭)\mathtt{extend}(L,i,\mathsf{N}) receives as input a set LΩL\subseteq\Omega, a position ii\in\mathbb{N}, and 𝖭Nodes(𝖣𝖲w)\mathsf{N}\subseteq\operatorname{Nodes}(\mathsf{DS}_{w}) such that i(𝗇)<ii(\mathsf{n})<i for every 𝗇𝖭\mathsf{n}\in\mathsf{N}. The procedure outputs a fresh node 𝗇e\mathsf{n}_{e} such that \lsem𝗇e\rsemiw:={{νL,i}}𝗇𝖭\lsem𝗇\rsemiw{\lsem{}{\mathsf{n}_{e}}\rsem}^{w}_{i}:=\{\!\!\{\nu_{L,i}\}\!\!\}\oplus\bigoplus_{\mathsf{n}\in\mathsf{N}}{\lsem{}{\mathsf{n}}\rsem}^{w}_{i}. By the construction of 𝖣𝖲w\mathsf{DS}_{w}, this operation is straightforward to implement by defining L(𝗇e)=LL(\mathsf{n}_{e})=L, i(𝗇e)=ii(\mathsf{n}_{e})=i, prod(𝗇e)=𝖭\operatorname{prod}(\mathsf{n}_{e})=\mathsf{N}, and uleft(𝗇e)=uright(𝗇e)=\operatorname{uleft}(\mathsf{n}_{e})=\operatorname{uright}(\mathsf{n}_{e})=\bot. Further, we can compute maxstart(𝗇e)\operatorname{max-start}(\mathsf{n}_{e}) from the set 𝖭\mathsf{N} as follows: maxstart(𝗇e)=min{i,min{maxstart(𝗇)𝗇𝖭}}\operatorname{max-start}(\mathsf{n}_{e})=\min\{i,\min\{\operatorname{max-start}(\mathsf{n})\mid\mathsf{n}\in\mathsf{N}\}\}. Overall, we can implement 𝚎𝚡𝚝𝚎𝚗𝚍(L,i,𝖭)\mathtt{extend}(L,i,\mathsf{N}) with running time O(|𝖭|)\pazocal{O}(|\mathsf{N}|).

The second procedure 𝚞𝚗𝚒𝚘𝚗(𝗇1,𝗇2)\mathtt{union}(\mathsf{n}_{1},\mathsf{n}_{2}) receives as inputs two nodes 𝗇1,𝗇2Nodes(𝖣𝖲w)\mathsf{n}_{1},\mathsf{n}_{2}\in\operatorname{Nodes}(\mathsf{DS}_{w}) such that max(𝗇1)i(𝗇2)\max(\mathsf{n}_{1})\leq i(\mathsf{n}_{2}) and uleft(𝗇2)=uright(𝗇2)=\operatorname{uleft}(\mathsf{n}_{2})=\operatorname{uright}(\mathsf{n}_{2})=\bot. It outputs a fresh node 𝗇u\mathsf{n}_{u} such that \lsem𝗇u\rsemiw:=\lsem𝗇1\rsemiw\lsem𝗇2\rsemiw{\lsem{}{\mathsf{n}_{u}}\rsem}^{w}_{i}:={\lsem{}{\mathsf{n}_{1}}\rsem}^{w}_{i}\cup{\lsem{}{\mathsf{n}_{2}}\rsem}^{w}_{i}. The implementation of this procedure is more involved since it requires inserting 𝗇2\mathsf{n}_{2} into 𝗇1\mathsf{n}_{1} by using uleft(𝗇1)\operatorname{uleft}(\mathsf{n}_{1}) and uright(𝗇1)\operatorname{uright}(\mathsf{n}_{1}), and maintaining condition (5). Furthermore, we require them to be fully persistent [13], namely, 𝗇1\mathsf{n}_{1} and 𝗇2\mathsf{n}_{2} are unmodified after each operation.

Proposition 5.3.

Let kk\in\mathbb{N} and assume that one performs 𝚞𝚗𝚒𝚘𝚗(𝗇1,𝗇2)\mathtt{union}(\mathsf{n}_{1},\mathsf{n}_{2}) over 𝖣𝖲w\mathsf{DS}_{w} with the same position i=i(𝗇2)i=i(\mathsf{n}_{2}) at most kk times. Then one can implement 𝚞𝚗𝚒𝚘𝚗(𝗇1,𝗇2)\mathtt{union}(\mathsf{n}_{1},\mathsf{n}_{2}) with running time O(log(kw))\pazocal{O}(\log(k\cdot w)) per call.

Algorithm 1 Evaluation of an unambiguous PCEA P=(Q,Ulin,Beq,Ω,Δ,F)\pazocal{P}=(Q,\textbf{U}_{\operatorname{lin}},\textbf{B}_{\operatorname{eq}},\Omega,\Delta,F) with equality predicates over a stream S\pazocal{S} under a sliding window of size ww.
1:procedure Evaluation(P,w,S\pazocal{P},w,\pazocal{S})
2:    𝖣𝖲w\mathsf{DS}_{w}\leftarrow\emptyset
3:    i1i\leftarrow-1
4:    while tyield[S]t\leftarrow\texttt{yield}[\pazocal{S}] do
5:        Reset()
6:        FireTransitions(t,it,i)
7:        UpdateIndices(t,it,i)
8:        for each 𝗇pF𝖭p\mathsf{n}\in\bigcup_{p\in F}\mathsf{N}_{p} do
9:            Enumerate(𝗇,i,w\mathsf{n},i,w)             
10:
11:procedure Reset()
12:    ii+1i\leftarrow i+1
13:    for each pQp\in Q do
14:        𝖭p\mathsf{N}_{p}\leftarrow\emptyset     
15:procedure FireTransitions(t,it,i)
16:    for each e=(P,U,,L,q)Δe=(P,U,\mathcal{B},L,q)\in\Delta do
17:        if tUpP𝖧[e,p,p(t)]t\in U\wedge\bigwedge_{p\in P}\mathsf{H}[e,p,\vec{\mathcal{B}}_{p}(t)]\neq\emptyset then
18:            𝖭{𝖧[e,p,p(t)]pP}\mathsf{N}\leftarrow\{\,\mathsf{H}[e,p,\vec{\mathcal{B}}_{p}(t)]\mid p\in P\,\}
19:            𝖭q𝖭q{𝚎𝚡𝚝𝚎𝚗𝚍(L,i,𝖭)}\mathsf{N}_{q}\leftarrow\mathsf{N}_{q}\cup\{\mathtt{extend}(L,i,\mathsf{N})\}             
20:
21:procedure UpdateIndices(tt)
22:    for each e=(P,U,,L,q)Δe=(P,U,\mathcal{B},L,q)\in\Delta do
23:        for each pP𝗇𝖭pp\in P\wedge\mathsf{n}\in\mathsf{N}_{p} do
24:            if 𝖧[e,p,p(t)]=\mathsf{H}[e,p,\reflectbox{$\vec{\reflectbox{$\mathcal{B}$}}$}_{p}(t)]=\emptyset then
25:                𝖧[e,p,p(t)]𝗇\mathsf{H}[e,p,\reflectbox{$\vec{\reflectbox{$\mathcal{B}$}}$}_{p}(t)]\leftarrow\mathsf{n}
26:            else
27:                𝗇𝖧[e,p,p(t)]\mathsf{n}^{\prime}\leftarrow\mathsf{H}[e,p,\reflectbox{$\vec{\reflectbox{$\mathcal{B}$}}$}_{p}(t)]
28:                𝖧[e,p,p(t)]𝚞𝚗𝚒𝚘𝚗(𝗇,𝗇)\mathsf{H}[e,p,\reflectbox{$\vec{\reflectbox{$\mathcal{B}$}}$}_{p}(t)]\leftarrow\mathtt{union}(\mathsf{n}^{\prime},\mathsf{n})                         

The streaming evaluation algorithm

In Algorithm 1, we present the main procedures of the evaluation algorithm given a fixed schema σ\sigma. The algorithm receives as input a PCEA P=(Q,Ulin,Beq,Ω,Δ,F)\pazocal{P}=(Q,\textbf{U}_{\operatorname{lin}},\textbf{B}_{\operatorname{eq}},\Omega,\Delta,F) over σ\sigma, a window size ww\in\mathbb{N}, and a reference to a stream S\pazocal{S}. We assume that these inputs are globally accessible by all procedures. Recall that we can test if tUt\in U in linear time for any UUlinU\in\textbf{U}_{\operatorname{lin}}. Further, recall that Beq\textbf{B}_{\operatorname{eq}} are equality predicates and, for every BBeqB\in\textbf{B}_{\operatorname{eq}}, there exists linear time computable partial functions B\vec{\reflectbox{$B$}} and B\vec{B} such that (t1,t2)B(t_{1},t_{2})\in B iff B(t1)\reflectbox{$\vec{\reflectbox{$B$}}$}(t_{1}) and B(t2)\vec{B}(t_{2}) are defined and B(t1)=B(t2)\reflectbox{$\vec{\reflectbox{$B$}}$}(t_{1})=\vec{B}(t_{2}), for every t1,t2Tuples[σ]t_{1},t_{2}\in\operatorname{Tuples}[\sigma].

For the algorithm, we require some data structures. First, we use the previously described data structure 𝖣𝖲w\mathsf{DS}_{w} and its nodes Nodes(𝖣𝖲w)\operatorname{Nodes}(\mathsf{DS}_{w}). Second, we consider a look-up table 𝖧\mathsf{H} that maps triples of the form (e,p,d)(e,p,d) to nodes in Nodes(𝖣𝖲w)\operatorname{Nodes}(\mathsf{DS}_{w}) where eΔe\in\Delta, pQp\in Q, and dd is the output of any partial function B\vec{\reflectbox{$B$}} or B\vec{B}. We write 𝖧[e,p,d]\mathsf{H}[e,p,d] for accessing its node, and 𝖧[e,p,d]𝗇\mathsf{H}[e,p,d]\leftarrow\mathsf{n} for updating a node 𝗇\mathsf{n} at entry (e,p,d)(e,p,d). Also, we write 𝖧[e,p,d]=\mathsf{H}[e,p,d]=\emptyset or 𝖧[e,p,d]\mathsf{H}[e,p,d]\neq\emptyset for checking whether there is a node or not at entry (e,p,d)(e,p,d). We assume all entries are empty at the beginning. Intuitively, for e=(P,U,,L,q)Δe=(P,U,\mathcal{B},L,q)\in\Delta and pPp\in P, we use 𝖧[e,p,]\mathsf{H}[e,p,\cdot] to check if the equality predicate p\mathcal{B}_{p} is satisfied or not (here p=(p)\mathcal{B}_{p}=\mathcal{B}(p)). As it is standard in the literature [5, 18] (i.e., by adopting the RAM model), we assume that each operation over look-up tables takes constant time. Finally, we assume a set of nodes 𝖭p\mathsf{N}_{p} for each pQp\in Q whose use will be clear later.

Algorithm 1 starts at the main procedure Evaluation. It initializes the data structure 𝖣𝖲w\mathsf{DS}_{w} to empty (i.e., the only node it has is the special node \bot) and the index ii for keeping the current position in the stream (lines 2-3). Then, the algorithm loops by reading the next tuple yield[S]\texttt{yield}[\pazocal{S}], performs the update phase (lines 5-7), followed by the enumeration phase (lines 8-9), and repeats the process over again. Next, we explain the update phase and enumeration phase separately.

The update phase is composed of three steps, encoded as procedures. The first one, Reset, is in charge of starting a new iteration by updating ii to the next position and emptying the sets 𝖭p\mathsf{N}_{p} (lines 12-14). The second step, FireTransitions, uses the new tuple tt to fire all transitions e=(P,U,,L,q)Δe=(P,U,\mathcal{B},L,q)\in\Delta of P\pazocal{P} (lines 16-19). We do this by checking if tt satisfies UU and all equality predicates {p}pP\{\mathcal{B}_{p}\}_{p\in P} (line 17). The main intuition is that the algorithm stores partial runs in the look-up table 𝖧\mathsf{H}, whose outputs are represented by nodes in 𝖣𝖲w\mathsf{DS}_{w}. Then the call 𝖧[e,p,p(t)]\mathsf{H}[e,p,\vec{\mathcal{B}}_{p}(t)] is used to verify the equality p(t)=p(t)\reflectbox{$\vec{\reflectbox{$\mathcal{B}$}}$}_{p}(t^{\prime})=\vec{\mathcal{B}}_{p}(t) for some previous tuple tt^{\prime}. Furthermore, if 𝖧[e,p,p(t)]\mathsf{H}[e,p,\vec{\mathcal{B}}_{p}(t)] is non-empty, it contains the node that represents all runs that have reached pp. If UU and all predicates {p}pP\{\mathcal{B}_{p}\}_{p\in P} are satisfied, we collect all nodes at states PP in the set 𝖭\mathsf{N} (line 18), and symbolically extend these runs by using the method 𝚎𝚡𝚝𝚎𝚗𝚍(L,i,𝖭)\mathtt{extend}(L,i,\mathsf{N}) of 𝖣𝖲w\mathsf{DS}_{w}. We collect the output node of 𝚎𝚡𝚝𝚎𝚗𝚍\mathtt{extend} in the set 𝖭q\mathsf{N}_{q} for use in the next procedure UpdateIndices.

The last step of the update phase, UpdateIndices, is to update the look-up table 𝖧\mathsf{H} by using tt and the nodes stored at the sets {𝖭p}pQ\{\mathsf{N}_{p}\}_{p\in Q} (lines 22-28). Intuitively, the nodes in 𝖭p\mathsf{N}_{p} represent new runs (i.e., valuations) that reached state pp when reading tt. Then, for every transition e=(P,U,,L,q)Δe=(P,U,\mathcal{B},L,q)\in\Delta such that pPp\in P, we want to update the entry (e,p,p(t))(e,p,\reflectbox{$\vec{\reflectbox{$\mathcal{B}$}}$}_{p}(t)) of 𝖧\mathsf{H} with the nodes from 𝖭p\mathsf{N}_{p}, to be ready to be fired for future tuples. For this goal, we check each 𝗇𝖭p\mathsf{n}\in\mathsf{N}_{p} and, if 𝖧[e,p,p(t)]\mathsf{H}[e,p,\reflectbox{$\vec{\reflectbox{$\mathcal{B}$}}$}_{p}(t)] is empty, we just place 𝗇\mathsf{n} at the entry (e,p,p(t))(e,p,\reflectbox{$\vec{\reflectbox{$\mathcal{B}$}}$}_{p}(t)) (lines 23-25). Otherwise, we use the union operator of 𝖣𝖲w\mathsf{DS}_{w}, to combine the previous outputs with the new ones of 𝗇\mathsf{n} (lines 26-28). Note that the call to 𝚞𝚗𝚒𝚘𝚗(𝗇,𝗇)\mathtt{union}(\mathsf{n}^{\prime},\mathsf{n}) satisfies the requirements of this operator, given that 𝗇\mathsf{n} was created recently.

Based on the previous description, the enumeration phase is straightforward. Given that the nodes in {𝖭p}pQ\{\mathsf{N}_{p}\}_{p\in Q} represent new runs at the last position, pF𝖭p\bigcup_{p\in F}\mathsf{N}_{p} are all new runs that reached some final state. Then, for each node 𝗇pF𝖭p\mathsf{n}\in\bigcup_{p\in F}\mathsf{N}_{p} we call the procedure Enumerate(𝗇,i,w)\textsc{Enumerate}(\mathsf{n},i,w) that enumerates all valuations in \lsem𝗇\rsemiw{\lsem{}{\mathsf{n}}\rsem}^{w}_{i}. Theorem 5.2 shows that this method exists with the desired guarantees given that P\pazocal{P} is unambiguous which implies that 𝖣𝖲w\mathsf{DS}_{w} is simple. Further, runs correspond with valuations, namely, \lsem𝗇\rsemiw{\lsem{}{\mathsf{n}}\rsem}^{w}_{i} is a set, and, thus, we enumerate the outputs without repetitions.

Proposition 5.4.

For every unambiguous PCEA P\pazocal{P} with equality predicates, ww\in\mathbb{N}, stream S\pazocal{S}, and position ii\in\mathbb{N}, Algorithm 1 enumerates all valuations \lsemP\rsemiw(S){\lsem{}{\pazocal{P}}\rsem}_{i}^{w}(\pazocal{S}) without repetitions.

We end by discussing the update time of Algorithm 1. By inspection, one can check that we performed a linear pass over Δ\Delta during the update phase, where each iteration takes linear time over each transition. Overall, we made at most O(|P|)\pazocal{O}(|\pazocal{P}|) calls to unary predicates, the look-up table, or the data structure 𝖣𝖲w\mathsf{DS}_{w}. Each call to a unary predicate takes O(|t|)\pazocal{O}(|t|)-time and, thus, at most O(|P||t|)\pazocal{O}(|\pazocal{P}|\cdot|t|)-time in total. The operations to the look-up table or 𝚎𝚡𝚝𝚎𝚗𝚍\mathtt{extend} take constant time. Instead, we performed at most O(|P|)\pazocal{O}(|\pazocal{P}|) unions over the same position ii. By Proposition 5.3, each 𝚞𝚗𝚒𝚘𝚗\mathtt{union} takes time O(log(|P|w))\pazocal{O}(\log(|\pazocal{P}|\cdot w)). Summing up, the updating time is O(|P||t|+|P|log(|P|)+|P|log(w))\pazocal{O}(|\pazocal{P}|\cdot|t|+|\pazocal{P}|\cdot\log(|\pazocal{P}|)+|\pazocal{P}|\cdot\log(w)).

6 Future work

We present an automata model for CER that expresses HCQ and can be evaluated in a streaming fashion under a sliding window with a logarithmic update and output-linear delay. These results achieve the primary goal of this paper but leave several directions for future work. First, defining a query language that characterizes the expressive power of PCEA will be interesting. Second, one would like to understand a disambiguation procedure to convert any PCEA into an unambiguous PCEA or to decide when this is possible. Last, we study here algorithms for PCEA with equality predicates, but the model works for any binary predicate. Then, it would be interesting to understand for which other predicates (e.g., inequalities) the model still admits efficient streaming evaluation.

References

  • [1] A. V. Aho and J. E. Hopcroft. The design and analysis of computer algorithms. Addison-Wesley, 1974.
  • [2] A. Amarilli, P. Bourhis, L. Jachiet, and S. Mengel. A circuit-based approach to efficient enumeration. In ICALP, volume 80 of LIPIcs, pages 111:1–111:15, 2017.
  • [3] A. Amarilli, L. Jachiet, M. Muñoz, and C. Riveros. Efficient enumeration for annotated grammars. In PODS, pages 291–300. ACM, 2022.
  • [4] A. Artikis, A. Margara, M. Ugarte, S. Vansummeren, and M. Weidlich. Complex event recognition languages: Tutorial. In DEBS, pages 7–10. ACM, 2017.
  • [5] C. Berkholz, J. Keppeler, and N. Schweikardt. Answering conjunctive queries under updates. In PODS, pages 303–318, 2017.
  • [6] M. Bucchi, A. Grez, A. Quintana, C. Riveros, and S. Vansummeren. CORE: a complex event recognition engine. VLDB, 15(9):1951–1964, 2022.
  • [7] A. K. Chandra, D. C. Kozen, and L. J. Stockmeyer. Alternation. Journal of the ACM (JACM), 28(1):114–133, 1981.
  • [8] S. Chaudhuri and M. Y. Vardi. Optimization of real conjunctive queries. In PODS, pages 59–70, 1993.
  • [9] R. Chirkova, J. Yang, et al. Materialized views. Foundations and Trends® in Databases, 4(4):295–405, 2012.
  • [10] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to algorithms. MIT press, 2022.
  • [11] G. Cugola and A. Margara. Processing flows of information: From data stream to complex event processing. ACM Computing Surveys (CSUR), 44(3):1–62, 2012.
  • [12] N. N. Dalvi and D. Suciu. The dichotomy of conjunctive queries on probabilistic structures. In PODS, pages 293–302, 2007.
  • [13] J. R. Driscoll, N. Sarnak, D. D. Sleator, and R. E. Tarjan. Making data structures persistent. In STOC, pages 109–121, 1986.
  • [14] N. Giatrakos, E. Alevizos, A. Artikis, A. Deligiannakis, and M. N. Garofalakis. Complex event recognition in the big data era: a survey. VLDB J., 29(1):313–352, 2020.
  • [15] E. Grandjean and L. Jachiet. Which arithmetic operations can be performed in constant time in the RAM model with addition? CoRR, abs/2206.13851, 2022.
  • [16] A. Grez and C. Riveros. Towards Streaming Evaluation of Queries with Correlation in Complex Event Processing. In ICDT, volume 155, pages 14:1–14:17, 2020.
  • [17] A. Grez, C. Riveros, M. Ugarte, and S. Vansummeren. A formal framework for complex event recognition. ACM Trans. Database Syst., 46(4):16:1–16:49, 2021.
  • [18] M. Idris, M. Ugarte, and S. Vansummeren. The dynamic yannakakis algorithm: Compact and efficient query processing under updates. In SIGMOD, pages 1259–1274, 2017.
  • [19] M. Idris, M. Ugarte, S. Vansummeren, H. Voigt, and W. Lehner. General dynamic yannakakis: conjunctive queries with theta joins under updates. VLDB J., 29(2-3):619–653, 2020.
  • [20] A. Kara, M. Nikolic, D. Olteanu, and H. Zhang. Pods. pages 375–392. ACM, 2020.
  • [21] Q. Lin, B. C. Ooi, Z. Wang, and C. Yu. Scalable distributed stream join processing. In SIGMOD, pages 811–825, 2015.
  • [22] M. Muñoz and C. Riveros. Streaming enumeration on nested documents. In ICDT, volume 220 of LIPIcs, pages 19:1–19:18, 2022.
  • [23] M. Muñoz and C. Riveros. Constant-delay enumeration for slp-compressed documents. In ICDT, volume 255 of LIPIcs, pages 7:1–7:17, 2023.
  • [24] F. Neven. Automata theory for XML researchers. SIGMOD Record, 31(3):39–46, 2002.
  • [25] P. D. Stotts and W. W. Pugh. Parallel finite automata for modeling concurrent software systems. J. Syst. Softw., 27(1):27–43, 1994.
  • [26] N. Tziavelis, W. Gatterbauer, and M. Riedewald. Beyond equi-joins: Ranking, enumeration and factorization. VLDB, 14(11):2599–2612, 2021.
  • [27] M. Ugarte and S. Vansummeren. On the difference between complex event processing and dynamic query evaluation. In AMW, volume 2100, 2018.
  • [28] Q. Wang, X. Hu, B. Dai, and K. Yi. Change propagation without joins. VLDB, 16(5):1046–1058, 2023.
  • [29] Q. Wang and K. Yi. Conjunctive queries with comparisons. In SIGMOD, pages 108–121. ACM, 2022.
  • [30] E. Wu, Y. Diao, and S. Rizvi. High-performance complex event processing over streams. In SIGMOD, pages 407–418, 2006.
  • [31] J. Xie and J. Yang. A survey of join processing in data streams. Data Streams: Models and Algorithms, pages 209–236, 2007.

Appendix A Proofs of Section 3

Proof of Proposition 3.2

Proof.

To prove this statement, we follow the same principle used in the subset construction. To simulate all possible run trees of a PFA with a DFA, we start at the leaves, with all initial states. Then for each symbol we move up on the tree, firing all transitions that used a subset of the current set of states. At the end of the string, if the last set has a final state, then it means that one can construct a run tree that accepts the input.

Let 𝒫=(Q,Σ,Δ,I,F)\mathcal{P}=(Q,\Sigma,\Delta,I,F) be a parallelized finite automata. We build the DFA A=(2Q,Σ,δ,I,F)\pazocal{A}=(2^{Q},\Sigma,\delta,I,F^{\prime}) such that F={PPF}F^{\prime}=\{P\mid P\cap F\neq\emptyset\} and δ(P,a)={qPP.(P,a,q)Δ)}\delta(P,a)=\{q\mid\exists P^{\prime}\subseteq P.\,(P^{\prime},a,q)\in\Delta)\} for every PQP\subseteq Q and aΣa\in\Sigma. We now prove that both automata define the same language.

L(𝒫)L(A)\pazocal{L}(\mathcal{P})\subseteq\pazocal{L}(\pazocal{A})

Let s¯=a1anΣ\bar{s}=a_{1}\ldots a_{n}\in\Sigma^{\ast} be a string such that s¯L(𝒫)\bar{s}\in\pazocal{L}(\mathcal{P}) and let τ:tQ\tau:t\rightarrow Q be an accepting run tree of 𝒫\mathcal{P} over s¯\bar{s}. We need to prove that the run ρ:Sna1Sn1a2anS0\rho:S_{n}\xrightarrow{a_{1}}S_{n-1}\xrightarrow{a_{2}}\ldots\xrightarrow{a_{n}}S_{0} is an accepting run of A\pazocal{A} over s¯\bar{s}, i.e. SnFS_{n}\in F^{\prime}. To this end, we define Li={τ(u¯)0ptτu¯=i}L_{i}=\{\tau(\bar{u})\mid 0pt{\tau}{\bar{u}}=i\} as the set of states labeling τ\tau at depth ii and prove that LiSiL_{i}\subseteq S_{i} for all 0in0\leq i\leq n. Since L0={τ(ε)}L_{0}=\{\tau(\varepsilon)\}, this in return means that SnFS_{n}\cap F\neq\emptyset and SnFS_{n}\in F^{\prime}.

For every leaf node u¯\bar{u} it holds that 0ptτu¯=n0pt{\tau}{\bar{u}}=n and τ(u¯)I\tau(\bar{u})\in I, meaning LnSn=IL_{n}\subseteq S_{n}=I. Let us assume that Li1Si1L_{i-1}\subseteq S_{i-1}; for every inner node v¯\bar{v} at depth ii there must be a transition (P,ani,q)Δ(P,a_{n-i},q)\in\Delta such that τ(v¯)=q\tau(\bar{v})=q and P={τ(u¯)u¯childrenτ(v¯)}P=\{\tau(\bar{u})\mid\bar{u}\in\texttt{children}_{\tau}(\bar{v})\}. Following the definition of δ\delta, it is clear that qδ(P,a)q\in\delta(P,a), and since this is true for every node at depth ii, we have that LiSiL_{i}\subseteq S_{i}.

Given that L0S0L_{0}\subseteq S_{0}, we know that SnFS_{n}\in F^{\prime}, which means that ρ\rho is an accepting run of A\pazocal{A} over s¯\bar{s} and therefore L(𝒫)L(A)\pazocal{L}(\mathcal{P})\subseteq\pazocal{L}(\pazocal{A}).

L(A)L(𝒫)\pazocal{L}(\pazocal{A})\subseteq\pazocal{L}(\mathcal{P})

Let s¯=a1anΣ\bar{s}=a_{1}\ldots a_{n}\in\Sigma^{\ast} be a string such that s¯L(A)\bar{s}\in\pazocal{L}(\pazocal{A}) and let ρ:Sna1Sn1a2anS0\rho:S_{n}\xrightarrow{a_{1}}S_{n-1}\xrightarrow{a_{2}}\ldots\xrightarrow{a_{n}}S_{0} be the run of A\pazocal{A} over s¯\bar{s}. We can now construct a run tree of 𝒫\mathcal{P} over s¯\bar{s}.

Since ρ\rho is an accepting run, we know that S0FS_{0}\cap F\leq\emptyset. We define τ:tQ\tau:t\rightarrow Q such that τ(ε)=f\tau(\varepsilon)=f with fS0Ff\in S_{0}\cap F. If we consider a node v¯t\overline{v}\in t at depth ii, such that τ(v¯)=q\tau(\bar{v})=q and qSiq\in S_{i}, we can follow the definition of δ\delta, and inductively add nodes to τ\tau according to the transition (P,ani,q)Δ(P,a_{n-i},q)\in\Delta so that |childrenτ(v¯)|=|P||\texttt{children}_{\tau}(\bar{v})|=|P| and P={τ(u¯)u¯childrenτ(v¯)}P=\{\tau(\bar{u})\mid\bar{u}\in\texttt{children}_{\tau}(\bar{v})\}. For every leaf node v¯\bar{v} it holds that 0ptτv¯=n0pt{\tau}{\bar{v}}=n and since Sn=IS_{n}=I all of them will be labeled by initial states.

The labeled tree τ\tau we just constructed is an accepting run of 𝒫\mathcal{P} over s¯\bar{s}, meaning L(A)L(𝒫)\pazocal{L}(\pazocal{A})\subseteq\pazocal{L}(\mathcal{P}) and, therefore, L(𝒫)=L(A)\pazocal{L}(\mathcal{P})=\pazocal{L}(\pazocal{A}). ∎

Proof of Proposition 3.4

Proof.

To prove this statement we just need to find a Parallelized-CEA P\pazocal{P} with no CCEA equivalent, i.e. there is no CCEA C\pazocal{C} such that \lsemP\rsem(S)=\lsemC\rsem(S){\lsem{}{\pazocal{P}}\rsem}(\pazocal{S})={\lsem{}{\pazocal{C}}\rsem}(\pazocal{S}) for every stream S\pazocal{S}. Let P\pazocal{P} be the PCEA represented in Figure 2, then P=(Q,U,B,Ω,Δ,F)\pazocal{P}=\big{(}Q,\textbf{U},\textbf{B},\Omega,\Delta,F\big{)}, with Q={R(x,y),S(x,y),T(x),x,y}Q=\{R(x,y),S(x,y),T(x),x,y\}, Ω={R,S,T}\Omega=\{R,S,T\}, F={x}F=\{x\} and:

Δ={(,UR(x,y),,{R(x,y)},R(x,y)),(,US(x,y),,{S(x,y)},S(x,y)),(,UT(x),,{T(x)},T(x)),({R(x,y),T(x)},US(x,y),{(R(x,y),BR(x,y),S(x,y)),(T(x),BT(x),S(x,y))},{S(x,y)},x),({S(x,y),T(x)},UR(x,y),{(S(x,y),BS(x,y),R(x,y)),(T(x),BT(x),R(x,y))},{R(x,y)},x),({R(x,y)},US(x,y),{(R(x,y),BR(x,y),S(x,y))},{S(x,y)},y),({S(x,y)},UR(x,y),{(S(x,y),BS(x,y),R(x,y))},{R(x,y)},y),({y},UT(x),{(y,By,T(x,y))},{T(x)},x)}\begin{array}[]{rcl}\Delta&=&\big{\{}(\emptyset,U_{R(x,y)},\emptyset,\{R(x,y)\},R(x,y)),\\ &&(\emptyset,U_{S(x,y)},\emptyset,\{S(x,y)\},S(x,y)),\\ &&(\emptyset,U_{T(x)},\emptyset,\{T(x)\},T(x)),\\ &&(\{R(x,y),T(x)\},U_{S(x,y)},\{(R(x,y),B_{R(x,y),S(x,y)}),(T(x),B_{T(x),S(x,y)})\},\{S(x,y)\},x),\\ &&(\{S(x,y),T(x)\},U_{R(x,y)},\{(S(x,y),B_{S(x,y),R(x,y)}),(T(x),B_{T(x),R(x,y)})\},\{R(x,y)\},x),\\ &&(\{R(x,y)\},U_{S(x,y)},\{(R(x,y),B_{R(x,y),S(x,y)})\},\{S(x,y)\},y),\\ &&(\{S(x,y)\},U_{R(x,y)},\{(S(x,y),B_{S(x,y),R(x,y)})\},\{R(x,y)\},y),\\ &&(\{y\},U_{T(x)},\{(y,B_{y,T(x,y)})\},\{T(x)\},x)\big{\}}\end{array}

with the predicates UR(x¯)U_{R(\bar{x})} and BR(x¯),S(y¯)B_{R(\bar{x}),S(\bar{y})} defined as:

UR(x¯):={R(a¯)Tuples[σ]hHom.h(R(x¯))=R(a¯)}U_{R(\bar{x})}\ :=\ \{R(\bar{a})\in\operatorname{Tuples}[\sigma]\mid\exists h\in\operatorname{Hom}.\;h(R(\bar{x}))=R(\bar{a})\}

and:

BR(x¯),S(y¯):={(R(a¯),S(b¯))hHom.h(R(x¯))=R(a¯)h(S(y¯))=S(b¯)}.B_{R(\bar{x}),S(\bar{y})}\ :=\ \{(R(\bar{a}),S(\bar{b}))\mid\exists h\in\operatorname{Hom}.\;h(R(\bar{x}))=R(\bar{a})\wedge h(S(\bar{y}))=S(\bar{b})\}.

Let Si={{R(0,i),T(0),S(0,i),}}\pazocal{S}_{i}=\{\!\!\{R(0,i),T(0),S(0,i),\ldots\}\!\!\} be a family of streams over the set of data values D=\textbf{D}=\mathbb{N} with ii\in\mathbb{N}. It is clear that the valuation {{0,1,2}}\lsemP\rsem(Si)\{\!\!\{0,1,2\}\!\!\}\in{\lsem{}{\pazocal{P}}\rsem}(\pazocal{S}_{i}) for every ii\in\mathbb{N}. Let C=(Q,U,B,Ω,Δ,I,F)\pazocal{C}=(Q^{\prime},\textbf{U}^{\prime},\textbf{B}^{\prime},\Omega^{\prime},\Delta^{\prime},I^{\prime},F^{\prime}) be a deterministic CCEA such that \lsemC\rsem(Si)=\lsemP\rsem(Si){\lsem{}{\pazocal{C}}\rsem}(\pazocal{S}_{i})={\lsem{}{\pazocal{P}}\rsem}(\pazocal{S}_{i}) for every ii\in\mathbb{N}. This means that for every stream Si\pazocal{S}_{i}, there is an accepting run of C\pazocal{C} over of the form ρi:qi,0R(0,i)qi,1T(0)qi,2S(0,i)qi,3\rho_{i}:q_{i,0}\xrightarrow{R(0,i)}q_{i,1}\xrightarrow{T(0)}q_{i,2}\xrightarrow{S(0,i)}q_{i,3}.

Since C\pazocal{C} has a finite number of states, we know that there must be two streams, Sj\pazocal{S}_{j} and Sk\pazocal{S}_{k} with jkj\neq k with accepting runs ρj:qj,0R(0,j)qj,1T(0)qj,2S(0,j)qj,3\rho_{j}:q_{j,0}\xrightarrow{R(0,j)}q_{j,1}\xrightarrow{T(0)}q_{j,2}\xrightarrow{S(0,j)}q_{j,3} and ρk:qk,0R(0,k)qk,1T(0)qk,2S(0,k)qk,3\rho_{k}:q_{k,0}\xrightarrow{R(0,k)}q_{k,1}\xrightarrow{T(0)}q_{k,2}\xrightarrow{S(0,k)}q_{k,3}, respectively, such that qj,i=qk,iq_{j,i}=q_{k,i} for every 0i30\leq i\leq 3.

Given the run ρk\rho_{k} of C\pazocal{C}, we know that there must be a transition (qk,2,U,B,ω,qk,3)Δ(q_{k,2},U,B,\omega,q_{k,3})\in\Delta^{\prime} such that S(0,k)US(0,k)\in U and (T(0),S(0,k))B(T(0),S(0,k))\in B and since qk,2=qj,2q_{k,2}=q_{j,2} and qk,3=qj,3q_{k,3}=q_{j,3} the following will be an accepting run of C\pazocal{C} over the stream Sj,k={{R(0,j),T(0),S(0,k)}}S_{j,k}=\{\{R(0,j),T(0),S(0,k)\}\}: ρj,k:qj,0R(0,j)qj,1T(0)qj,2S(0,k)qj,3\rho_{j,k}:q_{j,0}\xrightarrow{R(0,j)}q_{j,1}\xrightarrow{T(0)}q_{j,2}\xrightarrow{S(0,k)}q_{j,3}.

We can easily check that there are no accepting runs of P\pazocal{P} over Sj,k\pazocal{S}_{j,k}, meaning \lsemP\rsem(Sj,k)\lsemC\rsem(Sj,k){\lsem{}{\pazocal{P}}\rsem}(\pazocal{S}_{j,k})\neq{\lsem{}{\pazocal{C}}\rsem}(\pazocal{S}_{j,k}) and therefore there is no CCEA C\pazocal{C} such that \lsemP\rsem(S)=\lsemC\rsem(S){\lsem{}{\pazocal{P}}\rsem}(\pazocal{S})={\lsem{}{\pazocal{C}}\rsem}(\pazocal{S}) for every stream S\pazocal{S}. ∎

Appendix B Proofs of Section 4

Proof of equivalence between CQ bag-semantics

Fix a schema σ\sigma, a set of data values D and a CQ QQ over σ\sigma of the form:

Q(x¯)R0(x¯0),,Rm1(x¯m1)Q(\bar{x})\ \leftarrow\ R_{0}(\bar{x}_{0}),\ldots,R_{m-1}(\bar{x}_{m-1})

To prove that \lsemQ\rsem(D)=Q(D){\lsem{}{Q}\rsem}(D)=\lceil\lceil{Q}\rfloor\rfloor(D) we need to prove that \lsemQ\rsem(D)Q(D){\lsem{}{Q}\rsem}(D)\subseteq\lceil\lceil{Q}\rfloor\rfloor(D) and Q(D)\lsemQ\rsem(D)\lceil\lceil{Q}\rfloor\rfloor(D)\subseteq{\lsem{}{Q}\rsem}(D), where Q(D)\lsemQ\rsem(D)\lceil\lceil{Q}\rfloor\rfloor(D)\subseteq{\lsem{}{Q}\rsem}(D) if U(Q(D))U(\lsemQ\rsem(D))U(\lceil\lceil{Q}\rfloor\rfloor(D))\subseteq U({\lsem{}{Q}\rsem}(D)) and for every aa in I(Q(D))I(\lceil\lceil{Q}\rfloor\rfloor(D)), multQ(D)(a)mult\lsemQ\rsem(D)(a)\operatorname{mult}_{\lceil\lceil{Q}\rfloor\rfloor(D)}(a)\leq\operatorname{mult}_{{\lsem{}{Q}\rsem}(D)}(a).

Both \lsemQ\rsem(D){\lsem{}{Q}\rsem}(D) and Q(D)\lceil\lceil{Q}\rfloor\rfloor(D) map every atom of QQ to the database DD, meaning U(Q(D))=U(\lsemQ\rsem(D))U(\lceil\lceil{Q}\rfloor\rfloor(D))=U({\lsem{}{Q}\rsem}(D)), so now for every Q(a¯)Q(\bar{a}) in I(Q(D))I(\lceil\lceil{Q}\rfloor\rfloor(D)) we just need to prove that for every aa in I(Q(D))I(\lceil\lceil{Q}\rfloor\rfloor(D)), multQ(D)(Q(a¯))=mult\lsemQ\rsem(D)(a)\operatorname{mult}_{\lceil\lceil{Q}\rfloor\rfloor(D)}(Q(\bar{a}))=\operatorname{mult}_{{\lsem{}{Q}\rsem}(D)}(a).

By following the definitions given previously for the multiplicities we get:

mult\lsemQ\rsem(D)(Q(a¯))\displaystyle\operatorname{mult}_{{\lsem{}{Q}\rsem}(D)}(Q(\bar{a})) =|{j\lsemQ\rsem(D)(j)=Q(a¯)}|\displaystyle=\big{|}\{j\mid{\lsem{}{Q}\rsem}(D)(j)=Q(\bar{a})\}\big{|}

We also know that if \lsemQ\rsem(D)(j)=Q(a¯){\lsem{}{Q}\rsem}(D)(j)=Q(\bar{a}) holds, there must be a t-homomorphism η\eta such that Q(hη(x¯))=Q(a¯)Q(h_{\eta}(\bar{x}))=Q(\bar{a}). Since every t-homomorphism can map more than one atom, it is clear that:

mult\lsemQ\rsem(D)(Q(a¯))\displaystyle\operatorname{mult}_{{\lsem{}{Q}\rsem}(D)}(Q(\bar{a})) =|ηt-Hom(Q,D):hη(x¯)=a¯{j\lsemQ\rsem(D)(j)=Q(hη(x¯))}|\displaystyle=\bigg{\lvert}\bigcup_{\begin{subarray}{c}\eta\in\text{t-Hom}(Q,D)\,\colon\\ h_{\eta}(\bar{x})=\bar{a}\end{subarray}}\{j\mid{\lsem{}{Q}\rsem}(D)(j)=Q(h_{\eta}(\bar{x}))\}\bigg{\rvert}
=ηt-Hom(Q,D):hη(x¯)=a¯|{j\lsemQ\rsem(D)(j)=Q(hη(x¯))}|\displaystyle=\sum_{\begin{subarray}{c}\eta\in\text{t-Hom}(Q,D)\,\colon\\ h_{\eta}(\bar{x})=\bar{a}\end{subarray}}\big{|}\{j\mid{\lsem{}{Q}\rsem}(D)(j)=Q(h_{\eta}(\bar{x}))\}\big{|}
=ηt-Hom(Q,D):hη(x¯)=a¯multQ,D(hη)\displaystyle=\sum_{\begin{subarray}{c}\eta\in\text{t-Hom}(Q,D)\,\colon\\ h_{\eta}(\bar{x})=\bar{a}\end{subarray}}\operatorname{mult}_{Q,D}(h_{\eta})

As was stated before, there is a correspondence between t-homomorphisms and homomorphism, meaning:

multQ(D)(Q(a¯))\displaystyle\operatorname{mult}_{\lceil\lceil{Q}\rfloor\rfloor(D)}(Q(\bar{a})) =hHom(Q,D):h(x¯)=a¯multQ,D(h)\displaystyle=\sum_{\begin{subarray}{c}h\in\text{Hom}(Q,D)\,\colon\\ h(\bar{x})=\bar{a}\end{subarray}}\operatorname{mult}_{Q,D}(h)
=multQ(D)(Q(a¯))\displaystyle=\operatorname{mult}_{\lceil\lceil{Q}\rfloor\rfloor(D)}(Q(\bar{a}))

Given that multQ(D)(Q(a¯))=mult\lsemQ\rsem(D)(Q(a¯))\operatorname{mult}_{\lceil\lceil{Q}\rfloor\rfloor(D)}(Q(\bar{a}))=\operatorname{mult}_{{\lsem{}{Q}\rsem}(D)}(Q(\bar{a})) for every Q(a¯)Q(\bar{a}) in I(Q(D))I(\lceil\lceil{Q}\rfloor\rfloor(D)), this implies that multQ(D)(Q(a¯))=mult\lsemQ\rsem(D)(Q(a¯))\operatorname{mult}_{\lceil\lceil{Q}\rfloor\rfloor(D)}(Q(\bar{a}))=\operatorname{mult}_{{\lsem{}{Q}\rsem}(D)}(Q(\bar{a})) and therefore, for every CQ QQ and database DD it holds that \lsemQ\rsem(D)=Q(D){\lsem{}{Q}\rsem}(D)=\lceil\lceil{Q}\rfloor\rfloor(D).

Proof of Theorem 4.1

Proof.

Fix a schema σ\sigma and fix a HCQ QQ over σ\sigma of the form:

Q(x¯)R0(x¯0),,Rm1(x¯m1).Q(\bar{x})\ \leftarrow\ R_{0}(\bar{x}_{0}),\ldots,R_{m-1}(\bar{x}_{m-1}).

Without loss of generality, we assume that QQ has at least two atoms, i.e., m2m\geq 2. If not, it is straightforward to construct a PCEA for QQ. Further, for the sake of simplification, we assume that QQ does not have data values (i.e., constants); all the constructions below work with data values in the atoms with the additional cost of differentiating the variables from the data values in the set {x¯i}\{\bar{x}_{i}\}. Given that QQ is full, this means that {x¯}=i=0{x¯i}\{\bar{x}\}=\bigcup_{i=0}\{\bar{x}_{i}\}. For this reason, we will use {x¯}\{\bar{x}\} to refer to the set of all variables in QQ. Recall that we usually consider QQ as bag of atoms, namely, Q={{R0(x¯0),,Rm1(x¯m1)}}Q=\{\!\!\{R_{0}(\bar{x}_{0}),\ldots,R_{m-1}(\bar{x}_{m-1})\}\!\!\}. Then I(Q)I(Q) is the set of all the identifiers {0,,m1}\{0,\ldots,m-1\} and U(Q)U(Q) the set of all different atoms in QQ. We say that QQ is connected111Given that QQ is hierarchical, this definition is equivalent to the notion of connected CQ, i.e., that the Gaifman graph of QQ is connected. iff there exists a variable x{x¯}x\in\{\bar{x}\} such that x{x¯i}x\in\{\bar{x}_{i}\} for every i<mi<m.

In the following, we first give the proof of the theorem for connected HCQ without self joins, then move to the case of connected HCQ with self joins, and end with the general case.

Connected HCQ without self joins

A q-tree for QQ is a labeled tree, τQ:tI(Q){x¯}\tau_{Q}:t\rightarrow I(Q)\cup\{\bar{x}\}, where for every x{x¯}x\in\{\bar{x}\} there is a unique inner node u¯t\bar{u}\in t such that τQ(u¯)=x\tau_{Q}(\bar{u})=x, and for every iI(Q)i\in I(Q) there is a unique leaf node v¯t\bar{v}\in t such that τQ(v¯)=i\tau_{Q}(\bar{v})=i and, if u¯1,,u¯k\bar{u}_{1},\ldots,\bar{u}_{k} are the inner nodes of the path from the root until v¯\bar{v}, then {x¯i}={τQ(u¯1),,τQ(u¯k)}\{\bar{x}_{i}\}=\{\tau_{Q}(\bar{u}_{1}),\ldots,\tau_{Q}(\bar{u}_{k})\}, i.e., {x¯i}={τQ(u¯)u¯ancstτQ(v¯)u¯v¯}\{\bar{x}_{i}\}=\{\tau_{Q}(\bar{u})\mid\bar{u}\in\texttt{ancst}_{\tau_{Q}}(\bar{v})\wedge\bar{u}\neq\bar{v}\}. As an example, in Figure 3 we can see a q-tree associated with Q1(x,y,z,v,w){{R(x,y,z),S(x,y,v),T(x,w),U(x,y)}}Q_{1}(x,y,z,v,w)\leftarrow\{\!\!\{R(x,y,z),S(x,y,v),T(x,w),U(x,y)\}\!\!\} and two valid q-trees for the query Q2(x,y,z,v){{R(x,y,z),R(x,y,v),U(x,y)}}Q_{2}(x,y,z,v)\leftarrow\{\!\!\{R(x,y,z),R(x,y,v),U(x,y)\}\!\!\}. Note that we use the identifiers of the atoms (instead of the atom itself), so if an atom appears nn times in the query, there will be nn different leaf nodes for the same atom.

RRSSUUzzvvTTyywwxxτQ1\tau_{Q_{1}}:
RxyzR_{xyz}RxyvR_{xyv}UxyU_{xy}zzvvyyxxτQ2\tau_{Q_{2}}:
RxyzR_{xyz}RxyvR_{xyv}UxyU_{xy}zzvvxxyyτQ2\tau_{Q_{2}}^{\prime}:
Figure 3: Example of valid q-trees for Q1Q_{1} and Q2Q_{2}. For the sake of readability we represent each node u¯\bar{u} by the variable/atom in the underlying set of QQ that labels the node, Q(τQ(u¯))Q(\tau_{Q}(\bar{u})). Since we have no self joins in Q1Q_{1}, we omit the variables when representing the atoms.
Theorem B.1 ([5]).

QQ is hierarchical and connected iff there exists a q-tree for QQ.

In the following, we fix a q-tree for QQ that we denote by τQ\tau_{Q}. Furthermore, given that there is a one-to-one correspondence between nodes of τQ\tau_{Q} and the set I(Q){x¯}I(Q)\cup\{\bar{x}\}, by some abuse of notation, we will usually use variables and identifiers as nodes in τQ\tau_{Q} (e.g., descτQ(x)\texttt{desc}_{\tau_{Q}}(x) or ancstτQ(i)\texttt{ancst}_{\tau_{Q}}(i)).

We define the compact q-tree of an HCQ QQ as the result of taking the original q-tree of the query and removing all inner nodes with a single child. In other words, to generate the compact q-tree τQc\tau_{Q}^{c} from τQ\tau_{Q}, we copy τQ\tau_{Q}, then for each node u¯τQc\bar{u}\in\tau_{Q}^{c} such that childrenτQc(u¯)={v¯}\texttt{children}_{\tau_{Q}^{c}}(\bar{u})=\{\bar{v}\}, we remove v¯\bar{v} and τQc(v¯)\tau_{Q}^{c}(\bar{v}) from τQc\tau_{Q}^{c}, and we redefine childrenτQc(u¯)=childrenτQc(v¯)\texttt{children}_{\tau_{Q}^{c}}(\bar{u})=\texttt{children}_{\tau_{Q}^{c}}(\bar{v}). We can see in Figure 4 the compact q-trees associated with each of the q-trees of Figure 3. Note that for every q-tree τQ\tau_{Q}, the root of τQc\tau_{Q}^{c} is always variable and has at least two children since we assume that QQ has at least two atoms (and then τQ\tau_{Q} has at least two leaves). For simplification, from now on, we assume that τQ\tau_{Q} is compact. If not, all the constructions below hold verbatim by replacing τQ\tau_{Q} by τQc\tau_{Q}^{c} and restricting to the variables that appear in τQc\tau_{Q}^{c}.

UURRSSyyTTxxτQ1c\tau^{c}_{Q_{1}}:
UxyU_{xy}RxyzR_{xyz}RxyvR_{xyv}xxτQ2c\tau^{c}_{Q_{2}}:
UxyU_{xy}RxyzR_{xyz}RxyvR_{xyv}yyτQ2c\tau^{\prime c}_{Q_{2}}:
Figure 4: Example of compact q-trees for τQ1\tau_{Q_{1}}, τQ2\tau_{Q_{2}} and τQ2\tau_{Q_{2}}^{\prime}. Since we have no self joins in Q1Q_{1}, we omit the variables when representing the atoms.

We can now proceed with the proof. From QQ we construct the PCEA:

PQ=(I(Q){x¯},Ulin,Beq,I(Q),Δ,{τQ(ε)})\pazocal{P}_{Q}=(I(Q)\cup\{\bar{x}\},\textbf{U}_{\operatorname{lin}},\textbf{B}_{\operatorname{eq}},I(Q),\Delta,\{\tau_{Q}(\varepsilon)\})

where the states of the automaton are the nodes of τQ\tau_{Q}, the labeling set is the set off atom identifiers of the query QQ, and the only final state is the variable at root of τQ\tau_{Q} (i.e., τQ(ε)=τQ(root(τQ))\tau_{Q}(\varepsilon)=\tau_{Q}(\texttt{root}(\tau_{Q}))). For defining the transition relation Δ\Delta, we need additional definitions regarding predicates and some further constructs over τQ\tau_{Q}.

Below, we use h:{x¯}DDh:\{\bar{x}\}\cup\textbf{D}\rightarrow\textbf{D} to denote a homomorphism and Hom\operatorname{Hom} to denote the set of all homomorphisms. Let x{x¯}x\in\{\bar{x}\} be a variable and let S(y¯),T(z¯)U(Q)S(\bar{y}),T(\bar{z})\in U(Q) be atoms of QQ. We define the unary predicate of S(y¯)S(\bar{y}) as:

US(y¯):={S(b¯)Tuples[σ]hHom.h(S(y¯))=S(b¯)}U_{S(\bar{y})}\ :=\ \{S(\bar{b})\in\operatorname{Tuples}[\sigma]\mid\exists h\in\operatorname{Hom}.\;h(S(\bar{y}))=S(\bar{b})\}

and the binary predicate of S(y¯)S(\bar{y}) and T(z¯)T(\bar{z}) as:

BS(y¯),T(z¯):={(S(b¯),T(c¯))hHom.h(S(y¯))=S(b¯)h(T(z¯))=T(c¯)}.B_{S(\bar{y}),T(\bar{z})}\ :=\ \{(S(\bar{b}),T(\bar{c}))\mid\exists h\in\operatorname{Hom}.\;h(S(\bar{y}))=S(\bar{b})\wedge h(T(\bar{z}))=T(\bar{c})\}.

One can easily check that US(y¯)UlinU_{S(\bar{y})}\in\textbf{U}_{\operatorname{lin}} and BS(y¯),T(z¯)B_{S(\bar{y}),T(\bar{z})} is an equality predicate for every pair of atoms S(y¯)S(\bar{y}) and T(z¯)T(\bar{z}).

We extend the definition of BS(y¯),T(z¯)B_{S(\bar{y}),T(\bar{z})} to a pair variable-atom as follows. Take again the atom S(y¯)S(\bar{y}) and let xx be a variable such that x{y¯}x\notin\{\bar{y}\}. Note that this implies that {y¯}{x¯i}={y¯}{x¯j}\{\bar{y}\}\cap\{\bar{x}_{i}\}=\{\bar{y}\}\cap\{\bar{x}_{j}\} for every pair of distinct atoms i,jdescτQ(x)i,j\in\texttt{desc}_{\tau_{Q}}(x). In other words, all atoms (i.e., identifiers) in the q-tree below xx share the same variables with S(y¯)S(\bar{y}). Define the binary predicate of xx and S(y¯)S(\bar{y}) as:

Bx,S(y¯)=idescτQ(x)BRi(x¯i),S(y¯).B_{x,S(\bar{y})}=\bigcup_{i\in\texttt{desc}_{\tau_{Q}}(x)}B_{R_{i}(\bar{x}_{i}),S(\bar{y})}.

Given the previous discussion and given that QQ does not have self joins (all atoms are different), one can easily check that Bx,S(y¯)B_{x,S(\bar{y})} is an equality predicate in Beq\textbf{B}_{\operatorname{eq}}.

For our last definition, let iI(Q)i\in I(Q) be an identifier of QQ and x{x¯i}x\in\{\bar{x}_{i}\}. Define the set of incomplete states of xx given ii as:

Cx,i:={I(Q){x¯}parentτQ()(descτQ(x){x¯i})}({i}{x¯i})C_{x,i}\ :=\ \{\ell\in I(Q)\cup\{\bar{x}\}\mid\texttt{parent}_{\tau_{Q}}(\ell)\in\left(\texttt{desc}_{\tau_{Q}}(x)\cap\{\bar{x}_{i}\}\right)\}\setminus\big{(}\{i\}\cup\{\bar{x}_{i}\}\big{)}

In other words, Cx,iC_{x,i} is the set of all variables or atom identifiers that hang from a variable of x¯i\bar{x}_{i} that is a descendant of xx, except for variables in x¯i\bar{x}_{i} or the atom ii. For example, in τQ1\tau_{Q_{1}} of Figure 4 one can check that Cy,U={R,S,T}C_{y,U}=\{R,S,T\} and Cx,T={y}C_{x,T}=\{y\}.

With these definitions, we define the transition relation Δ\Delta as follows:

Δ\displaystyle\Delta :={(,URi(x¯i),,{i},i)iI(Q)}\displaystyle\ :=\ \Big{\{}\Big{(}\emptyset,U_{R_{i}(\bar{x}_{i})},\emptyset,\{i\},i\Big{)}\mid\,i\in I(Q)\Big{\}}
{(Cx,i,URi(x¯i),x,i,{i},x)iI(Q)x{x¯i}}\displaystyle\ \cup\ \Big{\{}\Big{(}C_{x,i},U_{R_{i}(\bar{x}_{i})},\mathcal{B}_{x,i},\{i\},x\Big{)}\mid i\in I(Q)\wedge x\in\{\bar{x}_{i}\}\Big{\}}

such that x,i\mathcal{B}_{x,i} is the predicate function associated with Cx,iC_{x,i} defined as follows: x,i(y)=By,Ri(x¯i)\mathcal{B}_{x,i}(y)=B_{y,R_{i}(\bar{x}_{i})} for every variable yCx,iy\in C_{x,i} and x,i(j)=BRj(x¯j),Ri(x¯i)\mathcal{B}_{x,i}(j)=B_{R_{j}(\bar{x}_{j}),R_{i}(\bar{x}_{i})} for every identifier jCx,ij\in C_{x,i}. Intuitively, Δ\Delta can be interpreted as the traversal and completion of the compact q-tree, with the first set of transitions representing the independent leaf nodes being completed by their respective atom, while the second set represents the completion of a variable, meaning the last tuple in its descendants was found.

From the previous construction, one can easily check by inspection that PQ\pazocal{P}_{Q} is of quadratic size with respect to QQ. In the sequel, we prove that PQ\pazocal{P}_{Q} is unambiguous and PQQ\pazocal{P}_{Q}\equiv Q. To prove both of these conditions, we will use the following lemma:

Lemma B.2.

For every accepting run ρ\rho of PQ\pazocal{P}_{Q} and for every atom iI(Q)i\in I(Q), there is exactly one node u¯ρ\bar{u}\in\rho such that ρ(u¯)=(p,j,{i})\rho(\bar{u})=(p,j,\{i\}).

Proof.

Let iRI(Q)i_{R}\in I(Q) be an atom with Q(iR)=R(x¯)Q(i_{R})=R(\bar{x}). Since ρ\rho is an accepting run of PQ\pazocal{P}_{Q}, its root must be labeled by the root of the compact q-tree, ρ(root(ρ))=(r,j,{iR})\rho(\texttt{root}(\rho))=(r,j,\{i_{R}\}) with r=root(τQ)r=\texttt{root}(\tau_{Q}) and iSI(Q)i_{S}\in I(Q) such that Q(iS)=S(y¯)Q(i_{S})=S(\bar{y}). Assuming a non trivial case, i.e. iRiSi_{R}\neq i_{S}, there must be a transition (Cr,iS,US(y¯),r,iS,{iS},r)Δ(C_{r,i_{S}},U_{S(\bar{y})},\mathcal{B}_{r,i_{S}},\{i_{S}\},r)\in\Delta, which leaves us with two recursive cases.

First, the case where none of the ancestors of iri_{r} are present in Cr,iSC_{r,i_{S}}, which will be the base case. Looking at the definition of Cr,iSC_{r,i_{S}} and considering that iSiRi_{S}\neq i_{R}, this means that iRi_{R} must be present in Cr,iSC_{r,i_{S}} and therefore, there must be a node labeled by (iR,k,{iR})(i_{R},k,\{i_{R}\}).

On the other hand, if one of the ancestors of iRi_{R} is present in Cr,iSC_{r,i_{S}} the case will be recursive. Let this ancestor be sancstτQ(iR)Cr,iSs\in\texttt{ancst}_{\tau_{Q}}(i_{R})\cap C_{r,i_{S}} and since it is present in Cr,iSC_{r,i_{S}}, there must be a transition (Cs,iT,UiT,PiT,C,{iT},s)Δ(C_{s,i_{T}},U_{i_{T}},P_{i_{T},C},\{i_{T}\},s)\in\Delta, leaving us with the same three cases as before; the trivial case where iT=iRi_{T}=i_{R}, the base case where there are no ancestors of iRi_{R} in Cs,iTC_{s,i_{T}} and the current recursive case. If we repeat the recursive case, we will eventually visit every ancestor of iRi_{R}, leaving us with either the trivial or the base case, meaning there exists a node u¯ρ\bar{u}\in\rho such that ρ(u¯)=(p,j,{iR})\rho(\bar{u})=(p,j,\{i_{R}\}).

We have proven that for every atom iRI(Q)i_{R}\in I(Q), there is a node u¯Rρ\bar{u}_{R}\in\rho such that ρ(u¯R)=(pR,jR,{iR})\rho(\bar{u}_{R})=(p_{R},j_{R},\{i_{R}\}) and Q(iR)=R(x¯R)Q(i_{R})=R(\bar{x}_{R}). Following the definition Δ\Delta, one can easily see that if u¯R\bar{u}_{R} is an inner node, iRdescτQ(pR)i_{R}\in\texttt{desc}_{\tau_{Q}}(p_{R}) and if u¯R\bar{u}_{R} is a leaf node, pR=iRp_{R}=i_{R}.

On the other hand for each inner node, there must exist a transition of the form:

(CpR,iR,UR(x¯R),pR,iR,{iR},pR)Δ(C_{p_{R},i_{R}},U_{R(\bar{x}_{R})},\mathcal{B}_{p_{R},i_{R}},\{i_{R}\},p_{R})\in\Delta

and from the definition of CpR,iRC_{p_{R},i_{R}} it follows that for each S(x¯S)descτQ(pR)S(\bar{x}_{S})\in\texttt{desc}_{\tau_{Q}}(p_{R}), with S(x¯S)R(x¯R)S(\bar{x}_{S})\neq R(\bar{x}_{R}) and Q(iS)=S(x¯S)Q(i_{S})=S(\bar{x}_{S}), there is exactly one node u¯Sρ\bar{u}_{S}\in\rho such that ρ(u¯S)=(iS,jS,{iS})\rho(\bar{u}_{S})=(i_{S},j_{S},\{i_{S}\}) or ρ(u¯S)=(pS,jS,{i})\rho(\bar{u}_{S})=(p_{S},j_{S},\{i\}) where if iiSi\neq i_{S}, iSdescτQ(pS)i_{S}\in\texttt{desc}_{\tau_{Q}}(p_{S}). Since each transition of Δ\Delta has exactly one atom as its label, then there will be exactly one node in the run marked by each atom. ∎

The automaton PQ\pazocal{P}_{Q} will be unambiguous iff every accepting run of PQ\pazocal{P}_{Q} is simple and for every stream S\pazocal{S} and a valuation ν\nu, there is at most one run ρ\rho of PQ\pazocal{P}_{Q} over S\pazocal{S} such that νρ=ν\nu_{\rho}=\nu. Every transition of PQ\pazocal{P}_{Q} uses a single label iI(Q)i\in I(Q), which combined with the correspondence between atoms and nodes in each run of directly indicates that every accepting run of PQ\pazocal{P}_{Q} is simple.

On the other hand, it is easy to see that given a tuple R(a¯)SR(\bar{a})\in\pazocal{S}, there will only be a single initial transition (,URi(x¯i),,{i},i)Δ(\emptyset,U_{R_{i}(\bar{x}_{i})},\emptyset,\{i\},i)\in\Delta such that R(a¯)URi(x¯i)R(\bar{a})\in U_{R_{i}(\bar{x}_{i})} and given a set of possible states for a run of PQ\pazocal{P}_{Q}, CC there will be at most one transition, due to the definition of Cx,iC_{x,i}, that can be taken by PQ\pazocal{P}_{Q}, meaning there will be at most one accepting run for each corresponding output and therefore PQ\pazocal{P}_{Q} will be unambiguous.

Now proving that PQQ\pazocal{P}_{Q}\equiv Q, this will hold iff \lsemPQ\rsemn(S)=\lsemQ\rsemn(S){\lsem{}{\pazocal{P}_{Q}}\rsem}_{n}(\pazocal{S})={\lsem{}{Q}\rsem}_{n}(\pazocal{S}) for every position nn\in\mathbb{N} and every stream S\pazocal{S}, which means we need to prove the following two conditions:

  • \lsemPQ\rsemn(S)\lsemQ\rsemn(S){\lsem{}{\pazocal{P}_{Q}}\rsem}_{n}(\pazocal{S})\subseteq{\lsem{}{Q}\rsem}_{n}(\pazocal{S}). The output of the automaton PQ\pazocal{P}_{Q} over the stream S=t0t1\pazocal{S}=t_{0}t_{1}\ldots at position nn is defined as the set of valuations:

    \lsemPQ\rsemn(S)={νρρ is an accepting run of PQ over S at position n}{\lsem{}{\pazocal{P}_{Q}}\rsem}_{n}(\pazocal{S})\ =\ \{\nu_{\rho}\mid\text{$\rho$ is an accepting run of $\pazocal{P}_{Q}$ over $S$ at position $n$}\}

    meaning that for every valuation νρ\lsemPQ\rsemn(S)\nu_{\rho}\in{\lsem{}{\pazocal{P}_{Q}}\rsem}_{n}(\pazocal{S}) there is an associated accepting run tree ρ:t(2Ω{})\rho:t\rightarrow(2^{\Omega}\setminus\{\emptyset\}).

    Because of Lemma B.2, we know that for every atom iRI(Q)i_{R}\in I(Q), with Q(iR)=R(x¯R)Q(i_{R})=R(\bar{x}_{R}), there is a node u¯Rρ\bar{u}_{R}\in\rho such that ρ(u¯R)=(pR,jR,{iR})\rho(\bar{u}_{R})=(p_{R},j_{R},\{i_{R}\}). Using this, we can define η:I(Q)I(Dn[S])\eta:I(Q)\to I(D_{n}[\pazocal{S}]) and hη:XDn[S]Dn[S]h_{\eta}:X\cup D_{n}[\pazocal{S}]\to D_{n}[\pazocal{S}] such that for every iRI(Q)i_{R}\in I(Q), η(iR)=jR\eta(i_{R})=j_{R} and:

    hη(x)={xxDn[S]Dn[S](jR)xXh_{\eta}(x)=\begin{cases}x&x\in D_{n}[\pazocal{S}]\\ D_{n}[\pazocal{S}](j_{R})&x\in\textbf{X}\\ \end{cases}

    Since hη(a)=ah_{\eta}(a)=a for every aDn[S]a\in D_{n}[\pazocal{S}], it is a homomorphism, so if hηh_{\eta} is well-defined it will be a homomorphism from QQ to Dn[S]D_{n}[\pazocal{S}] and η\eta will be a t-homomorphism from QQ to Dn[S]D_{n}[\pazocal{S}] with hηh_{\eta} as its corresponding homomorphism.

    Consider the atoms iR,iTI(Q)i_{R},i_{T}\in I(Q) and the nodes of the run tree u¯R,u¯Tρ\bar{u}_{R},\bar{u}_{T}\in\rho such that ρ(u¯R)=(pR,jR,{iR})\rho(\bar{u}_{R})=(p_{R},j_{R},\{i_{R}\}), ρ(u¯T)=(pT,jT,{iT})\rho(\bar{u}_{T})=(p_{T},j_{T},\{i_{T}\}), Q(iR)=R(x¯R)Q(i_{R})=R(\bar{x}_{R}) and Q(iT)=T(x¯T)Q(i_{T})=T(\bar{x}_{T}). We need to prove that for every xXx\in\textbf{X}, hη(x)h_{\eta}(x) has a single value; in other words, if z{x¯Rx¯T}z\in\{\bar{x}_{R}\cap\bar{x}_{T}\}, then Dn[S](jR).z=Dn[S](jT).zD_{n}[\pazocal{S}](j_{R}).z=D_{n}[\pazocal{S}](j_{T}).z. Given the structure of the q-tree, the path from the root to every atom corresponds to the variables of the atom, so zancstτQ(R(x¯R))ancstτQ(T(x¯T))z\in\texttt{ancst}_{\tau_{Q}}(R(\bar{x}_{R}))\cap\texttt{ancst}_{\tau_{Q}}(T(\bar{x}_{T})).

    For the purpose of simplification we will start by assuming that one of the states of the run is marked by zz, and then explain the general case. Let iSI(Q)i_{S}\in I(Q) be an atom and u¯Sρ\bar{u}_{S}\in\rho be a node of the run tree such that ρ(u¯S)=(z,jS,{iS})\rho(\bar{u}_{S})=(z,j_{S},\{i_{S}\}) and Q(iS)=S(x¯S)Q(i_{S})=S(\bar{x}_{S}), which means there must be a transition of the form, (Cz,iS,USi(x¯S),z,iS,{iS},z)(C_{z,i_{S}},U_{S_{i}(\bar{x}_{S})},\mathcal{B}_{z,i_{S}},\{i_{S}\},z) in Δ\Delta. Given zancstτQ(R(x¯R))z\in\texttt{ancst}_{\tau_{Q}}(R(\bar{x}_{R})), we have two possible cases.

    Case (1) corresponds to pRCz,iSp_{R}\in C_{z,i_{S}}. This implies that the tuples associated with iRi_{R} and iSi_{S} must satisfy the binary predicate given by the transition, i.e. (Dn[S](jR),Dn[S](jS))z,iS(iR))\big{(}D_{n}[\pazocal{S}](j_{R}),D_{n}[\pazocal{S}](j_{S})\big{)}\in\mathcal{B}_{z,i_{S}}(i_{R})), and therefore, by the definition of the binary predicate of R(x¯R)R(\bar{x}_{R}) and S(x¯S)S(\bar{x}_{S}), Dn[S](jR).z=Dn[S](jS).zD_{n}[\pazocal{S}](j_{R}).z=D_{n}[\pazocal{S}](j_{S}).z.

    On the other hand, for case (2), pRCz,iSp_{R}\notin C_{z,i_{S}}. Since there is a node labeled by zz in the run tree, the variable must have been completed during the run, meaning there must be a sequence of nodes u¯1,,u¯mρ\bar{u}_{1},\ldots,\bar{u}_{m}\in\rho, with u¯1=u¯S\bar{u}_{1}=\bar{u}_{S}, u¯m=u¯R\bar{u}_{m}=\bar{u}_{R}, such that for each node u¯k\bar{u}_{k}, ρ(u¯k)=(pk,jk,{ik})\rho(\bar{u}_{k})=(p_{k},j_{k},\{i_{k}\}) and u¯k=parentρ(u¯k+1)\bar{u}_{k}=\texttt{parent}_{\rho}(\bar{u}_{k+1}). This means that pk+1Cpk,ikp_{k+1}\in C_{p_{k},i_{k}} and pk+1descτQ(pk)p_{k+1}\in\texttt{desc}_{\tau_{Q}}(p_{k}) for every node u¯k\bar{u}_{k}, which indicates that if pkzp_{k}\neq z, then pkdescτQ(z)p_{k}\in\texttt{desc}_{\tau_{Q}}(z) and zz is a variable of Q(ik)Q(i_{k}). Following the same steps of (1), it is easy to see that for every u¯k\bar{u}_{k}, (Dn[S](jk+1),Dn[S](jk))pk,ik(ik+1)\big{(}D_{n}[\pazocal{S}](j_{k+1}),D_{n}[\pazocal{S}](j_{k})\big{)}\in\mathcal{B}_{p_{k},i_{k}}(i_{k+1}) and Dn[S](jk).z=Dn[S](jk+1).zD_{n}[\pazocal{S}](j_{k}).z=D_{n}[\pazocal{S}](j_{k+1}).z, therefore Dn[S](jR).z=Dn[S](jS).zD_{n}[\pazocal{S}](j_{R}).z=D_{n}[\pazocal{S}](j_{S}).z. Note that we purposefully omitted the case iS=iRi_{S}=i_{R} in which the condition holds trivially.

    It is clear that the same arguments as in (1) and (2) are valid for the atom iTi_{T}, so if the run tree has a node labeled by zz, it holds that Dn[S](jT).z=Dn[S](jS).z=Dn[S](jR).zD_{n}[\pazocal{S}](j_{T}).z=D_{n}[\pazocal{S}](j_{S}).z=D_{n}[\pazocal{S}](j_{R}).z. If there is no such node in the run, we can start from the root of the run and its corresponding transition, where ρ(root(ρ))=(r,jS,{iS})\rho(\texttt{root}(\rho))=(r,j_{S},\{i_{S}\}) with r=root(τQ)r=\texttt{root}(\tau_{Q}) and (Cr,iS,US(x¯S),r,iS,{iS},r)Δ(C_{r,i_{S}},U_{S(\bar{x}_{S})},\mathcal{B}_{r,i_{S}},\{i_{S}\},r)\in\Delta with Q(iS)=S(x¯S)Q(i_{S})=S(\bar{x}_{S}). It is easy to see that if there are no nodes labeled by zz, then zz must be a variable of S(x¯S)S(\bar{x}_{S}), so we can use the exact same arguments that we used in (1) and (2) to prove that Dn[S](jR).z=Dn[S](jS).zD_{n}[\pazocal{S}](j_{R}).z=D_{n}[\pazocal{S}](j_{S}).z and Dn[S](jT).z=Dn[S](jS).zD_{n}[\pazocal{S}](j_{T}).z=D_{n}[\pazocal{S}](j_{S}).z, therefore in any case it holds that Dn[S](jT).z=Dn[S](jR).zD_{n}[\pazocal{S}](j_{T}).z=D_{n}[\pazocal{S}](j_{R}).z and hηh_{\eta} is well-defined and a homomorphism from QQ to Dn[S]D_{n}[\pazocal{S}].

    Finally, since we have a t-homomorphism from QQ to Dn[S]D_{n}[\pazocal{S}] for every run tree ρ\rho of PQ\pazocal{P}_{Q}, and \lsemQ\rsemn(S)={η^η is a t-homomorphism from Q to Dn[S]}{\lsem{}{Q}\rsem}_{n}(\pazocal{S})\ =\ \{\hat{\eta}\mid\text{$\eta$ is a t-homomorphism from $Q$ to $D_{n}[\pazocal{S}]$}\}, then \lsemPQ\rsemn(S)\lsemQ\rsemn(S){\lsem{}{\pazocal{P}_{Q}}\rsem}_{n}(\pazocal{S})\subseteq{\lsem{}{Q}\rsem}_{n}(\pazocal{S}).

  • \lsemQ\rsemn(S)\lsemPQ\rsemn(S){\lsem{}{Q}\rsem}_{n}(\pazocal{S})\subseteq{\lsem{}{\pazocal{P}_{Q}}\rsem}_{n}(\pazocal{S}). Let \lsemQ\rsemn(S)={η^η is a t-homomorphism from Q to Dn[S]}{\lsem{}{Q}\rsem}_{n}(\pazocal{S})\ =\ \{\hat{\eta}\mid\text{$\eta$ is a t-homomorphism from $Q$ to $D_{n}[\pazocal{S}]$}\} be the set of outputs of QQ over the stream S\pazocal{S}.

    Let η^\lsemQ\rsemn(S)\hat{\eta}\in{\lsem{}{Q}\rsem}_{n}(\pazocal{S}) be a valuation with its associated t-homomorphism η\eta and homomorphism from hηh_{\eta}. We can represent the tuples of η^\hat{\eta} as the sequence tη={{i1,im}}t_{\eta}=\{\!\!\{i_{1},\ldots i_{m}\}\!\!\}, where for all ktηk\in t_{\eta}, Dn[S](ik)=Rk(a¯k)D_{n}[\pazocal{S}](i_{k})=R_{k}(\bar{a}_{k}) and ik<ik+1i_{k}<i_{k+1}.

    Using the valuation and the sequence tηt_{\eta}, we can build a run tree of PQ\pazocal{P}_{Q}, ρ:t(Q××(2Ω{}))\rho:t\rightarrow(Q\times\mathbb{N}\times(2^{\Omega}\setminus\{\emptyset\})). We starting with a single leaf node ε\varepsilon, with ρ(ε)=(R1(x¯1),tη(1),{R1(x¯1)})\rho(\varepsilon)=(R_{1}(\bar{x}_{1}),t_{\eta}(1),\{R_{1}(\bar{x}_{1})\}). After this, for every tuple ktηk\in t_{\eta}, we must check if there is a transition (P,U,B,L,q)Δ(P,U,B,L,q)\in\Delta such that for each pPp\in P, there is a node u¯ρ\bar{u}\in\rho with ρ(u¯)=(p,j,L)\rho(\bar{u})=(p,j,L) and u¯childrenρ(v¯)\bar{u}\notin\texttt{children}_{\rho}(\bar{v}) for every v¯ρ\bar{v}\in\rho, i.e. we check if there are nodes with no parent that are labeled by each state needed for a transition. If we find a transition that satisfies the previous condition, we add a new node v¯\bar{v} labeled by (q,tη(k),L)(q,t_{\eta}(k),L); if there are no transitions that satisfy the condition, we add a new leaf node labeled by (Rk(x¯k),tη(k),{Rk(x¯k)})(R_{k}(\bar{x}_{k}),t_{\eta}(k),\{R_{k}(\bar{x}_{k})\}).

    Since we get each tuple from the values given by the homomorphism hηh_{\eta}, we know that they will satisfy the unary and binary predicates of each transition, and since there is exactly one tuple for each atom of QQ, we know that there will be a one to one correspondence between the tuples and the labels in the tree ρ\rho, meaning it is an accepting run and therefore \lsemQ\rsemn(S)\lsemPQ\rsemn(S){\lsem{}{Q}\rsem}_{n}(\pazocal{S})\subseteq{\lsem{}{\pazocal{P}_{Q}}\rsem}_{n}(\pazocal{S}).

Connected HCQ with self joins

For a non-empty set AI(Q)A\subseteq I(Q) we say that AA is a self join (of QQ) iff Ri=RjR_{i}=R_{j} for every i,jAi,j\in A. That is, if all relations of the identifiers are the same. We denote by SJQ\operatorname{SJ}_{Q} the set of all self joins AA of QQ. Note that iA{x¯i}\bigcap_{i\in A}\{\bar{x}_{i}\}\neq\emptyset for every ASJQA\in\operatorname{SJ}_{Q} since τQ(ε)iA{x¯i}\tau_{Q}(\varepsilon)\in\bigcap_{i\in A}\{\bar{x}_{i}\} (i.e., QQ is connected).

To deal with self joins, we must modify the previous construction by keeping track of the last atom or self join that was read. This modification is necessary to use equality predicates. Furthermore, the construction has an inherently exponential blow-up in the number of transitions, given that we need to annotate tuples with self joins. This blow-up seems unavoidable for the model since, if QQ has only atoms with the same relational symbol, then the last transitions is forced to annotate with sets of atoms, which are an exponential number of them.

Let xSJQ={(x,A)ASJQxiA{x¯i}}\operatorname{xSJ}_{Q}=\{(x,A)\mid A\in\operatorname{SJ}_{Q}\wedge\ x\in\bigcap_{i\in A}\{\bar{x}_{i}\}\}. We define the PCEA:

PQ=(I(Q)xSJQ,Ulin,Beq,I(Q),Δ,{τQ(ε)}×SJQ).\pazocal{P}_{Q}=(I(Q)\cup\operatorname{xSJ}_{Q},\textbf{U}_{\operatorname{lin}},\textbf{B}_{\operatorname{eq}},I(Q),\Delta,\{\tau_{Q}(\varepsilon)\}\times\operatorname{SJ}_{Q}).

Note that {τQ(ε)}×SJQxSJQ\{\tau_{Q}(\varepsilon)\}\times\operatorname{SJ}_{Q}\subseteq\operatorname{xSJ}_{Q}. The main change is that the variable states of the automaton are pairs formed by a variable and a self join that brought PQ\pazocal{P}_{Q} to that variable. We dedicate the rest of this subsection to introduce the notation that we need to define Δ\Delta.

To account for a single tuple satisfying a self join of QQ, we need to define a new unary predicate associated with a self join ASJQA\in\operatorname{SJ}_{Q}. For this, the following lemma will be relevant.

Lemma B.3.

Let ASJQA\in\operatorname{SJ}_{Q}. There exists an atom tAt_{A} (not necessarily in QQ) such that, for every tTuples[σ]t\in\operatorname{Tuples}[\sigma], the following statements are equivalent: (1) there exists hHomh\in\operatorname{Hom} such that h(Ri(x¯i))=th(R_{i}(\bar{x}_{i}))=t for every iAi\in A; (2) there exists hHomh^{\prime}\in\operatorname{Hom} such that h(tA)=th^{\prime}(t_{A})=t.

Proof.

Let [1,n]={1,,n}[1,n]=\{1,\ldots,n\} and define the relation H[1,n]2H\subseteq[1,n]^{2} such that:

H={(k1,k2[1,n]2i,j.x¯i[k1]=x¯j[k2])}H=\{(k_{1},k_{2}\in[1,n]^{2}\mid\exists i,j.\,\bar{x}_{i}[k_{1}]=\bar{x}_{j}[k_{2}])\}

Note that HH is reflexive and symmetric, but not necessarily transitive.

Let HTH^{T} be the transitive closure of HH. Now HTH^{T} is an equivalence relation. Given k[1,,n]k\in[1,\ldots,n], let [k][k] be the equivalence class of kk with respect to HTH^{T}. Define R(x¯)=R([1],,[n])R(\bar{x})=R([1],\ldots,[n]). Next we prove that the atom tA=R(x¯)t_{A}=R(\bar{x}) satisfies the theorem.

  • (1) \rightarrow (2). Suppose that hh is a homomorphism such that h(R(x¯))=R(a¯)h(R(\bar{x}))=R(\bar{a}). Define the homomorphism h(x)=[k]h^{\ast}(x)=[k], with x¯i[k]=x\bar{x}_{i}[k]=x for some i.

    One can prove that (1) hh^{\ast} is well-defined, i.e., it does not depend on kk and (2) h(R(x¯i))=R(x¯)h^{\ast}(R(\bar{x}_{i}))=R(\bar{x}) for all ini\leq n.

    Define h=hhh^{\prime}=h^{\ast}\circ h, then h(R(x¯i))=h(h(R(x¯i)))=h(R(x¯))=R(a¯)h^{\prime}(R(\bar{x}_{i}))=h(h^{\ast}(R(\bar{x}_{i})))=h(R(\bar{x}))=R(\bar{a}).

  • (2) \rightarrow (1). Suppose that hhomh^{\prime}\in\hom satisfies h(R(x¯i))=R(a¯)h^{\prime}(R(\bar{x}_{i}))=R(\bar{a}) for every ii.

    Define the homomorphism hh such that h([k])=h(x¯1[k])h([k])=h^{\prime}(\bar{x}_{1}[k]). It is easy to see that hh is well-defined, namely if [k1]=[k2][k_{1}]=[k_{2}], then h(x¯1[k1])=h(x¯1[k2])h^{\prime}(\bar{x}_{1}[k_{1}])=h^{\prime}(\bar{x}_{1}[k_{2}]).

    Since hh is well-defined, then hh is a homomorphism and:

    h(R(x¯))\displaystyle h(R(\bar{x})) =R(h([1]),,h([n]))\displaystyle=R(h([1]),\ldots,h([n]))
    =R(h(x¯1[1]),,h(x¯1[n]))\displaystyle=R(h^{\prime}(\bar{x}_{1}[1]),\ldots,h^{\prime}(\bar{x}_{1}[n]))
    =h(R(x¯1))\displaystyle=h^{\prime}(R(\bar{x}_{1}))
    =R(a¯)\displaystyle=R(\bar{a})

Note that the homomorphism hh implicitly forces that tAt_{A} has the same relational symbol as every RiR_{i} with iAi\in A, otherwise there would be no tuple that satisfies the single homomorphism condition. It also motivates the following unary predicate for a self join ASJQA\in\operatorname{SJ}_{Q}:

UA={tTuples[σ]hHom.h(tA)=t}.U_{A}=\{t\in\operatorname{Tuples}[\sigma]\mid\exists h\in\operatorname{Hom}.\,h(t_{A})=t\}.

One can check that UAUlinU_{A}\in\textbf{U}_{\operatorname{lin}} given that one can compute tAt_{A} from AA in advance, and then check that tt is homomorphic to tAt_{A} in linear time over tt for every tTuples[σ]t\in\operatorname{Tuples}[\sigma].

Similar to unary predicates, we can derive a lemma for a pair of self joins.

Lemma B.4.

Let A1,A2SJQA_{1},A_{2}\in\operatorname{SJ}_{Q}. There exist atoms tA1,A2\reflectbox{$\vec{\reflectbox{$t$}}$}_{A_{1},A_{2}} and tA1,A2\vec{t}_{A_{1},A_{2}} (not necessarily in QQ) such that, for every pair (t1,t2)Tuples[σ]2(t_{1},t_{2})\in\operatorname{Tuples}[\sigma]^{2}, the following statements are equivalent: (1) there exists hHomh\in\operatorname{Hom} such that h(Ri(x¯i))=tjh(R_{i}(\bar{x}_{i}))=t_{j} for every j{1,2}j\in\{1,2\} and every iAji\in A_{j}; (2) there exists hHomh^{\prime}\in\operatorname{Hom} such that h(tA1,A2)=t1h^{\prime}(\reflectbox{$\vec{\reflectbox{$t$}}$}_{A_{1},A_{2}})=t_{1} and h(tA1,A2)=t2h^{\prime}(\vec{t}_{A_{1},A_{2}})=t_{2}.

Proof.

Let A1=R1(x¯1,,R1(x¯m))A_{1}=R_{1}(\bar{x}_{1},\ldots,R_{1}(\bar{x}_{m})), A2=R2(y¯1,,R2(y¯m))A_{2}=R_{2}(\bar{y}_{1},\ldots,R_{2}(\bar{y}_{m^{\prime}})), [1,n]={1,,n}[1,n]=\{1,\ldots,n\} and [1,n]={1,,n}[1^{\prime},n^{\prime}]=\{1^{\prime},\ldots,n^{\prime}\}. Note that [1,n][1,n]=[1,n]\cap[1^{\prime},n^{\prime}]=\emptyset. Define the relation H([1,n][1,n])2H\subseteq([1,n]\cup[1^{\prime},n^{\prime}])^{2} such that:

H\displaystyle H ={(k1,k2i,j.x¯i[k1]=x¯j[k2])}\displaystyle=\{(k_{1},k_{2}\mid\exists i,j.\,\bar{x}_{i}[k_{1}]=\bar{x}_{j}[k_{2}])\}
={(k1,k2i,j.y¯i[k1]=y¯j[k2])}\displaystyle=\{(k_{1}^{\prime},k_{2}^{\prime}\mid\exists i,j.\,\bar{y}_{i}[k_{1}]=\bar{y}_{j}[k_{2}])\}
={(k1,k2i,j.x¯i[k1]=y¯j[k2])}\displaystyle=\{(k_{1},k_{2}^{\prime}\mid\exists i,j.\,\bar{x}_{i}[k_{1}]=\bar{y}_{j}[k_{2}])\}
={(k1,k2i,j.y¯i[k1]=x¯j[k2])}\displaystyle=\{(k_{1}^{\prime},k_{2}\mid\exists i,j.\,\bar{y}_{i}[k_{1}]=\bar{x}_{j}[k_{2}])\}

Just like for Lemma B.3, take HTH^{T} the transitive closure of HH, with HTH^{T} an equivalence relation. Given k[1,,n][1,n]k\in[1,\ldots,n]\cup[1^{\prime},n^{\prime}], let [k][k] be the equivalence class of kk in HTH^{T}. Define R1(x¯)=R1([1],,[n])R_{1}(\bar{x})=R_{1}([1],\ldots,[n]) and R2(y¯)=R2([1],,[n])R_{2}(\bar{y})=R_{2}([1^{\prime}],\ldots,[n^{\prime}]). The proof that tA1,A2\reflectbox{$\vec{\reflectbox{$t$}}$}_{A_{1},A_{2}} and tA1,A2\vec{t}_{A_{1},A_{2}} is completely analogous to the one in Lemma B.3.

Analog to the unary predicates, this lemma motivates the following binary predicate for every pair A1,A2SJQA_{1},A_{2}\in\operatorname{SJ}_{Q}:

BA1,A2={(t1,t2)Tuples[σ]2hHom.h(tA1,A2)=t1h(tA1,A2)=t2}.B_{A_{1},A_{2}}=\{(t_{1},t_{2})\in\operatorname{Tuples}[\sigma]^{2}\mid\exists h\in\operatorname{Hom}.\,h(\reflectbox{$\vec{\reflectbox{$t$}}$}_{A_{1},A_{2}})=t_{1}\wedge h(\vec{t}_{A_{1},A_{2}})=t_{2}\}.

Note that BA1,A2BeqB_{A_{1},A_{2}}\in\textbf{B}_{\operatorname{eq}}. Indeed, BA1,A2=BS(y¯),T(z¯)B_{A_{1},A_{2}}=B_{S(\bar{y}),T(\bar{z})} with S(y¯)=tA1,A2S(\bar{y})=\reflectbox{$\vec{\reflectbox{$t$}}$}_{A_{1},A_{2}} and T(z¯)=tA1,A2T(\bar{z})=\vec{t}_{A_{1},A_{2}}.

Now, we move to the definitions for constructing the transition relation Δ\Delta. Let ASJQA\in\operatorname{SJ}_{Q}. For a variable xiA{x¯i}x\in\bigcap_{i\in A}\{\bar{x}_{i}\} we define the set of incomplete states of xx given AA as:

Cx,A:={I(Q){x¯}parentτQ()(descτQ(x)iA{x¯i})}(AiA{x¯i}).C_{x,A}\ :=\ \big{\{}\ell\in I(Q)\cup\{\bar{x}\}\mid\texttt{parent}_{\tau_{Q}}(\ell)\in\big{(}\texttt{desc}_{\tau_{Q}}(x)\cap\bigcup_{i\in A}\{\bar{x}_{i}\}\big{)}\big{\}}\setminus\big{(}A\cup\bigcup_{i\in A}\{\bar{x}_{i}\}\big{)}.

Cx,AC_{x,A} is the generalization of the set Cx,iC_{x,i}. Indeed, one can check that Cx,i=Cx,{i}C_{x,i}=C_{x,\{i\}}. Then, the intuition behind Cx,AC_{x,A} is similar: in Cx,AC_{x,A} we are collecting all variables or atoms identifiers in τQ\tau_{Q} that directly hangs from a variable that it is a descendant of xx and it is in an atom of AA, except for the same variables and atoms in AA.

Let Cx,AC_{x,A} be the incomplete states of xx given AA. We say that a subset of states CI(Q)xSJQC\subseteq I(Q)\cup\operatorname{xSJ}_{Q} encodes Cx,AC_{x,A} in PQ\pazocal{P}_{Q} iff CI(Q)=Cx,AI(Q)C\cap I(Q)=C_{x,A}\cap I(Q) and for every variable yCx,Ay\in C_{x,A} there exists a unique pair (y,A)C(y,A^{\prime})\in C. Intuitively, CC will be the analog of Cx,iC_{x,i} that we use as a set of states in the case without self joins. Note that Cx,AC_{x,A} can be empty, and then \emptyset is the only set representing Cx,AC_{x,A}. We denote by C¯x,A\bar{C}_{x,A} the set of all encodings of Cx,AC_{x,A} in PQ\pazocal{P}_{Q}.

With this machinery, we construct the new transition relation Δ\Delta as follows:

Δ\displaystyle\Delta ={(,URi(x¯i),,{i},i)iI(Q)}\displaystyle=\Big{\{}\Big{(}\emptyset,U_{R_{i}(\bar{x}_{i})},\emptyset,\{i\},i\Big{)}\mid\,i\in I(Q)\Big{\}}
{(C,UA,C,A,A,(x,A))ASJQxiA{x¯i}CC¯x,A}\displaystyle\cup\Big{\{}\Big{(}C,U_{A},\mathcal{B}_{C,A},A,(x,A)\Big{)}\mid A\in\operatorname{SJ}_{Q}\wedge\,x\in\bigcap_{i\in A}\{\bar{x}_{i}\}\wedge C\in\bar{C}_{x,A}\Big{\}}

such that C,A:CBeq\mathcal{B}_{C,A}:C\rightarrow\textbf{B}_{\operatorname{eq}} is the predicate function associated with CC and AA defined as follows: C,A(j)=B{j},A\mathcal{B}_{C,A}(j)=B_{\{j\},A} for every identifier jCj\in C, and Fx,C((y,A))=BA,AF_{x,C}((y,A^{\prime}))=B_{A^{\prime},A} for every (y,A)C(y,A^{\prime})\in C. Note that all the binary predicates of FC,AF_{C,A} are equality predicates as defined above.

This ends the definition of Δ\Delta and PQ\pazocal{P}_{Q} for a connected HCQ QQ with self joins. Next, we prove that PQ\pazocal{P}_{Q} is unambiguous and PQQ\pazocal{P}_{Q}\equiv Q.

Thanks to Lemma B.3 and Lemma B.4, this proof is analogous the the one without self joins.

The general case

We end this proof by showing the case when QQ is disconnected. If this is the case, let xx^{*} be a fresh variable and redefine QQ as:

Q(x,x¯)R0(x,x¯0),,Rm1(x,x¯m1)Q^{*}(x^{*},\bar{x})\ \leftarrow\ R_{0}(x^{*},\bar{x}_{0}),\ldots,R_{m-1}(x^{*},\bar{x}_{m-1})

where QQ^{*} is defined over a new schema σ\sigma^{*} where the arity of each relational symbol is increased by one. QQ^{*} is hierarchical and now is connected. By the previous case, we can construct a PCEA PQ\pazocal{P}_{Q^{*}} such that PQQ\pazocal{P}_{Q^{*}}\equiv Q^{*}. Now, take PQ\pazocal{P}_{Q^{*}} and define PQ\pazocal{P}_{Q} by removing the fresh variable xx^{*} from PQ\pazocal{P}_{Q^{*}}, namely, remove it from the unary and binary predicates in the transitions. One can easily check that PQQ\pazocal{P}_{Q}\equiv Q. ∎

Proof of Theorem 4.2

Fix a schema σ\sigma, a set of data values D=\textbf{D}=\mathbb{N} and an acyclic CQ QQ over σ\sigma of the form:

Q(x¯)R0(x¯0),,Rm1(x¯m1)Q(\bar{x})\ \leftarrow\ R_{0}(\bar{x}_{0}),\ldots,R_{m-1}(\bar{x}_{m-1})

that is not a HCQ, meaning there is a pair of variables y,zXy,z\in\textbf{X} with atoms(y)atoms(z)\textit{atoms}(y)\not\subseteq\textit{atoms}(z), atoms(z)atoms(y)\textit{atoms}(z)\not\subseteq\textit{atoms}(y) and atoms(y)atoms(z)\textit{atoms}(y)\cap\textit{atoms}(z)\neq\emptyset. Without loss of generality, assume that y{x0¯}{x1¯}y\in\{\bar{x_{0}}\}\cap\{\bar{x_{1}}\} , y{x2¯}y\notin\{\bar{x_{2}}\} and z{x0¯}{x2¯}z\in\{\bar{x_{0}}\}\cap\{\bar{x_{2}}\} and z{x1¯}z\notin\{\bar{x_{1}}\}; moreover, R0(x00,x0n0)R_{0}(x_{0}^{0},\ldots x_{0}^{n_{0}}), R1(x10,x1n1)R_{1}(x_{1}^{0},\ldots x_{1}^{n_{1}}), R2(x20,x2n2)R_{2}(x_{2}^{0},\ldots x_{2}^{n_{2}}) with x00=x10=yx_{0}^{0}=x_{1}^{0}=y and x01=x20=zx_{0}^{1}=x_{2}^{0}=z.

Let Sk={{R0(a¯0),,Rm1(a¯m1),}}\pazocal{S}_{k}=\{\!\!\{\,R_{0}(\bar{a}_{0}),\ldots,R_{m-1}(\bar{a}_{m-1}),\ldots\,\}\!\!\} be a family of streams where for every tuple Ri(ai0,aini)R_{i}(a_{i}^{0},\ldots a_{i}^{n_{i}}), aij=ka_{i}^{j}=k, with kk\in\mathbb{N} if xij=yx_{i}^{j}=y or xij=zx_{i}^{j}=z and aij=0a_{i}^{j}=0 if xijyx_{i}^{j}\neq y and xijzx_{i}^{j}\neq z. In other words, every variable of every tuple is mapped to 0, except for yy and zz, which are mapped to kk. Recalling our previous definition for CQ over streams, i.e. considering Ω=I(Q)\Omega=I(Q), it is clear that the valuation ν={{0,1,2,,m1}}\lsemQ\rsemn(Sk)\nu=\{\!\!\{0,1,2,\ldots,m-1\}\!\!\}\in{\lsem{}{Q}\rsem}_{n}(\pazocal{S}_{k}) for every kk\in\mathbb{N}.

Assume there is a Parallelized-CEA PQ=(PQ,U,B,Ω,Δ,F)\pazocal{P}_{Q}=\big{(}\pazocal{P}_{Q},\textbf{U},\textbf{B},\Omega,\Delta,F\big{)} such that PQQ\pazocal{P}_{Q}\equiv Q, meaning that for every stream and, in particular, for every stream Sk\pazocal{S}_{k} it holds that \lsemPQ\rsem=\lsemQ\rsem{\lsem{}{\pazocal{P}_{Q}}\rsem}={\lsem{}{Q}\rsem}, which means ν\lsemPQ\rsemn(Sk)\nu\in{\lsem{}{\pazocal{P}_{Q}}\rsem}_{n}(\pazocal{S}_{k}) for every kk\in\mathbb{N}. Since PQ\pazocal{P}_{Q} has a finite number of states, there must exist j,kj,k\in\mathbb{N} such that their accepting runs for ν\nu, ρj\rho_{j} and ρk\rho_{k} respectively, are isomorphic, meaning their runs go through the exact same states. For the sake of simplification, we assume that dom(ρj)=dom(ρk)\operatorname{dom}(\rho_{j})=\operatorname{dom}(\rho_{k}).

Let Sjk={{R0(a¯0),,Rm1(a¯m1),}}\pazocal{S}_{j\leftarrow k}=\{\!\!\{R_{0}(\bar{a}_{0}),\ldots,R_{m-1}(\bar{a}_{m-1}),\ldots\}\!\!\} where for every tuple Ri(ai0,aini)R_{i}(a_{i}^{0},\ldots a_{i}^{n_{i}}), ai=ja_{i}^{\ell}=j if xi=yx_{i}^{\ell}=y, ai=ka_{i}^{\ell}=k if xi=zx_{i}^{\ell}=z and ai=0a_{i}^{\ell}=0 if xiyx_{i}^{\ell}\neq y and xizx_{i}^{\ell}\neq z. It is clear that this time, ν={{0,1,2,,m1}}\lsemQ\rsemn(Sjk)\nu=\{\!\!\{0,1,2,\ldots,m-1\}\!\!\}\notin{\lsem{}{Q}\rsem}_{n}(\pazocal{S}_{j\leftarrow k}).

Since ρj\rho_{j} and ρk\rho_{k} are runs associated with ν\nu, there are nodes u¯0,u¯1,u¯2I(ρj)\bar{u}_{0},\bar{u}_{1},\bar{u}_{2}\in I(\rho_{j}) such that ρj(u¯i)=(pi,i,Li)\rho_{j}(\bar{u}_{i})=(p_{i},i,L_{i}) and ρk(u¯i)=(pi,i,Li)\rho_{k}(\bar{u}_{i})=(p_{i},i,L_{i}), with Ri(x¯i)LiR_{i}(\bar{x}_{i})\in L_{i} for each i{0,1,2}i\in\{0,1,2\}. Since this nodes exist in ρj\rho_{j} and ρk\rho_{k}, for every i{0,1,2}i\in\{0,1,2\} there must transitions (Pi,Ui,Bi,Li,pi)(P_{i},U_{i},B_{i},L_{i},p_{i}) and (Pi,Ui,Bi,Li,pi)(P_{i},U_{i}^{\prime},B_{i}^{\prime},L_{i},p_{i}) such that Sj(i)\pazocal{S}_{j}(i) satisfies the unary and binary predicates Ui,BiU_{i},B_{i} and Sk(i)\pazocal{S}_{k}(i) satisfies the unary and binary predicates Ui,BiU_{i}^{\prime},B_{i}^{\prime}.

Using these transitions, we can define ρjk:I(ρj)(PQ,,(2Ω))\rho_{j\leftarrow k}:I(\rho_{j})\to(\pazocal{P}_{Q},\mathbb{N},(2^{\Omega}\setminus\emptyset)) with ρjk(u¯)=ρj(u¯)\rho_{j\leftarrow k}(\bar{u})=\rho_{j}(\bar{u}) for every u¯dom(ρjk)\bar{u}\in\operatorname{dom}(\rho_{j\leftarrow k}). One can easily check that ρjk\rho_{j\leftarrow k} is an accepting run tree of PQ\pazocal{P}_{Q} over Sjk\pazocal{S}_{j\leftarrow k}, since it can follow (Pi,Ui,Bi,Li,pi)(P_{i},U_{i}^{\prime},B_{i}^{\prime},L_{i},p_{i}) for R0(a¯0),R1(a¯1),R2(a¯2)R_{0}(\bar{a}_{0}),R_{1}(\bar{a}_{1}),R_{2}(\bar{a}_{2}) and follow the exact same transitions as ρj\rho_{j} for every other tuple.

Since for the valuation ν={{0,1,2,,m1}}\nu=\{\!\!\{0,1,2,\ldots,m-1\}\!\!\} it holds that ν\lsemQ\rsemn(Sjk)\nu\notin{\lsem{}{Q}\rsem}_{n}(\pazocal{S}_{j\leftarrow k}) and ν\lsemPQ\rsemn(Sjk)\nu\in{\lsem{}{\pazocal{P}_{Q}}\rsem}_{n}(\pazocal{S}_{j\leftarrow k}), then PQQ\pazocal{P}_{Q}\not\equiv Q and therefore if QQ is and acyclic CQ that is not hierarchical, then PQQ\pazocal{P}_{Q}\not\equiv Q for all PFA PQ\pazocal{P}_{Q}.

Appendix C Proofs of Section 5

Proof of Theorem 5.2

Proof.

Let ww\in\mathbb{N} be a window size, 𝖣𝖲w\mathsf{DS}_{w} be a simple data structure and 𝗇Nodes(𝖣𝖲w)\mathsf{n}\in\operatorname{Nodes}(\mathsf{DS}_{w}) be a node of the data structure. The valuations in \lsem𝗇\rsemiw{\lsem{}{\mathsf{n}}\rsem}^{w}_{i} are defined as:

\lsem𝗇\rsemiw:={{ν\lsem𝗇\rsem|imin(ν)|w}}.{\lsem{}{\mathsf{n}}\rsem}^{w}_{i}\ :=\ \{\!\!\{\nu\in{\lsem{}{\mathsf{n}}\rsem}\mid|i-\min(\nu)|\leq w\}\!\!\}.

with

\lsem𝗇\rsemprod:={{νL(𝗇),i(𝗇)}}𝗇prod(𝗇)\lsem𝗇\rsem\lsem𝗇\rsem:=\lsem𝗇\rsemprod\lsemuleft(𝗇)\rsem\lsemuright(𝗇)\rsem.{\lsem{}{\mathsf{n}}\rsem}_{\operatorname{prod}}:=\{\!\!\{\nu_{L(\mathsf{n}),i(\mathsf{n})}\}\!\!\}\oplus\bigoplus_{\mathsf{n}^{\prime}\in\operatorname{prod}(\mathsf{n})}{\lsem{}{\mathsf{n}^{\prime}}\rsem}\ \ \ \ \ \ \ \ \ \ {\lsem{}{\mathsf{n}}\rsem}:={\lsem{}{\mathsf{n}}\rsem}_{\operatorname{prod}}\cup{\lsem{}{\operatorname{uleft}(\mathsf{n})}\rsem}\cup{\lsem{}{\operatorname{uright}(\mathsf{n})}\rsem}.

Following the definitions used in [22], we will say that the algorithm enumerates the results ν\lsem𝗇\rsemiw\nu\in{\lsem{}{\mathsf{n}}\rsem}^{w}_{i} by writing #ν1#ν2##νm#\#\nu_{1}\#\nu_{2}\#\ldots\#\nu_{m}\# to the output registers, where #Ω\#\notin\Omega is a separator symbol. Let 𝗍𝗂𝗆𝖾(i)\mathsf{time}(i) be the time in the enumeration when the algorithm writes the ii-th symbol #\#, we define the 𝖽𝖾𝗅𝖺𝗒(i)=𝗍𝗂𝗆𝖾(i+1)𝗍𝗂𝗆𝖾(i)\mathsf{delay}(i)=\mathsf{time}(i+1)-\mathsf{time}(i) for each imi\leq m. We say that the enumeration has output-linear delay if there is a constant kk such that for every imi\leq m it holds that 𝖽𝖾𝗅𝖺𝗒(i)k|νi|\mathsf{delay}(i)\leq k\cdot|\nu_{i}|.

To output the first valuation of \lsem𝗇\rsemiw{\lsem{}{\mathsf{n}}\rsem}^{w}_{i} we need to (1) determine if \lsem𝗇\rsemiw={\lsem{}{\mathsf{n}}\rsem}^{w}_{i}=\emptyset and (2) build the valuation by calculating the products in \lsem𝗇\rsem{\lsem{}{\mathsf{n}}\rsem}. We can know that \lsem𝗇\rsemiw{\lsem{}{\mathsf{n}}\rsem}^{w}_{i}\neq\emptyset iff |imaxstart(𝗇)|w|i-\operatorname{max-start}(\mathsf{n})|\leq w, and since the value of maxstart(𝗇)=max{min(ν)ν\lsem𝗇\rsemprod}\operatorname{max-start}(\mathsf{n})=\max\{\min(\nu)\mid\nu\in{\lsem{}{\mathsf{n}}\rsem}_{\operatorname{prod}}\} is stored in every node 𝗇\mathsf{n} and we are doing a simple calculation with constants, we can check (1) in constant time. Note that it is not necessary to recursively check the maxstart\operatorname{max-start} of the rest of the nodes in \lsem𝗇\rsemprod{\lsem{}{\mathsf{n}}\rsem}_{\operatorname{prod}} since they are considered in the definition.

On the other hand, the product of two bags of valuations V,VV,V^{\prime} is defined as the bag VV={{νννV,νV}}V\oplus V^{\prime}=\{\!\!\{\nu\oplus\nu^{\prime}\mid\nu\in V,\nu^{\prime}\in V^{\prime}\}\!\!\}, where νν\nu\oplus\nu^{\prime} is the product of two valuations, defined as a valuation such that [νν]()=ν()ν()[\nu\oplus\nu^{\prime}](\ell)=\nu(\ell)\cup\nu^{\prime}(\ell) for every Ω\ell\in\Omega. With these definitions, we can enumerate a single valuation ν\lsem𝗇\rsemprod\nu\in{\lsem{}{\mathsf{n}}\rsem}_{\operatorname{prod}} by calculating the union between a valuation ν𝗇U({{νL(𝗇),i(𝗇)}})\nu_{\mathsf{n}}\in U(\{\!\!\{\nu_{L(\mathsf{n}),i(\mathsf{n})}\}\!\!\}) and ν𝗇prod(𝗇)\nu_{\mathsf{n}^{\prime}}\in\operatorname{prod}(\mathsf{n}^{\prime}) for each nprod(𝗇)n^{\prime}\in\operatorname{prod}(\mathsf{n}^{\prime}). It is easy to see that we can complete (2) by both calculating and writing this valuation in linear time. It is worth noting that we can make sure that we find valuations inside of the time window in constant time by traversing every bag in reverse order (starting from the valuations with a higher to lower min{ν}\min\{\nu\}).

After enumerating the first output, we can continue traversing the bags of valuations, checking in constant time if |imin{ν𝗇}w||i-\min\{\nu_{\mathsf{n}^{\prime}}\}\leq w|. In the worst case, which will occur right after writing the last valuation in the output, we will have to check that |imin{ν𝗇}w||i-\min\{\nu_{\mathsf{n}^{\prime}}\}\leq w| for every node 𝗇prod(n)\mathsf{n}^{\prime}\in\operatorname{prod}(n), but since each check takes constant time and there is one node for each valuation we are adding to the output, this step can also be done in linear time with respect to |ν||\nu|. Finally, after enumerating every output of prod(n)\operatorname{prod}(n) inside the time window, we can recursively start the enumeration for uleft(𝗇)\operatorname{uleft}(\mathsf{n}) and uright(𝗇)\operatorname{uright}(\mathsf{n}) in constant time, which will maintain an output-linear delay. ∎

Proof of Proposition 5.3

Proof.

Fix k,wk,w\in\mathbb{N} and assume that one performs 𝚞𝚗𝚒𝚘𝚗(𝗇1,𝗇2)\mathtt{union}(\mathsf{n}_{1},\mathsf{n}_{2}) over 𝖣𝖲w\mathsf{DS}_{w} with the same position i=i(𝗇2)i=i(\mathsf{n}_{2}) at most kk times. In the following, we first prove the proposition with an implementation of the 𝚞𝚗𝚒𝚘𝚗\mathtt{union} operation that is not fully persistent and then show how to modify the implementation to maintain this property.

Let 𝗇1,𝗇2Nodes(𝖣𝖲w)\mathsf{n}_{1},\mathsf{n}_{2}\in\operatorname{Nodes}(\mathsf{DS}_{w}) be two nodes such that max(𝗇1)i(𝗇2)\max(\mathsf{n}_{1})\leq i(\mathsf{n}_{2}) and uleft(𝗇2)=uright(𝗇2)=\operatorname{uleft}(\mathsf{n}_{2})=\operatorname{uright}(\mathsf{n}_{2})=\bot. We say that 𝗇1𝗇2\mathsf{n}_{1}\leq\mathsf{n}_{2} iff (1) maxstart(𝗇1)maxstart(𝗇2)\operatorname{max-start}(\mathsf{n}_{1})\leq\operatorname{max-start}(\mathsf{n}_{2}) and (2) if maxstart(𝗇1)=maxstart(𝗇2)\operatorname{max-start}(\mathsf{n}_{1})=\operatorname{max-start}(\mathsf{n}_{2}) then i(𝗇1)i(𝗇2)i(\mathsf{n}_{1})\leq i(\mathsf{n}_{2}).

Recall that this operation requires inserting 𝗇2\mathsf{n}_{2} into 𝗇1\mathsf{n}_{1} and it outputs a fresh node 𝗇u\mathsf{n}_{u} such that \lsem𝗇u\rsemiw:=\lsem𝗇1\rsemiw\lsem𝗇2\rsemiw{\lsem{}{\mathsf{n}_{u}}\rsem}^{w}_{i}:={\lsem{}{\mathsf{n}_{1}}\rsem}^{w}_{i}\cup{\lsem{}{\mathsf{n}_{2}}\rsem}^{w}_{i}.

If |maxstart(𝗇1i(n2))|>w|\operatorname{max-start}(\mathsf{n}_{1}-i(n_{2}))|>w then all of the outputs from 𝗇1\mathsf{n}_{1} are now out of the time window, so \lsem𝗇1\rsemiw\lsem𝗇2\rsemiw=\lsem𝗇2\rsemiw{\lsem{}{\mathsf{n}_{1}}\rsem}^{w}_{i}\cup{\lsem{}{\mathsf{n}_{2}}\rsem}^{w}_{i}={\lsem{}{\mathsf{n}_{2}}\rsem}^{w}_{i} and therefore 𝚞𝚗𝚒𝚘𝚗(𝗇1,𝗇2)=𝗇2\mathtt{union}(\mathsf{n}_{1},\mathsf{n}_{2})=\mathsf{n}_{2}. The time it will take to do this operation will be the time necessary to insert 𝗇2\mathsf{n}_{2} in 𝖣𝖲w\mathsf{DS}_{w}, which we will analyze later.

On the other hand, if |maxstart(𝗇1i(𝗇2))|w|\operatorname{max-start}(\mathsf{n}_{1}-i(\mathsf{n}_{2}))|\leq w, we have to consider the outputs of both nodes, 𝗇2\mathsf{n}_{2} and 𝗇1\mathsf{n}_{1}. First check how 𝗇1\mathsf{n}_{1} compares with 𝗇2\mathsf{n}_{2}. If 𝗇1𝗇2\mathsf{n}_{1}\leq\mathsf{n}_{2}, then we have to create the new node 𝗇=𝗇2\mathsf{n}=\mathsf{n}_{2} such that uleft(𝗇)=𝚞𝚗𝚒𝚘𝚗(uleft(𝗇2),𝗇1)\operatorname{uleft}(\mathsf{n})=\mathtt{union}(\operatorname{uleft}(\mathsf{n}_{2}),\mathsf{n}_{1}) and, similarly, if 𝗇1>𝗇2\mathsf{n}_{1}>\mathsf{n}_{2}, we need to create the new node 𝗇=𝗇1\mathsf{n}=\mathsf{n}_{1} such that uleft(𝗇)=𝚞𝚗𝚒𝚘𝚗(uleft(𝗇1),𝗇2)\operatorname{uleft}(\mathsf{n})=\mathtt{union}(\operatorname{uleft}(\mathsf{n}_{1}),\mathsf{n}_{2}). In both cases, we are only creating one node and switching or adding pointers between nodes a constant number of times. Although it might seem like a recursive operation at first glance, we know beforehand that maxstart(𝗇)maxstart(uleft(𝗇))\operatorname{max-start}(\mathsf{n})\geq\operatorname{max-start}(\operatorname{uleft}(\mathsf{n})), so there will be at most one other union process generated. Once again since we can do this part of the operation in constant time and the bulk of the operation will be the time necessary for the insertion of the new node.

We know that for every position i=i(𝗇2)i=i(\mathsf{n}_{2}) we will perform a 𝚞𝚗𝚒𝚘𝚗\mathtt{union} operation at most kk times. Starting with an empty data structure, there will be at most kwk\cdot w nodes in 𝖣𝖲w\mathsf{DS}_{w} given a time window ww. Assuming 𝖣𝖲w\mathsf{DS}_{w} is a perfectly balanced binary tree, this means that the tree has a depth of log2(kw)\log_{2}(k\cdot w).

To ensure that the tree will always be balanced, we can add one bit of information to every node, which we will call the direction bit, that indicates which of the children of the node we need to visit for the insertion. If bit(𝗇)=0bit(\mathsf{n})=0, we must go to its left child and we must go to the right one otherwise. After each insertion, we need to change the value of the direction bit of every node in the path from the root to the newly inserted one, to avoid repeating the same path on the next insertion. This operation can be done in constant time for each node, so the time it will take to update all of the direction bits for each insertion will be exactly the depth of the tree.

Since one performs 𝚞𝚗𝚒𝚘𝚗(𝗇1,𝗇2)\mathtt{union}(\mathsf{n}_{1},\mathsf{n}_{2}) over 𝖣𝖲w\mathsf{DS}_{w} with the same position i=i(𝗇2)i=i(\mathsf{n}_{2}) at most kk times, if we start with an empty data structure, it will have at most k×wk\times w nodes after reading ww tuples from the stream. To insert the next node 𝗇\mathsf{n}^{\prime}, by following the direction bits, we will end up in the oldest node of the tree 𝗇\mathsf{n}, but it is clear that i(𝗇)i(𝗇)+wi(\mathsf{n})\leq i(\mathsf{n}^{\prime})+w, meaning that maxstart(𝗇)i(𝗇)w\operatorname{max-start}(\mathsf{n})-i(\mathsf{n})\leq w which indicates that all of the outputs of 𝗇\mathsf{n} are outside of the time window and therefore we can safely remove 𝗇\mathsf{n} from the tree and replace it with 𝗇\mathsf{n}^{\prime} without losing outputs. Given that the depth of 𝖣𝖲w\mathsf{DS}_{w} is at most kwk\cdot w, and all of our previous operations take time proportional to the depth of the tree, we can conclude that the running time of the 𝚞𝚗𝚒𝚘𝚗\mathtt{union} operation is in O(log(kw))\pazocal{O}(\log(k\cdot w)) for each call.

As we stated before, although the method we just discussed works and has a running time in O(log(kw))\pazocal{O}(\log(k\cdot w)) for each call, it is not a fully persistent implementation, since we are removing nodes from the leaves when they are not producing an output and we are also modifying the direction bits of the nodes. To solve this problem, we can use the path copying method. With this method, whenever we need to modify a node, we create a copy with the modifications applied instead.

In our case, for every insertion we will create a copy of the entire path from the root to the new node, since we will modify the direction bit of each of these nodes, setting the modified copy of the root as the new root of the data structure. It is easy to see that with clever use of pointers, the copying of a node can be done in constant time, so the usage of this method does not increase the overall running time of the 𝚞𝚗𝚒𝚘𝚗\mathtt{union} operation. ∎

Proof of Proposition 5.4

Proof.

Fix a time window size ww\in\mathbb{N} a stream S\pazocal{S}, a position ii\in\mathbb{N} and an PCEA with equality predicates P=(Q,Ulin,Beq,Ω,Δ,F)\pazocal{P}=(Q,\textbf{U}_{\operatorname{lin}},\textbf{B}_{\operatorname{eq}},\Omega,\Delta,F). The output of the automaton P\pazocal{P} over S\pazocal{S} at position ii with time window ww is defined as the set of valuations:

\lsemP\rsemiw(S)={νρρ is an accepting run of P over S at position i|imin(ν)|w}{\lsem{}{\pazocal{P}}\rsem}_{i}^{w}(\pazocal{S})\ =\ \{\nu_{\rho}\mid\text{$\rho$ is an accepting run of $\pazocal{P}$ over $S$ at position $i$}\wedge|i-\min(\nu)|\leq w\}

We need to prove that Algorithm 1 enumerates every valuation νρ\nu_{\rho} without repetitions. One way to do this is showing that at any position in the stream the indices in HH contain the information of every single run of P\pazocal{P} so far showing that the outputs for each of these runs can be enumerated.

  • Let i=0i=0 and suppose that S={{R(x¯),}}\pazocal{S}=\{\!\!\{R(\bar{x}),\ldots\}\!\!\}. HH trivially contains the information of all the runs up to this point, so we need to show that this condition still holds after the first tuple.

    Looking at the algorithm, after the Reset call, we start with i=0i=0, 𝖣𝖲w=\mathsf{DS}_{w}=\emptyset, Np=N_{p}=\emptyset for every pQp\in Q. Calling FireTransitions(R(x¯),0)\textsc{FireTransitions}(R(\bar{x}),0) we check each transitions satisfied by R(x¯)R(\bar{x}) and we register them in nodes for 𝖣𝖲w\mathsf{DS}_{w}. SinceP\pazocal{P} is unambiguous, there is only one transition that can lead to an accepting state, ef=(,U,,Lf,pf)Δe_{f}=(\emptyset,U,\emptyset,L_{f},p_{f})\in\Delta with pfFp_{f}\in F and for efe_{f} we have N={}N=\{\} and Npf=𝚎𝚡𝚝𝚎𝚗𝚍(Lf,0,{})N_{p_{f}}=\mathtt{extend}(L_{f},0,\{\}). In addition, P\pazocal{P} can take (several)transitions that do not lead to a final state; these would be transitions of the form e=(,U,,L,p)Δe=(\emptyset,U,\emptyset,L,p)\in\Delta with Np=𝚎𝚡𝚝𝚎𝚗𝚍(L,0,{})N_{p}=\mathtt{extend}(L,0,\{\}).

    On the other hand, UpdateIndices(R(x¯))\textsc{UpdateIndices}(R(\bar{x})) uses the nodes created in FireTransitions and assigns them to every possible transition that could be satisfied by them. In particular, for every reached state pp, we add each node in 𝖭p\mathsf{N}_{p} to the data structure H[e,p,p(t)]H[e,p,\reflectbox{$\vec{\reflectbox{$\mathcal{B}$}}$}_{p}(t)], registering every incomplete run of P\pazocal{P}.

    Finally, we enumerate the outputs of each run that reached a final state. Since P\pazocal{P} is unambiguous, this enumeration will not have duplicates. It is easy to see that enumerate will output our only valuation since pfFp_{f}\in F.

  • Suppose that HH contains the information of every single run of P\pazocal{P} up until position i1i-1 and that S[i]=S(y¯)\pazocal{S}[i]=S(\bar{y}) and that we can enumerate every valuation in \lsemP\rsemi1w(S){\lsem{}{\pazocal{P}}\rsem}_{i-1}^{w}(\pazocal{S}). We want to prove that after calling FireTransitions(S(y¯),i)\textsc{FireTransitions}(S(\bar{y}),i) and UpdateIndices(S(y¯))\textsc{UpdateIndices}(S(\bar{y})), HH will also contain the information of the runs up until ii.

    Once again we start with 𝖭p=\mathsf{N}_{p}=\emptyset for each pQp\in Q, but this time H[e,p,p(t)]H[e,p,\reflectbox{$\vec{\reflectbox{$\mathcal{B}$}}$}_{p}(t)] is not empty. Similar to the previous case, upon calling FireTransitions(S(y¯),i)\textsc{FireTransitions}(S(\bar{y}),i), we create a new node for every new state reached by any of the runs and it is clear by the definition of Δ\Delta and p(t)\reflectbox{$\vec{\reflectbox{$\mathcal{B}$}}$}_{p}(t) that S(y¯)US(\bar{y})\in U and pP𝖧[e,p,p(t)]\bigwedge_{p\in P}\mathsf{H}[e,p,\vec{\mathcal{B}}_{p}(t)]\neq\emptyset for a transition e=(P,U,,L,q)Δe=(P,U,\mathcal{B},L,q)\in\Delta iff there is a run tree ρ\rho and a node u¯\bar{u} such that ρ(u¯)=(q,i,L)\rho(\bar{u})=(q,i,L).

    In the same fashion, UpdateIndices(S(y¯))\textsc{UpdateIndices}(S(\bar{y})) will thoroughly calculate for each transition and each state in those transitions the left projection of the binary predicate for S(y¯)S(\bar{y}), maintaining the data structure in HH updated with the runs od P\pazocal{P}.

    Finally, the algorithm was already capable of enumerating every valuation that ends in a position j<ij<i, and we get from FireTransitions(S(y¯),i)\textsc{FireTransitions}(S(\bar{y}),i) that every new accepting run will have its associated nodes.