This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Accurate Summary-based Cardinality Estimation Through the Lens of Cardinality Estimation Graphs

Jeremy Chen University of Waterloo jeremy.chen@uwaterloo.ca Yuqing Huang University of Waterloo y558huan@uwaterloo.ca Mushi Wang University of Waterloo m358wang@uwaterloo.ca Semih Salihoglu University of Waterloo semih.salihoglu@uwaterloo.ca  and  Ken Salem University of Waterloo ken.salem@uwaterloo.ca
Abstract.

We study two classes of summary-based cardinality estimators that use statistics about input relations and small-size joins in the context of graph database management systems: (i) optimistic estimators that make uniformity and conditional independence assumptions; and (ii) the recent pessimistic estimators that use information theoretic linear programs. We begin by addressing the problem of how to make accurate estimates for optimistic estimators. We model these estimators as picking bottom-to-top paths in a cardinality estimation graph (CEG), which contains sub-queries as nodes and weighted edges between sub-queries that represent average degrees. We outline a space of heuristics to make an optimistic estimate in this framework and show that effective heuristics depend on the structure of the input queries. We observe that on acyclic queries and queries with small-size cycles, using the maximum-weight path is an effective technique to address the well known underestimation problem for optimistic estimators. We show that on a large suite of datasets and workloads, the accuracy of such estimates is up to three orders of magnitude more accurate in mean q-error than some prior heuristics that have been proposed in prior work. In contrast, we show that on queries with larger cycles these estimators tend to overestimate, which can partially be addressed by using minimum weight paths and more effectively by using an alternative CEG. We then show that CEGs can also model the recent pessimistic estimators. This surprising result allows us to connect two disparate lines of work on optimistic and pessimistic estimators, adopt an optimization from pessimistic estimators to optimistic ones, and provide insights into the pessimistic estimators, such as showing that there are alternative combinatorial solutions to the linear programs that define them.

copyright: none

1. Introduction

The problem of estimating the output size of a natural multi-join query (henceforth join query for short), is a fundamental problem that is solved in the query optimizers of database management systems when generating efficient query plans. This problem arises both in systems that manage relational data as well those that manage graph-structured data where systems need to estimate the cardinalities of subgraphs in their input graphs. It is well known that both problems are equivalent, since subgraph queries can equivalently be written as join queries over binary relations that store the edges of a graph.

We focus on the prevalent technique used by existing systems of using statistics about the base relations or outputs of small-size joins to estimate cardinalities of joins. These techniques use these statistics in algebraic formulas that make independence and uniformity assumptions to generate estimates for queries (Aboulnaga et al., 2001; Maduko et al., 2008; Mhedhbi and Salihoglu, 2019; Neumann and Moerkotte, 2011). We will refer to these as summary-based optimistic estimators (optimistic estimators for short), to emphasize that these estimators can make both under and overestimations. This contrasts with the recent pessimistic estimators that are based on worst-case optimal join size bounds (Abo Khamis et al., 2016; Atserias et al., 2013; Cai et al., 2019; Gottlob et al., 2012; Joglekar and Ré, 2018) and avoid underestimation, albeit using very loose, so inaccurate, estimates (Park et al., 2020). In this work, we study how to make accurate estimations using optimistic estimators using a new framework that we call cardinality estimation graphs (CEGs) to represent them. We observe and address several shortcomings of these estimators under the CEG framework. We show that this framework is useful in two additional ways: (i) CEGs can also represent the pessimistic estimators, establishing that these two classes of estimators are in fact surprisingly connected; and (ii) CEGs are useful mathematical tools to prove several theoretical properties of pessimistic estimators.

Refer to caption
Figure 1. Example subgraph query Q5fQ_{5f}.

We begin by showing that the algebraic formulas of prior optimistic estimators can be modeled as picking a bottom-to-top path in a weighted CEG, which we call CEGOCEG_{O}, for Optimistic. In this CEG nodes are intermediate sub-queries and edges weights are average degree statistics that extend sub-queries to larger queries. For example, consider the subgraph query in Figure 1 and the input dataset in Figure 2, whose CEGOCEG_{O} is shown in Figure 3. We observe that in estimators that can be represented with CEGs, there is often more than one way to generate an estimate for a query, corresponding to different bottom-to-up paths and that this decision does not have a clear answer in the case of CEGOCEG_{O}. For example, consider the subgraph query in Figure 1. Given that we have the accurate cardinalities of all subqueries of size \leq 2 available, there are 252 formulas (or bottom-to-up paths in the CEG of optimistic estimators) to estimate the cardinality of the query. Examples of these formulas are:

  • \bullet

    |𝐴𝐵|×|𝐵𝐶||𝐵|×|𝐶𝐷||𝐶|×|𝐷𝐸||𝐷||\xrightarrow{A}\xrightarrow{B}|\times\frac{|\xrightarrow{B}\xrightarrow{C}|}{|\xrightarrow{B}|}\times\frac{|\xleftarrow{C}\xrightarrow{D}|}{|\xrightarrow{C}|}\times\frac{|\xleftarrow{D}\xrightarrow{E}|}{|\xrightarrow{D}|}

  • \bullet

    |𝐴𝐵|×|𝐵𝐷||𝐵|×|𝐶𝐷||𝐷|×|𝐵𝐸||𝐵||\xrightarrow{A}\xrightarrow{B}|\times\frac{|\xrightarrow{B}\xrightarrow{D}|}{|\xrightarrow{B}|}\times\frac{|\xleftarrow{C}\xrightarrow{D}|}{|\xrightarrow{D}|}\times\frac{|\xrightarrow{B}\xrightarrow{E}|}{|\xrightarrow{B}|}

In previous work (Aboulnaga et al., 2001; Maduko et al., 2008; Mhedhbi and Salihoglu, 2019), the choice of which of these estimates to use has either been unspecified or decided by a heuristic without acknowledging possible other choices or empirically justifying these choices. As our first main contribution, we systematically describe a space of heuristics for making an estimate for optimistic estimators and show empirically that the better performing heuristics depend on the structure of the query. We show that on acyclic queries and queries with small-size cycles, whose statistics are available, using the maximum-weight path through the CEG is an effective way to make accurate estimations. We observe that as in the relational setting, estimators that use independence assumptions tend to underestimate the true cardinalities on these queries, and the use of maximum-weight path in the CEG can offset these underestimations. In contrast we observe that on queries that contain larger cycles, the optimistic estimators estimate modified versions of these queries that break these cycles into paths, which leads to overestimations. We show that on such queries, using the minimum weight paths leads to generally more accurate estimations than other heuristics. However, we also observe that these estimates can still be highly inaccurate, and address this shortcoming by defining a new CEG for these queries, which we call CEGOCRCEG_{OCR}, for Optimistic Cycle closing Rate, that use as edge weights statistics that estimate the probabilities that two edges that are connected through a path can close into a cycle. We further show that in general considering only the “long-hop” paths, i.e., with more number of edges, that make more independence assumption but by conditioning on larger sub-queries performs better than paths with fewer edges and comparably to considering every path.

Refer to caption
Figure 2. Example dataset in graph and relational formats.

As our second main contribution, we show that CEGs are expressive enough to model also the recent linear program-based pessimistic estimators. Specifically, we show that we can take CEGOCEG_{O} and replace its edge weights (which are average degrees) with maximum degrees of base relations and small-size joins, and construct a new CEG, which we call CEGMCEG_{M}, for MOLP bound from reference (Joglekar and Ré, 2018). We show that each path in CEGMCEG_{M} is guaranteed to be an overestimate for the cardinality of the query, therefore picking the minimum weight path on this CEG would be the most accurate estimate. We show that this path is indeed equivalent to the MOLP pessimistic estimator from reference (Joglekar and Ré, 2018). The ability to model pessimistic estimators as CEGs allows us to make several contributions. First, we connect two seemingly disparate classes of estimators: both subgraph summary-based optimistic estimators and the recent LP-based pessimistic ones can be modeled as different instances of estimators that pick paths through CEGs. Second, this result sheds light into the nature of the arguably opaque LPs that define pessimistic estimators. Specifically we show that in addition to their numeric representations, the pessimistic estimators have a combinatorial representation. Third, we show that a bound sketch optimization from the recent pessimistic estimator from references (Cai et al., 2019) can be directly applied to any estimator using a CEG, specifically to the optimistic estimators, and empirically demonstrate its benefits in some settings.

CEGs further turn out to be very useful mathematical tools to prove certain properties of the pessimistic estimators, which may be of independent interest to readers who are interested in the theory of pessimistic estimators. Specifically using CEGs in our proofs, we show that MOLP can be simplified because some of its constraints are unnecessary, provide several alternative combinatorial proofs to some known properties of MOLP, such as the theorem that MOLP is tighter than another bound called DBPLP (Joglekar and Ré, 2018), and that MOLP is at least as tight as the pessimistic estimator proposed by Cai et al (Cai et al., 2019) and are identical on acyclic queries over binary relations.

Refer to caption
Figure 3. CEGOCEG_{O} for query Q5fQ_{5f} in Figure 1 when the Markov table (§4) contains joins up to size 3. Only a subset of the vertices and edges of the full CEG is shown.

The remainder of this paper is structured as follows. Section 2 provides our query and database notation. Section 3 gives an overvi-ew of generic estimators that can be seen as picking paths from a CEG. Section 4 reviews optimistic estimators that can be modeled with CEGOCEG_{O} and outlines the space of possible heuristics for making estimates using CEGOCEG_{O}. We also discuss the shortcoming of CEGOCEG_{O} when estimating queries with large cycles and present CEGOCRCEG_{OCR} to address this shortcoming. Section 5 reviews the pessimistic estimators, the CEGMCEG_{M} of the MOLP estimator and the bound sketch refinement to pessimistic estimators. Using CEGMCEG_{M}, we prove several properties of MOLP and connect some of these pessimistic estimators. Section 6 presents extensive experiments evaluating our space of optimistic estimators both on CEGOCEG_{O}, and CEGOCRCEG_{OCR} and the benefits of bound sketch optimization. We compare the optimistic estimators against other summary-based and sampling-based techniques and evaluate the effects of our estimators on plan quality on an actual system. We also confirm as in reference (Park et al., 2020) that the pessimistic estimators in our settings are also not competitive and lead to highly inaccurate estimates. Finally, Sections 7 and 8 cover related work and conclude, respectively.

2. Query and Database Notation

We consider conjunctive queries of the form

Q(𝒜)=R1(𝒜1),,Rm(𝒜m)Q(\mathcal{A})=R_{1}(\mathcal{A}_{1}),\ldots,R_{m}(\mathcal{A}_{m})

where Ri(𝒜i)R_{i}(\mathcal{A}_{i}) is a relation with attributes 𝒜i\mathcal{A}_{i} and 𝒜=i𝒜i\mathcal{A}=\cup_{i}\mathcal{A}_{i}. Most of the examples used in this paper involve edge-labeled subgraph queries, in which case each RiR_{i} is a binary relation containing a subset of the edges in a graph as source/destination pairs. Figure 2 presents an example showing a graph with edge labels AA, BB, CC, DD, and EE, shown in black, orange, green, red, and blue. This graph can be represented using five binary relations, one for each of the edge labels. These relations are also shown in Figure 2.

We will often represent queries over such relations using a graph notation. For example, consider the relations AA and BB from Figure 2. We will represent the query Q(a1,a2,a3)=A(a1,a2)B(a2,a3)Q(a_{1},a_{2},a_{3})=A(a_{1},a_{2})\bowtie B(a_{2},a_{3}) as a1𝐴a2𝐵a3a_{1}\xrightarrow{A}a_{2}\xrightarrow{B}a_{3}. Similarly, the query Q(a1,a2,a3)=A(a1,a2)B(a3,a2)Q(a_{1},a_{2},a_{3})=A(a_{1},a_{2})\bowtie B(a_{3},a_{2}) will be represented as a1𝐴a2𝐵a3a_{1}\xrightarrow{A}a_{2}\xleftarrow{B}a_{3}.

3. CEG Overview

Next, we offer some intution for cardinality estimation graphs (CEGs). In Sections 4 and 5 we will define specific CEGs corresponding to different classes of optimistic and pessimistic estimators. However, all of these share a common structure for representing cardinality estimations. Specifically, a CEG for a query QQ will consist of:

  • \bullet

    Vertices labeled with subqueries of QQ, where subqueries are defined by subsets of QQ’s relations or attributes.

  • \bullet

    Edges from smaller subqueries to larger subqueries, labeled with extension rates which represent the cardinality of the larger subquery relative to that of the smaller subquery.

Refer to caption
Figure 4. CEGOCEG_{O} for query Q5fQ_{5f} from Figure 1. The figure shows a subset of the vertices and edges of the full CEG.

Each bottom-to-top path (from \emptyset to QQ) in a CEG represents a different way of generating a cardinality estimate for QQ. An estimator using a CEG picks one of these paths as an estimate. The estimate of a path is the product of the extension rates along the edges of the path. Equivalently one can put the logarithms of the extension rates as edge weights and sum the logarithms (and exponentiate the base of the logarithm) to compute the estimate.

Figure 4 illustrates a CEGCEG111Specifically, it is a CEGOCEG_{O}, defined in Section 4. for the query Q5fQ_{5f} shown in Figure 1 over the relations shown in Figure 2, assuming that statistics are available for any size-2 subqueries of Q5fQ_{5f}. For example, the leftmost path starts with a1𝐴a2𝐵a3a_{1}\xrightarrow{A}a_{2}\xrightarrow{B}a_{3}, then extends to a1𝐴a2𝐵a3𝐶a4a_{1}\xrightarrow{A}a_{2}\xrightarrow{B}a_{3}\xrightarrow{C}a_{4}, then to the subquery of 4-fork involving AA, BB, CC, and DD, and finally extends the 4-fork subquery to Q5fQ_{5f}. The first extension rate from \emptyset to a1𝐴a2𝐵a3a_{1}\xrightarrow{A}a_{2}\xrightarrow{B}a_{3} is simply the known cardinality of a1𝐴a2𝐵a3a_{1}\xrightarrow{A}a_{2}\xrightarrow{B}a_{3}, which is 4, and the second extension rate makes the uniformity assumption of |a2𝐵a3𝐶a4|/|a2𝐵a3||a_{2}\xrightarrow{B}a_{3}\xrightarrow{C}a_{4}|/|a_{2}\xrightarrow{B}a_{3}| =32=\frac{3}{2}. The final estimate of this path is 4×32×52×72=52.54\times\frac{3}{2}\times\frac{5}{2}\times\frac{7}{2}=52.5.

In the rest of this paper, we will show how some of the optimistic and pessimistic estimators from literature can be modeled as instances of this generic estimator using different CEGs. We will show that while it is clear that the minimum-weight path should be the estimate chosen in the CEG of pessimistic estimators, it is not clear which path should be chosen from the CEG of optimistic estimators. We will also discuss a shortcoming of the CEG for optimistic estimators and address it by defining a new CEG.

4. Optimistic Estimators

The estimators that we refer to as optimistic in this paper use known statistics about the input database in formulas that make uniformity and independence or conditional independence assumptions. The cardinality estimators of several systems fall under this category. We focus on three estimators: Markov tables (Aboulnaga et al., 2001) from XML databases, graph summaries (Maduko et al., 2008) from RDF databases, and the graph catalogue estimator of the Graphflow system (Mhedhbi and Salihoglu, 2019) for managing property graphs. These estimators are extensions of each other and use the statistics of the cardinalities of small-size joins. We give an overview of these estimators and then describe their CEGs, which we will refer to as CEGOCEG_{O}, and then first describe a space of possible optimistic estimates that an optimistic estimator can make. We then discuss a shortcoming of the CEGOCEG_{O} when queries contain large cycles whose statistics are missing and describe a modification to CEGOCEG_{O} to make more accurate estimations.

4.1. Overview

We begin by giving an overview of the Markov tables estimator (Aboulnaga et al., 2001), which was used to estimate the cardinalities of paths in XML documents. A Markov table of length h2h\geq 2 stores the cardinality of each path in an XML document’s element tree up to length hh and uses these to make predications for the cardinalities of longer paths. Table 1 shows a subset of the entries in an example Markov table for h=2h=2 for our running example dataset shown in Figure 2. The formula to estimate a 3-path using a Markov table with h=2h=2 is to multiply the cardinality of the leftmost 2-path with the consecutive 2-path divided by the cardinality of the common edge. For example, consider the query Q3p=𝐴𝐵𝐶Q_{3p}=\xrightarrow{A}\xrightarrow{B}\xrightarrow{C} against the dataset in Figure 2. The formula for Q3pQ_{3p} would be: |𝐴𝐵|×(|𝐵𝐶|/|𝐵|)|\xrightarrow{A}\xrightarrow{B}|\times(|\xrightarrow{B}\xrightarrow{C}|/|\xrightarrow{B}|). Observe that this formula is inspired by the Bayesian probability rule that Pr(ABC)=Pr(AB)Pr(C|AB)Pr(ABC)=Pr(AB)Pr(C|AB) but makes a conditional independence assumption between AA and CC, in which case the Bayesian formula would simplify to Pr(ABC)=Pr(AB)Pr(C|B)Pr(ABC)=Pr(AB)Pr(C|B). For Pr(AB)Pr(AB) the formula uses the true cardinality |𝐴𝐵|\xrightarrow{A}\xrightarrow{B}||. For Pr(C|B)Pr(C|B) the formula makes a uniformity assumption that the number of CC edges that each BB edge extends to is equal for each BB edge and is r=|𝐵𝐶|/|𝐵|r=|\xrightarrow{B}\xrightarrow{C}|/|\xrightarrow{B}|. Equivalently, this can be seen as an “average degree” assumption that on average the CC-degree of nodes in the 𝐵𝐶\xrightarrow{B}\xrightarrow{C} paths is rr. The result of this formula is 4×32=64\times\frac{3}{2}=6, which underestimates the true cardinality of 7. The graph summaries (Maduko et al., 2008) for RDF databases and the graph catalogue estimator (Mhedhbi and Salihoglu, 2019) for property graphs have extended the contents of what is stored in Markov tables, respectively, to other acyclic joins, such as stars, and small cycles, such as triangles, but use the same uniformity and conditional independence assumptions.

Path |Path|
𝐵\xrightarrow{B} 2
𝐴𝐵\xrightarrow{A}\xrightarrow{B} 4
𝐵𝐶\xrightarrow{B}\xrightarrow{C} 3
Table 1. Example Markov table for hh=2.

4.2. Space of Possible Optimistic Estimators

We next represent such estimators using a CEG that we call CEGOCEG_{O}. This will help us describe the space of possible estimations that can be made with these estimators. We assume that the given query QQ is connected. CEGOCEG_{O} consists of the following:

  • \bullet

    Vertices: For each connected subset of relations SS\subseteq\mathcal{R} of QQ, we have a vertex in CEGOCEG_{O} with label SS. This represents the sub-query RiSRi\bowtie_{R_{i}\in S}R_{i}.

  • \bullet

    Edges: Consider two vertices with labels SS and SS^{\prime} s.t., SSS\subset S^{\prime}. Let 𝒟\mathcal{D}, for difference be SSS^{\prime}\setminus S, and let 𝒟\mathcal{E}\supset\mathcal{D}, for extension be a join query in the Markov table, and let \mathcal{I}, for intersection, be S\mathcal{E}\cap S. If \mathcal{E} and \mathcal{I} exist in the Markov table, then there is an edge with weight ||||\frac{|\mathcal{E}|}{|\mathcal{I}|} from SS to SS^{\prime} in CEGOCEG_{O}.

When making estimates, we will apply two basic rules from prior work that limit the paths considered in CEGOCEG_{O}. First is that if the Markov table contains size-hh joins, the formulas use size hh joins in the numerators in the formula. For example, if h=3h=3, we do not try to estimate the cardinality of a sub-query 𝐴𝐵𝐶\xrightarrow{A}\xrightarrow{B}\xrightarrow{C} by a formula 𝐴𝐵×𝐵𝐶𝐵\xrightarrow{A}\xrightarrow{B}\times\frac{\xrightarrow{B}\xrightarrow{C}}{\xrightarrow{B}} because we store the true cardinality of 𝐴𝐵𝐶\xrightarrow{A}\xrightarrow{B}\xrightarrow{C} in the Markov table. Second, for cyclic graph queries, which was covered in reference (Mhedhbi and Salihoglu, 2019), an additional early cycle closing rule is used in the reference when generating formulas. In CEG formulation this translates to the rule that if SS can extend to multiple SS^{\prime}s and some of them contain additional cycles that are not in SS, then only such outgoing edges of SS to such SS^{\prime} are considered in finding paths.

Even when the previous rules are applied to limit the number of paths considered in a CEG, in general there may be multiple (,Q)(\emptyset,Q) paths that lead to different estimates. Consider the CEGOCEG_{O} shown in Figure 4 which uses a Markov table of size 2. There are 36 (,Q)(\emptyset,Q) paths leading to 7 different estimates. Two examples are:

  • \bullet

    |𝐴𝐵|×|𝐵𝐶||𝐵|×|𝐵𝐷||𝐵|×|𝐵𝐸||𝐵|=52.5|\xrightarrow{A}\xrightarrow{B}|\times\frac{|\xrightarrow{B}\xrightarrow{C}|}{|\xrightarrow{B}|}\times\frac{|\xrightarrow{B}\xrightarrow{D}|}{|\xrightarrow{B}|}\times\frac{|\xrightarrow{B}\xrightarrow{E}|}{|\xrightarrow{B}|}=52.5

  • \bullet

    |𝐴𝐵|×|𝐵𝐶||𝐵|×|𝐶𝐷||𝐶|×|𝐷𝐸||𝐷|=57.6|\xrightarrow{A}\xrightarrow{B}|\times\frac{|\xrightarrow{B}\xrightarrow{C}|}{|\xrightarrow{B}|}\times\frac{|\xleftarrow{C}\xrightarrow{D}|}{|\xrightarrow{C}|}\times\frac{|\xleftarrow{D}\xrightarrow{E}|}{|\xrightarrow{D}|}=57.6

Similarly, consider the fork query Q5fQ_{5f} in Figure 1 and a Markov table with up to 3-size joins. The CEGOCEG_{O} of Q5fQ_{5f} is shown in Figure 3, which contains multiple paths leading to 2 different estimates:

  • \bullet

    |𝐴𝐵𝐶|×|𝐶\ext@arrow0359\arrowfill@----ED||𝐶||\xrightarrow{A}\xrightarrow{B}\xrightarrow{C}|\times\frac{|\xleftarrow{C}\ext@arrow 0359\arrowfill@\mathrel{\mathchoice{\raise 1.16336pt\hbox to0.0pt{$\displaystyle\relbar$\hss}\lower 1.16336pt\hbox{$\displaystyle\relbar$}}{\raise 1.16336pt\hbox to0.0pt{$\textstyle\relbar$\hss}\lower 1.16336pt\hbox{$\textstyle\relbar$}}{\raise 1.16336pt\hbox to0.0pt{$\scriptstyle\relbar$\hss}\lower 1.16336pt\hbox{$\scriptstyle\relbar$}}{\raise 1.16336pt\hbox to0.0pt{$\scriptscriptstyle\relbar$\hss}\lower 1.16336pt\hbox{$\scriptscriptstyle\relbar$}}}\mathrel{\mathchoice{\raise 1.16336pt\hbox to0.0pt{$\displaystyle\relbar$\hss}\lower 1.16336pt\hbox{$\displaystyle\relbar$}}{\raise 1.16336pt\hbox to0.0pt{$\textstyle\relbar$\hss}\lower 1.16336pt\hbox{$\textstyle\relbar$}}{\raise 1.16336pt\hbox to0.0pt{$\scriptstyle\relbar$\hss}\lower 1.16336pt\hbox{$\scriptstyle\relbar$}}{\raise 1.16336pt\hbox to0.0pt{$\scriptscriptstyle\relbar$\hss}\lower 1.16336pt\hbox{$\scriptscriptstyle\relbar$}}}\rightrightarrows{E}{D}|}{|\xrightarrow{C}|}

  • \bullet

    |𝐴𝐵𝐶|×|𝐴𝐵𝐷||𝐴𝐵|×|𝐴𝐵𝐸||𝐴𝐵||\xrightarrow{A}\xrightarrow{B}\xrightarrow{C}|\times\frac{|\xrightarrow{A}\xrightarrow{B}\xrightarrow{D}|}{|\xrightarrow{A}\xrightarrow{B}|}\times\frac{|\xrightarrow{A}\xrightarrow{B}\xrightarrow{E}|}{|\xrightarrow{A}\xrightarrow{B}|}

Both formulas start by using |𝐴𝐵𝐶||\xrightarrow{A}\xrightarrow{B}\xrightarrow{C}|. Then, the first “short-hop” formula makes one fewer conditional independence assumption than the “long-hop” formula, which is an advantage. In contrast, the first estimate also makes a uniformity assumption that conditions on a smaller-size join. We can expect this assumption to be less accurate than the two assumptions made in the long-hop estimate, which condition on 2-size joins. In general, these two formulas can lead to different estimates.

For many queries, there can be many more than 2 different estimates. Therefore, any optimistic estimator implementation needs to make choices about which formulas to use, which corresponds to picking paths in CEGOCEG_{O}. Prior optimistic estimators have either left these choices unspecified or described procedures that implicitly pick a specific path yet without acknowledging possible other choices or empirically justifying their choice. Instead, we systematically identify a space of choices that an optimistic estimator can make along two parameters that also capture the choices made in prior work:

  • \bullet

    Path length: The estimator can identify a set of paths to consider based on the path lengths, i.e., number of edges or hops, in CEGOCEG_{O}, which can be: (i) maximum-hop (max-hop); (ii) minimum-hop (min-hop); or (iii) any number of hops (all-hops).

  • \bullet

    Estimate aggregator: Among the set of paths that are considered, each path gives an estimate. The estimator then has to aggregate these estimates to derive a final estimate, for which we identify three heuristics: (i) the largest estimated cardinality path (max-aggr); (ii) the lowest estimated cardinality path (min-aggr); or (iii) the average of the estimates among all paths (avg-aggr).

Any combination of these two choices can be used to design an optimistic estimator. The original Markov tables (Aboulnaga et al., 2001) chose the max-hop heuristic. In this work, each query was a path, so when the first heuristic is fixed, any path in CEGOCEG_{O} leads to the same estimate. Therefore an aggregator is not needed. Graph summaries (Maduko et al., 2008) uses the min-hop heuristic and leaves the aggregator unspecified. Finally, graph catalogue (Mhedhbi and Salihoglu, 2019) picks the min-hop and min-aggr aggregator. We will do a systematic empirical analysis of this space of estimators in Section 6. In fact, we will show that which heuristic to use depends on the structure of the query. For example, on acyclic queries unlike the choice in reference (Mhedhbi and Salihoglu, 2019), systems can combat the well known underestimation problem of optimistic estimators by picking the ‘pessimistic’ paths, so using max-aggr heuristic. Similarly, we find that using the max-hop heuristic leads generally to highly accurate estimates.

4.3. CEGOCRCEG_{OCR}: Handling Large Cyclic Patterns

Recall that a Markov table stores the cardinalities of patterns up to some size hh. Given a Markov table with h2h\geq 2, optimistic estimators can produce estimates for any acyclic query with size larger than hh. But what about large cyclic queries, i.e., cyclic queries with size larger than hh?

Faced with a large cyclic query QQ , the optimistic estimators we have described do not actually produce estimates for QQ. Instead, they produce an estimate for a similar acyclic QQ^{\prime} that includes all of QQ’s edges but is not closed. To illustrate this, consider estimating a 4-cycle query in Figure 5 using a Markov table with hh=33. Recall that the estimates of prior optimistic estimators are captured by paths in the CEGOCEG_{O} for 4-cycle query, shown in Figure 6(a). Consider as an example, the left most path, which corresponds to the formula: |𝐴𝐵𝐶|×|𝐵𝐶𝐷|/|𝐵𝐶||\xrightarrow{A}\xrightarrow{B}\xrightarrow{C}|\times|\xrightarrow{B}\xrightarrow{C}\xrightarrow{D}|/|\xrightarrow{B}\xrightarrow{C}|. Note that this formula is in fact estimating a 4-path 𝐴𝐵𝐶𝐷\xrightarrow{A}\xrightarrow{B}\xrightarrow{C}\xrightarrow{D} rather than the 4-cycle shown in Figure 5. This is true for each path in CEGOCEG_{O}.

More generally, when queries contain cycles of length >h>h, CEGOCEG_{O} breaks cycles in queries into paths. Although optimistic estimators generally tend to underestimate acyclic queries due to the independence assumptions they make, the situation is different for cyclic queries. Since there are often significantly more paths than cycles, estimates over CEGOCEG_{O} can lead to highly inaccurate overestimates. We note that this problem does not exist if a query contains a cycle CC of length >h>h that contains smaller cycles in them, such as a clique of size h+1h+1, because the early cycle closing rule from Section 4.2 will avoid formulas that estimate CC as a sub-query (i.e., no (,C)(\emptyset,C) sub-path will exist in CEG).

Refer to caption
Figure 5. An example of a 4-cycle cyclic query
Refer to caption
(a) CEGOCEG_{O}
Refer to caption
(b) CEGOCRCEG_{OCR}
Figure 6. CEGOCEG_{O} and CEGOCRCEG_{OCR} of Figure  5.

We next describe an alternative modified CEG to address this problem. Consider a query QQ with a kk-cycle CC where k>hk>h. Note that in order to not break cycles into paths, we need CEG edges whose weights capture the cycle closing effect when extending a sub-query SS that contains kk-11 edges of CC to a sub-query SS^{\prime} that contains CC. We capture this in a new CEG called CEGOCRCEG_{OCR}, for Optimistic Cycle closing Rate, which modifies CEGOCEG_{O} as follows. We keep the same vertices as in CEGOCEG_{O} and the same set of edges, except when we detect two vertices SS and SS^{\prime} with the above property. Then, instead of using the weights from the original CEGOCEG_{O} between SS and SS^{\prime}, we use pre-computed cycle closing probabilities. Suppose the last edge that closes the cycle CC is EiE_{i} and it is between the query edges Ei1E_{i-1} and Ei+1E_{i+1}. In the Markov table, we store the probabilities of two connected Ei1E_{i-1} and Ei+1E_{i+1} edges to be connected by an additional EiE_{i} edge to close a cycle. We denote this statistic as P(Ei1Ei+1|Ei)P(E_{i-1}*E_{i+1}|E_{i}). We can compute P(Ei1Ei+1|Ei)P(E_{i-1}*E_{i+1}|E_{i}) by computing all paths that start from Ei1E_{i-1} and end with Ei+1E_{i+1} of varying lengths and then counting the number of EiE_{i} edges that close such paths to cycles. On many datasets there may be a prohibitively many such paths, so we can sample pp of such paths. Suppose these pp paths lead to cc many cycles, then we can take the probability as c/pc/p. In our implementation we perform sampling through random walks that start from Ei1E_{i-1} and end at Ei+1E_{i+1} but other sampling strategies can also be employed. Figure 6(b) provides the CEGOCRCEG_{OCR} for the 4-cycle query in Figure 5. We note that the Markov table for CEGOCRCEG_{OCR} requires computing additional P(Ei1Ei+1|Ei)P(E_{i-1}*E_{i+1}|E_{i}) statistics that CEGOCEG_{O} does not require. The number of entries is at most O(L3)O(L^{3}) where LL is the number of edge labels in the dataset. For many datasets, e.g., all of the ones we used in our evaluations, LL is small and in the order of 10s or 100s, so even in the worst case these entries can be stored in several MBs. In contrast, storing large cycles with hh>>33 edges could potentially require Θ(Lh)\Theta(L^{h}) more entries.

5. Pessimistic Estimators

Join cardinality estimation is directly related to the following fundamental question: Given a query QQ and set of statistics over the relations RiR_{i}, such as their cardinalities or degree information about values in different columns, what is the worst-case output size of QQ? Starting from the seminal result by Atserias, Grohe, and Marx in 2008 (Atserias et al., 2013), several upper bounds have been provided to this question under different known statistics. For example the initial upper bound from reference (Atserias et al., 2013), now called the AGM bound, used only the cardinalities of each relation, while later bounds, DBPLP (Joglekar and Ré, 2018), MOLP (Joglekar and Ré, 2018), and CLLP (Abo Khamis et al., 2016) used maximum degrees of the values in the columns and improved the AGM bound. Since these bounds are upper bounds on the query size, they can be used as pessimistic estimators. This was done recently by Cai et al. (Cai et al., 2019) in an actual estimator implementation. We refer to this as the CBS estimator, after the names of the authors.

In this section, we show a surprising connection between the optimistic estimators from Section 4 and the recent pessimistic estimators (Joglekar and Ré, 2018; Cai et al., 2019). Specifically, in Section 5.1, we show that similar to optimistic estimators, MOLP (and CBS) can also be modeled as an estimator using a CEG. CEGs further turn out to be useful mathematical tools to prove properties of pessimistic estimators. We next show applications of CEGs is several of our proofs to obtain several theoretical results that provide insights into the pessimistic estimators. Section 5.2 reviews the CBS estimator. Using our CEG framework, we show that in fact the CBS estimator is equivalent to the MOLP bound on acyclic queries on which it was evaluated in reference (Cai et al., 2019). In Section 5.2, we also review the bound sketch refinement of the CBS estimator from reference (Cai et al., 2019), which we show can also be applied to any estimator using a CEG, specifically the optimistic ones we cover in this paper. Finally, Appendix D reviews the DBPLP bound and provides an alternative proof that MOLP is tighter than DBPLP that also uses CEGs in the proof.

5.1. MOLP

MOLP was defined in reference (Joglekar and Ré, 2018) as a tighter bound than the AGM bound that uses additional degree statistics about input relations that AGM bound does not use. We first review the formal notion of a degree. Let 𝒳\mathcal{X} be a subset of the attributes 𝒜i\mathcal{A}_{i} of some relation i\mathcal{R}_{i}, and let vv be a possible value of 𝒳\mathcal{X}. The degree of vv in i\mathcal{R}_{i} is the number of times vv occurs in i\mathcal{R}_{i}, i.e. deg(𝒳(v),i)=|{ti|π𝒳(t)=v}|deg(\mathcal{X}(v),\mathcal{R}_{i})=|\{t\in\mathcal{R}_{i}|\pi_{\mathcal{X}}(t)=v\}|. For example, in Figure 2, deg(s(3),E)=3deg(s(3),E)=3 because the outgoing EE-degree of vertex 3 is 3. Similarly deg(d(2),A)deg(d(2),A) is 1 because the incoming AA-degree of vertex 2 is 1. We also define deg(𝒳,i)deg(\mathcal{X},\mathcal{R}_{i}) to be the maximum degree in i\mathcal{R}_{i} of any value vv over XX, i.e., deg(X,i)=maxvdeg(𝒳(v),i)deg(X,\mathcal{R}_{i})=\max_{v}deg(\mathcal{X}(v),\mathcal{R}_{i}). So, deg(d,A)=3deg(d,A)=3 because vertex 13 has 3 incoming AA edges, which is the maximum A-in-degree in the dataset. The notion of degree can be generalized to deg(X(v),Y,i)deg(X(v),Y,\mathcal{R}_{i}), which refers to the “degree of a value vv over attributes XX in πYi\pi_{Y}\mathcal{R}_{i}”, which counts the number of times vv occurs in πY(i)\pi_{Y}(\mathcal{R}_{i}). Similarly, we let deg(X,Y,i)=maxvdeg(X(v),Y,i)deg(X,Y,\mathcal{R}_{i})=\max_{v}deg(X(v),Y,\mathcal{R}_{i}). Suppose a system has stored deg(X,Y,i)deg(X,Y,\mathcal{R}_{i}) statistics for each possible i\mathcal{R}_{i} and XY𝒜iX\subseteq Y\subseteq\mathcal{A}_{i}. MOLP is:

Maximize s𝒜s=0sXsY,XYsYEsXE+log(deg(X,Y,i)),X,Y,E𝒜,XY𝒜i\displaystyle\begin{split}&\text{\bf{Maximize\>}}s_{\mathcal{A}}\\ &s_{\emptyset}=0\\ &s_{X}\leq s_{Y},\>\forall X\subseteq Y\\ &s_{Y\cup E}\leq s_{X\cup E}\text{\small{+}}\log(deg(X,Y,\mathcal{R}_{i})),\forall X,Y,E\text{$\subseteq$}\mathcal{A},X\text{$\subseteq$}Y\text{$\subseteq$}\mathcal{A}_{i}\end{split}

The base of the logarithm can be any constant and we take it as 2. Let mAm_{A} be the optimal value of MOLP. Reference (Joglekar and Ré, 2018) has shown that 2mA2^{m_{A}} is an upper bound on the size of QQ. For example, in our running example, the optimal value of these inequalities is 96, which is an overestimate of the true cardinality of 78. It is not easy to directly see the solution of the MOLP on our running example. However, we will next show that we can represent the MOLP bound as the cost of minimum-weight (,Q)(\emptyset,Q) path in a CEG that we call CEGMCEG_{M}.

MOLP CEG (CEGMCEG_{M}): Let QZQ_{Z} be the projection of QQ onto attributes ZZ, so QZ=ΠZQQ_{Z}=\Pi_{Z}Q. Each variable sZs_{Z} in MOLP represents the maximum size of QZQ_{Z}, i.e., the tuples in the projection of QZQ_{Z} that contribute to the final output. We next interpret the two sets of inequalities in MOLP:

  • \bullet

    Extension Inequalities sYEsXE+log(deg(X,Y,i))s_{Y\cup E}\leq s_{X\cup E}+\log(deg(X,Y,\mathcal{R}_{i})): These inequalities intuitively indicate the following: each tuple tXEQXEt_{X\cup E}\in Q_{X\cup E} can extend to at most deg(X,Y,i)deg(X,Y,\mathcal{R}_{i}) QYEQ_{Y\cup E} tuples. For example, in our running example, let XX=={a2}\{a_{2}\}, YY=={a2a3}\{a_{2}a_{3}\} and EE=={a1}\{a_{1}\}. So both XX and YY are subsets of B(a2,a3)B(a_{2},a_{3}). The inequality indicates that each a1a2a_{1}a_{2} tuple, so an RAR_{A} tuple, can extend to at most deg({a2},{a2,a3},B(a2,a3))deg(\{a_{2}\},\{a_{2},a_{3}\},B(a_{2},a_{3}))==deg(a2,B)deg(a_{2},B) a1a2a3a_{1}a_{2}a_{3} tuples. This is true, because deg(a2,B)deg(a_{2},B) is the maximum degree of any a2a_{2} value in BB (in graph terms the maximum degree of any vertex with an outgoing BB edge).

  • \bullet

    Projection Inequalities sXsYs_{X}\leq s_{Y} (XY)(\forall X\subseteq Y): These indicate that the number of tuples in QXQ_{X} is at most the number of QYQ_{Y}, if YY is a larger subquery.

With these interpretations we can now construct CEGMCEG_{M}.

  • \bullet

    Vertices: For each X𝒜X\subseteq\mathcal{A} there is a vertex. This represents the subquery ΠXQ\Pi_{X}Q.

  • \bullet

    Extension Edges: Add an edge with weight log(deg(X,Y,\log(deg(X,Y, i))\mathcal{R}_{i})) between any W1=XEW_{1}=X\cup E and W2=YEW_{2}=Y\cup E, for which there is sYEsXE+log(deg(X,Y,i))s_{Y\cup E}\leq s_{X\cup E}+\log(deg(X,Y,\mathcal{R}_{i})) inequality. Note that there can be multiple edges between W1W_{1} and W2W_{2} corresponding to inequalities from different relations.

  • \bullet

    Projection Edges: XY\forall X\subseteq Y, add an edge with weight 0 from YY to XX. These edges correspond to projection inequalities and intuitively indicate that, in the worst-case instances, ΠYQ\Pi_{Y}Q is always as large as ΠXQ\Pi_{X}Q.

Figure 7 shows the CEGMCEG_{M} of our running example. For simplicity, we use actual degrees instead of their logarithms as edge weights and omit the projection edges in the figure. Below we use (,𝒜)(\emptyset,\mathcal{A}) instead of (,Q)(\emptyset,Q), to represent the bottom-to-top paths in CEGMCEG_{M} because Π𝒜Q=Q\Pi_{\mathcal{A}}Q=Q.

Refer to caption
Figure 7. CEGMCEG_{M} for query Q5fQ_{5f} in Figure 1.
Theorem 5.1.

Let QQ be a query whose degree statistics deg(X,Y,Ri)deg(X,Y,R_{i}) for each i\mathcal{R}_{i} and XY𝒜iX\subseteq Y\subseteq\mathcal{A}_{i} is known. The optimal solution mAm_{A} to the MOLP of QQ is equal to the weight of the minimum-weight (,𝒜)(\emptyset,\mathcal{A}) path in CEGM.

Proof.

Our proof consists of two steps. First we show that any feasible solution vv to MOLP has a value at most the weight of any (,𝒜)(\emptyset,\mathcal{A}) path. Then we show that a particular feasible solution, which we call vCEGv_{CEG}, is exactly the weight of the minimum-weight (,𝒜)(\emptyset,\mathcal{A}) path. Let vv be a feasible solution to the MOLPQMOLP_{Q}. We refer to the value of vv, so the value s𝒜s_{\mathcal{A}} takes in vv, simply as s𝒜s_{\mathcal{A}}. Let PP be any (,𝒜)(\emptyset,\mathcal{A}) path in CEGMCEG_{M}. Let w(P)w(P) be the weight of PP. Suppose w.l.o.g. that PP==()(\emptyset)e0\xrightarrow{e_{0}}(E1)(E_{1})(Ek)(E_{k})ek\xrightarrow{e_{k}}(Ek+1=𝒜)(E_{k+1}=\mathcal{A}) and for the purpose of contradiction that w(P)=w(e0)+.+w(ek)w(P)=w(e_{0})+....+w(e_{k})<<s𝒜s_{\mathcal{A}}. If this is the case, we can inductively (from ii==k+1k+1 down to 0) show that w(e0)+.+w(ei1)<sEiw(e_{0})+....+w(e_{i-1})<s_{E_{i}}. The base case for sEk+1s_{E_{k+1}} holds by our assumption. Suppose w(e0)+.+w(ei)<sEi+1w(e_{0})+....+w(e_{i})<s_{E_{i+1}} by induction hypothesis. Then consider the inequality in MOLPQMOLP_{Q} that corresponds to the (Ei)(E_{i})ei\xrightarrow{e_{i}}(Ei+1)(E_{i+1}) edge eie_{i}. There are two possible cases for this inequality:

Case 1: eie_{i} is a projection edge, so w(ei)=0w(e_{i})=0 and we have an inequality of sEi+1s_{E_{i+1}}\leqsEis_{E_{i}}, so w(e0)+.+w(ei)<sEi+1sEiw(e_{0})+....+w(e_{i})<s_{E_{i}+1}\leq s_{E_{i}}, so w(e0)+.+w(ei1)<sEiw(e_{0})+....+w(e_{i-1})<s_{E_{i}}.

Case 2: eie_{i} is an extension edge, so we have an inequality of sEi+1s_{E_{i+1}}\leqsEi+w(ei)s_{E_{i}}+w(e_{i}), so w(e0)+.+w(ei)<sEi+1sEi+w(ei)w(e_{0})+....+w(e_{i})<s_{E_{i}+1}\leq s_{E_{i}}+w(e_{i}), so w(e0)+.+w(ei)<sEiw(e_{0})+....+w(e_{i})<s_{E_{i}}, completing the inductive proof. However, this implies that 0<s0<s_{\emptyset}, which contradicts the first inequality of MOLP, completing the proof that any feasible solution vv to the MOLP is at most the weight of any (,𝒜)(\emptyset,\mathcal{A}) path in CEGMCEG_{M}.

Next, let vCEGv_{CEG} be an assignment of variables that sets each sXs_{X} to the weight of the minimum-weight (,X)(\emptyset,X) path in CEGMCEG_{M}. Let vXv_{X} be the value of sXs_{X} in vCEGv_{CEG}. We show that vCEGv_{CEG} is a feasible solution to MOLPQMOLP_{Q}. First, note that in vCEGv_{CEG} ss_{\emptyset} is assigned a value of 0, so the first inequality of MOLP holds. Second, consider any extension inequality sYEsXE+log(deg(X,Y,i))s_{Y\cup E}\leq s_{X\cup E}+\log(deg(X,Y,\mathcal{R}_{i})), so CEGMCEG_{M} contains an edge from XEX\cup E to YEY\cup E with weight log(deg(X,Y,i))\log(deg(X,Y,\mathcal{R}_{i})). By definition of minimum-weight paths, vYEvXE+log(deg(X,Y,i))v_{Y\cup E}\leq v_{X\cup E}+\log(deg(X,Y,\mathcal{R}_{i})). Therefore, in vCEGv_{CEG} all of the extension inequalities hold. Finally, consider a projection inequality sXsYs_{X}\leq s_{Y}, where XYX\subseteq Y, so CEGMCEG_{M} contains an edge from vertex YY to vertex XX with weight 0. By definition of minimum-weight paths, vXvY+0v_{X}\leq v_{Y}+0, so all of these inequalities also hold. Therefore, vCEGv_{CEG} is indeed a feasible solution to MOLPQMOLP_{Q}. Since any solution to MOLP has a value smaller than the weight of any path in CEGMCEG_{M}, we can conclude that v𝒜v_{\mathcal{A}} in vCEGv_{CEG}, which is the minimum-weight (,𝒜)(\emptyset,\mathcal{A}) path, is equal to mAm_{A}. ∎

With this connection, readers can verify that the MOLP bound in our running example is 96 by inspecting the paths in Figure 7. In this CEG, the minimum-weight (,𝒜)(\emptyset,\mathcal{A}) path has a length of 96 (specifically log2(96)\log_{2}(96)), corresponding to the leftmost path in Figure 7. We make three observations.

Observation 1: Similar to the CEGs for optimistic estimators, each (,𝒜)(\emptyset,\mathcal{A}) path in CEGMCEG_{M} corresponds to a sequence of extensions from \emptyset to QQ and is an estimate of the cardinality of QQ. For example, the rightmost path in Figure 7 indicates that there are 77 a3a6a_{3}a_{6}’s (so a3𝐸a6a_{3}\xrightarrow{E}a_{6} edges), each of which extends to at most 3 a3a5a6a_{3}a_{5}a_{6}’s (so a5𝐷a3𝐸a6a_{5}\xleftarrow{D}a_{3}\xrightarrow{E}a_{6} edges), and so forth. This path yields 7x3x2x1x3=126 many possible outputs. Since we are using maximum degrees on the edge weights, each (,𝒜)(\emptyset,\mathcal{A}) path is by construction an upper bound on QQ. So any path in CEGMCEG_{M} is a pessimistic estimator. This is an alternative simple proof to the following proposition:

Proposition 5.0 (Prop. 2 (Joglekar and Ré, 2018)).

Let QQ be a join query and OUTOUT be the output size of QQ, then OUTOUT\leq2m𝒜2^{m_{\mathcal{A}}}.222This is a slight variant of Prop. 2 from reference (Joglekar and Ré, 2018), which state that another bound, called the MO bound, which adds a preprocessing step to MOLP, is an upper bound of OUTOUT.

Proof.

Since for any (\emptyset, 𝒜\mathcal{A}) path PP in CEGMCEG_{M}, OUTOUT \leq 2w(P)2^{w(P)} and by Theorem 5.1, m𝒜m_{\mathcal{A}} is equal to the weight of the minimum-weight (\emptyset, 𝒜\mathcal{A}) path in CEGMCEG_{M}, OUTOUT \leq 2m𝒜2^{m_{\mathcal{A}}}. ∎

Observation 2: Theorem 5.1 implies that MOLP can be solved using a combinatorial algorithm, e.g., Dijkstra’s algorithm, instead of a numeric LP solver.

Observation 3: Theorem 5.1 implies that we can simplify MOLP by removing the projection inequalities, which correspond to the edges with weight 0 in CEGMCEG_{M}. To observe this, consider any (,𝒜)(\emptyset,\mathcal{A}) path PP==()(\emptyset)e0\xrightarrow{e_{0}}(E1)(E_{1})(Ek)(E_{k})ek\xrightarrow{e_{k}}(𝒜)(\mathcal{A}) and consider its first projection edge, say eie_{i}. In Appendix A, we show that we can remove eie_{i} and construct an alternative path with at most the same weight as PP but with one fewer projection edge, showing that MOLP linear program can be simplified by only using the extension inequalities.

5.1.1. Using Degree Statistics of Small-Size Joins

MOLP can directly integrate the degree statistics from results of small-size joins. For example, if a system knows the size of QRS=R(a1,a2)S(a2,a3)Q_{RS}=R(a_{1},a_{2})\bowtie S(a_{2},a_{3}), then the MOLP can include the inequality that sa1a2a3log(|QRS|)s_{a_{1}a_{2}a_{3}}\leq\log(|Q_{RS}|). Similarly, the extension inequalities can use the degree information from QRSQ_{RS} simply by taking the output of QRSQ_{RS} as an additional relation in the query with three attributes a1a_{1}, a2a_{2}, and a3a_{3}. When comparing the accuracy of the MOLP bound with optimistic estimators, we will ensure that MOLP uses the degree information of the same small-size joins as optimistic estimators, ensuring that MOLP uses a strict superset of the statistics used by optimistic estimators.

5.2. CBS and Bound Sketch Optimization

We review the CBS estimator very briefly and refer the reader to reference (Cai et al., 2019) for details. CBS estimator has two subroutines Bound Formula Generator (BFG) and Feasible Coverage Generator (FCG) (Algorithms 1 and 2 in reference (Cai et al., 2019)) that, given a query QQ and the degree statistics about QQ, generate a set of bounding formulas. A coverage is a mapping (RjR_{j}, Aj)A_{j}) of a subset of the relations in the query to attributes such that each Aj𝒜A_{j}\in\mathcal{A} appears in the mapping. A bounding formula is a multiplication of the known degree statistics that can be used as an upper bound on the size of a query. In Appendix B, we show using our CEG framework that in fact the MOLP bound is at least as tight as the CBS estimator on general acyclic queries and is exactly equal to the CBS estimator over acyclic queries over binary relations, which are the queries which reference (Cai et al., 2019) used. Therefore BFG and FCG can be seen as a method for solving the MOLP linear program on acyclic queries over binary relations, albeit in a brute force manner by enumerating all paths in CEGMCEG_{M}. We do this by showing that each path in CEGMCEG_{M} corresponds to a bounding formula and vice versa. These observations allow us to connect two worst-case upper bounds from literature using CEGs. Henceforth, we do not differentiate between MOLP and the CBS estimator. It is important to note that a similar connection between MOLP and CBS cannot be established for cyclic queries. This is because, although not explicitly acknowledged in reference (Cai et al., 2019), on cyclic queries, the covers that FCG generates may not be safe, i.e., the output of BFG may not be a pessimistic output. We provide a counter example in Appendix C. In contrast, MOLP generates a pessimistic estimate for arbitrary, so both acyclic or cyclic, queries.

5.2.1. Bound Sketch

We next review an optimization that was described in reference (Cai et al., 2019) to improve the MOLP bound. Given a partitioning budget KK, for each bottom-to-top path in CEGM, the optimization partitions the input relations into multiple pieces and derives KK many subqueries of QQ. Then the estimate for QQ is the sum of estimates of all KK subqueries. Intuitively, partitioning decreases the maximum degrees in subqueries to yield better estimates, specifically their sum is guaranteed to be more accurate than making a direct estimate for QQ. We give an overview of the optimization here and refer the reader to reference (Cai et al., 2019) for details.

We divide the edges in CEGMCEG_{M} into two. Recall that each edge W1ejW2W_{1}\xrightarrow{e_{j}}W_{2} in CEGMCEG_{M} is constructed from an inequality of sYEsXE+log(deg(X,Y,Ri))s_{Y\cup E}\leq s_{X\cup E}+\log(deg(X,Y,R_{i})) in MOLP. We call eje_{j} (i) an unbound edge if X=X=\emptyset, i.e., the weight of eje_{j} is |Ri||R_{i}|; (ii) a bound edge if XX\neq\emptyset, i.e., the weight of eje_{j} is actually the degree of some value in a column of RiR_{i}. Note that unbound edge extends W1W_{1} exactly with attributes 𝒜i\mathcal{A}_{i}, i.e., W2W1=𝒜iW_{2}\setminus W_{1}=\mathcal{A}_{i} and a bound edge with attributes YY, i.e., W2W1=YW_{2}\setminus W_{1}=Y. Below, we refer to these attributes as “extension” attributes.

Step1: For each p=(,𝒜)p=(\emptyset,\mathcal{A}) path in CEGMCEG_{M} (so a bounding formula in the terminology used in reference (Cai et al., 2019)), let SS be the join attributes that are not extension attributes through a bounded edge. For each attribute in SS, allocate K1/|S|K^{1/|S|} partitions. For example, consider the path P1P_{1}==\emptyset|B|a2a3\xrightarrow{|B|}a_{2}a_{3}deg(a3,C)\xrightarrow{deg(a_{3},C)}a24a_{2-4}deg(a2,A)\xrightarrow{deg(a_{2},A)}a14a_{1-4}deg(a3,E)\xrightarrow{deg(a_{3},E)}a14a6a_{1-4}a_{6}deg(a3,D)\linebreak\xrightarrow{deg(a_{3},D)}a16a_{1-6} in the CEGMCEG_{M} of Q5fQ_{5f} from Figure 7, where aija_{i-j} refers to aiai+1aja_{i}a_{i+1}...a_{j}. Then both a2a_{2} and a3a_{3} would be in SS. For path P2=|A|a1a2deg(a2,B)a13deg(a3,C)a14deg(a3,D)a15deg(a3,E)a16P_{2}=\emptyset\xrightarrow{|A|}a_{1}a_{2}\xrightarrow{deg(a_{2},B)}a_{1-3}\xrightarrow{deg(a_{3},C)}a_{1-4}\xrightarrow{deg(a_{3},D)}a_{1-5}\xrightarrow{deg(a_{3},E)}a_{1-6}, only a2a_{2} would be in in SS.

Step2: Partition each relation RiR_{i} as follows. Let PAiPA_{i}, for partition attributes, be PAi=S𝒜iPA_{i}=S\cap\mathcal{A}_{i} and zz be |PAi||PA_{i}|. Then partition RiR_{i} into Kz/|S|K^{z/|S|} pieces using zz hash functions, each hashing a tuple tRit\in R_{i} based on one of the attributes in PAiPA_{i} into {0,,K1/|S|1}\{0,...,K^{1/|S|}-1\}. For example, the relation BB in our example path P1P_{1} would be partitioned into 4, B00B_{00}, B01B_{01}, B10B_{10}, and B11B_{11}.

Step3: Then divide QQ into KK components Q00Q_{0...0}, to QK1/|S|1,,K1/|S|1Q_{K^{1/|S|}-1,...,K^{1/|S|}-1}, such that Qj1,,jzQ_{j_{1},...,j_{z}} contains only the partitions of each relation RiR_{i} that matches the {j1,,jz}\{j_{1},...,j_{z}\} indices. For example, in our example, Q00Q_{0...0} is A0B0,0C0D0E0A_{0}\bowtie B_{0,0}\bowtie C_{0}\bowtie D_{0}\bowtie E_{0}. This final partitioning is called the bound sketch of QQ for path pp.

5.2.2. Implementing Bound Sketch in Opt. Estimators

Note that a bound sketch can be directly used to refine any estimator using a CEG, as it is a general technique to partition QQ into subqueries based on each path pp in a CEG. The estimator can then sum the estimates for each subquery to generate an estimate for QQ. Specifically, we can use a bound sketch to refine optimistic estimators and we will evaluate its benefits in Section 6.3. Intuitively, one advantage of using a bound sketch is that the tuples that hash to different buckets of the join attributes are guaranteed to not produce outputs and they never appear in the same subquery. This can make the uniformity assumptions in the optimistic estimators more accurate because two tuples that hashed to the same bucket of an attribute are more likely to join.

We implemented the bound sketch optimization for optimistic estimators as follows. Given a partitioning budget KK and a set of queries in a workload, we worked backwards from the queries to find the necessary subqueries, and for each subquery the necessary statistics that would be needed are stored in the Markov table. For example, for Q5fQ_{5f}, one of the formulas that is needed is: |a1|a_{1}A0\xrightarrow{A_{0}}a2a_{2}B00\xrightarrow{B_{00}}a3|a_{3}| |a2B00a3C0a4||a2B00a3||a4C0a3D0a5||a3C0a4||a5D0a3E0a6|a3D0a5\frac{|a_{2}\xrightarrow{B_{00}}a_{3}\xrightarrow{C_{0}}a_{4}|}{|a_{2}\xrightarrow{B_{00}}a_{3}|}\frac{|a_{4}\xleftarrow{C_{0}}a_{3}\xrightarrow{D_{0}}a_{5}|}{|a_{3}\xrightarrow{C_{0}}a_{4}|}\frac{|a_{5}\xleftarrow{D_{0}}a_{3}\xrightarrow{E_{0}}a_{6}|}{a_{3}\xrightarrow{D_{0}}a_{5}}, so we ensure that our Markov table has these necessary statistics.

6. Evaluation

We next present our experiments, which aim to answer five questions: (1) Which heuristic out of the space we identified in Section 4.2 leads to better accuracy for optimistic estimators, and why? We aim to answer this question for acyclic queries on CEGOCEG_{O} and cyclic queries on CEGOCEG_{O} and CEGOCRCEG_{OCR}. (2) For cyclic queries, which of these two CEGs lead to more accurate results under their best performing heuristics? (3) How much does the bound-sketch optimization improve the optimistic estimators’ accuracy? (4) How do optimistic and pessimistic estimators, which are both summary-based estimators, compare against each other and other baseline summary-based estimators from the literature? (5) How does the best-performing optimistic estimator compare against Wander Join (Li et al., 2016; Park et al., 2020), the state-of-the-art sampling-based estimator? Finally, as in prior work, we use the RDF-3X system (Neumann and Weikum, 2008) to verify that our estimators’ more accurate estimations lead more performant query plans.

Throughout this section, except in our first experiments, where we set h=3h=3, we use a Markov table of size h=2h=2 for optimistic estimators. We generated workload-specific Markov tables, which required less than 0.6MB memory for any workload-dataset combination for h=2h=2 or h=3h=3. For CEGOCRCEG_{OCR}, which requires computing the cycle closing rates, the size was slightly higher but at most 0.9MB. Our code, datasets, and queries are publicly available at https://github.com/cetechreport/CEExperiments and https://github.com/cetechreport/gcare.

6.1. Setup, Datasets and Workloads

For all of our experiments, we use a single machine with two Intel E5-2670 at 2.6GHz CPUs, each with 8 physical and 16 logical cores, and 512 GB of RAM. We represent our datasets as labeled graphs and queries as edge-labeled subgraph queries but our datasets and queries can equivalently be represented as relational tables, one for each edge label, and SQL queries. We focused on edge-labeled queries for simplicity. Estimating queries with vertex labels can be done in a straightforward manner both for optimistic and pessimistic estimators e.g., by extending Markov table entries to have vertex labels as was done in reference (Mhedhbi and Salihoglu, 2019). We used a total of 6 real-world datasets, shown in Table 2, and 5 workloads on these datasets. Our dataset and workload combinations are as follows.

Dataset Domain |V| |E| |E. Labels|
IMDb Movies 27M 65M 127
YAGO Knowledge Graph 13M 16M 91
DBLP Citations 23M 56M 27
WatDiv Products 1M 11M 86
Hetionet Social Networks 45K 2M 24
Epinions Consumer Reviews 76K 509K 50
Table 2. Dataset descriptions.

IMDb and JOB: The IMDb relational database, together with a workload called JOB, has been used for cardinality estimation studies in prior work (Cai et al., 2019; Leis et al., 2018). We created property graph versions of the this database and workload as follows. IMDb contains three groups of tables: (i) entity tables representing entities, such as actors (e.g., name table), movies, and companies; (ii) relationship tables representing many-to-many relationships between the entities (e.g., the movie_companies table represents relationships between movies and companies); and (iii) type tables, which denormalize the entity or relationship tables to indicate the types of entities or relationships. We converted each row of an entity table to a vertex. We ignored vertex types because many queries in the JOB workload have no predicates on entity types. Let uu and vv be vertices representing, respectively, rows rur_{u} and rvr_{v} from tables TuT_{u} an TvT_{v}. We added two sets of edges between uu and vv: (i) a foreign key edge from uu to vv if the primary key of row rur_{u} is a foreign key in row rvr_{v}; (ii) a relationship edge between uu to vv if a row rr_{\ell} in a relationship table TT_{\ell} connects row rur_{u} and rvr_{v}.

We then transformed the JOB workload (Leis et al., 2018) into equivalent subgraph queries on our transformed graph. We removed non-join predicates in the queries since we are focusing on join cardinality estimations, and we encoded equality predicates on types directly on edge labels. This resulted in 7 join query templates, including four 4-edge queries, two 5-edge queries, and one 6-edge query. All of these queries are acyclic. There were also 2- and 3-edge queries, which we ignored because as their estimation is trivial with Markov tables of size 3. We generated 100 query instances from each template by choosing one edge label uniformly at random for each edge, while ensuring that the output of the query was non-empty. The final workload contained a total of 369 specific query instances.

Refer to caption
Figure 8. Our full acyclic query templates. The directions of the edges are not shown in the figure.

YAGO and G-CARE-Acyclic and G-CARE Cyclic Workloads: G-CARE (Park et al., 2020) is a recent cardinality estimation benchmark for subgraph queries. From this benchmark we took the YAGO knowledge graph dataset and the acyclic and cyclic query workloads for that dataset. The acyclic workload contains 382 queries generated from query templates with 3-, 6-, 9-, and 12-edge star and path queries, as well as randomly generated trees. We will refer to this workload as G-CARE-Acyclic. The cyclic query workload contains 240 queries generated from templates with 6-, and 9-edge cycle, 6-edge clique, 6-edge flower, and 6- and 9-edge petal queries. We will refer to this workload as G-CARE-Cyclic. The only other large dataset from G-CARE that was available was LUBM, which contained only 6 queries, so we did not include it in our study.

DBLP, WatDiv, Hetionet, and Epinions Datasets and Acyclic and Cyclic Workloads: We used four other datasets: (i) Hetionet: a biological network; (ii) DBLP: a real knowledge graph; (iii) WatDiv: a synthetic knowlege graph; and (iv) Epinions: a real-world social network graph. Epinions is a dataset that by default does not have any edge labels. We added a random set of 50 edge labels to Epinions. Our goal in using Epinions was to test whether our experimental observations also hold on a graph that is guranteed to not have any correlations between edge labels. For these datasets we created one acyclic and one cyclic query workload, which we refer to as Acyclic and Cyclic. The Acyclic workload contains queries generated from 6-, 7-, or 8-edge templates. We ensured that for each query size kk, we had patterns of every possible depth. Specifically for any kk, the minimum depth of any query is 2 (stars) and the maximum is kk (paths). For each depth dd in between, we picked a pattern. These patterns are shown in Figure 8. Then, we generated 20 non-empty instances of each template by putting one edge label uniformly at random on each edge, which yielded 360 queries in total. The Cyclic workload contains queries generated from templates used in reference (Mhedhbi and Salihoglu, 2019): 4-edge cycle, 5-edge diamond with a crossing edge, 6-cycle, complete graph K4K_{4}, 6-edge query of two triangles with a common vertex, 8-edge query of a square with two triangles on adjacent sides, and a 7-edge query with a square and a triangle. We then randomly generated instances of these queries by randomly matching each edge of the query template one at a time in the datasets with a 1 hour time limit for each dataset. We generated 70 queries for DBLP, 212 queries for Hetionet, 129 queries for WatDiv, and 394 queries for Epinions.

6.2. Space of Optimistic Estimators

We begin by comparing the 9 optimistic estimators we described on the two optimistic CEGs we defined. In order to set up an experiment in which we could test all of the 9 possible optimistic estimators, we used a Markov table that contained up to 3-size joins (i.e., h=3). A Markov table with only 2-size joins can not test different estimators based on different path-length heuristics or any cyclic query. We aim to answer four specific questions: (1) Which of the 9 possible optimistic estimators we outlined in Section 4.2 leads to most accurate estimates on acyclic queries and cyclic queries that contain \leq33 edges on CEGOCEG_{O}? (2) Which of the 9 estimators lead to most accurate estimates for cyclic queries with cycles of size >3>3 on CEGOCEG_{O} and CEGOCRCEG_{OCR}? (3) Which of these CEGs lead to most accurate estimations under their best performing estimators? (4) Given the accuracies of best performing estimators, how much room for improvement is there for designing more accurate techniques, e.g., heuristics, for making estimates on CEGOCEG_{O} and CEGOCRCEG_{OCR}?

To compare the accuracies of different estimators, for each query QQ in our workloads we make an estimate using each estimator and compute its q-error. If the true cardinality of QQ is cc and the estimate is ee, then the q-error is max{ce,ec}1\max\{\frac{c}{e},\frac{e}{c}\}\geq 1. For each workload, this gives us a distribution of q-errors, which we compare as follows. First, we take the logs of the q-errors so they are now 0\geq 0. If a q-error was an underestimate, we put a negative sign to it. This allows us to order the estimates from the least accurate underestimation to the least accurate overestimation. We then generate a box plot where the box represents the 25th, median, and 75th percentile cut-off marks. We also compute the mean of this distribution, excluding the top 10% of the distribution (ignoring under/over estimations) and draw it with a red dashed line in box plots.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 9. Evaluation of the optimistic estimators on CEGOCEG_{O} on acyclic queries (JOB or Acyclic workloads). Estimators are labeled “X-Y”, where X describes the path length choice (one of max-hop, min-hop, or all-hops) and Y describes the estimate aggregator (one of max-aggr, min-aggr, or avg-aggr).

6.2.1. Acyclic Queries and Cyclic Queries With Only Triangles

For our first question, we first compare our 9 estimators on CEGOCEG_{O} for each acyclic query workload in our setup. We then compare our 9 estimators on each cyclic query workload, but only using the queries that only contain triangles as cycles. All except one clique-6 query in the GCARE-Cyclic workload contained cycles with more than 3 edges, so the YAGO-GCARE-Cyclic combination is omitted.

The accuracies of the 9 estimators on acyclic workloads are shown in Figure 9 (ignore the P* column for now). We make several observations. First, regardless of the path-length heuristic chosen, the max aggregator (the last 3 box plots in the figures) makes significantly more accurate estimates (note that the y-axis on the plots are in log scale) than avg, which in turn is more accurate than min. This is true across all acyclic experiments and all datasets. For example, on IMDb and JOB workload, the all-hops-min, all-hops-avg, and all-hops-max estimators have log of mean q-errors (after removing top 10 percent outliers) of 6.5 (underestimation), 1.7 (underestimation), and 1.02 (understimation), respectively. Overall, we observe that using the most pessimistic of the optimistic estimates leads to significantly more accurate estimations in our evaluations than the heuristics used in prior work (up to three orders of magnitude improvement in mean accuracy). Therefore on acyclic queries, when there are multiple formulas that can be used for estimating a query’s cardinality, using the pessimistic ones is an effective technique to combat the well known underestimation problem.

We next analyze the path-length heuristics. Observe that across all experiments, if we ignore the outliers and focus on the 25-75 percentile boxes, max-hop and all-hops do at least as well as min-hop. Further observe that on IMDb, Hetionet, and on the Acyclic workload on Epinions, max-hop and all-hops lead to significantly more accurate estimates. Finally, the performance of max-hop and all-hops are comparable across our experiments. We verified that this is because all-hops effectively picks one of the max-hop paths in majority of the queries in our workloads. Since max-hop enumerates strictly fewer paths than all-hops to make an estimate, we conclude that on acyclic queries, systems implementing the optimistic estimators can prefer the max-hop-max estimator.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 10. Evaluation of the space of optimistic estimators on CEGOCEG_{O} on Cyclic workload on queries with only triangles.

Figure 10 shows the accuracies of the 9 estimators on cyclic query workloads with only triangles. Our observations are similar to those for acyclic queries, and we find that the max aggregator yields more accurate estimates than other aggregators, irrespective of the path length. This is again because we generally observe that on most datasets any of the 9 estimators tend to underestimate, and the max aggregator can combat this problem better than min or avg. When using the max aggregator, we also observe that the max-hop heuristic performs at least as well as the min-hop heuristic. Therefore, as we observed for acyclic queries, we find max-hop-max estimator to be an effective way to make accurate estimations for cyclic queries with only triangles.

For the above experiments, we also performed a query template-specific analysis and verified that our conclusions generally hold for each acyclic and cyclic query template we used in our workloads in Figures 9 and 10. Our charts in which we evaluate the 9 estimators on each query template can be found in our github repo.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 11. Evaluation of the space of optimistic estimators on CEGOCEG_{O}, CEGOCRCEG_{OCR} on cyclic queries (Cyclic or G-CARE-Cyclic workloads) on cycles with 4 or more edges.

6.2.2. Cyclic Queries With Cycles of Size >>3

For our second question, we compare our 9 estimators on CEGOCEG_{O} and CEGOCRCEG_{OCR} for each dataset-cyclic query workload combination in our benchmark, but only using queries that contain cycles of size >3>3. Recall that we now expect that estimates on CEGOCEG_{O} can be generally pessimistic. This is because any formula (or bottom to top path) in CEGOCEG_{O} breaks large cycles and estimates paths and on real world datasets there are often a lot more paths than cycles. In contrast, we expect the edge weights in CEGOCRCEG_{OCR} to fix this pessimism, so estimates on these CEGs can still be optimistic. Figure 11 shows our results. As we expected, we now see that across all of the datasets, our 9 estimators on CEGOCEG_{O} generally overestimate. In contrast to our observations for acyclic queries and cyclic queries with only cycles, we now see that the most accurate results are achieved when using the min aggregator (instead of max). For the min aggregator, any of the path-length heuristics seem to perform reasonably well. On DBLP we see that the min-hop-min heuristic leads to more accurate results, while on Hetionet max-hop and all-hops heuristics, and results are comparable in other datasets.

In contrast, on CEGOCRCEG_{OCR}, similar to our results from Figures 9 and 10, we still observe that max aggregator yields more accurate results, although the avg aggregator also yields competitive results on Hetionet and YAGO. This shows that simulating the cycle closing by using cycle closing rates avoids the pessimism of CEGOCEG_{O} and results in optimisitic estimates. Therefore, as before, countering this optimism using the pessimistic paths in CEGOCRCEG_{OCR} is an effective technique to achieve accurate estimates. In addition, we also see that any of the path-length heuristics perform reasonably well.

Finally to answer our third question, we compare CEGOCEG_{O}, and CEGOCRCEG_{OCR} under their best performing heuristics. We take as this estimator min-hop-min for CEGOCEG_{O} and max-hop-max for CEGOCRCEG_{OCR}. We see that, except for DBLP and YAGO where the estimates are competitive, the estimates on CEGOCRCEG_{OCR} are more accurate. For example, on Hetionet, while the median q-error for min-hop-min on CEGOCEG_{O} is 213.8 (overestimation), it is only 2.0 (overestimation) for max-hop-max CEGOCRCEG_{OCR}. Therefore, we observe that even using the most optimistic estimator on CEGOCEG_{O} may not be very accurate on cyclic queries with large cycles and modifying this CEG with cycle closing rates can fix this shortcoming.

As we did for the experiments presented in Section 6.2.1, we verified that our conclusions do not change for different query templates that we used in Figure 11. The figures corresponding to each query-template in Figure 11 can be found in our github repo.

6.2.3. PP^{*} Estimator and Room for Improvement

We next answer the question of how much room for improvement there is for the space of optimistic estimators we considered on CEGOCEG_{O} and CEGOCRCEG_{OCR}. To do so, we consider a thought experiment in which, for each query in our workloads, an oracle picks the most accurate path in our CEGs. The accuracies of this oracle-based estimator are shown as PP^{*} bars in our bar charts in Figures 9-11. We compare the PP^{*} bars in these figures with the max-hop-max estimator on CEGOCEG_{O} on acyclic queries and cyclic queries with only triangles, and max-hop-max estimator on CEGOCRCEG_{OCR} for queries with larger cycles. We find that on acyclic queries, shown in Figure 9, we generally see little room for improvement, though there is some room in Hetionet and YAGO. For example, although the median q-errors of max-hop-max and PP^{*} are indistinguishable on Hetionet, the 75 percentile cutoffs for max-hop-max and PP^{*} are 1.52 and 1.07, respectively. We see more room for improvement on cyclic query workloads that contain large cycles, shown in Figures 11. Although we still find that on DBLP, WatDiv and Epinions, max-hop-max estimator on CEGOCRCEG_{OCR} is competitive with PP^{*}, there is a more visible room for improvement on Hetionet and YAGO. For example, on Hetionet, the median q-errors of max-hop-max and PP^{*} are 1.48 (overestimation) and 1.02 (underestimation), respectively. On YAGO the median q-errors are 39.8 (overestimation) and 1.01 (overestimation), respectively. This indicates that future work on designing other techniques for making estimations on CEG-based estimators can focus on workloads with large cycles on these datasets to find opportunities for improvement.

6.3. Effects of Bound Sketch

Our next set of experiments aim to answer: How much does the bound-sketch optimization improve the optimistic estimators’ accuracy? This question was answered for the CBS pessimistic estimator in reference (Cai et al., 2019). We reproduce the experiment for MOLP in our context as well. To answer this question, we tested the effects of bound sketch on the JOB workload on IMDb and Acyclic workload on Hetionet and Epinions. We excluded WatDiv and DBLP as the max-hop-max estimator’s estimates are already very close to perfect on these datasets and there is no room for significant improvement (recall Figure 9). Then we applied the bound sketch optimization to both max-hop-max (on CEGOCEG_{O}) and MOLP estimators and measured the q-errors of the estimators under partitioning budgets of 1 (no partitioning), 4, 16, 64, and 128.

Our results are shown in Figure 12. As demonstrated in reference (Cai et al., 2019), our results confirm that bound sketch improves the accuracy of MOLP. The mean accuracy of MOLP increases between 15% and 89% across all of our datasets when moving between 1 and 128 partitions. Similarly, we also observe significant gains on the max-hop-max estimator though the results are data dependent. On Hetionet and Epinions, partitioning improves the mean accuracy at similar rates: by 25% and 89%, respectively. In contrast, we do not observe significant gains on IMDb. We note that the estimations of 68% of queries in Hetionet and 93% in Epinions strictly improved, so bound sketch is highly robust for optimistic estimators. We did not observe significant improvements in IMDb dataset for max-hop-max, although 93% of their q-errors strictly improve, albeit by a small amount. We will compare optimistic and pessimistic estimators in more detail in the next sub-section but readers can already see by inspecting the scale in the y-axes of Figure 12 that as observed in reference (Park et al., 2020) the pessimistic estimators are highly inaccurate and in our context orders of magnitude less accurate than optimistic ones.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 12. Effects of bound sketch on max-hop-max (left column) and MOLP (right column) estimators.

6.4. Summary-based Estimator Comparison

The optimistic and pessimistic estimators we consider in this paper are summary-based estimators. Our next set of experiments compares max-hop-max (on CEGOCEG_{O}) and MOLP against each other and against other baseline summary-based estimators. A recent work (Park et al., 2020) has compared MOLP against two other summary-based estimators, Characteristic Sets (CS) (Neumann and Moerkotte, 2011) and SumRDF (Stefanoni et al., 2018). We reproduce and extend this comparison in our setting with our suggested max-hop-max optimistic estimator. We first give an overview of CS and SumRDF.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 13. Summary-based estimation technique comparison.

CS: CS is a cardinality estimator that was used in the RDF-3X system (Neumann and Weikum, 2008). CS is primarily designed to estimate the cardinalities of stars in an RDF graph. The statistic that it uses are based on the so-called characteristic set of each vertex vv in an RDF graph, which is the set of distinct outgoing labels vv has. CS keeps statistics about the vertices with the same characteristic set, such as the number of nodes that belong to a characteristic set. Then, using these statistics, the estimator makes estimates for the number of distinct matches of stars. For a non-star query QQ, QQ is decomposed into multiple stars s1,,sks_{1},...,s_{k}, and the estimates for each sis_{i} is multiplied, which corresponds to an independence assumption.

SumRDF: SumRDF builds a summary graph SS of an RDF graph and adopts a holistic approach to make an estimate. Given the summary SS, SumRDF considers all possible RDF graphs GG that could have the same summary SS. Then, it returns the average cardinality of QQ across all possible instances. This is effectively another form of uniformity assumption: each possible world has the same probability of representing the actual graph on which the estimate is being made. Note that the pessimistic estimators can also be seen as doing something similar, except they consider the most pessimistic of the possible worlds and return the cardinality of QQ on that instance.

We measured the q-errors of max-hop-max, MOLP, CS, and SumRDF on the JOB workload on IMDb, the Acyclic workload on Hetionet, WatDiv, and Epinions, and the G-CARE-Acyclic workload on YAGO. We did not use bound sketch for MOLP and max-hop-max. However, we ensured that MOLP uses the cardinalities and degree information from 2-size joins, which ensures that the statistics MOLP uses is a strict superset of the statistics max-hop-max estimator uses.

Our results are shown in Figure 13. We omit CS from all figures except the first one, because it was not competitive with the rest of the estimators and even when y-axis is in logarithmic scale, plotting CS decreases the visibility of differences among the rest of the estimators. SumRDF timed out on several queries on YAGO and Hetionet and we removed those queries from max-hop-max and MOLP’s distributions as well. We make two observations. First, these results confirm the results from reference (Park et al., 2020) that although MOLP does not underestimate, its estimates are very loose. Second, across all summary-based estimators, unequivocally, max-hop-max generates significantly more accurate estimations, often by several orders of magnitude in mean estimation. For example, on the IMDb and JOB workload, the mean q-errors of max-hop-max, SumRDF, MOLP, and CS are 1.8, 193.3, 44.6, and 333751, respectively. We also note that both CS and SumRDF perform underestimations in virtually all queries, whereas there are datasets, such as WatDiv and YAGO, where the majority of max-hop-max’s estimates are overestimations.

6.5. Comparison Against WanderJoin

Although we studied summary-based estimators in this paper, an alternative technique that has been studied is based on sampling. Sampling-based techniques are fundamentally different and based on using unbiased samplers of the query’s output. As such, their primary advantage is that by enough sampling they are guaranteed to achieve estimations at any required accuracy. However, because they effectively perform the join on a sample of tuples, they can be slow, which is why they have seen little adoption in practice for join estimation. This time-vs-accuracy tradeoff is fundamentally different in summary-based estimators, which can give more accurate estimates only by storing more statistics, (e.g., larger join-sizes in Markov tables), so with more memory and not time. For completeness of our work, we compare max-hop-max estimator with WanderJoin (WJ) (Li et al., 2016; Park et al., 2020), which is a sampling-based estimator that was identified in reference (Park et al., 2020) as the most efficient sampling-based technique out of a set of techniques the authors experimented with. Our goal is to answer: What is the sampling ratio at which WJ’s estimate outperforms max-hop-max (on CEGOCEG_{O}) in accuracy and how do the estimation speeds of WJ and max-hop-max compare at this ratio? We first give an overview of WJ as implemented in reference (Park et al., 2020).

WanderJoin: WJ is similar to the index-based sampling described in reference (Leis et al., 2017). Given a query QQ, WJ picks one of the query edges eqe_{q} of QQ to start the join from and picks a sampling ratio r(0,1]r\in(0,1], which is the fraction of edges that can match eqe_{q} that it will sample. For each sampled edge eqe_{q}^{*} (with replacement), WJ computes the join one query-edge at-a-time by picking one of the possible edges of a vertex that has already been matched uniformly at random. Then, if the join is successfully computed, a correction factor depending on the degrees of the nodes that were extended is applied to get an estimate for the number of output results that eqe_{q}^{*} would extend to. Finally, the sum of the estimates for each sample is multiplied by 1/r1/r to get a final estimate.

We used the G-CARE’s codebase (Park et al., 2020). We integrated the max-hop-max estimator into G-CARE and used the WJ code that was provided. We compared WJ and max-hop-max with sampling ratios 0.01%, 0.1%, 0.25%, 0.5%, and 0.75% on the JOB workload on IMDb, the Acyclic workload on Hetionet, WatDiv, and Epinions, and the G-CARE-Acyclic workload on YAGO. We ran both estimators five times (each run executes inside a single thread) and report the averages of their estimation times. We also report the average q-error of WJ. However, we can no longer present under- and over-estimations in our figures, as WJ might under and over-estimate for the same query across different runs.

The box-plot q-error distributions of max-hop-max and WJ are shown in Figure 14. We identify the sampling ratios in which the mean accuracy of WJ is better than the mean accuracy of max-hop-max, except in DBLP and Hetionet, where both max-hop-max and WJ’s mean estimates are generally close to perfect, so we look at the sampling ratio where WJ’s maximum q-errors are smaller than max-hop-max. We find that this sampling ratio on IMDb is 0.1%, on DBLP is 0.5%, on Hetionet is 0.75%, on Epinions is 0.5%, and on YAGO is 0.75%. However, the estimation time of WJ is between 15x and 212x slower, so one to two orders of magnitude, than max-hop-max except on our smallest dataset Epinions, where the difference is 1.95x. Observe that max-hop-max’s estimation times are very stable and consistently in sub-milliseconds, between 0.18ms and 0.54ms. This is because max-hop-max’s estimation time is independent of the dataset’s size. In contrast, WJ’s estimation times get slower as the datasets get larger, because WJ performs more joins. For example, at 0.25% ratio, while WJ takes 0.28ms on our smallest dataset Epinions, it takes 35.4ms on DBLP.

We emphasize that these comparisons are not perfect because it is difficult to compare distributions and these are two fundamentally different classes of estimators, providing systems with different tradeoffs. However, we believe that our ‘competitive sampling ratio’ analysis (more than runtime numbers) is instructive for interested readers.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 14. Comparison of max-hop-max and WJ, with average estimation times (in milliseconds) indicated at the top of boxes.

6.6. Impact on Plan Quality

Reference (Leis et al., 2015) convincingly established that cardinality estimation is critical for optimizers to generate good plans for RDBMSs as it leads to better plan generation. Several other work has verified this in different contexts, in RDBMSs (Cai et al., 2019) and in RDF systems (Park et al., 2020). In our final set of experiments we set out to verify this in our context too by comparing the impact of our estimators on plan quality. We used the RDF-3X system (Neumann and Weikum, 2008) and its source code available here (rdf3x, 2020). We issued our Acyclic workload as join-only queries to RDF-3X on the DBLP and WatDiv datasets. We then ran the query under 10 configurations: first using RDF-3X’s default estimator and then by injecting the cardinality estimates of our 9 optimistic estimators to the system. The cardinalities are injected inside the system’s dynamic programming-based join optimizer. We then filtered out the queries in which all of the 10 estimators lead to picking exactly the same plan and there was less than 10% runtime difference between the minimum and maximum runtimes across the plans generated from 10 estimators. We were left with 15 queries for DBLP and 8 queries for WatDiv. We ran each query 55 times and report the best query execution time. The open source version of RDF-3X uses a simple cardinality estimator that is not based on characteristic sets as in reference (Neumann and Weikum, 2008) but on basic statistics about the original triple counts and some ‘magic’ constants to estimate cardinalities. We observed that this estimator is highly inaccurate compared to the 9 optimistic estimators. For example, we analyzed the final estimates of the RDF-3X estimator on the 8 WatDiv queries and compared with the other estimators. We omit the full results but while the RDF-3X estimator had a median q-error of 127.4 underestimation, the worst-performing of the 9 estimators had a median q-error of only 1.947 underestimation. So we expect RDF-3X’s estimator to lead to worse plans than the other estimators. We further expect generally that the more accurate of the optimistic estimators, such as the max-hop-max estimator, yield more efficient plans than the less accurate ones, such as the min-hop-min.

Figure 15 shows the runtimes of the system under each configuration where the y-axis shows the log-scale speedup or slow down of each plan under each estimator compared to the plans under the RDF-3X estimator. This is why the figure does not contain a separate box plot for the RDF-3X estimator. First observe that the median lines of the 9 estimators are above 0, indicating the each of these estimators, which have more accurate estimates than RDF-3X’s default estimator, leads to better plan generation. In addition, observe that the box plot of estimators with the max aggregators are generally better than estimators that use the min or avg aggregator. This correlates with Figure 9, where we showed these estimators lead to more accurate estimations on the same set of queries. We then performed a detailed analysis of the max-hop-max and min-hop-min estimator as representative of, respectively, the most and least accurate of the 9 estimators. We analyzed the queries in which plans under these estimators differed significantly. Specifically, we found 10 queries across both datasets where the runtime differences were at least 1.15x. Of these, only in 1 of them min-hop-min lead to more efficient plans and by a factor of 1.21x. In the other 9, max-hop-max lead to more efficient plans, by a median of 2.05x and up to 276.3x, confirming our expectation that better estimations generally lead to better plans.

Refer to caption
Refer to caption
Figure 15. RDF-3X runtimes on Acyclic workload.

7. Related Work

There are decades of research on cardinality estimation of queries in the context of different database management systems. We cover a part of this literature focusing on work on graph-based database management systems, specifically XML and RDF and on relational systems. We covered Characteristic Sets (Neumann and Moerkotte, 2011), SumRDF (Stefanoni et al., 2018), and WanderJoin (Li et al., 2016; Park et al., 2020) in Section 6. We cover another technique, based on maximum entropy that can be used with any estimator that can return estimates for base tables or small-size joins. We do not cover work that uses machine learning techniques to estimate cardinalities and refer the reader to several recent work in this space (Kipf et al., 2019; Liu et al., 2015; Woltmann et al., 2019) for details of these techniques.

Other Summary-based Estimators: The estimators we studied in this paper fall under the category of summary-based estimators. Many relational systems, including commercial ones such as PostgreSQL, use summary-based estimators. Example summaries include the cardinalities of relations, the number of distinct values in columns, or histograms (Aboulnaga and Chaudhuri, 1999; Muralikrishna and DeWitt, 1988; Poosala and Ioannidis, 1997), wavelets (Matias et al., 1998), or probabilistic and statistical models (Getoor et al., 2001; Sun et al., 1993) that capture the distribution of values in columns. These statistics are used to estimate the selectivities of each join predicate, which are put together using several approaches, such as independence assumptions. In contrast, the estimators we studied store degree statistics about base relations and small-size joins (note that cardinalities are a form of degree statistics, e.g., |Ri|=deg(,i)|R_{i}|=deg(\emptyset,\mathcal{R}_{i})).

Several estimators for subgraph queries have proposed summary-based estimators that compute a sketch of an input graph. SumRDF, which we covered in Section 6, falls under this category. In the context of estimating the selectivities of path expressions, XSeed (Zhang et al., 2006) and XSketch (Polyzotis and Garofalakis, 2002) build a sketch SS of the input XML Document. The sketch of the graph effectively collapses multiple nodes and edges into supernodes and edges with metatadata on the nodes and edges. The metadata contains statistics, such as the number of nodes that was collapsed into a supernode. Then given a query QQ, QQ is matched on SS and using the metadata an estimate is made. Because these techniques do not decompose a query into smaller sub-queries, the question of which decomposition to use does not arise for these estimators.

Several work have used data structures that are adaptations of histograms from relational systems to store selectivities of path or tree queries in XML documents. Examples include, positional histograms (Wu et al., 2002) and Bloom histogram (Wang et al., 2004). These techniques do not consecutively make estimates for larger paths and have not been adopted to general subgraph queries. For example, instead of storing small-size paths in a data structure as in Markov tables, Bloom histograms store all paths but hashed in a bloom filter. Other work used similar summaries of XML documents (or its precursor the object exchange model (Papakonstantinou et al., 1995) databases) for purposes other than cardinality estimation. For example, TreeSketch (Polyzotis et al., 2004) produces a summary of large XML documents to provide approximate answers to queries.

Sampling-based Estimators: Another class of important estimators are based on sampling tuples (Haas, Peter J. and Naughton, Jeffrey F. and Seshadri, S. and Swami, Arun N., 1996; Leis et al., 2017; Li et al., 2016; Vengerov et al., 2015; Wu et al., 2016). These estimators either sample input records from base tables offline or during query optimization, and they evaluate queries on these samples to make estimates. Research has focused on different ways samples can be generated, such as independent or correlated sampling, or sampling through existing indexes. Wander Join, which we covered in Section 6 falls under this category. As we discussed, by increasing the sizes of the samples these estimators can be arbitrarily accurate but they are in general slower than summary-based ones because they actually perform the join on a sample of tuples. We are aware of systems (Leis et al., 2018) that use sampling-based estimators to estimate the selectivities of predicates in base tables but not on multiway joins. Finally, sampling-based estimators have also been used to estimate frequencies of subgraphs relative to each other to discover motifs, i.e. infrequently appearing subgrahs, (Kashtan et al., 2004).

The Maximum Entropy Estimator: Markl et al. (Markl et al., 2007) has proposed another approach to make estimates for conjunctive predicates, say p1pkp_{1}\land...\land p_{k} given a set of \ell selectivity estimates si11,,i1ks_{i_{11},...,i_{1k}}, si1,,iks_{i_{\ell 1},...,i_{\ell k}}, where sij1,,ijks_{i_{j1},...,i_{jk}} is the selectivity estimate for predicate pij1pijkp_{i_{j1}}\land...\land p_{i_{jk}}. Markl et al.’s maximum entropy approach takes these known selectivities and uses a constraint optimization problem to compute the distribution that maximizes the entropy of the joint distribution of the 2k2^{k} possible predicate space. Reference (Markl et al., 2007) has only evaluated the accuracy of this approach for estimating conjunctive predicates on base tables and not on joins, but they have briefly described how the same approach can be used to estimate the cardinalities of join queries. Multiway join queries can be modeled as estimating the selectivity of the full join predicate that contains the equality constraint of all attributes with the same name. The statistics that we considered in this paper can be translated to selectivities of each predicate. For example the size of |(a1)𝐴(a2)𝐵(a3)||(a_{1})\xrightarrow{A}(a_{2})\xrightarrow{B}(a_{3})| can be modeled as si=|(a1)𝐴(a2)𝐵(a3)||A||B|s_{i}=\frac{|(a_{1})\xrightarrow{A}(a_{2})\xrightarrow{B}(a_{3})|}{|A||B|}, as the join of AA and BB is by definition applying the predicate A.src=B.dstA.src=B.dst predicate on the Cartesian product of relations AA and BB. This way, one can construct another optimistic estimator using the same statistics. We have not investigated the accuracy of this approach within the scope of this work and leave this to future work.

8. Conclusions and Future Work

We focused on how to make accurate estimations using the optimistic estimators using a new framework, in which we model these estimators as paths in a weighted CEG we called CEGOCEG_{O}, which uses as edge weights average degree statistics. We addressed two shortcomings of optimistic estimators from prior work. First, we addressed the question of which formulas, i.e., CEG paths, to use when there are multiple formulas to estimate a given query. We outlined and empirically evaluated a space of heuristics and showed that the heuristic that leads generally to most accurate estimates depends on the structure of the query. For example, we showed that for acyclic queries and cyclic queries wth small-size cycles, using the maximum weight path is an effective way to make accurate estimates. Second, we addressed how to make accurate estimates for queries with large cycles by proposing a new CEG that we call CEGOCRCEG_{OCR}. We then showed that surprisingly the recent pessimistic estimators can also be modeled as picking a path, this time the minimum weight path in another CEG, in which the edge weights are maximum degree statistics. Aside from connecting two disparate classes of estimators, this observation allowed us to apply the bound sketch optimization for pessimistic estimators to optimistic ones. We also showed that CEGs are useful mathematical tools to prove several properties of pessimistic estimators, e.g., that the CBS estimator is equivalent to MOLP estimator under acyclic queries over binary relations.

We believe the CEG framework can be the foundation for further research to define and evaluate other novel estimators. Within the scope of this work, we considered only three specific CEGs, with a fourth one that is used for theoretical purposes in Appendix D. However, many other CEGs can be defined to develop new estimators using different statistics and potentially different techniques to pick paths in CEGs as estimates. For example, one can use variance, standard deviation, or entropies of the distributions of small-size joins as edge weights in a CEG, possibly along with degree statistics, and pick the minimum-weight, e.g.,, “lowest entropy”, paths, assuming that degrees are more regular in lower entropy edges. An important research direction is to systematically study a class of CEG instances that use different combination of statistics as edge weights, as well as heuristics on these CEGs for picking paths, to understand which statistics lead to more accurate results in practice.

9. Acknowledgment

This work was supported by NSERC and by a grant from Waterloo-Huawei Joint Innovation Laboratory.

References

  • (1)
  • Abo Khamis et al. (2016) Mahmoud Abo Khamis, Hung Q. Ngo, and Dan Suciu. 2016. Computing Join Queries with Functional Dependencies. In PODS.
  • Aboulnaga et al. (2001) Ashraf Aboulnaga, Alaa R. Alameldeen, and Jeffrey F. Naughton. 2001. Estimating the Selectivity of XML Path Expressions for Internet Scale Applications. In VLDB.
  • Aboulnaga and Chaudhuri (1999) Ashraf Aboulnaga and Surajit Chaudhuri. 1999. Self-Tuning Histograms: Building Histograms Without Looking at Data. In SIGMOD.
  • Atserias et al. (2013) A. Atserias, M. Grohe, and D. Marx. 2013. Size Bounds and Query Plans for Relational Joins. SICOMP 42, 4 (2013).
  • Cai et al. (2019) Walter Cai, Magdalena Balazinska, and Dan Suciu. 2019. Pessimistic Cardinality Estimation: Tighter Upper Bounds for Intermediate Join Cardinalities. In SIGMOD.
  • Getoor et al. (2001) Lise Getoor, Benjamin Taskar, and Daphne Koller. 2001. Selectivity Estimation Using Probabilistic Models. In SIGMOD.
  • Gottlob et al. (2012) Georg Gottlob, Stephanie Tien Lee, Gregory Valiant, and Paul Valiant. 2012. Size and Treewidth Bounds for Conjunctive Queries. JACM 59, 3 (2012).
  • Haas, Peter J. and Naughton, Jeffrey F. and Seshadri, S. and Swami, Arun N. (1996) Haas, Peter J. and Naughton, Jeffrey F. and Seshadri, S. and Swami, Arun N. 1996. Selectivity and Cost Estimation for Joins Based on Random Sampling. JCSS 52, 3 (1996).
  • Joglekar and Ré (2018) Manas Joglekar and Christopher Ré. 2018. It’s All a Matter of Degree - Using Degree Information to Optimize Multiway Joins. TOCS 62, 4 (2018).
  • Kashtan et al. (2004) Nadav Kashtan, Shalev Itzkovitz, Ron Milo, and Uri Alon. 2004. Efficient Sampling Algorithm for Estimating Subgraph Concentrations and Detecting Network Motifs. Bioinformatics 20, 11 (2004).
  • Kipf et al. (2019) Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter A. Boncz, and Alfons Kemper. 2019. Learned Cardinalities: Estimating Correlated Joins with Deep Learning. In CIDR.
  • Leis et al. (2015) Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter Boncz, Alfons Kemper, and Thomas Neumann. 2015. How Good Are Query Optimizers, Really?. In VLDB.
  • Leis et al. (2017) Viktor Leis, Bernhard Radke, Andrey Gubichev, Alfons Kemper, and Thomas Neumann. 2017. Cardinality Estimation Done Right: Index-Based Join Sampling. In CIDR.
  • Leis et al. (2018) Viktor Leis, Bernhard Radke, Andrey Gubichev, Atanas Mirchev, Peter Boncz, Alfons Kemper, and Thomas Neumann. 2018. Query Optimization Through the Looking Glass, and What We Found Running the Join Order Benchmark. VLDBJ 27, 5 (2018).
  • Li et al. (2016) Feifei Li, Bin Wu, Ke Yi, and Zhuoyue Zhao. 2016. Wander Join: Online Aggregation via Random Walks. In SIGMOD.
  • Liu et al. (2015) Henry Liu, Mingbin Xu, Ziting Yu, Vincent Corvinelli, and Calisto Zuzarte. 2015. Cardinality Estimation Using Neural Networks. In CASCON.
  • Maduko et al. (2008) Angela Maduko, Kemafor Anyanwu, Amit Sheth, and Paul Schliekelman. 2008. Graph Summaries for Subgraph Frequency Estimation. In ESWC.
  • Markl et al. (2007) V. Markl, P. J. Haas, M. Kutsch, N. Megiddo, U. Srivastava, and T. M. Tam. 2007. Consistent Selectivity Estimation via Maximum Entropy. VLDBJ 16 (2007).
  • Matias et al. (1998) Yossi Matias, Jeffrey Scott Vitter, and Min Wang. 1998. Wavelet-Based Histograms for Selectivity Estimation. In SIGMOD.
  • Mhedhbi and Salihoglu (2019) Amine Mhedhbi and Semih Salihoglu. 2019. Optimizing Subgraph Queries by Combining Binary and Worst-Case Optimal Joins. PVLDB 12, 11 (2019).
  • Muralikrishna and DeWitt (1988) M. Muralikrishna and David J. DeWitt. 1988. Equi-Depth Histograms for Estimating Selectivity Factors for Multi-Dimensional Queries. In SIGMOD.
  • Neumann and Moerkotte (2011) Thomas Neumann and Guido Moerkotte. 2011. Characteristic Sets: Accurate Cardinality Estimation for RDF Queries with Multiple Joins. In ICDE.
  • Neumann and Weikum (2008) Thomas Neumann and Gerhard Weikum. 2008. RDF-3X: A RISC-Style Engine for RDF. In VLDB.
  • Papakonstantinou et al. (1995) Yannis Papakonstantinou, Hector Garcia-Molina, and Jennifer Widom. 1995. Object Exchange Across Heterogeneous Information Sources. In ICDE.
  • Park et al. (2020) Yeonsu Park, Seongyun Ko, Sourav S. Bhowmick, Kyoungmin Kim, Kijae Hong, and Wook-Shin Han. 2020. G-CARE: A Framework for Performance Benchmarking of Cardinality Estimation Techniques for Subgraph Matching. In SIGMOD.
  • Polyzotis and Garofalakis (2002) Neoklis Polyzotis and Minos Garofalakis. 2002. Statistical Synopses for Graph-Structured XML Databases. In SIGMOD.
  • Polyzotis et al. (2004) Neoklis Polyzotis, Minos Garofalakis, and Yannis Ioannidis. 2004. Approximate XML Query Answers. In SIGMOD.
  • Poosala and Ioannidis (1997) Viswanath Poosala and Yannis E. Ioannidis. 1997. Selectivity Estimation Without the Attribute Value Independence Assumption. In VLDB.
  • rdf3x (2020) rdf3x 2020. RDF-3X Source Code. https://github.com/gh-rdf3x/gh-rdf3x/. (2020).
  • Stefanoni et al. (2018) Giorgio Stefanoni, Boris Motik, and Egor V. Kostylev. 2018. Estimating the Cardinality of Conjunctive Queries over RDF Data Using Graph Summarisation. In WWW.
  • Sun et al. (1993) Wei Sun, Yibei Ling, Naphtali Rishe, and Yi Deng. 1993. An Instant and Accurate Size Estimation Method for Joins and Selections in a Retrieval-Intensive Environment. In SIGMOD.
  • Vengerov et al. (2015) David Vengerov, Andre Cavalheiro Menck, Mohamed Zait, and Sunil P. Chakkappen. 2015. Join Size Estimation Subject to Filter Conditions. In VLDB.
  • Wang et al. (2004) Wei Wang, Haifeng Jiang, Hongjun Lu, and Jeffrey Xu Yu. 2004. Bloom Histogram: Path Selectivity Estimation for XML Data with Updates. In VLDB.
  • Woltmann et al. (2019) Lucas Woltmann, Claudio Hartmann, Maik Thiele, Dirk Habich, and Wolfgang Lehner. 2019. Cardinality Estimation with Local Deep Learning Models. In aiDM.
  • Wu et al. (2016) Wentao Wu, Jeffrey F. Naughton, and Harneet Singh. 2016. Sampling-Based Query Re-Optimization. In SIGMOD.
  • Wu et al. (2002) Yuqing Wu, Jignesh M. Patel, and H. V. Jagadish. 2002. Estimating Answer Sizes for XML Queries. In EDBT.
  • Yu and Ozsoyoglu (1979) C. T. Yu and M. Z. Ozsoyoglu. 1979. An Algorithm for Tree-query Membership of a Distributed Query. In COMPSAC.
  • Zhang et al. (2006) Ning Zhang, M. Tamer Ozsu, Ashraf Aboulnaga, and Ihab F. Ilyas. 2006. XSEED: Accurate and Fast Cardinality Estimation for XPath Queries. In ICDE.

Appendix A Projection Inequalities Can Be Removed From MOLP

Consider any (,𝒜)(\emptyset,\mathcal{A}) path PP==()(\emptyset)e0\xrightarrow{e_{0}}(E1)(E_{1})(Ek)(E_{k})ek\xrightarrow{e_{k}}(𝒜)(\mathcal{A}) and consider its first projection edge, say eie_{i}. We show that we can remove eie_{i} and replace the rest of the edges ei+1e_{i+1} to eke_{k} with (possibly) new edges ei+1e_{i+1}^{\prime} to eke_{k}^{\prime} with exactly the same weights and construct PP^{\prime}==()(\emptyset)e0\xrightarrow{e_{0}}(E1)(E_{1})ei1\xrightarrow{e_{i-1}} (Ei)(E_{i})ei+1\xrightarrow{e^{\prime}_{i+1}}(Ei+2)(E^{\prime}_{i+2}) ….(Ek)(E^{\prime}_{k})ek\xrightarrow{e^{\prime}_{k}}(𝒜)(\mathcal{A}), where Ei+2Ei+2E^{\prime}_{i+2}\supseteq E_{i+2}, …, EkEkE^{\prime}_{k}\supseteq E_{k}. This can be seen inductively as follows. We know EiEi+1E_{i}\supseteq E_{i+1} because eie_{i} is a projection edge. Then if (Ei+1)(E_{i+1})ei+1\xrightarrow{e_{i+1}}(Ei+2)(E_{i+2}) edge was an extension edge that extended Ei+1E_{i+1} to Ni+1=Ei+2Ei+1N_{i+1}=E_{i+2}\setminus E_{i+1} attributes, then by construction, there is another edge (Ei)(E_{i})ei+1\xrightarrow{e^{\prime}_{i+1}}(Ei+2=EiNi)(E^{\prime}_{i+2}=E_{i}\cup N_{i}) in CEGMCEG_{M} with the same weight as ei+1e_{i+1}. If instead ei+1e_{i+1} was a projection edge that removed a set of attributes from Ei+1E_{i+1}, similarly there is another projection edge ei+1e^{\prime}_{i+1} that removes the same set of attributes from EiE_{i}. So inductively, we can find an alternative sub-path from Ei+1E_{i+1} to 𝒜\mathcal{A}, (Ei+1)(E_{i+1})ei+1\xrightarrow{e^{\prime}_{i+1}}ek\xrightarrow{e^{\prime}_{k}}(𝒜)(\mathcal{A}) with the same length as the sub-path (Ei+1)(E_{i+1})ei+1\xrightarrow{e_{i+1}}ek\xrightarrow{e_{k}}(𝒜)(\mathcal{A}).

Appendix B CBS Estimator’s Connection to MOLP On Acyclic Queries

We first show that on acyclic queries MOLP is at least as tight as the CBS estimator. Our proof is based on showing that for each bounding formula generated by BFG and FCG (respectively, Algorithms 1 and 2 in reference (Cai et al., 2019)), there is a path in MOLPCMOLP_{C}. For a detailed overview of these algorithms, we refer the reader to reference (Cai et al., 2019). We then show that if the queries are further on binary relations, then the standard MOLP bound, which only contains degree statistics about subsets of attributes in each relation, is exactly equal to the CBS estimator.

For each bounding formula generated by BFG and FCG, there is a path in CEGMCEG_{M}: Let CC be a coverage combination enumerated by FCG. Consider a bounding formula F𝒞F_{\mathpzc{C}}. We can represent F𝒞F_{\mathpzc{C}} as a set of (Ri,Xi)(R_{i},X_{i}) triples, where Xi𝒜iX_{i}\subseteq\mathcal{A}_{i} is the set of attributes that RiR_{i} covers and is of size either 0, |𝒜i|1|\mathcal{A}_{i}|-1, or |𝒜i||\mathcal{A}_{i}|. Let Yi=𝒜iXiY_{i}=\mathcal{A}_{i}\setminus X_{i}. Then the bounding formula generated for F𝒞F_{\mathpzc{C}} will have exactly cost ilogdeg(Yi,Ri)\sum_{i}\log deg(Y_{i},R_{i}) (recall that deg(Yi,Ri)=deg(Yi,𝒜i,Ri)deg(Y_{i},R_{i})=deg(Y_{i},\mathcal{A}_{i},R_{i})). This is because there are 3 cases: (i) if |Xi|=0|X_{i}|=0, then the BFG ignores RiR_{i} and deg(Yi,Ri)=0deg(Y_{i},R_{i})=0; (ii) if |Xi|=|𝒜i1||X_{i}|=|\mathcal{A}_{i}-1|, then BFG uses in its formula the degree of the single uncovered attribute aa in 𝒜i\mathcal{A}_{i}, which is equal to deg(Yi,Ri)deg(Y_{i},R_{i}) as YiY_{i} only contains aa; and (iii) if |Xi|=|𝒜i||X_{i}|=|\mathcal{A}_{i}|, then BFG uses |Ri||R_{i}| in its formula, and since Yi=Y_{i}=\emptyset, deg(Yi,Ri)=|Ri|deg(Y_{i},R_{i})=|R_{i}|.

We next show that CEGMCEG_{M} contains an (,𝒜)(\emptyset,\mathcal{A}) path with exactly the same weight as the cost of F𝒞F_{\mathpzc{C}}. We first argue that if QQ is acyclic, then there must always be at least one relation RiR_{i} in the coverage CC, that covers all of its attributes. Assume for the purpose of contradiction that each relation Ri(𝒜i)QR_{i}(\mathcal{A}_{i})\in Q either covers 0 attributes or |𝒜i|1|\mathcal{A}_{i}|-1 attributes. Then start with any Ri1(𝒜i1)R_{i1}(\mathcal{A}_{i1}) that covers |Ai1|1|A_{i1}|-1 attributes. Let ai1𝒜i1a_{i1}\in\mathcal{A}_{i1} be the attribute not covered by Ri1R_{i1}. Then another relation Ri2(𝒜i2)R_{i2}(\mathcal{A}_{i2}) must be covering ai1a_{i1} but not covering exactly one attribute ai2𝒜i2a_{i2}\in\mathcal{A}_{i2}. Similarly a third relation Ri3R_{i3} must be covering ai2a_{i2} but not covering another attribute ai3𝒜i3a_{i3}\in\mathcal{A}_{i3}, etc. Since the query is finite, there must be a relation RjR_{j} that covers an aj1a_{j-1} and whose other attributes are covered by some relation RkR_{k}, where k<jk<j, which forms a cycle, contradicting our assumption that QQ is acyclic.

We can finally construct our path in CEGMCEG_{M}. Let’s divide the relations into C\mathcal{R}_{C}, which cover all of their attributes, i.e., C={Ri(𝒜i)|Ri covers |𝒜i| attributes}\mathcal{R}_{C}=\{R_{i}(\mathcal{A}_{i})|R_{i}\text{ covers $|\mathcal{A}_{i}|$ attributes}\}, and E\mathcal{R}_{E}, which cover all but one of their attributes, E={Ri(𝒜i) s. t. Ri\mathcal{R}_{E}=\{R_{i}(\mathcal{A}_{i})\text{ s. t. }R_{i}  covers |𝒜i|1 attributes}\text{ covers $|\mathcal{A}_{i}|-1$ attributes}\}. We ignore the relations that do not cover any attributes. Let relation in C=RC1(𝒜C1),,\mathcal{R}_{C}=R_{C1}(\mathcal{A}_{C1}),...,RCk(𝒜Ck)R_{Ck}(\mathcal{A}_{Ck}) and those in E=RE1(𝒜E1),,\mathcal{R}_{E}=R_{E1}(\mathcal{A}_{E1}),...,REk(𝒜Ek)R_{Ek^{\prime}}(\mathcal{A}_{Ek^{\prime}}). Then we can construct the following path. The first of the path uses the cardinalities or relations in C\mathcal{R}_{C} in an arbitrary order to extend ()(\emptyset) to U=(i=1,k𝒜Ci)U=(\cup_{i=1,...k}\mathcal{A}_{Ci}). For example this path can be: P1=()log(|RC1|(𝒜C1)log(|RC2|(𝒜C1𝒜C2)log(|RCk|(U)P_{1}=(\emptyset)\xrightarrow{\log(|R_{C1}|}(\mathcal{A}_{C1})\xrightarrow{\log(|R_{C2}|}(\mathcal{A}_{C1}\cup\mathcal{A}_{C2})...\xrightarrow{\log(|R_{Ck}|}(U). Now to extend UU to 𝒜\mathcal{A}, observe that for each uncovered attribute T=𝒜UT=\mathcal{A}\setminus U, there must be some relation REj(𝒜Ej)ER_{Ej}(\mathcal{A}_{Ej})\in\mathcal{R}_{E}, such that all of the |𝒜Ej|1|\mathcal{A}_{Ej}|-1 attributes are already bound in UU. This is because |T|=k|T|=k^{\prime} and if each REjR_{Ej} has at least 2 attributes in TT, then QQ must be cyclic. Note that this is true because by definition of acyclicity (Yu and Ozsoyoglu, 1979) any “ear” that we remove can be iteratively covered by at most one relation, which means by the time we remove the last ear, we must be left with a relation REjR_{Ej} and one attribute, which contradicts that REjR_{Ej} had at least 2 uncovered attributes in TT. So we can iteratively extend the path P1P_{1} with one attribute at a time with an edge of weight log(deg(YEj,REj))\log(deg(Y_{Ej},R_{Ej})) until we reach 𝒜\mathcal{A}. Note that this path’s length would be exactly the same as the cost of the bounding formula generated by BFG and FCG for the coverage CC.

When relations of an acyclic query are binary the CBS estimator is equal to MOLP: Ideally we would want to prove that when relations are binary that any path in CEGMCEG_{M} corresponds to a bounding formula. However this is not true. Consider the simple join R(A,B)S(B,C)R(A,B)\bowtie S(B,C). CEGMCEG_{M} contains the following path, P=()logdeg({B},{B},R({B})P=(\emptyset)\xrightarrow{\log deg(\{B\},\{B\},R}(\{B\})logdeg(C,{B,C},S)({B,C})logdeg(A,{A,B},R)({A,B,C})\xrightarrow{\log deg(C,\{B,C\},S)}(\{B,C\})\xrightarrow{\log deg(A,\{A,B\},R)}(\{A,B,C\}). There is no bounding formula corresponding to this path in the CBS estimator because the first edge with weight logdeg({B},{B},R)\log deg(\{B\},\{B\},R) uses the cardinality of projection of RR. However, the CBS estimator does not use cardinalities of projections in its bounding formulas and only uses the cardinalities of relations. Instead, what can be shown is that if a path PP in CEGMCEG_{M} uses the cardinalities of projections, then there is another path PP^{\prime} with at most the same length, for which the CBS estimator has a bounding formula.

First we show that given an acyclic query over binary relations, if a path PP in CEGMCEG_{M} contains cardinalities of projections, then there is an alternative path PP^{\prime} that has at most the length of PP and that contains at least one more edge that uses the cardinality of a full relation. We assume w.l.o.g. that QQ is connected. Note that in PP any edge from ()(\emptyset) in CEGMCEG_{M} must be using the cardinality of a relation or a projection of a relation (the only outgoing edges from ()(\emptyset) are these edges). Let us divide the edge in PP into multiple groups: (i) CardCard are those edge that extend a subquery with two attributes and use the cardinality of a relation; (ii) ExtCardExt-Card are those edges that bound an attribute in a relation RiR_{i} in CardCard and extend a subquery to one new attribute aa using the degree of aa in RiR_{i}; (iii) ProjProj are those edges that extend a subquery by a single attribute aa, without bounding any attributes in the subquery, i.e., using the cardinality of the projection of a relation RiR_{i} onto aa (so the weight of these edges look like log(deg({a},{a},Ri))\log(deg(\{a\},\{a\},R_{i})); and (iv) RemRem are the remaining edges that extend a subquery by one attribute either bounding another attribute in ProjProj or some other attribute in RemRem.

We first note that we can assume w.l.o.g., that if any relation Ri(a1,a2)R_{i}(a_{1},a_{2}) is used in an edge epProje_{p}\in Proj to extend to, say, a2a_{2}, then a1a_{1} cannot be an attribute covered by the edges in CardCard or ExtExt. Otherwise we can always replace epe_{p}, which has weight log(Πa2Ri)\log(\Pi_{a_{2}}R_{i}) with an edge we can classify as ExtExt with weight log(deg(a2,{a1,a2},Ri))\log(deg(a_{2},\{a_{1},a_{2}\},R_{i})) because |Πa2Ri|deg(a2,{a1,a2},Ri)|\Pi_{a_{2}}R_{i}|\geq deg(a_{2},\{a_{1},a_{2}\},R_{i}). Next, we argue that we can iteratively remove two edges from ProjProj and possibly RemRem and instead add one edge to CardCard without increasing the length of PP. First observe that if RemRem is empty, we must have a relation Ri(a1,a2)R_{i}(a_{1},a_{2}) whose both attributes are in set ProjProj, in which case, we can remove these two edges and replace with a single CardCard edge that simply has weight log(,{a1,a2},Ri)\log(\emptyset,\{a_{1},a_{2}\},R_{i}) and reduce PP’s length because |Ri|Πa1Ri×Πa2Ri|R_{i}|\leq\Pi_{a_{1}}R_{i}\times\Pi_{a_{2}}R_{i}. Note that if RemRem is not empty, then at least one of the edges ere_{r} must be bounding an attribute a1a_{1} and extending to a2a_{2} using a relation RpR_{p}, where a1a_{1} must be extended by an edge epe_{p} in ProjProj using the same relation RpR_{p}. Otherwise there would be some edge ee in RemRem that extended a subquery W1W_{1} to W1{aj}W_{1}\cup\{a_{j}\} without bounding the attribute that appears in the weight of ee. This is because note that if we remove the relations that were used in the edges in CardCard and ExtExt then we would be left with an acyclic query, so have tt relations and t+1t+1 attributes that need to be covered by ProjProj and RemRem. If no edge ere_{r} is bounding an attribute aia_{i} in ProjProj, then one of the tt relations must appear twice in the edges in RemRem, which cannot happen because the relations are binary (i.e., this would imply that the attributes of a relation Rx(ax1,ax2)R_{x}(a_{x1},a_{x2}) were covered with two edges with weights log(deg({ax1},{ax1,ax2},Rx))\log(deg(\{a_{x1}\},\{a_{x1},a_{x2}\},R_{x})) and log(deg({ax2},{ax1,ax2},Rx))\log(deg(\{a_{x2}\},\{a_{x1},a_{x2}\},R_{x})), which cannot happen). Therefore such an ere_{r} and epe_{p} must exist. Then we can again remove them and replace with one edge to CardCard with weight log(deg(,\log(deg(\emptyset, {a1,a2},Rp))\{a_{1},a_{2}\},R_{p})) and decrease the weight of PP. Therefore from any PP we can construct a PP^{\prime} whose length is at most PP and that only consist of edges CardCard and ExtExt. Readers can verify that each such path PP^{\prime} directly corresponds to a bounding formula generated by BFG and FCG (each relation RiR_{i} used by an edge in CardCard and ExtExt, respectively corresponds to a relation covering exactly |𝒜i||\mathcal{A}_{i}| and |𝒜i|1|\mathcal{A}_{i}|-1 attributes).

Appendix C Counter Example for Using the CBS Estimator on Cyclic Queries

Consider the triangle query R(a,b)S(b,c)T(c,a)R(a,b)\bowtie S(b,c)\bowtie T(c,a). FCG would generate the cover aRa\rightarrow R, bSb\rightarrow S, and cTc\rightarrow T. For this cover, BFG would generate the bounding formula: h(a,b,c)h(a|b)+h(b|c)+h(c|a)h(a,b,c)\leq h(a|b)+h(b|c)+h(c|a), which may not be a pessimistic bound. As an example, suppose each relation RR, SS, and TT contains nn tuples of the form (1,1)(n,n)(1,1)...(n,n). Then this formula would generate a bound of 0, whereas the actual number of triangles in this input is nn.

Appendix D DBPLP

We demonstrate another application of CEGs and provide alternative combinatorial proofs to some properties of DBPLP, which is another worst-case output size bound from reference (Joglekar and Ré, 2018). DBPLP is not as tight as MOLP, which our proofs demonstrate through a path-analysis of CEGs. We begin by reviewing the notion of a cover of a query.

Definition 1.

A cover CC is a set of (RjR_{j}, AjA_{j}) pairs where RjR_{j}\in\mathcal{R} and Aj𝒜jA_{j}\in\mathcal{A}_{j}, such that the union of AjA_{j} in CC “cover” all of 𝒜\mathcal{A}, i.e., (Rj,Aj)CAj=𝒜\cup_{(R_{j},A_{j})\in C}A_{j}=\mathcal{A}.

DBPLP of a query is defined for a given cover CC as follows:

Minimize Σa𝒜vaΣaAjAjvalog(deg(Aj,ΠAjRj)),(Rj,Aj)C,AjAj\displaystyle\begin{split}&\text{\bf{Minimize\>}}\Sigma_{a\in\mathcal{A}}v_{a}\\ &\Sigma_{a\in A_{j}\setminus A_{j}^{\prime}}v_{a}\geq\log(deg(A_{j}^{\prime},\Pi_{A_{j}}R_{j})),\>\forall(R_{j},A_{j})\in C,A_{j}^{\prime}\subseteq A_{j}\end{split}

Note that unlike MOLP, DBPLP is a maximization problem and has one variable for each attribute a𝒜a\in\mathcal{A} (instead of each subset of 𝒜\mathcal{A}). Similar to the MOLP constraints, we can also provide an intuitive interpretation of the DBPLP constraints. For any (Rj,Aj)(R_{j},A_{j}) and AjAjA_{j}^{\prime}\subseteq A_{j}, let B=AjAjB=A_{j}\setminus A_{j}^{\prime}. Each constraint of the DBPLP indicates that the number of BB’s that any tuple that contains AjA_{j}^{\prime} can extend to is deg(Aj,ΠAjRj)deg(A_{j},\Pi_{A_{j}}R_{j}), which is the maximum degree of any AjA_{j}^{\prime} value in ΠAjRj\Pi_{A_{j}}R_{j}. Each constraint is therefore effectively an extension inequality using a maximum degree constraint. Based on this interpretation, we next define a DBPLP CEG, CEGDCEG_{D}, to provide insights into DBPLP.

DBPLP CEG (CEGDCEG_{D}):

  • \bullet

    Vertices: For each X𝒜X\subseteq\mathcal{A} we have a vertex.

  • \bullet

    Extension Edges: Add an edge with weight deg(Aj,ΠAjRj)deg(A_{j}^{\prime},\Pi_{A_{j}}R_{j}) between any W1W_{1} and W2W_{2}, such that AjW1A_{j}^{\prime}\subseteq W_{1} and W2=W1(AjAj)W_{2}=W_{1}\cup(A_{j}\setminus A_{j}^{\prime}).

We note that DBPLP is not the solution to some path in this CEG, but defining it allows to provide some insights into DBPLP. Observe that DBPLP and MOLP use the same degree information and the condition for an extension edge is the same. Therefore CEGDCEG_{D} contains the same set of vertices as CEGMCEG_{M} but a subset of the edges of CEGMCEG_{M}. For example CEGDCEG_{D} does not contain any of the projection edges of CEGMCEG_{M}. Similarly, CEGDCEG_{D} does not contain any edges that contain degree constraints that cannot be expressed in the cover CC, because in the (RjR_{j}, AjA_{j}) pairs in CC, AjA_{j} may not contain every attribute in 𝒜j\mathcal{A}_{j}. Consider our running example and the cover CC=={({a1,a2},RA)\{(\{a_{1},a_{2}\},R_{A}), ({a3,a4},RC)}(\{a_{3},a_{4}\},R_{C})\}. The DBPLP would contain, 6 constraints, 3 for ({a1,a2},RA)(\{a_{1},a_{2}\},R_{A}) and 3 for ({a3,a4},RC)(\{a_{3},a_{4}\},R_{C}). For example one constraint would be that va1+va2log(deg(,v_{a_{1}}+v_{a_{2}}\geq\log(deg(\emptyset,Π{a1,a2}RA)=log(|RA|)=log(4)\Pi_{\{a_{1},a_{2}\}R_{A}})=\log(|R_{A}|)=\log(4). The following theorem provides insight into why DBPLP is looser than MOLP using CEGDCEG_{D}.

Theorem D.1.

Let PP be any (,𝒜)(\emptyset,\mathcal{A}) path in CEGDCEG_{D} of a query QQ and cover CC of QQ. Let d𝒜d_{\mathcal{A}} be the solution to the DBPLP of QQ. Then d𝒜d_{\mathcal{A}} \geq |P||P|.

Proof.

We first need to show that there is always an (,𝒜)(\emptyset,\mathcal{A}) path in CEGDCEG_{D}. We can see the existence of such a path as follows. Start with an arbitrary (RjR_{j}, AjA_{j}) pair in CC, which has an inequality for Aj=AjA_{j}^{\prime}=A_{j} which leads to an Aj\emptyset\rightarrow{}A_{j} edge. Let X=AjX=A_{j}. Then take an arbitrary (RiR_{i}, AiA_{i}) such that Z=AiXZ=A_{i}\setminus X\neq\emptyset, which must exist because CC is a cover. Then we can extend XX to Y=XZY=X\cup Z, because by construction we added an XYX\rightarrow Y edge for the constraint where Ai=AiZA_{i}^{\prime}=A_{i}\setminus Z in DBPLP (so the constraint is ΣaZvalog(deg(AiZ,ΠAi(Ri)))\Sigma_{a\in Z}v_{a}\geq\log(deg(A_{i}\setminus Z,\Pi_{A_{i}}(R_{i})))).

Now consider any (,𝒜)(\emptyset,\mathcal{A}) path P=e0X0e1ek𝒜P=\emptyset\xrightarrow{e0}X_{0}\xrightarrow{e1}...\xrightarrow{ek}\mathcal{A}. Observe that by construction of CEGDCEG_{D} each edge eie_{i} extends an XX to Y=XZY=X\cup Z and the weight of eie_{i} comes from a constraint ΣaZva\Sigma_{a\in Z}v_{a}\geq log(deg(AjZ,ΠAjRj))\log(deg(A_{j}\setminus Z,\Pi_{A_{j}}R_{j})) for some (Rj,Aj)(R_{j},A_{j}). Therefore the variables that are summed over each edge is disjoint and contain every variable. So we can conclude that a𝒜|P|\sum_{a\in\mathcal{A}}\geq|P|. In other words, each (,𝒜)(\emptyset,\mathcal{A}) path identifies a subset of the constraints c1,,ckc_{1},...,c_{k} that do not share the same variable vav_{a} twice, so summing these constraints yields the constraint a𝒜|P|\sum_{a\in\mathcal{A}}\geq|P|. Therefore, any feasible solution vv^{*} to DBPLP, in particular the optimal solution d𝒜d_{\mathcal{A}} has to have a value greater than |P||P|. ∎

Corollary D.0.

Let m𝒜m_{\mathcal{A}} and d𝒜d_{\mathcal{A}} be the solutions to MOLP and DBPLP, respectively. Then m𝒜m_{\mathcal{A}} \leq d𝒜d_{\mathcal{A}} for any cover CC used in DBPLP.

Proof.

Directly follows from Theorems 5.1 and  D.1 and the observation that CEGDCEG_{D} contains the same vertices and a subset of the edges in CEGMCEG_{M}. ∎

Corollary D.2 is a variant of Theorem 2 from reference (Joglekar and Ré, 2018), which compares refinements of MOLP and DBPLP. Our proof serves as an alternative combinatorial proof to the inductive proof in reference (Joglekar and Ré, 2018) that compares the LPs of the bounds. Specifically, by representing MOLP and DBPLP bounds as CEGs and relating them, respectively, to the lengths of shortest and longest (,𝒜)(\emptyset,\mathcal{A}) paths, one can directly observe that MOLP is a tighter than DBPLP.