Consistent Answers of Aggregation Queries using SAT Solvers
Abstract.
The framework of database repairs and consistent answers to queries is a principled approach to managing inconsistent databases. We describe the first system able to compute the consistent answers of general aggregation queries with the COUNT(), COUNT(*), SUM(), MIN(), and MAX() operators, and with or without grouping constructs. Our system uses reductions to optimization versions of Boolean satisfiability (SAT) and then leverages powerful SAT solvers. We carry out an extensive set of experiments on both synthetic and real-world data that demonstrate the usefulness and scalability of this approach.
1. Introduction
The framework of database repairs and consistent query answering, introduced by Arenas, Bertossi, and Chomicki (Arenas et al., 1999), is a principled approach to managing inconsistent databases, i.e., databases that violate one or more integrity constraints on their schema. In this framework, inconsistencies are handled at query time by considering all possible repairs of the inconsistent database, where a repair of an inconsistent database is a consistent database that differs from in a “minimal” way. The consistent answers to a query on a given database is the intersection of the results of applied on each repair of . Thus, a consistent answer provides the guarantee that it will be found no matter on what repair the query has been evaluated. Computing the consistent answers can be an intractable problem, because an inconsistent database may have exponentially many repairs. In particular, computing the consistent answers of a fixed Select-Project-Join (SPJ) query can be a coNP-complete problem. By now, there is an extensive body of work on the complexity of consistent answers for SPJ queries (see Section 2).
Range Semantics: Concept and Motivation. Aggregation queries are the most frequently asked queries; they are of the form
where is one the standard aggregation operators COUNT(), COUNT(*), SUM(), AVG(). MIN(), MAX(), and is the relation returned by a SPJ query expressed in SQL. A scalar aggregation query is an aggregation query without a GROUP BY clause.
What is the semantics of an aggregation query over an inconsistent database? Since an aggregation query may return different answers on different repairs of an inconsistent database, there is typically no consistent answer as per the earlier definition of consistent answers. To obtain meaningful semantics to aggregation queries, Arenas et al. (Arenas et al., 2003b) introduced the range consistent answers.
|
|
|
Let be a scalar aggregation query and let be a set of integrity constraints. The set of possible answers to on an inconsistent instance w.r.t. is the set of the answers to over all repairs of w.r.t. , i.e., . By definition, the range consistent answers to on is the interval , where the endpoints of this interval are, respectively, the greatest lower bound (glb) and the least upper bound (lub) of the set of possible answers to on . For example, the range consistent answers to the query
SELECT SUM(ACCOUNTS.BAL) FROM ACCOUNTS, CUSTACC
WHERE ACCOUNTS.ACCID = CUSTACC.ACCID AND CUSTACC.CID = ‘C2’
on the instance in Table 1 is the interval . The meaning is that no matter how the database is repaired, the answer to the query is guaranteed to be in the range between 900 and 2200.
Arenas et al. (Arenas et al., 2003a) focused on scalar aggregation queries only. Fuxman, Fazli, and Miller (Fuxman et al., 2005a) extended the notion of range consistent answers to aggregation queries with grouping (see Section 3).
Range semantics have become the standard semantics of aggregation queries in the framework of database repairs (see (Bertossi, 2011, Section 5.6)). Furthermore, range semantics have been adapted to give semantics to aggregation queries in several other contexts, including data exchange (Afrati and Kolaitis, 2008) and ontologies (Kostylev and Reutter, 2015). Finally, range semantics have been suggested as an alternative way to overcome some of the issues arising from SQL’s handling of null values (Guagliardo and Libkin, 2016).
Earlier Systems for Consistent Query Answering. Several academic prototype systems for consistent query answering have been developed (Arenas et al., 2003a; Barceló and Bertossi, 2003; Chomicki et al., 2004b; Fuxman et al., 2005a, b; Greco et al., 2003; Kolaitis et al., 2013; Manna et al., 2011; Marileo and Bertossi, 2010; Dixit and Kolaitis, 2019). These systems use different approaches, including logic programming (Barceló and Bertossi, 2003; Greco et al., 2003), compact representations of repairs (Chomicki et al., 2004a), or reductions to solvers (Manna et al., 2011; Kolaitis et al., 2013; Dixit and Kolaitis, 2019). In particular, in (Dixit and Kolaitis, 2019), we reported on CAvSAT, a system that at that time was able to compute the consistent answers of unions of SPJ queries w.r.t. denial constraints (which include functional dependencies as a special case) via reductions to SAT solvers. Among all these systems, however, only the ConQuer system by Fuxman et al. (Fuxman et al., 2005a, b) is capable of handling aggregation queries. Actually, ConQuer can only handle a restricted class of aggregation query, namely, those aggregation queries w.r.t. key constraints for which the underlying SPJ query belongs to the class called . For such a query , the range consistent answers of are SQL-rewritable, which means that there is a SQL query such that the range semantics answers of on an instance can be obtained by directly evaluating on . This leaves out, however, many aggregation queries, including all aggregation queries whose range consistent answers are not SQL-rewritable or are NP-hard to compute. Up to now, no system supports such queries.
Summary of Contributions. In this paper, we report on and evaluate the performance of AggCAvSAT (Aggregate Consistent Answers via Satisfiability Testing), which is an enhanced version of CAvSAT and is also the first system that is capable to compute the range consistent answers to all aggregation queries involving the operators SUM(), COUNT(), or COUNT(*) with or without grouping.
We first corroborate the need for a system that goes well beyond ConQuer by showing that there is an aggregation query involving SUM() such that the consistent answers of the underlying SPJ query w.r.t. key constraints are SQL-rewritable, but the range consistent answers of are NP-hard (Theorem 3.2 in Section 3).
The distinctive feature of AggCAvSAT is that it uses polynomial-time reductions to reduce the range consistent answers of aggregation queries to optimization variants of Boolean Satisfiability (SAT), such as Partial MaxSAT and Weighted Partial MaxSAT. These reductions, described in Sections 4 and 5, are natural but are much more sophisticated than the reductions used in (Dixit and Kolaitis, 2019) to reduce the consistent answers of SPJ queries to SAT. After the reductions have been carried out, AggCAvSAT deploys powerful SAT solvers, such as the MaxHS solver (Davies and Bacchus, 2011), to compute the range consistent answers of aggregation queries. Furthermore, AggCAvSAT can handle databases that are inconsistent not only w.r.t. key constraints, but also w.r.t. arbitrary denial constraints, a much broader class of constraints.
An extensive experimental evaluation of AggCAvSAT is reported in Section 6. We carried out a suite of experiments on both synthetic and real-word databases, and for a variety of aggregation queries with and without grouping. The synthetic databases were generated using two different methods: (a) the DBGen tool of TPC-H was used to generate consistent data and then inconsistencies were injected artificially; (b) the PDBench inconsistent database generator from the probabilistic database management system MayBMS (Antova et al., 2008) was used. The experiments demonstrated the scalability of AggCAvSAT along both the size of the data and the degree of inconsistency in the data. Note that AggCAvSAT was also competitive in comparison to ConQuer (especially when the degree of inconsistency was not excessive), even though the latter is tailored to only handle a restricted class of aggregation queries whose range consistent answers are SQL-rewritable.
Consistent Answers vs. Data Cleaning. There is a large body of work on managing inconsistent databases via data cleaning. There are fundamental differences between the framework of the consistent answers and the framework of data cleaning (see (Bertossi, 2011, Section 6)). In particular, the consistent answers provide the guarantee that each such answer will be found no matter on which repair the query at hand is evaluated, while data cleaning provides no similar guarantee. Data cleaning has the attraction that it produces a single consistent instance but the process need not be deterministic and the instance produced need not even be a repair (i.e., it need not be a maximally consistent instance). Recent data cleaning systems, such as HoloClean (Rekatsinas et al., 2017) and Daisy (Giannakopoulou et al., 2020b, a), produce a probabilistic database instance as the output (which again need not be a repair).
It is not clear how to compare query answers over the database returned by a data cleaning system and the (range) consistent answers computed by a consistent query answering system. In fact, no such comparison is given in the HoloClean (Rekatsinas et al., 2017) and Daisy (Giannakopoulou et al., 2020b, a) papers. At the performance level, the data cleaning approaches remove inconsistencies in the data offline, hence the time-consuming tasks are done prior to answering the queries; in contrast, systems for consistent query answering work online. It is an interesting project, left for future research, to develop a methodology and carry out a fair comparison on a level playing field between systems for data cleaning and systems for consistent query answering.
2. Preliminaries
Integrity Constraints and Database Queries
A relational database schema is a finite collection of relation symbols, each with a fixed positive integer as its arity. The attributes of a relation symbol are names for its columns; they can be identified with their positions, thus denotes the set of attributes of . An -instance is a collection of finite relations , one for each relation symbol in . An expression of the form is a fact of the instance if . A key is a minimal subset of such that the functional dependency holds.
Starting with Codd’s seminal work (Codd, 1970, 1972), first-order logic has been successfully used as a database query language; in fact, it forms the core of SQL. A conjunctive query is expressible by a first-order formula of the form , where each is a tuple of variables and constants, and are tuples of variables with no variable in common, and the variables in appear in exactly one of the tuples and . A conjunctive query with no free variables (i.e., all variables are existentially quantified) is a boolean query, while a conjunctive query with free variables in is a -ary query. Conjunctive queries are also known as select-project-join (SPJ) queries with equi-joins, and are among the most frequently asked queries. For example, on the instance from Table 1, the binary conjunctive query returns the set of all pairs such that is an account ID of an account owned by customer named .
Equivalently, this query can be expressed in SQL as
SELECT CUSTACC.ACCID, CUST.CNAME
FROM CUST, CUSTACC
WHERE CUST.CID = CUSTACC.CID
A union of conjunctive queries is expressible by a disjunction of conjunctive queries, where all conjunctive queries have the same arity. Unions of conjunctive queries are strictly more expressive than conjunctive queries.
Database Repairs and Consistent Answers
Let be a set of integrity constraints on a database schema . An -instance is consistent if , i.e., satisfies every constraint in ; otherwise, is inconsistent. For example, let be the instance depicted in Table 1. There are two key constraints, namely, CUST(CID) and ACC(ACCID). Clearly, is inconsistent since the facts of CUST and facts of ACC violate these key constraints.
A repair of an inconsistent instance w.r.t. is a consistent instance that differs from in a “minimal” way. Different notions of minimality give rise to different types of repairs (see (Bertossi, 2011) for a comprehensive survey). Here, we focus on subset repairs, the most extensively studied type of repairs. An instance is a subset repair of an instance if is a maximal consistent subinstance of , that is, (where and are viewed as sets of facts), , and there exists no instance such that and . Arenas et al. (Arenas et al., 1999) used repairs to give rigorous semantics to query answering on inconsistent databases. Specifically, assume that is a query, is an -instance, and is a tuple of values. We say that is a consistent answer to on w.r.t. if , for every repair of . We write to denote the set of all consistent answers to on w.r.t. , i.e.,
If is a fixed set of integrity constraints and is a fixed query, then the main computational problem associated with the consistent answers is: given an instance , compute Cons(, , ); we write to denote this problem. If is a boolean query, then computing the consistent answers becomes the decision problem : given an instance , is true on every repair of w.r.t. ? When the constraints in are understood from the context, we will write and in place of and , respectively.
Complexity of Consistent Answers
There has been an extensive study of the consistent answers of conjunctive queries (Bertossi, 2011; Fuxman et al., 2005b; Fuxman and Miller, 2007; Wijsen, 2009, 2010a, 2010b; Kolaitis and Pema, 2012; Koutris and Wijsen, 2016, 2017). If is a fixed set of key constraints and is a boolean conjunctive query, then is always in coNP, but, depending on the query and the constraints, exhibits a variety of behaviors within coNP. The most definitive result to date is a trichotomy theorem by Koutris and Wijsen (Koutris and Wijsen, 2016, 2017); it asserts that if is a self-join-free (no repeated relation symbols) boolean conjunctive query with one key constraint per relation, then is either SQL-rewritable, or in P but not SQL-rewritable, or coNP-complete. It is an open problem whether or not this trichotomy extends to arbitrary boolean conjunctive queries and to broader classes of constraints (e.g., denial constraints).
We illustrate the trichotomy theorem with three examples. In what follows, the underlined attributes constitute the keys to the relations. First, if is the query , then is SQL-rewrtiable (Fuxman and Miller, 2007). Second, if is the query , then is in P, but is not SQL-rewritable (Wijsen, 2010b). Third, if is the query , then is coNP-complete (Fuxman and Miller, 2007).
Boolean Satisfiability and SAT Solvers
Boolean Satisfiability (SAT) is arguably the prototypical and the most widely studied NP-complete problem. SAT is the following decision problem: given a boolean formula , is satisfiable? There has been an extensive body of research on different aspects of boolean satisfiability (see the handbook (Biere et al., 2009)). In particular, significant progress has been made on developing SAT-solvers, so much so that the advances in this area of research are often referred to as the “SAT Revolution” (Vardi, 2009)). Typically, a SAT-solver takes a boolean formula in conjunctive normal form (CNF) as an input and outputs a satisfying assignment for (if one exists) or tells that the formula is unsatisfiable. Recall that a formula is in CNF if it is a conjunction of clauses, where each clause is a disjunction of literals. For example, the formula has a satisfying assignment , , , and .
At present, SAT-solvers are capable of solving quickly SAT-instances with millions of clauses and variables. SAT-solvers have been widely used in both academia and industry as general-purpose tools. Indeed, many real-world problems from a variety of domains, including scheduling, protocol design, software verification, and model checking, can be naturally encoded as SAT-instances, and solved quickly using solvers, such as Glucose (Audemard and Simon, 2009) and CaDiCaL (cad, [n.d.]). Furthermore, SAT-solvers have been used in solving open problems in mathematics (Heule et al., 2016; Oostema et al., 2020). In (Dixit and Kolaitis, 2019), we used SAT-solvers to build a prototypical system for consistent query answering, which we called CAvSAT. This system can compute consistent answers to unions of conjunctive queries over relational databases that are inconsistent w.r.t. a fixed set of arbitrary denial constraints.
3. Range Consistent Answers
Frequently asked database queries often involve one of the standard aggregation operators COUNT(), COUNT(*), SUM(), AVG(), MIN(), MAX(), and, possibly,
a GROUP BY clause. In what follows, we will use the term
aggregation queries to refer to queries with aggregate operators and with or without a GROUP BY clause. Thus, in full generality, an aggregation query can be expressed as
where is one of the aforementioned aggregate operators and is the relation returned by a query , which typically is a conjunctive query or a union of conjunctive queries expressed in SQL. This query is called the underlying query of , the attribute represented by the variable is called the aggregation attribute, and the attributes represented by are called the grouping attributes. A scalar aggregation query is one without a GROUP BY clause.
It is often the case that an aggregation query returns different answers on different repairs of an inconsistent database; thus, even for a scalar aggregation query, there is typically no consistent answer as per the definition of consistent answers given earlier. In fact, to produce an empty set of consistent answers, it suffices to have just two repairs on which a scalar aggregation query returns difference answers. Aiming to obtain more meaningful answers to aggregation queries, Arenas et al. (Arenas et al., 2003b) proposed the range consistent answers, as an alternative notion of consistent answers.
Let be a scalar aggregation query. The set of possible answers to on an inconsistent instance consists of the answers to over all repairs of , i.e., . By definition, the range consistent answers to on is the interval , where the endpoints of this interval are, respectively, the greatest lower bound (glb) and the least upper bound (lub) of the set of possible answers to on .
For example,
the range consistent answers of the query
SELECT SUM(ACCOUNTS.BAL) FROM ACCOUNTS, CUSTACC
WHERE ACCOUNTS.ACCID = CUSTACC.ACCID AND CUSTACC.CID = ‘C2’
on the instance in Table 1 is the interval . The guarantee is that no matter how the database is repaired, the answer to the query is guaranteed to be in the range between 900 and 2200. Note that, the glb-answer comes from a repair of that contains the fact , while the lub-answer is from a repair that contains the fact .
Arenas et al. (Arenas et al., 2003a) focused on scalar aggregation queries only. Fuxman, Fazli, and Miller (Fuxman et al., 2005a) extended the notion of range consistent answers to aggregation queries with grouping, i.e., to queries
For such queries, a tuple is a range consistent answer to on , if the following conditions hold:
For every repair of , there exists s.t. and .
For some repair of , we have that
For some repair of , we have that .
If is an aggregation query, denotes the problem: given an instance , compute the range semantics of on .
Complexity of Range Consistent Answers
Arenas et al. (Arenas et al., 2003b) investigated the computational complexity of the range consistent answers for scalar aggregation queries of the form
,
where is one of the standard aggregation operators and is a relational schema with functional dependencies. The main findings in Arenas et al. (Arenas et al., 2003b) can be summarized as follows.
If the relational schema has at most one functional dependency and is one of the aggregation operators MIN(), MAX(), SUM(), COUNT(*), AVG(), then the range consistent answers of the query A is in P.
There is a relational schema with one key dependency such that computing the range consistent answers of the query is an NP-complete problem.
There is a relational schema with two functional dependencies, such that computing the range consistent answers of the query is a NP-complete problem, where is one of the standard aggregation operators.
It remains an open problem to pinpoint the complexity of the range consistent answers for richer aggregation queries of the form
where is the relation returned by a conjunctive query or by a union of conjunctive queries. It can be shown, however, that if computing the consistent answers of the underlying query is a hard problem, then computing the range consistent answers of the aggregation query is a hard problem as well. This gives rise to the following question: what can we say about the complexity of the range consistent answers if computing the consistent answers of the underlying query is an easy problem?
Fuxman and Miller (Fuxman and Miller, 2007) identified a class, called , of self-join free conjunctive queries whose consistent answers are SQL-rewritable. In his PhD thesis, Fuxman (Fuxman, 2007) introduced the class consisting of all aggregation queries such that the aggregation operator is one of MIN(), MAX(), SUM(), COUNT(*), the underlying query is a conjunctive query in , and there is one key constraint for each relation in the underlying query . Fuxman (Fuxman, 2007) showed that the range consistent answers of every query in are SQL-rewritable (earlier, similar results for a proper subclass of were obtained by Fuxman, Fazli, and Miller).
It is known that there are self-join free conjunctive queries outside the class whose consistent answers are SQL-rewritable. In fact, Koutris and Wijsen (Koutris and Wijsen, 2017) have characterized the self-join free conjunctive queries whose consistent answers are SQL rewritable. However, the SQL rewritability of aggregation queries beyond those in has not been investigated. In the sequel, we show that there exists a self-join-free conjunctive query whose consistent answers are SQL-rewritable, but this property is not preserved when an aggregation operator is added on top of it. Specifically, we reduce the Maximum Cut problem to the problem of computing the range consistent answers to an aggregation query involving SUM and whose underlying conjunctive query has SQL-rewritable consistent answers. We begin by recalling the definition of the Maximum Cut problem, a fundamental NP-complete problem (Karp, 1972). We state and prove a helping lemma (Lemma 3.3) before stating the main result in Theorem 3.2.
Definition 3.1.
Maximum Cut. For an undirected graph , a cut of is a partition of , where and . The set of edges with one vertex in and one vertex in is denoted by , and the the size of the cut is .
The Maximum Cut problem asks: Given an undirected graph and an integer , is there a cut of that has size at least ?
Theorem 3.2.
Let be a schema with three relations , , and . Let be the following aggregation query:
SELECT SUM() FROM ,
where is the following self-join-free conjunctive query:
.
Then the following two statements hold.
-
(1)
is SQL-rewritable.
-
(2)
is NP-hard.
Proof.
To show that is SQL-rewritable, consider the following first-order query :
We will show that for every instance and every value , we have that . Since filters out the tuples from and that participate in the violations of the key constraints, we have that if , then , for every repair of , which means that . In the other direction, we claim that if , then . Indeed, if , then for all and such that , we would have that there is some such that and or there is some such that and . Construct a repair of as follows. First, for every , if ‘red’ is the only value such that is a fact of , then put in ; otherwise, pick an element such that is a fact of and put in . Second, for every , if ‘blue’ is the only value such that is a fact of , then put in ; otherwise, pick an element such that is a fact of and put in . Third, put every tuple of the relation of into . Clearly, is a repair of . Moreover, . Indeed, if , then there are elements and such that . Since , we have that there is some such that and or there is some such that and . In the first case, the construction of implies that is not a fact of , while in the second case, the construction of implies that is not a fact of ; in either case, we have arrived at a contradiction.
To show that is NP-hard, consider the following reduction from undirected graphs to -instances, where is the schema with relations , , and .
Reduction 3.1.
L Given an undirected graph , construct an -instance as follows. Let .
-
•
For each , add tuples , , , and to .
-
•
For each , add a tuple to .
-
•
For each edge , add tuples and to .
We will show that the preceding Reduction 3.1 reduces Maximum Cut to computing the range semantics of the aggregation query .
For the rest of this section, let be an undirected graph and be the database instance constructed from using Reduction 3.1. We say that a repair of produces a red-blue coloring of if for every vertex , we have that the tuples and are either both present in or both absent in . We now prove a useful lemma.
Lemma 3.3.
For every repair of , there exists a repair of (not necessarily different from ) such that produces a red-blue coloring of and .
Proof.
Let be a repair of . Construct an -instance from as follows. For every vertex , if both tuples and are present in for , then add them to . Otherwise, add the tuples and to . Also, copy all tuples from relation of to relation of . Clearly, is a repair of and produces a red-blue coloring of . Observe that can be different than only if there exists at least one vertex such that either or .
We will show that holds.
Case 1: Let be a node such that . In this case, while populating the database instance , vertex changes its color in relation , i.e., we have that and . Therefore, the summands arising from the tuples of the form , , and of (for some vertex ) do not appear in . Notice that each of these summands contributes value 1 to and the number of these summands is at most . At the same time, the summand that contributes value to arising from the tuples , , and of also does not appear in . Since , it follows that cannot be made smaller than on account of such a node .
Case 2: Let be a node such that . In this case, while populating , vertex changes its color in relation , i.e., we have that and . Compared to , this can only increase the number of summands that contribute 1 to , by possibly having new summands arising from the tuples of type , , and of (for some vertex ). Moreover, for every vertex , it is true that, if then ; similarly, if then . Therefore, every summand that contributes 1 to also contributes 1 to . Hence, cannot be made smaller than on account of such a node .
The preceding analysis implies that . ∎
By Lemma 3.3, there exists a repair of such that produces a red-blue coloring of and is the -answer in . We will show that, for a non-negative integer , there is a cut of such that if and only if there exists a repair of such that produces a red-blue coloring of and . Once this is shown, it will follow that it is NP-hard to even compute the -answer in .
Let be a cut of such that . Construct an -instance as follows. For each vertex , add tuples and to . For each vertex , add tuples and to . Add all tuples from relation of to . Observe that is a repair of and that produces a red-blue coloring of . Also, every edge such that and is part of a witness to a summand that contributes 1 to . Moreover, no summand in arises from a tuple of the form for some . Since we have that , it must be the case that . In the other direction, let be a repair of such that produces a red-blue coloring of and . Construct two sets and of vertices of as follows. Let if , and let if . Clearly, is a cut of . Every edge such that and is part of a witness to a summand that contributes 1 to since the tuples , , and of satisfy the underlying conjunctive query of . In fact, since produces a red-blue coloring of , every summand that contributes to must arise from such tuples. Since , it must be the case that . ∎
4. Consistent Answers via SAT Solving
In this section, we give polynomial-time reductions from computing the range consistent answers of aggregation queries to variants of SAT. The reductions in this section assume that the database schema has one key constraint per relation; in Section 5, we show how these reductions can be extended to schemata with arbitrary denial constraints. Our reductions rely on several well-known optimization variants of SAT that we describe next.
Weighted MaxSAT (or WMaxSAT) is the maximization variant of SAT in which each clause is assigned a positive weight and the goal is to find an assignment that maximizes the sum of the weights of the satisfied clauses. We will write to denote a clause with weight .
Partial MaxSAT (or PMaxSAT) is the maximization variant of SAT in which some clauses of the formula are assigned infinite weight (hard clauses), while each of the rest is assigned weight one (soft clauses). The goal is to find an assignment that satisfies all hard clauses and the maximum number of soft clauses. If the hard clauses of a PMaxSAT instance are not simultaneously satisfiable, then we say that the instance is unsatisfiable. For simplicity, a hard clause is denoted as .
Weighted Partial MaxSAT (or WPMaxSAT) is the maximization variant of SAT in which some of the clauses of the formula are assigned infinite weight (hard clauses), while each of the rest is assigned a positive weight (soft clauses). The goal is to find an assignment that satisfies all hard clauses and maximizes the sum of weights of the satisfied soft clauses. Clearly, WPMaxSAT is a common generalization of both WMaxSAT (no hard clauses) and PMaxSAT (each soft clause has weight one).
Modern solvers, such as MaxHS (Davies and Bacchus, 2011), can efficiently solve large instances of these maximization variants of SAT. Note that these maximization problems have dual minimization problems, called WMinSAT, PMinSAT, and WPMinSAT, respectively. For example, in WPMinSAT, the goal is to find an assignment that satisfies all hard clauses and minimizes the sum of weights of the satisfied soft clauses. These minimization problems are of interest to us, because some of the computations of the range consistent answers have natural reductions to such minimization problems. At present, the only existing WPMinSAT solver is MinSatz (Li et al., 2011). Since this solver has certain size limitations, we will deploy Kügel’s technique (Kügel, 2012) to first reduce WPMinSAT to WPMaxSAT, and then use the MaxHS solver in our experiments. This technique uses the concept of CNF-negation (Kügel, 2012; Zhu et al., 2012). By definition, if is a clause, then the CNF-negation of is the CNF-formula . It is easy to verify that the following properties hold: (i) if an assignment does not satisfy , then satisfies every clause of ; (ii) if an assignment satisfies , then satisfies all but one of the clauses of , namely, the clause , where is the smallest index such that . It follows that is satisfiable if and only if is unsatisfiable. For a weighted clause , the CNF-negation of is the formula .
We also need to recall the notions of key-equal groups and bags of witnesses. Let be a database instance.
We say that two facts of a relation of are key-equal, if they agree on the key attributes of . A set of facts of is called a key-equal group of facts if every two facts in are key-equal, and no fact in is key-equal to some fact in .
Let be a conjunctive query, where each is a tuple of variables and constants, and let be an answer to on . Let vars() and cons() be the sets of variables and constants occurring in . A function is a witnessing assignment to if the following hold: ; if is a constant in , then ; and if is an atom of , then is a fact of . We say that a set of facts from is a witness to a if there is a witnessing assignment to such that is an atom of .
Note that two distinct witnessing assignments to an answer may give rise to the same witness. Thus, we consider the bag of witnesses to an answer, i.e., the bag consisting of witnesses arising from all witnessing assignments to that answer, where each witness is accompanied by its multiplicity, an integer denoting the number of witnessing assignments that gave rise to . Finally, we define the bag of witnesses to a conjunctive query as the bag union of the bags of witnesses over all answers to on (in the bag union the multiplicities of the same set are added). The bag of witnesses to a union of conjunctive queries is the bag union of the bags of witnesses to each conjunctive query in . The bag of witnesses will be used in computing the range consistent answers to aggregation queries. In effect, the bag of witnesses corresponds to the provenance polynomials of conjunctive queries and their unions (Green et al., 2007; Karvounarakis and Green, 2012).
It is easy to verify that both the key equal groups and the bag of the witnesses can be computed using SQL queries. In Section 4.1 and Section 4.2, we give reductions to compute the range consistent answers to aggregation queries without grouping, and in Section 4.3, we describe an iterative algorithm that uses these reductions to handle aggregation queries with grouping.
4.1. Answering Queries with SUM and COUNT
Let be a database schema with one key constraint per relation, and be the aggregation query
where is one of the operators COUNT(*), COUNT(), SUM(), and is a relation expressed as a union of conjunctive queries over . We now reduce the range consistent answers of to PMaxSAT and to WPMaxSAT.
4.1.1. Reductions to PMaxSAT and WPMaxSAT
Reduction 4.1.
Let be an aggregation query, where is one of the operators COUNT(*), COUNT(), and SUM(). Let be an -instance and be the set of key-equal groups of facts of . For each fact of , introduce a boolean variable . Let be the bag of witnesses to the query on , where
Construct a partial CNF-formula (if is COUNT(*) or COUNT()) or a weighted partial CNF-formula (if is SUM(A)) as follows:
-
(1)
For each ,
-
•
construct a hard clause .
-
•
for each pair of facts in such that , construct a hard clause .
-
•
-
(2a)
If is COUNT(*) or COUNT(), then for each witness , construct a soft clause , where
Construct a partial CNF-instance
-
(2b)
If is SUM(), let and be the subsets of such that for each , we have iff , and iff . Let also , where is the absolute value of . Construct a weighted soft clause and a conjunction of hard clauses as follows. If , introduce a new variable and let
otherwise, let and do not construct .
Construct a weighted partial CNF-instance
Purpose of the components of in Reduction 4.1
-
•
Each -clause encodes the “at-least-one” constraint for each key-equal group in the sense that satisfying requires setting at least one variable corresponding to a fact in to true. Similarly, each -clause encodes the “at-most-one” constraint for . In effect, every assignment that satisfies all -clauses sets exactly one variable corresponding to the facts from each key-equal group to true, and thus uniquely corresponds to a repair of .
-
•
Satisfying a -clause constructed in Step 2a requires setting at least one variable corresponding to the facts of a witness to on to false. Thus, if is an assignment that satisfies all -clauses, then is satisfied by if and only if , where is a repair corresponding to .
-
•
The -clauses constructed in Step 2b serve the same purpose as the ones from Step 2a, but here they are constructed only for the witnesses in . For the witnesses in , the -clauses encode the condition that is satisfied if and only if all variables corresponding to the facts in are set to true. The hard -clauses are used solely to express the equivalence in conjunctive normal form.
The number of -clauses is , where is the size of the database; the number of -clauses and -clauses combined is , where is the number of relation symbols in .
Proposition 4.1.
Let be an aggregation query, where is one of the operators COUNT(*), COUNT(), and SUM(). In a maximum (a minimum) satisfying assignment of the WPMaxSAT-instance constructed using Reduction 4.1, the sum of weights of the falsified clauses is the glb-answer (lub-answer) in the range consistent answers on .
Proof.
Let , and let be an assignment of the formula constructed using Reduction 4.1. Let denote the sum of weights of the soft clauses of satisfied by . Construct a database sub-instance from such that if and only if . The hard clauses of constructed in Step (1) of Reduction 4.1 encode the condition that exactly one fact from each key-equal group of facts of is in , ensuring that is a repair of . Moreover, the soft clauses of falsified by have a one-to-one correspondence with the witnesses to in . Therefore, we have that . Since does not depend on , the answer is minimized (i.e., is a -answer in on ) when is a maximum satisfying assignment. Essentially the same argument works for the case where . A dual argument to this proves that a repair of constructed from a minimum satisfying assignment of realizes the -answer in on .
Now, let . Construct a repair of from by choosing if and only if . Also, construct a database instance as follows. For every fact , let if and only if for some . Thus, is the sum of values of the aggregation attribute evaluated on the witnesses in . Observe that, for every , the clause is falsified by if and only if . Similarly, for every , the clause is satisfied by if and only if . Therefore, we have that,
Since does not depend on , the answer is minimized (i.e., is a -answer in on ) when is a maximum satisfying assignment. A dual argument to this proves that a repair of constructed from a minimum satisfying assignment of realizes the -answer in on . ∎
Example 4.2.
Let be a database instance from Table 1, and be the following aggregation query which counts the number of customers who have an account in their own city:
SELECT COUNT(*)
FROM CUST, ACC, CUSTACC
WHERE CUST.CID = CUSTACC.CID
AND ACC.ACCID = CUSTACC.ACCID
AND CUST.CITY = ACC.CITY
From Reduction 4.1, we construct the following clauses:
-
-clauses: ;
-
-clauses: ;
-
-clauses: .
Observe that it is okay to omit the variables corresponding to the facts in CUSTACC since CUSTACC does not violate . A maximum satisfying assignment to the PMaxSAT instance constructed from above clauses is for , and otherwise. It falsifies one clause, namely, . Similarly, an assignment for , and otherwise is a minimum satisfying assignment to the PMinSAT instance , and it falsifies two clauses, namely, and . Thus, w.r.t. range semantics is by Proposition 4.1.
Example 4.3.
Let us again consider the database instance from Table 1, and the following aggregation query :
SELECT SUM(ACC.BAL)
FROM CUST, ACC, CUSTACC
WHERE CUST.CID = CUSTACC.CID
AND ACC.ACCID = CUSTACC.ACCID
AND CUST.CNAME = ‘Mary’
The hard clauses constructed using Reduction 4.1 are same as the ones from Example 4.2. The rest of the clauses are as follows:
-
-clauses: , , , , , .
-
-clauses: , , , , , .
The witnesses and belong to because the account balance is -100 in both cases, so we introduce new variables and respectively, and construct hard -clauses as described above. The -clauses corresponding to these witnesses are and . We omit in all of these clauses since CUSTACC does not violate . Note that . An assignment in which and is a maximum satisfying assignment to the PMaxSAT instance constructed. The sum of satisfied soft clauses by this assignment is 3500 since it satisfies two clauses with weights 1200 each, one with weight 1000 and one with weight 100. Thus, by Proposition 4.1, we have that where is a repair corresponding to the assignment in consideration. Similarly, setting and yields a minimum satisfying assignment in which the sum of satisfied soft clauses is 2200, since it satisfies one clause with weight 1200 and one with weight 1000, indicating that .
4.1.2. Handling DISTINCT
Let be an aggregation query, where is either COUNT(DISTINCT ) or SUM(DISTINCT ). Solving a PMaxSAT or a WPMaxSAT instance constructed using Reduction 4.1 may yield incorrect and answers to , if the database contains multiple witnesses with the same value for attribute . For example, consider the database instance from Table 1, and a query
The correct and -answers in are both 2, but solutions to the PMaxSAT and PMinSAT instances constructed using Reduction 4.1 yield both answers as 4. The reason behind this is that the soft clauses and both correspond to the account type Checking, and similarly , , and all correspond to the account type Saving. The hard clauses in the formula ensure that , , , and one of and are true, thus counting both Checking and Saving account types exactly twice in every satisfying assignment to the formula. This can be handled by modifying the -clauses in Reduction 4.1 as follows.
Let denote a set of distinct answers to the query . For each answer , let denote a subset of such that for every witness , we have that . The idea is to use auxiliary variables to construct one soft clause for every distinct answer , such that it is true if and only if no witness in is present in a repair corresponding to the satisfying assignment. First, for every witness , we introduce an auxiliary variable that is true if and only if is not present in the repair. Then, we introduce an auxiliary variable which is true if and only if all -variables are true. These constraints are encoded in the set returned by Algorithm 1, and are forced by making clauses in hard. For every answer , Algorithm 1 also returns one -clause, which serves the same purpose as the -clauses in Reduction 4.1. Now, a PMaxSAT or a WPMaxSAT instance can be constructed by taking in conjunction all -clauses from the key-equal groups, the hard -clauses if any, the hard clauses from all -sets, and all soft -clauses. With this, it is easy to see that a maximum (or minimum) satisfying assignment to PMaxSAT or WPMaxSAT instance give us the -answer (or -answer) in . This is illustrated in Example 4.4.
Example 4.4.
Consider the following aggregation query on the database instance from Table 1:
SELECT COUNT(DISTINCT ACC.TYPE) FROM ACC
We have that . Let us denote these two answers by and respectively. Since every witness to the query consists of a single fact, every -variable is equivalent to a single literal, for example, and . As a result, it is unnecessary to introduce any -variables at all. Thus, we construct the following clauses from Reduction 4.1 and Algorithm 1:
-
-clauses: ; -clauses: ;
-
;
-
;
-
-clauses:
The maximum and minimum satisfying assignments to the PMaxSAT and PMinSAT instances constructed using these clauses falsify both -clauses, since w.r.t. range semantics is .
4.2. Answering Queries with MIN and MAX
Let be a database schema with one key constraint per relation, and be the aggregation query
where is one of the operators MIN() and MAX(), and is a relation expressed as a union of conjunctive queries over . The semantics of the range consistent answers to aggregation queries with MIN and MAX operators are similar to aggregation queries with SUM or COUNT operators, but here we need to address one additional special case. We illustrate this special case using Example 4.5.
Example 4.5.
Consider the database instance from Table 1 and two aggregation queries and as follows.
SELECT SUM(ACCOUNTS.BAL) FROM ACCOUNTS
WHERE ACCOUNTS.CITY = ‘SF’
SELECT MIN(ACCOUNTS.BAL) FROM ACCOUNTS
WHERE ACCOUNTS.CITY = ‘SF’
It is clear that the range consistent answers to . The lub-answer of 0 in comes from a repair on which there is no account in the city of SF, and therefore the SUM function returns 0. For , however, the range consistent answers are unclear because the MIN function is not defined for empty sets.
In such scenarios, various different semantics can be considered. One natural semantics is that if there exists a repair on which the underlying conjunctive query evaluates to an empty set, we could say that there is no consistent answer to the aggregation query. Another one could be to return the interval [glb, lub] of values that come from the repairs on which the underlying conjunctive query evaluates to a non-empty set of answers, and additionally return the information about the existence of the repair on which the underlying conjunctive query returns the empty set of answers. The reductions we give in this Section can be used regardless of which of the two above-mentioned semantics is chosen.
In what follows, we first show that the glb-answer to an aggregation query with the MIN() operator can be computed in polynomial time in the size of the original inconsistent database instance (Proposition 4.6). We then give an iterative SAT-based approach to compute the lub-answer to an aggregation query with the MIN() operator. We do not explicitly state methods to obtain the range consistent answers aggregation queries with the MAX() operator since it is straightforward that the lub-answer for MIN() is a dual of the glb-answer for MAX() in the sense that computing the lub-answer for the MIN() operator yields the same result as negating all values of the aggregation attribute in the database and then computing the glb-answer for the MAX() operator.
Proposition 4.6.
Let be a database schema, an -instance, and the aggregation query . Let be the union of conjunctive queries and be the witness to on such that no two facts in are key-equal, and there is no such that and no two facts in are key-equal. Then, is the glb-answer in .
Proof.
For every witness to on such that , we have that no repair of contains because contains at least two key-equal facts. Moreover, since no two facts in are key-equal, there exists a repair of such that . Therefore, must be the smallest possible answer to on , i.e., the glb-answer in . Since the number of witnesses to is polynomial in the size of , a desired witness can be obtained efficiently from the result of evaluating on . ∎
To compute the range consistent answers to aggregation queries with MIN() and MAX() operators, we opt for an iterative SAT solving approach. In what follows, we formalize the construction of the SAT instance for the first iteration (Construction 4.1) and give Algorithm 2 that computes the lub-answer in by constructing and solving SAT instances in subsequent iterations.
Construction 4.1.
Given an -instance , construct a CNF formula as follows. For each fact of , introduce a boolean variable . Let be the set of key-equal groups of facts of , and denote the set of minimal witnesses to a conjunctive query on , where . Assume that the set is sorted in descending order of the answers, i.e., for , we have that .
-
•
For each , construct a clause .
-
•
Construct a CNF formula .
Proposition 4.7.
Let be an aggregation query, and be a database instance. Algorithm 2 returns the lub-answer in .
Proof.
The -clauses of make sure that a repair of can be constructed from every assignment of that satisfies the -clauses, by arbitrarily choosing exactly one fact from each key-equal group of such that . Let denote the set of distinct answers to a conjunctive query on , where is the lub-answer to on . For a witness to , a clause is satisfied by an assignment if and only if is not present in any repair constructed from . At iteration of the while-loop, if the formula is checked for satisfiability (line 8 of Algorithm 2), the formula contains the -clauses corresponding to the key-equal groups of in conjunction to all clauses corresponding to the minimal witnesses to on which the evaluates to an answer strictly smaller than . At this point, if the formula is satisfiable, then there exists a repair of such that , and also for all potential answers , we have that . On the other hand, if the formula is unsatisfiable, then there exists no repair of such that and for all . Since the clauses are added in the ascending order of the answers, we have that satisfiable at iteration if and only if . Therefore, if becomes unsatisfiable for the first time at iteration , it must be the the case that is the lub-answer in . ∎
Why not a Binary Search?
In essence, Algorithm 2 works like a linear search on a sorted array. It may sound appealing to perform the binary search instead of the linear search for obvious reasons. Clearly, the SAT solver will only have to solve instances of SAT instead of instances of SAT. The problem with this approach is the following. Observe that, on an average, half of the SAT instances that the solver needs to solve in the binary search approach will be unsatisfiable. In the linear search approach, however, all but the last instance given to the solver are satisfiable. Typically, the proofs of unsatisfiability produced by the SAT solvers are significantly large compared to the proofs of satisfiability as the unsatisfiability of an instance needs to be proven with a refutation tree that can be exponential in size of the formula, while just one satisfying assignment is enough to prove the satisfiability of an instance. As a result, SAT solvers typically take considerably long amounts of time to solve unsatisfiable instances but they are very quick on most real-world satisfiable instances. Therefore, in practice, the linear search often works better than the binary search.
4.3. Answering Queries with Grouping
Let be the aggregation query
where is one of COUNT, COUNT, SUM, MIN, or MAX, and is a relation expressed by a union of conjunctive queries on . We refer to the attributes in as the grouping attributes. For aggregation queries with grouping, it does not seem feasible to reduce to a single PMaxSAT or a WPMaxSAT instance because for each group of consistent answers, the GLB-answer and the LUB-answer may realize in different repairs of the inconsistent database. To illustrate this, consider the database from Table 1 and a query . Notice that, the GLB-answers (LA, 2) and (SF, 1) in come from two different repairs of relation CUST, namely, and respectively. However, the reductions from the preceding section can be used to compute the bounds to each consistent group of answers independently. For a given aggregation query with grouping, we first compute the consistent answers to an underlying conjunctive query . Then, for each answer in , we compute the GLB and LUB-answers to the query via PMaxSAT or WPMaxSAT solving as shown in Algorithm 3.
Let be an inconsistent database instance, and be an aggregation query of the form .
As noted earlier, the bags of witnesses used in the preceding reductions capture the provenance of unions of conjunctive queries in the provenance polynomials model of (Green et al., 2007; Karvounarakis and Green, 2012). In (Amsterdamer et al., 2011), it was shown that a stronger provenance model is needed to express the provenance of aggregation queries, a model that uses a tensor product combining annotations with values. A future direction of research is to investigate whether this stronger provenance model can be used to produce more direct reductions of the range consistent answers to SAT.
5. Beyond Key Constraints
Key constraints and functional dependencies are important special cases of denial constraints (DCs), which are expressible by first-order formulas of the form or, equivalently, where is a conjunction of atomic formulas and is a conjunction of expressions of the form with each op a built-in predicate, such as . In words, a denial constraint prohibits a set of tuples that satisfy certain conditions from appearing together in a database instance. If is a fixed finite set of denial constraints and is an aggregation query without grouping, then the following problem is in coNP: given a database instance and a number , is the lub-answer (or the glb-answer) in w.r.t. ? This is so because to check that is not the lub-answer (or the glb-answer), we guess a repairs of and verify that (or ). In all preceding reductions, the -clauses capture the inconsistency in the database arisen due to the key violations to enforce every satisfying assignment to uniquely correspond to a repair of the initial inconsistent database instance. Importantly, the -clauses are independent of the input query. In what follows, we provide a way to construct clauses to capture the inconsistency arising due to the violations of denial constraints. Thus, replacing the -clauses in the reductions from Section 4 by the ones provided below allows us to compute consistent answers over databases with a fixed finite set of arbitrary denial constraints. The reduction relies on the notions of minimal violations and near-violations to the set of denial constraints that we introduce next.
Assume that is a set of denial constraints, is an -instance, and is a sub-instance of . We say that is a minimal violation to , if and for every set , we have that .
Let be a set of denial constraints, an -instance, a sub-instance of , and a fact of . We say that is a near-violation w.r.t. and if and is a minimal violation to . As a special case, if itself is a minimal violation to , we say that there is exactly one near-violation w.r.t. , and it is the singleton , where is an auxiliary fact.
Let be a database schema, be a fixed finite set of denial constraints on , be an aggregation query without grouping, and be an -instance.
Reduction 5.1.
Given an -instance , compute the sets:
-
(1)
: the set of minimal violations to on .
-
(2)
: the set of near-violations to , on , w.r.t. each fact .
For each fact of , introduce a boolean variable , . For the auxiliary fact , introduce a constant , and for each , introduce a boolean variable .
-
(1)
For each , construct a clause .
-
(2)
For each , construct a clause .
-
(3)
For each variable , construct an expression .
-
(4)
Construct the following boolean formula :
Proposition 5.1.
The boolean formula constructed using Reduction 5.1 can be transformed to an equivalent CNF-formula whose size is polynomial in the size of . The satisfying assignments to and the repairs of w.r.t. are in one-to-one correspondence.
Proof.
Let be the number of facts of . Let be the smallest number such that there exists no denial constraint in whose number of database atoms is bigger than . Also, let be the smallest number such that there exists no conjunctive query in whose number of database atoms is bigger than . Since and are not part of the input to Cons(), the quantities and are fixed constants. We also have that , for , , and for . The number of -, -, and -variables in is therefore bounded by , , and , respectively. The formula contains as many -clauses as , and none of the -clause’s length exceeds . Similarly, there are at most -clauses, and none of their lengths exceeds . The number of -clauses is precisely , and each -clause is at most literals long. There are as many -expressions as there are -variables. Every -expression is of the form , where is a constant obtained from the number of facts in the corresponding near-violation. Each -expression can be equivalently written in a constant number of CNF-clauses as , in which the length each clause is constant. Thus, one can transform into an equivalent CNF-formula with size polynomial in the size of .
For the second part of Proposition 5.1, consider a satisfying assignment to and construct a database instance such that if and only if . The -clauses assert that no minimal violation to is present in , i.e., is a consistent subset of . The -clauses and the -expressions encode the condition that, for every fact , either or at least one near-violation w.r.t. and is in , making sure that is indeed a repair of . In the other direction, one can construct a satisfying assignment to from a repair of by setting if and only if . ∎
6. Experimental Evaluation
We evaluate the performance of AggCAvSAT over both synthetic and real-world databases. The first set of experiments includes a comparison of AggCAvSAT with an existing SQL-rewriting-based CQA system, namely, ConQuer, over synthetically generated TPC-H databases having one key constraint per relation. This set of experiments is divided into two parts, based on the method used to generate the inconsistent database instances. In the first part, we use the DBGen tool from TPC-H and artificially inject inconsistencies in the generated data; in the second part, we employ the PDBench inconsistent database generator from MayBMS (Antova et al., 2008) (see Section 6.1.1 for details). Next, we assess the scalability of AggCAvSAT by varying the size of the database and the amount of inconsistency present in it. Lastly, to evaluate the performance of the reductions from Section 5, we use a real-world Medigap (med, [n.d.]) dataset that has three functional dependencies and one denial constraint. All experiments were carried out on a machine running on Intel Core i7 2.7 GHz, 64 bit Ubuntu 16.04, with 8GB of RAM. We used Microsoft SQL Server 2017 as an underlying DBMS, and MaxHS v3.2 solver (Davies and Bacchus, 2011) for solving the WPMaxSAT instances. The AggCAvSAT system is implemented in Java 9.04 and its code is open-sourced at a GitHub repository https://github.com/uccross/cavsat via a BSD-style license. Various features of AggCAvSAT, including its graphical user interface, are presented in a short paper in the demonstration track of the 2021 SIGMOD conference (Dixit and Kolaitis, 2021).
6.1. Experiments with TPC-H Data and Queries
6.1.1. Datasets
For the first part of the experiments, the data is generated using the DBGen data generation tool from the TPC-H Consortium. The TPC-H schema comes with exactly one key constraint per relation, which was ideal for comparing AggCAvSAT against ConQuer (Fuxman et al., 2005a; Fuxman and Miller, 2007) (the only existing system for computing the range consistent answers to aggregation queries), because ConQuer does not support more than one key constraint per relation or classes of integrity constraints broader than keys. The DBGen tool generates consistent data, so we artificially injected inconsistency by updating the key attributes of randomly selected tuples from the data with the values taken from existing tuples of the same relation. The sizes of the key-equal groups that violate the key constraints were uniformly distributed between two and seven. The database instances were generated in such a way that every repair had the specified size. We experimented with varied degrees of inconsistency, ranging from 5% up to 35% of the tuples violating a key constraint, and with a variety of repair sizes, starting from 500 MB (4.3 million tuples) up to 5 GB (44 million tuples). For the second part, we employed the PDBench database generator from MayBMS (Antova et al., 2008) to generate four inconsistent database instances with varying degrees of inconsistency (see Table 2). In all four instances, the data is generated in such a way that every repair is of size 1 GB.
Inconsistency | ||||
Table | Inst. 1 | Inst. 2 | Inst. 3 | Inst. 4 |
CUSTOMER | 4.42% | 8.5% | 16.14% | 29.49% |
LINEITEM | 6.36% | 12.09% | 22.53% | 39.82% |
NATION | 7.69% | 0% | 7.69% | 7.69% |
ORDERS | 3.51% | 6.77% | 12.87% | 23.9% |
PART | 4.93% | 9.33% | 17.66% | 32.16% |
PARTSUPP | 1.53% | 2.96% | 5.77% | 11.13% |
REGION | 0% | 0% | 0% | 0% |
SUPPLIER | 3.69% | 7.44% | 14.11% | 26.51% |
Overall | 5.36% | 10.25% | 19.29% | 34.72% |
Database size and Repair size (in GB) | ||||
1.04 & 1.00 | 1.07 & 1.01 | 1.14 & 1.02 | 1.3 & 1.02 | |
Size of the Largest Key-equal Groups | ||||
8 tuples | 16 tuples | 16 tuples | 32 tuples |
6.1.2. Queries
The standard TPC-H specification comes with 22 queries (constructed using the QGen tool). Here, we focus on queries 1, 3, 4, 5, 6, 10, 12, 14, and 19; the other 13 queries have features such as nested subqueries, left outer joins, and negation that are beyond the aggregation queries defined in Section 3. In Section 6.1.3, we describe our results for queries without grouping. Since six out of the nine queries under consideration contained the GROUP BY clause, we removed it and added appropriate conditions in the WHERE clause based on the original grouping attributes to obtain queries without grouping. We refer to these queries as . The definitions of these queries are given in Table 3.
# | Query |
---|---|
SELECT SUM(LINEITEM.L_QUANTITY) FROM LINEITEM WHERE LINEITEM.L_SHIPDATE <= dateadd(dd, -90, cast(‘1998-12-01’ as datetime)) AND LINEITEM.L_RETURNFLAG = ‘N’ AND LINEITEM.L_LINESTATUS = ‘F’ | |
SELECT SUM(LINEITEM.L_EXTENDEDPRICE*(1-LINEITEM.L_DISCOUNT)) FROM CUSTOMER, ORDERS, LINEITEM WHERE CUSTOMER.C_MKTSEGMENT = ’BUILDING’ AND CUSTOMER.C_CUSTKEY = ORDERS.O_CUSTKEY AND LINEITEM.L_ORDERKEY = ORDERS.O_ORDERKEY AND ORDERS.O_ORDERDATE < ‘1995-03-15’ AND LINEITEM.L_SHIPDATE > ‘1995-03-15’ AND LINEITEM.L_ORDERKEY = 988226 AND ORDERS.O_ORDERDATE = ‘1995-02-01’ AND ORDERS.O_SHIPPRIORITY = 0 | |
SELECT COUNT(*) FROM ORDERS WHERE ORDERS.O_ORDERDATE >= ‘1993-07-01’ AND ORDERS.O_ORDERDATE < dateadd(mm,3, cast(‘1993-07-01’ as datetime)) AND ORDERS.O_ORDERPRIORITY = ‘1-URGENT’ | |
SELECT SUM(LINEITEM.L_EXTENDEDPRICE*(1-LINEITEM.L_DISCOUNT)) FROM CUSTOMER, ORDERS, LINEITEM, SUPPLIER, NATION, REGION WHERE CUSTOMER.C_CUSTKEY = ORDERS.O_CUSTKEY AND LINEITEM.L_ORDERKEY = ORDERS.O_ORDERKEY AND LINEITEM.L_SUPPKEY = SUPPLIER.S_SUPPKEY AND CUSTOMER.C_NATIONKEY = SUPPLIER.S_NATIONKEY AND SUPPLIER.S_NATIONKEY = NATION.N_NATIONKEY AND NATION.N_REGIONKEY = REGION.R_REGIONKEY AND REGION.R_NAME = ‘ASIA’ AND ORDERS.O_ORDERDATE >= ‘1994-01-01’ AND ORDERS.O_ORDERDATE < DATEADD(YY, 1, cast(‘1994-01-01’ as datetime)) AND NATION.N_NAME = ‘INDIA’ | |
SELECT SUM(LINEITEM.L_EXTENDEDPRICE*LINEITEM.L_DISCOUNT) FROM LINEITEM WHERE LINEITEM.L_SHIPDATE >= ‘1994-01-01’ AND LINEITEM.L_SHIPDATE < dateadd(yy, 1, cast(‘1994-01-01’ as datetime)) AND LINEITEM.L_DISCOUNT BETWEEN .06 - 0.01 AND .06 + 0.01 AND LINEITEM.L_QUANTITY < 24 | |
SELECT SUM(LINEITEM.L_EXTENDEDPRICE*(1-LINEITEM.L_DISCOUNT)) FROM CUSTOMER, ORDERS, LINEITEM, NATION WHERE CUSTOMER.C_CUSTKEY = ORDERS.O_CUSTKEY AND LINEITEM.L_ORDERKEY = ORDERS.O_ORDERKEY AND ORDERS.O_ORDERDATE >= ‘1993-10-01’ AND ORDERS.O_ORDERDATE < dateadd(mm, 3, cast(‘1993-10-01’ as datetime)) AND LINEITEM.L_RETURNFLAG = ‘R’ AND CUSTOMER.C_NATIONKEY = NATION.N_NATIONKEY AND CUSTOMER.C_CUSTKEY = 77296 AND CUSTOMER.C_NAME = ‘Customer#000077296’ AND CUSTOMER.C_ACCTBAL = 1250.65 AND CUSTOMER.C_PHONE = ‘12-248-307-9719’ AND NATION.N_NAME = ‘BRAZIL’ | |
SELECT COUNT(*) FROM ORDERS, LINEITEM WHERE ORDERS.O_ORDERKEY = LINEITEM.L_ORDERKEY AND LINEITEM.L_SHIPMODE = ‘MAIL’ AND (ORDERS.O_ORDERPRIORITY = ‘1-URGENT’ OR ORDERS.O_ORDERPRIORITY = ‘2-HIGH’) AND LINEITEM.L_COMMITDATE < LINEITEM.L_RECEIPTDATE AND LINEITEM.L_SHIPDATE < LINEITEM.L_COMMITDATE AND LINEITEM.L_RECEIPTDATE >= ‘1994-01-01’ AND LINEITEM.L_RECEIPTDATE < dateadd(mm, 1, cast(‘1995-09-01’ as datetime)) | |
SELECT SUM(LINEITEM.L_EXTENDEDPRICE*(1-LINEITEM.L_DISCOUNT)) FROM LINEITEM, PART WHERE LINEITEM.L_PARTKEY = PART.P_PARTKEY AND LINEITEM.L_SHIPDATE >= ‘1995-09-01’ AND LINEITEM.L_SHIPDATE < dateadd(mm, 1, ‘1995-09-01’) AND PART.P_TYPE LIKE ‘PROMO%%’ | |
SELECT SUM(LINEITEM.L_EXTENDEDPRICE*(1-LINEITEM.L_DISCOUNT)) FROM LINEITEM, PART WHERE (PART.P_PARTKEY = LINEITEM.L_PARTKEY AND PART.P_BRAND = ‘Brand#12’ AND PART.P_CONTAINER IN (‘SM CASE’, ‘SM BOX’, ‘SM PACK’, ‘SM PKG’) AND LINEITEM.L_QUANTITY >= 1 AND LINEITEM.L_QUANTITY <= 1 + 10 AND PART.P_SIZE BETWEEN 1 AND 5 AND LINEITEM.L_SHIPMODE IN (‘AIR’, ‘AIR REG’) AND LINEITEM.L_SHIPINSTRUCT = ‘DELIVER IN PERSON’) OR (PART.P_PARTKEY = LINEITEM.L_PARTKEY AND PART.P_BRAND =‘Brand#23’ AND PART.P_CONTAINER IN (‘MED BAG’, ‘MED BOX’, ‘MED PKG’, ‘MED PACK’) AND LINEITEM.L_QUANTITY >= 10 AND LINEITEM.L_QUANTITY <= 10 + 10 AND PART.P_SIZE BETWEEN 1 AND 10 AND LINEITEM.L_SHIPMODE IN (‘AIR’, ‘AIR REG’) AND LINEITEM.L_SHIPINSTRUCT = ‘DELIVER IN PERSON’) OR (PART.P_PARTKEY = LINEITEM.L_PARTKEY AND PART.P_BRAND = ‘Brand#34’ AND PART.P_CONTAINER IN ( ‘LG CASE’, ‘LG BOX’, ‘LG PACK’, ‘LG PKG’) AND LINEITEM.L_QUANTITY >= 20 AND LINEITEM.L_QUANTITY <= 20 + 10 AND PART.P_SIZE BETWEEN 1 AND 15 AND LINEITEM.L_SHIPMODE IN (‘AIR’, ‘AIR REG’) AND LINEITEM.L_SHIPINSTRUCT = ‘DELIVER IN PERSON’) |
6.1.3. Results on Queries without Grouping
In the first set of experiments, we computed the range consistent answers of the TPC-H-inspired aggregation queries without grouping via WPMaxSAT solving over a database instance with 10% inconsistency and having repairs of size 1 GB (8 million tuples). Figure 1 shows that much of the evaluation time used by AggCAvSAT is consumed in encoding the CQA instance into a WPMaxSAT instance, while the solver comparatively takes less time to compute the optimal solution. Note that, is not in the class and thus ConQuer cannot compute its range consistent answers. AggCAvSAT performs better than ConQuer on seven out of the remaining eight queries.
Next, we compared the performance of AggCAvSAT and ConQuer on database instances generated using PDBench. Figure 2 shows that AggCAvSAT performs better than ConQuer on PDBench instances with low inconsistency. As the inconsistency increases, the WPMaxSAT solver requires considerably long time to compute the optimal solutions (especially for , , and ). One reason is that the sizes of key-equal groups in PDBench instances with higher inconsistency percentage are large, which translates into clauses of large sizes in the WPMaxSAT instances, hence the solver works hard to solve them. Also, Kügel’s reduction (Kügel, 2012) from WPMinSAT to WPMaxSAT significantly increases the size of the CNF formula, resulting in higher time for the lub-answers to the queries.
5% | 15% | 25% | 35% | |
---|---|---|---|---|
10.2 | 34.3 | 60.6 | 95.3 | |
28.4 | 96.2 | 175.0 | 271.2 | |
6.4 | 21.1 | 40.6 | 62.3 |
5% | 15% | 25% | 35% |
---|---|---|---|
27.6 | 92.2 | 163.6 | 258.1 |
76.8 | 259.9 | 472.9 | 734.0 |
15.6 | 51.9 | 101.0 | 156.6 |
1 GB | 3 GB | 5 GB | |
---|---|---|---|
21.3 | 44.1 | 105.6 | |
60.9 | 127.13 | 304.4 | |
13.9 | 32.9 | 67.7 |
1 GB | 3 GB | 5 GB |
---|---|---|
57.7 | 104.1 | 258.8 |
165.3 | 300.7 | 823.1 |
34.0 | 73.7 | 166.6 |
Next, we varied the inconsistency in the database instances created using the DBGen-based data generator while keeping the size of the database repairs constant (1 GB). Figure 3 shows that the evaluation time of AggCAvSAT stays well under ten seconds (except for ), even if there is more inconsistency in the data. Tables 4(a) and 4(b) show the average size of the CNF formulas for the top three queries that exhibited the largest CNF formulas. The size of the formulas grows nearly linearly as the inconsistency present in the data grows. The CNF formulas for are significantly larger than the ones corresponding to the other queries since has high selectivity and it is posed against the single largest relation LINEITEM which has over 8.2 million tuples in an instance with 35% inconsistency. This also explains why AggCAvSAT takes more time for computing its range consistent answers (Figure 3). In database instances with low inconsistency, the consistent answers to the queries having low selectivity (e.g., , ) are sometimes contained in the consistent part of the data, and AggCAvSAT does not need to construct a WPMaxSAT instance at all.
We then evaluated AggCAvSAT’s scalability by increasing the sizes of the databases while keeping the inconsistency to a constant 10%. Figure 4 shows that the evaluation time of AggCAvSAT for queries , , and increases faster than that for the other queries. This is because the queries and are posed against LINEITEM while involves a join between LINEITEM and ORDERS resulting in AggCAvSAT spending more time on computing the bags of witnesses to these queries as the size of the database grows. Table 4(c) and 4(d) show that the size of the CNF formulas grows almost linearly w.r.t. the size of the database. The largest CNF formula consisted of over 304 thousand variables and 823 thousand clauses and was exhibited by on a database of size 5 GB (47 million tuples). The low selectivity of queries , , and resulted in very small CNF formulas, even on large databases.
6.1.4. Results on Queries with Grouping
In this set of experiments, we focus on TPC-H queries 1, 3, 4, 5, 10, and 12 (see Table 5), as the queries 6, 14, and 19 did not contain grouping. We evaluated the performance of AggCAvSAT and compared it to ConQuer on a database with 10% inconsistency w.r.t. primary keys (Figure 5). The repairs are of size 1 GB. AggCAvSAT computes the consistent answers to the underlying conjunctive query using the reductions from (Dixit and Kolaitis, 2019) which are, precisely, the consistent groups in the range consistent answers to the aggregation query. For each of these groups, it computes the glb-answer and the lub-answer using reductions to WPMaxSAT.
# | Query | Operator |
---|---|---|
SELECT LINEITEM.L_RETURNFLAG, LINEITEM.L_LINESTATUS, SUM(LINEITEM.L_QUANTITY) FROM LINEITEM WHERE LINEITEM.L_SHIPDATE <= dateadd(dd, -90, cast(‘1998-12-01’ as datetime)) GROUP BY LINEITEM.L_RETURNFLAG, LINEITEM.L_LINESTATUS | SUM(A) | |
SELECT TOP 10 LINEITEM.L_ORDERKEY, SUM(LINEITEM.L_EXTENDEDPRICE), ORDERS.O_ORDERDATE, ORDERS.O_SHIPPRIORITY FROM CUSTOMER, ORDERS, LINEITEM WHERE CUSTOMER.C_MKTSEGMENT = ‘BUILDING’ AND CUSTOMER.C_CUSTKEY = ORDERS.O_CUSTKEY AND LINEITEM.L_ORDERKEY = ORDERS.O_ORDERKEY AND ORDERS.O_ORDERDATE < ‘1995-03-15’ AND LINEITEM.L_SHIPDATE > ‘1995-03-15’ GROUP BY LINEITEM.L_ORDERKEY, ORDERS.O_ORDERDATE, ORDERS.O_SHIPPRIORITY | SUM(A) | |
SELECT ORDERS.O_ORDERPRIORITY, COUNT(*) AS O_COUNT FROM ORDERS WHERE ORDERS.O_ORDERDATE >= ‘1993-07-01’ AND ORDERS.O_ORDERDATE < dateadd(mm,3, cast(‘1993-07-01’ as datetime)) GROUP BY ORDERS.O_ORDERPRIORITY | COUNT(*) | |
SELECT NATION.N_NAME, SUM(LINEITEM.L_EXTENDEDPRICE*(1-LINEITEM.L_DISCOUNT)) AS REVENUE FROM CUSTOMER, ORDERS, LINEITEM, SUPPLIER, NATION, REGION WHERE CUSTOMER.C_CUSTKEY = ORDERS.O_CUSTKEY AND LINEITEM.L_ORDERKEY = ORDERS.O_ORDERKEY AND LINEITEM.L_SUPPKEY = SUPPLIER.S_SUPPKEY AND CUSTOMER.C_NATIONKEY = SUPPLIER.S_NATIONKEY AND SUPPLIER.S_NATIONKEY = NATION.N_NATIONKEY AND NATION.N_REGIONKEY = REGION.R_REGIONKEY AND REGION.R_NAME = ‘ASIA’ AND ORDERS.O_ORDERDATE >= ‘1994-01-01’ AND ORDERS.O_ORDERDATE < DATEADD(YY, 1, cast(‘1994-01-01’ as datetime)) GROUP BY NATION.N_NAME | SUM(A) | |
SELECT TOP 20 CUSTOMER.C_CUSTKEY, CUSTOMER.C_NAME, SUM(LINEITEM.L_EXTENDEDPRICE*(1-LINEITEM.L_DISCOUNT)) AS REVENUE, CUSTOMER.C_ACCTBAL, NATION.N_NAME, CUSTOMER.C_ADDRESS, CUSTOMER.C_PHONE FROM CUSTOMER, ORDERS, LINEITEM, NATION WHERE CUSTOMER.C_CUSTKEY = ORDERS.O_CUSTKEY AND LINEITEM.L_ORDERKEY = ORDERS.O_ORDERKEY AND ORDERS.O_ORDERDATE>= ‘1993-10-01’ AND ORDERS.O_ORDERDATE < dateadd(mm, 3, cast(‘1993-10-01’ as datetime)) AND LINEITEM.L_RETURNFLAG = ‘R’ AND CUSTOMER.C_NATIONKEY = NATION.N_NATIONKEY GROUP BY CUSTOMER.C_CUSTKEY, CUSTOMER.C_NAME, CUSTOMER.C_ACCTBAL, CUSTOMER.C_PHONE, NATION.N_NAME, CUSTOMER.C_ADDRESS | SUM(A) | |
SELECT LINEITEM.L_SHIPMODE, COUNT(*) AS HIGH_LINE_COUNT FROM ORDERS, LINEITEM WHERE ORDERS.O_ORDERKEY = LINEITEM.L_ORDERKEY AND LINEITEM.L_SHIPMODE IN (‘MAIL’,‘SHIP’) AND (ORDERS.O_ORDERPRIORITY = ‘1-URGENT’ OR ORDERS.O_ORDERPRIORITY = ‘2-HIGH’) AND LINEITEM.L_COMMITDATE < LINEITEM.L_RECEIPTDATE AND LINEITEM.L_SHIPDATE < LINEITEM.L_COMMITDATE AND LINEITEM.L_RECEIPTDATE >= ‘1994-01-01’ AND LINEITEM.L_RECEIPTDATE < dateadd(mm, 1, cast(‘1995-09-01’ as datetime)) GROUP BY LINEITEM.L_SHIPMODE | COUNT(*) |
The overhead of computing the range consistent answers to aggregation queries with grouping is higher than that for the aggregation queries without grouping because for an aggregation query with grouping, AggCAvSAT needs to construct and solve twice as many WPMaxSAT instances as there are consistent groups, i.e., one for the lub-answer and one for the glb-answer per consistent group. For queries that involved the SELECT TOP construct of SQL, we chose top consistent groups ordered by one or more grouping attributes present in the ORDER BY clause of the query. AggCAvSAT computes the range consistent answers to each query under ten seconds except for . It took under three seconds to compute the consistent groups of , but took over forty seconds to encode the range consistent answers of the groups and over fifteen minutes to solve the corresponding WPMaxSAT instances. This is because some consistent groups have over 3M tuples and so the WPMaxSAT instances have over 600 thousand variables and over 1.3 million clauses. ConQuer took slightly over two minutes to compute the range consistent answers to . We did not include in experiments with larger databases and higher inconsistency.
Figure 6 shows the comparison of AggCAvSAT and ConQuer for aggregation queries with grouping on PDBench instances. For the database instance with the lowest amount of inconsistency, AggCAvSAT beats ConQuer on all queries, but as the inconsistency grows, AggCAvSAT takes longer time to encode and solve for the consistent groups of the queries and .
In Figure 7, we first plot the evaluation time of AggCAvSAT as the percentage of inconsistency in the data grows from 5% to 35% in the instances generated using the DBGen-based data generator. The size of the database repairs is kept constant at 1 GB (8 million tuples).
Since AggCAvSAT constructs and solves many WPMaxSAT instances having varying sizes for an aggregation query involving grouping, we also plot the overall number of SAT calls made by the solver in Figure 7. Note that the Y-axis has logarithmic scaling in the second plot of Figure 7. There are ten consistent groups in the answers to , and just five and two consistent groups in the answers to and respectively. In each consistent group, the aggregation operator is applied over a much larger set of tuples in and than in . As a result, the evaluation time for is high but the number of SAT calls is comparatively less, while AggCAvSAT makes more SAT calls for and , even though their consistent answers are computed much faster. The query requires long time to construct and solve the WPMaxSAT instances for its consistent groups due to its high selectivity and the presence of joins between four relations. The evaluation time of computing the range consistent answers to aggregation queries with grouping increases almost linearly w.r.t. the size of the database when the percentage of inconsistency is constant (Figure 8). The second plot in Figure 8 depicts the number of SAT calls made by the solver as the size of the database grows. Due to low selectivity, the answers to are encoded into small CNF formulas even on databases with high inconsistency or large sizes, resulting in fast evaluations.
6.1.5. Discussion
The experiments show that AggCAvSAT performed well across a broad range of queries and databases; it performed worse on queries with high selectivity because, in such cases, very large CNF formulas were generated. AggCAvSAT slowed down on databases with high degree of inconsistency () and with key-equal groups of large sizes (). These are rather corner cases that should not be encountered in real-world databases.
6.2. Experiments with Real-world Data
6.2.1. Dataset
For this set experiments, we use the schema and the data from Medigap (med, [n.d.]), an openly available real-world database about Medicare supplement insurance in the United States. We combine the data from 2019 and 2020 to obtain a database with over 61K tuples (Table 6(a)). We evaluated the performance of Reduction 5.1, since we consider two functional dependencies and one denial constraint on the Medigap schema, as shown in Table 6(b). The actual data was inconsistent so no additional inconsistency was injected.
Relation | Acronym | # of attributes | # of tuples |
---|---|---|---|
OrgsByState | OBS | 5 | 3872 |
PlansByState | PBS | 18 | 21002 |
PlansByZip | PBZ | 20 | 4748 |
PlanType | PT | 4 | 2434 |
Premiums | PR | 7 | 29148 |
SimplePlanType | SPT | 4 | 70 |
Type | Constraint Definition | Inconsistency |
---|---|---|
FD | OBS (orgID orgName) | 2.58% |
FD | PBS (addr, city, abbrev zip) | 1.5% |
DC | (.webAddr ‘’) | 0.15% |
6.2.2. Queries
We use twelve natural aggregation queries on the Medigap database that involve the aggregation operators COUNT(*), COUNT(), and SUM(). We refer to these as . The first six queries contain no grouping, while the rest of them do. The definitions of these queries are given in Table 7.
# | Query |
---|---|
SELECT COUNT(*) FROM OBS WHERE OBS.Name = ‘Continental General Insurance Company’ | |
SELECT COUNT(*) FROM PBZ, SPT WHERE PBZ.Description = SPT.Simple_plantype_name AND SPT.Contract_year = 2020 AND SPT.Simple_plantype = ‘B’ | |
SELECT SUM(PBZ.Over65) FROM PBZ WHERE PBZ.State_name = ‘Wisconsin’ AND PBZ.County_name = ‘GREEN LAKE’ | |
SELECT SUM(PBZ.Community) FROM PBZ WHERE PBZ.State_name = ‘New York’ | |
SELECT COUNT(PR.Premium_range) FROM PR | |
SELECT COUNT(PR.Premium_range) FROM PT, PR WHERE PT.State_abbrev = PR.State_abbrev AND PT.Plan_type = PR.Plan_type AND PT.Contract_year = PR.Contract_year AND PT.Contract_year = 2020 AND PT.Simple_plantype = ‘K’ | |
SELECT SPT.Contract_year, COUNT(*) FROM SPT GROUP BY SPT.Contract_year ORDER BY SPT.Contract_year DESC | |
SELECT PBZ.State_name, COUNT(*) FROM PBZ GROUP BY PBZ.State_name | |
SELECT PBZ.Zip, SUM(PBZ.Community) FROM PBZ WHERE PBZ.State_name = ‘New York’ GROUP BY PBZ.Zip | |
SELECT TOP 10 PBS.State_name, SPT.Contract_year, SUM(PBS.Under65) FROM PBS, SPT WHERE SPT.Simple_plantype_name = PBS.Description AND SPT.Simple_plantype = ‘A’ AND SPT.Language_id = 1 GROUP BY PBS.State_name, SPT.Contract_year ORDER BY PBS.State_name | |
SELECT PR.Age_category, COUNT(PR.Premium_range) FROM PR GROUP BY PR.Age_category ORDER BY PR.Age_category | |
SELECT TOP 10 PT.Simple_plantype, COUNT(PR.Premium_range) FROM PT, PR WHERE PT.State_abbrev = PR.State_abbrev AND PT.Plan_type = PR.Plan_type AND PT.Contract_year = PR.Contract_year and PT.Contract_year = 2020 GROUP BY PT.Simple_plantype ORDER BY PT.Simple_plantype |
6.2.3. Results on Real-world Database
In Figure 9, we plot the overall time taken by AggCAvSAT to compute the range consistent answers to the twelve aggregation queries on the Medigap database. Since the Medigap schema has functional dependencies and a denial constraint, the encoding of CQA into WPMaxSAT instances is based on Reduction 5.1. Consequently, the size of the CNF formulas is much larger compared to that of the ones produced by Reduction 4.1, resulting in longer encoding times. For all twelve queries, the encoding time is dominated by the time required to compute the near-violations and hence the -clauses. This part of the encoding time is equal for all queries, but the computation time for the witnesses depends on the query. The solver takes comparatively minuscule amount of time to compute the consistent answers to the underlying conjunctive query. For the queries , the glb-answer and the lub-answer are encoded and then solved for for each consistent group, causing high overhead. The longest evaluation time is taken by queries , , and since they consist of 10, 10, and 6 consistent groups, respectively.
For these experiments, we did not compute the range consistent answers from the consistent part of the data first. Thus, for all CNF formulas, the number of variables for is equal to the number of tuples in the data (about 61K). The number of clauses, however, varies depending on the query, as shown in Figure 10. The query has the highest number of clauses since all tuples in the vwPremiums table are the minimal witnesses to its underlying conjunctive query.
The clauses arising from the inconsistency in the data can be constructed independently from the clauses arising from the witnesses to the queries. In the near future, we plan to parallelize their computation to improve AggCAvSAT’s performance.
7. Concluding Remarks
First, we showed that computing the range consistent answers to an aggregation query involving the SUM(A) operator can be NP-hard, even if the consistent answers to the underlying conjunctive query are SQL-rewritable. We then designed, implemented, and evaluated AggCAvSAT, a SAT-based system for computing range consistent answers to aggregation queries involving COUNT(A), COUNT(*), SUM(A), and grouping. It is the first system able to handle aggregation queries whose range consistent answers are not SQL-rewritable. Our experimental evaluation showed that AggCAvSAT is not only competitive with systems such as ConQuer but it is also scalable. The experiments on the Medigap data showed that AggCAvSAT can handle real-world databases having integrity constraints beyond primary keys. The next step in this investigation is to first delineate the complexity of the range consistent answers to aggregation queries with the operator AVG() and then enhance the capabilities of AggCAvSAT to compute the range consistent answers of such aggregation queries. Finally, we note that the SAT-based methods used here are applicable to broader classes of SQL queries, such as queries with nested subqueries, as long as denial constraints are considered. If broader classes of constraints are considered, such as universal constraints, then the consistent answers of even conjunctive queries become -hard to compute (Arming et al., 2016), hence SAT-based methods are not applicable. In that case, Answer Set Programming solvers (for example, DLV (Leone et al., 2006) or Potassco (Gebser et al., 2011)) have to be used, instead of SAT solvers.
Acknowledgments Dixit was supported by a Baskin School of Engineering Dissertation-Year Fellowship and by the Center for Research in Open Source Software (CROSS) at the UC Santa Cruz. Kolaitis was partially supported by NSF Award IIS: 1814152.
References
- (1)
- cad ([n.d.]) [n.d.]. CaDiCaL Simplified Satisfiability Solver. http://fmv.jku.at/cadical/.
- med ([n.d.]) [n.d.]. Medigap database for Medicare health and drug plans. https://www.medicare.gov/download/downloaddb.asp.
- Afrati and Kolaitis (2008) Foto N. Afrati and Phokion G. Kolaitis. 2008. Answering aggregate queries in data exchange. In Proc. of the Twenty-Seventh ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2008, June 9-11, 2008, Vancouver, BC, Canada. ACM, 129–138. https://doi.org/10.1145/1376916.1376936
- Amsterdamer et al. (2011) Yael Amsterdamer, Daniel Deutch, and Val Tannen. 2011. Provenance for aggregate queries. In Proceedings of the 30th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2011, June 12-16, 2011, Athens, Greece, Maurizio Lenzerini and Thomas Schwentick (Eds.). ACM, 153–164. https://doi.org/10.1145/1989284.1989302
- Antova et al. (2008) Lyublena Antova, Thomas Jansen, Christoph Koch, and Dan Olteanu. 2008. Fast and Simple Relational Processing of Uncertain Data. In 2008 IEEE 24th International Conference on Data Engineering. 983–992. https://doi.org/10.1109/ICDE.2008.4497507
- Arenas et al. (1999) Marcelo Arenas, Leopoldo Bertossi, and Jan Chomicki. 1999. Consistent Query Answers in Inconsistent Databases. In Proc. of the Eighteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (Philadelphia, Pennsylvania, USA) (PODS ’99). ACM, New York, NY, USA, 68–79. https://doi.org/10.1145/303976.303983
- Arenas et al. (2003b) Marcelo Arenas, Leopoldo Bertossi, Jan Chomicki, Xin He, Vijay Raghavan, and Jeremy Spinrad. 2003b. Scalar aggregation in inconsistent databases. Theoretical Computer Science 296, 3 (2003), 405–434. https://doi.org/10.1016/S0304-3975(02)00737-5 Database Theory.
- Arenas et al. (2003a) Marcelo Arenas, Leopoldo E. Bertossi, and Jan Chomicki. 2003a. Answer sets for consistent query answering in inconsistent databases. TPLP 3, 4-5 (2003), 393–424. https://doi.org/10.1017/S1471068403001832
- Arming et al. (2016) Sebastian Arming, Reinhard Pichler, and Emanuel Sallinger. 2016. Complexity of Repair Checking and Consistent Query Answering. In 19th International Conference on Database Theory, ICDT 2016, Bordeaux, France, March 15-18, 2016 (LIPIcs), Wim Martens and Thomas Zeume (Eds.), Vol. 48. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 21:1–21:18. https://doi.org/10.4230/LIPIcs.ICDT.2016.21
- Audemard and Simon (2009) Gilles Audemard and Laurent Simon. 2009. Predicting Learnt Clauses Quality in Modern SAT Solvers. In Proc. of the 21st International Jont Conference on Artifical Intelligence (Pasadena, California, USA) (IJCAI’09). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 399–404.
- Barceló and Bertossi (2003) Pablo Barceló and Leopoldo E. Bertossi. 2003. Logic Programs for Querying Inconsistent Databases. In Practical Aspects of Declarative Languages, 5th International Symposium, PADL 2003, New Orleans, LA, USA, January 13-14, 2003, Proc. 208–222. https://doi.org/10.1007/3-540-36388-2_15
- Bertossi (2011) Leopoldo E. Bertossi. 2011. Database Repairing and Consistent Query Answering. Morgan & Claypool Publishers. https://doi.org/10.2200/S00379ED1V01Y201108DTM020
- Biere et al. (2009) A. Biere, A. Biere, M. Heule, H. van Maaren, and T. Walsh. 2009. Handbook of Satisfiability: Volume 185 Frontiers in Artificial Intelligence and Applications. IOS Press, Amsterdam, The Netherlands, The Netherlands.
- Chomicki et al. (2004a) Jan Chomicki, Jerzy Marcinkowski, and Slawomir Staworko. 2004a. Computing Consistent Query Answers Using Conflict Hypergraphs. In Proc. of the Thirteenth ACM International Conference on Information and Knowledge Management (Washington, D.C., USA) (CIKM ’04). ACM, New York, NY, USA, 417–426. https://doi.org/10.1145/1031171.1031254
- Chomicki et al. (2004b) Jan Chomicki, Jerzy Marcinkowski, and Slawomir Staworko. 2004b. Hippo: A System for Computing Consistent Answers to a Class of SQL Queries. In Advances in Database Technology - EDBT 2004. Springer Berlin Heidelberg, Berlin, Heidelberg, 841–844.
- Codd (1970) E. F. Codd. 1970. A Relational Model of Data for Large Shared Data Banks. Commun. ACM 13, 6 (1970), 377–387. https://doi.org/10.1145/362384.362685
- Codd (1972) E. F. Codd. 1972. Relational Completeness of Data Base Sublanguages. Research Report / RJ / IBM / San Jose, California RJ987 (1972).
- Davies and Bacchus (2011) Jessica Davies and Fahiem Bacchus. 2011. Solving MAXSAT by Solving a Sequence of Simpler SAT Instances. In Principles and Practice of Constraint Programming – CP 2011. Springer Berlin Heidelberg, Berlin, Heidelberg, 225–239.
- Dixit and Kolaitis (2021) Akhil Dixit and Phokion G. Kolaitis. 2021. CAvSAT: Answering Aggregation Queries over Inconsistent Databases via SAT Solving. In Proc. of the 2021 ACM International Conference on Management of Data (SIGMOD) - Demonstration Track. To appear.
- Dixit and Kolaitis (2019) Akhil A. Dixit and Phokion G. Kolaitis. 2019. A SAT-Based System for Consistent Query Answering. In Theory and Applications of Satisfiability Testing - SAT 2019 - 22nd International Conference, SAT 2019, Lisbon, Portugal, July 9-12, 2019, Proc. (Lecture Notes in Computer Science), Vol. 11628. Springer, 117–135. https://doi.org/10.1007/978-3-030-24258-9_8
- Fuxman (2007) Ariel Fuxman. 2007. Efficient Query Processing over Inconsistent Databases. Ph.D. Dissertation. Department of Computer Science, University of Toronto.
- Fuxman et al. (2005a) Ariel Fuxman, Elham Fazli, and Renée J. Miller. 2005a. ConQuer: Efficient Management of Inconsistent Databases. In Proc. of the 2005 ACM SIGMOD International Conference on Management of Data (Baltimore, Maryland) (SIGMOD ’05). ACM, New York, NY, USA, 155–166. https://doi.org/10.1145/1066157.1066176
- Fuxman et al. (2005b) Ariel Fuxman, Diego Fuxman, and Renée J. Miller. 2005b. ConQuer: A System for Efficient Querying over Inconsistent Databases. In Proc. of the 31st International Conference on Very Large Data Bases (Trondheim, Norway) (VLDB ’05). VLDB Endowment, 1354–1357. http://dl.acm.org/citation.cfm?id=1083592.1083774
- Fuxman and Miller (2007) Ariel Fuxman and Renée J. Miller. 2007. First-order Query Rewriting for Inconsistent Databases. J. Comput. Syst. Sci. 73, 4 (June 2007), 610–635. https://doi.org/10.1016/j.jcss.2006.10.013
- Gebser et al. (2011) Martin Gebser, Benjamin Kaufmann, Roland Kaminski, Max Ostrowski, Torsten Schaub, and Marius Schneider. 2011. Potassco: The Potsdam Answer Set Solving Collection. AI Commun. 24, 2 (2011), 107–124. https://doi.org/10.3233/AIC-2011-0491
- Giannakopoulou et al. (2020a) Stella Giannakopoulou, Manos Karpathiotakis, and Anastasia Ailamaki. 2020a. Cleaning Denial Constraint Violations through Relaxation. In Proc. of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020. ACM, 805–815. https://doi.org/10.1145/3318464.3389775
- Giannakopoulou et al. (2020b) Stella Giannakopoulou, Manos Karpathiotakis, and Anastasia Ailamaki. 2020b. Query-driven Repair of Functional Dependency Violations. In 36th IEEE International Conference on Data Engineering, ICDE 2020, Dallas, TX, USA, April 20-24, 2020. IEEE, 1886–1889. https://doi.org/10.1109/ICDE48307.2020.00195
- Greco et al. (2003) Gianluigi Greco, Sergio Greco, and Ester Zumpano. 2003. A Logical Framework for Querying and Repairing Inconsistent Databases. IEEE Trans. on Knowl. and Data Eng. 15, 6 (Nov. 2003), 1389–1408. https://doi.org/10.1109/TKDE.2003.1245280
- Green et al. (2007) Todd J. Green, Gregory Karvounarakis, and Val Tannen. 2007. Provenance semirings. In Proceedings of the Twenty-Sixth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June 11-13, 2007, Beijing, China, Leonid Libkin (Ed.). ACM, 31–40. https://doi.org/10.1145/1265530.1265535
- Guagliardo and Libkin (2016) Paolo Guagliardo and Leonid Libkin. 2016. Making SQL Queries Correct on Incomplete Databases: A Feasibility Study. In Proc. of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2016, San Francisco, CA, USA, June 26 - July 01, 2016. ACM, 211–223. https://doi.org/10.1145/2902251.2902297
- Heule et al. (2016) Marijn J. H. Heule, Oliver Kullmann, and Victor W. Marek. 2016. Solving and Verifying the Boolean Pythagorean Triples Problem via Cube-and-Conquer. In Theory and Applications of Satisfiability Testing - SAT 2016 - 19th Intern. Conference, Bordeaux, France, July 5-8, 2016, Proc. (Lecture Notes in Computer Science), Vol. 9710. Springer, 228–245. https://doi.org/10.1007/978-3-319-40970-2_15
- Karp (1972) Richard M. Karp. 1972. Reducibility among Combinatorial Problems. Springer US, Boston, MA, 85–103. https://doi.org/10.1007/978-1-4684-2001-2_9
- Karvounarakis and Green (2012) Grigoris Karvounarakis and Todd J. Green. 2012. Semiring-annotated data: queries and provenance? SIGMOD Rec. 41, 3 (2012), 5–14. https://doi.org/10.1145/2380776.2380778
- Kolaitis and Pema (2012) Phokion G. Kolaitis and Enela Pema. 2012. A Dichotomy in the Complexity of Consistent Query Answering for Queries with Two Atoms. Inf. Process. Lett. 112, 3 (Jan. 2012), 77–85. https://doi.org/10.1016/j.ipl.2011.10.018
- Kolaitis et al. (2013) Phokion G. Kolaitis, Enela Pema, and Wang-Chiew Tan. 2013. Efficient Querying of Inconsistent Databases with Binary Integer Programming. PVLDB 6, 6 (2013), 397–408. https://doi.org/10.14778/2536336.2536341
- Kostylev and Reutter (2015) Egor V. Kostylev and Juan L. Reutter. 2015. Complexity of answering counting aggregate queries over DL-Lite. J. Web Semant. 33 (2015), 94–111. https://doi.org/10.1016/j.websem.2015.05.003
- Koutris and Wijsen (2016) Paraschos Koutris and Jef Wijsen. 2016. Consistent Query Answering for Primary Keys. SIGMOD Rec. 45, 1 (June 2016), 15–22. https://doi.org/10.1145/2949741.2949746
- Koutris and Wijsen (2017) Paraschos Koutris and Jef Wijsen. 2017. Consistent Query Answering for Self-Join-Free Conjunctive Queries Under Primary Key Constraints. ACM Trans. Database Syst. 42, 2, Article 9 (June 2017), 45 pages. https://doi.org/10.1145/3068334
- Kügel (2012) Adrian Kügel. 2012. Natural Max-SAT Encoding of Min-SAT. In Revised Selected Papers of the 6th International Conference on Learning and Intelligent Optimization - Volume 7219 (Paris, France) (LION 6). Springer-Verlag, Berlin, Heidelberg, 431–436.
- Leone et al. (2006) Nicola Leone, Gerald Pfeifer, Wolfgang Faber, Thomas Eiter, Georg Gottlob, Simona Perri, and Francesco Scarcello. 2006. The DLV system for knowledge representation and reasoning. ACM Trans. Comput. Log. 7, 3 (2006), 499–562. https://doi.org/10.1145/1149114.1149117
- Li et al. (2011) Chu-Min Li, Zhu Zhu, Felip Manyà, and Laurent Simon. 2011. Minimum Satisfiability and Its Applications. In Proc. of the Twenty-Second International Joint Conference on Artificial Intelligence - Volume Volume One (Barcelona, Catalonia, Spain) (IJCAI’11). AAAI Press, 605–610.
- Manna et al. (2011) Marco Manna, Francesco Ricca, and Giorgio Terracina. 2011. Consistent Query Answering via ASP from Different Perspectives: Theory and Practice. CoRR abs/1107.4570 (2011). arXiv:1107.4570 http://arxiv.org/abs/1107.4570
- Marileo and Bertossi (2010) Mónica Caniupán Marileo and Leopoldo E. Bertossi. 2010. The consistency extractor system: Answer set programs for consistent query answering in databases. Data Knowl. Eng. 69, 6 (2010), 545–572. https://doi.org/10.1016/j.datak.2010.01.005
- Oostema et al. (2020) Peter Oostema, Ruben Martins, and Marijn Heule. 2020. Coloring Unit-Distance Strips using SAT. In LPAR 2020: 23rd International Conference on Logic for Programming, Artificial Intelligence and Reasoning, Alicante, Spain, May 22-27, 2020 (EPiC Series in Computing), Vol. 73. EasyChair, 373–389. https://easychair.org/publications/paper/69T4
- Rekatsinas et al. (2017) Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. 2017. HoloClean: Holistic Data Repairs with Probabilistic Inference. Proc. VLDB Endow. 10, 11 (Aug. 2017), 1190–1201. https://doi.org/10.14778/3137628.3137631
- Vardi (2009) Moshe Y. Vardi. 2009. Symbolic Techniques in Propositional Satisfiability Solving. In Theory and Applications of Satisfiability Testing - SAT 2009. Springer Berlin Heidelberg, Berlin, Heidelberg, 2–3.
- Wijsen (2009) Jef Wijsen. 2009. Consistent Query Answering Under Primary Keys: A Characterization of Tractable Queries. In Proc. of the 12th International Conference on Database Theory (St. Petersburg, Russia) (ICDT ’09). ACM, New York, NY, USA, 42–52. https://doi.org/10.1145/1514894.1514900
- Wijsen (2010a) Jef Wijsen. 2010a. On the First-order Expressibility of Computing Certain Answers to Conjunctive Queries over Uncertain Databases. In Proc. of the Twenty-ninth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (Indianapolis, Indiana, USA) (PODS ’10). ACM, New York, NY, USA, 179–190. https://doi.org/10.1145/1807085.1807111
- Wijsen (2010b) Jef Wijsen. 2010b. A remark on the complexity of consistent conjunctive query answering under primary key violations. Inform. Process. Lett. 110, 21 (2010), 950 – 955. https://doi.org/10.1016/j.ipl.2010.07.021
- Zhu et al. (2012) Zhu Zhu, Chu-Min Li, Felip Manyà, and Josep Argelich. 2012. A New Encoding from MinSAT into MaxSAT. In Principles and Practice of Constraint Programming. Springer Berlin Heidelberg, Berlin, Heidelberg, 455–463.