This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Consistent Answers of Aggregation Queries using SAT Solvers

Akhil A. Dixit 0000-0003-2138-1319 University of California Santa Cruz akadixit@ucsc.edu  and  Phokion G. Kolaitis University of California Santa Cruz and IBM Research kolaitis@ucsc.edu
Abstract.

The framework of database repairs and consistent answers to queries is a principled approach to managing inconsistent databases. We describe the first system able to compute the consistent answers of general aggregation queries with the COUNT(AA), COUNT(*), SUM(AA), MIN(AA), and MAX(AA) operators, and with or without grouping constructs. Our system uses reductions to optimization versions of Boolean satisfiability (SAT) and then leverages powerful SAT solvers. We carry out an extensive set of experiments on both synthetic and real-world data that demonstrate the usefulness and scalability of this approach.

1. Introduction

The framework of database repairs and consistent query answering, introduced by Arenas, Bertossi, and Chomicki (Arenas et al., 1999), is a principled approach to managing inconsistent databases, i.e., databases that violate one or more integrity constraints on their schema. In this framework, inconsistencies are handled at query time by considering all possible repairs of the inconsistent database, where a repair of an inconsistent database \mathcal{I} is a consistent database 𝒥\mathcal{J} that differs from \mathcal{I} in a “minimal” way. The consistent answers to a query qq on a given database \mathcal{I} is the intersection of the results of qq applied on each repair of \mathcal{I}. Thus, a consistent answer provides the guarantee that it will be found no matter on what repair the query has been evaluated. Computing the consistent answers can be an intractable problem, because an inconsistent database may have exponentially many repairs. In particular, computing the consistent answers of a fixed Select-Project-Join (SPJ) query can be a coNP-complete problem. By now, there is an extensive body of work on the complexity of consistent answers for SPJ queries (see Section 2).

Range Semantics: Concept and Motivation. Aggregation queries are the most frequently asked queries; they are of the form

Q:=SELECT Z,f(A) FROM T(U,Z,A) GROUP BY Z,Q:=\;\texttt{SELECT }Z,f(A)\texttt{ FROM }T(U,Z,A)\texttt{ GROUP BY }Z,

where f(A)f(A) is one the standard aggregation operators COUNT(AA), COUNT(*), SUM(AA), AVG(AA). MIN(AA), MAX(AA), and T(U,Z,A)T(U,Z,A) is the relation returned by a SPJ query qq expressed in SQL. A scalar aggregation query is an aggregation query without a GROUP BY clause.

What is the semantics of an aggregation query over an inconsistent database? Since an aggregation query may return different answers on different repairs of an inconsistent database, there is typically no consistent answer as per the earlier definition of consistent answers. To obtain meaningful semantics to aggregation queries, Arenas et al. (Arenas et al., 2003b) introduced the range consistent answers.

Table 1. Running example – an inconsistent database instance \mathcal{I} (primary key attributes are underlined)
CUSTOMER CID CNAME CITY
C1 John LA f1f_{1}
C2 Mary LA f2f_{2}
C2 Mary SF f3f_{3}
C3 Don SF f4f_{4}
C4 Jen LA f5f_{5}
ACCOUNTS ACCID TYPE CITY BAL
A1 Checking LA 900 f6f_{6}
A2 Checking LA 1000 f7f_{7}
A3 Saving SJ 1200 f8f_{8}
A3 Saving SF -100 f9f_{9}
A4 Saving SJ 300 f10f_{10}
CUSTACC CID ACCID
C1 A1 f11f_{11}
C2 A2 f12f_{12}
C2 A3 f13f_{13}
C3 A4 f14f_{14}

Let QQ be a scalar aggregation query and let Σ\Sigma be a set of integrity constraints. The set of possible answers to QQ on an inconsistent instance \mathcal{I} w.r.t.  Σ\Sigma is the set of the answers to QQ over all repairs of \mathcal{I} w.r.t. Σ\Sigma, i.e., Poss(Q,Σ)={Q(𝒥)|𝒥 is a repair of  w.r.t. Σ}\text{Poss}(Q,\Sigma)=\{Q(\mathcal{J})\;|\;\mathcal{J}\text{ is a repair of }\mathcal{I}\text{ w.r.t.\ }\Sigma\}. By definition, the range consistent answers to QQ on \mathcal{I} is the interval [glb(Q,),lub(Q,)][glb(Q,\mathcal{I}),lub(Q,\mathcal{I})], where the endpoints of this interval are, respectively, the greatest lower bound (glb) and the least upper bound (lub) of the set Poss(Q,Σ)\text{Poss}(Q,\Sigma) of possible answers to QQ on \mathcal{I}. For example, the range consistent answers to the query

SELECT SUM(ACCOUNTS.BAL) FROM ACCOUNTS, CUSTACC
WHERE ACCOUNTS.ACCID = CUSTACC.ACCID AND CUSTACC.CID = ‘C2’

on the instance in Table 1 is the interval [900,2200][900,2200]. The meaning is that no matter how the database \mathcal{I} is repaired, the answer to the query is guaranteed to be in the range between 900 and 2200.

Arenas et al. (Arenas et al., 2003a) focused on scalar aggregation queries only. Fuxman, Fazli, and Miller (Fuxman et al., 2005a) extended the notion of range consistent answers to aggregation queries with grouping (see Section 3).

Range semantics have become the standard semantics of aggregation queries in the framework of database repairs (see (Bertossi, 2011, Section 5.6)). Furthermore, range semantics have been adapted to give semantics to aggregation queries in several other contexts, including data exchange (Afrati and Kolaitis, 2008) and ontologies (Kostylev and Reutter, 2015). Finally, range semantics have been suggested as an alternative way to overcome some of the issues arising from SQL’s handling of null values (Guagliardo and Libkin, 2016).

Earlier Systems for Consistent Query Answering. Several academic prototype systems for consistent query answering have been developed (Arenas et al., 2003a; Barceló and Bertossi, 2003; Chomicki et al., 2004b; Fuxman et al., 2005a, b; Greco et al., 2003; Kolaitis et al., 2013; Manna et al., 2011; Marileo and Bertossi, 2010; Dixit and Kolaitis, 2019). These systems use different approaches, including logic programming (Barceló and Bertossi, 2003; Greco et al., 2003), compact representations of repairs (Chomicki et al., 2004a), or reductions to solvers (Manna et al., 2011; Kolaitis et al., 2013; Dixit and Kolaitis, 2019). In particular, in (Dixit and Kolaitis, 2019), we reported on CAvSAT, a system that at that time was able to compute the consistent answers of unions of SPJ queries w.r.t. denial constraints (which include functional dependencies as a special case) via reductions to SAT solvers. Among all these systems, however, only the ConQuer system by Fuxman et al. (Fuxman et al., 2005a, b) is capable of handling aggregation queries. Actually, ConQuer can only handle a restricted class of aggregation query, namely, those aggregation queries w.r.t. key constraints for which the underlying SPJ query belongs to the class called CforestC_{\textit{forest}}. For such a query QQ, the range consistent answers of QQ are SQL-rewritable, which means that there is a SQL query QQ^{\prime} such that the range semantics answers of QQ on an instance \mathcal{I} can be obtained by directly evaluating QQ^{\prime} on \mathcal{I}. This leaves out, however, many aggregation queries, including all aggregation queries whose range consistent answers are not SQL-rewritable or are NP-hard to compute. Up to now, no system supports such queries.

Summary of Contributions. In this paper, we report on and evaluate the performance of AggCAvSAT (Aggregate Consistent Answers via Satisfiability Testing), which is an enhanced version of CAvSAT and is also the first system that is capable to compute the range consistent answers to all aggregation queries involving the operators SUM(AA), COUNT(AA), or COUNT(*) with or without grouping.

We first corroborate the need for a system that goes well beyond ConQuer by showing that there is an aggregation query QQ involving SUM(AA) such that the consistent answers of the underlying SPJ query qq w.r.t. key constraints are SQL-rewritable, but the range consistent answers of QQ are NP-hard (Theorem 3.2 in Section 3).

The distinctive feature of AggCAvSAT is that it uses polynomial-time reductions to reduce the range consistent answers of aggregation queries to optimization variants of Boolean Satisfiability (SAT), such as Partial MaxSAT and Weighted Partial MaxSAT. These reductions, described in Sections 4 and 5, are natural but are much more sophisticated than the reductions used in (Dixit and Kolaitis, 2019) to reduce the consistent answers of SPJ queries to SAT. After the reductions have been carried out, AggCAvSAT deploys powerful SAT solvers, such as the MaxHS solver (Davies and Bacchus, 2011), to compute the range consistent answers of aggregation queries. Furthermore, AggCAvSAT can handle databases that are inconsistent not only w.r.t. key constraints, but also w.r.t. arbitrary denial constraints, a much broader class of constraints.

An extensive experimental evaluation of AggCAvSAT is reported in Section 6. We carried out a suite of experiments on both synthetic and real-word databases, and for a variety of aggregation queries with and without grouping. The synthetic databases were generated using two different methods: (a) the DBGen tool of TPC-H was used to generate consistent data and then inconsistencies were injected artificially; (b) the PDBench inconsistent database generator from the probabilistic database management system MayBMS (Antova et al., 2008) was used. The experiments demonstrated the scalability of AggCAvSAT along both the size of the data and the degree of inconsistency in the data. Note that AggCAvSAT was also competitive in comparison to ConQuer (especially when the degree of inconsistency was not excessive), even though the latter is tailored to only handle a restricted class of aggregation queries whose range consistent answers are SQL-rewritable.

Consistent Answers vs. Data Cleaning. There is a large body of work on managing inconsistent databases via data cleaning. There are fundamental differences between the framework of the consistent answers and the framework of data cleaning (see (Bertossi, 2011, Section 6)). In particular, the consistent answers provide the guarantee that each such answer will be found no matter on which repair the query at hand is evaluated, while data cleaning provides no similar guarantee. Data cleaning has the attraction that it produces a single consistent instance but the process need not be deterministic and the instance produced need not even be a repair (i.e., it need not be a maximally consistent instance). Recent data cleaning systems, such as HoloClean (Rekatsinas et al., 2017) and Daisy (Giannakopoulou et al., 2020b, a), produce a probabilistic database instance as the output (which again need not be a repair).

It is not clear how to compare query answers over the database returned by a data cleaning system and the (range) consistent answers computed by a consistent query answering system. In fact, no such comparison is given in the HoloClean (Rekatsinas et al., 2017) and Daisy (Giannakopoulou et al., 2020b, a) papers. At the performance level, the data cleaning approaches remove inconsistencies in the data offline, hence the time-consuming tasks are done prior to answering the queries; in contrast, systems for consistent query answering work online. It is an interesting project, left for future research, to develop a methodology and carry out a fair comparison on a level playing field between systems for data cleaning and systems for consistent query answering.

2. Preliminaries

Integrity Constraints and Database Queries

A relational database schema \mathcal{R} is a finite collection of relation symbols, each with a fixed positive integer as its arity. The attributes of a relation symbol are names for its columns; they can be identified with their positions, thus Attr(R)={1,,n}Attr(R)=\{1,...,n\} denotes the set of attributes of RR. An \mathcal{R}-instance is a collection \mathcal{I} of finite relations RR^{\mathcal{I}}, one for each relation symbol RR in \mathcal{R}. An expression of the form R(a1,,an)R^{\mathcal{I}}(a_{1},...,a_{n}) is a fact of the instance \mathcal{I} if (a1,,an)R(a_{1},...,a_{n})\in R^{\mathcal{I}}. A key is a minimal subset XX of Attr(R)Attr(R) such that the functional dependency XAttr(R)X\rightarrow Attr(R) holds.

Starting with Codd’s seminal work (Codd, 1970, 1972), first-order logic has been successfully used as a database query language; in fact, it forms the core of SQL. A conjunctive query is expressible by a first-order formula of the form q(𝐳):=𝐰(R1(𝐱1)Rm(𝐱m))q({\bf z}):=\exists{\bf w}\;(R_{1}({\bf x}_{1})\land...\land R_{m}({\bf x}_{m})), where each 𝐱i{\bf x}_{i} is a tuple of variables and constants, 𝐳\bf z and 𝐰\bf w are tuples of variables with no variable in common, and the variables in 𝐱1,,𝐱m{\bf x}_{1},\cdots,{\bf x}_{m} appear in exactly one of the tuples 𝐳\bf z and 𝐰\bf w. A conjunctive query with no free variables (i.e., all variables are existentially quantified) is a boolean query, while a conjunctive query with kk free variables in 𝐳\bf z is a kk-ary query. Conjunctive queries are also known as select-project-join (SPJ) queries with equi-joins, and are among the most frequently asked queries. For example, on the instance \mathcal{I} from Table 1, the binary conjunctive query q(z,x):=w(CUST(w,x,y)CUSTACC(w,z))q(z,x):=\exists w\;(\texttt{CUST}(w,x,y)\land\texttt{CUSTACC}(w,z)) returns the set of all pairs (z,x)(z,x) such that zz is an account ID of an account owned by customer named xx.

Equivalently, this query can be expressed in SQL as
    SELECT CUSTACC.ACCID, CUST.CNAME
    FROM CUST, CUSTACC
    WHERE CUST.CID = CUSTACC.CID

A union of conjunctive queries is expressible by a disjunction q(𝐳):=q1qnq({\bf z}):=q_{1}\vee\cdots\vee q_{n} of conjunctive queries, where all conjunctive queries qiq_{i} have the same arity. Unions of conjunctive queries are strictly more expressive than conjunctive queries.

Database Repairs and Consistent Answers

Let Σ\Sigma be a set of integrity constraints on a database schema \mathcal{R}. An \mathcal{R}-instance \mathcal{I} is consistent if Σ\mathcal{I}\models\Sigma, i.e., \mathcal{I} satisfies every constraint in Σ\Sigma; otherwise, \mathcal{I} is inconsistent. For example, let \mathcal{I} be the instance depicted in Table 1. There are two key constraints, namely, CUST(CID) and ACC(ACCID). Clearly, \mathcal{I} is inconsistent since the facts f2,f3f_{2},f_{3} of CUST and facts f8,f9f_{8},f_{9} of ACC violate these key constraints.

A repair of an inconsistent instance \mathcal{I} w.r.t. Σ\Sigma is a consistent instance 𝒥\mathcal{J} that differs from \mathcal{I} in a “minimal” way. Different notions of minimality give rise to different types of repairs (see (Bertossi, 2011) for a comprehensive survey). Here, we focus on subset repairs, the most extensively studied type of repairs. An instance 𝒥\mathcal{J} is a subset repair of an instance \mathcal{I} if 𝒥\mathcal{J} is a maximal consistent subinstance of \mathcal{I}, that is, 𝒥\mathcal{J}\subseteq\mathcal{I} (where \mathcal{I} and 𝒥\mathcal{J} are viewed as sets of facts), 𝒥Σ\mathcal{J}\models\Sigma, and there exists no instance 𝒥\mathcal{J^{\prime}} such that 𝒥Σ\mathcal{J^{\prime}}\models\Sigma and 𝒥𝒥\mathcal{J}\subset\mathcal{J^{\prime}}\subset\mathcal{I}. Arenas et al. (Arenas et al., 1999) used repairs to give rigorous semantics to query answering on inconsistent databases. Specifically, assume that qq is a query, \mathcal{I} is an \mathcal{R}-instance, and 𝐭\bf t is a tuple of values. We say that 𝐭\bf t is a consistent answer to qq on \mathcal{I} w.r.t. Σ\Sigma if 𝐭q(𝒥){\bf t}\in q(\mathcal{J}), for every repair 𝒥\mathcal{J} of \mathcal{I}. We write Cons(q,,Σ)\mbox{\sc{Cons}}(q,\mathcal{I},\Sigma) to denote the set of all consistent answers to qq on \mathcal{I} w.r.t. Σ\Sigma, i.e.,

Cons(q,,Σ)={q(𝒥):𝒥 is a repair of  w.r.t. Σ}.\mbox{\sc{Cons}}(q,\mathcal{I},\Sigma)=\bigcap\;\{q(\mathcal{J}):\mbox{$\mathcal{J}$ is a repair of $\mathcal{I}$ w.r.t.\ $\Sigma$}\}.

If Σ\Sigma is a fixed set of integrity constraints and qq is a fixed query, then the main computational problem associated with the consistent answers is: given an instance \mathcal{I}, compute Cons(qq, \mathcal{I}, Σ\Sigma); we write Cons(q,Σ)\mbox{\sc{Cons}}(q,\Sigma) to denote this problem. If qq is a boolean query, then computing the consistent answers becomes the decision problem Certainty(q,Σ)\mbox{\sc{Certainty}}(q,\Sigma): given an instance \mathcal{I}, is qq true on every repair 𝒥\mathcal{J} of \mathcal{I} w.r.t. Σ\Sigma? When the constraints in Σ\Sigma are understood from the context, we will write Cons(q)\mbox{\sc{Cons}}(q) and Certainty(q)\mbox{\sc{Certainty}}(q) in place of Cons(q,Σ)\mbox{\sc{Cons}}(q,\Sigma) and Certainty(q,Σ)\mbox{\sc{Certainty}}(q,\Sigma), respectively.

Complexity of Consistent Answers

There has been an extensive study of the consistent answers of conjunctive queries (Bertossi, 2011; Fuxman et al., 2005b; Fuxman and Miller, 2007; Wijsen, 2009, 2010a, 2010b; Kolaitis and Pema, 2012; Koutris and Wijsen, 2016, 2017). If Σ\Sigma is a fixed set of key constraints and qq is a boolean conjunctive query, then Certainty(q,Σ)\mbox{\sc{Certainty}}(q,\Sigma) is always in coNP, but, depending on the query and the constraints, Certainty(q,Σ)\mbox{\sc{Certainty}}(q,\Sigma) exhibits a variety of behaviors within coNP. The most definitive result to date is a trichotomy theorem by Koutris and Wijsen (Koutris and Wijsen, 2016, 2017); it asserts that if qq is a self-join-free (no repeated relation symbols) boolean conjunctive query with one key constraint per relation, then Certainty(q)\mbox{\sc{Certainty}}(q) is either SQL-rewritable, or in P but not SQL-rewritable, or coNP-complete. It is an open problem whether or not this trichotomy extends to arbitrary boolean conjunctive queries and to broader classes of constraints (e.g., denial constraints).

We illustrate the trichotomy theorem with three examples. In what follows, the underlined attributes constitute the keys to the relations. First, if q1q_{1} is the query x,y,z(R(x¯,z)S(z¯,y))\exists x,y,z(R(\underline{x},z)\land S(\underline{z},y)), then Certainty(q1)\mbox{\sc{Certainty}}(q_{1}) is SQL-rewrtiable (Fuxman and Miller, 2007). Second, if q2q_{2} is the query x,y(R(x¯,y)S(y¯,x))\exists x,y(R(\underline{x},y)\land S(\underline{y},x)), then Certainty(q2)\mbox{\sc{Certainty}}(q_{2}) is in P, but is not SQL-rewritable (Wijsen, 2010b). Third, if q3q_{3} is the query x,y,z(R(x¯,y)S(z¯,y))\exists x,y,z(R(\underline{x},y)\land S(\underline{z},y)), then Certainty(q3)\mbox{\sc{Certainty}}(q_{3}) is coNP-complete (Fuxman and Miller, 2007).

Boolean Satisfiability and SAT Solvers

Boolean Satisfiability (SAT) is arguably the prototypical and the most widely studied NP-complete problem. SAT is the following decision problem: given a boolean formula φ\varphi, is φ\varphi satisfiable? There has been an extensive body of research on different aspects of boolean satisfiability (see the handbook (Biere et al., 2009)). In particular, significant progress has been made on developing SAT-solvers, so much so that the advances in this area of research are often referred to as the “SAT Revolution” (Vardi, 2009)). Typically, a SAT-solver takes a boolean formula φ\varphi in conjunctive normal form (CNF) as an input and outputs a satisfying assignment for φ\varphi (if one exists) or tells that the formula φ\varphi is unsatisfiable. Recall that a formula φ\varphi is in CNF if it is a conjunction of clauses, where each clause is a disjunction of literals. For example, the formula (x1x2¬x3)(¬x2x3)(¬x1x4)(x_{1}\lor x_{2}\lor\neg x_{3})\land(\neg x_{2}\lor x_{3})\land(\neg x_{1}\lor x_{4}) has a satisfying assignment x1=1x_{1}=1, x2=0x_{2}=0, x3=0x_{3}=0, and x4=1x_{4}=1.

At present, SAT-solvers are capable of solving quickly SAT-instances with millions of clauses and variables. SAT-solvers have been widely used in both academia and industry as general-purpose tools. Indeed, many real-world problems from a variety of domains, including scheduling, protocol design, software verification, and model checking, can be naturally encoded as SAT-instances, and solved quickly using solvers, such as Glucose (Audemard and Simon, 2009) and CaDiCaL (cad, [n.d.]). Furthermore, SAT-solvers have been used in solving open problems in mathematics (Heule et al., 2016; Oostema et al., 2020). In (Dixit and Kolaitis, 2019), we used SAT-solvers to build a prototypical system for consistent query answering, which we called CAvSAT. This system can compute consistent answers to unions of conjunctive queries over relational databases that are inconsistent w.r.t. a fixed set of arbitrary denial constraints.

3. Range Consistent Answers

Frequently asked database queries often involve one of the standard aggregation operators COUNT(AA), COUNT(*), SUM(AA), AVG(AA), MIN(AA), MAX(AA), and, possibly, a GROUP BY clause. In what follows, we will use the term aggregation queries to refer to queries with aggregate operators and with or without a GROUP BY clause. Thus, in full generality, an aggregation query can be expressed as

Q:=SELECT Z,f(A) FROM T(U,Z,A) GROUP BY Z,Q:=\;\texttt{SELECT }Z,f(A)\texttt{ FROM }T(U,Z,A)\texttt{ GROUP BY }Z,


where f(A)f(A) is one of the aforementioned aggregate operators and T(U,Z,A)T(U,Z,A) is the relation returned by a query qq, which typically is a conjunctive query or a union of conjunctive queries expressed in SQL. This query qq is called the underlying query of QQ, the attribute represented by the variable ww is called the aggregation attribute, and the attributes represented by ZZ are called the grouping attributes. A scalar aggregation query is one without a GROUP BY clause.

It is often the case that an aggregation query returns different answers on different repairs of an inconsistent database; thus, even for a scalar aggregation query, there is typically no consistent answer as per the definition of consistent answers given earlier. In fact, to produce an empty set of consistent answers, it suffices to have just two repairs on which a scalar aggregation query returns difference answers. Aiming to obtain more meaningful answers to aggregation queries, Arenas et al. (Arenas et al., 2003b) proposed the range consistent answers, as an alternative notion of consistent answers.

Let QQ be a scalar aggregation query. The set of possible answers to QQ on an inconsistent instance \mathcal{I} consists of the answers to QQ over all repairs of \mathcal{I}, i.e., Poss(Q,Σ)={Q(𝒥)|𝒥 is a repair of  w.r.t. Σ}\text{Poss}(Q,\Sigma)=\{Q(\mathcal{J})\;|\;\mathcal{J}\text{ is a repair of }\mathcal{I}\text{ w.r.t.\ }\Sigma\}. By definition, the range consistent answers to QQ on \mathcal{I} is the interval [glb(Q,),lub(Q,)][glb(Q,\mathcal{I}),lub(Q,\mathcal{I})], where the endpoints of this interval are, respectively, the greatest lower bound (glb) and the least upper bound (lub) of the set Poss(Q,Σ)\text{Poss}(Q,\Sigma) of possible answers to QQ on \mathcal{I}.

For example, the range consistent answers of the query
SELECT SUM(ACCOUNTS.BAL) FROM ACCOUNTS, CUSTACC
WHERE ACCOUNTS.ACCID = CUSTACC.ACCID AND CUSTACC.CID = ‘C2’

on the instance in Table 1 is the interval [900,2200][900,2200]. The guarantee is that no matter how the database \mathcal{I} is repaired, the answer to the query is guaranteed to be in the range between 900 and 2200. Note that, the glb-answer comes from a repair of \mathcal{I} that contains the fact f9f_{9}, while the lub-answer is from a repair that contains the fact f8f_{8}.

Arenas et al. (Arenas et al., 2003a) focused on scalar aggregation queries only. Fuxman, Fazli, and Miller (Fuxman et al., 2005a) extended the notion of range consistent answers to aggregation queries with grouping, i.e., to queries

Q:=SELECT Z,f(A) FROM T(U,Z,A) GROUP BY Z.Q:=\;\texttt{SELECT }Z,f(A)\texttt{ FROM }T(U,Z,A)\texttt{ GROUP BY }Z.

For such queries, a tuple (T,[glb,lub])(T,[glb,lub]) is a range consistent answer to QQ on \mathcal{I}, if the following conditions hold:

\bullet  For every repair 𝒥\mathcal{J} of \mathcal{I}, there exists dd s.t. (T,d)Q(J)(T,d)\in Q(J) and glbdlubglb\leq d\leq lub.

\bullet  For some repair 𝒥\mathcal{J} of \mathcal{I}, we have that (T,glb)Q(J)(T,glb)\in Q(J)

\bullet  For some repair 𝒥\mathcal{J} of \mathcal{I}, we have that (T,lub)Q(J)(T,lub)\in Q(J).

If QQ is an aggregation query, Cons(Q)\mbox{\sc{Cons}}(Q) denotes the problem: given an instance \mathcal{I}, compute the range semantics of QQ on \mathcal{I}.

Complexity of Range Consistent Answers

Arenas et al. (Arenas et al., 2003b) investigated the computational complexity of the range consistent answers for scalar aggregation queries of the form

SELECT f(A) FROM R(U,A)\texttt{SELECT }f(A)\texttt{ FROM }R(U,A)

,
where f(A)f(A) is one of the standard aggregation operators and R(U,A)R(U,A) is a relational schema with functional dependencies. The main findings in Arenas et al. (Arenas et al., 2003b) can be summarized as follows.

\bullet If the relational schema R(U,A)R(U,A) has at most one functional dependency and f(A)f(A) is one of the aggregation operators MIN(AA), MAX(AA), SUM(AA), COUNT(*), AVG(AA), then the range consistent answers of the query SELECT f(\texttt{SELECT }f(A) FROM R(U,A))\texttt{ FROM }R(U,A) is in P.

\bullet There is a relational schema R(U,A)R(U,A) with one key dependency such that computing the range consistent answers of the query SELECT COUNT(A) FROM R(U,A)\texttt{SELECT }\texttt{COUNT}(A)\texttt{ FROM }R(U,A) is an NP-complete problem.

\bullet There is a relational schema R(U,A)R(U,A) with two functional dependencies, such that computing the range consistent answers of the query SELECT f(A) FROM R(U,A)\texttt{SELECT }f(A)\texttt{ FROM }R(U,A) is a NP-complete problem, where f(A)f(A) is one of the standard aggregation operators.

It remains an open problem to pinpoint the complexity of the range consistent answers for richer aggregation queries of the form

Q:=SELECT Z,f(A) FROM T(U,Z,A)GROUP BYZ,Q:=\;\texttt{SELECT }Z,f(A)\texttt{ FROM }T(U,Z,A)\leavevmode\nobreak\ \texttt{GROUP BY}\leavevmode\nobreak\ Z,

where T(U,Z,A)T(U,Z,A) is the relation returned by a conjunctive query qq or by a union q:=q1qkq:=q_{1}\cup\cdots\cup q_{k} of conjunctive queries. It can be shown, however, that if computing the consistent answers Cons(q)\mbox{\sc{Cons}}(q) of the underlying query qq is a hard problem, then computing the range consistent answers Cons(Q)\mbox{\sc{Cons}}(Q) of the aggregation query QQ is a hard problem as well. This gives rise to the following question: what can we say about the complexity of the range consistent answers Cons(Q)\mbox{\sc{Cons}}(Q) if computing the consistent answers Cons(q)\mbox{\sc{Cons}}(q) of the underlying query is an easy problem?

Fuxman and Miller (Fuxman and Miller, 2007) identified a class, called CforestC_{\textit{forest}}, of self-join free conjunctive queries whose consistent answers are SQL-rewritable. In his PhD thesis, Fuxman (Fuxman, 2007) introduced the class CaggforestC_{\textit{aggforest}} consisting of all aggregation queries such that the aggregation operator is one of MIN(AA), MAX(AA), SUM(AA), COUNT(*), the underlying query qq is a conjunctive query in CforestC_{\textit{forest}}, and there is one key constraint for each relation in the underlying query qq. Fuxman (Fuxman, 2007) showed that the range consistent answers of every query in CaggforestC_{\textit{aggforest}} are SQL-rewritable (earlier, similar results for a proper subclass of CaggforestC_{\textit{aggforest}} were obtained by Fuxman, Fazli, and Miller).

It is known that there are self-join free conjunctive queries outside the class CforestC_{\textit{forest}} whose consistent answers are SQL-rewritable. In fact, Koutris and Wijsen (Koutris and Wijsen, 2017) have characterized the self-join free conjunctive queries whose consistent answers are SQL rewritable. However, the SQL rewritability of aggregation queries beyond those in CaggforestC_{\textit{aggforest}} has not been investigated. In the sequel, we show that there exists a self-join-free conjunctive query whose consistent answers are SQL-rewritable, but this property is not preserved when an aggregation operator is added on top of it. Specifically, we reduce the Maximum Cut problem to the problem of computing the range consistent answers to an aggregation query involving SUM and whose underlying conjunctive query has SQL-rewritable consistent answers. We begin by recalling the definition of the Maximum Cut problem, a fundamental NP-complete problem (Karp, 1972). We state and prove a helping lemma (Lemma 3.3) before stating the main result in Theorem 3.2.

Definition 3.1.

Maximum Cut. For an undirected graph G=(V,E)G=(V,E), a cut of GG is a partition (S,S¯)(S,\overline{S}) of VV, where SVS\subseteq V and S¯=V\S\overline{S}=V\backslash S. The set of edges with one vertex in SS and one vertex in S¯\overline{S} is denoted by E(S,S¯)E(S,\overline{S}), and the the size of the cut (S,S¯)(S,\overline{S}) is |E(S,S¯)||E(S,\overline{S})|.

The Maximum Cut problem asks: Given an undirected graph GG and an integer kk, is there a cut of GG that has size at least kk?

Theorem 3.2.

Let \mathcal{R} be a schema with three relations R1(A1¯,B1)R_{1}(\underline{A_{1}},B_{1}), R2(A2¯,B2)R_{2}(\underline{A_{2}},B_{2}), and R3(A1,B1,A2,B2,C¯)R_{3}(\underline{A_{1},B_{1},A_{2},B_{2},C}). Let QQ be the following aggregation query:

Q:=Q:= SELECT SUM(AA) FROM q(A)q(A),

where q(A)q(A) is the following self-join-free conjunctive query:

q(A):=xyR1(x¯,‘red’)R2(y¯,‘blue’)R3(x,‘red’,y,‘blue’,A¯)q(A):=\exists x\exists y\;R_{1}(\underline{x},\text{`red'})\land R_{2}(\underline{y},\text{`blue'})\land R_{3}(\underline{x,\text{`red'},y,\text{`blue'},A}).

Then the following two statements hold.

  1. (1)

    Cons(q)\mbox{\sc{Cons}}(q) is SQL-rewritable.

  2. (2)

    Cons(Q)\mbox{\sc{Cons}}(Q) is NP-hard.

Proof.

To show that Cons(q)\mbox{\sc{Cons}}(q) is SQL-rewritable, consider the following first-order query qq^{\prime}:

q(A):=\displaystyle q^{\prime}(A):=\; xy(R1(x¯,‘red’)R2(y¯,‘blue’)R3(x,‘red’,y,‘blue’,A¯)\displaystyle\exists x\exists y(R_{1}(\underline{x},\text{`red'})\land R_{2}(\underline{y},\text{`blue'})\land R_{3}(\underline{x,\text{`red'},y,\text{`blue'},A})
z(R1(x,z)z=‘red’)w(R2(y,w)w=‘blue’)).\displaystyle\land\forall z(R_{1}(x,z)\rightarrow z=\text{`red'})\land\forall w(R_{2}(y,w)\rightarrow w=\text{`blue'})).

We will show that for every instance \mathcal{I} and every value aa, we have that aq()aCons(q,)a\in q^{\prime}(\mathcal{I})\Leftrightarrow a\in\mbox{\sc{Cons}}(q,\mathcal{I}). Since qq^{\prime} filters out the tuples from R1R_{1} and R2R_{2} that participate in the violations of the key constraints, we have that if aq()a\in q^{\prime}(\mathcal{I}), then aq(𝒥)a\in q(\mathcal{J}), for every repair 𝒥\mathcal{J} of \mathcal{I}, which means that aCons(q,)a\in\mbox{\sc{Cons}}(q,\mathcal{I}). In the other direction, we claim that if aCons(q,)a\in\mbox{\sc{Cons}}(q,\mathcal{I}), then aq()a\in q^{\prime}(\mathcal{I}). Indeed, if aq()a\not\in q^{\prime}(\mathcal{I}), then for all xx and yy such that R1(x,‘red’)R2(y,‘blue’)R3(x,‘red’,y,‘blue’,a)R_{1}(x,\text{`red'})\land R_{2}(y,\text{`blue'})\land R_{3}(x,\text{`red'},y,\text{`blue'},a), we would have that there is some zz such that R1(x,z)R_{1}(x,z) and z‘red’z\not=\text{`red'} or there is some ww such that R2(y,w)R_{2}(y,w) and w‘blue’w\not=\text{`blue'}. Construct a repair 𝒥\mathcal{J} of \mathcal{I} as follows. First, for every xx, if ‘red’ is the only value zz such that R1(x,z)R_{1}(x,z) is a fact of \mathcal{I}, then put R1(x,‘red’)R_{1}(x,\text{`red'}) in 𝒥\mathcal{J}; otherwise, pick an element z‘red’z^{*}\not=\text{`red'} such that R1(x,z)R_{1}(x,z^{*}) is a fact of \mathcal{I} and put R1(x,z)R_{1}(x,z^{*}) in 𝒥\mathcal{J}. Second, for every yy, if ‘blue’ is the only value ww such that R2(y,w)R_{2}(y,w) is a fact of \mathcal{I}, then put R2(y,‘blue’)R_{2}(y,\text{`blue'}) in 𝒥\mathcal{J}; otherwise, pick an element w‘blue’w^{*}\not=\text{`blue'} such that R1(y,w)R_{1}(y,w^{*}) is a fact of \mathcal{I} and put R2(x,w)R_{2}(x,w^{*}) in 𝒥\mathcal{J}. Third, put every tuple of the relation R3R_{3} of \mathcal{I} into 𝒥\mathcal{J}. Clearly, 𝒥\mathcal{J} is a repair of \mathcal{I}. Moreover, aq(𝒥)a\not\in q({\mathcal{J}}). Indeed, if aq(𝒥)a\in q({\mathcal{J}}), then there are elements xx and yy such that 𝒥R1(x,‘red’)R2(y,‘blue’)R3(x,‘red’,y,‘blue’,a){\mathcal{J}}\models R_{1}(x,\text{`red'})\land R_{2}(y,\text{`blue'})\land R_{3}(x,\text{`red'},y,\text{`blue'},a). Since aq()a\not\in q^{\prime}({\mathcal{I}}), we have that there is some zz^{\prime} such that R1(x,z)R_{1}(x,z^{\prime}) and z‘red’z^{\prime}\not=\text{`red'} or there is some ww^{\prime} such that R2(y,w)R_{2}(y,w^{\prime}) and w‘blue’w^{\prime}\not=\text{`blue'}. In the first case, the construction of 𝒥\mathcal{J} implies that R1(x,’red’)R_{1}(x,\text{'red'}) is not a fact of 𝒥\mathcal{J}, while in the second case, the construction of 𝒥\mathcal{J} implies that R2(y,‘blue’)R_{2}(y,\text{`blue'}) is not a fact of 𝒥\mathcal{J}; in either case, we have arrived at a contradiction.

To show that Cons(Q)\mbox{\sc{Cons}}(Q) is NP-hard, consider the following reduction from undirected graphs to \mathcal{R}-instances, where \mathcal{R} is the schema with relations R1(A1¯,B1)R_{1}(\underline{A_{1}},B_{1}), R2(A2¯,B2)R_{2}(\underline{A_{2}},B_{2}), and R3(A1,B1,A2,B2,C¯)R_{3}(\underline{A_{1},B_{1},A_{2},B_{2},C}).

Reduction 3.1.

L Given an undirected graph G=(V,E)G=(V,E), construct an \mathcal{R}-instance \mathcal{I} as follows. Let m=|E|1m=-|E|-1.

  • For each vVv\in V, add tuples R1(v,‘red’)R_{1}(v,\text{`red'}), R1(v,‘blue’)R_{1}(v,\text{`blue'}), R2(v,‘red’)R_{2}(v,\text{`red'}), and R2(v,‘blue’)R_{2}(v,\text{`blue'}) to \mathcal{I}.

  • For each vVv\in V, add a tuple R3(v,‘red’,v,‘blue’,m)R_{3}(v,\text{`red'},v,\text{`blue'},m) to \mathcal{I}.

  • For each edge (u,v)E(u,v)\in E, add tuples R3(u,‘red’,v,‘blue’,1)R_{3}(u,\text{`red'},v,\text{`blue'},1) and R3(v,‘red’,u,‘blue’,1)R_{3}(v,\text{`red'},u,\text{`blue'},1) to \mathcal{I}.

We will show that the preceding Reduction 3.1 reduces Maximum Cut to computing the range semantics of the aggregation query QQ.

For the rest of this section, let GG be an undirected graph and \mathcal{I} be the database instance constructed from GG using Reduction 3.1. We say that a repair 𝒥\mathcal{J^{\prime}} of \mathcal{I} produces a red-blue coloring of GG if for every vertex vVv\in V, we have that the tuples R1(v,‘red’)R_{1}(v,\text{`red'}) and R2(v,‘red’)R_{2}(v,\text{`red'}) are either both present in 𝒥\mathcal{J^{\prime}} or both absent in 𝒥\mathcal{J^{\prime}}. We now prove a useful lemma.

Lemma 3.3.

For every repair 𝒥\mathcal{J} of \mathcal{I}, there exists a repair 𝒥\mathcal{J^{\prime}} of \mathcal{I} (not necessarily different from 𝒥\mathcal{J}) such that 𝒥\mathcal{J^{\prime}} produces a red-blue coloring of GG and Q(𝒥)Q(𝒥)Q(\mathcal{J^{\prime}})\geq Q(\mathcal{J}).

Proof.

Let 𝒥\mathcal{J} be a repair of \mathcal{I}. Construct an \mathcal{R}-instance 𝒥\mathcal{J^{\prime}} from 𝒥\mathcal{J} as follows. For every vertex vVv\in V, if both tuples R1(v,x)R_{1}(v,x) and R2(v,x)R_{2}(v,x) are present in 𝒥\mathcal{J} for x{‘red’,‘blue’}x\in\{\text{`red'},\text{`blue'}\}, then add them to 𝒥\mathcal{J^{\prime}}. Otherwise, add the tuples R1(v,‘red’)R_{1}(v,\text{`red'}) and R2(v,‘red’)R_{2}(v,\text{`red'}) to 𝒥\mathcal{J^{\prime}}. Also, copy all tuples from relation R3R_{3} of 𝒥\mathcal{J} to relation R3R_{3} of 𝒥\mathcal{J^{\prime}}. Clearly, 𝒥\mathcal{J^{\prime}} is a repair of \mathcal{I} and 𝒥\mathcal{J^{\prime}} produces a red-blue coloring of GG. Observe that Q(𝒥)Q(\mathcal{J^{\prime}}) can be different than Q(𝒥)Q(\mathcal{J}) only if there exists at least one vertex vVv\in V such that either R1(v,‘red’),R2(v,‘blue’)𝒥R_{1}(v,\text{`red'}),R_{2}(v,\text{`blue'})\in\mathcal{J} or R1(v,‘blue’),R2(v,‘red’)𝒥R_{1}(v,\text{`blue'}),R_{2}(v,\text{`red'})\in\mathcal{J}.

We will show that Q(𝒥)Q(𝒥)Q(\mathcal{J^{\prime}})\geq Q(\mathcal{J}) holds.

Case 1: Let vv be a node such that R1(v,‘red’),R2(v,‘blue’)𝒥R_{1}(v,\text{`red'}),R_{2}(v,\text{`blue'})\in\mathcal{J}. In this case, while populating the database instance 𝒥\mathcal{J^{\prime}}, vertex vv changes its color in relation R2R_{2}, i.e., we have that R2(v,‘red’)𝒥R_{2}(v,\text{`red'})\in\mathcal{J^{\prime}} and R2(v,‘blue’)𝒥R_{2}(v,\text{`blue'})\not\in\mathcal{J^{\prime}}. Therefore, the summands arising from the tuples of the form R1(u,‘red’)R_{1}(u,\text{`red'}), R2(v,‘blue’)R_{2}(v,\text{`blue'}), and R3(u,‘red’,v,‘blue’,1)R_{3}(u,\text{`red'},v,\text{`blue'},1) of 𝒥\mathcal{J} (for some vertex uvVu\neq v\in V) do not appear in Q(𝒥)Q(\mathcal{J^{\prime}}). Notice that each of these summands contributes value 1 to Q(𝒥)Q(\mathcal{J}) and the number of these summands is at most |E||E|. At the same time, the summand that contributes value mm to Q(𝒥)Q(\mathcal{J}) arising from the tuples R1(v,‘red’)R_{1}(v,\text{`red'}), R2(v,‘blue’)R_{2}(v,\text{`blue'}), and R3(v,‘red’,v,‘blue’)R_{3}(v,\text{`red'},v,\text{`blue'}) of 𝒥\mathcal{J} also does not appear in Q(𝒥)Q(\mathcal{J^{\prime}}). Since m=|E|1m=-|E|-1, it follows that Q(𝒥)Q(\mathcal{J^{\prime}}) cannot be made smaller than Q(𝒥)Q(\mathcal{J}) on account of such a node vv.

Case 2: Let vv be a node such that R1(v,‘blue’),R2(v,‘red’)𝒥R_{1}(v,\text{`blue'}),R_{2}(v,\text{`red'})\in\mathcal{J}. In this case, while populating 𝒥\mathcal{J^{\prime}}, vertex vv changes its color in relation R1R_{1}, i.e., we have that R1(v,‘red’)𝒥R_{1}(v,\text{`red'})\in\mathcal{J^{\prime}} and R1(v,‘blue’)𝒥R_{1}(v,\text{`blue'})\not\in\mathcal{J^{\prime}}. Compared to Q(𝒥)Q(\mathcal{J}), this can only increase the number of summands that contribute 1 to Q(𝒥)Q(\mathcal{J^{\prime}}), by possibly having new summands arising from the tuples of type R1(v,‘red’)R_{1}(v,\text{`red'}), R2(w,‘blue’)R_{2}(w,\text{`blue'}), and R3(v,‘red’,w,‘blue’,1)R_{3}(v,\text{`red'},w,\text{`blue'},1) of 𝒥\mathcal{J^{\prime}} (for some vertex wvVw\neq v\in V). Moreover, for every vertex uVu\in V, it is true that, if R1(u,‘red’)𝒥R_{1}(u,\text{`red'})\in\mathcal{J} then R1(u,‘red’)𝒥R_{1}(u,\text{`red'})\in\mathcal{J^{\prime}}; similarly, if R2(u,‘blue’)𝒥R_{2}(u,\text{`blue'})\in\mathcal{J} then R2(u,‘blue’)𝒥R_{2}(u,\text{`blue'})\in\mathcal{J^{\prime}}. Therefore, every summand that contributes 1 to Q(𝒥)Q(\mathcal{J}) also contributes 1 to Q(𝒥)Q(\mathcal{J^{\prime}}). Hence, Q(𝒥)Q(\mathcal{J^{\prime}}) cannot be made smaller than Q(𝒥)Q(\mathcal{J}) on account of such a node vv.

The preceding analysis implies that Q(𝒥)Q(𝒥)Q(\mathcal{J^{\prime}})\geq Q(\mathcal{J}). ∎

By Lemma 3.3, there exists a repair 𝒥\mathcal{J} of \mathcal{I} such that 𝒥\mathcal{J} produces a red-blue coloring of GG and Q(𝒥)Q(\mathcal{J}) is the lublub-answer in Cons(Q,)\mbox{\sc{Cons}}(Q,\mathcal{I}). We will show that, for a non-negative integer kk, there is a cut (S,S¯)(S,\overline{S}) of GG such that |E(S,S¯)|k|E(S,\overline{S})|\geq k if and only if there exists a repair 𝒥\mathcal{J} of \mathcal{I} such that 𝒥\mathcal{J} produces a red-blue coloring of GG and Q(𝒥)kQ(\mathcal{J})\geq k. Once this is shown, it will follow that it is NP-hard to even compute the lublub-answer in Cons(Q,)\mbox{\sc{Cons}}(Q,\mathcal{I}).

Let (S,S¯)(S,\overline{S}) be a cut of GG such that |E(S,S¯)|k|E(S,\overline{S})|\geq k. Construct an \mathcal{R}-instance 𝒥\mathcal{J} as follows. For each vertex vSv\in S, add tuples R1(v,‘red’)R_{1}(v,\text{`red'}) and R2(v,‘red’)R_{2}(v,\text{`red'}) to 𝒥\mathcal{J}. For each vertex vS¯v\in\overline{S}, add tuples R1(v,‘blue’)R_{1}(v,\text{`blue'}) and R2(v,‘blue’)R_{2}(v,\text{`blue'}) to 𝒥\mathcal{J}. Add all tuples from relation R3R_{3} of \mathcal{I} to 𝒥\mathcal{J}. Observe that 𝒥\mathcal{J} is a repair of \mathcal{I} and that 𝒥\mathcal{J} produces a red-blue coloring of GG. Also, every edge (u,v)E(u,v)\in E such that uSu\in S and vS¯v\in\overline{S} is part of a witness to a summand that contributes 1 to Q(𝒥)Q(\mathcal{J}). Moreover, no summand in Q(𝒥)Q(\mathcal{J}) arises from a tuple of the form R3(v,‘red’,v,‘blue’,m)R_{3}(v,\text{`red'},v,\text{`blue'},m) for some vVv\in V. Since we have that |E(S,S¯)|k|E(S,\overline{S})|\geq k, it must be the case that Q(𝒥)kQ(\mathcal{J})\geq k. In the other direction, let 𝒥\mathcal{J} be a repair of \mathcal{I} such that 𝒥\mathcal{J} produces a red-blue coloring of GG and Q(𝒥)kQ(\mathcal{J})\geq k. Construct two sets SS and S¯\overline{S} of vertices of GG as follows. Let vSv\in S if R1(v,‘red’)𝒥R_{1}(v,\text{`red'})\in\mathcal{J}, and let vS¯v\in\overline{S} if R1(v,‘blue’)𝒥R_{1}(v,\text{`blue'})\in\mathcal{J}. Clearly, (S,S¯)(S,\overline{S}) is a cut of GG. Every edge (u,v)E(u,v)\in E such that uSu\in S and vS¯v\in\overline{S} is part of a witness to a summand that contributes 1 to Q(𝒥)Q(\mathcal{J}) since the tuples R1(u,‘red’)R_{1}(u,\text{`red'}), R2(v,‘blue’)R_{2}(v,\text{`blue'}), and R3(u,‘red’,v,‘blue’,1)R_{3}(u,\text{`red'},v,\text{`blue'},1) of 𝒥\mathcal{J} satisfy the underlying conjunctive query of QQ. In fact, since 𝒥\mathcal{J} produces a red-blue coloring of GG, every summand that contributes to Q(𝒥)Q(\mathcal{J}) must arise from such tuples. Since Q(𝒥)kQ(\mathcal{J})\geq k, it must be the case that |E(S,S¯)|k|E(S,\overline{S})|\geq k. ∎

4. Consistent Answers via SAT Solving

In this section, we give polynomial-time reductions from computing the range consistent answers of aggregation queries to variants of SAT. The reductions in this section assume that the database schema has one key constraint per relation; in Section 5, we show how these reductions can be extended to schemata with arbitrary denial constraints. Our reductions rely on several well-known optimization variants of SAT that we describe next.

\bullet  Weighted MaxSAT (or WMaxSAT) is the maximization variant of SAT in which each clause is assigned a positive weight and the goal is to find an assignment that maximizes the sum of the weights of the satisfied clauses. We will write (l1lk,w)(l_{1}\lor\cdots\lor l_{k},w) to denote a clause (l1lk)(l_{1}\lor\cdots\lor l_{k}) with weight ww.

\bullet  Partial MaxSAT (or PMaxSAT) is the maximization variant of SAT in which some clauses of the formula are assigned infinite weight (hard clauses), while each of the rest is assigned weight one (soft clauses). The goal is to find an assignment that satisfies all hard clauses and the maximum number of soft clauses. If the hard clauses of a PMaxSAT instance are not simultaneously satisfiable, then we say that the instance is unsatisfiable. For simplicity, a hard clause (l1lk,)(l_{1}\lor\cdots\lor l_{k},\infty) is denoted as (l1lk)(l_{1}\lor\cdots\lor l_{k}).

\bullet  Weighted Partial MaxSAT (or WPMaxSAT) is the maximization variant of SAT in which some of the clauses of the formula are assigned infinite weight (hard clauses), while each of the rest is assigned a positive weight (soft clauses). The goal is to find an assignment that satisfies all hard clauses and maximizes the sum of weights of the satisfied soft clauses. Clearly, WPMaxSAT is a common generalization of both WMaxSAT (no hard clauses) and PMaxSAT (each soft clause has weight one).

Modern solvers, such as MaxHS (Davies and Bacchus, 2011), can efficiently solve large instances of these maximization variants of SAT. Note that these maximization problems have dual minimization problems, called WMinSAT, PMinSAT, and WPMinSAT, respectively. For example, in WPMinSAT, the goal is to find an assignment that satisfies all hard clauses and minimizes the sum of weights of the satisfied soft clauses. These minimization problems are of interest to us, because some of the computations of the range consistent answers have natural reductions to such minimization problems. At present, the only existing WPMinSAT solver is MinSatz (Li et al., 2011). Since this solver has certain size limitations, we will deploy Kügel’s technique (Kügel, 2012) to first reduce WPMinSAT to WPMaxSAT, and then use the MaxHS solver in our experiments. This technique uses the concept of CNF-negation (Kügel, 2012; Zhu et al., 2012). By definition, if C=(l1lk)C=(l_{1}\lor\cdots\lor l_{k}) is a clause, then the CNF-negation CNF(C¯)\text{CNF}(\overline{C}) of CC is the CNF-formula ¬l1(l1¬l2)(l1l2lk1¬lk)\neg l_{1}\land(l_{1}\lor\neg l_{2})\land\cdots\land(l_{1}\lor l_{2}\lor\cdots\lor l_{k-1}\lor\neg l_{k}). It is easy to verify that the following properties hold: (i) if an assignment ss does not satisfy CC, then ss satisfies every clause of CNF(C¯)\text{CNF}(\overline{C}); (ii) if an assignment ss satisfies CC, then ss satisfies all but one of the clauses of CNF(C¯)\text{CNF}(\overline{C}), namely, the clause (l1lj1¬lj)(l_{1}\lor\cdots\lor l_{j-1}\lor\neg l_{j}), where jj is the smallest index such that s(lj)=1s(l_{j})=1. It follows that CC is satisfiable if and only if CNF(C¯)\text{CNF}(\overline{C}) is unsatisfiable. For a weighted clause C=(l1lk,w)C=(l_{1}\lor\cdots\lor l_{k},w), the CNF-negation CNF(C¯)\text{CNF}(\overline{C}) of CC is the formula (¬l1,w)(l1¬l2,w)(l1l2lk1¬lk,w)(\neg l_{1},w)\land(l_{1}\lor\neg l_{2},w)\land\cdots\land(l_{1}\lor l_{2}\lor\cdots\lor l_{k-1}\lor\neg l_{k},w).

We also need to recall the notions of key-equal groups and bags of witnesses. Let \mathcal{I} be a database instance.

\bullet  We say that two facts of a relation RR of \mathcal{I} are key-equal, if they agree on the key attributes of RR. A set SS of facts of \mathcal{I} is called a key-equal group of facts if every two facts in SS are key-equal, and no fact in SS is key-equal to some fact in \S{\mathcal{I}}\backslash S.

\bullet  Let q(𝐳):=𝐰(R1(𝐱1)Rm(𝐱m))q({\bf z}):=\exists{\bf w}\;(R_{1}({\bf x}_{1})\land...\land R_{m}({\bf x}_{m})) be a conjunctive query, where each 𝐱i{\bf x}_{i} is a tuple of variables and constants, and let 𝐚q(){\bf a}\in q(\mathcal{I}) be an answer to qq on \mathcal{I}. Let vars(qq) and cons(qq) be the sets of variables and constants occurring in qq. A function f:vars(q)cons(q)vals()f:vars(q)\cup cons(q)\rightarrow vals({\mathcal{I}}) is a witnessing assignment to 𝐚\bf{a} if the following hold: f(𝐳)=𝐚f({\bf z})={\bf a}; if xjx_{j} is a constant in qq, then f(xj)=xjf(x_{j})=x_{j}; and if Ri(x1,,xn)R_{i}(x_{1},\cdots,x_{n}) is an atom of qq, then Ri(f(x1),,f(xn))R_{i}(f(x_{1}),\cdots,f(x_{n})) is a fact of \mathcal{I}. We say that a set SS of facts from \mathcal{I} is a witness to a if there is a witnessing assignment ff to 𝐚\bf a such that S={Ri(f(x1),,f(xn)):Ri(x1,,xn)S=\{R_{i}(f(x_{1}),\cdots,f(x_{n})):R_{i}(x_{1},\cdots,x_{n}) is an atom of q}q\}.

Note that two distinct witnessing assignments to an answer may give rise to the same witness. Thus, we consider the bag of witnesses to an answer, i.e., the bag consisting of witnesses arising from all witnessing assignments to that answer, where each witness SS is accompanied by its multiplicity, an integer denoting the number of witnessing assignments that gave rise to SS. Finally, we define the bag of witnesses to a conjunctive query as the bag union of the bags of witnesses over all answers to qq on \mathcal{I} (in the bag union the multiplicities of the same set are added). The bag of witnesses to a union q:=q1qkq:=q_{1}\cup\cdots\cup q_{k} of conjunctive queries is the bag union of the bags of witnesses to each conjunctive query qiq_{i} in qq. The bag of witnesses will be used in computing the range consistent answers to aggregation queries. In effect, the bag of witnesses corresponds to the provenance polynomials of conjunctive queries and their unions (Green et al., 2007; Karvounarakis and Green, 2012).

It is easy to verify that both the key equal groups and the bag of the witnesses can be computed using SQL queries. In Section 4.1 and Section 4.2, we give reductions to compute the range consistent answers to aggregation queries without grouping, and in Section 4.3, we describe an iterative algorithm that uses these reductions to handle aggregation queries with grouping.

4.1. Answering Queries with SUM and COUNT

Let \mathcal{R} be a database schema with one key constraint per relation, and QQ be the aggregation query

Q:=SELECT f FROM T(U,A),Q:=\;\texttt{SELECT }f\texttt{ FROM }T(U,A),


where ff is one of the operators COUNT(*), COUNT(AA), SUM(AA), and T(U,A)T(U,A) is a relation expressed as a union of conjunctive queries over \mathcal{R}. We now reduce the range consistent answers Cons(Q)\mbox{\sc{Cons}}(Q) of QQ to PMaxSAT and to WPMaxSAT.

4.1.1. Reductions to PMaxSAT and WPMaxSAT

Reduction 4.1.

Let Q:=SELECT f FROM T(U,A)Q:=\texttt{SELECT $f$ FROM }T(U,A) be an aggregation query, where ff is one of the operators COUNT(*), COUNT(AA), and SUM(AA). Let \mathcal{I} be an \mathcal{R}-instance and 𝒢\mathcal{G} be the set of key-equal groups of facts of \mathcal{I}. For each fact fif_{i} of \mathcal{I}, introduce a boolean variable xix_{i}. Let 𝒲\mathcal{W} be the bag of witnesses to the query qq^{*} on \mathcal{I}, where

q:={UAT(U,A)if f is COUNT(*)UT(U,A)if f is COUNT(A) or SUM(A).q^{*}:=\begin{cases}\exists U\exists A\;T(U,A)&\text{if }f\text{ is }\texttt{COUNT(*)}\\ \exists U\;T(U,A)&\text{if }f\text{ is }\texttt{COUNT($A$)}\text{ or }\texttt{SUM($A$)}.\end{cases}

Construct a partial CNF-formula ϕ\phi (if ff is COUNT(*) or COUNT(AA)) or a weighted partial CNF-formula ϕ\phi (if ff is SUM(A)) as follows:

  1. (1)

    For each Gj𝒢G_{j}\in\mathcal{G},

    • construct a hard clause αj:=fiGjxi\alpha_{j}:=\underset{f_{i}\in G_{j}}{\lor}x_{i}.

    • for each pair (fm,fn)(f_{m},f_{n}) of facts in GjG_{j} such that mnm\neq n, construct a hard clause αjmn:=(¬xm¬xn)\alpha_{j}^{mn}:=(\neg x_{m}\lor\neg x_{n}).

  2. (2a)

    If ff is COUNT(*) or COUNT(AA), then for each witness Wj𝒲W_{j}\in\mathcal{W}, construct a soft clause βj\beta_{j}, where

    βj=(fiWj¬xi,mj).\beta_{j}=\bigg{(}\underset{f_{i}\in W_{j}}{\lor}\neg x_{i},m_{j}\bigg{)}.

    Construct a partial CNF-instance

    ϕ=(j=1|𝒢|αj)(j=1|𝒢|(fm𝒢jfn𝒢jαjmn))(j=1|𝒲|βj).\phi=\bigg{(}\overset{|\mathcal{G}|}{\underset{j=1}{\land}}\alpha_{j}\bigg{)}\land\bigg{(}\overset{|\mathcal{G}|}{\underset{j=1}{\land}}\bigg{(}\bigwedge_{\begin{subarray}{c}f_{m}\in\mathcal{G}_{j}\\ f_{n}\in\mathcal{G}_{j}\end{subarray}}\alpha^{mn}_{j}\bigg{)}\bigg{)}\land\bigg{(}\overset{|\mathcal{W}|}{\underset{j=1}{\land}}\beta_{j}\bigg{)}.
  3. (2b)

    If ff is SUM(AA), let 𝒲P\mathcal{W}_{P} and 𝒲N\mathcal{W}_{N} be the subsets of 𝒲\mathcal{W} such that for each Wj𝒲W_{j}\in\mathcal{W}, we have Wj𝒲PW_{j}\in\mathcal{W}_{P} iff q(Wj)>0q^{*}(W_{j})>0, and Wj𝒲NW_{j}\in\mathcal{W}_{N} iff q(Wj)<0q^{*}(W_{j})<0. Let also wj=mjq(Wj)w_{j}=m_{j}*||q^{*}(W_{j})||, where q(Wj)||q^{*}(W_{j})|| is the absolute value of q(Wj)q^{*}(W_{j}). Construct a weighted soft clause βj\beta_{j} and a conjunction γj\gamma_{j} of hard clauses as follows. If Wj𝒲𝒩W_{j}\in\mathcal{W_{N}}, introduce a new variable yjy_{j} and let

    βj\displaystyle\beta_{j} =(yj,wj) and\displaystyle=(y_{j},w_{j})\text{ and }
    γj\displaystyle\gamma_{j} =((fiWj¬xi)yj)(fiWj(¬yjxi));\displaystyle=\bigg{(}\Big{(}\underset{f_{i}\in W_{j}}{\lor}\neg x_{i}\Big{)}\lor y_{j}\bigg{)}\land\bigg{(}\underset{f_{i}\in W_{j}}{\land}(\neg y_{j}\lor x_{i})\bigg{)};

    otherwise, let βj=(fiWj¬xi,wj)\beta_{j}=\bigg{(}\underset{f_{i}\in W_{j}}{\lor}\neg x_{i},w_{j}\bigg{)} and do not construct γj\gamma_{j}.
    Construct a weighted partial CNF-instance

    ϕ=(j=1|𝒢|αj)(j=1|𝒢|(fm𝒢jfn𝒢jαjmn))(j=1|𝒲|βj)(Wj𝒲𝒩γj).\phi=\bigg{(}\overset{|\mathcal{G}|}{\underset{j=1}{\land}}\alpha_{j}\bigg{)}\land\bigg{(}\overset{|\mathcal{G}|}{\underset{j=1}{\land}}\bigg{(}\bigwedge_{\begin{subarray}{c}f_{m}\in\mathcal{G}_{j}\\ f_{n}\in\mathcal{G}_{j}\end{subarray}}\alpha^{mn}_{j}\bigg{)}\bigg{)}\land\bigg{(}\overset{|\mathcal{W}|}{\underset{j=1}{\land}}\beta_{j}\bigg{)}\land\bigg{(}{\underset{W_{j}\in\mathcal{W_{N}}}{\land}}\gamma_{j}\bigg{)}.

Purpose of the components of ϕ\phi in Reduction 4.1

  • Each αj\alpha_{j}-clause encodes the “at-least-one” constraint for each key-equal group GjG_{j} in the sense that satisfying αj\alpha_{j} requires setting at least one variable corresponding to a fact in GjG_{j} to true. Similarly, each αjmn\alpha_{j}^{m}n-clause encodes the “at-most-one” constraint for GjG_{j}. In effect, every assignment that satisfies all α\alpha-clauses sets exactly one variable corresponding to the facts from each key-equal group to true, and thus uniquely corresponds to a repair of \mathcal{I}.

  • Satisfying a βj\beta_{j}-clause constructed in Step 2a requires setting at least one variable corresponding to the facts of a witness WjW_{j} to qq^{*} on \mathcal{I} to false. Thus, if ss is an assignment that satisfies all α\alpha-clauses, then βj\beta_{j} is satisfied by ss if and only if Wj𝒥W_{j}\not\in\mathcal{J}, where 𝒥\mathcal{J} is a repair corresponding to ss.

  • The βj\beta_{j}-clauses constructed in Step 2b serve the same purpose as the ones from Step 2a, but here they are constructed only for the witnesses in 𝒲P\mathcal{W}_{P}. For the witnesses in 𝒲N\mathcal{W}_{N}, the βj\beta_{j}-clauses encode the condition that βj\beta_{j} is satisfied if and only if all variables corresponding to the facts in WjW_{j} are set to true. The hard γj\gamma_{j}-clauses are used solely to express the equivalence yj(fiWjxi)y_{j}\leftrightarrow(\underset{f_{i}\in W_{j}}{\land}x_{i}) in conjunctive normal form.

The number of α\alpha-clauses is O(n)O(n), where nn is the size of the database; the number of β\beta-clauses and γ\gamma-clauses combined is O(nk)O(n^{k}), where kk is the number of relation symbols in QQ.

Proposition 4.1.

Let Q:=SELECT f FROM T(U,A)Q:=\texttt{SELECT $f$ FROM }T(U,A) be an aggregation query, where ff is one of the operators COUNT(*), COUNT(AA), and SUM(AA). In a maximum (a minimum) satisfying assignment of the WPMaxSAT-instance ϕ\phi constructed using Reduction 4.1, the sum of weights of the falsified clauses is the glb-answer (lub-answer) in the range consistent answers Cons(Q)\mbox{\sc{Cons}}(Q) on \mathcal{I}.

Proof.

Let Q:=SELECT COUNT(*) FROM T(U,A)Q:=\texttt{SELECT COUNT(*) FROM }T(U,A), and let ss be an assignment of the formula ϕ\phi constructed using Reduction 4.1. Let g(ϕ,s)g(\phi,s) denote the sum of weights of the soft clauses of ϕ\phi satisfied by ss. Construct a database sub-instance 𝒥\mathcal{J} from ss such that fi𝒥f_{i}\in\mathcal{J} if and only if s(xi)=1s(x_{i})=1. The hard clauses of ϕ\phi constructed in Step (1) of Reduction 4.1 encode the condition that exactly one fact from each key-equal group of facts of \mathcal{I} is in 𝒥\mathcal{J}, ensuring that 𝒥\mathcal{J} is a repair of \mathcal{I}. Moreover, the soft clauses of ϕ\phi falsified by ss have a one-to-one correspondence with the witnesses to qq^{*} in 𝒥\mathcal{J}. Therefore, we have that g(ϕ,s)=|𝒲|Q(𝒥)g(\phi,s)=|\mathcal{W}|-Q(\mathcal{J}). Since |𝒲||\mathcal{W}| does not depend on 𝒥\mathcal{J}, the answer Q(𝒥)Q(\mathcal{J}) is minimized (i.e., Q(𝒥)Q(\mathcal{J}) is a glbglb-answer in Cons(Q)\mbox{\sc{Cons}}(Q) on \mathcal{I}) when ss is a maximum satisfying assignment. Essentially the same argument works for the case where Q:=SELECT COUNT(A) FROM T(U,A)Q:=\texttt{SELECT COUNT($A$) FROM }T(U,A). A dual argument to this proves that a repair of \mathcal{I} constructed from a minimum satisfying assignment of ϕ\phi realizes the lublub-answer in Cons(Q)\mbox{\sc{Cons}}(Q) on \mathcal{I}.

Now, let Q:=SELECT SUM(A) FROM T(U,A)Q:=\texttt{SELECT SUM($A$) FROM }T(U,A). Construct a repair 𝒥\mathcal{J} of \mathcal{I} from ss by choosing fi𝒥f_{i}\in\mathcal{J} if and only if s(xi)=1s(x_{i})=1. Also, construct a database instance p\mathcal{I}_{p} as follows. For every fact ff\in\mathcal{I}, let fpf\in\mathcal{I}_{p} if and only if fWjf\in W_{j} for some Wj𝒲pW_{j}\in\mathcal{W}_{p}. Thus, Q(p)Q(\mathcal{I}_{p}) is the sum of values of the aggregation attribute evaluated on the witnesses in 𝒲p\mathcal{W}_{p}. Observe that, for every Wj𝒲PW_{j}\in\mathcal{W}_{P}, the clause βj\beta_{j} is falsified by ss if and only if Wj𝒥W_{j}\in\mathcal{J}. Similarly, for every Wj𝒲NW_{j}\in\mathcal{W}_{N}, the clause βj\beta_{j} is satisfied by ss if and only if Wj𝒥W_{j}\in\mathcal{J}. Therefore, we have that,

Q(𝒥)\displaystyle Q(\mathcal{J}) =Q(p)ΣWj𝒲Ps(βj)=1(wj)ΣWj𝒲Ns(βj)=1(wj)\displaystyle=Q(\mathcal{I}_{p})-\Sigma_{W_{j}\in\mathcal{W}_{P}\land s(\beta_{j})=1}(w_{j})-\Sigma_{W_{j}\in\mathcal{W}_{N}\land s(\beta_{j})=1}(w_{j})
=Q(p)g(ϕ,s)\displaystyle=Q(\mathcal{I}_{p})-g(\phi,s)

Since Q(p)Q(\mathcal{I}_{p}) does not depend on 𝒥\mathcal{J}, the answer Q(𝒥)Q(\mathcal{J}) is minimized (i.e., Q(𝒥)Q(\mathcal{J}) is a glbglb-answer in Cons(Q)\mbox{\sc{Cons}}(Q) on \mathcal{I}) when ss is a maximum satisfying assignment. A dual argument to this proves that a repair of \mathcal{I} constructed from a minimum satisfying assignment of ϕ\phi realizes the lublub-answer in Cons(Q)\mbox{\sc{Cons}}(Q) on \mathcal{I}. ∎

Example 4.2.

Let \mathcal{I} be a database instance from Table 1, and QQ be the following aggregation query which counts the number of customers who have an account in their own city:
    SELECT COUNT(*)
    FROM CUST, ACC, CUSTACC
    WHERE CUST.CID = CUSTACC.CID
     AND ACC.ACCID = CUSTACC.ACCID
     AND CUST.CITY = ACC.CITY
From Reduction 4.1, we construct the following clauses:

  • α\alpha-clauses: x1,(x2x3),x4,x5,x6,x7,(x8x9),x10x_{1},(x_{2}\lor x_{3}),x_{4},x_{5},x_{6},x_{7},(x_{8}\lor x_{9}),x_{10};

  • αmn\alpha^{mn}-clauses: (¬x2¬x3),(¬x8¬x9)(\neg x_{2}\lor\neg x_{3}),(\neg x_{8}\lor\neg x_{9});

  • β\beta-clauses: (¬x1¬x6,1),(¬x2¬x7,1),(¬x3¬x9,1)(\neg x_{1}\lor\neg x_{6},1),(\neg x_{2}\lor\neg x_{7},1),(\neg x_{3}\lor\neg x_{9},1).

Observe that it is okay to omit the variables corresponding to the facts in CUSTACC since CUSTACC does not violate Σ\Sigma. A maximum satisfying assignment to the PMaxSAT instance ϕ\phi constructed from above clauses is xi=0x_{i}=0 for i{2,9}i\in\{2,9\}, and xi=1x_{i}=1 otherwise. It falsifies one clause, namely, (¬x1¬x6,1)(\neg x_{1}\lor\neg x_{6},1). Similarly, an assignment xi=0x_{i}=0 for i{2,8}i\in\{2,8\}, and xi=1x_{i}=1 otherwise is a minimum satisfying assignment to the PMinSAT instance ϕ\phi, and it falsifies two clauses, namely, (¬x1¬x6,1)(\neg x_{1}\lor\neg x_{6},1) and (¬x3¬x9,1)(\neg x_{3}\lor\neg x_{9},1). Thus, Cons(Q,)\mbox{\sc{Cons}}(Q,\mathcal{I}) w.r.t. range semantics is [1,2][1,2] by Proposition 4.1.

Example 4.3.

Let us again consider the database instance \mathcal{I} from Table 1, and the following aggregation query QQ:
    SELECT SUM(ACC.BAL)
    FROM CUST, ACC, CUSTACC
    WHERE CUST.CID = CUSTACC.CID
     AND ACC.ACCID = CUSTACC.ACCID
     AND CUST.CNAME = ‘Mary’
The hard clauses constructed using Reduction 4.1 are same as the ones from Example 4.2. The rest of the clauses are as follows:

  • β\beta-clauses: (¬x2¬x7,1000)(\neg x_{2}\lor\neg x_{7},1000), (¬x3¬x7,1000)(\neg x_{3}\lor\neg x_{7},1000), (¬x2¬x8,1200)(\neg x_{2}\lor\neg x_{8},1200), (¬x3¬x8,1200)(\neg x_{3}\lor\neg x_{8},1200), (y1,100)(y_{1},100), (y2,100)(y_{2},100).

  • γ\gamma-clauses: (¬x2¬x9y1)(\neg x_{2}\lor\neg x_{9}\lor y_{1}), (¬y1x2)(\neg y_{1}\lor x_{2}), (¬y1x9)(\neg y_{1}\lor x_{9}), (¬x3¬x9y2)(\neg x_{3}\lor\neg x_{9}\lor y_{2}), (¬y2x3)(\neg y_{2}\lor x_{3}), (¬y2x9)(\neg y_{2}\lor x_{9}).

The witnesses {f2,f9,f13}\{f_{2},f_{9},f_{13}\} and {f3,f9,f13}\{f_{3},f_{9},f_{13}\} belong to 𝒲N\mathcal{W}_{N} because the account balance is -100 in both cases, so we introduce new variables y1y_{1} and y2y_{2} respectively, and construct hard γ\gamma-clauses as described above. The β\beta-clauses corresponding to these witnesses are (y1,100)(y_{1},100) and (y2,100)(y_{2},100). We omit x13x_{13} in all of these clauses since CUSTACC does not violate Σ\Sigma. Note that Q(p)=4400Q(\mathcal{I}_{p})=4400. An assignment in which x8=0x_{8}=0 and x9=1x_{9}=1 is a maximum satisfying assignment to the PMaxSAT instance ϕ\phi constructed. The sum of satisfied soft clauses by this assignment is 3500 since it satisfies two clauses with weights 1200 each, one with weight 1000 and one with weight 100. Thus, by Proposition 4.1, we have that Q(𝒥)=44003500=900Q(\mathcal{J})=4400-3500=900 where 𝒥\mathcal{J} is a repair corresponding to the assignment in consideration. Similarly, setting x8=1x_{8}=1 and x9=0x_{9}=0 yields a minimum satisfying assignment in which the sum of satisfied soft clauses is 2200, since it satisfies one clause with weight 1200 and one with weight 1000, indicating that Cons(Q,)=[900,2200]\mbox{\sc{Cons}}(Q,\mathcal{I})=[900,2200].

4.1.2. Handling DISTINCT

Let Q:=SELECT f FROM T(U,A)Q:=\texttt{SELECT $f$ FROM }T(U,A) be an aggregation query, where ff is either COUNT(DISTINCT AA) or SUM(DISTINCT AA). Solving a PMaxSAT or a WPMaxSAT instance constructed using Reduction 4.1 may yield incorrect glbglb and lublub answers to QQ, if the database contains multiple witnesses with the same value for attribute AA. For example, consider the database instance \mathcal{I} from Table 1, and a query

Q:=SELECT COUNT(DISTINCT ACC.TYPE) FROM ACCQ:=\texttt{SELECT COUNT(DISTINCT ACC.TYPE) FROM ACC}


The correct glbglb and lublub-answers in Cons(Q,)\mbox{\sc{Cons}}(Q,\mathcal{I}) are both 2, but solutions to the PMaxSAT and PMinSAT instances constructed using Reduction 4.1 yield both answers as 4. The reason behind this is that the soft clauses ¬x6\neg x_{6} and ¬x7\neg x_{7} both correspond to the account type Checking, and similarly ¬x8\neg x_{8}, ¬x9\neg x_{9}, and ¬x10\neg x_{10} all correspond to the account type Saving. The hard clauses in the formula ensure that x6x_{6}, x7x_{7}, x10x_{10}, and one of x8x_{8} and x9x_{9} are true, thus counting both Checking and Saving account types exactly twice in every satisfying assignment to the formula. This can be handled by modifying the β\beta-clauses in Reduction 4.1 as follows.

Let 𝒜\mathcal{A} denote a set of distinct answers to the query q(A):=UT(U,A)q^{*}(A):=\exists U\;T(U,A). For each answer b𝒜b\in\mathcal{A}, let 𝒲b\mathcal{W}^{b} denote a subset of 𝒲\mathcal{W} such that for every witness W𝒲bW\in\mathcal{W}^{b}, we have that q(W)=bq^{*}(W)=b. The idea is to use auxiliary variables to construct one soft clause for every distinct answer b𝒜b\in\mathcal{A}, such that it is true if and only if no witness in 𝒲b\mathcal{W}^{b} is present in a repair corresponding to the satisfying assignment. First, for every witness Wjb𝒲bW^{b}_{j}\in\mathcal{W}^{b}, we introduce an auxiliary variable zjbz^{b}_{j} that is true if and only if WjbW^{b}_{j} is not present in the repair. Then, we introduce an auxiliary variable vbv^{b} which is true if and only if all zbz^{b}-variables are true. These constraints are encoded in the set HbH^{b} returned by Algorithm 1, and are forced by making clauses in HbH^{b} hard. For every answer b𝒜b\in\mathcal{A}, Algorithm 1 also returns one βb\beta^{b}-clause, which serves the same purpose as the β\beta-clauses in Reduction 4.1. Now, a PMaxSAT or a WPMaxSAT instance can be constructed by taking in conjunction all α\alpha-clauses from the key-equal groups, the hard γ\gamma-clauses if any, the hard clauses from all HbH^{b}-sets, and all soft βb\beta^{b}-clauses. With this, it is easy to see that a maximum (or minimum) satisfying assignment to PMaxSAT or WPMaxSAT instance give us the glbglb-answer (or lublub-answer) in Cons(Q)\mbox{\sc{Cons}}(Q). This is illustrated in Example 4.4.

Algorithm 1 Handling DISTINCT
1:procedure handleDistinct(𝒲b\mathcal{W}^{b})
2:  let Hb=H^{b}=\emptyset//Empty set of clauses
3:  for Wjb𝒲bW^{b}_{j}\in\mathcal{W}^{b} do
4:   Hb=Hb{(¬zjb(fiWjb¬xi))}H^{b}=H^{b}\bigcup\Big{\{}\Big{(}\neg z^{b}_{j}\lor\Big{(}\underset{f_{i}\in W^{b}_{j}}{\bigvee}\neg x_{i}\Big{)}\Big{)}\Big{\}}
5:   for fiWjbf_{i}\in W^{b}_{j} do
6:     Hb=Hb{(zjbxi)}H^{b}=H^{b}\bigcup\;\{(z^{b}_{j}\lor x_{i})\}      
7:  Hb=Hb{(¬vb(Wjb𝒲b¬zjb))}H^{b}=H^{b}\bigcup\Big{\{}\Big{(}\neg v^{b}\lor\Big{(}\underset{W^{b}_{j}\in\mathcal{W}^{b}}{\bigvee}\neg z^{b}_{j}\Big{)}\Big{)}\Big{\}}
8:  for Wjb𝒲bW^{b}_{j}\in\mathcal{W}^{b} do
9:   Hb=Hb{(¬vbzjb)}H^{b}=H^{b}\bigcup\;\{(\neg v^{b}\lor z^{b}_{j})\}   
10:  let βb=(vb,1)\beta^{b}=(v^{b},1)
11:  if (ff is SUM(DISTINCT AA)then
12:   βb=(vb,b)\beta^{b}=(v^{b},||b||)
13:   if b<0b<0 then βb=(¬vb,b)\beta^{b}=(\neg v^{b},||b||)      
14:  return Hb,βbH^{b},\beta^{b}
Example 4.4.

Consider the following aggregation query QQ on the database instance \mathcal{I} from Table 1:
    SELECT COUNT(DISTINCT ACC.TYPE) FROM ACC
We have that 𝒜={‘Checking’,‘Saving’}\mathcal{A}=\{\text{`Checking'},\text{`Saving'}\}. Let us denote these two answers by a1a_{1} and a2a_{2} respectively. Since every witness to the query consists of a single fact, every yay^{a}-variable is equivalent to a single literal, for example, y1a1¬x6y^{a_{1}}_{1}\leftrightarrow\neg x_{6} and y2a1¬x7y^{a_{1}}_{2}\leftrightarrow\neg x_{7}. As a result, it is unnecessary to introduce any yay^{a}-variables at all. Thus, we construct the following clauses from Reduction 4.1 and Algorithm 1:

  • α\alpha-clauses: x6,x7,(x8x9),x10x_{6},x_{7},(x_{8}\lor x_{9}),x_{10}; αmn\alpha^{mn}-clauses: (¬x8¬x9)(\neg x_{8}\lor\neg x_{9});

  • Ha1:(x6x7va1),(¬va1¬x6),(¬va1¬x7)H^{a_{1}}:(x_{6}\lor x_{7}\lor v^{a_{1}}),(\neg v^{a_{1}}\lor\neg x_{6}),(\neg v^{a_{1}}\lor\neg x_{7});

  • Ha2:(x8x9x10va2),(¬va2¬x8),(¬va2¬x9),(¬va2¬x10)H^{a_{2}}:(x_{8}\lor x_{9}\lor x_{10}\lor v^{a_{2}}),(\neg v^{a_{2}}\lor\neg x_{8}),(\neg v^{a_{2}}\lor\neg x_{9}),(\neg v^{a_{2}}\lor\neg x_{10});

  • β\beta-clauses: (va1,1),(va2,1)(v^{a_{1}},1),(v^{a_{2}},1)

The maximum and minimum satisfying assignments to the PMaxSAT and PMinSAT instances constructed using these clauses falsify both β\beta-clauses, since Cons(Q,)\mbox{\sc{Cons}}(Q,\mathcal{I}) w.r.t. range semantics is [2,2][2,2].

4.2. Answering Queries with MIN and MAX

Let \mathcal{R} be a database schema with one key constraint per relation, and QQ be the aggregation query

Q:=SELECT f FROM T(U,A),Q:=\;\texttt{SELECT }f\texttt{ FROM }T(U,A),


where ff is one of the operators MIN(AA) and MAX(AA), and T(U,A)T(U,A) is a relation expressed as a union of conjunctive queries over \mathcal{R}. The semantics of the range consistent answers to aggregation queries with MIN and MAX operators are similar to aggregation queries with SUM or COUNT operators, but here we need to address one additional special case. We illustrate this special case using Example 4.5.

Example 4.5.

Consider the database instance \mathcal{I} from Table 1 and two aggregation queries Q1Q_{1} and Q2Q_{2} as follows.

Q1:=Q_{1}:= SELECT SUM(ACCOUNTS.BAL) FROM ACCOUNTS
    WHERE ACCOUNTS.CITY = ‘SF’

Q2:=Q_{2}:= SELECT MIN(ACCOUNTS.BAL) FROM ACCOUNTS
    WHERE ACCOUNTS.CITY = ‘SF’

It is clear that the range consistent answers to Cons(Q1,)=[100,0]\mbox{\sc{Cons}}(Q_{1},\mathcal{I})=[-100,0]. The lub-answer of 0 in Cons(Q1,)\mbox{\sc{Cons}}(Q_{1},\mathcal{I}) comes from a repair on which there is no account in the city of SF, and therefore the SUM function returns 0. For Q2Q_{2}, however, the range consistent answers are unclear because the MIN function is not defined for empty sets.

In such scenarios, various different semantics can be considered. One natural semantics is that if there exists a repair on which the underlying conjunctive query evaluates to an empty set, we could say that there is no consistent answer to the aggregation query. Another one could be to return the interval [glb, lub] of values that come from the repairs on which the underlying conjunctive query evaluates to a non-empty set of answers, and additionally return the information about the existence of the repair on which the underlying conjunctive query returns the empty set of answers. The reductions we give in this Section can be used regardless of which of the two above-mentioned semantics is chosen.

In what follows, we first show that the glb-answer to an aggregation query with the MIN(AA) operator can be computed in polynomial time in the size of the original inconsistent database instance \mathcal{I} (Proposition 4.6). We then give an iterative SAT-based approach to compute the lub-answer to an aggregation query with the MIN(AA) operator. We do not explicitly state methods to obtain the range consistent answers aggregation queries with the MAX(AA) operator since it is straightforward that the lub-answer for MIN(AA) is a dual of the glb-answer for MAX(AA) in the sense that computing the lub-answer for the MIN(AA) operator yields the same result as negating all values of the aggregation attribute AA in the database and then computing the glb-answer for the MAX(AA) operator.

Proposition 4.6.

Let \mathcal{R} be a database schema, \mathcal{I} an \mathcal{R}-instance, and QQ the aggregation query SELECT MIN(A) FROM T(U,A)\texttt{SELECT MIN($A$) FROM }T(U,A). Let q1q_{1} be the union q1(A):=UT(U,A)q_{1}(A):=\exists U\;T(U,A) of conjunctive queries and WglbW_{\textit{glb}} be the witness to q1q_{1} on \mathcal{I} such that no two facts in WglbW_{\textit{glb}} are key-equal, and there is no WW^{\prime} such that q(W)<q(Wglb)q(W^{\prime})<q(W_{\textit{glb}}) and no two facts in WW^{\prime} are key-equal. Then, q1(Wglb)q_{1}(W_{\textit{glb}}) is the glb-answer in Cons(Q,)\mbox{\sc{Cons}}(Q,\mathcal{I}).

Proof.

For every witness WW^{\prime} to q1q_{1} on \mathcal{I} such that q1(W)<q1(Wglb)q_{1}(W^{\prime})<q_{1}(W_{\textit{glb}}), we have that no repair of \mathcal{I} contains WW^{\prime} because WW^{\prime} contains at least two key-equal facts. Moreover, since no two facts in WglbW_{\textit{glb}} are key-equal, there exists a repair 𝒥\mathcal{J} of \mathcal{I} such that Wglb𝒥W_{\textit{glb}}\in\mathcal{J}. Therefore, q1(Wglb)q_{1}(W_{\textit{glb}}) must be the smallest possible answer to QQ on \mathcal{I}, i.e., the glb-answer in Cons(Q,)\mbox{\sc{Cons}}(Q,\mathcal{I}). Since the number of witnesses to q1q_{1} is polynomial in the size of \mathcal{I}, a desired witness WglbW_{\textit{glb}} can be obtained efficiently from the result of evaluating q1q_{1} on \mathcal{I}. ∎

To compute the range consistent answers to aggregation queries with MIN(AA) and MAX(AA) operators, we opt for an iterative SAT solving approach. In what follows, we formalize the construction of the SAT instance for the first iteration (Construction 4.1) and give Algorithm 2 that computes the lub-answer in Cons(Q,)\mbox{\sc{Cons}}(Q,\mathcal{I}) by constructing and solving SAT instances in subsequent iterations.

Construction 4.1.

Given an \mathcal{R}-instance \mathcal{I}, construct a CNF formula ϕ\phi as follows. For each fact fif_{i} of \mathcal{I}, introduce a boolean variable xix_{i}. Let 𝒢\mathcal{G} be the set of key-equal groups of facts of \mathcal{I}, and 𝒲={W1,,Wm}\mathcal{W}=\{W_{1},\cdots,W_{m}\} denote the set of minimal witnesses to a conjunctive query qq on \mathcal{I}, where q(w):=uT(u,w)q(w):=\exists\vec{u}\;T(\vec{u},w). Assume that the set 𝒲\mathcal{W} is sorted in descending order of the answers, i.e., for 1i<m1\leq i<m, we have that q(Wi)q(Wi+1)q(W_{i})\geq q(W_{i+1}).

  • For each Gj𝒢G_{j}\in\mathcal{G}, construct a clause αj=fiGjxi\alpha_{j}=\underset{f_{i}\in G_{j}}{\lor}x_{i}.

  • Construct a CNF formula ϕ=j=1|𝒢|αj\phi=\overset{|\mathcal{G}|}{\underset{j=1}{\land}}\alpha_{j}.

Algorithm 2 Computing the lub-answer in Cons(Q,)\mbox{\sc{Cons}}(Q,\mathcal{I}) for MIN via Iterative SAT
1:procedure LubAnswer-IterativeSAT(ϕ,𝒲\phi,\mathcal{W})
2:  let v=q(W1),j=1v=q(W_{1}),j=1
3:  while j|𝒲|j\leq|\mathcal{W}| do
4:   if v=q(Wj)v=q(W_{j}) then
5:     let ϕ=ϕ(fiWj¬xi)\phi=\phi\land\Big{(}{\underset{f_{i}\in W_{j}}{\lor}}\neg x_{i}\Big{)}
6:     let j=j+1j=j+1
7:   else
8:     if UNSAT(ϕ)(\phi) then
9:      return q(Wj1)q(W_{j-1})
10:     else
11:      let v=q(Wj)v=q(W_{j})           
12:  return q(W|𝒲|)q(W_{|\mathcal{W}|})
Proposition 4.7.

Let Q:=SELECT MIN(A) FROM T(U,A)Q:=\texttt{SELECT MIN($A$) FROM }T(U,A) be an aggregation query, and \mathcal{I} be a database instance. Algorithm 2 returns the lub-answer in Cons(Q,)\mbox{\sc{Cons}}(Q,\mathcal{I}).

Proof.

The α\alpha-clauses of ϕ\phi make sure that a repair 𝒥\mathcal{J} of \mathcal{I} can be constructed from every assignment ss of ϕ\phi that satisfies the α\alpha-clauses, by arbitrarily choosing exactly one fact fif_{i} from each key-equal group of \mathcal{I} such that s(xi)=1s(x_{i})=1. Let 𝒜={A1,,Alub,,A|𝒜|}\mathcal{A}=\{A_{1},\cdots,A_{lub},\cdots,A_{|\mathcal{A}|}\} denote the set of distinct answers to a conjunctive query q(w):=uT(u,w)q(w):=\exists\vec{u}\;T(\vec{u},w) on \mathcal{I}, where AlubA_{lub} is the lub-answer to QQ on \mathcal{I}. For a witness WW to qq, a clause (fiW¬xi)({\underset{f_{i}\in W}{\lor}}\neg x_{i}) is satisfied by an assignment ss if and only if WW is not present in any repair constructed from ss. At iteration jj of the while-loop, if the formula ϕ\phi is checked for satisfiability (line 8 of Algorithm 2), the formula contains the α\alpha-clauses corresponding to the key-equal groups of \mathcal{I} in conjunction to all clauses corresponding to the minimal witnesses to qq on which the qq evaluates to an answer strictly smaller than q(Wj)q(W_{j}). At this point, if the formula ϕ\phi is satisfiable, then there exists a repair 𝒥\mathcal{J} of \mathcal{I} such that q(Wj1)q(𝒥)q(W_{j-1})\notin q(\mathcal{J}), and also for all potential answers Aiq(Wj1)A_{i}\leq q(W_{j-1}), we have that Aiq(𝒥)A_{i}\notin q(\mathcal{J}). On the other hand, if the formula is unsatisfiable, then there exists no repair 𝒥\mathcal{J} of \mathcal{I} such that q(Wj1)q(𝒥)q(W_{j-1})\notin q(\mathcal{J}) and Aiq(𝒥)A_{i}\notin q(\mathcal{J}) for all Aiq(Wj1)A_{i}\leq q(W_{j-1}). Since the clauses are added in the ascending order of the answers, we have that ϕ\phi satisfiable at iteration jj if and only if q(Wj1)<Alubq(W_{j-1})<A_{lub}. Therefore, if ϕ\phi becomes unsatisfiable for the first time at iteration jj, it must be the the case that q(Wj1)q(W_{j-1}) is the lub-answer in Cons(Q,)\mbox{\sc{Cons}}(Q,\mathcal{I}). ∎

Why not a Binary Search?

In essence, Algorithm 2 works like a linear search on a sorted array. It may sound appealing to perform the binary search instead of the linear search for obvious reasons. Clearly, the SAT solver will only have to solve O(log2|𝒜|)O(log_{2}\;|\mathcal{A}|) instances of SAT instead of O(|𝒜|)O(|\mathcal{A}|) instances of SAT. The problem with this approach is the following. Observe that, on an average, half of the SAT instances that the solver needs to solve in the binary search approach will be unsatisfiable. In the linear search approach, however, all but the last instance given to the solver are satisfiable. Typically, the proofs of unsatisfiability produced by the SAT solvers are significantly large compared to the proofs of satisfiability as the unsatisfiability of an instance needs to be proven with a refutation tree that can be exponential in size of the formula, while just one satisfying assignment is enough to prove the satisfiability of an instance. As a result, SAT solvers typically take considerably long amounts of time to solve unsatisfiable instances but they are very quick on most real-world satisfiable instances. Therefore, in practice, the linear search often works better than the binary search.

4.3. Answering Queries with Grouping

Let QQ be the aggregation query

Q:=SELECT Z,f FROM T(U,Z,w) GROUP BY Z,Q:=\;\texttt{SELECT }Z,f\texttt{ FROM }T(U,Z,w)\texttt{ GROUP BY }Z,


where ff is one of COUNT()(*), COUNT(A)(A), SUM(A)(A), MIN(A)(A), or MAX(A)(A), and T(U,A)T(U,A) is a relation expressed by a union of conjunctive queries on \mathcal{R}. We refer to the attributes in ZZ as the grouping attributes. For aggregation queries with grouping, it does not seem feasible to reduce Cons(Q)\mbox{\sc{Cons}}(Q) to a single PMaxSAT or a WPMaxSAT instance because for each group of consistent answers, the GLB-answer and the LUB-answer may realize in different repairs of the inconsistent database. To illustrate this, consider the database from Table 1 and a query Q:=SELECT COUNT(*) FROM CUST GROUP BY CUST.CITYQ:=\texttt{SELECT COUNT(*) FROM CUST GROUP BY CUST.CITY}. Notice that, the GLB-answers (LA, 2) and (SF, 1) in Cons(Q)\mbox{\sc{Cons}}(Q) come from two different repairs of relation CUST, namely, {f1,f3,f4,f5}\{f_{1},f_{3},f_{4},f_{5}\} and {f1,f2,f4,f5}\{f_{1},f_{2},f_{4},f_{5}\} respectively. However, the reductions from the preceding section can be used to compute the bounds to each consistent group of answers independently. For a given aggregation query QQ with grouping, we first compute the consistent answers to an underlying conjunctive query q(Z):=U,AT(U,Z,A)q(Z):=\exists U,A\;T(U,Z,A). Then, for each answer bb in Cons(q)\mbox{\sc{Cons}}(q), we compute the GLB and LUB-answers to the query Q:=SELECT f FROM T(U,Z,A)(Z=b)Q^{\prime}:=\;\texttt{SELECT }f\texttt{ FROM }T(U,Z,A)\land(Z=b) via PMaxSAT or WPMaxSAT solving as shown in Algorithm 3.

Algorithm 3 Consistent Answers to Queries With Grouping

Let \mathcal{I} be an inconsistent database instance, and QQ be an aggregation query of the form Q:=SELECT Z,f FROM T(U,Z,A) GROUP BY ZQ:=\;\texttt{SELECT }Z,f\texttt{ FROM }T(U,Z,A)\texttt{ GROUP BY }Z.

 

1:procedure ConsAggGrouping(QQ)
2:  let Ans=Ans=\emptyset
3:  let q(Z):=U,AT(U,Z,A)q(Z):=\exists U,A\;T(U,Z,A)
4:  let 𝒜c=Cons(q,)\mathcal{A}_{c}=\mbox{\sc{Cons}}(q,\mathcal{I})
5:  for b𝒜cb\in\mathcal{A}_{c} do
6:   let Q:=SELECT f FROM T(U,Z,A)(Z=b)Q^{\prime}:=\;\texttt{SELECT }f\texttt{ FROM }T(U,Z,A)\land(Z=b)
7:   let [GLBA,LUBA]=Cons(Q,)[GLB_{A},LUB_{A}]=\mbox{\sc{Cons}}(Q^{\prime},\mathcal{I})
8:   Aans=Aans(b,[GLBA,LUBA])A_{ans}=A_{ans}\cup(b,[GLB_{A},LUB_{A}])   
9:  return AansA_{ans}

As noted earlier, the bags of witnesses used in the preceding reductions capture the provenance of unions of conjunctive queries in the provenance polynomials model of (Green et al., 2007; Karvounarakis and Green, 2012). In (Amsterdamer et al., 2011), it was shown that a stronger provenance model is needed to express the provenance of aggregation queries, a model that uses a tensor product combining annotations with values. A future direction of research is to investigate whether this stronger provenance model can be used to produce more direct reductions of the range consistent answers to SAT.

5. Beyond Key Constraints

Key constraints and functional dependencies are important special cases of denial constraints (DCs), which are expressible by first-order formulas of the form x1,,xn¬(φ(x1,,xn)ψ(x1,,xn)),\forall x_{1},...,x_{n}\neg(\varphi(x_{1},...,x_{n})\land\psi(x_{1},...,x_{n})), or, equivalently, x1,,xn(φ(x1,,xn)¬ψ(x1,,xn)),\forall x_{1},...,x_{n}(\varphi(x_{1},...,x_{n})\rightarrow\neg\psi(x_{1},...,x_{n})), where φ(x1,,xn)\varphi(x_{1},...,x_{n}) is a conjunction of atomic formulas and ψ(x1,,xn)\psi(x_{1},...,x_{n}) is a conjunction of expressions of the form (xiopxj)(x_{i}\;\mbox{op}\;x_{j}) with each op a built-in predicate, such as =,,<,>,,=,\neq,<,>,\leq,\geq. In words, a denial constraint prohibits a set of tuples that satisfy certain conditions from appearing together in a database instance. If Σ\Sigma is a fixed finite set of denial constraints and QQ is an aggregation query without grouping, then the following problem is in coNP: given a database instance \mathcal{I} and a number tt, is tt the lub-answer (or the glb-answer) in Cons(Q,)\mbox{\sc{Cons}}(Q,\mathcal{I}) w.r.t. Σ\Sigma? This is so because to check that tt is not the lub-answer (or the glb-answer), we guess a repairs 𝒥\mathcal{J} of \mathcal{I} and verify that t>Q(𝒥)t>Q(\mathcal{J}) (or t>Q(𝒥)t>Q(\mathcal{J})). In all preceding reductions, the α\alpha-clauses capture the inconsistency in the database arisen due to the key violations to enforce every satisfying assignment to uniquely correspond to a repair of the initial inconsistent database instance. Importantly, the α\alpha-clauses are independent of the input query. In what follows, we provide a way to construct clauses to capture the inconsistency arising due to the violations of denial constraints. Thus, replacing the α\alpha-clauses in the reductions from Section 4 by the ones provided below allows us to compute consistent answers over databases with a fixed finite set of arbitrary denial constraints. The reduction relies on the notions of minimal violations and near-violations to the set of denial constraints that we introduce next.

\bullet  Assume that Σ\Sigma is a set of denial constraints, \mathcal{I} is an \mathcal{R}-instance, and SS is a sub-instance of \mathcal{I}. We say that SS is a minimal violation to Σ\Sigma, if S⊧̸ΣS\not\models\Sigma and for every set SSS^{\prime}\subset S, we have that SΣS^{\prime}\models\Sigma.

\bullet  Let Σ\Sigma be a set of denial constraints, \mathcal{I} an \mathcal{R}-instance, SS a sub-instance of \mathcal{I}, and ff a fact of \mathcal{I}. We say that SS is a near-violation w.r.t. Σ\Sigma and ff if SΣS\models\Sigma and S{f}S\cup\{f\} is a minimal violation to Σ\Sigma. As a special case, if {f}\{f\} itself is a minimal violation to Σ\Sigma, we say that there is exactly one near-violation w.r.t. ff, and it is the singleton {ftrue}\{f_{true}\}, where ftruef_{true} is an auxiliary fact.

Let \mathcal{R} be a database schema, Σ\Sigma be a fixed finite set of denial constraints on \mathcal{R}, QQ be an aggregation query without grouping, and \mathcal{I} be an \mathcal{R}-instance.

Reduction 5.1.

Given an \mathcal{R}-instance \mathcal{I}, compute the sets:

  1. (1)

    𝒱\mathcal{V}: the set of minimal violations to Σ\Sigma on \mathcal{I}.

  2. (2)

    𝒩i\mathcal{N}^{i}: the set of near-violations to Σ\Sigma, on \mathcal{I}, w.r.t. each fact fiIf_{i}\in I.

For each fact fif_{i} of \mathcal{I}, introduce a boolean variable xix_{i}, 1in1\leq i\leq n. For the auxiliary fact ftruef_{true}, introduce a constant xtrue=truex_{true}=true, and for each Nji𝒩iN^{i}_{j}\in\mathcal{N}^{i}, introduce a boolean variable pjip^{i}_{j}.

  1. (1)

    For each Vj𝒱V_{j}\in\mathcal{V}, construct a clause αj=fiVj¬xi\alpha_{j}=\underset{f_{i}\in V_{j}}{\lor}\neg x_{i}.

  2. (2)

    For each fiIf_{i}\in I, construct a clause γi=xi(Nji𝒩ipji)\gamma_{i}=x_{i}\lor\bigg{(}\underset{N^{i}_{j}\in\mathcal{N}^{i}}{\lor}p^{i}_{j}\bigg{)}.

  3. (3)

    For each variable pjip^{i}_{j}, construct an expression θji=pji(fdNjixd)\theta^{i}_{j}=p^{i}_{j}\leftrightarrow\bigg{(}\underset{f_{d}\in N^{i}_{j}}{\land}x_{d}\bigg{)}.

  4. (4)

    Construct the following boolean formula ϕ\phi:

    ϕ=(i=1|𝒱|αi)(i=1|I|((j=1|𝒩i|θji)γi)){\phi=\bigg{(}\overset{|\mathcal{V}|}{\underset{i=1}{\land}}\alpha_{i}\bigg{)}\land\bigg{(}\overset{|I|}{\underset{i=1}{\land}}\bigg{(}\Big{(}\overset{|\mathcal{N}^{i}|}{\underset{j=1}{\land}}\theta^{i}_{j}\Big{)}\land\gamma_{i}\bigg{)}\bigg{)}}
Proposition 5.1.

The boolean formula ϕ\phi constructed using Reduction 5.1 can be transformed to an equivalent CNF-formula ϕ\phi whose size is polynomial in the size of \mathcal{I}. The satisfying assignments to ϕ\phi and the repairs of \mathcal{I} w.r.t. Σ\Sigma are in one-to-one correspondence.

Proof.

Let nn be the number of facts of \mathcal{I}. Let d1d_{1} be the smallest number such that there exists no denial constraint in Σ\Sigma whose number of database atoms is bigger than d1d_{1}. Also, let d2d_{2} be the smallest number such that there exists no conjunctive query in QQ whose number of database atoms is bigger than d2d_{2}. Since Σ\Sigma and QQ are not part of the input to Cons(QQ), the quantities d1d_{1} and d2d_{2} are fixed constants. We also have that |𝒱|nd1|\mathcal{V}|\leq n^{d_{1}}, |𝒩i|nd1|\mathcal{N}^{i}|\leq n^{d_{1}} for 1in1\leq i\leq n, |𝒜|nd2|\mathcal{A}|\leq n^{d_{2}}, and |𝒲l|nd2|\mathcal{W}^{l}|\leq n^{d_{2}} for 1l|𝒜|1\leq l\leq|\mathcal{A}|. The number of xx-, yy-, and pp-variables in ϕ\phi^{\prime} is therefore bounded by nn, nd1+1n^{d_{1}+1}, and nd2n^{d_{2}}, respectively. The formula ϕ\phi^{\prime} contains as many α\alpha-clauses as |𝒱||\mathcal{V}|, and none of the α\alpha-clause’s length exceeds nn. Similarly, there are at most nd2n^{d_{2}} β\beta-clauses, and none of their lengths exceeds d2+1d_{2}+1. The number of γ\gamma-clauses is precisely nn, and each γ\gamma-clause is at most nd1+1+1n^{d_{1}+1}+1 literals long. There are as many θ\theta-expressions as there are yy-variables. Every θ\theta-expression is of the form y(x1xd)y\leftrightarrow(x_{1}\land...\land x_{d}), where dd is a constant obtained from the number of facts in the corresponding near-violation. Each θ\theta-expression can be equivalently written in a constant number of CNF-clauses as ((¬yx1)(¬yxd))(¬x1¬xdy)((\neg y\lor x_{1})\land...\land(\neg y\lor x_{d}))\land(\neg x_{1}\lor...\lor\neg x_{d}\lor y), in which the length each clause is constant. Thus, one can transform ϕ\phi^{\prime} into an equivalent CNF-formula ϕ\phi with size polynomial in the size of \mathcal{I}.

For the second part of Proposition 5.1, consider a satisfying assignment ss to ϕ\phi and construct a database instance 𝒥\mathcal{J} such that fi𝒥f_{i}\in\mathcal{J} if and only if s(xi)=1s(x_{i})=1. The α\alpha-clauses assert that no minimal violation to Σ\Sigma is present in 𝒥\mathcal{J}, i.e., 𝒥\mathcal{J} is a consistent subset of \mathcal{I}. The γ\gamma-clauses and the θ\theta-expressions encode the condition that, for every fact ff\in\mathcal{I}, either f𝒥f\in\mathcal{J} or at least one near-violation w.r.t. Σ\Sigma and ff is in 𝒥\mathcal{J}, making sure that 𝒥\mathcal{J} is indeed a repair of \mathcal{I}. In the other direction, one can construct a satisfying assignment ss to ϕ\phi from a repair 𝒥\mathcal{J} of \mathcal{I} by setting s(xi)=1s(x_{i})=1 if and only if fi𝒥f_{i}\in\mathcal{J}. ∎

6. Experimental Evaluation

We evaluate the performance of AggCAvSAT over both synthetic and real-world databases. The first set of experiments includes a comparison of AggCAvSAT with an existing SQL-rewriting-based CQA system, namely, ConQuer, over synthetically generated TPC-H databases having one key constraint per relation. This set of experiments is divided into two parts, based on the method used to generate the inconsistent database instances. In the first part, we use the DBGen tool from TPC-H and artificially inject inconsistencies in the generated data; in the second part, we employ the PDBench inconsistent database generator from MayBMS (Antova et al., 2008) (see Section 6.1.1 for details). Next, we assess the scalability of AggCAvSAT by varying the size of the database and the amount of inconsistency present in it. Lastly, to evaluate the performance of the reductions from Section 5, we use a real-world Medigap (med, [n.d.]) dataset that has three functional dependencies and one denial constraint. All experiments were carried out on a machine running on Intel Core i7 2.7 GHz, 64 bit Ubuntu 16.04, with 8GB of RAM. We used Microsoft SQL Server 2017 as an underlying DBMS, and MaxHS v3.2 solver (Davies and Bacchus, 2011) for solving the WPMaxSAT instances. The AggCAvSAT system is implemented in Java 9.04 and its code is open-sourced at a GitHub repository https://github.com/uccross/cavsat via a BSD-style license. Various features of AggCAvSAT, including its graphical user interface, are presented in a short paper in the demonstration track of the 2021 SIGMOD conference (Dixit and Kolaitis, 2021).

6.1. Experiments with TPC-H Data and Queries

6.1.1. Datasets

For the first part of the experiments, the data is generated using the DBGen data generation tool from the TPC-H Consortium. The TPC-H schema comes with exactly one key constraint per relation, which was ideal for comparing AggCAvSAT against ConQuer (Fuxman et al., 2005a; Fuxman and Miller, 2007) (the only existing system for computing the range consistent answers to aggregation queries), because ConQuer does not support more than one key constraint per relation or classes of integrity constraints broader than keys. The DBGen tool generates consistent data, so we artificially injected inconsistency by updating the key attributes of randomly selected tuples from the data with the values taken from existing tuples of the same relation. The sizes of the key-equal groups that violate the key constraints were uniformly distributed between two and seven. The database instances were generated in such a way that every repair had the specified size. We experimented with varied degrees of inconsistency, ranging from 5% up to 35% of the tuples violating a key constraint, and with a variety of repair sizes, starting from 500 MB (4.3 million tuples) up to 5 GB (44 million tuples). For the second part, we employed the PDBench database generator from MayBMS (Antova et al., 2008) to generate four inconsistent database instances with varying degrees of inconsistency (see Table 2). In all four instances, the data is generated in such a way that every repair is of size 1 GB.

Table 2. Percentage of inconsistency in the TPC-H database instances generated using PDBench
Inconsistency
Table Inst. 1 Inst. 2 Inst. 3 Inst. 4
CUSTOMER 4.42% 8.5% 16.14% 29.49%
LINEITEM 6.36% 12.09% 22.53% 39.82%
NATION 7.69% 0% 7.69% 7.69%
ORDERS 3.51% 6.77% 12.87% 23.9%
PART 4.93% 9.33% 17.66% 32.16%
PARTSUPP 1.53% 2.96% 5.77% 11.13%
REGION 0% 0% 0% 0%
SUPPLIER 3.69% 7.44% 14.11% 26.51%
Overall 5.36% 10.25% 19.29% 34.72%
Database size and Repair size (in GB)
1.04 & 1.00 1.07 & 1.01 1.14 & 1.02 1.3 & 1.02
Size of the Largest Key-equal Groups
8 tuples 16 tuples 16 tuples 32 tuples

6.1.2. Queries

The standard TPC-H specification comes with 22 queries (constructed using the QGen tool). Here, we focus on queries 1, 3, 4, 5, 6, 10, 12, 14, and 19; the other 13 queries have features such as nested subqueries, left outer joins, and negation that are beyond the aggregation queries defined in Section 3. In Section 6.1.3, we describe our results for queries without grouping. Since six out of the nine queries under consideration contained the GROUP BY clause, we removed it and added appropriate conditions in the WHERE clause based on the original grouping attributes to obtain queries without grouping. We refer to these queries as Q1,Q3,,Q19Q^{\prime}_{1},Q^{\prime}_{3},\cdots,Q^{\prime}_{19}. The definitions of these queries are given in Table 3.

Table 3. TPC-H-inspired aggregation queries w/o grouping
# Query
Q1Q^{\prime}_{1} SELECT SUM(LINEITEM.L_QUANTITY) FROM LINEITEM WHERE LINEITEM.L_SHIPDATE <= dateadd(dd, -90, cast(‘1998-12-01’ as datetime)) AND LINEITEM.L_RETURNFLAG = ‘N’ AND LINEITEM.L_LINESTATUS = ‘F’
Q3Q^{\prime}_{3} SELECT SUM(LINEITEM.L_EXTENDEDPRICE*(1-LINEITEM.L_DISCOUNT)) FROM CUSTOMER, ORDERS, LINEITEM WHERE CUSTOMER.C_MKTSEGMENT = ’BUILDING’ AND CUSTOMER.C_CUSTKEY = ORDERS.O_CUSTKEY AND LINEITEM.L_ORDERKEY = ORDERS.O_ORDERKEY AND ORDERS.O_ORDERDATE < ‘1995-03-15’ AND LINEITEM.L_SHIPDATE > ‘1995-03-15’ AND LINEITEM.L_ORDERKEY = 988226 AND ORDERS.O_ORDERDATE = ‘1995-02-01’ AND ORDERS.O_SHIPPRIORITY = 0
Q4Q^{\prime}_{4} SELECT COUNT(*) FROM ORDERS WHERE ORDERS.O_ORDERDATE >= ‘1993-07-01’ AND ORDERS.O_ORDERDATE < dateadd(mm,3, cast(‘1993-07-01’ as datetime)) AND ORDERS.O_ORDERPRIORITY = ‘1-URGENT’
Q5Q^{\prime}_{5} SELECT SUM(LINEITEM.L_EXTENDEDPRICE*(1-LINEITEM.L_DISCOUNT)) FROM CUSTOMER, ORDERS, LINEITEM, SUPPLIER, NATION, REGION WHERE CUSTOMER.C_CUSTKEY = ORDERS.O_CUSTKEY AND LINEITEM.L_ORDERKEY = ORDERS.O_ORDERKEY AND LINEITEM.L_SUPPKEY = SUPPLIER.S_SUPPKEY AND CUSTOMER.C_NATIONKEY = SUPPLIER.S_NATIONKEY AND SUPPLIER.S_NATIONKEY = NATION.N_NATIONKEY AND NATION.N_REGIONKEY = REGION.R_REGIONKEY AND REGION.R_NAME = ‘ASIA’ AND ORDERS.O_ORDERDATE >= ‘1994-01-01’ AND ORDERS.O_ORDERDATE < DATEADD(YY, 1, cast(‘1994-01-01’ as datetime)) AND NATION.N_NAME = ‘INDIA’
Q6Q^{\prime}_{6} SELECT SUM(LINEITEM.L_EXTENDEDPRICE*LINEITEM.L_DISCOUNT) FROM LINEITEM WHERE LINEITEM.L_SHIPDATE >= ‘1994-01-01’ AND LINEITEM.L_SHIPDATE < dateadd(yy, 1, cast(‘1994-01-01’ as datetime)) AND LINEITEM.L_DISCOUNT BETWEEN .06 - 0.01 AND .06 + 0.01 AND LINEITEM.L_QUANTITY < 24
Q10Q^{\prime}_{10} SELECT SUM(LINEITEM.L_EXTENDEDPRICE*(1-LINEITEM.L_DISCOUNT)) FROM CUSTOMER, ORDERS, LINEITEM, NATION WHERE CUSTOMER.C_CUSTKEY = ORDERS.O_CUSTKEY AND LINEITEM.L_ORDERKEY = ORDERS.O_ORDERKEY AND ORDERS.O_ORDERDATE >= ‘1993-10-01’ AND ORDERS.O_ORDERDATE < dateadd(mm, 3, cast(‘1993-10-01’ as datetime)) AND LINEITEM.L_RETURNFLAG = ‘R’ AND CUSTOMER.C_NATIONKEY = NATION.N_NATIONKEY AND CUSTOMER.C_CUSTKEY = 77296 AND CUSTOMER.C_NAME = ‘Customer#000077296’ AND CUSTOMER.C_ACCTBAL = 1250.65 AND CUSTOMER.C_PHONE = ‘12-248-307-9719’ AND NATION.N_NAME = ‘BRAZIL’
Q12Q^{\prime}_{12} SELECT COUNT(*) FROM ORDERS, LINEITEM WHERE ORDERS.O_ORDERKEY = LINEITEM.L_ORDERKEY AND LINEITEM.L_SHIPMODE = ‘MAIL’ AND (ORDERS.O_ORDERPRIORITY = ‘1-URGENT’ OR ORDERS.O_ORDERPRIORITY = ‘2-HIGH’) AND LINEITEM.L_COMMITDATE < LINEITEM.L_RECEIPTDATE AND LINEITEM.L_SHIPDATE < LINEITEM.L_COMMITDATE AND LINEITEM.L_RECEIPTDATE >= ‘1994-01-01’ AND LINEITEM.L_RECEIPTDATE < dateadd(mm, 1, cast(‘1995-09-01’ as datetime))
Q14Q^{\prime}_{14} SELECT SUM(LINEITEM.L_EXTENDEDPRICE*(1-LINEITEM.L_DISCOUNT)) FROM LINEITEM, PART WHERE LINEITEM.L_PARTKEY = PART.P_PARTKEY AND LINEITEM.L_SHIPDATE >= ‘1995-09-01’ AND LINEITEM.L_SHIPDATE < dateadd(mm, 1, ‘1995-09-01’) AND PART.P_TYPE LIKE ‘PROMO%%’
Q19Q^{\prime}_{19} SELECT SUM(LINEITEM.L_EXTENDEDPRICE*(1-LINEITEM.L_DISCOUNT)) FROM LINEITEM, PART WHERE (PART.P_PARTKEY = LINEITEM.L_PARTKEY AND PART.P_BRAND = ‘Brand#12’ AND PART.P_CONTAINER IN (‘SM CASE’, ‘SM BOX’, ‘SM PACK’, ‘SM PKG’) AND LINEITEM.L_QUANTITY >= 1 AND LINEITEM.L_QUANTITY <= 1 + 10 AND PART.P_SIZE BETWEEN 1 AND 5 AND LINEITEM.L_SHIPMODE IN (‘AIR’, ‘AIR REG’) AND LINEITEM.L_SHIPINSTRUCT = ‘DELIVER IN PERSON’) OR (PART.P_PARTKEY = LINEITEM.L_PARTKEY AND PART.P_BRAND =‘Brand#23’ AND PART.P_CONTAINER IN (‘MED BAG’, ‘MED BOX’, ‘MED PKG’, ‘MED PACK’) AND LINEITEM.L_QUANTITY >= 10 AND LINEITEM.L_QUANTITY <= 10 + 10 AND PART.P_SIZE BETWEEN 1 AND 10 AND LINEITEM.L_SHIPMODE IN (‘AIR’, ‘AIR REG’) AND LINEITEM.L_SHIPINSTRUCT = ‘DELIVER IN PERSON’) OR (PART.P_PARTKEY = LINEITEM.L_PARTKEY AND PART.P_BRAND = ‘Brand#34’ AND PART.P_CONTAINER IN ( ‘LG CASE’, ‘LG BOX’, ‘LG PACK’, ‘LG PKG’) AND LINEITEM.L_QUANTITY >= 20 AND LINEITEM.L_QUANTITY <= 20 + 10 AND PART.P_SIZE BETWEEN 1 AND 15 AND LINEITEM.L_SHIPMODE IN (‘AIR’, ‘AIR REG’) AND LINEITEM.L_SHIPINSTRUCT = ‘DELIVER IN PERSON’)

6.1.3. Results on Queries without Grouping

In the first set of experiments, we computed the range consistent answers of the TPC-H-inspired aggregation queries without grouping via WPMaxSAT solving over a database instance with 10% inconsistency and having repairs of size 1 GB (8 million tuples). Figure 1 shows that much of the evaluation time used by AggCAvSAT is consumed in encoding the CQA instance into a WPMaxSAT instance, while the solver comparatively takes less time to compute the optimal solution. Note that, Q5Q^{\prime}_{5} is not in the class CaggforestC_{\textit{aggforest}} and thus ConQuer cannot compute its range consistent answers. AggCAvSAT performs better than ConQuer on seven out of the remaining eight queries.

Q1Q^{\prime}_{1}Q3Q^{\prime}_{3}Q4Q^{\prime}_{4}Q5Q^{\prime}_{5}Q6Q^{\prime}_{6}Q10Q^{\prime}_{10}Q12Q^{\prime}_{12}Q14Q^{\prime}_{14}Q19Q^{\prime}_{19}02244TPC-H-inspired aggregation queries w/o groupingEval. time (seconds)×\times5.98 secs \rightarrowWPMaxSAT enc.WPMaxSAT sol.ConQuerOriginal query
Figure 1. AggCAvSAT vs. ConQuer on TPC-H data generated using the DBGen-based tool (10% inconsistency, 1 GB repairs)

Next, we compared the performance of AggCAvSAT and ConQuer on database instances generated using PDBench. Figure 2 shows that AggCAvSAT performs better than ConQuer on PDBench instances with low inconsistency. As the inconsistency increases, the WPMaxSAT solver requires considerably long time to compute the optimal solutions (especially for Q6Q^{\prime}_{6}, Q12Q^{\prime}_{12}, and Q14Q^{\prime}_{14}). One reason is that the sizes of key-equal groups in PDBench instances with higher inconsistency percentage are large, which translates into clauses of large sizes in the WPMaxSAT instances, hence the solver works hard to solve them. Also, Kügel’s reduction (Kügel, 2012) from WPMinSAT to WPMaxSAT significantly increases the size of the CNF formula, resulting in higher time for the lub-answers to the queries.

Q1Q^{\prime}_{1}Q3Q^{\prime}_{3}Q4Q^{\prime}_{4}Q5Q^{\prime}_{5}Q6Q^{\prime}_{6}Q10Q^{\prime}_{10}Q12Q^{\prime}_{12}Q14Q^{\prime}_{14}Q19Q^{\prime}_{19}02244Eval. time (seconds)×\timesInstance 1WPMaxSAT enc.WPMaxSAT sol.ConQuerOriginal query
Q1Q^{\prime}_{1}Q3Q^{\prime}_{3}Q4Q^{\prime}_{4}Q5Q^{\prime}_{5}Q6Q^{\prime}_{6}Q10Q^{\prime}_{10}Q12Q^{\prime}_{12}Q14Q^{\prime}_{14}Q19Q^{\prime}_{19}02244Eval. time (seconds)×\timesInstance 2WPMaxSAT enc.WPMaxSAT sol.ConQuerOriginal query
Q1Q^{\prime}_{1}Q3Q^{\prime}_{3}Q4Q^{\prime}_{4}Q5Q^{\prime}_{5}Q6Q^{\prime}_{6}Q10Q^{\prime}_{10}Q12Q^{\prime}_{12}Q14Q^{\prime}_{14}Q19Q^{\prime}_{19}02244669.34 sec \rightarrowEval. time (seconds)×\timesInstance 3WPMaxSAT enc.WPMaxSAT sol.ConQuerOriginal query
Q1Q^{\prime}_{1}Q3Q^{\prime}_{3}Q4Q^{\prime}_{4}Q5Q^{\prime}_{5}Q6Q^{\prime}_{6}Q10Q^{\prime}_{10}Q12Q^{\prime}_{12}Q14Q^{\prime}_{14}Q19Q^{\prime}_{19}055101049.1 sec \rightarrowTPC-H-inspired aggregation queries w/o groupingEval. time (seconds)×\timesInstance 4WPMaxSAT enc.WPMaxSAT sol.ConQuerOriginal query
Figure 2. AggCAvSAT vs. ConQuer on PDBench instances
Table 4. Average size of the CNF formulas for Q1Q^{\prime}_{1}, Q6Q^{\prime}_{6}, and Q14Q^{\prime}_{14}
5% 15% 25% 35%
Q1Q^{\prime}_{1} 10.2 34.3 60.6 95.3
Q6Q^{\prime}_{6} 28.4 96.2 175.0 271.2
Q14Q^{\prime}_{14} 6.4 21.1 40.6 62.3
(a) # of variables (in thousands)
5% 15% 25% 35%
27.6 92.2 163.6 258.1
76.8 259.9 472.9 734.0
15.6 51.9 101.0 156.6
(b) # of clauses (in thousands)
1 GB 3 GB 5 GB
Q1Q^{\prime}_{1} 21.3 44.1 105.6
Q6Q^{\prime}_{6} 60.9 127.13 304.4
Q14Q^{\prime}_{14} 13.9 32.9 67.7
(c) # of variables (in thousands)
1 GB 3 GB 5 GB
57.7 104.1 258.8
165.3 300.7 823.1
34.0 73.7 166.6
(d) # of clauses (in thousands)

Next, we varied the inconsistency in the database instances created using the DBGen-based data generator while keeping the size of the database repairs constant (1 GB). Figure 3 shows that the evaluation time of AggCAvSAT stays well under ten seconds (except for Q6Q^{\prime}_{6}), even if there is more inconsistency in the data. Tables 4(a) and 4(b) show the average size of the CNF formulas for the top three queries that exhibited the largest CNF formulas. The size of the formulas grows nearly linearly as the inconsistency present in the data grows. The CNF formulas for Q6Q^{\prime}_{6} are significantly larger than the ones corresponding to the other queries since Q6Q^{\prime}_{6} has high selectivity and it is posed against the single largest relation LINEITEM which has over 8.2 million tuples in an instance with 35% inconsistency. This also explains why AggCAvSAT takes more time for computing its range consistent answers (Figure 3). In database instances with low inconsistency, the consistent answers to the queries having low selectivity (e.g., Q3Q^{\prime}_{3}, Q10Q^{\prime}_{10}) are sometimes contained in the consistent part of the data, and AggCAvSAT does not need to construct a WPMaxSAT instance at all.

224466881010121214141616181820202222242426262828303032323434363638380.00.05.05.010.010.015.015.020.020.0Percentage of inconsistencyEval. time (seconds)Q1Q^{\prime}_{1}Q3Q^{\prime}_{3}Q4Q^{\prime}_{4}Q5Q^{\prime}_{5}Q6Q^{\prime}_{6}Q10Q^{\prime}_{10}Q12Q^{\prime}_{12}Q14Q^{\prime}_{14}Q19Q^{\prime}_{19}
Figure 3. AggCAvSAT on TPC-H data generated using the DBGen-based tool (varying inconsistency, 1 GB repairs)

We then evaluated AggCAvSAT’s scalability by increasing the sizes of the databases while keeping the inconsistency to a constant 10%. Figure 4 shows that the evaluation time of AggCAvSAT for queries Q1Q^{\prime}_{1}, Q6Q^{\prime}_{6}, and Q12Q^{\prime}_{12} increases faster than that for the other queries. This is because the queries Q1Q^{\prime}_{1} and Q6Q^{\prime}_{6} are posed against LINEITEM while Q12Q^{\prime}_{12} involves a join between LINEITEM and ORDERS resulting in AggCAvSAT spending more time on computing the bags of witnesses to these queries as the size of the database grows. Table 4(c) and 4(d) show that the size of the CNF formulas grows almost linearly w.r.t. the size of the database. The largest CNF formula consisted of over 304 thousand variables and 823 thousand clauses and was exhibited by Q6Q^{\prime}_{6} on a database of size 5 GB (47 million tuples). The low selectivity of queries Q3Q^{\prime}_{3}, Q10Q^{\prime}_{10}, and Q19Q^{\prime}_{19} resulted in very small CNF formulas, even on large databases.

0.50.5111.51.5222.52.5333.53.5444.54.555020204040Size of the database repairs (in GB)Eval. time (seconds)Q1Q_{1}Q3Q_{3}Q4Q_{4}Q5Q_{5}Q6Q_{6}Q10Q_{10}Q12Q_{12}Q14Q_{14}Q19Q_{19}
Figure 4. AggCAvSAT on TPC-H data generated using the DBGen-based tool (varying database sizes, 10% inconsistency)

6.1.4. Results on Queries with Grouping

In this set of experiments, we focus on TPC-H queries 1, 3, 4, 5, 10, and 12 (see Table 5), as the queries 6, 14, and 19 did not contain grouping. We evaluated the performance of AggCAvSAT and compared it to ConQuer on a database with 10% inconsistency w.r.t. primary keys (Figure 5). The repairs are of size 1 GB. AggCAvSAT computes the consistent answers to the underlying conjunctive query using the reductions from (Dixit and Kolaitis, 2019) which are, precisely, the consistent groups in the range consistent answers to the aggregation query. For each of these groups, it computes the glb-answer and the lub-answer using reductions to WPMaxSAT.

Table 5. TPC-H-inspired Aggregation Queries w/ Grouping
# Query Operator
Q1Q_{1} SELECT LINEITEM.L_RETURNFLAG, LINEITEM.L_LINESTATUS, SUM(LINEITEM.L_QUANTITY) FROM LINEITEM WHERE LINEITEM.L_SHIPDATE <= dateadd(dd, -90, cast(‘1998-12-01’ as datetime)) GROUP BY LINEITEM.L_RETURNFLAG, LINEITEM.L_LINESTATUS SUM(A)
Q3Q_{3} SELECT TOP 10 LINEITEM.L_ORDERKEY, SUM(LINEITEM.L_EXTENDEDPRICE), ORDERS.O_ORDERDATE, ORDERS.O_SHIPPRIORITY FROM CUSTOMER, ORDERS, LINEITEM WHERE CUSTOMER.C_MKTSEGMENT = ‘BUILDING’ AND CUSTOMER.C_CUSTKEY = ORDERS.O_CUSTKEY AND LINEITEM.L_ORDERKEY = ORDERS.O_ORDERKEY AND ORDERS.O_ORDERDATE < ‘1995-03-15’ AND LINEITEM.L_SHIPDATE > ‘1995-03-15’ GROUP BY LINEITEM.L_ORDERKEY, ORDERS.O_ORDERDATE, ORDERS.O_SHIPPRIORITY SUM(A)
Q4Q_{4} SELECT ORDERS.O_ORDERPRIORITY, COUNT(*) AS O_COUNT FROM ORDERS WHERE ORDERS.O_ORDERDATE >= ‘1993-07-01’ AND ORDERS.O_ORDERDATE < dateadd(mm,3, cast(‘1993-07-01’ as datetime)) GROUP BY ORDERS.O_ORDERPRIORITY COUNT(*)
Q5Q_{5} SELECT NATION.N_NAME, SUM(LINEITEM.L_EXTENDEDPRICE*(1-LINEITEM.L_DISCOUNT)) AS REVENUE FROM CUSTOMER, ORDERS, LINEITEM, SUPPLIER, NATION, REGION WHERE CUSTOMER.C_CUSTKEY = ORDERS.O_CUSTKEY AND LINEITEM.L_ORDERKEY = ORDERS.O_ORDERKEY AND LINEITEM.L_SUPPKEY = SUPPLIER.S_SUPPKEY AND CUSTOMER.C_NATIONKEY = SUPPLIER.S_NATIONKEY AND SUPPLIER.S_NATIONKEY = NATION.N_NATIONKEY AND NATION.N_REGIONKEY = REGION.R_REGIONKEY AND REGION.R_NAME = ‘ASIA’ AND ORDERS.O_ORDERDATE >= ‘1994-01-01’ AND ORDERS.O_ORDERDATE < DATEADD(YY, 1, cast(‘1994-01-01’ as datetime)) GROUP BY NATION.N_NAME SUM(A)
Q10Q_{10} SELECT TOP 20 CUSTOMER.C_CUSTKEY, CUSTOMER.C_NAME, SUM(LINEITEM.L_EXTENDEDPRICE*(1-LINEITEM.L_DISCOUNT)) AS REVENUE, CUSTOMER.C_ACCTBAL, NATION.N_NAME, CUSTOMER.C_ADDRESS, CUSTOMER.C_PHONE FROM CUSTOMER, ORDERS, LINEITEM, NATION WHERE CUSTOMER.C_CUSTKEY = ORDERS.O_CUSTKEY AND LINEITEM.L_ORDERKEY = ORDERS.O_ORDERKEY AND ORDERS.O_ORDERDATE>= ‘1993-10-01’ AND ORDERS.O_ORDERDATE < dateadd(mm, 3, cast(‘1993-10-01’ as datetime)) AND LINEITEM.L_RETURNFLAG = ‘R’ AND CUSTOMER.C_NATIONKEY = NATION.N_NATIONKEY GROUP BY CUSTOMER.C_CUSTKEY, CUSTOMER.C_NAME, CUSTOMER.C_ACCTBAL, CUSTOMER.C_PHONE, NATION.N_NAME, CUSTOMER.C_ADDRESS SUM(A)
Q12Q_{12} SELECT LINEITEM.L_SHIPMODE, COUNT(*) AS HIGH_LINE_COUNT FROM ORDERS, LINEITEM WHERE ORDERS.O_ORDERKEY = LINEITEM.L_ORDERKEY AND LINEITEM.L_SHIPMODE IN (‘MAIL’,‘SHIP’) AND (ORDERS.O_ORDERPRIORITY = ‘1-URGENT’ OR ORDERS.O_ORDERPRIORITY = ‘2-HIGH’) AND LINEITEM.L_COMMITDATE < LINEITEM.L_RECEIPTDATE AND LINEITEM.L_SHIPDATE < LINEITEM.L_COMMITDATE AND LINEITEM.L_RECEIPTDATE >= ‘1994-01-01’ AND LINEITEM.L_RECEIPTDATE < dateadd(mm, 1, cast(‘1995-09-01’ as datetime)) GROUP BY LINEITEM.L_SHIPMODE COUNT(*)
Q1Q_{1}Q3Q_{3}Q4Q_{4}Q5Q_{5}Q10Q_{10}Q12Q_{12}0551010TPC-H-inspired aggregation queries with groupingEval. time (seconds)×\times2 mins \rightarrow15 mins \rightarrowEncoding underlying CQSolving underlying CQEncoding for groupsSolving for groupsConQuer rewritingOriginal query
Figure 5. AggCAvSAT vs. ConQuer on TPC-H data generated using the DBGen-based tool (10% inconsistency, 1 GB repairs)
Q3Q_{3}Q4Q_{4}Q5Q_{5}Q10Q_{10}Q12Q_{12}022446688Eval. time (seconds)×\times72 secs \rightarrowInstance 1Encoding underlying CQSolving underlying CQEncoding for groupsSolving for groupsConQuer rewritingOriginal query
Q3Q_{3}Q4Q_{4}Q5Q_{5}Q10Q_{10}Q12Q_{12}022446688Eval. time (seconds)×\times64 secs \rightarrowInstance 2
Q3Q_{3}Q4Q_{4}Q5Q_{5}Q10Q_{10}Q12Q_{12}0551010Eval. time (seconds)×\times68 secs \rightarrowInstance 3
Q3Q_{3}Q4Q_{4}Q5Q_{5}Q10Q_{10}Q12Q_{12}05510101515TPC-H-inspired aggregation queries with groupingEval. time (seconds)×\times72 secs \rightarrowInstance 4
Figure 6. AggCAvSAT vs. ConQuer on PDBench instances

The overhead of computing the range consistent answers to aggregation queries with grouping is higher than that for the aggregation queries without grouping because for an aggregation query with grouping, AggCAvSAT needs to construct and solve twice as many WPMaxSAT instances as there are consistent groups, i.e., one for the lub-answer and one for the glb-answer per consistent group. For queries that involved the SELECT TOP kk construct of SQL, we chose top kk consistent groups ordered by one or more grouping attributes present in the ORDER BY clause of the query. AggCAvSAT computes the range consistent answers to each query under ten seconds except for Q1Q_{1}. It took under three seconds to compute the consistent groups of Q1Q_{1}, but took over forty seconds to encode the range consistent answers of the groups and over fifteen minutes to solve the corresponding WPMaxSAT instances. This is because some consistent groups have over 3M tuples and so the WPMaxSAT instances have over 600 thousand variables and over 1.3 million clauses. ConQuer took slightly over two minutes to compute the range consistent answers to Q1Q_{1}. We did not include Q1Q_{1} in experiments with larger databases and higher inconsistency.

Figure 6 shows the comparison of AggCAvSAT and ConQuer for aggregation queries with grouping on PDBench instances. For the database instance with the lowest amount of inconsistency, AggCAvSAT beats ConQuer on all queries, but as the inconsistency grows, AggCAvSAT takes longer time to encode and solve for the consistent groups of the queries Q3Q_{3} and Q10Q_{10}.

In Figure 7, we first plot the evaluation time of AggCAvSAT as the percentage of inconsistency in the data grows from 5% to 35% in the instances generated using the DBGen-based data generator. The size of the database repairs is kept constant at 1 GB (8 million tuples).

22446688101012121414161618182020222224242626282830303232343436363838020204040Eval. time (seconds)Q3Q_{3}Q4Q_{4}Q5Q_{5}Q10Q_{10}Q12Q_{12}
2244668810101212141416161818202022222424262628283030323234343636383810110^{1}10210^{2}10310^{3}10410^{4}Percentage of inconsistencyNumber of SAT callsQ3Q_{3}Q4Q_{4}Q5Q_{5}Q10Q_{10}Q12Q_{12}
Figure 7. AggCAvSAT on TPC-H data generated using the DBGen-based tool (varying inconsistency, 1 GB repairs)

Since AggCAvSAT constructs and solves many WPMaxSAT instances having varying sizes for an aggregation query involving grouping, we also plot the overall number of SAT calls made by the solver in Figure 7. Note that the Y-axis has logarithmic scaling in the second plot of Figure 7. There are ten consistent groups in the answers to Q3Q_{3}, and just five and two consistent groups in the answers to Q5Q_{5} and Q12Q_{12} respectively. In each consistent group, the aggregation operator is applied over a much larger set of tuples in Q5Q_{5} and Q12Q_{12} than in Q3Q_{3}. As a result, the evaluation time for Q3Q_{3} is high but the number of SAT calls is comparatively less, while AggCAvSAT makes more SAT calls for Q5Q_{5} and Q12Q_{12}, even though their consistent answers are computed much faster. The query Q10Q_{10} requires long time to construct and solve the WPMaxSAT instances for its consistent groups due to its high selectivity and the presence of joins between four relations. The evaluation time of computing the range consistent answers to aggregation queries with grouping increases almost linearly w.r.t. the size of the database when the percentage of inconsistency is constant (Figure 8). The second plot in Figure 8 depicts the number of SAT calls made by the solver as the size of the database grows. Due to low selectivity, the answers to Q4Q_{4} are encoded into small CNF formulas even on databases with high inconsistency or large sizes, resulting in fast evaluations.

0.50.5111.51.5222.52.5333.53.5444.54.5550202040406060Eval. time (seconds)Q3Q_{3}Q4Q_{4}Q5Q_{5}Q10Q_{10}Q12Q_{12}
0.50.5111.51.5222.52.5333.53.5444.54.55510110^{1}10210^{2}10310^{3}10410^{4}Size of the database repairs (in GB)Number of SAT callsQ3Q_{3}Q4Q_{4}Q5Q_{5}Q10Q_{10}Q12Q_{12}
Figure 8. AggCAvSAT on TPC-H data generated using the DBGen-based tool (varying database sizes, 10% inconsistency)

6.1.5. Discussion

The experiments show that AggCAvSAT performed well across a broad range of queries and databases; it performed worse on queries with high selectivity because, in such cases, very large CNF formulas were generated. AggCAvSAT slowed down on databases with high degree of inconsistency (>30%>30\%) and with key-equal groups of large sizes (>15>15). These are rather corner cases that should not be encountered in real-world databases.

6.2. Experiments with Real-world Data

6.2.1. Dataset

For this set experiments, we use the schema and the data from Medigap (med, [n.d.]), an openly available real-world database about Medicare supplement insurance in the United States. We combine the data from 2019 and 2020 to obtain a database with over 61K tuples (Table 6(a)). We evaluated the performance of Reduction 5.1, since we consider two functional dependencies and one denial constraint on the Medigap schema, as shown in Table 6(b). The actual data was inconsistent so no additional inconsistency was injected.

Table 6. Medigap real-world database
Relation Acronym # of attributes # of tuples
OrgsByState OBS 5 3872
PlansByState PBS 18 21002
PlansByZip PBZ 20 4748
PlanType PT 4 2434
Premiums PR 7 29148
SimplePlanType SPT 4 70
(a) Medigap schema
Type Constraint Definition Inconsistency
FD OBS (orgID \rightarrow orgName) 2.58%
FD PBS (addr, city, abbrev \rightarrow zip) 1.5%
DC tPBS\forall\;t\in\text{PBS} (tt.webAddr \neq ‘’) 0.15%
(b) Integrity constraints and inconsistency

6.2.2. Queries

We use twelve natural aggregation queries on the Medigap database that involve the aggregation operators COUNT(*), COUNT(AA), and SUM(AA). We refer to these as (Q1m,,Q12m)(Q^{m}_{1},\cdots,Q^{m}_{12}). The first six queries contain no grouping, while the rest of them do. The definitions of these queries are given in Table 7.

Table 7. Aggregation Queries on Medigap database
# Query
Q1mQ^{m}_{1} SELECT COUNT(*) FROM OBS WHERE OBS.Name = ‘Continental General Insurance Company’
Q2mQ^{m}_{2} SELECT COUNT(*) FROM PBZ, SPT WHERE PBZ.Description = SPT.Simple_plantype_name AND SPT.Contract_year = 2020 AND SPT.Simple_plantype = ‘B’
Q3mQ^{m}_{3} SELECT SUM(PBZ.Over65) FROM PBZ WHERE PBZ.State_name = ‘Wisconsin’ AND PBZ.County_name = ‘GREEN LAKE’
Q4mQ^{m}_{4} SELECT SUM(PBZ.Community) FROM PBZ WHERE PBZ.State_name = ‘New York’
Q5mQ^{m}_{5} SELECT COUNT(PR.Premium_range) FROM PR
Q6mQ^{m}_{6} SELECT COUNT(PR.Premium_range) FROM PT, PR WHERE PT.State_abbrev = PR.State_abbrev AND PT.Plan_type = PR.Plan_type AND PT.Contract_year = PR.Contract_year AND PT.Contract_year = 2020 AND PT.Simple_plantype = ‘K’
Q7mQ^{m}_{7} SELECT SPT.Contract_year, COUNT(*) FROM SPT GROUP BY SPT.Contract_year ORDER BY SPT.Contract_year DESC
Q8mQ^{m}_{8} SELECT PBZ.State_name, COUNT(*) FROM PBZ GROUP BY PBZ.State_name
Q9mQ^{m}_{9} SELECT PBZ.Zip, SUM(PBZ.Community) FROM PBZ WHERE PBZ.State_name = ‘New York’ GROUP BY PBZ.Zip
Q10mQ^{m}_{10} SELECT TOP 10 PBS.State_name, SPT.Contract_year, SUM(PBS.Under65) FROM PBS, SPT WHERE SPT.Simple_plantype_name = PBS.Description AND SPT.Simple_plantype = ‘A’ AND SPT.Language_id = 1 GROUP BY PBS.State_name, SPT.Contract_year ORDER BY PBS.State_name
Q11mQ^{m}_{11} SELECT PR.Age_category, COUNT(PR.Premium_range) FROM PR GROUP BY PR.Age_category ORDER BY PR.Age_category
Q12mQ^{m}_{12} SELECT TOP 10 PT.Simple_plantype, COUNT(PR.Premium_range) FROM PT, PR WHERE PT.State_abbrev = PR.State_abbrev AND PT.Plan_type = PR.Plan_type AND PT.Contract_year = PR.Contract_year and PT.Contract_year = 2020 GROUP BY PT.Simple_plantype ORDER BY PT.Simple_plantype

6.2.3. Results on Real-world Database

Q1mQ^{m}_{1}Q2mQ^{m}_{2}Q3mQ^{m}_{3}Q4mQ^{m}_{4}Q5mQ^{m}_{5}Q6mQ^{m}_{6}Q7mQ^{m}_{7}Q8mQ^{m}_{8}Q9mQ^{m}_{9}Q10mQ^{m}_{10}Q11mQ^{m}_{11}Q12mQ^{m}_{12}020204040Real-world aggregation queries on Medigap databaseEval. time (seconds)Encoding underlying CQSolving underlying CQEncoding for groupsSolving for groups
Figure 9. Evaluation time for computing the range consistent answers to real-world aggregation queries

In Figure 9, we plot the overall time taken by AggCAvSAT to compute the range consistent answers to the twelve aggregation queries on the Medigap database. Since the Medigap schema has functional dependencies and a denial constraint, the encoding of CQA into WPMaxSAT instances is based on Reduction 5.1. Consequently, the size of the CNF formulas is much larger compared to that of the ones produced by Reduction 4.1, resulting in longer encoding times. For all twelve queries, the encoding time is dominated by the time required to compute the near-violations and hence the γ\gamma-clauses. This part of the encoding time is equal for all queries, but the computation time for the witnesses depends on the query. The solver takes comparatively minuscule amount of time to compute the consistent answers to the underlying conjunctive query. For the queries Q7m,,Q12mQ^{m}_{7},\cdots,Q^{m}_{12}, the glb-answer and the lub-answer are encoded and then solved for for each consistent group, causing high overhead. The longest evaluation time is taken by queries Q10mQ^{m}_{10}, Q12mQ^{m}_{12}, and Q6mQ^{m}_{6} since they consist of 10, 10, and 6 consistent groups, respectively.

Q1mQ^{m}_{1}Q2mQ^{m}_{2}Q3mQ^{m}_{3}Q4mQ^{m}_{4}Q5mQ^{m}_{5}Q6mQ^{m}_{6}Q7mQ^{m}_{7}Q8mQ^{m}_{8}Q9mQ^{m}_{9}Q10mQ^{m}_{10}Q11mQ^{m}_{11}Q12mQ^{m}_{12}0.80.8111.21.2105\cdot 10^{5}Real-world aggregation queries on Medigap databaseNumber of clausesNumber of clauses
Figure 10. Number of clauses in a CNF formula capturing the consistent answers to the underlying conjunctive query

For these experiments, we did not compute the range consistent answers from the consistent part of the data first. Thus, for all CNF formulas, the number of variables for is equal to the number of tuples in the data (about 61K). The number of clauses, however, varies depending on the query, as shown in Figure 10. The query Q5mQ^{m}_{5} has the highest number of clauses since all tuples in the vwPremiums table are the minimal witnesses to its underlying conjunctive query.

The clauses arising from the inconsistency in the data can be constructed independently from the clauses arising from the witnesses to the queries. In the near future, we plan to parallelize their computation to improve AggCAvSAT’s performance.

7. Concluding Remarks

First, we showed that computing the range consistent answers to an aggregation query involving the SUM(A) operator can be NP-hard, even if the consistent answers to the underlying conjunctive query are SQL-rewritable. We then designed, implemented, and evaluated AggCAvSAT, a SAT-based system for computing range consistent answers to aggregation queries involving COUNT(A), COUNT(*), SUM(A), and grouping. It is the first system able to handle aggregation queries whose range consistent answers are not SQL-rewritable. Our experimental evaluation showed that AggCAvSAT is not only competitive with systems such as ConQuer but it is also scalable. The experiments on the Medigap data showed that AggCAvSAT can handle real-world databases having integrity constraints beyond primary keys. The next step in this investigation is to first delineate the complexity of the range consistent answers to aggregation queries with the operator AVG(AA) and then enhance the capabilities of AggCAvSAT to compute the range consistent answers of such aggregation queries. Finally, we note that the SAT-based methods used here are applicable to broader classes of SQL queries, such as queries with nested subqueries, as long as denial constraints are considered. If broader classes of constraints are considered, such as universal constraints, then the consistent answers of even conjunctive queries become Π2p\Pi_{2}^{p}-hard to compute (Arming et al., 2016), hence SAT-based methods are not applicable. In that case, Answer Set Programming solvers (for example, DLV (Leone et al., 2006) or Potassco (Gebser et al., 2011)) have to be used, instead of SAT solvers.

Acknowledgments  Dixit was supported by a Baskin School of Engineering Dissertation-Year Fellowship and by the Center for Research in Open Source Software (CROSS) at the UC Santa Cruz. Kolaitis was partially supported by NSF Award IIS: 1814152.

References

  • (1)
  • cad ([n.d.]) [n.d.]. CaDiCaL Simplified Satisfiability Solver. http://fmv.jku.at/cadical/.
  • med ([n.d.]) [n.d.]. Medigap database for Medicare health and drug plans. https://www.medicare.gov/download/downloaddb.asp.
  • Afrati and Kolaitis (2008) Foto N. Afrati and Phokion G. Kolaitis. 2008. Answering aggregate queries in data exchange. In Proc. of the Twenty-Seventh ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2008, June 9-11, 2008, Vancouver, BC, Canada. ACM, 129–138. https://doi.org/10.1145/1376916.1376936
  • Amsterdamer et al. (2011) Yael Amsterdamer, Daniel Deutch, and Val Tannen. 2011. Provenance for aggregate queries. In Proceedings of the 30th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2011, June 12-16, 2011, Athens, Greece, Maurizio Lenzerini and Thomas Schwentick (Eds.). ACM, 153–164. https://doi.org/10.1145/1989284.1989302
  • Antova et al. (2008) Lyublena Antova, Thomas Jansen, Christoph Koch, and Dan Olteanu. 2008. Fast and Simple Relational Processing of Uncertain Data. In 2008 IEEE 24th International Conference on Data Engineering. 983–992. https://doi.org/10.1109/ICDE.2008.4497507
  • Arenas et al. (1999) Marcelo Arenas, Leopoldo Bertossi, and Jan Chomicki. 1999. Consistent Query Answers in Inconsistent Databases. In Proc. of the Eighteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (Philadelphia, Pennsylvania, USA) (PODS ’99). ACM, New York, NY, USA, 68–79. https://doi.org/10.1145/303976.303983
  • Arenas et al. (2003b) Marcelo Arenas, Leopoldo Bertossi, Jan Chomicki, Xin He, Vijay Raghavan, and Jeremy Spinrad. 2003b. Scalar aggregation in inconsistent databases. Theoretical Computer Science 296, 3 (2003), 405–434. https://doi.org/10.1016/S0304-3975(02)00737-5 Database Theory.
  • Arenas et al. (2003a) Marcelo Arenas, Leopoldo E. Bertossi, and Jan Chomicki. 2003a. Answer sets for consistent query answering in inconsistent databases. TPLP 3, 4-5 (2003), 393–424. https://doi.org/10.1017/S1471068403001832
  • Arming et al. (2016) Sebastian Arming, Reinhard Pichler, and Emanuel Sallinger. 2016. Complexity of Repair Checking and Consistent Query Answering. In 19th International Conference on Database Theory, ICDT 2016, Bordeaux, France, March 15-18, 2016 (LIPIcs), Wim Martens and Thomas Zeume (Eds.), Vol. 48. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 21:1–21:18. https://doi.org/10.4230/LIPIcs.ICDT.2016.21
  • Audemard and Simon (2009) Gilles Audemard and Laurent Simon. 2009. Predicting Learnt Clauses Quality in Modern SAT Solvers. In Proc. of the 21st International Jont Conference on Artifical Intelligence (Pasadena, California, USA) (IJCAI’09). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 399–404.
  • Barceló and Bertossi (2003) Pablo Barceló and Leopoldo E. Bertossi. 2003. Logic Programs for Querying Inconsistent Databases. In Practical Aspects of Declarative Languages, 5th International Symposium, PADL 2003, New Orleans, LA, USA, January 13-14, 2003, Proc. 208–222. https://doi.org/10.1007/3-540-36388-2_15
  • Bertossi (2011) Leopoldo E. Bertossi. 2011. Database Repairing and Consistent Query Answering. Morgan & Claypool Publishers. https://doi.org/10.2200/S00379ED1V01Y201108DTM020
  • Biere et al. (2009) A. Biere, A. Biere, M. Heule, H. van Maaren, and T. Walsh. 2009. Handbook of Satisfiability: Volume 185 Frontiers in Artificial Intelligence and Applications. IOS Press, Amsterdam, The Netherlands, The Netherlands.
  • Chomicki et al. (2004a) Jan Chomicki, Jerzy Marcinkowski, and Slawomir Staworko. 2004a. Computing Consistent Query Answers Using Conflict Hypergraphs. In Proc. of the Thirteenth ACM International Conference on Information and Knowledge Management (Washington, D.C., USA) (CIKM ’04). ACM, New York, NY, USA, 417–426. https://doi.org/10.1145/1031171.1031254
  • Chomicki et al. (2004b) Jan Chomicki, Jerzy Marcinkowski, and Slawomir Staworko. 2004b. Hippo: A System for Computing Consistent Answers to a Class of SQL Queries. In Advances in Database Technology - EDBT 2004. Springer Berlin Heidelberg, Berlin, Heidelberg, 841–844.
  • Codd (1970) E. F. Codd. 1970. A Relational Model of Data for Large Shared Data Banks. Commun. ACM 13, 6 (1970), 377–387. https://doi.org/10.1145/362384.362685
  • Codd (1972) E. F. Codd. 1972. Relational Completeness of Data Base Sublanguages. Research Report / RJ / IBM / San Jose, California RJ987 (1972).
  • Davies and Bacchus (2011) Jessica Davies and Fahiem Bacchus. 2011. Solving MAXSAT by Solving a Sequence of Simpler SAT Instances. In Principles and Practice of Constraint Programming – CP 2011. Springer Berlin Heidelberg, Berlin, Heidelberg, 225–239.
  • Dixit and Kolaitis (2021) Akhil Dixit and Phokion G. Kolaitis. 2021. CAvSAT: Answering Aggregation Queries over Inconsistent Databases via SAT Solving. In Proc. of the 2021 ACM International Conference on Management of Data (SIGMOD) - Demonstration Track. To appear.
  • Dixit and Kolaitis (2019) Akhil A. Dixit and Phokion G. Kolaitis. 2019. A SAT-Based System for Consistent Query Answering. In Theory and Applications of Satisfiability Testing - SAT 2019 - 22nd International Conference, SAT 2019, Lisbon, Portugal, July 9-12, 2019, Proc. (Lecture Notes in Computer Science), Vol. 11628. Springer, 117–135. https://doi.org/10.1007/978-3-030-24258-9_8
  • Fuxman (2007) Ariel Fuxman. 2007. Efficient Query Processing over Inconsistent Databases. Ph.D. Dissertation. Department of Computer Science, University of Toronto.
  • Fuxman et al. (2005a) Ariel Fuxman, Elham Fazli, and Renée J. Miller. 2005a. ConQuer: Efficient Management of Inconsistent Databases. In Proc. of the 2005 ACM SIGMOD International Conference on Management of Data (Baltimore, Maryland) (SIGMOD ’05). ACM, New York, NY, USA, 155–166. https://doi.org/10.1145/1066157.1066176
  • Fuxman et al. (2005b) Ariel Fuxman, Diego Fuxman, and Renée J. Miller. 2005b. ConQuer: A System for Efficient Querying over Inconsistent Databases. In Proc. of the 31st International Conference on Very Large Data Bases (Trondheim, Norway) (VLDB ’05). VLDB Endowment, 1354–1357. http://dl.acm.org/citation.cfm?id=1083592.1083774
  • Fuxman and Miller (2007) Ariel Fuxman and Renée J. Miller. 2007. First-order Query Rewriting for Inconsistent Databases. J. Comput. Syst. Sci. 73, 4 (June 2007), 610–635. https://doi.org/10.1016/j.jcss.2006.10.013
  • Gebser et al. (2011) Martin Gebser, Benjamin Kaufmann, Roland Kaminski, Max Ostrowski, Torsten Schaub, and Marius Schneider. 2011. Potassco: The Potsdam Answer Set Solving Collection. AI Commun. 24, 2 (2011), 107–124. https://doi.org/10.3233/AIC-2011-0491
  • Giannakopoulou et al. (2020a) Stella Giannakopoulou, Manos Karpathiotakis, and Anastasia Ailamaki. 2020a. Cleaning Denial Constraint Violations through Relaxation. In Proc. of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020. ACM, 805–815. https://doi.org/10.1145/3318464.3389775
  • Giannakopoulou et al. (2020b) Stella Giannakopoulou, Manos Karpathiotakis, and Anastasia Ailamaki. 2020b. Query-driven Repair of Functional Dependency Violations. In 36th IEEE International Conference on Data Engineering, ICDE 2020, Dallas, TX, USA, April 20-24, 2020. IEEE, 1886–1889. https://doi.org/10.1109/ICDE48307.2020.00195
  • Greco et al. (2003) Gianluigi Greco, Sergio Greco, and Ester Zumpano. 2003. A Logical Framework for Querying and Repairing Inconsistent Databases. IEEE Trans. on Knowl. and Data Eng. 15, 6 (Nov. 2003), 1389–1408. https://doi.org/10.1109/TKDE.2003.1245280
  • Green et al. (2007) Todd J. Green, Gregory Karvounarakis, and Val Tannen. 2007. Provenance semirings. In Proceedings of the Twenty-Sixth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June 11-13, 2007, Beijing, China, Leonid Libkin (Ed.). ACM, 31–40. https://doi.org/10.1145/1265530.1265535
  • Guagliardo and Libkin (2016) Paolo Guagliardo and Leonid Libkin. 2016. Making SQL Queries Correct on Incomplete Databases: A Feasibility Study. In Proc. of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2016, San Francisco, CA, USA, June 26 - July 01, 2016. ACM, 211–223. https://doi.org/10.1145/2902251.2902297
  • Heule et al. (2016) Marijn J. H. Heule, Oliver Kullmann, and Victor W. Marek. 2016. Solving and Verifying the Boolean Pythagorean Triples Problem via Cube-and-Conquer. In Theory and Applications of Satisfiability Testing - SAT 2016 - 19th Intern. Conference, Bordeaux, France, July 5-8, 2016, Proc. (Lecture Notes in Computer Science), Vol. 9710. Springer, 228–245. https://doi.org/10.1007/978-3-319-40970-2_15
  • Karp (1972) Richard M. Karp. 1972. Reducibility among Combinatorial Problems. Springer US, Boston, MA, 85–103. https://doi.org/10.1007/978-1-4684-2001-2_9
  • Karvounarakis and Green (2012) Grigoris Karvounarakis and Todd J. Green. 2012. Semiring-annotated data: queries and provenance? SIGMOD Rec. 41, 3 (2012), 5–14. https://doi.org/10.1145/2380776.2380778
  • Kolaitis and Pema (2012) Phokion G. Kolaitis and Enela Pema. 2012. A Dichotomy in the Complexity of Consistent Query Answering for Queries with Two Atoms. Inf. Process. Lett. 112, 3 (Jan. 2012), 77–85. https://doi.org/10.1016/j.ipl.2011.10.018
  • Kolaitis et al. (2013) Phokion G. Kolaitis, Enela Pema, and Wang-Chiew Tan. 2013. Efficient Querying of Inconsistent Databases with Binary Integer Programming. PVLDB 6, 6 (2013), 397–408. https://doi.org/10.14778/2536336.2536341
  • Kostylev and Reutter (2015) Egor V. Kostylev and Juan L. Reutter. 2015. Complexity of answering counting aggregate queries over DL-Lite. J. Web Semant. 33 (2015), 94–111. https://doi.org/10.1016/j.websem.2015.05.003
  • Koutris and Wijsen (2016) Paraschos Koutris and Jef Wijsen. 2016. Consistent Query Answering for Primary Keys. SIGMOD Rec. 45, 1 (June 2016), 15–22. https://doi.org/10.1145/2949741.2949746
  • Koutris and Wijsen (2017) Paraschos Koutris and Jef Wijsen. 2017. Consistent Query Answering for Self-Join-Free Conjunctive Queries Under Primary Key Constraints. ACM Trans. Database Syst. 42, 2, Article 9 (June 2017), 45 pages. https://doi.org/10.1145/3068334
  • Kügel (2012) Adrian Kügel. 2012. Natural Max-SAT Encoding of Min-SAT. In Revised Selected Papers of the 6th International Conference on Learning and Intelligent Optimization - Volume 7219 (Paris, France) (LION 6). Springer-Verlag, Berlin, Heidelberg, 431–436.
  • Leone et al. (2006) Nicola Leone, Gerald Pfeifer, Wolfgang Faber, Thomas Eiter, Georg Gottlob, Simona Perri, and Francesco Scarcello. 2006. The DLV system for knowledge representation and reasoning. ACM Trans. Comput. Log. 7, 3 (2006), 499–562. https://doi.org/10.1145/1149114.1149117
  • Li et al. (2011) Chu-Min Li, Zhu Zhu, Felip Manyà, and Laurent Simon. 2011. Minimum Satisfiability and Its Applications. In Proc. of the Twenty-Second International Joint Conference on Artificial Intelligence - Volume Volume One (Barcelona, Catalonia, Spain) (IJCAI’11). AAAI Press, 605–610.
  • Manna et al. (2011) Marco Manna, Francesco Ricca, and Giorgio Terracina. 2011. Consistent Query Answering via ASP from Different Perspectives: Theory and Practice. CoRR abs/1107.4570 (2011). arXiv:1107.4570 http://arxiv.org/abs/1107.4570
  • Marileo and Bertossi (2010) Mónica Caniupán Marileo and Leopoldo E. Bertossi. 2010. The consistency extractor system: Answer set programs for consistent query answering in databases. Data Knowl. Eng. 69, 6 (2010), 545–572. https://doi.org/10.1016/j.datak.2010.01.005
  • Oostema et al. (2020) Peter Oostema, Ruben Martins, and Marijn Heule. 2020. Coloring Unit-Distance Strips using SAT. In LPAR 2020: 23rd International Conference on Logic for Programming, Artificial Intelligence and Reasoning, Alicante, Spain, May 22-27, 2020 (EPiC Series in Computing), Vol. 73. EasyChair, 373–389. https://easychair.org/publications/paper/69T4
  • Rekatsinas et al. (2017) Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. 2017. HoloClean: Holistic Data Repairs with Probabilistic Inference. Proc. VLDB Endow. 10, 11 (Aug. 2017), 1190–1201. https://doi.org/10.14778/3137628.3137631
  • Vardi (2009) Moshe Y. Vardi. 2009. Symbolic Techniques in Propositional Satisfiability Solving. In Theory and Applications of Satisfiability Testing - SAT 2009. Springer Berlin Heidelberg, Berlin, Heidelberg, 2–3.
  • Wijsen (2009) Jef Wijsen. 2009. Consistent Query Answering Under Primary Keys: A Characterization of Tractable Queries. In Proc. of the 12th International Conference on Database Theory (St. Petersburg, Russia) (ICDT ’09). ACM, New York, NY, USA, 42–52. https://doi.org/10.1145/1514894.1514900
  • Wijsen (2010a) Jef Wijsen. 2010a. On the First-order Expressibility of Computing Certain Answers to Conjunctive Queries over Uncertain Databases. In Proc. of the Twenty-ninth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (Indianapolis, Indiana, USA) (PODS ’10). ACM, New York, NY, USA, 179–190. https://doi.org/10.1145/1807085.1807111
  • Wijsen (2010b) Jef Wijsen. 2010b. A remark on the complexity of consistent conjunctive query answering under primary key violations. Inform. Process. Lett. 110, 21 (2010), 950 – 955. https://doi.org/10.1016/j.ipl.2010.07.021
  • Zhu et al. (2012) Zhu Zhu, Chu-Min Li, Felip Manyà, and Josep Argelich. 2012. A New Encoding from MinSAT into MaxSAT. In Principles and Practice of Constraint Programming. Springer Berlin Heidelberg, Berlin, Heidelberg, 455–463.