Consistent Answers of Aggregation Queries using SAT Solvers

Akhil A. Dixit 0000-0003-2138-1319 University of California Santa Cruz akadixit@ucsc.edu and Phokion G. Kolaitis University of California Santa Cruz and IBM Research kolaitis@ucsc.edu

Abstract.

The framework of database repairs and consistent answers to queries is a principled approach to managing inconsistent databases. We describe the first system able to compute the consistent answers of general aggregation queries with the COUNT( $A$ ), COUNT(*), SUM( $A$ ), MIN( $A$ ), and MAX( $A$ ) operators, and with or without grouping constructs. Our system uses reductions to optimization versions of Boolean satisfiability (SAT) and then leverages powerful SAT solvers. We carry out an extensive set of experiments on both synthetic and real-world data that demonstrate the usefulness and scalability of this approach.

1. Introduction

The framework of database repairs and consistent query answering, introduced by Arenas, Bertossi, and Chomicki (Arenas et al., 1999), is a principled approach to managing inconsistent databases, i.e., databases that violate one or more integrity constraints on their schema. In this framework, inconsistencies are handled at query time by considering all possible repairs of the inconsistent database, where a repair of an inconsistent database $\mathcal{I}$ is a consistent database $\mathcal{J}$ that differs from $\mathcal{I}$ in a “minimal” way. The consistent answers to a query $q$ on a given database $\mathcal{I}$ is the intersection of the results of $q$ applied on each repair of $\mathcal{I}$ . Thus, a consistent answer provides the guarantee that it will be found no matter on what repair the query has been evaluated. Computing the consistent answers can be an intractable problem, because an inconsistent database may have exponentially many repairs. In particular, computing the consistent answers of a fixed Select-Project-Join (SPJ) query can be a coNP-complete problem. By now, there is an extensive body of work on the complexity of consistent answers for SPJ queries (see Section 2).

Range Semantics: Concept and Motivation. Aggregation queries are the most frequently asked queries; they are of the form

$Q:=\;\texttt{SELECT }Z,f(A)\texttt{ FROM }T(U,Z,A)\texttt{ GROUP BY }Z,$

where $f(A)$ is one the standard aggregation operators COUNT( $A$ ), COUNT(*), SUM( $A$ ), AVG( $A$ ). MIN( $A$ ), MAX( $A$ ), and $T(U,Z,A)$ is the relation returned by a SPJ query $q$ expressed in SQL. A scalar aggregation query is an aggregation query without a GROUP BY clause.

What is the semantics of an aggregation query over an inconsistent database? Since an aggregation query may return different answers on different repairs of an inconsistent database, there is typically no consistent answer as per the earlier definition of consistent answers. To obtain meaningful semantics to aggregation queries, Arenas et al. (Arenas et al., 2003b) introduced the range consistent answers.

Table 1. Running example – an inconsistent database instance

\mathcal{I}

(primary key attributes are underlined)

CUSTOMER	CID	CNAME	CITY
	C1	John	LA	$f_{1}$
	C2	Mary	LA	$f_{2}$
	C2	Mary	SF	$f_{3}$
	C3	Don	SF	$f_{4}$
	C4	Jen	LA	$f_{5}$

ACCOUNTS	ACCID	TYPE	CITY	BAL
	A1	Checking	LA	900	$f_{6}$
	A2	Checking	LA	1000	$f_{7}$
	A3	Saving	SJ	1200	$f_{8}$
	A3	Saving	SF	-100	$f_{9}$
	A4	Saving	SJ	300	$f_{10}$

CUSTACC	CID	ACCID
	C1	A1	$f_{11}$
	C2	A2	$f_{12}$
	C2	A3	$f_{13}$
	C3	A4	$f_{14}$

Let $Q$ be a scalar aggregation query and let $\Sigma$ be a set of integrity constraints. The set of possible answers to $Q$ on an inconsistent instance $\mathcal{I}$ w.r.t. $\Sigma$ is the set of the answers to $Q$ over all repairs of $\mathcal{I}$ w.r.t. $\Sigma$ , i.e., $\text{Poss}(Q,\Sigma)=\{Q(\mathcal{J})\;|\;\mathcal{J}\text{ is a repair of }\mathcal{I}\text{ w.r.t.\ }\Sigma\}$ . By definition, the range consistent answers to $Q$ on $\mathcal{I}$ is the interval $[glb(Q,\mathcal{I}),lub(Q,\mathcal{I})]$ , where the endpoints of this interval are, respectively, the greatest lower bound (glb) and the least upper bound (lub) of the set $\text{Poss}(Q,\Sigma)$ of possible answers to $Q$ on $\mathcal{I}$ . For example, the range consistent answers to the query

SELECT SUM(ACCOUNTS.BAL) FROM ACCOUNTS, CUSTACC
WHERE ACCOUNTS.ACCID = CUSTACC.ACCID AND CUSTACC.CID = ‘C2’

on the instance in Table 1 is the interval $[900,2200]$ . The meaning is that no matter how the database $\mathcal{I}$ is repaired, the answer to the query is guaranteed to be in the range between 900 and 2200.

Arenas et al. (Arenas et al., 2003a) focused on scalar aggregation queries only. Fuxman, Fazli, and Miller (Fuxman et al., 2005a) extended the notion of range consistent answers to aggregation queries with grouping (see Section 3).

Range semantics have become the standard semantics of aggregation queries in the framework of database repairs (see (Bertossi, 2011, Section 5.6)). Furthermore, range semantics have been adapted to give semantics to aggregation queries in several other contexts, including data exchange (Afrati and Kolaitis, 2008) and ontologies (Kostylev and Reutter, 2015). Finally, range semantics have been suggested as an alternative way to overcome some of the issues arising from SQL’s handling of null values (Guagliardo and Libkin, 2016).

Earlier Systems for Consistent Query Answering. Several academic prototype systems for consistent query answering have been developed (Arenas et al., 2003a; Barceló and Bertossi, 2003; Chomicki et al., 2004b; Fuxman et al., 2005a, b; Greco et al., 2003; Kolaitis et al., 2013; Manna et al., 2011; Marileo and Bertossi, 2010; Dixit and Kolaitis, 2019). These systems use different approaches, including logic programming (Barceló and Bertossi, 2003; Greco et al., 2003), compact representations of repairs (Chomicki et al., 2004a), or reductions to solvers (Manna et al., 2011; Kolaitis et al., 2013; Dixit and Kolaitis, 2019). In particular, in (Dixit and Kolaitis, 2019), we reported on CAvSAT, a system that at that time was able to compute the consistent answers of unions of SPJ queries w.r.t. denial constraints (which include functional dependencies as a special case) via reductions to SAT solvers. Among all these systems, however, only the ConQuer system by Fuxman et al. (Fuxman et al., 2005a, b) is capable of handling aggregation queries. Actually, ConQuer can only handle a restricted class of aggregation query, namely, those aggregation queries w.r.t. key constraints for which the underlying SPJ query belongs to the class called $C_{\textit{forest}}$ . For such a query $Q$ , the range consistent answers of $Q$ are SQL-rewritable, which means that there is a SQL query $Q^{\prime}$ such that the range semantics answers of $Q$ on an instance $\mathcal{I}$ can be obtained by directly evaluating $Q^{\prime}$ on $\mathcal{I}$ . This leaves out, however, many aggregation queries, including all aggregation queries whose range consistent answers are not SQL-rewritable or are NP-hard to compute. Up to now, no system supports such queries.

Summary of Contributions. In this paper, we report on and evaluate the performance of AggCAvSAT (Aggregate Consistent Answers via Satisfiability Testing), which is an enhanced version of CAvSAT and is also the first system that is capable to compute the range consistent answers to all aggregation queries involving the operators SUM( $A$ ), COUNT( $A$ ), or COUNT(*) with or without grouping.

We first corroborate the need for a system that goes well beyond ConQuer by showing that there is an aggregation query $Q$ involving SUM( $A$ ) such that the consistent answers of the underlying SPJ query $q$ w.r.t. key constraints are SQL-rewritable, but the range consistent answers of $Q$ are NP-hard (Theorem 3.2 in Section 3).

The distinctive feature of AggCAvSAT is that it uses polynomial-time reductions to reduce the range consistent answers of aggregation queries to optimization variants of Boolean Satisfiability (SAT), such as Partial MaxSAT and Weighted Partial MaxSAT. These reductions, described in Sections 4 and 5, are natural but are much more sophisticated than the reductions used in (Dixit and Kolaitis, 2019) to reduce the consistent answers of SPJ queries to SAT. After the reductions have been carried out, AggCAvSAT deploys powerful SAT solvers, such as the MaxHS solver (Davies and Bacchus, 2011), to compute the range consistent answers of aggregation queries. Furthermore, AggCAvSAT can handle databases that are inconsistent not only w.r.t. key constraints, but also w.r.t. arbitrary denial constraints, a much broader class of constraints.

An extensive experimental evaluation of AggCAvSAT is reported in Section 6. We carried out a suite of experiments on both synthetic and real-word databases, and for a variety of aggregation queries with and without grouping. The synthetic databases were generated using two different methods: (a) the DBGen tool of TPC-H was used to generate consistent data and then inconsistencies were injected artificially; (b) the PDBench inconsistent database generator from the probabilistic database management system MayBMS (Antova et al., 2008) was used. The experiments demonstrated the scalability of AggCAvSAT along both the size of the data and the degree of inconsistency in the data. Note that AggCAvSAT was also competitive in comparison to ConQuer (especially when the degree of inconsistency was not excessive), even though the latter is tailored to only handle a restricted class of aggregation queries whose range consistent answers are SQL-rewritable.

Consistent Answers vs. Data Cleaning. There is a large body of work on managing inconsistent databases via data cleaning. There are fundamental differences between the framework of the consistent answers and the framework of data cleaning (see (Bertossi, 2011, Section 6)). In particular, the consistent answers provide the guarantee that each such answer will be found no matter on which repair the query at hand is evaluated, while data cleaning provides no similar guarantee. Data cleaning has the attraction that it produces a single consistent instance but the process need not be deterministic and the instance produced need not even be a repair (i.e., it need not be a maximally consistent instance). Recent data cleaning systems, such as HoloClean (Rekatsinas et al., 2017) and Daisy (Giannakopoulou et al., 2020b, a), produce a probabilistic database instance as the output (which again need not be a repair).

It is not clear how to compare query answers over the database returned by a data cleaning system and the (range) consistent answers computed by a consistent query answering system. In fact, no such comparison is given in the HoloClean (Rekatsinas et al., 2017) and Daisy (Giannakopoulou et al., 2020b, a) papers. At the performance level, the data cleaning approaches remove inconsistencies in the data offline, hence the time-consuming tasks are done prior to answering the queries; in contrast, systems for consistent query answering work online. It is an interesting project, left for future research, to develop a methodology and carry out a fair comparison on a level playing field between systems for data cleaning and systems for consistent query answering.

2. Preliminaries

Integrity Constraints and Database Queries

A relational database schema $\mathcal{R}$ is a finite collection of relation symbols, each with a fixed positive integer as its arity. The attributes of a relation symbol are names for its columns; they can be identified with their positions, thus $Attr(R)=\{1,...,n\}$ denotes the set of attributes of $R$ . An $\mathcal{R}$ -instance is a collection $\mathcal{I}$ of finite relations $R^{\mathcal{I}}$ , one for each relation symbol $R$ in $\mathcal{R}$ . An expression of the form $R^{\mathcal{I}}(a_{1},...,a_{n})$ is a fact of the instance $\mathcal{I}$ if $(a_{1},...,a_{n})\in R^{\mathcal{I}}$ . A key is a minimal subset $X$ of $Attr(R)$ such that the functional dependency $X\rightarrow Attr(R)$ holds.

Starting with Codd’s seminal work (Codd, 1970, 1972), first-order logic has been successfully used as a database query language; in fact, it forms the core of SQL. A conjunctive query is expressible by a first-order formula of the form $q({\bf z}):=\exists{\bf w}\;(R_{1}({\bf x}_{1})\land...\land R_{m}({\bf x}_{m}))$ , where each ${\bf x}_{i}$ is a tuple of variables and constants, $\bf z$ and $\bf w$ are tuples of variables with no variable in common, and the variables in ${\bf x}_{1},\cdots,{\bf x}_{m}$ appear in exactly one of the tuples $\bf z$ and $\bf w$ . A conjunctive query with no free variables (i.e., all variables are existentially quantified) is a boolean query, while a conjunctive query with $k$ free variables in $\bf z$ is a $k$ -ary query. Conjunctive queries are also known as select-project-join (SPJ) queries with equi-joins, and are among the most frequently asked queries. For example, on the instance $\mathcal{I}$ from Table 1, the binary conjunctive query $q(z,x):=\exists w\;(\texttt{CUST}(w,x,y)\land\texttt{CUSTACC}(w,z))$ returns the set of all pairs $(z,x)$ such that $z$ is an account ID of an account owned by customer named $x$ .

Equivalently, this query can be expressed in SQL as
SELECT CUSTACC.ACCID, CUST.CNAME
FROM CUST, CUSTACC
WHERE CUST.CID = CUSTACC.CID
A union of conjunctive queries is expressible by a disjunction $q({\bf z}):=q_{1}\vee\cdots\vee q_{n}$ of conjunctive queries, where all conjunctive queries $q_{i}$ have the same arity. Unions of conjunctive queries are strictly more expressive than conjunctive queries.

Database Repairs and Consistent Answers

Let $\Sigma$ be a set of integrity constraints on a database schema $\mathcal{R}$ . An $\mathcal{R}$ -instance $\mathcal{I}$ is consistent if $\mathcal{I}\models\Sigma$ , i.e., $\mathcal{I}$ satisfies every constraint in $\Sigma$ ; otherwise, $\mathcal{I}$ is inconsistent. For example, let $\mathcal{I}$ be the instance depicted in Table 1. There are two key constraints, namely, CUST(CID) and ACC(ACCID). Clearly, $\mathcal{I}$ is inconsistent since the facts $f_{2},f_{3}$ of CUST and facts $f_{8},f_{9}$ of ACC violate these key constraints.

A repair of an inconsistent instance $\mathcal{I}$ w.r.t. $\Sigma$ is a consistent instance $\mathcal{J}$ that differs from $\mathcal{I}$ in a “minimal” way. Different notions of minimality give rise to different types of repairs (see (Bertossi, 2011) for a comprehensive survey). Here, we focus on subset repairs, the most extensively studied type of repairs. An instance $\mathcal{J}$ is a subset repair of an instance $\mathcal{I}$ if $\mathcal{J}$ is a maximal consistent subinstance of $\mathcal{I}$ , that is, $\mathcal{J}\subseteq\mathcal{I}$ (where $\mathcal{I}$ and $\mathcal{J}$ are viewed as sets of facts), $\mathcal{J}\models\Sigma$ , and there exists no instance $\mathcal{J^{\prime}}$ such that $\mathcal{J^{\prime}}\models\Sigma$ and $\mathcal{J}\subset\mathcal{J^{\prime}}\subset\mathcal{I}$ . Arenas et al. (Arenas et al., 1999) used repairs to give rigorous semantics to query answering on inconsistent databases. Specifically, assume that $q$ is a query, $\mathcal{I}$ is an $\mathcal{R}$ -instance, and $\bf t$ is a tuple of values. We say that $\bf t$ is a consistent answer to $q$ on $\mathcal{I}$ w.r.t. $\Sigma$ if ${\bf t}\in q(\mathcal{J})$ , for every repair $\mathcal{J}$ of $\mathcal{I}$ . We write $\mbox{\sc{Cons}}(q,\mathcal{I},\Sigma)$ to denote the set of all consistent answers to $q$ on $\mathcal{I}$ w.r.t. $\Sigma$ , i.e.,

$\mbox{\sc{Cons}}(q,\mathcal{I},\Sigma)=\bigcap\;\{q(\mathcal{J}):\mbox{$\mathcal{J}$ is a repair of $\mathcal{I}$ w.r.t.\ $\Sigma$}\}.$

If $\Sigma$ is a fixed set of integrity constraints and $q$ is a fixed query, then the main computational problem associated with the consistent answers is: given an instance $\mathcal{I}$ , compute Cons( $q$ , $\mathcal{I}$ , $\Sigma$ ); we write $\mbox{\sc{Cons}}(q,\Sigma)$ to denote this problem. If $q$ is a boolean query, then computing the consistent answers becomes the decision problem $\mbox{\sc{Certainty}}(q,\Sigma)$ : given an instance $\mathcal{I}$ , is $q$ true on every repair $\mathcal{J}$ of $\mathcal{I}$ w.r.t. $\Sigma$ ? When the constraints in $\Sigma$ are understood from the context, we will write $\mbox{\sc{Cons}}(q)$ and $\mbox{\sc{Certainty}}(q)$ in place of $\mbox{\sc{Cons}}(q,\Sigma)$ and $\mbox{\sc{Certainty}}(q,\Sigma)$ , respectively.

Complexity of Consistent Answers

There has been an extensive study of the consistent answers of conjunctive queries (Bertossi, 2011; Fuxman et al., 2005b; Fuxman and Miller, 2007; Wijsen, 2009, 2010a, 2010b; Kolaitis and Pema, 2012; Koutris and Wijsen, 2016, 2017). If $\Sigma$ is a fixed set of key constraints and $q$ is a boolean conjunctive query, then $\mbox{\sc{Certainty}}(q,\Sigma)$ is always in coNP, but, depending on the query and the constraints, $\mbox{\sc{Certainty}}(q,\Sigma)$ exhibits a variety of behaviors within coNP. The most definitive result to date is a trichotomy theorem by Koutris and Wijsen (Koutris and Wijsen, 2016, 2017); it asserts that if $q$ is a self-join-free (no repeated relation symbols) boolean conjunctive query with one key constraint per relation, then $\mbox{\sc{Certainty}}(q)$ is either SQL-rewritable, or in P but not SQL-rewritable, or coNP-complete. It is an open problem whether or not this trichotomy extends to arbitrary boolean conjunctive queries and to broader classes of constraints (e.g., denial constraints).

We illustrate the trichotomy theorem with three examples. In what follows, the underlined attributes constitute the keys to the relations. First, if $q_{1}$ is the query $\exists x,y,z(R(\underline{x},z)\land S(\underline{z},y))$ , then $\mbox{\sc{Certainty}}(q_{1})$ is SQL-rewrtiable (Fuxman and Miller, 2007). Second, if $q_{2}$ is the query $\exists x,y(R(\underline{x},y)\land S(\underline{y},x))$ , then $\mbox{\sc{Certainty}}(q_{2})$ is in P, but is not SQL-rewritable (Wijsen, 2010b). Third, if $q_{3}$ is the query $\exists x,y,z(R(\underline{x},y)\land S(\underline{z},y))$ , then $\mbox{\sc{Certainty}}(q_{3})$ is coNP-complete (Fuxman and Miller, 2007).

Boolean Satisfiability and SAT Solvers

Boolean Satisfiability (SAT) is arguably the prototypical and the most widely studied NP-complete problem. SAT is the following decision problem: given a boolean formula $\varphi$ , is $\varphi$ satisfiable? There has been an extensive body of research on different aspects of boolean satisfiability (see the handbook (Biere et al., 2009)). In particular, significant progress has been made on developing SAT-solvers, so much so that the advances in this area of research are often referred to as the “SAT Revolution” (Vardi, 2009)). Typically, a SAT-solver takes a boolean formula $\varphi$ in conjunctive normal form (CNF) as an input and outputs a satisfying assignment for $\varphi$ (if one exists) or tells that the formula $\varphi$ is unsatisfiable. Recall that a formula $\varphi$ is in CNF if it is a conjunction of clauses, where each clause is a disjunction of literals. For example, the formula $(x_{1}\lor x_{2}\lor\neg x_{3})\land(\neg x_{2}\lor x_{3})\land(\neg x_{1}\lor x_{4})$ has a satisfying assignment $x_{1}=1$ , $x_{2}=0$ , $x_{3}=0$ , and $x_{4}=1$ .

At present, SAT-solvers are capable of solving quickly SAT-instances with millions of clauses and variables. SAT-solvers have been widely used in both academia and industry as general-purpose tools. Indeed, many real-world problems from a variety of domains, including scheduling, protocol design, software verification, and model checking, can be naturally encoded as SAT-instances, and solved quickly using solvers, such as Glucose (Audemard and Simon, 2009) and CaDiCaL (cad, [n.d.]). Furthermore, SAT-solvers have been used in solving open problems in mathematics (Heule et al., 2016; Oostema et al., 2020). In (Dixit and Kolaitis, 2019), we used SAT-solvers to build a prototypical system for consistent query answering, which we called CAvSAT. This system can compute consistent answers to unions of conjunctive queries over relational databases that are inconsistent w.r.t. a fixed set of arbitrary denial constraints.

3. Range Consistent Answers

Frequently asked database queries often involve one of the standard aggregation operators COUNT( $A$ ), COUNT(*), SUM( $A$ ), AVG( $A$ ), MIN( $A$ ), MAX( $A$ ), and, possibly, a GROUP BY clause. In what follows, we will use the term aggregation queries to refer to queries with aggregate operators and with or without a GROUP BY clause. Thus, in full generality, an aggregation query can be expressed as

$Q:=\;\texttt{SELECT }Z,f(A)\texttt{ FROM }T(U,Z,A)\texttt{ GROUP BY }Z,$

where $f(A)$ is one of the aforementioned aggregate operators and $T(U,Z,A)$ is the relation returned by a query $q$ , which typically is a conjunctive query or a union of conjunctive queries expressed in SQL. This query $q$ is called the underlying query of $Q$ , the attribute represented by the variable $w$ is called the aggregation attribute, and the attributes represented by $Z$ are called the grouping attributes. A scalar aggregation query is one without a GROUP BY clause.

It is often the case that an aggregation query returns different answers on different repairs of an inconsistent database; thus, even for a scalar aggregation query, there is typically no consistent answer as per the definition of consistent answers given earlier. In fact, to produce an empty set of consistent answers, it suffices to have just two repairs on which a scalar aggregation query returns difference answers. Aiming to obtain more meaningful answers to aggregation queries, Arenas et al. (Arenas et al., 2003b) proposed the range consistent answers, as an alternative notion of consistent answers.

Let $Q$ be a scalar aggregation query. The set of possible answers to $Q$ on an inconsistent instance $\mathcal{I}$ consists of the answers to $Q$ over all repairs of $\mathcal{I}$ , i.e., $\text{Poss}(Q,\Sigma)=\{Q(\mathcal{J})\;|\;\mathcal{J}\text{ is a repair of }\mathcal{I}\text{ w.r.t.\ }\Sigma\}$ . By definition, the range consistent answers to $Q$ on $\mathcal{I}$ is the interval $[glb(Q,\mathcal{I}),lub(Q,\mathcal{I})]$ , where the endpoints of this interval are, respectively, the greatest lower bound (glb) and the least upper bound (lub) of the set $\text{Poss}(Q,\Sigma)$ of possible answers to $Q$ on $\mathcal{I}$ .

For example, the range consistent answers of the query
SELECT SUM(ACCOUNTS.BAL) FROM ACCOUNTS, CUSTACC
WHERE ACCOUNTS.ACCID = CUSTACC.ACCID AND CUSTACC.CID = ‘C2’

on the instance in Table 1 is the interval $[900,2200]$ . The guarantee is that no matter how the database $\mathcal{I}$ is repaired, the answer to the query is guaranteed to be in the range between 900 and 2200. Note that, the glb-answer comes from a repair of $\mathcal{I}$ that contains the fact $f_{9}$ , while the lub-answer is from a repair that contains the fact $f_{8}$ .

Q:=\;\texttt{SELECT }Z,f(A)\texttt{ FROM }T(U,Z,A)\texttt{ GROUP BY }Z.

For such queries, a tuple $(T,[glb,lub])$ is a range consistent answer to $Q$ on $\mathcal{I}$ , if the following conditions hold:

$\bullet$ For every repair $\mathcal{J}$ of $\mathcal{I}$ , there exists $d$ s.t. $(T,d)\in Q(J)$ and $glb\leq d\leq lub$ .

$\bullet$ For some repair $\mathcal{J}$ of $\mathcal{I}$ , we have that $(T,glb)\in Q(J)$

$\bullet$ For some repair $\mathcal{J}$ of $\mathcal{I}$ , we have that $(T,lub)\in Q(J)$ .

If $Q$ is an aggregation query, $\mbox{\sc{Cons}}(Q)$ denotes the problem: given an instance $\mathcal{I}$ , compute the range semantics of $Q$ on $\mathcal{I}$ .

Complexity of Range Consistent Answers

Arenas et al. (Arenas et al., 2003b) investigated the computational complexity of the range consistent answers for scalar aggregation queries of the form

$\texttt{SELECT }f(A)\texttt{ FROM }R(U,A)$

,
where $f(A)$ is one of the standard aggregation operators and $R(U,A)$ is a relational schema with functional dependencies. The main findings in Arenas et al. (Arenas et al., 2003b) can be summarized as follows.

$\bullet$ If the relational schema $R(U,A)$ has at most one functional dependency and $f(A)$ is one of the aggregation operators MIN( $A$ ), MAX( $A$ ), SUM( $A$ ), COUNT(*), AVG( $A$ ), then the range consistent answers of the query $\texttt{SELECT }f($ A $)\texttt{ FROM }R(U,A)$ is in P.

$\bullet$ There is a relational schema $R(U,A)$ with one key dependency such that computing the range consistent answers of the query $\texttt{SELECT }\texttt{COUNT}(A)\texttt{ FROM }R(U,A)$ is an NP-complete problem.

$\bullet$ There is a relational schema $R(U,A)$ with two functional dependencies, such that computing the range consistent answers of the query $\texttt{SELECT }f(A)\texttt{ FROM }R(U,A)$ is a NP-complete problem, where $f(A)$ is one of the standard aggregation operators.

It remains an open problem to pinpoint the complexity of the range consistent answers for richer aggregation queries of the form

$Q:=\;\texttt{SELECT }Z,f(A)\texttt{ FROM }T(U,Z,A)\leavevmode\nobreak\ \texttt{GROUP BY}\leavevmode\nobreak\ Z,$

where $T(U,Z,A)$ is the relation returned by a conjunctive query $q$ or by a union $q:=q_{1}\cup\cdots\cup q_{k}$ of conjunctive queries. It can be shown, however, that if computing the consistent answers $\mbox{\sc{Cons}}(q)$ of the underlying query $q$ is a hard problem, then computing the range consistent answers $\mbox{\sc{Cons}}(Q)$ of the aggregation query $Q$ is a hard problem as well. This gives rise to the following question: what can we say about the complexity of the range consistent answers $\mbox{\sc{Cons}}(Q)$ if computing the consistent answers $\mbox{\sc{Cons}}(q)$ of the underlying query is an easy problem?

Fuxman and Miller (Fuxman and Miller, 2007) identified a class, called $C_{\textit{forest}}$ , of self-join free conjunctive queries whose consistent answers are SQL-rewritable. In his PhD thesis, Fuxman (Fuxman, 2007) introduced the class $C_{\textit{aggforest}}$ consisting of all aggregation queries such that the aggregation operator is one of MIN( $A$ ), MAX( $A$ ), SUM( $A$ ), COUNT(*), the underlying query $q$ is a conjunctive query in $C_{\textit{forest}}$ , and there is one key constraint for each relation in the underlying query $q$ . Fuxman (Fuxman, 2007) showed that the range consistent answers of every query in $C_{\textit{aggforest}}$ are SQL-rewritable (earlier, similar results for a proper subclass of $C_{\textit{aggforest}}$ were obtained by Fuxman, Fazli, and Miller).

It is known that there are self-join free conjunctive queries outside the class $C_{\textit{forest}}$ whose consistent answers are SQL-rewritable. In fact, Koutris and Wijsen (Koutris and Wijsen, 2017) have characterized the self-join free conjunctive queries whose consistent answers are SQL rewritable. However, the SQL rewritability of aggregation queries beyond those in $C_{\textit{aggforest}}$ has not been investigated. In the sequel, we show that there exists a self-join-free conjunctive query whose consistent answers are SQL-rewritable, but this property is not preserved when an aggregation operator is added on top of it. Specifically, we reduce the Maximum Cut problem to the problem of computing the range consistent answers to an aggregation query involving SUM and whose underlying conjunctive query has SQL-rewritable consistent answers. We begin by recalling the definition of the Maximum Cut problem, a fundamental NP-complete problem (Karp, 1972). We state and prove a helping lemma (Lemma 3.3) before stating the main result in Theorem 3.2.

Definition 3.1.

Maximum Cut. For an undirected graph $G=(V,E)$ , a cut of $G$ is a partition $(S,\overline{S})$ of $V$ , where $S\subseteq V$ and $\overline{S}=V\backslash S$ . The set of edges with one vertex in $S$ and one vertex in $\overline{S}$ is denoted by $E(S,\overline{S})$ , and the the size of the cut $(S,\overline{S})$ is $|E(S,\overline{S})|$ .

The Maximum Cut problem asks: Given an undirected graph $G$ and an integer $k$ , is there a cut of $G$ that has size at least $k$ ?

Theorem 3.2.

Let $\mathcal{R}$ be a schema with three relations $R_{1}(\underline{A_{1}},B_{1})$ , $R_{2}(\underline{A_{2}},B_{2})$ , and $R_{3}(\underline{A_{1},B_{1},A_{2},B_{2},C})$ . Let $Q$ be the following aggregation query:

$Q:=$ SELECT SUM( $A$ ) FROM $q(A)$ ,

where $q(A)$ is the following self-join-free conjunctive query:

$q(A):=\exists x\exists y\;R_{1}(\underline{x},\text{`red'})\land R_{2}(\underline{y},\text{`blue'})\land R_{3}(\underline{x,\text{`red'},y,\text{`blue'},A})$ .

Then the following two statements hold.

(1)

$\mbox{\sc{Cons}}(q)$ is SQL-rewritable.
(2)

$\mbox{\sc{Cons}}(Q)$ is NP-hard.

Proof.

To show that $\mbox{\sc{Cons}}(q)$ is SQL-rewritable, consider the following first-order query $q^{\prime}$ :

	$\displaystyle q^{\prime}(A):=\;$	$\displaystyle\exists x\exists y(R_{1}(\underline{x},\text{`red'})\land R_{2}(\underline{y},\text{`blue'})\land R_{3}(\underline{x,\text{`red'},y,\text{`blue'},A})$
		$\displaystyle\land\forall z(R_{1}(x,z)\rightarrow z=\text{`red'})\land\forall w(R_{2}(y,w)\rightarrow w=\text{`blue'})).$

We will show that for every instance $\mathcal{I}$ and every value $a$ , we have that $a\in q^{\prime}(\mathcal{I})\Leftrightarrow a\in\mbox{\sc{Cons}}(q,\mathcal{I})$ . Since $q^{\prime}$ filters out the tuples from $R_{1}$ and $R_{2}$ that participate in the violations of the key constraints, we have that if $a\in q^{\prime}(\mathcal{I})$ , then $a\in q(\mathcal{J})$ , for every repair $\mathcal{J}$ of $\mathcal{I}$ , which means that $a\in\mbox{\sc{Cons}}(q,\mathcal{I})$ . In the other direction, we claim that if $a\in\mbox{\sc{Cons}}(q,\mathcal{I})$ , then $a\in q^{\prime}(\mathcal{I})$ . Indeed, if $a\not\in q^{\prime}(\mathcal{I})$ , then for all $x$ and $y$ such that $R_{1}(x,\text{`red'})\land R_{2}(y,\text{`blue'})\land R_{3}(x,\text{`red'},y,\text{`blue'},a)$ , we would have that there is some $z$ such that $R_{1}(x,z)$ and $z\not=\text{`red'}$ or there is some $w$ such that $R_{2}(y,w)$ and $w\not=\text{`blue'}$ . Construct a repair $\mathcal{J}$ of $\mathcal{I}$ as follows. First, for every $x$ , if ‘red’ is the only value $z$ such that $R_{1}(x,z)$ is a fact of $\mathcal{I}$ , then put $R_{1}(x,\text{`red'})$ in $\mathcal{J}$ ; otherwise, pick an element $z^{*}\not=\text{`red'}$ such that $R_{1}(x,z^{*})$ is a fact of $\mathcal{I}$ and put $R_{1}(x,z^{*})$ in $\mathcal{J}$ . Second, for every $y$ , if ‘blue’ is the only value $w$ such that $R_{2}(y,w)$ is a fact of $\mathcal{I}$ , then put $R_{2}(y,\text{`blue'})$ in $\mathcal{J}$ ; otherwise, pick an element $w^{*}\not=\text{`blue'}$ such that $R_{1}(y,w^{*})$ is a fact of $\mathcal{I}$ and put $R_{2}(x,w^{*})$ in $\mathcal{J}$ . Third, put every tuple of the relation $R_{3}$ of $\mathcal{I}$ into $\mathcal{J}$ . Clearly, $\mathcal{J}$ is a repair of $\mathcal{I}$ . Moreover, $a\not\in q({\mathcal{J}})$ . Indeed, if $a\in q({\mathcal{J}})$ , then there are elements $x$ and $y$ such that ${\mathcal{J}}\models R_{1}(x,\text{`red'})\land R_{2}(y,\text{`blue'})\land R_{3}(x,\text{`red'},y,\text{`blue'},a)$ . Since $a\not\in q^{\prime}({\mathcal{I}})$ , we have that there is some $z^{\prime}$ such that $R_{1}(x,z^{\prime})$ and $z^{\prime}\not=\text{`red'}$ or there is some $w^{\prime}$ such that $R_{2}(y,w^{\prime})$ and $w^{\prime}\not=\text{`blue'}$ . In the first case, the construction of $\mathcal{J}$ implies that $R_{1}(x,\text{'red'})$ is not a fact of $\mathcal{J}$ , while in the second case, the construction of $\mathcal{J}$ implies that $R_{2}(y,\text{`blue'})$ is not a fact of $\mathcal{J}$ ; in either case, we have arrived at a contradiction.

To show that $\mbox{\sc{Cons}}(Q)$ is NP-hard, consider the following reduction from undirected graphs to $\mathcal{R}$ -instances, where $\mathcal{R}$ is the schema with relations $R_{1}(\underline{A_{1}},B_{1})$ , $R_{2}(\underline{A_{2}},B_{2})$ , and $R_{3}(\underline{A_{1},B_{1},A_{2},B_{2},C})$ .

Reduction 3.1.

L Given an undirected graph $G=(V,E)$ , construct an $\mathcal{R}$ -instance $\mathcal{I}$ as follows. Let $m=-|E|-1$ .

•

For each $v\in V$ , add tuples $R_{1}(v,\text{`red'})$ , $R_{1}(v,\text{`blue'})$ , $R_{2}(v,\text{`red'})$ , and $R_{2}(v,\text{`blue'})$ to $\mathcal{I}$ .
•

For each $v\in V$ , add a tuple $R_{3}(v,\text{`red'},v,\text{`blue'},m)$ to $\mathcal{I}$ .
•

For each edge $(u,v)\in E$ , add tuples $R_{3}(u,\text{`red'},v,\text{`blue'},1)$ and $R_{3}(v,\text{`red'},u,\text{`blue'},1)$ to $\mathcal{I}$ .

We will show that the preceding Reduction 3.1 reduces Maximum Cut to computing the range semantics of the aggregation query $Q$ .

For the rest of this section, let $G$ be an undirected graph and $\mathcal{I}$ be the database instance constructed from $G$ using Reduction 3.1. We say that a repair $\mathcal{J^{\prime}}$ of $\mathcal{I}$ produces a red-blue coloring of $G$ if for every vertex $v\in V$ , we have that the tuples $R_{1}(v,\text{`red'})$ and $R_{2}(v,\text{`red'})$ are either both present in $\mathcal{J^{\prime}}$ or both absent in $\mathcal{J^{\prime}}$ . We now prove a useful lemma.

Lemma 3.3.

For every repair $\mathcal{J}$ of $\mathcal{I}$ , there exists a repair $\mathcal{J^{\prime}}$ of $\mathcal{I}$ (not necessarily different from $\mathcal{J}$ ) such that $\mathcal{J^{\prime}}$ produces a red-blue coloring of $G$ and $Q(\mathcal{J^{\prime}})\geq Q(\mathcal{J})$ .

Proof.

Let $\mathcal{J}$ be a repair of $\mathcal{I}$ . Construct an $\mathcal{R}$ -instance $\mathcal{J^{\prime}}$ from $\mathcal{J}$ as follows. For every vertex $v\in V$ , if both tuples $R_{1}(v,x)$ and $R_{2}(v,x)$ are present in $\mathcal{J}$ for $x\in\{\text{`red'},\text{`blue'}\}$ , then add them to $\mathcal{J^{\prime}}$ . Otherwise, add the tuples $R_{1}(v,\text{`red'})$ and $R_{2}(v,\text{`red'})$ to $\mathcal{J^{\prime}}$ . Also, copy all tuples from relation $R_{3}$ of $\mathcal{J}$ to relation $R_{3}$ of $\mathcal{J^{\prime}}$ . Clearly, $\mathcal{J^{\prime}}$ is a repair of $\mathcal{I}$ and $\mathcal{J^{\prime}}$ produces a red-blue coloring of $G$ . Observe that $Q(\mathcal{J^{\prime}})$ can be different than $Q(\mathcal{J})$ only if there exists at least one vertex $v\in V$ such that either $R_{1}(v,\text{`red'}),R_{2}(v,\text{`blue'})\in\mathcal{J}$ or $R_{1}(v,\text{`blue'}),R_{2}(v,\text{`red'})\in\mathcal{J}$ .

We will show that $Q(\mathcal{J^{\prime}})\geq Q(\mathcal{J})$ holds.

Case 1: Let $v$ be a node such that $R_{1}(v,\text{`red'}),R_{2}(v,\text{`blue'})\in\mathcal{J}$ . In this case, while populating the database instance $\mathcal{J^{\prime}}$ , vertex $v$ changes its color in relation $R_{2}$ , i.e., we have that $R_{2}(v,\text{`red'})\in\mathcal{J^{\prime}}$ and $R_{2}(v,\text{`blue'})\not\in\mathcal{J^{\prime}}$ . Therefore, the summands arising from the tuples of the form $R_{1}(u,\text{`red'})$ , $R_{2}(v,\text{`blue'})$ , and $R_{3}(u,\text{`red'},v,\text{`blue'},1)$ of $\mathcal{J}$ (for some vertex $u\neq v\in V$ ) do not appear in $Q(\mathcal{J^{\prime}})$ . Notice that each of these summands contributes value 1 to $Q(\mathcal{J})$ and the number of these summands is at most $|E|$ . At the same time, the summand that contributes value $m$ to $Q(\mathcal{J})$ arising from the tuples $R_{1}(v,\text{`red'})$ , $R_{2}(v,\text{`blue'})$ , and $R_{3}(v,\text{`red'},v,\text{`blue'})$ of $\mathcal{J}$ also does not appear in $Q(\mathcal{J^{\prime}})$ . Since $m=-|E|-1$ , it follows that $Q(\mathcal{J^{\prime}})$ cannot be made smaller than $Q(\mathcal{J})$ on account of such a node $v$ .

Case 2: Let $v$ be a node such that $R_{1}(v,\text{`blue'}),R_{2}(v,\text{`red'})\in\mathcal{J}$ . In this case, while populating $\mathcal{J^{\prime}}$ , vertex $v$ changes its color in relation $R_{1}$ , i.e., we have that $R_{1}(v,\text{`red'})\in\mathcal{J^{\prime}}$ and $R_{1}(v,\text{`blue'})\not\in\mathcal{J^{\prime}}$ . Compared to $Q(\mathcal{J})$ , this can only increase the number of summands that contribute 1 to $Q(\mathcal{J^{\prime}})$ , by possibly having new summands arising from the tuples of type $R_{1}(v,\text{`red'})$ , $R_{2}(w,\text{`blue'})$ , and $R_{3}(v,\text{`red'},w,\text{`blue'},1)$ of $\mathcal{J^{\prime}}$ (for some vertex $w\neq v\in V$ ). Moreover, for every vertex $u\in V$ , it is true that, if $R_{1}(u,\text{`red'})\in\mathcal{J}$ then $R_{1}(u,\text{`red'})\in\mathcal{J^{\prime}}$ ; similarly, if $R_{2}(u,\text{`blue'})\in\mathcal{J}$ then $R_{2}(u,\text{`blue'})\in\mathcal{J^{\prime}}$ . Therefore, every summand that contributes 1 to $Q(\mathcal{J})$ also contributes 1 to $Q(\mathcal{J^{\prime}})$ . Hence, $Q(\mathcal{J^{\prime}})$ cannot be made smaller than $Q(\mathcal{J})$ on account of such a node $v$ .

The preceding analysis implies that $Q(\mathcal{J^{\prime}})\geq Q(\mathcal{J})$ . ∎

By Lemma 3.3, there exists a repair $\mathcal{J}$ of $\mathcal{I}$ such that $\mathcal{J}$ produces a red-blue coloring of $G$ and $Q(\mathcal{J})$ is the $lub$ -answer in $\mbox{\sc{Cons}}(Q,\mathcal{I})$ . We will show that, for a non-negative integer $k$ , there is a cut $(S,\overline{S})$ of $G$ such that $|E(S,\overline{S})|\geq k$ if and only if there exists a repair $\mathcal{J}$ of $\mathcal{I}$ such that $\mathcal{J}$ produces a red-blue coloring of $G$ and $Q(\mathcal{J})\geq k$ . Once this is shown, it will follow that it is NP-hard to even compute the $lub$ -answer in $\mbox{\sc{Cons}}(Q,\mathcal{I})$ .

Let $(S,\overline{S})$ be a cut of $G$ such that $|E(S,\overline{S})|\geq k$ . Construct an $\mathcal{R}$ -instance $\mathcal{J}$ as follows. For each vertex $v\in S$ , add tuples $R_{1}(v,\text{`red'})$ and $R_{2}(v,\text{`red'})$ to $\mathcal{J}$ . For each vertex $v\in\overline{S}$ , add tuples $R_{1}(v,\text{`blue'})$ and $R_{2}(v,\text{`blue'})$ to $\mathcal{J}$ . Add all tuples from relation $R_{3}$ of $\mathcal{I}$ to $\mathcal{J}$ . Observe that $\mathcal{J}$ is a repair of $\mathcal{I}$ and that $\mathcal{J}$ produces a red-blue coloring of $G$ . Also, every edge $(u,v)\in E$ such that $u\in S$ and $v\in\overline{S}$ is part of a witness to a summand that contributes 1 to $Q(\mathcal{J})$ . Moreover, no summand in $Q(\mathcal{J})$ arises from a tuple of the form $R_{3}(v,\text{`red'},v,\text{`blue'},m)$ for some $v\in V$ . Since we have that $|E(S,\overline{S})|\geq k$ , it must be the case that $Q(\mathcal{J})\geq k$ . In the other direction, let $\mathcal{J}$ be a repair of $\mathcal{I}$ such that $\mathcal{J}$ produces a red-blue coloring of $G$ and $Q(\mathcal{J})\geq k$ . Construct two sets $S$ and $\overline{S}$ of vertices of $G$ as follows. Let $v\in S$ if $R_{1}(v,\text{`red'})\in\mathcal{J}$ , and let $v\in\overline{S}$ if $R_{1}(v,\text{`blue'})\in\mathcal{J}$ . Clearly, $(S,\overline{S})$ is a cut of $G$ . Every edge $(u,v)\in E$ such that $u\in S$ and $v\in\overline{S}$ is part of a witness to a summand that contributes 1 to $Q(\mathcal{J})$ since the tuples $R_{1}(u,\text{`red'})$ , $R_{2}(v,\text{`blue'})$ , and $R_{3}(u,\text{`red'},v,\text{`blue'},1)$ of $\mathcal{J}$ satisfy the underlying conjunctive query of $Q$ . In fact, since $\mathcal{J}$ produces a red-blue coloring of $G$ , every summand that contributes to $Q(\mathcal{J})$ must arise from such tuples. Since $Q(\mathcal{J})\geq k$ , it must be the case that $|E(S,\overline{S})|\geq k$ . ∎

4. Consistent Answers via SAT Solving

In this section, we give polynomial-time reductions from computing the range consistent answers of aggregation queries to variants of SAT. The reductions in this section assume that the database schema has one key constraint per relation; in Section 5, we show how these reductions can be extended to schemata with arbitrary denial constraints. Our reductions rely on several well-known optimization variants of SAT that we describe next.

$\bullet$ Weighted MaxSAT (or WMaxSAT) is the maximization variant of SAT in which each clause is assigned a positive weight and the goal is to find an assignment that maximizes the sum of the weights of the satisfied clauses. We will write $(l_{1}\lor\cdots\lor l_{k},w)$ to denote a clause $(l_{1}\lor\cdots\lor l_{k})$ with weight $w$ .

$\bullet$ Partial MaxSAT (or PMaxSAT) is the maximization variant of SAT in which some clauses of the formula are assigned infinite weight (hard clauses), while each of the rest is assigned weight one (soft clauses). The goal is to find an assignment that satisfies all hard clauses and the maximum number of soft clauses. If the hard clauses of a PMaxSAT instance are not simultaneously satisfiable, then we say that the instance is unsatisfiable. For simplicity, a hard clause $(l_{1}\lor\cdots\lor l_{k},\infty)$ is denoted as $(l_{1}\lor\cdots\lor l_{k})$ .

$\bullet$ Weighted Partial MaxSAT (or WPMaxSAT) is the maximization variant of SAT in which some of the clauses of the formula are assigned infinite weight (hard clauses), while each of the rest is assigned a positive weight (soft clauses). The goal is to find an assignment that satisfies all hard clauses and maximizes the sum of weights of the satisfied soft clauses. Clearly, WPMaxSAT is a common generalization of both WMaxSAT (no hard clauses) and PMaxSAT (each soft clause has weight one).

Modern solvers, such as MaxHS (Davies and Bacchus, 2011), can efficiently solve large instances of these maximization variants of SAT. Note that these maximization problems have dual minimization problems, called WMinSAT, PMinSAT, and WPMinSAT, respectively. For example, in WPMinSAT, the goal is to find an assignment that satisfies all hard clauses and minimizes the sum of weights of the satisfied soft clauses. These minimization problems are of interest to us, because some of the computations of the range consistent answers have natural reductions to such minimization problems. At present, the only existing WPMinSAT solver is MinSatz (Li et al., 2011). Since this solver has certain size limitations, we will deploy Kügel’s technique (Kügel, 2012) to first reduce WPMinSAT to WPMaxSAT, and then use the MaxHS solver in our experiments. This technique uses the concept of CNF-negation (Kügel, 2012; Zhu et al., 2012). By definition, if $C=(l_{1}\lor\cdots\lor l_{k})$ is a clause, then the CNF-negation $\text{CNF}(\overline{C})$ of $C$ is the CNF-formula $\neg l_{1}\land(l_{1}\lor\neg l_{2})\land\cdots\land(l_{1}\lor l_{2}\lor\cdots\lor l_{k-1}\lor\neg l_{k})$ . It is easy to verify that the following properties hold: (i) if an assignment $s$ does not satisfy $C$ , then $s$ satisfies every clause of $\text{CNF}(\overline{C})$ ; (ii) if an assignment $s$ satisfies $C$ , then $s$ satisfies all but one of the clauses of $\text{CNF}(\overline{C})$ , namely, the clause $(l_{1}\lor\cdots\lor l_{j-1}\lor\neg l_{j})$ , where $j$ is the smallest index such that $s(l_{j})=1$ . It follows that $C$ is satisfiable if and only if $\text{CNF}(\overline{C})$ is unsatisfiable. For a weighted clause $C=(l_{1}\lor\cdots\lor l_{k},w)$ , the CNF-negation $\text{CNF}(\overline{C})$ of $C$ is the formula $(\neg l_{1},w)\land(l_{1}\lor\neg l_{2},w)\land\cdots\land(l_{1}\lor l_{2}\lor\cdots\lor l_{k-1}\lor\neg l_{k},w)$ .

We also need to recall the notions of key-equal groups and bags of witnesses. Let $\mathcal{I}$ be a database instance.

$\bullet$ We say that two facts of a relation $R$ of $\mathcal{I}$ are key-equal, if they agree on the key attributes of $R$ . A set $S$ of facts of $\mathcal{I}$ is called a key-equal group of facts if every two facts in $S$ are key-equal, and no fact in $S$ is key-equal to some fact in ${\mathcal{I}}\backslash S$ .

$\bullet$ Let $q({\bf z}):=\exists{\bf w}\;(R_{1}({\bf x}_{1})\land...\land R_{m}({\bf x}_{m}))$ be a conjunctive query, where each ${\bf x}_{i}$ is a tuple of variables and constants, and let ${\bf a}\in q(\mathcal{I})$ be an answer to $q$ on $\mathcal{I}$ . Let vars( $q$ ) and cons( $q$ ) be the sets of variables and constants occurring in $q$ . A function $f:vars(q)\cup cons(q)\rightarrow vals({\mathcal{I}})$ is a witnessing assignment to $\bf{a}$ if the following hold: $f({\bf z})={\bf a}$ ; if $x_{j}$ is a constant in $q$ , then $f(x_{j})=x_{j}$ ; and if $R_{i}(x_{1},\cdots,x_{n})$ is an atom of $q$ , then $R_{i}(f(x_{1}),\cdots,f(x_{n}))$ is a fact of $\mathcal{I}$ . We say that a set $S$ of facts from $\mathcal{I}$ is a witness to a if there is a witnessing assignment $f$ to $\bf a$ such that $S=\{R_{i}(f(x_{1}),\cdots,f(x_{n})):R_{i}(x_{1},\cdots,x_{n})$ is an atom of $q\}$ .

Note that two distinct witnessing assignments to an answer may give rise to the same witness. Thus, we consider the bag of witnesses to an answer, i.e., the bag consisting of witnesses arising from all witnessing assignments to that answer, where each witness $S$ is accompanied by its multiplicity, an integer denoting the number of witnessing assignments that gave rise to $S$ . Finally, we define the bag of witnesses to a conjunctive query as the bag union of the bags of witnesses over all answers to $q$ on $\mathcal{I}$ (in the bag union the multiplicities of the same set are added). The bag of witnesses to a union $q:=q_{1}\cup\cdots\cup q_{k}$ of conjunctive queries is the bag union of the bags of witnesses to each conjunctive query $q_{i}$ in $q$ . The bag of witnesses will be used in computing the range consistent answers to aggregation queries. In effect, the bag of witnesses corresponds to the provenance polynomials of conjunctive queries and their unions (Green et al., 2007; Karvounarakis and Green, 2012).

It is easy to verify that both the key equal groups and the bag of the witnesses can be computed using SQL queries. In Section 4.1 and Section 4.2, we give reductions to compute the range consistent answers to aggregation queries without grouping, and in Section 4.3, we describe an iterative algorithm that uses these reductions to handle aggregation queries with grouping.

4.1. Answering Queries with SUM and COUNT

Let $\mathcal{R}$ be a database schema with one key constraint per relation, and $Q$ be the aggregation query

$Q:=\;\texttt{SELECT }f\texttt{ FROM }T(U,A),$

where $f$ is one of the operators COUNT(*), COUNT( $A$ ), SUM( $A$ ), and $T(U,A)$ is a relation expressed as a union of conjunctive queries over $\mathcal{R}$ . We now reduce the range consistent answers $\mbox{\sc{Cons}}(Q)$ of $Q$ to PMaxSAT and to WPMaxSAT.

4.1.1. Reductions to PMaxSAT and WPMaxSAT

Reduction 4.1.

Let $Q:=\texttt{SELECT $f$ FROM }T(U,A)$ be an aggregation query, where $f$ is one of the operators COUNT(*), COUNT( $A$ ), and SUM( $A$ ). Let $\mathcal{I}$ be an $\mathcal{R}$ -instance and $\mathcal{G}$ be the set of key-equal groups of facts of $\mathcal{I}$ . For each fact $f_{i}$ of $\mathcal{I}$ , introduce a boolean variable $x_{i}$ . Let $\mathcal{W}$ be the bag of witnesses to the query $q^{*}$ on $\mathcal{I}$ , where

q^{*}:=\begin{cases}\exists U\exists A\;T(U,A)&\text{if }f\text{ is }\texttt{COUNT(*)}\\ \exists U\;T(U,A)&\text{if }f\text{ is }\texttt{COUNT($A$)}\text{ or }\texttt{SUM($A$)}.\end{cases}

Construct a partial CNF-formula $\phi$ (if $f$ is COUNT(*) or COUNT( $A$ )) or a weighted partial CNF-formula $\phi$ (if $f$ is SUM(A)) as follows:

(1)
For each $G_{j}\in\mathcal{G}$ ,
- •
  
  construct a hard clause $\alpha_{j}:=\underset{f_{i}\in G_{j}}{\lor}x_{i}$ .
- •
  
  for each pair $(f_{m},f_{n})$ of facts in $G_{j}$ such that $m\neq n$ , construct a hard clause $\alpha_{j}^{mn}:=(\neg x_{m}\lor\neg x_{n})$ .

(2a)

If $f$ is COUNT(*) or COUNT( $A$ ), then for each witness $W_{j}\in\mathcal{W}$ , construct a soft clause $\beta_{j}$ , where

\beta_{j}=\bigg{(}\underset{f_{i}\in W_{j}}{\lor}\neg x_{i},m_{j}\bigg{)}.

Construct a partial CNF-instance

\phi=\bigg{(}\overset{|\mathcal{G}|}{\underset{j=1}{\land}}\alpha_{j}\bigg{)}\land\bigg{(}\overset{|\mathcal{G}|}{\underset{j=1}{\land}}\bigg{(}\bigwedge_{\begin{subarray}{c}f_{m}\in\mathcal{G}_{j}\\ f_{n}\in\mathcal{G}_{j}\end{subarray}}\alpha^{mn}_{j}\bigg{)}\bigg{)}\land\bigg{(}\overset{|\mathcal{W}|}{\underset{j=1}{\land}}\beta_{j}\bigg{)}.

(2b)

If $f$ is SUM( $A$ ), let $\mathcal{W}_{P}$ and $\mathcal{W}_{N}$ be the subsets of $\mathcal{W}$ such that for each $W_{j}\in\mathcal{W}$ , we have $W_{j}\in\mathcal{W}_{P}$ iff $q^{*}(W_{j})>0$ , and $W_{j}\in\mathcal{W}_{N}$ iff $q^{*}(W_{j})<0$ . Let also $w_{j}=m_{j}*||q^{*}(W_{j})||$ , where $||q^{*}(W_{j})||$ is the absolute value of $q^{*}(W_{j})$ . Construct a weighted soft clause $\beta_{j}$ and a conjunction $\gamma_{j}$ of hard clauses as follows. If $W_{j}\in\mathcal{W_{N}}$ , introduce a new variable $y_{j}$ and let

	$\displaystyle\beta_{j}$	$\displaystyle=(y_{j},w_{j})\text{ and }$
	$\displaystyle\gamma_{j}$	$\displaystyle=\bigg{(}\Big{(}\underset{f_{i}\in W_{j}}{\lor}\neg x_{i}\Big{)}\lor y_{j}\bigg{)}\land\bigg{(}\underset{f_{i}\in W_{j}}{\land}(\neg y_{j}\lor x_{i})\bigg{)};$

otherwise, let $\beta_{j}=\bigg{(}\underset{f_{i}\in W_{j}}{\lor}\neg x_{i},w_{j}\bigg{)}$ and do not construct $\gamma_{j}$ .
Construct a weighted partial CNF-instance

\phi=\bigg{(}\overset{|\mathcal{G}|}{\underset{j=1}{\land}}\alpha_{j}\bigg{)}\land\bigg{(}\overset{|\mathcal{G}|}{\underset{j=1}{\land}}\bigg{(}\bigwedge_{\begin{subarray}{c}f_{m}\in\mathcal{G}_{j}\\ f_{n}\in\mathcal{G}_{j}\end{subarray}}\alpha^{mn}_{j}\bigg{)}\bigg{)}\land\bigg{(}\overset{|\mathcal{W}|}{\underset{j=1}{\land}}\beta_{j}\bigg{)}\land\bigg{(}{\underset{W_{j}\in\mathcal{W_{N}}}{\land}}\gamma_{j}\bigg{)}.

Purpose of the components of $\phi$ in Reduction 4.1

•

Each $\alpha_{j}$ -clause encodes the “at-least-one” constraint for each key-equal group $G_{j}$ in the sense that satisfying $\alpha_{j}$ requires setting at least one variable corresponding to a fact in $G_{j}$ to true. Similarly, each $\alpha_{j}^{m}n$ -clause encodes the “at-most-one” constraint for $G_{j}$ . In effect, every assignment that satisfies all $\alpha$ -clauses sets exactly one variable corresponding to the facts from each key-equal group to true, and thus uniquely corresponds to a repair of $\mathcal{I}$ .
•

Satisfying a $\beta_{j}$ -clause constructed in Step 2a requires setting at least one variable corresponding to the facts of a witness $W_{j}$ to $q^{*}$ on $\mathcal{I}$ to false. Thus, if $s$ is an assignment that satisfies all $\alpha$ -clauses, then $\beta_{j}$ is satisfied by $s$ if and only if $W_{j}\not\in\mathcal{J}$ , where $\mathcal{J}$ is a repair corresponding to $s$ .
•

The $\beta_{j}$ -clauses constructed in Step 2b serve the same purpose as the ones from Step 2a, but here they are constructed only for the witnesses in $\mathcal{W}_{P}$ . For the witnesses in $\mathcal{W}_{N}$ , the $\beta_{j}$ -clauses encode the condition that $\beta_{j}$ is satisfied if and only if all variables corresponding to the facts in $W_{j}$ are set to true. The hard $\gamma_{j}$ -clauses are used solely to express the equivalence $y_{j}\leftrightarrow(\underset{f_{i}\in W_{j}}{\land}x_{i})$ in conjunctive normal form.

The number of $\alpha$ -clauses is $O(n)$ , where $n$ is the size of the database; the number of $\beta$ -clauses and $\gamma$ -clauses combined is $O(n^{k})$ , where $k$ is the number of relation symbols in $Q$ .

Proposition 4.1.

Let $Q:=\texttt{SELECT $f$ FROM }T(U,A)$ be an aggregation query, where $f$ is one of the operators COUNT(*), COUNT( $A$ ), and SUM( $A$ ). In a maximum (a minimum) satisfying assignment of the WPMaxSAT-instance $\phi$ constructed using Reduction 4.1, the sum of weights of the falsified clauses is the glb-answer (lub-answer) in the range consistent answers $\mbox{\sc{Cons}}(Q)$ on $\mathcal{I}$ .

Proof.

Let $Q:=\texttt{SELECT COUNT(*) FROM }T(U,A)$ , and let $s$ be an assignment of the formula $\phi$ constructed using Reduction 4.1. Let $g(\phi,s)$ denote the sum of weights of the soft clauses of $\phi$ satisfied by $s$ . Construct a database sub-instance $\mathcal{J}$ from $s$ such that $f_{i}\in\mathcal{J}$ if and only if $s(x_{i})=1$ . The hard clauses of $\phi$ constructed in Step (1) of Reduction 4.1 encode the condition that exactly one fact from each key-equal group of facts of $\mathcal{I}$ is in $\mathcal{J}$ , ensuring that $\mathcal{J}$ is a repair of $\mathcal{I}$ . Moreover, the soft clauses of $\phi$ falsified by $s$ have a one-to-one correspondence with the witnesses to $q^{*}$ in $\mathcal{J}$ . Therefore, we have that $g(\phi,s)=|\mathcal{W}|-Q(\mathcal{J})$ . Since $|\mathcal{W}|$ does not depend on $\mathcal{J}$ , the answer $Q(\mathcal{J})$ is minimized (i.e., $Q(\mathcal{J})$ is a $glb$ -answer in $\mbox{\sc{Cons}}(Q)$ on $\mathcal{I}$ ) when $s$ is a maximum satisfying assignment. Essentially the same argument works for the case where $Q:=\texttt{SELECT COUNT($A$) FROM }T(U,A)$ . A dual argument to this proves that a repair of $\mathcal{I}$ constructed from a minimum satisfying assignment of $\phi$ realizes the $lub$ -answer in $\mbox{\sc{Cons}}(Q)$ on $\mathcal{I}$ .

Now, let $Q:=\texttt{SELECT SUM($A$) FROM }T(U,A)$ . Construct a repair $\mathcal{J}$ of $\mathcal{I}$ from $s$ by choosing $f_{i}\in\mathcal{J}$ if and only if $s(x_{i})=1$ . Also, construct a database instance $\mathcal{I}_{p}$ as follows. For every fact $f\in\mathcal{I}$ , let $f\in\mathcal{I}_{p}$ if and only if $f\in W_{j}$ for some $W_{j}\in\mathcal{W}_{p}$ . Thus, $Q(\mathcal{I}_{p})$ is the sum of values of the aggregation attribute evaluated on the witnesses in $\mathcal{W}_{p}$ . Observe that, for every $W_{j}\in\mathcal{W}_{P}$ , the clause $\beta_{j}$ is falsified by $s$ if and only if $W_{j}\in\mathcal{J}$ . Similarly, for every $W_{j}\in\mathcal{W}_{N}$ , the clause $\beta_{j}$ is satisfied by $s$ if and only if $W_{j}\in\mathcal{J}$ . Therefore, we have that,

	$\displaystyle Q(\mathcal{J})$	$\displaystyle=Q(\mathcal{I}_{p})-\Sigma_{W_{j}\in\mathcal{W}_{P}\land s(\beta_{j})=1}(w_{j})-\Sigma_{W_{j}\in\mathcal{W}_{N}\land s(\beta_{j})=1}(w_{j})$
		$\displaystyle=Q(\mathcal{I}_{p})-g(\phi,s)$

Since $Q(\mathcal{I}_{p})$ does not depend on $\mathcal{J}$ , the answer $Q(\mathcal{J})$ is minimized (i.e., $Q(\mathcal{J})$ is a $glb$ -answer in $\mbox{\sc{Cons}}(Q)$ on $\mathcal{I}$ ) when $s$ is a maximum satisfying assignment. A dual argument to this proves that a repair of $\mathcal{I}$ constructed from a minimum satisfying assignment of $\phi$ realizes the $lub$ -answer in $\mbox{\sc{Cons}}(Q)$ on $\mathcal{I}$ . ∎

Example 4.2.

Let $\mathcal{I}$ be a database instance from Table 1, and $Q$ be the following aggregation query which counts the number of customers who have an account in their own city:
SELECT COUNT(*)
FROM CUST, ACC, CUSTACC
WHERE CUST.CID = CUSTACC.CID
AND ACC.ACCID = CUSTACC.ACCID
AND CUST.CITY = ACC.CITY
From Reduction 4.1, we construct the following clauses:

$\alpha$ -clauses: $x_{1},(x_{2}\lor x_{3}),x_{4},x_{5},x_{6},x_{7},(x_{8}\lor x_{9}),x_{10}$ ;
$\alpha^{mn}$ -clauses: $(\neg x_{2}\lor\neg x_{3}),(\neg x_{8}\lor\neg x_{9})$ ;
$\beta$ -clauses: $(\neg x_{1}\lor\neg x_{6},1),(\neg x_{2}\lor\neg x_{7},1),(\neg x_{3}\lor\neg x_{9},1)$ .

Observe that it is okay to omit the variables corresponding to the facts in CUSTACC since CUSTACC does not violate $\Sigma$ . A maximum satisfying assignment to the PMaxSAT instance $\phi$ constructed from above clauses is $x_{i}=0$ for $i\in\{2,9\}$ , and $x_{i}=1$ otherwise. It falsifies one clause, namely, $(\neg x_{1}\lor\neg x_{6},1)$ . Similarly, an assignment $x_{i}=0$ for $i\in\{2,8\}$ , and $x_{i}=1$ otherwise is a minimum satisfying assignment to the PMinSAT instance $\phi$ , and it falsifies two clauses, namely, $(\neg x_{1}\lor\neg x_{6},1)$ and $(\neg x_{3}\lor\neg x_{9},1)$ . Thus, $\mbox{\sc{Cons}}(Q,\mathcal{I})$ w.r.t. range semantics is $[1,2]$ by Proposition 4.1.

Example 4.3.

Let us again consider the database instance $\mathcal{I}$ from Table 1, and the following aggregation query $Q$ :
SELECT SUM(ACC.BAL)
FROM CUST, ACC, CUSTACC
WHERE CUST.CID = CUSTACC.CID
AND ACC.ACCID = CUSTACC.ACCID
AND CUST.CNAME = ‘Mary’
The hard clauses constructed using Reduction 4.1 are same as the ones from Example 4.2. The rest of the clauses are as follows:

$\beta$ -clauses: $(\neg x_{2}\lor\neg x_{7},1000)$ , $(\neg x_{3}\lor\neg x_{7},1000)$ , $(\neg x_{2}\lor\neg x_{8},1200)$ , $(\neg x_{3}\lor\neg x_{8},1200)$ , $(y_{1},100)$ , $(y_{2},100)$ .
$\gamma$ -clauses: $(\neg x_{2}\lor\neg x_{9}\lor y_{1})$ , $(\neg y_{1}\lor x_{2})$ , $(\neg y_{1}\lor x_{9})$ , $(\neg x_{3}\lor\neg x_{9}\lor y_{2})$ , $(\neg y_{2}\lor x_{3})$ , $(\neg y_{2}\lor x_{9})$ .

The witnesses $\{f_{2},f_{9},f_{13}\}$ and $\{f_{3},f_{9},f_{13}\}$ belong to $\mathcal{W}_{N}$ because the account balance is -100 in both cases, so we introduce new variables $y_{1}$ and $y_{2}$ respectively, and construct hard $\gamma$ -clauses as described above. The $\beta$ -clauses corresponding to these witnesses are $(y_{1},100)$ and $(y_{2},100)$ . We omit $x_{13}$ in all of these clauses since CUSTACC does not violate $\Sigma$ . Note that $Q(\mathcal{I}_{p})=4400$ . An assignment in which $x_{8}=0$ and $x_{9}=1$ is a maximum satisfying assignment to the PMaxSAT instance $\phi$ constructed. The sum of satisfied soft clauses by this assignment is 3500 since it satisfies two clauses with weights 1200 each, one with weight 1000 and one with weight 100. Thus, by Proposition 4.1, we have that $Q(\mathcal{J})=4400-3500=900$ where $\mathcal{J}$ is a repair corresponding to the assignment in consideration. Similarly, setting $x_{8}=1$ and $x_{9}=0$ yields a minimum satisfying assignment in which the sum of satisfied soft clauses is 2200, since it satisfies one clause with weight 1200 and one with weight 1000, indicating that $\mbox{\sc{Cons}}(Q,\mathcal{I})=[900,2200]$ .

4.1.2. Handling DISTINCT

Let $Q:=\texttt{SELECT $f$ FROM }T(U,A)$ be an aggregation query, where $f$ is either COUNT(DISTINCT $A$ ) or SUM(DISTINCT $A$ ). Solving a PMaxSAT or a WPMaxSAT instance constructed using Reduction 4.1 may yield incorrect $glb$ and $lub$ answers to $Q$ , if the database contains multiple witnesses with the same value for attribute $A$ . For example, consider the database instance $\mathcal{I}$ from Table 1, and a query

$Q:=\texttt{SELECT COUNT(DISTINCT ACC.TYPE) FROM ACC}$

The correct $glb$ and $lub$ -answers in $\mbox{\sc{Cons}}(Q,\mathcal{I})$ are both 2, but solutions to the PMaxSAT and PMinSAT instances constructed using Reduction 4.1 yield both answers as 4. The reason behind this is that the soft clauses $\neg x_{6}$ and $\neg x_{7}$ both correspond to the account type Checking, and similarly $\neg x_{8}$ , $\neg x_{9}$ , and $\neg x_{10}$ all correspond to the account type Saving. The hard clauses in the formula ensure that $x_{6}$ , $x_{7}$ , $x_{10}$ , and one of $x_{8}$ and $x_{9}$ are true, thus counting both Checking and Saving account types exactly twice in every satisfying assignment to the formula. This can be handled by modifying the $\beta$ -clauses in Reduction 4.1 as follows.

Let $\mathcal{A}$ denote a set of distinct answers to the query $q^{*}(A):=\exists U\;T(U,A)$ . For each answer $b\in\mathcal{A}$ , let $\mathcal{W}^{b}$ denote a subset of $\mathcal{W}$ such that for every witness $W\in\mathcal{W}^{b}$ , we have that $q^{*}(W)=b$ . The idea is to use auxiliary variables to construct one soft clause for every distinct answer $b\in\mathcal{A}$ , such that it is true if and only if no witness in $\mathcal{W}^{b}$ is present in a repair corresponding to the satisfying assignment. First, for every witness $W^{b}_{j}\in\mathcal{W}^{b}$ , we introduce an auxiliary variable $z^{b}_{j}$ that is true if and only if $W^{b}_{j}$ is not present in the repair. Then, we introduce an auxiliary variable $v^{b}$ which is true if and only if all $z^{b}$ -variables are true. These constraints are encoded in the set $H^{b}$ returned by Algorithm 1, and are forced by making clauses in $H^{b}$ hard. For every answer $b\in\mathcal{A}$ , Algorithm 1 also returns one $\beta^{b}$ -clause, which serves the same purpose as the $\beta$ -clauses in Reduction 4.1. Now, a PMaxSAT or a WPMaxSAT instance can be constructed by taking in conjunction all $\alpha$ -clauses from the key-equal groups, the hard $\gamma$ -clauses if any, the hard clauses from all $H^{b}$ -sets, and all soft $\beta^{b}$ -clauses. With this, it is easy to see that a maximum (or minimum) satisfying assignment to PMaxSAT or WPMaxSAT instance give us the $glb$ -answer (or $lub$ -answer) in $\mbox{\sc{Cons}}(Q)$ . This is illustrated in Example 4.4.

Algorithm 1 Handling DISTINCT

1:procedure handleDistinct(

\mathcal{W}^{b}

)

2: let

H^{b}=\emptyset

//Empty set of clauses

3: for

W^{b}_{j}\in\mathcal{W}^{b}

H^{b}=H^{b}\bigcup\Big{\{}\Big{(}\neg z^{b}_{j}\lor\Big{(}\underset{f_{i}\in W^{b}_{j}}{\bigvee}\neg x_{i}\Big{)}\Big{)}\Big{\}}

5: for

f_{i}\in W^{b}_{j}

H^{b}=H^{b}\bigcup\;\{(z^{b}_{j}\lor x_{i})\}

H^{b}=H^{b}\bigcup\Big{\{}\Big{(}\neg v^{b}\lor\Big{(}\underset{W^{b}_{j}\in\mathcal{W}^{b}}{\bigvee}\neg z^{b}_{j}\Big{)}\Big{)}\Big{\}}

8: for

W^{b}_{j}\in\mathcal{W}^{b}

H^{b}=H^{b}\bigcup\;\{(\neg v^{b}\lor z^{b}_{j})\}

10: let

\beta^{b}=(v^{b},1)

11: if (

f

is SUM(DISTINCT

A

)) then

12:

\beta^{b}=(v^{b},||b||)

13: if

b<0

then

\beta^{b}=(\neg v^{b},||b||)

14: return

H^{b},\beta^{b}

Example 4.4.

Consider the following aggregation query $Q$ on the database instance $\mathcal{I}$ from Table 1:
SELECT COUNT(DISTINCT ACC.TYPE) FROM ACC
We have that $\mathcal{A}=\{\text{`Checking'},\text{`Saving'}\}$ . Let us denote these two answers by $a_{1}$ and $a_{2}$ respectively. Since every witness to the query consists of a single fact, every $y^{a}$ -variable is equivalent to a single literal, for example, $y^{a_{1}}_{1}\leftrightarrow\neg x_{6}$ and $y^{a_{1}}_{2}\leftrightarrow\neg x_{7}$ . As a result, it is unnecessary to introduce any $y^{a}$ -variables at all. Thus, we construct the following clauses from Reduction 4.1 and Algorithm 1:

$\alpha$ -clauses: $x_{6},x_{7},(x_{8}\lor x_{9}),x_{10}$ ; $\alpha^{mn}$ -clauses: $(\neg x_{8}\lor\neg x_{9})$ ;
$H^{a_{1}}:(x_{6}\lor x_{7}\lor v^{a_{1}}),(\neg v^{a_{1}}\lor\neg x_{6}),(\neg v^{a_{1}}\lor\neg x_{7})$ ;
$H^{a_{2}}:(x_{8}\lor x_{9}\lor x_{10}\lor v^{a_{2}}),(\neg v^{a_{2}}\lor\neg x_{8}),(\neg v^{a_{2}}\lor\neg x_{9}),(\neg v^{a_{2}}\lor\neg x_{10})$ ;
$\beta$ -clauses: $(v^{a_{1}},1),(v^{a_{2}},1)$

The maximum and minimum satisfying assignments to the PMaxSAT and PMinSAT instances constructed using these clauses falsify both $\beta$ -clauses, since $\mbox{\sc{Cons}}(Q,\mathcal{I})$ w.r.t. range semantics is $[2,2]$ .

4.2. Answering Queries with MIN and MAX

Let $\mathcal{R}$ be a database schema with one key constraint per relation, and $Q$ be the aggregation query

$Q:=\;\texttt{SELECT }f\texttt{ FROM }T(U,A),$

where $f$ is one of the operators MIN( $A$ ) and MAX( $A$ ), and $T(U,A)$ is a relation expressed as a union of conjunctive queries over $\mathcal{R}$ . The semantics of the range consistent answers to aggregation queries with MIN and MAX operators are similar to aggregation queries with SUM or COUNT operators, but here we need to address one additional special case. We illustrate this special case using Example 4.5.

Example 4.5.

Consider the database instance $\mathcal{I}$ from Table 1 and two aggregation queries $Q_{1}$ and $Q_{2}$ as follows.

$Q_{1}:=$ SELECT SUM(ACCOUNTS.BAL) FROM ACCOUNTS
WHERE ACCOUNTS.CITY = ‘SF’

$Q_{2}:=$ SELECT MIN(ACCOUNTS.BAL) FROM ACCOUNTS
WHERE ACCOUNTS.CITY = ‘SF’

It is clear that the range consistent answers to $\mbox{\sc{Cons}}(Q_{1},\mathcal{I})=[-100,0]$ . The lub-answer of 0 in $\mbox{\sc{Cons}}(Q_{1},\mathcal{I})$ comes from a repair on which there is no account in the city of SF, and therefore the SUM function returns 0. For $Q_{2}$ , however, the range consistent answers are unclear because the MIN function is not defined for empty sets.

In such scenarios, various different semantics can be considered. One natural semantics is that if there exists a repair on which the underlying conjunctive query evaluates to an empty set, we could say that there is no consistent answer to the aggregation query. Another one could be to return the interval [glb, lub] of values that come from the repairs on which the underlying conjunctive query evaluates to a non-empty set of answers, and additionally return the information about the existence of the repair on which the underlying conjunctive query returns the empty set of answers. The reductions we give in this Section can be used regardless of which of the two above-mentioned semantics is chosen.

In what follows, we first show that the glb-answer to an aggregation query with the MIN( $A$ ) operator can be computed in polynomial time in the size of the original inconsistent database instance $\mathcal{I}$ (Proposition 4.6). We then give an iterative SAT-based approach to compute the lub-answer to an aggregation query with the MIN( $A$ ) operator. We do not explicitly state methods to obtain the range consistent answers aggregation queries with the MAX( $A$ ) operator since it is straightforward that the lub-answer for MIN( $A$ ) is a dual of the glb-answer for MAX( $A$ ) in the sense that computing the lub-answer for the MIN( $A$ ) operator yields the same result as negating all values of the aggregation attribute $A$ in the database and then computing the glb-answer for the MAX( $A$ ) operator.

Proposition 4.6.

Let $\mathcal{R}$ be a database schema, $\mathcal{I}$ an $\mathcal{R}$ -instance, and $Q$ the aggregation query $\texttt{SELECT MIN($A$) FROM }T(U,A)$ . Let $q_{1}$ be the union $q_{1}(A):=\exists U\;T(U,A)$ of conjunctive queries and $W_{\textit{glb}}$ be the witness to $q_{1}$ on $\mathcal{I}$ such that no two facts in $W_{\textit{glb}}$ are key-equal, and there is no $W^{\prime}$ such that $q(W^{\prime})<q(W_{\textit{glb}})$ and no two facts in $W^{\prime}$ are key-equal. Then, $q_{1}(W_{\textit{glb}})$ is the glb-answer in $\mbox{\sc{Cons}}(Q,\mathcal{I})$ .

Proof.

For every witness $W^{\prime}$ to $q_{1}$ on $\mathcal{I}$ such that $q_{1}(W^{\prime})<q_{1}(W_{\textit{glb}})$ , we have that no repair of $\mathcal{I}$ contains $W^{\prime}$ because $W^{\prime}$ contains at least two key-equal facts. Moreover, since no two facts in $W_{\textit{glb}}$ are key-equal, there exists a repair $\mathcal{J}$ of $\mathcal{I}$ such that $W_{\textit{glb}}\in\mathcal{J}$ . Therefore, $q_{1}(W_{\textit{glb}})$ must be the smallest possible answer to $Q$ on $\mathcal{I}$ , i.e., the glb-answer in $\mbox{\sc{Cons}}(Q,\mathcal{I})$ . Since the number of witnesses to $q_{1}$ is polynomial in the size of $\mathcal{I}$ , a desired witness $W_{\textit{glb}}$ can be obtained efficiently from the result of evaluating $q_{1}$ on $\mathcal{I}$ . ∎

To compute the range consistent answers to aggregation queries with MIN( $A$ ) and MAX( $A$ ) operators, we opt for an iterative SAT solving approach. In what follows, we formalize the construction of the SAT instance for the first iteration (Construction 4.1) and give Algorithm 2 that computes the lub-answer in $\mbox{\sc{Cons}}(Q,\mathcal{I})$ by constructing and solving SAT instances in subsequent iterations.

Construction 4.1.

Given an $\mathcal{R}$ -instance $\mathcal{I}$ , construct a CNF formula $\phi$ as follows. For each fact $f_{i}$ of $\mathcal{I}$ , introduce a boolean variable $x_{i}$ . Let $\mathcal{G}$ be the set of key-equal groups of facts of $\mathcal{I}$ , and $\mathcal{W}=\{W_{1},\cdots,W_{m}\}$ denote the set of minimal witnesses to a conjunctive query $q$ on $\mathcal{I}$ , where $q(w):=\exists\vec{u}\;T(\vec{u},w)$ . Assume that the set $\mathcal{W}$ is sorted in descending order of the answers, i.e., for $1\leq i<m$ , we have that $q(W_{i})\geq q(W_{i+1})$ .

•

For each $G_{j}\in\mathcal{G}$ , construct a clause $\alpha_{j}=\underset{f_{i}\in G_{j}}{\lor}x_{i}$ .
•

Construct a CNF formula $\phi=\overset{|\mathcal{G}|}{\underset{j=1}{\land}}\alpha_{j}$ .

Algorithm 2 Computing the lub-answer in

\mbox{\sc{Cons}}(Q,\mathcal{I})

for MIN via Iterative SAT

1:procedure LubAnswer-IterativeSAT(

\phi,\mathcal{W}

)

2: let

v=q(W_{1}),j=1

3: while

j\leq|\mathcal{W}|

4: if

v=q(W_{j})

then

5: let

\phi=\phi\land\Big{(}{\underset{f_{i}\in W_{j}}{\lor}}\neg x_{i}\Big{)}

6: let

j=j+1

7: else

8: if UNSAT

(\phi)

then

9: return

q(W_{j-1})

10: else

11: let

v=q(W_{j})

12: return

q(W_{|\mathcal{W}|})

Proposition 4.7.

Let $Q:=\texttt{SELECT MIN($A$) FROM }T(U,A)$ be an aggregation query, and $\mathcal{I}$ be a database instance. Algorithm 2 returns the lub-answer in $\mbox{\sc{Cons}}(Q,\mathcal{I})$ .

Proof.

The $\alpha$ -clauses of $\phi$ make sure that a repair $\mathcal{J}$ of $\mathcal{I}$ can be constructed from every assignment $s$ of $\phi$ that satisfies the $\alpha$ -clauses, by arbitrarily choosing exactly one fact $f_{i}$ from each key-equal group of $\mathcal{I}$ such that $s(x_{i})=1$ . Let $\mathcal{A}=\{A_{1},\cdots,A_{lub},\cdots,A_{|\mathcal{A}|}\}$ denote the set of distinct answers to a conjunctive query $q(w):=\exists\vec{u}\;T(\vec{u},w)$ on $\mathcal{I}$ , where $A_{lub}$ is the lub-answer to $Q$ on $\mathcal{I}$ . For a witness $W$ to $q$ , a clause $({\underset{f_{i}\in W}{\lor}}\neg x_{i})$ is satisfied by an assignment $s$ if and only if $W$ is not present in any repair constructed from $s$ . At iteration $j$ of the while-loop, if the formula $\phi$ is checked for satisfiability (line 8 of Algorithm 2), the formula contains the $\alpha$ -clauses corresponding to the key-equal groups of $\mathcal{I}$ in conjunction to all clauses corresponding to the minimal witnesses to $q$ on which the $q$ evaluates to an answer strictly smaller than $q(W_{j})$ . At this point, if the formula $\phi$ is satisfiable, then there exists a repair $\mathcal{J}$ of $\mathcal{I}$ such that $q(W_{j-1})\notin q(\mathcal{J})$ , and also for all potential answers $A_{i}\leq q(W_{j-1})$ , we have that $A_{i}\notin q(\mathcal{J})$ . On the other hand, if the formula is unsatisfiable, then there exists no repair $\mathcal{J}$ of $\mathcal{I}$ such that $q(W_{j-1})\notin q(\mathcal{J})$ and $A_{i}\notin q(\mathcal{J})$ for all $A_{i}\leq q(W_{j-1})$ . Since the clauses are added in the ascending order of the answers, we have that $\phi$ satisfiable at iteration $j$ if and only if $q(W_{j-1})<A_{lub}$ . Therefore, if $\phi$ becomes unsatisfiable for the first time at iteration $j$ , it must be the the case that $q(W_{j-1})$ is the lub-answer in $\mbox{\sc{Cons}}(Q,\mathcal{I})$ . ∎

Why not a Binary Search?

In essence, Algorithm 2 works like a linear search on a sorted array. It may sound appealing to perform the binary search instead of the linear search for obvious reasons. Clearly, the SAT solver will only have to solve $O(log_{2}\;|\mathcal{A}|)$ instances of SAT instead of $O(|\mathcal{A}|)$ instances of SAT. The problem with this approach is the following. Observe that, on an average, half of the SAT instances that the solver needs to solve in the binary search approach will be unsatisfiable. In the linear search approach, however, all but the last instance given to the solver are satisfiable. Typically, the proofs of unsatisfiability produced by the SAT solvers are significantly large compared to the proofs of satisfiability as the unsatisfiability of an instance needs to be proven with a refutation tree that can be exponential in size of the formula, while just one satisfying assignment is enough to prove the satisfiability of an instance. As a result, SAT solvers typically take considerably long amounts of time to solve unsatisfiable instances but they are very quick on most real-world satisfiable instances. Therefore, in practice, the linear search often works better than the binary search.

4.3. Answering Queries with Grouping

Let $Q$ be the aggregation query

$Q:=\;\texttt{SELECT }Z,f\texttt{ FROM }T(U,Z,w)\texttt{ GROUP BY }Z,$

where $f$ is one of COUNT $(*)$ , COUNT $(A)$ , SUM $(A)$ , MIN $(A)$ , or MAX $(A)$ , and $T(U,A)$ is a relation expressed by a union of conjunctive queries on $\mathcal{R}$ . We refer to the attributes in $Z$ as the grouping attributes. For aggregation queries with grouping, it does not seem feasible to reduce $\mbox{\sc{Cons}}(Q)$ to a single PMaxSAT or a WPMaxSAT instance because for each group of consistent answers, the GLB-answer and the LUB-answer may realize in different repairs of the inconsistent database. To illustrate this, consider the database from Table 1 and a query $Q:=\texttt{SELECT COUNT(*) FROM CUST GROUP BY CUST.CITY}$ . Notice that, the GLB-answers (LA, 2) and (SF, 1) in $\mbox{\sc{Cons}}(Q)$ come from two different repairs of relation CUST, namely, $\{f_{1},f_{3},f_{4},f_{5}\}$ and $\{f_{1},f_{2},f_{4},f_{5}\}$ respectively. However, the reductions from the preceding section can be used to compute the bounds to each consistent group of answers independently. For a given aggregation query $Q$ with grouping, we first compute the consistent answers to an underlying conjunctive query $q(Z):=\exists U,A\;T(U,Z,A)$ . Then, for each answer $b$ in $\mbox{\sc{Cons}}(q)$ , we compute the GLB and LUB-answers to the query $Q^{\prime}:=\;\texttt{SELECT }f\texttt{ FROM }T(U,Z,A)\land(Z=b)$ via PMaxSAT or WPMaxSAT solving as shown in Algorithm 3.

Algorithm 3 Consistent Answers to Queries With Grouping

Let $\mathcal{I}$ be an inconsistent database instance, and $Q$ be an aggregation query of the form $Q:=\;\texttt{SELECT }Z,f\texttt{ FROM }T(U,Z,A)\texttt{ GROUP BY }Z$ .

1:procedure ConsAggGrouping(

Q

)

2: let

Ans=\emptyset

3: let

q(Z):=\exists U,A\;T(U,Z,A)

4: let

\mathcal{A}_{c}=\mbox{\sc{Cons}}(q,\mathcal{I})

5: for

b\in\mathcal{A}_{c}

6: let

Q^{\prime}:=\;\texttt{SELECT }f\texttt{ FROM }T(U,Z,A)\land(Z=b)

7: let

[GLB_{A},LUB_{A}]=\mbox{\sc{Cons}}(Q^{\prime},\mathcal{I})

A_{ans}=A_{ans}\cup(b,[GLB_{A},LUB_{A}])

9: return

A_{ans}

As noted earlier, the bags of witnesses used in the preceding reductions capture the provenance of unions of conjunctive queries in the provenance polynomials model of (Green et al., 2007; Karvounarakis and Green, 2012). In (Amsterdamer et al., 2011), it was shown that a stronger provenance model is needed to express the provenance of aggregation queries, a model that uses a tensor product combining annotations with values. A future direction of research is to investigate whether this stronger provenance model can be used to produce more direct reductions of the range consistent answers to SAT.

5. Beyond Key Constraints

Key constraints and functional dependencies are important special cases of denial constraints (DCs), which are expressible by first-order formulas of the form $\forall x_{1},...,x_{n}\neg(\varphi(x_{1},...,x_{n})\land\psi(x_{1},...,x_{n})),$ or, equivalently, $\forall x_{1},...,x_{n}(\varphi(x_{1},...,x_{n})\rightarrow\neg\psi(x_{1},...,x_{n})),$ where $\varphi(x_{1},...,x_{n})$ is a conjunction of atomic formulas and $\psi(x_{1},...,x_{n})$ is a conjunction of expressions of the form $(x_{i}\;\mbox{op}\;x_{j})$ with each op a built-in predicate, such as $=,\neq,<,>,\leq,\geq$ . In words, a denial constraint prohibits a set of tuples that satisfy certain conditions from appearing together in a database instance. If $\Sigma$ is a fixed finite set of denial constraints and $Q$ is an aggregation query without grouping, then the following problem is in coNP: given a database instance $\mathcal{I}$ and a number $t$ , is $t$ the lub-answer (or the glb-answer) in $\mbox{\sc{Cons}}(Q,\mathcal{I})$ w.r.t. $\Sigma$ ? This is so because to check that $t$ is not the lub-answer (or the glb-answer), we guess a repairs $\mathcal{J}$ of $\mathcal{I}$ and verify that $t>Q(\mathcal{J})$ (or $t>Q(\mathcal{J})$ ). In all preceding reductions, the $\alpha$ -clauses capture the inconsistency in the database arisen due to the key violations to enforce every satisfying assignment to uniquely correspond to a repair of the initial inconsistent database instance. Importantly, the $\alpha$ -clauses are independent of the input query. In what follows, we provide a way to construct clauses to capture the inconsistency arising due to the violations of denial constraints. Thus, replacing the $\alpha$ -clauses in the reductions from Section 4 by the ones provided below allows us to compute consistent answers over databases with a fixed finite set of arbitrary denial constraints. The reduction relies on the notions of minimal violations and near-violations to the set of denial constraints that we introduce next.

$\bullet$ Assume that $\Sigma$ is a set of denial constraints, $\mathcal{I}$ is an $\mathcal{R}$ -instance, and $S$ is a sub-instance of $\mathcal{I}$ . We say that $S$ is a minimal violation to $\Sigma$ , if $S\not\models\Sigma$ and for every set $S^{\prime}\subset S$ , we have that $S^{\prime}\models\Sigma$ .

$\bullet$ Let $\Sigma$ be a set of denial constraints, $\mathcal{I}$ an $\mathcal{R}$ -instance, $S$ a sub-instance of $\mathcal{I}$ , and $f$ a fact of $\mathcal{I}$ . We say that $S$ is a near-violation w.r.t. $\Sigma$ and $f$ if $S\models\Sigma$ and $S\cup\{f\}$ is a minimal violation to $\Sigma$ . As a special case, if $\{f\}$ itself is a minimal violation to $\Sigma$ , we say that there is exactly one near-violation w.r.t. $f$ , and it is the singleton $\{f_{true}\}$ , where $f_{true}$ is an auxiliary fact.

Let $\mathcal{R}$ be a database schema, $\Sigma$ be a fixed finite set of denial constraints on $\mathcal{R}$ , $Q$ be an aggregation query without grouping, and $\mathcal{I}$ be an $\mathcal{R}$ -instance.

Reduction 5.1.

Given an $\mathcal{R}$ -instance $\mathcal{I}$ , compute the sets:

(1)

$\mathcal{V}$ : the set of minimal violations to $\Sigma$ on $\mathcal{I}$ .
(2)

$\mathcal{N}^{i}$ : the set of near-violations to $\Sigma$ , on $\mathcal{I}$ , w.r.t. each fact $f_{i}\in I$ .

For each fact $f_{i}$ of $\mathcal{I}$ , introduce a boolean variable $x_{i}$ , $1\leq i\leq n$ . For the auxiliary fact $f_{true}$ , introduce a constant $x_{true}=true$ , and for each $N^{i}_{j}\in\mathcal{N}^{i}$ , introduce a boolean variable $p^{i}_{j}$ .

(1)

For each $V_{j}\in\mathcal{V}$ , construct a clause $\alpha_{j}=\underset{f_{i}\in V_{j}}{\lor}\neg x_{i}$ .
(2)

For each $f_{i}\in I$ , construct a clause $\gamma_{i}=x_{i}\lor\bigg{(}\underset{N^{i}_{j}\in\mathcal{N}^{i}}{\lor}p^{i}_{j}\bigg{)}$ .
(3)

For each variable $p^{i}_{j}$ , construct an expression $\theta^{i}_{j}=p^{i}_{j}\leftrightarrow\bigg{(}\underset{f_{d}\in N^{i}_{j}}{\land}x_{d}\bigg{)}$ .

(4)

Construct the following boolean formula $\phi$ :

{\phi=\bigg{(}\overset{|\mathcal{V}|}{\underset{i=1}{\land}}\alpha_{i}\bigg{)}\land\bigg{(}\overset{|I|}{\underset{i=1}{\land}}\bigg{(}\Big{(}\overset{|\mathcal{N}^{i}|}{\underset{j=1}{\land}}\theta^{i}_{j}\Big{)}\land\gamma_{i}\bigg{)}\bigg{)}}

Proposition 5.1.

The boolean formula $\phi$ constructed using Reduction 5.1 can be transformed to an equivalent CNF-formula $\phi$ whose size is polynomial in the size of $\mathcal{I}$ . The satisfying assignments to $\phi$ and the repairs of $\mathcal{I}$ w.r.t. $\Sigma$ are in one-to-one correspondence.

Proof.

Let $n$ be the number of facts of $\mathcal{I}$ . Let $d_{1}$ be the smallest number such that there exists no denial constraint in $\Sigma$ whose number of database atoms is bigger than $d_{1}$ . Also, let $d_{2}$ be the smallest number such that there exists no conjunctive query in $Q$ whose number of database atoms is bigger than $d_{2}$ . Since $\Sigma$ and $Q$ are not part of the input to Cons( $Q$ ), the quantities $d_{1}$ and $d_{2}$ are fixed constants. We also have that $|\mathcal{V}|\leq n^{d_{1}}$ , $|\mathcal{N}^{i}|\leq n^{d_{1}}$ for $1\leq i\leq n$ , $|\mathcal{A}|\leq n^{d_{2}}$ , and $|\mathcal{W}^{l}|\leq n^{d_{2}}$ for $1\leq l\leq|\mathcal{A}|$ . The number of $x$ -, $y$ -, and $p$ -variables in $\phi^{\prime}$ is therefore bounded by $n$ , $n^{d_{1}+1}$ , and $n^{d_{2}}$ , respectively. The formula $\phi^{\prime}$ contains as many $\alpha$ -clauses as $|\mathcal{V}|$ , and none of the $\alpha$ -clause’s length exceeds $n$ . Similarly, there are at most $n^{d_{2}}$ $\beta$ -clauses, and none of their lengths exceeds $d_{2}+1$ . The number of $\gamma$ -clauses is precisely $n$ , and each $\gamma$ -clause is at most $n^{d_{1}+1}+1$ literals long. There are as many $\theta$ -expressions as there are $y$ -variables. Every $\theta$ -expression is of the form $y\leftrightarrow(x_{1}\land...\land x_{d})$ , where $d$ is a constant obtained from the number of facts in the corresponding near-violation. Each $\theta$ -expression can be equivalently written in a constant number of CNF-clauses as $((\neg y\lor x_{1})\land...\land(\neg y\lor x_{d}))\land(\neg x_{1}\lor...\lor\neg x_{d}\lor y)$ , in which the length each clause is constant. Thus, one can transform $\phi^{\prime}$ into an equivalent CNF-formula $\phi$ with size polynomial in the size of $\mathcal{I}$ .

For the second part of Proposition 5.1, consider a satisfying assignment $s$ to $\phi$ and construct a database instance $\mathcal{J}$ such that $f_{i}\in\mathcal{J}$ if and only if $s(x_{i})=1$ . The $\alpha$ -clauses assert that no minimal violation to $\Sigma$ is present in $\mathcal{J}$ , i.e., $\mathcal{J}$ is a consistent subset of $\mathcal{I}$ . The $\gamma$ -clauses and the $\theta$ -expressions encode the condition that, for every fact $f\in\mathcal{I}$ , either $f\in\mathcal{J}$ or at least one near-violation w.r.t. $\Sigma$ and $f$ is in $\mathcal{J}$ , making sure that $\mathcal{J}$ is indeed a repair of $\mathcal{I}$ . In the other direction, one can construct a satisfying assignment $s$ to $\phi$ from a repair $\mathcal{J}$ of $\mathcal{I}$ by setting $s(x_{i})=1$ if and only if $f_{i}\in\mathcal{J}$ . ∎

6. Experimental Evaluation

We evaluate the performance of AggCAvSAT over both synthetic and real-world databases. The first set of experiments includes a comparison of AggCAvSAT with an existing SQL-rewriting-based CQA system, namely, ConQuer, over synthetically generated TPC-H databases having one key constraint per relation. This set of experiments is divided into two parts, based on the method used to generate the inconsistent database instances. In the first part, we use the DBGen tool from TPC-H and artificially inject inconsistencies in the generated data; in the second part, we employ the PDBench inconsistent database generator from MayBMS (Antova et al., 2008) (see Section 6.1.1 for details). Next, we assess the scalability of AggCAvSAT by varying the size of the database and the amount of inconsistency present in it. Lastly, to evaluate the performance of the reductions from Section 5, we use a real-world Medigap (med, [n.d.]) dataset that has three functional dependencies and one denial constraint. All experiments were carried out on a machine running on Intel Core i7 2.7 GHz, 64 bit Ubuntu 16.04, with 8GB of RAM. We used Microsoft SQL Server 2017 as an underlying DBMS, and MaxHS v3.2 solver (Davies and Bacchus, 2011) for solving the WPMaxSAT instances. The AggCAvSAT system is implemented in Java 9.04 and its code is open-sourced at a GitHub repository https://github.com/uccross/cavsat via a BSD-style license. Various features of AggCAvSAT, including its graphical user interface, are presented in a short paper in the demonstration track of the 2021 SIGMOD conference (Dixit and Kolaitis, 2021).

6.1. Experiments with TPC-H Data and Queries

6.1.1. Datasets

For the first part of the experiments, the data is generated using the DBGen data generation tool from the TPC-H Consortium. The TPC-H schema comes with exactly one key constraint per relation, which was ideal for comparing AggCAvSAT against ConQuer (Fuxman et al., 2005a; Fuxman and Miller, 2007) (the only existing system for computing the range consistent answers to aggregation queries), because ConQuer does not support more than one key constraint per relation or classes of integrity constraints broader than keys. The DBGen tool generates consistent data, so we artificially injected inconsistency by updating the key attributes of randomly selected tuples from the data with the values taken from existing tuples of the same relation. The sizes of the key-equal groups that violate the key constraints were uniformly distributed between two and seven. The database instances were generated in such a way that every repair had the specified size. We experimented with varied degrees of inconsistency, ranging from 5% up to 35% of the tuples violating a key constraint, and with a variety of repair sizes, starting from 500 MB (4.3 million tuples) up to 5 GB (44 million tuples). For the second part, we employed the PDBench database generator from MayBMS (Antova et al., 2008) to generate four inconsistent database instances with varying degrees of inconsistency (see Table 2). In all four instances, the data is generated in such a way that every repair is of size 1 GB.

Table 2. Percentage of inconsistency in the TPC-H database instances generated using PDBench

	Inconsistency
Table	Inst. 1	Inst. 2	Inst. 3	Inst. 4
CUSTOMER	4.42%	8.5%	16.14%	29.49%
LINEITEM	6.36%	12.09%	22.53%	39.82%
NATION	7.69%	0%	7.69%	7.69%
ORDERS	3.51%	6.77%	12.87%	23.9%
PART	4.93%	9.33%	17.66%	32.16%
PARTSUPP	1.53%	2.96%	5.77%	11.13%
REGION	0%	0%	0%	0%
SUPPLIER	3.69%	7.44%	14.11%	26.51%
Overall	5.36%	10.25%	19.29%	34.72%
	Database size and Repair size (in GB)
	1.04 & 1.00	1.07 & 1.01	1.14 & 1.02	1.3 & 1.02
	Size of the Largest Key-equal Groups
	8 tuples	16 tuples	16 tuples	32 tuples

6.1.2. Queries

The standard TPC-H specification comes with 22 queries (constructed using the QGen tool). Here, we focus on queries 1, 3, 4, 5, 6, 10, 12, 14, and 19; the other 13 queries have features such as nested subqueries, left outer joins, and negation that are beyond the aggregation queries defined in Section 3. In Section 6.1.3, we describe our results for queries without grouping. Since six out of the nine queries under consideration contained the GROUP BY clause, we removed it and added appropriate conditions in the WHERE clause based on the original grouping attributes to obtain queries without grouping. We refer to these queries as $Q^{\prime}_{1},Q^{\prime}_{3},\cdots,Q^{\prime}_{19}$ . The definitions of these queries are given in Table 3.

Table 3. TPC-H-inspired aggregation queries w/o grouping

#	Query
$Q^{\prime}_{1}$	SELECT SUM(LINEITEM.L_QUANTITY) FROM LINEITEM WHERE LINEITEM.L_SHIPDATE <= dateadd(dd, -90, cast(‘1998-12-01’ as datetime)) AND LINEITEM.L_RETURNFLAG = ‘N’ AND LINEITEM.L_LINESTATUS = ‘F’
$Q^{\prime}_{3}$	SELECT SUM(LINEITEM.L_EXTENDEDPRICE*(1-LINEITEM.L_DISCOUNT)) FROM CUSTOMER, ORDERS, LINEITEM WHERE CUSTOMER.C_MKTSEGMENT = ’BUILDING’ AND CUSTOMER.C_CUSTKEY = ORDERS.O_CUSTKEY AND LINEITEM.L_ORDERKEY = ORDERS.O_ORDERKEY AND ORDERS.O_ORDERDATE < ‘1995-03-15’ AND LINEITEM.L_SHIPDATE > ‘1995-03-15’ AND LINEITEM.L_ORDERKEY = 988226 AND ORDERS.O_ORDERDATE = ‘1995-02-01’ AND ORDERS.O_SHIPPRIORITY = 0
$Q^{\prime}_{4}$	SELECT COUNT(*) FROM ORDERS WHERE ORDERS.O_ORDERDATE >= ‘1993-07-01’ AND ORDERS.O_ORDERDATE < dateadd(mm,3, cast(‘1993-07-01’ as datetime)) AND ORDERS.O_ORDERPRIORITY = ‘1-URGENT’
$Q^{\prime}_{5}$	SELECT SUM(LINEITEM.L_EXTENDEDPRICE*(1-LINEITEM.L_DISCOUNT)) FROM CUSTOMER, ORDERS, LINEITEM, SUPPLIER, NATION, REGION WHERE CUSTOMER.C_CUSTKEY = ORDERS.O_CUSTKEY AND LINEITEM.L_ORDERKEY = ORDERS.O_ORDERKEY AND LINEITEM.L_SUPPKEY = SUPPLIER.S_SUPPKEY AND CUSTOMER.C_NATIONKEY = SUPPLIER.S_NATIONKEY AND SUPPLIER.S_NATIONKEY = NATION.N_NATIONKEY AND NATION.N_REGIONKEY = REGION.R_REGIONKEY AND REGION.R_NAME = ‘ASIA’ AND ORDERS.O_ORDERDATE >= ‘1994-01-01’ AND ORDERS.O_ORDERDATE < DATEADD(YY, 1, cast(‘1994-01-01’ as datetime)) AND NATION.N_NAME = ‘INDIA’
$Q^{\prime}_{6}$	SELECT SUM(LINEITEM.L_EXTENDEDPRICE*LINEITEM.L_DISCOUNT) FROM LINEITEM WHERE LINEITEM.L_SHIPDATE >= ‘1994-01-01’ AND LINEITEM.L_SHIPDATE < dateadd(yy, 1, cast(‘1994-01-01’ as datetime)) AND LINEITEM.L_DISCOUNT BETWEEN .06 - 0.01 AND .06 + 0.01 AND LINEITEM.L_QUANTITY < 24
$Q^{\prime}_{10}$	SELECT SUM(LINEITEM.L_EXTENDEDPRICE*(1-LINEITEM.L_DISCOUNT)) FROM CUSTOMER, ORDERS, LINEITEM, NATION WHERE CUSTOMER.C_CUSTKEY = ORDERS.O_CUSTKEY AND LINEITEM.L_ORDERKEY = ORDERS.O_ORDERKEY AND ORDERS.O_ORDERDATE >= ‘1993-10-01’ AND ORDERS.O_ORDERDATE < dateadd(mm, 3, cast(‘1993-10-01’ as datetime)) AND LINEITEM.L_RETURNFLAG = ‘R’ AND CUSTOMER.C_NATIONKEY = NATION.N_NATIONKEY AND CUSTOMER.C_CUSTKEY = 77296 AND CUSTOMER.C_NAME = ‘Customer#000077296’ AND CUSTOMER.C_ACCTBAL = 1250.65 AND CUSTOMER.C_PHONE = ‘12-248-307-9719’ AND NATION.N_NAME = ‘BRAZIL’
$Q^{\prime}_{12}$	SELECT COUNT(*) FROM ORDERS, LINEITEM WHERE ORDERS.O_ORDERKEY = LINEITEM.L_ORDERKEY AND LINEITEM.L_SHIPMODE = ‘MAIL’ AND (ORDERS.O_ORDERPRIORITY = ‘1-URGENT’ OR ORDERS.O_ORDERPRIORITY = ‘2-HIGH’) AND LINEITEM.L_COMMITDATE < LINEITEM.L_RECEIPTDATE AND LINEITEM.L_SHIPDATE < LINEITEM.L_COMMITDATE AND LINEITEM.L_RECEIPTDATE >= ‘1994-01-01’ AND LINEITEM.L_RECEIPTDATE < dateadd(mm, 1, cast(‘1995-09-01’ as datetime))
$Q^{\prime}_{14}$	SELECT SUM(LINEITEM.L_EXTENDEDPRICE*(1-LINEITEM.L_DISCOUNT)) FROM LINEITEM, PART WHERE LINEITEM.L_PARTKEY = PART.P_PARTKEY AND LINEITEM.L_SHIPDATE >= ‘1995-09-01’ AND LINEITEM.L_SHIPDATE < dateadd(mm, 1, ‘1995-09-01’) AND PART.P_TYPE LIKE ‘PROMO%%’
$Q^{\prime}_{19}$	SELECT SUM(LINEITEM.L_EXTENDEDPRICE*(1-LINEITEM.L_DISCOUNT)) FROM LINEITEM, PART WHERE (PART.P_PARTKEY = LINEITEM.L_PARTKEY AND PART.P_BRAND = ‘Brand#12’ AND PART.P_CONTAINER IN (‘SM CASE’, ‘SM BOX’, ‘SM PACK’, ‘SM PKG’) AND LINEITEM.L_QUANTITY >= 1 AND LINEITEM.L_QUANTITY <= 1 + 10 AND PART.P_SIZE BETWEEN 1 AND 5 AND LINEITEM.L_SHIPMODE IN (‘AIR’, ‘AIR REG’) AND LINEITEM.L_SHIPINSTRUCT = ‘DELIVER IN PERSON’) OR (PART.P_PARTKEY = LINEITEM.L_PARTKEY AND PART.P_BRAND =‘Brand#23’ AND PART.P_CONTAINER IN (‘MED BAG’, ‘MED BOX’, ‘MED PKG’, ‘MED PACK’) AND LINEITEM.L_QUANTITY >= 10 AND LINEITEM.L_QUANTITY <= 10 + 10 AND PART.P_SIZE BETWEEN 1 AND 10 AND LINEITEM.L_SHIPMODE IN (‘AIR’, ‘AIR REG’) AND LINEITEM.L_SHIPINSTRUCT = ‘DELIVER IN PERSON’) OR (PART.P_PARTKEY = LINEITEM.L_PARTKEY AND PART.P_BRAND = ‘Brand#34’ AND PART.P_CONTAINER IN ( ‘LG CASE’, ‘LG BOX’, ‘LG PACK’, ‘LG PKG’) AND LINEITEM.L_QUANTITY >= 20 AND LINEITEM.L_QUANTITY <= 20 + 10 AND PART.P_SIZE BETWEEN 1 AND 15 AND LINEITEM.L_SHIPMODE IN (‘AIR’, ‘AIR REG’) AND LINEITEM.L_SHIPINSTRUCT = ‘DELIVER IN PERSON’)

6.1.3. Results on Queries without Grouping

In the first set of experiments, we computed the range consistent answers of the TPC-H-inspired aggregation queries without grouping via WPMaxSAT solving over a database instance with 10% inconsistency and having repairs of size 1 GB (8 million tuples). Figure 1 shows that much of the evaluation time used by AggCAvSAT is consumed in encoding the CQA instance into a WPMaxSAT instance, while the solver comparatively takes less time to compute the optimal solution. Note that, $Q^{\prime}_{5}$ is not in the class $C_{\textit{aggforest}}$ and thus ConQuer cannot compute its range consistent answers. AggCAvSAT performs better than ConQuer on seven out of the remaining eight queries.

Figure 1. AggCAvSAT vs. ConQuer on TPC-H data generated using the DBGen-based tool (10% inconsistency, 1 GB repairs)

Next, we compared the performance of AggCAvSAT and ConQuer on database instances generated using PDBench. Figure 2 shows that AggCAvSAT performs better than ConQuer on PDBench instances with low inconsistency. As the inconsistency increases, the WPMaxSAT solver requires considerably long time to compute the optimal solutions (especially for $Q^{\prime}_{6}$ , $Q^{\prime}_{12}$ , and $Q^{\prime}_{14}$ ). One reason is that the sizes of key-equal groups in PDBench instances with higher inconsistency percentage are large, which translates into clauses of large sizes in the WPMaxSAT instances, hence the solver works hard to solve them. Also, Kügel’s reduction (Kügel, 2012) from WPMinSAT to WPMaxSAT significantly increases the size of the CNF formula, resulting in higher time for the lub-answers to the queries.

Figure 2. AggCAvSAT vs. ConQuer on PDBench instances

Table 4. Average size of the CNF formulas for

Q^{\prime}_{1}

Q^{\prime}_{6}

, and

Q^{\prime}_{14}

	5%	15%	25%	35%
$Q^{\prime}_{1}$	10.2	34.3	60.6	95.3
$Q^{\prime}_{6}$	28.4	96.2	175.0	271.2
$Q^{\prime}_{14}$	6.4	21.1	40.6	62.3

(a) # of variables (in thousands)

5%	15%	25%	35%
27.6	92.2	163.6	258.1
76.8	259.9	472.9	734.0
15.6	51.9	101.0	156.6

(b) # of clauses (in thousands)

	1 GB	3 GB	5 GB
$Q^{\prime}_{1}$	21.3	44.1	105.6
$Q^{\prime}_{6}$	60.9	127.13	304.4
$Q^{\prime}_{14}$	13.9	32.9	67.7

1 GB	3 GB	5 GB
57.7	104.1	258.8
165.3	300.7	823.1
34.0	73.7	166.6

(d) # of clauses (in thousands)

Next, we varied the inconsistency in the database instances created using the DBGen-based data generator while keeping the size of the database repairs constant (1 GB). Figure 3 shows that the evaluation time of AggCAvSAT stays well under ten seconds (except for $Q^{\prime}_{6}$ ), even if there is more inconsistency in the data. Tables 4(a) and 4(b) show the average size of the CNF formulas for the top three queries that exhibited the largest CNF formulas. The size of the formulas grows nearly linearly as the inconsistency present in the data grows. The CNF formulas for $Q^{\prime}_{6}$ are significantly larger than the ones corresponding to the other queries since $Q^{\prime}_{6}$ has high selectivity and it is posed against the single largest relation LINEITEM which has over 8.2 million tuples in an instance with 35% inconsistency. This also explains why AggCAvSAT takes more time for computing its range consistent answers (Figure 3). In database instances with low inconsistency, the consistent answers to the queries having low selectivity (e.g., $Q^{\prime}_{3}$ , $Q^{\prime}_{10}$ ) are sometimes contained in the consistent part of the data, and AggCAvSAT does not need to construct a WPMaxSAT instance at all.

Figure 3. AggCAvSAT on TPC-H data generated using the DBGen-based tool (varying inconsistency, 1 GB repairs)

We then evaluated AggCAvSAT’s scalability by increasing the sizes of the databases while keeping the inconsistency to a constant 10%. Figure 4 shows that the evaluation time of AggCAvSAT for queries $Q^{\prime}_{1}$ , $Q^{\prime}_{6}$ , and $Q^{\prime}_{12}$ increases faster than that for the other queries. This is because the queries $Q^{\prime}_{1}$ and $Q^{\prime}_{6}$ are posed against LINEITEM while $Q^{\prime}_{12}$ involves a join between LINEITEM and ORDERS resulting in AggCAvSAT spending more time on computing the bags of witnesses to these queries as the size of the database grows. Table 4(c) and 4(d) show that the size of the CNF formulas grows almost linearly w.r.t. the size of the database. The largest CNF formula consisted of over 304 thousand variables and 823 thousand clauses and was exhibited by $Q^{\prime}_{6}$ on a database of size 5 GB (47 million tuples). The low selectivity of queries $Q^{\prime}_{3}$ , $Q^{\prime}_{10}$ , and $Q^{\prime}_{19}$ resulted in very small CNF formulas, even on large databases.

Figure 4. AggCAvSAT on TPC-H data generated using the DBGen-based tool (varying database sizes, 10% inconsistency)

6.1.4. Results on Queries with Grouping

In this set of experiments, we focus on TPC-H queries 1, 3, 4, 5, 10, and 12 (see Table 5), as the queries 6, 14, and 19 did not contain grouping. We evaluated the performance of AggCAvSAT and compared it to ConQuer on a database with 10% inconsistency w.r.t. primary keys (Figure 5). The repairs are of size 1 GB. AggCAvSAT computes the consistent answers to the underlying conjunctive query using the reductions from (Dixit and Kolaitis, 2019) which are, precisely, the consistent groups in the range consistent answers to the aggregation query. For each of these groups, it computes the glb-answer and the lub-answer using reductions to WPMaxSAT.

Table 5. TPC-H-inspired Aggregation Queries w/ Grouping

#	Query	Operator
$Q_{1}$	SELECT LINEITEM.L_RETURNFLAG, LINEITEM.L_LINESTATUS, SUM(LINEITEM.L_QUANTITY) FROM LINEITEM WHERE LINEITEM.L_SHIPDATE <= dateadd(dd, -90, cast(‘1998-12-01’ as datetime)) GROUP BY LINEITEM.L_RETURNFLAG, LINEITEM.L_LINESTATUS	SUM(A)
$Q_{3}$	SELECT TOP 10 LINEITEM.L_ORDERKEY, SUM(LINEITEM.L_EXTENDEDPRICE), ORDERS.O_ORDERDATE, ORDERS.O_SHIPPRIORITY FROM CUSTOMER, ORDERS, LINEITEM WHERE CUSTOMER.C_MKTSEGMENT = ‘BUILDING’ AND CUSTOMER.C_CUSTKEY = ORDERS.O_CUSTKEY AND LINEITEM.L_ORDERKEY = ORDERS.O_ORDERKEY AND ORDERS.O_ORDERDATE < ‘1995-03-15’ AND LINEITEM.L_SHIPDATE > ‘1995-03-15’ GROUP BY LINEITEM.L_ORDERKEY, ORDERS.O_ORDERDATE, ORDERS.O_SHIPPRIORITY	SUM(A)
$Q_{4}$	SELECT ORDERS.O_ORDERPRIORITY, COUNT(*) AS O_COUNT FROM ORDERS WHERE ORDERS.O_ORDERDATE >= ‘1993-07-01’ AND ORDERS.O_ORDERDATE < dateadd(mm,3, cast(‘1993-07-01’ as datetime)) GROUP BY ORDERS.O_ORDERPRIORITY	COUNT(*)
$Q_{5}$	SELECT NATION.N_NAME, SUM(LINEITEM.L_EXTENDEDPRICE*(1-LINEITEM.L_DISCOUNT)) AS REVENUE FROM CUSTOMER, ORDERS, LINEITEM, SUPPLIER, NATION, REGION WHERE CUSTOMER.C_CUSTKEY = ORDERS.O_CUSTKEY AND LINEITEM.L_ORDERKEY = ORDERS.O_ORDERKEY AND LINEITEM.L_SUPPKEY = SUPPLIER.S_SUPPKEY AND CUSTOMER.C_NATIONKEY = SUPPLIER.S_NATIONKEY AND SUPPLIER.S_NATIONKEY = NATION.N_NATIONKEY AND NATION.N_REGIONKEY = REGION.R_REGIONKEY AND REGION.R_NAME = ‘ASIA’ AND ORDERS.O_ORDERDATE >= ‘1994-01-01’ AND ORDERS.O_ORDERDATE < DATEADD(YY, 1, cast(‘1994-01-01’ as datetime)) GROUP BY NATION.N_NAME	SUM(A)
$Q_{10}$	SELECT TOP 20 CUSTOMER.C_CUSTKEY, CUSTOMER.C_NAME, SUM(LINEITEM.L_EXTENDEDPRICE*(1-LINEITEM.L_DISCOUNT)) AS REVENUE, CUSTOMER.C_ACCTBAL, NATION.N_NAME, CUSTOMER.C_ADDRESS, CUSTOMER.C_PHONE FROM CUSTOMER, ORDERS, LINEITEM, NATION WHERE CUSTOMER.C_CUSTKEY = ORDERS.O_CUSTKEY AND LINEITEM.L_ORDERKEY = ORDERS.O_ORDERKEY AND ORDERS.O_ORDERDATE>= ‘1993-10-01’ AND ORDERS.O_ORDERDATE < dateadd(mm, 3, cast(‘1993-10-01’ as datetime)) AND LINEITEM.L_RETURNFLAG = ‘R’ AND CUSTOMER.C_NATIONKEY = NATION.N_NATIONKEY GROUP BY CUSTOMER.C_CUSTKEY, CUSTOMER.C_NAME, CUSTOMER.C_ACCTBAL, CUSTOMER.C_PHONE, NATION.N_NAME, CUSTOMER.C_ADDRESS	SUM(A)
$Q_{12}$	SELECT LINEITEM.L_SHIPMODE, COUNT(*) AS HIGH_LINE_COUNT FROM ORDERS, LINEITEM WHERE ORDERS.O_ORDERKEY = LINEITEM.L_ORDERKEY AND LINEITEM.L_SHIPMODE IN (‘MAIL’,‘SHIP’) AND (ORDERS.O_ORDERPRIORITY = ‘1-URGENT’ OR ORDERS.O_ORDERPRIORITY = ‘2-HIGH’) AND LINEITEM.L_COMMITDATE < LINEITEM.L_RECEIPTDATE AND LINEITEM.L_SHIPDATE < LINEITEM.L_COMMITDATE AND LINEITEM.L_RECEIPTDATE >= ‘1994-01-01’ AND LINEITEM.L_RECEIPTDATE < dateadd(mm, 1, cast(‘1995-09-01’ as datetime)) GROUP BY LINEITEM.L_SHIPMODE	COUNT(*)

Figure 5. AggCAvSAT vs. ConQuer on TPC-H data generated using the DBGen-based tool (10% inconsistency, 1 GB repairs)

Figure 6. AggCAvSAT vs. ConQuer on PDBench instances

The overhead of computing the range consistent answers to aggregation queries with grouping is higher than that for the aggregation queries without grouping because for an aggregation query with grouping, AggCAvSAT needs to construct and solve twice as many WPMaxSAT instances as there are consistent groups, i.e., one for the lub-answer and one for the glb-answer per consistent group. For queries that involved the SELECT TOP $k$ construct of SQL, we chose top $k$ consistent groups ordered by one or more grouping attributes present in the ORDER BY clause of the query. AggCAvSAT computes the range consistent answers to each query under ten seconds except for $Q_{1}$ . It took under three seconds to compute the consistent groups of $Q_{1}$ , but took over forty seconds to encode the range consistent answers of the groups and over fifteen minutes to solve the corresponding WPMaxSAT instances. This is because some consistent groups have over 3M tuples and so the WPMaxSAT instances have over 600 thousand variables and over 1.3 million clauses. ConQuer took slightly over two minutes to compute the range consistent answers to $Q_{1}$ . We did not include $Q_{1}$ in experiments with larger databases and higher inconsistency.

Figure 6 shows the comparison of AggCAvSAT and ConQuer for aggregation queries with grouping on PDBench instances. For the database instance with the lowest amount of inconsistency, AggCAvSAT beats ConQuer on all queries, but as the inconsistency grows, AggCAvSAT takes longer time to encode and solve for the consistent groups of the queries $Q_{3}$ and $Q_{10}$ .

In Figure 7, we first plot the evaluation time of AggCAvSAT as the percentage of inconsistency in the data grows from 5% to 35% in the instances generated using the DBGen-based data generator. The size of the database repairs is kept constant at 1 GB (8 million tuples).

Figure 7. AggCAvSAT on TPC-H data generated using the DBGen-based tool (varying inconsistency, 1 GB repairs)

Since AggCAvSAT constructs and solves many WPMaxSAT instances having varying sizes for an aggregation query involving grouping, we also plot the overall number of SAT calls made by the solver in Figure 7. Note that the Y-axis has logarithmic scaling in the second plot of Figure 7. There are ten consistent groups in the answers to $Q_{3}$ , and just five and two consistent groups in the answers to $Q_{5}$ and $Q_{12}$ respectively. In each consistent group, the aggregation operator is applied over a much larger set of tuples in $Q_{5}$ and $Q_{12}$ than in $Q_{3}$ . As a result, the evaluation time for $Q_{3}$ is high but the number of SAT calls is comparatively less, while AggCAvSAT makes more SAT calls for $Q_{5}$ and $Q_{12}$ , even though their consistent answers are computed much faster. The query $Q_{10}$ requires long time to construct and solve the WPMaxSAT instances for its consistent groups due to its high selectivity and the presence of joins between four relations. The evaluation time of computing the range consistent answers to aggregation queries with grouping increases almost linearly w.r.t. the size of the database when the percentage of inconsistency is constant (Figure 8). The second plot in Figure 8 depicts the number of SAT calls made by the solver as the size of the database grows. Due to low selectivity, the answers to $Q_{4}$ are encoded into small CNF formulas even on databases with high inconsistency or large sizes, resulting in fast evaluations.

Figure 8. AggCAvSAT on TPC-H data generated using the DBGen-based tool (varying database sizes, 10% inconsistency)

6.1.5. Discussion

The experiments show that AggCAvSAT performed well across a broad range of queries and databases; it performed worse on queries with high selectivity because, in such cases, very large CNF formulas were generated. AggCAvSAT slowed down on databases with high degree of inconsistency ( $>30\%$ ) and with key-equal groups of large sizes ( $>15$ ). These are rather corner cases that should not be encountered in real-world databases.

6.2. Experiments with Real-world Data

6.2.1. Dataset

For this set experiments, we use the schema and the data from Medigap (med, [n.d.]), an openly available real-world database about Medicare supplement insurance in the United States. We combine the data from 2019 and 2020 to obtain a database with over 61K tuples (Table 6(a)). We evaluated the performance of Reduction 5.1, since we consider two functional dependencies and one denial constraint on the Medigap schema, as shown in Table 6(b). The actual data was inconsistent so no additional inconsistency was injected.

Table 6. Medigap real-world database

Relation	Acronym	# of attributes	# of tuples
OrgsByState	OBS	5	3872
PlansByState	PBS	18	21002
PlansByZip	PBZ	20	4748
PlanType	PT	4	2434
Premiums	PR	7	29148
SimplePlanType	SPT	4	70

(a) Medigap schema

Type	Constraint Definition	Inconsistency
FD	OBS (orgID $\rightarrow$ orgName)	2.58%
FD	PBS (addr, city, abbrev $\rightarrow$ zip)	1.5%
DC	$\forall\;t\in\text{PBS}$ ( $t$ .webAddr $\neq$ ‘’)	0.15%

(b) Integrity constraints and inconsistency

6.2.2. Queries

We use twelve natural aggregation queries on the Medigap database that involve the aggregation operators COUNT(*), COUNT( $A$ ), and SUM( $A$ ). We refer to these as $(Q^{m}_{1},\cdots,Q^{m}_{12})$ . The first six queries contain no grouping, while the rest of them do. The definitions of these queries are given in Table 7.

Table 7. Aggregation Queries on Medigap database

#	Query
$Q^{m}_{1}$	SELECT COUNT(*) FROM OBS WHERE OBS.Name = ‘Continental General Insurance Company’
$Q^{m}_{2}$	SELECT COUNT(*) FROM PBZ, SPT WHERE PBZ.Description = SPT.Simple_plantype_name AND SPT.Contract_year = 2020 AND SPT.Simple_plantype = ‘B’
$Q^{m}_{3}$	SELECT SUM(PBZ.Over65) FROM PBZ WHERE PBZ.State_name = ‘Wisconsin’ AND PBZ.County_name = ‘GREEN LAKE’
$Q^{m}_{4}$	SELECT SUM(PBZ.Community) FROM PBZ WHERE PBZ.State_name = ‘New York’
$Q^{m}_{5}$	SELECT COUNT(PR.Premium_range) FROM PR
$Q^{m}_{6}$	SELECT COUNT(PR.Premium_range) FROM PT, PR WHERE PT.State_abbrev = PR.State_abbrev AND PT.Plan_type = PR.Plan_type AND PT.Contract_year = PR.Contract_year AND PT.Contract_year = 2020 AND PT.Simple_plantype = ‘K’
$Q^{m}_{7}$	SELECT SPT.Contract_year, COUNT(*) FROM SPT GROUP BY SPT.Contract_year ORDER BY SPT.Contract_year DESC
$Q^{m}_{8}$	SELECT PBZ.State_name, COUNT(*) FROM PBZ GROUP BY PBZ.State_name
$Q^{m}_{9}$	SELECT PBZ.Zip, SUM(PBZ.Community) FROM PBZ WHERE PBZ.State_name = ‘New York’ GROUP BY PBZ.Zip
$Q^{m}_{10}$	SELECT TOP 10 PBS.State_name, SPT.Contract_year, SUM(PBS.Under65) FROM PBS, SPT WHERE SPT.Simple_plantype_name = PBS.Description AND SPT.Simple_plantype = ‘A’ AND SPT.Language_id = 1 GROUP BY PBS.State_name, SPT.Contract_year ORDER BY PBS.State_name
$Q^{m}_{11}$	SELECT PR.Age_category, COUNT(PR.Premium_range) FROM PR GROUP BY PR.Age_category ORDER BY PR.Age_category
$Q^{m}_{12}$	SELECT TOP 10 PT.Simple_plantype, COUNT(PR.Premium_range) FROM PT, PR WHERE PT.State_abbrev = PR.State_abbrev AND PT.Plan_type = PR.Plan_type AND PT.Contract_year = PR.Contract_year and PT.Contract_year = 2020 GROUP BY PT.Simple_plantype ORDER BY PT.Simple_plantype

6.2.3. Results on Real-world Database

Figure 9. Evaluation time for computing the range consistent answers to real-world aggregation queries

In Figure 9, we plot the overall time taken by AggCAvSAT to compute the range consistent answers to the twelve aggregation queries on the Medigap database. Since the Medigap schema has functional dependencies and a denial constraint, the encoding of CQA into WPMaxSAT instances is based on Reduction 5.1. Consequently, the size of the CNF formulas is much larger compared to that of the ones produced by Reduction 4.1, resulting in longer encoding times. For all twelve queries, the encoding time is dominated by the time required to compute the near-violations and hence the $\gamma$ -clauses. This part of the encoding time is equal for all queries, but the computation time for the witnesses depends on the query. The solver takes comparatively minuscule amount of time to compute the consistent answers to the underlying conjunctive query. For the queries $Q^{m}_{7},\cdots,Q^{m}_{12}$ , the glb-answer and the lub-answer are encoded and then solved for for each consistent group, causing high overhead. The longest evaluation time is taken by queries $Q^{m}_{10}$ , $Q^{m}_{12}$ , and $Q^{m}_{6}$ since they consist of 10, 10, and 6 consistent groups, respectively.

Figure 10. Number of clauses in a CNF formula capturing the consistent answers to the underlying conjunctive query

For these experiments, we did not compute the range consistent answers from the consistent part of the data first. Thus, for all CNF formulas, the number of variables for is equal to the number of tuples in the data (about 61K). The number of clauses, however, varies depending on the query, as shown in Figure 10. The query $Q^{m}_{5}$ has the highest number of clauses since all tuples in the vwPremiums table are the minimal witnesses to its underlying conjunctive query.

The clauses arising from the inconsistency in the data can be constructed independently from the clauses arising from the witnesses to the queries. In the near future, we plan to parallelize their computation to improve AggCAvSAT’s performance.

7. Concluding Remarks

First, we showed that computing the range consistent answers to an aggregation query involving the SUM(A) operator can be NP-hard, even if the consistent answers to the underlying conjunctive query are SQL-rewritable. We then designed, implemented, and evaluated AggCAvSAT, a SAT-based system for computing range consistent answers to aggregation queries involving COUNT(A), COUNT(*), SUM(A), and grouping. It is the first system able to handle aggregation queries whose range consistent answers are not SQL-rewritable. Our experimental evaluation showed that AggCAvSAT is not only competitive with systems such as ConQuer but it is also scalable. The experiments on the Medigap data showed that AggCAvSAT can handle real-world databases having integrity constraints beyond primary keys. The next step in this investigation is to first delineate the complexity of the range consistent answers to aggregation queries with the operator AVG( $A$ ) and then enhance the capabilities of AggCAvSAT to compute the range consistent answers of such aggregation queries. Finally, we note that the SAT-based methods used here are applicable to broader classes of SQL queries, such as queries with nested subqueries, as long as denial constraints are considered. If broader classes of constraints are considered, such as universal constraints, then the consistent answers of even conjunctive queries become $\Pi_{2}^{p}$ -hard to compute (Arming et al., 2016), hence SAT-based methods are not applicable. In that case, Answer Set Programming solvers (for example, DLV (Leone et al., 2006) or Potassco (Gebser et al., 2011)) have to be used, instead of SAT solvers.

Acknowledgments Dixit was supported by a Baskin School of Engineering Dissertation-Year Fellowship and by the Center for Research in Open Source Software (CROSS) at the UC Santa Cruz. Kolaitis was partially supported by NSF Award IIS: 1814152.

References

(1)
cad ([n.d.]) [n.d.]. CaDiCaL Simplified Satisfiability Solver. http://fmv.jku.at/cadical/.
med ([n.d.]) [n.d.]. Medigap database for Medicare health and drug plans. https://www.medicare.gov/download/downloaddb.asp.
Afrati and Kolaitis (2008) Foto N. Afrati and Phokion G. Kolaitis. 2008. Answering aggregate queries in data exchange. In Proc. of the Twenty-Seventh ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2008, June 9-11, 2008, Vancouver, BC, Canada. ACM, 129–138. https://doi.org/10.1145/1376916.1376936
Amsterdamer et al. (2011) Yael Amsterdamer, Daniel Deutch, and Val Tannen. 2011. Provenance for aggregate queries. In Proceedings of the 30th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2011, June 12-16, 2011, Athens, Greece, Maurizio Lenzerini and Thomas Schwentick (Eds.). ACM, 153–164. https://doi.org/10.1145/1989284.1989302
Antova et al. (2008) Lyublena Antova, Thomas Jansen, Christoph Koch, and Dan Olteanu. 2008. Fast and Simple Relational Processing of Uncertain Data. In 2008 IEEE 24th International Conference on Data Engineering. 983–992. https://doi.org/10.1109/ICDE.2008.4497507
Arenas et al. (1999) Marcelo Arenas, Leopoldo Bertossi, and Jan Chomicki. 1999. Consistent Query Answers in Inconsistent Databases. In Proc. of the Eighteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (Philadelphia, Pennsylvania, USA) (PODS ’99). ACM, New York, NY, USA, 68–79. https://doi.org/10.1145/303976.303983
Arenas et al. (2003b) Marcelo Arenas, Leopoldo Bertossi, Jan Chomicki, Xin He, Vijay Raghavan, and Jeremy Spinrad. 2003b. Scalar aggregation in inconsistent databases. Theoretical Computer Science 296, 3 (2003), 405–434. https://doi.org/10.1016/S0304-3975(02)00737-5 Database Theory.
Arenas et al. (2003a) Marcelo Arenas, Leopoldo E. Bertossi, and Jan Chomicki. 2003a. Answer sets for consistent query answering in inconsistent databases. TPLP 3, 4-5 (2003), 393–424. https://doi.org/10.1017/S1471068403001832
Arming et al. (2016) Sebastian Arming, Reinhard Pichler, and Emanuel Sallinger. 2016. Complexity of Repair Checking and Consistent Query Answering. In 19th International Conference on Database Theory, ICDT 2016, Bordeaux, France, March 15-18, 2016 (LIPIcs), Wim Martens and Thomas Zeume (Eds.), Vol. 48. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 21:1–21:18. https://doi.org/10.4230/LIPIcs.ICDT.2016.21
Audemard and Simon (2009) Gilles Audemard and Laurent Simon. 2009. Predicting Learnt Clauses Quality in Modern SAT Solvers. In Proc. of the 21st International Jont Conference on Artifical Intelligence (Pasadena, California, USA) (IJCAI’09). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 399–404.
Barceló and Bertossi (2003) Pablo Barceló and Leopoldo E. Bertossi. 2003. Logic Programs for Querying Inconsistent Databases. In Practical Aspects of Declarative Languages, 5th International Symposium, PADL 2003, New Orleans, LA, USA, January 13-14, 2003, Proc. 208–222. https://doi.org/10.1007/3-540-36388-2_15
Bertossi (2011) Leopoldo E. Bertossi. 2011. Database Repairing and Consistent Query Answering. Morgan & Claypool Publishers. https://doi.org/10.2200/S00379ED1V01Y201108DTM020
Biere et al. (2009) A. Biere, A. Biere, M. Heule, H. van Maaren, and T. Walsh. 2009. Handbook of Satisfiability: Volume 185 Frontiers in Artificial Intelligence and Applications. IOS Press, Amsterdam, The Netherlands, The Netherlands.
Chomicki et al. (2004a) Jan Chomicki, Jerzy Marcinkowski, and Slawomir Staworko. 2004a. Computing Consistent Query Answers Using Conflict Hypergraphs. In Proc. of the Thirteenth ACM International Conference on Information and Knowledge Management (Washington, D.C., USA) (CIKM ’04). ACM, New York, NY, USA, 417–426. https://doi.org/10.1145/1031171.1031254
Chomicki et al. (2004b) Jan Chomicki, Jerzy Marcinkowski, and Slawomir Staworko. 2004b. Hippo: A System for Computing Consistent Answers to a Class of SQL Queries. In Advances in Database Technology - EDBT 2004. Springer Berlin Heidelberg, Berlin, Heidelberg, 841–844.
Codd (1970) E. F. Codd. 1970. A Relational Model of Data for Large Shared Data Banks. Commun. ACM 13, 6 (1970), 377–387. https://doi.org/10.1145/362384.362685
Codd (1972) E. F. Codd. 1972. Relational Completeness of Data Base Sublanguages. Research Report / RJ / IBM / San Jose, California RJ987 (1972).
Davies and Bacchus (2011) Jessica Davies and Fahiem Bacchus. 2011. Solving MAXSAT by Solving a Sequence of Simpler SAT Instances. In Principles and Practice of Constraint Programming – CP 2011. Springer Berlin Heidelberg, Berlin, Heidelberg, 225–239.
Dixit and Kolaitis (2021) Akhil Dixit and Phokion G. Kolaitis. 2021. CAvSAT: Answering Aggregation Queries over Inconsistent Databases via SAT Solving. In Proc. of the 2021 ACM International Conference on Management of Data (SIGMOD) - Demonstration Track. To appear.
Dixit and Kolaitis (2019) Akhil A. Dixit and Phokion G. Kolaitis. 2019. A SAT-Based System for Consistent Query Answering. In Theory and Applications of Satisfiability Testing - SAT 2019 - 22nd International Conference, SAT 2019, Lisbon, Portugal, July 9-12, 2019, Proc. (Lecture Notes in Computer Science), Vol. 11628. Springer, 117–135. https://doi.org/10.1007/978-3-030-24258-9_8
Fuxman (2007) Ariel Fuxman. 2007. Efficient Query Processing over Inconsistent Databases. Ph.D. Dissertation. Department of Computer Science, University of Toronto.
Fuxman et al. (2005a) Ariel Fuxman, Elham Fazli, and Renée J. Miller. 2005a. ConQuer: Efficient Management of Inconsistent Databases. In Proc. of the 2005 ACM SIGMOD International Conference on Management of Data (Baltimore, Maryland) (SIGMOD ’05). ACM, New York, NY, USA, 155–166. https://doi.org/10.1145/1066157.1066176
Fuxman et al. (2005b) Ariel Fuxman, Diego Fuxman, and Renée J. Miller. 2005b. ConQuer: A System for Efficient Querying over Inconsistent Databases. In Proc. of the 31st International Conference on Very Large Data Bases (Trondheim, Norway) (VLDB ’05). VLDB Endowment, 1354–1357. http://dl.acm.org/citation.cfm?id=1083592.1083774
Fuxman and Miller (2007) Ariel Fuxman and Renée J. Miller. 2007. First-order Query Rewriting for Inconsistent Databases. J. Comput. Syst. Sci. 73, 4 (June 2007), 610–635. https://doi.org/10.1016/j.jcss.2006.10.013
Gebser et al. (2011) Martin Gebser, Benjamin Kaufmann, Roland Kaminski, Max Ostrowski, Torsten Schaub, and Marius Schneider. 2011. Potassco: The Potsdam Answer Set Solving Collection. AI Commun. 24, 2 (2011), 107–124. https://doi.org/10.3233/AIC-2011-0491
Giannakopoulou et al. (2020a) Stella Giannakopoulou, Manos Karpathiotakis, and Anastasia Ailamaki. 2020a. Cleaning Denial Constraint Violations through Relaxation. In Proc. of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020. ACM, 805–815. https://doi.org/10.1145/3318464.3389775
Giannakopoulou et al. (2020b) Stella Giannakopoulou, Manos Karpathiotakis, and Anastasia Ailamaki. 2020b. Query-driven Repair of Functional Dependency Violations. In 36th IEEE International Conference on Data Engineering, ICDE 2020, Dallas, TX, USA, April 20-24, 2020. IEEE, 1886–1889. https://doi.org/10.1109/ICDE48307.2020.00195
Greco et al. (2003) Gianluigi Greco, Sergio Greco, and Ester Zumpano. 2003. A Logical Framework for Querying and Repairing Inconsistent Databases. IEEE Trans. on Knowl. and Data Eng. 15, 6 (Nov. 2003), 1389–1408. https://doi.org/10.1109/TKDE.2003.1245280
Green et al. (2007) Todd J. Green, Gregory Karvounarakis, and Val Tannen. 2007. Provenance semirings. In Proceedings of the Twenty-Sixth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June 11-13, 2007, Beijing, China, Leonid Libkin (Ed.). ACM, 31–40. https://doi.org/10.1145/1265530.1265535
Guagliardo and Libkin (2016) Paolo Guagliardo and Leonid Libkin. 2016. Making SQL Queries Correct on Incomplete Databases: A Feasibility Study. In Proc. of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2016, San Francisco, CA, USA, June 26 - July 01, 2016. ACM, 211–223. https://doi.org/10.1145/2902251.2902297
Heule et al. (2016) Marijn J. H. Heule, Oliver Kullmann, and Victor W. Marek. 2016. Solving and Verifying the Boolean Pythagorean Triples Problem via Cube-and-Conquer. In Theory and Applications of Satisfiability Testing - SAT 2016 - 19th Intern. Conference, Bordeaux, France, July 5-8, 2016, Proc. (Lecture Notes in Computer Science), Vol. 9710. Springer, 228–245. https://doi.org/10.1007/978-3-319-40970-2_15
Karp (1972) Richard M. Karp. 1972. Reducibility among Combinatorial Problems. Springer US, Boston, MA, 85–103. https://doi.org/10.1007/978-1-4684-2001-2_9
Karvounarakis and Green (2012) Grigoris Karvounarakis and Todd J. Green. 2012. Semiring-annotated data: queries and provenance? SIGMOD Rec. 41, 3 (2012), 5–14. https://doi.org/10.1145/2380776.2380778
Kolaitis and Pema (2012) Phokion G. Kolaitis and Enela Pema. 2012. A Dichotomy in the Complexity of Consistent Query Answering for Queries with Two Atoms. Inf. Process. Lett. 112, 3 (Jan. 2012), 77–85. https://doi.org/10.1016/j.ipl.2011.10.018
Kolaitis et al. (2013) Phokion G. Kolaitis, Enela Pema, and Wang-Chiew Tan. 2013. Efficient Querying of Inconsistent Databases with Binary Integer Programming. PVLDB 6, 6 (2013), 397–408. https://doi.org/10.14778/2536336.2536341
Kostylev and Reutter (2015) Egor V. Kostylev and Juan L. Reutter. 2015. Complexity of answering counting aggregate queries over DL-Lite. J. Web Semant. 33 (2015), 94–111. https://doi.org/10.1016/j.websem.2015.05.003
Koutris and Wijsen (2016) Paraschos Koutris and Jef Wijsen. 2016. Consistent Query Answering for Primary Keys. SIGMOD Rec. 45, 1 (June 2016), 15–22. https://doi.org/10.1145/2949741.2949746
Koutris and Wijsen (2017) Paraschos Koutris and Jef Wijsen. 2017. Consistent Query Answering for Self-Join-Free Conjunctive Queries Under Primary Key Constraints. ACM Trans. Database Syst. 42, 2, Article 9 (June 2017), 45 pages. https://doi.org/10.1145/3068334
Kügel (2012) Adrian Kügel. 2012. Natural Max-SAT Encoding of Min-SAT. In Revised Selected Papers of the 6th International Conference on Learning and Intelligent Optimization - Volume 7219 (Paris, France) (LION 6). Springer-Verlag, Berlin, Heidelberg, 431–436.
Leone et al. (2006) Nicola Leone, Gerald Pfeifer, Wolfgang Faber, Thomas Eiter, Georg Gottlob, Simona Perri, and Francesco Scarcello. 2006. The DLV system for knowledge representation and reasoning. ACM Trans. Comput. Log. 7, 3 (2006), 499–562. https://doi.org/10.1145/1149114.1149117
Li et al. (2011) Chu-Min Li, Zhu Zhu, Felip Manyà, and Laurent Simon. 2011. Minimum Satisfiability and Its Applications. In Proc. of the Twenty-Second International Joint Conference on Artificial Intelligence - Volume Volume One (Barcelona, Catalonia, Spain) (IJCAI’11). AAAI Press, 605–610.
Manna et al. (2011) Marco Manna, Francesco Ricca, and Giorgio Terracina. 2011. Consistent Query Answering via ASP from Different Perspectives: Theory and Practice. CoRR abs/1107.4570 (2011). arXiv:1107.4570 http://arxiv.org/abs/1107.4570
Marileo and Bertossi (2010) Mónica Caniupán Marileo and Leopoldo E. Bertossi. 2010. The consistency extractor system: Answer set programs for consistent query answering in databases. Data Knowl. Eng. 69, 6 (2010), 545–572. https://doi.org/10.1016/j.datak.2010.01.005
Oostema et al. (2020) Peter Oostema, Ruben Martins, and Marijn Heule. 2020. Coloring Unit-Distance Strips using SAT. In LPAR 2020: 23rd International Conference on Logic for Programming, Artificial Intelligence and Reasoning, Alicante, Spain, May 22-27, 2020 (EPiC Series in Computing), Vol. 73. EasyChair, 373–389. https://easychair.org/publications/paper/69T4
Rekatsinas et al. (2017) Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. 2017. HoloClean: Holistic Data Repairs with Probabilistic Inference. Proc. VLDB Endow. 10, 11 (Aug. 2017), 1190–1201. https://doi.org/10.14778/3137628.3137631
Vardi (2009) Moshe Y. Vardi. 2009. Symbolic Techniques in Propositional Satisfiability Solving. In Theory and Applications of Satisfiability Testing - SAT 2009. Springer Berlin Heidelberg, Berlin, Heidelberg, 2–3.
Wijsen (2009) Jef Wijsen. 2009. Consistent Query Answering Under Primary Keys: A Characterization of Tractable Queries. In Proc. of the 12th International Conference on Database Theory (St. Petersburg, Russia) (ICDT ’09). ACM, New York, NY, USA, 42–52. https://doi.org/10.1145/1514894.1514900
Wijsen (2010a) Jef Wijsen. 2010a. On the First-order Expressibility of Computing Certain Answers to Conjunctive Queries over Uncertain Databases. In Proc. of the Twenty-ninth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (Indianapolis, Indiana, USA) (PODS ’10). ACM, New York, NY, USA, 179–190. https://doi.org/10.1145/1807085.1807111
Wijsen (2010b) Jef Wijsen. 2010b. A remark on the complexity of consistent conjunctive query answering under primary key violations. Inform. Process. Lett. 110, 21 (2010), 950 – 955. https://doi.org/10.1016/j.ipl.2010.07.021
Zhu et al. (2012) Zhu Zhu, Chu-Min Li, Felip Manyà, and Josep Argelich. 2012. A New Encoding from MinSAT into MaxSAT. In Principles and Practice of Constraint Programming. Springer Berlin Heidelberg, Berlin, Heidelberg, 455–463.