Data Generation for Testing and Grading SQL Queries

Bikash Chandra Bhupesh Chawda

~{}^{*}

Biplab Kar

~{}^{\S}

K. V. Maheshwara Reddy

~{}^{\#}

Shetal Shah S. Sudarshan
IIT Bombay
{bikash, biplabkar11, kvmahesh12, shetals, sudarsha}@cse.iitb.ac.in, bhchawda@in.irl.com

Abstract

Correctness of SQL queries is usually tested by executing the queries on one or more datasets. Erroneous queries are often the results of small changes or mutations of the correct query. A mutation Q’ of a query Q is killed by a dataset D if Q(D) $\neq$ Q’(D). Earlier work on the XData system showed how to generate datasets that kill all mutations in a class of mutations that included join type and comparison operation mutations.

In this paper, we extend the XData data generation techniques to handle a wider variety of SQL queries and a much larger class of mutations. We have also built a system for grading SQL queries using the datasets generated by XData. We present a study of the effectiveness of the datasets generated by the extended XData approach, using a variety of queries including queries submitted by students as part of a database course. We show that the XData datasets outperform predefined datasets as well as manual grading done earlier by teaching assistants, while also avoiding the drudgery of manual correction. Thus, we believe that our techniques will be of great value to database course instructors and TAs, particularly to those of MOOCs. It will also be valuable to database application developers and testers for testing SQL queries.

Keywords:

Mutation Testing, Test Data Generation

^†^†

~{}^{\#}

Currently working at SAP Labs, India

*

Current working at IBM IRL, India

~{}^{\S}

Currently working at Oracle India Pvt. Ltd.

1 Introduction

Queries written in SQL are used in a variety of different applications. An important part of testing these applications is to test the correctness of SQL queries in these applications. The queries are usually tested using multiple ad hoc test cases provided by the programmer or the tester. Queries are run against these test cases and tested by comparing the results with the intended one manually or by automated test cases. However, this approach involves manual effort in terms of test case generation and also does not ensure whether all the relevant test cases have been covered or not. Formal verification techniques involve comparing a specification with an implementation. However, since SQL queries are themselves specifications and do not contain the implementation, formal verification techniques cannot be applied for testing SQL queries.

A closely related problem is grading SQL queries written by students. Grading SQL queries is usually done by executing the query on small datasets and/or by reading the student query and comparing those with the correct query. Manually created datasets, as well as datasets created in a query independent manner, can be incomplete and are likely to miss errors in queries. Manual reading and comparing of queries is difficult, since students may write queries in a variety of different ways, and is prone to errors as graders are likely to miss subtle mistakes. For example, when required to write the query $Q$ below:

SELECT course.id, department.dept_name FROM
course LEFT OUTER JOIN (SELECT * from department 

WHERE department.budget > 70000) d USING (dept_name);

students often write the query $Q_{s}$ :

SELECT course.id, department.dept_name FROM
course LEFT OUTER JOIN department USING (dept_name) 

WHERE department.budget > 70000;

which looks sufficiently similar for a grader to miss the difference. These queries are not equivalent since they give different results on departments with budget less than 70000.

Mutation testing is a well-known approach for checking the adequacy of test cases for a program mutation:testing . Mutation testing involves generating mutants of the original program by modifying the program in a controlled manner. For SQL queries, we consider that a mutation is a single (syntactically correct) change of the original query; a mutant is the result of one of more mutations on the original query. A dataset kills a mutant if the original query and the mutant give different results on the dataset, allowing us to distinguish between the queries. A test suite consisting of multiple datasets kills a mutant if at least one of the datasets kills the mutant.

Consider the query:

SELECT dept_name, COUNT(DISTINCT id) FROM
course LEFT OUTER JOIN takes
USING(course_id) GROUP BY dept_name

One of the mutants obtained by mutating the join condition of the query is:

SELECT dept_name, COUNT(DISTINCT id) FROM
course INNER JOIN takes
USING(course_id) GROUP BY dept_name

Similarly by mutating the aggregation we get the following mutation:

SELECT dept_name, COUNT(id) FROM
course LEFT OUTER JOIN takes
USING(course_id) GROUP BY dept_name

In this paper, we address the problem of generating datasets that can catch commonly occurring errors in a large class of SQL queries. Queries with common errors can be thought of as mutants of the original query. Our goal is to generate (a relatively small number of) datasets so as to kill a wide variety of query mutations. These datasets can be used in two distinct ways:

a)

To check if a given query is what was intended, a tester manually examines the result of the query on each dataset, and checks if the result is what was intended.
b)

To check if a student query is correct, the results of the student query and a given correct query are compared on each dataset. A difference on any dataset indicates that the student query is erroneous (We note that checking query equivalence is possible in limited special cases but is hard or undecidable in generalineq.equiv ; conjuct.equiv ; bag ).

There has been increased interest in the recent years in test data generation for SQL queries including Tuya:2010 ; Riva:2010 ; SQE ; QEX ; olstonCS09 addresses a similar problem in the context of data-flow programs. Our earlier work on the XData system xdata:icde10 ; xdata:icde11 showed how to generate datasets that can distinguish the correct query from some class of query mutations, including join and comparison operator mutations. However, real life SQL queries have a variety of features and mutations that were not handled in xdata:icde10 ; xdata:icde11 . (Related work is described in detail in Section 12.) A few of the techniques described in this paper were sketched in a short workshop paper xdata:dbtest13 , but details were not presented there.

In Sections 4 to 8 we describe techniques to handle different SQL query features. For each feature, we first discuss techniques to handle data generation for that feature, then describe mutations of these features, and finally present techniques to kill these mutations. In Section 9 we describe techniques for killing new classes of mutations for query features that were handled in our earlier work xdata:icde10 ; xdata:icde11 .

Each data generation technique is designed to handle specific query constructs or specific mutations of the query. We combine these techniques to generate datasets for a complete query, with each dataset targeting a specific type of mutation. One dataset is capable of killing one or more mutations. Specifically, we do not generate any mutants at all. Our goal is to generate datasets to kill mutations and not enumerate the possible mutants. Although the number of mutations may be very large, our approach generates a small number of datasets that can kill a much larger number of mutations.

The contributions of this paper are as follows.

1.

We discuss (in Section 4) how to generate test data and kill mutations for queries involving string predicates such as string comparison and the LIKE predicate, using a string solver we have developed.
2.

We support the NULL values and several mutations that may arise because of the presence of NULLs (Section 5).
3.

For queries containing constraints on aggregated results, we describe (in Section 6) a new algorithm to find the number of tuples that need to be generated for each relation to satisfy the aggregation constraints.
4.

We support test data generation and mutation killing for a large class of nested subqueries (Section 7).
5.

We also support data generation and mutation killing for queries containing set operators (Section 8).
6.

We extend the class of mutations considered to include missing or additional join conditions (Section 9.1), missing or additional group by attributes (Section 9.2), and distinct clause mutations (Section 9.3).
7.

The data types supported include floating point numbers, time and date values. The class of queries is extended to include insert, delete, update and parameterized queries as well as view creation statements (Section 10).
8.

We describe (in Section 11) techniques for grading student queries based on the datasets generated by XData. These techniques can be used for grading, as well as in a learning mode where it can give immediate feedback to students.
9.

In Section 13 we present performance results of our techniques. We generate test data for a number of queries involving constrained aggregation and subqueries on the University database dbconcepts2010 as well as queries of the TPC-H benchmark and show that the datasets generated by XData are able to kill most of the non-equivalent mutations. We also test the effectiveness of our grading tool by using as a benchmark a set of assignments given as part of a database course at IIT Bombay. We show that the datasets generated using our techniques catch more errors than the University datasets, provided with dbconcepts2010 , as well as manual grading by the TAs, on all the queries.

We believe the techniques presented in this paper will be of great value to database application developers and testers for testing real life SQL queries. It will also be valuable to database course instructors and TAs by taking the drudgery out of grading and allow SQL query assignments to be properly checked in MOOC setting, where manual grading is not feasible.

2 Background

In our earlier work on XData xdata:icde11 , we presented techniques for generating test data for killing SQL query mutants; we briefly outline that work below.

2.1 Approach to Data Generation

Given an SQL query $Q$ , XDataxdata:icde11 generates multiple datasets. The first dataset is designed to generate non-empty datasets for $Q$ , wherever feasible, which itself kills several mutations that would generate an empty result on that dataset. Each of the remaining datasets is targeted to kill one or more mutations of the query; i.e. on each dataset the given query returns a result that is different from those returned by each of the mutations targeted by that dataset. The number of possible mutations is very large, but the number of datasets generated to kill these mutations is small.

To generate a particular dataset, XData does the following:

1.

It generates a set of constraint variables, where each tuple in the target dataset is represented by a tuple of constraint variables.
2.

It generates a set of constraints between these variables. For example, selection conditions, join conditions, primary key and foreign key conditions are all mapped to constraints on these variables. Different datasets are designed to catch different mutations; the exact set of constraints generated (as also the set of constraints variables) is different for each dataset, as described shortly.
3.

It then invokes a constraint (SMT) solver smt ¹¹1A constraint solver takes as input a set of constraints and produces a result that satisfies the constraints. to solve the constraints; the solution given by the solver assigns values to each constraint variable, thereby defining a specific dataset.

In order to kill mutations, the goal of XData is to generate datasets that produce different results on the query and its mutation. To produce different results, constraints are added in a manner so as to ensure that the mutation in a node of a query tree is reflected above leading to different results for the query and its mutation. For example consider the following query:

Example 1


SELECT course.course_id, COUNT(DISTINCT takes.id)
FROM course INNER JOIN takes USING(course_id)
WHERE course.credits >= 6

This query has two predicates course INNER JOIN takes USING (course_id) and course.credits >= 6. When generating datasets to kill the mutations of join predicates we need to ensure that $course.credits>=6$ is satisfied for the tuple generated for the course table. In case $course.credits>=6$ is not satisfied, both the query and the mutant could give empty results.

2.2 Mutation Space and Datasets

The mutation space considered consisted of the following

1.

Join Type Mutations: A join type mutations involves replacing one of { INNER, LEFT OUTER, RIGHT OUTER } JOIN with another. Consider the mutation from department INNER JOIN course to department LEFT OUTER JOIN course. In order to kill this mutation, we need to ensure that there is a tuple in department relation that does not satisfy the join condition with any tuple in course relation. The INNER JOIN query would not output that tuple in the department relation while the LEFT OUTER JOIN would.

In SQL, a join query can be specified in a join order independent fashion, with many equivalent join orders for a given query. Hence, the number of join type mutations across all these orders is exponential. From the join conditions specified in the query, XData forms equivalence classes of $<$ relation, attribute $>$ pairs such that elements in the same equivalence class need to be assigned the same value to meet (one or more) join conditions. Using these equivalence classes, XData generates a linear number of datasets to kill join type mutations across all join orderings. If a pair of relations involve multiple join conditions XData nullifies each join condition separately.
2.

Selection Predicate Mutations: For selection conditions XData considers mutations of the relational operator where any occurrence of one of $\{=,<>,<,>,\leq,\geq\}$ is replaced by another. For killing mutations for the selection condition $A_{1}$ relop $A_{2}$ , XData generates 3 datasets (1) $A_{1}>A_{2}$ , (2) $A_{1}<A_{2}$ , and (3) $A_{1}=A_{2}$ . These three datasets kill all non-equivalent mutations from one relop to another relop. These datasets also kill mutations because of missing selection conditions.
3.

Unconstrained Aggregation Mutation: Aggregations at the root of the query tree are not constrained to satisfy any condition. The aggregation function can be mutated among MAX, MIN, SUM, AVG, COUNT and their DISTINCT versions. In order to kill these mutations, a dataset with three tuples is generated; two with the same value (non-zero) and another with a different value in the aggregate column.

2.3 Constraint Generation

We now describe our techniques for constraint generation. Our current implementation uses the CVC3 constraint solver CVC3 . We are working on implementing the constraints in the SMT-LIB format smtlib so that we can potentially use several constraint solvers compatible with SMT-LIB.

In CVC3, text attributes are modeled as enumerated types while numeric attributes are modeled as subtypes of integers or rationals. The data type declarations in CVC3 are as follows. For each attribute of each relation, we specify a set of acceptable values, taken from an input database, as datatypes in CVC3. While the input database is not necessary for data generation, its use makes for improved readability and comprehension of the query results. In case an input database is not specified we get the range from the data type of the corresponding column.

A tuple type is created for each relation, where each element is a constraint variable of the specified type. A relation is represented as an array of constraint variables; the size of the array has to be determined before solving the constraints, and constraints have to be specified for each attribute of each tuple.

Consider an input database which has CS-101, BIO-301, CS-312 and PHY-101 as course_id, and credits is an integer constrained to be between 2 and 10. Then, this translates to the following the declarations in CVC3.

DATATYPE
course_id = CS-101 | BIO-301 | CS-312 | PHY-101 END;
credits:TYPE = SUBTYPE (LAMBDA (x: INT):
        x > 1 AND x < 11);
course_tuple_type:TYPE = [course_id,credits];
course: ARRAY INT OF course_tuple_type;

Tuple attributes are referenced by position, not by name; thus, course[2].0 refers to the value of the first attribute, which is course_id, of the second tuple in course.

To ensure a non-empty result for the query in Example 1, we need a tuple in course which matches a tuple in takes on attribute course_id and where the course.credits >= 6. This is done by creating a tuple for each of the relations and adding the following constraints:
ASSERT course[1].0 = takes[1].1;
ASSERT course[1].1 >= 6;

Primary key constraints are enforced by constraints that ensure that if two tuples match on the primary key, then the values of the remaining attributes for those two tuples should also match. Foreign key constraints are enforced by adding extra tuples that satisfy the foreign key condition. Foreign key constraints for the foreign key from takes.course_id to course.course_id are specified as:

ASSERT FORALL(i: takes_index):
EXISTS (j: course_index): takes[i].1 = course[j].0;

where takes_index and course_index give the index range for the takes and course arrays; takes[i].1 stands for dept_name of the i^th tuple of course. In our example an extra tuple would be generated for course for each tuple in takes, although in this case the first tuple of course itself ensures the foreign key constraint is satisfied for the first tuple of takes.

The above constraints are given to CVC3 which generates satisfying values (assuming the constraints are satisfiable).

As explained earlier in this section, to kill a mutation of the inner join to right outer join, we need a value in course.course_id which does not match any value in takes.course_id. To do so we replace the earlier equality constraint
ASSERT course[1].0 = takes[1].1;
with:
ASSERT NOT EXISTS(i:course_index):
(course[i].0 = takes[1].1);
and generate the required dataset using CVC3. Datasets for killing other mutations are generated similarly.

2.4 Disjunctions

Tuya et al. in Riva:2010 presented techniques for killing mutations in the presence of disjunctions.

For killing a where clause mutation of a query, the mutation should be reflected as a change at the root of the query tree. Consider the $where$ clause $P_{1}$ $or$ $P_{2}$ , where $P_{1}$ and $P_{2}$ are conjuncts of selection conditions. If a condition in $P_{1}$ is mutated, $P_{2}$ should be false so that the change in the condition of $P_{1}$ affects the output of the query. For example, let $P_{1}$ be $(a>50$ AND $b=40)$ . If we mutate the first condition in $P_{1}$ to $a<50$ we need to ensure that $b=40$ is satisfied while $P_{2}$ is not satisfied. If $P_{2}$ is satisfied there would be no change in the output of the query. Although not mentioned in Riva:2010 , the above technique not only kills mutations of atomic selection conditions (such as comparisons) but also kills mutations of conjunction operations to disjunctions and vice versa.

The XData system has been extended to implement the above technique for killing selection predicate mutations in the presence of disjunctions.

3 Queries and Mutations Considered

The class of queries considered by XData now includes

a)

Single block queries with join/outer-join operations and predicates in the where-clause, and optionally aggregate operations, corresponding to select / project / join / outer-join queries in relational algebra, with aggregation operations.
b)

Multi-block queries. Our current implementation can deal with subqueries up to a single level of nesting.
c)

Compound queries with set operators UNION(ALL), INTERSECT(ALL) and EXCEPT(ALL).

In this paper, we remove the following assumptions made in xdata:icde11 :

a)

SQL queries do not contain string comparison or string like operators such as like, ilike, etc.
b)

Aggregations are only present at the top of the query tree, and hence they are not constrained.
c)

SQL queries are single block queries with no nested subqueries.
d)

NULL values are not allowed for attribute values.
e)

Selection predicates are conjunctions of simple conditions of the form expr relop expr.

XData now considers a large class of mutations - join type mutations, comparison operator mutations, aggregation mutations, string mutations, NULL mutations, set operator mutations, join condition mutations, group by attribute mutations and distinct mutations. Of these only join type mutations, comparison operator mutations and aggregation mutations were discussed previously in xdata:icde11 .

We retain the following assumptions

a)

The only database constraints are unique, primary key and foreign key constraints.
b)

Queries do not include numeric functions or expressions other than simple arithmetic expressions.
c)

Join predicates are conjunctions of simple conditions.
d)

No user defined functions are used.

We only consider single mutations in a query when generating test datasets, since the space of mutants is much larger with multiple mutations. It is possible that an erroneous query may contain multiple mistakes; queries with multiple mutations are likely, but not always guaranteed, to be killed by the datasets we generate. Completeness guarantees for our data generation techniques are described in Appendix D.

4 Data Generation for String Constraints

SQL queries can have equality and inequality conditions on strings, and pattern matching conditions using the LIKE operator or its variants.
Consider the SQL query,

SELECT * from student WHERE name LIKE ‘Amol%’
AND name LIKE ‘%Pal’ AND tot_cred > 30

In order to generate the first dataset that produces a non-empty result for this query or to kill mutations of the condition tot_cred > 30, we need to generate a tuple for which attribute name satisfies the LIKE conditions ‘Amol%’ and ‘%Pal’. To generate such a value we need to solve the corresponding string constraints. For killing mutations of the LIKE operators also we need to solve similar string constraints.

Since many constraint solvers, including CVC3, do not support string constraints, we solve the string constraints outside of the solver. We describe the types of string constraints considered in Section 4.1 and our approach to solving string constraints in Section 4.2 We then discuss test data generation for killing mutations involving string operators in Section 4.3. Note that for this to work; there should be no dependence between string and other constraints so that the string constraints can be solved independently of other constraints. For example, for constraints like $length(R.a)>R.b$ , where $R.a$ is a string attribute and $R.b$ is an integer attribute, the condition on $R.a$ cannot be solved independently of constraints on $R.b$ if there are other constraints on $R.a$ and $R.b$ . However, if an integrated constraint solver this restriction does not apply.

4.1 Types of String Constraints Considered

For string comparisons, we consider the following class of string constraints: $S_{1}$ relop constant, and $S_{1}$ relop $S_{2}$ , where $S_{1}$ and $S_{2}$ are string variables, and relop operators are $=,<,\leq,>,\geq,<>$ and case-insensitive equality denoted by $\sim=$ . We support LIKE constraints of the form $S$ likeop pattern, where likeop is one of LIKE, ILIKE (case insensitive like), NOT LIKE and NOT ILIKE. We also support strlen( $S$ ) relop constant where relop is one of $=,<,\leq,>,\geq$ or $<>$ . We do not support constraints of the form $S_{1}$ likeop $S_{2}$ , where both $S_{1}$ and $S_{2}$ are variables.

We support the string functions upper and lower in queries where these functions can be rewritten using one of the operators described above; for example upper( $S$ ) = ‘ABC’ can be rewritten as $S\sim=$ ‘ABC’, and similarly upper( $S$ ) LIKE pattern can be replaced by $S$ ILIKE pattern. We rewrite these conditions as a pre-processing step. Conditions like $upper$ ( $S$ ) = constant or $upper$ ( $S$ ) LIKE pattern, where the constant or pattern contains at least one lower case character, cannot be satisfied. Hence for such conditions we do not change the operators but return an empty dataset. If these functions are used on a constant string, we convert the string to upper or lower according to the function.

4.2 Solving String Constraints

There are several available string solvers that we considered, including Hampi HAMPI , Kaluza KAL , SUSHI SUSHI and Rex REX . However, we found that Hampi and Kaluza were rather slow, and while they handled regular expressions and length constraints, they could not handle constraints such as $S1<S2$ , where both $S1$ and $S2$ are variables. Rex and SUSHI, though much faster, could not handle constraints involving multiple string variables. Hence, we built our own solver which is described in Appendix B. Subsequent to the implementation of our string solver the latest version of CVC (CVC4) has also provided some support for solving string constraints cvc4:string , but it has some limitations currently²²2Although there are some limitations in CVC4 currently; in future we may use CVC4 as an integrated solver for both string constraints and other constraints.. Refer the experimental section in Appendix B for details.

Once the values for string variables are obtained we solve the non-string constraints using CVC3 and get an overall solution as follows: enumeration types are created in CVC3 for string variables, with the enumeration names being the (suitably encoded) strings generated by the string solver. For example, consider a query which has a single string constraint: $S_{1}$ like $`Bio\%^{\prime}$ . Let the string that satisfies the constraint be Biology, then the constraint is specified as

ASSERT(table[index].pos = Biology)

in CVC3, where table[index].pos is the corresponding CVC3 variable of $S_{1}$ . We then add constraints in CVC3 equating each string variable to its corresponding enumeration name, add other non-string constraints as described in Section 2 and invoke CVC3 to get a suitable dataset.

If there are disjunctions in the selection predicate, it is not possible to separate the string constraints since not all string constraints may need to be satisfied.

4.3 Killing String Constraint Mutations

There can be different types of string mutations depending on whether the string condition is a comparison condition or a LIKE condition.

String Comparison Mutation

Consider a string constraint of the form $S1$ relop $S2$ , where $S1$ is a variable (attribute name), $S2$ could be another variable or a constant. We consider mutations of relop where any occurrence of one of $\{=,<>,<,>,\leq,\geq\}$ is replaced by another. Three datasets are enough to kill all the relop mutations. These are the datasets generated for (1) $S1=S2$ (2) $S1>S2$ (3) $S1<S2$ . These datasets will also kill the mutation because of missing string selection mutations. In addition, to kill mutations between $=$ and $\sim=$ , we generate an additional dataset, where $S1<>S2$ , but $S1\sim=S2$ .

LIKE Predicate Mutation

We also consider the mutation of the likeop operators where one of $\{$ LIKE, ILIKE, NOT LIKE, NOT ILIKE $\}$ is mutated to another or the operator is missing. For a condition $S1$ likeop pattern, where $S1$ is an attribute name, the three datasets given below are sufficient to kill all mutations among the LIKE operators:
Dataset 1 satisfying the condition $S1$ LIKE pattern.
Dataset 2 satisfying condition $S1$ ILIKE pattern, but not $S1$ LIKE pattern.
Dataset 3 failing both the LIKE and ILIKE conditions.

Mutation to kill	Dataset
LIKE vs. NOT LIKE	1, 2, 3
LIKE vs. ILIKE	2
LIKE vs. NOT ILIKE	1, 3
NOT LIKE vs. ILIKE	1, 3
NOT LIKE vs. NOT ILIKE	2
ILIKE vs. NOT ILIKE	1, 2, 3
Missing LIKE / ILIKE	3
Missing NOT LIKE / NOT ILIKE	1

Table 1: Dataset required to kill like operator mutations

For example, for the condition S1 LIKE ‘bio_’, the conditions in the three cases would be (1) S1 LIKE ‘bio_’, (2) S1 LIKE ‘BIO_ ’, and (3) S1 LIKE ‘CIO_’.

The targeted mutations and the datasets that kill them are shown in Table 1.

LIKE Pattern Mutations

A common error while using the LIKE operator is the specification of an incorrect pattern in the query, for example, specifying $S_{1}$ LIKE ‘Comp_’ or $S_{1}$ LIKE ‘Com%’ in place of $S_{1}$ LIKE ‘Comp%’. There could be a very large number of such patterns to be considered. We handle mutations that involve ‘_’ in place of ‘%’ and vice versa and also missing ‘_’ or ‘%’. Consider the like predicate to be $S$ likeop P.

•

For killing the mutation of ‘%’ to ‘_’ or for missing ‘%’, we generate separate datasets for each occurrence of the ‘%’ replaced with “__”(two underscores) . The pattern with ‘%’ gives a non-empty result while the mutated patterns will give an empty result on the corresponding datasets if the likeop is LIKE or ILIKE. For NOT LIKE and NOT ILIKE the pattern with ‘%’ gives an empty result while the mutated patterns will give a non-empty result.
•

For killing the mutation of ‘_’ to ‘%’ or for missing ‘_’, we generate separate datasets for each occurrence of ‘_’ with that occurrence of ‘_’ removed. If the likeop is LIKE or ILIKE the original pattern gives an empty result while the mutated patterns give non-empty results on the corresponding dataset. For NOT LIKE and NOT ILIKE the pattern with ‘_’ gives a non-empty result while the mutated patterns will give an empty result.

5 Handling NULLs

In our earlier work xdata:icde11 , we could not handle NULLs. In this section, we discuss how we model NULLs using regular non-NULL values; to the best of our knowledge, none of the SMT solvers supports NULL values with SQL NULL value semantics.

To model NULLs for string attributes, we enumerate a few more values in the enumerated type and designate them NULLs. For example, the domain of course_id is modeled in CVC3 as follows:

DATATYPE course_id = CS190 | CS632 | NULL_course_id_1
| NULL_course_id_2 END;

Here, the first two values are regular values from the domain of course_id, while the last two values are used as NULLs. For numeric values, we model NULLs as any integer in a range of negative values that are not part of the given domain of that numeric value.

Next, we define a function which identifies which values are NULL values and which are not. This function is syntactic sugar for dealing with NULLs cleanly and is defined per domain to identify the NULLs in that particular domain. In addition to specifying which values are NULLs, we also explicitly need to state that the other values are NOT NULL. Otherwise, the solver may choose to treat a NON-NULL value as a NULL value. Following is an example of the function in CVC3:

ISNULL_COURSE_ID : COURSE_ID -> BOOLEAN;
ASSERT NOT ISNULL_COURSE_ID(CS190);
ASSERT NOT ISNULL_COURSE_ID(CS632);
ASSERT ISNULL_COURSE_ID(NULL_crse_id_1);
ASSERT ISNULL_COURSE_ID(NULL_crse_id_2);

We also need to enforce another property of nulls, namely, that nulls are not comparable. To do so, we choose different NULL values for different constraint variables that may potentially be assigned a null value, thus implicitly enforcing an inequality between them.

The capability to generate NULLs enables us to handle nullable foreign keys, selection conditions involving IS NULL checks and kill mutations of COUNT to COUNT(*).

5.1 Nullable Foreign Keys

If a foreign key attribute $fk$ , is nullable then the foreign key constraint is encoded in the SMT solver by forcing values of $fk$ to be either values from the corresponding primary key values or NULL values; this allows the SMT solver to assign NULLS to foreign keys if required. Nullable foreign keys allow us to kill more mutants than is possible if the foreign key attribute as not nullable. (Our implementation handles multi-attribute foreign and primary keys.)

5.2 IS NULL / NOT IS NULL Clause

If the query contains a condition $R.a$ IS NULL, we explicitly assign (a different) NULL to attribute $a$ for each tuple $R[i]$ if the query contains only inner joins or only a single relation (provided the attribute is nullable; attributes declared as primary key or as not null cannot be assigned a NULL value).

However in case the query contains an outer join there may be multiple ways to ensure that an attribute has NULL value. Let us consider the join condition E1 $\mathrel{\hbox{${{}_{\raise 3.0pt\hbox{--}}^{\hbox{--}}}$}\hskip-4.41017pt\Join}$ E2. If the IS NULL condition is on an attribute of E1 we need to ensure that the value of that attribute is NULL. If the IS NULL condition is on an attribute on E2 we need to ensure that either (a) that attribute is NULL (which may not be possible if E1 is a relation and the attribute is not nullable) or (b) for that tuple in E1 there does not exist any matching tuple in E2; this can be done by a minor change in the algorithm to handle NOT EXISTS subqueries as described in Section 7.1 (Algorithm 1). We omit details for brevity.

We consider mutation from IS NULL to NOT IS NULL. The first dataset (the one that generates non-empty results on the original query) kills the mutation of IS NULL to NOT IS NULL if the IS NULL condition is present in the form of conjunctions with other conditions. In the presence of disjunctions, we generate a dataset such that the IS NULL condition is satisfied while the conditions present in disjunction with the IS NULL condition are not satisfied. If the query contains an IS NULL then the dataset will give a non-empty result whereas the NOT IS NULL mutant will generate an empty result and vice versa. We also consider the mutation where the mutant query does not contain the IS NULL condition In order to kill this mutation we generate a tuple with the IS NULL condition being replaced by NOT IS NULL (with the conditions present in disjunction with the IS NULL not being satisfied). The original query gives an empty result while the mutant gives a non-empty result.

If the query contains the condition NOT IS NULL the corresponding mutations can be killed in a similar manner.

5.3 NULLs and COUNT(*)

To kill the mutation from COUNT(attr) to COUNT(*), where attr is a set of attributes, we create a dataset such that all tuples in a group have attr as NULL (provided all attributes in attr are nullable and none of them is forced to be non-nullable by selection or join conditions). COUNT(attr) gives a count of 0 while COUNT(*) gives a count of equal to the total number of tuples.

In order to kill mutations of COUNT(*) to COUNT (attr), for any set of attributes attr, we create a dataset such that all nullable columns (columns that can be assigned NULL values and do not have conditions that force them to be not NULL) have NULL values. If any attribute in attr is not nullable, COUNT(*) and COUNT(attr) are equivalent mutations.

6 Constrained Aggregation

In xdata:icde11 we considered aggregates which did not have any constraints on the aggregation result e.g. via a HAVING clause, or in an enclosing SQL query of a subquery with aggregation. In this section, we discuss techniques for data generation for queries which have constrained aggregation. We assume that each aggregate is on a single attribute, not on multiple attributes or expressions. We also assume that aggregation constraints do not involve disjunctions.

Consider the HAVING clause constraint, $SUM(r.a)$ $>$ $20$ . In case the domain of $r.a$ is restricted to [0,5] it is not possible to generate a single tuple for $r$ such that the aggregation constraint is satisfied. Most constraint solvers including CVC3 do not support a relation type where the number of tuples may be left unspecified. Some solvers like Alloy alloy do support a relation type. However, there are other limitations to using Alloy since it is very slow and supports only the integer datatype. We model relations as arrays of tuples with a predefined number of tuples in each relation; such aggregation constraints cannot be translated into SMT solver constraints leaving the number of tuples unspecified. Hence, before generating SMT solver constraints we must (a) estimate the number of tuples $n$ , required to satisfy the aggregation constraints, and (b) in case the input to the aggregate is a join of multiple base relations, translate this number $n$ to appropriate number of tuples for each base relation so that the join result contains exactly $n$ tuples.

In Section 6.1 we discuss how to estimate the number of tuples to satisfy an aggregation constraint. We discuss data generation for constrained aggregation on a single relation in Section 6.2 and for join results in Section 6.3.

6.1 Estimating Number of Tuples per Group

We now consider how to estimate the number of values (tuples), $n$ , needed to satisfy aggregation constraints. For each attribute, $A$ , on which there are aggregate constraints we consider the following for estimating $n$ .

1.
Aggregation Properties: Constraint variables sum_A, min_A, max_A, avg_A, count_A respectively correspond to the results of aggregation operators SUM, MIN, MAX, AVG and COUNT on attribute $A$ . Note that count_A also indicates the number of tuples at the input to the aggregation. We add the following conditions
- •
  
  Since the value of each tuple cannot be less than min_A and greater than max_A, it follows that min_A $*$ count_A $\leq$ sum_A $\leq$ max_A $*$ count_A.
- •
  
  If the domain of $A$ is integer and $A$ is unique,
  min_A + (min_A + 1) + … + (min_A + $count_{A}$ - 1) $\leq$ $sum_{A}$ $\leq$ ( $max_{A}$ - $count_{A}$ + 1) + ( $max_{A}$ - $count_{A}$ + 2) + … + (max_A - $count_{A}$ + ( $count_{A}$ - 1)) +( $max_{A}$ ). We use the simplified form of the above expression for constraint generation.
- •
  
  $($ $avg_{A}$ $*$ $count_{A}$ $)=$ sum_A
2.
Domain Constraints: Constraint variables dmin_A, dmax_A correspond to the minimum and maximum value in the domain of attribute $A$ . We add the following constraints
- •
  
  dmin_A $\leq$ min_A $\leq$ max_A $\leq$ dmax_A. This constraint states that min_A cannot be less than the domain minimum or greater than max_A.
3.

Aggregation Constraints: Aggregation constraints specified by the query (e.g. sum_A $\leq$ $10$ ).
4.

Selection Conditions: If the query contains non-aggregate constraints on any attribute $A$ , we add these to the tuple estimation constraints. For example, consider the query,
```
SELECT dept_name,SUM(credits) FROM
course INNER JOIN dept USING (dept_name)
WHERE credits <= 4 GROUP BY dept_name
HAVING SUM(credits) < 13
```
Here, because credits column has a selection condition on it, its limit is constrained. Hence, max_credits $\leq 4$ is also added to the list of constraints above.

The solver returns a value for the count which satisfies all the constraints above, but the value may not be the minimum. Since we are interested in small datasets, we want the count to be as small as possible. Hence, we run CVC3 with the count fixed to different values, ranging from $1$ to MAX_TUPLES and choose the smallest value of the count for which CVC3 gives a valid answer.³³3Since we are interested in small datasets, we set MAX_TUPLES to $32$ in our experiments. We borrow the idea of calculating the number of tuples, using multiple tries, for the aggregation constraint from RQP RQP . However, note that the problem is different here, since, unlike RQP, we do not know the value of the aggregation in the query result. Note that the above procedure works even in case of multiple aggregates on the same column or on different columns.

Heuristic Extensions

The value with which the aggregate is compared to may be a column (i.e. a variable) e.g. HAVING SUM(R.a) relop S.b. This can happen when S.b is a group by attribute or when the constrained aggregation is in a subquery and S.b is a correlation variable from an outer query. For such cases, we replace the column name by a CVC3 variable when estimating the number of tuples. We also add the domain and selection conditions for that column as constraints on the CVC3 variable. The solver then chooses a value for the number of tuples such that the aggregate is satisfied for some value of the variable in its domain.

If the aggregation has a DISTINCT clause we add constraints to make the corresponding aggregated attribute unique.

Handling constraint aggregation in general for these cases is an area of future work.

6.2 Data Generation for Aggregate on a Single Relation

In case the aggregate is on a single relation the number of tuples estimated is assigned to the only relation. For each result tuple generated by an aggregation operator, we create a tuple of constraint variables where group by attributes are equated to the corresponding values in the inputs and aggregation results are replaced by arithmetic expressions. For example, $sum(r.x)$ is replaced by $R[i].x+R[i+1].x+\ldots+R[i+k].x$ , where R[i] to R[i+k] are the tuples assigned for a particular group. We also add constraints to ensure that no two tuples in R[i]…R[i+k] are the same if the relation has a primary key.

The tuple variables created as above can be used for other operations e.g. selection or join that use the aggregation result as inputs.

In case data generation for multiple groups is required we add constraints to ensure that at least one of the GROUP BY attributes is distinct across groups.

Consider the query,

SELECT id, COUNT(*) FROM takes
WHERE grade = ‘A+’ AND year = 2010
GROUP BY id HAVING COUNT(*) < 3

For this query the number of tuples in the group is estimated to be 1. We assign a single tuple to the takes relation and add constraints to ensure that grade for this tuple is ‘A+’ and year is 2010.

Note that if the XData system generates additional tuples for the takes relation (for example because this query is part of a subquery and there may be other instances of the takes relation outside the subquery or takes is referenced by some other relation and we need to generate additional tuples to satisfy foreign key dependencies) the value of COUNT() in the having clause may change and the constrained aggregation may no longer be satisfied. In order to ensure that the HAVING clause is not affected we need to ensure that no other tuple in the takes relation belongs to the same group.

In general, to ensure that the additional tuples generated do not cause problems we add constraints to ensure that for any additional tuple either has a different value for the GROUP BY attribute and hence belongs to a different group or fails at least one of the selection conditions. In the above example, we assert that either the id is different for the additional tuple or $year\neq 2010$ . In practice, the conditions are generated by Algorithm 5 described in Appendix C, which handles the general case of aggregation on join results, to assert these constraints as described in Section 6.3.2.

6.3 Data Generation for Aggregation on JOIN Results

In case the aggregate is on a join result we need to assign tuples to each of the relations such that the join results in the required number of tuples. In this section, we address this issue.

6.3.1 Estimating Number of Tuples per Relation

We assume here that all join conditions are equijoins. The required number of tuples is denoted by $n$ . Consider a query that involves $R_{i}\Join R_{j}\Join R_{k}$ where we need $n$ tuples for a GROUP BY on $A.a$ . Each of the relations need be assigned a specific number of tuples such that the result of the join produces $n$ tuples.

A naive way is to assign $n$ tuples to a relation, $R_{i}$ and assign the same value to all its joining attributes, $\{R_{i}.a,R_{i}.b\}$ . For relations joining with $R_{i}$ only a single tuple is assigned and the joining attribute(s) are assigned the value to the corresponding join attribute of $R_{i}$ . For all other relations also single tuple is assigned and the joining attributes are equated. It is easy to see that this assignment will lead to $n$ tuples in the output. The assignment, however, does not work in case the joining attribute(s) of $R_{i}$ are unique (either due to primary keys or by inference from other primary keys) or multiple values are required for attributes of some other relations (to satisfy the aggregate constraint).

We define the following types of attribute(s) that are used for assigning cardinality to relations.

1.

uniqueElements: Sets of attributes for which no two tuples in a group can have the same value. These sets of attributes are placed in uniqueElements, where uniqueElements[ $R_{i}$ ] contains sets of unique elements of relation $R_{i}$ . If a relation $R_{i}$ has unique constraints for (a,b) and (a,c) then uniqueElements[ $R_{i}$ ]={{a,b},{a,c}}.
2.

singleValuedAttributes: The attribute(s) which have the same value across all tuples in a group. These attributes are placed in singleValuedAttributes.

Using the uniqueElements, singleValuedAttributes, join conditions and foreign key conditions for each relation under conditions we estimate the number of tuples for each relation. Details for this are provided in the Appendix A.

6.3.2 Data Generation

After getting the tuple assignment for each relation we add CVC3 constraints to fix the number of tuples in a group to the estimated value. For each join condition, constraints are generated depending on the number of tuples assigned. For example, if both relations $R$ and $S$ have $n$ tuples, the constraint $R[i].x=S[i].y$ is generated for all $1\leq i\leq n$ , while if $R$ has $n$ tuples and $S$ has 1 tuple, the constraint $R[i].x=S[1].y$ is generated for all $1\leq i\leq n$ . Constraints variables for the output of the aggregate operator are created as described earlier in Section 6.2. One difference is in handling aggregation for relations that have been assigned one tuple. For example, $sum(r.x)$ is replaced by $R[i].x+R[i+1].x+\ldots+R[i+k].x$ , where R[i] to R[i+k] are the tuples assigned for a particular group, if $R$ has $n$ tuples, otherwise it is replaced by $n*R[i].x$ , where $R[i]$ is the only tuple assigned for a group. Unique constraints are added as pairwise non-equality constraints to ensure that sets of $uniqueElements$ have distinct values.

Constraints to ensure that additional tuples do not alter satisfaction of the aggregate conditions for the group $g$ are generated using Algorithm 5 described in Appendix C. The inputs to the algorithm are (a) $T$ - query tree corresponding to block that contains the constrained aggregate (b) $AT$ - the tuples generated to satisfy the constrained aggregation and (c) $ASel$ - conditions that evaluate each GROUP BY attributes to the corresponding values in $g$ .

Data generation for multiple groups is done by adding constraints to ensure that at least one of the GROUP BY attributes is distinct across groups.

The constraints are then given as input to CVC3, and output of CVC3 gives us the required dataset.

Discussion:

Our tuple assignment techniques always assign either 1 tuple or $n$ tuples to a relation. There could be cases where such an assignment is not possible and a different assignment is required to generate datasets. However, in such an assignment it becomes difficult to assert constraints such that the join of the relations will generate exactly the required number of tuples. Handling tuple assignment for cases where either 1 or $n$ tuples cannot be assigned to all the relations to satisfy the aggregation constraint is an area of future work.

6.4 Constraint Aggregation and Mutant Killing

Techniques for killing aggregation mutations were described in xdata:icde11 (summarized in Section 2.2). A dataset to kill aggregation mutations is generated by creating multiple tuples per group using techniques of constrained aggregation described above. Different mutations of the aggregate operator will produce different values on this dataset. To ensure that the value difference due to aggregate mutation will cause a difference in the constraint aggregate result, we need to ensure that only one of the query or its mutation satisfies the aggregation constraint. For some cases, we have implemented constraints to ensure that there is a difference in the constraint aggregate result. Implementing this in general is an area of future work.

Datasets for killing mutations of comparison operators in aggregation constraint (e.g. having clause) are generated using existing techniques in XData for handling comparison mutations. Killing mutations due to additional and missing group by attributes is discussed in Section 9.2.

7 Where Clause Subqueries

We now consider test data generation for SQL queries involving subqueries. Data generation for subqueries in the FROM clause is discussed in Section 10; in this section we consider data generation and mutation killing for subqueries in the where clause. We initially assume in Section 7.1 that subqueries do not have aggregations. Subqueries with aggregation are discussed in Section 7.2.

7.1 Data Generation for Subqueries Without Aggregation

EXISTS Connective

Consider a query Q with a nested subquery predicate EXISTS( $SQ$ ). To generate a non-empty result for Q we need to ensure that SQ gives a non-empty result. If SQ does not have any correlation variables we treat subquery SQ as a query in itself and add constraints to generate a non-empty dataset for the subquery using our data generation techniques. We then add constraints for Q for predicates other than the subquery. The dataset is then generated based on these constraints.

If SQ has correlation conditions, then for every tuple that is generated for Q, we call a function to generate the constraints for data generation of the subquery, with the correlation variables passed as parameters. The correlation conditions are treated as selections in SQ with the given constraint variables and appropriate constraints are generated for SQ. For example, consider the query
SELECT course_id,title
FROM course INNER JOIN section USING(course_id)
WHERE year = 2010 AND EXISTS (SELECT * FROM prereq
WHERE prereq_id=‘CS-201’ AND
prereq.course_id = course.course_id)

To generate a dataset for the outer query, we generate a single tuple each for the course and section relations. Let the tuples be course[1] and section[1]. We then add constraints to assert section[1].year=2010 and course[1].course_id = section[1].course_id. We pass the correlation variable course[1].course_id as a parameter to the function for generating constraints for the subquery. For this tuple in the outer query block, we generate a tuple in prereq relation, say prereq[1], for which we add constraints to ensure that prereq[1].prereq_id = ‘CS-201’ and prereq[1].course_id = course[1].course_id.

NOT EXISTS Connective

Consider a query Q with a nested subquery predicate NOT EXISTS( $SQ$ ). Here we need to ensure that the number of tuples from SQ is 0.

If SQ has only a single relation R, we add constraints to ensure that every tuple in R fails at least one of the selection conditions. In case, SQ has a join of two or more relations we traverse the tree of SQ, and in a recursive manner add constraints on selections and joins to ensure that no tuple reaches the root of SQ. If the join is an INNER JOIN we need to ensure that there exists no pair of tuples for which the join conditions are satisfied or that one of the inputs to the join is empty. In case the join is LEFT OUTER JOIN, we need to ensure that there is no tuple in the left subtree. Similarly, in case of RIGHT OUTER JOIN we need to ensure that no tuple is projected from the right subquery.

T

= Query tree

0: constraints to ensure no tuple is projected from the subquery

1: constraints

\leftarrow

“ ”

R

T

.ROOT

3: if

R

is a relation then

4: Let the selection conditions on

R

S_{1}

\wedge

S_{2}

\wedge

S_{c}

5: Let the number of tuples in

R

m

6: for i in 1 to m do

7: SC[i]

\leftarrow

NOT(

S_{1}

) OR NOT(

S_{2}

) .. OR NOT(

S_{c}

)

8: end for

9: constraints

\leftarrow

SC[1] AND .. SC[i] AND SC[m]

\forall

i,1\leq i\leq m

10: else if

R

is an aggregate then

11: genConstraintsForNotExists(

R

.CHILD)

12: else if

R

is a LEFT OUTER JOIN then

13: constraints

\leftarrow

genConstraintsForNotExists(

R

.LEFT)

14: else if

R

is a RIGHT OUTER JOIN then

15: constraints

\leftarrow

genConstraintsForNotExists(

R

.RIGHT)

16: else if

R

is an INNER JOIN then

17: JC={}

18: Let the join conditions at R be

J_{1},J_{2}..J_{c}

19: for Each join condition

J_{k}

20: Let

J_{k}

involve relations

R_{1}

and

R_{2}

21: Let the number of tuples in

R_{1}

m

and in

R_{2}

n

22: Let

J_{k}(i,j)

denote the condition corresponding to join of tuples

R_{1}[i]

and

R_{2}[j]

23: for i in 1 to m, j in 1 to n do

24: JC[k]

\leftarrow

JC[k] AND NOT(

J_{k}(i,j)

)

25: end for

26: end for

27: constraints

\leftarrow

JC[1] OR.. JC[k] OR.. JC[c]

\forall k,

1\leq k\leq c

28: constraints

\leftarrow

constraints + “OR” + (genConstraintsForNotExists(

R

.LEFT)) + “OR” + (genConstraintsForNotExists(

R

.RIGHT))

29: end if

30: return constraints

Algorithm 1 : genConstraintsForNotExists

Constraints to ensure that there is no tuple from the NOT EXISTS subquery are added using Algorithm 1. If the subquery contains selections with disjunctions, we may fail to get the selection conditions that involve only $R$ in Step 4 of our algorithm. Our algorithm is currently restricted to NOT EXISTS queries that do not contain any disjunction. At Step 5 we assert negations of the constraints corresponding to the particular selection condition, $S_{i}$ . For example, if $S_{i}$ is a NOT EXISTS subquery we assert constraints corresponding to EXISTS( $S_{i}$ ). Correlation variables in SQ, if present, are treated in the same manner as EXISTS subquery and passed as parameters. Correlation conditions are then treated as selections in Algorithm 1.

IN/NOT IN Connective

We convert subqueries of the IN type to EXISTS type subquery by adding the IN connective as a correlation condition in the WHERE clause of the EXISTS subquery. The same techniques as that of EXISTS are then used. Similarly. subqueries using a NOT IN connective are converted to use the NOT EXISTS connective. For example,
r.a IN (SELECT s.b FROM .. WHERE ..)
is converted to
EXISTS (SELECT s.b FROM .. WHERE .. AND r.a = s.b)

ALL/ANY Connective

Subqueries with ALL and ANY connectives, always appear with one of the comparison operators, for example “ $<$ ALL” or “ $>=$ ANY”. We transform subqueries of the form $relop$ ANY to an EXISTS query with $relop$ condition as a correlation condition in the WHERE clause. Subqueries with $relop$ ALL are transformed to a NOT EXISTS query with a negation of the $relop$ condition as a correlation condition, or either of the correlation variables in the correlation condition as NULL in the WHERE clause. For example,
r.a >ALL (SELECT s.b FROM .. WHERE ..)
is converted to
NOT EXISTS
(SELECT s.b FROM .. WHERE .. AND r.a <= s.b
OR IS NULL(r.a) OR IS NULL(s.b))

Scalar Subqueries

Scalar subqueries are subqueries that return only a single result. We consider scalar subqueries in the where clause which are used in conditions on the form SSQ relop attr/value, where SSQ is a scalar subquery, attr is an attribute from the outer block of query and value is a constant. For scalar subqueries, we generate only a single tuple for the query and assert that the projected attribute satisfies the comparison operator. Correlation conditions, if any, are treated in the same manner as subqueries with the EXISTS connective.

7.2 Data Generation for Subqueries With Aggregation

In this section, we consider subqueries that have aggregation. Constraints involving aggregation can be in the inner query (e.g. HAVING clause) or in outer query (e.g. r.s $<$ (SELECT agg(s.b…)))

Non Scalar Subqueries

The techniques in Section 7.1 can be applied for EXISTS subqueries without constrained aggregation, since we only need to ensure empty / non-empty results for the subquery. For NOT EXISTS Algorithm 1 covers the case of aggregate operators as well.

In case of constrained aggregation in EXISTS subquery (e.g. HAVING clause), we use the techniques described in Section 6 to generate tuples for the subquery; multiple tuples may be generated. In case there is a constrained aggregation in the NOT EXISTS subquery, we assert constraints to ensure that either the constraint aggregation is not satisfied or there is no tuple input to the aggregation constraint.

Subqueries of the IN/NOT IN/ALL/ANY type having an aggregate as the projected attribute can be transformed into EXISTS/NOT EXISTS in a similar manner as shown in Section 7.1. In this case, the projected aggregate is added as a HAVING clause. For example,

r.a NOT IN (SELECT agg(s.b) FROM .. WHERE .. )

is converted to

NOT EXISTS (SELECT agg(s.b) FROM ..
WHERE .. HAVING agg(s.b) = r.a)

The techniques for constrained aggregation in EXISTS/ NOT EXISTS can then be applied.

Scalar Subqueries

Consider the following query involving the relation takes(id, course_id, sec_id, semester, year, grades),

SELECT id FROM takes
WHERE grade < (SELECT MIN(grade)
FROM takes WHERE year = 2010)

To generate datasets for this query we add constraints to generate a tuple, $takes[1]$ for the takes relation in the outer query. The tuple estimation technique for the subquery estimates that one tuple is required to satisfy the comparison operator ( $<MIN(grade)$ ). We add constraints to generate one more tuple, say $takes[2]$ for takes relation corresponding to the subquery and add a constraint to ensure that takes[2].year = 2010 for that tuple. We then add the constraint, takes[1].grade $<$ takes[2].grade to ensure that the grade of the outer query tuple is greater than the grade of the subquery tuple. Since takes[1] does not participate in aggregation we need to ensure that it does not satisfy the conditions of the subquery block. To ensure this, the constraint $takes[1].year<>2010$ is added.

In general, consider a query of the form

SELECT * FROM rel1 JOIN .. WHERE cond1 AND ...
AND attr1 relop (SELECT agg(sqrel1.attr2)
FROM sqrel1 JOIN ... WHERE sqcond1 AND ..)

For such subqueries we need to ensure that the aggregate, agg(sqrel1.attr2) satisfies the condition attr1 relop agg(sqrel1.attr2). In order to do this, we may need to project multiple tuples from the subquery. We use the techniques described in Section 6 to estimate the number of tuples, assign the desired number of tuples to each relation and generate constraints for data generation. In order to ensure that no additional tuple affects the aggregate value, we use the techniques described in Algorithm 5 in Appendix C. The input to the algorithm is the same as described in Section 7.4 below.

Similar to EXISTS subquery, in the presence of correlation conditions, we generate one group of tuples in the subquery for every tuple in the outer query.

7.3 Killing Subquery Connective Mutations

EXISTS/NOT EXISTS, IN/NOT IN Mutation

The dataset generated for the original query will kill the mutation between IN and NOT IN, and between EXISTS and NOT EXISTS if the subquery condition is present in the form of conjunctions with other conditions. In the presence of disjunctions, we generate a dataset such that the subquery condition is satisfied and conditions in disjunction with it are not. The EXISTS clause gives an empty result when NOT EXISTS gives a non-empty result, and vice versa. Similar datasets are generated to kill mutations of IN vs. NOT IN.

Comparison Operator Mutation

For conditions of the form “r.A relop (SSQ)” where SSQ is a scalar subquery, as well as conditions of “r.A relop [ALL/ANY] SQ”, we consider mutations among the different relops. Similar to the approach shown in Section 2.2 we generate data for three cases, with relop replaced by $>$ , $=$ and $<$ .

ANY/ALL Mutation

This mutation involves changing from ANY to ALL or vice versa. Since the ANY subquery has been transformed to EXISTS the mutation from ANY to ALL becomes a double mutation - replacing EXISTS with NOT EXISTS and negating the correlation condition corresponding to the ANY comparison condition. The case for ALL to ANY mutation is similar.

Let the correlation condition added because of transformation of ALL/ANY to EXISTS/NOT EXISTS be $R_{1}.a$ relop $R_{2}.b$ . We generate a dataset with two tuples in the subquery for every tuple in the outer query. We add constraints for relop for one tuple and the negation of relop for the other tuple. The ANY query will produce a non-empty result while the ALL query will produce an empty result.

Missing Subquery Mutation

To kill the mutation of a query with missing EXISTS condition connective we generate a dataset with the EXISTS condition replaced by NOT EXISTS. If the EXISTS condition is missing the mutant query will give a non-empty answer while the original query will give an empty answer. Similarly, for killing mutations with missing subquery connectives in other cases we replace NOT EXISTS with EXISTS, IN with NOT IN and NOT IN with IN.

The datasets generated to kill comparison operator mutation will also kill mutations involving missing scalar/ALL/ANY subqueries. If the subquery is present the original query will give an empty result on at least one of the three datasets while the mutant query will produce a non-empty result on all the three datasets.

7.4 Killing Mutations in a Subquery

We also need to generate test data for killing mutations in subquery blocks. For the EXISTS connective and for scalar subqueries we treat a subquery block as a normal query and generate sets of constraints to kill mutations in the subquery block. For each constraint set, we also add constraints to ensure a non-empty result on the outer query block.

For killing selection (comparison mutations, string mutations, NULL mutations), JOIN, and HAVING clause mutations the techniques described in xdata:icde11 and this paper generate datasets that produce empty result on either the query or the mutant but not both. Thus for these mutations the subquery will satisfy the EXISTS condition or the comparison operator (for scalar subqueries) for either the subquery or its mutation enabling XData to kill the mutation.

Extra tuples may get generated for the subquery if there are relations in the subquery that are repeated in the query or are referenced by other relation through foreign keys. Because of these extra tuples, an empty result may turn into a non-empty result or vice versa. To prevent this, we add constraints using Algorithm 5 described in Appendix C where (a) $T$ - query tree of the subquery (b) $AT$ - tuples created for the subquery (c) $ASel$ - correlation conditions with correlation variables being passed as parameters. The constraints ensure that the extra tuples do not affect the result of the subquery, preventing the extra tuples from turning an empty result into a non-empty result or vice versa.

In case there are disjunctions with the subquery, we add constraints to ensure that other conditions in disjunction with the subquery (e.g. P or EXISTS(Q)) are not satisfied as described in Section 2.4.

Mutations like distinct or aggregation mutation in the project clause of the subquery create equivalent mutants of the query and hence need not be killed.

If the subquery uses the NOT EXISTS connective, we generate the datasets for killing mutations in the subquery treating the NOT EXISTS as an EXISTS conditions. Out of the original query and the mutant, the query that produces empty results on the subquery satisfies the NOT EXISTS conditions and produces non-empty results for the outer query. The query that does not produce empty results does not satisfy the NOT EXISTS condition and produces an empty result in the outer query. Thus, these mutations can be killed.

Subquery connectives IN, NOT IN, ANY and ALL are converted to EXISTS and NOT EXISTS as described earlier. Mutations in the subquery are killed after the conversion.

Dataset	P	Q	UNION	UNION	INTERSECT	INTERSECT	EXCEPT	EXCEPT
				ALL		ALL		ALL
1	$t_{1}$	$t_{1}$	$t_{1}$	$t_{1},t_{1}$	$t_{1}$	$t_{1}$	$\nexists t_{1}$	$\nexists t_{1}$
2	$t_{1}$	$\nexists t_{1}$	$t_{1}$	$t_{1}$	$\nexists t_{1}$	$\nexists t_{1}$	$t_{1}$	$t_{1}$
3	$t_{1}$	$\exists^{>1}t_{1}$	$t_{1}$	$\exists^{>1}t_{1}$	$t_{1}$	$t_{1}$	$\nexists t_{1}$	$\nexists t_{1}$
4	$\nexists t_{1}$	$t_{1}$	$t_{1}$	$t_{1}$	$\nexists t_{1}$	$\nexists t_{1}$	$\nexists t_{1}$	$\nexists t_{1}$
5	$\nexists t_{1}$	$\exists^{>1}t_{1}$	$t_{1}$	$\exists^{>1}t_{1}$	$\nexists t_{1}$	$\nexists t_{1}$	$\nexists t_{1}$	$\nexists t_{1}$
6	$\exists^{>1}t_{1}$	$t_{1}$	$t_{1}$	$\exists^{>1}t_{1}$ (sum)	$t_{1}$	$t_{1}$	$\nexists t_{1}$	$\exists t_{1}$ (diff)
7	$\exists^{>1}t_{1}$	$\nexists t_{1}$	$t_{1}$	$\exists^{>1}t_{1}$	$\nexists t_{1}$	$\nexists t_{1}$	$t_{1}$	$\exists^{>1}t_{1}$
8	$\exists^{>1}t_{1}$	$\exists^{>1}t_{1}$	$t_{1}$	$\exists^{>1}t_{1}$ (sum)	$t_{1}$	$\exists^{>1}t_{1}$ (min)	$\nexists t_{1}$	- (diff)

Table 2: Datasets to kill set operator mutations

8 Set Operators

In this section, we consider data generation and mutation killing for queries that contain one of the following set operators - UNION, UNION ALL, INTERSECT, INTERSECT ALL, EXCEPT, EXCEPT ALL.

8.1 Data generation

Set queries are of the form, P SETOP Q where SETOP is a set operator, and P and Q are queries that may be simple or compound queries themselves.

In order to generate a dataset that produces a non-empty result on this query if the SETOP is UNION(ALL) we add constraints to ensure non-empty results for P or Q or both (P and Q may have conflicting constraints so for both to have non-empty results may not always be possible).

Data generation for INTERSECT(ALL) is done in a similar manner as the EXISTS subquery described in Section 7.1. We treat the query as

SELECT * FROM (P) WHERE EXISTS
    (SELECT * FROM Q WHERE pred)

where predicate pred equates each projected attribute of P to the corresponding attribute of Q. For each tuple in P, we generate a corresponding tuple in Q that satisfies the correlation condition, pred, as described in Section 7.1. Data generation for EXCEPT(ALL) is done in a similar manner using NOT EXISTS instead of EXISTS, using the techniques described earlier for the NOT EXISTS operator.

8.2 Killing Set Operator Mutations

In order to kill the mutations among the different operators (UNION(ALL), INTERSECT(ALL), EXCEPT (ALL)) we generate datasets as described below (summarized in Table 2 along with the results for various set operators).

1.

Generate a dataset that has exactly one tuple $t_{1}$ for P. Add constraints to ensure that one matching tuple exists in Q.
2.

Generate a dataset that has one tuple $t_{1}$ for P. Add constraints to ensure that $t_{1}$ does not exist in Q.
3.

Generate a dataset which has at least two identical tuples $\exists^{>1}t_{1}$ for Q. Add constraints to create one matching tuple $t_{1}$ for P.
4.

Generate a dataset that has one tuple $t_{1}$ for Q. Add constraints to ensure that $t_{1}$ does not exist in P.
5.

Generate a dataset which has at least two identical tuples $\exists^{>1}t_{1}$ for Q. Add constraints to ensure not matching tuples for P.
6.

Generate a dataset that has at least two identical tuples, $\exists^{>1}t_{1}$ for P. Add constraints to ensure that there is exactly one matching tuple $t_{1}$ in Q.
7.

Generate a dataset that has at least two identical tuples, $\exists^{>1}t_{1}$ for P. Add constraints to ensure that $t_{1}$ does not exist in Q.
8.

Generate a dataset that has at least two identical tuples, $\exists^{>1}t_{1}$ for both P and Q.

We call kill a mutation between a pair of set operators if for a dataset the results of the query as shown in Table 2 differ. Note that it may not be possible to generate some datasets because of query/integrity constraints; in particular primary key constraints may prevent generation of datasets with duplicates. It may not be necessary to generate all datasets to kill all mutations. As an optimization we can stop generation of datasets if we have successfully generated at least one of the datasets for killing each of the mutations.

For both P and Q we have three options; either generate no tuple, one tuple or more than one tuple. Table 2 exhaustively covers all combinations (except for the case where both P and Q are empty, since if both P an Q are empty all operators would give an empty result and no mutation would be killed). Hence, these datasets are sufficient to kill all pairs mutations. For example the mutation between INTERSECT and INTERSECT ALL can only be killed if there is more than one matching tuple between P and Q. Dataset 8 covers this case. The only mutation that may be missed is the mutation between EXCEPT ALL and other operators except UNION ALL since for dataset 8, we cannot guarantee whether the result would be $\exists t_{1}$ , $\nexists t_{1}$ or $\exists^{>1}t_{1}$ . Dataset 8 would still be able to kill the mutation between UNION ALL and EXCEPT ALL since UNION ALL would produce more tuples in the result than EXCEPT ALL. Hence, if it is possible to only generate dataset 8, mutations of other operators with EXCEPT ALL may not get killed. In order to provide completeness guarantees for killing mutations involving EXCEPT ALL, we need to generate specific number of tuples for P and Q. This is an area of future work.

To ensure that a tuples of one relation does not exist in the other, constraints are added using the NOT EXISTS technique described in Algorithm 1 of Section 7.1. To ensure that a tuple in one relation exists in the other, we use the EXISTS technique described in Section 7.1.

To create at least two identical tuples in the result of a subquery, we assert constraints to imply that the number of tuples is more than one. Then using the techniques described in Section 6 for constrained aggregation we estimate the required number of tuples for each base relation. We treat the projected attributes in the select clause as the group by attributes in constrained aggregation, which ensures that these have the same value across tuples. Data generation is done using techniques for constrained aggregation described in Section 6.2 and Section 6.3.2.

8.3 Killing Mutations in Input to Set Operators

We also need to kill mutations in the input to the set operator. For this, we need to ensure that the effect of the mutation makes a difference in the result of the set operator.

If the set operator is UNION/UNION ALL and the mutation to the query is in P, we add constraints to ensure that the mutation in P is killed. In addition to ensure that there are no tuples from Q that mask the changes in the result we add constraints similar to NOT EXISTS subquery for Q. Similarly data generation can be done for killing mutations in Q.

We treat INTERSECT and EXCEPT queries as EXISTS and NOT EXISTS respectively as described earlier. Mutations of P can be killed by datasets to kill mutations of the outer query block while the mutations in Q can be killing by killing mutations in the subquery block as described in Section 7.4.

9 Handling Join Condition, Group By Attribute and Distinct Clause Mutations

In this section, we describe our techniques to kill missing or additional joins conditions, group by attributes and DISTINCT keyword. Although our previous work handled joins, group by and distinct clause, these mutations were not considered.

9.1 Missing or Extra Joins Conditions

Consider the tables student (id, name, dept_name), course (course_id, course_name and dept_name) and takes (id, course_id, sec_id, semester, year) from the University schema in dbconcepts2010 . Consider the query,

SELECT course_id,course_name
FROM student INNER JOIN takes ON(id)
INNER JOIN course ON(course_id)
WHERE student.id = 1234

One of the mutations of the query could be because of an additional join condition leading to a mutant query like

SELECT course_id,course_name
FROM student INNER JOIN takes ON(id)
INNER JOIN course ON(course_id, dept_name)
WHERE student.id = 1234

Such errors are common when using natural joins. For example, if natural join was used in place of .. INNER JOIN course ON(course_id) resulting in student.dept_name being equal to course.dept_name.

In order to kill such mutations, we do the following. Let the relations being joined be $R_{i}$ and $R_{j}$ . For every attribute $p\in R_{i}$ such that (a) there is an attribute $q\in R_{j}$ with identical names and (b) there is no join condition involving $p$ and $q$ in the original query, we assert that the values held by the two attributes are not equal. The original query without the join condition would give a non-empty result while the mutation would give an empty result.

Similarly, there could be mutants such that the mutant query contains some missing join conditions. Such mutations can be killed by the datasets that kill join type mutations (INNER / LEFT OUTER / RIGHT OUTER) described in Section 2.2.

9.2 Group By Clause Mutations

In this section, we discuss the mutation of the query due to the presence of additional attributes or absence of some attributes in the group by clause.

9.2.1 Additional Group By Attributes

Consider the following query, $Q$ , to find the number of students taking each course every time it is offered.

SELECT count(id), course_id, semester, year
FROM takes GROUP BY course_id, semester, year

Additional attributes included in the group by clause such as section as shown in the query, $Q_{s}$ , below, could result in an erroneous query.

SELECT count(id), course_id, semester, year
FROM takes GROUP BY course_id, semester, year, section

To catch such mutations, we generate a dataset for each possible additional group by attribute, with more than one tuple in the group, such that the additional attribute (section in this case) has different values for different tuples in the group. This ensures that the incorrect query produces multiple groups while the correct one produces only a single group, thereby killing the mutation. Note that because of some selection conditions resulting in attributes being single-valued, functional dependencies on group by attributes and equality conditions on group by attributes some of the mutations with additional GROUP BY may be equivalent to the original query. We do not consider such attributes.

There are situations where the above approach would not work e.g. if the group by is in an EXISTS subquery. The EXISTS condition is satisfied regardless of one or two groups being present. In such a case if there is no constrained aggregation the mutation would be equivalent but if there are aggregation constraints the mutation may not be equivalent and needs to be killed.

If the group has an aggregation that is constrained, e.g., $SUM(a)>20$ or $SUM(b)\leq 30$ the number of tuples is assigned based on the aggregation constraint. We try to ensure that the data generated is such that the aggregation constraints of one of the queries, i.e., either of the original query or of its mutant are satisfied, resulting in a non-empty result on either the original query or its mutation but not both, hence killing the mutation.

Let the group by attributes be $G$ . For each possible additional group by attribute, $g_{i}$ , we generate up to 2 corresponding datasets. In our first attempt, we try to generate two separate groups, which agree on $G$ but differ in $g_{i}$ , such that each group (when grouped by $G,g_{i}$ ) satisfies the aggregation constraints, but the group containing the union of these tuples (i.e., group by $G$ ) does not. Note that this may not be possible in case the aggregate is of the form SUM(x) $>number$ for values in the positive range or COUNT(x) $>number$ etc. Hence, we also try to generate a dataset such that the combined group satisfies the aggregate but the individual groups do not. If either succeeds, the mutation will be killed.

9.2.2 Missing Group By Attributes

Another common error is to miss specifying some of the group by attributes. For example, if one misses specifying the attribute, semester in the GROUP BY clause but query $Q$ then the resultant query is clearly erroneous. Such erroneous queries can be easily detected if the number of attributes projected out is different.

However, that may not be the case for all queries where a group by attribute has been missed. For instance, in the above example, if semester was not in the projection list, the missing group by mutation would be harder to catch. Although rare, we have found such cases when the group by is in a subquery whose result is an aggregation tuple.

We generate datasets to kill such mutations as follows: Let $g_{1},g_{2},...g_{n}$ be the group by attributes. For missing group by attribute, $g_{i}$ , we treat the original query as the one with the missing group by attribute and its mutation with the additional group by attribute as the original query. Datasets can be generated using the techniques for killing mutations of additional group by attributes.

9.3 Distinct Clause Mutations

Users may erroneously omit the DISTINCT keyword in the projection list of a select clause. For example, consider the following query from dbconcepts2010 that finds the department names of all instructors.
SELECT DISTINCT dept_name FROM instructor

In this query, the absence of the DISTINCT keyword would lead to the same department name being repeated which is not desired. We term mutations that add or delete the DISTINCT keyword to the select as distinct mutations (DISTINCT of aggregates is covered in Section 2.2). To kill such mutations we need a dataset which results in at least two tuples in the output such that these tuples are identical on the projected attributes. We use the technique described in Section 8.2 for generating tuples with identical projected attributes. For such a dataset, the query with the DISTINCT keyword will give only a single tuple as output while the one without, will give at least two tuples.

In case the constraints are not satisfiable, it is not possible to have multiple tuples with the same value of the projected attribute(s). This could happen if one of the projected attributes is a primary key for the input to the DISTINCT clause or if the projected attributes are also used as GROUP BY attributes in the same query block. For such cases, the DISTINCT mutation is equivalent.

10 Other Extensions

From clause subqueries: Our parser turns from clause subqueries into a tree which can be handled using our existing techniques. We do not handle from clause subqueries that project aggregates, if there are constraints on the aggregation result in the enclosing query (other than simple constraints which our techniques handle) or if the query uses the lateral construct. Handling such queries is an area of future work.

Handling Parameterized Queries: When generating datasets for a query with parameters, we assign a variable to every parameter. The solution given by the SMT solver also contains a value for each parameter. It should be noted that since the solver assigns these values, each dataset may potentially have its own values for the parameters.

DATE and TIME: We handle SQL data types related to date and time, namely DATE, TIME and TIMESTAMP by converting them to integers.

Floating and Fixed Point Numbers: CVC3 allows real numbers to be represented as (arbitrary precision) rationals and hence when populating a real type data (floating or fixed precision) from the database or query, we represent it as a fraction in CVC3. When converting values to fixed precision values, supported by SQL, the conversion can in theory cause problems in rare cases, since two rationals generated by CVC3 which are very close to each other may map to the same fixed precision number. We have however not observed this in practice so far.

BETWEEN operator: For queries that contain the BETWEEN operator, say attr BETWEEN a AND b, we convert the BETWEEN operator to $attr>a$ $AND$ $attr<b$ . The datasets for killing selection mutations are also able to catch mutations where the user intended the range to include $a$ or $b$ or both.

Insert/Delete/Update Queries: To handle INSERT queries involving a subquery, and DELETE queries, we convert them to SELECT queries by replacing “INSERT INTO relation” or “DELETE” by “SELECT *”. UPDATE queries are similarly converted by creating a SELECT query whose projection list includes the primary key of the updated table, and the new values for each updated column; the WHERE clause remains unchanged from the UPDATE query. Data generation is then done to catch mutations of the resultant SELECT queries.

When testing queries in an application for correctness, we execute the original INSERT, DELETE or UPDATE queries against the generated datasets. To test student queries against a given correct query, we perform the transformation from INSERT, DELETE and UPDATE queries to SELECT queries as above for both the given student queries and the given correct queries, before comparing them as described in Section 11.

Handling WITH Clause and Views: We syntactically convert a query using a WITH clause or views by performing view expansion. The assumptions we make about the query structure must be satisfied by the resultant expanded query.

ORDER BY clause: ORDER BY clause mutations include missing ORDER BY clause or attributes, additional ORDER BY clause or attributes, using ORDER BY DESC instead of ORDER BY and vice versa. In the absence of any ORDER BY clause, the order of tuples is determined by the query plan. Hence, it is possible for a query without an order by clause or with an incomplete order by clause, to give a result in the same order as a correct query, depending upon the chosen plan. Thus, order by mutations, in general, cannot be caught by comparing results on different datasets, although we can use such comparison as a heuristic. Mutations between ORDER BY and ORDER BY DESC can, however, be caught by generating appropriate datasets. To kill such a mutation we generate a dataset having two distinct values for the order by attributes.

As an alternative to checking results on generated datasets, mutations involving missing or additional ORDER BY clause or attributes can be detected by checking the ORDER BY clauses in the query. However, care should be taken to handle equivalent ORDER BY clauses due to functional dependencies, equality predicates between variables, and equality selection conditions.

11 Grading Student SQL Queries

In xdata:icde15 we describe the XDa-TA grading tool which uses datasets generated by the techniques presented in this paper for checking the correctness of student SQL queries. Here we describe how to efficiently check student queries against given correct queries. For each query in an assignment, a correct SQL query is given to the tool, for which it generates datasets for killing mutants of that query. To check if a student query is correct, the results of the student and correct query are compared on each dataset.

It is to be noted that we do not aim to prove query equivalence of student query and correct query. Query equivalence between two queries $Q_{1}$ and $Q_{2}$ can be proven if we are able to prove that $Q_{1}$ is contained in $Q_{2}$ and vice versa⁴⁴4Query containment can be reduced to equivalence similarly since $Q_{1}\subseteq Q_{2}\equiv Q_{1}\cap Q_{2}=Q_{1}$ . Thus, query equivalence can be modeled in terms of query containment. Under set-semantics, it can be shown that the problem of query containment is NP-complete for conjunctive queries cm:stoc77 , and $\prod_{2}^{p}$ -complete for queries involving inequalities ineq.equiv ; conjuct.equiv . For bag semantics, the complexity of query containment is undecidable for conjunctive queries with inequalities bag . We tried a sufficient condition for query equivalence, namely that both $Q_{i}$ and $Q_{i,j}$ generate the same optimal query plan, but as results in Section 13.4 show, this approach is often unable to establish equivalence of correct queries.

Thus, we only aim to catch common errors and it is possible that a non-equivalent student query may be marked correct. However, in case we mark a student query as incorrect we have a dataset on which the student query and the correct query gives different results and hence guarantee that the queries are not equivalent.

The instructor needs to upload the schema and optionally small sample tables, by providing SQL script files. The instructor can then add assignment questions in text and correct queries for the same. For each correct query, the tool then generates datasets, using the techniques of the XData system. Each dataset is tagged with a label indicating what kind of mutation the dataset was designed to kill. Student queries are submitted directly by the tool or can be uploaded in bulk.

For some assignments, it may be possible to write correct queries using several very different approaches. Datasets generated for a correct query are designed to be used to kill mutations of that query, but may or may not succeed in killing mutations of a different formulation of the query. It could also happen that the question in text set by the instructor was ambiguous and there are multiple ways of interpreting it. For these cases, the instructor mode allows multiple correct queries to be uploaded. Datasets generated from all the correct queries are used while evaluating student queries. The instructor may set whether datasets of all the queries need to be passed or only one query needs to be passed depending on the need. Besides, additional datasets for the query may also be added if desired.

Let $Q_{i,j}$ denote the $j^{th}$ student’s query submission for question $i$ . Let $CQ_{i,m}$ denote the $m^{th}$ correct query for question $i$ and $D_{i,m,k}$ be the $k^{th}$ dataset for the correct query $CQ_{i,m}$ .

To evaluate student queries for a given correct query $CQ_{i,m}$ , for each corresponding dataset $D_{i,m,k}$ , the tool first uploads the dataset to the database, creating appropriate tables. The tables created for this purpose are temporary tables whose view is limited for only a session so that there are no conflicts in case multiple student queries are being evaluated simultaneously. Next to compare the result of each student query $Q_{i,j}$ with that obtained by the correct query, $CQ_{i,m}$ , the tool executes an SQL query of the form
( $Q_{i,j}$ EXCEPT ALL $CQ_{i,m}$ ) UNION ( $CQ_{i,m}$ EXCEPT ALL $Q_{i,j}$ ) on the temporary tables.

If the result of the above query is non-empty for any dataset $D_{i,m,k}$ , the student query $Q_{i,j}$ is marked as incorrect. If the results of the above query are empty for all datasets, query $Q_{i,j}$ is deemed correct for the purpose of grading. The instructor can also decide that the presence of duplicates does not matter and in this case the tool uses EXCEPT instead of EXCEPT ALL in the query above.

An assignment can be marked as a learning assignment or a graded assignment. When the tool is used in student mode, for graded assignments, the tool accepts queries from the student and saves the queries for later grading. Grading can be initiated from by the instructor. For learning assignments, the system executes the queries and displays which datasets the query fails on (this can be done incrementally, one failed dataset at a time). Tagging datasets with the type of mutation that the dataset was intended to kill, as mentioned earlier, helps students understand what the mistake was.

Our approach for checking the correctness of query relies on killing the mutations of the correct query and not of the student query. As a result, we may not catch erroneous student queries that have extra selection conditions. We do catch extra join conditions if the column names are identical but may miss other extra join conditions also. Consider a query condition $x>3$ . We generate datasets for satisfying $x>3$ , $x=3$ and $x<3$ . These datasets will catch the mutations involving a change in the operator and in case the condition is missing. However, if the student query contains $x>3$ $AND$ $x<>2674$ , it may be marked as correct since we may not have any test case to test mutation of $x<>2674$ . Since the additional condition could be any arbitrary condition it is not feasible to generate datasets to catch such errors. One way to deal with this is to generate datasets based on mutations of the student query as well and use these also in grading. These datasets would catch such extra conditions. Since this requires a lot of overhead including constraint generation, constraint solving etc. for all the student queries, we do not implement this currently. We did not find any such student query in our experiment described in Section 13.4.

12 Related Work

The AGENDA toolset can generate test data for an application, given as input the database schema, the application source code and certain sample value files. The data generated is however query agnostic, and may not catch errors if the selection conditions are not satisfied, leading to empty results in all cases. Reverse Query Processing (RQP) RQP takes as input a query $Q$ and a result $O$ , and generates input data $I$ such that $O=Q(I)$ , the result of $Q$ on input $I$ . Since the query result needs to be provided as input, RQP cannot be used to test correctness of SQL queries.

Qex SQE ; QEX is a tool for generating a dataset and parameter values for a given parameterized SQL query using the SMT solver Z3. The goal is to generate data so that the query has a non-empty result. This corresponds to the generation of the first dataset in our case. However, Qex does not address killing of query mutations. Datasets of Qex may not be able to catch errors in join conditions, distinct, aggregate, missing or additional group by attributes as well as missing selection or joins conditions.

Tuya et al. mutation1 describe a number of possible mutations for SQL queries. However, they do not handle test data generation for killing these mutations. They divide the mutations into four classes: mutations of the main SQL clauses (SC), mutations of the operators that are present in conditions and expressions (OR), mutations related to the handling of NULL values (NL), and replacement of identifiers: column references, constants and parameters (IR). We generate dataset for all of SC, OR and NL mutations except for the following: mutations related to arithmetic expressions, some mutations of LIKE patterns, mutations between AND and OR, and some mutations related to three-valued logic. Currently, we do not consider IR mutations. Handling the above mutations is an area of future work. However, we do consider some mutations that are not covered in mutation1 such as join type mutations on alternative join orders and mutations of the LIKE operator.

Riva et al. Riva:2010 introduce rules which they call SQL Full predicate coverage (SQLFpc) rules, which specify conditions that must be satisfied by test cases in order to kill each of a variety of SQL query mutations; further rules to handle a larger class of SQL constructs and mutations are described in their Web tool sqlfpc.web . However, they do not describe how to actually generate data. Tuya:2010 extends Riva:2010 by generating constraints based on SQLFpc and solving the constraints using a constraint solver called Alloy alloy . However, it considers data generation and mutation killing for only numeric selection conditions and joins. Queries involving strings, aggregation, subqueries, group by and updates are not handled.

Pan et al. mutagen:ast13 describe Mutagen which, given a database application, first generates program code mutants and SQL-query mutants by transforming constraints from SQL queries to program code, and then uses PexMutator pex to generate data to kill the mutants. However, they only handle mutations of conditions in the where clause; as far as we can tell from their brief description, the class of mutations they consider is very small, and in particular, they do not handle aggregation, subqueries, join type mutations set operators, distinct mutations and a number of other query features and mutations that we consider.

The work in this paper extends our earlier work on XData xdata:icde11 ; xdata:dbtest13 ; xdata:icde15 ; details of the differences and novel contributions of this paper were described earlier in Sections 1 and 2.

Olston et al. olstonCS09 take a dataflow program and a database and generate an example dataset such that the result of each operator (including intermediate operators) in the program is non-empty. However, they do not handle integrity constraints or check for query correctness.

There have been a number of papers for testing database applications. However, these do not address the problem of testing queries in the applications. Emmi et al. emmi and Pan et al. mutagen:dbtest11 ; pan:2014 describe approaches to testing applications based on creating database states and test inputs, which can ensure code coverage. Kapfhammer and Soffa kapfhammer similarly consider test adequacy of database driven applications.

13 Experimental Results

We implemented the techniques for data generation described in this paper, as extensions to the XData system. We show that our techniques for constrained aggregation (Section 13.1) and subqueries (Section 13.2) are able to generate non-empty datasets and kill mutations in a number of cases. In Section 13.3 we show that our techniques are capable of generating datasets and killing mutations for the queries in the TPC-H benchmark. In Section 13.4 we evaluate our grading tool and show that it is better at catching student query errors than fixed datasets or correction by TAs.

Each of the techniques we describe targets a different query construct or mutation and hence it does not make sense to compare the different techniques that we have proposed with each other.

13.1 Constrained Aggregation

In Section 6.3.1 we described our approach for estimating the number of tuples for the purpose of data generation for queries containing constrained aggregation. In this section, we provide experimental results on the estimation of number of tuples per relation and subsequent data generation for a number of queries containing constrained aggregation. The objective is to see if the tuple assignment technique (Section 6.3.1) assigns tuples in a manner that (a) can produce datasets to generate to non-empty result on the original query (this is the first dataset as mentioned in Section 2) and (b) kill mutations related to the HAVING clause (aggregate mutation and comparison operator mutation of the HAVING clause).

For this experiment, queries which involve constraints on aggregate operators along with one or more GROUP BY attributes were chosen. (The list of queries is provided in Appendix E.) Aggregates in both outer query block and subqueries are considered. We also manually generated non-equivalent mutations by mutating the comparison operator (20 mutations) and aggregate operator (16 mutations) for the chosen queries, to test if the datasets could kill these mutations⁵⁵5We do not use any automated tool to generate mutations. The mutations generated by an automated tool may or may not be equivalent to the original query. If our dataset fails to kill some of the mutations we would not be sure if that was because of the incompleteness of our tool or because of equivalence of mutation and the original query..

The results are shown in Table 3. For each constrained aggregation, the Tuples column shows the number of tuples assigned to each base relation. The columns Comparison Mutations and Aggregate Mutations show if all the non-equivalent mutations of comparison operator in HAVING clause and aggregate mutation respectively were killed by the generated datasets or not. Query CA8 had two constrained aggregations, one in a subquery and one in the outer query block which are labeled as CA8a and CA8b respectively.

The datasets generated by XData was able to produce non-empty results on all queries. In terms of killing mutations, 35 out of the 36 mutations were killed. The mutation from MAX to MIN was not caught for Test Case CA2. For killing mutation on MAX to MIN we need two distinct tuples, one which satisfies the aggregate constraint and one which does not. Our tuple assignment method assigned only one tuple to the relation that had the MAX aggregate value and hence this mutation was not caught. Handling such cases is an area of future work.

Test	Tuples	Comparison	Aggregate
Case		Mutations	Mutations
CA1	1,2	$\surd$	$\surd$
CA2	1,1,2	$\surd$	$\times$
CA3	2,1,2	$\surd$	$\surd$
CA4	3,3,1,3	$\surd$	$\surd$
CA5	1,2,2	$\surd$	$\surd$
CA6	1,2,1,2,2	$\surd$	$\surd$
CA7	1,1,3	$\surd$	$\surd$
CA8a	1,2,2	$\surd$	$\surd$
CA8b	1,2,2	$\surd$	$\surd$
CA9	5,5	$\surd$	$\surd$

Table 3: Tuple estimation for constrained aggregation

13.2 Subqueries

In Section 7 we described various techniques for generating test data and killing mutations for queries containing where clause subqueries. For this experiment, we chose queries involving various subquery connectives both with and without aggregates (The list of all queries is provided in Appendix E) and check to see if XData is able to generate a dataset that produces non-empty result on the original query. We also manually generated non-equivalent mutants by mutating the subquery connective (20 mutations) and the conditions in the subquery (20 mutations) to test if the datasets could kill these mutations⁵.

For all the queries considered XData could generate a dataset that produced non-empty result on the original query. The datasets generated by XData were able to kill all of the 40 query mutations that we considered.

13.3 TPC-H queries

We also tried generating test data and killing mutations for queries from the TPC-H benchmark. We asked a few volunteers (who had not worked on the XData project) to generate specific types of query mutants. We tested to check if the datasets generated by XData could kill these mutations or not. In case, XData was not able to kill the mutations we examined to check if the mutant was equivalent to the original query or not. We only used non-equivalent mutants for measuring the performance of XData.

Since our parser did not support certain query constructs we made minor changes (mainly syntactic) to the queries so that it could be parsed and the datasets could be generated. However, for checking whether the datasets generated a non-empty result or not, and for generation and killing of mutations we used the original queries.

We were able to successfully generate datasets for 17 out of the 22 queries. Of the 5 queries for which our techniques failed to generate correct datasets, 4 queries had query constructs which are not currently handled (subqueries that have aggregates with expressions, aggregate value compared to a subquery and aggregate in a from clause subquery). One query failed because the CVC3 solver crashed while generating datasets for that query. Extending our system to handle these construct and migrating to newer version of CVC or using a different solver such as Z3 is an area of ongoing work.

The number of the different types of mutations killed across all queries is shown in Table 4. In addition to the mutations that our techniques explicitly target, we also tested queries with mutations of arithmetic expressions (replacing one arithmetic operator with another).

Mutation	Mutants	Mutants
Type	Generated	Killed
Selection (Comparison)	10	10
Join Type (INNER / OUTER)	8	8
Aggregation (Distinct/ MIN vs. MAX)	9	9
String Selection (String Comparison)	7	7
String Like	5	5
Missing Joins Conditions	13	13
Having Clause (Comparison Operator)	2	2
Subquery Connective	6	6
Changed Group By	21	20
AND vs. OR	16	16
Arithmetic Operator	13	12
Total	110	108

Table 4: Number of mutants caught on TPC-H queries

Overall XData was able to kill over 95% of the non-equivalent mutants that we obtained. For TPCH query 4, XData could not generate a dataset for killing extra group by attribute mutations and hence the corresponding mutation was not caught. Of the 13 queries with arithmetic operator mutations all but one were killed even though we do not explicitly target these mutations; explicitly targeting them is an area of future work.

Refer to caption — Figure 1: Number of incorrect queries caught

13.4 Grading

We use the tool described in Section 11 to grade student queries. In order to compare grading done by XData to fixed datasets and the grading done by TAs, we used 14 SQL assignments, each of which was answered by students of an undergraduate database course at IIT Bombay. We omit questions which asked students to create DDL statements.

For each question, a correct SQL query $CQi$ was used to generate datasets. The correct SQL queries are shown in Appendix E. For the 9th assignment question, the query could be written in 2 quite different ways which we denote CQ9a and CQ9b; we generate datasets for both query formulations, and the results for CQ9 are using the combined sets of datasets. Query $CQ3$ was assigned at a point in the course where students had not been taught about the DISTINCT clause, and hence we set the testing tool parameters to ignore duplicates in the results of the correct query and the student query.

The time taken for generating all the datasets for these queries (including the time taken by our code and the CVC3 solver) ranged from 11 to 90 seconds, on a computer with an Intel(R) Core(TM) i5-2500K 3.30GHz CPU, and 8 GB of memory, running Ubuntu. The number of datasets generated ranged from 2 for CQ1 to 25 for CQ9a. Each dataset had a very small number of tuples, typically less than 5 per relation. The maximum number of tuples for a relation was 16.

As comparison points, we also tested the queries with two sample University databases provided with the textbook by Silberschatz et al. dbconcepts2010 , and with the result of manual correction by course TAs. The first University database, which we call USm is a small database which was manually created by the authors of dbconcepts2010 to catch common errors; the second larger database, which we call ULg is a larger database. The TAs used a combination of testing against sample databases they created, and their own reading of the queries.

We also tried an alternate way to grade student queries, by comparing the optimal query plans of the correct query with the optimal query plans for the student queries. If the plans match we flag the query as correct. We use PostgreSQL with the VERBOSE flag set to ensure that we get projected attributes of the query as well. Note that equivalent queries may not have identical plans. For example, a condition $x>3$ is equivalent to $x>=2$ when $x$ is an integer, but plans using these alternatives would be considered different. Also, the optimizer could find different plans for different ways of expressing the same query (especially true with subqueries). In our experiments we found that most of the student queries did not have the same plan as the correct query, even if they were correct (verified manually on sample cases). For CQ3 the optimizer chose different join plans and hence most of the queries did not match. Same was the case with CQ7, CQ8, CQ13 and CQ14. For these queries, less than 5% of the queries were marked correct.

The result of the evaluations is shown in Figure 1. Detailed evaluation is shown in Table 7 in Appendix E. For XData, USm and ULg the query is marked as incorrect iff there is a dataset that produces different results on the correct query and the student query. Hence for these methods we can guarantee that a student query is marked incorrect only when it is not equivalent to the correct query. Consequently, the number of queries marked incorrect can be used as a measure of the effectiveness of the technique. We also tried to use the combination of USm and ULg grade queries. The number of incorrect queries caught turned out to be the maximum of the number of incorrect queries caught by USm and ULg.

These results indicate that, overall, the datasets generated by XData were able to catch more incorrect queries than both USm and ULg, the two University datasets from dbconcepts2010 . For CQ5, CQ8 and CQ14, in particular, our tool was significantly more efficient than the University datasets.

As compared to TAs, our datasets performed significantly better on many queries, including CQ3, CQ4, CQ5, CQ6, CQ8 and CQ9. The actual effectiveness of TAs is a little better than what the table indicates, since there were some queries where students made minor errors such as including extra attributes, which the TAs decided to ignore as irrelevant, but which were caught by all the datasets.⁶⁶6If students had been told that their queries would be graded by a tool, they would have probably taken more care to avoid such errors.

For query CQ5 and CQ8, some students had performed joins on the wrong tables, but these queries gave a correct result on datasets created by the TAs for checking the queries, and were marked as correct.

For CQ8, the University dataset did not have any course taught by two different instructors in Spring 2010, and hence a missing distinct keyword in the select clause was not detected. The TAs too did not enforce the check for distinct, which was required for this query.

In contrast, for CQ4, the University dataset USm had a student who had taken CS-101 twice and hence performed as well as XData. Again, the TAs had ignored the absence of a distinct specification. For CQ5, again the University datasets USm and ULg both had some courses with two sections, which caught missing distinct specifications; in this case the TAs did check for the presence of the distinct specification.

For CQ9a a large number of incorrect queries were caught by XData based on missing group by attributes and missing distinct clause. For CQ14, the data generation and mutation killing technique for NOT EXISTS was essential for catching a large number of student query errors.

Discussion:

In order to get a measure of our accuracy or completeness of our techniques on these queries we need an oracle to identify which queries are correct and which are not. This is very difficult for complex queries and doing this for classes with many students is extremely time-consuming. The closest option is human evaluation. However, our tool in its current version outperforms TAs (indicating TAs are not infallible). Hence, it is difficult for us to provide any completeness results for our grading tool.

14 Conclusion

In this paper we have addressed the issue of testing SQL queries and automated testing of SQL student assignments. We used the XData system which we built earlier, to generate test datasets for detecting errors, and realized that there were several limitations that needed to be addressed. We described several novel extensions to address these limitations. We also tested the efficacy of our test generation techniques for grading SQL queries submitted by students, and showed that our techniques outperform fixed (query independent) datasets, as well as TAs in terms of catching errors, while avoiding the drudgery of manual correction. Our XData system has great potential for easing the life of database application developers and testers and also to database course instructors particularly to those of MOOCs.

We have successfully used the grading tool in a UG database course at IIT Bombay to correct student queries. The grading tool is available at http://www.cse.iitb.ac.in/infolab/xdata and can be used by course instructors for grading queries.

Areas of future work include handling some SQL features which we do not currently support, or support only partially, and handling further classes of mutations. These features include handling subqueries within a subquery, arithmetic expressions and mutations involving replacement of identifiers. Another area of future work is to award partial marks to student queries in a way that reflects how close the student query is to some correct query.

Acknowledgements.

We would like to thank Tata Consultancy Services(TCS), India for partially funding this project through a grant and a PhD fellowship. We would also like to thank Amol Bhangadia, Bharath Radhakrishnan and Ankit Shah for their help in some running some experiments.

References

(1) SMT-LIB- The Satisfiability Modulo Theories Library. http://smt-lib.org/
(2) SQLFpc - Generation of Full Predicate Coverage Rules for SQL database queries (MCDC for SQL) (2014). Http://in2test.lsi.uniovi.es/sqlfpc/SQLFpcWeb.aspx
(3) Barrett, C., Conway, C.L., Deters, M., Hadarean, L., Jovanović, D., King, T., Reynolds, A., Tinelli, C.: Cvc4. In: 23rd International Conference on Computer Aided Verification, CAV’11, pp. 171–177. Springer-Verlag, Berlin, Heidelberg (2011)
(4) Barrett, C., Sebastiani, R., Seshia, S.A., Tinelli, C.: Satisfiability modulo theories. In: A. Biere, H. van Maaren, T. Walsh (eds.) Handbook of Satisfiability, vol. 4, chap. 8. IOS Press (2009)
(5) Barrett, C., Tinelli, C.: CVC3. In: Computer Aided Verification (CAV), pp. 298–302 (2007)
(6) Bhangadiya, A., Chandra, B., Kar, B., Radhakrishnan, B., Reddy, K.V.M., Shah, S., Sudarshan, S.: The XDa-TA system for automated grading of SQL query assignments. In: ICDE (2015)
(7) Binnig, C., Kossmann, D., Lo., E.: Reverse query processing. In: ICDE, pp. 506–515 (2007)
(8) Chandra, A.K., Merlin, P.M.: Optimal implementation of conjunctive queries in relational data bases. In: STOC, pp. 77–90 (1977)
(9) Chandra, B., Chawda, B., Shah, S., Sudarshan, S., Shah, A.: Extending XData to kill SQL query mutants in the wild. In: Sixth International Workshop on Testing Database Systems, DBTest ’13, held in conjunction with ACM SIGMOD, pp. 2:1–2:6 (2013)
(10) Emmi, M., Majumdar, R., Sen, K.: Dynamic test input generation for database applications. In: Int’l Symp. on Software Testing and Analysis, pp. 151–162 (2007)
(11) Fu, X., Powell, M., Bantegui, M., Li, C.C.: Simple linear string constraints. Formal Aspects of Computing pp. 847–891 (2013)
(12) Gupta, B.P., Vira, D., Sudarshan, S.: X-Data: Generating Test Data for Killing SQL Mutants. In: ICDE (2010)
(13) Jackson, D.: Alloy: A new technology for software modelling. In: Tools and Algorithms for the Construction and Analysis of Systems (TACAS), LNCS, vol. 2280, pp. 20–20 (2002)
(14) Jayram, T.S., Kolaitis, P.G., Vee, E.: The containment problem for real conjunctive queries with inequalities. In: PODS, pp. 80–89 (2006)
(15) Jia, Y., Harman, M.: An analysis and survey of the development of mutation testing. IEEE Transactions on Software Engineering pp. 649–678 (2011)
(16) Kapfhammer, G.M., Soffa, M.L.: A family of test adequacy criteria for database-driven applications. SIGSOFT Softw. Eng. Notes pp. 98–107 (2003)
(17) Kiezun, A., Ganesh, V., Guo, P.J., Hooimeijer, P., Ernst, M.D.: HAMPI: a solver for string constraints. In: Intl. Symp. on Software Testing and Analysis, pp. 105–116. ACM, New York, NY, USA (2009)
(18) Klug, A.: On conjunctive queries containing inequalities. J. ACM (1988)
(19) Liang, T., Reynolds, A., Tinelli, C., Barrett, C., Deters, M.: A DPLL (T) theory solver for a theory of strings and regular expressions. In: Computer Aided Verification, pp. 646–662. Springer (2014)
(20) van der Meyden, R.: The complexity of querying indefinite data about linearly ordered domains. In: ACM PODS, pp. 331–345 (1992)
(21) Møller, A.: Automaton pakage dk.brics.automaton. Http://www.brics.dk/automaton/
(22) Olston, C., Chopra, S., Srivastava, U.: Generating example data for dataflow programs. In: SIGMOD Conference, pp. 245–256 (2009)
(23) Pan, K., Wu, X., Xie, T.: Database state generation via dynamic symbolic execution for coverage criteria. In: Fourth International Workshop on Testing Database Systems, DBTest ’11, held in conjunction with ACM SIGMOD, pp. 4:1–4:6 (2011)
(24) Pan, K., Wu, X., Xie, T.: Automatic test generation for mutation testing on database applications. In: Automation of Software Test (AST), 2013 8th International Workshop on, pp. 111–117 (2013)
(25) Pan, K., Wu, X., Xie, T.: Guided test generation for database applications via synthesized database interactions. ACM TOSEM (2014)
(26) de la Riva, C., Suárez-Cabal, M.J., Tuya, J.: Constraint-based test database generation for SQL queries. In: Workshop on Automation of Software Test, AST ’10, pp. 67–74 (2010)
(27) Saxena, P., Akhawe, D., McCamant, S., Song, D.: KALUZA. Http://webblaze.cs.berkeley.edu/2010/kaluza/
(28) Shah, S., Sudarshan, S., Kajbaje, S., Patidar, S., Gupta, B.P., Vira, D.: Generating test data for killing SQL mutants: A constraint-based approach. In: ICDE (2011)
(29) Silberschatz, A., Korth, H.F., Sudarshan, S.: Database System Concepts, 6th edn. McGraw Hill (2010)
(30) Tuya, J., Cabal, M.J.S., de la Riva, C.: Full predicate coverage for testing SQL database queries. Softw. Test., Verif. Reliab. pp. 237–288 (2010)
(31) Tuya, J., Suarez-Cabal, M.J., de la Riva, C.: Mutating database queries. Information and Software Technology pp. 398–417 (2007)
(32) Veanes, M., Grigorenko, P., de Halleux, P., Tillmann, N.: Symbolic query exploration. In: ICFEM, pp. 49–68 (2009)
(33) Veanes, M., de Halleux, P., Tillmann, N.: Rex: Symbolic regular expression explorer. In: ICST, pp. 498–507 (2010)
(34) Veanes, M., Tillmann, N., de Halleux, J.: Qex: Symbolic SQL Query Explorer. In: LPAR, pp. 425–446 (2010)
(35) Zhang, L., Xie, T., Zhang, L., Tillmann, N., De Halleux, J., Mei, H.: Test generation via dynamic symbolic execution for mutation testing. In: IEEE International Conference on Software Maintenance (ICSM), pp. 1–10 (2010)

Appendix

Appendix A Cardinality Estimation for Join Inputs

The tuple estimation for each relation for constrained aggregation on join result is done in 3 steps. First we construct a join graph. Then we infer attributes to be added to uniqueElements and singleValuedAttributes. In the third step, we assign cardinality to each relation such that the resulting number of tuples is $n$ .

Step 1: Construct Join Graph

We construct a join graph G = (R, E), with each relation in the query as a vertex. The join conditions from one table to another are represented by a single edge between the nodes. Figure 2 shows a join graph involving relations $A$ , $B$ and $C$ . There are join conditions between $A$ and $B$ , and between $B$ and $C$ . However there are no join conditions between $A$ and $C$ . Inferred join equalities are also added to the graph. For example, the join conditions A.a = B.b and B.b = C.c imply that A.a = C.c is also a join condition and hence it would be added to the graph. Note that this may introduce a cycle in the graph; our algorithm can work with cyclic join graphs.

Step 2: Infer Attribute Properties

Next we apply the following sets of rules to infer properties of attributes

Rule 1: Every group by attribute is a single valued attribute.

Rule 2: Every set of attributes declared as primary key or unique key, is unique in the group.

Rule 3: Every attribute which appears in conjuncts of the form A.a=constant is a single valued attribute.

Rule 4: If each attribute of any uniqueElements[ $R_{i}$ ] is a single valued attribute then all attributes of that relation are single valued attributes.

Rule 5: If any attribute, $R_{i}.x$ , is a single valued attribute then every attribute of equivalence class (Section 2.2) in which $R_{i}.x$ is present becomes a single valued attribute. For example, if the join condition is A.a = B.a and A.a is single valued, B.a also becomes single valued.

Rule 6: If an attribute of a unique element is single valued then remaining attributes of unique element become unique. We apply this rule recursively on the unique element to get a minimal unique element. We then drop all non-minimal sets from $uniqueElements$ . For example, if (A.a, A.b, A.c) is unique and A.a is single valued then (A.b, A.c) is unique and is added to $uniqueElements[R_{i}]$ . In this case (A.a, A.b, A.c) is dropped from $uniqueElements[R_{i}]$ .

0: joinConds[

R_{i}

R_{j}

]groupByAttributesunique keys and primary keys of relations

\in

query

0: Effective uniqueElements, singleValuedAttributes for the query

1: Build equivalence classes using joinConds[

R_{i},R_{j}

\forall R_{i},R_{j}

\in

R

2: Apply Rule 2 and update uniqueElements

3: Apply Rule 1, Rule 3 and update singleValuedAttributes

4: while change in uniqueElements or singleValuedAttributes do

5: Apply Rule 4, Rule 5 and update singleValuedAttributes

6: Apply Rule 6 and update uniqueElements

7: end while

8: return uniqueElements, singleValuedAttributes

Algorithm 2 : getAttributeInferences()

The rules are applied according to Algorithm 2 to infer which attributes are added to uniqueElements and which to singleValuedAttributes.

Step 3: Assign Cardinality

We define some more terms

•

$joinAttributes[R_{i},R_{j}]$ : attributes of relation $R_{i}$ that are involved in join conditions with relation $R_{j}$ .
•

$unique[R_{i},R_{j}]$ : $\{S_{k}\mid S_{k}\subseteq\mbox{joinAttributes}[R_{i},R_{j}]\wedge S_{k}\in\mbox{uniqueElements}[R_{i}]\}$ .
•

$n_{R_{i}}$ : number of tuples assigned to relation $R_{i}$ .

In order to find the number of tuples for each relation we use the attributes inferred using Algorithm 2 along with the following rules.

Rule 7: If $n_{R_{i}}$ = $n,n>1$ and unique[ $R_{i}$ , $R_{j}$ ] $\neq\emptyset$ then $n_{R_{j}}$ is set to $n$ . We also infer further unique elements as follows. For each $S_{k}\in\textit{unique}[R_{i},R_{j}]$ , let $S^{\prime}_{k}$ be the attributes from $R_{j}$ that are equated to $S_{k}$ . Then add $S^{\prime}_{k}$ to uniqueElements[ $R_{j}$ ].

The intuition behind Rule 7 is as follows. Consider the join of two relations A and B. Let the join condition be $A.a=B.a$ and suppose that $\{A.a\}\in$ uniqueElements $[A]$ . Here joinAttributes [ $A$ , $B$ ]={A.a}, joinAttributes $[B,A]$ ={ $B.a$ }, unique [ $A$ , $B$ ]={ $A.a$ } and unique [ $B$ , $A$ ]= $\emptyset$ . If the cardinality of $A$ is $n$ , since $A.a$ is unique, it must have $n$ different values. The relation $B$ has join condition with $A.a$ which belongs to uniqueElements[A]. So $B$ must contain $n$ tuples with distinct values for the attribute $B.a$ across $n$ tuples and each value matches with the value of $A.a$ for one of the tuples in $R_{i}$ . So the cardinality of $B$ become $n$ and $B.a$ becomes a unique attribute.

Implementation Rule 1: If $n_{R_{i}}$ = $n,n>1$ and $R_{i}$ has a multi attribute unique element, $mu$ , such that every attribute of $mu$ participates in some join conditions but joinAttributes $[R_{i},R_{j}]\subset$ $mu$ for all j, then for at least one relation $R_{k}$ that joins with $R_{i}$ joinAttributes $[R_{i},R_{k}]$ is unique and $n_{R_{k}}=n$ . One such $R_{k}$ is picked and we add joinAttributes $[R_{i},R_{k}]$ to uniqueElements $[R_{i}]$ and joinAttributes $[R_{k},R_{i}]$ to uniqueElements[ $R_{k}$ ].

The intuition is as follows. Consider the join graph shown in Figure 2. Let joinConds[ $A$ , $B$ ]={A.a=B.a}, joinConds[ $B$ , $C$ ]={B.b=C.b}. Let (B.a, B.b) be unique. Here, joinAttributes [ $A$ , $B$ ] = {A.a}, joinAttributes [ $B$ , $A$ ] = { $B.a$ }, joinAttributes [ $B$ , $C$ ]={ $B.b$ }, and joinAttributes [ $C$ , $B$ ]={ $C.b$ }. Further, unique [ $A$ , $B$ ]= $\emptyset$ , unique [ $B$ , $A$ ]= $\emptyset$ , unique [ $B$ , $C$ ]= $\emptyset$ , unique [ $C$ , $B$ ]= $\emptyset$ .

Suppose cardinality of $B$ is $n$ . Since unique[ $B$ , $A$ ] = $\emptyset$ , is possible that $n_{A}$ =1 such that $A.a$ matches with all values of $B.a$ across $n$ tuples. Here $B.a$ contains same value across $n$ tuples. Similarly, we can choose $n_{C}$ = 1 and $B.b$ will have the same across $n$ tuples. Now both $B.a$ and $B.b$ have same values across all $n$ tuples. But (B.a, B.b) must be unique across $n$ tuples. So the assignment of cardinalities is incorrect. Hence at least one of B.a or B.b must be chosen to be unique, and this will cause one $n_{A}$ or $n_{B}$ to be $n$ .

Note that in this example had (B.a, B.b, B.c) been unique, every attribute of $mu$ does not participate in any of the join conditions. In this case, the rule is not applicable and both $A$ and $C$ may have a cardinality of 1. To generate $n$ tuples for $B$ such that the join results in $n$ tuples, B.c can have $n$ distinct values while B.a and B.b have same values corresponding to A.a, C.b respectively.

We differentiate this rule from others since this rule can have several possible outcomes as opposed to the other rules for which the outcome is definite and unique. One outcome is chosen. The choice of which of the joining relations is assigned cardinality as $n$ can be made by the solver or as heuristic the choice can be made arbitrarily; we describe these below.

Cardinality Inference Algorithm

Let the aggregated attribute be $R.a$ . For getting the cardinality of each relation, using the rules and the given join conditions of the relations we can encode the tuple assignment problem in the form of constraints in CVC3. We add the following constraints in CVC3.

•

constraints ascertaining singleValuedAttributes and uniqueElements for each relation
•

for each relation such that all attributes are single valued (Rule 4) constraints to ensure that the number of tuples is 1
•

constraints for Rule 7 and the Implementation Rule 1 for all the relations in the query as applicable
•

constraints to ensure that the final count after joining the tables is n
•

in case n values are required for some attribute R.a to satisfy some aggregate condition we add constraints to ensure that the relation R has n tuples. For example, consider a case where $SUM(R.a)=17$ , where $a$ is an integer attribute and there is a constraint $R.a\leq 5$ , we need at least 4 tuples for the given group of R and they cannot all be the same. It is not possible to satisfy the aggregation condition if we assign a single tuple to R, the join of R with other relations produces 4 tuples for the group. Similar is the case with SUM DISTINCT on an integer attribute.

On solving this set of constraints, we get the number of tuples for each relation.

The constraint approach for tuple generation works well if the number of attributes is not very large. In practice, we use a simple and fast heuristic approach described as follows. If any non-empty set of attributes of a relation forms a unique element and every attribute of that unique element is a single valued attribute then that relation must contain a single tuple (explained in Rule 4). For such relations, the only possible choice of cardinality is $1$ . Of the remaining relations, the heuristic algorithm chooses one relation and assigns to it a cardinality of $n$ , making it the root node. The count of all other nodes of the join graph, $n_{R_{i}}$ is initialized as 1. The root node ( $R_{r}$ ) is then used as a starting relation to calculate the actual cardinality for other relations using Rule 7 and Implementation Rule 1. The procedure for this is described Algorithm 3. If the heuristic fails we use the constraint approach.

G

= (

R

E

): Join graph singleValuedAttributes uniqueElements

R_{s}

: relation chosen as root node (cardinality can be

n

)

0: assigned cardinality

n_{R_{i}}

\forall R_{i}

\in

R

\forall R_{i}

\in

R

, initialize

n_{R_{i}}\leftarrow 1

2: nQueue =

\emptyset

n_{R_{s}}

n

4: nQueue.enqueue(

R_{s}

)

5: while

nQueue\neq\emptyset

R_{i}\leftarrow

nQueue.dequeue() do

6: for each edge

E_{k}

\in

E

from

R_{i}

R_{j}

7: prevCardinality

\leftarrow n_{R_{j}}

8: Apply Rule 7 from

R_{i}

R_{j}

9: if change in

uniqueElements[R_{j}]

or (prevCardinality=1 and

n_{R_{i}}=n

) then

10: nQueue.enqueue(

R_{j}

)

11: end if

12: end for

13: if Implementation Rule is applicable on

R_{i}

then

14: Apply Implementation Rule 1 on

R_{i}

, let

R_{k}

be the relation for which cardinality if to be changed to

n

15: prevCardinality

\leftarrow n_{R_{k}}

16:

n_{R_{k}}\leftarrow n

17: if change in

uniqueElements[R_{k}]

or prevCardinality=1 then

18: nQueue.enqueue(

R_{k}

)

19: end if

20: if change in

uniqueElements[R_{i}]

on Rule 8 then

21: nQueue.enqueue(

R_{i}

)

22: end if

23: end if

24: end while

25: return

n_{R_{i}}

\forall R_{i}

\in

R

Algorithm 3 : getActualCardinalityHeuristic()

Appendix B Solving String Constraints

In this section, we describe our techniques to solve string constraints. We also show some more experimental results comparing our string solver to other available string solvers.

B.1 String Solver

In this section we describe the working of our string solver. To illustrate our method we use the following set of constraints as an example

Example 2


A > B
A like ‘%pqr%’
B ilike ‘_abc’
C >= B
C = ‘Biology’
A = E
E like ‘%abc%’
F >= B
G like ‘Bio%’

In this example for the purpose of simplicity of representation we consider that the strings may take only alphabetical values.

Our solver works as follows.

Step 1: Collect Conditions.

From all the constraints required for generating a dataset for the query, in the first step, we separate and collect the string constraints, i.e., selection conditions on strings, like conditions, and string length conditions.

Step 2: Reduce Number of Conditions.

Next, we reduce the number of string constraints by removing the conditions containing the equality operator as follows:

a) For each condition of the kind $S_{i}$ $=$ $const_{i}$ , where $S_{i}$ is a string variable and $const_{i}$ is a constant, we replace all occurrences of $S_{i}$ with $const_{i}$ . This may lead to constraints of the form $const_{i}$ relop $const_{j}$ or $const_{i}$ likeop $pattern$ . Using string operations, we then verify if such constraints are satisfiable. If they are satisfiable then we remove the equality conditions else we infer that there is no possible solution to the given set of conditions. For example, if the conditions are $A=$ ‘Comp’ and $A$ LIKE ‘Bio%’, replacing the value of $A$ as ‘Comp’ in the latter condition leads to an unsatisfiable constraint.

b) For constraints of the form $S_{i}$ = $S_{j}$ , we replace all occurrences of $S_{i}$ by $S_{j}$ in all constraints and remove the constraint $S_{i}=S_{j}$ from the set. When an instance of $S_{j}$ has been found, after solving the rest of the constraints, we assign the same value to $S_{i}$ .

In Example 2 we assign C= ‘Biology’ and replace all occurrences of C with this value. Replacing C, in C $>=$ B we get ‘Biology’ $>=$ B. We rewrite this as B $<=$ ‘Biology’. Since A=E is a constraint we replace all occurrences of A by E. After this step the constraints are

E > B
E like ‘%pqr%’
E ilike ‘_abc’
B <= ‘Biology’
E like ‘%abc%’
F >= B
G like ‘Bio%’

Step 3: Group Related Variables.

Next, we group variables that depend on each other, i.e., if $S_{i}$ relop $S_{j}$ or $S_{i}$ likeop $S_{j}$ is present in the set of constraints then $S_{i}$ and $S_{j}$ are in the same group. Once these groups are formed, we then solve the constraints for one group at a time. This grouping of variables helps in reducing the number of constraints that need to be solved at a time. In the above example E, B and F are dependent on one another and are hence grouped together in a group while G is put in another group.

For each group, we construct a graph, where the variables form the vertices. Let vertex $V_{i}$ represent the string variable $S_{i}$ . A constraint of the form $S_{i}<S_{j}$ or $S_{i}\leq S_{j}$ is represented by a directed edge from $V_{j}$ to $V_{i}$ in the graph. Constraint $S_{i}<>S_{j}$ is represented by an undirected edge between $V_{i}$ and $V_{j}$ .

The graph for our example case would look like the one shown in Fig. 3 where the dotted edge between F and B implies $\leq$ and the edge between E and B implies $<$ .

Additionally, for each string variable, $S_{i}$ , we store the following information.

•

MaxLength: The maximum allowable length of the string. It is initially assigned a default value. This value is modified based on string length constraints on $S_{i}$ , if any.
•

MinLength: The minimum allowable length of the string. Similar to the MaxLength this also has a default value and is modified based on length constraints.
•

NotEqualLengths: This is a set of values of length values not allowed for $S_{i}$ . This captures constraints of the kind $strlen(S_{i})<>$ constant.
•

Less: list of variables with the value less than $S_{i}$
•

LessEqual: list of variables with the value is less than or equal to $S_{i}$ .
•

NotEqual: list of variables with the value not equal to $S_{i}$ .
•

OtherConstraints: This list contains constraints of the form $S_{i}$ relop constant or $S_{i}$ likeop pattern.

Step 4: Choose the Variables to Solve.

We traverse the graph and first collect all vertices $V_{1},..,V_{k}$ whose outdegree is $0$ . These vertices represent the string variables whose value is the lowest amongst all comparable variables. In our example we choose the variable B.

If we do not find any such variable, it implies that there is a cyclic dependency among variables with each variable being less than (equal) to that some other variables. Essentially, this means that either all the variables in that cycle are equal to each other, if all edges are $\leq$ , or that the given set of constraints is not satisfiable, if at least one of the edges is $<$ . We first solve for these variables (with outdegree 0), one by one, using the function SolveOneVariable (described below) which finds the lexicographically smallest string possible.

After obtaining the solution for a vertex, say $V_{i}$ (and hence string variable $S_{i}$ ), for each vertex $V_{j}$ (string $S_{j}$ ) that has an edge to $V_{i}$ in the graph, we add appropriate constraints, using the solution of $S_{i}$ to the list of constraints for $S_{j}$ . We then remove $V_{i}$ from the graph and solve for the remaining vertices by repeating this step on the modified graph.

We now describe the function SolveOneVariable for finding the solution for vertex, $V_{i}$ . This function consists of two parts a) building an automaton and b) finding the lexicographically smallest string on this automation that satisfies all the constraints.

Step 4a: Building an automaton: We first convert the constraints of the form $S_{i}$ relop constant and $S_{i}$ likeop pattern to $S_{i}$ matches $re$ where $re$ is the corresponding regular expression in Java. This conversion is made by functions written specifically for each LIKE and comparison operator, as illustrated by the following examples of conversion:

S1 $>$ ‘Bio’ $\rightarrow$ ‘[C-z]\w*|B[j-z]\w*|Bi[p-z]\w*|Bio\w+’

S1 LIKE ‘Bio%’ $\rightarrow$ ‘Bio\w*’

S1 LIKE ‘Bio_’ $\rightarrow$ ‘Bio\w’

S1 ILIKE ‘Bio%’ $\rightarrow$ ‘[B|b][I|i][O|o]\w*’
(\w denotes a wild character)

We build an automaton, $A$ for the identity pattern (‘\w*’). Then for every constraint that must be satisfied by $S_{i}$ , we create another automaton, $B$ and modify the automaton $A:=A\cap B$ . We use a slightly modified version of the automaton package dk.bricks.automaton AUTO operation on automata. We use our own methods for converting a given Java compatible regular expression to an automaton. If the number of constraints on a variable is above a certain threshold we minimize the automaton resulting from $A\cap B$ at each step so as to improve the performance.

Algorithm 4 getSmallestString (state, strBuilder, max, min, notEqualLength, length)

1: if length

>=

min

\wedge

length

<=

max

\wedge

state

\in

finalState

\wedge

length

\notin

notEqualLength then

2: Return strBuilder

3: end if

4: if length

>

max then

5: Return null

6: else

7: tran = state.getTransitionsSorted

8: for

\forall

\in

tran do

9: append c to strBuilder {Each transition is on a character}

10: str= getSmallestString(c.to, strBuilder, max, min, notEqualLength, length+1)

11: if str

\neq

null then

12: Return str

13: end if

14: remove last char from strBuilder

15: end for

16: end if

17: Return null

Step 4b: Finding the lexicographically smallest string : Once we have the minimized automaton, $A$ , for a variable, $S_{i}$ , we find the lexicographically smallest possible string within MaxLength and MinLength for that $S_{i}$ . To find such a string, we use a backtracking approach which traverses the automaton graph in a depth-first manner. At each step we check if (a) the current depth $>=$ MinLength and $<=$ MaxLength, (b) the state is a final state, (c) the current depth is not present in NotEqualLengths. If these conditions are satisfied then we return the string obtained by the traversal. If these conditions are not satisfied then even after reaching the dept of MaxLength, we backtrack. If after traversing the entire graph, we do not find a string that satisfies the conditions then we return a null value. Details are provided in Algorithm 4.

For our example, an automaton is created for B using the constraints on B i.e B $<$ ‘Biology’. The smallest possible value for B is found to be ‘A’. We then add the constraint E $>$ ‘A’ to E and F $>=$ ‘A’ to F and remove B from E.Less and F.LessEqual. Now the remaining variables E and F do not have any dependency on each other and can be solved in any order. We create appropriate automata for both the variables and find suitable values using Algorithm 4. Now in order to satisfy the condition A=E after solving the variables B, E and F we put the value of A the same as the one obtained for E.

Constraints containing “ $<>$ ” and “ $\sim=$ ”

:
We handle conditions of the kind $S_{i}\sim=S_{j}$ and $S_{i}<>S_{j}$ , where both $S_{i}$ and $S_{j}$ are string variables, such that one of $S_{i}$ and $S_{j}$ is unconstrained, i.e., there are no other string constraints constraining the value of one of them. For such cases, we first find an assignment to the constrained variable and then assign a value of other variables that satisfies the $<>$ or $\sim=$ constraint as applicable.

B.2 String Solver Performance

The experiment in this section focuses on the performance of our string solver as compared to other solvers in terms of the time taken to solve string constraints. The experiments for HAMPI HAMPI , Kaluza KAL , CVC4 CVC4 , SUSHI SUSHI and XData string solver were run on a virtual machine with 4GB RAM and a dual core CPU running Ubuntu Linux. For Rex REX we used a virtual machine with the same configuration running Windows 7.

Case	Constraints
Test	Constraints
S1	A like ‘Comp__’
S2	A like ‘Mr%’
S3	A ilike ‘%sr%’
S4	A like ‘Comp%’, A like ‘%Sc’
S5	A like ‘Comp%’, A like ‘_Sc’
S6	A $>$ ‘Bio’
S7	A like ‘%Sc’, A like ‘Life%’, A.length $>$ 6
S8	A $<$ B, B like ‘Bio%’, A like ‘CSE%’
S9	A $<$ B, B like ‘Bio%’, B.length $>$ 4, A like ‘%101’
S10	A $>$ B, A like ‘%pqr%’, B ilike ‘_abc’, C $>=$ B, C = ‘Biology’, A = E, E like ‘%abc%’, F $>=$ B, G like ‘Bio%’

Table 5: String solver test cases

For the first experiment, we study the efficiency of the string solvers in a variety of common cases. The test cases for this experiment are listed in Table 5. We include a mix of satisfiable and unsatisfiable cases. The last 3 test cases contain multiple string variables and can only be solved with our string solver. We include these cases to show that the performance does not drop much even when solving for multiple variables. For solvers other than the XData string solver the expressions in the form of A likeop/relop expr etc. were manually converted to regular expressions of the format recognized by the solvers. The running time does not take into account the conversion. For XData string solver we fed constraints in the same form as in the SQL queries and let XData convert these to regular expressions.

The time taken by different string solvers for this experiment is shown in Table 6. The test cases that cannot be solved by a particular solver⁷⁷7HAMPI currently has a known bug because of which it cannot handle more than one constraints on the same variable in some cases. Test cases 4, 5 and 7 failed because of this. is marked with a “-” and cases that ran for a very long time ( $>$ 20 min) but still did not terminate are marked with a “*”. In terms of time taken, CVC4 and the XData solver turn out to the most efficient ones for these cases, but CVC4 cannot handle comparison among multiple variables. ⁸⁸8We tried to encode string comparison as user defined functions in CVC4 but with these functions the execution did not terminate even after 20 min.

Test	HAMPI	Kaluza	SUSHI	CVC4	Rex	XData
Case	HAMPI	Kaluza	SUSHI	CVC4	Rex	solver
S1	150	706	22	6	124	4
S2	136	706	34	6	140	4
S3	139	708	39	9	140	4
S4	-	2444	175	17	168	15
S5	-	671	160	19	156	14
S6	137	380	54	*	256	4
S7	-	653	-	20	-	11
S8	-	-	-	-	-	23
S9	-	-	-	-	-	11
S10	-	-	-	-	-	30

Table 6: Time taken by string solvers (in ms)

We conducted two experiments to test the scalability of the solvers. Scalability can be measured in terms of length of string that can be successfully solved by the solver or by the number of simultaneous constraints it can handle.

For the second experiment, we use the experimental benchmark from Rex REX to measure the performance as the length of the string required in the output varies. The constraint to be satisfied by the string is that must match intersection of regular expressions
\ $w*[a-c]*a[a-c]\{n+1\}$ \ $w*$
and
\ $w*[a-c]*b[a-c]\{n\}$ \ $w*$
$n$ is a parameter which we varied from $0$ to $1000$ . The results for this experiment are shown in Fig 4.

CVC4 and HAMPI failed to generate any result for any value of $n$ and hence could not be included. KALUZA gave the result as UNSAT (cannot be satisfied) for $n>6$ while SUSHI ran out of memory for $n>13$ . Rex and XData solver were able to successfully generate string till $n=1000$ . In terms of time taken, the XData string solver turned out to be the most efficient for most cases.

For the third experiment, we measure the performance in terms of time taken to solve varying number of constraints. For each $n$ the constraint to be satisfied is that the string must match the intersection of regular expressions \ $w*[a-c]*b[a-c]\{i\}$ \ $w*$ , $\forall i$ , $0\leq i\leq n$ . We varied $n$ from $0$ to $15$ . The results are shown in Fig 5.

Here again CVC4 and HAMPI failed to generate any result any value of $n>0$ and hence could not be included. KALUZA gave UNSAT result for $n>7$ . SUSHI and Rex ran out of memory for $n>9$ and $n>6$ respectively. In this experiment also the XData solver turned out to be the most efficient and did not run out of memory even at $n=15$ .

Appendix C Algorithm To Ensure No Extra Tuples

The presence of additional tuples (created, for example, due to repeated relations or foreign key dependencies) may change the intended result of a query on the generated test dataset. For some cases like constrained aggregation and subqueries the additional tuples may prevent the generation of desired tuples, and the killing of mutations may be affected. To avoid the change in the intended result we add constraints preventing additional tuples from altering the result; details are described in Algorithm 5. We assume for now that the query tree has only joins and selections.

The algorithm takes as input (1) the query tree for which we do not intend to generate any additional tuples, (2) the tuples generated for the query tree and (3) additional selection conditions for correlation conditions or group by attributes equated to a particular value.

The first step of the algorithm is to create a list of relations, along with the join and selection conditions for the given query tree, which we call flattening. To flatten the tree we recursively traverse the tree. For INNER JOIN we add both its left and right children to the flattened tree i.e. the function makes calls flatten(left) and flatten(right) and returns the union of the lists along with the join conditions. For the LEFT OUTER JOIN ‘no extra tuples’ can only be ensured if there is no extra tuple from the left input. We consider only left input for flattening i.e. the function calls flatten(left) and returns the list returned by the function. Similarly for RIGHT OUTER join we consider only the right input for flattening. For a relation, flattening returns the relation with its selection conditions. For example, we flatten ( $R_{1}$ $\mathrel{\hbox{${{}_{\raise 3.0pt\hbox{--}}^{\hbox{--}}}$}\hskip-4.41017pt\Join}_{\theta 1}$ $R_{2}$ ) $\Join_{\theta 2}$ ( $R_{3}$ $\Join_{\theta 3}$ $R_{4}$ ) to $\Join_{\theta 2,\theta 3}$ ( $R_{1},R_{3},R_{4}$ ).

In the subsequent steps we take the join conditions present in the flattened tree and assert constraints to ensure that for every combination of tuples such that at least one tuple is not present in the allowed tuple range, at least one of the selection, additional selection or join conditions in the flattened query is not satisfied.

T

= Query tree

AT

= Map of allowed tuples for each relation

ASel

= Additional selection conditions

0: constraints to ensure no tuple is projected from the subquery

FT

= flattenTree(

T

)

2: Let the join conditions for FT be

J_{1},J_{2}..J_{c}

3: Let the relations for the join be

R_{1},R_{2},..R_{m}

4: constraints

\leftarrow

NOT EXISTS

i_{r_{1}}\in R_{1},...i_{r_{n}}\in R_{n}|

(i_{r_{1}}\notin AT[R_{1}]\vee i_{r_{2}}\notin AT[R_{2}]...\vee i_{r_{m}}\notin AT[R_{m}])

\wedge(selCond(R_{1},i_{r_{1}})\wedge selCond(R_{2},i_{r_{2}})...\wedge selCond(R_{m},i_{r_{n}})

\wedge

J_{1}^{\prime}\wedge J_{2}^{\prime}...\wedge J_{c}^{\prime})

,where

J_{k}^{\prime}

is join condition

J_{k}

applied to the tuples identified by

i_{r_{1}}..i_{r_{n}}

5: return constraints

selCond(R,i)

1: Let the selection conditions on

R

S_{1},S_{2},...S_{n}

2: Let the additional selection conditions on

R

ASel_{R1}

ASel_{R2}

AND …

ASel_{Rn}

3: return

S_{1}[i]

AND

S_{2}[i]

… AND

S_{n}[i]

AND

ASel_{R1}[i]

AND

ASel_{R2}[i]

.. AND

ASel_{Rm}[i]

, where

S_{1}[i]

is the constraint for the selection condition

S_{1}

on the

i^{th}

tuple of

R

Algorithm 5 : genConstraintsForNoExtraTuples

We implement the condition, $i_{r_{1}}\in AT[R_{1}]$ by checking that the primary key value is not equal to the primary key of any tuple in $AT[R_{i}]$ . This is because if a tuple outside the allowed tuples range has the same primary key as a tuple in the allowed tuple range, the tuples are identical.

If the query tree contains GROUP BY attributes and aggregations, we consider the input to these operators for flattening. We currently do not handle flatting conditions in subqueries in this algorithm.

In practice, we unfold the expression in Step 5, to remove the NOT EXISTS quantifier and replace it by conditions for each combination of tuples, in the tuple range, for which we generate data. Such unfolding speeds up constraint solving in CVC3 solver as noted earlier in xdata:icde11 .

Appendix D Completeness

Shah et al. in xdata:icde11 present completeness results for join, comparison operator and aggregation mutations on a limited space of queries. In this section, we consider the completeness of our techniques for the wider class of operators and mutations considered in this paper.

D.1 Types of Result Difference

Our techniques for killing mutations generate differences in the result of a mutated operator, which may be classified into several types:

1.

Tuple Existence: A dataset that results in some tuples being present in the result of the original operator, but not in the result of the mutated operator, or vice versa; the tuples in one result are a superset of the tuples in the other.

Tuple existence differences are easy to propagate and relatively easy to generate and are thus the preferred type of difference for our data generation techniques.

Empty Result Difference: This is a stronger version of tuple existence, where one of the results is empty while the other is non-empty. This is needed in case of exists/not-exists subqueries and a few other cases.
2.

Tuple Count: For some cases like DISTINCT clause it is not feasible to kill mutations by tuple existence and we generate datasets that produce different numbers of tuples (which may be duplicates) for the correct query and the mutation.
3.

Value Difference: For some other cases like mutations between different aggregate functions, the above differences cannot be generated, but we instead generate datasets where the correct operator and the mutated operator produce different values for one or more attributes.

D.2 Approach to Showing Completeness

Our approach to showing completeness for a given class of queries, for a given space of mutations, is as follows. The operators we consider are selections, joins, aggregates, projections, subquery, set and GROUP BY. Our proof is in terms of relational algebra tree. For each possible operator $O_{i}$ , we need to show that:

1.

For each non-equivalent mutation, $o_{i}^{\prime}$ of an occurrence $o_{i}$ of $O_{i}$ , we generate at least one set of constraints that would result in a difference in the result of $o_{i}$ compared to $o_{i}^{\prime}$ . We describe the possible differences in the result shortly.
2.

For each dataset that is not targeted at mutations of an instance $o_{i}$ of $O_{i}$ , the constraints generated for $o_{i}$ “propagate” certain differences generated in an input of $o_{i}$ to the result of $o_{i}$ . Note that the difference in the output of $o_{i}$ may not be the same as the difference in the input of $o_{i}$ . To ensure completeness in general, every difference should get propagated, but in several cases our techniques only propagate some of the differences.
3.

For each dataset, we should add only necessary constraints for data generation and mutation killing. Removing some constraints could result in a dataset that is not able to kill the intended mutations. Adding constraints that are not necessary conditions could make the constraints unsatisfiable, even if a solution actually exists for the dataset. For most operators, we only generate necessary conditions. In some cases, we do not achieve this as discussed in Section D.3.

Our arguments for completeness are based on the set of constraints we create for data generation. Note that we use an SMT solver to solve the constraints and generate a dataset, and SMT solvers are, in general, not complete; however, they have been found to work well in practice.

D.3 Completeness for Operators Considered

The operators we consider dataset generation and mutation killing are as follows

1.

Selection Operators

Killing Mutations: For mutations in the selection predicates such as comparison operator mutations, mutations between conjunctions and disjunctions, string mutations, IS NULL mutations and where clause subquery connective mutations, mutation killing is ensured by tuple existence as described in Sections 2.2, 4, 5 and 7.

It can be seen that only necessary conditions are added to kill the mutations.

Propagating Difference: The same values as input to the selection will be propagated up since we assert the selection condition to be true for the tuples that are input to the selection as described in xdata:icde11 . Hence irrespective of the result difference technique used to kill the mutation below the selection operator, the result difference will be propagated up the selection operator.

We assert only necessary constraints to propagate the mutations.
2.

Join Operators
Killing Mutations:
- •
  
  Join Type Mutations: As discussed in xdata:icde11 (summarized in Section 2.2), mutation of INNER JOIN vs. any outer join is killed using tuple existence. Mutations of LEFT OUTER JOIN vs. RIGHT OUTER JOIN are killed by value difference. Mutations to FULL OUTER JOIN are killed by value difference.
  
  It can be seen from xdata:icde11 that only necessary conditions are asserted to kill the join type mutations.
- •
  
  Missing or Additional Join Conditions: As discussed in Section 9.1, mutations of missing or additional join conditions are killed by tuple existence.
  
  It can be seen from Section 9.1 that only necessary conditions are asserted to kill these mutations.
Propagating Difference: Data generation for joins is done by creating matching tuples for input to the join conditions. Hence, the same values as input to the join will be propagated up. Hence, the result difference below the join will be propagated up for all types of result differences. It can be seen from xdata:icde11 that only necessary conditions are asserted to propagate the differences.
3.

Aggregation Operators
Killing Mutations:
- •
  
  Aggregation Operator Mutations: As discussed in xdata:icde11 (summarized in Section 2.2) aggregation mutations are killed by value difference of result of the aggregate as compared to its mutations.
  
  Only necessary conditions are asserted to kill unconstrained aggregation mutations. For constrained aggregation, as described in Section 6.3.2 we may add constraints that are not necessary for aggregates on join results. For aggregation on a single relation, we assert necessary constraints only.
- •
  
  GROUP BY attribute mutations: In case there is no HAVING clause above the GROUP BY attribute, the mutation of changes GROUP BY attributes is killed by tuple count as described in Section 9.2. In case there is a HAVING clause above the GROUP BY, the mutation is killed using tuple existence at the HAVING clause which is also described in Section 9.2.
  
  Only necessary conditions are asserted for killing these mutations.
Propagating Difference: Not all differences due to mutations below are propagated by aggregation operators.
- •
  
  First consider the aggregates SUM (DISTINCT), AVG (DISTINCT), MIN and MAX. If the mutation below the aggregate is killed by tuple existence then the aggregate produces a zero result for the case where the tuples exists and a non-zero result for the cases where the tuple does not exist (aggregated attributes are asserted to be non-zero). Hence, a mutation that is killed by tuple existence below will result in value difference at the aggregate. For mutations below that are killed tuple count or by value difference, the aggregate may produce the same result and the mutation might not get killed. Mutations below killed by value difference will produce a value difference at the aggregate if the value difference at the mutated node is a NULL vs. NOT NULL difference.
- •
  
  Now consider the aggregates COUNT and COUNT DISTINCT. Mutations killed by tuple existence will be killed by COUNT or COUNT DISTINCT. Mutations killed by tuple count will produce a value difference for COUNT and hence the difference will be propagated. For COUNT DISTINCT, mutations killed by tuple count may produce the same values and hence may not get propagated. Mutations below killed by value difference will produce a value difference in COUNT or COUNT DISTINCT only if the value difference at the mutated node is a NULL vs. NOT NULL difference.
For unconstrained aggregation, no constraints are added for data generation. The constraints added for mutation killing are necessary as described in Section 2.2. For constrained aggregation, as described in Section 6.3.2 we may add constraints that are not necessary. For aggregation on a single relation, we do not face this issue and hence only necessary constraints are added.
4.

Projection Operator (non-duplicate removing)

Killing Mutations: Currently we do not target mutations in projections. Mutations due to adding or removing attributes would get caught if present at the top of the query tree. We currently do not generate any datasets to catch projection mutations. Killing projection mutations could be done by asserting attributes in the projection list have different values wherever possible, an area of future work.

Propagating Difference: A mutation below that is killed by tuple existence or by tuple count will be preserved after projection. A value difference will be propagated up only if the attribute whose value is different is present in the projected attributes.

We do not add any constraints for projection and hence trivially only necessary constraints are added.
5.

DISTINCT operator

Killing Mutations: DISTINCT clause mutation is killed by tuple count as shown in Section 9.3.

Only necessary constraints are asserted to kill DISTINCT clause mutations as shown in Section 9.3.

Propagating Difference: If a mutation below the distinct clause is killed by tuple existence or by value difference, then the DISTINCT clause will also preserve the respective property. However if the mutation below the DISTINCT clause is killed by tuple count the DISTINCT clause may not be able to preserve the difference.

We do not add any constraints for the DISTINCT clause and hence trivially only necessary constraints are added.
6.

Subquery Operator

Killing Mutations: Mutations of the subquery connectives (EXISTS, NOT EXISTS, IN, NOT IN, ALL, ANY and scalar subqueries) in the WHERE clause comes under selection mutation and is discussed earlier in the bullet on selection mutation. We currently only handle scalar subqueries of the form SSQ relop attr/value, where SSQ is a scalar subquery, attr is an attribute from the outer block of query and value is a constant.

For subqueries other than scalar subqueries with aggregation, only necessary constraints for killing mutations are asserted. Scalar subqueries with aggregation use constrained aggregation techniques and hence constraints that are not necessary may also be added if the subquery contains more than one relation.

Propagating Difference: For mutations of operators in the subquery, the subquery connective preserves tuple existence by ensuring empty result difference. Other types of mutation killing may not be propagated up; these mutations are equivalent in many but not all cases. Refer Section 7.4 for details.

Only necessary conditions for propagating differences are asserted as can be seen from Section 7.4.
7.

Set Operators

Killing Mutations: Mutations killing for UNION vs UNION ALL, INTERSECT vs. INTERSECT ALL and EXCEPT vs. EXCEPT ALL is done by tuple count. Other mutations are killed by tuple existence as shown in Section 8.2.

Only necessary constraints required for killing set operator mutations are asserted as can be seen from Section 8.2.

Propagating Difference: As explained in Section 8.3 mutations below the set operator are propagated up for all mutations.

Only necessary conditions for propagating differences are asserted as can be seen from Section 8.3.

D.4 Summary

Our data generation techniques are complete for killing a given mutation on a given operator of a given query tree if

1.

The constraint generation technique creates a difference at the mutated operator
2.

The difference at the mutated node is propagated up the query tree to the root i.e. each ancestor node propagates the difference type generated by its child on the path from the mutated node.
3.

Only necessary constraints are added for data generation.

If the above properties are satisfied for all operators in the query and all mutations of the operator in a space of mutations, then our data generation techniques are complete for the query under the space of mutations considered.

Although not complete, in practice our data generation techniques work well. Our experimental results in Section 13 show that we are able to generate test data and kill mutations for a large variety of common queries.

Appendix E Test Cases and Results for Experiments

In this section, we list results of the grading tool and the test cases that were used for the experiments described in Section 13.

E.1 Grading Tool Results

Result of the grading tool experiment is listed in Table 7.

QId	Que-	XData	USm	ULg	TA	Plan
QId	ries	$\times$	$\times$	$\times$	$\times$	$\surd$	$?$
CQ1	55	2	2	2	2	51	4
CQ2	57	1	1	1	1	54	3
CQ3	71	13	12	1	1	3	68
CQ4	78	26	26	3	1	52	26
CQ5	72	23	11	16	13	43	29
CQ6	61	6	6	6	2	55	6
CQ7	77	25	23	3	24	3	74
CQ8	79	33	12	14	16	2	77
CQ9a	80	68	24	70	23	2	78
CQ9b	80	71	24	70	23	3	77
CQ9	80	72	24	70	23	5	75
CQ10	74	1	1	1	0	34	40
CQ11	69	16	16	16	16	51	18
CQ12	70	8	3	7	7	38	32
CQ13	72	9	9	9	7	3	69
CQ14	67	34	14	10	32	2	65

Table 7: Query grading results

The column labeled Queries lists the number of student queries that were submitted. Columns labeled XData, USm, ULg and TA show the number of incorrect queries caught by these techniques. Plan gives the number of queries labeled as correct and ones for which the plan is not able to determine correctness. Wherever our technique and/or some of the datasets find more incorrect queries than others, we have highlighted the results in bold.

E.2 Test Queries for Constrained Aggregation

For the experiment involving constrained aggregation, we used the following set of queries:

CA1:
SELECT c.dept_name, SUM(c.credits) FROM course c INNER JOIN department d ON (c.dept_name = d.dept_name) GROUP BY c.dept_name HAVING SUM(c.credits) $>$ 10 AND COUNT(c.credits)>1
CA2:
SELECT c.dept_name, SUM(i.salary) FROM course c INNER JOIN department d ON (c.dept_name = d.dept_name) INNER JOIN instructor i ON (d.dept_name = i.dept_name) GROUP BY c.dept_name HAVING SUM(i.salary) $>$ 100000 AND MAX(i.salary) $<$ 75000
CA3:
SELECT c.dept_name, SUM(d.budget) FROM course c INNER JOIN department d ON (c.dept_name = d.dept_name) INNER JOIN teaches t ON (c.course_id = t.course_id) GROUP BY c.dept_name HAVING SUM(d.budget) $>$ 100000 AND COUNT(d.budget) $>$ 1
CA4:
SELECT c.dept_name, AVG(i.salary) FROM course c INNER JOIN department d ON (c.dept_name = d.dept_name) INNER JOIN teaches t ON (c.course_id = t.course_id) INNER JOIN instructor i ON (d.dept_name = i.dept_name) GROUP BY c.dept_name HAVING AVG(i.salary) $>$ 50000 AND COUNT(i.salary)=3
CA5:
SELECT t.semester, SUM(c.credits) FROM department d INNER JOIN teaches t ON (d.budget = t.year + 4) INNER JOIN course c ON (c.dept_name = d.dept_name) GROUP BY t.semester HAVING AVG(c.credits) $>$ 2 AND COUNT(d.building)=2
CA6:
SELECT id FROM course NATURAL JOIN department NATURAL JOIN student NATURAL JOIN takes NATURAL JOIN section GROUP BY id,dept_name HAVING COUNT(dept_name) $>$ 1
CA7:
SELECT distinct dept_name FROM course WHERE credits = (SELECT MAX(credits) FROM course NATURAL JOIN department WHERE title=‘CS’ GROUP BY dept_name HAVING COUNT(course_id) $>$ 2)
CA8:
SELECT id,name FROM (SELECT id,time_slot_id,year,semester FROM takes NATURAL JOIN section GROUP BY id,time_slot_id,year,semester HAVING COUNT(time_slot_id) $>$ 1) as s NATURAL JOIN student GROUP BY id, name HAVING COUNT(id) $>$ 1
CA9:
SELECT SUM(T) as su FROM (SELECT year as T FROM teaches NATURAL JOIN instructor GROUP BY year, course_id HAVING COUNT(id) $>$ 4) as temp GROUP BY T

E.3 Test Queries for Subquery

For the experiment involving subqueries, we used the following set of queries:

SQ1:
SELECT * FROM department d WHERE d.dept_name IN (SELECT c.dept_name FROM course c WHERE c.credits $>$ 2)
SQ2:
SELECT * FROM course c WHERE EXISTS (SELECT * FROM department d WHERE c.dept_name = d.dept_name)
SQ3:
SELECT * FROM takes t WHERE NOT EXISTS (SELECT * FROM section WHERE t.year=section.year AND year = 2010)
SQ4:
SELECT * FROM course c WHERE credits $>$ 3 AND EXISTS (SELECT * FROM department d WHERE d.dept_name = c.dept_name)
SQ5:
SELECT course_id, title FROM course NATURAL JOIN section WHERE SEMESTER = ‘Spring’ AND year = 2010 AND course_id IN (SELECT course_id FROM prereq WHERE prereq_id = ‘CS-201’)
SQ6:
SELECT course_id, TITLE FROM course NATURAL JOIN section WHERE SEMESTER = ‘Spring’ AND year = 2010 AND course_id NOT IN (SELECT course_id FROM prereq WHERE prereq_id = ‘CS-201’)
SQ7:
SELECT name FROM instructor WHERE salary $>$ ALL (SELECT salary FROM instructor WHERE dept_name = ‘Biology’)
SQ8:
SELECT name FROM instructor WHERE salary $>$ (SELECT AVG(salary) FROM instructor WHERE dept_name = ‘Physics’)
SQ9:
SELECT * FROM student WHERE tot_cred $>$ (SELECT SUM(credits) FROM takes INNER JOIN course USING(course_id) WHERE student.id=takes.id)
SQ10:
SELECT * FROM student WHERE tot_cred $<$ ALL (SELECT SUM(credits) FROM takes INNER JOIN course USING(course_id) WHERE dept_name=‘History’)

E.4 Correct Queries for Grading Tool

Following are the correct queries that were used in the experiment to grade student queries:

CQ1:
SELECT course_id, title FROM course
CQ2:
SELECT course_id, title FROM course WHERE dept_name $=$ ‘Comp. Sci.’
CQ3:
SELECT DISTINCT course_id, title, id FROM course NATURAL JOIN teaches WHERE teaches.semester $=$ ‘Spring’ AND teaches.year $=$ ‘2010’
CQ4:
SELECT DISTINCT student.id, student.name FROM takes NATURAL JOIN student WHERE course_id $=$ ‘CS-101’
CQ5:
SELECT DISTINCT course.dept_name FROM course NATURAL JOIN section WHERE section.semester $=$ ‘Spring’ AND section.year $=$ ‘2010’
CQ6:
SELECT course_id, title FROM course WHERE credits $>$ 3
CQ7:
SELECT course_id, COUNT(DISTINCT id) FROM course NATURAL LEFT OUTER JOIN takes GROUP BY course_id
CQ8:
SELECT DISTINCT course_id, title FROM course NATURAL JOIN section WHERE semester $=$ ‘Spring’ AND year $=$ 2010 AND course_id NOT IN (SELECT course_id FROM prereq)
CQ9:
1. a)
  WITH s as (SELECT id, time_slot_id, year, semester FROM takes NATURAL JOIN section GROUP BY id, time_slot_id, year, semester HAVING COUNT(time_slot_id) $>$ 1) SELECT DISTINCT id,name FROM s NATURAL JOIN student
2. b)
  SELECT DISTINCT A.id, A.name FROM (SELECT * FROM student NATURAL JOIN takes NATURAL JOIN section) A, (SELECT * from student NATURAL JOIN takes NATURAL JOIN section) B WHERE A.name = B.name AND A.year = B.year AND A.course_id $<>$ B.course_id AND A.semester = B.semester AND A.time_slot_id = B.time_slot_id
CQ10:
SELECT DISTINCT dept_name FROM course WHERE credits $=$ (SELECT MAX(credits) FROM course)
CQ11:
SELECT DISTINCT instructor.id, name, course_id FROM instructor LEFT OUTER JOIN TEACHES ON instructor.id $=$ teaches.id
CQ12:
SELECT student.id, student.name FROM student WHERE lower(student.name) like ‘%sr%’
CQ13:
SELECT id,name FROM student s WHERE NOT EXISTS (SELECT * FROM student t NATURAL JOIN takes WHERE s.id=t.id AND takes.year=2010 AND takes.semester=‘Spring’)
CQ14:
SELECT DISTINCT * FROM takes t WHERE (NOT EXISTS (SELECT id,course_id FROM takes s WHERE grade $!=$ ‘F’ AND t.id $=$ s.id AND t.course_id $=$ s.course_id) AND t.grade IS NOT NULL) OR (t.grade $!=$ ‘F’ AND t.grade IS NOT NULL)