V \acmNumberN \acmArticle1 \acmYear20YY \acmMonth1
Extending two-variable logic on data trees with order on data values and its automata
Abstract
Data trees are trees in which each node, besides carrying a label from a finite alphabet, also carries a data value from an infinite domain. They have been used as an abstraction model for reasoning tasks on XML and verification. However, most existing approaches consider the case where only equality test can be performed on the data values.
In this paper we study data trees in which the data values come from a linearly ordered domain, and in addition to equality test, we can test whether the data value in a node is greater than the one in another node. We introduce an automata model for them which we call ordered-data tree automata (ODTA), provide its logical characterisation, and prove that its non-emptiness problem is decidable in 3-NExpTime. We also show that the two-variable logic on unranked data trees, studied by Bojanczyk, Muscholl, Schwentick and Segoufin in 2009, corresponds precisely to a special subclass of this automata model.
Then we define a slightly weaker version of ODTA, which we call weak ODTA, and provide its logical characterisation. The complexity of the non-emptiness problem drops to NP. However, a number of existing formalisms and models studied in the literature can be captured already by weak ODTA. We also show that the definition of ODTA can be easily modified, to the case where the data values come from a tree-like partially ordered domain, such as strings.
category:
F.1.1 Models of Computation Automatacategory:
F.4.1 Mathematical Logic Computational logickeywords:
Finite-state automata, Two-variable logic, Data trees, Ordered data valuesThe extended abstract of this article has been published in the proceedings of LICS 2012, under the title: “An Automata Model for Trees with Ordered Data Values.” The author is FWO Pegasus Marie Curie Fellow.
1 Introduction
Classical automata theory studies words and trees over finite alphabets. Recently there has been a growing interest in the so-called “data” words and trees, that is, words and trees in which each position, besides carrying a label from a finite alphabet, also carries a data value from an infinite domain.
Interest in such structures with data springs due to their connection to XML [Alon et al. (2003), Arenas et al. (2008), Björklund et al. (2008), David et al. (2012), Fan and Libkin (2002), Figueira (2009), Neven (2002)], as well as system specifications [Bouyer et al. (2001), Demri et al. (2007), Segoufin and Torunczyk (2011)], where many properties simply cannot be captured by finite alphabets. This has motivated various works on data words [Benedikt et al. (2010), Bojanczyk et al. (11a), Demri and Lazić (2009), Grumberg et al. (2010), Kaminski and Francez (1994), Neven et al. (2004)], as well as on data trees [Björklund and Bojanczyk (2007), Bojanczyk et al. (2009), Figueira (2012a), Figueira and Segoufin (2011), Jurdzinski and Lazic (2011)]. The common feature of these works is the addition of equality test on the data values to the logic on trees. While for finitely-labeled trees many logical formalisms (e.g., the monadic second-order logic MSO) are decidable by converting formulae to automata, even FO (first-order logic) on data words extended with data-equality is already undecidable. See, e.g., [Bojanczyk et al. (11a), Fan and Libkin (2002), Neven et al. (2004)].
Thus, there is a need for expressive enough, while computationally well-behaved, frameworks to reason about structures with data values. This has been quite a common theme in XML and system specification research. It has largely followed two routes. The first takes a specific reasoning task, or a set of similar tasks, and builds algorithms for them (see, e.g., [Arenas et al. (2008), Figueira (2011), Björklund et al. (2008), Schwentick (2004), Fan and Libkin (2002), Figueira (2009)]). The second looks for sufficiently general automata models that can express reasoning tasks of interest, but are still decidable (see, e.g., [Demri and Lazić (2009), Bojanczyk et al. (2009), Jurdzinski and Lazic (2011), Segoufin and Torunczyk (2011)]).
Both approaches usually assume that data values come from an abstract set equipped only with the equality predicate. This is already sufficient to capture a wide range of interesting applications both in databases and verification. However, it has been advocated in [Deutsch et al. (2009)] that comparisons based on a linear order over the data values could be useful in many scenarios, including data centric applications built on top of a database.
So far, not many works have been done in this direction. A few works such as [Manuel (2010), Figueira (2011), Schwentick and Zeume (2010), Segoufin and Torunczyk (2011)] are on words, while in most applications we need to consider trees. Moreover, these works are incomparable to some interesting existing formalisms [Fan and Libkin (2002), Bojanczyk et al. (2009), Arenas et al. (2008), David et al. (2012), Jurdzinski and Lazic (2011), Demri and Lazić (2009), Lazić (2011)] known to be able to capture various interesting scenarios common in practice. On top of that many useful techniques, notably those introduced in [Fan and Libkin (2002), Bojanczyk et al. (11a), Bojanczyk et al. (2009), Jurdzinski and Lazic (2011)], can deal only with data equality, and are highly dependent on specific combinatorial properties of the formalisms. They are rather hard to adapt to other more specific tasks, let alone being generalised to include more relations on data values, and they tend to produce extremely high complexity bounds, such as non-primitive-recursive, or at least as hard as the reachability problem in Petri nets. Furthermore, many known decidability results are lost as soon as we add the order relation on data values. Some exceptions are [Figueira et al. (2010), Figueira (2012a)].
In this paper we study the notion of data trees in which the data values come from a linearly ordered domain, which we call ordered-data trees. In addition to equality tests on the data values, in ordered-data trees we are allowed to test whether the data value in a node is greater than the data value in another node. To the extent it is possible, we aim to unify various ad hoc methods introduced to reason about data trees, and generalise them to ordered-data trees to make them more accessible and applicable in practice. This paper is the first step, where we introduce an automata model for ordered-data trees, provide its logical characterisation, and prove that it has decidable non-emptiness problem. Moreover, we also show that it can capture various well known formalisms.
Brief description of the results in this paper
The trees that we consider are unranked trees where there is no a priori bound in the number of children of a node. Moreover, we also have an order on the children of each node. We consider a natural logic for ordered-data trees, which consists of the following relations.
-
•
The parent relation , where means that node is the parent of node .
-
•
The next-sibling relation , where means that nodes and have the same parent and is the next sibling of .
-
•
The labeling predicates ’s, where means that node is labeled with symbol .
-
•
The data equality predicate , where means that nodes and have the same data value.
-
•
The order relation on data , where means that the data value in node is less than the one in node .
-
•
The successive order relation on data , where means that the data value in node is the minimal data value in the tree greater than the one in node .
We introduce an automata model for ordered-data trees, which we call ordered-data tree automata (ODTA), and provide its logical characterisation. Namely, we prove that the class of languages accepted by ODTA corresponds precisely to those expressible by formulas of the form:
(1) |
where
-
•
are monadic second-order predicates;
-
•
is an FO formula restricted to two variables and using only the predicates , , , as well as the unary predicates and ’s.
-
•
is an FO formula using only the predicates , , , as well as the unary predicates and ’s.
We show that the logic , first studied in [Bojanczyk et al. (2009)], corresponds precisely to a special subclass of ODTA, where denotes the set of formulas of the form (1) in which is a true formula. We then prove that the non-emptiness problem of ODTA is decidable in 3-NExpTime. Our main idea here is to show how to convert the ordered-data trees back to a string over finite alphabets. (See our notion of string representation of data values in Section 3.) Such conversion enables us to use the classical finite state automata to reason about data values.
Then we define a slightly weaker version of ODTA, which we call weak ODTA. Essentially the only feature of ODTA missing in weak ODTA is the ability to test whether two adjacent nodes have the same data value. Without such simple feature, the complexity of the non-emptiness problem surprisingly drops three-fold exponentially to NP. We provide its logical characterisation by showing that it corresponds precisely to the languages expressible by the formulas of the form (1) where does not use the predicate . We show that a number of existing formalisms and models can be captured already by weak ODTA, i.e. those in [Fan and Libkin (2002), David et al. (2012), Manuel (2010)].
We should remark that [David et al. (2012)] studies a formalism which consists of tree automata and a collection of set and linear constraints.***We will later define formally what set and linear constraints are. It is shown that the satisfiability problem of such formalism is NP-complete. In fact, it is also shown in [David et al. (2012)] that a single set constraint (without tree automaton and linear constraint) already yields NP-hardness. Weak ODTA are essentially equivalent to the formalism in [David et al. (2012)] extended with the full expressive power of the first-order logic . It is worth to note that despite such extension, the non-emptiness problem remains in NP.
Finally we also show that the definition of ODTA can be easily modified to the case where the data values come from a partially ordered domain, such as strings. This work can be seen as a generalisation of the works in [David et al. (2010)] and [Kara et al. (2012)]. However, it must be noted that [David et al. (2010), Kara et al. (2012)] deal only with data words, where only equality test is allowed on the data values and there is no order on them.
Related works
Most of the existing works in this area are on data words. In the paper [Bojanczyk et al. (11a)] the model data automata was introduced, and it was shown that it captures the logic , the fragment of existential monadic second order logic in which the first order part uses only two variables and the predicates: the data equality , as well as the order and the successor on the domain.
An important feature of data automata is that their non-emptiness problem is decidable, even for infinite words, but is at least as hard as reachability for Petri nets. It was also shown that the satisfiability problem for the three-variable first order logic is undecidable. Later in [David et al. (2010)] an alternative proof was given for the decidability of the weaker logic . The proof gives a decision procedure with an elementary upper bound for the satisfiability problem of on strings. Recently in [Kara et al. (2012)] an automata model that captures precisely the logic , both on finite and infinite words, is proposed.
Another logical approach is via the so called linear temporal logic with freeze quantifier, introduced in [Demri and Lazić (2009)]. Intuitively, these are LTL formulas equipped with a finite number of registers to store the data values. We denote by , the LTL with freeze quantifier, where denotes the number of registers and the only temporal operators allowed are the neXt operator X and the Until operator U. It was shown that alternating register automata with registers () accept all LTL languages and the non-emptiness problem for alternating is decidable. However, the complexity is non primitive recursive. Hence, the satisfiability problem for LTL is decidable as well. Adding one more register or past time operator to LTL makes the satisfiability problem undecidable. In [Figueira et al. (2010), Figueira (2012a)] it is shown that alternating can be extended to strings with linearly ordered data values, and the emptiness problem is still decidable. In [Lazić (2011)] a weaker version of alternating , called safety alternating , is considered, and the non-emptiness problem is shown to be EXPSPACE-complete.
A model for data words with linearly ordered data values was proposed in [Segoufin and Torunczyk (2011)]. The model consists of an automaton equipped with a finite number of registers, and its transitions are based on constraints on the data values stored in the registers. It is shown that the non-emptiness problem for this model is decidable in PSPACE. However, no logical characterisation is provided for such model.
In [Bojanczyk et al. (11b)] another type of register automata for words was introduced and studied, which is a generalisation of the original register automata introduced by Kaminski and Francez [Kaminski and Francez (1994)], where the data values also can come from a linearly ordered domain. Thus, the order comparison, not just equality, can be performed on data values. The generalisation is done via the notion of monoid for data words, and is incomparable with our model here. In the terminology of the original register automata defined in [Kaminski and Francez (1994)], it is simply register automata extended with testing whether the data value currently read is bigger/smaller than those in the registers.
It is shown in [Manuel (2010)] that the satisfiability problem for over text is decidable. A text is simply a data word in which all the data values are different and they range over the positive integers from to , for some . We will see later that the satisfiability problem for can be reduced to the non-emptiness problem of our model.
In [Schwentick and Zeume (2010)] it is shown that the satisfiability problem of the logic on words is decidable. This logic is incomparable with our model. However, it should be noted that cannot capture the whole class of regular languages.
The work on data trees that we are aware of is in [Bojanczyk et al. (2009), Jurdzinski and Lazic (2011)]. In [Bojanczyk et al. (2009)] it was shown that the satisfiability problem for the logic over unranked trees is decidable in 3-NExpTime. However, no automata model is provided. We will see later how this logic corresponds precisely to a special subclass of ODTA.
In [Jurdzinski and Lazic (2011)] alternating tree register automata were introduced for trees. They are essentially the generalisation of the alternating to the tree case. It was shown that this model captures the forward XPath queries. However, no logical characterisation is provided and the non-emptiness problem, though decidable, is non primitive recursive.
As mentioned earlier, the main idea in this paper is the conversion of the data values from an infinite domain back to string over a finite alphabet. Roughly speaking, it works as follows. Given an ordered-data tree , we show how to construct a string over a finite alphabet whose domain corresponds precisely to the data values in . We then use the classical finite state automaton to reason about , and thus, also about the data values in . This idea is the main difference between our paper and the existing works. Most of the existing techniques rely on some specific combinatorial properties of the formalisms considered, which make them highly independent of one another. As we will see later, our model captures quite a few other formalisms without significant jump in complexity.
Organisation
This paper is organised as follows. In Section 2 we give some preliminary background. In Section 3 we formally define the logic for ordered-data trees and present a few examples as well as notations that we need in this paper. In Section 4 we present two lemmas that we are going to need later on. We prove them in a quite general setting, as we think they are interesting in their own. We introduce the ordered-data tree automata (ODTA) in Section 5 and weak ODTA in Section 6. In Section 7 we discuss a couple of the undecidable extensions of weak ODTA. In Section 8 we describe how to modify the definition of ODTA when the data values are strings, that is, when they come from a partially ordered domain. Finally we conclude with some concluding remarks in Section 9.
2 Preliminaries
In this section we review some definitions that we are going to use later on. We usually use and to denote finite alphabets. We write to denote an alphabet in which each symbol corresponds to a subset of . In some cases, we may need the alphabet – an alphabet in which each symbol corresponds to a set of subsets of . We denote the set of natural numbers by .
Usually we write to denote a language, for both string and tree languages. When it is clear from the context, we use the term language to mean either a string language, or a tree language.
2.1 Finite state automata over strings and commutative regular languages
We usually write to denote a finite state automaton on strings. The language accepted by the automaton is denoted by .
Let . For a word , the Parikh image of is , where is the number of appearances of in . For a vector , the inverse of the Parikh image of is .
For , a vector is called an -base, if and , for all . A language is periodic, if there exist vectors such that and each is an -base and
We denote such language by .
A language is commutative if it is closed under reordering. That is, if , and is a permutation on , then .
The following result is a kind of folklore and can be proved easily.
Theorem 2.1.
A language is commutative and regular if and only if it is a finite union of periodic languages.
2.2 Unranked trees, tree automata and transducers
An unranked finite tree domain is a prefix-closed finite subset of (words over ) such that implies for all and . Given a finite labeling alphabet , a -labeled unranked tree is a structure
where
-
•
is an unranked tree domain,
-
•
is the child relation: for all ,
-
•
is the next-sibling relation: for all , and
-
•
the ’s are labeling predicates, i.e. for each node , exactly one of , with , is true.
We write to denote the domain . The label of a node in is denoted by . If , then we say that is an -node.
An unranked tree automaton [Comon et al. (2007), Thatcher (1967)] over -labeled trees is a tuple , where is a finite set of states, is the set of final states, and is a transition function; we require ’s to be regular languages over for all and .
A run of over a tree is a function such that for each node with children , the word is in the language . For a leaf labeled , this means that could be assigned a state if and only if the empty word is in . A run is accepting if , i.e., if the root is assigned a final state. A tree is accepted by if there exists an accepting run of on . The set of all trees accepted by is denoted by .
An unranked tree (letter-to-letter) transducer with the input alphabet and output alphabet is a tuple , where is a tree automaton with the set of states , and is an output relation. We call such a transducer from to .
Let be a -labeled tree, and a -labeled tree such that . We say that a tree is an output of on , if there is an accepting run of on and for each , it holds that . We call an identity transducer, if for all . We will often view an automaton as an identity transducer.
2.3 Automata with Presburger constraints (APC)
An automaton with Presburger constraints (APC) is a tuple , where is an unranked tree automaton with states and is an existential Presburger formula with free variables . A tree is accepted by , denoted by , if there is an accepting run of on such that is true, where is the number of appearances of in .
Theorem 2.2.
[Seidl et al. (2004), Verma et al. (2005)] The non-emptiness problem for APC is decidable in NP.
It is worth noting also that the class of languages accepted by APC is closed under union and intersection.
Oftentimes, instead of counting the number of states in the accepting run, we need to count the number of occurrences of alphabet symbols in the tree. Since we can easily embed the alphabet symbols inside the states, we always assume that the Presburger formula has the free variables ’s to denote the number of appearances of the symbol in the tree.
As in the word case, we let denote the Parikh image of the tree . We will need the following proposition.
Proposition 2.3.
[Seidl et al. (2004), Verma et al. (2005)] Given an unranked tree automaton , one can construct, in polynomial time, an existential Presburger formula such that
-
•
for every tree , holds;
-
•
for every such that holds, there exists a tree with .
3 Ordered-data trees and Their Logic
An ordered-data tree over the alphabet is a tree in which each node, besides carrying a label from the finite alphabet , also carries a data value from .†††Here we use the natural numbers as data values just to be concrete. The results in our paper apply trivially for any linearly ordered domain.
Let be an ordered-data tree over and . We write to denote the data value in the node . The set of all data values in the -nodes in is denoted by . That is, . We write to denote the set of data values found in the tree . We also write to denote the number of -nodes in .
The profile of a node is a triplet , where and indicate that the node has the same data value and different data value as its left sibling, respectively; indicates that does not have a left sibling. Similarly, , , and have the same meaning in relation to the parent of the node , while , , and means the same in relation to the right sibling of the node . For an ordered-data tree over , the profile tree of , denoted by , is a tree over obtained by augmenting to each node of its profile.
We write to denote the projection of the ordered-data tree , that is, is without the data values. When we say that an ordered-data tree is accepted by an automaton , we mean that is accepted by . An ordered-data tree is an output of a transducer on an ordered-data tree , if is an output of on , and for all , we have .
Figure 1 shows an example of an ordered-data tree over the alphabet with its profile tree. The notation means that the node is labeled with and has data value .
3.1 String representations of data values
Let be an ordered-data tree over . For a set , let
That is, is the set of data values that are found in -positions for all but are not found in any -position for . Note that the sets ’s are disjoint, and that for each ,
Moreover, .
Let be all the data values found in . The string representation of the data values in , denoted by , is the string over the alphabet of length such that , for each . The notation is already introduced in [David et al. (2010), David et al. (2012)], but not .
Consider the example of the tree in Figure 1. The data values in are , where
The string is , where , , and .
3.2 A logic for ordered-data trees
An ordered-data tree over the alphabet can be viewed as a structure
where
-
•
the relations are as defined before in Subsection 2.2,
-
•
holds, if ,
-
•
holds, if ,
-
•
holds, if is the minimal data value in greater than .
Obviously, can be expressed equivalently as . We include for the sake of convenience. We also assume that we have the predicates , , , and which stand for , , , and , respectively. We also write to denote .
For , we let stand for the first-order logic with the vocabulary , for its monadic second-order logic (which extends with quantification over sets of nodes), and for its existential monadic second order logic, i.e., formulas of the form , where is an formula over the vocabulary extended with the unary predicates .
We let stand for with two variables, i.e., the set of formulae that only use two variables and . The set of all formulae of the form , where is an formula is denoted by . Note that is equivalent in expressive power to over the usual (without data) trees. That is, it defines precisely the regular tree languages [Thomas (1997)].
As usual, we define as the set of ordered-data trees that satisfy the formula . In such case, we say that the formula expresses the language .‡‡‡To avoid confusion, we put the subscript on to denote a language of ordered-data trees. We use the symbol without the subscript to denote the usual language of trees/strings without data.
The following theorem is well known. It shows how even extending with equality test on data values immediately yields undecidability.
Theorem 3.1.
(See, for example, [Neven et al. (2004)]) The satisfiability problem for the logic is undecidable.
One of the deepest results in this area is the following decidability result for the logic .
Theorem 3.2.
[Bojanczyk et al. (2009)] The satisfiability problem for the logic is decidable.
3.3 A few examples
In this subsection we present a few examples of properties of ordered-data trees. Some of them are special cases of more general techniques that will be used later on.
Example 3.3.
Let . Consider the language of ordered-data trees over where an ordered-data tree if and only if there exist two -nodes and such that is an ancestor of and either or . This language can be expressed with the formula , where states that contains only the node , contains only the node , contains precisely the nodes in the path from to , and or .
Example 3.4.
For a fixed set and an integer , we consider the language such that if and only if .
We pick an arbitrary symbol . The language can be expressed in with the formula of the form , where is a conjunction of the following.
-
•
That the predicates are disjoint and each of them contains exactly one node, which is an -node.
-
•
That the data values found in nodes in are all different.
-
•
That for each , if a data value is found in a node in , then it must also be found in some -node, for every .
-
•
That for each , if a data value found in a node in , then it must not be found in any -node, for every .
-
•
That for every -node (recall that ) that does not belong to the ’s, either it has the same data value as the data value in a node belongs to one of the ’s, or it has the data value not in .
That its data value does not belong to can be stated as the negation of-
–
for each , there is a -node with the same data value; and
-
–
the data value cannot be found in any -node, for every .
-
–
To express all these intended meanings, it is sufficient that .
Example 3.5.
For a fixed set and an integer , we consider the language such that if and only if .
This language can be expressed in with a formula of the form
where the intended meanings of are as follows. For a node in an ordered-data tree ,
-
•
the number of nodes belonging to is precisely ; and if holds in , then the data value in the node belongs to ;
-
•
holds in if and only if in the subtree rooted in we have ;
-
•
if are all the left-siblings of , and holds, then holds if and only if .
To express all these intended meanings, it is sufficient that .
Example 3.6.
Let . Consider the language of ordered-data trees over where an ordered-data tree if and only if all the -nodes with data values different from the ones in their parents satisfy the following conditions:
-
•
the data values found in these nodes are all different;
-
•
one of the these data values must be the largest in the tree .
The language can be expressed in with the following formula:
4 Two Useful Lemmas
In this section we prove two lemmas which will be used later on. The first is combinatorial by nature, and we will use it in our proof of the decidability of ODTA. The second is an Ehrenfeucht-Fraïssé type lemma for ordered-data trees, and we will use it in our proof of the logical characterization of ODTA.
4.1 A combinatorial lemma
Let be an (undirected and finite) graph. For simplicity, we consider only the graph without self-loop. We denote by the set of vertices in and the set of edges. For a node , we write to denote the degree of the node and to denote .
A data graph over the alphabet is a graph in which each node carries a label from and a data value from . A node is called an -node, if its label is , in which case we write . We denote by the data value found in node , and the set of data values found in -nodes in .
Lemma 4.1.
Let be a data graph over . Suppose for each , we have . Then we can reassign the data values in the nodes in to obtain another data graph such that and and
-
(1)
for each , ;
-
(2)
for each , ;
-
(3)
for each , if , then .
Proof 4.2.
Note that in the lemma the data graph differs from only in the data values on the nodes, where we require that adjacent nodes in have different data values.
In the following we write to denote the number of -nodes in and . First, we perform some partial reassignment of the data values on some nodes. For each , we pick number of -nodes in . Then we assign to these -nodes the data values from . One -node gets one data value. Such assignment can be done since obviously . If , then there will be some -nodes in that do not have data values. We write , if does not have data value. From this step we already obtain that for each .
However, reassigning the data values just like that, there may exist an edge such that and . We call such an edge a conflict edge. We are going to reassign the data values one more time so that there is no conflict edge.
Suppose there exists an edge such that and suppose that is an -node, for some . The data value can only be found in at most nodes in . Since , the neighbours of those nodes (with data value ) are at most nodes. Now , there are at least number of -nodes whose neighbours do not get the data value . Let be such -nodes, where . From these nodes, there exists such that .
We can then swap the data values on the nodes and , and this results in one less conflict edge. We repeat this process until there is no conflict edge. Now it is straightforward that
-
(1)
for each , ;
-
(2)
for each , ;
-
(3)
for each , if and , then .
What is left to do now is to assign data values to the nodes , where . For each -node, where , we pick the data value which is not assigned to any its neighbour. Such data value exists since . Such assignment will not violate condition (3) above, thus, we get the desired data graph . This completes the proof of Lemma 4.1.
4.2 An Ehrenfeucht-Fraïssé type lemma
We need the following notation. A -characteristic function on the alphabet , is a function . Let be the set of all such -characteristic functions on . A function is a -characteristic function for a set , if , for all , and , for all .
An ordered-data set over the alphabet consists of a finite set , in which each element carries a label and a data value . An element is called an -element, if . In other words, an ordered-data set is similar to an ordered-data tree, but without the relations and . It can be viewed as a structure , where
-
•
for each and , the relation holds if ,
-
•
holds, if ,
-
•
holds, if ,
-
•
holds, if is the minimal data value found in greater than .
Let be an ordered-data set and be the data values found in . The -extended representation of is the string such that and for each and for each ,
-
1.
is a -characteristic function for the set ,
-
2.
if , then there are number of -elements in with data value ,
-
3.
if , then there are at least number of -elements in with data value .
We assume that in every formula in all the monadic second-order quantifiers precede the first-order part. That is, sentences in are of the form: , where the ’s are monadic second-order variables, the ’s are or and extended with the unary predicates . We call the integer , the MSO quantifier rank of , denoted by , while we write to denote the quantifier rank of , that is the quantifier rank of the first-order part of .
Lemma 4.3.
Let and be ordered-data sets over such that . For any sentence such that and , .
Proof 4.4.
The proof is by Ehrenfeucht-Fraïssé game for MSO of rounds, with rounds of set-moves and rounds of point-moves. We can assume that the set-moves precede the point-moves. See, for example, [Libkin (2004)], for the definition of Ehrenfeucht-Fraïssé game.
Before we go to the proof, we need a few notations. Let and be ordered-data sets over . For , we write – the set of elements in with label and data value . We can define similarly for .
Let . Let and , for some ordered-data sets and . The mapping is a partial -isomorphism (with equality) from to , if it is a partial isomorphism with regards to the vocabulary , and if , then .
We are going to describe Duplicator’s strategy for winning the Ehrenfeucht-Fraïssé game for MSO of rounds of set-moves, followed by rounds of point moves. We start with the set-moves.
Duplicator’s strategy for set-moves: Suppose that the game is already played for rounds, where and are the sets of positions chosen in and , respectively. For each , define the following set:
Duplicator’s strategy is to preserve the following identity: for every and every
-
•
If the cardinality , then .
-
•
If the cardinality , then also .
Now suppose that on the set-move, Spoiler chooses a set of positions on . Duplicator chooses a set of positions on as follows. For each , there are four cases:
-
1.
and .
Then, , which by induction hypothesis, implies . Duplicator picks number of points from , and declares them “belong to .” The rest of the points from are declared “not belong to .”
Obviously, and . -
2.
and .
In this case, either or . In either case there are number of points from which Duplicator declares as “belong to .” The rest of the points from are declared “not belong to .”
Obviously, and . -
3.
and .
Similar to Case 2. -
4.
and .
Then, , and so . Duplicator declares half of the points in as “belong to ” and the other half as “not belong to .”
Obviously, and .
Now after rounds of set-moves, we have the following identity: for every and every
-
•
If the cardinality , then .
-
•
If the cardinality , then also .
This ends our description of Duplicator’s strategy for set-moves. Now we describe Duplicator’s strategy for point-moves.
Duplicator’s strategy for point-moves: Suppose that the game is now on th step. Let be a partial -isomorphism, where . Suppose Spoiler chooses an element from such that is the largest data value in .
-
•
If , for some , Duplicator chooses from .
-
•
If , Duplicator chooses from such that and and is the largest data value in . Such an element exists, as .
In either case is a partial -isomorphism. This completes the description of Duplicator’s strategy and hence, our proof.
Now, we define the -extended representation of an ordered-data tree over the alphabet , denoted by is the -extended representation of the ordered-data set obtained by ignoring the relations and in . The following corollary is an immediate consequence of Lemma 4.3 above.
Corollary 4.5.
Let and be ordered-data trees over such that . For any sentence such that and , .
Proof 4.6.
Since the predicates and are not used in the formula , we can ignore them in and and view both and as ordered-data sets. Our corollary follows immediately from Lemma 4.3.
5 Automata for Ordered-data Tree
In this section we are going to introduce an automata model for ordered-data trees and study its expressive power.
Definition 5.1.
An ordered-data tree automaton, in short ODTA, over the alphabet is a triplet , where is a letter-to-letter non-deterministic transducer from to the output alphabet ; is an automaton on strings over the alphabet ; and .
An ordered-data tree is accepted by , denoted by , if there exists an ordered-data tree over such that
-
•
on input , the transducer outputs ;
-
•
the automaton accepts the string ; and
-
•
for every , all the -nodes in have different data values.
We describe a few examples of ODTA that accept the languages described in Examples 3.3, 3.4, 3.5 and 3.6.
Example 5.2.
An ODTA that accepts the language in Example 3.3 can be defined as follows. The output alphabet of the transducer is . On an input tree , the transducer marks the nodes in as follows. There is only one node marked with , one node marked with , and the -node is an ancestor of . The automaton accepts all the strings in which the position labeled with is less than or equal to the position labeled with . (These two positions can be equal, which means .) Finally, .
Example 5.3.
An ODTA that accepts the language in Example 3.4 can be defined as follows. The transducer is an identity transducer. The automaton accepts all the strings in which the symbol appears exactly times, and .
Example 5.4.
An ODTA that accepts the language in Example 3.5 can be defined as follows. The transducer is an identity transducer. The automaton accepts a string in which the number of appearances of the symbol is a multiple of , and .
Example 5.5.
An ODTA that accepts the language in Example 3.6 can be defined as follows. The output alphabet of the transducer is . The transducer marks the nodes as follows. A node is marked with if and only if it is an -node and it has different data value from the one of its parent. All the other nodes are marked with . The automaton accepts a string if and only if the last symbol in contains the symbol , while .
The following proposition states that ODTA languages are closed under union and intersection, but not under negation. We would like to remark that being not closed under negation is rather common for decidable models for data trees. Often models that are closed under negation have undecidable non-emptiness/satisfiability problem.
Proposition 5.6.
The class of languages accepted by ODTA is closed under union and intersection, but not under negation.
Proof 5.7.
For closure under union and intersection, let and be ODTA. The union is accepted by an ODTA which non-deterministically chooses to simulate either or on the input ordered-data tree. The ODTA for the intersection can be obtained by the standard cross product between and .
We now prove hat ODTA languages are not closed under negation. Consider the negation of the language in Example 3.3, whose equivalent ODTA is presented in Example 5.2. Every tree has the following property. If are two -nodes in and is an ancestor of , then .
Now suppose to the contrary that there exists an ODTA that accepts the negation of . Let be the output alphabet of . Let be a data tree with nodes, where each node is labelled with and has at most one child. This implies that the data values in are all different and appear in increasing order from the root node to the leaf node.
Let . Since has nodes, and hence so does , there are two nodes in and in with the same label. Let be a data tree obtained from by swapping the data values between and , so . Since , on input , the transducer can also output , which means that . This contradicts the fact that is the complement of . This completes the proof of Proposition 5.6.
We should remark that in Section 7 we will discuss that extending ODTA with the complement of languages of the form in Example 5.2 will immediately yield undecidability.
Theorems 5.8, 5.9 and 5.10 are the main results in this paper. Theorem 5.8 below provides the ODTA characterisation of the logic and its proof can be found in Subsection 5.1.
Theorem 5.8.
A language is expressible with an formula if and only if it is accepted by an ODTA , where is a commutative language. Moreover, the translation from formulas to ODTA takes triple exponential time, while from ODTA to formulas, takes exponential time.
Theorem 5.9 below provides the logical characterisation of ODTA. The proof can be found in Subsection 5.2.
Theorem 5.9.
A language is accepted by an ODTA if and only if it is expressible with a formula of the form: , where is a formula from , and from , both extended with the unary predicates and . Moreover, the translation from ODTA to formula is of polynomial time, and from formula to ODTA is effective, but non-elementary.
Finally, we show that the non-emptiness problem for ODTA is decidable in Theorem 5.10. The proof can be found in Subsection 5.3.
Theorem 5.10.
The non-emptiness problem for ODTA is decidable in 3-NExpTime.
The best lower bound known up to date is NP-hard. See [Fan and Libkin (2002), David et al. (2012)].
5.1 Proof of Theorem 5.8
In the proof we assume that the ordered-data trees are over the finite alphabet . We will need the following proposition which states that every formula can be syntactically rewritten to a normal form for .
Proposition 5.11.
[Bojanczyk et al. (2009), Proposition 3.8] Every formula can be rewritten into a normal form of exponential size of the form: , where is a conjunction of formulae of the form:
-
(N1)
,
-
(N2)
,
-
(N3)
,
-
(N4)
,
-
(N5)
,
-
(N6)
,
-
(N7)
,
where is a conjunction of some unary predicates and its negations, is either or , and is either or .
We should remark that if is a conjunction of formulae of the forms (N1)–(N5) above, then there exists a tree automaton over the alphabet such that for every ordered-data tree ,
Such construction is straightforward from the classical automata theory. See, for example, [Thomas (1997)]. We divide the proof of Theorem 5.8 into Lemmas 5.12 and 5.14 below.
Lemma 5.12.
For every formula , there exists an ODTA such that and is commutative. Moreover, the construction of is effective and takes triple exponential time in the size of the formula .
Proof 5.13.
Applying Proposition 5.11, we can rewrite the formula in its normal form . Furthermore, we can rewrite the formula into the form , where , and is a conjunction of formulas of the form:
-
(N0′)
are pairwise disjoint, and .
-
(N1′)
,
-
(N2′)
,
-
(N3′)
,
-
(N4′)
,
-
(N5′)
,
-
(N6′)
,
-
(N7′)
,
where are disjunctions of some of the ’s, and and are the same above. Intuitively, the unary predicates corresponds to subsets of .
The ODTA is defined as follows.
-
•
The transducer checks whether the formulas (N0′)–(N5′) are satisfied, with the output alphabet where a node is labeled with if and only if it belongs to .
The construction of such transducer is straightforward, thus, omitted. See, for example, [Thomas (1997)]. -
•
consists of the ’s, where there exists and and a formula of the form (N6′)
in .
-
•
the automaton accepts the language , where
That is commutative is trivial. That accepts precisely the language can be deduced from the following.
-
•
That ensures that formulas N0′–N5′ are satisfied.
-
•
That contains precisely the symbols ’s where all -nodes are supposed to contain different data values.
-
•
That for every ordered-data tree ,
if and only if .
-
•
That for every ordered-data tree ,
if and only if
-
–
for all such that ; and
-
–
for all , , which is captured by the condition imposed by .
-
–
The analysis of the complexity is as follows. The first step, applying Proposition 5.11, induces an exponential blow-up in the size of the input. The second step to construct the formula takes exponential time in , and is exponential in the size of the input. The construction of takes polynomial time in the size of , since (N0′)–(N5′) are already in the “automata transition” format. The construction of takes polynomial time in , while the construction of induces another exponential blow-up in . Altogether the complexity of our constructing is triple exponential time in the size of . This concludes the proof of Lemma 5.12.
For the complexity analysis in Lemma 5.14, we assume that a commutative automaton is given as a set of vectors (in binary format) indicating its Parikh images. That is, is given as a set , where
and each number in the vectors in is written in the standard binary form.
Lemma 5.14.
For every ODTA , where is a commutative language, there exists a formula such that . Moreover, the construction of takes exponential time in the size of .
Proof 5.15.
Let and be the set of states and the output alphabet of the transducer , respectively. Let .
By Theorem 2.1, is a finite union of periodic languages. Let be the finite set of -tuple of -vectors such that
Let and be the enumeration of non-empty subsets of . First, for , we construct an formula where
if and only if |
We denote by the non-zero entry of . This formula is as follows.
where and are the formulas for the languages and in Examples 3.4 and 3.5, respectively.
The desired formula is:
where
-
•
the formula expresses the fact that the data values found under nodes labeled with a symbol from are all different;
-
•
the unary predicates are supposed to represent the states and the output alphabets of , respectively;
-
•
the formula expresses the behaviour of the transducer – that is, a tree satisfies in which for every node , and holds, if there exists an accepting run of on in which the node is labeled with and output ;
-
•
the predicates ’s and the formulas ’s are as in the formula defined above.
The analysis of the complexity is as follows. The size of the formula and are exponential in the size of . Hence, the construction of takes exponential time in the size of . The construction of and takes polynomial time in the size of and , respectively. Hence, the total time to construct the formula is exponential in the size of . This completes the proof of the lemma.
5.2 Proof of Theorem 5.9
In this subsection for every ordered-data tree , we assume that the data values in are precisely the natural numbers in the range , for a positive integer . That is, if are the data values in , then , , , .
We start with the following lemma.
Lemma 5.16.
Let be of quantifier rank . Let be the set of unary predicates used in . There exists a finite state automaton over the alphabet such that the following holds.
-
•
The automaton accepts words of the form
where each .
-
•
For every ordered-data tree , if , then there exists a word in of the form
-
•
For every word , if is
then there exists a tree , where .
Proof 5.17.
Let be of quantifier rank . Let be the set of unary predicates used in . We define the following sentence (that is, over strings) inductively from as follows.
-
•
If is , where , then is
-
•
If is , then is also .
-
•
If is , then states “there is no position in between and labeled with any symbol from .”
-
•
If is , then states “there is at least one position in between and labeled with a symbol from .”
We have the following claim.
Claim 1.
-
(1)
For every ordered-data tree , if , then there exists a word of the form
-
(2)
For every word , if is
then there exists a tree , where .
Proof 5.18.
We first prove item (1). Let be an ordered-data tree over the alphabet and let be its -extended string representation of data values in . Let be the following data string
When is viewed as a data tree§§§That is, a data string is a data tree in which each node has at most one child., . Hence, by Corollary 4.5,
By straightforward induction on , we can show that for every of the form
there exists a word of the form
Similarly, to prove (2), we can prove by straightforward induction on that for every word of the form
there exists a tree of the form
This completes the proof of our claim.
Let be an automaton over the alphabet that expresses the formula and that it accepts only words of the form
where each . The construction of from the formula is rather standard, but non-elementary. See, for example, [Thomas (1997)]. That the automaton is the desired automaton is immediate. This completes our proof of Lemma 5.16.
Lemma 5.19.
Let be of quantifier rank . Let be the set of unary predicates used in . There exists a finite state automaton over the alphabet such that .
Proof 5.20.
Now we are ready to prove Theorem 5.9. We start with the “if” direction. Let be a formula of the form:
is a formula from and from , both extended with the unary predicates .
By Proposition 5.11, we can rewrite (with additional unary predicates) the formula into a conjunction of formulae of the form N1–N7 as stated in Proposition 5.11. Then we further rewrite it into the form
where and is a formula from and from , both extended with the unary predicates , and that the formula is conjunction of the form:
-
(N0′)
a formula that states that are pairwise disjoint and that
-
(N1′)
,
-
(N2′)
,
-
(N3′)
,
-
(N4′)
,
-
(N5′)
,
-
(N6′)
,
-
(N7′)
,
where are disjunctions of some of the unary predicates .
We will describe the ODTA for the formula , where the transducer expresses the formula N0′–N5′ with the output alphabet , the automaton expresses the formula N6′, N7′ and , and is the set of symbols that appear in formula N6′. Formally, it is defined as follows.
-
•
The output alphabet of is .
-
•
The transducer expresses the formula N0′–N5′ above. In particular, the input and output symbols of each node must satisfy the formula N0′.
This step take polynomial time, since the formula N0′–N5′ is already in the transition format.
-
•
The set .
This step takes polynomial time.
-
•
The automaton expresses the formulas N6′, N7′ and , obtained by applying Lemma 5.19.
This step is constructive, but non-elementary due to the conversion from a formula to its finite state automaton.
It is straightforward to show that .
Now we prove the “only if” direction. Let , where , and
-
•
be the states of ;
-
•
be the states of , and is the initial state of ;
-
•
be the output alphabet of .
We denote by the input alphabet of .
The desired formula for is of the form:
where
-
•
the unary predicates are supposed to represent the states, the output alphabets of , and the states of , respectively;
-
•
the formula expresses the behaviour of the transducer – that is, a tree satisfies in which for every node , and holds, if there exists an accepting run of on in which the node is labeled with and output ;
-
•
the formula expresses the behaviour of the automaton ;
-
•
the formula expresses the property that for every , all the nodes belonging to contain different data values, which is
The construction of the formula is rather standard, thus, omitted. We will show the construction of the formula . Let denote the following formula
which states that the data value on the node belongs to . The formula expresses the following properties.
-
•
That the node contains the minimal data value belongs to . Formally, it can be written as follows.
-
•
That the transition of must be “respected.” Formally, it can be written as follows.
where stands for .
-
•
That the node contains the maximal data value belongs to one of the final states of , denoted by . Formally, it can be written as follows.
That the construction takes polynomial time is straightforward. This completes our proof of Theorem 5.9.
5.3 Proof of Theorem 5.10
The proof of Theorem 5.10 consists of two main steps.
-
(1)
We prove that for each ODTA , if , then contains a data tree with “small model property” (Lemma 5.21).
-
(2)
We describe a procedure, that given an ODTA , checks whether contains a data tree with “small model property,” by converting the ODTA into an APC . Since the non-emptiness of APC is decidable, Theorem 5.10 follows immediately.
The first step (Lemma 5.21) is adapted from the proof of [Bojanczyk et al. (2009), Proposition 3.10]. It is in the second step our proof differs from [Bojanczyk et al. (2009), Proposition 3.10] The decision procedure in [Bojanczyk et al. (2009)] relies on intricate counting argument of the so called dog and sheep symbols (see [Bojanczyk et al. (2009), page 36]) and it seems that it cannot be generalised to the case of ODTA. On the other hand, our decision procedure relies mainly on Proposition 2.3, Lemma 4.1 and counting the cardinality of each .
We need a few terminologies. A set of nodes in a data tree is called connected, if it is connected in the graph induced by and . A zone in a data tree is a maximal connected set of nodes with the same data value. The outdegree of a zone is the number of different zones to which there is an edge (either or ) from .
Let be an ODTA, where is a transducer from to . Let be the set of states of . For a tree , its extended tree (with respect to the ODTA ) is a tree over the alphabet , where
-
•
the projection of to is ;
-
•
the projection of to is an accepting run of on ;
-
•
the projection of to is an output of on .
The following Lemma is simply an adaptation of [Bojanczyk et al. (2009), Proposition 3.10] to the case of ODTA. The proof is via cut-and-paste, where given an ordered-data tree over the alphabet where has “many” zones in which the outdegree is “large,” we can cut some nodes in and paste it in another part of without affecting the set ’s for each . The aim of such cut-and-paste is to reduce the number of zones in with large outdegree. We give the formal statement below.
Lemma 5.21.
[Compare [Bojanczyk et al. (2009), Proposition 3.10]] For every ODTA over the alphabet , if , then there exists a data tree in which there are at most zones with outdegree , where and is the set of states of and the output alphabet of .
Proof 5.22.
Let be an ODTA over the alphabet , and is the set of states of and the output alphabet of . Suppose that . We will work on the extended tree of . The aim is to convert into another tree over the alphabet such that
-
1.
the number of zones in with outdegree is bounded by ,
-
2.
the projection of is the profile of each node,
-
3.
the projection of is an accepting run of on the projection of and the output is its projection,
-
4.
for each the set of data values found in the -nodes in is the same as the set of those found in -nodes in ,
-
5.
the projection of is accepted by .
Intuitively, the tree is obtained via repeated applications of “pumping lemma” on both - and -directions in the tree .
Below we give a brief summary of the proof adapted from the proof of [Bojanczyk et al. (2009), Proposition 3.10]. We need the following terminologies, all of them are from [Bojanczyk et al. (2009)].
-
•
Two nodes in a tree are called siblings, if they have the same parent node.
-
•
The set of all children of a node is called a sibling group.
-
•
A contiguous sequence of siblings is called an interval.
We write for an interval in which and are the left-most and right-most nodes, respectively, in the interval. -
•
An interval is complete, if the following holds.
-
–
If a node exists such that , then .
-
–
If a node exists such that , then .
-
–
-
•
An interval is pure, if all of its nodes have the same data value.
-
•
A pure interval with the data value is called a -pure interval.
-
•
If the parent of an interval (or, a sibling group) has data value , then it is called a -parent interval (or a -parent sibling group).
-
•
A zone with the data value is called a -zone.
The construction of from is as follows.
-
1.
Convert to another tree such that
-
•
for every data value there are at most complete -pure intervals of size more than ;
-
•
, for every ;
-
•
is an extended tree of its projection w.r.t. .
This step is adapted from [Bojanczyk et al. (2009), Proposition 3.12]. The idea is to cut an interval (together with its subtree) and paste it in another interval; and while doing so the data values in the interval remain untouched.
-
•
-
2.
Convert to another tree such that
-
•
for every data value there are at most -parent sibling group with more than complete pure intervals;
-
•
, for every ;
-
•
is an extended tree of its projection w.r.t. .
This step is adapted from [Bojanczyk et al. (2009), Proposition 3.14]. Again when the cut-and-paste is performed the data values in the sibling groups remain untouched.
-
•
-
3.
Convert to another tree such that
-
•
for every data value there are at most -zones containing a path with more than nodes;
-
•
, for every ;
-
•
is an extended tree of its projection w.r.t. .
This step is adapted from [Bojanczyk et al. (2009), Proposition 3.17]. Again when the cut-and-paste is performed the data values in the zones remain untouched.
-
•
-
4.
Convert to another tree such that
-
•
there are at most complete pure intervals with more than nodes;
-
•
, for every ;
-
•
is an extended tree of its projection w.r.t. .
This step is adapted from [Bojanczyk et al. (2009), Proposition 3.20]. Here actually when the cut-and-paste is performed, the data values in some zones have to be changed. However, those changes are only applied to the safe zones, where a zone is safe if for every node in it there is another node outside the zone with the same label (from ) and the same data value. (See [Bojanczyk et al. (2009), page 23, last paragraph].) More specifically, these changes are done by applying [Bojanczyk et al. (2009), Lemma 3.19] on the safe zones. That it is applied only on safe zones is important so that after changing the data values, constraints such as are still satisfied.
-
•
-
5.
Convert to another tree such that
-
•
there are at most sibling groups containing more than complete pure intervals;
-
•
, for every ;
-
•
is an extended tree of its projection w.r.t. .
This step is adapted from [Bojanczyk et al. (2009), Proposition 3.21]. Here there are also changes of data values when performing cut-and-paste. However, as in the previous step, they are only applied to the safe zones. These changes are also done by applying [Bojanczyk et al. (2009), Lemma 3.19] on the safe zones.
-
•
-
6.
Convert to another tree such that
-
•
there are at most zones containing paths with more than nodes;
-
•
, for every ;
-
•
is an extended tree of its projection w.r.t. .
This step is adapted from [Bojanczyk et al. (2009), Proposition 3.25]. Here there are also changes of data values when performing cut-and-paste. However, as in the previous step, they are only applied to the safe zones. More specifically, these changes are done by applying [Bojanczyk et al. (2009), Lemma 3.24] on the safe zones.
-
•
The extended tree is the desired extended tree. It is a rather straightforward computation that there are at most zones in with outdegree .
To describe the decision procedure for Theorem 5.10, we need a few more additional terminologies. For a data tree over the alphabet , and , an -zone is a zone in which the labels of the nodes are precisely . We write to denote the set of data values found in -zones in . For ,
Suppose are all the data values in . The zonal string representation of the data values in , denoted by , is the string over the alphabet such that for each , .
A zonal ODTA is , where and are as in the definition of ODTA, and is a finite state automaton over the alphabet . A data tree is accepted by the zonal ODTA , if the following holds.
-
•
is accepted by , yielding an output tree over the alphabet .
-
•
The string is accepted by .
-
•
For each , all the data values found in the -nodes in are different.
Proposition 5.23.
For every ODTA , one can construct in ExpTime its equivalent zonal ODTA.
Proof 5.24.
Let and . Its equivalent zonal ODTA is defined as , where and . It is straightforward to show that .
Note that the only difference between and is the transitions and in and , respectively. The membership can be checked in polynomial time in the size of and . Since there are exponentially many , the exponential time upper bound holds immediately. This completes the proof of Proposition 5.23.
Briefly our decision procedure for Theorem 5.10 works as follows. Let be the given ODTA, where is the input alphabet of , the output alphabet, and the set of states of . Let . The decision procedure constructs an APC such that accepts an ordered-data tree in which there are at most zones with outdegree if and only if accepts an extended tree of w.r.t. .
Its precise description is given as follows.
-
1.
Compute .
-
2.
Convert into its zonal ODTA .
-
3.
Guess the following items.
-
(a)
A set .
-
(b)
For each , guess an integer and a set of constants .¶¶¶The purpose of the number is the application of Lemma 4.1 later on, where we consider the graph where the nodes are the zones. Each zone is labeled with a symbol from , which is of size . If a zone has outdegree , then it has only at most nodes, which means that its degree (the sum of indegree and outdegree) is bounded by . Now is intended to contain all those ’s in which so that we can “guess” some constants as elements of and make sure by automaton that the same constant is not “assigned” to adjacent zones. For not in , we can apply Lemma 4.1 to make sure the same data value from is not assigned to adjacent zones.
-
(c)
Two integers such that and a set of constants .
The intuitive meaning of and are the number of zones with outdegree and the number of data values found in them, respectively. We also remark that the constants in may overlap with the constants in some . -
(d)
For each , guess a set .
-
(a)
-
4.
Construct the following automaton over the alphabet .
-
(a)
accepts only the extended trees of in which there are at most zones with outdegree .
-
(b)
The automaton can remember the constants in its states.
-
(c)
For every , for every , the automaton “assigns” the constant in an -zone, for every , but not in any -zone, for every .
-
(d)
The automaton “assigns” every zone with outdegree with a constant from .
-
(e)
For every , for every , the automaton “assigns” the constant in an -zone, for every , but in no -zone, for every .
-
(f)
For each , there is at most one -node in every zone, and for every two zones that contains -nodes, if they are assigned with some constants from ’s and , then these constants must be different.
-
(g)
For every two adjacent zones, if they are assigned with constants from ’s and , then these constants must be different.
The automaton “assigns” a constant to a zone by remembering the constant in the state when is reading the zone.
-
(a)
-
5.
Let be the enumeration of non-empty subsets of .
Applying Lemma 2.3, convert the automaton into its Presburger formula , where the intended meaning of ’s is the number of appearances of the label . -
6.
Let and be the enumeration of non-empty subsets of . Define the formula
(14) The meaning of is the number of -nodes occurring in the zone not assigned with any constants from ’s and ; and is the number -zones not assigned with any constants from ’s and . The intuition behind items (2)–(6) is rather clear. The intuition behind item (7) is as follows. Recall that in Step (3), for each , we guess a set . The meaning is that for some . So for every , the number of such that should not exceed . This is precisely what is stated in item (7).
-
7.
Test the non-emptiness of the APC .
Before we proceed to prove its correctness, we first present the analysis of its complexity.
-
•
Step (1) is trivial and Step (2) takes exponential time.
-
•
Step (3) takes non-deterministic exponential time in the size of . The analysis is as follows. Step (3.a) takes non-deterministic exponential time in the size of , which is bounded by the size of in . (Recall that the alphabet in is .) Step (3.b) can guess up exponentially many constant in each , and there are exponentially many different , hence it takes double exponential time in the size of . Steps (3.c) and (3.d) take non-deterministic exponential time.
-
•
Step (4) takes deterministic triple exponential time and can produce the automaton of size up to triple exponential. The analysis is as follows. The automaton has to remember in its states the outdegree of each zone up to and the number of zones with out degree . This induces an exponential blow-up in the size of .
The number of constants in guessed in Step (3) is double exponential in the size of . Then has to remember in its states which constant is assigned to which zone (of outdegree ), which induces another exponential blow-up. Altogether the size of can be triple exponential in the size of .
-
•
By Proposition 2.3, Step (5) takes polynomial time in the size , which is of size exponential in the size of the original .
-
•
The length of the formula in step (6) is double exponential in the size of , since the number of constants in can be double exponential in the size of , and hence .
-
•
Step (7) takes non-deterministic polynomial time in the size of , and hence non-deterministic triple exponential time in the size of the input .
The following claim immediately implies the correctness of our algorithm.
Claim 2.
-
1.
For every ordered-data tree , in which there are at most zones with outdegree , there exists an extended tree of which is accepted by the APC .
-
2.
For every , there exists an ordered-data tree such that is an extended tree of w.r.t. .
Proof 5.25.
We prove (1) first. Let be an ordered-data tree in which there are at most zones with outdegree . Let be the output of on so that is accepted by and all nodes in labelled with a symbol in have different data values.
We have the following items guessed in Step 3 in our algorithm above.
-
•
.
-
•
For each , , and .
-
•
be the number of zones in with outdegree and be the number of data values found in these zones.
-
•
,
-
•
For each , is the set such that .
Now let be an extended tree of with respect to , and and be the automaton and formula as constructed in Steps 4–6 above. We are going to show that . Obviously, . To show that the formula is satisfied, we take as witness to . Since , by Proposition 2.3, the formula holds. It is straightforward from the definitions of the items , ’s, , , and ’s that the formula in Step 6 is satisfied with ’s and ’s interpreted as intended.
Now we prove (2). The proof is more delicate than the proof of (1). Suppose . We are going to construct an ordered-data tree from such that is an extended tree of w.r.t. . Let , ’s, ’s, , , and ’s the items as guessed in Step 3 above and
-
•
for each , let be the number of -nodes in occurring in a zone without any constants from ’s and ;
-
•
for each , let be the number of -zones in without any constants from ’s and .
Suppose be the witness to such that
By Proposition 2.3, this means that there exists a word such that . For each , we let
We will assign a data value to each node in such that
and . The assignment is done according to three cases below.
- Case 1
-
For the nodes that are assigned with some constants from ’s.
In this case . We define bijections . There is always a bijection from to since they have the same cardinality , due to the following condition in the formula :The data value assignment to nodes of this case can be done by replacing every constant with .
- Case 2
-
For the nodes that are assigned some constants from .
We define a 1-1 mapping such that , where is the set guessed in Step 3. Such 1-1 mapping exists because the following condition in the formula :The data value assignment to nodes of this case can be done by replacing every constant with .
- Case 3
-
For the nodes that are not assigned any constants from ’s and .
First we assign each of such zone in with a data value∥∥∥A zone in can be recognised from the profile information in . such that for each ,This step can be done as follows. The number of such -zone in is greater than , due to the condition below in the formula :
Thus, we can simply assign every -zone with a data value from , and make sure every data value from appears in some -zone.
However, by assigning data values like that, some adjacent zones may get the same data values. Here we apply Lemma 4.1. Since for each , , by the condition below in the formulathe cardinality
The outdegree of such zone is , hence, the number of nodes in the zone is also . Since each node can have indegree at most , the degree of each of such zone is . By applying Lemma 4.1, where , we can reassign the data value in such zone so that each adjacent zone get different data value.
This completes the proof of our Claim.
6 Weak ODTA
A weak ODTA over is a triplet where is a letter-to-letter transducer from to the output alphabet , and is a finite state automaton over and . An ordered-data tree is accepted by , denoted by , if there exists an ordered-data tree over such that
-
•
on input , the transducer outputs ;
-
•
the automaton accepts the string ; and
-
•
for every , all the -nodes in have different data values.
Note that the only difference between weak ODTA and ODTA is the equality test on the data values in neighboring nodes. Such difference is the cause of the triple exponential leap in complexity, as stated in the following theorem.
Theorem 6.1.
The non-emptiness problem for weak ODTA is in NP.
Proof 6.2.
Let be a weak ODTA. Let be the input alphabet, set of states and output alphabet of , respectively.
We need the following notation. For a tree , its extended tree (with respect to the weak ODTA ) is a tree over the alphabet , where
-
•
the projection of to is ;
-
•
the projection of to is an accepting run of on such that its output is the projection of to .
The decision procedure for Theorem 6.1 works as follows.
-
1.
Construct an automaton over the alphabet for the extended trees accepted by .
-
2.
Let be the set of symbols used in .
By applying Proposition 2.3, construct the Presburger formula for . -
3.
Let . Let be the following formula:
-
4.
Test the non-emptiness of APC .
That this procedure works in NP follows directly from the fact that the non-emptiness problem of APC is in NP.
We now show the correctness of our algorithm by showing that if and only if . (For the sake of presentation, we write without its free variables.) We start with the “only if” part. Suppose that . We claim that the extended tree of is accepted by . Obviously, . To show that holds, let be the data tree obtained by projecting to and the data value in each node comes from the same node in . That is, is an output of on . We will show that holds.
-
•
As witness to , we take . Since , by Proposition 2.3, holds.
-
•
As witness to , we take . Now for each , the constraint holds since the number of data values in the -nodes cannot exceed the the number of -nodes itself. The constraint , for each , since the data values found in -nodes are all different.
Thus, holds, and this concludes our proof of the “only if” part.
Now we prove the “if” part. Suppose that . So . Let and be the - and -projection of , respectively. By the definition of , is an output of on . Now since holds, in particular there exists a witness to such that holds, by Proposition 2.3, there exists a word over the alphabet such that .
We are going to assign data values to the nodes of (thus, also to those of ) such that . The assignment is done as follows. For each , let be the set of positions of labeled with . Now for each , we assign the -nodes in with the data values from such that . This is possible due to the constraint .
With such assignment, we get . Thus, . Moreover, for every , all the data values in -nodes are different, which follows from the constraint . Therefore, the resulting ordered-data tree . This concludes our proof.
Next, we give the logical characterisation of weak ODTA.
Theorem 6.3.
A language is accepted by a weak ODTA if and only if is expressible with a formula of the form: , where is a formula from , and is a formula from , extended with the unary predicates .
The proof of Theorem 6.3 is the same as the proof of Theorem 5.9. The difference is that to simulate the formula , the profile information is not necessary. The complexity of the translation is still the same as in Theorem 5.9.
6.1 Extending weak ODTA with Presburger constraints
Like in the case of APC, we can extend weak ODTA with Presburger constraints without increasing the complexity of its non-emptiness problem. Let be a weak ODTA, where and are the input and output alphabets of , respectively. Let .
A weak ODTA extended with Presburger constraint is a tuple , where is an existential Presburger formula with the free variables . An ordered-data tree is accepted by , if there exists an output of on , the automaton accepts , for each , all -nodes in have different data values and holds. We write to denote the set of languages accepted by .
We claim that the non-emptiness problem of weak ODTA extended with Presburger constraint is still decidable in NP. The reason is as follows. The non-emptiness of a weak ODTA is checked by converting into an APC , where expresses linear constraints on the number of nodes labeled with symbols from and as well as those labeled with in the accepting run. The formula can be appropriately “inserted” into , and hence, the non-emptiness of is reducible to non-emptiness of APC, which is in NP.
6.2 Comparison with other known decidable formalisms
We are going to compare the expressiveness of weak ODTA with other known models with decidable non-emptiness.
6.2.1 DTD with integrity constraints
An XML document is typically viewed as a data tree. The most common XML formalism is Document Type Definition (DTD). In short, a DTD is a context free grammar and a tree conforms to a DTD , if it is a derivation tree of a word accepted by the context free grammar.
The most commonly used XML constraints are integrity constraints which are of two types.
-
•
The key constraint are the following constraint:
-
•
The inclusion constraint are the following constraint:
The satisfiability problem of a given DTD and a collection of integrity constraints asks whether there exists an ordered-data tree that conforms to the DTD that satisfies all the constraints in . In [Fan and Libkin (2002)] it is shown that this problem is NP-complete.
Theorem 6.4.
Given a DTD and a collection of integrity constraints, one can construct a weak ODTA such that is precisely the set of ordered-data trees that conforms to and satisfies all constraints in .
Proof 6.5.
Let be the alphabet of the given DTD . Consider the following weak ODTA .
-
•
is an identity transducer that checks whether the input tree conforms to DTD .
-
•
is an automaton that accepts , where .
-
•
.
That is the desired ODTA follows immediately from the fact that for every ordered-data tree , if and only if for all where , but .
The size of the automaton , hence the size of , produced by our construction in Theorem 6.4 is of exponential size. This blow-up is tight, as the following example shows. Consider the case where does not contain inclusion constraints. That is, contains only key constraints. Then any equivalent ODTA will have . Thus, we have exponential blow-up in the size of . Nevertheless, if we are concerned only with satisfiability, then we can lower the complexity to NP as stated in the following theorem.
Theorem 6.6.
Given a DTD and a collection of integrity constraints, one can construct a weak ODTA in non-deterministic polynomial time such that if and only if there exists an ordered-data tree that conforms to and satisfies all the constraints in .
Proof 6.7.
Let be the alphabet of the DTD . We non-deterministically construct a weak ODTA as follows.
-
•
is an identity transducer that checks whether the input tree conforms to DTD .
-
•
Guess a sequence of some subsets of such that
-
–
is partitioned into ;
-
–
for every two different symbols , are in the same set if and only if both and are in ;
-
–
if and , then and and .
Intuitively, the sequence tells us the ordering of the elements in that respect the inclusion constraints in , where if both and are in , then and are tie and they must be in the same set .
-
–
-
•
Let be such that , where and .
-
•
is a non-deterministic automaton over the alphabet , where the set of states is , all are the initial states and the final states, and the transitions are: for every .
-
•
.
We claim that if and only if there exists an ordered-data tree that conforms to and satisfies all the constraints in .
We start with the “if” direction. Suppose conforms to the DTD and satisfies all the constraints in . For each , let be the number of data values found in the -nodes in . Let be a sequence of some subsets of such that
-
•
is partitioned into ;
-
•
for every two different symbols , are in the same set if and only if ;
-
•
and and if and only if .
Consider the following ordered-data tree over , where is obtained from by reassigning the data values on the nodes in as follows. For each , we assign the set of integers as the data values of -nodes in . Such assignment is possible since is no more than the number of -nodes in . With such assignment still obeys the constraints in , as shown below.
-
•
If , then is precisely the number of -nodes in , thus, also in . Thus, with the data values , the data values on the -nodes in are all different.
-
•
If , then obviously, . Thus, still satisfies the constraint , since the data values in -nodes in are , while those in -nodes are .
Now the string is of the form , where where , and if , then for some in the sequence . By the definition of , is accepted by . That is accepted by is trivial and so is the fact that all the data values found in -nodes in for each . Thus, .
For the “only if” direction, it is sufficient to observe that for every sequence that “respects” the inclusion constraints in as explained above, if , then satisfies all the inclusion constraints in . This completes our proof.
6.2.2 Set and linear constraints for data trees
In the paper [David et al. (2012)] the set and linear constraints are introduced for data trees. As argued there, those constraints, together with automata, are able to capture many interesting properties commonly used in XML practice. We review those constraints and show how they can be captured by weak ODTA extended with Presburger constraints.
Data-terms (or just terms) are given by the grammar
The semantics of is defined with respect to a data tree :
Recall that – the set of data values found in the data tree .
A set constraint is either or , where is a term. A data tree satisfies , written as , if and only if (and likewise for ).
A linear constraint over the alphabet is a linear constraint on the variables , for each and , for each . A data tree satisfies , if holds by interpreting as the number of -nodes in , and the cardinality .
Theorem 6.8.
Given a tree automaton and a set of set and linear constraints, there exists a weak ODTA extended with Presburger constraints such that is precisely the set of ordered-data trees accepted by that satisfies all the constraints in . Moreover, the construction of takes exponential time in the size of and .
Proof 6.9.
The proof is simply a restatement of the proof in [David et al. (2012)] into a language of weak ODTA. We need the following notation. For a data term , we define a family of subsets of as follows.
-
•
If , then .
-
•
If , then .
-
•
If , then , where is or .
It follows that for every data tree , we have . Recall that the sets ’s are disjoint.
The desired is defined as follows. The transducer is the identity transducer , and . The automaton accepts a word if and only if
-
C1.
for every set constraint , does not contain any symbol from ;
-
C2.
for every set constraint , contains at least one symbol from .
The formula is the conjunction of all the linear constraints in .
That is indeed precisely the set of ordered-data trees accepted by that satisfies all the constraints in follows immediately from the definition of . The exponential upper-bound occurs while constructing the automaton which requires the enumeration of each element of and checking both conditions C1 and C2 are satisfied. This completes the proof of Theorem 6.8.
6.2.3 over text
Here we focus our attention on ordered-data words, which can be viewed as trees where each node has at most one child. We write to denote ordered-data word in which position has label and data value . It is called a text, if all the data values are different and the set of data values is precisely .
It is shown in [Manuel (2010)] that the satisfiability problem for over text is decidable.******The definition of text in [Manuel (2010)] is slightly different, but it is equivalent to our definition. However, it turns out that the key lemma proved in [Manuel (2010)] has a gap which is filled later on in [Figueira (2012b)]. The final result is still correct though. The following theorem shows that this decidability can be obtained via weak ODTA.
Theorem 6.10.
For every formula , one can construct effectively a weak ODTA such that
-
•
for every text , if , then ;
-
•
for every ordered-data word , there exists a text such that .
The construction of takes double exponential time in the size of .
Proof 6.11.
In [Manuel (2010)], the decidability is proved by constructing its so called text automata, also defined in [Manuel (2010)]. We review the precise definition here. Let be a text over the alphabet . Therefore, is such that each is a singleton.
We define , the marked string projection of , as the word , where and
A text automaton over the alphabet is pair , where
-
•
is a non-deterministic letter-to-letter word transducer with the input alphabet and the output alphabet .
-
•
is a non-deterministic finite state automaton over .
A text is accepted by the text automaton , if
-
•
is accepted by , yielding a string ;
-
•
the string is accepted by , where the indexes are such that .
It is shown in [Manuel (2010)] that for every , one can construct effectively a text automaton such that for every text , if and only if .
Now we are going to show how to get the desired ODTA . Let be the text automaton as above. On input ordered-data word , performs the following.
-
•
The automaton simulates , by guessing and outputs its -projection, while store its -projection in its states.
-
•
The automaton is simply .
It is straightforward to see that such is the desired weak ODTA. The analysis of the complexity is as follows. The construction of the text automaton takes double exponential time in the size of . See [Manuel (2010), Lemmas 5 and 6]. The construction of ODTA takes polynomial time in the size of . Altogether, it takes double exponential time to construct from the original formula . This completes the proof of Theorem 6.10.
7 An Undecidable Extension
In this section we would like to remark on an undecidable extension of ODTA. Recall the language in Example 3.3. It has already noted in the proof of Proposition 5.6 that its complement is not accepted by any ODTA. Formally, the complement of the language in Example 3.3 can be expressed with formula of the form:
(15) |
where and denotes the transitive closure of . In the following we are going to show that given an ODTA and a collection of formulas of the form (15), it is undecidable to check whether there is an ordered-data tree such that , for all .
The proof is simply an observation that the proof of [Bojanczyk et al. (11a), Proposition 29] can be applied directly here. In [Bojanczyk et al. (11a), Proposition 29] it is proved that the satisfiability of is undecidable.††††††Technically, the undecidability in [Bojanczyk et al. (11a), Proposition 29] is proved on data strings over the logic , which of course, is equivalent to . The reduction is from Post Correspondence Problem (PCP), where given an instance of PCP, one can effectively construct a formula of the form , where and is a formula of the form (15). Since can be captured by ODTA, the undecidability of ODTA extended with formulas of the form (15) follows immediately.
At this point we would also like to point out that extending ODTA with operation such as addition on data values will immediately yield undecidability. This can be deduced immediately from [Halpern (1991)] where we know that together with unary predicates, addition yields undecidability.
8 When the Data Values are Strings
In this section we discuss data trees where the data values are strings from , instead of natural numbers. We call such trees string data trees. There are two common kinds of order for strings: the prefix order, and the lexicographic order. Strings with lexicographic order are simply linearly ordered domain, thus, ODTA can be applied directly in such case.
For the prefix order, we have to modify the definition of ODTA. Consider a string data tree over the alphabet . Let be the set of data values found in . We define as a tree over the alphabet , where
-
•
is ;
-
•
for , is a parent of if is a prefix of and there is no such that is a prefix of and is a prefix of ;
-
•
for the label of is , if ; and root, if .
We call the tree representation of the data values in . Consider an example of a string data tree in Figure 2. We have
So , and
-
•
is the parent of and ;
-
•
is the parent of and ; and
-
•
is the parent of .
Now an ODTA for string data trees is , where is a letter-to-letter transducer from to ; is an unranked tree automaton over the alphabet ; . The requirement for acceptance is the same as in Section 5, except that takes a tree over the alphabet as the input.
We observe that in the proof of the decidability of the non-emptiness of ODTA , the automaton is converted in polynomial time into a Presburger formula by applying Proposition 2.3, which actually holds for tree automata. Hence, the decision procedures in Sections 5 and 6 can also be applied to string data trees.
9 Concluding Remarks
In this paper we study data trees in which the data values come from a linearly ordered domain, where in addition to equality test, we can test whether the data value in one node is greater than the other. We introduce ordered-data tree automata (ODTA), provide its logical characterisation, and prove that its non-emptiness problem is decidable. We also show the logic can be captured by ODTA.
Then we define weak ODTA, which essentially are ODTA without the ability to perform equality test on data values on two adjacent nodes. We provide its logical characterisation. We show that a number of existing formalisms and models studied in the literature so far can be captured already by weak ODTA. We also show that the definition of ODTA can be easily modified, to the case where the data values come from a partially ordered domain, such as strings.
We believe that the notion of ODTA provides new techniques to reason about ordered-data values on unranked trees, and thus, can find potential applications in practice. We also prove that ODTA capture various formalisms on data trees studied so far in the literature. As far as we know this is the first formalism for data trees with neat logical and automata characterisations.
Acknowledgement. The author would like to thank FWO for their generous financial support under the scheme FWO Marie Curie fellowship. The author also thanks Egor V. Kostylev for careful proof reading of this paper, as well as Nadime Francis for pointing out the reference [Halpern (1991)]. Finally, the author thanks the anonymous referees, both the conference and the journal versions, for their careful reading and comments which greatly improve the paper.
References
- Alon et al. (2003) Alon, N., Milo, T., Neven, F., Suciu, D., and Vianu, V. 2003. XML with data values: typechecking revisited. Journal of Computer and System Science 66, 4, 688–727.
- Arenas et al. (2008) Arenas, M., Fan, W., and Libkin, L. 2008. On the complexity of verifying consistency of XML specifications. SIAM Journal of Computing 38, 3, 841–880.
- Benedikt et al. (2010) Benedikt, M., Ley, C., and Puppis, G. 2010. Automata vs. logics on data words. In CSL.
- Björklund and Bojanczyk (2007) Björklund, H. and Bojanczyk, M. 2007. Bounded depth data trees. In ICALP.
- Björklund et al. (2008) Björklund, H., Martens, W., and Schwentick, T. 2008. Optimizing conjunctive queries over trees using schema information. In MFCS.
- Bojanczyk et al. (11a) Bojanczyk, M., David, C., Muscholl, A., Schwentick, T., and Segoufin, L. ’11a. Two-variable logic on data words. ACM Transactions on Computational Logic 12, 4, 27.
- Bojanczyk et al. (11b) Bojanczyk, M., Klin, B., and Lasota, S. ’11b. Automata with group actions. In LICS. 355–364.
- Bojanczyk et al. (2009) Bojanczyk, M., Muscholl, A., Schwentick, T., and Segoufin, L. 2009. Two-variable logic on data trees and XML reasoning. Journal of the ACM 56, 3.
- Bouyer et al. (2001) Bouyer, P., Petit, A., and Thérien, D. 2001. An algebraic characterization of data and timed languages. In CONCUR.
- Comon et al. (2007) Comon, H., Dauchet, M., Gilleron, R., Löding, C., Jacquemard, F., Lugiez, D., Tison, S., and Tommasi, M. 2007. Tree automata techniques and applications. Available on: http://www.grappa.univ-lille3.fr/tata. release October, 12th 2007.
- David et al. (2010) David, C., Libkin, L., and Tan, T. 2010. On the satisfiability of two-variable logic over data words. In LPAR.
- David et al. (2012) David, C., Libkin, L., and Tan, T. 2012. Efficient reasoning about data trees via integer linear programming. ACM Transactions on Database Systems 37, 3, 19.
- Demri et al. (2007) Demri, S., D’Souza, D., and Gascon, R. 2007. A decidable temporal logic of repeating values. In LFCS.
- Demri and Lazić (2009) Demri, S. and Lazić, R. 2009. LTL with the freeze quantifier and register automata. ACM Transactions of Computational Logic 10, 3.
- Deutsch et al. (2009) Deutsch, A., Hull, R., Patrizi, F., and Vianu, V. 2009. Automatic verification of data-centric business processes. In ICDT.
- Fan and Libkin (2002) Fan, W. and Libkin, L. 2002. On XML integrity constraints in the presence of DTDs. Journal of the ACM 49, 3, 368–406.
- Figueira (2009) Figueira, D. 2009. Satisfiability of downward xpath with data equality tests. In PODS.
- Figueira (2011) Figueira, D. 2011. A decidable two-way logic on data words. In LICS. 365–374.
- Figueira (2012a) Figueira, D. 2012a. Alternating register automata on finite data words and trees. Logical Methods in Computer Science 8, 1.
- Figueira (2012b) Figueira, D. 2012b. Satisfiability for two-variable logic with two successor relations on finite linear orders. http://arxiv.org/abs/1204.2495.
- Figueira et al. (2010) Figueira, D., Hofman, P., and Lasota, S. 2010. Relating timed and register automata. In EXPRESS.
- Figueira and Segoufin (2011) Figueira, D. and Segoufin, L. 2011. Bottom-up automata on data trees and vertical XPath. In STACS.
- Grumberg et al. (2010) Grumberg, O., Kupferman, O., and Sheinvald, S. 2010. Variable automata over infinite alphabets. In LATA.
- Halpern (1991) Halpern, J. 1991. Presburger arithmetic with unary predicates is complete. Journal of Symbolic Logic 56, 2, 637–642.
- Jurdzinski and Lazic (2011) Jurdzinski, M. and Lazic, R. 2011. Alternating automata on data trees and XPath satisfiability. ACM Transactions of Computational Logic 12, 3, 19.
- Kaminski and Francez (1994) Kaminski, M. and Francez, N. 1994. Finite-memory automata. Theoretical Computer Science 134, 2, 329–363.
- Kara et al. (2012) Kara, A., Schwentick, T., and Tan, T. 2012. Feasible automata for two-variable logic with successor on data words. In LATA.
- Lazić (2011) Lazić, R. 2011. Safety alternating automata on data words. ACM Transaction of Computational Logic 12, 2, 10.
- Libkin (2004) Libkin, L. 2004. Elements of Finite Model Theory. Springer.
- Manuel (2010) Manuel, A. D. 2010. Two orders and two variables. In MFCS.
- Neven (2002) Neven, F. 2002. Automata, logic, and XML. In CSL.
- Neven et al. (2004) Neven, F., Schwentick, T., and Vianu, V. 2004. Finite state machines for strings over infinite alphabets. ACM Transactions on Computational Logic 5, 3, 403–435.
- Schwentick (2004) Schwentick, T. 2004. XPath query containment. SIGMOD Record 33, 1, 101–109.
- Schwentick and Zeume (2010) Schwentick, T. and Zeume, T. 2010. Two-variable logic with two order relations. In CSL.
- Segoufin and Torunczyk (2011) Segoufin, L. and Torunczyk, S. 2011. Automata based verification over linearly ordered data domains. In STACS.
- Seidl et al. (2004) Seidl, H., Schwentick, T., Muscholl, A., and Habermehl, P. 2004. Counting in trees for free. In ICALP.
- Thatcher (1967) Thatcher, J. 1967. Characterizing derivation trees of context-free grammars through a generalization of finite automata theory. Journal of Computer and System Sciences 1, 4, 317–322.
- Thomas (1997) Thomas, W. 1997. Languages, automata, and logic. In Handbook of Formal Languages, Vol. 3. 389–455.
- Verma et al. (2005) Verma, K. N., Seidl, H., and Schwentick, T. 2005. On the complexity of equational horn clauses. In CADE.