This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\acmVolume

V \acmNumberN \acmArticle1 \acmYear20YY \acmMonth1

Extending two-variable logic on data trees with order on data values and its automata

Tony Tan
Hasselt University and Transnational University of Limburg
Abstract

Data trees are trees in which each node, besides carrying a label from a finite alphabet, also carries a data value from an infinite domain. They have been used as an abstraction model for reasoning tasks on XML and verification. However, most existing approaches consider the case where only equality test can be performed on the data values.

In this paper we study data trees in which the data values come from a linearly ordered domain, and in addition to equality test, we can test whether the data value in a node is greater than the one in another node. We introduce an automata model for them which we call ordered-data tree automata (ODTA), provide its logical characterisation, and prove that its non-emptiness problem is decidable in 3-NExpTime. We also show that the two-variable logic on unranked data trees, studied by Bojanczyk, Muscholl, Schwentick and Segoufin in 2009, corresponds precisely to a special subclass of this automata model.

Then we define a slightly weaker version of ODTA, which we call weak ODTA, and provide its logical characterisation. The complexity of the non-emptiness problem drops to NP. However, a number of existing formalisms and models studied in the literature can be captured already by weak ODTA. We also show that the definition of ODTA can be easily modified, to the case where the data values come from a tree-like partially ordered domain, such as strings.

category:
F.1.1 Models of Computation Automata
category:
F.4.1 Mathematical Logic Computational logic
keywords:
Finite-state automata, Two-variable logic, Data trees, Ordered data values
terms: Languages
{bottomstuff}

The extended abstract of this article has been published in the proceedings of LICS 2012, under the title: “An Automata Model for Trees with Ordered Data Values.” The author is FWO Pegasus Marie Curie Fellow.

1 Introduction

Classical automata theory studies words and trees over finite alphabets. Recently there has been a growing interest in the so-called “data” words and trees, that is, words and trees in which each position, besides carrying a label from a finite alphabet, also carries a data value from an infinite domain.

Interest in such structures with data springs due to their connection to XML [Alon et al. (2003), Arenas et al. (2008), Björklund et al. (2008), David et al. (2012), Fan and Libkin (2002), Figueira (2009), Neven (2002)], as well as system specifications [Bouyer et al. (2001), Demri et al. (2007), Segoufin and Torunczyk (2011)], where many properties simply cannot be captured by finite alphabets. This has motivated various works on data words [Benedikt et al. (2010), Bojanczyk et al. (11a), Demri and Lazić (2009), Grumberg et al. (2010), Kaminski and Francez (1994), Neven et al. (2004)], as well as on data trees [Björklund and Bojanczyk (2007), Bojanczyk et al. (2009), Figueira (2012a), Figueira and Segoufin (2011), Jurdzinski and Lazic (2011)]. The common feature of these works is the addition of equality test on the data values to the logic on trees. While for finitely-labeled trees many logical formalisms (e.g., the monadic second-order logic MSO) are decidable by converting formulae to automata, even FO (first-order logic) on data words extended with data-equality is already undecidable. See, e.g., [Bojanczyk et al. (11a), Fan and Libkin (2002), Neven et al. (2004)].

Thus, there is a need for expressive enough, while computationally well-behaved, frameworks to reason about structures with data values. This has been quite a common theme in XML and system specification research. It has largely followed two routes. The first takes a specific reasoning task, or a set of similar tasks, and builds algorithms for them (see, e.g., [Arenas et al. (2008), Figueira (2011), Björklund et al. (2008), Schwentick (2004), Fan and Libkin (2002), Figueira (2009)]). The second looks for sufficiently general automata models that can express reasoning tasks of interest, but are still decidable (see, e.g., [Demri and Lazić (2009), Bojanczyk et al. (2009), Jurdzinski and Lazic (2011), Segoufin and Torunczyk (2011)]).

Both approaches usually assume that data values come from an abstract set equipped only with the equality predicate. This is already sufficient to capture a wide range of interesting applications both in databases and verification. However, it has been advocated in [Deutsch et al. (2009)] that comparisons based on a linear order over the data values could be useful in many scenarios, including data centric applications built on top of a database.

So far, not many works have been done in this direction. A few works such as [Manuel (2010), Figueira (2011), Schwentick and Zeume (2010), Segoufin and Torunczyk (2011)] are on words, while in most applications we need to consider trees. Moreover, these works are incomparable to some interesting existing formalisms [Fan and Libkin (2002), Bojanczyk et al. (2009), Arenas et al. (2008), David et al. (2012), Jurdzinski and Lazic (2011), Demri and Lazić (2009), Lazić (2011)] known to be able to capture various interesting scenarios common in practice. On top of that many useful techniques, notably those introduced in [Fan and Libkin (2002), Bojanczyk et al. (11a), Bojanczyk et al. (2009), Jurdzinski and Lazic (2011)], can deal only with data equality, and are highly dependent on specific combinatorial properties of the formalisms. They are rather hard to adapt to other more specific tasks, let alone being generalised to include more relations on data values, and they tend to produce extremely high complexity bounds, such as non-primitive-recursive, or at least as hard as the reachability problem in Petri nets. Furthermore, many known decidability results are lost as soon as we add the order relation on data values. Some exceptions are [Figueira et al. (2010), Figueira (2012a)].

In this paper we study the notion of data trees in which the data values come from a linearly ordered domain, which we call ordered-data trees. In addition to equality tests on the data values, in ordered-data trees we are allowed to test whether the data value in a node is greater than the data value in another node. To the extent it is possible, we aim to unify various ad hoc methods introduced to reason about data trees, and generalise them to ordered-data trees to make them more accessible and applicable in practice. This paper is the first step, where we introduce an automata model for ordered-data trees, provide its logical characterisation, and prove that it has decidable non-emptiness problem. Moreover, we also show that it can capture various well known formalisms.

Brief description of the results in this paper

The trees that we consider are unranked trees where there is no a priori bound in the number of children of a node. Moreover, we also have an order on the children of each node. We consider a natural logic for ordered-data trees, which consists of the following relations.

  • The parent relation EE_{\downarrow}, where E(x,y)\mbox{$E_{\downarrow}$}(x,y) means that node xx is the parent of node yy.

  • The next-sibling relation EE_{\rightarrow}, where E(x,y)\mbox{$E_{\rightarrow}$}(x,y) means that nodes xx and yy have the same parent and yy is the next sibling of xx.

  • The labeling predicates a()a(\cdot)’s, where a(x)a(x) means that node xx is labeled with symbol aa.

  • The data equality predicate \sim, where xyx\sim y means that nodes xx and yy have the same data value.

  • The order relation on data \prec, where xyx\prec y means that the data value in node xx is less than the one in node yy.

  • The successive order relation on data suc\prec_{suc}, where xsucyx\mbox{$\prec_{suc}$}\ y means that the data value in node yy is the minimal data value in the tree greater than the one in node xx.

We introduce an automata model for ordered-data trees, which we call ordered-data tree automata (ODTA), and provide its logical characterisation. Namely, we prove that the class of languages accepted by ODTA corresponds precisely to those expressible by formulas of the form:

X1Xnφψ,\exists X_{1}\cdots\exists X_{n}\ \varphi\wedge\psi, (1)

where

  • X1,,XnX_{1},\ldots,X_{n} are monadic second-order predicates;

  • φ\varphi is an FO formula restricted to two variables and using only the predicates EE_{\downarrow}, EE_{\rightarrow}, \sim, as well as the unary predicates X1,,XnX_{1},\ldots,X_{n} and aa’s.

  • ψ\psi is an FO formula using only the predicates \sim, \prec, suc\prec_{suc}, as well as the unary predicates X1,,XnX_{1},\ldots,X_{n} and aa’s.

We show that the logic MSO2(E,E,)\mbox{$\exists\mbox{$\textrm{MSO}$}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim), first studied in [Bojanczyk et al. (2009)], corresponds precisely to a special subclass of ODTA, where MSO2(E,E,)\mbox{$\exists\mbox{$\textrm{MSO}$}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim) denotes the set of formulas of the form (1) in which ψ\psi is a true formula. We then prove that the non-emptiness problem of ODTA is decidable in 3-NExpTime. Our main idea here is to show how to convert the ordered-data trees back to a string over finite alphabets. (See our notion of string representation of data values in Section 3.) Such conversion enables us to use the classical finite state automata to reason about data values.

Then we define a slightly weaker version of ODTA, which we call weak ODTA. Essentially the only feature of ODTA missing in weak ODTA is the ability to test whether two adjacent nodes have the same data value. Without such simple feature, the complexity of the non-emptiness problem surprisingly drops three-fold exponentially to NP. We provide its logical characterisation by showing that it corresponds precisely to the languages expressible by the formulas of the form (1) where φ\varphi does not use the predicate \sim. We show that a number of existing formalisms and models can be captured already by weak ODTA, i.e. those in [Fan and Libkin (2002), David et al. (2012), Manuel (2010)].

We should remark that [David et al. (2012)] studies a formalism which consists of tree automata and a collection of set and linear constraints.***We will later define formally what set and linear constraints are. It is shown that the satisfiability problem of such formalism is NP-complete. In fact, it is also shown in [David et al. (2012)] that a single set constraint (without tree automaton and linear constraint) already yields NP-hardness. Weak ODTA are essentially equivalent to the formalism in [David et al. (2012)] extended with the full expressive power of the first-order logic FO(,,suc)\mbox{$\textrm{FO}$}(\sim,\prec,\mbox{$\prec_{suc}$}). It is worth to note that despite such extension, the non-emptiness problem remains in NP.

Finally we also show that the definition of ODTA can be easily modified to the case where the data values come from a partially ordered domain, such as strings. This work can be seen as a generalisation of the works in [David et al. (2010)] and [Kara et al. (2012)]. However, it must be noted that [David et al. (2010), Kara et al. (2012)] deal only with data words, where only equality test is allowed on the data values and there is no order on them.

Related works

Most of the existing works in this area are on data words. In the paper [Bojanczyk et al. (11a)] the model data automata was introduced, and it was shown that it captures the logic MSO2(,<,+1)\exists\mbox{$\textrm{MSO}$}^{2}(\sim,<,+1), the fragment of existential monadic second order logic in which the first order part uses only two variables and the predicates: the data equality \sim, as well as the order << and the successor +1+1 on the domain.

An important feature of data automata is that their non-emptiness problem is decidable, even for infinite words, but is at least as hard as reachability for Petri nets. It was also shown that the satisfiability problem for the three-variable first order logic is undecidable. Later in [David et al. (2010)] an alternative proof was given for the decidability of the weaker logic MSO2(+1,)\mbox{$\exists\mbox{$\textrm{MSO}$}$}^{2}(+1,\sim). The proof gives a decision procedure with an elementary upper bound for the satisfiability problem of MSO2(+1,)\mbox{$\exists\mbox{$\textrm{MSO}$}$}^{2}(+1,\sim) on strings. Recently in [Kara et al. (2012)] an automata model that captures precisely the logic MSO2(+1,)\mbox{$\exists\mbox{$\textrm{MSO}$}$}^{2}(+1,\sim), both on finite and infinite words, is proposed.

Another logical approach is via the so called linear temporal logic with freeze quantifier, introduced in [Demri and Lazić (2009)]. Intuitively, these are LTL formulas equipped with a finite number of registers to store the data values. We denote by LTLn[X,U]\mbox{$\textrm{LTL}$}_{n}^{\downarrow}[\texttt{X},\texttt{U}], the LTL with freeze quantifier, where nn denotes the number of registers and the only temporal operators allowed are the neXt operator X and the Until operator U. It was shown that alternating register automata with nn registers (RAn\mbox{$\textrm{RA}$}_{n}) accept all LTL[X,U]n{}_{n}^{\downarrow}[\texttt{X},\texttt{U}] languages and the non-emptiness problem for alternating RA1\mbox{$\textrm{RA}$}_{1} is decidable. However, the complexity is non primitive recursive. Hence, the satisfiability problem for LTL(X,U)1{}_{1}^{\downarrow}(\texttt{X},\texttt{U}) is decidable as well. Adding one more register or past time operator U1\texttt{U}^{-1} to LTL(X,U)1{}_{1}^{\downarrow}(\texttt{X},\texttt{U}) makes the satisfiability problem undecidable. In [Figueira et al. (2010), Figueira (2012a)] it is shown that alternating RA1\mbox{$\textrm{RA}$}_{1} can be extended to strings with linearly ordered data values, and the emptiness problem is still decidable. In [Lazić (2011)] a weaker version of alternating RA1\mbox{$\textrm{RA}$}_{1}, called safety alternating RA1\mbox{$\textrm{RA}$}_{1}, is considered, and the non-emptiness problem is shown to be EXPSPACE-complete.

A model for data words with linearly ordered data values was proposed in [Segoufin and Torunczyk (2011)]. The model consists of an automaton equipped with a finite number of registers, and its transitions are based on constraints on the data values stored in the registers. It is shown that the non-emptiness problem for this model is decidable in PSPACE. However, no logical characterisation is provided for such model.

In [Bojanczyk et al. (11b)] another type of register automata for words was introduced and studied, which is a generalisation of the original register automata introduced by Kaminski and Francez [Kaminski and Francez (1994)], where the data values also can come from a linearly ordered domain. Thus, the order comparison, not just equality, can be performed on data values. The generalisation is done via the notion of monoid for data words, and is incomparable with our model here. In the terminology of the original register automata defined in [Kaminski and Francez (1994)], it is simply register automata extended with testing whether the data value currently read is bigger/smaller than those in the registers.

It is shown in [Manuel (2010)] that the satisfiability problem for FO2(+1,suc)\mbox{$\textrm{FO}$}^{2}(+1,\mbox{$\prec_{suc}$}) over text is decidable. A text is simply a data word in which all the data values are different and they range over the positive integers from 11 to nn, for some n1n\geq 1. We will see later that the satisfiability problem for FO2(+1,suc)\mbox{$\textrm{FO}$}^{2}(+1,\mbox{$\prec_{suc}$}) can be reduced to the non-emptiness problem of our model.

In [Schwentick and Zeume (2010)] it is shown that the satisfiability problem of the logic FO2(<,)\mbox{$\textrm{FO}$}^{2}(<,\prec) on words is decidable. This logic is incomparable with our model. However, it should be noted that FO2(<)\mbox{$\textrm{FO}$}^{2}(<) cannot capture the whole class of regular languages.

The work on data trees that we are aware of is in [Bojanczyk et al. (2009), Jurdzinski and Lazic (2011)]. In [Bojanczyk et al. (2009)] it was shown that the satisfiability problem for the logic MSO2(E,E,)\mbox{$\exists\mbox{$\textrm{MSO}$}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim) over unranked trees is decidable in 3-NExpTime. However, no automata model is provided. We will see later how this logic corresponds precisely to a special subclass of ODTA.

In [Jurdzinski and Lazic (2011)] alternating tree register automata were introduced for trees. They are essentially the generalisation of the alternating RA1\mbox{$\textrm{RA}$}_{1} to the tree case. It was shown that this model captures the forward XPath queries. However, no logical characterisation is provided and the non-emptiness problem, though decidable, is non primitive recursive.

As mentioned earlier, the main idea in this paper is the conversion of the data values from an infinite domain back to string over a finite alphabet. Roughly speaking, it works as follows. Given an ordered-data tree tt, we show how to construct a string ww over a finite alphabet whose domain corresponds precisely to the data values in tt. We then use the classical finite state automaton to reason about ww, and thus, also about the data values in tt. This idea is the main difference between our paper and the existing works. Most of the existing techniques rely on some specific combinatorial properties of the formalisms considered, which make them highly independent of one another. As we will see later, our model captures quite a few other formalisms without significant jump in complexity.

Organisation

This paper is organised as follows. In Section 2 we give some preliminary background. In Section 3 we formally define the logic for ordered-data trees and present a few examples as well as notations that we need in this paper. In Section 4 we present two lemmas that we are going to need later on. We prove them in a quite general setting, as we think they are interesting in their own. We introduce the ordered-data tree automata (ODTA) in Section 5 and weak ODTA in Section 6. In Section 7 we discuss a couple of the undecidable extensions of weak ODTA. In Section 8 we describe how to modify the definition of ODTA when the data values are strings, that is, when they come from a partially ordered domain. Finally we conclude with some concluding remarks in Section 9.

2 Preliminaries

In this section we review some definitions that we are going to use later on. We usually use Γ\Gamma and Σ\Sigma to denote finite alphabets. We write 2Γ2^{\Gamma} to denote an alphabet in which each symbol corresponds to a subset of Γ\Gamma. In some cases, we may need the alphabet 22Γ2^{2^{\Gamma}} – an alphabet in which each symbol corresponds to a set of subsets of Γ\Gamma. We denote the set of natural numbers {0,1,2,}\{0,1,2,\ldots\} by {\mathbb{N}}.

Usually we write \mathcal{L} to denote a language, for both string and tree languages. When it is clear from the context, we use the term language to mean either a string language, or a tree language.

2.1 Finite state automata over strings and commutative regular languages

We usually write \mathcal{M} to denote a finite state automaton on strings. The language accepted by the automaton \mathcal{M} is denoted by ()\mbox{$\mathcal{L}$}(\mbox{$\mathcal{M}$}).

Let Σ={a1,,a}\Sigma=\{a_{1},\ldots,a_{\ell}\}. For a word wΣw\in\Sigma^{\ast}, the Parikh image of ww is Parikh(w)=(n1,,n)\mbox{$\textsf{Parikh}$}(w)=(n_{1},\ldots,n_{\ell}), where nin_{i} is the number of appearances of aia_{i} in ww. For a vector n¯\bar{n}, the inverse of the Parikh image of n¯\bar{n} is Parikh1(n¯)={wwΣandParikh(w)=n¯}\mbox{$\textsf{Parikh}$}^{-1}(\bar{n})=\{w\mid w\in\Sigma^{\ast}\ \mbox{and}\ \mbox{$\textsf{Parikh}$}(w)=\bar{n}\}.

For 1i1\leq i\leq\ell, a vector v¯=(n1,,n)\bar{v}=(n_{1},\ldots,n_{\ell})\in\mbox{${\mathbb{N}}$}^{\ell} is called an ii-base, if ni0n_{i}\neq 0 and nj=0n_{j}=0, for all jij\neq i. A language \mathcal{L} is periodic, if there exist (+1)(\ell+1) vectors u¯,v¯1,,v¯\bar{u},\bar{v}_{1},\ldots,\bar{v}_{\ell} such that u¯\bar{u}\in\mbox{${\mathbb{N}}$}^{\ell} and each v¯i\bar{v}_{i} is an ii-base and

=h1,,h0Parikh1(u¯+h1v¯1++hv¯).\mbox{$\mathcal{L}$}=\bigcup_{h_{1},\ldots,h_{\ell}\geq 0}\mbox{$\textsf{Parikh}$}^{-1}(\bar{u}+h_{1}\bar{v}_{1}+\cdots+h_{\ell}\bar{v}_{\ell}).

We denote such language \mathcal{L} by (u¯,v¯1,,v¯)\mbox{$\mathcal{L}$}(\bar{u},\bar{v}_{1},\ldots,\bar{v}_{\ell}).

A language \mathcal{L} is commutative if it is closed under reordering. That is, if w=b1bmw=b_{1}\cdots b_{m}\in\mbox{$\mathcal{L}$}, and σ\sigma is a permutation on {1,,m}\{1,\ldots,m\}, then bσ(1)bσ(m)b_{\sigma(1)}\cdots b_{\sigma(m)}\in\mbox{$\mathcal{L}$}.

The following result is a kind of folklore and can be proved easily.

Theorem 2.1.

A language is commutative and regular if and only if it is a finite union of periodic languages.

2.2 Unranked trees, tree automata and transducers

An unranked finite tree domain is a prefix-closed finite subset DD of \mbox{${\mathbb{N}}$}^{*} (words over {\mathbb{N}}) such that uiDu\cdot i\in D implies ujDu\cdot j\in D for all j<ij<i and uu\in\mbox{${\mathbb{N}}$}^{*}. Given a finite labeling alphabet Σ\Sigma, a Σ\Sigma-labeled unranked tree tt is a structure

D,E,E,{a()}aΣ,\langle D,\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\{a(\cdot)\}_{a\in\Sigma}\rangle,

where

  • DD is an unranked tree domain,

  • EE_{\downarrow} is the child relation: (u,ui)E(u,u\cdot i)\in\mbox{$E_{\downarrow}$} for all u,uiDu,u\cdot i\in D,

  • EE_{\rightarrow} is the next-sibling relation: (ui,u(i+1))E(u\cdot i,u\cdot(i+1))\in\mbox{$E_{\rightarrow}$} for all ui,u(i+1)Du\cdot i,u\cdot(i+1)\in D, and

  • the a()a(\cdot)’s are labeling predicates, i.e. for each node uu, exactly one of a(u)a(u), with aΣa\in\Sigma, is true.

We write Dom(t)\mbox{$\textsf{Dom}$}(t) to denote the domain DD. The label of a node uu in tt is denoted by abt(u)\mbox{$\ell ab$}_{t}(u). If abt(u)=a\mbox{$\ell ab$}_{t}(u)=a, then we say that uu is an aa-node.

An unranked tree automaton [Comon et al. (2007), Thatcher (1967)] over Σ\Sigma-labeled trees is a tuple 𝒜=Q,Σ,δ,F\mbox{$\mathcal{A}$}=\langle Q,\Sigma,\delta,F\rangle, where QQ is a finite set of states, FQF\subseteq Q is the set of final states, and δ:Q×Σ2(Q)\delta:Q\times\Sigma\to 2^{(Q^{*})} is a transition function; we require δ(q,a)\delta(q,a)’s to be regular languages over QQ for all qQq\in Q and aΣa\in\Sigma.

A run of 𝒜\mathcal{A} over a tree tt is a function ρ𝒜:Dom(t)Q\rho_{\mbox{\scriptsize$\mathcal{A}$}}:\mbox{$\textsf{Dom}$}(t)\to Q such that for each node uu with nn children u0,,u(n1)u\cdot 0,\ldots,u\cdot(n-1), the word ρ𝒜(u0)ρ𝒜(u(n1))\rho_{\mbox{\scriptsize$\mathcal{A}$}}(u\cdot 0)\cdots\rho_{\mbox{\scriptsize$\mathcal{A}$}}(u\cdot(n-1)) is in the language δ(ρ𝒜(u),abt(u))\delta(\rho_{\mbox{\scriptsize$\mathcal{A}$}}(u),\mbox{$\ell ab$}_{t}(u)). For a leaf uu labeled aa, this means that uu could be assigned a state qq if and only if the empty word ϵ\epsilon is in δ(q,a)\delta(q,a). A run is accepting if ρ𝒜(ϵ)F\rho_{\mbox{\scriptsize$\mathcal{A}$}}(\epsilon)\in F, i.e., if the root is assigned a final state. A tree tt is accepted by 𝒜\mathcal{A} if there exists an accepting run of 𝒜\mathcal{A} on tt. The set of all trees accepted by 𝒜\mathcal{A} is denoted by (𝒜)\mbox{$\mathcal{L}$}(\mbox{$\mathcal{A}$}).

An unranked tree (letter-to-letter) transducer with the input alphabet Σ\Sigma and output alphabet Γ\Gamma is a tuple 𝒯=𝒜,μ\mbox{$\mathcal{T}$}=\langle\mbox{$\mathcal{A}$},\mu\rangle, where 𝒜\mathcal{A} is a tree automaton with the set of states QQ, and μQ×Σ×Γ\mu\subseteq Q\times\Sigma\times\Gamma is an output relation. We call such 𝒯\mathcal{T} a transducer from Σ\Sigma to Γ\Gamma.

Let tt be a Σ\Sigma-labeled tree, and tt^{\prime} a Γ\Gamma-labeled tree such that Dom(t)=Dom(t)\mbox{$\textsf{Dom}$}(t)=\mbox{$\textsf{Dom}$}(t^{\prime}). We say that a tree tt^{\prime} is an output of 𝒯\mathcal{T} on tt, if there is an accepting run ρ𝒜\rho_{\mbox{\scriptsize$\mathcal{A}$}} of 𝒜\mathcal{A} on tt and for each uDom(t)u\in\mbox{$\textsf{Dom}$}(t), it holds that (ρ𝒜(u),abt(u),abt(u))μ(\rho_{\mbox{\scriptsize$\mathcal{A}$}}(u),\mbox{$\ell ab$}_{t}(u),\mbox{$\ell ab$}_{t^{\prime}}(u))\in\mu. We call 𝒯\mathcal{T} an identity transducer, if abt(u)=abt(u)\mbox{$\ell ab$}_{t}(u)=\mbox{$\ell ab$}_{t^{\prime}}(u) for all uDom(t)u\in\mbox{$\textsf{Dom}$}(t). We will often view an automaton 𝒜\mathcal{A} as an identity transducer.

2.3 Automata with Presburger constraints (APC)

An automaton with Presburger constraints (APC) is a tuple 𝒜,ξ\langle\mbox{$\mathcal{A}$},\xi\rangle, where 𝒜\mathcal{A} is an unranked tree automaton with states q0,,qmq_{0},\ldots,q_{m} and ξ\xi is an existential Presburger formula with free variables x0,,xmx_{0},\ldots,x_{m}. A tree tt is accepted by 𝒜,ξ\langle\mbox{$\mathcal{A}$},\xi\rangle, denoted by t(𝒜,ξ)t\in\mbox{$\mathcal{L}$}(\mbox{$\mathcal{A}$},\xi), if there is an accepting run ρ𝒜\rho_{\mbox{\scriptsize$\mathcal{A}$}} of 𝒜\mathcal{A} on ww such that ξ(n0,,nm)\xi(n_{0},\ldots,n_{m}) is true, where nin_{i} is the number of appearances of qiq_{i} in ρ𝒜\rho_{\mbox{\scriptsize$\mathcal{A}$}}.

Theorem 2.2.

[Seidl et al. (2004), Verma et al. (2005)] The non-emptiness problem for APC is decidable in NP.

It is worth noting also that the class of languages accepted by APC is closed under union and intersection.

Oftentimes, instead of counting the number of states in the accepting run, we need to count the number of occurrences of alphabet symbols in the tree. Since we can easily embed the alphabet symbols inside the states, we always assume that the Presburger formula ξ\xi has the free variables xax_{a}’s to denote the number of appearances of the symbol aa in the tree.

As in the word case, we let Parikh(t)\mbox{$\textsf{Parikh}$}(t) denote the Parikh image of the tree tt. We will need the following proposition.

Proposition 2.3.

[Seidl et al. (2004), Verma et al. (2005)] Given an unranked tree automaton 𝒜\mathcal{A}, one can construct, in polynomial time, an existential Presburger formula ξ𝒜(x1,,x)\xi_{\mbox{\scriptsize$\mathcal{A}$}}(x_{1},\ldots,x_{\ell}) such that

  • for every tree t(𝒜)t\in\mbox{$\mathcal{L}$}(\mbox{$\mathcal{A}$}), ξ𝒜(Parikh(t))\xi_{\mbox{\scriptsize$\mathcal{A}$}}(\mbox{$\textsf{Parikh}$}(t)) holds;

  • for every n¯=(n1,,n)\bar{n}=(n_{1},\ldots,n_{\ell}) such that ξ𝒜(n¯)\xi_{\mbox{\scriptsize$\mathcal{A}$}}(\bar{n}) holds, there exists a tree t(𝒜)t\in\mbox{$\mathcal{L}$}(\mbox{$\mathcal{A}$}) with Parikh(t)=n¯\mbox{$\textsf{Parikh}$}(t)=\bar{n}.

3 Ordered-data trees and Their Logic

An ordered-data tree over the alphabet Σ\Sigma is a tree in which each node, besides carrying a label from the finite alphabet Σ\Sigma, also carries a data value from ={0,1,}\mbox{$\mathbb{N}$}=\{0,1,\ldots\}.Here we use the natural numbers as data values just to be concrete. The results in our paper apply trivially for any linearly ordered domain.

Let tt be an ordered-data tree over Σ\Sigma and uDom(t)u\in\mbox{$\textsf{Dom}$}(t). We write vat(u)\mbox{$\textrm{\em va}\ell$}_{t}(u) to denote the data value in the node uu. The set of all data values in the aa-nodes in tt is denoted by Vt(a)V_{t}(a). That is, Vt(a)={vat(u)|abt(u)=aanduDom(t)}V_{t}(a)=\{\mbox{$\textrm{\em va}\ell$}_{t}(u)\ |\ \mbox{$\ell ab$}_{t}(u)=a\ \mbox{and}\ u\in\mbox{$\textsf{Dom}$}(t)\}. We write VtV_{t} to denote the set of data values found in the tree tt. We also write #t(a)\#_{t}(a) to denote the number of aa-nodes in tt.

The profile of a node uu is a triplet (l,p,r){,,}×{,,}×{,,}(l,p,r)\in\{\top,\bot,*\}\times\{\top,\bot,*\}\times\{\top,\bot,*\}, where l=l=\top and l=l=\bot indicate that the node uu has the same data value and different data value as its left sibling, respectively; l=l=* indicates that uu does not have a left sibling. Similarly, p=p=\top, p=p=\bot, and p=p=* have the same meaning in relation to the parent of the node uu, while r=r=\top, r=r=\bot, and r=r=* means the same in relation to the right sibling of the node uu. For an ordered-data tree tt over Σ\Sigma, the profile tree of tt, denoted by Profile(t)\mbox{$\textsf{Profile}$}(t), is a tree over Σ×{,,}3\Sigma\times\{\top,\bot,*\}^{3} obtained by augmenting to each node of tt its profile.

We write Proj(t)\mbox{$\textsf{Proj}$}(t) to denote the Σ\Sigma projection of the ordered-data tree tt, that is, Proj(t)\mbox{$\textsf{Proj}$}(t) is tt without the data values. When we say that an ordered-data tree tt is accepted by an automaton 𝒜\mathcal{A}, we mean that Proj(t)\mbox{$\textsf{Proj}$}(t) is accepted by 𝒜\mathcal{A}. An ordered-data tree tt^{\prime} is an output of a transducer 𝒯\mathcal{T} on an ordered-data tree tt, if Proj(t)\mbox{$\textsf{Proj}$}(t^{\prime}) is an output of 𝒯\mathcal{T} on Proj(t)\mbox{$\textsf{Proj}$}(t), and for all uDom(t)u\in\mbox{$\textsf{Dom}$}(t^{\prime}), we have vat(u)=vat(u)\mbox{$\textrm{\em va}\ell$}_{t^{\prime}}(u)=\mbox{$\textrm{\em va}\ell$}_{t}(u).

Figure 1 shows an example of an ordered-data tree tt over the alphabet {a,b,c}\{a,b,c\} with its profile tree. The notation (ad){a\choose d} means that the node is labeled with aa and has data value dd.

(a2){a\choose 2}(b1){b\choose 1}(c2){c\choose 2}(a4){a\choose 4}(a6){a\choose 6}(b2){b\choose 2}(b4){b\choose 4}(a7){a\choose 7}(c1){c\choose 1}(c6){c\choose 6}(b7){b\choose 7}(a,(,,)2){a,(*,*,*)\choose 2}(b,(,,)1){b,(*,\bot,\bot)\choose 1}(c,(,,)2){c,(\bot,\top,\bot)\choose 2}(a,(,,)4){a,(\bot,\bot,\bot)\choose 4}(a,(,,)6){a,(\bot,\bot,*)\choose 6}(b,(,,)2){b,(*,\top,\bot)\choose 2}(b,(,,)4){b,(\bot,\bot,\bot)\choose 4}(a,(,,)7){a,(\bot,\bot,*)\choose 7}(c,(,,)1){c,(*,\bot,*)\choose 1}(c,(,,)6){c,(*,\bot,*)\choose 6}(b,(,,)7){b,(*,\top,*)\choose 7}
Figure 1: An example of an ordered-data tree (on the left) and its profile (on the right).

3.1 String representations of data values

Let tt be an ordered-data tree over Γ\Gamma. For a set SΓS\subseteq\Gamma, let

[S]t=aSVt(a)bSVt(b)¯.[S]_{t}=\bigcap_{a\in S}V_{t}(a)\cap\bigcap_{b\notin S}\overline{V_{t}(b)}.

That is, [S]t[S]_{t} is the set of data values that are found in aa-positions for all aSa\in S but are not found in any bb-position for bSb\not\in S. Note that the sets [S]t[S]_{t}’s are disjoint, and that for each aΓa\in\Gamma,

Vt(a)=Ss.t.aS[S]t.V_{t}(a)=\bigcup_{S\ \mbox{\scriptsize s.t.}\ a\in S}\ [S]_{t}.

Moreover, |Vt(a)|=Ss.t.aS|[S]t||V_{t}(a)|=\sum_{S\ \mbox{\scriptsize s.t.}\ a\in S}\ |[S]_{t}|.

Let d1<<dmd_{1}<\cdots<d_{m} be all the data values found in tt. The string representation of the data values in tt, denoted by 𝒱Γ(t)\mbox{$\mathcal{V}$}_{\Gamma}(t), is the string S1SmS_{1}\cdots S_{m} over the alphabet 2Γ{}2^{\Gamma}-\{\emptyset\} of length mm such that di[Si]td_{i}\in[S_{i}]_{t}, for each i=1,,mi=1,\ldots,m. The notation [S]t[S]_{t} is already introduced in [David et al. (2010), David et al. (2012)], but not 𝒱Γ(t)\mbox{$\mathcal{V}$}_{\Gamma}(t).

Consider the example of the tree tt in Figure 1. The data values in tt are 1,2,4,6,71,2,4,6,7, where

[{b,c}]t\displaystyle~[\{b,c\}]_{t} =\displaystyle= {1},\displaystyle\{1\},
[{a,b,c}]t\displaystyle~[\{a,b,c\}]_{t} =\displaystyle= {2},\displaystyle\{2\},
[{a,b}]t\displaystyle~[\{a,b\}]_{t} =\displaystyle= {4,7},\displaystyle\{4,7\},
[{a,c}]t\displaystyle~[\{a,c\}]_{t} =\displaystyle= {6},\displaystyle\{6\},
[S]t\displaystyle~[S]_{t} =\displaystyle= ,for all the otherS’s.\displaystyle\emptyset,\ \mbox{for all the other}\ S\mbox{'s}.

The string 𝒱Γ(t)\mbox{$\mathcal{V}$}_{\Gamma}(t) is S1S2S3S4S5S_{1}\ S_{2}\ S_{3}\ S_{4}\ S_{5}, where S1={b,c}S_{1}=\{b,c\}, S2={a,b,c}S_{2}=\{a,b,c\}, S3=S5={a,b}S_{3}=S_{5}=\{a,b\} and S4={a,c}S_{4}=\{a,c\}.

3.2 A logic for ordered-data trees

An ordered-data tree tt over the alphabet Σ\Sigma can be viewed as a structure

t=D,{a()}aΣ,E,E,,,suc,t\ =\langle D,\{a(\cdot)\}_{a\in\Sigma},\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim,\prec,\mbox{$\prec_{suc}$}\rangle,

where

  • the relations {a()}aΣ,E,E\{a(\cdot)\}_{a\in\Sigma},\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$} are as defined before in Subsection 2.2,

  • uvu\sim v holds, if vat(u)=vat(v)\mbox{$\textrm{\em va}\ell$}_{t}(u)=\mbox{$\textrm{\em va}\ell$}_{t}(v),

  • uvu\prec v holds, if vat(u)<vat(v)\mbox{$\textrm{\em va}\ell$}_{t}(u)<\mbox{$\textrm{\em va}\ell$}_{t}(v),

  • usucvu\mbox{$\prec_{suc}$}\ v holds, if vat(v)\mbox{$\textrm{\em va}\ell$}_{t}(v) is the minimal data value in tt greater than vat(u)\mbox{$\textrm{\em va}\ell$}_{t}(u).

Obviously, xsucyx\mbox{$\prec_{suc}$}\ y can be expressed equivalently as xyz(¬(xzzy))x\prec y\wedge\forall z(\neg(x\prec z\wedge z\prec y)). We include suc\prec_{suc} for the sake of convenience. We also assume that we have the predicates root(x)\mbox{{\rm root}}(x), first-sibling(x)\mbox{{\rm first-sibling}}(x), last-sibling(x)\mbox{{\rm last-sibling}}(x), and leaf(x)\mbox{{\rm leaf}}(x) which stand for y(¬E(y,x))\forall y(\neg\mbox{$E_{\downarrow}$}(y,x)), y(¬E(y,x))\forall y(\neg\mbox{$E_{\rightarrow}$}(y,x)), y(¬E(x,y))\forall y(\neg\mbox{$E_{\rightarrow}$}(x,y)), and y(¬E(x,y))\forall y(\neg\mbox{$E_{\downarrow}$}(x,y)), respectively. We also write xyx\nsim y to denote ¬(xy)\neg(x\sim y).

For 𝒪{E,E,,,suc}\mbox{$\mathcal{O}$}\subseteq\{\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim,\prec,\mbox{$\prec_{suc}$}\}, we let FO(𝒪)\mbox{$\textrm{FO}$}(\mbox{$\mathcal{O}$}) stand for the first-order logic with the vocabulary 𝒪\mathcal{O}, MSO(𝒪)\mbox{$\textrm{MSO}$}(\mbox{$\mathcal{O}$}) for its monadic second-order logic (which extends FO(𝒪)\mbox{$\textrm{FO}$}(\mbox{$\mathcal{O}$}) with quantification over sets of nodes), and MSO(𝒪)\mbox{$\exists\mbox{$\textrm{MSO}$}$}(\mbox{$\mathcal{O}$}) for its existential monadic second order logic, i.e., formulas of the form X1Xmψ\exists X_{1}\ldots\exists X_{m}\ \psi, where ψ\psi is an FO(𝒪)\mbox{$\textrm{FO}$}(\mbox{$\mathcal{O}$}) formula over the vocabulary 𝒪\mathcal{O} extended with the unary predicates X1,,XmX_{1},\ldots,X_{m}.

We let FO2(𝒪)\mbox{$\textrm{FO}$}^{2}(\mbox{$\mathcal{O}$}) stand for FO(𝒪)\mbox{$\textrm{FO}$}(\mbox{$\mathcal{O}$}) with two variables, i.e., the set of FO(𝒪)\mbox{$\textrm{FO}$}(\mbox{$\mathcal{O}$}) formulae that only use two variables xx and yy. The set of all formulae of the form X1Xmψ\exists X_{1}\ldots\exists X_{m}\ \psi, where ψ\psi is an FO2(𝒪)\mbox{$\textrm{FO}$}^{2}(\mbox{$\mathcal{O}$}) formula is denoted by MSO2(𝒪)\mbox{$\exists\mbox{$\textrm{MSO}$}$}^{2}(\mbox{$\mathcal{O}$}). Note that MSO2(E,E)\mbox{$\exists\mbox{$\textrm{MSO}$}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$}) is equivalent in expressive power to MSO(E,E)\mbox{$\textrm{MSO}$}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$}) over the usual (without data) trees. That is, it defines precisely the regular tree languages [Thomas (1997)].

As usual, we define data(φ)\mbox{$\mathcal{L}$}_{data}(\varphi) as the set of ordered-data trees that satisfy the formula φ\varphi. In such case, we say that the formula φ\varphi expresses the language data(φ)\mbox{$\mathcal{L}$}_{data}(\varphi).To avoid confusion, we put the subscript datadata on data\mbox{$\mathcal{L}$}_{data} to denote a language of ordered-data trees. We use the symbol \mathcal{L} without the subscript datadata to denote the usual language of trees/strings without data.

The following theorem is well known. It shows how even extending FO(E,E)\mbox{$\textrm{FO}$}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$}) with equality test on data values immediately yields undecidability.

Theorem 3.1.

(See, for example, [Neven et al. (2004)]) The satisfiability problem for the logic FO(E,E,)\mbox{$\textrm{FO}$}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim) is undecidable.

One of the deepest results in this area is the following decidability result for the logic MSO2(E,E,)\mbox{$\exists\mbox{$\textrm{MSO}$}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim).

Theorem 3.2.

[Bojanczyk et al. (2009)] The satisfiability problem for the logic MSO2(E,E,)\mbox{$\exists\mbox{$\textrm{MSO}$}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim) is decidable.

3.3 A few examples

In this subsection we present a few examples of properties of ordered-data trees. Some of them are special cases of more general techniques that will be used later on.

Example 3.3.

Let Σ={a,b}\Sigma=\{a,b\}. Consider the language dataa\mbox{$\mathcal{L}$}_{data}^{a} of ordered-data trees over Σ\Sigma where an ordered-data tree tdataat\in\mbox{$\mathcal{L}$}_{data}^{a} if and only if there exist two aa-nodes uu and vv such that uu is an ancestor of vv and either vuv\sim u or vuv\prec u. This language can be expressed with the formula XYZφ\exists X\exists Y\exists Z\ \varphi, where φ\varphi states that XX contains only the node uu, YY contains only the node vv, ZZ contains precisely the nodes in the path from uu to vv, and vuv\sim u or vuv\prec u.

Example 3.4.

For a fixed set SΣS\subseteq\Sigma and an integer m1m\geq 1, we consider the language dataS,m\mbox{$\mathcal{L}$}_{data}^{S,m} such that tdataS,mt\in\mbox{$\mathcal{L}$}_{data}^{S,m} if and only if |[S]t|=m|[S]_{t}|=m.

We pick an arbitrary symbol aSa\in S. The language dataS,m\mbox{$\mathcal{L}$}_{data}^{S,m} can be expressed in MSO2(E,E,)\mbox{$\exists\mbox{$\textrm{MSO}$}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim) with the formula of the form X1Xmφ\exists X_{1}\ \cdots\exists X_{m}\ \varphi, where φ\varphi is a conjunction of the following.

  • That the predicates X1,,XmX_{1},\ldots,X_{m} are disjoint and each of them contains exactly one node, which is an aa-node.

  • That the data values found in nodes in X1,,XmX_{1},\ldots,X_{m} are all different.

  • That for each i{1,,m}i\in\{1,\ldots,m\}, if a data value is found in a node in XiX_{i}, then it must also be found in some bb-node, for every bSb\in S.

  • That for each i{1,,m}i\in\{1,\ldots,m\}, if a data value found in a node in XiX_{i}, then it must not be found in any bb-node, for every bSb\notin S.

  • That for every aa-node (recall that aSa\in S) that does not belong to the XiX_{i}’s, either it has the same data value as the data value in a node belongs to one of the XiX_{i}’s, or it has the data value not in [S]t[S]_{t}.
    That its data value does not belong to [S]t[S]_{t} can be stated as the negation of

    • for each bSb\in S, there is a bb-node with the same data value; and

    • the data value cannot be found in any bb-node, for every bSb\notin S.

To express all these intended meanings, it is sufficient that φFO2(E,E,)\varphi\in\mbox{$\textrm{FO}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim).

Example 3.5.

For a fixed set SΣS\subseteq\Sigma and an integer m1m\geq 1, we consider the language dataS,(modm)\mbox{$\mathcal{L}$}_{data}^{S,\pmod{m}} such that tdataS,(modm)t\in\mbox{$\mathcal{L}$}_{data}^{S,\pmod{m}} if and only if |[S]t|0(modm)|[S]_{t}|\equiv 0\pmod{m}.

This language dataS,(modm)\mbox{$\mathcal{L}$}_{data}^{S,\pmod{m}} can be expressed in MSO2(E,E,)\mbox{$\exists\mbox{$\textrm{MSO}$}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim) with a formula of the form

X0Xm1Y0Ym1Zψ,\exists X_{0}\cdots\exists X_{m-1}\exists Y_{0}\cdots\exists Y_{m-1}\exists Z\ \psi,

where the intended meanings of X0,,Xm1,Y0,,Ym1,ZX_{0},\ldots,X_{m-1},Y_{0},\ldots,Y_{m-1},Z are as follows. For a node uu in an ordered-data tree tdatat\in\mbox{$\mathcal{L}$}_{data},

  • the number of nodes belonging to ZZ is precisely |[S]t||[S]_{t}|; and if Z(u)Z(u) holds in tt, then the data value in the node uu belongs to [S]t[S]_{t};

  • Xi(u)X_{i}(u) holds in tt if and only if in the subtree tt^{\prime} rooted in uu we have |[S]t|i(modm)|[S]_{t^{\prime}}|\equiv i\pmod{m};

  • if v1,,vkv_{1},\ldots,v_{k} are all the left-siblings of uu, and Xi1(v1),,Xik(vk)X_{i_{1}}(v_{1}),\ldots,X_{i_{k}}(v_{k}) holds, then Yi(u)Y_{i}(u) holds if and only if i1++iki(modm)i_{1}+\cdots+i_{k}\equiv i\pmod{m}.

To express all these intended meanings, it is sufficient that ψFO2(E,E,)\psi\in\mbox{$\textrm{FO}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim).

Example 3.6.

Let Σ={a,b}\Sigma=\{a,b\}. Consider the language dataa\mbox{$\mathcal{L}$}_{data}^{a\ast} of ordered-data trees over Σ\Sigma where an ordered-data tree tdataat\in\mbox{$\mathcal{L}$}_{data}^{a\ast} if and only if all the aa-nodes with data values different from the ones in their parents satisfy the following conditions:

  • the data values found in these nodes are all different;

  • one of the these data values must be the largest in the tree tt.

The language dataa\mbox{$\mathcal{L}$}_{data}^{a\ast} can be expressed in MSO2(E,,)\mbox{$\exists\mbox{$\textrm{MSO}$}$}^{2}(\mbox{$E_{\downarrow}$},\sim,\prec) with the following formula:

X\displaystyle\exists X (\displaystyle\Big{(} x(X(x)a(x)y(E(y,x)yx))\displaystyle\forall x\big{(}X(x)\iff a(x)\wedge\exists y(\mbox{$E_{\downarrow}$}(y,x)\wedge y\nsim x)\big{)}
xy(X(x)X(y)xyx=y)\displaystyle\wedge\ \ \forall x\forall y(X(x)\wedge X(y)\wedge x\sim y\to x=y)
x(X(x)y(yxxy))).\displaystyle\wedge\ \ \exists x\big{(}X(x)\wedge\forall y(y\prec x\vee x\sim y)\big{)}\Big{)}.

4 Two Useful Lemmas

In this section we prove two lemmas which will be used later on. The first is combinatorial by nature, and we will use it in our proof of the decidability of ODTA. The second is an Ehrenfeucht-Fraïssé type lemma for ordered-data trees, and we will use it in our proof of the logical characterization of ODTA.

4.1 A combinatorial lemma

Let GG be an (undirected and finite) graph. For simplicity, we consider only the graph without self-loop. We denote by V(G)V(G) the set of vertices in GG and E(G)E(G) the set of edges. For a node uV(G)u\in V(G), we write deg(u)\deg(u) to denote the degree of the node uu and deg(G)\deg(G) to denote max{deg(u)uV(G)}\max\{\deg(u)\mid u\in V(G)\}.

A data graph over the alphabet Γ\Gamma is a graph GG in which each node carries a label from Γ\Gamma and a data value from {\mathbb{N}}. A node uV(G)u\in V(G) is called an aa-node, if its label is aa, in which case we write abG(u)=a\mbox{$\ell ab$}_{G}(u)=a. We denote by vaG(u)\mbox{$\textrm{\em va}\ell$}_{G}(u) the data value found in node uu, and ValG(a)\mbox{$\textrm{Val}$}_{G}(a) the set of data values found in aa-nodes in GG.

Lemma 4.1.

Let GG be a data graph over Γ\Gamma. Suppose for each aΓa\in\Gamma, we have |VG(a)|deg(G)|Γ|+deg(G)+1|V_{G}(a)|\geq\deg(G)|\Gamma|+\deg(G)+1. Then we can reassign the data values in the nodes in GG to obtain another data graph GG^{\prime} such that V(G)=V(G)V(G)=V(G^{\prime}) and E(G)=E(G)E(G)=E(G^{\prime}) and

  • (1)

    for each uV(G)u\in V(G^{\prime}), abG(u)=abG(u)\mbox{$\ell ab$}_{G}(u)=\mbox{$\ell ab$}_{G^{\prime}}(u);

  • (2)

    for each aΓa\in\Gamma, ValG(a)=ValG(a)\mbox{$\textrm{Val}$}_{G}(a)=\mbox{$\textrm{Val}$}_{G^{\prime}}(a);

  • (3)

    for each u,vV(G)u,v\in V(G), if (u,v)E(G)(u,v)\in E(G^{\prime}), then vaG(u)vaG(v)\mbox{$\textrm{\em va}\ell$}_{G^{\prime}}(u)\neq\mbox{$\textrm{\em va}\ell$}_{G^{\prime}}(v).

Proof 4.2.

Note that in the lemma the data graph GG^{\prime} differs from GG only in the data values on the nodes, where we require that adjacent nodes in GG^{\prime} have different data values.

In the following we write #G(a)\#_{G}(a) to denote the number of aa-nodes in GG and K=deg(G)K=\deg(G). First, we perform some partial reassignment of the data values on some nodes. For each aΓa\in\Gamma, we pick |ValG(a)||\mbox{$\textrm{Val}$}_{G}(a)| number of aa-nodes in GG^{\prime}. Then we assign to these aa-nodes the data values from ValG(a)\mbox{$\textrm{Val}$}_{G}(a). One aa-node gets one data value. Such assignment can be done since obviously #G(a)|ValG(a)|\#_{G}(a)\geq|\mbox{$\textrm{Val}$}_{G}(a)|. If #G(a)>|ValG(a)|\#_{G}(a)>|\mbox{$\textrm{Val}$}_{G}(a)|, then there will be some aa-nodes in GG^{\prime} that do not have data values. We write vaG(u)=\mbox{$\textrm{\em va}\ell$}_{G^{\prime}}(u)=\sharp, if uu does not have data value. From this step we already obtain that ValG(a)=ValG(a)\mbox{$\textrm{Val}$}_{G^{\prime}}(a)=\mbox{$\textrm{Val}$}_{G}(a) for each aΓa\in\Gamma.

However, reassigning the data values just like that, there may exist an edge (u,v)(u,v) such that vaG(u)=vaG(v)\mbox{$\textrm{\em va}\ell$}_{G^{\prime}}(u)=\mbox{$\textrm{\em va}\ell$}_{G^{\prime}}(v) and vaG(u),vaG(v)\mbox{$\textrm{\em va}\ell$}_{G^{\prime}}(u),\mbox{$\textrm{\em va}\ell$}_{G^{\prime}}(v)\neq\sharp. We call such an edge a conflict edge. We are going to reassign the data values one more time so that there is no conflict edge.

Suppose there exists an edge (u,v)E(u,v)\in E such that vaG(u)=vaG(v)=d\mbox{$\textrm{\em va}\ell$}_{G^{\prime}}(u)=\mbox{$\textrm{\em va}\ell$}_{G^{\prime}}(v)=d and suppose that uu is an aa-node, for some aΓa\in\Gamma. The data value dd can only be found in at most |Γ||\Gamma| nodes in GG^{\prime}. Since deg(G)=K\deg(G)=K, the neighbours of those nodes (with data value dd) are at most K|Γ|K|\Gamma| nodes. Now |ValG(a)|=|ValG(a)|K|Γ|+K+1|\mbox{$\textrm{Val}$}_{G}(a)|=|\mbox{$\textrm{Val}$}_{G^{\prime}}(a)|\geq K|\Gamma|+K+1, there are at least K+1K+1 number of aa-nodes whose neighbours do not get the data value dd. Let u1,,umu_{1},\ldots,u_{m} be such aa-nodes, where mK+1m\geq K+1. From these nodes, there exists ii such that vaG(ui){vaG(x)(u,x)E}\mbox{$\textrm{\em va}\ell$}_{G^{\prime}}(u_{i})\notin\{\mbox{$\textrm{\em va}\ell$}_{G^{\prime}}(x)\mid(u,x)\in E\}.

We can then swap the data values on the nodes uu and uiu_{i}, and this results in one less conflict edge. We repeat this process until there is no conflict edge. Now it is straightforward that

  • (1)

    for each uVu\in V, abG(u)=abG(u)\mbox{$\ell ab$}_{G}(u)=\mbox{$\ell ab$}_{G^{\prime}}(u);

  • (2)

    for each aΓa\in\Gamma, ValG(a)=ValG(a)\mbox{$\textrm{Val}$}_{G}(a)=\mbox{$\textrm{Val}$}_{G^{\prime}}(a);

  • (3)

    for each u,vVu,v\in V, if (u,v)E(u,v)\in E and vaG(u),vaG(v)\mbox{$\textrm{\em va}\ell$}_{G^{\prime}}(u),\mbox{$\textrm{\em va}\ell$}_{G^{\prime}}(v)\neq\sharp, then vaG(u)vaG(v)\mbox{$\textrm{\em va}\ell$}_{G^{\prime}}(u)\neq\mbox{$\textrm{\em va}\ell$}_{G^{\prime}}(v).

What is left to do now is to assign data values to the nodes uu, where vaG(u)=\mbox{$\textrm{\em va}\ell$}_{G^{\prime}}(u)=\sharp. For each aa-node, where vaG(u)=\mbox{$\textrm{\em va}\ell$}_{G^{\prime}}(u)=\sharp, we pick the data value dValG(a)=ValG(a)d\in\mbox{$\textrm{Val}$}_{G^{\prime}}(a)=\mbox{$\textrm{Val}$}_{G}(a) which is not assigned to any its neighbour. Such data value exists since |ValG(a)|K|Γ|+K+1K+1|\mbox{$\textrm{Val}$}_{G^{\prime}}(a)|\geq K|\Gamma|+K+1\geq K+1. Such assignment will not violate condition (3) above, thus, we get the desired data graph GG^{\prime}. This completes the proof of Lemma 4.1.

4.2 An Ehrenfeucht-Fraïssé type lemma

We need the following notation. A kk-characteristic function on the alphabet Γ\Gamma, is a function f:Γ{0,1,2,,k}f:\Gamma\to\{0,1,2,\ldots,k\}. Let Γ,k\mbox{$\mathcal{F}$}_{\Gamma,k} be the set of all such kk-characteristic functions on Γ\Gamma. A function fΓ,kf\in\mbox{$\mathcal{F}$}_{\Gamma,k} is a kk-characteristic function for a set SΓS\subseteq\Gamma, if f(a){1,2,,k}f(a)\in\{1,2,\ldots,k\}, for all aSa\in S, and f(a)=0f(a)=0, for all aSa\notin S.

An ordered-data set 𝔘\mathfrak{U} over the alphabet Γ\Gamma consists of a finite set UU, in which each element uUu\in U carries a label ab𝔘(u)Γ\mbox{$\ell ab$}_{\mbox{\scriptsize$\mathfrak{U}$}}(u)\in\Gamma and a data value va𝔘(u)\mbox{$\textrm{\em va}\ell$}_{\mbox{\scriptsize$\mathfrak{U}$}}(u)\in\mbox{$\mathbb{N}$}. An element uUu\in U is called an aa-element, if ab𝔘(u)=a\mbox{$\ell ab$}_{\mbox{\scriptsize$\mathfrak{U}$}}(u)=a. In other words, an ordered-data set is similar to an ordered-data tree, but without the relations EE_{\downarrow} and EE_{\rightarrow}. It can be viewed as a structure 𝔘=U,{a()}aΓ,,,suc\mbox{$\mathfrak{U}$}=\langle U,\{a(\cdot)\}_{a\in\Gamma},\sim,\prec,\mbox{$\prec_{suc}$}\rangle, where

  • for each aΓa\in\Gamma and uUu\in U, the relation a(u)a(u) holds if ab𝔘(u)=a\mbox{$\ell ab$}_{\mbox{\scriptsize$\mathfrak{U}$}}(u)=a,

  • uvu\sim v holds, if va𝔘(u)=va𝔘(v)\mbox{$\textrm{\em va}\ell$}_{\mbox{\scriptsize$\mathfrak{U}$}}(u)=\mbox{$\textrm{\em va}\ell$}_{\mbox{\scriptsize$\mathfrak{U}$}}(v),

  • uvu\prec v holds, if va𝔘(u)<va𝔘(v)\mbox{$\textrm{\em va}\ell$}_{\mbox{\scriptsize$\mathfrak{U}$}}(u)<\mbox{$\textrm{\em va}\ell$}_{\mbox{\scriptsize$\mathfrak{U}$}}(v),

  • usucvu\mbox{$\prec_{suc}$}\ v holds, if va𝔘(v)\mbox{$\textrm{\em va}\ell$}_{\mbox{\scriptsize$\mathfrak{U}$}}(v) is the minimal data value found in 𝔘\mathfrak{U} greater than va𝔘(u)\mbox{$\textrm{\em va}\ell$}_{\mbox{\scriptsize$\mathfrak{U}$}}(u).

Let 𝔘\mathfrak{U} be an ordered-data set and d1<<dmd_{1}<\cdots<d_{m} be the data values found in 𝔘\mathfrak{U}. The kk-extended representation of 𝔘\mathfrak{U} is the string 𝒱Γk(𝔘)=(S1,f1)(Sm,fm)2Γ×Γ,k\mbox{$\mathcal{V}$}^{k}_{\Gamma}(\mbox{$\mathfrak{U}$})=(S_{1},f_{1})\cdots(S_{m},f_{m})\in 2^{\Gamma}\times\mbox{$\mathcal{F}$}_{\Gamma,k} such that S1Sm=𝒱Γ(𝔘)S_{1}\cdots S_{m}=\mbox{$\mathcal{V}$}_{\Gamma}(\mbox{$\mathfrak{U}$}) and for each i{1,2,,m}i\in\{1,2,\ldots,m\} and for each aΓa\in\Gamma,

  1. 1.

    fif_{i} is a kk-characteristic function for the set SiS_{i},

  2. 2.

    if 1fi(a)k11\leq f_{i}(a)\leq k-1, then there are fi(a)f_{i}(a) number of aa-elements in UU with data value did_{i},

  3. 3.

    if fi(a)=kf_{i}(a)=k, then there are at least kk number of aa-elements in UU with data value did_{i}.

We assume that in every formula in MSO(,,suc)\mbox{$\textrm{MSO}$}(\sim,\prec,\mbox{$\prec_{suc}$}) all the monadic second-order quantifiers precede the first-order part. That is, sentences in MSO(,,suc)\mbox{$\textrm{MSO}$}(\sim,\prec,\mbox{$\prec_{suc}$}) are of the form: φ:=Q1X1QsXsψ\varphi:=Q_{1}X_{1}\cdots Q_{s}X_{s}\;\psi, where the XiX_{i}’s are monadic second-order variables, the QiQ_{i}’s are \exists or \forall and ψFO(,,suc)\psi\in\mbox{$\textrm{FO}$}(\sim,\prec,\mbox{$\prec_{suc}$}) extended with the unary predicates X1,,XsX_{1},\ldots,X_{s}. We call the integer ss, the MSO quantifier rank of φ\varphi, denoted by MSO-qr(φ)=s\mbox{$\textsf{MSO-qr}$}(\varphi)=s, while we write FO-qr(φ)\mbox{$\textsf{FO-qr}$}(\varphi) to denote the quantifier rank of ψ\psi, that is the quantifier rank of the first-order part of φ\varphi.

Lemma 4.3.

Let 𝔘1\mbox{$\mathfrak{U}$}_{1} and 𝔘2\mbox{$\mathfrak{U}$}_{2} be ordered-data sets over Γ\Gamma such that 𝒱Γk2s(𝔘1)=𝒱Γk2s(𝔘2)\mbox{$\mathcal{V}$}^{k2^{s}}_{\Gamma}(\mbox{$\mathfrak{U}$}_{1})=\mbox{$\mathcal{V}$}^{k2^{s}}_{\Gamma}(\mbox{$\mathfrak{U}$}_{2}). For any MSO(,,suc)\mbox{$\textrm{MSO}$}(\sim,\prec,\mbox{$\prec_{suc}$}) sentence φ\varphi such that MSO-qr(φ)s\mbox{$\textsf{MSO-qr}$}(\varphi)\leq s and FO-qr(φ)k\mbox{$\textsf{FO-qr}$}(\varphi)\leq k, 𝔘1φif and only if𝔘2φ\mbox{$\mathfrak{U}$}_{1}\models\varphi\quad\mbox{if and only if}\quad\mbox{$\mathfrak{U}$}_{2}\models\varphi.

Proof 4.4.

The proof is by Ehrenfeucht-Fraïssé game for MSO of (s+k)(s+k) rounds, with ss rounds of set-moves and kk rounds of point-moves. We can assume that the set-moves precede the point-moves. See, for example, [Libkin (2004)], for the definition of Ehrenfeucht-Fraïssé game.

Before we go to the proof, we need a few notations. Let 𝔘1\mbox{$\mathfrak{U}$}_{1} and 𝔘2\mbox{$\mathfrak{U}$}_{2} be ordered-data sets over Γ\Gamma. For (a,d)Γ×(a,d)\in\Gamma\times\mbox{$\mathbb{N}$}, we write P𝔘1(a,d)={uab𝔘1(u)=aandva𝔘1(u)=d}P_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d)=\{u\mid\mbox{$\ell ab$}_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(u)=a\ \mbox{and}\ \mbox{$\textrm{\em va}\ell$}_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(u)=d\} – the set of elements in U1U_{1} with label aa and data value dd. We can define similarly P𝔘2(a,d)P_{\mbox{\scriptsize$\mathfrak{U}$}_{2}}(a,d) for 𝔘2\mbox{$\mathfrak{U}$}_{2}.

Let 𝒪{,,suc}\mbox{$\mathcal{O}$}\subseteq\{\sim,\prec,\mbox{$\prec_{suc}$}\}. Let u1,,ukU1u_{1},\ldots,u_{k}\in U_{1} and v1,,vkU2v_{1},\ldots,v_{k}\in U_{2}, for some ordered-data sets U1U_{1} and U2U_{2}. The mapping (u1,,uk)(v1,,vk)(u_{1},\ldots,u_{k})\mapsto(v_{1},\ldots,v_{k}) is a partial 𝒪\mathcal{O}-isomorphism (with equality) from U1U_{1} to U2U_{2}, if it is a partial isomorphism with regards to the vocabulary 𝒪\mathcal{O}, and if ul=ulu_{l}=u_{l^{\prime}}, then vl=vlv_{l}=v_{l^{\prime}}.

We are going to describe Duplicator’s strategy for winning the Ehrenfeucht-Fraïssé game for MSO of ss rounds of set-moves, followed by kk rounds of point moves. We start with the set-moves.

Duplicator’s strategy for set-moves: Suppose that the game is already played for ll rounds, where X1,,XlX_{1},\ldots,X_{l} and Y1,,YlY_{1},\ldots,Y_{l} are the sets of positions chosen in U1U_{1} and U2U_{2}, respectively. For each I{1,,l}I\subseteq\{1,\ldots,l\}, define the following set:

P𝔘1(a,d;I)\displaystyle P_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d;I) =\displaystyle= P𝔘1(a,d)iIXijIXj¯\displaystyle P_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d)\cap\bigcap_{i\in I}X_{i}\cap\bigcap_{j\notin I}\overline{X_{j}}
P𝔘2(a,d;I)\displaystyle P_{\mbox{\scriptsize$\mathfrak{U}$}_{2}}(a,d;I) =\displaystyle= P𝔘2(a,d)iIYijIYj¯\displaystyle P_{\mbox{\scriptsize$\mathfrak{U}$}_{2}}(a,d)\cap\bigcap_{i\in I}Y_{i}\cap\bigcap_{j\notin I}\overline{Y_{j}}

Duplicator’s strategy is to preserve the following identity: for every (a,d)Γ×(a,d)\in\Gamma\times\mbox{$\mathbb{N}$} and every I{1,,l}I\subseteq\{1,\ldots,l\}

  • If the cardinality |P𝔘1(a,d;I)|<k2ml|P_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d;I)|<k2^{m-l}, then |P𝔘1(a,d;I)|=|P𝔘2(a,d;I)||P_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d;I)|=|P_{\mbox{\scriptsize$\mathfrak{U}$}_{2}}(a,d;I)|.

  • If the cardinality |P𝔘1(a,d;I)|k2ml|P_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d;I)|\geq k2^{m-l}, then also |P𝔘2(a,d;I)|k2ml|P_{\mbox{\scriptsize$\mathfrak{U}$}_{2}}(a,d;I)|\geq k2^{m-l}.

Now suppose that on the (l+1)th(l+1)^{\rm th} set-move, Spoiler chooses a set XX of positions on U1U_{1}. Duplicator chooses a set YY of positions on U2U_{2} as follows. For each I{1,,l}I\subseteq\{1,\ldots,l\}, there are four cases:

  1. 1.

    |P𝔘1(a,d;I)X||P_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d;I)\cap X| and |P𝔘1(a,d;I)X¯|<k2ml1|P_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d;I)\cap\overline{X}|<k2^{m-l-1}.
    Then, |P𝔘1(a,d;I)|<k2ml|P_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d;I)|<k2^{m-l}, which by induction hypothesis, implies |P𝔘2(a,d;I)|=|P𝔘1(a,d;I)||P_{\mbox{\scriptsize$\mathfrak{U}$}_{2}}(a,d;I)|=|P_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d;I)|. Duplicator picks |P𝔘1(a,d;I)X||P_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d;I)\cap X| number of points from P𝔘2(a,d;I)P_{\mbox{\scriptsize$\mathfrak{U}$}_{2}}(a,d;I), and declares them “belong to YY.” The rest of the points from P𝔘2(a,d;I)P_{\mbox{\scriptsize$\mathfrak{U}$}_{2}}(a,d;I) are declared “not belong to YY.”
    Obviously, |P𝔘1(a,d;I)X|=|P𝔘2(a,d;I)Y|<k2ml1|P_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d;I)\cap X|=|P_{\mbox{\scriptsize$\mathfrak{U}$}_{2}}(a,d;I)\cap Y|<k2^{m-l-1} and |P𝔘1(a,d;I)X¯|=|P𝔘2(a,d;I)Y¯|<k2ml1|P_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d;I)\cap\overline{X}|=|P_{\mbox{\scriptsize$\mathfrak{U}$}_{2}}(a,d;I)\cap\overline{Y}|<k2^{m-l-1}.

  2. 2.

    |P𝔘1(a,d;I)X|<k2ml1|P_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d;I)\cap X|<k2^{m-l-1} and |P𝔘1(a,d;I)X¯|k2ml1|P_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d;I)\cap\overline{X}|\geq k2^{m-l-1}.
    In this case, either P𝔘1(a,d;I)<k2mP_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d;I)<k2^{m} or k2m\geq k2^{m}. In either case there are |P𝔘1(a,d;I)X||P_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d;I)\cap X| number of points from P𝔘2(a,d;I)P_{\mbox{\scriptsize$\mathfrak{U}$}_{2}}(a,d;I) which Duplicator declares as “belong to YY.” The rest of the points from P𝔘2(a,d;I)P_{\mbox{\scriptsize$\mathfrak{U}$}_{2}}(a,d;I) are declared “not belong to YY.”
    Obviously, |P𝔘1(a,d;I)X|=|P𝔘2(a,d;I)Y||P_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d;I)\cap X|=|P_{\mbox{\scriptsize$\mathfrak{U}$}_{2}}(a,d;I)\cap Y| and |P𝔘2(a,d;I)Y¯|k2ml1|P_{\mbox{\scriptsize$\mathfrak{U}$}_{2}}(a,d;I)\cap\overline{Y}|\geq k2^{m-l-1}.

  3. 3.

    |P𝔘1(a,d;I)X|k2ml1|P_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d;I)\cap X|\geq k2^{m-l-1} and |P𝔘1(a,d;I)X¯|<k2ml1|P_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d;I)\cap\overline{X}|<k2^{m-l-1}.
    Similar to Case 2.

  4. 4.

    |P𝔘1(a,d;I)X|k2ml1|P_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d;I)\cap X|\geq k2^{m-l-1} and |P𝔘1(a,d;I)X¯|k2ml1|P_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d;I)\cap\overline{X}|\geq k2^{m-l-1}.
    Then, |P𝔘1(a,d;I)|k2ml|P_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d;I)|\geq k2^{m-l}, and so |P𝔘2(a,d;I)|k2ml|P_{\mbox{\scriptsize$\mathfrak{U}$}_{2}}(a,d;I)|\geq k2^{m-l}. Duplicator declares half of the points in P𝔘2(a,d;I)P_{\mbox{\scriptsize$\mathfrak{U}$}_{2}}(a,d;I) as “belong to YY” and the other half as “not belong to YY.”
    Obviously, |P𝔘2(a,d;I)Y||P_{\mbox{\scriptsize$\mathfrak{U}$}_{2}}(a,d;I)\cap Y| and |P𝔘2(a,d;I)Y¯|k2ml1|P_{\mbox{\scriptsize$\mathfrak{U}$}_{2}}(a,d;I)\cap\overline{Y}|\geq k2^{m-l-1}.

Now after mm rounds of set-moves, we have the following identity: for every (a,d)Σ×(a,d)\in\Sigma\times\mbox{$\mathbb{N}$} and every I{1,,m}I\subseteq\{1,\ldots,m\}

  • If the cardinality |P𝔘1(a,d;I)|<k|P_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d;I)|<k, then |P𝔘1(a,d;I)|=|P𝔘2(a,d;I)||P_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d;I)|=|P_{\mbox{\scriptsize$\mathfrak{U}$}_{2}}(a,d;I)|.

  • If the cardinality |P𝔘1(a,d;I)|k|P_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d;I)|\geq k, then also |P𝔘2(a,d;I)|k|P_{\mbox{\scriptsize$\mathfrak{U}$}_{2}}(a,d;I)|\geq k.

This ends our description of Duplicator’s strategy for set-moves. Now we describe Duplicator’s strategy for point-moves.

Duplicator’s strategy for point-moves: Suppose that the game is now on llth step. Let (u1,,ul)(v1,,vl)(u_{1},\ldots,u_{l})\mapsto(v_{1},\ldots,v_{l}) be a partial {,,suc}\{\sim,\prec,\mbox{$\prec_{suc}$}\}-isomorphism, where 0lk10\leq l\leq k-1. Suppose Spoiler chooses an element ul+1u_{l+1} from U1U_{1} such that va𝔘1(ul+1)\mbox{$\textrm{\em va}\ell$}_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(u_{l+1}) is the jthj^{th} largest data value in 𝔘1\mbox{$\mathfrak{U}$}_{1}.

  • If ul+1=ulu_{l+1}=u_{l^{\prime}}, for some l{1,,l}l^{\prime}\in\{1,\ldots,l\}, Duplicator chooses vl+1=vlv_{l+1}=v_{l^{\prime}} from U2U_{2}.

  • If ul+1{u1,,ul}u_{l+1}\notin\{u_{1},\ldots,u_{l}\}, Duplicator chooses vl+1v_{l+1} from U2U_{2} such that vl+1{v1,,vl}v_{l+1}\notin\{v_{1},\ldots,v_{l}\} and ab𝔘1(ul+1)=ab𝔘2(vl+1)\mbox{$\ell ab$}_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(u_{l+1})=\mbox{$\ell ab$}_{\mbox{\scriptsize$\mathfrak{U}$}_{2}}(v_{l+1}) and va𝔘2(vl+1)\mbox{$\textrm{\em va}\ell$}_{\mbox{\scriptsize$\mathfrak{U}$}_{2}}(v_{l+1}) is the jthj^{th} largest data value in 𝔘2\mbox{$\mathfrak{U}$}_{2}. Such an element exists, as 𝒱k2m(𝔘1)=𝒱k2m(𝔘2)\mbox{$\mathcal{V}$}^{k2^{m}}(\mbox{$\mathfrak{U}$}_{1})=\mbox{$\mathcal{V}$}^{k2^{m}}(\mbox{$\mathfrak{U}$}_{2}).

In either case (u1,,ul+1)(v1,,vl+1)(u_{1},\ldots,u_{l+1})\mapsto(v_{1},\ldots,v_{l+1}) is a partial {,,suc}\{\sim,\prec,\mbox{$\prec_{suc}$}\}-isomorphism. This completes the description of Duplicator’s strategy and hence, our proof.

Now, we define the kk-extended representation of an ordered-data tree tt over the alphabet Γ\Gamma, denoted by 𝒱Γk(t)\mbox{$\mathcal{V}$}^{k}_{\Gamma}(t) is the kk-extended representation of the ordered-data set 𝔘\mathfrak{U} obtained by ignoring the relations EE_{\downarrow} and EE_{\rightarrow} in tt. The following corollary is an immediate consequence of Lemma 4.3 above.

Corollary 4.5.

Let t1t_{1} and t2t_{2} be ordered-data trees over Γ\Gamma such that 𝒱Γk2s(t1)=𝒱Γk2s(t2)\mbox{$\mathcal{V}$}^{k2^{s}}_{\Gamma}(t_{1})=\mbox{$\mathcal{V}$}^{k2^{s}}_{\Gamma}(t_{2}). For any MSO(,,suc)\mbox{$\textrm{MSO}$}(\sim,\prec,\mbox{$\prec_{suc}$}) sentence φ\varphi such that MSO-qr(φ)s\mbox{$\textsf{MSO-qr}$}(\varphi)\leq s and FO-qr(φ)k\mbox{$\textsf{FO-qr}$}(\varphi)\leq k, t1φif and only ift2φt_{1}\models\varphi\quad\mbox{if and only if}\quad t_{2}\models\varphi.

Proof 4.6.

Since the predicates EE_{\downarrow} and EE_{\rightarrow} are not used in the formula φMSO(,,suc)\varphi\in\mbox{$\textrm{MSO}$}(\sim,\prec,\mbox{$\prec_{suc}$}), we can ignore them in t1t_{1} and t2t_{2} and view both t1t_{1} and t2t_{2} as ordered-data sets. Our corollary follows immediately from Lemma 4.3.

5 Automata for Ordered-data Tree

In this section we are going to introduce an automata model for ordered-data trees and study its expressive power.

Definition 5.1.

An ordered-data tree automaton, in short ODTA, over the alphabet Σ\Sigma is a triplet 𝒮=𝒯,,Γ0\mbox{$\mathcal{S}$}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Gamma_{0}\rangle, where 𝒯\mathcal{T} is a letter-to-letter non-deterministic transducer from Σ×{,,}3\Sigma\times\{\top,\bot,*\}^{3} to the output alphabet Γ\Gamma; \mathcal{M} is an automaton on strings over the alphabet 2Γ2^{\Gamma}; and Γ0Γ\Gamma_{0}\subseteq\Gamma.

An ordered-data tree tt is accepted by 𝒮\mathcal{S}, denoted by tdata(𝒮)t\in\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$}), if there exists an ordered-data tree tt^{\prime} over Γ\Gamma such that

  • on input Profile(t)\mbox{$\textsf{Profile}$}(t), the transducer 𝒯\mathcal{T} outputs tt^{\prime};

  • the automaton \mathcal{M} accepts the string 𝒱Γ(t)\mbox{$\mathcal{V}$}_{\Gamma}(t^{\prime}); and

  • for every aΓ0a\in\Gamma_{0}, all the aa-nodes in tt^{\prime} have different data values.

We describe a few examples of ODTA that accept the languages described in Examples 3.33.4, 3.5 and 3.6.

Example 5.2.

An ODTA 𝒮a=𝒯,,Γ0\mbox{$\mathcal{S}$}^{a}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Gamma_{0}\rangle that accepts the language dataa\mbox{$\mathcal{L}$}_{data}^{a} in Example 3.3 can be defined as follows. The output alphabet of the transducer 𝒯\mathcal{T} is Γ={α,β,γ}\Gamma=\{\alpha,\beta,\gamma\}. On an input tree tt, the transducer 𝒯\mathcal{T} marks the nodes in tt as follows. There is only one node marked with α\alpha, one node marked with β\beta, and the α\alpha-node is an ancestor of β\beta. The automaton \mathcal{M} accepts all the strings in which the position labeled with SβS\ni\beta is less than or equal to the position labeled with SαS^{\prime}\ni\alpha. (These two positions can be equal, which means S=SS=S^{\prime}.) Finally, Γ0=\Gamma_{0}=\emptyset.

Example 5.3.

An ODTA 𝒮S,m=𝒯,,Γ0\mbox{$\mathcal{S}$}^{S,m}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Gamma_{0}\rangle that accepts the language dataS,m\mbox{$\mathcal{L}$}_{data}^{S,m} in Example 3.4 can be defined as follows. The transducer 𝒯\mathcal{T} is an identity transducer. The automaton \mathcal{M} accepts all the strings in which the symbol SS appears exactly mm times, and Γ0=\Gamma_{0}=\emptyset.

Example 5.4.

An ODTA 𝒮S,(modm)=𝒯,,Γ0\mbox{$\mathcal{S}$}^{S,\pmod{m}}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Gamma_{0}\rangle that accepts the language dataS,(modm)\mbox{$\mathcal{L}$}_{data}^{S,\pmod{m}} in Example 3.5 can be defined as follows. The transducer 𝒯\mathcal{T} is an identity transducer. The automaton \mathcal{M} accepts a string in which the number of appearances of the symbol SS is a multiple of mm, and Γ0=\Gamma_{0}=\emptyset.

Example 5.5.

An ODTA 𝒮a=𝒯,,Γ0\mbox{$\mathcal{S}$}^{a\ast}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Gamma_{0}\rangle that accepts the language dataa\mbox{$\mathcal{L}$}_{data}^{a\ast} in Example 3.6 can be defined as follows. The output alphabet of the transducer 𝒯\mathcal{T} is Γ={α,β}\Gamma=\{\alpha,\beta\}. The transducer 𝒯\mathcal{T} marks the nodes as follows. A node is marked with α\alpha if and only if it is an aa-node and it has different data value from the one of its parent. All the other nodes are marked with β\beta. The automaton \mathcal{M} accepts a string vv if and only if the last symbol in vv contains the symbol α\alpha, while Γ0={α}\Gamma_{0}=\{\alpha\}.

The following proposition states that ODTA languages are closed under union and intersection, but not under negation. We would like to remark that being not closed under negation is rather common for decidable models for data trees. Often models that are closed under negation have undecidable non-emptiness/satisfiability problem.

Proposition 5.6.

The class of languages accepted by ODTA is closed under union and intersection, but not under negation.

Proof 5.7.

For closure under union and intersection, let 𝒮1=𝒯1,1,Γ01\mbox{$\mathcal{S}$}_{1}=\langle\mbox{$\mathcal{T}$}_{1},\mbox{$\mathcal{M}$}_{1},\Gamma_{0}^{1}\rangle and 𝒮2=𝒯2,2,Γ02\mbox{$\mathcal{S}$}_{2}=\langle\mbox{$\mathcal{T}$}_{2},\mbox{$\mathcal{M}$}_{2},\Gamma_{0}^{2}\rangle be ODTA. The union data(𝒮1)data(𝒮2)\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$}_{1})\cup\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$}_{2}) is accepted by an ODTA which non-deterministically chooses to simulate either 𝒮1\mbox{$\mathcal{S}$}_{1} or 𝒮2\mbox{$\mathcal{S}$}_{2} on the input ordered-data tree. The ODTA for the intersection data(𝒮1)data(𝒮2)\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$}_{1})\cap\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$}_{2}) can be obtained by the standard cross product between 𝒮1\mbox{$\mathcal{S}$}_{1} and 𝒮2\mbox{$\mathcal{S}$}_{2}.

We now prove hat ODTA languages are not closed under negation. Consider the negation of the language in Example 3.3, whose equivalent ODTA 𝒮a\mbox{$\mathcal{S}$}^{a} is presented in Example 5.2. Every tree t(𝒮a)t\notin\mbox{$\mathcal{L}$}(\mbox{$\mathcal{S}$}^{a}) has the following property. If u,vu,v are two aa-nodes in tt and uu is an ancestor of vv, then uvu\prec v.

Now suppose to the contrary that there exists an ODTA 𝒮=𝒯,,Γ0\mbox{$\mathcal{S}$}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Gamma_{0}\rangle that accepts the negation of (𝒮a)\mbox{$\mathcal{L}$}(\mbox{$\mathcal{S}$}^{a}). Let Γ\Gamma be the output alphabet of 𝒯\mathcal{T}. Let t(𝒮)t\in\mbox{$\mathcal{L}$}(\mbox{$\mathcal{S}$}) be a data tree with |Γ|+1|\Gamma|+1 nodes, where each node is labelled with aa and has at most one child. This implies that the data values in tt are all different and appear in increasing order from the root node to the leaf node.

Let t𝒯(t)t^{\prime}\in\mbox{$\mathcal{T}$}(t). Since tt has |Γ|+1|\Gamma|+1 nodes, and hence so does tt^{\prime}, there are two nodes in uu and vv in tt^{\prime} with the same label. Let t′′t^{\prime\prime} be a data tree obtained from tt by swapping the data values between uu and vv, so t′′(𝒮a)t^{\prime\prime}\in\mbox{$\mathcal{L}$}(\mbox{$\mathcal{S}$}^{a}). Since Profile(t)=Profile(t′′)\mbox{$\textsf{Profile}$}(t)=\mbox{$\textsf{Profile}$}(t^{\prime\prime}), on input Profile(t′′)\mbox{$\textsf{Profile}$}(t^{\prime\prime}), the transducer 𝒯\mathcal{T} can also output tt^{\prime}, which means that t′′(𝒮)t^{\prime\prime}\in\mbox{$\mathcal{L}$}(\mbox{$\mathcal{S}$}). This contradicts the fact that (𝒮)\mbox{$\mathcal{L}$}(\mbox{$\mathcal{S}$}) is the complement of (𝒮a)\mbox{$\mathcal{L}$}(\mbox{$\mathcal{S}$}^{a}). This completes the proof of Proposition 5.6.

We should remark that in Section 7 we will discuss that extending ODTA with the complement of languages of the form in Example 5.2 will immediately yield undecidability.

Theorems 5.85.9 and 5.10 are the main results in this paper. Theorem 5.8 below provides the ODTA characterisation of the logic MSO2(E,E,)\mbox{$\exists\mbox{$\textrm{MSO}$}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim) and its proof can be found in Subsection 5.1.

Theorem 5.8.

A language data\mbox{$\mathcal{L}$}_{data} is expressible with an MSO2(E,E,)\mbox{$\exists\mbox{$\textrm{MSO}$}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim) formula if and only if it is accepted by an ODTA 𝒮=𝒯,,Γ0\mbox{$\mathcal{S}$}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Gamma_{0}\rangle, where ()\mbox{$\mathcal{L}$}(\mbox{$\mathcal{M}$}) is a commutative language. Moreover, the translation from MSO2(E,E,)\mbox{$\exists\mbox{$\textrm{MSO}$}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim) formulas to ODTA takes triple exponential time, while from ODTA to MSO2(E,E,)\mbox{$\exists\mbox{$\textrm{MSO}$}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim) formulas, takes exponential time.

Theorem 5.9 below provides the logical characterisation of ODTA. The proof can be found in Subsection 5.2.

Theorem 5.9.

A language data\mbox{$\mathcal{L}$}_{data} is accepted by an ODTA if and only if it is expressible with a formula of the form: X1Xmφψ\exists X_{1}\cdots\exists X_{m}\ \varphi\wedge\psi, where φ\varphi is a formula from FO2(E,E,)\mbox{$\textrm{FO}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim), and ψ\psi from FO(,,suc)\mbox{$\textrm{FO}$}(\sim,\prec,\mbox{$\prec_{suc}$}), both extended with the unary predicates X1,,XmX_{1},\ldots,X_{m} and a()a(\cdot). Moreover, the translation from ODTA to formula is of polynomial time, and from formula to ODTA is effective, but non-elementary.

Finally, we show that the non-emptiness problem for ODTA is decidable in Theorem 5.10. The proof can be found in Subsection 5.3.

Theorem 5.10.

The non-emptiness problem for ODTA is decidable in 3-NExpTime.

The best lower bound known up to date is NP-hard. See [Fan and Libkin (2002), David et al. (2012)].

5.1 Proof of Theorem 5.8

In the proof we assume that the ordered-data trees are over the finite alphabet Σ\Sigma. We will need the following proposition which states that every MSO2(E,E,)\mbox{$\exists\mbox{$\textrm{MSO}$}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim) formula can be syntactically rewritten to a normal form for MSO2(E,E,)\mbox{$\exists\mbox{$\textrm{MSO}$}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim).

Proposition 5.11.

[Bojanczyk et al. (2009), Proposition 3.8] Every formula ψMSO2(E,E,)\psi\in\mbox{$\exists\mbox{$\textrm{MSO}$}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim) can be rewritten into a normal form of exponential size of the form: Y1Ynφ\exists Y_{1}\cdots\exists Y_{n}\ \varphi, where φ\varphi is a conjunction of formulae of the form:

  • (N1)

    xy(α(x)δ(x,y)ξ(x,y)β(y))\forall x\forall y\ (\alpha(x)\wedge\delta(x,y)\wedge\xi(x,y)\to\beta(y)),

  • (N2)

    x(root(x)α(y))\forall x\ (\mbox{{\rm root}}(x)\to\alpha(y)),

  • (N3)

    x(first-sibling(x)α(y))\forall x\ (\mbox{{\rm first-sibling}}(x)\to\alpha(y)),

  • (N4)

    x(last-sibling(x)α(y))\forall x\ (\mbox{{\rm last-sibling}}(x)\to\alpha(y)),

  • (N5)

    x(leaf(x)α(y))\forall x\ (\mbox{{\rm leaf}}(x)\to\alpha(y)),

  • (N6)

    xy(α(x)α(y)xyx=y)\forall x\forall y\ (\alpha(x)\wedge\alpha(y)\wedge x\sim y\to x=y),

  • (N7)

    xy(α(x)β(y)xy)\forall x\exists y\ (\alpha(x)\to\beta(y)\wedge x\sim y),

where α(x),β(x)\alpha(x),\beta(x) is a conjunction of some unary predicates and its negations, δ(x,y)\delta(x,y) is either E(x,y)\mbox{$E_{\downarrow}$}(x,y) or E(x,y)\mbox{$E_{\rightarrow}$}(x,y), and ξ(x,y)\xi(x,y) is either xyx\sim y or xyx\nsim y.

We should remark that if φ\varphi is a conjunction of formulae of the forms (N1)–(N5) above, then there exists a tree automaton 𝒜\mathcal{A} over the alphabet Σ×{,,}3\Sigma\times\{\top,\bot,*\}^{3} such that for every ordered-data tree tt,

tΨif and only ifProfile(t)is accepted by𝒜.t\models\Psi\quad\mbox{if and only if}\quad\mbox{$\textsf{Profile}$}(t)\ \mbox{is accepted by}\ \mbox{$\mathcal{A}$}.

Such construction is straightforward from the classical automata theory. See, for example, [Thomas (1997)]. We divide the proof of Theorem 5.8 into Lemmas 5.12 and 5.14 below.

Lemma 5.12.

For every formula ΨMSO2(E,E,)\Psi\in\mbox{$\exists\mbox{$\textrm{MSO}$}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim), there exists an ODTA 𝒮Ψ=𝒯,,Γ0\mbox{$\mathcal{S}$}_{\Psi}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Gamma_{0}\rangle such that data(Ψ)=data(𝒮Ψ)\mbox{$\mathcal{L}$}_{data}(\Psi)=\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$}_{\Psi}) and ()\mbox{$\mathcal{L}$}(\mbox{$\mathcal{M}$}) is commutative. Moreover, the construction of 𝒮Ψ\mbox{$\mathcal{S}$}_{\Psi} is effective and takes triple exponential time in the size of the formula Ψ\Psi.

Proof 5.13.

Applying Proposition 5.11, we can rewrite the formula Ψ\Psi in its normal form Y1YnΨ\exists Y_{1}\cdots\exists Y_{n}\Psi^{\prime}. Furthermore, we can rewrite the formula Ψ\Psi into the form X1Xmφ\exists X_{1}\cdots\exists X_{m}\ \varphi, where m=2nm=2^{n}, and φ\varphi is a conjunction of formulas of the form:

  • (N0)

    X1,,XmX_{1},\ldots,X_{m} are pairwise disjoint, and aΣx(a(x)α(x))\bigwedge_{a\in\Sigma}\forall x(a(x)\to\alpha^{\prime}(x)).

  • (N1)

    xy(α(x)δ(x,y)ξ(x,y)β(y))\forall x\forall y\ (\alpha^{\prime}(x)\wedge\delta(x,y)\wedge\xi(x,y)\to\beta^{\prime}(y)),

  • (N2)

    x(root(x)α(y))\forall x\ (\mbox{{\rm root}}(x)\to\alpha^{\prime}(y)),

  • (N3)

    x(first-sibling(x)α(y))\forall x\ (\mbox{{\rm first-sibling}}(x)\to\alpha^{\prime}(y)),

  • (N4)

    x(last-sibling(x)α(y))\forall x\ (\mbox{{\rm last-sibling}}(x)\to\alpha^{\prime}(y)),

  • (N5)

    x(leaf(x)α(y))\forall x\ (\mbox{{\rm leaf}}(x)\to\alpha^{\prime}(y)),

  • (N6)

    xy(α(x)α(y)xyx=y)\forall x\forall y\ (\alpha^{\prime}(x)\wedge\alpha^{\prime}(y)\wedge x\sim y\to x=y),

  • (N7)

    xy(α(x)β(y)xy)\forall x\exists y\ (\alpha^{\prime}(x)\to\beta^{\prime}(y)\wedge x\sim y),

where α(x),β(x)\alpha^{\prime}(x),\beta^{\prime}(x) are disjunctions of some of the XiX_{i}’s, and δ(x,y)\delta(x,y) and ξ(x,y)\xi(x,y) are the same above. Intuitively, the unary predicates X1,,XmX_{1},\ldots,X_{m} corresponds to subsets of {Y1,,Yn}\{Y_{1},\ldots,Y_{n}\}.

The ODTA 𝒮Ψ=𝒯,,Γ0\mbox{$\mathcal{S}$}_{\Psi}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Gamma_{0}\rangle is defined as follows.

  • The transducer 𝒯\mathcal{T} checks whether the formulas (N0)–(N5) are satisfied, with the output alphabet Γ={X1,,Xm}\Gamma=\{X_{1},\ldots,X_{m}\} where a node is labeled with XiX_{i} if and only if it belongs to XiX_{i}.
    The construction of such transducer is straightforward, thus, omitted. See, for example, [Thomas (1997)].

  • Γ0\Gamma_{0} consists of the XiX_{i}’s, where there exists A{X1,,Xm}A\subseteq\{X_{1},\ldots,X_{m}\} and XiAX_{i}\in A and a formula of the form (N6)

    xy(XjAXj(x)XjAXj(y)xyx=y),\forall x\forall y\ (\bigvee_{X_{j}\in A}X_{j}(x)\wedge\bigvee_{X_{j}\in A}X_{j}(y)\wedge x\sim y\to x=y),

    in φ\varphi.

  • the automaton \mathcal{M} accepts the language (2{X1,,Xm}(𝒫1𝒫2))(2^{\{X_{1},\ldots,X_{m}\}}-(\mbox{$\mathcal{P}$}_{1}\cup\mbox{$\mathcal{P}$}_{2}))^{\ast}, where

    𝒫1\displaystyle\mbox{$\mathcal{P}$}_{1} :=\displaystyle:= {S|there exists a formulaxy(XAX(x)XBX(y)xy)inφsuch thatSAbutSB=}\displaystyle\left\{\begin{array}[]{l}S\left|\begin{array}[]{l}\mbox{there exists a formula}\\ \qquad\forall x\exists y\ (\bigvee_{X\in A}X(x)\to\bigvee_{X\in B}X(y)\wedge x\sim y)\\ \mbox{in}\ \varphi\ \mbox{such that}\ S\cap A\neq\emptyset\ \mbox{but}\ S\cap B=\emptyset\end{array}\right\}\end{array}\right.
    𝒫2\displaystyle\mbox{$\mathcal{P}$}_{2} :=\displaystyle:= {S|there exists a formulaxy(XAX(x)XAX(y)xyx=y)inφsuch that|SA|2}\displaystyle\left\{\begin{array}[]{l}S\left|\begin{array}[]{l}\mbox{there exists a formula}\\ \qquad\forall x\forall y\ (\bigvee_{X\in A}X(x)\wedge\bigvee_{X\in A}X(y)\wedge x\sim y\to x=y)\\ \mbox{in}\ \varphi\ \mbox{such that}\ |S\cap A|\geq 2\end{array}\right\}\end{array}\right.

That ()\mbox{$\mathcal{L}$}(\mbox{$\mathcal{M}$}) is commutative is trivial. That 𝒮\mathcal{S} accepts precisely the language data(Ψ)\mbox{$\mathcal{L}$}_{data}(\Psi) can be deduced from the following.

  • That 𝒯\mathcal{T} ensures that formulas N0–N5 are satisfied.

  • That Γ0\Gamma_{0} contains precisely the symbols XiX_{i}’s where all XiX_{i}-nodes are supposed to contain different data values.

  • That for every ordered-data tree tt,

    txy(XAX(x)XBX(y)xy)t\models\forall x\exists y\ (\bigvee_{X\in A}X(x)\to\bigvee_{X\in B}X(y)\wedge x\sim y)

    if and only if [S]t=for allSsuch thatSAbutSB=[S]_{t}=\emptyset\ \mbox{for all}\ S\ \mbox{such that}\ S\cap A\neq\emptyset\ \mbox{but}\ S\cap B=\emptyset.

  • That for every ordered-data tree tt,

    txy(XAX(x)XAX(y)xyx=y)t\models\forall x\forall y\ (\bigvee_{X\in A}X(x)\wedge\bigvee_{X\in A}X(y)\wedge x\sim y\to x=y)

    if and only if

    • [S]t=[S]_{t}=\emptyset for all SS such that |SA|2|S\cap A|\geq 2; and

    • for all XAX\in A, txy(X(x)X(y)xyx=y)t\models\forall x\forall y\ (X(x)\wedge X(y)\wedge x\sim y\to x=y), which is captured by the condition imposed by Γ0\Gamma_{0}.

The analysis of the complexity is as follows. The first step, applying Proposition 5.11, induces an exponential blow-up in the size of the input. The second step to construct the formula X1Xmφ\exists X_{1}\cdots\exists X_{m}\ \varphi takes exponential time in nn, and nn is exponential in the size of the input. The construction of 𝒯\mathcal{T} takes polynomial time in the size of φ\varphi, since (N0)–(N5) are already in the “automata transition” format. The construction of Γ0\Gamma_{0} takes polynomial time in mm, while the construction of \mathcal{M} induces another exponential blow-up in mm. Altogether the complexity of our constructing 𝒮Ψ\mbox{$\mathcal{S}$}_{\Psi} is triple exponential time in the size of Ψ\Psi. This concludes the proof of Lemma 5.12.

For the complexity analysis in Lemma 5.14, we assume that a commutative automaton \mathcal{M} is given as a set of vectors (in binary format) indicating its Parikh images. That is, \mathcal{M} is given as a set I={(u¯1,v¯1,1,,v¯1,),,(u¯n,v¯n,1,,v¯n,)}I=\{(\bar{u}_{1},\bar{v}_{1,1},\ldots,\bar{v}_{1,\ell}),\ldots,(\bar{u}_{n},\bar{v}_{n,1},\ldots,\bar{v}_{n,\ell})\}, where

(u¯,v¯1,,v¯)I(u¯,v¯1,,v¯)=(),\bigcup_{(\bar{u},\bar{v}_{1},\ldots,\bar{v}_{\ell})\in I}\mbox{$\mathcal{L}$}(\bar{u},\bar{v}_{1},\ldots,\bar{v}_{\ell})=\mbox{$\mathcal{L}$}(\mbox{$\mathcal{M}$}),

and each number in the vectors in II is written in the standard binary form.

Lemma 5.14.

For every ODTA 𝒮=𝒯,,Γ0\mbox{$\mathcal{S}$}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Gamma_{0}\rangle, where ()\mbox{$\mathcal{L}$}(\mbox{$\mathcal{M}$}) is a commutative language, there exists a formula φMSO2(E,E,)\varphi\in\mbox{$\exists\mbox{$\textrm{MSO}$}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim) such that data(φ)=data(𝒮)\mbox{$\mathcal{L}$}_{data}(\varphi)=\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$}). Moreover, the construction of φ\varphi takes exponential time in the size of 𝒮\mathcal{S}.

Proof 5.15.

Let Q𝒯={q0,,qm}Q_{\mbox{\scriptsize$\mathcal{T}$}}=\{q_{0},\ldots,q_{m}\} and Γ={α1,,αk}\Gamma=\{\alpha_{1},\ldots,\alpha_{k}\} be the set of states and the output alphabet of the transducer 𝒯\mathcal{T}, respectively. Let =2|Γ|1\ell=2^{|\Gamma|}-1.

By Theorem 2.1, ()\mbox{$\mathcal{L}$}(\mbox{$\mathcal{M}$}) is a finite union of periodic languages. Let II be the finite set of (+1)(\ell+1)-tuple of \mbox{$\mathbb{N}$}^{\ell}-vectors such that

(u¯,v¯1,,v¯)I(u¯,v¯1,,v¯)=().\bigcup_{(\bar{u},\bar{v}_{1},\ldots,\bar{v}_{\ell})\in I}\mbox{$\mathcal{L}$}(\bar{u},\bar{v}_{1},\ldots,\bar{v}_{\ell})=\mbox{$\mathcal{L}$}(\mbox{$\mathcal{M}$}).

Let I={(u¯1,v¯1,1,,v¯1,),,(u¯n,v¯n,1,,v¯n,)}I=\{(\bar{u}_{1},\bar{v}_{1,1},\ldots,\bar{v}_{1,\ell}),\ldots,(\bar{u}_{n},\bar{v}_{n,1},\ldots,\bar{v}_{n,\ell})\} and S1,,SS_{1},\ldots,S_{\ell} be the enumeration of non-empty subsets of Γ\Gamma. First, for (u¯,v¯1,,v¯)I(\bar{u},\bar{v}_{1},\ldots,\bar{v}_{\ell})\in I, we construct an MSO2(E,E,)\mbox{$\exists\mbox{$\textrm{MSO}$}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim) formula Ψ(u¯,v¯1,,v¯)\Psi_{(\bar{u},\bar{v}_{1},\ldots,\bar{v}_{\ell})} where

tΨ(u¯,v¯1,,v¯)\displaystyle t\in\Psi_{(\bar{u},\bar{v}_{1},\ldots,\bar{v}_{\ell})} if and only if [there existsh1,,h0such that(|[S1]t|,,|[S]t|)=u¯+h1v¯1++hv¯]\displaystyle\left[\begin{array}[]{l}\mbox{there exists}\ h_{1},\ldots,h_{\ell}\geq 0\ \mbox{such that}\\ (|[S_{1}]_{t}|,\ldots,|[S_{\ell}]_{t}|)=\bar{u}+h_{1}\bar{v}_{1}+\cdots+h_{\ell}\bar{v}_{\ell}\end{array}\right]

We denote by viv_{i} the non-zero entry of v¯i\bar{v}_{i}. This formula Ψ(u¯,v¯1,,v¯)\Psi_{(\bar{u},\bar{v}_{1},\ldots,\bar{v}_{\ell})} is as follows.

W1,1W1,u1W,1W,u\displaystyle\exists W_{1,1}\cdots W_{1,u_{1}}\ \cdots\cdots\ \exists W_{\ell,1}\cdots W_{\ell,u_{\ell}}
X1,0X1,v11Y1,0Y1,v11Z1\displaystyle\quad\quad\exists X_{1,0}\cdots X_{1,v_{1}-1}\ \exists Y_{1,0}\cdots Y_{1,v_{1}-1}\ Z_{1}
\displaystyle\quad\quad\quad\quad\quad\ddots
X,0X,v1Y,0Y,v1Z\displaystyle\quad\quad\quad\quad\quad\exists X_{\ell,0}\cdots X_{\ell,v_{\ell}-1}\ \exists Y_{\ell,0}\cdots Y_{\ell,v_{\ell}-1}\ Z_{\ell}
iWi,1,,Wi,uiZi=\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad\bigwedge_{i}W_{i,1},\ldots,W_{i,u_{i}}\cap Z_{i}=\emptyset
iφ|[Si]|=ui(Wi,1,,Wi,ui)\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad\wedge\ \bigwedge_{i}\varphi_{|[S_{i}]|=u_{i}}(W_{i,1},\ldots,W_{i,u_{i}})
iφ|[Si]|vi(modm)(Xi,0,,Xi,vi1,Yi,0,,Yi,vi1,Zi)\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad\wedge\ \bigwedge_{i}\varphi_{|[S_{i}]|\equiv v_{i}\pmod{m}}(X_{i,0},\ldots,X_{i,v_{i}-1},Y_{i,0},\ldots,Y_{i,v_{i}-1},Z_{i})

where φSi,ui(Wi,1,,Wi,ui)\varphi_{S_{i},u_{i}}(W_{i,1},\ldots,W_{i,u_{i}}) and φSi,(modvi)(Xi,0,,Xi,vi1,Yi,0,,Yi,vi1,Zi)\varphi_{S_{i},\pmod{v_{i}}}(X_{i,0},\ldots,X_{i,v_{i}-1},Y_{i,0},\ldots,Y_{i,v_{i}-1},Z_{i}) are the formulas for the languages dataSi,ui\mbox{$\mathcal{L}$}_{data}^{S_{i},u_{i}} and dataSi,(modui)\mbox{$\mathcal{L}$}_{data}^{S_{i},\pmod{u_{i}}} in Examples 3.4 and 3.5, respectively.

The desired formula φ\varphi is:

Xq0XqmXα1XαkX¯(u1¯,v¯1,1,,v¯1,)X¯(un¯,v¯n,1,,v¯n,)\displaystyle\exists X_{q_{0}}\cdots\exists X_{q_{m}}\ \exists X_{\alpha_{1}}\cdots\exists X_{\alpha_{k}}\ \exists\overline{X}_{(\bar{u_{1}},\bar{v}_{1,1},\ldots,\bar{v}_{1,\ell})}\ \cdots\ \exists\overline{X}_{(\bar{u_{n}},\bar{v}_{n,1},\ldots,\bar{v}_{n,\ell})}
φΓ0φ𝒯(u¯,v¯1,,v¯)Iφ(u¯,v¯1,,v¯)\displaystyle\qquad\qquad\qquad\qquad\qquad\varphi_{\Gamma_{0}}\wedge\varphi_{\mbox{\scriptsize$\mathcal{T}$}}\wedge\bigvee_{(\bar{u},\bar{v}_{1},\ldots,\bar{v}_{\ell})\in I}\varphi_{(\bar{u},\bar{v}_{1},\ldots,\bar{v}_{\ell})}

where

  • the formula φΓ0\varphi_{\Gamma_{0}} expresses the fact that the data values found under nodes labeled with a symbol from Γ0\Gamma_{0} are all different;

  • the unary predicates Xq0,,Xqm,Xα1,,XαkX_{q_{0}},\ldots,X_{q_{m}},X_{\alpha_{1}},\ldots,X_{\alpha_{k}} are supposed to represent the states and the output alphabets of 𝒯\mathcal{T}, respectively;

  • the formula φ𝒯\varphi_{\mbox{\scriptsize$\mathcal{T}$}} expresses the behaviour of the transducer 𝒯\mathcal{T} – that is, a tree satisfies φ𝒯\varphi_{\mbox{\scriptsize$\mathcal{T}$}} in which for every node uDom(t)u\in\mbox{$\textsf{Dom}$}(t), Xqi(u)X_{q_{i}}(u) and Xαj(u)X_{\alpha_{j}}(u) holds, if there exists an accepting run of 𝒯\mathcal{T} on tt in which the node uu is labeled with qiq_{i} and output αj\alpha_{j};

  • the predicates X¯(ui¯,v¯i,1,,v¯i,)\overline{X}_{(\bar{u_{i}},\bar{v}_{i,1},\ldots,\bar{v}_{i,\ell})}’s and the formulas φ(ui¯,v¯i,1,,v¯i,)\varphi_{(\bar{u_{i}},\bar{v}_{i,1},\ldots,\bar{v}_{i,\ell})}’s are as in the formula Ψ(u¯,v¯1,,v¯)\Psi_{(\bar{u},\bar{v}_{1},\ldots,\bar{v}_{\ell})} defined above.

The analysis of the complexity is as follows. The size of the formula φSi,ui\varphi_{S_{i},u_{i}} and φSi,(modvi)\varphi_{S_{i},\pmod{v_{i}}} are exponential in the size of Si,ui,viS_{i},u_{i},v_{i}. Hence, the construction of Ψ(u¯,v¯1,,v¯)\Psi_{(\bar{u},\bar{v}_{1},\ldots,\bar{v}_{\ell})} takes exponential time in the size of (u¯,v¯1,,v¯)(\bar{u},\bar{v}_{1},\ldots,\bar{v}_{\ell}). The construction of φΓ0\varphi_{\Gamma_{0}} and φ𝒯\varphi_{\mbox{\scriptsize$\mathcal{T}$}} takes polynomial time in the size of Γ0\Gamma_{0} and 𝒯\mathcal{T}, respectively. Hence, the total time to construct the formula φ\varphi is exponential in the size of 𝒮\mathcal{S}. This completes the proof of the lemma.

5.2 Proof of Theorem 5.9

In this subsection for every ordered-data tree tt, we assume that the data values in tt are precisely the natural numbers in the range [1..m][1..m], for a positive integer m1m\geq 1. That is, if d1<d2<<dmd_{1}<d_{2}<\cdots<d_{m} are the data values in tt, then d1=1d_{1}=1, d2=2d_{2}=2, \ldots, dm=md_{m}=m.

We start with the following lemma.

Lemma 5.16.

Let ψFO(,)\psi\in\mbox{$\textrm{FO}$}(\sim,\prec) be of quantifier rank kk. Let Γ={a1,,a}\Gamma=\{a_{1},\ldots,a_{\ell}\} be the set of unary predicates used in ψ\psi. There exists a finite state automaton CC over the alphabet Γ(2Γ×Γ,k)\Gamma\cup(2^{\Gamma}\times\mbox{$\mathcal{F}$}_{\Gamma,k}) such that the following holds.

  • The automaton CC accepts words of the form

    a1a1f1(a1)aaf1(a)(S1,f1)a1a1fm(a1)aafm(a)(Sm,fm),\overbrace{a_{1}\cdots a_{1}}^{f_{1}(a_{1})}\cdots\overbrace{a_{\ell}\cdots a_{\ell}}^{f_{1}(a_{\ell})}\ (S_{1},f_{1})\ \cdots\cdots\ \overbrace{a_{1}\cdots a_{1}}^{f_{m}(a_{1})}\cdots\overbrace{a_{\ell}\cdots a_{\ell}}^{f_{m}(a_{\ell})}\ (S_{m},f_{m}),

    where each Si={afi(a)1}S_{i}=\{a\mid f_{i}(a)\geq 1\}.

  • For every ordered-data tree tψt\models\psi, if 𝒱(k)=(S1,f1),,(Sm,fm)\mbox{$\mathcal{V}$}^{(k)}=(S_{1},f_{1}),\ldots,(S_{m},f_{m}), then there exists a word in (C)\mbox{$\mathcal{L}$}(C) of the form

    a1a1f1(a1)aaf1(a)(S1,f1)a1a1fm(a1)aafm(a)(Sm,fm)\overbrace{a_{1}\cdots a_{1}}^{f_{1}(a_{1})}\cdots\overbrace{a_{\ell}\cdots a_{\ell}}^{f_{1}(a_{\ell})}\ (S_{1},f_{1})\ \cdots\cdots\ \overbrace{a_{1}\cdots a_{1}}^{f_{m}(a_{1})}\cdots\overbrace{a_{\ell}\cdots a_{\ell}}^{f_{m}(a_{\ell})}\ (S_{m},f_{m})
  • For every word w(C)w\in\mbox{$\mathcal{L}$}(C), if ww is

    a1a1f1(a1)aaf1(a)(S1,f1)a1a1fm(a1)aafm(a)(Sm,fm)\overbrace{a_{1}\cdots a_{1}}^{f_{1}(a_{1})}\cdots\overbrace{a_{\ell}\cdots a_{\ell}}^{f_{1}(a_{\ell})}\ (S_{1},f_{1})\ \cdots\cdots\ \overbrace{a_{1}\cdots a_{1}}^{f_{m}(a_{1})}\cdots\overbrace{a_{\ell}\cdots a_{\ell}}^{f_{m}(a_{\ell})}\ (S_{m},f_{m})

    then there exists a tree tψt\models\psi, where 𝒱(k)(t)=(S1,f1)(sm,fm)\mbox{$\mathcal{V}$}^{(k)}(t)=(S_{1},f_{1})\cdots(s_{m},f_{m}).

Proof 5.17.

Let ψFO(,)\psi\in\mbox{$\textrm{FO}$}(\sim,\prec) be of quantifier rank kk. Let Γ={a1,,a}\Gamma=\{a_{1},\ldots,a_{\ell}\} be the set of unary predicates used in φ\varphi. We define the following sentence ψ¯FO(<)\overline{\psi}\in\mbox{$\textrm{FO}$}(<) (that is, over strings) inductively from ψ\psi as follows.

  • If ψ\psi is QxξQx\;\xi, where Q{,}Q\in\{\forall,\exists\}, then ψ¯\overline{\psi} is

    QxaΓa(x)ξ¯.Qx\;\bigvee_{a\in\Gamma}a(x)\to\overline{\xi}.
  • If ψ\psi is x=yx=y, then ψ¯\overline{\psi} is also x=yx=y.

  • If ψ\psi is xyx\sim y, then ψ¯\overline{\psi} states “there is no position in between xx and yy labeled with any symbol from 2Γ×Γ,k2^{\Gamma}\times\mbox{$\mathcal{F}$}_{\Gamma,k}.”

  • If ψ\psi is xyx\prec y, then ψ¯\overline{\psi} states “there is at least one position in between xx and yy labeled with a symbol from 2Γ×Γ,k2^{\Gamma}\times\mbox{$\mathcal{F}$}_{\Gamma,k}.”

We have the following claim.

Claim 1.
  1. (1)

    For every ordered-data tree tψt\models\psi, if 𝒱(k)=(S1,f1),,(Sm,fm)\mbox{$\mathcal{V}$}^{(k)}=(S_{1},f_{1}),\ldots,(S_{m},f_{m}), then there exists a word wψ¯w\models\overline{\psi} of the form

    a1a1f1(a1)aaf1(a)(S1,f1)a1a1fm(a1)aafm(a)(Sm,fm)\overbrace{a_{1}\cdots a_{1}}^{f_{1}(a_{1})}\cdots\overbrace{a_{\ell}\cdots a_{\ell}}^{f_{1}(a_{\ell})}\ (S_{1},f_{1})\ \cdots\cdots\ \overbrace{a_{1}\cdots a_{1}}^{f_{m}(a_{1})}\cdots\overbrace{a_{\ell}\cdots a_{\ell}}^{f_{m}(a_{\ell})}\ (S_{m},f_{m})
  2. (2)

    For every word wψ¯w\models\overline{\psi}, if ww is

    a1a1f1(a1)aaf1(a)(S1,f1)a1a1fm(a1)aafm(a)(Sm,fm)\overbrace{a_{1}\cdots a_{1}}^{f_{1}(a_{1})}\cdots\overbrace{a_{\ell}\cdots a_{\ell}}^{f_{1}(a_{\ell})}\ (S_{1},f_{1})\ \cdots\cdots\ \overbrace{a_{1}\cdots a_{1}}^{f_{m}(a_{1})}\cdots\overbrace{a_{\ell}\cdots a_{\ell}}^{f_{m}(a_{\ell})}\ (S_{m},f_{m})

    then there exists a tree tψt\models\psi, where 𝒱(k)(t)=(S1,f1)(sm,fm)\mbox{$\mathcal{V}$}^{(k)}(t)=(S_{1},f_{1})\cdots(s_{m},f_{m}).

Proof 5.18.

We first prove item (1). Let tt be an ordered-data tree over the alphabet Γ\Gamma and let 𝒱k(t)=(S1,f1)(Sm,fm)\mbox{$\mathcal{V}$}^{k}(t)=(S_{1},f_{1})\cdots(S_{m},f_{m}) be its kk-extended string representation of data values in tt. Let tt^{\prime} be the following data string

(a11)(a11)f1(a1)(a1)(a1)f1(a)(a1m)(a1m)fm(a1)(am)(am)fm(a)\overbrace{{a_{1}\choose 1}\cdots{a_{1}\choose 1}}^{f_{1}(a_{1})}\cdots\overbrace{{a_{\ell}\choose 1}\cdots{a_{\ell}\choose 1}}^{f_{1}(a_{\ell})}\cdots\cdots\cdots\cdots\overbrace{{a_{1}\choose m}\cdots{a_{1}\choose m}}^{f_{m}(a_{1})}\cdots\overbrace{{a_{\ell}\choose m}\cdots{a_{\ell}\choose m}}^{f_{m}(a_{\ell})}

When tt^{\prime} is viewed as a data tree§§§That is, a data string is a data tree in which each node has at most one child., 𝒱Γ(k)(t)=𝒱(k)(t)\mbox{$\mathcal{V}$}^{(k)}_{\Gamma}(t)=\mbox{$\mathcal{V}$}^{(k)}(t^{\prime}). Hence, by Corollary 4.5,

tψif and only iftψ.t\models\psi\qquad\mbox{if and only if}\qquad t^{\prime}\models\psi.

By straightforward induction on ψ¯\overline{\psi}, we can show that for every tψt^{\prime}\models\psi of the form

(a11)(a11)f1(a1)(a1)(a1)f1(a)(a1m)(a1m)fm(a1)(am)(am)fm(a)\overbrace{{a_{1}\choose 1}\cdots{a_{1}\choose 1}}^{f_{1}(a_{1})}\cdots\overbrace{{a_{\ell}\choose 1}\cdots{a_{\ell}\choose 1}}^{f_{1}(a_{\ell})}\cdots\cdots\cdots\cdots\overbrace{{a_{1}\choose m}\cdots{a_{1}\choose m}}^{f_{m}(a_{1})}\cdots\overbrace{{a_{\ell}\choose m}\cdots{a_{\ell}\choose m}}^{f_{m}(a_{\ell})}

there exists a word wψ¯w\models\overline{\psi} of the form

a1a1f1(a1)aaf1(a)(S1,f1)a1a1fm(a1)aafm(a)(Sm,fm)\overbrace{a_{1}\cdots a_{1}}^{f_{1}(a_{1})}\cdots\overbrace{a_{\ell}\cdots a_{\ell}}^{f_{1}(a_{\ell})}\ (S_{1},f_{1})\ \cdots\cdots\ \overbrace{a_{1}\cdots a_{1}}^{f_{m}(a_{1})}\cdots\overbrace{a_{\ell}\cdots a_{\ell}}^{f_{m}(a_{\ell})}\ (S_{m},f_{m})

Similarly, to prove (2), we can prove by straightforward induction on ψ¯\overline{\psi} that for every word wψ¯w\models\overline{\psi} of the form

a1a1f1(a1)aaf1(a)(S1,f1)a1a1fm(a1)aafm(a)(Sm,fm),\overbrace{a_{1}\cdots a_{1}}^{f_{1}(a_{1})}\cdots\overbrace{a_{\ell}\cdots a_{\ell}}^{f_{1}(a_{\ell})}\ (S_{1},f_{1})\ \cdots\cdots\ \overbrace{a_{1}\cdots a_{1}}^{f_{m}(a_{1})}\cdots\overbrace{a_{\ell}\cdots a_{\ell}}^{f_{m}(a_{\ell})}\ (S_{m},f_{m}),

there exists a tree tψt\models\psi of the form

(a11)(a11)f1(a1)(a1)(a1)f1(a)(a1m)(a1m)fm(a1)(am)(am)fm(a)\overbrace{{a_{1}\choose 1}\cdots{a_{1}\choose 1}}^{f_{1}(a_{1})}\cdots\overbrace{{a_{\ell}\choose 1}\cdots{a_{\ell}\choose 1}}^{f_{1}(a_{\ell})}\cdots\cdots\cdots\cdots\overbrace{{a_{1}\choose m}\cdots{a_{1}\choose m}}^{f_{m}(a_{1})}\cdots\overbrace{{a_{\ell}\choose m}\cdots{a_{\ell}\choose m}}^{f_{m}(a_{\ell})}

This completes the proof of our claim.

Let CC be an automaton over the alphabet Γ(2Γ×Γ,k)\Gamma\cup(2^{\Gamma}\times\mbox{$\mathcal{F}$}_{\Gamma,k}) that expresses the formula ψ¯\overline{\psi} and that it accepts only words of the form

a1a1f1(a1)aaf1(a)(S1,f1)a1a1fm(a1)aafm(a)(Sm,fm),\overbrace{a_{1}\cdots a_{1}}^{f_{1}(a_{1})}\cdots\overbrace{a_{\ell}\cdots a_{\ell}}^{f_{1}(a_{\ell})}\ (S_{1},f_{1})\ \cdots\cdots\ \overbrace{a_{1}\cdots a_{1}}^{f_{m}(a_{1})}\cdots\overbrace{a_{\ell}\cdots a_{\ell}}^{f_{m}(a_{\ell})}\ (S_{m},f_{m}),

where each Si={afi(a)1}S_{i}=\{a\mid f_{i}(a)\geq 1\}. The construction of CC from the formula ψ¯\overline{\psi} is rather standard, but non-elementary. See, for example, [Thomas (1997)]. That the automaton CC is the desired automaton is immediate. This completes our proof of Lemma 5.16.

Lemma 5.19.

Let ψFO(,)\psi\in\mbox{$\textrm{FO}$}(\sim,\prec) be of quantifier rank kk. Let Γ={a1,,a}\Gamma=\{a_{1},\ldots,a_{\ell}\} be the set of unary predicates used in ψ\psi. There exists a finite state automaton \mathcal{M} over the alphabet 2Γ×Γ,k2^{\Gamma}\times\mbox{$\mathcal{F}$}_{\Gamma,k} such that ()={𝒱Γ,k(k)(t)tψ}\mbox{$\mathcal{L}$}(\mbox{$\mathcal{M}$})=\{\mbox{$\mathcal{V}$}^{(k)}_{\Gamma,k}(t)\mid t\models\psi\}.

Proof 5.20.

Let CC be the automaton obtained by applying Lemma 5.16 on the formula ψ\psi. Then let \mathcal{M} be the automaton obtained from CC, where every symbol from Γ\Gamma is projected to empty string. The automaton \mathcal{M} is the desired automaton, and this completes our proof of Lemma 5.19.

Now we are ready to prove Theorem 5.9. We start with the “if” direction. Let Ψ\Psi be a formula of the form:

Y1Ynφψ,\exists Y_{1}\cdots\exists Y_{n}\ \varphi\wedge\psi,

φ\varphi is a formula from FO2(E,E,)\mbox{$\textrm{FO}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim) and ψ\psi from FO(,)\mbox{$\textrm{FO}$}(\sim,\prec), both extended with the unary predicates Y1,,YnY_{1},\ldots,Y_{n}.

By Proposition 5.11, we can rewrite (with additional unary predicates) the formula φ\varphi into a conjunction of formulae of the form N1–N7 as stated in Proposition 5.11. Then we further rewrite it into the form

X1Xmφψ,\exists X_{1}\cdots\exists X_{m}\ \varphi^{\prime}\wedge\psi^{\prime},

where m=2nm=2^{n} and φ\varphi is a formula from FO2(E,E,)\mbox{$\textrm{FO}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim) and ψ\psi from FO(,)\mbox{$\textrm{FO}$}(\sim,\prec), both extended with the unary predicates X1,,XmX_{1},\ldots,X_{m}, and that the formula φ\varphi^{\prime} is conjunction of the form:

  • (N0)

    a formula ξ\xi that states that X1,,XmX_{1},\ldots,X_{m} are pairwise disjoint and that

    aΣx(a(x)α(x)),\bigwedge_{a\in\Sigma}\forall x\ (a(x)\to\alpha(x)),
  • (N1)

    xy(α(x)δ(x,y)ξ(x,y)β(y))\forall x\forall y\ (\alpha(x)\wedge\delta(x,y)\wedge\xi(x,y)\to\beta(y)),

  • (N2)

    x(root(x)α(y))\forall x\ (\mbox{{\rm root}}(x)\to\alpha(y)),

  • (N3)

    x(first-sibling(x)α(y))\forall x\ (\mbox{{\rm first-sibling}}(x)\to\alpha(y)),

  • (N4)

    x(last-sibling(x)α(y))\forall x\ (\mbox{{\rm last-sibling}}(x)\to\alpha(y)),

  • (N5)

    x(leaf(x)α(y))\forall x\ (\mbox{{\rm leaf}}(x)\to\alpha(y)),

  • (N6)

    xy(α(x)α(y)xyx=y)\forall x\forall y\ (\alpha(x)\wedge\alpha(y)\wedge x\sim y\to x=y),

  • (N7)

    xy(α(x)β(y)xy)\forall x\exists y\ (\alpha(x)\to\beta(y)\wedge x\sim y),

where α(x),β(x)\alpha(x),\beta(x) are disjunctions of some of the unary predicates X1,,XmX_{1},\ldots,X_{m}.

We will describe the ODTA 𝒮=𝒯,,Γ0\mbox{$\mathcal{S}$}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Gamma_{0}\rangle for the formula Ψ\Psi, where the transducer 𝒯\mathcal{T} expresses the formula N0–N5 with the output alphabet Γ={X1,,Xm}\Gamma=\{X_{1},\ldots,X_{m}\}, the automaton \mathcal{M} expresses the formula N6, N7 and ψ\psi^{\prime}, and Γ0\Gamma_{0} is the set of symbols that appear in formula N6. Formally, it is defined as follows.

  • The output alphabet of 𝒯\mathcal{T} is Γ={X1,,Xm}\Gamma=\{X_{1},\ldots,X_{m}\}.

  • The transducer expresses the formula N0–N5 above. In particular, the input and output symbols of each node must satisfy the formula N0.

    This step take polynomial time, since the formula N0–N5 is already in the transition format.

  • The set Γ0={XiXiappears inN6}\Gamma_{0}=\{X_{i}\mid X_{i}\ \mbox{appears in}\ \mbox{N6}^{\prime}\}.

    This step takes polynomial time.

  • The automaton \mathcal{M} expresses the formulas N6, N7 and ψ\psi^{\prime}, obtained by applying Lemma 5.19.

    This step is constructive, but non-elementary due to the conversion from a formula to its finite state automaton.

It is straightforward to show that data(𝒮)={ttΨ}\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$})=\{t\mid t\models\Psi\}.

Now we prove the “only if” direction. Let =data(𝒮)\mbox{$\mathcal{L}$}=\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$}), where 𝒮=𝒯,,Γ0\mbox{$\mathcal{S}$}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Gamma_{0}\rangle, and

  • Q={q1,,qn}Q=\{q_{1},\ldots,q_{n}\} be the states of 𝒯\mathcal{T};

  • P={p1,,ps}P=\{p_{1},\ldots,p_{s}\} be the states of \mathcal{M}, and p1p_{1} is the initial state of \mathcal{M};

  • Γ={α1,,α}\Gamma=\{\alpha_{1},\ldots,\alpha_{\ell}\} be the output alphabet of 𝒯\mathcal{T}.

We denote by Σ\Sigma the input alphabet of 𝒯\mathcal{T}.

The desired formula for \mathcal{L} is of the form:

Xq1XqnXα1XαXp1XpsΨ𝒯ΨΨΓ0\exists X_{q_{1}}\cdots\exists X_{q_{n}}\ \exists X_{\alpha_{1}}\cdots\exists X_{\alpha_{\ell}}\ \exists X_{p_{1}}\cdots\exists X_{p_{s}}\ \Psi_{\mbox{\scriptsize$\mathcal{T}$}}\wedge\Psi_{\mbox{\scriptsize$\mathcal{M}$}}\wedge\Psi_{\Gamma_{0}}

where

  • the unary predicates Xq1,,Xqn,Xα1,,Xα,Xp1,,XpsX_{q_{1}},\ldots,X_{q_{n}},X_{\alpha_{1}},\ldots,X_{\alpha_{\ell}},X_{p_{1}},\ldots,X_{p_{s}} are supposed to represent the states, the output alphabets of 𝒯\mathcal{T}, and the states of \mathcal{M}, respectively;

  • the formula Ψ𝒯\Psi_{\mbox{\scriptsize$\mathcal{T}$}} expresses the behaviour of the transducer 𝒯\mathcal{T} – that is, a tree satisfies Ψ𝒯\Psi_{\mbox{\scriptsize$\mathcal{T}$}} in which for every node uDom(t)u\in\mbox{$\textsf{Dom}$}(t), Xqi(u)X_{q_{i}}(u) and Xαj(u)X_{\alpha_{j}}(u) holds, if there exists an accepting run of 𝒯\mathcal{T} on tt in which the node uu is labeled with qiq_{i} and output αj\alpha_{j};

  • the formula Ψ\Psi_{\mbox{\scriptsize$\mathcal{M}$}} expresses the behaviour of the automaton \mathcal{M};

  • the formula ΨΓ0\Psi_{\Gamma_{0}} expresses the property that for every αiΓ0\alpha_{i}\in\Gamma_{0}, all the nodes belonging to XαiX_{\alpha_{i}} contain different data values, which is

    αΓ0xy(Xα(x)Xα(y)xyx=y).\bigwedge_{\alpha\in\Gamma_{0}}\forall x\forall y(X_{\alpha}(x)\wedge X_{\alpha}(y)\wedge x\sim y\to x=y).

The construction of the formula Ψ𝒯\Psi_{\mbox{\scriptsize$\mathcal{T}$}} is rather standard, thus, omitted. We will show the construction of the formula Ψ\Psi_{\mbox{\scriptsize$\mathcal{M}$}}. Let Φ[S](x)\Phi_{[S]}(x) denote the following formula

αiSXαi(x)αiSy(Xαi(x)xy)αjSy(Xαj(y)xy),\bigvee_{\alpha_{i}\in S}X_{\alpha_{i}}(x)\wedge\bigwedge_{\alpha_{i}\in S}\exists y(X_{\alpha_{i}}(x)\wedge x\sim y)\wedge\bigwedge_{\alpha_{j}\notin S}\forall y(X_{\alpha_{j}}(y)\to x\nsim y),

which states that the data value on the node xx belongs to [S][S]. The formula Ψ\Psi_{\mbox{\scriptsize$\mathcal{M}$}} expresses the following properties.

  • That the node contains the minimal data value belongs to Xp1X_{p_{1}}. Formally, it can be written as follows.

    x(yxyxyXp1(x))\forall x(\forall yx\prec y\vee x\sim y\to X_{p_{1}}(x))
  • That the transition μ\mu of \mathcal{M} must be “respected.” Formally, it can be written as follows.

    (pi,S,pj)μ(xy(Xpi(x)Ψ[S](x)xsucyXpj(y))),\bigwedge_{(p_{i},S,p_{j})\in\mu}\Big{(}\forall x\forall y(X_{p_{i}}(x)\wedge\Psi_{[S]}(x)\wedge x\mbox{$\prec_{suc}$}y\to X_{p_{j}}(y))\Big{)},

    where xsucyx\mbox{$\prec_{suc}$}\ y stands for xyz(¬(xzzy))x\prec y\wedge\forall z(\neg(x\prec z\wedge z\prec y)).

  • That the node contains the maximal data value belongs to one of the final states of \mathcal{M}, denoted by FF. Formally, it can be written as follows.

    x(y(yxyx)piFXpi(x)).\forall x(\forall y\ (y\prec x\vee y\sim x)\to\bigvee_{p_{i}\in F}X_{p_{i}}(x)).

That the construction takes polynomial time is straightforward. This completes our proof of Theorem 5.9.

5.3 Proof of Theorem 5.10

The proof of Theorem 5.10 consists of two main steps.

  1. (1)

    We prove that for each ODTA 𝒮\mathcal{S}, if data(𝒮)\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$})\neq\emptyset, then data(𝒮)\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$}) contains a data tree with “small model property” (Lemma 5.21).

  2. (2)

    We describe a procedure, that given an ODTA 𝒮\mathcal{S}, checks whether (𝒮)\mbox{$\mathcal{L}$}(\mbox{$\mathcal{S}$}) contains a data tree with “small model property,” by converting the ODTA 𝒮\mathcal{S} into an APC (𝒜,ξ)(\mbox{$\mathcal{A}$},\xi). Since the non-emptiness of APC is decidable, Theorem 5.10 follows immediately.

The first step (Lemma 5.21) is adapted from the proof of [Bojanczyk et al. (2009), Proposition 3.10]. It is in the second step our proof differs from [Bojanczyk et al. (2009), Proposition 3.10] The decision procedure in [Bojanczyk et al. (2009)] relies on intricate counting argument of the so called dog and sheep symbols (see [Bojanczyk et al. (2009), page 36]) and it seems that it cannot be generalised to the case of ODTA. On the other hand, our decision procedure relies mainly on Proposition 2.3, Lemma 4.1 and counting the cardinality of each [S][S].

We need a few terminologies. A set of nodes in a data tree tt is called connected, if it is connected in the graph induced by EE_{\downarrow} and EE_{\rightarrow}. A zone in a data tree tt is a maximal connected set of nodes with the same data value. The outdegree of a zone ZZ is the number of different zones to which there is an edge (either EE_{\downarrow} or EE_{\rightarrow}) from ZZ.

Let 𝒮=𝒯,,Γ0\mbox{$\mathcal{S}$}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Gamma_{0}\rangle be an ODTA, where 𝒯\mathcal{T} is a transducer from Σ\Sigma to Γ\Gamma. Let QQ be the set of states of 𝒯\mathcal{T}. For a tree tdata(𝒮)t\in\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$}), its extended tree t~\tilde{t} (with respect to the ODTA 𝒮\mathcal{S}) is a tree over the alphabet Σ×{,,}3×Q×Γ\Sigma\times\{\top,\bot,*\}^{3}\times Q\times\Gamma, where

  • the projection of t~\tilde{t} to Σ×{,,}3\Sigma\times\{\top,\bot,*\}^{3} is Profile(t)\mbox{$\textsf{Profile}$}(t);

  • the projection of t~\tilde{t} to QQ is an accepting run of 𝒯\mathcal{T} on tt;

  • the projection of t~\tilde{t} to Γ\Gamma is an output of 𝒯\mathcal{T} on tt.

The following Lemma is simply an adaptation of [Bojanczyk et al. (2009), Proposition 3.10] to the case of ODTA. The proof is via cut-and-paste, where given an ordered-data tree tt over the alphabet Σ\Sigma where tt has “many” zones in which the outdegree is “large,” we can cut some nodes in tt and paste it in another part of tt without affecting the set Vt(a)V_{t}(a)’s for each aΣa\in\Sigma. The aim of such cut-and-paste is to reduce the number of zones in tt with large outdegree. We give the formal statement below.

Lemma 5.21.

[Compare [Bojanczyk et al. (2009), Proposition 3.10]] For every ODTA 𝒮=𝒯,,Γ0\mbox{$\mathcal{S}$}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Gamma_{0}\rangle over the alphabet Σ\Sigma, if data(𝒮)\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$})\neq\emptyset, then there exists a data tree tdata(𝒮)t\in\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$}) in which there are at most KO(K2)K^{O(K^{2})} zones with outdegree K(K3)\geq K^{(K^{3})}, where K=O(|Σ||Q||Γ|)K=O(|\Sigma|\cdot|Q|\cdot|\Gamma|) and QQ is the set of states of 𝒯\mathcal{T} and Γ\Gamma the output alphabet of 𝒯\mathcal{T}.

Proof 5.22.

Let 𝒮=𝒯,,Γ0\mbox{$\mathcal{S}$}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Gamma_{0}\rangle be an ODTA over the alphabet Σ\Sigma, and QQ is the set of states of 𝒯\mathcal{T} and Γ\Gamma the output alphabet of 𝒯\mathcal{T}. Suppose that t0data(𝒮)t_{0}\in\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$}). We will work on the extended tree t~0\tilde{t}_{0} of t0t_{0}. The aim is to convert t~0\tilde{t}_{0} into another tree t~\tilde{t} over the alphabet Σ×{,,}3×Q×Γ\Sigma\times\{\top,\bot,*\}^{3}\times Q\times\Gamma such that

  1. 1.

    the number of zones in t~\tilde{t} with outdegree K(K3)\geq K^{(K^{3})} is bounded by KO(K2)K^{O(K^{2})},

  2. 2.

    the {,,}3\{\top,\bot,*\}^{3} projection of t~\tilde{t} is the profile of each node,

  3. 3.

    the QQ projection of t~\tilde{t} is an accepting run of 𝒯\mathcal{T} on the Σ×{,,}3\Sigma\times\{\top,\bot,*\}^{3} projection of t~\tilde{t} and the output is its Γ\Gamma projection,

  4. 4.

    for each (a,(l,p,r),q,b)Σ×{,,}3×Q×Γ(a,(l,p,r),q,b)\in\Sigma\times\{\top,\bot,*\}^{3}\times Q\times\Gamma the set of data values found in the (a,(l,p,r),q,b)(a,(l,p,r),q,b)-nodes in t~0\tilde{t}_{0} is the same as the set of those found in (a,(l,p,r),q,b)(a,(l,p,r),q,b)-nodes in t~\tilde{t},

  5. 5.

    the Σ\Sigma projection of t~\tilde{t} is accepted by 𝒮\mathcal{S}.

Intuitively, the tree t~\tilde{t} is obtained via repeated applications of “pumping lemma” on both EE_{\downarrow}- and EE_{\rightarrow}-directions in the tree tt.

Below we give a brief summary of the proof adapted from the proof of [Bojanczyk et al. (2009), Proposition 3.10]. We need the following terminologies, all of them are from [Bojanczyk et al. (2009)].

  • Two nodes in a tree are called siblings, if they have the same parent node.

  • The set of all children of a node is called a sibling group.

  • A contiguous sequence of siblings is called an interval.
    We write [u,v][u,v] for an interval in which uu and vv are the left-most and right-most nodes, respectively, in the interval.

  • An interval [u,v][u,v] is complete, if the following holds.

    • If a node uu^{\prime} exists such that E(u,u)\mbox{$E_{\rightarrow}$}(u^{\prime},u), then uuu^{\prime}\nsim u.

    • If a node vv^{\prime} exists such that E(v,v)\mbox{$E_{\rightarrow}$}(v,v^{\prime}), then uuu^{\prime}\nsim u.

  • An interval is pure, if all of its nodes have the same data value.

  • A pure interval with the data value dd is called a dd-pure interval.

  • If the parent of an interval (or, a sibling group) has data value dd, then it is called a dd-parent interval (or a dd-parent sibling group).

  • A zone with the data value dd is called a dd-zone.

The construction of t~\tilde{t} from t~0\tilde{t}_{0} is as follows.

  1. 1.

    Convert t~0\tilde{t}_{0} to another tree t~1\tilde{t}_{1} such that

    • for every data value dVt~1d\in V_{\tilde{t}_{1}} there are at most O(K)O(K) complete dd-pure intervals of size more than O(K)O(K);

    • Vt~1(a,(l,p,r),q,b)=Vt~0(a,(l,p,r),q,b)V_{\tilde{t}_{1}}(a,(l,p,r),q,b)=V_{\tilde{t}_{0}}(a,(l,p,r),q,b), for every (a,(l,p,r),q,b)Σ×{,,}3×Q×Γ(a,(l,p,r),q,b)\in\Sigma\times\{\top,\bot,*\}^{3}\times Q\times\Gamma;

    • t~1\tilde{t}_{1} is an extended tree of its Σ\Sigma projection w.r.t. 𝒮\mathcal{S}.

    This step is adapted from [Bojanczyk et al. (2009), Proposition 3.12]. The idea is to cut an interval (together with its subtree) and paste it in another interval; and while doing so the data values in the interval remain untouched.

  2. 2.

    Convert t~1\tilde{t}_{1} to another tree t~2\tilde{t}_{2} such that

    • for every data value dVt~2d\in V_{\tilde{t}_{2}} there are at most O(K)O(K) dd-parent sibling group with more than KO(K)K^{O(K)} complete pure intervals;

    • Vt~2(a,(l,p,r),q,b)=Vt~1(a,(l,p,r),q,b)V_{\tilde{t}_{2}}(a,(l,p,r),q,b)=V_{\tilde{t}_{1}}(a,(l,p,r),q,b), for every (a,(l,p,r),q,b)Σ×{,,}3×Q×Γ(a,(l,p,r),q,b)\in\Sigma\times\{\top,\bot,*\}^{3}\times Q\times\Gamma;

    • t~2\tilde{t}_{2} is an extended tree of its Σ\Sigma projection w.r.t. 𝒮\mathcal{S}.

    This step is adapted from [Bojanczyk et al. (2009), Proposition 3.14]. Again when the cut-and-paste is performed the data values in the sibling groups remain untouched.

  3. 3.

    Convert t~2\tilde{t}_{2} to another tree t~3\tilde{t}_{3} such that

    • for every data value dVt~3d\in V_{\tilde{t}_{3}} there are at most O(K)O(K) dd-zones containing a path with more than O(K)O(K) nodes;

    • Vt~3(a,(l,p,r),q,b)=Vt~2(a,(l,p,r),q,b)V_{\tilde{t}_{3}}(a,(l,p,r),q,b)=V_{\tilde{t}_{2}}(a,(l,p,r),q,b), for every (a,(l,p,r),q,b)Σ×{,,}3×Q×Γ(a,(l,p,r),q,b)\in\Sigma\times\{\top,\bot,*\}^{3}\times Q\times\Gamma;

    • t~3\tilde{t}_{3} is an extended tree of its Σ\Sigma projection w.r.t. 𝒮\mathcal{S}.

    This step is adapted from [Bojanczyk et al. (2009), Proposition 3.17]. Again when the cut-and-paste is performed the data values in the zones remain untouched.

  4. 4.

    Convert t~3\tilde{t}_{3} to another tree t~4\tilde{t}_{4} such that

    • there are at most KO(K2)K^{O(K^{2})} complete pure intervals with more than O(K2)O(K^{2}) nodes;

    • Vt~3(a,(l,p,r),q,b)=Vt~4(a,(l,p,r),q,b)V_{\tilde{t}_{3}}(a,(l,p,r),q,b)=V_{\tilde{t}_{4}}(a,(l,p,r),q,b), for every (a,(l,p,r),q,b)Σ×{,,}3×Q×Γ(a,(l,p,r),q,b)\in\Sigma\times\{\top,\bot,*\}^{3}\times Q\times\Gamma;

    • t~4\tilde{t}_{4} is an extended tree of its Σ\Sigma projection w.r.t. 𝒮\mathcal{S}.

    This step is adapted from [Bojanczyk et al. (2009), Proposition 3.20]. Here actually when the cut-and-paste is performed, the data values in some zones have to be changed. However, those changes are only applied to the safe zones, where a zone is safe if for every node in it there is another node outside the zone with the same label (from Σ×{,,}×Q×Γ\Sigma\times\{\top,\bot,*\}\times Q\times\Gamma) and the same data value. (See [Bojanczyk et al. (2009), page 23, last paragraph].) More specifically, these changes are done by applying [Bojanczyk et al. (2009), Lemma 3.19] on the safe zones. That it is applied only on safe zones is important so that after changing the data values, constraints such as xy(a(x)xyb(y))\forall x\exists y(a(x)\to x\sim y\wedge b(y)) are still satisfied.

  5. 5.

    Convert t~4\tilde{t}_{4} to another tree t~5\tilde{t}_{5} such that

    • there are at most KO(K2)K^{O(K^{2})} sibling groups containing more than KO(K)K^{O(K)} complete pure intervals;

    • Vt~4(a,(l,p,r),q,b)=Vt~5(a,(l,p,r),q,b)V_{\tilde{t}_{4}}(a,(l,p,r),q,b)=V_{\tilde{t}_{5}}(a,(l,p,r),q,b), for every (a,(l,p,r),q,b)Σ×{,,}3×Q×Γ(a,(l,p,r),q,b)\in\Sigma\times\{\top,\bot,*\}^{3}\times Q\times\Gamma;

    • t~5\tilde{t}_{5} is an extended tree of its Σ\Sigma projection w.r.t. 𝒮\mathcal{S}.

    This step is adapted from [Bojanczyk et al. (2009), Proposition 3.21]. Here there are also changes of data values when performing cut-and-paste. However, as in the previous step, they are only applied to the safe zones. These changes are also done by applying [Bojanczyk et al. (2009), Lemma 3.19] on the safe zones.

  6. 6.

    Convert t~5\tilde{t}_{5} to another tree t~6\tilde{t}_{6} such that

    • there are at most KO(K2)K^{O(K^{2})} zones containing paths with more than O(K2)O(K^{2}) nodes;

    • Vt~5(a,(l,p,r),q,b)=Vt~6(a,(l,p,r),q,b)V_{\tilde{t}_{5}}(a,(l,p,r),q,b)=V_{\tilde{t}_{6}}(a,(l,p,r),q,b), for every (a,(l,p,r),q,b)Σ×{,,}3×Q×Γ(a,(l,p,r),q,b)\in\Sigma\times\{\top,\bot,*\}^{3}\times Q\times\Gamma;

    • t~6\tilde{t}_{6} is an extended tree of its Σ\Sigma projection w.r.t. 𝒮\mathcal{S}.

    This step is adapted from [Bojanczyk et al. (2009), Proposition 3.25]. Here there are also changes of data values when performing cut-and-paste. However, as in the previous step, they are only applied to the safe zones. More specifically, these changes are done by applying [Bojanczyk et al. (2009), Lemma 3.24] on the safe zones.

The extended tree t~6\tilde{t}_{6} is the desired extended tree. It is a rather straightforward computation that there are at most KO(K2)K^{O(K^{2})} zones in t~6\tilde{t}_{6} with outdegree K(K3)\geq K^{(K^{3})}.

To describe the decision procedure for Theorem 5.10, we need a few more additional terminologies. For a data tree tt over the alphabet Γ\Gamma, and SΓS\subseteq\Gamma, an SS-zone is a zone in which the labels of the nodes are precisely SS. We write Vtzone(S)V^{zone}_{t}(S) to denote the set of data values found in SS-zones in tt. For P2ΓP\subseteq 2^{\Gamma},

[P]tzone=SPVtzone(S)RPVtzone(R)¯[P]^{zone}_{t}=\bigcap_{S\in P}V^{zone}_{t}(S)\cap\bigcap_{R\notin P}\overline{V^{zone}_{t}(R)}

Suppose d1<<dmd_{1}<\cdots<d_{m} are all the data values in tt. The zonal string representation of the data values in tt, denoted by 𝒱Γzone(t)\mbox{$\mathcal{V}$}^{zone}_{\Gamma}(t), is the string P1PmP_{1}\cdots P_{m} over the alphabet 22Γ2^{2^{\Gamma}} such that for each i{1,,m}i\in\{1,\ldots,m\}, di[Pi]tzoned_{i}\in[P_{i}]^{zone}_{t}.

A zonal ODTA is 𝒮=𝒯,,Γ0\mbox{$\mathcal{S}$}^{\prime}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$}^{\prime},\Gamma_{0}\rangle, where 𝒯\mathcal{T} and Γ0\Gamma_{0} are as in the definition of ODTA, and \mbox{$\mathcal{M}$}^{\prime} is a finite state automaton over the alphabet 22Γ2^{2^{\Gamma}}. A data tree tt is accepted by the zonal ODTA 𝒮\mbox{$\mathcal{S}$}^{\prime}, if the following holds.

  • Profile(t)\mbox{$\textsf{Profile}$}(t) is accepted by 𝒯\mathcal{T}, yielding an output tree tt^{\prime} over the alphabet Γ\Gamma.

  • The string 𝒱Γzone(t)\mbox{$\mathcal{V}$}^{zone}_{\Gamma}(t^{\prime}) is accepted by \mbox{$\mathcal{M}$}^{\prime}.

  • For each aΓ0a\in\Gamma_{0}, all the data values found in the aa-nodes in tt^{\prime} are different.

Proposition 5.23.

For every ODTA 𝒮\mathcal{S}, one can construct in ExpTime its equivalent zonal ODTA.

Proof 5.24.

Let 𝒮=𝒯,,Γ0\mbox{$\mathcal{S}$}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Gamma_{0}\rangle and =Q,q0,δ,F\mbox{$\mathcal{M}$}=\langle Q,q_{0},\delta,F\rangle. Its equivalent zonal ODTA is defined as 𝒮=𝒯,,Γ0\mbox{$\mathcal{S}$}^{\prime}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$}^{\prime},\Gamma_{0}\rangle, where =Q,q0,δ,F\mbox{$\mathcal{M}$}^{\prime}=\langle Q,q_{0},\delta^{\prime},F\rangle and δ={(q,P,q)Q×22Γ×Q(q,S,q)δsuch thatRPR=S}\delta^{\prime}=\{(q,P,q^{\prime})\in Q\times 2^{2^{\Gamma}}\times Q\mid\exists(q,S,q^{\prime})\in\delta\ \mbox{such that}\ \bigcup_{R\in P}R=S\}. It is straightforward to show that data(𝒮)=data(𝒮)\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$}^{\prime})=\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$}).

Note that the only difference between 𝒮\mathcal{S} and 𝒮\mbox{$\mathcal{S}$}^{\prime} is the transitions δ\delta and δ\delta^{\prime} in \mathcal{M} and \mbox{$\mathcal{M}$}^{\prime}, respectively. The membership (q,P,q)δ(q,P,q^{\prime})\in\delta^{\prime} can be checked in polynomial time in the size of (q,P,q)(q,P,q^{\prime}) and δ\delta. Since there are exponentially many (q,P,q)(q,P,q^{\prime}), the exponential time upper bound holds immediately. This completes the proof of Proposition 5.23.

Briefly our decision procedure for Theorem 5.10 works as follows. Let 𝒮=𝒯,,Γ0\mbox{$\mathcal{S}$}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Gamma_{0}\rangle be the given ODTA, where Σ\Sigma is the input alphabet of 𝒯\mathcal{T}, Γ\Gamma the output alphabet, and QQ the set of states of 𝒯\mathcal{T}. Let K=27|Σ||Q||Γ|K=27\cdot|\Sigma|\cdot|Q|\cdot|\Gamma|. The decision procedure constructs an APC (𝒜,ξ)(\mbox{$\mathcal{A}$},\xi) such that 𝒮\mathcal{S} accepts an ordered-data tree tt in which there are at most KO(K2)K^{O(K^{2})} zones with outdegree K(K3)\geq K^{(K^{3})} if and only if (𝒜,ξ)(\mbox{$\mathcal{A}$},\xi) accepts an extended tree of tt w.r.t. 𝒮\mathcal{S}.

Its precise description is given as follows.

  1. 1.

    Compute K=27|Σ||Q||Γ|K=27\cdot|\Sigma|\cdot|Q|\cdot|\Gamma|.

  2. 2.

    Convert 𝒮\mathcal{S} into its zonal ODTA 𝒮=𝒯,,Γ0\mbox{$\mathcal{S}$}^{\prime}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$}^{\prime},\Gamma_{0}\rangle.

  3. 3.

    Guess the following items.

    1. (a)

      A set 𝒫22Γ\mbox{$\mathcal{P}$}\subseteq 2^{2^{\Gamma}}.

    2. (b)

      For each P𝒫P\in\mbox{$\mathcal{P}$}, guess an integer MP2KK32K+2KK3+1M_{P}\leq 2\cdot K^{K^{3}}\cdot 2^{K}+2\cdot K^{K^{3}}+1 and a set of MPM_{P} constants 𝒞P={c1,,cMP}\mbox{$\mathcal{C}$}_{P}=\{c_{1},\ldots,c_{M_{P}}\}.The purpose of the number 2KK32K+2KK32\cdot K^{K^{3}}\cdot 2^{K}+2\cdot K^{K^{3}} is the application of Lemma 4.1 later on, where we consider the graph where the nodes are the zones. Each zone is labeled with a symbol from 2Σ×{,,}3×Q×Γ2^{\Sigma\times\{\top,\bot,*\}^{3}\times Q\times\Gamma}, which is of size 2K2^{K}. If a zone has outdegree K(K3)\leq K^{(K^{3})}, then it has only at most K(K3)K^{(K^{3})} nodes, which means that its degree (the sum of indegree and outdegree) is bounded by 2KK32\cdot K^{K^{3}}. Now 𝒫\mathcal{P} is intended to contain all those PP’s in which |[P]tzone|2KK32K+2KK3+1|[P]^{zone}_{t}|\leq 2\cdot K^{K^{3}}\cdot 2^{K}+2\cdot K^{K^{3}}+1 so that we can “guess” some constants as elements of [P]tzone[P]^{zone}_{t} and make sure by automaton that the same constant is not “assigned” to adjacent zones. For PP not in 𝒫\mathcal{P}, we can apply Lemma 4.1 to make sure the same data value from [P]tzone[P]^{zone}_{t} is not assigned to adjacent zones.

    3. (c)

      Two integers N,NN,N^{\prime} such that NNKO(K2)N^{\prime}\leq N\leq K^{O(K^{2})} and a set of NN^{\prime} constants 𝒟={d1,,dN}\mbox{$\mathcal{D}$}=\{d_{1},\ldots,d_{N^{\prime}}\}.
      The intuitive meaning of NN^{\prime} and NN are the number of zones with outdegree K(K3)\geq K^{(K^{3})} and the number of data values found in them, respectively. We also remark that the constants in 𝒟\mathcal{D} may overlap with the constants in some 𝒞P\mbox{$\mathcal{C}$}_{P}.

    4. (d)

      For each d𝒟d\in\mbox{$\mathcal{D}$}, guess a set Pd2ΓP_{d}\subseteq 2^{\Gamma}.

  4. 4.

    Construct the following automaton 𝒜\mathcal{A} over the alphabet Σ×{,,}3×Q×Γ\Sigma\times\{\top,\bot,*\}^{3}\times Q\times\Gamma.

    1. (a)

      𝒜\mathcal{A} accepts only the extended trees of (𝒯)\mbox{$\mathcal{L}$}(\mbox{$\mathcal{T}$}) in which there are at most NN zones with outdegree K(K3)\geq K^{(K^{3})}.

    2. (b)

      The automaton 𝒜\mathcal{A} can remember the constants in its states.

    3. (c)

      For every P𝒫P\in\mbox{$\mathcal{P}$}, for every c𝒞Pc\in\mbox{$\mathcal{C}$}_{P}, the automaton 𝒜\mbox{$\mathcal{A}$}^{\prime} “assigns” the constant cc in an SS-zone, for every SPS\in P, but not in any RR-zone, for every RPR\notin P.

    4. (d)

      The automaton 𝒜\mathcal{A} “assigns” every zone with outdegree K(K3)\geq K^{(K^{3})} with a constant from 𝒟\mathcal{D}.

    5. (e)

      For every d𝒟d\in\mbox{$\mathcal{D}$}, for every SPdS\in P_{d}, the automaton 𝒜\mathcal{A} “assigns” the constant dd in an SS-zone, for every SPdS\in P_{d}, but in no RR-zone, for every RPdR\notin P_{d}.

    6. (f)

      For each aΓ0a\in\Gamma_{0}, there is at most one aa-node in every zone, and for every two zones that contains aa-nodes, if they are assigned with some constants from 𝒞P\mbox{$\mathcal{C}$}_{P}’s and 𝒟\mathcal{D}, then these constants must be different.

    7. (g)

      For every two adjacent zones, if they are assigned with constants from 𝒞P\mbox{$\mathcal{C}$}_{P}’s and 𝒟\mathcal{D}, then these constants must be different.

    The automaton 𝒜\mathcal{A} “assigns” a constant to a zone by remembering the constant in the state when 𝒜\mathcal{A} is reading the zone.

  5. 5.

    Let P1,,PmP_{1},\ldots,P_{m} be the enumeration of non-empty subsets of 2Γ2^{\Gamma}.
    Applying Lemma 2.3, convert the automaton \mbox{$\mathcal{M}$}^{\prime} into its Presburger formula ξ(zP1,,zPm)\xi_{\mbox{\scriptsize$\mathcal{M}$}^{\prime}}(z_{P_{1}},\ldots,z_{P_{m}}), where the intended meaning of zPiz_{P_{i}}’s is the number of appearances of the label PiP_{i}.

  6. 6.

    Let Γ={a1,,a}\Gamma=\{a_{1},\ldots,a_{\ell}\} and S1,,SkS_{1},\ldots,S_{k} be the enumeration of non-empty subsets of Γ\Gamma. Define the formula ξ(xa1,,xa,xS1,,xSk):\xi(x_{a_{1}},\ldots,x_{a_{\ell}},x_{S_{1}},\ldots,x_{S_{k}}):

    zP1zPm\displaystyle\exists z_{P_{1}}\cdots\exists z_{P_{m}} ξ(zP1,,zPm)\displaystyle\xi_{\mbox{\scriptsize$\mathcal{M}$}^{\prime}}(z_{P_{1}},\ldots,z_{P_{m}}) (14)
    Pi𝒫zPi=MPi\displaystyle\ \wedge\ \bigwedge_{P_{i}\in\mbox{\scriptsize$\mathcal{P}$}}z_{P_{i}}=M_{P_{i}}
    Pi𝒫zPi2KK32K+2K(K3)+1\displaystyle\ \wedge\ \bigwedge_{P_{i}\notin\mbox{\scriptsize$\mathcal{P}$}}z_{P_{i}}\geq 2\cdot K^{K^{3}}\cdot 2^{K}+2\cdot K^{(K^{3})}+1
    SΓ(xSPiSandPi𝒫zPi)\displaystyle\ \wedge\;\bigwedge_{S\subseteq\Gamma}\Big{(}x_{S}\ \geq\ \sum_{P_{i}\ni S\ \textrm{and}\ P_{i}\notin\mbox{\scriptsize$\mathcal{P}$}}z_{P_{i}}\Big{)}
    aΓ0(xa=there existsSsuch thataSandSPiandPi𝒫zPi)\displaystyle\ \wedge\;\bigwedge_{a\in\Gamma_{0}}\Big{(}x_{a}\ =\ \sum_{\scriptsize\begin{array}[]{c}\textrm{there exists}\ S\ \textrm{such that}\\ a\in S\ \textrm{and}\ S\in P_{i}\ \textrm{and}\ P_{i}\notin\mbox{$\mathcal{P}$}\end{array}}z_{P_{i}}\Big{)}
    Pi𝒫zPi|{d𝒟Pd=Pi}|\displaystyle\ \wedge\;\bigwedge_{\scriptsize P_{i}\notin\mbox{$\mathcal{P}$}}z_{P_{i}}\geq|\{d\in\mbox{$\mathcal{D}$}\mid P_{d}=P_{i}\}|

    The meaning of xax_{a} is the number of aa-nodes occurring in the zone not assigned with any constants from 𝒞P\mbox{$\mathcal{C}$}_{P}’s and 𝒟\mathcal{D}; and xSx_{S} is the number SS-zones not assigned with any constants from 𝒞P\mbox{$\mathcal{C}$}_{P}’s and 𝒟\mathcal{D}. The intuition behind items (2)–(6) is rather clear. The intuition behind item (7) is as follows. Recall that in Step (3), for each d𝒟d\in\mbox{$\mathcal{D}$}, we guess a set PdP_{d}. The meaning is that d[Pd]tzoned\in[P_{d}]^{zone}_{t} for some tdata(𝒮)t\in\mbox{$\mathcal{L}$}^{data}(\mbox{$\mathcal{S}$}). So for every Pi𝒫P_{i}\notin\mbox{$\mathcal{P}$}, the number of dd such that Pd=PiP_{d}=P_{i} should not exceed zPiz_{P_{i}}. This is precisely what is stated in item (7).

  7. 7.

    Test the non-emptiness of the APC (𝒜,ξ)(\mbox{$\mathcal{A}$},\xi).

Before we proceed to prove its correctness, we first present the analysis of its complexity.

  • Step (1) is trivial and Step (2) takes exponential time.

  • Step (3) takes non-deterministic exponential time in the size of 𝒮\mathcal{S}. The analysis is as follows. Step (3.a) takes non-deterministic exponential time in the size of 2Γ2^{\Gamma}, which is bounded by the size of \mathcal{M} in 𝒮\mathcal{S}. (Recall that the alphabet in \mathcal{M} is 2Γ2^{\Gamma}.) Step (3.b) can guess up exponentially many constant in each 𝒞P\mbox{$\mathcal{C}$}_{P}, and there are exponentially many different 𝒞P\mbox{$\mathcal{C}$}_{P}, hence it takes double exponential time in the size of 2Γ2^{\Gamma}. Steps (3.c) and (3.d) take non-deterministic exponential time.

  • Step (4) takes deterministic triple exponential time and can produce the automaton 𝒜\mathcal{A} of size up to triple exponential. The analysis is as follows. The automaton 𝒜\mathcal{A} has to remember in its states the outdegree of each zone up to K(K3)K^{(K^{3})} and the number of zones with out degree K(K3)\geq K^{(K^{3})}. This induces an exponential blow-up in the size of 𝒯\mathcal{T}.

    The number of constants in guessed in Step (3) is double exponential in the size of 𝒯\mathcal{T}. Then 𝒜\mathcal{A} has to remember in its states which constant is assigned to which zone (of outdegree K(K3)\geq K^{(K^{3})}), which induces another exponential blow-up. Altogether the size of 𝒜\mathcal{A} can be triple exponential in the size of 𝒯\mathcal{T}.

  • By Proposition 2.3, Step (5) takes polynomial time in the size \mbox{$\mathcal{M}$}^{\prime}, which is of size exponential in the size of the original \mathcal{M}.

  • The length of the formula in step (6) is double exponential in the size of 𝒮\mathcal{S}, since the number of constants in 𝒟\mathcal{D} can be double exponential in the size of 2Γ2^{\Gamma}, and hence 𝒮\mathcal{S}.

  • Step (7) takes non-deterministic polynomial time in the size of (𝒜,ξ)(\mbox{$\mathcal{A}$},\xi), and hence non-deterministic triple exponential time in the size of the input 𝒮\mathcal{S}.

The following claim immediately implies the correctness of our algorithm.

Claim 2.
  1. 1.

    For every ordered-data tree tdata(𝒮)t\in\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$}), in which there are at most KO(K2)K^{O(K^{2})} zones with outdegree K(K3)\geq K^{(K^{3})}, there exists an extended tree of tt which is accepted by the APC (𝒜,ξ)(\mbox{$\mathcal{A}$},\xi).

  2. 2.

    For every t(𝒜,ξ)t^{\prime}\in\mbox{$\mathcal{L}$}(\mbox{$\mathcal{A}$},\xi), there exists an ordered-data tree tdata(𝒮)t\in\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$}) such that tt^{\prime} is an extended tree of tt w.r.t. 𝒮\mathcal{S}.

Proof 5.25.

We prove (1) first. Let tdata(𝒮)t\in\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$}) be an ordered-data tree in which there are at most KO(K2)K^{O(K^{2})} zones with outdegree K(K3)\geq K^{(K^{3})}. Let t0t_{0} be the output of 𝒯\mathcal{T} on tt so that 𝒱zone(t0)\mbox{$\mathcal{V}$}^{zone}(t_{0}) is accepted by \mathcal{M} and all nodes in t0t_{0} labelled with a symbol in Γ0\Gamma_{0} have different data values.

We have the following items guessed in Step 3 in our algorithm above.

  • 𝒫={P|[P]tzone|2KK32K+2K(K3)+1}\mbox{$\mathcal{P}$}=\{P\mid|[P]^{zone}_{t}|\leq 2\cdot K^{K^{3}}\cdot 2^{K}+2\cdot K^{(K^{3})}+1\}.

  • For each PPP\in P, 𝒞P=[P]t0zone\mbox{$\mathcal{C}$}_{P}=[P]^{zone}_{t_{0}}, and MP=|𝒞P|M_{P}=|\mbox{$\mathcal{C}$}_{P}|.

  • NN be the number of zones in tt with outdegree K(O(K2))\geq K^{(O(K^{2}))} and NN^{\prime} be the number of data values found in these zones.

  • 𝒟={ddis found in a zone with outdegreeK(K3)}\mbox{$\mathcal{D}$}=\{d\mid d\ \mbox{is found in a zone with outdegree}\ \geq K^{(K^{3})}\},

  • For each d𝒟d\in\mbox{$\mathcal{D}$}, PdP_{d} is the set such that d[Pd]t0zoned\in[P_{d}]^{zone}_{t_{0}}.

Now let tt^{\prime} be an extended tree of tt with respect to 𝒮\mathcal{S}, and 𝒜\mathcal{A} and ξ\xi be the automaton and formula as constructed in Steps 4–6 above. We are going to show that t(𝒜,ξ)t^{\prime}\in\mbox{$\mathcal{L}$}(\mbox{$\mathcal{A}$},\xi). Obviously, t(𝒜)t^{\prime}\in\mbox{$\mathcal{L}$}(\mbox{$\mathcal{A}$}). To show that the formula ξ\xi is satisfied, we take Parikh(𝒱zone(t0))\mbox{$\textsf{Parikh}$}(\mbox{$\mathcal{V}$}^{zone}(t_{0})) as witness to (zP1,,zPm)(z_{P_{1}},\ldots,z_{P_{m}}). Since 𝒱zone(t0)()\mbox{$\mathcal{V}$}^{zone}(t_{0})\in\mbox{$\mathcal{L}$}(\mbox{$\mathcal{M}$}^{\prime}), by Proposition 2.3, the formula ξ(Parikh(𝒱zone(t0)))\xi_{\mbox{\scriptsize$\mathcal{M}$}^{\prime}}(\mbox{$\textsf{Parikh}$}(\mbox{$\mathcal{V}$}^{zone}(t_{0}))) holds. It is straightforward from the definitions of the items 𝒫\mathcal{P}, MPM_{P}’s, NN, NN^{\prime}, 𝒟\mathcal{D} and PdP_{d}’s that the formula ξ\xi in Step 6 is satisfied with xax_{a}’s and xSx_{S}’s interpreted as intended.

Now we prove (2). The proof is more delicate than the proof of (1). Suppose t(𝒜,ξ)t^{\prime}\in\mbox{$\mathcal{L}$}(\mbox{$\mathcal{A}$}^{\prime},\xi). We are going to construct an ordered-data tree tt from tt^{\prime} such that tt^{\prime} is an extended tree of tt w.r.t. 𝒮\mathcal{S}. Let 𝒫\mathcal{P}, MPM_{P}’s, 𝒞P\mbox{$\mathcal{C}$}_{P}’s, NN, NN^{\prime}, 𝒟\mathcal{D} and PdP_{d}’s the items as guessed in Step 3 above and

  • for each aiΓa_{i}\in\Gamma, let nain_{a_{i}} be the number of aia_{i}-nodes in tt^{\prime} occurring in a zone without any constants from 𝒞P\mbox{$\mathcal{C}$}_{P}’s and 𝒟\mathcal{D};

  • for each SiΓS_{i}\subseteq\Gamma, let nSin_{S_{i}} be the number of SiS_{i}-zones in tt^{\prime} without any constants from 𝒞P\mbox{$\mathcal{C}$}_{P}’s and 𝒟\mathcal{D}.

Suppose (kP1,,kPm)(k_{P_{1}},\ldots,k_{P_{m}}) be the witness to zP1,,zPmz_{P_{1}},\ldots,z_{P_{m}} such that

ξ(na1,,na,nS1,,nSl)holds.\xi(n_{a_{1}},\ldots,n_{a_{\ell}},n_{S_{1}},\ldots,n_{S_{l}})\qquad\mbox{holds}.

By Proposition 2.3, this means that there exists a word w()w\in\mbox{$\mathcal{L}$}(\mbox{$\mathcal{M}$}^{\prime}) such that Parikh(w)=(kP1,,kPm)\mbox{$\textsf{Parikh}$}(w)=(k_{P_{1}},\ldots,k_{P_{m}}). For each PiP_{i}, we let

𝒩Pi={jpositionjinwis labeledPi}.\mbox{$\mathcal{N}$}_{P_{i}}=\{j\mid\mbox{position}\ j\ \mbox{in}\ w\ \mbox{is labeled}\ P_{i}\}.

We will assign a data value to each node in tt such that

[Pi]tzone=𝒩Pi,[P_{i}]^{zone}_{t}=\mbox{$\mathcal{N}$}_{P_{i}},

and 𝒱zone(t)=w\mbox{$\mathcal{V}$}^{zone}(t)=w. The assignment is done according to three cases below.

Case 1

For the nodes that are assigned with some constants from 𝒞Pi\mbox{$\mathcal{C}$}_{P_{i}}’s.
In this case Pi𝒫P_{i}\in\mbox{$\mathcal{P}$}. We define bijections fPi:𝒞Pi𝒩Pif_{P_{i}}:\mbox{$\mathcal{C}$}_{P_{i}}\mapsto\mbox{$\mathcal{N}$}_{P_{i}}. There is always a bijection from 𝒞Pi\mbox{$\mathcal{C}$}_{P_{i}} to 𝒩Pi\mbox{$\mathcal{N}$}_{P_{i}} since they have the same cardinality MPiM_{P_{i}}, due to the following condition in the formula ξ\xi:

Pi𝒫zPi=MPi.\bigwedge_{P_{i}\in\mbox{\scriptsize$\mathcal{P}$}}z_{P_{i}}=M_{P_{i}}.

The data value assignment to nodes of this case can be done by replacing every constant c𝒞Pic\in\mbox{$\mathcal{C}$}_{P_{i}} with fPi(c)f_{P_{i}}(c).

Case 2

For the nodes that are assigned some constants from 𝒟\mathcal{D}.
We define a 1-1 mapping f:𝒟{1,,|w|}f:\mbox{$\mathcal{D}$}\mapsto\{1,\ldots,|w|\} such that f(d)𝒩Pdf(d)\in\mbox{$\mathcal{N}$}_{P_{d}}, where PdP_{d} is the set guessed in Step 3. Such 1-1 mapping exists because the following condition in the formula ξ\xi:

Pi𝒫zPi|{d𝒟Pd=Pi}|\bigwedge_{\scriptsize P_{i}\notin\mbox{$\mathcal{P}$}}z_{P_{i}}\geq|\{d^{\prime}\in\mbox{$\mathcal{D}$}\mid P_{d^{\prime}}=P_{i}\}|

The data value assignment to nodes of this case can be done by replacing every constant d𝒟d\in\mbox{$\mathcal{D}$} with f(d)f(d).

Case 3

For the nodes that are not assigned any constants from 𝒞P\mbox{$\mathcal{C}$}_{P}’s and 𝒟\mathcal{D}.
First we assign each of such zone in tt with a data valueA zone in tt can be recognised from the profile information in tt^{\prime}. such that for each SΓS\subseteq\Gamma,

Vtzone(S)=PiSandPi𝒫𝒩PiV^{zone}_{t}(S)=\bigcup_{P_{i}\ni S\ \textrm{and}\ P_{i}\notin\mbox{\scriptsize$\mathcal{P}$}}\mbox{$\mathcal{N}$}_{P_{i}}

This step can be done as follows. The number of such SS-zone in tt is greater than PiSandPi𝒫|𝒩Pi|\sum_{P_{i}\ni S\ \textrm{and}\ P_{i}\notin\mbox{\scriptsize$\mathcal{P}$}}|\mbox{$\mathcal{N}$}_{P_{i}}|, due to the condition below in the formula ξ\xi:

xSPiSandPi𝒫zPi.x_{S}\ \geq\ \sum_{P_{i}\ni S\ \textrm{and}\ P_{i}\notin\mbox{\scriptsize$\mathcal{P}$}}z_{P_{i}}.

Thus, we can simply assign every SS-zone with a data value from PiSandPi𝒫𝒩Pi\bigcup_{P_{i}\ni S\ \textrm{and}\ P_{i}\notin\mbox{\scriptsize$\mathcal{P}$}}\mbox{$\mathcal{N}$}_{P_{i}}, and make sure every data value from PiSandPi𝒫𝒩Pi\bigcup_{P_{i}\ni S\ \textrm{and}\ P_{i}\notin\mbox{\scriptsize$\mathcal{P}$}}\mbox{$\mathcal{N}$}_{P_{i}} appears in some SS-zone.
However, by assigning data values like that, some adjacent zones may get the same data values. Here we apply Lemma 4.1. Since for each Pi𝒫P_{i}\notin\mbox{$\mathcal{P}$}, |𝒩Pi|2KK32K+2K(K3)+1|\mbox{$\mathcal{N}$}_{P_{i}}|\geq 2\cdot K^{K^{3}}\cdot 2^{K}+2\cdot K^{(K^{3})}+1, by the condition below in the formula ξ\xi

Pi𝒫zPi2KK32K+2K(K3)+1,\bigwedge_{P_{i}\notin\mbox{\scriptsize$\mathcal{P}$}}z_{P_{i}}\geq 2\cdot K^{K^{3}}\cdot 2^{K}+2\cdot K^{(K^{3})}+1,

the cardinality

|PiSandPi𝒫𝒩Pi|=PiSandPi𝒫|𝒩Pi|2KK32K+2K(K3)+1.\Big{|}\bigcup_{P_{i}\ni S\ \textrm{and}\ P_{i}\notin\mbox{\scriptsize$\mathcal{P}$}}\mbox{$\mathcal{N}$}_{P_{i}}\Big{|}=\sum_{P_{i}\ni S\ \textrm{and}\ P_{i}\notin\mbox{\scriptsize$\mathcal{P}$}}|\mbox{$\mathcal{N}$}_{P_{i}}|\geq 2\cdot K^{K^{3}}\cdot 2^{K}+2\cdot K^{(K^{3})}+1.

The outdegree of such zone is K(K3)\leq K^{(K^{3})}, hence, the number of nodes in the zone is also K(K3)\leq K^{(K^{3})}. Since each node can have indegree at most 11, the degree of each of such zone is 2K(K3)\leq 2\cdot K^{(K^{3})}. By applying Lemma 4.1, where deg(G)=2K(K3)\deg(G)=2\cdot K^{(K^{3})}, we can reassign the data value in such zone so that each adjacent zone get different data value.

This completes the proof of our Claim.

6 Weak ODTA

A weak ODTA over Σ\Sigma is a triplet 𝒮=𝒯,,Γ0\mbox{$\mathcal{S}$}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Gamma_{0}\rangle where 𝒯\mathcal{T} is a letter-to-letter transducer from Σ\Sigma to the output alphabet Γ\Gamma, and \mathcal{M} is a finite state automaton over 2Γ2^{\Gamma} and Γ0Γ\Gamma_{0}\subseteq\Gamma. An ordered-data tree tt is accepted by 𝒮\mathcal{S}, denoted by tdata(𝒮)t\in\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$}), if there exists an ordered-data tree tt^{\prime} over Γ\Gamma such that

  • on input Proj(t)\mbox{$\textsf{Proj}$}(t), the transducer 𝒯\mathcal{T} outputs tt^{\prime};

  • the automaton \mathcal{M} accepts the string 𝒱Γ(t)\mbox{$\mathcal{V}$}_{\Gamma}(t^{\prime}); and

  • for every aΓ0a\in\Gamma_{0}, all the aa-nodes in tt^{\prime} have different data values.

Note that the only difference between weak ODTA and ODTA is the equality test on the data values in neighboring nodes. Such difference is the cause of the triple exponential leap in complexity, as stated in the following theorem.

Theorem 6.1.

The non-emptiness problem for weak ODTA is in NP.

Proof 6.2.

Let 𝒮=𝒯,,Γ0\mbox{$\mathcal{S}$}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Gamma_{0}\rangle be a weak ODTA. Let Σ,Q,Γ\Sigma,Q,\Gamma be the input alphabet, set of states and output alphabet of 𝒯\mathcal{T}, respectively.

We need the following notation. For a tree tdata(𝒮)t\in\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$}), its extended tree t~\tilde{t} (with respect to the weak ODTA 𝒮\mathcal{S}) is a tree over the alphabet Σ×Q×Γ\Sigma\times Q\times\Gamma, where

  • the projection of t~\tilde{t} to Σ\Sigma is tt;

  • the projection of t~\tilde{t} to QQ is an accepting run of 𝒯\mathcal{T} on tt such that its output is the projection of t~\tilde{t} to Γ\Gamma.

The decision procedure for Theorem 6.1 works as follows.

  1. 1.

    Construct an automaton 𝒜\mathcal{A} over the alphabet Σ×Q×Γ\Sigma\times Q\times\Gamma for the extended trees accepted by 𝒯\mathcal{T}.

  2. 2.

    Let 𝒫={S1,,Sm}2Γ\mbox{$\mathcal{P}$}=\{S_{1},\ldots,S_{m}\}\subseteq 2^{\Gamma} be the set of symbols used in \mathcal{M}.
    By applying Proposition 2.3, construct the Presburger formula ξ(xS1,,xSm)\xi_{\mbox{\scriptsize$\mathcal{M}$}}(x_{S_{1}},\ldots,x_{S_{m}}) for \mathcal{M}.

  3. 3.

    Let Σ×Q×Γ={(a1,q1,α1),,(ak,qn,α)}\Sigma\times Q\times\Gamma=\{(a_{1},q_{1},\alpha_{1}),\ldots,(a_{k},q_{n},\alpha_{\ell})\}. Let φ(x(a1,q1,α1),,x(ak,qn,α))\varphi(x_{(a_{1},q_{1},\alpha_{1})},\ldots,x_{(a_{k},q_{n},\alpha_{\ell})}) be the following formula:

    xα1xαxS1xSmξ(xS1,,xSm)\displaystyle\exists x_{\alpha_{1}}\cdots\exists x_{\alpha_{\ell}}\ \exists x_{S_{1}}\cdots\exists x_{S_{m}}\ \xi_{\mbox{\scriptsize$\mathcal{M}$}}(x_{S_{1}},\ldots,x_{S_{m}})
    αiΓ(xαi=ajΣ,qhQx(aj,qh,αi))\displaystyle\quad\ \wedge\ \bigwedge_{\alpha_{i}\in\Gamma}\Big{(}x_{\alpha_{i}}=\sum_{a_{j}\in\Sigma,q_{h}\in Q}x_{(a_{j},q_{h},\alpha_{i})}\Big{)}
    αiΓ(xαiαiSjxSj)αiΓ0(xαi=αiSjxSj).\displaystyle\quad\ \wedge\bigwedge_{\alpha_{i}\in\Gamma}\Big{(}x_{\alpha_{i}}\geq\sum_{\alpha_{i}\in S_{j}}x_{S_{j}}\Big{)}\ \wedge\ \bigwedge_{\alpha_{i}\in\Gamma_{0}}\Big{(}x_{\alpha_{i}}=\sum_{\alpha_{i}\in S_{j}}x_{S_{j}}\Big{)}.
  4. 4.

    Test the non-emptiness of APC (𝒜,φ(x(a1,q1,α1),,x(ak,qn,α)))(\mbox{$\mathcal{A}$},\varphi(x_{(a_{1},q_{1},\alpha_{1})},\ldots,x_{(a_{k},q_{n},\alpha_{\ell})})).

That this procedure works in NP follows directly from the fact that the non-emptiness problem of APC is in NP.

We now show the correctness of our algorithm by showing that data(𝒮)\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$})\neq\emptyset if and only if (𝒜,φ)\mbox{$\mathcal{L}$}(\mbox{$\mathcal{A}$},\varphi)\neq\emptyset. (For the sake of presentation, we write φ\varphi without its free variables.) We start with the “only if” part. Suppose that tdata(𝒮)t\in\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$}). We claim that the extended tree t~\tilde{t} of tt is accepted by (𝒜,φ)(\mbox{$\mathcal{A}$},\varphi). Obviously, t~(𝒜)\tilde{t}\in\mbox{$\mathcal{L}$}(\mbox{$\mathcal{A}$}). To show that φ(Parikh(t~))\varphi(\mbox{$\textsf{Parikh}$}(\tilde{t})) holds, let tt^{\prime} be the data tree obtained by projecting t~\tilde{t} to Γ\Gamma and the data value in each node comes from the same node in tt. That is, tt^{\prime} is an output of 𝒯\mathcal{T} on tt. We will show that φ(Parikh(t~))\varphi(\mbox{$\textsf{Parikh}$}(\tilde{t})) holds.

  • As witness to xS1,,xSmx_{S_{1}},\ldots,x_{S_{m}}, we take Parikh(𝒱(t))\mbox{$\textsf{Parikh}$}(\mbox{$\mathcal{V}$}(t^{\prime})). Since 𝒱(t)()\mbox{$\mathcal{V}$}(t^{\prime})\in\mbox{$\mathcal{L}$}(\mbox{$\mathcal{M}$}), by Proposition 2.3, ξ(Parikh(𝒱(t)))\xi_{\mbox{\scriptsize$\mathcal{M}$}}(\mbox{$\textsf{Parikh}$}(\mbox{$\mathcal{V}$}(t^{\prime}))) holds.

  • As witness to xα1,,xαx_{\alpha_{1}},\ldots,x_{\alpha_{\ell}}, we take Parikh(t)\mbox{$\textsf{Parikh}$}(t^{\prime}). Now for each αiΓ\alpha_{i}\in\Gamma, the constraint xαiαiSjxSjx_{\alpha_{i}}\geq\sum_{\alpha_{i}\in S_{j}}x_{S_{j}} holds since the number of data values in the αi\alpha_{i}-nodes cannot exceed the the number of αi\alpha_{i}-nodes itself. The constraint xαi=αiSjxSjx_{\alpha_{i}}=\sum_{\alpha_{i}\in S_{j}}x_{S_{j}}, for each αiΓ0\alpha_{i}\in\Gamma_{0}, since the data values found in αi\alpha_{i}-nodes are all different.

Thus, φ(Parikh(t~))\varphi(\mbox{$\textsf{Parikh}$}(\tilde{t})) holds, and this concludes our proof of the “only if” part.

Now we prove the “if” part. Suppose that t~(𝒜,φ)\tilde{t}\in\mbox{$\mathcal{L}$}(\mbox{$\mathcal{A}$},\varphi). So t~(𝒜)\tilde{t}\in\mbox{$\mathcal{L}$}(\mbox{$\mathcal{A}$}). Let tt and tt^{\prime} be the Σ\Sigma- and Γ\Gamma-projection of t~\tilde{t}, respectively. By the definition of 𝒜\mathcal{A}, tt^{\prime} is an output of 𝒯\mathcal{T} on tt. Now since φ(Parikh(t~))\varphi(\mbox{$\textsf{Parikh}$}(\tilde{t})) holds, in particular there exists a witness M¯=(M1,,Mm)\bar{M}=(M_{1},\ldots,M_{m}) to xS1,,xSmx_{S_{1}},\ldots,x_{S_{m}} such that ξ(M¯)\xi_{\mbox{\scriptsize$\mathcal{M}$}}(\bar{M}) holds, by Proposition 2.3, there exists a word w()w\in\mbox{$\mathcal{L}$}(\mbox{$\mathcal{M}$}) over the alphabet 2Γ2^{\Gamma} such that Parikh(w)=M¯\mbox{$\textsf{Parikh}$}(w)=\bar{M}.

We are going to assign data values to the nodes of tt^{\prime} (thus, also to those of tt) such that tdata(𝒮)t\in\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$}). The assignment is done as follows. For each SΓS\subseteq\Gamma, let Vw(S)V_{w}(S) be the set of positions of ww labeled with SS. Now for each αΓ\alpha\in\Gamma, we assign the α\alpha-nodes in tt^{\prime} with the data values from αSVw(S)\bigcup_{\alpha\in S}V_{w}(S) such that Vt(α)=αSVw(S)V_{t^{\prime}}(\alpha)=\bigcup_{\alpha\in S}V_{w}(S). This is possible due to the constraint xααSxSx_{\alpha}\geq\sum_{\alpha\in S}x_{S}.

With such assignment, we get 𝒱(t)=w\mbox{$\mathcal{V}$}(t^{\prime})=w. Thus, 𝒱(t)()\mbox{$\mathcal{V}$}(t^{\prime})\in\mbox{$\mathcal{L}$}(\mbox{$\mathcal{M}$}). Moreover, for every αΓ0\alpha\in\Gamma_{0}, all the data values in α\alpha-nodes are different, which follows from the constraint xα=αSxSx_{\alpha}=\sum_{\alpha\in S}x_{S}. Therefore, the resulting ordered-data tree tdata(𝒮)t\in\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$}). This concludes our proof.

Next, we give the logical characterisation of weak ODTA.

Theorem 6.3.

A language \mathcal{L} is accepted by a weak ODTA if and only if \mathcal{L} is expressible with a formula of the form: X1Xmφψ\exists X_{1}\cdots\exists X_{m}\ \varphi\wedge\psi, where φ\varphi is a formula from FO2(E,E)\mbox{$\textrm{FO}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$}), and ψ\psi is a formula from FO(,,suc)\mbox{$\textrm{FO}$}(\sim,\prec,\mbox{$\prec_{suc}$}), extended with the unary predicates X1,,XmX_{1},\ldots,X_{m}.

The proof of Theorem 6.3 is the same as the proof of Theorem 5.9. The difference is that to simulate the FO2(E,E)\mbox{$\textrm{FO}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$}) formula φ\varphi, the profile information is not necessary. The complexity of the translation is still the same as in Theorem 5.9.

6.1 Extending weak ODTA with Presburger constraints

Like in the case of APC, we can extend weak ODTA with Presburger constraints without increasing the complexity of its non-emptiness problem. Let 𝒮=𝒯,,Γ0\mbox{$\mathcal{S}$}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Gamma_{0}\rangle be a weak ODTA, where Σ\Sigma and Γ\Gamma are the input and output alphabets of 𝒯\mathcal{T}, respectively. Let Γ={α1,,α}\Gamma=\{\alpha_{1},\ldots,\alpha_{\ell}\}.

A weak ODTA 𝒮=𝒯,,Γ0\mbox{$\mathcal{S}$}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Gamma_{0}\rangle extended with Presburger constraint is a tuple 𝒮,ξ\langle\mbox{$\mathcal{S}$},\xi\rangle, where ξ(x1,,x,y1,,y21)\xi(x_{1},\ldots,x_{\ell},y_{1},\ldots,y_{2^{\ell}-1}) is an existential Presburger formula with the free variables x1,,x,y1,,y21x_{1},\ldots,x_{\ell},y_{1},\ldots,y_{2^{\ell}-1}. An ordered-data tree tt is accepted by 𝒮,ξ\langle\mbox{$\mathcal{S}$},\xi\rangle, if there exists an output tt^{\prime} of 𝒯\mathcal{T} on tt, the automaton \mathcal{M} accepts 𝒱Γ(t)\mbox{$\mathcal{V}$}_{\Gamma}(t^{\prime}), for each aΓ0a\in\Gamma_{0}, all aa-nodes in tt^{\prime} have different data values and ξ(Parikh(t),Parikh(𝒱Γ(t)))\xi(\mbox{$\textsf{Parikh}$}(t^{\prime}),\mbox{$\textsf{Parikh}$}(\mbox{$\mathcal{V}$}_{\Gamma}(t^{\prime}))) holds. We write data(𝒮,ξ)\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$},\xi) to denote the set of languages accepted by 𝒮,ξ\langle\mbox{$\mathcal{S}$},\xi\rangle.

We claim that the non-emptiness problem of weak ODTA extended with Presburger constraint is still decidable in NP. The reason is as follows. The non-emptiness of a weak ODTA 𝒮\mathcal{S} is checked by converting 𝒮\mathcal{S} into an APC (𝒜,φ)(\mbox{$\mathcal{A}$},\varphi), where φ\varphi expresses linear constraints on the number of nodes labeled with symbols from Σ\Sigma and Γ\Gamma as well as those labeled with QQ in the accepting run. The formula ξ\xi can be appropriately “inserted” into φ\varphi, and hence, the non-emptiness of (𝒮,ξ)(\mbox{$\mathcal{S}$},\xi) is reducible to non-emptiness of APC, which is in NP.

6.2 Comparison with other known decidable formalisms

We are going to compare the expressiveness of weak ODTA with other known models with decidable non-emptiness.

6.2.1 DTD with integrity constraints

An XML document is typically viewed as a data tree. The most common XML formalism is Document Type Definition (DTD). In short, a DTD is a context free grammar and a tree tt conforms to a DTD DD, if it is a derivation tree of a word accepted by the context free grammar.

The most commonly used XML constraints are integrity constraints which are of two types.

  • The key constraint key(a)key(a) are the following constraint:

    xy(a(x)a(y)xyx=y).\forall x\forall y(a(x)\wedge a(y)\wedge x\sim y\to x=y).
  • The inclusion constraint V(a)V(b)V(a)\subseteq V(b) are the following constraint:

    xy(a(x)b(y)xy).\forall x\exists y(a(x)\to b(y)\wedge x\sim y).

The satisfiability problem of a given DTD DD and a collection 𝒞\mathcal{C} of integrity constraints asks whether there exists an ordered-data tree tt that conforms to the DTD that satisfies all the constraints in 𝒞\mathcal{C}. In [Fan and Libkin (2002)] it is shown that this problem is NP-complete.

Theorem 6.4.

Given a DTD DD and a collection 𝒞\mathcal{C} of integrity constraints, one can construct a weak ODTA 𝒮\mathcal{S} such that data(𝒮)\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$}) is precisely the set of ordered-data trees that conforms to DD and satisfies all constraints in 𝒞\mathcal{C}.

Proof 6.5.

Let Σ\Sigma be the alphabet of the given DTD DD. Consider the following weak ODTA 𝒮=𝒯,,Σ0\mbox{$\mathcal{S}$}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Sigma_{0}\rangle.

  • 𝒯\mathcal{T} is an identity transducer that checks whether the input tree conforms to DTD DD.

  • \mathcal{M} is an automaton that accepts 𝒫\mbox{$\mathcal{P}$}^{\ast}, where 𝒫=2Σ{SaSandbSfor someV(a)V(b)𝒞}\mbox{$\mathcal{P}$}=2^{\Sigma}-\{S\mid a\in S\ \mbox{and}\ b\notin S\ \mbox{for some}\ V(a)\subseteq V(b)\in\mbox{$\mathcal{C}$}\}.

  • Σ0={akey(a)𝒞}\Sigma_{0}=\{a\mid key(a)\in\mbox{$\mathcal{C}$}\}.

That 𝒮\mathcal{S} is the desired ODTA follows immediately from the fact that for every ordered-data tree tt, Vt(a)Vt(b)V_{t}(a)\subseteq V_{t}(b) if and only if [S]t=[S]_{t}=\emptyset for all SS where aSa\in S, but bSb\notin S.

The size of the automaton \mathcal{M}, hence the size of 𝒮\mathcal{S}, produced by our construction in Theorem 6.4 is of exponential size. This blow-up is tight, as the following example shows. Consider the case where 𝒞\mathcal{C} does not contain inclusion constraints. That is, 𝒞\mathcal{C} contains only key constraints. Then any equivalent ODTA 𝒮=𝒯,,Σ0\mbox{$\mathcal{S}$}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Sigma_{0}\rangle will have ()=(2Σ{})\mbox{$\mathcal{L}$}(\mbox{$\mathcal{M}$})=(2^{\Sigma}-\{\emptyset\})^{\ast}. Thus, we have exponential blow-up in the size of \mathcal{M}. Nevertheless, if we are concerned only with satisfiability, then we can lower the complexity to NP as stated in the following theorem.

Theorem 6.6.

Given a DTD DD and a collection 𝒞\mathcal{C} of integrity constraints, one can construct a weak ODTA 𝒮\mathcal{S} in non-deterministic polynomial time such that data(𝒮)\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$})\neq\emptyset if and only if there exists an ordered-data tree tt that conforms to DD and satisfies all the constraints in 𝒞\mathcal{C}.

Proof 6.7.

Let Σ\Sigma be the alphabet of the DTD DD. We non-deterministically construct a weak ODTA 𝒮=𝒯,,Σ0\mbox{$\mathcal{S}$}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Sigma_{0}\rangle as follows.

  • 𝒯\mathcal{T} is an identity transducer that checks whether the input tree conforms to DTD DD.

  • Guess a sequence (H1,,Hk)(H_{1},\ldots,H_{k}) of some subsets of Σ\Sigma such that

    • Σ\Sigma is partitioned into H1HkH_{1}\cup\cdots\cup H_{k};

    • for every two different symbols a,bΣa,b\in\Sigma, a,ba,b are in the same set HiH_{i} if and only if both V(a)V(b)V(a)\subseteq V(b) and V(b)V(a)V(b)\subseteq V(a) are in 𝒞\mathcal{C};

    • if V(a)V(b)𝒞V(a)\subseteq V(b)\in\mbox{$\mathcal{C}$} and V(b)V(a)𝒞V(b)\subseteq V(a)\not\in\mbox{$\mathcal{C}$}, then aHia\in H_{i} and bHjb\in H_{j} and iji\leq j.

    Intuitively, the sequence (H1,,Hk)(H_{1},\ldots,H_{k}) tells us the ordering of the elements in Σ\Sigma that respect the inclusion constraints in 𝒞\mathcal{C}, where if both V(a)V(b)V(a)\subseteq V(b) and V(b)V(a)V(b)\subseteq V(a) are in 𝒞\mathcal{C}, then aa and bb are tie and they must be in the same set HiH_{i}.

  • Let S1,,SkΣS_{1},\ldots,S_{k}\subseteq\Sigma be such that Si=Σ(H1Hi1)S_{i}=\Sigma-(H_{1}\cup\cdots\cup H_{i-1}), where S1=ΣS_{1}=\Sigma and Sk=HkS_{k}=H_{k}.

  • \mathcal{M} is a non-deterministic automaton over the alphabet {S1,,Sk}\{S_{1},\ldots,S_{k}\}, where the set of states is {q1,,qk}\{q_{1},\ldots,q_{k}\}, all q1,,qkq_{1},\ldots,q_{k} are the initial states and the final states, and the transitions are: (qi,Sj,qj)(q_{i},S_{j},q_{j}) for every 1ijk1\leq i\leq j\leq k.

  • Σ0={akey(a)𝒞}\Sigma_{0}=\{a\mid key(a)\in\mbox{$\mathcal{C}$}\}.

We claim that data(𝒮)\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$})\neq\emptyset if and only if there exists an ordered-data tree tt that conforms to DD and satisfies all the constraints in 𝒞\mathcal{C}.

We start with the “if” direction. Suppose tt conforms to the DTD DD and satisfies all the constraints in 𝒞\mathcal{C}. For each aΣa\in\Sigma, let NaN_{a} be the number of data values found in the aa-nodes in tt. Let (H1,,Hk)(H_{1},\ldots,H_{k}) be a sequence of some subsets of Σ\Sigma such that

  • Σ\Sigma is partitioned into H1HkH_{1}\cup\cdots\cup H_{k};

  • for every two different symbols a,bΣa,b\in\Sigma, a,ba,b are in the same set HiH_{i} if and only if Na=NbN_{a}=N_{b};

  • aHia\in H_{i} and bHjb\in H_{j} and iji\leq j if and only if NaNbN_{a}\leq N_{b}.

Consider the following ordered-data tree tt^{\prime} over Σ\Sigma, where tt^{\prime} is obtained from tt by reassigning the data values on the nodes in tt as follows. For each aΣa\in\Sigma, we assign the set of integers {d1dNa}\{d\mid 1\leq d\leq N_{a}\} as the data values of aa-nodes in tt^{\prime}. Such assignment is possible since NaN_{a} is no more than the number of aa-nodes in tt^{\prime}. With such assignment tt^{\prime} still obeys the constraints in 𝒞\mathcal{C}, as shown below.

  • If key(a)𝒞key(a)\in\mbox{$\mathcal{C}$}, then NaN_{a} is precisely the number of aa-nodes in tt, thus, also in tt^{\prime}. Thus, with the data values {1,,Na}\{1,\ldots,N_{a}\}, the data values on the aa-nodes in tt^{\prime} are all different.

  • If V(a)V(a)𝒞V(a)\subseteq V(a^{\prime})\in\mbox{$\mathcal{C}$}, then obviously, NaNaN_{a}\leq N_{a^{\prime}}. Thus, tt^{\prime} still satisfies the constraint V(a)V(a)V(a)\subseteq V(a^{\prime}), since the data values in aa-nodes in tt^{\prime} are {1,2,,Na}\{1,2,\ldots,N_{a}\}, while those in aa^{\prime}-nodes are {1,2,,Na}\{1,2,\ldots,N_{a^{\prime}}\}.

Now the string 𝒱(t)\mbox{$\mathcal{V}$}(t^{\prime}) is of the form R1RmR_{1}\cdots R_{m}, where m=maxaΣ(Na)m=\max_{a\in\Sigma}(N_{a}) where R1R2RmR_{1}\supseteq R_{2}\supseteq\cdots\supseteq R_{m}, and if RiRi+1R_{i}\neq R_{i+1}, then Ri+1Ri=HjR_{i+1}-R_{i}=H_{j} for some HjH_{j} in the sequence (H1,,Hk)(H_{1},\ldots,H_{k}). By the definition of \mathcal{M}, 𝒱(t)\mbox{$\mathcal{V}$}(t^{\prime}) is accepted by \mathcal{M}. That tt is accepted by 𝒯\mathcal{T} is trivial and so is the fact that all the data values found in aa-nodes in tt^{\prime} for each aΣ0a\in\Sigma_{0}. Thus, tdata(𝒮)t^{\prime}\in\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$}).

For the “only if” direction, it is sufficient to observe that for every sequence (H1,,Hk)(H_{1},\ldots,H_{k}) that “respects” the inclusion constraints in 𝒞\mathcal{C} as explained above, if 𝒱(t)()\mbox{$\mathcal{V}$}(t)\in\mbox{$\mathcal{L}$}(\mbox{$\mathcal{M}$}), then tt satisfies all the inclusion constraints in 𝒞\mathcal{C}. This completes our proof.

6.2.2 Set and linear constraints for data trees

In the paper [David et al. (2012)] the set and linear constraints are introduced for data trees. As argued there, those constraints, together with automata, are able to capture many interesting properties commonly used in XML practice. We review those constraints and show how they can be captured by weak ODTA extended with Presburger constraints.

Data-terms (or just terms) are given by the grammar

τ:=V(a)|ττ|ττ|τ¯foraΣ.\tau:=V(a)\ |\ \tau\cup\tau\ |\ \tau\cap\tau\ |\ \overline{\tau}\quad\quad\mbox{for}\ a\in\Sigma.

The semantics of τ\tau is defined with respect to a data tree tt:

V(a)t=Vt(a)τ¯t=Vtτtτ1τ2t=τ1tτ2tτ1τ2t=τ1tτ2t\begin{array}[]{lll}{\llbracket V(a)\rrbracket}_{t}=V_{t}(a)&&{\llbracket\overline{\tau}\rrbracket}_{t}=V_{t}-{\llbracket\tau\rrbracket}_{t}\\ {\llbracket\tau_{1}\cap\tau_{2}\rrbracket}_{t}={\llbracket\tau_{1}\rrbracket}_{t}\cap{\llbracket\tau_{2}\rrbracket}_{t}&&{\llbracket\tau_{1}\cup\tau_{2}\rrbracket}_{t}={\llbracket\tau_{1}\rrbracket}_{t}\cup{\llbracket\tau_{2}\rrbracket}_{t}\end{array}

Recall that Vt=aΣVt(a)V_{t}=\bigcup_{a\in\Sigma}V_{t}(a) – the set of data values found in the data tree tt.

A set constraint is either τ=\tau=\emptyset or τ\tau\neq\emptyset, where τ\tau is a term. A data tree tt satisfies τ=\tau=\emptyset, written as tτ=t\models\tau=\emptyset, if and only if τt={\llbracket\tau\rrbracket}_{t}=\emptyset (and likewise for τ\tau\neq\emptyset).

A linear constraint ξ\xi over the alphabet Σ\Sigma is a linear constraint on the variables xax_{a}, for each aΣa\in\Sigma and zSz_{S}, for each SΣS\subseteq\Sigma. A data tree tt satisfies ξ\xi, if ξ\xi holds by interpreting xax_{a} as the number of aa-nodes in tt, and zSz_{S} the cardinality |[S]t||[S]_{t}|.

Theorem 6.8.

Given a tree automaton 𝒜\mathcal{A} and a set 𝒞\mathcal{C} of set and linear constraints, there exists a weak ODTA 𝒮,φ\langle\mbox{$\mathcal{S}$},\varphi\rangle extended with Presburger constraints such that data(𝒮,φ)\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$},\varphi) is precisely the set of ordered-data trees accepted by 𝒜\mathcal{A} that satisfies all the constraints in 𝒞\mathcal{C}. Moreover, the construction of S,φ\langle S,\varphi\rangle takes exponential time in the size of 𝒜\mathcal{A} and 𝒞\mathcal{C}.

Proof 6.9.

The proof is simply a restatement of the proof in [David et al. (2012)] into a language of weak ODTA. We need the following notation. For a data term τ\tau, we define a family 𝕊(τ)\mbox{$\mathbb{S}$}(\tau) of subsets of Σ{\Sigma} as follows.

  • If τ=V(a)\tau=V(a), then 𝕊(τ)={SaSandSΣ}\mbox{$\mathbb{S}$}(\tau)=\{S\mid a\in S\ \mbox{and}\ S\subseteq\Sigma\}.

  • If τ=τ¯1\tau=\overline{\tau}_{1}, then 𝕊(τ)=2Σ𝕊(τ1)\mbox{$\mathbb{S}$}(\tau)=2^{\Sigma}-\mbox{$\mathbb{S}$}(\tau_{1}).

  • If τ=τ1τ2\tau=\tau_{1}\star\tau_{2}, then 𝕊(τ)=𝕊(τ1)𝕊(τ2)\mbox{$\mathbb{S}$}(\tau)=\mbox{$\mathbb{S}$}(\tau_{1})\star\mbox{$\mathbb{S}$}(\tau_{2}), where \star is \cap or \cup.

It follows that for every data tree tt, we have τt=S𝕊(τ)[S]t{\llbracket\tau\rrbracket}_{t}=\bigcup_{S\in\mbox{\scriptsize$\mathbb{S}$}(\tau)}[S]_{t}. Recall that the sets [S]t[S]_{t}’s are disjoint.

The desired 𝒮=𝒯,,Σ0\mbox{$\mathcal{S}$}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Sigma_{0}\rangle is defined as follows. The transducer 𝒯\mathcal{T} is the identity transducer 𝒜\mathcal{A}, and Σ0=\Sigma_{0}=\emptyset. The automaton \mathcal{M} accepts a word v(2Σ)v\in(2^{\Sigma})^{\ast} if and only if

  • C1.

    for every set constraint τ=\tau=\emptyset, vv does not contain any symbol from 𝕊(τ)\mbox{$\mathbb{S}$}(\tau);

  • C2.

    for every set constraint τ\tau\neq\emptyset, vv contains at least one symbol from 𝕊(τ)\mbox{$\mathbb{S}$}(\tau).

The formula ξ\xi is the conjunction of all the linear constraints in 𝒞\mathcal{C}.

That data(𝒮,ξ)\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$},\xi) is indeed precisely the set of ordered-data trees accepted by 𝒜\mathcal{A} that satisfies all the constraints in 𝒞\mathcal{C} follows immediately from the definition of 𝕊\mathbb{S}. The exponential upper-bound occurs while constructing the automaton \mathcal{M} which requires the enumeration of each element of 2Σ2^{\Sigma} and checking both conditions C1 and C2 are satisfied. This completes the proof of Theorem 6.8.

6.2.3 FO2(+1,suc)\mbox{$\textrm{FO}$}^{2}(+1,\mbox{$\prec_{suc}$}) over text

Here we focus our attention on ordered-data words, which can be viewed as trees where each node has at most one child. We write w=(a1d1)(andn)w={a_{1}\choose d_{1}}\cdots{a_{n}\choose d_{n}} to denote ordered-data word in which position ii has label aia_{i} and data value did_{i}. It is called a text, if all the data values are different and the set of data values {d1,,dn}\{d_{1},\ldots,d_{n}\} is precisely {1,,n}\{1,\ldots,n\}.

It is shown in [Manuel (2010)] that the satisfiability problem for FO2(+1,suc)\mbox{$\textrm{FO}$}^{2}(+1,\mbox{$\prec_{suc}$}) over text is decidable.******The definition of text in [Manuel (2010)] is slightly different, but it is equivalent to our definition. However, it turns out that the key lemma proved in [Manuel (2010)] has a gap which is filled later on in [Figueira (2012b)]. The final result is still correct though. The following theorem shows that this decidability can be obtained via weak ODTA.

Theorem 6.10.

For every formula φFO2(+1,suc)\varphi\in\mbox{$\textrm{FO}$}^{2}(+1,\mbox{$\prec_{suc}$}), one can construct effectively a weak ODTA 𝒮\mathcal{S} such that

  • for every text ww, if wdata(φ)w\in\mbox{$\mathcal{L}$}_{data}(\varphi), then wdata(𝒮)w\in\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$});

  • for every ordered-data word wdata(𝒮)w\in\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$}), there exists a text wdata(φ)w^{\prime}\in\mbox{$\mathcal{L}$}_{data}(\varphi) such that Proj(w)=Proj(w)\mbox{$\textsf{Proj}$}(w)=\mbox{$\textsf{Proj}$}(w^{\prime}).

The construction of 𝒮\mathcal{S} takes double exponential time in the size of φ\varphi.

Proof 6.11.

In [Manuel (2010)], the decidability is proved by constructing its so called text automata, also defined in [Manuel (2010)]. We review the precise definition here. Let w=(a1d1)(andn)w={a_{1}\choose d_{1}}\cdots{a_{n}\choose d_{n}} be a text over the alphabet Σ\Sigma. Therefore, 𝒱(w)=S1Sn\mbox{$\mathcal{V}$}(w)=S_{1}\cdots S_{n} is such that each SiS_{i} is a singleton.

We define msp(w)msp(w), the marked string projection of ww, as the word (a0,b0)(an,bn)(a_{0},b_{0})\ldots(a_{n},b_{n}), where bi{1,1,}b_{i}\in\{-1,1,*\} and

bi={1if 1i<nanddi+1+1=di1if 1i<nanddi+1=di+1otherwiseb_{i}=\left\{\begin{array}[]{ll}-1&\mbox{if}\ 1\leq i<n\ \mbox{and}\ d_{i+1}+1=d_{i}\\ 1&\mbox{if}\ 1\leq i<n\ \mbox{and}\ d_{i}+1=d_{i+1}\\ &\mbox{otherwise}\end{array}\right.

A text automaton over the alphabet Σ\Sigma is pair (T1,T2)(T_{1},T_{2}), where

  • T1T_{1} is a non-deterministic letter-to-letter word transducer with the input alphabet Σ×{1,1,}\Sigma\times\{-1,1,*\} and the output alphabet Γ\Gamma.

  • T2T_{2} is a non-deterministic finite state automaton over Γ\Gamma.

A text w=(a1d1)(andn)w={a_{1}\choose d_{1}}\cdots{a_{n}\choose d_{n}} is accepted by the text automaton (T1,T2)(T_{1},T_{2}), if

  • msp(w)msp(w) is accepted by T1T_{1}, yielding a string α1αn\alpha_{1}\cdots\alpha_{n};

  • the string αi0αin\alpha_{i_{0}}\cdots\alpha_{i_{n}} is accepted by T2T_{2}, where the indexes i1,,ini_{1},\ldots,i_{n} are such that 1=di1<di2<<din=n1=d_{i_{1}}<d_{i_{2}}<\cdots<d_{i_{n}}=n.

It is shown in [Manuel (2010)] that for every φFO2(+1,suc)\varphi\in\mbox{$\textrm{FO}$}^{2}(+1,\mbox{$\prec_{suc}$}), one can construct effectively a text automaton 𝒜\mathcal{A} such that for every text ww, wdata(φ)w\in\mbox{$\mathcal{L}$}_{data}(\varphi) if and only if wdata(𝒜)w\in\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{A}$}).

Now we are going to show how to get the desired ODTA 𝒮=𝒯,,Γ\mbox{$\mathcal{S}$}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Gamma\rangle. Let (T1,T2)(T_{1},T_{2}) be the text automaton as above. On input ordered-data word w=(a1d1)(andn)w={a_{1}\choose d_{1}}\cdots{a_{n}\choose d_{n}}, 𝒮\mathcal{S} performs the following.

  • The automaton 𝒯\mathcal{T} simulates T1T_{1}, by guessing msp(w)msp(w) and outputs its Γ\Gamma-projection, while store its {1,1,}\{-1,1,*\}-projection in its states.

  • The automaton \mathcal{M} is simply T2T_{2}.

It is straightforward to see that such 𝒮\mathcal{S} is the desired weak ODTA. The analysis of the complexity is as follows. The construction of the text automaton (T1,T2)(T_{1},T_{2}) takes double exponential time in the size of φ\varphi. See [Manuel (2010), Lemmas 5 and 6]. The construction of ODTA 𝒮\mathcal{S} takes polynomial time in the size of (T1,T2)(T_{1},T_{2}). Altogether, it takes double exponential time to construct 𝒮\mathcal{S} from the original formula φ\varphi. This completes the proof of Theorem 6.10.

7 An Undecidable Extension

In this section we would like to remark on an undecidable extension of ODTA. Recall the language in Example 3.3. It has already noted in the proof of Proposition 5.6 that its complement is not accepted by any ODTA. Formally, the complement of the language in Example 3.3 can be expressed with formula of the form:

xyaΣ0a(x)aΣ0a(y)E(x,y)xy,\forall x\;\forall y\;\bigvee_{a\in\Sigma_{0}}a(x)\wedge\bigvee_{a\in\Sigma_{0}}a(y)\wedge\mbox{$E_{\downarrow}$}^{*}(x,y)\to x\prec y, (15)

where Σ0Σ\Sigma_{0}\subseteq\Sigma and E\mbox{$E_{\downarrow}$}^{*} denotes the transitive closure of EE_{\downarrow}. In the following we are going to show that given an ODTA and a collection 𝒞\mathcal{C} of formulas of the form (15), it is undecidable to check whether there is an ordered-data tree tdata(𝒮)t\in\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$}) such that tψt\models\psi, for all ψ𝒞\psi\in\mbox{$\mathcal{C}$}.

The proof is simply an observation that the proof of [Bojanczyk et al. (11a), Proposition 29] can be applied directly here. In [Bojanczyk et al. (11a), Proposition 29] it is proved that the satisfiability of FO2(E,E,,)\mbox{$\textrm{FO}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\downarrow}$}^{*},\sim,\prec) is undecidable.††††††Technically, the undecidability in [Bojanczyk et al. (11a), Proposition 29] is proved on data strings over the logic FO2(+1,<,,)\mbox{$\textrm{FO}$}^{2}(+1,<,\sim,\prec), which of course, is equivalent to FO2(E,E,,)\mbox{$\textrm{FO}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\downarrow}$}^{*},\sim,\prec). The reduction is from Post Correspondence Problem (PCP), where given an instance of PCP, one can effectively construct a formula of the form φψ\varphi\wedge\psi, where φFO2(E,E,)\varphi\in\mbox{$\textrm{FO}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\downarrow}$}^{*},\sim) and ψ\psi is a formula of the form (15). Since φ\varphi can be captured by ODTA, the undecidability of ODTA extended with formulas of the form (15) follows immediately.

At this point we would also like to point out that extending ODTA with operation such as addition on data values will immediately yield undecidability. This can be deduced immediately from [Halpern (1991)] where we know that together with unary predicates, addition yields undecidability.

8 When the Data Values are Strings

In this section we discuss data trees where the data values are strings from {0,1}\{0,1\}^{\ast}, instead of natural numbers. We call such trees string data trees. There are two common kinds of order for strings: the prefix order, and the lexicographic order. Strings with lexicographic order are simply linearly ordered domain, thus, ODTA can be applied directly in such case.

For the prefix order, we have to modify the definition of ODTA. Consider a string data tree tt over the alphabet Σ\Sigma. Let VtV_{t} be the set of data values found in tt. We define 𝒱Σ(t)\mbox{$\mathcal{V}$}_{\Sigma}(t) as a tree over the alphabet 2Σ2^{\Sigma}, where

  • Dom(𝒱Σ(t))\mbox{$\textsf{Dom}$}(\mbox{$\mathcal{V}$}_{\Sigma}(t)) is Vt{ϵ}V_{t}\cup\{\epsilon\};

  • for u,vDom(𝒱Σ(t))u,v\in\mbox{$\textsf{Dom}$}(\mbox{$\mathcal{V}$}_{\Sigma}(t)), uu is a parent of vv if uu is a prefix of vv and there is no wDom(𝒱Σ(t))w\in\mbox{$\textsf{Dom}$}(\mbox{$\mathcal{V}$}_{\Sigma}(t)) such that uu is a prefix of ww and ww is a prefix of vv;

  • for uDom(𝒱Σ(t))u\in\mbox{$\textsf{Dom}$}(\mbox{$\mathcal{V}$}_{\Sigma}(t)) the label of uu is SS, if u[S]tu\in[S]_{t}; and root, if u=ϵu=\epsilon.

We call 𝒱Σ(t)\mbox{$\mathcal{V}$}_{\Sigma}(t) the tree representation of the data values in tt. Consider an example of a string data tree in Figure 2. We have

[{a}]t={0101}[{b}]t={0100}[{c}]t={01011}[{a,b}]t={01}[{b,c}]t={01000}[{a,b,c}]t={010011}.\begin{array}[]{lll}~[\{a\}]_{t}=\{0101\}&&[\{b\}]_{t}=\{0100\}\\ ~[\{c\}]_{t}=\{01011\}&&[\{a,b\}]_{t}=\{01\}\\ ~[\{b,c\}]_{t}=\{01000\}&&[\{a,b,c\}]_{t}=\{010011\}.\end{array}

So Dom(𝒱Σ(t))={01,0100,0101,010011,010000,01011}\mbox{$\textsf{Dom}$}(\mbox{$\mathcal{V}$}_{\Sigma}(t))=\{01,0100,0101,010011,010000,01011\}, and

  • 0101 is the parent of 01000100 and 01010101;

  • 01000100 is the parent of 010011010011 and 010000010000; and

  • 01010101 is the parent of 0101101011.

(a01){a\choose 01}(b0100){b\choose 0100}(c01011){c\choose 01011}(a010011){a\choose 010011}(a0101){a\choose 0101}(b01){b\choose 01}(b010011){b\choose 010011}(a0101){a\choose 0101}(c010011){c\choose 010011}(c010000){c\choose 010000}(b010000){b\choose 010000}root{a,b}\{a,b\}{b}\{b\}{a}\{a\}{a,b,c}\{a,b,c\}{b,c}\{b,c\}{c}\{c\}
Figure 2: An example of a string data tree (on the left) and the tree representation of its data values (on the right).

Now an ODTA for string data trees is 𝒮=𝒯,𝒜,Γ0\mbox{$\mathcal{S}$}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{A}$},\Gamma_{0}\rangle, where 𝒯\mathcal{T} is a letter-to-letter transducer from Σ×{,,}3\Sigma\times\{\top,\bot,*\}^{3} to Γ\Gamma; 𝒜\mathcal{A} is an unranked tree automaton over the alphabet 2Γ2^{\Gamma}; Γ0Γ\Gamma_{0}\subseteq\Gamma. The requirement for acceptance is the same as in Section 5, except that 𝒜\mathcal{A} takes a tree over the alphabet 2Γ2^{\Gamma} as the input.

We observe that in the proof of the decidability of the non-emptiness of ODTA 𝒮=𝒯,,Γ0\mbox{$\mathcal{S}$}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Gamma_{0}\rangle, the automaton \mathcal{M} is converted in polynomial time into a Presburger formula by applying Proposition 2.3, which actually holds for tree automata. Hence, the decision procedures in Sections 5 and 6 can also be applied to string data trees.

9 Concluding Remarks

In this paper we study data trees in which the data values come from a linearly ordered domain, where in addition to equality test, we can test whether the data value in one node is greater than the other. We introduce ordered-data tree automata (ODTA), provide its logical characterisation, and prove that its non-emptiness problem is decidable. We also show the logic MSO2(E,E,)\mbox{$\exists\mbox{$\textrm{MSO}$}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim) can be captured by ODTA.

Then we define weak ODTA, which essentially are ODTA without the ability to perform equality test on data values on two adjacent nodes. We provide its logical characterisation. We show that a number of existing formalisms and models studied in the literature so far can be captured already by weak ODTA. We also show that the definition of ODTA can be easily modified, to the case where the data values come from a partially ordered domain, such as strings.

We believe that the notion of ODTA provides new techniques to reason about ordered-data values on unranked trees, and thus, can find potential applications in practice. We also prove that ODTA capture various formalisms on data trees studied so far in the literature. As far as we know this is the first formalism for data trees with neat logical and automata characterisations.


Acknowledgement. The author would like to thank FWO for their generous financial support under the scheme FWO Marie Curie fellowship. The author also thanks Egor V. Kostylev for careful proof reading of this paper, as well as Nadime Francis for pointing out the reference [Halpern (1991)]. Finally, the author thanks the anonymous referees, both the conference and the journal versions, for their careful reading and comments which greatly improve the paper.


References

  • Alon et al. (2003) Alon, N., Milo, T., Neven, F., Suciu, D., and Vianu, V. 2003. XML with data values: typechecking revisited. Journal of Computer and System Science 66, 4, 688–727.
  • Arenas et al. (2008) Arenas, M., Fan, W., and Libkin, L. 2008. On the complexity of verifying consistency of XML specifications. SIAM Journal of Computing 38, 3, 841–880.
  • Benedikt et al. (2010) Benedikt, M., Ley, C., and Puppis, G. 2010. Automata vs. logics on data words. In CSL.
  • Björklund and Bojanczyk (2007) Björklund, H. and Bojanczyk, M. 2007. Bounded depth data trees. In ICALP.
  • Björklund et al. (2008) Björklund, H., Martens, W., and Schwentick, T. 2008. Optimizing conjunctive queries over trees using schema information. In MFCS.
  • Bojanczyk et al. (11a) Bojanczyk, M., David, C., Muscholl, A., Schwentick, T., and Segoufin, L. ’11a. Two-variable logic on data words. ACM Transactions on Computational Logic 12, 4, 27.
  • Bojanczyk et al. (11b) Bojanczyk, M., Klin, B., and Lasota, S. ’11b. Automata with group actions. In LICS. 355–364.
  • Bojanczyk et al. (2009) Bojanczyk, M., Muscholl, A., Schwentick, T., and Segoufin, L. 2009. Two-variable logic on data trees and XML reasoning. Journal of the ACM 56, 3.
  • Bouyer et al. (2001) Bouyer, P., Petit, A., and Thérien, D. 2001. An algebraic characterization of data and timed languages. In CONCUR.
  • Comon et al. (2007) Comon, H., Dauchet, M., Gilleron, R., Löding, C., Jacquemard, F., Lugiez, D., Tison, S., and Tommasi, M. 2007. Tree automata techniques and applications. Available on: http://www.grappa.univ-lille3.fr/tata. release October, 12th 2007.
  • David et al. (2010) David, C., Libkin, L., and Tan, T. 2010. On the satisfiability of two-variable logic over data words. In LPAR.
  • David et al. (2012) David, C., Libkin, L., and Tan, T. 2012. Efficient reasoning about data trees via integer linear programming. ACM Transactions on Database Systems 37, 3, 19.
  • Demri et al. (2007) Demri, S., D’Souza, D., and Gascon, R. 2007. A decidable temporal logic of repeating values. In LFCS.
  • Demri and Lazić (2009) Demri, S. and Lazić, R. 2009. LTL with the freeze quantifier and register automata. ACM Transactions of Computational Logic 10, 3.
  • Deutsch et al. (2009) Deutsch, A., Hull, R., Patrizi, F., and Vianu, V. 2009. Automatic verification of data-centric business processes. In ICDT.
  • Fan and Libkin (2002) Fan, W. and Libkin, L. 2002. On XML integrity constraints in the presence of DTDs. Journal of the ACM 49, 3, 368–406.
  • Figueira (2009) Figueira, D. 2009. Satisfiability of downward xpath with data equality tests. In PODS.
  • Figueira (2011) Figueira, D. 2011. A decidable two-way logic on data words. In LICS. 365–374.
  • Figueira (2012a) Figueira, D. 2012a. Alternating register automata on finite data words and trees. Logical Methods in Computer Science 8, 1.
  • Figueira (2012b) Figueira, D. 2012b. Satisfiability for two-variable logic with two successor relations on finite linear orders. http://arxiv.org/abs/1204.2495.
  • Figueira et al. (2010) Figueira, D., Hofman, P., and Lasota, S. 2010. Relating timed and register automata. In EXPRESS.
  • Figueira and Segoufin (2011) Figueira, D. and Segoufin, L. 2011. Bottom-up automata on data trees and vertical XPath. In STACS.
  • Grumberg et al. (2010) Grumberg, O., Kupferman, O., and Sheinvald, S. 2010. Variable automata over infinite alphabets. In LATA.
  • Halpern (1991) Halpern, J. 1991. Presburger arithmetic with unary predicates is π11\pi_{1}^{1} complete. Journal of Symbolic Logic 56, 2, 637–642.
  • Jurdzinski and Lazic (2011) Jurdzinski, M. and Lazic, R. 2011. Alternating automata on data trees and XPath satisfiability. ACM Transactions of Computational Logic 12, 3, 19.
  • Kaminski and Francez (1994) Kaminski, M. and Francez, N. 1994. Finite-memory automata. Theoretical Computer Science 134, 2, 329–363.
  • Kara et al. (2012) Kara, A., Schwentick, T., and Tan, T. 2012. Feasible automata for two-variable logic with successor on data words. In LATA.
  • Lazić (2011) Lazić, R. 2011. Safety alternating automata on data words. ACM Transaction of Computational Logic 12, 2, 10.
  • Libkin (2004) Libkin, L. 2004. Elements of Finite Model Theory. Springer.
  • Manuel (2010) Manuel, A. D. 2010. Two orders and two variables. In MFCS.
  • Neven (2002) Neven, F. 2002. Automata, logic, and XML. In CSL.
  • Neven et al. (2004) Neven, F., Schwentick, T., and Vianu, V. 2004. Finite state machines for strings over infinite alphabets. ACM Transactions on Computational Logic 5, 3, 403–435.
  • Schwentick (2004) Schwentick, T. 2004. XPath query containment. SIGMOD Record 33, 1, 101–109.
  • Schwentick and Zeume (2010) Schwentick, T. and Zeume, T. 2010. Two-variable logic with two order relations. In CSL.
  • Segoufin and Torunczyk (2011) Segoufin, L. and Torunczyk, S. 2011. Automata based verification over linearly ordered data domains. In STACS.
  • Seidl et al. (2004) Seidl, H., Schwentick, T., Muscholl, A., and Habermehl, P. 2004. Counting in trees for free. In ICALP.
  • Thatcher (1967) Thatcher, J. 1967. Characterizing derivation trees of context-free grammars through a generalization of finite automata theory. Journal of Computer and System Sciences 1, 4, 317–322.
  • Thomas (1997) Thomas, W. 1997. Languages, automata, and logic. In Handbook of Formal Languages, Vol. 3. 389–455.
  • Verma et al. (2005) Verma, K. N., Seidl, H., and Schwentick, T. 2005. On the complexity of equational horn clauses. In CADE.