\acmVolume

V \acmNumberN \acmArticle1 \acmYear20YY \acmMonth1

Extending two-variable logic on data trees with order on data values and its automata

Tony Tan
Hasselt University and Transnational University of Limburg

Abstract

Data trees are trees in which each node, besides carrying a label from a finite alphabet, also carries a data value from an infinite domain. They have been used as an abstraction model for reasoning tasks on XML and verification. However, most existing approaches consider the case where only equality test can be performed on the data values.

In this paper we study data trees in which the data values come from a linearly ordered domain, and in addition to equality test, we can test whether the data value in a node is greater than the one in another node. We introduce an automata model for them which we call ordered-data tree automata (ODTA), provide its logical characterisation, and prove that its non-emptiness problem is decidable in 3-NExpTime. We also show that the two-variable logic on unranked data trees, studied by Bojanczyk, Muscholl, Schwentick and Segoufin in 2009, corresponds precisely to a special subclass of this automata model.

Then we define a slightly weaker version of ODTA, which we call weak ODTA, and provide its logical characterisation. The complexity of the non-emptiness problem drops to NP. However, a number of existing formalisms and models studied in the literature can be captured already by weak ODTA. We also show that the definition of ODTA can be easily modified, to the case where the data values come from a tree-like partially ordered domain, such as strings.

category:

F.1.1 Models of Computation Automata

category:

F.4.1 Mathematical Logic Computational logic

keywords:

Finite-state automata, Two-variable logic, Data trees, Ordered data values

^†^†terms: Languages

{bottomstuff}

The extended abstract of this article has been published in the proceedings of LICS 2012, under the title: “An Automata Model for Trees with Ordered Data Values.” The author is FWO Pegasus Marie Curie Fellow.

1 Introduction

Classical automata theory studies words and trees over finite alphabets. Recently there has been a growing interest in the so-called “data” words and trees, that is, words and trees in which each position, besides carrying a label from a finite alphabet, also carries a data value from an infinite domain.

Interest in such structures with data springs due to their connection to XML [Alon et al. (2003), Arenas et al. (2008), Björklund et al. (2008), David et al. (2012), Fan and Libkin (2002), Figueira (2009), Neven (2002)], as well as system specifications [Bouyer et al. (2001), Demri et al. (2007), Segoufin and Torunczyk (2011)], where many properties simply cannot be captured by finite alphabets. This has motivated various works on data words [Benedikt et al. (2010), Bojanczyk et al. (11a), Demri and Lazić (2009), Grumberg et al. (2010), Kaminski and Francez (1994), Neven et al. (2004)], as well as on data trees [Björklund and Bojanczyk (2007), Bojanczyk et al. (2009), Figueira (2012a), Figueira and Segoufin (2011), Jurdzinski and Lazic (2011)]. The common feature of these works is the addition of equality test on the data values to the logic on trees. While for finitely-labeled trees many logical formalisms (e.g., the monadic second-order logic MSO) are decidable by converting formulae to automata, even FO (first-order logic) on data words extended with data-equality is already undecidable. See, e.g., [Bojanczyk et al. (11a), Fan and Libkin (2002), Neven et al. (2004)].

Thus, there is a need for expressive enough, while computationally well-behaved, frameworks to reason about structures with data values. This has been quite a common theme in XML and system specification research. It has largely followed two routes. The first takes a specific reasoning task, or a set of similar tasks, and builds algorithms for them (see, e.g., [Arenas et al. (2008), Figueira (2011), Björklund et al. (2008), Schwentick (2004), Fan and Libkin (2002), Figueira (2009)]). The second looks for sufficiently general automata models that can express reasoning tasks of interest, but are still decidable (see, e.g., [Demri and Lazić (2009), Bojanczyk et al. (2009), Jurdzinski and Lazic (2011), Segoufin and Torunczyk (2011)]).

Both approaches usually assume that data values come from an abstract set equipped only with the equality predicate. This is already sufficient to capture a wide range of interesting applications both in databases and verification. However, it has been advocated in [Deutsch et al. (2009)] that comparisons based on a linear order over the data values could be useful in many scenarios, including data centric applications built on top of a database.

So far, not many works have been done in this direction. A few works such as [Manuel (2010), Figueira (2011), Schwentick and Zeume (2010), Segoufin and Torunczyk (2011)] are on words, while in most applications we need to consider trees. Moreover, these works are incomparable to some interesting existing formalisms [Fan and Libkin (2002), Bojanczyk et al. (2009), Arenas et al. (2008), David et al. (2012), Jurdzinski and Lazic (2011), Demri and Lazić (2009), Lazić (2011)] known to be able to capture various interesting scenarios common in practice. On top of that many useful techniques, notably those introduced in [Fan and Libkin (2002), Bojanczyk et al. (11a), Bojanczyk et al. (2009), Jurdzinski and Lazic (2011)], can deal only with data equality, and are highly dependent on specific combinatorial properties of the formalisms. They are rather hard to adapt to other more specific tasks, let alone being generalised to include more relations on data values, and they tend to produce extremely high complexity bounds, such as non-primitive-recursive, or at least as hard as the reachability problem in Petri nets. Furthermore, many known decidability results are lost as soon as we add the order relation on data values. Some exceptions are [Figueira et al. (2010), Figueira (2012a)].

In this paper we study the notion of data trees in which the data values come from a linearly ordered domain, which we call ordered-data trees. In addition to equality tests on the data values, in ordered-data trees we are allowed to test whether the data value in a node is greater than the data value in another node. To the extent it is possible, we aim to unify various ad hoc methods introduced to reason about data trees, and generalise them to ordered-data trees to make them more accessible and applicable in practice. This paper is the first step, where we introduce an automata model for ordered-data trees, provide its logical characterisation, and prove that it has decidable non-emptiness problem. Moreover, we also show that it can capture various well known formalisms.

Brief description of the results in this paper

The trees that we consider are unranked trees where there is no a priori bound in the number of children of a node. Moreover, we also have an order on the children of each node. We consider a natural logic for ordered-data trees, which consists of the following relations.

•

The parent relation $E_{\downarrow}$ , where $\mbox{$E_{\downarrow}$}(x,y)$ means that node $x$ is the parent of node $y$ .
•

The next-sibling relation $E_{\rightarrow}$ , where $\mbox{$E_{\rightarrow}$}(x,y)$ means that nodes $x$ and $y$ have the same parent and $y$ is the next sibling of $x$ .
•

The labeling predicates $a(\cdot)$ ’s, where $a(x)$ means that node $x$ is labeled with symbol $a$ .
•

The data equality predicate $\sim$ , where $x\sim y$ means that nodes $x$ and $y$ have the same data value.
•

The order relation on data $\prec$ , where $x\prec y$ means that the data value in node $x$ is less than the one in node $y$ .
•

The successive order relation on data $\prec_{suc}$ , where $x\mbox{$\prec_{suc}$}\ y$ means that the data value in node $y$ is the minimal data value in the tree greater than the one in node $x$ .

We introduce an automata model for ordered-data trees, which we call ordered-data tree automata (ODTA), and provide its logical characterisation. Namely, we prove that the class of languages accepted by ODTA corresponds precisely to those expressible by formulas of the form:

\exists X_{1}\cdots\exists X_{n}\ \varphi\wedge\psi,

(1)

where

•

$X_{1},\ldots,X_{n}$ are monadic second-order predicates;
•

$\varphi$ is an FO formula restricted to two variables and using only the predicates $E_{\downarrow}$ , $E_{\rightarrow}$ , $\sim$ , as well as the unary predicates $X_{1},\ldots,X_{n}$ and $a$ ’s.
•

$\psi$ is an FO formula using only the predicates $\sim$ , $\prec$ , $\prec_{suc}$ , as well as the unary predicates $X_{1},\ldots,X_{n}$ and $a$ ’s.

We show that the logic $\mbox{$\exists\mbox{$\textrm{MSO}$}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim)$ , first studied in [Bojanczyk et al. (2009)], corresponds precisely to a special subclass of ODTA, where $\mbox{$\exists\mbox{$\textrm{MSO}$}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim)$ denotes the set of formulas of the form (1) in which $\psi$ is a true formula. We then prove that the non-emptiness problem of ODTA is decidable in 3-NExpTime. Our main idea here is to show how to convert the ordered-data trees back to a string over finite alphabets. (See our notion of string representation of data values in Section 3.) Such conversion enables us to use the classical finite state automata to reason about data values.

Then we define a slightly weaker version of ODTA, which we call weak ODTA. Essentially the only feature of ODTA missing in weak ODTA is the ability to test whether two adjacent nodes have the same data value. Without such simple feature, the complexity of the non-emptiness problem surprisingly drops three-fold exponentially to NP. We provide its logical characterisation by showing that it corresponds precisely to the languages expressible by the formulas of the form (1) where $\varphi$ does not use the predicate $\sim$ . We show that a number of existing formalisms and models can be captured already by weak ODTA, i.e. those in [Fan and Libkin (2002), David et al. (2012), Manuel (2010)].

We should remark that [David et al. (2012)] studies a formalism which consists of tree automata and a collection of set and linear constraints.^*^**We will later define formally what set and linear constraints are. It is shown that the satisfiability problem of such formalism is NP-complete. In fact, it is also shown in [David et al. (2012)] that a single set constraint (without tree automaton and linear constraint) already yields NP-hardness. Weak ODTA are essentially equivalent to the formalism in [David et al. (2012)] extended with the full expressive power of the first-order logic $\mbox{$\textrm{FO}$}(\sim,\prec,\mbox{$\prec_{suc}$})$ . It is worth to note that despite such extension, the non-emptiness problem remains in NP.

Finally we also show that the definition of ODTA can be easily modified to the case where the data values come from a partially ordered domain, such as strings. This work can be seen as a generalisation of the works in [David et al. (2010)] and [Kara et al. (2012)]. However, it must be noted that [David et al. (2010), Kara et al. (2012)] deal only with data words, where only equality test is allowed on the data values and there is no order on them.

Related works

Most of the existing works in this area are on data words. In the paper [Bojanczyk et al. (11a)] the model data automata was introduced, and it was shown that it captures the logic $\exists\mbox{$\textrm{MSO}$}^{2}(\sim,<,+1)$ , the fragment of existential monadic second order logic in which the first order part uses only two variables and the predicates: the data equality $\sim$ , as well as the order $<$ and the successor $+1$ on the domain.

An important feature of data automata is that their non-emptiness problem is decidable, even for infinite words, but is at least as hard as reachability for Petri nets. It was also shown that the satisfiability problem for the three-variable first order logic is undecidable. Later in [David et al. (2010)] an alternative proof was given for the decidability of the weaker logic $\mbox{$\exists\mbox{$\textrm{MSO}$}$}^{2}(+1,\sim)$ . The proof gives a decision procedure with an elementary upper bound for the satisfiability problem of $\mbox{$\exists\mbox{$\textrm{MSO}$}$}^{2}(+1,\sim)$ on strings. Recently in [Kara et al. (2012)] an automata model that captures precisely the logic $\mbox{$\exists\mbox{$\textrm{MSO}$}$}^{2}(+1,\sim)$ , both on finite and infinite words, is proposed.

Another logical approach is via the so called linear temporal logic with freeze quantifier, introduced in [Demri and Lazić (2009)]. Intuitively, these are LTL formulas equipped with a finite number of registers to store the data values. We denote by $\mbox{$\textrm{LTL}$}_{n}^{\downarrow}[\texttt{X},\texttt{U}]$ , the LTL with freeze quantifier, where $n$ denotes the number of registers and the only temporal operators allowed are the neXt operator X and the Until operator U. It was shown that alternating register automata with $n$ registers ( $\mbox{$\textrm{RA}$}_{n}$ ) accept all LTL ${}_{n}^{\downarrow}[\texttt{X},\texttt{U}]$ languages and the non-emptiness problem for alternating $\mbox{$\textrm{RA}$}_{1}$ is decidable. However, the complexity is non primitive recursive. Hence, the satisfiability problem for LTL ${}_{1}^{\downarrow}(\texttt{X},\texttt{U})$ is decidable as well. Adding one more register or past time operator $\texttt{U}^{-1}$ to LTL ${}_{1}^{\downarrow}(\texttt{X},\texttt{U})$ makes the satisfiability problem undecidable. In [Figueira et al. (2010), Figueira (2012a)] it is shown that alternating $\mbox{$\textrm{RA}$}_{1}$ can be extended to strings with linearly ordered data values, and the emptiness problem is still decidable. In [Lazić (2011)] a weaker version of alternating $\mbox{$\textrm{RA}$}_{1}$ , called safety alternating $\mbox{$\textrm{RA}$}_{1}$ , is considered, and the non-emptiness problem is shown to be EXPSPACE-complete.

A model for data words with linearly ordered data values was proposed in [Segoufin and Torunczyk (2011)]. The model consists of an automaton equipped with a finite number of registers, and its transitions are based on constraints on the data values stored in the registers. It is shown that the non-emptiness problem for this model is decidable in PSPACE. However, no logical characterisation is provided for such model.

In [Bojanczyk et al. (11b)] another type of register automata for words was introduced and studied, which is a generalisation of the original register automata introduced by Kaminski and Francez [Kaminski and Francez (1994)], where the data values also can come from a linearly ordered domain. Thus, the order comparison, not just equality, can be performed on data values. The generalisation is done via the notion of monoid for data words, and is incomparable with our model here. In the terminology of the original register automata defined in [Kaminski and Francez (1994)], it is simply register automata extended with testing whether the data value currently read is bigger/smaller than those in the registers.

It is shown in [Manuel (2010)] that the satisfiability problem for $\mbox{$\textrm{FO}$}^{2}(+1,\mbox{$\prec_{suc}$})$ over text is decidable. A text is simply a data word in which all the data values are different and they range over the positive integers from $1$ to $n$ , for some $n\geq 1$ . We will see later that the satisfiability problem for $\mbox{$\textrm{FO}$}^{2}(+1,\mbox{$\prec_{suc}$})$ can be reduced to the non-emptiness problem of our model.

In [Schwentick and Zeume (2010)] it is shown that the satisfiability problem of the logic $\mbox{$\textrm{FO}$}^{2}(<,\prec)$ on words is decidable. This logic is incomparable with our model. However, it should be noted that $\mbox{$\textrm{FO}$}^{2}(<)$ cannot capture the whole class of regular languages.

The work on data trees that we are aware of is in [Bojanczyk et al. (2009), Jurdzinski and Lazic (2011)]. In [Bojanczyk et al. (2009)] it was shown that the satisfiability problem for the logic $\mbox{$\exists\mbox{$\textrm{MSO}$}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim)$ over unranked trees is decidable in 3-NExpTime. However, no automata model is provided. We will see later how this logic corresponds precisely to a special subclass of ODTA.

In [Jurdzinski and Lazic (2011)] alternating tree register automata were introduced for trees. They are essentially the generalisation of the alternating $\mbox{$\textrm{RA}$}_{1}$ to the tree case. It was shown that this model captures the forward XPath queries. However, no logical characterisation is provided and the non-emptiness problem, though decidable, is non primitive recursive.

As mentioned earlier, the main idea in this paper is the conversion of the data values from an infinite domain back to string over a finite alphabet. Roughly speaking, it works as follows. Given an ordered-data tree $t$ , we show how to construct a string $w$ over a finite alphabet whose domain corresponds precisely to the data values in $t$ . We then use the classical finite state automaton to reason about $w$ , and thus, also about the data values in $t$ . This idea is the main difference between our paper and the existing works. Most of the existing techniques rely on some specific combinatorial properties of the formalisms considered, which make them highly independent of one another. As we will see later, our model captures quite a few other formalisms without significant jump in complexity.

Organisation

This paper is organised as follows. In Section 2 we give some preliminary background. In Section 3 we formally define the logic for ordered-data trees and present a few examples as well as notations that we need in this paper. In Section 4 we present two lemmas that we are going to need later on. We prove them in a quite general setting, as we think they are interesting in their own. We introduce the ordered-data tree automata (ODTA) in Section 5 and weak ODTA in Section 6. In Section 7 we discuss a couple of the undecidable extensions of weak ODTA. In Section 8 we describe how to modify the definition of ODTA when the data values are strings, that is, when they come from a partially ordered domain. Finally we conclude with some concluding remarks in Section 9.

2 Preliminaries

In this section we review some definitions that we are going to use later on. We usually use $\Gamma$ and $\Sigma$ to denote finite alphabets. We write $2^{\Gamma}$ to denote an alphabet in which each symbol corresponds to a subset of $\Gamma$ . In some cases, we may need the alphabet $2^{2^{\Gamma}}$ – an alphabet in which each symbol corresponds to a set of subsets of $\Gamma$ . We denote the set of natural numbers $\{0,1,2,\ldots\}$ by ${\mathbb{N}}$ .

Usually we write $\mathcal{L}$ to denote a language, for both string and tree languages. When it is clear from the context, we use the term language to mean either a string language, or a tree language.

2.1 Finite state automata over strings and commutative regular languages

We usually write $\mathcal{M}$ to denote a finite state automaton on strings. The language accepted by the automaton $\mathcal{M}$ is denoted by $\mbox{$\mathcal{L}$}(\mbox{$\mathcal{M}$})$ .

Let $\Sigma=\{a_{1},\ldots,a_{\ell}\}$ . For a word $w\in\Sigma^{\ast}$ , the Parikh image of $w$ is $\mbox{$\textsf{Parikh}$}(w)=(n_{1},\ldots,n_{\ell})$ , where $n_{i}$ is the number of appearances of $a_{i}$ in $w$ . For a vector $\bar{n}$ , the inverse of the Parikh image of $\bar{n}$ is $\mbox{$\textsf{Parikh}$}^{-1}(\bar{n})=\{w\mid w\in\Sigma^{\ast}\ \mbox{and}\ \mbox{$\textsf{Parikh}$}(w)=\bar{n}\}$ .

For $1\leq i\leq\ell$ , a vector $\bar{v}=(n_{1},\ldots,n_{\ell})\in\mbox{${\mathbb{N}}$}^{\ell}$ is called an $i$ -base, if $n_{i}\neq 0$ and $n_{j}=0$ , for all $j\neq i$ . A language $\mathcal{L}$ is periodic, if there exist $(\ell+1)$ vectors $\bar{u},\bar{v}_{1},\ldots,\bar{v}_{\ell}$ such that $\bar{u}\in\mbox{${\mathbb{N}}$}^{\ell}$ and each $\bar{v}_{i}$ is an $i$ -base and

\mbox{$\mathcal{L}$}=\bigcup_{h_{1},\ldots,h_{\ell}\geq 0}\mbox{$\textsf{Parikh}$}^{-1}(\bar{u}+h_{1}\bar{v}_{1}+\cdots+h_{\ell}\bar{v}_{\ell}).

We denote such language $\mathcal{L}$ by $\mbox{$\mathcal{L}$}(\bar{u},\bar{v}_{1},\ldots,\bar{v}_{\ell})$ .

A language $\mathcal{L}$ is commutative if it is closed under reordering. That is, if $w=b_{1}\cdots b_{m}\in\mbox{$\mathcal{L}$}$ , and $\sigma$ is a permutation on $\{1,\ldots,m\}$ , then $b_{\sigma(1)}\cdots b_{\sigma(m)}\in\mbox{$\mathcal{L}$}$ .

The following result is a kind of folklore and can be proved easily.

Theorem 2.1.

A language is commutative and regular if and only if it is a finite union of periodic languages.

2.2 Unranked trees, tree automata and transducers

An unranked finite tree domain is a prefix-closed finite subset $D$ of $\mbox{${\mathbb{N}}$}^{*}$ (words over ${\mathbb{N}}$ ) such that $u\cdot i\in D$ implies $u\cdot j\in D$ for all $j<i$ and $u\in\mbox{${\mathbb{N}}$}^{*}$ . Given a finite labeling alphabet $\Sigma$ , a $\Sigma$ -labeled unranked tree $t$ is a structure

\langle D,\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\{a(\cdot)\}_{a\in\Sigma}\rangle,

where

•

$D$ is an unranked tree domain,
•

$E_{\downarrow}$ is the child relation: $(u,u\cdot i)\in\mbox{$E_{\downarrow}$}$ for all $u,u\cdot i\in D$ ,
•

$E_{\rightarrow}$ is the next-sibling relation: $(u\cdot i,u\cdot(i+1))\in\mbox{$E_{\rightarrow}$}$ for all $u\cdot i,u\cdot(i+1)\in D$ , and
•

the $a(\cdot)$ ’s are labeling predicates, i.e. for each node $u$ , exactly one of $a(u)$ , with $a\in\Sigma$ , is true.

We write $\mbox{$\textsf{Dom}$}(t)$ to denote the domain $D$ . The label of a node $u$ in $t$ is denoted by $\mbox{$\ell ab$}_{t}(u)$ . If $\mbox{$\ell ab$}_{t}(u)=a$ , then we say that $u$ is an $a$ -node.

An unranked tree automaton [Comon et al. (2007), Thatcher (1967)] over $\Sigma$ -labeled trees is a tuple $\mbox{$\mathcal{A}$}=\langle Q,\Sigma,\delta,F\rangle$ , where $Q$ is a finite set of states, $F\subseteq Q$ is the set of final states, and $\delta:Q\times\Sigma\to 2^{(Q^{*})}$ is a transition function; we require $\delta(q,a)$ ’s to be regular languages over $Q$ for all $q\in Q$ and $a\in\Sigma$ .

A run of $\mathcal{A}$ over a tree $t$ is a function $\rho_{\mbox{\scriptsize$\mathcal{A}$}}:\mbox{$\textsf{Dom}$}(t)\to Q$ such that for each node $u$ with $n$ children $u\cdot 0,\ldots,u\cdot(n-1)$ , the word $\rho_{\mbox{\scriptsize$\mathcal{A}$}}(u\cdot 0)\cdots\rho_{\mbox{\scriptsize$\mathcal{A}$}}(u\cdot(n-1))$ is in the language $\delta(\rho_{\mbox{\scriptsize$\mathcal{A}$}}(u),\mbox{$\ell ab$}_{t}(u))$ . For a leaf $u$ labeled $a$ , this means that $u$ could be assigned a state $q$ if and only if the empty word $\epsilon$ is in $\delta(q,a)$ . A run is accepting if $\rho_{\mbox{\scriptsize$\mathcal{A}$}}(\epsilon)\in F$ , i.e., if the root is assigned a final state. A tree $t$ is accepted by $\mathcal{A}$ if there exists an accepting run of $\mathcal{A}$ on $t$ . The set of all trees accepted by $\mathcal{A}$ is denoted by $\mbox{$\mathcal{L}$}(\mbox{$\mathcal{A}$})$ .

An unranked tree (letter-to-letter) transducer with the input alphabet $\Sigma$ and output alphabet $\Gamma$ is a tuple $\mbox{$\mathcal{T}$}=\langle\mbox{$\mathcal{A}$},\mu\rangle$ , where $\mathcal{A}$ is a tree automaton with the set of states $Q$ , and $\mu\subseteq Q\times\Sigma\times\Gamma$ is an output relation. We call such $\mathcal{T}$ a transducer from $\Sigma$ to $\Gamma$ .

Let $t$ be a $\Sigma$ -labeled tree, and $t^{\prime}$ a $\Gamma$ -labeled tree such that $\mbox{$\textsf{Dom}$}(t)=\mbox{$\textsf{Dom}$}(t^{\prime})$ . We say that a tree $t^{\prime}$ is an output of $\mathcal{T}$ on $t$ , if there is an accepting run $\rho_{\mbox{\scriptsize$\mathcal{A}$}}$ of $\mathcal{A}$ on $t$ and for each $u\in\mbox{$\textsf{Dom}$}(t)$ , it holds that $(\rho_{\mbox{\scriptsize$\mathcal{A}$}}(u),\mbox{$\ell ab$}_{t}(u),\mbox{$\ell ab$}_{t^{\prime}}(u))\in\mu$ . We call $\mathcal{T}$ an identity transducer, if $\mbox{$\ell ab$}_{t}(u)=\mbox{$\ell ab$}_{t^{\prime}}(u)$ for all $u\in\mbox{$\textsf{Dom}$}(t)$ . We will often view an automaton $\mathcal{A}$ as an identity transducer.

2.3 Automata with Presburger constraints (APC)

An automaton with Presburger constraints (APC) is a tuple $\langle\mbox{$\mathcal{A}$},\xi\rangle$ , where $\mathcal{A}$ is an unranked tree automaton with states $q_{0},\ldots,q_{m}$ and $\xi$ is an existential Presburger formula with free variables $x_{0},\ldots,x_{m}$ . A tree $t$ is accepted by $\langle\mbox{$\mathcal{A}$},\xi\rangle$ , denoted by $t\in\mbox{$\mathcal{L}$}(\mbox{$\mathcal{A}$},\xi)$ , if there is an accepting run $\rho_{\mbox{\scriptsize$\mathcal{A}$}}$ of $\mathcal{A}$ on $w$ such that $\xi(n_{0},\ldots,n_{m})$ is true, where $n_{i}$ is the number of appearances of $q_{i}$ in $\rho_{\mbox{\scriptsize$\mathcal{A}$}}$ .

Theorem 2.2.

[Seidl et al. (2004), Verma et al. (2005)] The non-emptiness problem for APC is decidable in NP.

It is worth noting also that the class of languages accepted by APC is closed under union and intersection.

Oftentimes, instead of counting the number of states in the accepting run, we need to count the number of occurrences of alphabet symbols in the tree. Since we can easily embed the alphabet symbols inside the states, we always assume that the Presburger formula $\xi$ has the free variables $x_{a}$ ’s to denote the number of appearances of the symbol $a$ in the tree.

As in the word case, we let $\mbox{$\textsf{Parikh}$}(t)$ denote the Parikh image of the tree $t$ . We will need the following proposition.

Proposition 2.3.

[Seidl et al. (2004), Verma et al. (2005)] Given an unranked tree automaton $\mathcal{A}$ , one can construct, in polynomial time, an existential Presburger formula $\xi_{\mbox{\scriptsize$\mathcal{A}$}}(x_{1},\ldots,x_{\ell})$ such that

•

for every tree $t\in\mbox{$\mathcal{L}$}(\mbox{$\mathcal{A}$})$ , $\xi_{\mbox{\scriptsize$\mathcal{A}$}}(\mbox{$\textsf{Parikh}$}(t))$ holds;
•

for every $\bar{n}=(n_{1},\ldots,n_{\ell})$ such that $\xi_{\mbox{\scriptsize$\mathcal{A}$}}(\bar{n})$ holds, there exists a tree $t\in\mbox{$\mathcal{L}$}(\mbox{$\mathcal{A}$})$ with $\mbox{$\textsf{Parikh}$}(t)=\bar{n}$ .

3 Ordered-data trees and Their Logic

An ordered-data tree over the alphabet $\Sigma$ is a tree in which each node, besides carrying a label from the finite alphabet $\Sigma$ , also carries a data value from $\mbox{$\mathbb{N}$}=\{0,1,\ldots\}$ .^†^††Here we use the natural numbers as data values just to be concrete. The results in our paper apply trivially for any linearly ordered domain.

Let $t$ be an ordered-data tree over $\Sigma$ and $u\in\mbox{$\textsf{Dom}$}(t)$ . We write $\mbox{$\textrm{\em va}\ell$}_{t}(u)$ to denote the data value in the node $u$ . The set of all data values in the $a$ -nodes in $t$ is denoted by $V_{t}(a)$ . That is, $V_{t}(a)=\{\mbox{$\textrm{\em va}\ell$}_{t}(u)\ |\ \mbox{$\ell ab$}_{t}(u)=a\ \mbox{and}\ u\in\mbox{$\textsf{Dom}$}(t)\}$ . We write $V_{t}$ to denote the set of data values found in the tree $t$ . We also write $\#_{t}(a)$ to denote the number of $a$ -nodes in $t$ .

The profile of a node $u$ is a triplet $(l,p,r)\in\{\top,\bot,*\}\times\{\top,\bot,*\}\times\{\top,\bot,*\}$ , where $l=\top$ and $l=\bot$ indicate that the node $u$ has the same data value and different data value as its left sibling, respectively; $l=*$ indicates that $u$ does not have a left sibling. Similarly, $p=\top$ , $p=\bot$ , and $p=*$ have the same meaning in relation to the parent of the node $u$ , while $r=\top$ , $r=\bot$ , and $r=*$ means the same in relation to the right sibling of the node $u$ . For an ordered-data tree $t$ over $\Sigma$ , the profile tree of $t$ , denoted by $\mbox{$\textsf{Profile}$}(t)$ , is a tree over $\Sigma\times\{\top,\bot,*\}^{3}$ obtained by augmenting to each node of $t$ its profile.

We write $\mbox{$\textsf{Proj}$}(t)$ to denote the $\Sigma$ projection of the ordered-data tree $t$ , that is, $\mbox{$\textsf{Proj}$}(t)$ is $t$ without the data values. When we say that an ordered-data tree $t$ is accepted by an automaton $\mathcal{A}$ , we mean that $\mbox{$\textsf{Proj}$}(t)$ is accepted by $\mathcal{A}$ . An ordered-data tree $t^{\prime}$ is an output of a transducer $\mathcal{T}$ on an ordered-data tree $t$ , if $\mbox{$\textsf{Proj}$}(t^{\prime})$ is an output of $\mathcal{T}$ on $\mbox{$\textsf{Proj}$}(t)$ , and for all $u\in\mbox{$\textsf{Dom}$}(t^{\prime})$ , we have $\mbox{$\textrm{\em va}\ell$}_{t^{\prime}}(u)=\mbox{$\textrm{\em va}\ell$}_{t}(u)$ .

Figure 1 shows an example of an ordered-data tree $t$ over the alphabet $\{a,b,c\}$ with its profile tree. The notation ${a\choose d}$ means that the node is labeled with $a$ and has data value $d$ .

Figure 1: An example of an ordered-data tree (on the left) and its profile (on the right).

3.1 String representations of data values

Let $t$ be an ordered-data tree over $\Gamma$ . For a set $S\subseteq\Gamma$ , let

[S]_{t}=\bigcap_{a\in S}V_{t}(a)\cap\bigcap_{b\notin S}\overline{V_{t}(b)}.

That is, $[S]_{t}$ is the set of data values that are found in $a$ -positions for all $a\in S$ but are not found in any $b$ -position for $b\not\in S$ . Note that the sets $[S]_{t}$ ’s are disjoint, and that for each $a\in\Gamma$ ,

V_{t}(a)=\bigcup_{S\ \mbox{\scriptsize s.t.}\ a\in S}\ [S]_{t}.

Moreover, $|V_{t}(a)|=\sum_{S\ \mbox{\scriptsize s.t.}\ a\in S}\ |[S]_{t}|$ .

Let $d_{1}<\cdots<d_{m}$ be all the data values found in $t$ . The string representation of the data values in $t$ , denoted by $\mbox{$\mathcal{V}$}_{\Gamma}(t)$ , is the string $S_{1}\cdots S_{m}$ over the alphabet $2^{\Gamma}-\{\emptyset\}$ of length $m$ such that $d_{i}\in[S_{i}]_{t}$ , for each $i=1,\ldots,m$ . The notation $[S]_{t}$ is already introduced in [David et al. (2010), David et al. (2012)], but not $\mbox{$\mathcal{V}$}_{\Gamma}(t)$ .

Consider the example of the tree $t$ in Figure 1. The data values in $t$ are $1,2,4,6,7$ , where

$\displaystyle~[\{b,c\}]_{t}$	$\displaystyle=$	$\displaystyle\{1\},$
$\displaystyle~[\{a,b,c\}]_{t}$	$\displaystyle=$	$\displaystyle\{2\},$
$\displaystyle~[\{a,b\}]_{t}$	$\displaystyle=$	$\displaystyle\{4,7\},$
$\displaystyle~[\{a,c\}]_{t}$	$\displaystyle=$	$\displaystyle\{6\},$
$\displaystyle~[S]_{t}$	$\displaystyle=$	$\displaystyle\emptyset,\ \mbox{for all the other}\ S\mbox{'s}.$

The string $\mbox{$\mathcal{V}$}_{\Gamma}(t)$ is $S_{1}\ S_{2}\ S_{3}\ S_{4}\ S_{5}$ , where $S_{1}=\{b,c\}$ , $S_{2}=\{a,b,c\}$ , $S_{3}=S_{5}=\{a,b\}$ and $S_{4}=\{a,c\}$ .

3.2 A logic for ordered-data trees

An ordered-data tree $t$ over the alphabet $\Sigma$ can be viewed as a structure

t\ =\langle D,\{a(\cdot)\}_{a\in\Sigma},\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim,\prec,\mbox{$\prec_{suc}$}\rangle,

where

•

the relations $\{a(\cdot)\}_{a\in\Sigma},\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$}$ are as defined before in Subsection 2.2,
•

$u\sim v$ holds, if $\mbox{$\textrm{\em va}\ell$}_{t}(u)=\mbox{$\textrm{\em va}\ell$}_{t}(v)$ ,
•

$u\prec v$ holds, if $\mbox{$\textrm{\em va}\ell$}_{t}(u)<\mbox{$\textrm{\em va}\ell$}_{t}(v)$ ,
•

$u\mbox{$\prec_{suc}$}\ v$ holds, if $\mbox{$\textrm{\em va}\ell$}_{t}(v)$ is the minimal data value in $t$ greater than $\mbox{$\textrm{\em va}\ell$}_{t}(u)$ .

Obviously, $x\mbox{$\prec_{suc}$}\ y$ can be expressed equivalently as $x\prec y\wedge\forall z(\neg(x\prec z\wedge z\prec y))$ . We include $\prec_{suc}$ for the sake of convenience. We also assume that we have the predicates $\mbox{{\rm root}}(x)$ , $\mbox{{\rm first-sibling}}(x)$ , $\mbox{{\rm last-sibling}}(x)$ , and $\mbox{{\rm leaf}}(x)$ which stand for $\forall y(\neg\mbox{$E_{\downarrow}$}(y,x))$ , $\forall y(\neg\mbox{$E_{\rightarrow}$}(y,x))$ , $\forall y(\neg\mbox{$E_{\rightarrow}$}(x,y))$ , and $\forall y(\neg\mbox{$E_{\downarrow}$}(x,y))$ , respectively. We also write $x\nsim y$ to denote $\neg(x\sim y)$ .

For $\mbox{$\mathcal{O}$}\subseteq\{\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim,\prec,\mbox{$\prec_{suc}$}\}$ , we let $\mbox{$\textrm{FO}$}(\mbox{$\mathcal{O}$})$ stand for the first-order logic with the vocabulary $\mathcal{O}$ , $\mbox{$\textrm{MSO}$}(\mbox{$\mathcal{O}$})$ for its monadic second-order logic (which extends $\mbox{$\textrm{FO}$}(\mbox{$\mathcal{O}$})$ with quantification over sets of nodes), and $\mbox{$\exists\mbox{$\textrm{MSO}$}$}(\mbox{$\mathcal{O}$})$ for its existential monadic second order logic, i.e., formulas of the form $\exists X_{1}\ldots\exists X_{m}\ \psi$ , where $\psi$ is an $\mbox{$\textrm{FO}$}(\mbox{$\mathcal{O}$})$ formula over the vocabulary $\mathcal{O}$ extended with the unary predicates $X_{1},\ldots,X_{m}$ .

We let $\mbox{$\textrm{FO}$}^{2}(\mbox{$\mathcal{O}$})$ stand for $\mbox{$\textrm{FO}$}(\mbox{$\mathcal{O}$})$ with two variables, i.e., the set of $\mbox{$\textrm{FO}$}(\mbox{$\mathcal{O}$})$ formulae that only use two variables $x$ and $y$ . The set of all formulae of the form $\exists X_{1}\ldots\exists X_{m}\ \psi$ , where $\psi$ is an $\mbox{$\textrm{FO}$}^{2}(\mbox{$\mathcal{O}$})$ formula is denoted by $\mbox{$\exists\mbox{$\textrm{MSO}$}$}^{2}(\mbox{$\mathcal{O}$})$ . Note that $\mbox{$\exists\mbox{$\textrm{MSO}$}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$})$ is equivalent in expressive power to $\mbox{$\textrm{MSO}$}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$})$ over the usual (without data) trees. That is, it defines precisely the regular tree languages [Thomas (1997)].

As usual, we define $\mbox{$\mathcal{L}$}_{data}(\varphi)$ as the set of ordered-data trees that satisfy the formula $\varphi$ . In such case, we say that the formula $\varphi$ expresses the language $\mbox{$\mathcal{L}$}_{data}(\varphi)$ .^‡^‡‡To avoid confusion, we put the subscript $data$ on $\mbox{$\mathcal{L}$}_{data}$ to denote a language of ordered-data trees. We use the symbol $\mathcal{L}$ without the subscript $data$ to denote the usual language of trees/strings without data.

The following theorem is well known. It shows how even extending $\mbox{$\textrm{FO}$}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$})$ with equality test on data values immediately yields undecidability.

Theorem 3.1.

(See, for example, [Neven et al. (2004)]) The satisfiability problem for the logic $\mbox{$\textrm{FO}$}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim)$ is undecidable.

One of the deepest results in this area is the following decidability result for the logic $\mbox{$\exists\mbox{$\textrm{MSO}$}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim)$ .

Theorem 3.2.

[Bojanczyk et al. (2009)] The satisfiability problem for the logic $\mbox{$\exists\mbox{$\textrm{MSO}$}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim)$ is decidable.

3.3 A few examples

In this subsection we present a few examples of properties of ordered-data trees. Some of them are special cases of more general techniques that will be used later on.

Example 3.3.

Let $\Sigma=\{a,b\}$ . Consider the language $\mbox{$\mathcal{L}$}_{data}^{a}$ of ordered-data trees over $\Sigma$ where an ordered-data tree $t\in\mbox{$\mathcal{L}$}_{data}^{a}$ if and only if there exist two $a$ -nodes $u$ and $v$ such that $u$ is an ancestor of $v$ and either $v\sim u$ or $v\prec u$ . This language can be expressed with the formula $\exists X\exists Y\exists Z\ \varphi$ , where $\varphi$ states that $X$ contains only the node $u$ , $Y$ contains only the node $v$ , $Z$ contains precisely the nodes in the path from $u$ to $v$ , and $v\sim u$ or $v\prec u$ .

Example 3.4.

For a fixed set $S\subseteq\Sigma$ and an integer $m\geq 1$ , we consider the language $\mbox{$\mathcal{L}$}_{data}^{S,m}$ such that $t\in\mbox{$\mathcal{L}$}_{data}^{S,m}$ if and only if $|[S]_{t}|=m$ .

We pick an arbitrary symbol $a\in S$ . The language $\mbox{$\mathcal{L}$}_{data}^{S,m}$ can be expressed in $\mbox{$\exists\mbox{$\textrm{MSO}$}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim)$ with the formula of the form $\exists X_{1}\ \cdots\exists X_{m}\ \varphi$ , where $\varphi$ is a conjunction of the following.

•

That the predicates $X_{1},\ldots,X_{m}$ are disjoint and each of them contains exactly one node, which is an $a$ -node.
•

That the data values found in nodes in $X_{1},\ldots,X_{m}$ are all different.
•

That for each $i\in\{1,\ldots,m\}$ , if a data value is found in a node in $X_{i}$ , then it must also be found in some $b$ -node, for every $b\in S$ .
•

That for each $i\in\{1,\ldots,m\}$ , if a data value found in a node in $X_{i}$ , then it must not be found in any $b$ -node, for every $b\notin S$ .
•
That for every $a$ -node (recall that $a\in S$ ) that does not belong to the $X_{i}$ ’s, either it has the same data value as the data value in a node belongs to one of the $X_{i}$ ’s, or it has the data value not in $[S]_{t}$ .
That its data value does not belong to $[S]_{t}$ can be stated as the negation of
- –
  
  for each $b\in S$ , there is a $b$ -node with the same data value; and
- –
  
  the data value cannot be found in any $b$ -node, for every $b\notin S$ .

To express all these intended meanings, it is sufficient that $\varphi\in\mbox{$\textrm{FO}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim)$ .

Example 3.5.

For a fixed set $S\subseteq\Sigma$ and an integer $m\geq 1$ , we consider the language $\mbox{$\mathcal{L}$}_{data}^{S,\pmod{m}}$ such that $t\in\mbox{$\mathcal{L}$}_{data}^{S,\pmod{m}}$ if and only if $|[S]_{t}|\equiv 0\pmod{m}$ .

This language $\mbox{$\mathcal{L}$}_{data}^{S,\pmod{m}}$ can be expressed in $\mbox{$\exists\mbox{$\textrm{MSO}$}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim)$ with a formula of the form

\exists X_{0}\cdots\exists X_{m-1}\exists Y_{0}\cdots\exists Y_{m-1}\exists Z\ \psi,

where the intended meanings of $X_{0},\ldots,X_{m-1},Y_{0},\ldots,Y_{m-1},Z$ are as follows. For a node $u$ in an ordered-data tree $t\in\mbox{$\mathcal{L}$}_{data}$ ,

•

the number of nodes belonging to $Z$ is precisely $|[S]_{t}|$ ; and if $Z(u)$ holds in $t$ , then the data value in the node $u$ belongs to $[S]_{t}$ ;
•

$X_{i}(u)$ holds in $t$ if and only if in the subtree $t^{\prime}$ rooted in $u$ we have $|[S]_{t^{\prime}}|\equiv i\pmod{m}$ ;
•

if $v_{1},\ldots,v_{k}$ are all the left-siblings of $u$ , and $X_{i_{1}}(v_{1}),\ldots,X_{i_{k}}(v_{k})$ holds, then $Y_{i}(u)$ holds if and only if $i_{1}+\cdots+i_{k}\equiv i\pmod{m}$ .

To express all these intended meanings, it is sufficient that $\psi\in\mbox{$\textrm{FO}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim)$ .

Example 3.6.

Let $\Sigma=\{a,b\}$ . Consider the language $\mbox{$\mathcal{L}$}_{data}^{a\ast}$ of ordered-data trees over $\Sigma$ where an ordered-data tree $t\in\mbox{$\mathcal{L}$}_{data}^{a\ast}$ if and only if all the $a$ -nodes with data values different from the ones in their parents satisfy the following conditions:

•

the data values found in these nodes are all different;
•

one of the these data values must be the largest in the tree $t$ .

The language $\mbox{$\mathcal{L}$}_{data}^{a\ast}$ can be expressed in $\mbox{$\exists\mbox{$\textrm{MSO}$}$}^{2}(\mbox{$E_{\downarrow}$},\sim,\prec)$ with the following formula:

$\displaystyle\exists X$	$\displaystyle\Big{(}$	$\displaystyle\forall x\big{(}X(x)\iff a(x)\wedge\exists y(\mbox{$E_{\downarrow}$}(y,x)\wedge y\nsim x)\big{)}$
		$\displaystyle\wedge\ \ \forall x\forall y(X(x)\wedge X(y)\wedge x\sim y\to x=y)$
		$\displaystyle\wedge\ \ \exists x\big{(}X(x)\wedge\forall y(y\prec x\vee x\sim y)\big{)}\Big{)}.$

4 Two Useful Lemmas

In this section we prove two lemmas which will be used later on. The first is combinatorial by nature, and we will use it in our proof of the decidability of ODTA. The second is an Ehrenfeucht-Fraïssé type lemma for ordered-data trees, and we will use it in our proof of the logical characterization of ODTA.

4.1 A combinatorial lemma

Let $G$ be an (undirected and finite) graph. For simplicity, we consider only the graph without self-loop. We denote by $V(G)$ the set of vertices in $G$ and $E(G)$ the set of edges. For a node $u\in V(G)$ , we write $\deg(u)$ to denote the degree of the node $u$ and $\deg(G)$ to denote $\max\{\deg(u)\mid u\in V(G)\}$ .

A data graph over the alphabet $\Gamma$ is a graph $G$ in which each node carries a label from $\Gamma$ and a data value from ${\mathbb{N}}$ . A node $u\in V(G)$ is called an $a$ -node, if its label is $a$ , in which case we write $\mbox{$\ell ab$}_{G}(u)=a$ . We denote by $\mbox{$\textrm{\em va}\ell$}_{G}(u)$ the data value found in node $u$ , and $\mbox{$\textrm{Val}$}_{G}(a)$ the set of data values found in $a$ -nodes in $G$ .

Lemma 4.1.

Let $G$ be a data graph over $\Gamma$ . Suppose for each $a\in\Gamma$ , we have $|V_{G}(a)|\geq\deg(G)|\Gamma|+\deg(G)+1$ . Then we can reassign the data values in the nodes in $G$ to obtain another data graph $G^{\prime}$ such that $V(G)=V(G^{\prime})$ and $E(G)=E(G^{\prime})$ and

(1)

for each $u\in V(G^{\prime})$ , $\mbox{$\ell ab$}_{G}(u)=\mbox{$\ell ab$}_{G^{\prime}}(u)$ ;
(2)

for each $a\in\Gamma$ , $\mbox{$\textrm{Val}$}_{G}(a)=\mbox{$\textrm{Val}$}_{G^{\prime}}(a)$ ;
(3)

for each $u,v\in V(G)$ , if $(u,v)\in E(G^{\prime})$ , then $\mbox{$\textrm{\em va}\ell$}_{G^{\prime}}(u)\neq\mbox{$\textrm{\em va}\ell$}_{G^{\prime}}(v)$ .

Proof 4.2.

Note that in the lemma the data graph $G^{\prime}$ differs from $G$ only in the data values on the nodes, where we require that adjacent nodes in $G^{\prime}$ have different data values.

In the following we write $\#_{G}(a)$ to denote the number of $a$ -nodes in $G$ and $K=\deg(G)$ . First, we perform some partial reassignment of the data values on some nodes. For each $a\in\Gamma$ , we pick $|\mbox{$\textrm{Val}$}_{G}(a)|$ number of $a$ -nodes in $G^{\prime}$ . Then we assign to these $a$ -nodes the data values from $\mbox{$\textrm{Val}$}_{G}(a)$ . One $a$ -node gets one data value. Such assignment can be done since obviously $\#_{G}(a)\geq|\mbox{$\textrm{Val}$}_{G}(a)|$ . If $\#_{G}(a)>|\mbox{$\textrm{Val}$}_{G}(a)|$ , then there will be some $a$ -nodes in $G^{\prime}$ that do not have data values. We write $\mbox{$\textrm{\em va}\ell$}_{G^{\prime}}(u)=\sharp$ , if $u$ does not have data value. From this step we already obtain that $\mbox{$\textrm{Val}$}_{G^{\prime}}(a)=\mbox{$\textrm{Val}$}_{G}(a)$ for each $a\in\Gamma$ .

However, reassigning the data values just like that, there may exist an edge $(u,v)$ such that $\mbox{$\textrm{\em va}\ell$}_{G^{\prime}}(u)=\mbox{$\textrm{\em va}\ell$}_{G^{\prime}}(v)$ and $\mbox{$\textrm{\em va}\ell$}_{G^{\prime}}(u),\mbox{$\textrm{\em va}\ell$}_{G^{\prime}}(v)\neq\sharp$ . We call such an edge a conflict edge. We are going to reassign the data values one more time so that there is no conflict edge.

Suppose there exists an edge $(u,v)\in E$ such that $\mbox{$\textrm{\em va}\ell$}_{G^{\prime}}(u)=\mbox{$\textrm{\em va}\ell$}_{G^{\prime}}(v)=d$ and suppose that $u$ is an $a$ -node, for some $a\in\Gamma$ . The data value $d$ can only be found in at most $|\Gamma|$ nodes in $G^{\prime}$ . Since $\deg(G)=K$ , the neighbours of those nodes (with data value $d$ ) are at most $K|\Gamma|$ nodes. Now $|\mbox{$\textrm{Val}$}_{G}(a)|=|\mbox{$\textrm{Val}$}_{G^{\prime}}(a)|\geq K|\Gamma|+K+1$ , there are at least $K+1$ number of $a$ -nodes whose neighbours do not get the data value $d$ . Let $u_{1},\ldots,u_{m}$ be such $a$ -nodes, where $m\geq K+1$ . From these nodes, there exists $i$ such that $\mbox{$\textrm{\em va}\ell$}_{G^{\prime}}(u_{i})\notin\{\mbox{$\textrm{\em va}\ell$}_{G^{\prime}}(x)\mid(u,x)\in E\}$ .

We can then swap the data values on the nodes $u$ and $u_{i}$ , and this results in one less conflict edge. We repeat this process until there is no conflict edge. Now it is straightforward that

(1)

for each $u\in V$ , $\mbox{$\ell ab$}_{G}(u)=\mbox{$\ell ab$}_{G^{\prime}}(u)$ ;
(2)

for each $a\in\Gamma$ , $\mbox{$\textrm{Val}$}_{G}(a)=\mbox{$\textrm{Val}$}_{G^{\prime}}(a)$ ;
(3)

for each $u,v\in V$ , if $(u,v)\in E$ and $\mbox{$\textrm{\em va}\ell$}_{G^{\prime}}(u),\mbox{$\textrm{\em va}\ell$}_{G^{\prime}}(v)\neq\sharp$ , then $\mbox{$\textrm{\em va}\ell$}_{G^{\prime}}(u)\neq\mbox{$\textrm{\em va}\ell$}_{G^{\prime}}(v)$ .

What is left to do now is to assign data values to the nodes $u$ , where $\mbox{$\textrm{\em va}\ell$}_{G^{\prime}}(u)=\sharp$ . For each $a$ -node, where $\mbox{$\textrm{\em va}\ell$}_{G^{\prime}}(u)=\sharp$ , we pick the data value $d\in\mbox{$\textrm{Val}$}_{G^{\prime}}(a)=\mbox{$\textrm{Val}$}_{G}(a)$ which is not assigned to any its neighbour. Such data value exists since $|\mbox{$\textrm{Val}$}_{G^{\prime}}(a)|\geq K|\Gamma|+K+1\geq K+1$ . Such assignment will not violate condition (3) above, thus, we get the desired data graph $G^{\prime}$ . This completes the proof of Lemma 4.1.

4.2 An Ehrenfeucht-Fraïssé type lemma

We need the following notation. A $k$ -characteristic function on the alphabet $\Gamma$ , is a function $f:\Gamma\to\{0,1,2,\ldots,k\}$ . Let $\mbox{$\mathcal{F}$}_{\Gamma,k}$ be the set of all such $k$ -characteristic functions on $\Gamma$ . A function $f\in\mbox{$\mathcal{F}$}_{\Gamma,k}$ is a $k$ -characteristic function for a set $S\subseteq\Gamma$ , if $f(a)\in\{1,2,\ldots,k\}$ , for all $a\in S$ , and $f(a)=0$ , for all $a\notin S$ .

An ordered-data set $\mathfrak{U}$ over the alphabet $\Gamma$ consists of a finite set $U$ , in which each element $u\in U$ carries a label $\mbox{$\ell ab$}_{\mbox{\scriptsize$\mathfrak{U}$}}(u)\in\Gamma$ and a data value $\mbox{$\textrm{\em va}\ell$}_{\mbox{\scriptsize$\mathfrak{U}$}}(u)\in\mbox{$\mathbb{N}$}$ . An element $u\in U$ is called an $a$ -element, if $\mbox{$\ell ab$}_{\mbox{\scriptsize$\mathfrak{U}$}}(u)=a$ . In other words, an ordered-data set is similar to an ordered-data tree, but without the relations $E_{\downarrow}$ and $E_{\rightarrow}$ . It can be viewed as a structure $\mbox{$\mathfrak{U}$}=\langle U,\{a(\cdot)\}_{a\in\Gamma},\sim,\prec,\mbox{$\prec_{suc}$}\rangle$ , where

•

for each $a\in\Gamma$ and $u\in U$ , the relation $a(u)$ holds if $\mbox{$\ell ab$}_{\mbox{\scriptsize$\mathfrak{U}$}}(u)=a$ ,
•

$u\sim v$ holds, if $\mbox{$\textrm{\em va}\ell$}_{\mbox{\scriptsize$\mathfrak{U}$}}(u)=\mbox{$\textrm{\em va}\ell$}_{\mbox{\scriptsize$\mathfrak{U}$}}(v)$ ,
•

$u\prec v$ holds, if $\mbox{$\textrm{\em va}\ell$}_{\mbox{\scriptsize$\mathfrak{U}$}}(u)<\mbox{$\textrm{\em va}\ell$}_{\mbox{\scriptsize$\mathfrak{U}$}}(v)$ ,
•

$u\mbox{$\prec_{suc}$}\ v$ holds, if $\mbox{$\textrm{\em va}\ell$}_{\mbox{\scriptsize$\mathfrak{U}$}}(v)$ is the minimal data value found in $\mathfrak{U}$ greater than $\mbox{$\textrm{\em va}\ell$}_{\mbox{\scriptsize$\mathfrak{U}$}}(u)$ .

Let $\mathfrak{U}$ be an ordered-data set and $d_{1}<\cdots<d_{m}$ be the data values found in $\mathfrak{U}$ . The $k$ -extended representation of $\mathfrak{U}$ is the string $\mbox{$\mathcal{V}$}^{k}_{\Gamma}(\mbox{$\mathfrak{U}$})=(S_{1},f_{1})\cdots(S_{m},f_{m})\in 2^{\Gamma}\times\mbox{$\mathcal{F}$}_{\Gamma,k}$ such that $S_{1}\cdots S_{m}=\mbox{$\mathcal{V}$}_{\Gamma}(\mbox{$\mathfrak{U}$})$ and for each $i\in\{1,2,\ldots,m\}$ and for each $a\in\Gamma$ ,

1.

$f_{i}$ is a $k$ -characteristic function for the set $S_{i}$ ,
2.

if $1\leq f_{i}(a)\leq k-1$ , then there are $f_{i}(a)$ number of $a$ -elements in $U$ with data value $d_{i}$ ,
3.

if $f_{i}(a)=k$ , then there are at least $k$ number of $a$ -elements in $U$ with data value $d_{i}$ .

We assume that in every formula in $\mbox{$\textrm{MSO}$}(\sim,\prec,\mbox{$\prec_{suc}$})$ all the monadic second-order quantifiers precede the first-order part. That is, sentences in $\mbox{$\textrm{MSO}$}(\sim,\prec,\mbox{$\prec_{suc}$})$ are of the form: $\varphi:=Q_{1}X_{1}\cdots Q_{s}X_{s}\;\psi$ , where the $X_{i}$ ’s are monadic second-order variables, the $Q_{i}$ ’s are $\exists$ or $\forall$ and $\psi\in\mbox{$\textrm{FO}$}(\sim,\prec,\mbox{$\prec_{suc}$})$ extended with the unary predicates $X_{1},\ldots,X_{s}$ . We call the integer $s$ , the MSO quantifier rank of $\varphi$ , denoted by $\mbox{$\textsf{MSO-qr}$}(\varphi)=s$ , while we write $\mbox{$\textsf{FO-qr}$}(\varphi)$ to denote the quantifier rank of $\psi$ , that is the quantifier rank of the first-order part of $\varphi$ .

Lemma 4.3.

Let $\mbox{$\mathfrak{U}$}_{1}$ and $\mbox{$\mathfrak{U}$}_{2}$ be ordered-data sets over $\Gamma$ such that $\mbox{$\mathcal{V}$}^{k2^{s}}_{\Gamma}(\mbox{$\mathfrak{U}$}_{1})=\mbox{$\mathcal{V}$}^{k2^{s}}_{\Gamma}(\mbox{$\mathfrak{U}$}_{2})$ . For any $\mbox{$\textrm{MSO}$}(\sim,\prec,\mbox{$\prec_{suc}$})$ sentence $\varphi$ such that $\mbox{$\textsf{MSO-qr}$}(\varphi)\leq s$ and $\mbox{$\textsf{FO-qr}$}(\varphi)\leq k$ , $\mbox{$\mathfrak{U}$}_{1}\models\varphi\quad\mbox{if and only if}\quad\mbox{$\mathfrak{U}$}_{2}\models\varphi$ .

Proof 4.4.

The proof is by Ehrenfeucht-Fraïssé game for MSO of $(s+k)$ rounds, with $s$ rounds of set-moves and $k$ rounds of point-moves. We can assume that the set-moves precede the point-moves. See, for example, [Libkin (2004)], for the definition of Ehrenfeucht-Fraïssé game.

Before we go to the proof, we need a few notations. Let $\mbox{$\mathfrak{U}$}_{1}$ and $\mbox{$\mathfrak{U}$}_{2}$ be ordered-data sets over $\Gamma$ . For $(a,d)\in\Gamma\times\mbox{$\mathbb{N}$}$ , we write $P_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d)=\{u\mid\mbox{$\ell ab$}_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(u)=a\ \mbox{and}\ \mbox{$\textrm{\em va}\ell$}_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(u)=d\}$ – the set of elements in $U_{1}$ with label $a$ and data value $d$ . We can define similarly $P_{\mbox{\scriptsize$\mathfrak{U}$}_{2}}(a,d)$ for $\mbox{$\mathfrak{U}$}_{2}$ .

Let $\mbox{$\mathcal{O}$}\subseteq\{\sim,\prec,\mbox{$\prec_{suc}$}\}$ . Let $u_{1},\ldots,u_{k}\in U_{1}$ and $v_{1},\ldots,v_{k}\in U_{2}$ , for some ordered-data sets $U_{1}$ and $U_{2}$ . The mapping $(u_{1},\ldots,u_{k})\mapsto(v_{1},\ldots,v_{k})$ is a partial $\mathcal{O}$ -isomorphism (with equality) from $U_{1}$ to $U_{2}$ , if it is a partial isomorphism with regards to the vocabulary $\mathcal{O}$ , and if $u_{l}=u_{l^{\prime}}$ , then $v_{l}=v_{l^{\prime}}$ .

We are going to describe Duplicator’s strategy for winning the Ehrenfeucht-Fraïssé game for MSO of $s$ rounds of set-moves, followed by $k$ rounds of point moves. We start with the set-moves.

Duplicator’s strategy for set-moves: Suppose that the game is already played for $l$ rounds, where $X_{1},\ldots,X_{l}$ and $Y_{1},\ldots,Y_{l}$ are the sets of positions chosen in $U_{1}$ and $U_{2}$ , respectively. For each $I\subseteq\{1,\ldots,l\}$ , define the following set:

	$\displaystyle P_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d;I)$	$\displaystyle=$	$\displaystyle P_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d)\cap\bigcap_{i\in I}X_{i}\cap\bigcap_{j\notin I}\overline{X_{j}}$
	$\displaystyle P_{\mbox{\scriptsize$\mathfrak{U}$}_{2}}(a,d;I)$	$\displaystyle=$	$\displaystyle P_{\mbox{\scriptsize$\mathfrak{U}$}_{2}}(a,d)\cap\bigcap_{i\in I}Y_{i}\cap\bigcap_{j\notin I}\overline{Y_{j}}$

Duplicator’s strategy is to preserve the following identity: for every $(a,d)\in\Gamma\times\mbox{$\mathbb{N}$}$ and every $I\subseteq\{1,\ldots,l\}$

•

If the cardinality $|P_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d;I)|<k2^{m-l}$ , then $|P_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d;I)|=|P_{\mbox{\scriptsize$\mathfrak{U}$}_{2}}(a,d;I)|$ .
•

If the cardinality $|P_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d;I)|\geq k2^{m-l}$ , then also $|P_{\mbox{\scriptsize$\mathfrak{U}$}_{2}}(a,d;I)|\geq k2^{m-l}$ .

Now suppose that on the $(l+1)^{\rm th}$ set-move, Spoiler chooses a set $X$ of positions on $U_{1}$ . Duplicator chooses a set $Y$ of positions on $U_{2}$ as follows. For each $I\subseteq\{1,\ldots,l\}$ , there are four cases:

1.

$|P_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d;I)\cap X|$ and $|P_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d;I)\cap\overline{X}|<k2^{m-l-1}$ .
Then, $|P_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d;I)|<k2^{m-l}$ , which by induction hypothesis, implies $|P_{\mbox{\scriptsize$\mathfrak{U}$}_{2}}(a,d;I)|=|P_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d;I)|$ . Duplicator picks $|P_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d;I)\cap X|$ number of points from $P_{\mbox{\scriptsize$\mathfrak{U}$}_{2}}(a,d;I)$ , and declares them “belong to $Y$ .” The rest of the points from $P_{\mbox{\scriptsize$\mathfrak{U}$}_{2}}(a,d;I)$ are declared “not belong to $Y$ .”
Obviously, $|P_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d;I)\cap X|=|P_{\mbox{\scriptsize$\mathfrak{U}$}_{2}}(a,d;I)\cap Y|<k2^{m-l-1}$ and $|P_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d;I)\cap\overline{X}|=|P_{\mbox{\scriptsize$\mathfrak{U}$}_{2}}(a,d;I)\cap\overline{Y}|<k2^{m-l-1}$ .
2.

$|P_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d;I)\cap X|<k2^{m-l-1}$ and $|P_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d;I)\cap\overline{X}|\geq k2^{m-l-1}$ .
In this case, either $P_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d;I)<k2^{m}$ or $\geq k2^{m}$ . In either case there are $|P_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d;I)\cap X|$ number of points from $P_{\mbox{\scriptsize$\mathfrak{U}$}_{2}}(a,d;I)$ which Duplicator declares as “belong to $Y$ .” The rest of the points from $P_{\mbox{\scriptsize$\mathfrak{U}$}_{2}}(a,d;I)$ are declared “not belong to $Y$ .”
Obviously, $|P_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d;I)\cap X|=|P_{\mbox{\scriptsize$\mathfrak{U}$}_{2}}(a,d;I)\cap Y|$ and $|P_{\mbox{\scriptsize$\mathfrak{U}$}_{2}}(a,d;I)\cap\overline{Y}|\geq k2^{m-l-1}$ .
3.

$|P_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d;I)\cap X|\geq k2^{m-l-1}$ and $|P_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d;I)\cap\overline{X}|<k2^{m-l-1}$ .
Similar to Case 2.
4.

$|P_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d;I)\cap X|\geq k2^{m-l-1}$ and $|P_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d;I)\cap\overline{X}|\geq k2^{m-l-1}$ .
Then, $|P_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d;I)|\geq k2^{m-l}$ , and so $|P_{\mbox{\scriptsize$\mathfrak{U}$}_{2}}(a,d;I)|\geq k2^{m-l}$ . Duplicator declares half of the points in $P_{\mbox{\scriptsize$\mathfrak{U}$}_{2}}(a,d;I)$ as “belong to $Y$ ” and the other half as “not belong to $Y$ .”
Obviously, $|P_{\mbox{\scriptsize$\mathfrak{U}$}_{2}}(a,d;I)\cap Y|$ and $|P_{\mbox{\scriptsize$\mathfrak{U}$}_{2}}(a,d;I)\cap\overline{Y}|\geq k2^{m-l-1}$ .

Now after $m$ rounds of set-moves, we have the following identity: for every $(a,d)\in\Sigma\times\mbox{$\mathbb{N}$}$ and every $I\subseteq\{1,\ldots,m\}$

•

If the cardinality $|P_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d;I)|<k$ , then $|P_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d;I)|=|P_{\mbox{\scriptsize$\mathfrak{U}$}_{2}}(a,d;I)|$ .
•

If the cardinality $|P_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(a,d;I)|\geq k$ , then also $|P_{\mbox{\scriptsize$\mathfrak{U}$}_{2}}(a,d;I)|\geq k$ .

This ends our description of Duplicator’s strategy for set-moves. Now we describe Duplicator’s strategy for point-moves.

Duplicator’s strategy for point-moves: Suppose that the game is now on $l$ th step. Let $(u_{1},\ldots,u_{l})\mapsto(v_{1},\ldots,v_{l})$ be a partial $\{\sim,\prec,\mbox{$\prec_{suc}$}\}$ -isomorphism, where $0\leq l\leq k-1$ . Suppose Spoiler chooses an element $u_{l+1}$ from $U_{1}$ such that $\mbox{$\textrm{\em va}\ell$}_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(u_{l+1})$ is the $j^{th}$ largest data value in $\mbox{$\mathfrak{U}$}_{1}$ .

•

If $u_{l+1}=u_{l^{\prime}}$ , for some $l^{\prime}\in\{1,\ldots,l\}$ , Duplicator chooses $v_{l+1}=v_{l^{\prime}}$ from $U_{2}$ .
•

If $u_{l+1}\notin\{u_{1},\ldots,u_{l}\}$ , Duplicator chooses $v_{l+1}$ from $U_{2}$ such that $v_{l+1}\notin\{v_{1},\ldots,v_{l}\}$ and $\mbox{$\ell ab$}_{\mbox{\scriptsize$\mathfrak{U}$}_{1}}(u_{l+1})=\mbox{$\ell ab$}_{\mbox{\scriptsize$\mathfrak{U}$}_{2}}(v_{l+1})$ and $\mbox{$\textrm{\em va}\ell$}_{\mbox{\scriptsize$\mathfrak{U}$}_{2}}(v_{l+1})$ is the $j^{th}$ largest data value in $\mbox{$\mathfrak{U}$}_{2}$ . Such an element exists, as $\mbox{$\mathcal{V}$}^{k2^{m}}(\mbox{$\mathfrak{U}$}_{1})=\mbox{$\mathcal{V}$}^{k2^{m}}(\mbox{$\mathfrak{U}$}_{2})$ .

In either case $(u_{1},\ldots,u_{l+1})\mapsto(v_{1},\ldots,v_{l+1})$ is a partial $\{\sim,\prec,\mbox{$\prec_{suc}$}\}$ -isomorphism. This completes the description of Duplicator’s strategy and hence, our proof.

Now, we define the $k$ -extended representation of an ordered-data tree $t$ over the alphabet $\Gamma$ , denoted by $\mbox{$\mathcal{V}$}^{k}_{\Gamma}(t)$ is the $k$ -extended representation of the ordered-data set $\mathfrak{U}$ obtained by ignoring the relations $E_{\downarrow}$ and $E_{\rightarrow}$ in $t$ . The following corollary is an immediate consequence of Lemma 4.3 above.

Corollary 4.5.

Let $t_{1}$ and $t_{2}$ be ordered-data trees over $\Gamma$ such that $\mbox{$\mathcal{V}$}^{k2^{s}}_{\Gamma}(t_{1})=\mbox{$\mathcal{V}$}^{k2^{s}}_{\Gamma}(t_{2})$ . For any $\mbox{$\textrm{MSO}$}(\sim,\prec,\mbox{$\prec_{suc}$})$ sentence $\varphi$ such that $\mbox{$\textsf{MSO-qr}$}(\varphi)\leq s$ and $\mbox{$\textsf{FO-qr}$}(\varphi)\leq k$ , $t_{1}\models\varphi\quad\mbox{if and only if}\quad t_{2}\models\varphi$ .

Proof 4.6.

Since the predicates $E_{\downarrow}$ and $E_{\rightarrow}$ are not used in the formula $\varphi\in\mbox{$\textrm{MSO}$}(\sim,\prec,\mbox{$\prec_{suc}$})$ , we can ignore them in $t_{1}$ and $t_{2}$ and view both $t_{1}$ and $t_{2}$ as ordered-data sets. Our corollary follows immediately from Lemma 4.3.

5 Automata for Ordered-data Tree

In this section we are going to introduce an automata model for ordered-data trees and study its expressive power.

Definition 5.1.

An ordered-data tree automaton, in short ODTA, over the alphabet $\Sigma$ is a triplet $\mbox{$\mathcal{S}$}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Gamma_{0}\rangle$ , where $\mathcal{T}$ is a letter-to-letter non-deterministic transducer from $\Sigma\times\{\top,\bot,*\}^{3}$ to the output alphabet $\Gamma$ ; $\mathcal{M}$ is an automaton on strings over the alphabet $2^{\Gamma}$ ; and $\Gamma_{0}\subseteq\Gamma$ .

An ordered-data tree $t$ is accepted by $\mathcal{S}$ , denoted by $t\in\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$})$ , if there exists an ordered-data tree $t^{\prime}$ over $\Gamma$ such that

•

on input $\mbox{$\textsf{Profile}$}(t)$ , the transducer $\mathcal{T}$ outputs $t^{\prime}$ ;
•

the automaton $\mathcal{M}$ accepts the string $\mbox{$\mathcal{V}$}_{\Gamma}(t^{\prime})$ ; and
•

for every $a\in\Gamma_{0}$ , all the $a$ -nodes in $t^{\prime}$ have different data values.

We describe a few examples of ODTA that accept the languages described in Examples 3.3, 3.4, 3.5 and 3.6.

Example 5.2.

An ODTA $\mbox{$\mathcal{S}$}^{a}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Gamma_{0}\rangle$ that accepts the language $\mbox{$\mathcal{L}$}_{data}^{a}$ in Example 3.3 can be defined as follows. The output alphabet of the transducer $\mathcal{T}$ is $\Gamma=\{\alpha,\beta,\gamma\}$ . On an input tree $t$ , the transducer $\mathcal{T}$ marks the nodes in $t$ as follows. There is only one node marked with $\alpha$ , one node marked with $\beta$ , and the $\alpha$ -node is an ancestor of $\beta$ . The automaton $\mathcal{M}$ accepts all the strings in which the position labeled with $S\ni\beta$ is less than or equal to the position labeled with $S^{\prime}\ni\alpha$ . (These two positions can be equal, which means $S=S^{\prime}$ .) Finally, $\Gamma_{0}=\emptyset$ .

Example 5.3.

An ODTA $\mbox{$\mathcal{S}$}^{S,m}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Gamma_{0}\rangle$ that accepts the language $\mbox{$\mathcal{L}$}_{data}^{S,m}$ in Example 3.4 can be defined as follows. The transducer $\mathcal{T}$ is an identity transducer. The automaton $\mathcal{M}$ accepts all the strings in which the symbol $S$ appears exactly $m$ times, and $\Gamma_{0}=\emptyset$ .

Example 5.4.

An ODTA $\mbox{$\mathcal{S}$}^{S,\pmod{m}}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Gamma_{0}\rangle$ that accepts the language $\mbox{$\mathcal{L}$}_{data}^{S,\pmod{m}}$ in Example 3.5 can be defined as follows. The transducer $\mathcal{T}$ is an identity transducer. The automaton $\mathcal{M}$ accepts a string in which the number of appearances of the symbol $S$ is a multiple of $m$ , and $\Gamma_{0}=\emptyset$ .

Example 5.5.

An ODTA $\mbox{$\mathcal{S}$}^{a\ast}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Gamma_{0}\rangle$ that accepts the language $\mbox{$\mathcal{L}$}_{data}^{a\ast}$ in Example 3.6 can be defined as follows. The output alphabet of the transducer $\mathcal{T}$ is $\Gamma=\{\alpha,\beta\}$ . The transducer $\mathcal{T}$ marks the nodes as follows. A node is marked with $\alpha$ if and only if it is an $a$ -node and it has different data value from the one of its parent. All the other nodes are marked with $\beta$ . The automaton $\mathcal{M}$ accepts a string $v$ if and only if the last symbol in $v$ contains the symbol $\alpha$ , while $\Gamma_{0}=\{\alpha\}$ .

The following proposition states that ODTA languages are closed under union and intersection, but not under negation. We would like to remark that being not closed under negation is rather common for decidable models for data trees. Often models that are closed under negation have undecidable non-emptiness/satisfiability problem.

Proposition 5.6.

The class of languages accepted by ODTA is closed under union and intersection, but not under negation.

Proof 5.7.

For closure under union and intersection, let $\mbox{$\mathcal{S}$}_{1}=\langle\mbox{$\mathcal{T}$}_{1},\mbox{$\mathcal{M}$}_{1},\Gamma_{0}^{1}\rangle$ and $\mbox{$\mathcal{S}$}_{2}=\langle\mbox{$\mathcal{T}$}_{2},\mbox{$\mathcal{M}$}_{2},\Gamma_{0}^{2}\rangle$ be ODTA. The union $\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$}_{1})\cup\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$}_{2})$ is accepted by an ODTA which non-deterministically chooses to simulate either $\mbox{$\mathcal{S}$}_{1}$ or $\mbox{$\mathcal{S}$}_{2}$ on the input ordered-data tree. The ODTA for the intersection $\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$}_{1})\cap\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$}_{2})$ can be obtained by the standard cross product between $\mbox{$\mathcal{S}$}_{1}$ and $\mbox{$\mathcal{S}$}_{2}$ .

We now prove hat ODTA languages are not closed under negation. Consider the negation of the language in Example 3.3, whose equivalent ODTA $\mbox{$\mathcal{S}$}^{a}$ is presented in Example 5.2. Every tree $t\notin\mbox{$\mathcal{L}$}(\mbox{$\mathcal{S}$}^{a})$ has the following property. If $u,v$ are two $a$ -nodes in $t$ and $u$ is an ancestor of $v$ , then $u\prec v$ .

Now suppose to the contrary that there exists an ODTA $\mbox{$\mathcal{S}$}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Gamma_{0}\rangle$ that accepts the negation of $\mbox{$\mathcal{L}$}(\mbox{$\mathcal{S}$}^{a})$ . Let $\Gamma$ be the output alphabet of $\mathcal{T}$ . Let $t\in\mbox{$\mathcal{L}$}(\mbox{$\mathcal{S}$})$ be a data tree with $|\Gamma|+1$ nodes, where each node is labelled with $a$ and has at most one child. This implies that the data values in $t$ are all different and appear in increasing order from the root node to the leaf node.

Let $t^{\prime}\in\mbox{$\mathcal{T}$}(t)$ . Since $t$ has $|\Gamma|+1$ nodes, and hence so does $t^{\prime}$ , there are two nodes in $u$ and $v$ in $t^{\prime}$ with the same label. Let $t^{\prime\prime}$ be a data tree obtained from $t$ by swapping the data values between $u$ and $v$ , so $t^{\prime\prime}\in\mbox{$\mathcal{L}$}(\mbox{$\mathcal{S}$}^{a})$ . Since $\mbox{$\textsf{Profile}$}(t)=\mbox{$\textsf{Profile}$}(t^{\prime\prime})$ , on input $\mbox{$\textsf{Profile}$}(t^{\prime\prime})$ , the transducer $\mathcal{T}$ can also output $t^{\prime}$ , which means that $t^{\prime\prime}\in\mbox{$\mathcal{L}$}(\mbox{$\mathcal{S}$})$ . This contradicts the fact that $\mbox{$\mathcal{L}$}(\mbox{$\mathcal{S}$})$ is the complement of $\mbox{$\mathcal{L}$}(\mbox{$\mathcal{S}$}^{a})$ . This completes the proof of Proposition 5.6.

We should remark that in Section 7 we will discuss that extending ODTA with the complement of languages of the form in Example 5.2 will immediately yield undecidability.

Theorems 5.8, 5.9 and 5.10 are the main results in this paper. Theorem 5.8 below provides the ODTA characterisation of the logic $\mbox{$\exists\mbox{$\textrm{MSO}$}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim)$ and its proof can be found in Subsection 5.1.

Theorem 5.8.

A language $\mbox{$\mathcal{L}$}_{data}$ is expressible with an $\mbox{$\exists\mbox{$\textrm{MSO}$}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim)$ formula if and only if it is accepted by an ODTA $\mbox{$\mathcal{S}$}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Gamma_{0}\rangle$ , where $\mbox{$\mathcal{L}$}(\mbox{$\mathcal{M}$})$ is a commutative language. Moreover, the translation from $\mbox{$\exists\mbox{$\textrm{MSO}$}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim)$ formulas to ODTA takes triple exponential time, while from ODTA to $\mbox{$\exists\mbox{$\textrm{MSO}$}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim)$ formulas, takes exponential time.

Theorem 5.9 below provides the logical characterisation of ODTA. The proof can be found in Subsection 5.2.

Theorem 5.9.

A language $\mbox{$\mathcal{L}$}_{data}$ is accepted by an ODTA if and only if it is expressible with a formula of the form: $\exists X_{1}\cdots\exists X_{m}\ \varphi\wedge\psi$ , where $\varphi$ is a formula from $\mbox{$\textrm{FO}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim)$ , and $\psi$ from $\mbox{$\textrm{FO}$}(\sim,\prec,\mbox{$\prec_{suc}$})$ , both extended with the unary predicates $X_{1},\ldots,X_{m}$ and $a(\cdot)$ . Moreover, the translation from ODTA to formula is of polynomial time, and from formula to ODTA is effective, but non-elementary.

Finally, we show that the non-emptiness problem for ODTA is decidable in Theorem 5.10. The proof can be found in Subsection 5.3.

Theorem 5.10.

The non-emptiness problem for ODTA is decidable in 3-NExpTime.

The best lower bound known up to date is NP-hard. See [Fan and Libkin (2002), David et al. (2012)].

5.1 Proof of Theorem 5.8

In the proof we assume that the ordered-data trees are over the finite alphabet $\Sigma$ . We will need the following proposition which states that every $\mbox{$\exists\mbox{$\textrm{MSO}$}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim)$ formula can be syntactically rewritten to a normal form for $\mbox{$\exists\mbox{$\textrm{MSO}$}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim)$ .

Proposition 5.11.

[Bojanczyk et al. (2009), Proposition 3.8] Every formula $\psi\in\mbox{$\exists\mbox{$\textrm{MSO}$}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim)$ can be rewritten into a normal form of exponential size of the form: $\exists Y_{1}\cdots\exists Y_{n}\ \varphi$ , where $\varphi$ is a conjunction of formulae of the form:

(N1)

$\forall x\forall y\ (\alpha(x)\wedge\delta(x,y)\wedge\xi(x,y)\to\beta(y))$ ,
(N2)

$\forall x\ (\mbox{{\rm root}}(x)\to\alpha(y))$ ,
(N3)

$\forall x\ (\mbox{{\rm first-sibling}}(x)\to\alpha(y))$ ,
(N4)

$\forall x\ (\mbox{{\rm last-sibling}}(x)\to\alpha(y))$ ,
(N5)

$\forall x\ (\mbox{{\rm leaf}}(x)\to\alpha(y))$ ,
(N6)

$\forall x\forall y\ (\alpha(x)\wedge\alpha(y)\wedge x\sim y\to x=y)$ ,
(N7)

$\forall x\exists y\ (\alpha(x)\to\beta(y)\wedge x\sim y)$ ,

where $\alpha(x),\beta(x)$ is a conjunction of some unary predicates and its negations, $\delta(x,y)$ is either $\mbox{$E_{\downarrow}$}(x,y)$ or $\mbox{$E_{\rightarrow}$}(x,y)$ , and $\xi(x,y)$ is either $x\sim y$ or $x\nsim y$ .

We should remark that if $\varphi$ is a conjunction of formulae of the forms (N1)–(N5) above, then there exists a tree automaton $\mathcal{A}$ over the alphabet $\Sigma\times\{\top,\bot,*\}^{3}$ such that for every ordered-data tree $t$ ,

t\models\Psi\quad\mbox{if and only if}\quad\mbox{$\textsf{Profile}$}(t)\ \mbox{is accepted by}\ \mbox{$\mathcal{A}$}.

Such construction is straightforward from the classical automata theory. See, for example, [Thomas (1997)]. We divide the proof of Theorem 5.8 into Lemmas 5.12 and 5.14 below.

Lemma 5.12.

For every formula $\Psi\in\mbox{$\exists\mbox{$\textrm{MSO}$}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim)$ , there exists an ODTA $\mbox{$\mathcal{S}$}_{\Psi}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Gamma_{0}\rangle$ such that $\mbox{$\mathcal{L}$}_{data}(\Psi)=\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$}_{\Psi})$ and $\mbox{$\mathcal{L}$}(\mbox{$\mathcal{M}$})$ is commutative. Moreover, the construction of $\mbox{$\mathcal{S}$}_{\Psi}$ is effective and takes triple exponential time in the size of the formula $\Psi$ .

Proof 5.13.

Applying Proposition 5.11, we can rewrite the formula $\Psi$ in its normal form $\exists Y_{1}\cdots\exists Y_{n}\Psi^{\prime}$ . Furthermore, we can rewrite the formula $\Psi$ into the form $\exists X_{1}\cdots\exists X_{m}\ \varphi$ , where $m=2^{n}$ , and $\varphi$ is a conjunction of formulas of the form:

(N0^′)

$X_{1},\ldots,X_{m}$ are pairwise disjoint, and $\bigwedge_{a\in\Sigma}\forall x(a(x)\to\alpha^{\prime}(x))$ .
(N1^′)

$\forall x\forall y\ (\alpha^{\prime}(x)\wedge\delta(x,y)\wedge\xi(x,y)\to\beta^{\prime}(y))$ ,
(N2^′)

$\forall x\ (\mbox{{\rm root}}(x)\to\alpha^{\prime}(y))$ ,
(N3^′)

$\forall x\ (\mbox{{\rm first-sibling}}(x)\to\alpha^{\prime}(y))$ ,
(N4^′)

$\forall x\ (\mbox{{\rm last-sibling}}(x)\to\alpha^{\prime}(y))$ ,
(N5^′)

$\forall x\ (\mbox{{\rm leaf}}(x)\to\alpha^{\prime}(y))$ ,
(N6^′)

$\forall x\forall y\ (\alpha^{\prime}(x)\wedge\alpha^{\prime}(y)\wedge x\sim y\to x=y)$ ,
(N7^′)

$\forall x\exists y\ (\alpha^{\prime}(x)\to\beta^{\prime}(y)\wedge x\sim y)$ ,

where $\alpha^{\prime}(x),\beta^{\prime}(x)$ are disjunctions of some of the $X_{i}$ ’s, and $\delta(x,y)$ and $\xi(x,y)$ are the same above. Intuitively, the unary predicates $X_{1},\ldots,X_{m}$ corresponds to subsets of $\{Y_{1},\ldots,Y_{n}\}$ .

The ODTA $\mbox{$\mathcal{S}$}_{\Psi}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Gamma_{0}\rangle$ is defined as follows.

•

The transducer $\mathcal{T}$ checks whether the formulas (N0^′)–(N5^′) are satisfied, with the output alphabet $\Gamma=\{X_{1},\ldots,X_{m}\}$ where a node is labeled with $X_{i}$ if and only if it belongs to $X_{i}$ .
The construction of such transducer is straightforward, thus, omitted. See, for example, [Thomas (1997)].
•

$\Gamma_{0}$ consists of the $X_{i}$ ’s, where there exists $A\subseteq\{X_{1},\ldots,X_{m}\}$ and $X_{i}\in A$ and a formula of the form (N6^′)

$\forall x\forall y\ (\bigvee_{X_{j}\in A}X_{j}(x)\wedge\bigvee_{X_{j}\in A}X_{j}(y)\wedge x\sim y\to x=y),$

in $\varphi$ .

•

the automaton $\mathcal{M}$ accepts the language $(2^{\{X_{1},\ldots,X_{m}\}}-(\mbox{$\mathcal{P}$}_{1}\cup\mbox{$\mathcal{P}$}_{2}))^{\ast}$ , where

	$\displaystyle\mbox{$\mathcal{P}$}_{1}$	$\displaystyle:=$	$\displaystyle\left\{\begin{array}[]{l}S\left\|\begin{array}[]{l}\mbox{there exists a formula}\\ \qquad\forall x\exists y\ (\bigvee_{X\in A}X(x)\to\bigvee_{X\in B}X(y)\wedge x\sim y)\\ \mbox{in}\ \varphi\ \mbox{such that}\ S\cap A\neq\emptyset\ \mbox{but}\ S\cap B=\emptyset\end{array}\right\}\end{array}\right.$
	$\displaystyle\mbox{$\mathcal{P}$}_{2}$	$\displaystyle:=$	$\displaystyle\left\{\begin{array}[]{l}S\left\|\begin{array}[]{l}\mbox{there exists a formula}\\ \qquad\forall x\forall y\ (\bigvee_{X\in A}X(x)\wedge\bigvee_{X\in A}X(y)\wedge x\sim y\to x=y)\\ \mbox{in}\ \varphi\ \mbox{such that}\ \|S\cap A\|\geq 2\end{array}\right\}\end{array}\right.$

That $\mbox{$\mathcal{L}$}(\mbox{$\mathcal{M}$})$ is commutative is trivial. That $\mathcal{S}$ accepts precisely the language $\mbox{$\mathcal{L}$}_{data}(\Psi)$ can be deduced from the following.

•

That $\mathcal{T}$ ensures that formulas N0^′–N5^′ are satisfied.
•

That $\Gamma_{0}$ contains precisely the symbols $X_{i}$ ’s where all $X_{i}$ -nodes are supposed to contain different data values.
•

That for every ordered-data tree $t$ ,

$t\models\forall x\exists y\ (\bigvee_{X\in A}X(x)\to\bigvee_{X\in B}X(y)\wedge x\sim y)$

if and only if $[S]_{t}=\emptyset\ \mbox{for all}\ S\ \mbox{such that}\ S\cap A\neq\emptyset\ \mbox{but}\ S\cap B=\emptyset$ .
•
That for every ordered-data tree $t$ ,

$t\models\forall x\forall y\ (\bigvee_{X\in A}X(x)\wedge\bigvee_{X\in A}X(y)\wedge x\sim y\to x=y)$

if and only if
- –
  
  $[S]_{t}=\emptyset$ for all $S$ such that $|S\cap A|\geq 2$ ; and
- –
  
  for all $X\in A$ , $t\models\forall x\forall y\ (X(x)\wedge X(y)\wedge x\sim y\to x=y)$ , which is captured by the condition imposed by $\Gamma_{0}$ .

The analysis of the complexity is as follows. The first step, applying Proposition 5.11, induces an exponential blow-up in the size of the input. The second step to construct the formula $\exists X_{1}\cdots\exists X_{m}\ \varphi$ takes exponential time in $n$ , and $n$ is exponential in the size of the input. The construction of $\mathcal{T}$ takes polynomial time in the size of $\varphi$ , since (N0^′)–(N5^′) are already in the “automata transition” format. The construction of $\Gamma_{0}$ takes polynomial time in $m$ , while the construction of $\mathcal{M}$ induces another exponential blow-up in $m$ . Altogether the complexity of our constructing $\mbox{$\mathcal{S}$}_{\Psi}$ is triple exponential time in the size of $\Psi$ . This concludes the proof of Lemma 5.12.

For the complexity analysis in Lemma 5.14, we assume that a commutative automaton $\mathcal{M}$ is given as a set of vectors (in binary format) indicating its Parikh images. That is, $\mathcal{M}$ is given as a set $I=\{(\bar{u}_{1},\bar{v}_{1,1},\ldots,\bar{v}_{1,\ell}),\ldots,(\bar{u}_{n},\bar{v}_{n,1},\ldots,\bar{v}_{n,\ell})\}$ , where

\bigcup_{(\bar{u},\bar{v}_{1},\ldots,\bar{v}_{\ell})\in I}\mbox{$\mathcal{L}$}(\bar{u},\bar{v}_{1},\ldots,\bar{v}_{\ell})=\mbox{$\mathcal{L}$}(\mbox{$\mathcal{M}$}),

and each number in the vectors in $I$ is written in the standard binary form.

Lemma 5.14.

For every ODTA $\mbox{$\mathcal{S}$}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Gamma_{0}\rangle$ , where $\mbox{$\mathcal{L}$}(\mbox{$\mathcal{M}$})$ is a commutative language, there exists a formula $\varphi\in\mbox{$\exists\mbox{$\textrm{MSO}$}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim)$ such that $\mbox{$\mathcal{L}$}_{data}(\varphi)=\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$})$ . Moreover, the construction of $\varphi$ takes exponential time in the size of $\mathcal{S}$ .

Proof 5.15.

Let $Q_{\mbox{\scriptsize$\mathcal{T}$}}=\{q_{0},\ldots,q_{m}\}$ and $\Gamma=\{\alpha_{1},\ldots,\alpha_{k}\}$ be the set of states and the output alphabet of the transducer $\mathcal{T}$ , respectively. Let $\ell=2^{|\Gamma|}-1$ .

By Theorem 2.1, $\mbox{$\mathcal{L}$}(\mbox{$\mathcal{M}$})$ is a finite union of periodic languages. Let $I$ be the finite set of $(\ell+1)$ -tuple of $\mbox{$\mathbb{N}$}^{\ell}$ -vectors such that

\bigcup_{(\bar{u},\bar{v}_{1},\ldots,\bar{v}_{\ell})\in I}\mbox{$\mathcal{L}$}(\bar{u},\bar{v}_{1},\ldots,\bar{v}_{\ell})=\mbox{$\mathcal{L}$}(\mbox{$\mathcal{M}$}).

Let $I=\{(\bar{u}_{1},\bar{v}_{1,1},\ldots,\bar{v}_{1,\ell}),\ldots,(\bar{u}_{n},\bar{v}_{n,1},\ldots,\bar{v}_{n,\ell})\}$ and $S_{1},\ldots,S_{\ell}$ be the enumeration of non-empty subsets of $\Gamma$ . First, for $(\bar{u},\bar{v}_{1},\ldots,\bar{v}_{\ell})\in I$ , we construct an $\mbox{$\exists\mbox{$\textrm{MSO}$}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim)$ formula $\Psi_{(\bar{u},\bar{v}_{1},\ldots,\bar{v}_{\ell})}$ where

\displaystyle t\in\Psi_{(\bar{u},\bar{v}_{1},\ldots,\bar{v}_{\ell})}

if and only if

\displaystyle\left[\begin{array}[]{l}\mbox{there exists}\ h_{1},\ldots,h_{\ell}\geq 0\ \mbox{such that}\\ (|[S_{1}]_{t}|,\ldots,|[S_{\ell}]_{t}|)=\bar{u}+h_{1}\bar{v}_{1}+\cdots+h_{\ell}\bar{v}_{\ell}\end{array}\right]

We denote by $v_{i}$ the non-zero entry of $\bar{v}_{i}$ . This formula $\Psi_{(\bar{u},\bar{v}_{1},\ldots,\bar{v}_{\ell})}$ is as follows.

	$\displaystyle\exists W_{1,1}\cdots W_{1,u_{1}}\ \cdots\cdots\ \exists W_{\ell,1}\cdots W_{\ell,u_{\ell}}$
	$\displaystyle\quad\quad\exists X_{1,0}\cdots X_{1,v_{1}-1}\ \exists Y_{1,0}\cdots Y_{1,v_{1}-1}\ Z_{1}$
	$\displaystyle\quad\quad\quad\quad\quad\ddots$
	$\displaystyle\quad\quad\quad\quad\quad\exists X_{\ell,0}\cdots X_{\ell,v_{\ell}-1}\ \exists Y_{\ell,0}\cdots Y_{\ell,v_{\ell}-1}\ Z_{\ell}$
	$\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad\bigwedge_{i}W_{i,1},\ldots,W_{i,u_{i}}\cap Z_{i}=\emptyset$
	$\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad\wedge\ \bigwedge_{i}\varphi_{\|[S_{i}]\|=u_{i}}(W_{i,1},\ldots,W_{i,u_{i}})$
	$\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad\wedge\ \bigwedge_{i}\varphi_{\|[S_{i}]\|\equiv v_{i}\pmod{m}}(X_{i,0},\ldots,X_{i,v_{i}-1},Y_{i,0},\ldots,Y_{i,v_{i}-1},Z_{i})$

where $\varphi_{S_{i},u_{i}}(W_{i,1},\ldots,W_{i,u_{i}})$ and $\varphi_{S_{i},\pmod{v_{i}}}(X_{i,0},\ldots,X_{i,v_{i}-1},Y_{i,0},\ldots,Y_{i,v_{i}-1},Z_{i})$ are the formulas for the languages $\mbox{$\mathcal{L}$}_{data}^{S_{i},u_{i}}$ and $\mbox{$\mathcal{L}$}_{data}^{S_{i},\pmod{u_{i}}}$ in Examples 3.4 and 3.5, respectively.

The desired formula $\varphi$ is:

	$\displaystyle\exists X_{q_{0}}\cdots\exists X_{q_{m}}\ \exists X_{\alpha_{1}}\cdots\exists X_{\alpha_{k}}\ \exists\overline{X}_{(\bar{u_{1}},\bar{v}_{1,1},\ldots,\bar{v}_{1,\ell})}\ \cdots\ \exists\overline{X}_{(\bar{u_{n}},\bar{v}_{n,1},\ldots,\bar{v}_{n,\ell})}$
	$\displaystyle\qquad\qquad\qquad\qquad\qquad\varphi_{\Gamma_{0}}\wedge\varphi_{\mbox{\scriptsize$\mathcal{T}$}}\wedge\bigvee_{(\bar{u},\bar{v}_{1},\ldots,\bar{v}_{\ell})\in I}\varphi_{(\bar{u},\bar{v}_{1},\ldots,\bar{v}_{\ell})}$

where

•

the formula $\varphi_{\Gamma_{0}}$ expresses the fact that the data values found under nodes labeled with a symbol from $\Gamma_{0}$ are all different;
•

the unary predicates $X_{q_{0}},\ldots,X_{q_{m}},X_{\alpha_{1}},\ldots,X_{\alpha_{k}}$ are supposed to represent the states and the output alphabets of $\mathcal{T}$ , respectively;
•

the formula $\varphi_{\mbox{\scriptsize$\mathcal{T}$}}$ expresses the behaviour of the transducer $\mathcal{T}$ – that is, a tree satisfies $\varphi_{\mbox{\scriptsize$\mathcal{T}$}}$ in which for every node $u\in\mbox{$\textsf{Dom}$}(t)$ , $X_{q_{i}}(u)$ and $X_{\alpha_{j}}(u)$ holds, if there exists an accepting run of $\mathcal{T}$ on $t$ in which the node $u$ is labeled with $q_{i}$ and output $\alpha_{j}$ ;
•

the predicates $\overline{X}_{(\bar{u_{i}},\bar{v}_{i,1},\ldots,\bar{v}_{i,\ell})}$ ’s and the formulas $\varphi_{(\bar{u_{i}},\bar{v}_{i,1},\ldots,\bar{v}_{i,\ell})}$ ’s are as in the formula $\Psi_{(\bar{u},\bar{v}_{1},\ldots,\bar{v}_{\ell})}$ defined above.

The analysis of the complexity is as follows. The size of the formula $\varphi_{S_{i},u_{i}}$ and $\varphi_{S_{i},\pmod{v_{i}}}$ are exponential in the size of $S_{i},u_{i},v_{i}$ . Hence, the construction of $\Psi_{(\bar{u},\bar{v}_{1},\ldots,\bar{v}_{\ell})}$ takes exponential time in the size of $(\bar{u},\bar{v}_{1},\ldots,\bar{v}_{\ell})$ . The construction of $\varphi_{\Gamma_{0}}$ and $\varphi_{\mbox{\scriptsize$\mathcal{T}$}}$ takes polynomial time in the size of $\Gamma_{0}$ and $\mathcal{T}$ , respectively. Hence, the total time to construct the formula $\varphi$ is exponential in the size of $\mathcal{S}$ . This completes the proof of the lemma.

5.2 Proof of Theorem 5.9

In this subsection for every ordered-data tree $t$ , we assume that the data values in $t$ are precisely the natural numbers in the range $[1..m]$ , for a positive integer $m\geq 1$ . That is, if $d_{1}<d_{2}<\cdots<d_{m}$ are the data values in $t$ , then $d_{1}=1$ , $d_{2}=2$ , $\ldots$ , $d_{m}=m$ .

We start with the following lemma.

Lemma 5.16.

Let $\psi\in\mbox{$\textrm{FO}$}(\sim,\prec)$ be of quantifier rank $k$ . Let $\Gamma=\{a_{1},\ldots,a_{\ell}\}$ be the set of unary predicates used in $\psi$ . There exists a finite state automaton $C$ over the alphabet $\Gamma\cup(2^{\Gamma}\times\mbox{$\mathcal{F}$}_{\Gamma,k})$ such that the following holds.

•

The automaton $C$ accepts words of the form

\overbrace{a_{1}\cdots a_{1}}^{f_{1}(a_{1})}\cdots\overbrace{a_{\ell}\cdots a_{\ell}}^{f_{1}(a_{\ell})}\ (S_{1},f_{1})\ \cdots\cdots\ \overbrace{a_{1}\cdots a_{1}}^{f_{m}(a_{1})}\cdots\overbrace{a_{\ell}\cdots a_{\ell}}^{f_{m}(a_{\ell})}\ (S_{m},f_{m}),

where each $S_{i}=\{a\mid f_{i}(a)\geq 1\}$ .

•

For every ordered-data tree $t\models\psi$ , if $\mbox{$\mathcal{V}$}^{(k)}=(S_{1},f_{1}),\ldots,(S_{m},f_{m})$ , then there exists a word in $\mbox{$\mathcal{L}$}(C)$ of the form

\overbrace{a_{1}\cdots a_{1}}^{f_{1}(a_{1})}\cdots\overbrace{a_{\ell}\cdots a_{\ell}}^{f_{1}(a_{\ell})}\ (S_{1},f_{1})\ \cdots\cdots\ \overbrace{a_{1}\cdots a_{1}}^{f_{m}(a_{1})}\cdots\overbrace{a_{\ell}\cdots a_{\ell}}^{f_{m}(a_{\ell})}\ (S_{m},f_{m})

•

For every word $w\in\mbox{$\mathcal{L}$}(C)$ , if $w$ is

\overbrace{a_{1}\cdots a_{1}}^{f_{1}(a_{1})}\cdots\overbrace{a_{\ell}\cdots a_{\ell}}^{f_{1}(a_{\ell})}\ (S_{1},f_{1})\ \cdots\cdots\ \overbrace{a_{1}\cdots a_{1}}^{f_{m}(a_{1})}\cdots\overbrace{a_{\ell}\cdots a_{\ell}}^{f_{m}(a_{\ell})}\ (S_{m},f_{m})

then there exists a tree $t\models\psi$ , where $\mbox{$\mathcal{V}$}^{(k)}(t)=(S_{1},f_{1})\cdots(s_{m},f_{m})$ .

Proof 5.17.

Let $\psi\in\mbox{$\textrm{FO}$}(\sim,\prec)$ be of quantifier rank $k$ . Let $\Gamma=\{a_{1},\ldots,a_{\ell}\}$ be the set of unary predicates used in $\varphi$ . We define the following sentence $\overline{\psi}\in\mbox{$\textrm{FO}$}(<)$ (that is, over strings) inductively from $\psi$ as follows.

•

If $\psi$ is $Qx\;\xi$ , where $Q\in\{\forall,\exists\}$ , then $\overline{\psi}$ is

$Qx\;\bigvee_{a\in\Gamma}a(x)\to\overline{\xi}.$
•

If $\psi$ is $x=y$ , then $\overline{\psi}$ is also $x=y$ .
•

If $\psi$ is $x\sim y$ , then $\overline{\psi}$ states “there is no position in between $x$ and $y$ labeled with any symbol from $2^{\Gamma}\times\mbox{$\mathcal{F}$}_{\Gamma,k}$ .”
•

If $\psi$ is $x\prec y$ , then $\overline{\psi}$ states “there is at least one position in between $x$ and $y$ labeled with a symbol from $2^{\Gamma}\times\mbox{$\mathcal{F}$}_{\Gamma,k}$ .”

We have the following claim.

Claim 1.

(1)

For every ordered-data tree $t\models\psi$ , if $\mbox{$\mathcal{V}$}^{(k)}=(S_{1},f_{1}),\ldots,(S_{m},f_{m})$ , then there exists a word $w\models\overline{\psi}$ of the form

\overbrace{a_{1}\cdots a_{1}}^{f_{1}(a_{1})}\cdots\overbrace{a_{\ell}\cdots a_{\ell}}^{f_{1}(a_{\ell})}\ (S_{1},f_{1})\ \cdots\cdots\ \overbrace{a_{1}\cdots a_{1}}^{f_{m}(a_{1})}\cdots\overbrace{a_{\ell}\cdots a_{\ell}}^{f_{m}(a_{\ell})}\ (S_{m},f_{m})

(2)

For every word $w\models\overline{\psi}$ , if $w$ is

\overbrace{a_{1}\cdots a_{1}}^{f_{1}(a_{1})}\cdots\overbrace{a_{\ell}\cdots a_{\ell}}^{f_{1}(a_{\ell})}\ (S_{1},f_{1})\ \cdots\cdots\ \overbrace{a_{1}\cdots a_{1}}^{f_{m}(a_{1})}\cdots\overbrace{a_{\ell}\cdots a_{\ell}}^{f_{m}(a_{\ell})}\ (S_{m},f_{m})

then there exists a tree $t\models\psi$ , where $\mbox{$\mathcal{V}$}^{(k)}(t)=(S_{1},f_{1})\cdots(s_{m},f_{m})$ .

Proof 5.18.

We first prove item (1). Let $t$ be an ordered-data tree over the alphabet $\Gamma$ and let $\mbox{$\mathcal{V}$}^{k}(t)=(S_{1},f_{1})\cdots(S_{m},f_{m})$ be its $k$ -extended string representation of data values in $t$ . Let $t^{\prime}$ be the following data string

\overbrace{{a_{1}\choose 1}\cdots{a_{1}\choose 1}}^{f_{1}(a_{1})}\cdots\overbrace{{a_{\ell}\choose 1}\cdots{a_{\ell}\choose 1}}^{f_{1}(a_{\ell})}\cdots\cdots\cdots\cdots\overbrace{{a_{1}\choose m}\cdots{a_{1}\choose m}}^{f_{m}(a_{1})}\cdots\overbrace{{a_{\ell}\choose m}\cdots{a_{\ell}\choose m}}^{f_{m}(a_{\ell})}

When $t^{\prime}$ is viewed as a data tree^§^§§That is, a data string is a data tree in which each node has at most one child., $\mbox{$\mathcal{V}$}^{(k)}_{\Gamma}(t)=\mbox{$\mathcal{V}$}^{(k)}(t^{\prime})$ . Hence, by Corollary 4.5,

t\models\psi\qquad\mbox{if and only if}\qquad t^{\prime}\models\psi.

By straightforward induction on $\overline{\psi}$ , we can show that for every $t^{\prime}\models\psi$ of the form

\overbrace{{a_{1}\choose 1}\cdots{a_{1}\choose 1}}^{f_{1}(a_{1})}\cdots\overbrace{{a_{\ell}\choose 1}\cdots{a_{\ell}\choose 1}}^{f_{1}(a_{\ell})}\cdots\cdots\cdots\cdots\overbrace{{a_{1}\choose m}\cdots{a_{1}\choose m}}^{f_{m}(a_{1})}\cdots\overbrace{{a_{\ell}\choose m}\cdots{a_{\ell}\choose m}}^{f_{m}(a_{\ell})}

there exists a word $w\models\overline{\psi}$ of the form

\overbrace{a_{1}\cdots a_{1}}^{f_{1}(a_{1})}\cdots\overbrace{a_{\ell}\cdots a_{\ell}}^{f_{1}(a_{\ell})}\ (S_{1},f_{1})\ \cdots\cdots\ \overbrace{a_{1}\cdots a_{1}}^{f_{m}(a_{1})}\cdots\overbrace{a_{\ell}\cdots a_{\ell}}^{f_{m}(a_{\ell})}\ (S_{m},f_{m})

Similarly, to prove (2), we can prove by straightforward induction on $\overline{\psi}$ that for every word $w\models\overline{\psi}$ of the form

\overbrace{a_{1}\cdots a_{1}}^{f_{1}(a_{1})}\cdots\overbrace{a_{\ell}\cdots a_{\ell}}^{f_{1}(a_{\ell})}\ (S_{1},f_{1})\ \cdots\cdots\ \overbrace{a_{1}\cdots a_{1}}^{f_{m}(a_{1})}\cdots\overbrace{a_{\ell}\cdots a_{\ell}}^{f_{m}(a_{\ell})}\ (S_{m},f_{m}),

there exists a tree $t\models\psi$ of the form

\overbrace{{a_{1}\choose 1}\cdots{a_{1}\choose 1}}^{f_{1}(a_{1})}\cdots\overbrace{{a_{\ell}\choose 1}\cdots{a_{\ell}\choose 1}}^{f_{1}(a_{\ell})}\cdots\cdots\cdots\cdots\overbrace{{a_{1}\choose m}\cdots{a_{1}\choose m}}^{f_{m}(a_{1})}\cdots\overbrace{{a_{\ell}\choose m}\cdots{a_{\ell}\choose m}}^{f_{m}(a_{\ell})}

This completes the proof of our claim.

Let $C$ be an automaton over the alphabet $\Gamma\cup(2^{\Gamma}\times\mbox{$\mathcal{F}$}_{\Gamma,k})$ that expresses the formula $\overline{\psi}$ and that it accepts only words of the form

\overbrace{a_{1}\cdots a_{1}}^{f_{1}(a_{1})}\cdots\overbrace{a_{\ell}\cdots a_{\ell}}^{f_{1}(a_{\ell})}\ (S_{1},f_{1})\ \cdots\cdots\ \overbrace{a_{1}\cdots a_{1}}^{f_{m}(a_{1})}\cdots\overbrace{a_{\ell}\cdots a_{\ell}}^{f_{m}(a_{\ell})}\ (S_{m},f_{m}),

where each $S_{i}=\{a\mid f_{i}(a)\geq 1\}$ . The construction of $C$ from the formula $\overline{\psi}$ is rather standard, but non-elementary. See, for example, [Thomas (1997)]. That the automaton $C$ is the desired automaton is immediate. This completes our proof of Lemma 5.16.

Lemma 5.19.

Let $\psi\in\mbox{$\textrm{FO}$}(\sim,\prec)$ be of quantifier rank $k$ . Let $\Gamma=\{a_{1},\ldots,a_{\ell}\}$ be the set of unary predicates used in $\psi$ . There exists a finite state automaton $\mathcal{M}$ over the alphabet $2^{\Gamma}\times\mbox{$\mathcal{F}$}_{\Gamma,k}$ such that $\mbox{$\mathcal{L}$}(\mbox{$\mathcal{M}$})=\{\mbox{$\mathcal{V}$}^{(k)}_{\Gamma,k}(t)\mid t\models\psi\}$ .

Proof 5.20.

Let $C$ be the automaton obtained by applying Lemma 5.16 on the formula $\psi$ . Then let $\mathcal{M}$ be the automaton obtained from $C$ , where every symbol from $\Gamma$ is projected to empty string. The automaton $\mathcal{M}$ is the desired automaton, and this completes our proof of Lemma 5.19.

Now we are ready to prove Theorem 5.9. We start with the “if” direction. Let $\Psi$ be a formula of the form:

\exists Y_{1}\cdots\exists Y_{n}\ \varphi\wedge\psi,

$\varphi$ is a formula from $\mbox{$\textrm{FO}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim)$ and $\psi$ from $\mbox{$\textrm{FO}$}(\sim,\prec)$ , both extended with the unary predicates $Y_{1},\ldots,Y_{n}$ .

By Proposition 5.11, we can rewrite (with additional unary predicates) the formula $\varphi$ into a conjunction of formulae of the form N1–N7 as stated in Proposition 5.11. Then we further rewrite it into the form

\exists X_{1}\cdots\exists X_{m}\ \varphi^{\prime}\wedge\psi^{\prime},

where $m=2^{n}$ and $\varphi$ is a formula from $\mbox{$\textrm{FO}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim)$ and $\psi$ from $\mbox{$\textrm{FO}$}(\sim,\prec)$ , both extended with the unary predicates $X_{1},\ldots,X_{m}$ , and that the formula $\varphi^{\prime}$ is conjunction of the form:

(N0^′)

a formula $\xi$ that states that $X_{1},\ldots,X_{m}$ are pairwise disjoint and that

$\bigwedge_{a\in\Sigma}\forall x\ (a(x)\to\alpha(x)),$
(N1^′)

$\forall x\forall y\ (\alpha(x)\wedge\delta(x,y)\wedge\xi(x,y)\to\beta(y))$ ,
(N2^′)

$\forall x\ (\mbox{{\rm root}}(x)\to\alpha(y))$ ,
(N3^′)

$\forall x\ (\mbox{{\rm first-sibling}}(x)\to\alpha(y))$ ,
(N4^′)

$\forall x\ (\mbox{{\rm last-sibling}}(x)\to\alpha(y))$ ,
(N5^′)

$\forall x\ (\mbox{{\rm leaf}}(x)\to\alpha(y))$ ,
(N6^′)

$\forall x\forall y\ (\alpha(x)\wedge\alpha(y)\wedge x\sim y\to x=y)$ ,
(N7^′)

$\forall x\exists y\ (\alpha(x)\to\beta(y)\wedge x\sim y)$ ,

where $\alpha(x),\beta(x)$ are disjunctions of some of the unary predicates $X_{1},\ldots,X_{m}$ .

We will describe the ODTA $\mbox{$\mathcal{S}$}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Gamma_{0}\rangle$ for the formula $\Psi$ , where the transducer $\mathcal{T}$ expresses the formula N0^′–N5^′ with the output alphabet $\Gamma=\{X_{1},\ldots,X_{m}\}$ , the automaton $\mathcal{M}$ expresses the formula N6^′, N7^′ and $\psi^{\prime}$ , and $\Gamma_{0}$ is the set of symbols that appear in formula N6^′. Formally, it is defined as follows.

•

The output alphabet of $\mathcal{T}$ is $\Gamma=\{X_{1},\ldots,X_{m}\}$ .
•

The transducer expresses the formula N0^′–N5^′ above. In particular, the input and output symbols of each node must satisfy the formula N0^′.

This step take polynomial time, since the formula N0^′–N5^′ is already in the transition format.
•

The set $\Gamma_{0}=\{X_{i}\mid X_{i}\ \mbox{appears in}\ \mbox{N6}^{\prime}\}$ .

This step takes polynomial time.
•

The automaton $\mathcal{M}$ expresses the formulas N6^′, N7^′ and $\psi^{\prime}$ , obtained by applying Lemma 5.19.

This step is constructive, but non-elementary due to the conversion from a formula to its finite state automaton.

It is straightforward to show that $\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$})=\{t\mid t\models\Psi\}$ .

Now we prove the “only if” direction. Let $\mbox{$\mathcal{L}$}=\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$})$ , where $\mbox{$\mathcal{S}$}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Gamma_{0}\rangle$ , and

•

$Q=\{q_{1},\ldots,q_{n}\}$ be the states of $\mathcal{T}$ ;
•

$P=\{p_{1},\ldots,p_{s}\}$ be the states of $\mathcal{M}$ , and $p_{1}$ is the initial state of $\mathcal{M}$ ;
•

$\Gamma=\{\alpha_{1},\ldots,\alpha_{\ell}\}$ be the output alphabet of $\mathcal{T}$ .

We denote by $\Sigma$ the input alphabet of $\mathcal{T}$ .

The desired formula for $\mathcal{L}$ is of the form:

\exists X_{q_{1}}\cdots\exists X_{q_{n}}\ \exists X_{\alpha_{1}}\cdots\exists X_{\alpha_{\ell}}\ \exists X_{p_{1}}\cdots\exists X_{p_{s}}\ \Psi_{\mbox{\scriptsize$\mathcal{T}$}}\wedge\Psi_{\mbox{\scriptsize$\mathcal{M}$}}\wedge\Psi_{\Gamma_{0}}

where

•

the unary predicates $X_{q_{1}},\ldots,X_{q_{n}},X_{\alpha_{1}},\ldots,X_{\alpha_{\ell}},X_{p_{1}},\ldots,X_{p_{s}}$ are supposed to represent the states, the output alphabets of $\mathcal{T}$ , and the states of $\mathcal{M}$ , respectively;
•

the formula $\Psi_{\mbox{\scriptsize$\mathcal{T}$}}$ expresses the behaviour of the transducer $\mathcal{T}$ – that is, a tree satisfies $\Psi_{\mbox{\scriptsize$\mathcal{T}$}}$ in which for every node $u\in\mbox{$\textsf{Dom}$}(t)$ , $X_{q_{i}}(u)$ and $X_{\alpha_{j}}(u)$ holds, if there exists an accepting run of $\mathcal{T}$ on $t$ in which the node $u$ is labeled with $q_{i}$ and output $\alpha_{j}$ ;
•

the formula $\Psi_{\mbox{\scriptsize$\mathcal{M}$}}$ expresses the behaviour of the automaton $\mathcal{M}$ ;
•

the formula $\Psi_{\Gamma_{0}}$ expresses the property that for every $\alpha_{i}\in\Gamma_{0}$ , all the nodes belonging to $X_{\alpha_{i}}$ contain different data values, which is

$\bigwedge_{\alpha\in\Gamma_{0}}\forall x\forall y(X_{\alpha}(x)\wedge X_{\alpha}(y)\wedge x\sim y\to x=y).$

The construction of the formula $\Psi_{\mbox{\scriptsize$\mathcal{T}$}}$ is rather standard, thus, omitted. We will show the construction of the formula $\Psi_{\mbox{\scriptsize$\mathcal{M}$}}$ . Let $\Phi_{[S]}(x)$ denote the following formula

\bigvee_{\alpha_{i}\in S}X_{\alpha_{i}}(x)\wedge\bigwedge_{\alpha_{i}\in S}\exists y(X_{\alpha_{i}}(x)\wedge x\sim y)\wedge\bigwedge_{\alpha_{j}\notin S}\forall y(X_{\alpha_{j}}(y)\to x\nsim y),

which states that the data value on the node $x$ belongs to $[S]$ . The formula $\Psi_{\mbox{\scriptsize$\mathcal{M}$}}$ expresses the following properties.

•

That the node contains the minimal data value belongs to $X_{p_{1}}$ . Formally, it can be written as follows.

$\forall x(\forall yx\prec y\vee x\sim y\to X_{p_{1}}(x))$

•

That the transition $\mu$ of $\mathcal{M}$ must be “respected.” Formally, it can be written as follows.

\bigwedge_{(p_{i},S,p_{j})\in\mu}\Big{(}\forall x\forall y(X_{p_{i}}(x)\wedge\Psi_{[S]}(x)\wedge x\mbox{$\prec_{suc}$}y\to X_{p_{j}}(y))\Big{)},

where $x\mbox{$\prec_{suc}$}\ y$ stands for $x\prec y\wedge\forall z(\neg(x\prec z\wedge z\prec y))$ .

•

That the node contains the maximal data value belongs to one of the final states of $\mathcal{M}$ , denoted by $F$ . Formally, it can be written as follows.

$\forall x(\forall y\ (y\prec x\vee y\sim x)\to\bigvee_{p_{i}\in F}X_{p_{i}}(x)).$

That the construction takes polynomial time is straightforward. This completes our proof of Theorem 5.9.

5.3 Proof of Theorem 5.10

The proof of Theorem 5.10 consists of two main steps.

(1)

We prove that for each ODTA $\mathcal{S}$ , if $\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$})\neq\emptyset$ , then $\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$})$ contains a data tree with “small model property” (Lemma 5.21).
(2)

We describe a procedure, that given an ODTA $\mathcal{S}$ , checks whether $\mbox{$\mathcal{L}$}(\mbox{$\mathcal{S}$})$ contains a data tree with “small model property,” by converting the ODTA $\mathcal{S}$ into an APC $(\mbox{$\mathcal{A}$},\xi)$ . Since the non-emptiness of APC is decidable, Theorem 5.10 follows immediately.

The first step (Lemma 5.21) is adapted from the proof of [Bojanczyk et al. (2009), Proposition 3.10]. It is in the second step our proof differs from [Bojanczyk et al. (2009), Proposition 3.10] The decision procedure in [Bojanczyk et al. (2009)] relies on intricate counting argument of the so called dog and sheep symbols (see [Bojanczyk et al. (2009), page 36]) and it seems that it cannot be generalised to the case of ODTA. On the other hand, our decision procedure relies mainly on Proposition 2.3, Lemma 4.1 and counting the cardinality of each $[S]$ .

We need a few terminologies. A set of nodes in a data tree $t$ is called connected, if it is connected in the graph induced by $E_{\downarrow}$ and $E_{\rightarrow}$ . A zone in a data tree $t$ is a maximal connected set of nodes with the same data value. The outdegree of a zone $Z$ is the number of different zones to which there is an edge (either $E_{\downarrow}$ or $E_{\rightarrow}$ ) from $Z$ .

Let $\mbox{$\mathcal{S}$}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Gamma_{0}\rangle$ be an ODTA, where $\mathcal{T}$ is a transducer from $\Sigma$ to $\Gamma$ . Let $Q$ be the set of states of $\mathcal{T}$ . For a tree $t\in\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$})$ , its extended tree $\tilde{t}$ (with respect to the ODTA $\mathcal{S}$ ) is a tree over the alphabet $\Sigma\times\{\top,\bot,*\}^{3}\times Q\times\Gamma$ , where

•

the projection of $\tilde{t}$ to $\Sigma\times\{\top,\bot,*\}^{3}$ is $\mbox{$\textsf{Profile}$}(t)$ ;
•

the projection of $\tilde{t}$ to $Q$ is an accepting run of $\mathcal{T}$ on $t$ ;
•

the projection of $\tilde{t}$ to $\Gamma$ is an output of $\mathcal{T}$ on $t$ .

The following Lemma is simply an adaptation of [Bojanczyk et al. (2009), Proposition 3.10] to the case of ODTA. The proof is via cut-and-paste, where given an ordered-data tree $t$ over the alphabet $\Sigma$ where $t$ has “many” zones in which the outdegree is “large,” we can cut some nodes in $t$ and paste it in another part of $t$ without affecting the set $V_{t}(a)$ ’s for each $a\in\Sigma$ . The aim of such cut-and-paste is to reduce the number of zones in $t$ with large outdegree. We give the formal statement below.

Lemma 5.21.

[Compare [Bojanczyk et al. (2009), Proposition 3.10]] For every ODTA $\mbox{$\mathcal{S}$}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Gamma_{0}\rangle$ over the alphabet $\Sigma$ , if $\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$})\neq\emptyset$ , then there exists a data tree $t\in\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$})$ in which there are at most $K^{O(K^{2})}$ zones with outdegree $\geq K^{(K^{3})}$ , where $K=O(|\Sigma|\cdot|Q|\cdot|\Gamma|)$ and $Q$ is the set of states of $\mathcal{T}$ and $\Gamma$ the output alphabet of $\mathcal{T}$ .

Proof 5.22.

Let $\mbox{$\mathcal{S}$}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Gamma_{0}\rangle$ be an ODTA over the alphabet $\Sigma$ , and $Q$ is the set of states of $\mathcal{T}$ and $\Gamma$ the output alphabet of $\mathcal{T}$ . Suppose that $t_{0}\in\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$})$ . We will work on the extended tree $\tilde{t}_{0}$ of $t_{0}$ . The aim is to convert $\tilde{t}_{0}$ into another tree $\tilde{t}$ over the alphabet $\Sigma\times\{\top,\bot,*\}^{3}\times Q\times\Gamma$ such that

1.

the number of zones in $\tilde{t}$ with outdegree $\geq K^{(K^{3})}$ is bounded by $K^{O(K^{2})}$ ,
2.

the $\{\top,\bot,*\}^{3}$ projection of $\tilde{t}$ is the profile of each node,
3.

the $Q$ projection of $\tilde{t}$ is an accepting run of $\mathcal{T}$ on the $\Sigma\times\{\top,\bot,*\}^{3}$ projection of $\tilde{t}$ and the output is its $\Gamma$ projection,
4.

for each $(a,(l,p,r),q,b)\in\Sigma\times\{\top,\bot,*\}^{3}\times Q\times\Gamma$ the set of data values found in the $(a,(l,p,r),q,b)$ -nodes in $\tilde{t}_{0}$ is the same as the set of those found in $(a,(l,p,r),q,b)$ -nodes in $\tilde{t}$ ,
5.

the $\Sigma$ projection of $\tilde{t}$ is accepted by $\mathcal{S}$ .

Intuitively, the tree $\tilde{t}$ is obtained via repeated applications of “pumping lemma” on both $E_{\downarrow}$ - and $E_{\rightarrow}$ -directions in the tree $t$ .

Below we give a brief summary of the proof adapted from the proof of [Bojanczyk et al. (2009), Proposition 3.10]. We need the following terminologies, all of them are from [Bojanczyk et al. (2009)].

•

Two nodes in a tree are called siblings, if they have the same parent node.
•

The set of all children of a node is called a sibling group.
•

A contiguous sequence of siblings is called an interval.
We write $[u,v]$ for an interval in which $u$ and $v$ are the left-most and right-most nodes, respectively, in the interval.
•
An interval $[u,v]$ is complete, if the following holds.
- –
  
  If a node $u^{\prime}$ exists such that $\mbox{$E_{\rightarrow}$}(u^{\prime},u)$ , then $u^{\prime}\nsim u$ .
- –
  
  If a node $v^{\prime}$ exists such that $\mbox{$E_{\rightarrow}$}(v,v^{\prime})$ , then $u^{\prime}\nsim u$ .
•

An interval is pure, if all of its nodes have the same data value.
•

A pure interval with the data value $d$ is called a $d$ -pure interval.
•

If the parent of an interval (or, a sibling group) has data value $d$ , then it is called a $d$ -parent interval (or a $d$ -parent sibling group).
•

A zone with the data value $d$ is called a $d$ -zone.

The construction of $\tilde{t}$ from $\tilde{t}_{0}$ is as follows.

1.
Convert $\tilde{t}_{0}$ to another tree $\tilde{t}_{1}$ such that
- •
  
  for every data value $d\in V_{\tilde{t}_{1}}$ there are at most $O(K)$ complete $d$ -pure intervals of size more than $O(K)$ ;
- •
  
  $V_{\tilde{t}_{1}}(a,(l,p,r),q,b)=V_{\tilde{t}_{0}}(a,(l,p,r),q,b)$ , for every $(a,(l,p,r),q,b)\in\Sigma\times\{\top,\bot,*\}^{3}\times Q\times\Gamma$ ;
- •
  
  $\tilde{t}_{1}$ is an extended tree of its $\Sigma$ projection w.r.t. $\mathcal{S}$ .
This step is adapted from [Bojanczyk et al. (2009), Proposition 3.12]. The idea is to cut an interval (together with its subtree) and paste it in another interval; and while doing so the data values in the interval remain untouched.
2.
Convert $\tilde{t}_{1}$ to another tree $\tilde{t}_{2}$ such that
- •
  
  for every data value $d\in V_{\tilde{t}_{2}}$ there are at most $O(K)$ $d$ -parent sibling group with more than $K^{O(K)}$ complete pure intervals;
- •
  
  $V_{\tilde{t}_{2}}(a,(l,p,r),q,b)=V_{\tilde{t}_{1}}(a,(l,p,r),q,b)$ , for every $(a,(l,p,r),q,b)\in\Sigma\times\{\top,\bot,*\}^{3}\times Q\times\Gamma$ ;
- •
  
  $\tilde{t}_{2}$ is an extended tree of its $\Sigma$ projection w.r.t. $\mathcal{S}$ .
This step is adapted from [Bojanczyk et al. (2009), Proposition 3.14]. Again when the cut-and-paste is performed the data values in the sibling groups remain untouched.
3.
Convert $\tilde{t}_{2}$ to another tree $\tilde{t}_{3}$ such that
- •
  
  for every data value $d\in V_{\tilde{t}_{3}}$ there are at most $O(K)$ $d$ -zones containing a path with more than $O(K)$ nodes;
- •
  
  $V_{\tilde{t}_{3}}(a,(l,p,r),q,b)=V_{\tilde{t}_{2}}(a,(l,p,r),q,b)$ , for every $(a,(l,p,r),q,b)\in\Sigma\times\{\top,\bot,*\}^{3}\times Q\times\Gamma$ ;
- •
  
  $\tilde{t}_{3}$ is an extended tree of its $\Sigma$ projection w.r.t. $\mathcal{S}$ .
This step is adapted from [Bojanczyk et al. (2009), Proposition 3.17]. Again when the cut-and-paste is performed the data values in the zones remain untouched.
4.
Convert $\tilde{t}_{3}$ to another tree $\tilde{t}_{4}$ such that
- •
  
  there are at most $K^{O(K^{2})}$ complete pure intervals with more than $O(K^{2})$ nodes;
- •
  
  $V_{\tilde{t}_{3}}(a,(l,p,r),q,b)=V_{\tilde{t}_{4}}(a,(l,p,r),q,b)$ , for every $(a,(l,p,r),q,b)\in\Sigma\times\{\top,\bot,*\}^{3}\times Q\times\Gamma$ ;
- •
  
  $\tilde{t}_{4}$ is an extended tree of its $\Sigma$ projection w.r.t. $\mathcal{S}$ .
This step is adapted from [Bojanczyk et al. (2009), Proposition 3.20]. Here actually when the cut-and-paste is performed, the data values in some zones have to be changed. However, those changes are only applied to the safe zones, where a zone is safe if for every node in it there is another node outside the zone with the same label (from $\Sigma\times\{\top,\bot,*\}\times Q\times\Gamma$ ) and the same data value. (See [Bojanczyk et al. (2009), page 23, last paragraph].) More specifically, these changes are done by applying [Bojanczyk et al. (2009), Lemma 3.19] on the safe zones. That it is applied only on safe zones is important so that after changing the data values, constraints such as $\forall x\exists y(a(x)\to x\sim y\wedge b(y))$ are still satisfied.
5.
Convert $\tilde{t}_{4}$ to another tree $\tilde{t}_{5}$ such that
- •
  
  there are at most $K^{O(K^{2})}$ sibling groups containing more than $K^{O(K)}$ complete pure intervals;
- •
  
  $V_{\tilde{t}_{4}}(a,(l,p,r),q,b)=V_{\tilde{t}_{5}}(a,(l,p,r),q,b)$ , for every $(a,(l,p,r),q,b)\in\Sigma\times\{\top,\bot,*\}^{3}\times Q\times\Gamma$ ;
- •
  
  $\tilde{t}_{5}$ is an extended tree of its $\Sigma$ projection w.r.t. $\mathcal{S}$ .
This step is adapted from [Bojanczyk et al. (2009), Proposition 3.21]. Here there are also changes of data values when performing cut-and-paste. However, as in the previous step, they are only applied to the safe zones. These changes are also done by applying [Bojanczyk et al. (2009), Lemma 3.19] on the safe zones.
6.
Convert $\tilde{t}_{5}$ to another tree $\tilde{t}_{6}$ such that
- •
  
  there are at most $K^{O(K^{2})}$ zones containing paths with more than $O(K^{2})$ nodes;
- •
  
  $V_{\tilde{t}_{5}}(a,(l,p,r),q,b)=V_{\tilde{t}_{6}}(a,(l,p,r),q,b)$ , for every $(a,(l,p,r),q,b)\in\Sigma\times\{\top,\bot,*\}^{3}\times Q\times\Gamma$ ;
- •
  
  $\tilde{t}_{6}$ is an extended tree of its $\Sigma$ projection w.r.t. $\mathcal{S}$ .
This step is adapted from [Bojanczyk et al. (2009), Proposition 3.25]. Here there are also changes of data values when performing cut-and-paste. However, as in the previous step, they are only applied to the safe zones. More specifically, these changes are done by applying [Bojanczyk et al. (2009), Lemma 3.24] on the safe zones.

The extended tree $\tilde{t}_{6}$ is the desired extended tree. It is a rather straightforward computation that there are at most $K^{O(K^{2})}$ zones in $\tilde{t}_{6}$ with outdegree $\geq K^{(K^{3})}$ .

To describe the decision procedure for Theorem 5.10, we need a few more additional terminologies. For a data tree $t$ over the alphabet $\Gamma$ , and $S\subseteq\Gamma$ , an $S$ -zone is a zone in which the labels of the nodes are precisely $S$ . We write $V^{zone}_{t}(S)$ to denote the set of data values found in $S$ -zones in $t$ . For $P\subseteq 2^{\Gamma}$ ,

[P]^{zone}_{t}=\bigcap_{S\in P}V^{zone}_{t}(S)\cap\bigcap_{R\notin P}\overline{V^{zone}_{t}(R)}

Suppose $d_{1}<\cdots<d_{m}$ are all the data values in $t$ . The zonal string representation of the data values in $t$ , denoted by $\mbox{$\mathcal{V}$}^{zone}_{\Gamma}(t)$ , is the string $P_{1}\cdots P_{m}$ over the alphabet $2^{2^{\Gamma}}$ such that for each $i\in\{1,\ldots,m\}$ , $d_{i}\in[P_{i}]^{zone}_{t}$ .

A zonal ODTA is $\mbox{$\mathcal{S}$}^{\prime}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$}^{\prime},\Gamma_{0}\rangle$ , where $\mathcal{T}$ and $\Gamma_{0}$ are as in the definition of ODTA, and $\mbox{$\mathcal{M}$}^{\prime}$ is a finite state automaton over the alphabet $2^{2^{\Gamma}}$ . A data tree $t$ is accepted by the zonal ODTA $\mbox{$\mathcal{S}$}^{\prime}$ , if the following holds.

•

$\mbox{$\textsf{Profile}$}(t)$ is accepted by $\mathcal{T}$ , yielding an output tree $t^{\prime}$ over the alphabet $\Gamma$ .
•

The string $\mbox{$\mathcal{V}$}^{zone}_{\Gamma}(t^{\prime})$ is accepted by $\mbox{$\mathcal{M}$}^{\prime}$ .
•

For each $a\in\Gamma_{0}$ , all the data values found in the $a$ -nodes in $t^{\prime}$ are different.

Proposition 5.23.

For every ODTA $\mathcal{S}$ , one can construct in ExpTime its equivalent zonal ODTA.

Proof 5.24.

Let $\mbox{$\mathcal{S}$}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Gamma_{0}\rangle$ and $\mbox{$\mathcal{M}$}=\langle Q,q_{0},\delta,F\rangle$ . Its equivalent zonal ODTA is defined as $\mbox{$\mathcal{S}$}^{\prime}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$}^{\prime},\Gamma_{0}\rangle$ , where $\mbox{$\mathcal{M}$}^{\prime}=\langle Q,q_{0},\delta^{\prime},F\rangle$ and $\delta^{\prime}=\{(q,P,q^{\prime})\in Q\times 2^{2^{\Gamma}}\times Q\mid\exists(q,S,q^{\prime})\in\delta\ \mbox{such that}\ \bigcup_{R\in P}R=S\}$ . It is straightforward to show that $\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$}^{\prime})=\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$})$ .

Note that the only difference between $\mathcal{S}$ and $\mbox{$\mathcal{S}$}^{\prime}$ is the transitions $\delta$ and $\delta^{\prime}$ in $\mathcal{M}$ and $\mbox{$\mathcal{M}$}^{\prime}$ , respectively. The membership $(q,P,q^{\prime})\in\delta^{\prime}$ can be checked in polynomial time in the size of $(q,P,q^{\prime})$ and $\delta$ . Since there are exponentially many $(q,P,q^{\prime})$ , the exponential time upper bound holds immediately. This completes the proof of Proposition 5.23.

Briefly our decision procedure for Theorem 5.10 works as follows. Let $\mbox{$\mathcal{S}$}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Gamma_{0}\rangle$ be the given ODTA, where $\Sigma$ is the input alphabet of $\mathcal{T}$ , $\Gamma$ the output alphabet, and $Q$ the set of states of $\mathcal{T}$ . Let $K=27\cdot|\Sigma|\cdot|Q|\cdot|\Gamma|$ . The decision procedure constructs an APC $(\mbox{$\mathcal{A}$},\xi)$ such that $\mathcal{S}$ accepts an ordered-data tree $t$ in which there are at most $K^{O(K^{2})}$ zones with outdegree $\geq K^{(K^{3})}$ if and only if $(\mbox{$\mathcal{A}$},\xi)$ accepts an extended tree of $t$ w.r.t. $\mathcal{S}$ .

Its precise description is given as follows.

1.

Compute $K=27\cdot|\Sigma|\cdot|Q|\cdot|\Gamma|$ .
2.

Convert $\mathcal{S}$ into its zonal ODTA $\mbox{$\mathcal{S}$}^{\prime}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$}^{\prime},\Gamma_{0}\rangle$ .
3.
Guess the following items.
1. (a)
  
  A set $\mbox{$\mathcal{P}$}\subseteq 2^{2^{\Gamma}}$ .
2. (b)
  
  For each $P\in\mbox{$\mathcal{P}$}$ , guess an integer $M_{P}\leq 2\cdot K^{K^{3}}\cdot 2^{K}+2\cdot K^{K^{3}}+1$ and a set of $M_{P}$ constants $\mbox{$\mathcal{C}$}_{P}=\{c_{1},\ldots,c_{M_{P}}\}$ .^¶^¶¶The purpose of the number $2\cdot K^{K^{3}}\cdot 2^{K}+2\cdot K^{K^{3}}$ is the application of Lemma 4.1 later on, where we consider the graph where the nodes are the zones. Each zone is labeled with a symbol from $2^{\Sigma\times\{\top,\bot,*\}^{3}\times Q\times\Gamma}$ , which is of size $2^{K}$ . If a zone has outdegree $\leq K^{(K^{3})}$ , then it has only at most $K^{(K^{3})}$ nodes, which means that its degree (the sum of indegree and outdegree) is bounded by $2\cdot K^{K^{3}}$ . Now $\mathcal{P}$ is intended to contain all those $P$ ’s in which $|[P]^{zone}_{t}|\leq 2\cdot K^{K^{3}}\cdot 2^{K}+2\cdot K^{K^{3}}+1$ so that we can “guess” some constants as elements of $[P]^{zone}_{t}$ and make sure by automaton that the same constant is not “assigned” to adjacent zones. For $P$ not in $\mathcal{P}$ , we can apply Lemma 4.1 to make sure the same data value from $[P]^{zone}_{t}$ is not assigned to adjacent zones.
3. (c)
  
  Two integers $N,N^{\prime}$ such that $N^{\prime}\leq N\leq K^{O(K^{2})}$ and a set of $N^{\prime}$ constants $\mbox{$\mathcal{D}$}=\{d_{1},\ldots,d_{N^{\prime}}\}$ .
  The intuitive meaning of $N^{\prime}$ and $N$ are the number of zones with outdegree $\geq K^{(K^{3})}$ and the number of data values found in them, respectively. We also remark that the constants in $\mathcal{D}$ may overlap with the constants in some $\mbox{$\mathcal{C}$}_{P}$ .
4. (d)
  
  For each $d\in\mbox{$\mathcal{D}$}$ , guess a set $P_{d}\subseteq 2^{\Gamma}$ .
4.
Construct the following automaton $\mathcal{A}$ over the alphabet $\Sigma\times\{\top,\bot,*\}^{3}\times Q\times\Gamma$ .
1. (a)
  
  $\mathcal{A}$ accepts only the extended trees of $\mbox{$\mathcal{L}$}(\mbox{$\mathcal{T}$})$ in which there are at most $N$ zones with outdegree $\geq K^{(K^{3})}$ .
2. (b)
  
  The automaton $\mathcal{A}$ can remember the constants in its states.
3. (c)
  
  For every $P\in\mbox{$\mathcal{P}$}$ , for every $c\in\mbox{$\mathcal{C}$}_{P}$ , the automaton $\mbox{$\mathcal{A}$}^{\prime}$ “assigns” the constant $c$ in an $S$ -zone, for every $S\in P$ , but not in any $R$ -zone, for every $R\notin P$ .
4. (d)
  
  The automaton $\mathcal{A}$ “assigns” every zone with outdegree $\geq K^{(K^{3})}$ with a constant from $\mathcal{D}$ .
5. (e)
  
  For every $d\in\mbox{$\mathcal{D}$}$ , for every $S\in P_{d}$ , the automaton $\mathcal{A}$ “assigns” the constant $d$ in an $S$ -zone, for every $S\in P_{d}$ , but in no $R$ -zone, for every $R\notin P_{d}$ .
6. (f)
  
  For each $a\in\Gamma_{0}$ , there is at most one $a$ -node in every zone, and for every two zones that contains $a$ -nodes, if they are assigned with some constants from $\mbox{$\mathcal{C}$}_{P}$ ’s and $\mathcal{D}$ , then these constants must be different.
7. (g)
  
  For every two adjacent zones, if they are assigned with constants from $\mbox{$\mathcal{C}$}_{P}$ ’s and $\mathcal{D}$ , then these constants must be different.
The automaton $\mathcal{A}$ “assigns” a constant to a zone by remembering the constant in the state when $\mathcal{A}$ is reading the zone.
5.

Let $P_{1},\ldots,P_{m}$ be the enumeration of non-empty subsets of $2^{\Gamma}$ .
Applying Lemma 2.3, convert the automaton $\mbox{$\mathcal{M}$}^{\prime}$ into its Presburger formula $\xi_{\mbox{\scriptsize$\mathcal{M}$}^{\prime}}(z_{P_{1}},\ldots,z_{P_{m}})$ , where the intended meaning of $z_{P_{i}}$ ’s is the number of appearances of the label $P_{i}$ .

Let $\Gamma=\{a_{1},\ldots,a_{\ell}\}$ and $S_{1},\ldots,S_{k}$ be the enumeration of non-empty subsets of $\Gamma$ . Define the formula $\xi(x_{a_{1}},\ldots,x_{a_{\ell}},x_{S_{1}},\ldots,x_{S_{k}}):$

$\displaystyle\exists z_{P_{1}}\cdots\exists z_{P_{m}}$	$\displaystyle\xi_{\mbox{\scriptsize$\mathcal{M}$}^{\prime}}(z_{P_{1}},\ldots,z_{P_{m}})$	(14)
	$\displaystyle\ \wedge\ \bigwedge_{P_{i}\in\mbox{\scriptsize$\mathcal{P}$}}z_{P_{i}}=M_{P_{i}}$
	$\displaystyle\ \wedge\ \bigwedge_{P_{i}\notin\mbox{\scriptsize$\mathcal{P}$}}z_{P_{i}}\geq 2\cdot K^{K^{3}}\cdot 2^{K}+2\cdot K^{(K^{3})}+1$
	$\displaystyle\ \wedge\;\bigwedge_{S\subseteq\Gamma}\Big{(}x_{S}\ \geq\ \sum_{P_{i}\ni S\ \textrm{and}\ P_{i}\notin\mbox{\scriptsize$\mathcal{P}$}}z_{P_{i}}\Big{)}$
	$\displaystyle\ \wedge\;\bigwedge_{a\in\Gamma_{0}}\Big{(}x_{a}\ =\ \sum_{\scriptsize\begin{array}[]{c}\textrm{there exists}\ S\ \textrm{such that}\\ a\in S\ \textrm{and}\ S\in P_{i}\ \textrm{and}\ P_{i}\notin\mbox{$\mathcal{P}$}\end{array}}z_{P_{i}}\Big{)}$
	$\displaystyle\ \wedge\;\bigwedge_{\scriptsize P_{i}\notin\mbox{$\mathcal{P}$}}z_{P_{i}}\geq\|\{d\in\mbox{$\mathcal{D}$}\mid P_{d}=P_{i}\}\|$

The meaning of $x_{a}$ is the number of $a$ -nodes occurring in the zone not assigned with any constants from $\mbox{$\mathcal{C}$}_{P}$ ’s and $\mathcal{D}$ ; and $x_{S}$ is the number $S$ -zones not assigned with any constants from $\mbox{$\mathcal{C}$}_{P}$ ’s and $\mathcal{D}$ . The intuition behind items (2)–(6) is rather clear. The intuition behind item (7) is as follows. Recall that in Step (3), for each $d\in\mbox{$\mathcal{D}$}$ , we guess a set $P_{d}$ . The meaning is that $d\in[P_{d}]^{zone}_{t}$ for some $t\in\mbox{$\mathcal{L}$}^{data}(\mbox{$\mathcal{S}$})$ . So for every $P_{i}\notin\mbox{$\mathcal{P}$}$ , the number of $d$ such that $P_{d}=P_{i}$ should not exceed $z_{P_{i}}$ . This is precisely what is stated in item (7).

7.

Test the non-emptiness of the APC $(\mbox{$\mathcal{A}$},\xi)$ .

Before we proceed to prove its correctness, we first present the analysis of its complexity.

•

Step (1) is trivial and Step (2) takes exponential time.
•

Step (3) takes non-deterministic exponential time in the size of $\mathcal{S}$ . The analysis is as follows. Step (3.a) takes non-deterministic exponential time in the size of $2^{\Gamma}$ , which is bounded by the size of $\mathcal{M}$ in $\mathcal{S}$ . (Recall that the alphabet in $\mathcal{M}$ is $2^{\Gamma}$ .) Step (3.b) can guess up exponentially many constant in each $\mbox{$\mathcal{C}$}_{P}$ , and there are exponentially many different $\mbox{$\mathcal{C}$}_{P}$ , hence it takes double exponential time in the size of $2^{\Gamma}$ . Steps (3.c) and (3.d) take non-deterministic exponential time.
•

Step (4) takes deterministic triple exponential time and can produce the automaton $\mathcal{A}$ of size up to triple exponential. The analysis is as follows. The automaton $\mathcal{A}$ has to remember in its states the outdegree of each zone up to $K^{(K^{3})}$ and the number of zones with out degree $\geq K^{(K^{3})}$ . This induces an exponential blow-up in the size of $\mathcal{T}$ .

The number of constants in guessed in Step (3) is double exponential in the size of $\mathcal{T}$ . Then $\mathcal{A}$ has to remember in its states which constant is assigned to which zone (of outdegree $\geq K^{(K^{3})}$ ), which induces another exponential blow-up. Altogether the size of $\mathcal{A}$ can be triple exponential in the size of $\mathcal{T}$ .
•

By Proposition 2.3, Step (5) takes polynomial time in the size $\mbox{$\mathcal{M}$}^{\prime}$ , which is of size exponential in the size of the original $\mathcal{M}$ .
•

The length of the formula in step (6) is double exponential in the size of $\mathcal{S}$ , since the number of constants in $\mathcal{D}$ can be double exponential in the size of $2^{\Gamma}$ , and hence $\mathcal{S}$ .
•

Step (7) takes non-deterministic polynomial time in the size of $(\mbox{$\mathcal{A}$},\xi)$ , and hence non-deterministic triple exponential time in the size of the input $\mathcal{S}$ .

The following claim immediately implies the correctness of our algorithm.

Claim 2.

1.

For every ordered-data tree $t\in\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$})$ , in which there are at most $K^{O(K^{2})}$ zones with outdegree $\geq K^{(K^{3})}$ , there exists an extended tree of $t$ which is accepted by the APC $(\mbox{$\mathcal{A}$},\xi)$ .
2.

For every $t^{\prime}\in\mbox{$\mathcal{L}$}(\mbox{$\mathcal{A}$},\xi)$ , there exists an ordered-data tree $t\in\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$})$ such that $t^{\prime}$ is an extended tree of $t$ w.r.t. $\mathcal{S}$ .

Proof 5.25.

We prove (1) first. Let $t\in\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$})$ be an ordered-data tree in which there are at most $K^{O(K^{2})}$ zones with outdegree $\geq K^{(K^{3})}$ . Let $t_{0}$ be the output of $\mathcal{T}$ on $t$ so that $\mbox{$\mathcal{V}$}^{zone}(t_{0})$ is accepted by $\mathcal{M}$ and all nodes in $t_{0}$ labelled with a symbol in $\Gamma_{0}$ have different data values.

We have the following items guessed in Step 3 in our algorithm above.

•

$\mbox{$\mathcal{P}$}=\{P\mid|[P]^{zone}_{t}|\leq 2\cdot K^{K^{3}}\cdot 2^{K}+2\cdot K^{(K^{3})}+1\}$ .
•

For each $P\in P$ , $\mbox{$\mathcal{C}$}_{P}=[P]^{zone}_{t_{0}}$ , and $M_{P}=|\mbox{$\mathcal{C}$}_{P}|$ .
•

$N$ be the number of zones in $t$ with outdegree $\geq K^{(O(K^{2}))}$ and $N^{\prime}$ be the number of data values found in these zones.
•

$\mbox{$\mathcal{D}$}=\{d\mid d\ \mbox{is found in a zone with outdegree}\ \geq K^{(K^{3})}\}$ ,
•

For each $d\in\mbox{$\mathcal{D}$}$ , $P_{d}$ is the set such that $d\in[P_{d}]^{zone}_{t_{0}}$ .

Now let $t^{\prime}$ be an extended tree of $t$ with respect to $\mathcal{S}$ , and $\mathcal{A}$ and $\xi$ be the automaton and formula as constructed in Steps 4–6 above. We are going to show that $t^{\prime}\in\mbox{$\mathcal{L}$}(\mbox{$\mathcal{A}$},\xi)$ . Obviously, $t^{\prime}\in\mbox{$\mathcal{L}$}(\mbox{$\mathcal{A}$})$ . To show that the formula $\xi$ is satisfied, we take $\mbox{$\textsf{Parikh}$}(\mbox{$\mathcal{V}$}^{zone}(t_{0}))$ as witness to $(z_{P_{1}},\ldots,z_{P_{m}})$ . Since $\mbox{$\mathcal{V}$}^{zone}(t_{0})\in\mbox{$\mathcal{L}$}(\mbox{$\mathcal{M}$}^{\prime})$ , by Proposition 2.3, the formula $\xi_{\mbox{\scriptsize$\mathcal{M}$}^{\prime}}(\mbox{$\textsf{Parikh}$}(\mbox{$\mathcal{V}$}^{zone}(t_{0})))$ holds. It is straightforward from the definitions of the items $\mathcal{P}$ , $M_{P}$ ’s, $N$ , $N^{\prime}$ , $\mathcal{D}$ and $P_{d}$ ’s that the formula $\xi$ in Step 6 is satisfied with $x_{a}$ ’s and $x_{S}$ ’s interpreted as intended.

Now we prove (2). The proof is more delicate than the proof of (1). Suppose $t^{\prime}\in\mbox{$\mathcal{L}$}(\mbox{$\mathcal{A}$}^{\prime},\xi)$ . We are going to construct an ordered-data tree $t$ from $t^{\prime}$ such that $t^{\prime}$ is an extended tree of $t$ w.r.t. $\mathcal{S}$ . Let $\mathcal{P}$ , $M_{P}$ ’s, $\mbox{$\mathcal{C}$}_{P}$ ’s, $N$ , $N^{\prime}$ , $\mathcal{D}$ and $P_{d}$ ’s the items as guessed in Step 3 above and

•

for each $a_{i}\in\Gamma$ , let $n_{a_{i}}$ be the number of $a_{i}$ -nodes in $t^{\prime}$ occurring in a zone without any constants from $\mbox{$\mathcal{C}$}_{P}$ ’s and $\mathcal{D}$ ;
•

for each $S_{i}\subseteq\Gamma$ , let $n_{S_{i}}$ be the number of $S_{i}$ -zones in $t^{\prime}$ without any constants from $\mbox{$\mathcal{C}$}_{P}$ ’s and $\mathcal{D}$ .

Suppose $(k_{P_{1}},\ldots,k_{P_{m}})$ be the witness to $z_{P_{1}},\ldots,z_{P_{m}}$ such that

\xi(n_{a_{1}},\ldots,n_{a_{\ell}},n_{S_{1}},\ldots,n_{S_{l}})\qquad\mbox{holds}.

By Proposition 2.3, this means that there exists a word $w\in\mbox{$\mathcal{L}$}(\mbox{$\mathcal{M}$}^{\prime})$ such that $\mbox{$\textsf{Parikh}$}(w)=(k_{P_{1}},\ldots,k_{P_{m}})$ . For each $P_{i}$ , we let

\mbox{$\mathcal{N}$}_{P_{i}}=\{j\mid\mbox{position}\ j\ \mbox{in}\ w\ \mbox{is labeled}\ P_{i}\}.

We will assign a data value to each node in $t$ such that

[P_{i}]^{zone}_{t}=\mbox{$\mathcal{N}$}_{P_{i}},

and $\mbox{$\mathcal{V}$}^{zone}(t)=w$ . The assignment is done according to three cases below.

Case 1

For the nodes that are assigned with some constants from $\mbox{$\mathcal{C}$}_{P_{i}}$ ’s.
In this case $P_{i}\in\mbox{$\mathcal{P}$}$ . We define bijections $f_{P_{i}}:\mbox{$\mathcal{C}$}_{P_{i}}\mapsto\mbox{$\mathcal{N}$}_{P_{i}}$ . There is always a bijection from $\mbox{$\mathcal{C}$}_{P_{i}}$ to $\mbox{$\mathcal{N}$}_{P_{i}}$ since they have the same cardinality $M_{P_{i}}$ , due to the following condition in the formula $\xi$ :

\bigwedge_{P_{i}\in\mbox{\scriptsize$\mathcal{P}$}}z_{P_{i}}=M_{P_{i}}.

The data value assignment to nodes of this case can be done by replacing every constant $c\in\mbox{$\mathcal{C}$}_{P_{i}}$ with $f_{P_{i}}(c)$ .

Case 2

For the nodes that are assigned some constants from $\mathcal{D}$ .
We define a 1-1 mapping $f:\mbox{$\mathcal{D}$}\mapsto\{1,\ldots,|w|\}$ such that $f(d)\in\mbox{$\mathcal{N}$}_{P_{d}}$ , where $P_{d}$ is the set guessed in Step 3. Such 1-1 mapping exists because the following condition in the formula $\xi$ :

\bigwedge_{\scriptsize P_{i}\notin\mbox{$\mathcal{P}$}}z_{P_{i}}\geq|\{d^{\prime}\in\mbox{$\mathcal{D}$}\mid P_{d^{\prime}}=P_{i}\}|

The data value assignment to nodes of this case can be done by replacing every constant $d\in\mbox{$\mathcal{D}$}$ with $f(d)$ .

Case 3

For the nodes that are not assigned any constants from $\mbox{$\mathcal{C}$}_{P}$ ’s and $\mathcal{D}$ .
First we assign each of such zone in $t$ with a data value^∥^∥∥A zone in $t$ can be recognised from the profile information in $t^{\prime}$ . such that for each $S\subseteq\Gamma$ ,

V^{zone}_{t}(S)=\bigcup_{P_{i}\ni S\ \textrm{and}\ P_{i}\notin\mbox{\scriptsize$\mathcal{P}$}}\mbox{$\mathcal{N}$}_{P_{i}}

This step can be done as follows. The number of such $S$ -zone in $t$ is greater than $\sum_{P_{i}\ni S\ \textrm{and}\ P_{i}\notin\mbox{\scriptsize$\mathcal{P}$}}|\mbox{$\mathcal{N}$}_{P_{i}}|$ , due to the condition below in the formula $\xi$ :

x_{S}\ \geq\ \sum_{P_{i}\ni S\ \textrm{and}\ P_{i}\notin\mbox{\scriptsize$\mathcal{P}$}}z_{P_{i}}.

Thus, we can simply assign every $S$ -zone with a data value from $\bigcup_{P_{i}\ni S\ \textrm{and}\ P_{i}\notin\mbox{\scriptsize$\mathcal{P}$}}\mbox{$\mathcal{N}$}_{P_{i}}$ , and make sure every data value from $\bigcup_{P_{i}\ni S\ \textrm{and}\ P_{i}\notin\mbox{\scriptsize$\mathcal{P}$}}\mbox{$\mathcal{N}$}_{P_{i}}$ appears in some $S$ -zone.
However, by assigning data values like that, some adjacent zones may get the same data values. Here we apply Lemma 4.1. Since for each $P_{i}\notin\mbox{$\mathcal{P}$}$ , $|\mbox{$\mathcal{N}$}_{P_{i}}|\geq 2\cdot K^{K^{3}}\cdot 2^{K}+2\cdot K^{(K^{3})}+1$ , by the condition below in the formula $\xi$

\bigwedge_{P_{i}\notin\mbox{\scriptsize$\mathcal{P}$}}z_{P_{i}}\geq 2\cdot K^{K^{3}}\cdot 2^{K}+2\cdot K^{(K^{3})}+1,

the cardinality

\Big{|}\bigcup_{P_{i}\ni S\ \textrm{and}\ P_{i}\notin\mbox{\scriptsize$\mathcal{P}$}}\mbox{$\mathcal{N}$}_{P_{i}}\Big{|}=\sum_{P_{i}\ni S\ \textrm{and}\ P_{i}\notin\mbox{\scriptsize$\mathcal{P}$}}|\mbox{$\mathcal{N}$}_{P_{i}}|\geq 2\cdot K^{K^{3}}\cdot 2^{K}+2\cdot K^{(K^{3})}+1.

The outdegree of such zone is $\leq K^{(K^{3})}$ , hence, the number of nodes in the zone is also $\leq K^{(K^{3})}$ . Since each node can have indegree at most $1$ , the degree of each of such zone is $\leq 2\cdot K^{(K^{3})}$ . By applying Lemma 4.1, where $\deg(G)=2\cdot K^{(K^{3})}$ , we can reassign the data value in such zone so that each adjacent zone get different data value.

This completes the proof of our Claim.

6 Weak ODTA

A weak ODTA over $\Sigma$ is a triplet $\mbox{$\mathcal{S}$}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Gamma_{0}\rangle$ where $\mathcal{T}$ is a letter-to-letter transducer from $\Sigma$ to the output alphabet $\Gamma$ , and $\mathcal{M}$ is a finite state automaton over $2^{\Gamma}$ and $\Gamma_{0}\subseteq\Gamma$ . An ordered-data tree $t$ is accepted by $\mathcal{S}$ , denoted by $t\in\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$})$ , if there exists an ordered-data tree $t^{\prime}$ over $\Gamma$ such that

•

on input $\mbox{$\textsf{Proj}$}(t)$ , the transducer $\mathcal{T}$ outputs $t^{\prime}$ ;
•

the automaton $\mathcal{M}$ accepts the string $\mbox{$\mathcal{V}$}_{\Gamma}(t^{\prime})$ ; and
•

for every $a\in\Gamma_{0}$ , all the $a$ -nodes in $t^{\prime}$ have different data values.

Note that the only difference between weak ODTA and ODTA is the equality test on the data values in neighboring nodes. Such difference is the cause of the triple exponential leap in complexity, as stated in the following theorem.

Theorem 6.1.

The non-emptiness problem for weak ODTA is in NP.

Proof 6.2.

Let $\mbox{$\mathcal{S}$}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Gamma_{0}\rangle$ be a weak ODTA. Let $\Sigma,Q,\Gamma$ be the input alphabet, set of states and output alphabet of $\mathcal{T}$ , respectively.

We need the following notation. For a tree $t\in\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$})$ , its extended tree $\tilde{t}$ (with respect to the weak ODTA $\mathcal{S}$ ) is a tree over the alphabet $\Sigma\times Q\times\Gamma$ , where

•

the projection of $\tilde{t}$ to $\Sigma$ is $t$ ;
•

the projection of $\tilde{t}$ to $Q$ is an accepting run of $\mathcal{T}$ on $t$ such that its output is the projection of $\tilde{t}$ to $\Gamma$ .

The decision procedure for Theorem 6.1 works as follows.

1.

Construct an automaton $\mathcal{A}$ over the alphabet $\Sigma\times Q\times\Gamma$ for the extended trees accepted by $\mathcal{T}$ .
2.

Let $\mbox{$\mathcal{P}$}=\{S_{1},\ldots,S_{m}\}\subseteq 2^{\Gamma}$ be the set of symbols used in $\mathcal{M}$ .
By applying Proposition 2.3, construct the Presburger formula $\xi_{\mbox{\scriptsize$\mathcal{M}$}}(x_{S_{1}},\ldots,x_{S_{m}})$ for $\mathcal{M}$ .

Let $\Sigma\times Q\times\Gamma=\{(a_{1},q_{1},\alpha_{1}),\ldots,(a_{k},q_{n},\alpha_{\ell})\}$ . Let $\varphi(x_{(a_{1},q_{1},\alpha_{1})},\ldots,x_{(a_{k},q_{n},\alpha_{\ell})})$ be the following formula:

	$\displaystyle\exists x_{\alpha_{1}}\cdots\exists x_{\alpha_{\ell}}\ \exists x_{S_{1}}\cdots\exists x_{S_{m}}\ \xi_{\mbox{\scriptsize$\mathcal{M}$}}(x_{S_{1}},\ldots,x_{S_{m}})$
	$\displaystyle\quad\ \wedge\ \bigwedge_{\alpha_{i}\in\Gamma}\Big{(}x_{\alpha_{i}}=\sum_{a_{j}\in\Sigma,q_{h}\in Q}x_{(a_{j},q_{h},\alpha_{i})}\Big{)}$
	$\displaystyle\quad\ \wedge\bigwedge_{\alpha_{i}\in\Gamma}\Big{(}x_{\alpha_{i}}\geq\sum_{\alpha_{i}\in S_{j}}x_{S_{j}}\Big{)}\ \wedge\ \bigwedge_{\alpha_{i}\in\Gamma_{0}}\Big{(}x_{\alpha_{i}}=\sum_{\alpha_{i}\in S_{j}}x_{S_{j}}\Big{)}.$

4.

Test the non-emptiness of APC $(\mbox{$\mathcal{A}$},\varphi(x_{(a_{1},q_{1},\alpha_{1})},\ldots,x_{(a_{k},q_{n},\alpha_{\ell})}))$ .

That this procedure works in NP follows directly from the fact that the non-emptiness problem of APC is in NP.

We now show the correctness of our algorithm by showing that $\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$})\neq\emptyset$ if and only if $\mbox{$\mathcal{L}$}(\mbox{$\mathcal{A}$},\varphi)\neq\emptyset$ . (For the sake of presentation, we write $\varphi$ without its free variables.) We start with the “only if” part. Suppose that $t\in\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$})$ . We claim that the extended tree $\tilde{t}$ of $t$ is accepted by $(\mbox{$\mathcal{A}$},\varphi)$ . Obviously, $\tilde{t}\in\mbox{$\mathcal{L}$}(\mbox{$\mathcal{A}$})$ . To show that $\varphi(\mbox{$\textsf{Parikh}$}(\tilde{t}))$ holds, let $t^{\prime}$ be the data tree obtained by projecting $\tilde{t}$ to $\Gamma$ and the data value in each node comes from the same node in $t$ . That is, $t^{\prime}$ is an output of $\mathcal{T}$ on $t$ . We will show that $\varphi(\mbox{$\textsf{Parikh}$}(\tilde{t}))$ holds.

•

As witness to $x_{S_{1}},\ldots,x_{S_{m}}$ , we take $\mbox{$\textsf{Parikh}$}(\mbox{$\mathcal{V}$}(t^{\prime}))$ . Since $\mbox{$\mathcal{V}$}(t^{\prime})\in\mbox{$\mathcal{L}$}(\mbox{$\mathcal{M}$})$ , by Proposition 2.3, $\xi_{\mbox{\scriptsize$\mathcal{M}$}}(\mbox{$\textsf{Parikh}$}(\mbox{$\mathcal{V}$}(t^{\prime})))$ holds.
•

As witness to $x_{\alpha_{1}},\ldots,x_{\alpha_{\ell}}$ , we take $\mbox{$\textsf{Parikh}$}(t^{\prime})$ . Now for each $\alpha_{i}\in\Gamma$ , the constraint $x_{\alpha_{i}}\geq\sum_{\alpha_{i}\in S_{j}}x_{S_{j}}$ holds since the number of data values in the $\alpha_{i}$ -nodes cannot exceed the the number of $\alpha_{i}$ -nodes itself. The constraint $x_{\alpha_{i}}=\sum_{\alpha_{i}\in S_{j}}x_{S_{j}}$ , for each $\alpha_{i}\in\Gamma_{0}$ , since the data values found in $\alpha_{i}$ -nodes are all different.

Thus, $\varphi(\mbox{$\textsf{Parikh}$}(\tilde{t}))$ holds, and this concludes our proof of the “only if” part.

Now we prove the “if” part. Suppose that $\tilde{t}\in\mbox{$\mathcal{L}$}(\mbox{$\mathcal{A}$},\varphi)$ . So $\tilde{t}\in\mbox{$\mathcal{L}$}(\mbox{$\mathcal{A}$})$ . Let $t$ and $t^{\prime}$ be the $\Sigma$ - and $\Gamma$ -projection of $\tilde{t}$ , respectively. By the definition of $\mathcal{A}$ , $t^{\prime}$ is an output of $\mathcal{T}$ on $t$ . Now since $\varphi(\mbox{$\textsf{Parikh}$}(\tilde{t}))$ holds, in particular there exists a witness $\bar{M}=(M_{1},\ldots,M_{m})$ to $x_{S_{1}},\ldots,x_{S_{m}}$ such that $\xi_{\mbox{\scriptsize$\mathcal{M}$}}(\bar{M})$ holds, by Proposition 2.3, there exists a word $w\in\mbox{$\mathcal{L}$}(\mbox{$\mathcal{M}$})$ over the alphabet $2^{\Gamma}$ such that $\mbox{$\textsf{Parikh}$}(w)=\bar{M}$ .

We are going to assign data values to the nodes of $t^{\prime}$ (thus, also to those of $t$ ) such that $t\in\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$})$ . The assignment is done as follows. For each $S\subseteq\Gamma$ , let $V_{w}(S)$ be the set of positions of $w$ labeled with $S$ . Now for each $\alpha\in\Gamma$ , we assign the $\alpha$ -nodes in $t^{\prime}$ with the data values from $\bigcup_{\alpha\in S}V_{w}(S)$ such that $V_{t^{\prime}}(\alpha)=\bigcup_{\alpha\in S}V_{w}(S)$ . This is possible due to the constraint $x_{\alpha}\geq\sum_{\alpha\in S}x_{S}$ .

With such assignment, we get $\mbox{$\mathcal{V}$}(t^{\prime})=w$ . Thus, $\mbox{$\mathcal{V}$}(t^{\prime})\in\mbox{$\mathcal{L}$}(\mbox{$\mathcal{M}$})$ . Moreover, for every $\alpha\in\Gamma_{0}$ , all the data values in $\alpha$ -nodes are different, which follows from the constraint $x_{\alpha}=\sum_{\alpha\in S}x_{S}$ . Therefore, the resulting ordered-data tree $t\in\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$})$ . This concludes our proof.

Next, we give the logical characterisation of weak ODTA.

Theorem 6.3.

A language $\mathcal{L}$ is accepted by a weak ODTA if and only if $\mathcal{L}$ is expressible with a formula of the form: $\exists X_{1}\cdots\exists X_{m}\ \varphi\wedge\psi$ , where $\varphi$ is a formula from $\mbox{$\textrm{FO}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$})$ , and $\psi$ is a formula from $\mbox{$\textrm{FO}$}(\sim,\prec,\mbox{$\prec_{suc}$})$ , extended with the unary predicates $X_{1},\ldots,X_{m}$ .

The proof of Theorem 6.3 is the same as the proof of Theorem 5.9. The difference is that to simulate the $\mbox{$\textrm{FO}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$})$ formula $\varphi$ , the profile information is not necessary. The complexity of the translation is still the same as in Theorem 5.9.

6.1 Extending weak ODTA with Presburger constraints

Like in the case of APC, we can extend weak ODTA with Presburger constraints without increasing the complexity of its non-emptiness problem. Let $\mbox{$\mathcal{S}$}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Gamma_{0}\rangle$ be a weak ODTA, where $\Sigma$ and $\Gamma$ are the input and output alphabets of $\mathcal{T}$ , respectively. Let $\Gamma=\{\alpha_{1},\ldots,\alpha_{\ell}\}$ .

A weak ODTA $\mbox{$\mathcal{S}$}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Gamma_{0}\rangle$ extended with Presburger constraint is a tuple $\langle\mbox{$\mathcal{S}$},\xi\rangle$ , where $\xi(x_{1},\ldots,x_{\ell},y_{1},\ldots,y_{2^{\ell}-1})$ is an existential Presburger formula with the free variables $x_{1},\ldots,x_{\ell},y_{1},\ldots,y_{2^{\ell}-1}$ . An ordered-data tree $t$ is accepted by $\langle\mbox{$\mathcal{S}$},\xi\rangle$ , if there exists an output $t^{\prime}$ of $\mathcal{T}$ on $t$ , the automaton $\mathcal{M}$ accepts $\mbox{$\mathcal{V}$}_{\Gamma}(t^{\prime})$ , for each $a\in\Gamma_{0}$ , all $a$ -nodes in $t^{\prime}$ have different data values and $\xi(\mbox{$\textsf{Parikh}$}(t^{\prime}),\mbox{$\textsf{Parikh}$}(\mbox{$\mathcal{V}$}_{\Gamma}(t^{\prime})))$ holds. We write $\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$},\xi)$ to denote the set of languages accepted by $\langle\mbox{$\mathcal{S}$},\xi\rangle$ .

We claim that the non-emptiness problem of weak ODTA extended with Presburger constraint is still decidable in NP. The reason is as follows. The non-emptiness of a weak ODTA $\mathcal{S}$ is checked by converting $\mathcal{S}$ into an APC $(\mbox{$\mathcal{A}$},\varphi)$ , where $\varphi$ expresses linear constraints on the number of nodes labeled with symbols from $\Sigma$ and $\Gamma$ as well as those labeled with $Q$ in the accepting run. The formula $\xi$ can be appropriately “inserted” into $\varphi$ , and hence, the non-emptiness of $(\mbox{$\mathcal{S}$},\xi)$ is reducible to non-emptiness of APC, which is in NP.

6.2 Comparison with other known decidable formalisms

We are going to compare the expressiveness of weak ODTA with other known models with decidable non-emptiness.

6.2.1 DTD with integrity constraints

An XML document is typically viewed as a data tree. The most common XML formalism is Document Type Definition (DTD). In short, a DTD is a context free grammar and a tree $t$ conforms to a DTD $D$ , if it is a derivation tree of a word accepted by the context free grammar.

The most commonly used XML constraints are integrity constraints which are of two types.

•

The key constraint $key(a)$ are the following constraint:

$\forall x\forall y(a(x)\wedge a(y)\wedge x\sim y\to x=y).$
•

The inclusion constraint $V(a)\subseteq V(b)$ are the following constraint:

$\forall x\exists y(a(x)\to b(y)\wedge x\sim y).$

The satisfiability problem of a given DTD $D$ and a collection $\mathcal{C}$ of integrity constraints asks whether there exists an ordered-data tree $t$ that conforms to the DTD that satisfies all the constraints in $\mathcal{C}$ . In [Fan and Libkin (2002)] it is shown that this problem is NP-complete.

Theorem 6.4.

Given a DTD $D$ and a collection $\mathcal{C}$ of integrity constraints, one can construct a weak ODTA $\mathcal{S}$ such that $\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$})$ is precisely the set of ordered-data trees that conforms to $D$ and satisfies all constraints in $\mathcal{C}$ .

Proof 6.5.

Let $\Sigma$ be the alphabet of the given DTD $D$ . Consider the following weak ODTA $\mbox{$\mathcal{S}$}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Sigma_{0}\rangle$ .

•

$\mathcal{T}$ is an identity transducer that checks whether the input tree conforms to DTD $D$ .
•

$\mathcal{M}$ is an automaton that accepts $\mbox{$\mathcal{P}$}^{\ast}$ , where $\mbox{$\mathcal{P}$}=2^{\Sigma}-\{S\mid a\in S\ \mbox{and}\ b\notin S\ \mbox{for some}\ V(a)\subseteq V(b)\in\mbox{$\mathcal{C}$}\}$ .
•

$\Sigma_{0}=\{a\mid key(a)\in\mbox{$\mathcal{C}$}\}$ .

That $\mathcal{S}$ is the desired ODTA follows immediately from the fact that for every ordered-data tree $t$ , $V_{t}(a)\subseteq V_{t}(b)$ if and only if $[S]_{t}=\emptyset$ for all $S$ where $a\in S$ , but $b\notin S$ .

The size of the automaton $\mathcal{M}$ , hence the size of $\mathcal{S}$ , produced by our construction in Theorem 6.4 is of exponential size. This blow-up is tight, as the following example shows. Consider the case where $\mathcal{C}$ does not contain inclusion constraints. That is, $\mathcal{C}$ contains only key constraints. Then any equivalent ODTA $\mbox{$\mathcal{S}$}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Sigma_{0}\rangle$ will have $\mbox{$\mathcal{L}$}(\mbox{$\mathcal{M}$})=(2^{\Sigma}-\{\emptyset\})^{\ast}$ . Thus, we have exponential blow-up in the size of $\mathcal{M}$ . Nevertheless, if we are concerned only with satisfiability, then we can lower the complexity to NP as stated in the following theorem.

Theorem 6.6.

Given a DTD $D$ and a collection $\mathcal{C}$ of integrity constraints, one can construct a weak ODTA $\mathcal{S}$ in non-deterministic polynomial time such that $\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$})\neq\emptyset$ if and only if there exists an ordered-data tree $t$ that conforms to $D$ and satisfies all the constraints in $\mathcal{C}$ .

Proof 6.7.

Let $\Sigma$ be the alphabet of the DTD $D$ . We non-deterministically construct a weak ODTA $\mbox{$\mathcal{S}$}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Sigma_{0}\rangle$ as follows.

•

$\mathcal{T}$ is an identity transducer that checks whether the input tree conforms to DTD $D$ .
•
Guess a sequence $(H_{1},\ldots,H_{k})$ of some subsets of $\Sigma$ such that
- –
  
  $\Sigma$ is partitioned into $H_{1}\cup\cdots\cup H_{k}$ ;
- –
  
  for every two different symbols $a,b\in\Sigma$ , $a,b$ are in the same set $H_{i}$ if and only if both $V(a)\subseteq V(b)$ and $V(b)\subseteq V(a)$ are in $\mathcal{C}$ ;
- –
  
  if $V(a)\subseteq V(b)\in\mbox{$\mathcal{C}$}$ and $V(b)\subseteq V(a)\not\in\mbox{$\mathcal{C}$}$ , then $a\in H_{i}$ and $b\in H_{j}$ and $i\leq j$ .
Intuitively, the sequence $(H_{1},\ldots,H_{k})$ tells us the ordering of the elements in $\Sigma$ that respect the inclusion constraints in $\mathcal{C}$ , where if both $V(a)\subseteq V(b)$ and $V(b)\subseteq V(a)$ are in $\mathcal{C}$ , then $a$ and $b$ are tie and they must be in the same set $H_{i}$ .
•

Let $S_{1},\ldots,S_{k}\subseteq\Sigma$ be such that $S_{i}=\Sigma-(H_{1}\cup\cdots\cup H_{i-1})$ , where $S_{1}=\Sigma$ and $S_{k}=H_{k}$ .
•

$\mathcal{M}$ is a non-deterministic automaton over the alphabet $\{S_{1},\ldots,S_{k}\}$ , where the set of states is $\{q_{1},\ldots,q_{k}\}$ , all $q_{1},\ldots,q_{k}$ are the initial states and the final states, and the transitions are: $(q_{i},S_{j},q_{j})$ for every $1\leq i\leq j\leq k$ .
•

$\Sigma_{0}=\{a\mid key(a)\in\mbox{$\mathcal{C}$}\}$ .

We claim that $\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$})\neq\emptyset$ if and only if there exists an ordered-data tree $t$ that conforms to $D$ and satisfies all the constraints in $\mathcal{C}$ .

We start with the “if” direction. Suppose $t$ conforms to the DTD $D$ and satisfies all the constraints in $\mathcal{C}$ . For each $a\in\Sigma$ , let $N_{a}$ be the number of data values found in the $a$ -nodes in $t$ . Let $(H_{1},\ldots,H_{k})$ be a sequence of some subsets of $\Sigma$ such that

•

$\Sigma$ is partitioned into $H_{1}\cup\cdots\cup H_{k}$ ;
•

for every two different symbols $a,b\in\Sigma$ , $a,b$ are in the same set $H_{i}$ if and only if $N_{a}=N_{b}$ ;
•

$a\in H_{i}$ and $b\in H_{j}$ and $i\leq j$ if and only if $N_{a}\leq N_{b}$ .

Consider the following ordered-data tree $t^{\prime}$ over $\Sigma$ , where $t^{\prime}$ is obtained from $t$ by reassigning the data values on the nodes in $t$ as follows. For each $a\in\Sigma$ , we assign the set of integers $\{d\mid 1\leq d\leq N_{a}\}$ as the data values of $a$ -nodes in $t^{\prime}$ . Such assignment is possible since $N_{a}$ is no more than the number of $a$ -nodes in $t^{\prime}$ . With such assignment $t^{\prime}$ still obeys the constraints in $\mathcal{C}$ , as shown below.

•

If $key(a)\in\mbox{$\mathcal{C}$}$ , then $N_{a}$ is precisely the number of $a$ -nodes in $t$ , thus, also in $t^{\prime}$ . Thus, with the data values $\{1,\ldots,N_{a}\}$ , the data values on the $a$ -nodes in $t^{\prime}$ are all different.
•

If $V(a)\subseteq V(a^{\prime})\in\mbox{$\mathcal{C}$}$ , then obviously, $N_{a}\leq N_{a^{\prime}}$ . Thus, $t^{\prime}$ still satisfies the constraint $V(a)\subseteq V(a^{\prime})$ , since the data values in $a$ -nodes in $t^{\prime}$ are $\{1,2,\ldots,N_{a}\}$ , while those in $a^{\prime}$ -nodes are $\{1,2,\ldots,N_{a^{\prime}}\}$ .

Now the string $\mbox{$\mathcal{V}$}(t^{\prime})$ is of the form $R_{1}\cdots R_{m}$ , where $m=\max_{a\in\Sigma}(N_{a})$ where $R_{1}\supseteq R_{2}\supseteq\cdots\supseteq R_{m}$ , and if $R_{i}\neq R_{i+1}$ , then $R_{i+1}-R_{i}=H_{j}$ for some $H_{j}$ in the sequence $(H_{1},\ldots,H_{k})$ . By the definition of $\mathcal{M}$ , $\mbox{$\mathcal{V}$}(t^{\prime})$ is accepted by $\mathcal{M}$ . That $t$ is accepted by $\mathcal{T}$ is trivial and so is the fact that all the data values found in $a$ -nodes in $t^{\prime}$ for each $a\in\Sigma_{0}$ . Thus, $t^{\prime}\in\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$})$ .

For the “only if” direction, it is sufficient to observe that for every sequence $(H_{1},\ldots,H_{k})$ that “respects” the inclusion constraints in $\mathcal{C}$ as explained above, if $\mbox{$\mathcal{V}$}(t)\in\mbox{$\mathcal{L}$}(\mbox{$\mathcal{M}$})$ , then $t$ satisfies all the inclusion constraints in $\mathcal{C}$ . This completes our proof.

6.2.2 Set and linear constraints for data trees

In the paper [David et al. (2012)] the set and linear constraints are introduced for data trees. As argued there, those constraints, together with automata, are able to capture many interesting properties commonly used in XML practice. We review those constraints and show how they can be captured by weak ODTA extended with Presburger constraints.

Data-terms (or just terms) are given by the grammar

\tau:=V(a)\ |\ \tau\cup\tau\ |\ \tau\cap\tau\ |\ \overline{\tau}\quad\quad\mbox{for}\ a\in\Sigma.

The semantics of $\tau$ is defined with respect to a data tree $t$ :

\begin{array}[]{lll}{\llbracket V(a)\rrbracket}_{t}=V_{t}(a)&&{\llbracket\overline{\tau}\rrbracket}_{t}=V_{t}-{\llbracket\tau\rrbracket}_{t}\\ {\llbracket\tau_{1}\cap\tau_{2}\rrbracket}_{t}={\llbracket\tau_{1}\rrbracket}_{t}\cap{\llbracket\tau_{2}\rrbracket}_{t}&&{\llbracket\tau_{1}\cup\tau_{2}\rrbracket}_{t}={\llbracket\tau_{1}\rrbracket}_{t}\cup{\llbracket\tau_{2}\rrbracket}_{t}\end{array}

Recall that $V_{t}=\bigcup_{a\in\Sigma}V_{t}(a)$ – the set of data values found in the data tree $t$ .

A set constraint is either $\tau=\emptyset$ or $\tau\neq\emptyset$ , where $\tau$ is a term. A data tree $t$ satisfies $\tau=\emptyset$ , written as $t\models\tau=\emptyset$ , if and only if ${\llbracket\tau\rrbracket}_{t}=\emptyset$ (and likewise for $\tau\neq\emptyset$ ).

A linear constraint $\xi$ over the alphabet $\Sigma$ is a linear constraint on the variables $x_{a}$ , for each $a\in\Sigma$ and $z_{S}$ , for each $S\subseteq\Sigma$ . A data tree $t$ satisfies $\xi$ , if $\xi$ holds by interpreting $x_{a}$ as the number of $a$ -nodes in $t$ , and $z_{S}$ the cardinality $|[S]_{t}|$ .

Theorem 6.8.

Given a tree automaton $\mathcal{A}$ and a set $\mathcal{C}$ of set and linear constraints, there exists a weak ODTA $\langle\mbox{$\mathcal{S}$},\varphi\rangle$ extended with Presburger constraints such that $\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$},\varphi)$ is precisely the set of ordered-data trees accepted by $\mathcal{A}$ that satisfies all the constraints in $\mathcal{C}$ . Moreover, the construction of $\langle S,\varphi\rangle$ takes exponential time in the size of $\mathcal{A}$ and $\mathcal{C}$ .

Proof 6.9.

The proof is simply a restatement of the proof in [David et al. (2012)] into a language of weak ODTA. We need the following notation. For a data term $\tau$ , we define a family $\mbox{$\mathbb{S}$}(\tau)$ of subsets of ${\Sigma}$ as follows.

•

If $\tau=V(a)$ , then $\mbox{$\mathbb{S}$}(\tau)=\{S\mid a\in S\ \mbox{and}\ S\subseteq\Sigma\}$ .
•

If $\tau=\overline{\tau}_{1}$ , then $\mbox{$\mathbb{S}$}(\tau)=2^{\Sigma}-\mbox{$\mathbb{S}$}(\tau_{1})$ .
•

If $\tau=\tau_{1}\star\tau_{2}$ , then $\mbox{$\mathbb{S}$}(\tau)=\mbox{$\mathbb{S}$}(\tau_{1})\star\mbox{$\mathbb{S}$}(\tau_{2})$ , where $\star$ is $\cap$ or $\cup$ .

It follows that for every data tree $t$ , we have ${\llbracket\tau\rrbracket}_{t}=\bigcup_{S\in\mbox{\scriptsize$\mathbb{S}$}(\tau)}[S]_{t}$ . Recall that the sets $[S]_{t}$ ’s are disjoint.

The desired $\mbox{$\mathcal{S}$}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Sigma_{0}\rangle$ is defined as follows. The transducer $\mathcal{T}$ is the identity transducer $\mathcal{A}$ , and $\Sigma_{0}=\emptyset$ . The automaton $\mathcal{M}$ accepts a word $v\in(2^{\Sigma})^{\ast}$ if and only if

C1.

for every set constraint $\tau=\emptyset$ , $v$ does not contain any symbol from $\mbox{$\mathbb{S}$}(\tau)$ ;
C2.

for every set constraint $\tau\neq\emptyset$ , $v$ contains at least one symbol from $\mbox{$\mathbb{S}$}(\tau)$ .

The formula $\xi$ is the conjunction of all the linear constraints in $\mathcal{C}$ .

That $\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$},\xi)$ is indeed precisely the set of ordered-data trees accepted by $\mathcal{A}$ that satisfies all the constraints in $\mathcal{C}$ follows immediately from the definition of $\mathbb{S}$ . The exponential upper-bound occurs while constructing the automaton $\mathcal{M}$ which requires the enumeration of each element of $2^{\Sigma}$ and checking both conditions C1 and C2 are satisfied. This completes the proof of Theorem 6.8.

6.2.3 $\mbox{$\textrm{FO}$}^{2}(+1,\mbox{$\prec_{suc}$})$ over text

Here we focus our attention on ordered-data words, which can be viewed as trees where each node has at most one child. We write $w={a_{1}\choose d_{1}}\cdots{a_{n}\choose d_{n}}$ to denote ordered-data word in which position $i$ has label $a_{i}$ and data value $d_{i}$ . It is called a text, if all the data values are different and the set of data values $\{d_{1},\ldots,d_{n}\}$ is precisely $\{1,\ldots,n\}$ .

It is shown in [Manuel (2010)] that the satisfiability problem for $\mbox{$\textrm{FO}$}^{2}(+1,\mbox{$\prec_{suc}$})$ over text is decidable.^**^****The definition of text in [Manuel (2010)] is slightly different, but it is equivalent to our definition. However, it turns out that the key lemma proved in [Manuel (2010)] has a gap which is filled later on in [Figueira (2012b)]. The final result is still correct though. The following theorem shows that this decidability can be obtained via weak ODTA.

Theorem 6.10.

For every formula $\varphi\in\mbox{$\textrm{FO}$}^{2}(+1,\mbox{$\prec_{suc}$})$ , one can construct effectively a weak ODTA $\mathcal{S}$ such that

•

for every text $w$ , if $w\in\mbox{$\mathcal{L}$}_{data}(\varphi)$ , then $w\in\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$})$ ;
•

for every ordered-data word $w\in\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$})$ , there exists a text $w^{\prime}\in\mbox{$\mathcal{L}$}_{data}(\varphi)$ such that $\mbox{$\textsf{Proj}$}(w)=\mbox{$\textsf{Proj}$}(w^{\prime})$ .

The construction of $\mathcal{S}$ takes double exponential time in the size of $\varphi$ .

Proof 6.11.

In [Manuel (2010)], the decidability is proved by constructing its so called text automata, also defined in [Manuel (2010)]. We review the precise definition here. Let $w={a_{1}\choose d_{1}}\cdots{a_{n}\choose d_{n}}$ be a text over the alphabet $\Sigma$ . Therefore, $\mbox{$\mathcal{V}$}(w)=S_{1}\cdots S_{n}$ is such that each $S_{i}$ is a singleton.

We define $msp(w)$ , the marked string projection of $w$ , as the word $(a_{0},b_{0})\ldots(a_{n},b_{n})$ , where $b_{i}\in\{-1,1,*\}$ and

b_{i}=\left\{\begin{array}[]{ll}-1&\mbox{if}\ 1\leq i<n\ \mbox{and}\ d_{i+1}+1=d_{i}\\ 1&\mbox{if}\ 1\leq i<n\ \mbox{and}\ d_{i}+1=d_{i+1}\\ &\mbox{otherwise}\end{array}\right.

A text automaton over the alphabet $\Sigma$ is pair $(T_{1},T_{2})$ , where

•

$T_{1}$ is a non-deterministic letter-to-letter word transducer with the input alphabet $\Sigma\times\{-1,1,*\}$ and the output alphabet $\Gamma$ .
•

$T_{2}$ is a non-deterministic finite state automaton over $\Gamma$ .

A text $w={a_{1}\choose d_{1}}\cdots{a_{n}\choose d_{n}}$ is accepted by the text automaton $(T_{1},T_{2})$ , if

•

$msp(w)$ is accepted by $T_{1}$ , yielding a string $\alpha_{1}\cdots\alpha_{n}$ ;
•

the string $\alpha_{i_{0}}\cdots\alpha_{i_{n}}$ is accepted by $T_{2}$ , where the indexes $i_{1},\ldots,i_{n}$ are such that $1=d_{i_{1}}<d_{i_{2}}<\cdots<d_{i_{n}}=n$ .

It is shown in [Manuel (2010)] that for every $\varphi\in\mbox{$\textrm{FO}$}^{2}(+1,\mbox{$\prec_{suc}$})$ , one can construct effectively a text automaton $\mathcal{A}$ such that for every text $w$ , $w\in\mbox{$\mathcal{L}$}_{data}(\varphi)$ if and only if $w\in\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{A}$})$ .

Now we are going to show how to get the desired ODTA $\mbox{$\mathcal{S}$}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Gamma\rangle$ . Let $(T_{1},T_{2})$ be the text automaton as above. On input ordered-data word $w={a_{1}\choose d_{1}}\cdots{a_{n}\choose d_{n}}$ , $\mathcal{S}$ performs the following.

•

The automaton $\mathcal{T}$ simulates $T_{1}$ , by guessing $msp(w)$ and outputs its $\Gamma$ -projection, while store its $\{-1,1,*\}$ -projection in its states.
•

The automaton $\mathcal{M}$ is simply $T_{2}$ .

It is straightforward to see that such $\mathcal{S}$ is the desired weak ODTA. The analysis of the complexity is as follows. The construction of the text automaton $(T_{1},T_{2})$ takes double exponential time in the size of $\varphi$ . See [Manuel (2010), Lemmas 5 and 6]. The construction of ODTA $\mathcal{S}$ takes polynomial time in the size of $(T_{1},T_{2})$ . Altogether, it takes double exponential time to construct $\mathcal{S}$ from the original formula $\varphi$ . This completes the proof of Theorem 6.10.

7 An Undecidable Extension

In this section we would like to remark on an undecidable extension of ODTA. Recall the language in Example 3.3. It has already noted in the proof of Proposition 5.6 that its complement is not accepted by any ODTA. Formally, the complement of the language in Example 3.3 can be expressed with formula of the form:

\forall x\;\forall y\;\bigvee_{a\in\Sigma_{0}}a(x)\wedge\bigvee_{a\in\Sigma_{0}}a(y)\wedge\mbox{$E_{\downarrow}$}^{*}(x,y)\to x\prec y,

(15)

where $\Sigma_{0}\subseteq\Sigma$ and $\mbox{$E_{\downarrow}$}^{*}$ denotes the transitive closure of $E_{\downarrow}$ . In the following we are going to show that given an ODTA and a collection $\mathcal{C}$ of formulas of the form (15), it is undecidable to check whether there is an ordered-data tree $t\in\mbox{$\mathcal{L}$}_{data}(\mbox{$\mathcal{S}$})$ such that $t\models\psi$ , for all $\psi\in\mbox{$\mathcal{C}$}$ .

The proof is simply an observation that the proof of [Bojanczyk et al. (11a), Proposition 29] can be applied directly here. In [Bojanczyk et al. (11a), Proposition 29] it is proved that the satisfiability of $\mbox{$\textrm{FO}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\downarrow}$}^{*},\sim,\prec)$ is undecidable.^††^††††Technically, the undecidability in [Bojanczyk et al. (11a), Proposition 29] is proved on data strings over the logic $\mbox{$\textrm{FO}$}^{2}(+1,<,\sim,\prec)$ , which of course, is equivalent to $\mbox{$\textrm{FO}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\downarrow}$}^{*},\sim,\prec)$ . The reduction is from Post Correspondence Problem (PCP), where given an instance of PCP, one can effectively construct a formula of the form $\varphi\wedge\psi$ , where $\varphi\in\mbox{$\textrm{FO}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\downarrow}$}^{*},\sim)$ and $\psi$ is a formula of the form (15). Since $\varphi$ can be captured by ODTA, the undecidability of ODTA extended with formulas of the form (15) follows immediately.

At this point we would also like to point out that extending ODTA with operation such as addition on data values will immediately yield undecidability. This can be deduced immediately from [Halpern (1991)] where we know that together with unary predicates, addition yields undecidability.

8 When the Data Values are Strings

In this section we discuss data trees where the data values are strings from $\{0,1\}^{\ast}$ , instead of natural numbers. We call such trees string data trees. There are two common kinds of order for strings: the prefix order, and the lexicographic order. Strings with lexicographic order are simply linearly ordered domain, thus, ODTA can be applied directly in such case.

For the prefix order, we have to modify the definition of ODTA. Consider a string data tree $t$ over the alphabet $\Sigma$ . Let $V_{t}$ be the set of data values found in $t$ . We define $\mbox{$\mathcal{V}$}_{\Sigma}(t)$ as a tree over the alphabet $2^{\Sigma}$ , where

•

$\mbox{$\textsf{Dom}$}(\mbox{$\mathcal{V}$}_{\Sigma}(t))$ is $V_{t}\cup\{\epsilon\}$ ;
•

for $u,v\in\mbox{$\textsf{Dom}$}(\mbox{$\mathcal{V}$}_{\Sigma}(t))$ , $u$ is a parent of $v$ if $u$ is a prefix of $v$ and there is no $w\in\mbox{$\textsf{Dom}$}(\mbox{$\mathcal{V}$}_{\Sigma}(t))$ such that $u$ is a prefix of $w$ and $w$ is a prefix of $v$ ;
•

for $u\in\mbox{$\textsf{Dom}$}(\mbox{$\mathcal{V}$}_{\Sigma}(t))$ the label of $u$ is $S$ , if $u\in[S]_{t}$ ; and root, if $u=\epsilon$ .

We call $\mbox{$\mathcal{V}$}_{\Sigma}(t)$ the tree representation of the data values in $t$ . Consider an example of a string data tree in Figure 2. We have

\begin{array}[]{lll}~[\{a\}]_{t}=\{0101\}&&[\{b\}]_{t}=\{0100\}\\ ~[\{c\}]_{t}=\{01011\}&&[\{a,b\}]_{t}=\{01\}\\ ~[\{b,c\}]_{t}=\{01000\}&&[\{a,b,c\}]_{t}=\{010011\}.\end{array}

So $\mbox{$\textsf{Dom}$}(\mbox{$\mathcal{V}$}_{\Sigma}(t))=\{01,0100,0101,010011,010000,01011\}$ , and

•

$01$ is the parent of $0100$ and $0101$ ;
•

$0100$ is the parent of $010011$ and $010000$ ; and
•

$0101$ is the parent of $01011$ .

Figure 2: An example of a string data tree (on the left) and the tree representation of its data values (on the right).

Now an ODTA for string data trees is $\mbox{$\mathcal{S}$}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{A}$},\Gamma_{0}\rangle$ , where $\mathcal{T}$ is a letter-to-letter transducer from $\Sigma\times\{\top,\bot,*\}^{3}$ to $\Gamma$ ; $\mathcal{A}$ is an unranked tree automaton over the alphabet $2^{\Gamma}$ ; $\Gamma_{0}\subseteq\Gamma$ . The requirement for acceptance is the same as in Section 5, except that $\mathcal{A}$ takes a tree over the alphabet $2^{\Gamma}$ as the input.

We observe that in the proof of the decidability of the non-emptiness of ODTA $\mbox{$\mathcal{S}$}=\langle\mbox{$\mathcal{T}$},\mbox{$\mathcal{M}$},\Gamma_{0}\rangle$ , the automaton $\mathcal{M}$ is converted in polynomial time into a Presburger formula by applying Proposition 2.3, which actually holds for tree automata. Hence, the decision procedures in Sections 5 and 6 can also be applied to string data trees.

9 Concluding Remarks

In this paper we study data trees in which the data values come from a linearly ordered domain, where in addition to equality test, we can test whether the data value in one node is greater than the other. We introduce ordered-data tree automata (ODTA), provide its logical characterisation, and prove that its non-emptiness problem is decidable. We also show the logic $\mbox{$\exists\mbox{$\textrm{MSO}$}$}^{2}(\mbox{$E_{\downarrow}$},\mbox{$E_{\rightarrow}$},\sim)$ can be captured by ODTA.

Then we define weak ODTA, which essentially are ODTA without the ability to perform equality test on data values on two adjacent nodes. We provide its logical characterisation. We show that a number of existing formalisms and models studied in the literature so far can be captured already by weak ODTA. We also show that the definition of ODTA can be easily modified, to the case where the data values come from a partially ordered domain, such as strings.

We believe that the notion of ODTA provides new techniques to reason about ordered-data values on unranked trees, and thus, can find potential applications in practice. We also prove that ODTA capture various formalisms on data trees studied so far in the literature. As far as we know this is the first formalism for data trees with neat logical and automata characterisations.

Acknowledgement. The author would like to thank FWO for their generous financial support under the scheme FWO Marie Curie fellowship. The author also thanks Egor V. Kostylev for careful proof reading of this paper, as well as Nadime Francis for pointing out the reference [Halpern (1991)]. Finally, the author thanks the anonymous referees, both the conference and the journal versions, for their careful reading and comments which greatly improve the paper.

References

Alon et al. (2003) Alon, N., Milo, T., Neven, F., Suciu, D., and Vianu, V. 2003. XML with data values: typechecking revisited. Journal of Computer and System Science 66, 4, 688–727.
Arenas et al. (2008) Arenas, M., Fan, W., and Libkin, L. 2008. On the complexity of verifying consistency of XML specifications. SIAM Journal of Computing 38, 3, 841–880.
Benedikt et al. (2010) Benedikt, M., Ley, C., and Puppis, G. 2010. Automata vs. logics on data words. In CSL.
Björklund and Bojanczyk (2007) Björklund, H. and Bojanczyk, M. 2007. Bounded depth data trees. In ICALP.
Björklund et al. (2008) Björklund, H., Martens, W., and Schwentick, T. 2008. Optimizing conjunctive queries over trees using schema information. In MFCS.
Bojanczyk et al. (11a) Bojanczyk, M., David, C., Muscholl, A., Schwentick, T., and Segoufin, L. ’11a. Two-variable logic on data words. ACM Transactions on Computational Logic 12, 4, 27.
Bojanczyk et al. (11b) Bojanczyk, M., Klin, B., and Lasota, S. ’11b. Automata with group actions. In LICS. 355–364.
Bojanczyk et al. (2009) Bojanczyk, M., Muscholl, A., Schwentick, T., and Segoufin, L. 2009. Two-variable logic on data trees and XML reasoning. Journal of the ACM 56, 3.
Bouyer et al. (2001) Bouyer, P., Petit, A., and Thérien, D. 2001. An algebraic characterization of data and timed languages. In CONCUR.
Comon et al. (2007) Comon, H., Dauchet, M., Gilleron, R., Löding, C., Jacquemard, F., Lugiez, D., Tison, S., and Tommasi, M. 2007. Tree automata techniques and applications. Available on: http://www.grappa.univ-lille3.fr/tata. release October, 12th 2007.
David et al. (2010) David, C., Libkin, L., and Tan, T. 2010. On the satisfiability of two-variable logic over data words. In LPAR.
David et al. (2012) David, C., Libkin, L., and Tan, T. 2012. Efficient reasoning about data trees via integer linear programming. ACM Transactions on Database Systems 37, 3, 19.
Demri et al. (2007) Demri, S., D’Souza, D., and Gascon, R. 2007. A decidable temporal logic of repeating values. In LFCS.
Demri and Lazić (2009) Demri, S. and Lazić, R. 2009. LTL with the freeze quantifier and register automata. ACM Transactions of Computational Logic 10, 3.
Deutsch et al. (2009) Deutsch, A., Hull, R., Patrizi, F., and Vianu, V. 2009. Automatic verification of data-centric business processes. In ICDT.
Fan and Libkin (2002) Fan, W. and Libkin, L. 2002. On XML integrity constraints in the presence of DTDs. Journal of the ACM 49, 3, 368–406.
Figueira (2009) Figueira, D. 2009. Satisfiability of downward xpath with data equality tests. In PODS.
Figueira (2011) Figueira, D. 2011. A decidable two-way logic on data words. In LICS. 365–374.
Figueira (2012a) Figueira, D. 2012a. Alternating register automata on finite data words and trees. Logical Methods in Computer Science 8, 1.
Figueira (2012b) Figueira, D. 2012b. Satisfiability for two-variable logic with two successor relations on finite linear orders. http://arxiv.org/abs/1204.2495.
Figueira et al. (2010) Figueira, D., Hofman, P., and Lasota, S. 2010. Relating timed and register automata. In EXPRESS.
Figueira and Segoufin (2011) Figueira, D. and Segoufin, L. 2011. Bottom-up automata on data trees and vertical XPath. In STACS.
Grumberg et al. (2010) Grumberg, O., Kupferman, O., and Sheinvald, S. 2010. Variable automata over infinite alphabets. In LATA.
Halpern (1991) Halpern, J. 1991. Presburger arithmetic with unary predicates is $\pi_{1}^{1}$ complete. Journal of Symbolic Logic 56, 2, 637–642.
Jurdzinski and Lazic (2011) Jurdzinski, M. and Lazic, R. 2011. Alternating automata on data trees and XPath satisfiability. ACM Transactions of Computational Logic 12, 3, 19.
Kaminski and Francez (1994) Kaminski, M. and Francez, N. 1994. Finite-memory automata. Theoretical Computer Science 134, 2, 329–363.
Kara et al. (2012) Kara, A., Schwentick, T., and Tan, T. 2012. Feasible automata for two-variable logic with successor on data words. In LATA.
Lazić (2011) Lazić, R. 2011. Safety alternating automata on data words. ACM Transaction of Computational Logic 12, 2, 10.
Libkin (2004) Libkin, L. 2004. Elements of Finite Model Theory. Springer.
Manuel (2010) Manuel, A. D. 2010. Two orders and two variables. In MFCS.
Neven (2002) Neven, F. 2002. Automata, logic, and XML. In CSL.
Neven et al. (2004) Neven, F., Schwentick, T., and Vianu, V. 2004. Finite state machines for strings over infinite alphabets. ACM Transactions on Computational Logic 5, 3, 403–435.
Schwentick (2004) Schwentick, T. 2004. XPath query containment. SIGMOD Record 33, 1, 101–109.
Schwentick and Zeume (2010) Schwentick, T. and Zeume, T. 2010. Two-variable logic with two order relations. In CSL.
Segoufin and Torunczyk (2011) Segoufin, L. and Torunczyk, S. 2011. Automata based verification over linearly ordered data domains. In STACS.
Seidl et al. (2004) Seidl, H., Schwentick, T., Muscholl, A., and Habermehl, P. 2004. Counting in trees for free. In ICALP.
Thatcher (1967) Thatcher, J. 1967. Characterizing derivation trees of context-free grammars through a generalization of finite automata theory. Journal of Computer and System Sciences 1, 4, 317–322.
Thomas (1997) Thomas, W. 1997. Languages, automata, and logic. In Handbook of Formal Languages, Vol. 3. 389–455.
Verma et al. (2005) Verma, K. N., Seidl, H., and Schwentick, T. 2005. On the complexity of equational horn clauses. In CADE.