This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: Univesity of Texas at Austin, Austin TX 78712, USA
11email: {ehsiung,joydeepb,swarat}@cs.utexas.edu

Automata Learning from Preference and Equivalence Queries

Eric Hsiung 0000-0002-4188-4127    Joydeep Biswas 0000-0002-1211-1731    Swarat Chaudhuri 0000-0002-6859-1391
Abstract

Active automata learning from membership and equivalence queries is a foundational problem with numerous applications. We propose a novel variant of the active automata learning problem: actively learn finite automata using preference queries—i.e., queries about the relative position of two sequences in a total preorder—instead of membership queries. Our solution is Remap, a novel algorithm which leverages a symbolic observation table along with unification and constraint solving to navigate a space of symbolic hypotheses (each representing a set of automata), and uses satisfiability-solving to construct a concrete automaton (specifically a Moore machine) from a symbolic hypothesis. Remap is guaranteed to correctly infer the minimal automaton with polynomial query complexity under exact equivalence queries, and achieves PAC–identification (ε\varepsilon-approximate, with high probability) of the minimal automaton using sampling-based equivalence queries. Our empirical evaluations of Remap on the task of learning reward machines for two reinforcement learning domains indicate Remap scales to large automata and is effective at learning correct automata from consistent teachers, under both exact and sampling-based equivalence queries.

1 Introduction

Active automata learning has applications from software engineering [37, 1] and verification [29] to interpretable machine learning [44] and learning reward machines [39, 20, 46, 16]. The classical problem formulation involves a teacher with access to a regular language and a learner which asks membership and equivalence queries to infer a finite automaton describing the regular language [5, 22].

Consider an alternative formulation: learning a finite automaton from preference and equivalence queries.111Shah et al. [38] investigates choosing between membership and preference queries. A preference query resolves the relative position of two sequences in a total preorder available to the teacher. The motivation for learning from preferences stems from leveraging human preferences as a rich source of information. In fact, comparative feedback, such as preferences, is a modality which humans are apt at providing in comparison to giving specific numerical values, as shown by MacGlashan et al. [30] and Christiano et al. [12], likely due to choice overload [26]. But preferences need not be directly obtained from individual humans: preferences can also be obtained from automated systems in a frequency-based sense. For example, consider an obstacle avoidance scenario of a vehicle avoiding large debris in the middle of a roadway. Consider a dataset of a population of driver responses (vehicle trajectories) to such a scenario, procured from dashcam or traffic camera footage. The population preference of a given driving response can be automatically determined by ranking each response by the frequency of occurrence in the dataset. In fact, learning from human preference data has applications in fine-tuning language models [34], learning conditional preference networks [28, 23], inferring reinforcement learning policies [13], and learning Markovian reward functions [11, 36, 10, 27]. Possible applications for learning finite automata from preferences over sequences include inferring sequence classifications (e.g. program executions [21], vehicle maneuvers, human-robot interactions) using ordered classes (e.g. safe, risky, dangerous, fatal) with each automaton state labeled by a class, distilling interpretable preference models [44], and inferring reward machines. However, no method currently exists for learning finite automata from preferences with the termination and correctness guarantees enjoyed by classical automata learning algorithms such as L\textrm{L}^{*} [5].

Refer to caption
Figure 1: REMAP Algorithm Overview

Unfortunately, adapting L\textrm{L}^{*} to the preference-based setting is challenging. Preference queries do not directly provide the concrete observations available from membership queries, as required by L\textrm{L}^{*}, so our solution Remap addresses this challenge through a symbolic approach to L\textrm{L}^{*}.

Remap222Code available at https://eric-hsiung.github.io/remap/, as well as supplementary material. (Figure 1) features termination and correctness guarantees for exact and probably approximately correct [43] (PAC) identification [6] of the desired automaton, with a strong learner capable of symbolic reasoning and constraint-solving to offset the weaker preference-based signals from the teacher. While approximate learners have been proposed to learn from a combination of membership and preference queries [38], Remap is the first exact learner with formal guarantees of minimalism, correctness, and query complexity. By using unification333Symbolically obtaining sets of equivalent variables from sets of equations, and substituting variables in the equations with their representatives. See Section 4.2. to navigate the symbolic space of hypotheses, and constraint-solving to construct a concrete automaton from a symbolic hypothesis, Remap identifies, in a polynomial number of queries, the minimal Moore machine isomorphic to one describing the teacher’s total preorder over input sequences when using exact equivalence queries. However, exact equivalence queries may be infeasible if the learner and teacher lack a common representation.444Either a pair of identical representations, or between which a translation exists. This motivates sampling-based equivalence queries under Remap, which achieves PAC–identification.555See Definition 10 for PAC–identification and related Theorems 5.4 and 5.5. Our empirical evaluations apply Remap to learn reward machines for sequential-decision making domains in the reward machine literature. We measure query complexity for the exact and PAC–identification settings, and measure empirical correctness for the PAC–identification setting.

We contribute (a) Remap, a novel L\textrm{L}^{*} style algorithm for learning Moore machines using preference and equivalence queries; under exact and PAC-identifica-tion settings we provide (b) theoretical analysis of query complexity, correctness, and minimalism, and (c) supporting empirical results, demonstrating the efficacy of the algorithm’s ability to learn reward machines from preference and equivalence queries.

2 Background

Prior to introducing Remap, we provide preliminaries on orders and finite automata, followed by a discussion of Angluin’s L\textrm{L}^{*} algorithm in order to highlight how Remap differs from L\textrm{L}^{*}.

Definition 1 (Total Preorder)

The binary relation \precsim on a set AA is a total preorder if \precsim satisfies: aaa\precsim a for all aAa\in A (reflexivity); abbcaca\precsim b\land b\precsim c\implies a\prec c for all a,b,cAa,b,c\in A (transitivity); and abbaa\precsim b\lor b\precsim a for all a,bAa,b\in A (total).

Definition 2 (Total Order)

The binary relation \precsim on a set AA is a total order if \precsim satisfies: total preorder conditions; and abbaa=ba\precsim b\land b\precsim a\implies a=b for all a,bAa,b\in A (antisymmetric); a=ba=b means aa and bb are the same element from AA.

Finite Automata  Automata describe sets of sequences. An alphabet Σ\Sigma is a set whose elements can be used to construct sequences; Σ\Sigma^{*} represents the set of sequences of any length created from elements of Σ\Sigma. A sequence sΣs\in\Sigma^{*} has integer length |s|0|s|\geq 0; if |s|=0|s|=0, then ss is the empty sequence ε\varepsilon. An element σΣ\sigma\in\Sigma has length 1; if s,tΣs,t\in\Sigma^{*}, then sts\cdot t represents ss concatenated with tt, with length |s|+|t||s|+|t|. Different types of finite automata have different semantics. Deterministic automata feature a deterministic transition function δ\delta defined over a set of states QQ and an input alphabet ΣI\Sigma^{I}. An output alphabet ΣO\Sigma^{O} may be present to label states or transitions using a labeling function LL. The tuple Q,q0,ΣI,ΣO,δ,L\langle Q,q_{0},\Sigma^{I},\Sigma^{O},\delta,L\rangle is a Moore machine [33], where q0Qq_{0}\in Q is the initial state, δ:Q×ΣIQ\delta:Q\times\Sigma^{I}\rightarrow Q describes transitions, and L:QΣOL:Q\rightarrow\Sigma^{O} associates outputs with states. Extended, δ:Q×(ΣI)Q\delta:Q\times(\Sigma^{I})^{*}\rightarrow Q, where δ(q,ε)=q\delta(q,\varepsilon)=q and δ(q,σs)=δ(δ(q,σ),s)\delta(q,\sigma\cdot s)=\delta(\delta(q,\sigma),s). In Mealy machines [32], outputs are associated with transitions instead, so L:Q×ΣIΣOL:Q\times\Sigma^{I}\rightarrow\Sigma^{O}. Mealy and Moore machines are equivalent [19] and can be converted between one another. Reward machines are an application of Mealy machines in reinforcement learning and are used to express a class of non-Markovian reward functions.

Active Automaton Learning  Consider the problem of actively learning a Moore machine Q,q0,ΣI,ΣO,δ,L\langle Q,q_{0},\Sigma^{I},\Sigma^{O},\delta,L\rangle to exactly model a function f:(ΣI)ΣOf:(\Sigma^{I})^{*}\rightarrow\Sigma^{O}, where ΣI\Sigma^{I} and ΣO\Sigma^{O} are input and output alphabets of finite size known to both teacher and learner. We desire a learner which learns a model f^\hat{f} of ff exactly; that is for all s(ΣI)s\in(\Sigma^{I})^{*}, we require f^(s)=f(s)\hat{f}(s)=f(s), where f^(s)=L(δ(q0,s))\hat{f}(s)=L(\delta(q_{0},s)), with the assistance of a teacher 𝒯\mathcal{T} that can answer questions about ff.

In Angluin’s seminal active learning algorithm, L\textrm{L}^{*} [5], the learner learns f^\hat{f} as a binary classifier, where |ΣO|=2|\Sigma^{O}|=2, to determine sequence membership of a regular language by querying 𝒯\mathcal{T} with: (i) membership queries, where the learner asks 𝒯\mathcal{T} for the value of f(s)f(s) for a particular sequence ss, and (ii) equivalence queries, where the learner asks 𝒯\mathcal{T} to evaluate whether for all s(ΣI)s\in(\Sigma^{I})^{*}, f^(s)=f(s)\hat{f}(s)=f(s). For the latter query, 𝒯\mathcal{T} returns True if the statement holds; otherwise a counterexample cc for which f^(c)f(c)\hat{f}(c)\neq f(c) is returned. An observation table S,E,T\langle S,E,T\rangle records the concrete observations acquired by the learner’s queries (see Figure 2). Here, SS is a set of prefixes, EE is a set of suffixes, and TT is the empirical observation function that maps sequences to output values—T:(S(SΣI))EΣOT:(S\cup(S\cdot\Sigma^{I}))\cdot E\rightarrow\Sigma^{O}. The observation table is a two-dimensional array; s(S(SΣI))s\in(S\cup(S\cdot\Sigma^{I})) indexes rows, and eEe\in E indexes columns, with entries given by T(se)T(s\cdot e). Any proposed hypothesis f^\hat{f} must be consistent with TT. The algorithm operates by construction: a deterministic transition function must be found exhibiting consistency (deterministic transitions) and operates in a closed manner over the set of states. If the consistency or closure requirements are violated, then membership queries are executed to expand the observation table. Once a suitable transition function is found, a hypothesis f^\hat{f} can be made and checked via the equivalence query. The algorithm terminates if f^(s)=f(s)\hat{f}(s)=f(s) for all s(ΣI)s\in(\Sigma^{I})^{*}; otherwise L\textrm{L}^{*} continues on by adding counterexample cc and all its prefixes to the table, and finds another transition function satisfying the consistency and closure requirements.

Refer to caption
Figure 2: L\textrm{L}^{*} example. Figure 3 illustrates Remap. L\textrm{L}^{*} records concrete values from membership queries in observation table. Green symbolizes new information. Colors in (d) visually highlight transition inconsistencies.

Consequently, we consider how L\textrm{L}^{*}-style learning can be used to learn Moore machines from preference queries over sequences, as a foray into understanding how finite automaton structure can be learned from comparison information.

3 Problem Statement

We consider how a Moore machine Q,q0,ΣI,ΣO,δ,L\langle Q,q_{0},\Sigma^{I},\Sigma^{O},\delta,L\rangle can be actively learned from preference queries over (ΣI)(\Sigma^{I})^{*}. We focus on the case of a finite sized ΣI\Sigma^{I} and ΣO\Sigma^{O} known to both teacher and learner. With a preference query, the learner asks the teacher 𝒯\mathcal{T} which of two sequences s1s_{1} and s2s_{2} is preferred, or if both are equally preferable. This requires a preference model, which we assume represents a total preordering over (ΣI)(\Sigma^{I})^{*}; we also assume ΣO\Sigma^{O} is totally ordered. Thus, we consider a preference model for 𝒯\mathcal{T} where f:(ΣI)ΣOf:(\Sigma^{I})^{*}\rightarrow\Sigma^{O} is consistent with both orderings, i.e., 𝒯\mathcal{T} prefers s1s_{1} over s2s_{2} if f(s1)>f(s2)f(s_{1})>f(s_{2}), or otherwise has equal preference if f(s1)=f(s2)f(s_{1})=f(s_{2}).

Several options for evaluating hypothesis equivalence (is f^f\hat{f}\equiv f?) can be defined for this problem formulation. We first review the definition of exact equivalence, followed by an alternative notion of equivalence which respects ordering:

Definition 3 (Exact Equivalence)

Given a hypothesis f^\hat{f} and a reference ff, f^\hat{f} is exactly equivalent to ff if f^(s)=f(s)\hat{f}(s)=f(s) for all ss in (ΣI)(\Sigma^{I})^{*}.

Definition 4 (Order-Respecting Equivalence)

Given a hypothesis f^\hat{f} and a reference ff, f^\hat{f} is order-respecting equivalent to ff if, for all s,ts,t in (ΣI)(\Sigma^{I})^{*}, there exists a relation 𝐑s,t{=,>,<}\mathbf{R}_{s,t}\in\{=,>,<\} such that f^(s)𝐑s,tf^(t)f(s)𝐑s,tf(t)\hat{f}(s)\mathbf{R}_{s,t}\hat{f}(t)\iff f(s)\mathbf{R}_{s,t}f(t).

Since a hypothesis f^\hat{f} satisfying exact equivalence is also always order-respect-ing, we focus on exact equivalence queries where 𝒯\mathcal{T} returns feedback along with counterexample cc.

Definition 5 (Exact Equivalence Query with Feedback)

Given a hypothesis f^\hat{f} and a reference ff, an exact equivalence query EQ returns the following triple:

EQ(f^)=s(ΣI):f^(s)=f(s),c:c s.t. f^(c)f(c),ϕ(c)\text{{\emph{EQ}}}(\hat{f})=\langle\forall s\in(\Sigma^{I})^{*}:\hat{f}(s)=f(s),c:\exists c\text{ \emph{s.t.} }\hat{f}(c)\neq f(c),\phi(c)\rangle

where cc is a counterexample if one exists, and ϕ(c)\phi(c) is feedback associated with the counterexample, interpreted as a constraint.

The strength of feedback ϕ\phi directly impacts how many hypotheses per counterexample can be eliminated by the learner: returning ϕ(c):=f^(c)f(c)\phi(c):=\hat{f}(c)\neq f(c), which means the value f(c)f(c) is not f^(c)\hat{f}(c), is weak feedback compared to returning ϕ(c):=f^(c)=f(c)\phi(c):=\hat{f}(c)=f(c), which means the value of f^(c)\hat{f}(c) should be f(c)f(c), or returning ϕ(c):=f^(c)X\phi(c):=\hat{f}(c)\in X (where XΣOX\subset\Sigma^{O}, which means the value of f^(c)\hat{f}(c) should be in a subset of ΣO\Sigma^{O}). Although Remap outputs a concrete Moore machine, Remap navigates symbolic Moore machine666i.e., a Moore machine with symbolic values as outputs. space using solely preference information, while concrete information assists in selecting the concrete hypothesis. Thus, our theoretical analysis considers equivalence queries that provide counterexamples with strong feedback: ϕ(c):=f^(c)=f(c)\phi(c):=\hat{f}(c)=f(c).

4 The Remap Algorithm

Remap is a L\textrm{L}^{*}-based algorithm employing preference and equivalence queries to gather constraints. By first leveraging unification to navigate the symbolic hypothesis space of Moore machines, solving the constraints yields a concrete Moore machine. In particular, Remap (Algorithm 1) learns a Moore machine Q^,q^0,ΣI,ΣO,δ^,L^\langle\hat{Q},\hat{q}_{0},\Sigma^{I},\Sigma^{O},\hat{\delta},\hat{L}\rangle, representing a multiclass classifier f^(s)=L^(δ^(q^0,s))\hat{f}(s)=\hat{L}(\hat{\delta}(\hat{q}_{0},s)).

Refer to caption
Figure 3: Remap example. Figure 2 illustrates L\textrm{L}^{*}. Remap performs preference queries and records variables in symbolic observation table S,E,T;𝒞,Γ\langle S,E,T;\mathcal{C},\Gamma\rangle within a SymbolicFill. (a) initializes S,E,T;𝒞,Γ\langle S,E,T;\mathcal{C},\Gamma\rangle; (b) expands SS with sequence bb, followed by SymbolicFill; yields unified, closed, and consistent S,E,T;𝒞,Γ\langle S,E,T;\mathcal{C},\Gamma\rangle; (c) MakeHypothesis yields concrete hypothesis h1h_{1} from symbolic hypothesis and constraints in 𝒞\mathcal{C}; (d) submits h1h_{1} via equivalence query; receives and processes counterexample babbab with feedback f(bab)=0f(bab)=0, adding babbab, baba, and bb to SS, performs a SymbolicFill, sets the value of equivalence class for T(bab)T(bab) to f(bab)f(bab); yields an inconsistent table. (e) shows resulting consistent table. Figure 2f shows ground truth Moore machine with ΣI={a,b},ΣO={0,1}\Sigma^{I}=\{a,b\},\Sigma^{O}=\{0,1\}. Green symbolizes new information. Colors in (d) visually highlight transition inconsistencies.

Central to Remap are four core components: (1) a new construct called a symbolic observation table (Section 4.1), shown in Figure 3a, 3b, 3d, and 3e; (2) as well as an associated algorithm for unification [35] (Section 4.2) inspired by Martelli and Montanari [31] to contain the fresh variable explosion; both enable the learner to (3) generate symbolic hypotheses (Section 4.3) purely from observed symbolic constraints, along with (4) a constraint solver for obtaining a concrete hypothesis (Section 4.4). We first discuss the four components and how they fit together into Remap (Section 4.5). Afterwards, we illustrate the correctness and termination guarantees.

4.0.1 4.1 Symbolic Observation Tables

Recall that in classic L\textrm{L}^{*}, the observation table entries are concrete values obtained from membership queries (Figure 2a). However, observations obtained from preference queries are constraints, rather than concrete values, indicating that for a pair of sequences s1s_{1} and s2s_{2}, one of f(s1)>f(s2)f(s_{1})>f(s_{2}), f(s1)<f(s2)f(s_{1})<f(s_{2}), or f(s1)=f(s2)f(s_{1})=f(s_{2}) holds. We therefore introduce the symbolic observation table S,E,T;𝒞,Γ\langle S,E,T;\mathcal{C},\Gamma\rangle, where SS is a set of prefixes, EE is a set of suffixes, 𝒞\mathcal{C} is the set of known constraints, Γ\Gamma is a context which uniquely maps sequences to variables, with the set of variables 𝒱\mathcal{V} in the context Γ\Gamma given by 𝒱={Γ[se]|se(S(SΣI))E}\mathcal{V}=\{\Gamma[s\cdot e]|s\cdot e\in(S\cup(S\cdot\Sigma^{I}))\cdot E\}, and T:(S(SΣI))E𝒱T:(S\cup(S\cdot\Sigma^{I}))\cdot E\rightarrow\mathcal{V} maps queried sequences to variables. Thus, in a symbolic observation table, the entry for each prefix-suffix pair is a variable, rather than a concrete value; and the constraints over those variables are stored in 𝒞\mathcal{C} (Figure 3a, 3b, 3d, 3e). The constraints from the preferences of 𝒯\mathcal{T} over ff about s1s_{1} and s2s_{2} correspond with T(s1)>T(s2)T(s_{1})>T(s_{2}), T(s1)<T(s2)T(s_{1})<T(s_{2}), or T(s1)=T(s2)T(s_{1})=T(s_{2}), respectively.

Definition 6

An equivalence class \mathbb{C} of variables is a set with the property that all members of \mathbb{C} are equivalent to each other. The representative \llbracket\mathbb{C}\rrbracket of \mathbb{C} is a deterministically elected member of \mathbb{C}. The set of variables 𝒱\mathcal{V} can be partitioned into disjoint equivalence classes. The set of equivalence classes 𝒞EC\mathcal{C}_{EC} corresponds with the partitioning of 𝒱\mathcal{V} into the smallest possible number of equivalence classes consistent with equality constraints. Let ={|𝒞EC}\mathcal{R}=\{\llbracket\mathbb{C}\rrbracket|\mathbb{C}\in\mathcal{C}_{EC}\} be the set of representatives.

4.0.2 4.2 Unification and Constraints

Since preference queries return observations comparing the values of two sequences, we leverage a simple unification algorithm to ensure the number of unique variables in the table remains bounded by |ΣO||\Sigma^{O}|. As a reminder, unification involves symbolically solving for representatives and rewriting all constraints and variables in terms of those representatives. Whenever we observe the constraint f(s1)=f(s2)f(s_{1})=f(s_{2}), this implies T(s1)=T(s2)T(s_{1})=T(s_{2}), so we add Γ[s1]\Gamma[s_{1}] and Γ[s2]\Gamma[s_{2}] to an equivalence class \mathbb{C}, and elect a representative from \mathbb{C}. The unification algorithm is presented in Appendix Algorithm 9.

If Γ[s1]\Gamma[s_{1}] already belongs to an equivalence class 𝒞EC\mathbb{C}\in\mathcal{C}_{EC}, but Γ[s2]\Gamma[s_{2}] is an orphan variable—belonging to no class in 𝒞EC\mathcal{C}_{EC}—then Γ[s2]\Gamma[s_{2}] is merged into \mathbb{C}. Swap s1s_{1} and s2s_{2} for the other case. If Γ[s1]\Gamma[s_{1}] and Γ[s2]\Gamma[s_{2}] belong to separate classes 1\mathbb{C}_{1} and 2\mathbb{C}_{2}, with 12\mathbb{C}_{1}\neq\mathbb{C}_{2}, then 1\mathbb{C}_{1} and 2\mathbb{C}_{2} are merged into one via 12\mathbb{C}\leftarrow\mathbb{C}_{1}\cup\mathbb{C}_{2}, and one of 1\llbracket\mathbb{C}_{1}\rrbracket and 2\llbracket\mathbb{C}_{2}\rrbracket is deterministically elected as the representative \llbracket\mathbb{C}\rrbracket.

When this unification process is applied to a symbolic observation table S,E,T;𝒞,Γ\langle S,E,T;\mathcal{C},\Gamma\rangle, we are unifying 𝒱\mathcal{V} (the set of variables in the context Γ\Gamma) according to the set of known equivalence constraints 𝒞EQ\mathcal{C}_{EQ} and equivalence classes 𝒞EC\mathcal{C}_{EC} available in 𝒞\mathcal{C}; and each variable in the table is replaced with its equivalence class representative (see Figure 3abde after the large right curly brace). That is, for all ss in (SSΣI)E(S\cup S\cdot\Sigma^{I})\cdot E, we substitute T(s)T(s)\leftarrow\llbracket\mathbb{C}\rrbracket if T(s)T(s)\in\mathbb{C}. In the resulting unified symbolic observation table, TT maps queried sequences to the set of representatives \mathcal{R}. Note ||=|𝒞EC||ΣO||\mathcal{R}|=|\mathcal{C}_{EC}|\leq|\Sigma^{O}|.

Besides constraints from preferences queries, the learner obtains constraints about the value of T(c)T(c) from equivalence queries (Figure 3c to 3d) and adds them to 𝒞\mathcal{C}. When first obtained, these constraints may possibly be expressed in terms of orphan variables, but during the process of unification, each orphan variable joins an equivalence class and is then replaced by its equivalence class representative in the constraint. Thus, after unification, all constraints in 𝒞\mathcal{C} are expressed in terms of equivalence class representatives.

Finally, unifying the symbolic observation table is critical for making a symbolic hypothesis without knowledge of concrete value assignments: unification permits S,E,T;𝒞,Γ\langle S,E,T;\mathcal{C},\Gamma\rangle to become closed and consistent—prerequisites for generating a symbolic hypothesis.

Definition 7

Let rows(S)={row(s)|sS}\textbf{{rows}}(S)=\{\textbf{{row}}(s)|s\in S\}, where the row in S,E,T;𝒞,Γ\langle S,E,T;\mathcal{C},\Gamma\rangle indexed by ss is row(s)\textbf{{row}}(s). S,E,T;𝒞,Γ\langle S,E,T;\mathcal{C},\Gamma\rangle is closed if rows(SΣI)rows(S)\textbf{{rows}}(S\cdot\Sigma^{I})\subseteq\textbf{{rows}}(S).

Definition 8

S,E,T;𝒞,Γ\langle S,E,T;\mathcal{C},\Gamma\rangle is consistent if for all sequence pairs s1s_{1} and s2s_{2} where row(s1)row(s2)\textbf{{row}}(s_{1})\equiv\textbf{{row}}(s_{2}), all their transitions also remain equivalent with each other: row(s1σ)row(s2σ)\textbf{{row}}(s_{1}\cdot\sigma)\equiv\textbf{{row}}(s_{2}\cdot\sigma) for all σΣI\sigma\in\Sigma^{I}.

Definition 9

Table S,E,T;𝒞,Γ\langle S,E,T;\mathcal{C},\Gamma\rangle is unified if T(se)T(s\cdot e)\in\mathcal{R} for all se(S(SΣI))Es\cdot e\in(S\cup(S\cdot\Sigma^{I}))\cdot E.

4.0.3 4.3 Making a Symbolic Hypothesis

If symbolic observation table S,E,T;𝒞,Γ\langle S,E,T;\mathcal{C},\Gamma\rangle is unified, closed, and consistent, a symbolic hypothesis can be made (Figure 3c). This construction is identical to L\textrm{L}^{*}, except that the outputs are symbolic:

Q^\displaystyle\hat{Q} ={row(s)|sS} is the set of states\displaystyle=\{\textbf{{row}}(s)|\forall s\in S\}\text{ \ is the set of states}
q^0\displaystyle\hat{q}_{0} =row(ε) is the initial state\displaystyle=\textbf{{row}}(\varepsilon)\text{ \ is the initial state}
δ^(row(s),σ)\displaystyle\hat{\delta}(\textbf{{row}}(s),\sigma) =row(sσ) for all sS and σΣI\displaystyle=\textbf{{row}}(s\cdot\sigma)\text{ \ for all $s\in S$ and $\sigma\in\Sigma^{I}$}
L^(row(s))\displaystyle\hat{L}(\textbf{{row}}(s)) =T(sε) is the sequence to output function\displaystyle=T(s\cdot\varepsilon)\text{ \ is the sequence to output function}
Q^,q^0,ΣI,ΣO,δ^,L^ is a symbolic hypothesis.\displaystyle\langle\hat{Q},\hat{q}_{0},\Sigma^{I},\Sigma^{O},\hat{\delta},\hat{L}\rangle\text{ is a symbolic hypothesis.}

4.0.4 4.4 Making a Concrete Hypothesis

The learner finds a satisfying solution Λ\Lambda to the set of constraints 𝒞\mathcal{C}, while subject to the global constraint requiring the value of each representative to be in ΣO\Sigma^{O} (Figure 3c). Thus, L^\hat{L} becomes concrete.

ΛFindSolution(S,E,T;𝒞,Γ,ΣO)\displaystyle\Lambda\longleftarrow\textsc{FindSolution}(\langle S,E,T;\mathcal{C},\Gamma\rangle,\Sigma^{O}) (1)
L^(row(s))=Λ[T(sε)]\displaystyle\hat{L}(\textbf{{row}}(s))=\Lambda[T(s\cdot\varepsilon)] (2)

In particular, Λ\Lambda finds satisfying values for each member of \mathcal{R}, the set of equivalence class representatives. Since S,E,T;𝒞,Γ\langle S,E,T;\mathcal{C},\Gamma\rangle is unified, we are guaranteed that TT maps from queried sequences to \mathcal{R}; hence Λ[T(sε)]\Lambda[T(s\cdot\varepsilon)] is guaranteed to resolve to a concrete value as long as the teacher provides consistent preferences.

Algorithm 1 Remap

Input: Alphabets ΣI\Sigma^{I} (input) and ΣO\Sigma^{O} (output), teacher 𝒯\mathcal{T}
Output: Moore Machine =Q^,ΣI,ΣO,q^0,δ^,L^\mathcal{H}=\langle\hat{Q},\Sigma^{I},\Sigma^{O},\hat{q}_{0},\hat{\delta},\hat{L}\rangle

1:  Initialize 𝒪=S,E,T;𝒞,Γ\mathcal{O}=\langle S,E,T;\mathcal{C},\Gamma\rangle with S={ε},E={ε}S=\{\varepsilon\},E=\{\varepsilon\}, 𝒞={}\mathcal{C}=\{\}, Γ=\Gamma=\emptyset
2:  𝒪\mathcal{O}\longleftarrow SymbolicFill(𝒪|𝒯)\left(\mathcal{O}|\mathcal{T}\right)
3:  repeat
4:     𝒪\mathcal{O}\longleftarrow MakeClosedAndConsistent(𝒪|𝒯)\left(\mathcal{O}|\mathcal{T}\right)
5:     \mathcal{H}\longleftarrowMakeHypothesis(𝒪,ΣI,ΣO)\left(\mathcal{O},\Sigma^{I},\Sigma^{O}\right)
6:     result \longleftarrow EquivalenceQuery(|𝒯)(\mathcal{H}|\mathcal{T})
7:     𝒪\mathcal{O}\longleftarrowProcessCex(𝒪,result)(\mathcal{O},\text{{result}}) if result \neq correct
8:  until result == correct
9:  return  \mathcal{H}
Algorithm 2 A Query Efficient Symbolic Fill Procedure

procedure SymbolicFill(S,E,T;𝒞,Γ|𝒯)\left(\langle S,E,T;\mathcal{C},\Gamma\rangle|\mathcal{T}\right)

1:  seqs ={}=\{\}; Let 𝒪=S,E,T;𝒞,Γ\mathcal{O}=\langle S,E,T;\mathcal{C},\Gamma\rangle; let oldsortedseqs be a sorted list of sequences.
2:  seqs == PopulateMissingFreshVars(𝒪\mathcal{O})
3:  𝒪\mathcal{O} \longleftarrow PrefQsByRandomizedQuicksortFollowedByLinearMerge(seqs, oldsortedseqs, 𝒪|𝒯\mathcal{O}|\mathcal{T})
4:  𝒪\mathcal{O}\longleftarrow Unification(𝒪)(\mathcal{O})
5:  return  𝒪\mathcal{O}

4.0.5 4.5 Remap

We now describe Remap (Algorithm 1) in terms of the previously discussed components. In order to make a symbolic hypothesis, Remap must first obtain a unified, closed, and consistent S,E,T;𝒞,Γ\langle S,E,T;\mathcal{C},\Gamma\rangle. To perform closed and consistency checks, the table must be unified. Therefore, Remap must symbolically fill S,E,T;𝒞,Γ\langle S,E,T;\mathcal{C},\Gamma\rangle by asking preference queries and performing unification to obtain a unified table. If the unified table is not closed or not consistent, then the table is alternately expanded and symbolically filled until the table becomes unified, closed, and consistent.

SymbolicFill  A symbolic fill (Algorithm 2 and Appendix Algorithm 4 and 5) (i) creates fresh variables for empty entries in the table, (ii) asks preference queries, and (iii) performs unification. If a sequence se(S(SΣI))Es\cdot e\in(S\cup(S\cdot\Sigma^{I}))\cdot\,E does not have an associated variable in the context Γ\Gamma, then a fresh variable Γ[se]\Gamma[s\cdot e] is created. Preference queries are executed to obtain the total preordering of (S(SΣI))E(S\cup(S\cdot\Sigma^{I}))\cdot E. In our implementation, we use preference queries in place of comparisons in randomized quicksort and linear merge. Once every sequence in (S(SΣI))E(S\cup(S\cdot\Sigma^{I}))\cdot E has a variable, and once the preference queries have been completed, unification is performed. SymbolicFill is called on lines 2, 4 (in MakeClosedAndConsistent), and 7 (in ProcessCex.). Unification is shown in Appendix Algorithm 9.

Ensuring Consistency   If the unified table is not consistent, then there exists a pair s1,s2Ss_{1},s_{2}\in S and σΣI\sigma\in\Sigma^{I} for which row(s1)row(s2)\textbf{{row}}(s_{1})\equiv\textbf{{row}}(s_{2}) and row(s1σ)row(s2σ)\textbf{{row}}(s_{1}\cdot\sigma)\not\equiv\textbf{{row}}(s_{2}\cdot\sigma), implying there is an eEe\in E such that T(s1σe)T(s2σe)T(s_{1}\cdot\sigma\cdot e)\not\equiv T(s_{2}\cdot\sigma\cdot e). To attempt to make the table consistent, add σe\sigma\cdot e to EE, and then perform a symbolic fill. Figure 3d shows inconsistency, and Figure 3e shows a table made consistent through expansion of the suffix set.

Ensuring Closedness   If the unified table is not closed, then rows(SΣI)rows(S)\textbf{{rows}}(S\cdot\Sigma^{I})\not\subseteq\textbf{{rows}}(S). To attempt to make the table closed, find a row row(s)\textbf{{row}}(s^{\prime}) in rows(SΣI)\textbf{{rows}}(S\cdot\Sigma^{I}) but not in rows(S)\textbf{{rows}}(S). Add ss^{\prime} to SS, update SΣIS\cdot\Sigma^{I}, then fill symbolically. Figure 3a to 3b shows a closure process.

The closed and consistency checks occur in a loop (consistency first, closed second) inside MakeClosedAndConsistent until the table becomes unified, closed, and consistent. Then hypothesis h=Q^,q^0,ΣI,ΣO,δ^,L^h=\langle\hat{Q},\hat{q}_{0},\Sigma^{I},\Sigma^{O},\hat{\delta},\hat{L}\rangle is generated by MakeHypothesis (Figure 3c) and is sent to the teacher via EquivalenceQuery (Figure 3c to 3d). If hh is wrong, then a counterexample cc is returned, as well as feedback ϕ(c)\phi(c) which is interpreted as a new constraint on the value of f^(c)\hat{f}(c). The counterexample cc and all its prefixes are added to SS, and then a symbolic fill is performed. Then the constraint on the value of f^(c)\hat{f}(c) is added to 𝒞\mathcal{C} as a constraint on the value of the representative at T(c)T(c).

5 Theoretical Guarantees of Remap

We now cover the algorithmic guarantees of Remap when 𝒯\mathcal{T} uses exact equivalence queries, and show how sampling-based equivalence queries achieves PAC–identification. We first detail how Remap guarantees termination and yields a correct, minimal Moore machine that classifies sequence equivalently to ff. If Remap terminates, then the final hypothesis must be correct, since termination occurs only if no counterexamples exist for the final hypothesis. Therefore, if the hypothesized Moore machine classifies all sequences correctly according to the teacher, it must be correct. Thus, proving termination implies correctness. See Appendix for sketches and proofs. Here, we assume the teacher provides feedback f^(c)=f(c)\hat{f}(c)=f(c) with counterexample cc.

Theorem 5.1

If S,E,T;𝒞,Γ\langle S,E,T;\mathcal{C},\Gamma\rangle is unified, closed, and consistent, and the range of MakeHypothesis(S,E,T;𝒞,Γ)(\langle S,E,T;\mathcal{C},\Gamma\rangle) is \mathcal{H}, then every hypothesis hh\in\mathcal{H} is consistent with constraints 𝒞\mathcal{C}. Any other hypothesis consistent with 𝒞\mathcal{C}, but not contained in \mathcal{H}, must have more states.

Proof

(Sketch) A given unified, closed, and consistent symbolic observation table S,E,T;𝒞,Γ\langle S,E,T;\mathcal{C},\Gamma\rangle corresponds to (𝒮,,𝒞)(\mathcal{S},\mathcal{R},\mathcal{C}), where 𝒮\mathcal{S} is a symbolic hypothesis, \mathcal{R} is the set of representatives used in the table, and 𝒞\mathcal{C} are the constraints expressed over \mathcal{R}. All hypotheses in \mathcal{H} have states and transitions identical to 𝒮\mathcal{S}. Each satisfying solution Λ\Lambda to 𝒞\mathcal{C} corresponds to a unique concrete hypothesis in \mathcal{H}. Therefore every concrete hypothesis in \mathcal{H} is consistent with 𝒞\mathcal{C}. Let |h||h| represent the number of states in hh. We know for all hh\in\mathcal{H}, |h|=|𝒮||h|=|\mathcal{S}|. Let ¯\overline{\mathcal{H}} be the set of concrete hypotheses not in \mathcal{H}. Note ¯\overline{\mathcal{H}} can be partitioned into three sets—concrete hypotheses with (a) fewer states than 𝒮\mathcal{S}, (b) more states than 𝒮\mathcal{S}, and (c) same number of states as 𝒮\mathcal{S} but inconsistent with 𝒞\mathcal{C}. We ignore (c) because we care only about hypotheses consistent with 𝒞\mathcal{C}. Consider any concrete hypothesis hh in \mathcal{H} and its corresponding satisfying solution Λ\Lambda. Suppose we desire another hypothesis hh^{\prime} to be consistent with hh. If |h|<|h||h^{\prime}|<|h|, then hh^{\prime} cannot be consistent with hh because at least one sequence will be misclassified. Therefore, if hh^{\prime} must be consistent with hh, then we require |h||h||h^{\prime}|\geq|h|. Thus, any other hypothesis consistent with 𝒞\mathcal{C}, but not in \mathcal{H}, must have more states.∎

Theorem 1 establishes that the output of Remap will be the smallest Moore machine consistent with all the constraints in 𝒞\mathcal{C}. This is necessary to prove termination.

Lemma 1

Whenever a counterexample cc is processed, either 0 or 11 additional representative values becomes known.

Theorem 5.2

Suppose S,E,T;𝒞,Γ\langle S,E,T;\mathcal{C},\Gamma\rangle is unified, closed, and consistent. Let h^=MakeHypothesis(S,E,T;𝒞,Γ)\hat{h}=\textsc{MakeHypothesis}(\langle S,E,T;\mathcal{C},\Gamma\rangle) be the hypothesis induced by Λ\Lambda, a satisfying solution to 𝒞\mathcal{C}. If the teacher returns a counterexample cc as the result of an equivalence query on h^\hat{h}, then at least one of the following is true about h^\hat{h}: (a) h^\hat{h} contains too few states, or (b) the satisfying solution Λ\Lambda inducing h^\hat{h} is either incomplete or incorrect.

Corollary 1

Remap must terminate when the number of states and number of known representative values in a concrete hypothesis reach their respective upper bounds.

Proof

(Sketch) This sketch applies to the above lemma, theorem, and corollary about termination. Consider the sequence ,hk1,hk,\dots,h_{k-1},h_{k},\dots of hypotheses that Remap makes. For a given pair of consecutive hypotheses (hk1,hk)(h_{k-1},h_{k}), consider how the number of states nn, and the number of known representative values nn_{\bullet} changes. Let nn^{*} be the number of states of the minimal Moore machine correctly classifying all sequences. Let V|ΣO|V^{*}\leq|\Sigma^{O}| be the upper bound on |||\mathcal{R}|. Note that 0n||V|ΣO|0\leq n_{\bullet}\leq|\mathcal{R}|\leq V^{*}\leq|\Sigma^{O}| always holds. Through detailed case analysis on returned counterexamples, we can show that the change in nn_{\bullet}, denoted by Δn\Delta n_{\bullet}, must always be either 0 or 11, and furthermore, if Δn=0\Delta n_{\bullet}=0, then we must have Δn1\Delta n\geq 1. By the case analysis and tracking nn and nn_{\bullet}, observe that if a counterexample cc is received from the teacher due to hypothesis hh, then at least one of (a) n<nn<n^{*} or (b) n<Vn_{\bullet}<V^{*} must be true. Since Δn\Delta n_{\bullet} and Δn\Delta n cannot simultaneously be 0, whenever a new hypothesis is made, progress must be made towards the upper bound of (n,V)(n^{*},V^{*}). If the upper bound is reached, then the algorithm must terminate, since it is impossible to progress from the point (n,V)(n^{*},V^{*}).∎

Theorem 5.3 (Query Complexity)

If nn is the number of states of the minimal automaton isomorphic to the target automaton, and mm is the maximum length of any counterexample sequence that the teacher returns, then (a) Remap executes at most n+|ΣO|1n+|\Sigma^{O}|-1 equivalence queries, and (b) the preference query complexity is 𝒪(mn2ln(mn2))\mathcal{O}(mn^{2}\ln(mn^{2})), which is polynomial in the number of unique sequences queried.

Proof

Based on Theorem 2, we know that the maximum number of equivalence queries is the taxi distance from the point (1,0)(1,0) to (n,|ΣO|)(n,|\Sigma^{O}|), which is n+|ΣO|1n+|\Sigma^{O}|-1. From counterexample processing, we know there will be at most m(n+|ΣO|1)m(n+|\Sigma^{O}|-1) sequences added to the prefix set SS, since a counterexample cc of length mm results in at most mm sequences added to the prefix set SS. The maximum number of times the table can be found inconsistent is at most n1n-1 times, since there can be at most nn states, and the learner starts with 11 state. Whenever a sequence is added to the suffix set EE, the maximum length of sequences in EE increases by at most 11, implying the maximum sequence length in EE is n1n-1. Similarly, closure operations can be performed at most n1n-1 times, so the total number of sequences in EE is at most nn; the maximum number of sequences in SS is n+m(n+|ΣO|1)n+m(n+|\Sigma^{O}|-1). The maximum number of unique sequences queried in the table is the maximum cardinality of (SSΣI)E(S\cup S\cdot\Sigma^{I})\cdot E, which is

(n+m(n+|ΣO|1))(1+|ΣI|)n=𝒪(mn2).(n+m(n+|\Sigma^{O}|-1))(1+|\Sigma^{I}|)n=\mathcal{O}(mn^{2}).

Therefore, the preference query complexity of Remap is 𝒪(mn2ln(mn2))\mathcal{O}(mn^{2}\ln(mn^{2})) due to randomized quicksort.∎

Lemma 1, Theorem 2, and Corollary 1 imply Remap makes progress towards termination with every hypothesis, and termination occurs when specific conditions are satisfied; therefore its output must be correct. Theorem 3 indicates that REMAP learns the correct minimal automaton isomorphic to the target automaton in polynomial time. Next, we show how Remap achieves PAC–identification when sampling-based equivalence queries are used.

Definition 10 (Probably Approximately Correct Identification)

Given Moore machine M=Q,q0,ΣI,ΣO,δ,LM=\langle Q,q_{0},\Sigma^{I},\Sigma^{O},\delta,L\rangle, let the classification function f:(ΣI)ΣOf:(\Sigma^{I})^{*}\rightarrow\Sigma^{O} be represented by f(s)=L(δ(q0,s))f(s)=L(\delta(q_{0},s)) for all s(ΣI)s\in(\Sigma^{I})^{*}. Let 𝒟\mathcal{D} be any probability distribution over (ΣI)(\Sigma^{I})^{*}. An algorithm 𝒜\mathcal{A} probably approximately correctly identifies ff if and only if for any choice of 0<ϵ10<\epsilon\leq 1 and 0<d<10<d<1, 𝒜\mathcal{A} always terminates and outputs an ϵ\epsilon-approximate sequence classifier f^:(ΣI)ΣO\hat{f}:(\Sigma^{I})^{*}\rightarrow\Sigma^{O}, such that with probability at least 1d1-d, the probability of misclassification is P(f^(s)f(s))ϵP(\hat{f}(s)\neq f(s))\leq\epsilon when ss is drawn according to the distribution 𝒟\mathcal{D}.

Theorem 5.4

Remap achieves probably approximately correct identification of any Moore machine when the teacher 𝒯\mathcal{T} uses sampling-based equivalence queries with at least mk1ϵ(ln1d+kln2)m_{k}\geq\left\lceil\frac{1}{\epsilon}\left(\ln\frac{1}{d}+k\ln 2\right)\right\rceil samples drawn i.i.d. from 𝒟\mathcal{D} for the kkth equivalence query.

Proof

(Sketch) The probability 1ϵk1-\epsilon_{k} of a sequence sampled from an arbitrary distribution 𝒟\mathcal{D} over (ΣI)(\Sigma^{I})^{*} depends upon the distribution and the intersections of sets of sequences of the teacher and the learner’s kkth hypothesis with the same classification values. The probability that the kkth hypothesis misclassifies a sequence is ϵk\epsilon_{k}. If the teacher samples mkm_{k} samples for the kkth equivalence query, then an upper bound can be established for the case when ϵkϵ\epsilon_{k}\leq\epsilon for a given ϵ\epsilon. Since we know Remap executes at most n+|ΣO|1n+|\Sigma^{O}|-1 equivalence queries, one can upper bound the probability that Remap terminates with an error by summing all probabilities of events that the teacher does not detect an error in at most n+|ΣO|1n+|\Sigma^{O}|-1 equivalence queries. An exponential decaying upper bound can be found, and a lower bound for mkm_{k} can be found in terms of ϵ,d,\epsilon,d, and kk.∎

Theorem 5.5

To achieve PAC-identification under Remap, given parameters ϵ\epsilon and dd, and if ff can be represented by a minimal Moore machine with nn states and |ΣO||\Sigma^{O}| classes, then teacher 𝒯\mathcal{T} needs to sample at least

𝒪(n+|ΣO|+1ϵ((n+|ΣO|)ln1d+(n+|ΣO|)2))\mathcal{O}\left(n+|\Sigma^{O}|+\frac{1}{\epsilon}\left((n+|\Sigma^{O}|)\ln\frac{1}{d}+(n+|\Sigma^{O}|)^{2}\right)\right)

sequences i.i.d. from 𝒟\mathcal{D} over the entire run of Remap.

Proof

Since for the kkth equivalence query, the teacher must sample at least mk1ϵ(ln1d+kln2)m_{k}\geq\left\lceil\frac{1}{\epsilon}(\ln\frac{1}{d}+k\ln 2)\right\rceil sequences in order to achieve PAC-identification, if the total number of samples is to be minimized while still achieving PAC-identification, then the teacher can just sample a quantity of sequences i.i.d. from 𝒟\mathcal{D} equal to the following total

k=1n+|ΣO|1[1ϵ(ln1d+kln2)+1]\sum_{k=1}^{n+|\Sigma^{O}|-1}\left[\frac{1}{\epsilon}(\ln\frac{1}{d}+k\ln 2)+1\right]
=n+|ΣO|1+1ϵ[(ln1d)(n+|ΣO|1)+ln2k=1n+|ΣO|1k]=n+|\Sigma^{O}|-1+\frac{1}{\epsilon}\left[(\ln\frac{1}{d})(n+|\Sigma^{O}|-1)+\ln 2\sum_{k=1}^{n+|\Sigma^{O}|-1}k\right]
=𝒪((n+|ΣO|)+1ϵ((n+|ΣO|)ln1d+(n+|ΣO|)2))=\mathcal{O}\left((n+|\Sigma^{O}|)+\frac{1}{\epsilon}((n+|\Sigma^{O}|)\ln\frac{1}{d}+(n+|\Sigma^{O}|)^{2})\right)

Theorem 4 and Theorem 5 imply Remap achieves PAC–identification for a choice of ϵ\epsilon and dd as long as the teacher samples sufficient sequences per equivalence query. In particular, Theorem 5 indicates the total quantity of sequences sampled by the teacher to achieve PAC-identification depends on both nn the number of states, and |ΣO||\Sigma^{O}| the number of output classes. This contrasts with the result for PAC-identification of DFAs (Theorem 7 [5]), which depends only on nn. Finally, since Remap outputs a Moore machine, one can leverage Moore and Mealy machine equivalence in order to convert the final hypothesis into a reward machine, as defined and covered in the next section.

6 Learning Reward Machines from Preferences

We consider applying Remap to learn reward machines from preferences. Reward machines are Mealy machines with propositional and reward semantics. Equivalence between Mealy and Moore machines allows the output of Remap to be converted to a reward machine. We first review reinforcement learning and reward machine semantics.

Markov Decision Processes  Decision making problems are often modeled by a Markov Decision Process (MDP), which is a tuple 𝒮,𝒜,P,R,γ\langle\mathcal{S},\mathcal{A},P,R,\gamma\rangle where 𝒮\mathcal{S} is the set of states, 𝒜\mathcal{A} is the set of actions, P:𝒮×𝒜×𝒮[0,1]P:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow[0,1] represents the transition probability from state ss to ss^{\prime} via action aa. The reward function R:𝒮×𝒜×𝒮R:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow\mathbb{R} provides the associated scalar reward, and 0γ10\leq\gamma\leq 1 is a discount factor. In MDPs, the Markovian assumption is that transitions and rewards depend only upon the current state-action pair and the next state. However, not all tasks are expressible using Markovian reward [2].

Non-Markovian Reward  Non-Markovian Reward Decision Processes are identical to MDPs, except that R:(𝒮×𝒜)R:(\mathcal{S}\times\mathcal{A})^{*}\rightarrow\mathbb{R} is non-Markovian a reward function that depends on state-action history. This allows reward machines to model a class of non-Markovian reward functions.

Reward Machines  A reward machine (RM) is a Mealy machine where ΣI=2𝒫\Sigma^{I}=2^{\mathcal{P}}, ΣO\Sigma^{O} is a set of reward emitting objects, and 𝒫\mathcal{P} is a set of propositions describing states and actions. A labeling function 𝕃:𝒮×𝒜×𝒮2𝒫\mathbb{L}:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow 2^{\mathcal{P}} with 𝕃(sk1,ak,sk)=lk\mathbb{L}(s_{k-1},a_{k},s_{k})=l_{k} labels a state-action sequence s0a1s1a2s2ansns_{0}a_{1}s_{1}a_{2}s_{2}\dots a_{n}s_{n} with label sequence l1l2lnl_{1}l_{2}\dots l_{n}. Thus, reward machines operate over label sequences.

A single disjunctive normal formula (DNF) labeled transition can summarize multiple transitions with identical ΣO\Sigma^{O} labels (connecting a pair of states), since the elements of 2𝒫2^{\mathcal{P}} are sets of propositions. Reward machines map label sequences to reward outputs and can be represented as f:(2𝒫)ΣOf:(2^{\mathcal{P}})^{*}\rightarrow\Sigma^{O}, so Remap learns a reward machine by converting the output Moore machine to a reward machine.

Sequential Tasks in OfficeWorld and CraftWorld  Icarte et al. [25] and Andreas et al. [4] introduced the OfficeWorld and CraftWorld gridworld domains, respectively, and feature sequential tasks encoded as reward machines. OfficeWorld features 4 sequential tasks across several rooms with various objects available for an agent to interact with. Example tasks include (1) picking up coffee and mail and delivering them to a certain room, or (2) continuously patrolling between a set of rooms. CraftWorld is a 2D version of MineCraft, where the 10 sequential tasks involve the agent collecting materials and constructing tools or objects in a certain order while avoiding hazards.

7 Empirical Results

We evaluate the exact and PAC–identification (PAC-ID) versions of Remap and consider: first, how often is PAC-ID Remap correct? (Exact Remap is guaranteed to be correct). To answer this, we run experiments by applying PAC-ID Remap to learn reward machines (RMs), by converting the Moore machine into a RM. We measure empirical correctness with empirical probability of isomorphism and average regret. Second, how do exact and PAC-ID Remap scale? We measure preference query complexity as a function of the number of unique sequences queried, and present an example phase diagram of algorithm execution. We use Z3 [15] for the constraint solver.

Setup  We investigate these questions on 14 sequential tasks in the OfficeWorld and CraftWorld domains. The Appendix 0.A.1 contains domain specific details. We implement exact equivalence queries (Definition 5) using a variant of the Hopcroft-Karp algorithm [3, 24]. We implement i.i.d. sequence sampling in sampling-based equivalence queries with the following process per sample: sample a length LL from a geometric distribution; then, construct an LL-length sequence by drawing LL elements i.i.d. from a uniform distribution over ΣI\Sigma^{I}. In both the exact and sampling-based equivalence queries, strong feedback is used.

7.1 PAC–Identification Correctness Experiments

Reproducibility  PAC-ID Remap was run 100 times per ground truth reward machine. We measure correctness based on (1) empirical probability that the learned RM is isomorphic to the ground truth RM, based on classification accuracy, and (2) average policy regret between the learned RM policy and the ground truth RM policy.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: PAC–identification Remap: (left) empirical isomorphism probability, (right) average regret as functions of the number of samples per equivalence query for 4 OfficeWorld tasks (O-T1-4) and 10 CraftWorld tasks (C-T1-10).
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 5: Query Complexity. Top left: Exact Remap preference query complexity. Mean of 100 trials per ground truth reward machine (blue dots) ±1\pm 1 standard deviation (orange, grey bars). Top right: PAC-ID Remap, preference query complexity is 𝒪(nlnn)\mathcal{O}(n\ln n) in the number of unique sequences in the table. Bottom left: Example termination phase diagram. Bottom right: CraftWorld environment depiction.

Empirical Probability of Isomorphism is the fraction of learned RMs with 100% classification accuracy. As the number of sample sequences tested by the teacher per equivalence query increases, the probability that the learner outputs a RM isomorphic to the ground truth RM upon termination goes to 11 (Figure 4, left column). Classification accuracy is defined as the fraction of a test set of sequences that are identically classified by the learned and ground truth RMs. The Appendix describes the distribution over (ΣI)(\Sigma^{I})^{*} that the test set is drawn from.

Average Policy Regret  We employ Q-learning with counterfactual experiences for reward machines (CRM) [41] to obtain optimal policies for ground truth and learned RMs. We measured the empirical expected return of optimal policies learned from each type of RM. Average regret for a given task was measured as the difference between the empirical return under the ground truth RM for that task (averaged over 100 CRM trials) and the empirical return under the learned RM (with 10 CRM trials per learned RM, then averaging all 100×10=1000100\times 10=1000 trials). Regret goes to 0 as the number of samples tested by the teacher per equivalence query increases (Figure 4, right column). The Appendix 0.A.1.6 describes regret computation details.

Correctness Conclusion  Exact Remap learns the correct automaton 100% of the time. Additionally, PAC-ID Remap is more likely to be correct as the number of samples per equivalence query increases: isomorphism probability goes to 1 and regret goes to 0 for all tasks in both domains.

7.2 Scaling Experiments

Figure 5 shows query complexity results. We measure preference query complexity of exact and PAC-ID Remap, as a function of the number of unique sequences stored in the table upon termination. Exact Remap (upper left) displays a trendline of C=0.2114NlnNC=0.2114N\ln N with R2=0.99268R^{2}=0.99268, where CC is the number of queries, and NN is the number of unique sequences in the observation table. PAC-ID Remap (upper right) tends to make significantly more preference queries about unique sequences compared to exact Remap due to the sampling process. However, the number of preference queries is still 𝒪(NlnN)\mathcal{O}(N\ln N) due to randomized quicksort and linear merge comparison complexity. In comparison, an L\textrm{L}^{*}-based approach would use exactly NN membership queries (exactly linear in NN with a coefficient of 11).

The maximum number of equivalence queries Remap makes (Theorem 3) is the taxi distance from (0,1)(0,1) to (|ΣO|,n)(|\Sigma^{O}|,n) in the termination phase diagram of Figure 5. Progress (Lemma 1 and Theorem 2) towards termination (Corollary 1) occurs whenever a new hypothesis is made. Remap can terminate early when all variables have correct values and the required number of states is reached.

8 Related Work

Active approaches for learning automata are variations or improvements of Angluin’s seminal L\textrm{L}^{*} algorithm [5], featuring membership and equivalence queries. We consider an alternative formulation: actively learning automata from preference and equivalence queries featuring feedback. We first discuss adaptations of L\textrm{L}^{*} for learning variants of finite automata, including reward machine variants.

Learning Finite Automata  Angluin [5] introduced L\textrm{L}^{*} to learn deterministic finite automata (DFAs). Remap has similar theoretical guarantees as L\textrm{L}^{*}, but utilizes a symbolic observation table, rather than an evidence-based one. Other algorithms adopt the evidence-based table of L\textrm{L}^{*} to learn: symbolic automata [17, 7], where Boolean predicates summarize state transitions; weighted automata [9, 8] which feature valuation semantics for sequences on non-deterministic automata; probabilistic DFAs [44], a weighted automata that models distributions of sequences. None of these approaches uses preference queries.

However, Shah et al. [38] considers active, cost-based selection between membership and preference queries to learn DFAs, relying on a satisfiability encoding of the problem. They assume a fixed hypothesis space and have probabilistic guarantees for termination and correctness. Remap, through unification, navigates a sequence of hypothesis spaces, each guaranteed to contain a concrete hypothesis satisfying current constraints, and has theoretical guarantees of correctness, minimalism, and termination under exact and PAC–identification settings.

Furthermore, learning finite automata from preference information relates to the novel problem of learning reward machines [25] from preferences. Learning Markovian reward functions from preferences has be studied extensively using neural [11, 36] and interpretable decision tree [10, 27] representations, but approaches for learning reward machines primarily adapt evidence-based finite automata learning approaches.

Reward Machine Variants  Several reward machine (RM) variants have been proposed. Classical RMs [25, 42] have deterministic transitions and rewards; probabilistic RMs [16] model probabilistic transitions and deterministic rewards; and stochastic RMs [14] pair deterministic transitions with stochastic rewards. Symbolic RMs [47] are deterministic like classical RMs, but feature symbolic reward values in place of concrete values. Zhou and Li [47] apply Bayesian inverse reinforcement learning (BIRL) to infer optimal reward values and actualize symbolic RMs into classical RMs, and require a symbolic RM sketch. Remap requires no sketch, since it navigates over a hypothesis space of symbolic RMs and outputs a concrete classical RM upon termination.

Learning Reward Machines  Many RM learning algorithms assume access to explicit reward samples via environment interaction. Given a maximum RM size, Toro Icarte et al. [42] apply discrete optimization to arrive at a perfect classical RM. Xu et al. [45] learn a minimal classical RM by combining regular positive negative inference [18] with Q-learning for RMs [25], and apply constraint solving to ensure each hypothesis RM is consistent with observed reward samples. Corazza et al. [14] extended the method to learn stochastic RMs. Topper et al. [40] extends BIRL to learn classical RMs using simulated annealing, but needs the number of states to be supplied, and requires empirical tuning of hyperparameters. L\textrm{L}^{*} based approaches have also been used to learn classical [39, 20, 46] and probabilistic [16] RMs, relying on concrete observation tables. Gaon and Brafman [20] and Xu et al. [46] use a binary observation table, while Tappler et al. [39] and Dohmen et al. [16] record empirical reward distribution table entries.

In contrast, Remap uses a symbolic observation table, and uses preferences information in place of explicit reward values. Remap navigates symbolic hypothesis space, with constraint solving enabling a concrete classical RM.

9 Conclusion

We introduce the problem of learning Moore machines from preferences and propose Remap, an L\textrm{L}^{*} based algorithm, wherein a strong learner with access to a constraint solver is paired with a weak teacher capable of answering preference queries and providing counterexample feedback in the form of a constraint. Unification applied to a symbolic observation table permits symbolic hypothesis space navigation; the constraint solver enables concrete hypotheses. Remap has theoretical guarantees for correctness, termination, and minimalism under both exact and PAC–identification settings, and it has been empirically verified under both settings when applied to learning reward machines. Future work will expound on more realistic preference models, variable strength feedback, and inconsistency.

{credits}

9.0.1 \discintname

The authors have no competing interests to declare that are relevant to the content of this article.

References

  • Aarts et al. [2012] Aarts, F., Kuppens, H., Tretmans, J., Vaandrager, F.W., Verwer, S.: Learning and testing the bounded retransmission protocol. In: International Conference on Graphics and Interaction (2012), URL https://api.semanticscholar.org/CorpusID:2641499
  • Abel et al. [2021] Abel, D., Dabney, W., Harutyunyan, A., Ho, M.K., Littman, M., Precup, D., Singh, S.: On the expressivity of markov reward. In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems, vol. 34, pp. 7799–7812, Curran Associates, Inc. (2021), URL https://proceedings.neurips.cc/paper_files/paper/2021/file/4079016d940210b4ae9ae7d41c4a2065-Paper.pdf
  • Almeida et al. [2009] Almeida, M., Moreira, N., Reis, R.: Testing the equivalence of regular languages. In: Workshop on Descriptional Complexity of Formal Systems (2009), URL https://api.semanticscholar.org/CorpusID:9014414
  • Andreas et al. [2017] Andreas, J., Klein, D., Levine, S.: Modular multitask reinforcement learning with policy sketches. In: Precup, D., Teh, Y.W. (eds.) Proceedings of the 34th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 70, pp. 166–175, PMLR (06–11 Aug 2017), URL https://proceedings.mlr.press/v70/andreas17a.html
  • Angluin [1987] Angluin, D.: Learning regular sets from queries and counterexamples. Inf. Comput. 75(2), 87–106 (1987), doi:10.1016/0890-5401(87)90052-6, URL https://doi.org/10.1016/0890-5401(87)90052-6
  • Angluin [1988] Angluin, D.: Queries and concept learning. Machine learning 2, 319–342 (1988)
  • Argyros and D’antoni [2018] Argyros, G., D’antoni, L.: The learnability of symbolic automata. In: International Conference on Computer Aided Verification (2018)
  • Balle and Mohri [2015] Balle, B., Mohri, M.: Learning weighted automata. In: Conference on Algebraic Informatics (2015)
  • Bergadano and Varricchio [1994] Bergadano, F., Varricchio, S.: Learning behaviors of automata from multiplicity and equivalence queries. SIAM J. Comput. 25, 1268–1280 (1994)
  • Bewley and Lécué [2021] Bewley, T., Lécué, F.: Interpretable preference-based reinforcement learning with tree-structured reward functions. ArXiv abs/2112.11230 (2021), URL https://api.semanticscholar.org/CorpusID:245353680
  • Bıyık et al. [2022] Bıyık, E., Talati, A., Sadigh, D.: Aprel: A library for active preference-based reward learning algorithms (2022)
  • Christiano et al. [2017a] Christiano, P.F., Leike, J., Brown, T., Martic, M., Legg, S., Amodei, D.: Deep reinforcement learning from human preferences. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 30, Curran Associates, Inc. (2017a), URL https://proceedings.neurips.cc/paper_files/paper/2017/file/d5e2c0adad503c91f91df240d0cd4e49-Paper.pdf
  • Christiano et al. [2017b] Christiano, P.F., Leike, J., Brown, T.B., Martic, M., Legg, S., Amodei, D.: Deep reinforcement learning from human preferences. ArXiv abs/1706.03741 (2017b), URL https://api.semanticscholar.org/CorpusID:4787508
  • Corazza et al. [2022] Corazza, J., Gavran, I., Neider, D.: Reinforcement learning with stochastic reward machines. In: Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022, pp. 6429–6436, AAAI Press (2022), URL https://ojs.aaai.org/index.php/AAAI/article/view/20594
  • De Moura and Bjørner [2008] De Moura, L., Bjørner, N.: Z3: An efficient smt solver. In: Proceedings of the Theory and Practice of Software, 14th International Conference on Tools and Algorithms for the Construction and Analysis of Systems, p. 337–340, TACAS’08/ETAPS’08, Springer-Verlag, Berlin, Heidelberg (2008), ISBN 3540787992
  • Dohmen et al. [2022] Dohmen, T., Topper, N., Atia, G.K., Beckus, A., Trivedi, A., Velasquez, A.: Inferring probabilistic reward machines from non-markovian reward signals for reinforcement learning. In: Kumar, A., Thiébaux, S., Varakantham, P., Yeoh, W. (eds.) Proceedings of the Thirty-Second International Conference on Automated Planning and Scheduling, ICAPS 2022, Singapore (virtual), June 13-24, 2022, pp. 574–582, AAAI Press (2022), URL https://ojs.aaai.org/index.php/ICAPS/article/view/19844
  • Drews and D’antoni [2017] Drews, S., D’antoni, L.: Learning symbolic automata. In: International Conference on Tools and Algorithms for Construction and Analysis of Systems (2017)
  • Dupont [1994] Dupont, P.: Regular grammatical inference from positive and negative samples by genetic search: the GIG method. In: Carrasco, R.C., Oncina, J. (eds.) Grammatical Inference and Applications, Second International Colloquium, ICGI-94, Alicante, Spain, September 21-23, 1994, Proceedings, Lecture Notes in Computer Science, vol. 862, pp. 236–245, Springer (1994), doi:10.1007/3-540-58473-0\_152, URL https://doi.org/10.1007/3-540-58473-0_152
  • Fleischner [1977] Fleischner, H.: On the equivalence of mealy-type and moore-type automata and a relation between reducibility and moore-reducibility. Journal of Computer and System Sciences 14(1), 1–16 (1977), ISSN 0022-0000, doi:https://doi.org/10.1016/S0022-0000(77)80038-X, URL https://www.sciencedirect.com/science/article/pii/S002200007780038X
  • Gaon and Brafman [2020] Gaon, M., Brafman, R.I.: Reinforcement learning with non-markovian rewards. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pp. 3980–3987, AAAI Press (2020)
  • Giannakopoulou et al. [2012] Giannakopoulou, D., Rakamaric, Z., Raman, V.: Symbolic learning of component interfaces. In: Sensors Applications Symposium (2012), URL https://api.semanticscholar.org/CorpusID:1449946
  • Gold [1978] Gold, E.M.: Complexity of automaton identification from given data. Inf. Control. 37, 302–320 (1978), URL https://api.semanticscholar.org/CorpusID:8943792
  • Guerin et al. [2013] Guerin, J.T., Allen, T.E., Goldsmith, J.: Learning cp-net preferences online from user queries. In: AAAI Conference on Artificial Intelligence (2013), URL https://api.semanticscholar.org/CorpusID:15976671
  • Hopcroft and Karp [1971] Hopcroft, J.E., Karp, R.M.: A linear algorithm for testing equivalence of finite automata. (1971), URL https://api.semanticscholar.org/CorpusID:120207847
  • Icarte et al. [2018] Icarte, R.T., Klassen, T., Valenzano, R., McIlraith, S.: Using reward machines for high-level task specification and decomposition in reinforcement learning. In: Dy, J., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 80, pp. 2107–2116, PMLR (10–15 Jul 2018), URL https://proceedings.mlr.press/v80/icarte18a.html
  • Iyengar and Lepper [2000] Iyengar, S.S., Lepper, M.R.: When choice is demotivating: Can one desire too much of a good thing? Journal of personality and social psychology 79(6), 995 (2000)
  • Kalra and Brown [2023] Kalra, A., Brown, D.S.: Can differentiable decision trees learn interpretable reward functions? ArXiv abs/2306.13004 (2023), URL https://api.semanticscholar.org/CorpusID:259224487
  • Koriche and Zanuttini [2009] Koriche, F., Zanuttini, B.: Learning conditional preference networks with queries. Artif. Intell. 174, 685–703 (2009), URL https://api.semanticscholar.org/CorpusID:3060370
  • Lin et al. [2014] Lin, S.W., Étienne André, Liu, Y., Sun, J., Dong, J.S.: Learning assumptions for compositionalverification of timed systems. IEEE Transactions on Software Engineering 40(2), 137–153 (2014)
  • MacGlashan et al. [2017] MacGlashan, J., Ho, M.K., Loftin, R., Peng, B., Wang, G., Roberts, D.L., Taylor, M.E., Littman, M.L.: Interactive learning from policy-dependent human feedback. In: Precup, D., Teh, Y.W. (eds.) Proceedings of the 34th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 70, pp. 2285–2294, PMLR (06–11 Aug 2017), URL https://proceedings.mlr.press/v70/macglashan17a.html
  • Martelli and Montanari [1976] Martelli, A., Montanari, U.: Unification in linear time and space: a structured presentation. Tech. rep., Istituto di Elaborazione della Informazione, Pisa (1976)
  • Mealy [1955] Mealy, G.H.: A method for synthesizing sequential circuits. The Bell System Technical Journal 34(5), 1045–1079 (1955), doi:10.1002/j.1538-7305.1955.tb03788.x
  • Moore [1956] Moore, E.F.: Gedanken-experiments on sequential machines. In: Shannon, C., McCarthy, J. (eds.) Automata Studies, pp. 129–153, Princeton University Press, Princeton, NJ (1956)
  • Ouyang et al. [2022] Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C.L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L.E., Simens, M., Askell, A., Welinder, P., Christiano, P.F., Leike, J., Lowe, R.J.: Training language models to follow instructions with human feedback. ArXiv abs/2203.02155 (2022), URL https://api.semanticscholar.org/CorpusID:246426909
  • Robinson [1965] Robinson, J.A.: A machine-oriented logic based on the resolution principle. J. ACM 12, 23–41 (1965)
  • Sadigh et al. [2017] Sadigh, D., Dragan, A.D., Sastry, S., Seshia, S.A.: Active preference-based learning of reward functions. In: Robotics: Science and Systems (2017)
  • Schuts et al. [2016] Schuts, M., Hooman, J., Vaandrager, F.: Refactoring of Legacy Software Using Model Learning and Equivalence Checking: An Industrial Experience Report. Springer International Publishing (2016)
  • Shah et al. [2023] Shah, A., Vazquez-Chanlatte, M., Junges, S., Seshia, S.A.: Learning formal specifications from membership and preference queries (2023)
  • Tappler et al. [2019] Tappler, M., Aichernig, B.K., Bacci, G., Eichlseder, M., Larsen, K.G.: L*-based learning of Markov decision processes. In: International Symposium on Formal Methods, pp. 651–669, Springer (2019)
  • Topper et al. [2024] Topper, N., Velasquez, A., Atia, G.: Bayesian inverse reinforcement learning for non-markovian rewards (2024)
  • Toro Icarte et al. [2022] Toro Icarte, R., Klassen, T.Q., Valenzano, R.A., McIlraith, S.A.: Reward machines: Exploiting reward function structure in reinforcement learning. Journal of Artificial Intelligence Research (JAIR) 73, 173–208 (2022), URL https://doi.org/10.1613/jair.1.12440
  • Toro Icarte et al. [2019] Toro Icarte, R., Waldie, E., Klassen, T., Valenzano, R., Castro, M., McIlraith, S.: Learning reward machines for partially observable reinforcement learning. In: Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32, Curran Associates, Inc. (2019), URL https://proceedings.neurips.cc/paper/2019/file/532435c44bec236b471a47a88d63513d-Paper.pdf
  • Valiant [1984] Valiant, L.G.: A theory of the learnable. Commun. ACM 27(11), 1134–1142 (nov 1984), ISSN 0001-0782, doi:10.1145/1968.1972, URL https://doi.org/10.1145/1968.1972
  • Weiss et al. [2019] Weiss, G., Goldberg, Y., Yahav, E.: Learning deterministic weighted automata with queries and counterexamples. In: Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32, Curran Associates, Inc. (2019), URL https://proceedings.neurips.cc/paper_files/paper/2019/file/d3f93e7766e8e1b7ef66dfdd9a8be93b-Paper.pdf
  • Xu et al. [2020] Xu, Z., Gavran, I., Ahmad, Y., Majumdar, R., Neider, D., Topcu, U., Wu, B.: Joint inference of reward machines and policies for reinforcement learning. Proceedings of the International Conference on Automated Planning and Scheduling 30(1), 590–598 (Jun 2020), doi:10.1609/icaps.v30i1.6756, URL https://ojs.aaai.org/index.php/ICAPS/article/view/6756
  • Xu et al. [2021] Xu, Z., Wu, B., Ojha, A., Neider, D., Topcu, U.: Active finite reward automaton inference and reinforcement learning using queries and counterexamples. In: Machine Learning and Knowledge Extraction: 5th IFIP TC 5, TC 12, WG 8.4, WG 8.9, WG 12.9 International Cross-Domain Conference, CD-MAKE 2021, Virtual Event, August 17–20, 2021, Proceedings, p. 115–135, Springer-Verlag, Berlin, Heidelberg (2021), ISBN 978-3-030-84059-4, doi:10.1007/978-3-030-84060-0\_8, URL https://doi.org/10.1007/978-3-030-84060-0_8
  • Zhou and Li [2022] Zhou, W., Li, W.: A hierarchical Bayesian approach to inverse reinforcement learning with symbolic reward machines. In: Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., Sabato, S. (eds.) Proceedings of the 39th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 162, pp. 27159–27178, PMLR (17–23 Jul 2022), URL https://proceedings.mlr.press/v162/zhou22b.html

Appendix 0.A Technical Appendix

In this technical appendix, we present empirical data, examples, experimental setup information, termination plots, detailed algorithms, and proofs. We also present the reproducibility checklist at the end of the appendix.

0.A.1 Experimental Details

0.A.1.1 Hardware

Experiments were run on a server with 512GB of memory and 2 Intel(R) Xeon(R) Gold 6248R CPUs. Each experimental run was executed on a single core. The operating system was Ubuntu 22.04.4. The relevant software libraries are listed in the code appendix remap/README file.

0.A.1.2 OfficeWorld and CraftWorld Domains

The computational experiments in this paper involved the OfficeWorld and CraftWorld domains. Both are gridworld environments that have been used in the reward machine literature. They feature sequential tasks which can be represented as a reward machine. Figure 6 illustrates an example OfficeWorld domain, along with a corresponding reward machine representing a task where the agent must bring paper to the desk while avoiding obstacles.

0.A.1.3 Handling Reward Machine Incompleteness.

The classical reward machines specified by Icarte et al. [25] in the OfficeWorld and CraftWorld domains are incomplete automata, in that not all states have transitions defined. Specifically, classical reward machines have a terminal state from which no transitions can occur. However, an incomplete reward machine can be converted into a complete one by adding transitions from all terminal states to a single, special absorbing “HALT” state.

Therefore, to handle reward machine incompleteness, we first convert the incomplete reward machines into complete Mealy machines with terminal states, and then convert the Mealy machine to a complete Moore machine with a single absorbing HALT state which all terminal states transition to. This latter Moore machine is used by the teacher as the ground truth non-Markovian reward function. Next, Remap is run, resulting in a learned Moore machine. The learned Moore machine is converted to a Mealy machine, and the absorbing state is identified and removed.

We do not merge terminal states in order to remain consistent with the original implementation reward machines, but we do collapse pairs of transitions into single summary transitions if a pair of transitions share the same start, end, and output values. This is accomplished by constructing a truth table, constructing a disjunctive normal formula (DNF) for the truth table, and then simplifying the DNF and using the result as the summarizing transition label. Once this process is complete, the result is a learned reward machine, which can now be evaluated.

0.A.1.4 Handling Reward Machine Nondeterminism.

Classical reward machines have a deterministic definition. However, some of the classical reward machines specified by Icarte et al. [25] are nondeterministic, in that from a given state, multiple transitions can be satisfied using under a given set of true propositions. This was the case for the reward machines for tasks 5 through 10 in the CraftWorld domain.

The common nondeterminism in those reward machines was the following type: assume the proposition set is 𝒫={a,b}\mathcal{P}=\{a,b\}, and the transitions have been summarized according to some set of Boolean formulae for which a subset is {ϕa,ϕb}\{\phi_{a},\phi_{b}\}, where ϕa\phi_{a} is satisfied whenever aa holds, and ϕb\phi_{b} is satisfied whenever bb holds. Suppose we have a state q1q_{1}, and two of its summarized transitions are the following: q2=δ(q1,ϕa)q_{2}=\delta(q_{1},\phi_{a}) and q3=δ(q1,ϕb)q_{3}=\delta(q_{1},\phi_{b}), meaning that a transition from q1q_{1} to q2q_{2} occurs if ϕa\phi_{a} is satisfied, and a transition from q1q_{1} to q3q_{3} occurs if proposition ϕb\phi_{b} holds. Clearly, ϕa\phi_{a} and ϕb\phi_{b} can simultaneously be satisfied if aba\land b holds, implying nondeterminism. Additionally, all the nondeterministic reward machine specifications also contained the following style of transitions: q4=δ(q1,ϕaϕb)=δ(q1,ϕbϕa)q_{4}=\delta(q_{1},\phi_{a}\phi_{b})=\delta(q_{1},\phi_{b}\phi_{a}), where ϕaϕb\phi_{a}\phi_{b} and ϕbϕa\phi_{b}\phi_{a} are sequences of length 2. This type of nondeterminism can be corrected by adding an additional state q5q_{5} and modifying the transitions from q1q_{1} to q2q_{2} and q3q_{3}, while still maintaining the intended behabior of reaching state q4q_{4}. Specifically, the conversion is, given

Refer to caption
Refer to caption
Figure 6: Example OfficeWorld domain (top) and a reward machine (bottom) encoding the sequential task “bring the paper to the desk without running into any obstacles.” A transition from state uiu_{i} to uju_{j}, labeled by propositional formula ϕ\phi and scalar reward rr as ϕ,r\langle\phi,r\rangle, occurs only if ϕ\phi is satisfied; reward rr is emitted upon transition. Propositions p,tp,t, and dd are the agent: possessing paper, running into an obstacle, and being located at the desk.
q1\displaystyle q_{1} =δ(q1,ε)=δ(q1,¬ϕa¬ϕb)\displaystyle=\delta(q_{1},\varepsilon)=\delta(q_{1},\neg\phi_{a}\land\neg\phi_{b})
q2\displaystyle q_{2} =δ(q1,ϕa)=δ(q2,¬ϕb)\displaystyle=\delta(q_{1},\phi_{a})=\delta(q_{2},\neg\phi_{b})
q3\displaystyle q_{3} =δ(q1,ϕb)=δ(q3,¬ϕa)\displaystyle=\delta(q_{1},\phi_{b})=\delta(q_{3},\neg\phi_{a})
q4\displaystyle q_{4} =δ(q1,ϕaϕb)=δ(q1,ϕbϕa),\displaystyle=\delta(q_{1},\phi_{a}\phi_{b})=\delta(q_{1},\phi_{b}\phi_{a}),

we can make the following adjustments and additions:

q2=δ(q1,ϕa)\displaystyle q_{2}=\delta(q_{1},\phi_{a}) q2=δ(q1,ϕa¬ϕb)\displaystyle\Longrightarrow q_{2}=\delta(q_{1},\phi_{a}\land\neg\phi_{b})
q3=δ(q1,ϕb)\displaystyle q_{3}=\delta(q_{1},\phi_{b}) q3=δ(q1,ϕb¬ϕa)\displaystyle\Longrightarrow q_{3}=\delta(q_{1},\phi_{b}\land\neg\phi_{a})
Add new state q5=δ(q1,ϕaϕb)\displaystyle\Longrightarrow q_{5}=\delta(q_{1},\phi_{a}\land\phi_{b})
=δ(q5,¬(ϕaϕb))\displaystyle\phantom{{}\Longrightarrow q_{5}}=\delta(q_{5},\neg(\phi_{a}\lor\phi_{b}))
Add new transition q4=δ(q5,ϕaϕb).\displaystyle\Longrightarrow q_{4}=\delta(q_{5},\phi_{a}\lor\phi_{b}).

These changes make the reward machine deterministic while still maintaining desired behavior by explicitly providing three different transitions away from state q1q_{1} for processing the input proposition sets {a},{b},\{a\},\{b\}, and {a,b}\{a,b\} separately.

Refer to caption
Figure 7: Learning a Moore machine with Remap (top) vs L\textrm{L}^{*} (bottom). L\textrm{L}^{*} employs concrete values in its observation table, whereas Remap uses a symbolic observation table S,E,T;𝒞,Γ\langle S,E,T;\mathcal{C},\Gamma\rangle. The function f(s)f(s) returns 11 if ss is in aba^{*}b and returns 0 otherwise. (a) initializing S,E,T;𝒞,Γ\langle S,E,T;\mathcal{C},\Gamma\rangle and performing a SymbolicFill, unclosed table; (b) expand SS with sequence bb to close the table, followed by a SymbolicFill yielding a unified, closed, and consistent S,E,T;𝒞,Γ\langle S,E,T;\mathcal{C},\Gamma\rangle; (c) MakeHypothesis yields concrete hypothesis h1h_{1} from the symbolic hypothesis and solving constraints in 𝒞\mathcal{C}; (d) submit h1h_{1} via equivalence query, receive and process the counterexample babbab with feedback f(bab)=0f(bab)=0 from teacher by adding babbab, baba, and bb to SS, perform a SymbolicFill, set the value equivalence class of T(bab)T(bab) to f(bab)f(bab), yields an inconsistent table; (e) expand suffixes EE with bb and perform a SymbolicFill, yielding a unified, closed, and consistent table; (f) MakeHypothesis yields concrete hypothesis h2h_{2} for an equivalence query, wherein the teacher establishes h2h_{2} is correct, and the algorithm terminates. Learning the target Moore machine with L\textrm{L}^{*} is shown is parts (g)-(l). (g) shows using membership queries to populate the table with concrete values, resulting in an unclosed table, which is then made closed in (h) by adding bb to SS. Since the result is closed and consistent, a concrete hypothesis can be made in part (i). Sending this hypothesis to the teacher via an equivalence query results in the teacher sending a counterexample babbab back, which must be processed in (j). Here, the babbab and all its prefixes are added to SS, resulting in an inconsistent table. Consistency is resolved by adding bb to EE, resulting in a closed and consistent table. This allows the final hypothesis to be made in part (l)(l).

0.A.1.5 Test Set Generation for Classification Accuracy

Here, we describe the process inducing the distribution 𝒟\mathcal{D} over (ΣI)(\Sigma^{I})^{*} from which the test set is generated. To generate the test set, we follow the procedure described for sampling-based equivalence queries: we first sample a random variable LL from a geometric distribution to represent desired sequence length, and then populate each of the LL sequence elements i.i.d. from the uniform distribution over ΣI\Sigma^{I}. The geometric distribution we use is Pr(L=l)=(1p)l1p\text{Pr}(L=l)=(1-p)^{l-1}p, where p=0.2p=0.2. Thus, the average sampled sequence length is 5.

Next, we amend this sample sequence set with the set of sequences guaranteed to be composed only from explicitly specified transitions in the incomplete ground truth reward machine. This latter sequence sample set is generated by finding all sequences of Boolean formula from the initial state to all other states via iterative deepening search, resulting in all sequences with length at most the number of states in the reward machine. Each Boolean formula sequence generates multiple sequences with elements from 2𝒫2^{\mathcal{P}}, by uniformly sampling elements from 2𝒫2^{\mathcal{P}} that make the formula true. Explicitly, for a given path of length dd from the start state to depth dd, where dd ranges from 22 to NN the number of states in the automaton, we generate d|2𝒫|sd|2^{\mathcal{P}}|s sequences, where ss is a positive integer. Thus, for an given automaton with NN states, we have at most 𝒪(|2𝒫|N)\mathcal{O}(|2^{\mathcal{P}}|^{N}) paths, so we generate a test set of size 𝒪(|2𝒫|N+1ds)\mathcal{O}(|2^{\mathcal{P}}|^{N+1}ds). Each of the elements from the set {25,50,100,200,300,400,500,1000,2000,5000}\{25,50,100,200,300,400,500,1000,2000,5000\} was used as the value of ss.

For tasks 1 through 4 of OfficeWorld and CraftWorld, this corresponded to between 35.4k to 36.2k samples. For tasks 5 through 10 of CraftWorld, this corresponded to 1.4m to 24.6m samples.

We evaluate each ground truth and learned reward machine pair with these sequences. Classification accuracy is the fraction of sequences in the sample set identically classified by the learned and ground truth reward machines.

0.A.1.6 Regret Computation Details

For the regret experiments, we needed to train and evaluate policies under the ground truth and learned reward machines. For a given reward machine, REMAP was run 100 times, producing a set of 100 reward machines per ground truth. Any differences in reward machines stemmed from how the teacher sampled test sequences and the order and length of counterexamples presented.

To compute regret, we computed empirical regret between the optimal policy under the ground truth reward machine and the optimal policy of each of the learned reward machines.

We utilized Q-learning with counterfactual experiences for reward machines (CRM) [41] to obtain optimal policies for each reward machine. We set the discount factor to γ=0.9\gamma=0.9, and for the OfficeWorld reward machines, each policy was trained for a total of 2×1052\times 10^{5} steps, and 2×1062\times 10^{6} steps for CraftWorld reward machines. Each reward machine was trained for at least 10 seeds. Observing the return curves, we concluded that by 1×1051\times 10^{5} steps, the policy was already optimal for OfficeWorld domains, and by 1×1061\times 10^{6} steps for CraftWorld domains. We compute the average reward per step of the optimal policy by summing the total return of the policy over the last 10510^{5} (OfficeWorld) or 10610^{6} (CraftWorld) steps:

Average Reward per Step for Seed kk =1Δsss+Δsrk,t𝑑t\displaystyle=\frac{1}{\Delta s}\int_{s}^{s+\Delta s}r_{k,t}dt
=R^k\displaystyle=\hat{R}_{k}
Empirical Average Reward per Step =1Nk=1NR^k\displaystyle=\frac{1}{N}\sum_{k=1}^{N}\hat{R}_{k}
=𝔼^[R^],\displaystyle=\hat{\mathbb{E}}[\hat{R}],

where rk,tr_{k,t} represents the reward received for seed kk at step tt, and ss and s+Δss+\Delta s represent the interval of steps the average is taken over, and NN represents the total number of seeds. Then, the average regret plotted in the paper was the difference between the empirical average reward per step of the optimal policy induced by the ground truth reward machine, and the empirical average reward per step of the optimal policies from the corresponding learned reward machines. Sample variance was computed for the ground truth reward machines via

Var(R^)=1N1k=1N(R^k𝔼^[R^])2\text{Var}(\hat{R})=\frac{1}{N-1}\sum_{k=1}^{N}(\hat{R}_{k}-\hat{\mathbb{E}}[\hat{R}])^{2}

with standard deviation computed via taking the square root of the sample variance.

0.A.2 Example Comparing REMAP and L\textrm{L}^{*}

Figures 2 and 3 of the main paper contained an abridged example comparing REMAP and L\textrm{L}^{*}. We show the full example in Figure 7 of the Appendix showing how REMAP and L\textrm{L}^{*} learn the Moore machine shown in part (l).

If the ground truth is to be interpreted as the Moore equivalent of a reward machine, then the ground truth Moore machine has 3 states, with qbq_{b} as the terminal state, qbaq_{ba} is the absorbing HALT state, and qεq_{\varepsilon} as the initial state; the corresponding ground truth reward machine would have only two states—qεq_{\varepsilon} as the initial state, and qbq_{b} as the terminal state, with no transitions out of qbq_{b}; there would be no absorbing qbaq_{ba} state. Additionally, the transition from qεq_{\varepsilon} to qbq_{b} via bb would have a reward of 11 associated with it, and the self-transition qεq_{\varepsilon} to qεq_{\varepsilon} via aa would have a reward of 0 associated with it; the states would no longer have rewards associated with them.

If the input alphabet is interpreted as the powerset of a proposition set 𝒫={p}\mathcal{P}=\{p\}, with 2𝒫=ΣI={{},{p}}2^{\mathcal{P}}=\Sigma^{I}=\{\{\},\{p\}\}, then we can use a={}a=\{\} and b={p}b=\{p\} for convenience, and where ΣO={0,1}\Sigma^{O}=\{0,1\}. As shown in the example, REMAP uses a symbolic observation table and performs preference queries, and closedness and consistency tests are evaluated with respect to a unified table. L\textrm{L}^{*} uses a concrete observation table and uses membership queries.

Refer to caption
Refer to caption
Refer to caption
Figure 8: Empirical Scaling Plots. Left to right: Under an inexact PAC-identification teacher, empirical distributions of how the total number of sequences in the table upon termination of Remap depends on alphabet size, number of states, and maximum counterexample length (color coded using data from CraftWorld tasks C-T5 through C-T10, see Figure 4 legend).

0.A.3 Additional Scaling Measurements

Since we measured query complexity as a function of the number of unique sequences present in the table upon termination, we also measured the number of unique sequences as functions of input alphabet size, number of states, and length of the maximum counterexample received. These results are shown in Figure 8.

0.A.4 Termination Plots

We present a full set of termination phase diagrams for learning the CraftWorld reward machines for tasks 5 through 10 in Figures 9, 10, 11, and 12. Columns correspond to tasks, while rows correspond to number of samples the teacher make per equivalence query. Each plot contains 100 paths through the termination phase space, where the xx-axis is the number of known representative values, and the yy-axis is the number of states in the hypothesis. Each node along the path corresponds with an event in Remap. The start of each path always starts at (0,1)(0,1), since there is always an initial state, but no known representatives. Green XX’s represent when an equivalence query is made. Blue circles represent the immediate result of a closure operation, while orange squares represent the immediate result of a consistency operation. The red star represents the upper bound on number of states and number of known representative values. Observe that between consecutive equivalence queries, at least one of the number of states or the number of known representative values must increase: in particular, if the number of known representative values does not increase as the result of an equivalence query, the number of states must increase. Additionally, it is possible for Remap to terminate early, when the number of states has reached the upper bound, and when the satisfying solution to the constraints has all correct values.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption

  Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption

Figure 9: Phase diagram termination plots for CraftWorld tasks 5 through 10. Teacher tests 25 (top section) and 50 (bottom section) samples per equivalence query. Within each section: left to right (top row) Task 5, 6, 7; left to right (bottom row) Task 8, 9, 10. The X-axis is number of known representative values in the hypothesis, the Y-axis is the number of states in the hypothesis. Legend: blue circle is a closure operation, orange square is a consistency operation, green x is an equivalence query, and the red star represents the the upper bound termination condition. We observe it is possible to terminate early, prior to reaching the upper bound. Each plot contains 100 individual paths through phase space.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption

  Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption

Figure 10: Phase diagram termination plots for CraftWorld tasks 5 through 10. Teacher tests 100 (top section) and 200 (bottom section) samples per equivalence query. Within each section: left to right (top row) Task 5, 6, 7; left to right (bottom row) Task 8, 9, 10. The X-axis is number of known representative values in the hypothesis, the Y-axis is the number of states in the hypothesis. Legend: blue circle is a closure operation, orange square is a consistency operation, green x is an equivalence query, and the red star represents the the upper bound termination condition. We observe it is possible to terminate early, prior to reaching the upper bound. Each plot contains 100 individual paths through phase space.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption

  Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption

Figure 11: Phase diagram termination plots for CraftWorld tasks 5 through 10. Teacher tests 300 (top section) and 400 (bottom section) samples per equivalence query. Within each section: left to right (top row) Task 5, 6, 7; left to right (bottom row) Task 8, 9, 10. The X-axis is number of known representative values in the hypothesis, the Y-axis is the number of states in the hypothesis. Legend: blue circle is a closure operation, orange square is a consistency operation, green x is an equivalence query, and the red star represents the the upper bound termination condition. We observe it is possible to terminate early, prior to reaching the upper bound. Each plot contains 100 individual paths through phase space.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 12: Phase diagram termination plots for CraftWorld tasks 5 through 10. Teacher tests 500 samples per equivalence query. Left to right (top row) Task 5, 6, 7; left to right (bottom row) Task 8, 9, 10. The X-axis is number of known representative values in the hypothesis, the Y-axis is the number of states in the hypothesis. Legend: blue circle is a closure operation, orange square is a consistency operation, green x is an equivalence query, and the red star represents the the upper bound termination condition. We observe it is possible to terminate early, prior to reaching the upper bound. Each plot contains 100 individual paths through phase space.

0.A.5 Remap Algorithms

In this section, we present the algorithms used in Remap. Algorithm 1 presented in the body of the paper is an abbreviated version of Algorithm 3 (Remap). In Algorithm 3, we expand the function MakeClosedAndConsistent into loop for making the symbolic observation table unified, closed, and consistent. We also present the Symbolic Fill Procedure in Algorithm 5, which is responsible for (1) creating fresh variables for sequences which have not been queried, (2) performing preference queries, and (3) performing unification. Algorithm 6 illustrates how preference queries are performed as comparisons, and also shows how the constraint set is updated with the return information. Algorithm 7 is responsible for constructing a hypothesis from a unified, closed, and consistent observation table, and it includes the FindSatisfyingSolution (abbreviated as FindSolution in the main paper) procedure which encodes all the collected constraints, known representative values, and global constraints, and sends them to the solver. Algorithm 8 illustrates the generic equivalence query used by the teacher for probably approximately correct (PAC) identification. Note that in practice, for both the equivalence query and for obtaining classification accuracy, we collect all sampled sequences into a set first, then perform evaluation over the set. For the equivalence query, the first counterexample encountered is returned (and the remainder is untested). For classification accuracy, we evaluate all the sample sequences. Finally, in Algorithm 9, we present the unification algorithm used in Remap. Overall, the algorithm has three sections: (1) creation and merging of equivalence classes and electing representatives, which converts all collected equality relations into equivalence classes (sets of variables which are equivalent), (2) performing unification on the collection of inequalities, performed via substituting each variable with its elected representative, and (3) replacing each variable in the symbolic observation table with the elected representative of the equivalence class of the variable.

Algorithm 3 Remap

Input: Input alphabet ΣI\Sigma^{I}, output alphabet ΣO\Sigma^{O}, and a teacher 𝒯\mathcal{T}
Output: Moore Machine =Q^,ΣI,ΣO,q^0,δ^,L^\mathcal{H}=\langle\hat{Q},\Sigma^{I},\Sigma^{O},\hat{q}_{0},\hat{\delta},\hat{L}\rangle

1:  Initialize observation table (S,E,T)(\langle S,E,T\rangle), constraint set 𝒞\mathcal{C}, and context Γ\Gamma as S,E,T;𝒞,Γ\langle S,E,T;\mathcal{C},\Gamma\rangle with S={ε},E={ε}S=\{\varepsilon\},E=\{\varepsilon\}, 𝒞={}\mathcal{C}=\{\}, Γ=\Gamma=\emptyset
2:  S,E,T;𝒞,Γ\langle S,E,T;\mathcal{C},\Gamma\rangle\longleftarrow SymbolicFill(S,E,T;𝒞,Γ|𝒯)\left(\langle S,E,T;\mathcal{C},\Gamma\rangle|\mathcal{T}\right)
3:  repeat
4:     while S,E,T;𝒞,Γ\langle S,E,T;\mathcal{C},\Gamma\rangle is not closed and consistent do
5:        if S,E,T;𝒞,Γ\langle S,E,T;\mathcal{C},\Gamma\rangle not consistent then
6:           Find s1,s2Ss_{1},s_{2}\in S, σΣI\sigma\in\Sigma^{I}, and eEe\in Es.t. row(s1)row(s2)\textbf{{row}}(s_{1})\equiv\textbf{{row}}(s_{2}) and row(s1σ)row(s2σ)\textbf{{row}}(s_{1}\cdot\sigma)\not\equiv\textbf{{row}}(s_{2}\cdot\sigma) and T(s1σe)T(s2σe)T(s_{1}\cdot\sigma\cdot e)\not\equiv T(s_{2}\cdot\sigma\cdot e)
7:           Add σe\sigma\cdot e to EE via E:=E{σe}E:=E\cup\{\sigma\cdot e\}
8:        end if
9:        if S,E,T;𝒞,Γ\langle S,E,T;\mathcal{C},\Gamma\rangle not closed then
10:           Find sSs\in S and σΣI\sigma\in\Sigma^{I} s.t. row(sσ)rows(SΣI)\textbf{{row}}(s\cdot\sigma)\in\textbf{{rows}}(S\cdot\Sigma^{I}) and row(sσ)rows(S)\textbf{{row}}(s\cdot\sigma)\notin\textbf{{rows}}(S)
11:           Add sσs\cdot\sigma to SS via S:=S{sσ}S:=S\cup\{s\cdot\sigma\}
12:        end if
13:        S,E,T;𝒞,Γ\langle S,E,T;\mathcal{C},\Gamma\rangle\longleftarrow SymbolicFill(S,E,T;𝒞,Γ|𝒯)\left(\langle S,E,T;\mathcal{C},\Gamma\rangle|\mathcal{T}\right)
14:     end while
15:     \mathcal{H}\longleftarrowMakeHypothesis(S,E,T;𝒞,Γ,ΣI,ΣO)\left(\langle S,E,T;\mathcal{C},\Gamma\rangle,\Sigma^{I},\Sigma^{O}\right)
16:     result \longleftarrow EquivalenceQuery(|𝒯)(\mathcal{H}|\mathcal{T})
17:     if result \neq correct then
18:        (s,r)(s^{\prime},r)\longleftarrow result
19:        for all tprefixes(s)t\in\text{{{prefixes}}}(s^{\prime}) do
20:           Add tt to SS via S:=S{t}S:=S\cup\{t\}
21:        end for
22:        S,E,T;𝒞,Γ\langle S,E,T;\mathcal{C},\Gamma\rangle\longleftarrow SymbolicFill(S,E,T;𝒞,Γ|𝒯)\left(\langle S,E,T;\mathcal{C},\Gamma\rangle|\mathcal{T}\right)
23:        Update 𝒞\mathcal{C} to include constraint via 𝒞:=𝒞{GetVar(s,S,E,T;𝒞,Γ)=r}\mathcal{C}:=\mathcal{C}\cup\{\text{{GetVar}$(s^{\prime},\langle S,E,T;\mathcal{C},\Gamma\rangle)$}=r\}
24:     end if
25:  until result == correct
26:  return  \mathcal{H}
Algorithm 4 A Query Efficient Symbolic Fill Procedure

procedure SymbolicFill(S,E,T;𝒞,Γ|𝒯)\left(\langle S,E,T;\mathcal{C},\Gamma\rangle|\mathcal{T}\right)

1:  seqs ={}=\{\}; Let 𝒪=S,E,T;𝒞,Γ\mathcal{O}=\langle S,E,T;\mathcal{C},\Gamma\rangle; let oldsortedseqs be a sorted list of sequences.
2:  for all se(S(SΣI))Es\cdot e\in(S\cup(S\cdot\Sigma^{I}))\cdot E do
3:     if seΓs\cdot e\notin\Gamma then
4:        Γ[se],seqs\Gamma\left[s\cdot e\right],seqs\longleftarrow FreshVar(), seqs{se}\text{{seqs}}\cup\{s\cdot e\}
5:        T(se)T(s\cdot e)\longleftarrow Γ[se]\Gamma\left[s\cdot e\right]
6:     end if
7:  end for
8:  sortseqs, 𝒪\mathcal{O} \longleftarrow PrefQsViaRandQuicksort(seqs, 𝒪|𝒯\mathcal{O}|\mathcal{T})
9:  oldsortedseqs𝒪\mathcal{O}\longleftarrow PrefQsViaLinearMerge(sortedseqs, oldsortedseqs, 𝒪|𝒯\mathcal{O}|\mathcal{T})
10:  𝒪\mathcal{O}\longleftarrow Unification(𝒪)(\mathcal{O})
11:  return  𝒪\mathcal{O}
Algorithm 5 A Query Inefficient Symbolic Fill Procedure

procedure SymbolicFill(S,E,T;𝒞,Γ|𝒯)\left(\langle S,E,T;\mathcal{C},\Gamma\rangle|\mathcal{T}\right)

1:  newentries, oldentries ={},{}=\{\},\{\}
2:  for all se(S(SΣI))Es\cdot e\in(S\cup(S\cdot\Sigma^{I}))\cdot E do
3:     if seΓs\cdot e\notin\Gamma then
4:        Γ[se]\Gamma\left[s\cdot e\right]\longleftarrow FreshVar()
5:        T(se)Γ[se]T(s\cdot e)\longleftarrow\Gamma\left[s\cdot e\right]
6:        newentriesnewentries{se}\text{{newentries}}\longleftarrow\text{{newentries}}\cup\{s\cdot e\}
7:     else
8:        oldentriesoldentries{se}\text{{oldentries}}\longleftarrow\text{{oldentries}}\cup\{s\cdot e\}
9:     end if
10:  end for
11:  for all (s1e1),(s2e2)(s_{1}\cdot e_{1}),(s_{2}\cdot e_{2})\in PairCombinations(newentries, newentries \cup oldentriesdo
12:     pp\longleftarrow PrefQuery(s1e1,s2e2|𝒯)(s_{1}\cdot e_{1},s_{2}\cdot e_{2}|\mathcal{T})
13:     𝒞\mathcal{C}\longleftarrow UpdateConstraintSet(p,s1e1,s2e2;S,E,T;𝒞,Γ)(p,s_{1}\cdot e_{1},s_{2}\cdot e_{2};\langle S,E,T;\mathcal{C},\Gamma\rangle)
14:  end for
15:  S,E,T;𝒞,Γ\langle S,E,T;\mathcal{C},\Gamma\rangle\longleftarrow Unification(S,E,T;𝒞,Γ)(\langle S,E,T;\mathcal{C},\Gamma\rangle)
16:  return  S,E,T;𝒞,Γ\langle S,E,T;\mathcal{C},\Gamma\rangle
Algorithm 6 Preference Query Procedure and Constraints Update Procedure

Preference queries are checked by the teacher executing each sequence using the ground truth transition function δ\delta and checking the ground truth output value using LL.

procedure PrefQuery(s1,s2|𝒯)(s_{1},s_{2}|\mathcal{T})

1:  if L(δ(q0,s1))=L(δ(q0,s2))L(\delta(q_{0},s_{1}))=L(\delta(q_{0},s_{2})) then
2:     return  0
3:  else if L(δ(q0,s1))<L(δ(q0,s2))L(\delta(q_{0},s_{1}))<L(\delta(q_{0},s_{2})) then
4:     return  1-1
5:  else
6:     return  +1+1
7:  end if

procedure UpdateConstraintSet(p,s1e1,s2e2;S,E,T;𝒞,Γ)(p,s_{1}\cdot e_{1},s_{2}\cdot e_{2};\langle S,E,T;\mathcal{C},\Gamma\rangle)

1:  if p=0p=0 then
2:     𝒞:=𝒞{T(s1e1)=T(s2e2)}\mathcal{C}:=\mathcal{C}\cup\left\{T(s_{1}\cdot e_{1})=T(s_{2}\cdot e_{2})\right\}
3:  else if p=1p=-1 then
4:     𝒞:=𝒞{T(s1e1)<T(s2e2)}\mathcal{C}:=\mathcal{C}\cup\left\{T(s_{1}\cdot e_{1})<T(s_{2}\cdot e_{2})\right\}
5:  else
6:     𝒞:=𝒞{T(s1e1)>T(s2e2)}\mathcal{C}:=\mathcal{C}\cup\left\{T(s_{1}\cdot e_{1})>T(s_{2}\cdot e_{2})\right\}
7:  end if
8:  return  𝒞\mathcal{C}
Algorithm 7 Make Hypothesis from Observation Table Procedure

procedure MakeHypothesis(S,E,T;𝒞,Γ,ΣI,ΣO)\left(\langle S,E,T;\mathcal{C},\Gamma\rangle,\Sigma^{I},\Sigma^{O}\right)

1:  Q^={row(s)|sS}\hat{Q}=\{\textbf{{row}}(s)|\forall s\in S\} is the set of states
2:  q^0=row(ε)\hat{q}_{0}=\textbf{{row}}(\varepsilon) is the initial state
3:  δ^(row(s),σ)=row(sσ)\hat{\delta}(\textbf{{row}}(s),\sigma)=\textbf{{row}}(s\cdot\sigma) for all sSs\in S and σΣI\sigma\in\Sigma^{I}
4:  Λ\Lambda\longleftarrowFindSatisfyingSolution(S,E,T;𝒞,Γ,ΣO)(\langle S,E,T;\mathcal{C},\Gamma\rangle,\Sigma^{O})
5:  L^(row(s))=Λ[T(sε)]\hat{L}(\textbf{{row}}(s))=\Lambda[T(s\cdot\varepsilon)] is the sequence to output function
6:  return  Q^,ΣI,ΣO,q^0,δ^,L^\langle\hat{Q},\Sigma^{I},\Sigma^{O},\hat{q}_{0},\hat{\delta},\hat{L}\rangle

procedure FindSatisfyingSolution(S,E,T;𝒞,Γ,ΣO)(\langle S,E,T;\mathcal{C},\Gamma\rangle,\Sigma^{O})

1:  𝒞val:=\mathcal{C}_{val}:= Select(𝒞,VarValue)(\mathcal{C},Var\rightarrow Value)
2:  D={rΣOT(se)=r|se(S(SΣI))E such that 𝒞val[T(se)]}D=\left\{\left.\displaystyle\bigvee_{r\in\Sigma^{O}}T(s\cdot e)=r\right|\forall s\cdot e\in(S\cup(S\cdot\Sigma^{I}))\cdot E\text{ such that }\mathcal{C}_{val}[T(s\cdot e)]\equiv\bot\right\}
3:  Submit to an SMT solver the constraint set 𝒞D\mathcal{C}\cup D
4:  return  The model Λ\Lambda which maps variables to values.
Algorithm 8 Sampling-based Equivalence Query

Equivalence is checked through sampling by the teacher. We assume a probability distribution 𝒟\mathcal{D} over all possible sequences (ΣI)\left(\Sigma^{I}\right)^{*}, and that the teacher can sample random sequences s𝒟s\sim\mathcal{D}.

procedure EquivalenceQuery(Q^,ΣI,ΣO,q^0,δ^,L^|𝒯)\left(\langle\hat{Q},\Sigma^{I},\Sigma^{O},\hat{q}_{0},\hat{\delta},\hat{L}\rangle|\mathcal{T}\right)

1:  repeat
2:     Sample a random sequence s𝒟s\sim\mathcal{D}.
3:     if L(δ(q0,s))L^(δ^(q0^,s))L\left(\delta\left(q_{0},s\right)\right)\neq\hat{L}\left(\hat{\delta}\left(\hat{q_{0}},s\right)\right) then
4:        return  (s,L(δ(q0,s)))\left(s,L\left(\delta\left(q_{0},s\right)\right)\right)
5:     end if
6:  until up to rr times
7:  return  correct
Algorithm 9 Unification Procedure

Unification is performed by computing equivalence classes.

procedure Unification(S,E,T;𝒞,Γ)\left(\langle S,E,T;\mathcal{C},\Gamma\rangle\right)

1:  𝒞ec:=\mathcal{C}_{ec}:= Select(𝒞,VarEquivalenceClass)(\mathcal{C},Var\rightarrow EquivalenceClass)
2:  𝒞val:=\mathcal{C}_{val}:= Select(𝒞,VarValue)(\mathcal{C},Var\rightarrow Value)
3:  𝒞eq:=\mathcal{C}_{eq}:= Select(𝒞,Var=Var)(\mathcal{C},Var=Var)
4:  𝒞in:=\mathcal{C}_{in}:= Select(𝒞,Var<Var)Select(𝒞,Var>Var)(\mathcal{C},Var<Var)\cup\text{{Select}}(\mathcal{C},Var>Var)
5:  while |𝒞eq|>0|\mathcal{C}_{eq}|>0 do
6:     match Pop(𝒞eq)(\mathcal{C}_{eq}) with L=RL=R
7:     if L𝒞ecR𝒞ecL\in\mathcal{C}_{ec}\land R\in\mathcal{C}_{ec} then
8:        if 𝒞ec[L]𝒞ec[R]\mathcal{C}_{ec}\left[L\right]\not\equiv\mathcal{C}_{ec}[R] then
9:           LrepL_{rep}\longleftarrow GetRepresentative(𝒞ec[L])(\mathcal{C}_{ec}\left[L\right])
10:           RrepR_{rep}\longleftarrow GetRepresentative(𝒞ec[R])(\mathcal{C}_{ec}\left[R\right])
11:           Update 𝒞ec[L]:=𝒞ec[L]𝒞ec[R]\mathcal{C}_{ec}\left[L\right]:=\mathcal{C}_{ec}\left[L\right]\cup\mathcal{C}_{ec}\left[R\right]
12:           for all v𝒞ec[L]v\in\mathcal{C}_{ec}\left[L\right] do
13:              Set 𝒞ec[v]:=𝒞ec[L]\mathcal{C}_{ec}\left[v\right]:=\mathcal{C}_{ec}\left[L\right]
14:           end for
15:           if LrepRrepRrep𝒞valL_{rep}\not\equiv R_{rep}\land R_{rep}\in\mathcal{C}_{val} then
16:              Remove redundant representative via Del(𝒞val[Rrep])(\mathcal{C}_{val}\left[R_{rep}\right])
17:           end if
18:        end if
19:     else if L𝒞ecR𝒞ecL\in\mathcal{C}_{ec}\land R\not\in\mathcal{C}_{ec} then
20:        Update 𝒞ec[L]:=𝒞ec[L]{R}\mathcal{C}_{ec}\left[L\right]:=\mathcal{C}_{ec}\left[L\right]\cup\left\{R\right\} and set 𝒞ec[R]:=𝒞ec[L]\mathcal{C}_{ec}\left[R\right]:=\mathcal{C}_{ec}\left[L\right]
21:     else if L𝒞ecR𝒞ecL\not\in\mathcal{C}_{ec}\land R\in\mathcal{C}_{ec} then
22:        Update 𝒞ec[R]:=𝒞ec[R]{L}\mathcal{C}_{ec}\left[R\right]:=\mathcal{C}_{ec}\left[R\right]\cup\left\{L\right\} and set 𝒞ec[L]:=𝒞ec[R]\mathcal{C}_{ec}\left[L\right]:=\mathcal{C}_{ec}\left[R\right]
23:     else
24:        Set 𝒞ec[L]\mathcal{C}_{ec}\left[L\right]\longleftarrow EquivalenceClass({L,R})(\{L,R\}) and 𝒞ec[R]:=𝒞ec[L]\mathcal{C}_{ec}\left[R\right]:=\mathcal{C}_{ec}\left[L\right]
25:        Set 𝒞val[GetRepresentative(𝒞ec[L])]\mathcal{C}_{val}\left[\text{{GetRepresentative}}\left(\mathcal{C}_{ec}\left[L\right]\right)\right]\longleftarrow\bot
26:     end if
27:  end while
28:  subineqs {}\longleftarrow\{\}
29:  while |𝒞in|>0|\mathcal{C}_{in}|>0 do
30:     constraint \longleftarrow Pop(𝒞in)(\mathcal{C}_{in})
31:     for all vv\in GetVars(constraint)\left(\text{{constraint}}\right) do
32:        if v𝒞ecv\not\in\mathcal{C}_{ec} then
33:           Set 𝒞ec[v]\mathcal{C}_{ec}\left[v\right]\longleftarrow EquivalenceClass({v})(\{v\})
34:           Set 𝒞val[GetRepresentative(𝒞ec[v])]\mathcal{C}_{val}\left[\text{{GetRepresentative}}\left(\mathcal{C}_{ec}\left[v\right]\right)\right]\longleftarrow\bot
35:        end if
36:        Substitution via constraint :=:= constraint[GetRepresentative(𝒞ec[v])/v][\text{{GetRepresentative}}\left(\mathcal{C}_{ec}\left[v\right]\right)/v]
37:     end for
38:     subineqs :=:= subineqs \cup {constraint}\{\text{{constraint}}\}
39:  end while
40:  𝒞in:=\mathcal{C}_{in}:= subineqs
41:  for all se(S(SΣI))Es\cdot e\in(S\cup(S\cdot\Sigma^{I}))\cdot E do
42:     Update T[se]T[s\cdot e] via T[se]:=GetRepresentative(𝒞ec[T[se]])T[s\cdot e]:=\text{{GetRepresentative}}\left(\mathcal{C}_{ec}\left[T[s\cdot e]\right]\right)
43:  end for
44:  for all seΓs\cdot e\in\Gamma do
45:     Update Γ[se]\Gamma[s\cdot e] via Γ[se]:=GetRepresentative(𝒞ec[Γ[se]])\Gamma[s\cdot e]:=\text{{GetRepresentative}}\left(\mathcal{C}_{ec}\left[\Gamma[s\cdot e]\right]\right)
46:  end for

0.A.6 Technical Proofs

We first establish the correctness of constructing a hypothesis deterministic finite automaton from a symbolic observation table via Algorithm 7, in Theorem 1 via Propositions 1 through 5. The proof strategies for Propositions 3 through 5 generally follow those of Angluin [5], with appropriate adjustments to account for a set of possible hypotheses, compared to just a single hypothesis.

Definition 11

Let hk=Qk,q0,k,ΣI,ΣO,δk,Lkh_{k}=\langle Q_{k},q_{0,k},\Sigma^{I},\Sigma^{O},\delta_{k},L_{k}\rangle be a Moore machine for some integer kk. Two Moore machines h1h_{1} and h2h_{2} are structurally identical if a bijection B:Q1Q2B:Q_{1}\rightarrow Q_{2} exists, and the transition functions δ1\delta_{1} and δ2\delta_{2} are consistent with BB. That is, B(δ1(q0,1,s))=δ2(q0,2,s)B(\delta_{1}(q_{0,1},s))=\delta_{2}(q_{0,2},s) for all s(ΣI)s\in(\Sigma^{I})^{*} where δ1\delta_{1} and δ2\delta_{2} are the extended transition functions.

Proposition 1

Let \mathcal{H} be the set of all hypotheses that can be returned from
MakeHypothesis(S,E,T;𝒞,Γ)(\langle S,E,T;\mathcal{C},\Gamma\rangle). All hypotheses in \mathcal{H} are structurally identical to each other, and differ only by their labeling function.

Proof

We first show that all hypotheses in \mathcal{H} must be structurally identical by construction, but have different L^\hat{L} functions. Identical structure for all hypotheses in \mathcal{H} can be shown by observing that lines 1-3 of Algorithm 7 are the same for each possible hypothesis in \mathcal{H} for a given S,E,T;𝒞,Γ\langle S,E,T;\mathcal{C},\Gamma\rangle input. Line 1 is the construction for the set of states. This procedure is identical for all hypotheses in \mathcal{H}, so all hypotheses in \mathcal{H} have the same set of states. Bijection between sets of states for any pair of hypotheses is satisfied by the identity bijection function. Furthermore, the transition function construction specified on Line 3 indicates that all transitions functions for all hypotheses within \mathcal{H} are identical. Since the transition functions are identical, and since bijection between states is satisfied by the identify function, we have that B(δk(q0,k,s))=δk(q0,k,s)B(\delta_{k}(q_{0,k},s))=\delta_{k}(q_{0,k},s) by identity bijection, and furthermore, δk(q0,k,s)=δk(q0,k,s)\delta_{k}(q_{0,k},s)=\delta_{k^{\prime}}(q_{0,k^{\prime}},s) by Line 3 construction, since δk\delta_{k} and δk\delta_{k^{\prime}} are identical. This shows that all hypotheses in \mathcal{H} are structurally identical.

Next, we show that each hypothesis in \mathcal{H} has a unique labeling function. Note that in constructing the hypotheses, the only differences in the output hypothesis are due to lines 4 and 5. Lines 4 and 5 together select a satisfying solution for the free variables in S,E,T;𝒞,Γ\langle S,E,T;\mathcal{C},\Gamma\rangle, subject to the constraints 𝒞\mathcal{C}. Therefore, each unique hypothesis in \mathcal{H} is corresponds to a unique satisfying solution; the mapping between unique satisfying solutions and unique hypotheses is bijective.∎

Proposition 2

States in a Moore machine are represented by sets of sequences. Specifically, a mapping can always be constructed from sets of sequences to states, given an initial state.

Proof

We proceed by showing that a bijection always exists, simply by proposing a valid bijection for every case in an inductive argument. Suppose we have a finite alphabet Σ\Sigma, and a Moore machine, with set of states QQ, initial state q0Qq_{0}\in Q, and transition function δ:Q×ΣQ\delta:Q\times\Sigma\rightarrow Q. We define the initial state q0q_{0} of the Moore machine to correspond with the empty sequence ε\varepsilon which has length 0, so in fact we have q0=δ(q0,ε)q_{0}=\delta(q_{0},\varepsilon), and for any qQq\in Q, q=δ(q,ε)q=\delta(q,\varepsilon). The transition function can be extended recursively to δ:Q×(Σ)Q\delta:Q\times\left(\Sigma^{*}\right)\rightarrow Q by observing that δ(q0,σω)=δ(δ(q0,σ),ω)\delta(q_{0},\sigma\cdot\omega)=\delta(\delta(q_{0},\sigma),\omega), where σ\sigma is a sequence of length of 1, and ω\omega is a sequence of length of at least 11. Similarly, δ(q0,ωσ)=δ(δ(q0,ω),σ)\delta(q_{0},\omega\cdot\sigma)=\delta(\delta(q_{0},\omega),\sigma). Based on this extended transition function, we can now consider the set Q={δ(q0,σ)|σΣ}Q^{\prime}=\{\delta(q_{0},\sigma)|\sigma\in\Sigma\} to be a subset of QQ.

First, we consider the case where |Q|=|Σ||Q^{\prime}|=|\Sigma|, where all transitions from q0q_{0} have led to unique states. We can therefore construct a mapping: M:2ΣQM:2^{\Sigma}\rightarrow Q^{\prime}. We know that |2Σ|=2|Σ|>|Q||2^{\Sigma}|=2^{|\Sigma|}>|Q^{\prime}| if |Q|=|Σ|1|Q^{\prime}|=|\Sigma|\geq 1, so we can consider a subset K2ΣK\subset 2^{\Sigma} such that MM is bijective. By constructing KK, we can show that the elements of KK uniquely correspond to the elements of QQ^{\prime} because M:KQM:K\rightarrow Q^{\prime} will be bijective. We note that if we construct K={{σ}|σΣ}K=\{\{\sigma\}|\sigma\in\Sigma\}, then clearly |K|=|Σ|=|Q||K|=|\Sigma|=|Q^{\prime}|, satisfying the current case. Furthermore, the mapping M({σ})=δ(q0,σ)σΣM(\{\sigma\})=\delta(q_{0},\sigma)\forall\sigma\in\Sigma is clearly a bijective mapping between KK and QQ^{\prime}.

Next, we consider the case where |Q|<|Σ||Q^{\prime}|<|\Sigma|, which implies that there exists at least one pair σi,σjΣ\sigma_{i},\sigma_{j}\in\Sigma which lead to the same state (via the pidgeonhole principle). That is, δ(q0,σi)=δ(q0,σj)\delta(q_{0},\sigma_{i})=\delta(q_{0},\sigma_{j}). If this is the case, then in order to make MM bijective, let K={k|k=i,j{σi}{σj}(σi,σj)Σ×Σ such that δ(q0,σi)=δ(q0,σj)}K=\{k|k=\bigcup_{i,j}\{\sigma_{i}\}\cup\{\sigma_{j}\}\forall(\sigma_{i},\sigma_{j})\in\Sigma\times\Sigma\text{ such that }\delta(q_{0},\sigma_{i})=\delta(q_{0},\sigma_{j})\}. The following mapping for MM is bijective: M(k)=δ(q0,σ)kK and σkM(k)=\delta(q_{0},\sigma)\forall k\in K\text{ and }\forall\sigma\in k, and that K2ΣK\subseteq 2^{\Sigma}. Note each element of KK is a set of sequences of length 1.

Now, consider Qn={δ(q0,ω)|ω(Σ)n}Q^{\prime}_{n}=\{\delta(q_{0},\omega)|\omega\in(\Sigma)^{n}\} where (Σ)n(\Sigma)^{n} denotes sequences of length at most nn. Assume we can construct a bijective mapping M:2(Σ)nQnM:2^{(\Sigma)^{n}}\longrightarrow Q^{\prime}_{n} for all 1nN1\leq n\leq N for some fixed NN. We have already shown this for N=1N=1 above. We now proceed to show that we can also construct a bijective mapping M:2(Σ)N+1QN+1M:2^{(\Sigma)^{N+1}}\longrightarrow Q^{\prime}_{N+1}. Similar to the N=1N=1 case, consider the case for when all sequences ω(Σ)N+1\omega\in(\Sigma)^{N+1} lead to unique states—that is, when |QN+1|=|Σ|N+1|Q^{\prime}_{N+1}|=|\Sigma|^{N+1}. Then the bijective mapping in this case is M:KQN+1M:K\longrightarrow Q^{\prime}_{N+1} where K={{ω}|ω(Σ)N+1}K=\{\{\omega\}|\omega\in(\Sigma)^{N+1}\}, and where M({ω})=δ(q0,ω)M(\{\omega\})=\delta(q_{0},\omega) for all ω(Σ)N+1\omega\in(\Sigma)^{N+1}, and K2(Σ)N+1K\subset 2^{(\Sigma)^{N+1}}.

Now, if |QN+1|<|Σ|N+1|Q^{\prime}_{N+1}|<|\Sigma|^{N+1}, then by the pidgeonhole principle, multiple ω\omega must lead to the same state. Hence, let K={k|k=i,j{ωi}{ωj}(ωi,ωj)(Σ)N+1×(Σ)N+1 such that δ(q0,ωi)=δ(q0,ωj)}K=\{k|k=\bigcup_{i,j}\{\omega_{i}\}\cup\{\omega_{j}\}\forall(\omega_{i},\omega_{j})\in(\Sigma)^{N+1}\times(\Sigma)^{N+1}\text{ such that }\delta(q_{0},\omega_{i})=\delta(q_{0},\omega_{j})\}. Then the bijective function is M:KQN+1M:K\longrightarrow Q^{\prime}_{N+1}, where M(k)=δ(q0,ω)M(k)=\delta(q_{0},\omega) for all kKk\in K and for all ωk\omega\in k, and note that K2(Σ)N+1K\subseteq 2^{(\Sigma)^{N+1}}.

Since we can construct bijective mappings from M:2(Σ)nQnM:2^{(\Sigma)^{n}}\longrightarrow Q^{\prime}_{n} for all 1nN1\leq n\leq N, and we have also shown this is the case for n=N+1n=N+1, this must now also hold for all n=N+dn=N+d, for integer d0d\geq 0, and thus this holds as nn tends towards infinity. Thus, states in a Moore machine are represented by sets of sequences, where each state corresponds to a set in KK. Each state is therefore synonymous with an equivalence class of sequences—a set of sequences that are equivalent according to the extended transition function.∎

Proposition 3

The rows of a unified, closed, and consistent S,E,T;𝒞,Γ\langle S,E,T;\mathcal{C},\Gamma\rangle represent the states in a Moore machine consistent with the constraints 𝒞\mathcal{C}. The hypothesis MakeHypothesis(S,E,T;𝒞,Γ)(\langle S,E,T;\mathcal{C},\Gamma\rangle) satisfies δ(q0,s)=row(s)\delta(q_{0},s)=\textbf{row}(s) for all s(S(SΣ))s\in(S\cup(S\cdot\Sigma)).

Proof

We utilize the inductive proof of Lemma 2 from Angluin [5]. For the case of s=εs=\varepsilon, with length 0, we have δ(q0,ε)=q0=row(ε)\delta(q_{0},\varepsilon)=q_{0}=\textbf{row}(\varepsilon), which is true by definition. Now, let us assume that δ(q0,s)=row(s)\delta(q_{0},s)=\textbf{row}(s) holds for all s(S(SΣ))s\in(S\cup(S\cdot\Sigma)) with lengths no greater than kk. Let s(S(SΣ))s^{\prime}\in(S\cup(S\cdot\Sigma)) be a sequence of length k+1k+1 such that s=sσs^{\prime}=s\cdot\sigma. If sSs^{\prime}\in S, then sSs\in S because SS is prefix closed. If sSΣs^{\prime}\in S\cdot\Sigma, then sSs\in S. We can then show that δ(q0,s)=δ(q0,sσ)=δ(δ(q0,s),σ)=δ(row(s),σ)=row(sσ)=row(s)\delta(q_{0},s^{\prime})=\delta(q_{0},s\cdot\sigma)=\delta(\delta(q_{0},s),\sigma)=\delta(\textbf{row}(s),\sigma)=\textbf{row}(s\cdot\sigma)=\textbf{row}(s^{\prime}). This sequence of equalities follows as specified in Lemma 2 from Angluin [5].∎

Proposition 4

The entries of a unified, closed, and consistent symbolic observation table correspond to sequence classification consistent with constraints 𝒞\mathcal{C}. Specifically, for a unified, closed, and consistent S,E,T;𝒞,Γ\langle S,E,T;\mathcal{C},\Gamma\rangle, let Σ\Sigma^{*} be partitioned into at most k|ΣO|k\leq|\Sigma^{O}| disjoint sets F1,F2,,FkF_{1},F_{2},...,F_{k}, and let ΣkO\Sigma^{O}_{k} be a specific subset of kk distinct elements of ΣO\Sigma^{O}. A sequence sΣs\in\Sigma^{*} is a member of FjF_{j} if and only if it is classified as σjΣkO\sigma_{j}\in\Sigma^{O}_{k}. The hypothesis output by MakeHypothesis(S,E,T;𝒞,Γ)(\langle S,E,T;\mathcal{C},\Gamma\rangle) satisfies δ(q0,se)Fj\delta(q_{0},s\cdot e)\in F_{j} if and only if T(se)=σjT(s\cdot e)=\sigma_{j} for every s(S(SΣ)) and every eEs\in(S\cup(S\cdot\Sigma))\text{ and every }e\in E.

Proof

We adapt the inductive proof of Lemma 3 from Angluin [5], but make appropriate adjustments for sequence classification. Let s(S(SΣ))s\in(S\cup(S\cdot\Sigma)) and e=εe=\varepsilon be the base case. Clearly, δ(q0,sε)=δ(q0,s)=row(s)\delta(q_{0},s\cdot\varepsilon)=\delta(q_{0},s)=\textbf{row}(s) as shown previously. If sSs\in S, then row(s)Fj\textbf{row}(s)\in F_{j} if and only if T(s)=σjT(s)=\sigma_{j}. If sSΣs\in S\cdot\Sigma, then row(s)Q^\textbf{row}(s)\in\hat{Q}, since the symbolic observation table is closed. A row(s)Q^\textbf{row}(s^{\prime})\in\hat{Q} is in FjF_{j} if and only if T(s)=σjT(s^{\prime})=\sigma_{j}.

Next, without loss of generality, assume that δ(q0,se)Fj\delta(q_{0},s\cdot e)\in F_{j} if and only if T(se)=σjT(s\cdot e)=\sigma_{j} for all sequences ee with length at most kk. Let eEe^{\prime}\in E be a sequence with length k+1k+1 and let sS(SΣ)s\in S\cup(S\cdot\Sigma). The sequence e=σe0e^{\prime}=\sigma\cdot e_{0} for some e0Ee_{0}\in E of length kk and some σΣ\sigma\in\Sigma, because EE is suffix-closed. Furthermore, there is a sequence s0Ss_{0}\in S such that row(s)=row(s0)\textbf{row}(s)=\textbf{row}(s_{0}) because the symbolic observation table is closed. Next, we show that if two sequences share a common suffix, but have different prefixes that end at the same state, then the two sequences also end at the same state. We observe that δ(q0,se)=δ(q0,sσe0)=δ(δ(q0,s),σe0)=δ(row(s),σe0)=δ(row(s0),σe0)=δ(δ(row(s0),σ)e0)=δ(row(s0σ),e0)=δ(δ(q0,s0σ),e0)=δ(q0,s0σe0)\delta(q_{0},s\cdot e^{\prime})=\delta(q_{0},s\cdot\sigma\cdot e_{0})=\delta(\delta(q_{0},s),\sigma\cdot e_{0})=\delta(\textbf{row}(s),\sigma\cdot e_{0})=\delta(\textbf{row}(s_{0}),\sigma\cdot e_{0})=\delta(\delta(\textbf{row}(s_{0}),\sigma)e_{0})=\delta(\textbf{row}(s_{0}\cdot\sigma),e_{0})=\delta(\delta(q_{0},s_{0}\cdot\sigma),e_{0})=\delta(q_{0},s_{0}\cdot\sigma\cdot e_{0}). This indicates that the sequences ses\cdot e^{\prime} and s0es_{0}\cdot e^{\prime} share a common suffix, but potentially have different prefixes, the two sequences end at the same state because their prefixes end at the same state. Since e0e_{0} has length kk, and s0σS(SΣ)s_{0}\cdot\sigma\in S\cup(S\cdot\Sigma), we can use our initial assumption that δ(q0,s0σe0)Fj\delta(q_{0},s_{0}\cdot\sigma\cdot e_{0})\in F_{j} if and only if T(s0σe0)=σjT(s_{0}\cdot\sigma\cdot e_{0})=\sigma_{j}. Because row(s)=row(s0)\textbf{row}(s)=\textbf{row}(s_{0}), this means that row(s)(e)=row(s0)(e)eE\textbf{row}(s)(e)=\textbf{row}(s_{0})(e)\forall e\in E. Note that by definition, T(se)=row(s)(e)T(s\cdot e^{\prime})=\textbf{row}(s)(e^{\prime}), so row(s0)(σe0)=row(s)(σe0)\textbf{row}(s_{0})(\sigma\cdot e_{0})=\textbf{row}(s)(\sigma\cdot e_{0}) implies T(s0σe0)=T(sσe0)=T(se)T(s_{0}\cdot\sigma\cdot e_{0})=T(s\cdot\sigma\cdot e_{0})=T(s\cdot e^{\prime}), which means δ(se)Fj\delta(s\cdot e^{\prime})\in F_{j} if and only if T(se)=σjT(s\cdot e^{\prime})=\sigma_{j}.∎

Proposition 5

Suppose S,E,T;𝒞,Γ\langle S,E,T;\mathcal{C},\Gamma\rangle is a unified, closed, and consistent symbolic observation table. If the hypothesis h^=Q^,ΣI,ΣO,q^0,δ^,L^\hat{h}=\langle\hat{Q},\Sigma^{I},\Sigma^{O},\hat{q}_{0},\hat{\delta},\hat{L}\rangle from the function MakeHypothesis(S,E,T;𝒞,Γ)(\langle S,E,T;\mathcal{C},\Gamma\rangle) has nn states, and h^\hat{h} was generated using a satisfying solution Λ\Lambda to constraints 𝒞\mathcal{C}, then any other Moore machine h=Q,ΣI,ΣO,q0,δ,Lh=\langle Q,\Sigma^{I},\Sigma^{O},q_{0},\delta,L\rangle with nn or fewer states that is also consistent with Λ\Lambda and TT is isomorphic to h^\hat{h}.

Proof

We make additions and appropriate adjustments to the proof of Lemma 4 from Angluin [5]. First, let us partition Q^\hat{Q} into k=|ΣO|k=|\Sigma^{O}| disjoint sets F^1,,F^k\hat{F}_{1},...,\hat{F}_{k} such that for all σjΣO\sigma_{j}\in\Sigma^{O}, q^F^j\hat{q}\in\hat{F}_{j} if and only if L^(q^)=σj\hat{L}(\hat{q})=\sigma_{j}. Note that it is possible for some of the F^j\hat{F}_{j} to be empty. Similarly, let us also partition QQ into kk disjoint sets F1,,FkF_{1},...,F_{k} such that for all σjΣO\sigma_{j}\in\Sigma^{O}, qFjq\in F_{j} if and only if L(q)=σjL(q)=\sigma_{j}. Next, we can define for all qQq\in Q and for all σjΣO\sigma_{j}\in\Sigma^{O} the function row(q):EΣO\textbf{row}^{\wedge}(q):E\rightarrow\Sigma^{O} such that row(q)(e)=σj\textbf{row}^{\wedge}(q)(e)=\sigma_{j} if and only if δ(q,e)Fj\delta(q,e)\in F_{j}.

Because h=Q,ΣI,ΣO,q0,δ,Lh=\langle Q,\Sigma^{I},\Sigma^{O},q_{0},\delta,L\rangle is consistent with Λ\Lambda and TT, we know that for each sS(SΣ)s\in S\cup(S\cdot\Sigma) and for each eEe\in E, δ(q0,se)Fj\delta(q_{0},s\cdot e)\in F_{j} if and only if T(se)=σjT(s\cdot e)=\sigma_{j}. Therefore, for all eEe\in E, row(q)(e)=row(s)(e)\textbf{row}^{\wedge}(q)(e)=\textbf{row}(s)(e) if q=δ(q0,s)q=\delta(q_{0},s) implies row(δ(q0,s))row(s)\textbf{row}^{\wedge}(\delta(q_{0},s))\equiv\textbf{row}(s). Next, because by definition Q^={row(s)|sS}\hat{Q}=\{\textbf{row}(s)|\forall s\in S\}, we have via substitution Q^={row(δ(q0,s))|sS}\hat{Q}=\{\textbf{row}^{\wedge}(\delta(q_{0},s))|\forall s\in S\}, which implies that |Q|n=|Q^||Q|\geq n=|\hat{Q}|. This is because in order for {row(δ(q0,s))|sS}\{\textbf{row}^{\wedge}(\delta(q_{0},s))|\forall s\in S\} to contain nn elements, δ(q0,s)\delta(q_{0},s) must range over at least nn states as ss ranges over SS. However, because we have assumed that hh contains nn or fewer states, hh must contain exactly n=|Q^|=|Q|n=|\hat{Q}|=|Q| states.

Next, we can consider bijective mappings between Q^\hat{Q} and QQ. Since QQ and Q^\hat{Q} have the same cardinality, and because row(δ(q0,s))row(s)\textbf{row}^{\wedge}(\delta(q_{0},s))\equiv\textbf{row}(s), we know that for every sSs\in S corresponds to a unique qQq\in Q, specifically, q=δ(q0,s)q=\delta(q_{0},s). We can define the bijective mapping row:Q^Q\textbf{row}^{-\wedge}:\hat{Q}\rightarrow Q via for all sSs\in S, row(row(s))=δ(q0,s)\textbf{row}^{-\wedge}(\textbf{row}(s))=\delta(q_{0},s). From this mapping, we observe that row(row(ε))=row(q^0)=δ(q0,ε)=q0\textbf{row}^{-\wedge}(\textbf{row}(\varepsilon))=\textbf{row}^{-\wedge}(\hat{q}_{0})=\delta(q_{0},\varepsilon)=q_{0}. Additionally, for all sSs\in S and for all σΣ\sigma\in\Sigma, row(δ^(row(s),σ))=row(row(sσ))=δ(q0,sσ)=δ(δ(q0,s),σ)=δ(row(row(s)),σ)\textbf{row}^{-\wedge}(\hat{\delta}(\textbf{row}(s),\sigma))=\textbf{row}^{-\wedge}(\textbf{row}(s\cdot\sigma))=\delta(q_{0},s\cdot\sigma)=\delta(\delta(q_{0},s),\sigma)=\delta(\textbf{row}^{-\wedge}(\textbf{row}(s)),\sigma), which implies for all sSs\in S and σΣ\sigma\in\Sigma, row(δ^(row(s),σ))=δ(row(row(s)),σ)\textbf{row}^{-\wedge}(\hat{\delta}(\textbf{row}(s),\sigma))=\delta(\textbf{row}^{-\wedge}(\textbf{row}(s)),\sigma).

Finally, we show that i{k|k&1k|ΣO|},row:F^iFi\forall i\in\{k|k\in\mathbb{N}\And 1\leq k\leq|\Sigma^{O}|\},\textbf{row}^{-\wedge}:\hat{F}_{i}\rightarrow F_{i}. Specifically, if sSs\in S has row(s)F^i\textbf{row}(s)\in\hat{F}_{i}, this means that row(s)(ε)=σi\textbf{row}(s)(\varepsilon)=\sigma_{i}, and therefore T(s)=T(sε)=σiT(s)=T(s\cdot\varepsilon)=\sigma_{i}. Furthermore, suppose row(row(s))=δ(q0,s)=q\textbf{row}^{-\wedge}(\textbf{row}(s))=\delta(q_{0},s)=q, and therefore row(q)=row(δ(q0,s))row(s)\textbf{row}^{\wedge}(q)=\textbf{row}^{\wedge}(\delta(q_{0},s))\equiv\textbf{row}(s), which implies that row(q)(ε)=row(s)(ε)=T(sε)=T(s)=σi\textbf{row}^{\wedge}(q)(\varepsilon)=\textbf{row}(s)(\varepsilon)=T(s\cdot\varepsilon)=T(s)=\sigma_{i}. This means that qFiq\in F_{i} and row(s)F^i\textbf{row}(s)\in\hat{F}_{i}, and that each element of F^i\hat{F}_{i} bijectively maps to an element of FiF_{i}.∎

Proposition 6

Suppose S,E,T;𝒞,Γ\langle S,E,T;\mathcal{C},\Gamma\rangle is a unified, closed, and consistent symbolic observation table. Let nn be the number of states in the hypothesis returned from MakeHypothesis(S,E,T;𝒞,Γ)(\langle S,E,T;\mathcal{C},\Gamma\rangle). Any Moore machine consistent with Γ\Gamma and 𝒞\mathcal{C} must have at least nn states.

Proof

Let h=Q,ΣI,ΣO,q0,δ,Lh=\langle Q,\Sigma^{I},\Sigma^{O},q_{0},\delta,L\rangle be any Moore machine consistent with T,ΛT,\Lambda, and 𝒞\mathcal{C}, for the satisfying solution Λ\Lambda that was used to generate MakeHypothesis(S,E,T;𝒞,Γ)(\langle S,E,T;\mathcal{C},\Gamma\rangle). Then, δ(q0,s)Fi\delta(q_{0},s)\in F_{i} if and only if T(s)=σiT(s)=\sigma_{i}. Let s1s_{1} and s2s_{2} be distinct elements of SS such that row(s1)row(s2)\textbf{{row}}(s_{1})\neq\textbf{{row}}(s_{2}). This means there exists an eEe\in E such that T(s1e)T(s2e)T(s_{1}\cdot e)\neq T(s_{2}\cdot e). This means that if T(s1e)=σiT(s_{1}\cdot e)=\sigma_{i}, then T(s2e)σiT(s_{2}\cdot e)\neq\sigma_{i}. Because hh is consistent with TT, this means that δ(q0,s1e)Fi\delta(q_{0},s_{1}\cdot e)\in F_{i} and δ(q0,s2e)Fi\delta(q_{0},s_{2}\cdot e)\not\in F_{i}. This means that δ(q0,s1e)\delta(q_{0},s_{1}\cdot e) and δ(q0,s2e)\delta(q_{0},s_{2}\cdot e) must be distinct since they must belong to different classes. Because δ(q0,s1e)δ(q0,s2e)\delta(q_{0},s_{1}\cdot e)\neq\delta(q_{0},s_{2}\cdot e), and because the transition function must be consistent, then δ(q0,s1)\delta(q_{0},s_{1}) and δ(q0,s2)\delta(q_{0},s_{2}) must be distinct states: if δ(q0,s1)\delta(q_{0},s_{1}) and δ(q0,s2)\delta(q_{0},s_{2}) were the same state, then it is impossible to transition to two different states using the same sequence ee. Since MakeHypothesis(S,E,T;𝒞,Γ)(\langle S,E,T;\mathcal{C},\Gamma\rangle) contains nn states, there is at least one subset PSP\subseteq S with P={s1,,sn}P=\{s_{1},...,s_{n}\} containing nn elements such that all the elements of P={row(s1),,row(sn)}P^{\prime}=\{\textbf{{row}}(s_{1}),...,\textbf{{row}}(s_{n})\} are distinct from one another. Since every pair row(si)row(sj)\textbf{{row}}(s_{i})\neq\textbf{{row}}(s_{j}) for iji\neq j in PP^{\prime}, then it follows that each element of {δ(q0,s1),,δ(q0,sn)}\{\delta(q_{0},s_{1}),...,\delta(q_{0},s_{n})\} must be distinct to remain consistent with TT. However, for a given TT, there is a corresponding satisfying solution Λ\Lambda that specifies the values for TT. This means that the mapping from Λ\Lambda to TT is one-to-one, and so for a given Moore machine to be consistent with a (Λ,T)(\Lambda,T) pair, the Moore machine must contain at least nn states. But since Λ\Lambda can be any satisfying solution to 𝒞\mathcal{C} and Γ\Gamma, the above statement about Moore machines must be true for any Λ\Lambda that satisfies 𝒞\mathcal{C} and Γ\Gamma. Thus, any Moore machine consistent with 𝒞\mathcal{C} and Γ\Gamma must have at least nn states.∎

Theorem 0.A.1

If the symbolic observation table S,E,T;𝒞,Γ\langle S,E,T;\mathcal{C},\Gamma\rangle is unified, closed, and consistent, and \mathcal{H} is the set of all hypotheses that can be returned from MakeHypothesis(S,E,T;𝒞,Γ)(\langle S,E,T;\mathcal{C},\Gamma\rangle), then every hypothesis hh\in\mathcal{H} is consistent with constraints 𝒞\mathcal{C}. Any other Moore machine consistent with 𝒞\mathcal{C}, but not contained in \mathcal{H}, must have more states.

We first provide a sketch, followed by the proof.

Proof

(Sketch) A given unified, closed, and consistent symbolic observation table S,E,T;𝒞,Γ\langle S,E,T;\mathcal{C},\Gamma\rangle corresponds to (𝒮,,𝒞)(\mathcal{S},\mathcal{R},\mathcal{C}), where 𝒮\mathcal{S} is a symbolic hypothesis, \mathcal{R} is the set of representatives used in the table, and 𝒞\mathcal{C} are the constraints expressed over \mathcal{R}. All hypotheses in \mathcal{H} have states and transitions identical to 𝒮\mathcal{S}. Each satisfying solution Λ\Lambda to 𝒞\mathcal{C} corresponds to a unique concrete hypothesis in \mathcal{H}. Therefore every concrete hypothesis in \mathcal{H} is consistent with 𝒞\mathcal{C}. Let |h||h| represent the number of states in hh. We know for all hh\in\mathcal{H}, |h|=|𝒮||h|=|\mathcal{S}|. Let ¯\overline{\mathcal{H}} be the set of concrete hypotheses not in \mathcal{H}. Note ¯\overline{\mathcal{H}} can be partitioned into three sets—concrete hypotheses with (a) fewer states than 𝒮\mathcal{S}, (b) more states than 𝒮\mathcal{S}, and (c) same number of states as 𝒮\mathcal{S} but inconsistent with 𝒞\mathcal{C}. We ignore (c) because we care only about hypotheses consistent with 𝒞\mathcal{C}. Consider any concrete hypothesis hh in \mathcal{H} and its corresponding satisfying solution Λ\Lambda. Suppose we desire another hypothesis hh^{\prime} to be consistent with hh. If |h|<|h||h^{\prime}|<|h|, then hh^{\prime} cannot be consistent with hh because at least one sequence will be misclassified. Therefore, if hh^{\prime} must be consistent with hh, then we require |h||h||h^{\prime}|\geq|h|. Thus, any other hypothesis consistent with 𝒞\mathcal{C}, but not in \mathcal{H}, must have more states.∎

Proof

Proposition 1 establishes that all hypotheses in \mathcal{H} have equivalent states and equivalent transitions, which implies that every hypothesis in \mathcal{H} has the same number of states. Propositions 2 and 4 together establish that in a hypothesis, if a specific sequence ss is classified correctly, then all other sequences which start at the same initial state as ss and end in the same state as ss will be classified the same as ss. Thus, if a state is classified correctly, then all sequences terminating at that state are classified correctly. Finally, Proposition 5 establishes that any Moore machine dd consistent with Λ\Lambda and TT from h^=MakeHypothesis(S,E,T;𝒞,Γ)\hat{h}=\textsc{MakeHypothesis}(\langle S,E,T;\mathcal{C},\Gamma\rangle) must be isomorphic to h^\hat{h}; otherwise dd must contain at least one more state than h^\hat{h}. This implies that h^\hat{h} is the smallest Moore machine consistent with Λ\Lambda and TT, and because Λ\Lambda is taken from all possible satisfying solutions to 𝒞\mathcal{C}, then by Proposition 1, every hypothesis hh\in\mathcal{H} consistent with 𝒞\mathcal{C} has that same smallest number of states. Therefore, any other Moore machine not in \mathcal{H}, but still consistent with 𝒞\mathcal{C}, must have more states.∎

Theorem 1 establishes that the output of Algorithm 7 will be consistent with all the constraints in 𝒞\mathcal{C}. In order to establish correctness and termination, we will present a few additional lemmas, and then the main result.

Proposition 7

The number of unique constraints in 𝒞\mathcal{C} is finite, and is upper bounded by (|Γ|2)\binom{|\Gamma|}{2}.

Proof

We will show that the number unique constraints in 𝒞\mathcal{C} is finite. In fact, the upper bound on the total number of constraints in 𝒞\mathcal{C} is a function of |Γ||\Gamma|, the total number of unique sequences recorded in the observation table. The total number of preference queries that are executed is (|Γ|2)\binom{|\Gamma|}{2}, which is the maximum number of unique constraints in 𝒞\mathcal{C}.

This is because whenever a SymbolicFill is performed at some time tt, there will be NN new, unique sequences in the table which have never been used in a preference query, and there are OO old sequences, representing sequences which have already been queried. At all times, |Γ|=N+O|\Gamma|=N+O, and therefore, during a SymbolicFill, (N2)+NO\binom{N}{2}+NO preference queries are performed. Now, consider rr to represent the round number—the rrth time that a SymbolicFill has been performed. Let nrn_{r} represent the number of new, unique sequences in round rr. Let or+1o_{r+1} represent all sequences in Γ\Gamma which have been queried over the past rr rounds; that is,

or+1=k=0rnk,o_{r+1}=\sum_{k=0}^{r}n_{k},

where the initial conditions are o0=0o_{0}=0, and n01n_{0}\geq 1 is the initial size of Γ\Gamma. Suppose that RR rounds have occurred. How many preference queries have occurred? Clearly, the total number of unique sequences after RR rounds is n0+n1++nR=|Γ|n_{0}+n_{1}+\cdots+n_{R}=|\Gamma|, which is the same as oR+nR=|Γ|o_{R}+n_{R}=|\Gamma|. We can show that

(n0+n1++nR2)=k=0R(nr2)+nror.\binom{n_{0}+n_{1}+\cdots+n_{R}}{2}=\sum_{k=0}^{R}\binom{n_{r}}{2}+n_{r}o_{r}.

Clearly, the RHS is justified, because in round rr, (nr2)+nror\binom{n_{r}}{2}+n_{r}o_{r} preference queries are performed. Therefore, we just need to show the LHS is equivalent to the RHS via some algebra:

(n0++nR2)\displaystyle\binom{n_{0}+\cdots+n_{R}}{2} =(oR+nR2)\displaystyle=\binom{o_{R}+n_{R}}{2}
=(oR+nR)(oR+nR1)2\displaystyle=\frac{(o_{R}+n_{R})(o_{R}+n_{R}-1)}{2}
=(oR+nR)2(oR+nR)2\displaystyle=\frac{(o_{R}+n_{R})^{2}-(o_{R}+n_{R})}{2}
=oR2+nR2+2oRnRoRnR2\displaystyle=\frac{o_{R}^{2}+n_{R}^{2}+2o_{R}n_{R}-o_{R}-n_{R}}{2}
=(oR2oR)+(nR2nR)+2oRnR2\displaystyle=\frac{(o_{R}^{2}-o_{R})+(n_{R}^{2}-n_{R})+2o_{R}n_{R}}{2}
(oR+nR2)\displaystyle\binom{o_{R}+n_{R}}{2} =(oR2)+(nR2)+oRnR\displaystyle=\binom{o_{R}}{2}+\binom{n_{R}}{2}+o_{R}n_{R}
(oR+nR2)\displaystyle\binom{o_{R}+n_{R}}{2} =(oR1+nR12)+(nR2)+oRnR\displaystyle=\binom{o_{R-1}+n_{R-1}}{2}+\binom{n_{R}}{2}+o_{R}n_{R}
=[(oR2+nR22)+oR1nR1\displaystyle=\left[\binom{o_{R-2}+n_{R-2}}{2}+o_{R-1}n_{R-1}\right.
+(nR12)]+(nR2)+oRnR\displaystyle\phantom{{}=}\left.+\binom{n_{R-1}}{2}\right]+\binom{n_{R}}{2}+o_{R}n_{R}

Now, note

(oRd+nRd2)\displaystyle\binom{o_{R-d}+n_{R-d}}{2} =(oRd1+nRd12)\displaystyle=\binom{o_{R-d-1}+n_{R-d-1}}{2}
+(nRd2)+oRdnRd\displaystyle\phantom{{}=}+\binom{n_{R-d}}{2}+o_{R-d}n_{R-d}

for 0dR10\leq d\leq R-1 is an integer. Hence, by recursively splitting the above binomial term, we end up with the terms o0n0++oRnRo_{0}n_{0}+\cdots+o_{R}n_{R}, (n02)++(nR2)\binom{n_{0}}{2}+\cdots+\binom{n_{R}}{2}, and (o02)\binom{o_{0}}{2} in the expression, and in since o0=0o_{0}=0, we have:

(oR+nR2)\displaystyle\binom{o_{R}+n_{R}}{2} =(o02)+[(n02)++(nR2)]\displaystyle=\binom{o_{0}}{2}+\left[\binom{n_{0}}{2}+\cdots+\binom{n_{R}}{2}\right]
+(o0n0++oRnR)\displaystyle\phantom{{}=}+(o_{0}n_{0}+\cdots+o_{R}n_{R})
=(o02)+k=0R(nr2)+ornr\displaystyle=\binom{o_{0}}{2}+\sum_{k=0}^{R}\binom{n_{r}}{2}+o_{r}n_{r}
=k=0R(nr2)+ornr\displaystyle=\sum_{k=0}^{R}\binom{n_{r}}{2}+o_{r}n_{r}

The number of unique sequences in the observation table is a function of the number of table closure and table consistency operations performed; but eventually, when the table becomes closed and consistent, a finite number of prefixes and suffixes will exist in the table. This implies there is only a finite number of unique possible sequences that can be created from the prefixes and suffixes in the table. The upper bound on the number of unique constraints is (|Γ|2)\binom{|\Gamma|}{2}.∎

Proposition 8

The number of representatives, |||\mathcal{R}|, in a unified, closed, and consistent symbolic observation table S,E,T;𝒞,Γ\langle S,E,T;\mathcal{C},\Gamma\rangle is finite, and is bounded by 1|||ΣO|1\leq|\mathcal{R}|\leq|\Sigma^{O}|. Also, the number of unique inequalities in 𝒞\mathcal{C} is (||2)\binom{|\mathcal{R}|}{2}.

Proof

Since the size of the output alphabet ΣO\Sigma^{O} is finite, this means there is an upper bound of |ΣO||\Sigma^{O}| possible classification classes for all the sequences. If all sequences from (S(SΣI))E(S\cup(S\cdot\Sigma^{I}))\cdot E must be classified into ΣO\Sigma^{O}, then there are at most |ΣO||\Sigma^{O}| equivalence classes of sequences.

Because each entry of a unified, closed, and consistent symbolic observation table S,E,T;𝒞,Γ\langle S,E,T;\mathcal{C},\Gamma\rangle is a variable, and that variable must have as its value one of the elements in ΣO\Sigma^{O}, the number of representatives |||\mathcal{R}| is bounded via 1|||ΣO|1\leq|\mathcal{R}|\leq|\Sigma^{O}|. We define unique representatives to be representatives that are known have distinct values; in other words, the number of unique representatives at any given in time is the same as the number of equivalence classes over variables at that point in time; the unique representatives are the representatives of the equivalence classes of variables. Thus, if there are |||\mathcal{R}| unique representatives, and they satisfy a total ordering, then 𝒞\mathcal{C} will eventually contain a total of (||2)\binom{|\mathcal{R}|}{2} unique inequalities. This total number of unique inequalities is populated in 𝒞\mathcal{C} via preference queries over pairs of unique sequences in the table. The quantity |Γ||\Gamma| of unique sequences in the table is increased by prefix expansions (via closure tests and counterexamples), and suffix expansions (via consistency tests). Furthermore, during an equivalence query, whenever a counterexample cc is presented, cc and all its prefixes are added to the prefix set.∎

We now define properties of a satisfying solution Λ\Lambda to the constraint set CC involving |||\mathcal{R}| representatives, and how many of those representatives have correct assignments. If ||<V|\mathcal{R}|<V^{*} for some upper bound V|ΣO|V^{*}\leq|\Sigma^{O}|, then we say that Λ\Lambda is incomplete. If Λ\Lambda contains at least 1 incorrect representative-value assignment, then we say Λ\Lambda is incorrect. If Λ\Lambda has assigned correct values to all |||\mathcal{R}| representatives, then we say Λ\Lambda is correct. If Λ\Lambda is both incomplete and correct, then we say Λ\Lambda is partially correct.

Lemma 1

Whenever a counterexample cc is processed, either 0 or 11 additional representative values becomes known.

Proof

This is proven in the proof for Theorem 2.∎

Theorem 0.A.2

Suppose S,E,T;𝒞,Γ\langle S,E,T;\mathcal{C},\Gamma\rangle is a unified, closed, and consistent symbolic observation table. Let h^=MakeHypothesis(S,E,T;𝒞,Γ)\hat{h}=\textsc{MakeHypothesis}(\langle S,E,T;\mathcal{C},\Gamma\rangle) be the hypothesis induced by Λ\Lambda, a satisfying solution to 𝒞\mathcal{C}. If the teacher returns a counterexample cc as the result of an equivalence query on h^\hat{h}, then at least one of the following statements about hypothesis h^\hat{h} must be true: (a) h^\hat{h} contains too few states, or (b) the satisfying solution Λ\Lambda inducing h^\hat{h} is either incomplete or incorrect.

We first present a sketch, then a proof.

Proof

(Sketch) This sketch applies to the above lemma, theorem, and below corollary about termination. Consider the sequence ,hk1,hk,\dots,h_{k-1},h_{k},\dots of hypotheses that Remap makes. For a given pair of consecutive hypotheses (hk1,hk)(h_{k-1},h_{k}), consider how the number of states nn, and the number of known representative values nn_{\bullet} changes. Let nn^{*} be the number of states of the minimal Moore machine correctly classifying all sequences. Let V|ΣO|V^{*}\leq|\Sigma^{O}| be the upper bound on |||\mathcal{R}|. Note that 0n||V|ΣO|0\leq n_{\bullet}\leq|\mathcal{R}|\leq V^{*}\leq|\Sigma^{O}| always holds. Through detailed case analysis on returned counterexamples, we can show that the change in nn_{\bullet}, denoted by Δn\Delta n_{\bullet}, must always be either 0 or 11, and furthermore, if Δn=0\Delta n_{\bullet}=0, then we must have Δn1\Delta n\geq 1. By the case analysis and tracking nn and nn_{\bullet}, observe that if a counterexample cc is received from the teacher due to hypothesis hh, then at least one of (a) n<nn<n^{*} or (b) n<Vn_{\bullet}<V^{*} must be true. Since Δn\Delta n_{\bullet} and Δn\Delta n cannot simultaneously be 0, whenever a new hypothesis is made, progress must be made towards the upper bound of (n,V)(n^{*},V^{*}). If the upper bound is reached, then the algorithm must terminate, since it is impossible to progress from the point (n,V)(n^{*},V^{*}).∎

Proof

Our strategy for this proof is to show that the number of known representative values and number of states in the hypothesis will increase up to their upper limits, at which point the algorithm must terminate. Let 𝔼[v]\mathbb{EC}[v] denote the set of variables in Γ\Gamma which are known to be equivalent to vv according to the equality constraints that were gathered so far in 𝒞\mathcal{C}. Let |||\mathcal{R}| denote the number of representatives present in the unified, closed, and consistent observation table S,E,T;𝒞,Γ\langle S,E,T;\mathcal{C},\Gamma\rangle. That is, |||\mathcal{R}| is the number of elements in the set {𝔼[Γ(se)]|se(S(SΣ))E}\{\mathbb{EC}[\Gamma(s\cdot e)]|\forall s\cdot e\in(S\cup(S\cdot\Sigma))\cdot E\}. Let nn_{\circ} represent the number of unknown representative values, and let nn_{\bullet} represent the number of known representative values. We have, at all times, the invariant ||=n+n|\mathcal{R}|=n_{\circ}+n_{\bullet}. Furthermore, at the very beginning of the algorithm, they have the initial values of ||=1|\mathcal{R}|=1, n=||n_{\circ}=|\mathcal{R}| and n=0n_{\bullet}=0. Finally, let VV^{*} represent the ground truth number of classes that sequences can be classified into, according to the teacher. We will show that if the teacher returns a counterexample, then at least one of the following two scenarios must be true about the learner’s current hypothesis h^=MakeHypothesis(S,E,T;𝒞,Γ)\hat{h}=\textsc{MakeHypothesis}(\langle S,E,T;\mathcal{C},\Gamma\rangle): (A) h^\hat{h} contains too few states, or (B) the satisfying solution Λ\Lambda inducing h^\hat{h} is incorrect. We enumerate cases by describing the pre-conditions and post-conditions when executing the equivalence query.

Case 1

Prior to the equivalence query, c(S(SΣI))Ec\in(S\cup(S\cdot\Sigma^{I}))\cdot E. This means there exists s1(S(SΣI))s_{1}\in(S\cup(S\cdot\Sigma^{I})) and e1Ee_{1}\in E such that c=s1e1c=s_{1}\cdot e_{1} and the variable located at T(c)T(c) was assigned an incorrect value ww. This also means that for every se(S(SΣ))Es\cdot e\in(S\cup(S\cdot\Sigma))\cdot E such that Γ[se]𝔼[T(c)]\Gamma[s\cdot e]\in\mathbb{EC}[T(c)], it follows that each T(se)T(s\cdot e) is also the same incorrect ww. However, after the equivalence query, the teacher returns the counterexample cc with feedback f(c)=rf(c)=r, where rwr\neq~{}w. Then, for each variable v𝔼[T(c)]v\in\mathbb{EC}[T(c)], the learner will assign vv the value rr. If cSc\in S, then all of its prefixes are already in SS because SS is prefixed closed. If cSΣc\in S\cdot\Sigma, then all of its prefixes are already in SS. If cc is not in SΣS\cdot\Sigma, then the prefixes of cc which are not in SS will be added to SS. In this latter case, there is a possibility that the number of unique representatives |||\mathcal{R}| will be increased by kk, where 0kV||0\leq k\leq V^{*}-|\mathcal{R}|. However, the number of known variables will always increase by exactly 1. Thus, under this case, we have the following transformation: (||,n,n)(||+k,n1+k,n+1)(|\mathcal{R}|,n_{\circ},n_{\bullet})\longrightarrow(|\mathcal{R}|+k,n_{\circ}-1+k,n_{\bullet}+1), where 0kV||0\leq k\leq V^{*}-|\mathcal{R}|.

Case 2

Prior to the equivalence query, c(S(SΣI))Ec\not\in(S\cup(S\cdot\Sigma^{I}))\cdot E, which means T(c)T(c) does not exist in the symbolic observation table. If cc was returned as a counterexample, then the learner incorrectly hypothesized that δ^(q^0,c)F^i\hat{\delta}(\hat{q}_{0},c)\in\hat{F}_{i}. The post-conditions for this case can be broken into three possibilities:

  1. (a)

    There is some se(S(SΣI))Es\cdot e\in(S\cup(S\cdot\Sigma^{I}))\cdot E for which 𝔼[Γ[se]]=valr\mathbb{EC}\left[\Gamma[s\cdot e]\right]\overset{\mathrm{val}}{=}r according to Λ\Lambda, and |𝔼[Γ[se]]|1\left|\mathbb{EC}[\Gamma[s\cdot e]]\right|\geq 1. The new variable T(c)T(c) is added to 𝔼[Γ[se]]\mathbb{EC}[\Gamma[s\cdot e]], and subsequently, 𝔼[T(c)]=𝔼[Γ[se]]\mathbb{EC}[T(c)]=\mathbb{EC}[\Gamma[s\cdot e]] and |𝔼[T(c)]|>1|\mathbb{EC}[T(c)]|>1, and all the variables in that class have a value rr. While cc itself does not induce |||\mathcal{R}| to increase, it is possible that its prefixes might induce an increase in |||\mathcal{R}|. So while the number of unknown representative values might increase, the number of known representative values stays constant. Therefore, we have the following transformation: (||,n,n)(||+k,n+k,n)(|\mathcal{R}|,n_{\circ},n_{\bullet})\longrightarrow(|\mathcal{R}|+k,n_{\circ}+k,n_{\bullet}), where 0kV||0\leq k\leq V^{*}-|\mathcal{R}|.

  2. (b)

    There is some se(S(SΣI))Es\cdot e\in(S\cup(S\cdot\Sigma^{I}))\cdot E for which 𝔼[Γ[se]]valr\mathbb{EC}\left[\Gamma[s\cdot e]\right]\overset{\mathrm{val}}{\neq}r according to Λ\Lambda and |𝔼[Γ[se]]|1\left|\mathbb{EC}[\Gamma[s\cdot e]]\right|\geq 1. The new variable T(c)T(c) is added to 𝔼[Γ[se]]\mathbb{EC}[\Gamma[s\cdot e]], and subsequently, 𝔼[T(c)]=𝔼[Γ[se]]\mathbb{EC}[T(c)]=\mathbb{EC}[\Gamma[s\cdot e]] and |𝔼[T(c)]|>1|\mathbb{EC}[T(c)]|>1, and 𝔼[T(c)]=val𝔼[Γ[se]]=valr\mathbb{EC}\left[T(c)\right]\overset{\mathrm{val}}{=}\mathbb{EC}\left[\Gamma[s\cdot e]\right]\overset{\mathrm{val}}{=}r . Again, the number of unknown representative values might increase, but this time, the number of known representative values increases by 1. Therefore, we have the following transformation: (||,n,n)(||+k,n1+k,n+1)(|\mathcal{R}|,n_{\circ},n_{\bullet})\longrightarrow(|\mathcal{R}|+k,n_{\circ}-1+k,n_{\bullet}+1), where 0kV||0\leq k\leq V^{*}-|\mathcal{R}|.

  3. (c)

    The variable T(c)T(c) is not added to any pre-existing equivalence class; therefore it is added to its own equivalence class, and its value is known to be rr. Thus, we have the following transformation: (||,n,n)(||+1+k,n+k,n+1)(|\mathcal{R}|,n_{\circ},n_{\bullet})\longrightarrow(|\mathcal{R}|+1+k,n_{\circ}+k,n_{\bullet}+1), where 0kV||10\leq k\leq V^{*}-|\mathcal{R}|-1.

We note that cases (1), (2b), and (2c) can occur a finite number of times. This is because nn_{\bullet} increases by exactly 11 each time one of those cases occurs. The maximum value nn_{\bullet} can take on is |||\mathcal{R}|; the upper bound on |||\mathcal{R}| is VV^{*}, and the upper bound on VV^{*} is |ΣO||\Sigma^{O}|. Furthermore, cases (1), (2b), and (2c) cannot occur if n=Vn_{\bullet}=V^{*}. Thus, if case (1), (2b), or (2c) occur, they increase the number of known variables by 1, and they will cease to occur once n=Vn_{\bullet}=V^{*}. These cases fall under the umbrella of an incorrect satisfying solution Λ\Lambda.

Clearly, case (1) falls under Scenario (B), an incorrect satisfying solution Λ\Lambda, since the value of T(c)T(c) for a known cc in the table is incorrect. Case (2b) also falls under Scenario (B), because δ^(q^0,c)=δ^(q^0,s)\hat{\delta}(\hat{q}_{0},c)=\hat{\delta}(\hat{q}_{0},s^{\prime}) for some sSs^{\prime}\in S was incorrectly classified, implying that the value of T(s)T(s^{\prime}) was incorrect.

For cases (2a) and (2c), it is not necessarily true that the satisfying solution Λ\Lambda used to generate h^\hat{h} was incorrect, with respect to the current number of representatives |||\mathcal{R}|. It is possible that Λ\Lambda was indeed incorrect, where at least one of the |||\mathcal{R}| representatives was assigned the incorrect value, in which case Scenario (B) is true. It is also possible that Λ\Lambda was incomplete, but partially correct, where all existing known representatives had correct values, but where ||<V|\mathcal{R}|<V^{*}. If a hypothesis h^\hat{h} is partially correct, then 𝒞\mathcal{C} contains only |||\mathcal{R}| representatives, with ||<V|\mathcal{R}|<V^{*}, n=||n_{\bullet}=|\mathcal{R}|, and n=0n_{\circ}=0. This means that h^=MakeHypothesis(S,E,T;𝒞,Γ)\hat{h}=\textsc{MakeHypothesis}(\langle S,E,T;\mathcal{C},\Gamma\rangle) is the only partially correct hypothesis the algorithm could have generated under these conditions (any others must be incorrect), and is therefore unique; the size of \mathcal{H} must be 1. Then, by Theorem 1, any other Moore machine consistent with 𝒞\mathcal{C}, but not equivalent to h^\hat{h} must contain more states; hence Scenario (A) is true. Thus, in our case work, we have shown that Scenario (A) or Scenario (B) holds.∎

Corollary 1 (Termination)

Remap must terminate when the number of states and number of known representative values in a concrete hypothesis reach their respective upper bounds.

Proof

Since cases (1-2c) cover all the cases, and in each case at least one of the scenarios, Scenario (A) and Scenario (B), is true, then for each pair of consecutive hypotheses h^j1\hat{h}_{j-1} and h^j\hat{h}_{j} generated by consecutive equivalence queries in the algorithm, one of the following is true: (a) the latter hypothesis h^j\hat{h}_{j} contains at least one more state than the prior hypothesis h^j1\hat{h}_{j-1}, (b) the latter hypothesis h^j\hat{h}_{j} contains at least one more known variable value compared to the prior hypothesis h^j1\hat{h}_{j-1}, or (c) both (a) and (b) are true. Thus, in a sequence of hypotheses generated by equivalence queries, both the number of states and number of known representatives increase monotonically. There can be at most VV^{*} discoveries. If the nn^{*} is the number of states in the minimum Moore machine which correctly classifies all sequences, then the number of states nn in each hypothesis will increase monotonically to nn^{*}. Then, clearly, the algorithm must terminate when n=nn=n^{*} and ||=V|\mathcal{R}|=V^{*}.∎

Corollary 1 indicates that Remap must terminate, since every hypothesis made makes progress towards the upper bound.

Theorem 0.A.3 (Query Complexity)

If nn is the number of states of the minimal automaton isomorphic to the target automaton to be learned, and mm is the maximum length of any counterexample sequence that the teacher returns, then (a) the upper bound on the number of equivalence queries that Remap executes is n+|ΣO|1n+|\Sigma^{O}|-1, and (b) the preference query complexity is 𝒪(mn2ln(mn2))\mathcal{O}(mn^{2}\ln(mn^{2})), which is polynomial in the number of unique sequences that the learner performs queries on.

Proof

Based on Theorem 2, we know that the maximum number of equivalence queries is the taxi distance from the point (1,0)(1,0) to (n,|ΣO|)(n,|\Sigma^{O}|), which is n+|ΣO|1n+|\Sigma^{O}|-1. From counterexample processing, we know there will be at most m(n+|ΣO|1)m(n+|\Sigma^{O}|-1) sequences added to the prefix set SS, since a counterexample cc of length mm results in at most mm sequences added to the prefix set SS. The maximum number of times the table can be found inconsistent is at most n1n-1 times, since there can be at most nn states, and the learner starts with 11 state. Whenever a sequence is added to the suffix set EE, the maximum length of sequences in EE increases by at most 11, implying the maximum sequence length in EE is n1n-1. Similarly, closure operations can be performed at most n1n-1 times, so the total number of sequences in EE is at most nn; the maximum number of sequences in SS is n+m(n+|ΣO|1)n+m(n+|\Sigma^{O}|-1). The maximum number of unique sequences queried in the table is the maximum cardinality of (SSΣI)E(S\cup S\cdot\Sigma^{I})\cdot E, which is

(n+m(n+|ΣO|1))(1+|ΣI|)n=𝒪(mn2).(n+m(n+|\Sigma^{O}|-1))(1+|\Sigma^{I}|)n=\mathcal{O}(mn^{2}).

Therefore, the preference query complexity of Remap is 𝒪(mn2ln(mn2))\mathcal{O}(mn^{2}\ln(mn^{2})) due to randomized quicksort.∎

Next, we show that Remap probably approximately correctly identifies the minimal automaton isomorphic to the target.

Definition 12 (Probably Approximately Correct Identification)

Given an arbitrary Moore machine M=Q,q0,ΣI,ΣO,δ,LM=\langle Q,q_{0},\Sigma^{I},\Sigma^{O},\delta,L\rangle, let the regular language classification function f:(ΣI)ΣOf:(\Sigma^{I})^{*}\rightarrow\Sigma^{O} be represented by f(s)=L(δ(q0,s))f(s)=L(\delta(q_{0},s)) for all s(ΣI)s\in(\Sigma^{I})^{*}. Let 𝒟\mathcal{D} be an any probability distribution over (ΣI)(\Sigma^{I})^{*}. An algorithm 𝒜\mathcal{A} probably approximately correctly identifies ff if and only if for any choice of 0<ϵ<10<\epsilon<1 and 0<d<10<d<1, 𝒜\mathcal{A} always terminates and outputs an ϵ\epsilon-approximate sequence classifier f^:(ΣI)ΣO\hat{f}:(\Sigma^{I})^{*}\rightarrow\Sigma^{O}, such that with probability at least 1d1-d, the probability of misclassification is P(f^(s)f(s))ϵP(\hat{f}(s)\neq f(s))\leq\epsilon when ss is drawn according to the distribution 𝒟\mathcal{D}.

Theorem 0.A.4

Remap achieves probably approximately correct identification of any Moore machine when the teacher 𝒯\mathcal{T} uses sampling-based equivalence queries with at least mk1ϵ(ln1d+kln2)m_{k}\geq\left\lceil\frac{1}{\epsilon}\left(\ln\frac{1}{d}+k\ln 2\right)\right\rceil samples drawn i.i.d. from 𝒟\mathcal{D} for the kkth equivalence query.

We first present a proof sketch, then the full proof.

Proof

(Sketch) The probability 1ϵk1-\epsilon_{k} of a sequence sampled from an arbitrary distribution 𝒟\mathcal{D} over (ΣI)(\Sigma^{I})^{*} depends on the distribution and the intersections of sets of sequences of the teacher and the learner’s kkth hypothesis with the same classification values. The probability that the kkth hypothesis misclassifies a sequence is ϵk\epsilon_{k}. If the teacher samples mkm_{k} samples for the kkth equivalence query, then an upper bound can be established for the case when ϵkϵ\epsilon_{k}\leq\epsilon for a given ϵ\epsilon. Since we know Remap executes at most n+|ΣO|1n+|\Sigma^{O}|-1 equivalence queries, one can upper bound the probability that Remap terminates with an error by summing all probabilities of events that the teacher does not detect an error in at most n+|ΣO|1n+|\Sigma^{O}|-1 equivalence queries. An exponential decaying upper bound can be found, and a lower bound for mkm_{k} can be found in terms of ϵ,d,\epsilon,d, and kk.∎

Proof

Suppose the teacher 𝒯\mathcal{T} has Moore machine Q,q0,ΣI,ΣO,δ,L\langle Q,q_{0},\Sigma^{I},\Sigma^{O},\delta,L\rangle representing ff which classifies all sequences in the set (ΣI)(\Sigma^{I})^{*} into |ΣO||\Sigma^{O}| disjoint sets {Lc|cΣO}\{L_{c}|c\in\Sigma^{O}\} such that for all cΣO,sLc,f(s)=L(δ(q0,s))=cc\in\Sigma^{O},s\in L_{c},f(s)=L(\delta(q_{0},s))=c, we have:

(ΣI)=cΣOLc.\displaystyle(\Sigma^{I})^{*}=\bigcup_{c\in\Sigma^{O}}L_{c}.

Similarly, suppose the learner has proposed a Moore machine Q^,q^0,ΣI,ΣO,δ^,L^\langle\hat{Q},\hat{q}_{0},\Sigma^{I},\Sigma^{O},\hat{\delta},\hat{L}\rangle representing the kkth hypothesis f^k\hat{f}_{k} which classifies all sequences in the set (ΣI)(\Sigma^{I})^{*} into |CLk||C^{k}_{L}| disjoint subsets {Lck|cCLk}\{L^{\prime k}_{c}|c\in C^{k}_{L}\} where CLkΣOC^{k}_{L}\subseteq\Sigma^{O} such that for all cCLkc\in C^{k}_{L} and for all sLck,f^(s)=L^(δ^(q^0,s))=cs\in L^{\prime k}_{c},\hat{f}(s)=\hat{L}(\hat{\delta}(\hat{q}_{0},s))=c we have:

(ΣI)=cCLkLck\displaystyle(\Sigma^{I})^{*}=\bigcup_{c\in C^{k}_{L}}L^{\prime k}_{c}

If 𝒟\mathcal{D} is a distribution over (ΣI)(\Sigma^{I})^{*}, and if S𝒟S\sim\mathcal{D} is a random variable representing the sequence ss drawn according to 𝒟\mathcal{D}, and if the set of intersections IkI_{k} is defined by

Ik=cCLkΣOLcLckI_{k}=\displaystyle\bigcup_{c\in C^{k}_{L}\cap\Sigma^{O}}L_{c}\cap L^{\prime k}_{c}

then the probability that the kkth hypothesis f^k\hat{f}_{k} classifies ss correctly is

P(f^k(S)=f(S))=sIkpS(s)P(\hat{f}_{k}(S)=f(S))=\displaystyle\sum_{s\in I_{k}}p_{S}(s)

and therefore the probability ϵk\epsilon_{k} that f^k(s)f(s)\hat{f}_{k}(s)\neq f(s) is

ϵk=P(f^k(S)f(S))=1sIkpS(s).\epsilon_{k}=P(\hat{f}_{k}(S)\neq f(S))=1-\displaystyle\sum_{s\in I_{k}}p_{S}(s).

Suppose for the kkth equivalence query, the teacher 𝒯\mathcal{T} samples mkm_{k} sequences i.i.d. according to the distribution 𝒟\mathcal{D} over (ΣI)(\Sigma^{I})^{*}. The probability 𝒯\mathcal{T} accepts f^k\hat{f}_{k} because it detects no misclassification for any of the mkm_{k} samples (represented by the random variables S1,,Smk𝒟S_{1},\cdots,S_{m_{k}}\sim\mathcal{D}) is pkp_{k}, given by

pk=P(i=1mk(f^k(Si)f(Si))=0)=(1ϵk)mk,p_{k}=P(\sum_{i=1}^{m_{k}}(\hat{f}_{k}(S_{i})-f(S_{i}))=0)=(1-\epsilon_{k})^{m_{k}},

so if ϵkϵ\epsilon_{k}\leq\epsilon for a chosen ϵ\epsilon, then

P(i=1mk(f^k(Si)f(Si))=0|ϵkϵ)(1ϵ)mkP(\sum_{i=1}^{m_{k}}(\hat{f}_{k}(S_{i})-f(S_{i}))=0|\epsilon_{k}\leq\epsilon)\geq(1-\epsilon)^{m_{k}}

and if ϵkϵ\epsilon_{k}\geq\epsilon, then

P(i=1mk(f^k(Si)f(Si))=0|ϵkϵ)(1ϵ)mk.P(\sum_{i=1}^{m_{k}}(\hat{f}_{k}(S_{i})-f(S_{i}))=0|\epsilon_{k}\geq\epsilon)\leq(1-\epsilon)^{m_{k}}.

We know that Remap will execute at most n+|ΣO|1n+|\Sigma^{O}|-1 equivalence queries, so the probability that Remap terminates with an error in the set {ϵ1,,ϵn+|ΣO|1}\{\epsilon_{1},\cdots,\epsilon_{n+|\Sigma^{O}|-1}\} is given by

p1+(1p1)p2++pn+|ΣO|1k=1n+|ΣO|2(1pk)\displaystyle p_{1}+(1-p_{1})p_{2}+\cdots+p_{n+|\Sigma^{O}|-1}\prod_{k=1}^{n+|\Sigma^{O}|-2}(1-p_{k})
=k=1n+|ΣO|1pki=0k1(1pi)\displaystyle=\sum_{k=1}^{n+|\Sigma^{O}|-1}p_{k}\prod_{i=0}^{k-1}(1-p_{i})

where p0=0p_{0}=0. Each term in the summation on the right hand side is less than or equal to pkp_{k}, so

k=1n+|ΣO|1pki=0k1(1pi)k=1n+|ΣO|1pk.\sum_{k=1}^{n+|\Sigma^{O}|-1}p_{k}\prod_{i=0}^{k-1}(1-p_{i})\leq\sum_{k=1}^{n+|\Sigma^{O}|-1}p_{k}.

Furthermore, we know that ϵk\epsilon_{k} monotonically decreases as kk increases because |CLkΣO||C_{L}^{k}\cap\Sigma^{O}| monotonically increases. Considering the progress transformation from Theorem 2, if |CLkΣO||C_{L}^{k}\cap\Sigma^{O}| monotonically increases, then this means the cardinality of the set {Lck|cCLk}\{{L^{\prime}}_{c}^{k}|\forall c\in C_{L}^{k}\} monotonically increases. If any sequence in any of the intersections LcLckL_{c}\cap{L^{\prime}}_{c}^{k} is misclassified, then the feedback from the counterexample will move the misclassified sequence to the correct class, which means the error decreases monotonically. Specifically, by Theorem 2, the following events can occur: (1) the value of an existing state (or a set of existing equivalently valued states) will obtain their correct values; (2) at least 1 new states will be created, either increasing the size of one of the Lck{L^{\prime}}_{c}^{k}; or (3) increasing the cardinality of |CLkΣO||C_{L}^{k}\cap\Sigma^{O}| and creating an additional Lck{L^{\prime}}_{c}^{k} set (corresponds with at least one new states and one new value). Other non-counterexample sequences which were previously misclassified either stay misclassified (perhaps with a different, but still incorrect value), or they will become classified correctly. A sequence which was already known to be classified correctly will not become misclassified. Thus, all three possibilities monotonically decrease the error. This means ϵjϵj+1\epsilon_{j}\geq\epsilon_{j+1}, so the terminal hypothesis will have an error of at most the error of the previous hypothesis. If we choose a desired error ϵ\epsilon, then the probability that the error of the terminal hypothesis is greater than ϵ\epsilon is at most

k=1n+|ΣO|1pkk=1n+|ΣO|1(1ϵ)mkk=1n+|ΣO|1eϵmk\sum_{k=1}^{n+|\Sigma^{O}|-1}p_{k}\leq\sum_{k=1}^{n+|\Sigma^{O}|-1}(1-\epsilon)^{m_{k}}\leq\sum_{k=1}^{n+|\Sigma^{O}|-1}e^{-\epsilon m_{k}}

since

P(i=1mk(f^k(Si)f(Si))=0|ϵkϵ)(1ϵ)mk,P(\sum_{i=1}^{m_{k}}(\hat{f}_{k}(S_{i})-f(S_{i}))=0|\epsilon_{k}\geq\epsilon)\leq(1-\epsilon)^{m_{k}},

and since 1+xex1+x\leq e^{x} for all real xx. If we take each term in the rightmost summation and require it to be at most d2k\frac{d}{2^{k}} for some chosen value of 0<d<10<d<1, then we have a lower bound on the number of samples for the kkth equivalence query for k=1,,(n+|ΣO|1)k=1,\cdots,(n+|\Sigma^{O}|-1)

mk1ϵ(ln1d+kln2)m_{k}\geq\frac{1}{\epsilon}(\ln\frac{1}{d}+k\ln 2)

and

k=1n+|ΣO|1eϵmkk=1n+|ΣO|1d2kk=1d2kd\sum_{k=1}^{n+|\Sigma^{O}|-1}e^{-\epsilon m_{k}}\leq\sum_{k=1}^{n+|\Sigma^{O}|-1}\frac{d}{2^{k}}\leq\sum_{k=1}^{\infty}\frac{d}{2^{k}}\leq d

which implies shows that the probability that Remap terminates with a hypothesis with error at least ϵ\epsilon is at most dd.∎

Theorem 0.A.5

To achieve PAC-identification under Remap, if the teacher 𝒯\mathcal{T} has chosen parameters ϵ\epsilon and dd, and if ff can be represented by a minimal Moore machine with nn states and |ΣO||\Sigma^{O}| classes, then 𝒯\mathcal{T} needs to sample at least

𝒪(n+|ΣO|+1ϵ((n+|ΣO|)ln1d+(n+|ΣO|)2))\mathcal{O}(n+|\Sigma^{O}|+\frac{1}{\epsilon}((n+|\Sigma^{O}|)\ln\frac{1}{d}+(n+|\Sigma^{O}|)^{2}))

sequences i.i.d. from 𝒟\mathcal{D} over the entire run of Remap.

Proof

Since for the kkth equivalence query, the teacher must sample at least mk1ϵ(ln1d+kln2)m_{k}\geq\left\lceil\frac{1}{\epsilon}(\ln\frac{1}{d}+k\ln 2)\right\rceil sequences in order to achieve PAC-identification, if the total number of samples is to be minimized while still achieving PAC-identification, then the teacher can just sample a total of

k=1n+|ΣO|1[1ϵ(ln1d+kln2)+1]\displaystyle\sum_{k=1}^{n+|\Sigma^{O}|-1}\left[\frac{1}{\epsilon}(\ln\frac{1}{d}+k\ln 2)+1\right]
=\displaystyle= n+|ΣO|1\displaystyle n+|\Sigma^{O}|-1
+1ϵ[(ln1d)(n+|ΣO|1)+ln2k=1n+|ΣO|1k]\displaystyle+\frac{1}{\epsilon}\left[(\ln\frac{1}{d})(n+|\Sigma^{O}|-1)+\ln 2\sum_{k=1}^{n+|\Sigma^{O}|-1}k\right]
=\displaystyle= 𝒪((n+|ΣO|)+1ϵ((n+|ΣO|)ln1d+(n+|ΣO|)2))\displaystyle\mathcal{O}\left((n+|\Sigma^{O}|)+\frac{1}{\epsilon}((n+|\Sigma^{O}|)\ln\frac{1}{d}+(n+|\Sigma^{O}|)^{2})\right)

0.A.7 Miscellaneous Experiments

In addition to learning reward machines, we also evaluated Remap on sequence classification problems where the classification model can be represented as a Moore machine. Since reward machines are Mealy machines, and Moore machines and Mealy machines are equivalent, each of these Moore machines can be converted into a reward machine. We specified the reference Moore machines using regular expressions. The regular expressions we used to construct the Moore machines were: aba^{*}b represents sequences of zero or more aa’s followed by a single bb; bab^{*}a represents sequences of zero or more bb’s followed by a single aa; (a|b)(a|b)^{*} represents sequences with zero or more elements, where each element is either aa or bb; (ab)(ab)^{*} represents sequences of zero or of even lengths, where every aa is immediately followed by a bb, and non-zero length sequences start with an aa; (ba)(ba^{*}) represents sequences of zero or of even lengths, where every bb is immediately followed by a aa, and non-zero length sequences start with a bb.

These components were OR’ed together. If MiM_{i} represents a regular expression component, and

M=|Mii=1N=M1|M2||MNM=\displaystyle\left|{}_{i=1}^{N}M_{i}\right.=M_{1}|M_{2}|\cdots|M_{N}

represents NN distinct regular expression components OR’ed together, then the classifier ff classifies a sequence ss in the following manner: f(s)=kf(s)=k if sMks\in M_{k}, and is 0 otherwise. If ss happens to be contained in multiple Mk1M_{k_{1}},Mk2,M_{k_{2}},..., then the lowest matching kk value prevails. This structure implies the size of the output alphabet is 1+N1+N.

We measured accuracy, number of preference queries, number of equivalence queries, and number of unique sequences as functions of samples per equivalence query (Figure 13) and Figure 14. We also measured these attributes as functions of alphabet size (Figure 15). Finally, we also created some termination phase diagrams (Figure 16).

Regex Accuracy Num of EQ Num of PQ Num of Seq
(ab)(a^{*}b) Refer to caption Refer to caption Refer to caption Refer to caption
(ab)|(ba)(a^{*}b)|(b^{*}a) Refer to caption Refer to caption Refer to caption Refer to caption
(ab)|(ba)|(a|b)(a^{*}b)|(b^{*}a)|(a|b)^{*} Refer to caption Refer to caption Refer to caption Refer to caption
(ab)|(ba)|(ab)(a^{*}b)|(b^{*}a)|(ab)^{*} Refer to caption Refer to caption Refer to caption Refer to caption
(ab)|(ba)|(ab)|(ba)(a^{*}b)|(b^{*}a)|(ab)^{*}|(ba)^{*} Refer to caption Refer to caption Refer to caption Refer to caption
Figure 13: Violin Plots of Terminal Hypothesis Accuracy, Number of Equivalence Queries executed, Number of Preference Queries executed, and Number of Unique Sequences Tested, as a function of Maximum Number of Random Tests performed per Equivalence Query, for learning a variety of regexes using a two-letter alphabet. During an Equivalence Query, the teacher samples NN random sequences, with the length of each sequence drawn iid from a geometric distribution with termination probability 0.20.2. Given a sequence test length, each element of the sequence is chosen iid from a uniform distribution over the alphabet. To compute the terminal hypothesis accuracy, the terminal hypothesis is tested on 200200 random sequences per regex class. A total of 20002000 trials were conducted per NN value; the plots show the resulting empirical distributions of those trials.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 14: Plots of the same data as in Figure 13. Mean values are represented by thick lines, median values are represented by thin lines, and values lying between the 20th20^{th} and 80th80^{th} percentiles are shown as shaded.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 15: Plots accuracy, number of unique sequences tested, number of equivalence queries performed, and number of preference queries made, as a function of alphabet size. These plots are set at 200 samples per EQ. Mean values are represented by thick lines, median values are represented by thin lines, and values lying between the 20th20^{th} and 80th80^{th} percentiles are shown as shaded.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 16: Termination plots illustrating that number of state and number of known variables increase monotonically. The blue circles represent closure operations, orange squares represent consistency operations, and green Xs represent equivalence queries. The number of states in the automata from left to right is 3, 5, 8, 11, 14. We observe that for termination to occur, the number of states must be correct. The number of explicitly known values does not need to match the number of classes to terminate, however, because the learner can simply “guess” the correct values for early termination.