This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Learning Phonotactics from Linguistic Informants

Canaan Breissa, Alexis Rossb,11footnotemark: 1
Amani Maina-Kilaasb  Roger Levyb  Jacob Andreasb
a
University of Southern California  bMassachusetts Institute of Technology
cbreiss@usc.edu{alexisro, amanirmk, rplevy, jda}@mit.edu
  Both authors contributed equally to this work.
Abstract

We propose an interactive approach to language learning that utilizes linguistic acceptability judgments from an informant (a competent language user) to learn a grammar. Given a grammar formalism and a framework for synthesizing data, our model iteratively selects or synthesizes a data-point according to one of a range of information-theoretic policies, asks the informant for a binary judgment, and updates its own parameters in preparation for the next query. We demonstrate the effectiveness of our model in the domain of phonotactics, the rules governing what kinds of sound-sequences are acceptable in a language, and carry out two experiments, one with typologically-natural linguistic data and another with a range of procedurally-generated languages. We find that the information-theoretic policies that our model uses to select items to query the informant achieve sample efficiency comparable to, and sometimes greater than, fully supervised approaches.

1 Introduction

In recent years, natural language processing has made remarkable progress toward models that can (explicitly or implicitly) predict and use representations of linguistic structure from phonetics to syntax (Mohamed et al., 2022; Hewitt and Manning, 2019). These models play a prominent role in contemporary computational linguistics research. But the data required to train them is of a vastly larger scale, and features less controlled coverage of important phenomena, than data gathered in the course of linguistic research, e.g. during language documentation with native speaker informants. How can we build computational models that learn more like linguists—from targeted inquiry rather than large-scale corpus data?

We describe a paradigm in which language-learning agents interactively select examples to learn from by querying an informant, with the goal of learning about a language as data-efficiently as possible, rather than relying on large-scale corpora to capture attested-but-rare phenomena. This approach has two important features. First, rather than relying on existing data to learn, our model performs data synthesis to explore the space of useful possible data-points. But second, our model can also leverage corpus data as part of its learning procedure by trading off between interactive elicitation and ordinary supervised learning, making it useful both ab initio and in scenarios where seed data is available to bootstrap a full grammar.

Refer to caption
Figure 1: Overview of our approach. Instead of learning a model from a static set of well-formed word forms (left), we interactively elicit acceptability judgments from a knowledgeable language user (right), using ideas from active learning and optimal experiment design. On a family of phonotactic grammar learning problems, active example selection is sometimes more sample-efficient than supervised learning or elicitation of judgments about random word forms.

We evaluate the capabilities of our methods in two experiments on learning phonotactic grammars, in which the goal is to learn the constraints on sequences of permissible sounds in the words of a language. Applied to the problem of learning a vowel harmony system inspired by natural language typology, we show that our approach succeeds in recovering the generalizations governing the distribution of vowels. Using an additional set of procedurally-generated synthetic languages, our approach also succeeds in recovering relevant phonotactic generalizations, demonstrating that model performance is robust to whether the target pattern is typologically common or not. We find that our approach is more sample-efficient than ordinary supervised learning or random queries to the informant.

Our methods have the potential to be deployed as an aid to learners acquiring a second language or to linguists doing elicitation work with speakers of a language that has not previously been documented. Further, the development of more data-efficient computational models can help redress social inequalities which flow from the asymmetrical distribution of training data types available for present models (Bender et al., 2021).

2 Problem Formulation and Method

Preliminaries

We aim to learn a language LL comprising a set of strings xx, each of which is a concatenation of symbols from some inventory Σ\Sigma (so LΣ+L\subseteq\Sigma^{+}). (In phonotactics, for example, Σ\Sigma might be the set of phonemes, and LL the set of word forms that speakers judge phonotactically acceptable.) A learned model of a language is a discriminative function that maps from elements xΣ+x\in\Sigma^{+} to values in {0,1}\{0,1\} where 1 indicates that xLx\in L and 0 indicates that xLx\notin L. In this paper, we will generalize this to graded models of language membership f:Σ+[0,1]f:\Sigma^{+}\mapsto[0,1], in which higher values assigned to strings xΣ+x\in\Sigma^{+} correspond to greater confidence that xLx\in L (cf. Albright, 2009, for data and argumentation in favor of a gradient model of phonotactic acceptability in humans).

We may then characterize the language learning problem as one of acquiring a collection of pairs (x1,y1),(x2,y2),,(xn,yn)(x_{1},y_{1}),(x_{2},y_{2}),\dots,(x_{n},y_{n}) where xiΣ+x_{i}\in\Sigma^{+}, and yi{0,1}y_{i}\in\{0,1\} correspond to acceptability judgments about whether xiLx_{i}\in L. Given this data, a learner's job is to identify a language consistent with these pairs. Importantly, in this setting, learners may have access to both positive and negative evidence.

Approach

In our problem characterization, the data acquisition process takes place over a series of time steps. At each time step tt, the learner uses a policy π\pi according to which a new string xt𝒳x_{t}\in\mathcal{X} is selected; here 𝒳\mathcal{X} is some set of possible strings, with L𝒳Σ+L\subset\mathcal{X}\subset\Sigma^{+}. The chosen string is then passed to an informant that provides the learner a value yt{0,1}y_{t}\in\{0,1\} corresponding to whether xtx_{t} is in LL. The new datum (xt,yt)(x_{t},y_{t}) is then appended to a running collection of (string, judgment) pairs (x¯,y¯)(\underline{\smash{x}},\underline{\smash{y}}), after which the learning process proceeds to the next time step. This procedure is summarized in Algorithm 1.

Conceptually, there are two ways in which a learner might gather information about a new language. One possibility is to gather examples well-formed strings already produced by users of the language (e.g. by listening to a conversation, or collecting a text corpus), similar to an \sayimmersion approach when learning a new language. In this case, the learner does not have control over the specific selected string xtx_{t}, but it is guaranteed that the selected string is part of the language: xtLx_{t}\in L and thus yt=1y_{t}=1.

The other way of collecting information is to select some string xtx_{t} from 𝒳\mathcal{X}, and directly elicit a judgment yiy_{i} from a knowledgeable informant. This approach is often pursued by linguists working with language informants in a documentation setting, where their query stems from a hypothesis about the structural principles of the language. Here, examples can be chosen to be maximally informative, and negative evidence gathered directly. In practice, learners might also use \sayhybrid policies that compare which of multiple basic policies (passive observation, active inquiry) is expected to yield a new datum that optimally improves the learner's knowledge state. Each of these strategies is described in more detail below.

Input: policy π\pi, total timesteps TT
(x¯,y¯)[](\underline{\smash{x}},\underline{\smash{y}})\leftarrow[\;]; t0t\leftarrow 0;
while t<Tt<T do
       xtπ(xx¯,y¯)x_{t}\leftarrow\pi(x\mid\underline{\smash{x}},\underline{\smash{y}});
       ytinformant(xt)y_{t}\leftarrow\text{informant}(x_{t});
       append (xt,yt)(x_{t},y_{t}) to (x¯,y¯)(\underline{\smash{x}},\underline{\smash{y}});
       tt+1t\leftarrow t+1;
      
end while
Algorithm 1 Iterative Query Procedure

Model assumptions

To characterize the learning policies, we make the following assumptions regarding the model trained from available data (x¯,y¯)(\underline{\smash{x}},\underline{\smash{y}}). We assume that the function f:Σ+[0,1]f:\Sigma^{+}\to[0,1] acquired from (x¯,y¯)(\underline{\smash{x}},\underline{\smash{y}}) can be interpreted as a conditional probability of the form p(yx,x¯,y¯)p(y\mid x,\underline{\smash{x}},\underline{\smash{y}}). We further assume that this conditional probability is determined by a set of parameters 𝜽\bm{\theta} for which a(n approximate) posterior distribution P(𝜽x¯,y¯)P(\bm{\theta}\mid\underline{\smash{x}},\underline{\smash{y}}) is maintained, with p(yx,x¯,y¯)=𝜽P(y|x,𝜽)P(𝜽x¯,y¯)𝑑𝜽p(y\mid x,\underline{\smash{x}},\underline{\smash{y}})=\int_{\bm{\theta}}P(y|x,\bm{\theta})P(\bm{\theta}\mid\underline{\smash{x}},\underline{\smash{y}})\,d\bm{\theta}.

3 Query policies

In the framework described in Section 2, how should a learner choose which questions to ask the informant? Below, we describe a family of different policies for learning.

3.1 Basic policies

Train

The first basic policy, πtrain(xx¯,y¯)\pi_{\text{train}}(x\mid{\underline{\smash{x}},\underline{\smash{y}}}), corresponds to observing and recording an utterance by a speaker. For simplicity we model this as uniform sampling (without replacement) over LL:

πtrain(x|x¯,y¯)U({xLx¯}).\displaystyle\pi_{\text{train}}(x|\underline{\smash{x}},\underline{\smash{y}})\sim U(\{x\in L-\underline{\smash{x}}\}).

Uniform

The second basic policy, πunif(x|x¯,y¯)\pi_{\text{unif}}(x|\underline{\smash{x}},\underline{\smash{y}}), samples a string uniformly from 𝒳\mathcal{X} and presents it to the informant for an acceptability judgment:

πunif(x|x¯,y¯)U({x𝒳}).\displaystyle\pi_{\text{unif}}(x|\underline{\smash{x}},\underline{\smash{y}})\sim U(\{x\in\mathcal{X}\}).

Label Entropy

The πlabel-ent(x|x¯,y¯)\pi_{\text{label-ent}}(x|\underline{\smash{x}},\underline{\smash{y}}) policy selects the string xx^{*} with the maximum entropy \mathcal{H} over labels yy under the current model state:

x=\displaystyle x^{*}= argmaxx𝒳(yx,x¯,y¯).\displaystyle\operatorname*{arg\,max}_{x\in\mathcal{X}}\mathcal{H}(y\mid x,\underline{\smash{x}},\underline{\smash{y}}).

Expected Information Gain

The πeig(x|x¯,y¯)\pi_{\text{eig}}(x|\underline{\smash{x}},\underline{\smash{y}}) policy selects the candidate that, if observed, would yield the greatest expected reduction in entropy over the posterior distribution of the model parameters 𝜽\bm{\theta}. This is often called the information gain (MacKay, 1992); we denote the change in entropy as VIG(x,y;x¯,y¯)V_{\text{IG}}(x,y;\underline{\smash{x}},\underline{\smash{y}}):

𝑽IG\displaystyle\bm{V}_{\textbf{IG}} (x,y;x¯,y¯)\displaystyle(x,y;\underline{\smash{x}},\underline{\smash{y}}) (1)
=(𝜽x¯,y¯)(𝜽x,y,x¯,y¯).\displaystyle=\mathcal{H}(\bm{\theta}\mid\underline{\smash{x}},\underline{\smash{y}})-\mathcal{H}(\bm{\theta}\mid x,y,\underline{\smash{x}},\underline{\smash{y}}).

The expected information gain policy selects the xx^{*} that maximizes 𝔼y[0,1]VIG(x,y;x¯,y¯)\mathop{\mathbb{E}}_{y\in[0,1]}V_{\text{IG}}(x,y;\underline{\smash{x}},\underline{\smash{y}}), i.e.,:

x=\displaystyle x^{*}= argmaxx𝒳\displaystyle\operatorname*{arg\,max}_{x\in\mathcal{X}}
VIG(x,y=1;x¯,y¯)p(y=1x,x¯,y¯)\displaystyle\quad\quad V_{\text{IG}}(x,y=1;\underline{\smash{x}},\underline{\smash{y}})\cdot p(y=1\mid x,\underline{\smash{x}},\underline{\smash{y}})
+VIG(x,y=0;x¯,y¯)p(y=0x,x¯,y¯),\displaystyle~{}~{}~{}+V_{\text{IG}}(x,y=0;\underline{\smash{x}},\underline{\smash{y}})\cdot p(y=0\mid x,\underline{\smash{x}},\underline{\smash{y}}),
πeig\displaystyle\pi_{\text{eig}} (x|x¯,y¯)=δ(x),\displaystyle(x|\underline{\smash{x}},\underline{\smash{y}})=\delta(x^{*}),

where δ(x)\delta(x) denotes the probability distribution that places all its mass on xx.

3.2 Hybrid Policies

Hybrid policies dynamically choose at each time step among a set of basic policies Π\Pi based on some metric VV. At each step, the hybrid policy estimates the expected value of VV for each basic policy πΠ\pi\in\Pi, chooses the policy π\pi* that has the highest expected value, and then samples xΣ+x\in\Sigma^{+} according to π\pi^{*}. Here, we study one such policy: Π=[πtrain,πeig]\Pi=[\pi_{\text{train}},\pi_{\text{eig}}], with metric V=VIGV=V_{\text{IG}}. We refer to the non-train policy as π^\hat{\pi} and the metric used to select π[π^,πtrain]\pi^{*}\in[\hat{\pi},\pi_{\text{train}}] at each step as VV.

We explore two general methods for estimating the expected value of VV for each policy π\pi^{*}: history-based and model-based. We also explore a mixed approach using a history-based method for πtrain\pi_{\text{train}} and a model-based method for π^\hat{\pi}.

History

In the history-based approach, the model keeps a running average of empirical values of VV for candidates previously selected by πtrain\pi_{\text{train}} and π^\hat{\pi}.

For instance, for history-based hybrid policy πeig-history(x|x¯,y¯)\pi_{\text{eig-history}}(x|\underline{\smash{x}},\underline{\smash{y}}), V=VIGV=V_{\text{IG}} (see Table 1(b)). Suppose at a particular step, the basic policy π\pi^{*} selected by πeig-history\pi_{\text{eig-history}} chose query xx, which received label yy from the informant. Then the history-based πeig-history\pi_{\text{eig-history}} would store the empirical information gain between p(𝜽x¯,y¯),p(𝜽x,y,x¯,y¯)p(\bm{\theta}\mid\underline{\smash{x}},\underline{\smash{y}}),p(\bm{\theta}\mid x,y,\underline{\smash{x}},\underline{\smash{y}}) for the chosen π\pi^{*}; in future steps, it would then select the π[πtrain,π^]\pi^{*}\in[\pi_{\text{train}},\hat{\pi}] with the highest empirical mean of VV, in this case the empirical mean information gain, over candidates queried by each basic policy.

More formally, let 𝑺EMP(π;x¯,y¯)\bm{S}^{\textbf{EMP}}(\pi;\underline{\smash{x}},\underline{\smash{y}}) refer to the mean of observed values VV for candidates xix_{i} selected by π\pi before step tt, where π[πtrain,π^]\pi\in[\pi_{\text{train}},\hat{\pi}]:

𝑺EMP(π;x¯,y¯)\displaystyle\bm{S}^{\textbf{EMP}}(\pi;\underline{\smash{x}},\underline{\smash{y}}) =iIπV(xi,yi;x¯<i)|Iπ|,\displaystyle=\frac{\sum_{i\in I_{\pi}}V(x_{i},y_{i};\underline{\smash{x}}_{<i})}{|I_{\pi}|},
where Iπ\displaystyle\text{where }I_{\pi} ={ixi was selected by π,i<t}.\displaystyle=\{i\mid x_{i}\text{ was selected by }\pi,i<t\}.

V(xi,yi;x¯<i)V(x_{i},y_{i};\underline{\smash{x}}_{<i}) denotes VV's score for the i'th data-point xix_{i} selected by π\pi under a model that as observed data x¯<i,y¯<i\underline{\smash{x}}_{<i},\underline{\smash{y}}_{<i}.

Then at step tt, the history-based hybrid policies sample π\pi^{*} according to:

π\displaystyle\pi^{*} =argmaxπ[π^,πtrain]SEMP(π;x¯,y¯).\displaystyle=\operatorname*{arg\,max}_{\pi\in[\hat{\pi},\pi_{\text{train}}]}S^{\text{EMP}}(\pi;\underline{\smash{x}},\underline{\smash{y}}).

For t<2t<2, we automatically select πtrain\pi_{\text{train}} and π^\hat{\pi} in a random order, each once, to ensure we have empirical means for both policies.

Model

The model-based approach is prospective and involves using the current posterior distribution over model parameters to compute an expected value for the target metric under the policy. We define two ways of computing these expectations.

SEXP(y)S^{\text{EXP}{(y)}} computes an expectation over possible labels yy for the candidate xx^{*} that will be chosen by policy π\pi. We use SEXP(y)S^{\text{EXP}{(y)}} to score non-train basic policies π^\hat{\pi} because they select xx^{*} deterministically given 𝒳\mathcal{X}, i.e., selecting the inputs that maximize the objectives described in §3.1. More formally:

𝑺EXP(𝒚)(π^;x¯,y¯)\displaystyle\bm{S}^{\textbf{EXP}\bm{(y)}}(\hat{\pi};\underline{\smash{x}},\underline{\smash{y}}) =𝔼y[0,1]V(x,y;x¯,y¯),xπ^.\displaystyle=\mathop{\mathbb{E}}_{y\in[0,1]}V(x^{*},y;\underline{\smash{x}},\underline{\smash{y}}),x^{*}\sim\hat{\pi}.

SEXP(x)S^{\text{EXP}{(x)}} computes an expectation over possible inputs xLx\in L and assumes a fixed label (y=1)(y=1). We score the train basic policy πtrain\pi_{\text{train}} with SEXP(x)S^{\text{EXP}{(x)}} because the randomness for πtrain\pi_{\text{train}} is over forms in the lexicon that could be sampled by πtrain\pi_{\text{train}}, and labels are always 1. More formally:

𝑺EXP(𝒙)(πtrain;x¯,y¯)\displaystyle\bm{S}^{\textbf{EXP}\bm{(x)}}(\pi_{\text{train}};\underline{\smash{x}},\underline{\smash{y}}) =𝔼xLV(x,y=1;x¯,y¯).\displaystyle=\mathop{\mathbb{E}}_{x\in L}V(x,y=1;\underline{\smash{x}},\underline{\smash{y}}).

In practice, however, we approximate this expectation with samples from 𝒳\mathcal{X}, since we do not assume that the model has access to the lexicon used by the informant. In particular, we model the probability that a form xx is in the lexicon as p(y=1x;x¯,y¯)p(y=1\mid x;\underline{\smash{x}},\underline{\smash{y}}).

Using the policy-specific expectations defined above, the model-based approach selects the policy π\pi^{*} according to:

π=argmaxπ[π^,πtrain]S(π;x¯,y¯).\displaystyle\pi^{*}=\operatorname*{arg\,max}_{\pi\in[\hat{\pi},\pi_{\text{train}}]}S(\pi;\underline{\smash{x}},\underline{\smash{y}}).

Mixed

Finally, the mixed policies combine the retrospective evaluation of the history-based method and the prospective evaluation of the model-based method. In particular, we use the model-based approach for non-train π^\hat{\pi} (i.e., scoring with SEXP(y)S^{\text{EXP}{(y)}}) and the history-based approach for train policy πtrain\pi_{\text{train}} (i.e., scoring with SEMPS^{\text{EMP}}):

S(π^;x¯,y¯)\displaystyle S(\hat{\pi};\underline{\smash{x}},\underline{\smash{y}}) =SEXP(y)(π^;x¯,y¯),\displaystyle=S^{\text{EXP}{(y)}}(\hat{\pi};\underline{\smash{x}},\underline{\smash{y}}),
S(πtrain;x¯,y¯)\displaystyle S(\pi_{\text{train}};\underline{\smash{x}},\underline{\smash{y}}) =SEMP(πtrain;x¯,y¯),\displaystyle=S^{\text{EMP}}(\pi_{\text{train}};\underline{\smash{x}},\underline{\smash{y}}),
π\displaystyle\pi^{*} =argmaxπ[π^,πtrain]S(π;x¯,y¯).\displaystyle=\operatorname*{arg\,max}_{\pi\in[\hat{\pi},\pi_{\text{train}}]}S(\pi;\underline{\smash{x}},\underline{\smash{y}}).

For t=0t=0, we always select πtrain\pi_{\text{train}} to ensure we have an empirical mean for πtrain\pi_{\text{train}}. Table 1 provides an overview of the query policies described in the preceding sections.

Basic Policy Quantity Maximized
πtrain\pi_{\text{train}}
πunif\pi_{\text{unif}}
πlabel-ent\pi_{\text{label-ent}} Label entropy
πeig\pi_{\text{eig}} Expected info gain
((a)) Basic policies (§3.1).
Hybrid Basic Basic Policy Selection
Policy Choices Π\Pi Method Metric VV πtrain\pi_{\text{train}} score Non-train score
πeig-history\pi_{\text{eig-history}} πtrain\pi_{\text{train}}, πeig\pi_{\text{eig}} History Info gain (VIGV_{\text{IG}}, Eq 1) SEMPS^{\text{EMP}} SEMPS^{\text{EMP}}
πeig-model\pi_{\text{eig-model}} Model SEXP(x)S^{\text{EXP}{(x)}} SEXP(y)S^{\text{EXP}{(y)}}
πeig-mixed\pi_{\text{eig-mixed}} Mixed SEMPS^{\text{EMP}} SEXP(y)S^{\text{EXP}{(y)}}
((b)) Hybrid policies (§3.2).
Table 1: Summary of query policies (§3). SEMPS^{\text{EMP}} refers to empirical mean. SEXP(y)S^{\text{EXP}{(y)}} and SEXP(x)S^{\text{EXP}{(x)}} refer to the expectation metrics for the non-train π^\hat{\pi} and train πtrain\pi_{\text{train}} strategies, respectively. Basic policies select inputs to query the informant. Hybrid policies choose between a set of basic policies Π\Pi by scoring them with a metric VV and one of the scoring functions.

4 A Grammatical Model for Phonotactics

We implement and test our approach for a simple categorical model of phonotactics. The grammar consists of two components. First, a finite set of phonological feature functions {ϕi}:Σ+{0,1}\{\phi_{i}\}:\Sigma^{+}\mapsto\{0,1\}; if ϕi(x)=1\phi_{i}(x)=1 we say that feature ii is active for string xx. This set is taken to be universal and available to the learner before any data are observed. Second, a set of binary values 𝜽={θi}\bm{\theta}=\{\theta_{i}\}, one for each feature function; if θi=1\theta_{i}=1 then feature ii is penalized. In our simple categorical model, a string is grammatical if and only if no feature active for it is penalized. 𝜽\bm{\theta} thus determines the language: L={x:iθi(x)ϕi(x)=0}L=\{x:\sum_{i}\theta_{i}(x)\phi_{i}(x)=0\}. We assume a factorizable prior distribution over which features are active: p(𝜽)=θj𝜽p(θj)p(\bm{\theta})=\prod_{\theta_{j}\in\bm{\theta}}p(\theta_{j}). To enable mathematical tractability, we also incorporate a noise term α\alpha which causes the learner to perceive a judgment from the informant as noisy (reversed) with probability 1α1-\alpha.

This model is based on a decades-long research tradition in theoretical and experimental phonology into what determines the range and frequency of possible word forms in a language. A consensus view of the topic is that speakers have fine-grained judgments about the acceptability of nonwords (for example, most speakers judge blick to be more acceptable than bnick; Chomsky and Halle, 1968), and that this knowledge can be decomposed into the independent, additive effects of multiple prohibitions on specific sequences of sounds (in phonological theory, termed Markedness constraints). Further, speakers form these generalizations at the level of the phonological feature, since they exhibit structured judgments that distinguish between different unattested forms: speakers systematically rate bnick as less English-like than bzick, despite no attested words having initial bn- or bz-. We reflect this knowledge in our generative model: to determine the distribution of licit strings in a language, we first sample some parameters which govern subsequences of features which are penalized in the language.

In our model we take {ϕi}\{\phi_{i}\} to be a collection of phonological feature trigrams: an ordered triple of three phonological features with values that pick out some class of trigrams of segments in the language (see §5.1 for more details and examples). Since our phonotactics are variants on vowel harmony, these featural trigrams are henceforth assumed to be relativized to the vowel tier, regulating vowel qualities in three adjacent syllables. In order to capture generalizations that may hold differently in edge-adjacent vs. word-medial position, we pad the representation of each word treated by the model with a boundary symbol \say# — omitted generally in this paper, for simplicity — which bears the [++ word boundary] feature that the trigram constraints can refer to (following the practice of Hayes and Wilson, 2008, inspired by Chomsky and Halle, 1968).

4.1 Implementation details

Our general approach and specific model create several computational tractability issues that we address here. First, all policies aside from πtrain\pi_{\text{train}} and πunif\pi_{\text{unif}} in principle require search for an optimal string xx within 𝒳\mathcal{X}. In practice, we consider 𝒳=Σ+{2,5}\mathcal{X}=\Sigma^{+}\{2,5\}, i.e., 𝒳\mathcal{X} is the set of strings with 2-5 syllables. This resulting set is still very large, so we approximate the search over 𝒳\mathcal{X} by uniformly sampling a set of kk candidates and choosing the best according to VV. We sample candidates by uniformly sampling a length, then uniformly sampling each syllable from the inventory of possible onset-vowel combinations in the language (with replacement). We then de-duplicate candidates and filter x¯\underline{\smash{x}}, excluding previously observed sequences and those that were accidental duplicates with items in the test set.

Second, although the model parameters 𝜽\bm{\theta} are independent in the prior, conditioning on data renders them conditionally dependent and computing with the true posterior is in general intractable. To deal with this, we use mean-field variational Bayes to approximate the posterior as p(𝜽x¯,y¯)θj𝜽q(θj=1x¯,y¯)p(\bm{\theta}\mid\underline{\smash{x}},\underline{\smash{y}})\approx\prod_{\theta_{j}\in\bm{\theta}}q(\theta_{j}=1\mid\underline{\smash{x}},\underline{\smash{y}}). We use this approximation to both estimate the model's posterior (used by πlabel-ent\pi_{\text{label-ent}} and πeig\pi_{\text{eig}}) and to make predictions about individual new examples. See Appendix D for details.

5 Experiments

We now describe our experiments for evaluating the different query policies. We evaluate on two types of languages. We call the first the ATR Vowel Harmony language (§5.1), which has grammar that regulates the distribution of types of vowels, inspired by those found in many languages of the world. The purpose of evaluating on this language is to evaluate how well our new approach, and specifically the various non-baseline query policies, work on naturalistic data. We also evaluate on a set of procedurally-generated languages (§5.2) that are matched on statistics to ATR Vowel Harmony, i.e., they have the same number of feature trigrams that are penalized, but differ in which. This second set of evaluations aims to determine how robust our model is to typologically-unusual languages, so we can be confident that any success in learning ATR Vowel Harmony is attributable to our procedure, rather than a coincidence of the typologically-natural vowel harmony pattern.

These experiments lead to three sets of analyses: in the first (§5.4), we both select hyperparameters and evaluate on procedurally-generated languages through k-fold cross validation. These results can be interpreted as an in-distribution analysis of the query policies. In the second set of results (§5.5), we evaluate the policies out-of-distribution by selecting hyperparameters on the procedurally-generated languages and evaluating on the ATR Vowel Harmony language. In the last analysis (§5.6), we evaluate the upper bound of policy performance by selecting hyperparameters and evaluating on the same language, ATR Vowel Harmony.

5.1 ATR Vowel Harmony

We created a model language whose words are governed by a small set of known phonological principles. Loosely inspired by harmony systems common among Niger-Congo and Nilo-Saharan languages spoken throughout Africa, the vowels in this language can be divided into two classes, defined with respect to the phonological feature Advanced Tongue Root (ATR); for typological data, see Casali (2003, 2008, 2016); Rose (2018), among others. In this language, vowels that are [+ATR] are {i, e}, and have pronunciations that are more peripheral in the vowel space; those that are [-ATR] are {, }, and are more phonetically centralized. For the sake of simplicity, we restrict the simulated language to only have front vowels. A fifth vowel in the system, [a], is not specified for ATR. This language has consonants {p, t, k, q}, which are distributed freely with respect to one another and to vowels with the exception that syllables must begin with exactly consonant and must contain exactly one vowel, a typologically common restriction. Since consonants are not regulated by the grammar we are working with, the three binary features (leaving out [word boundary]) create a set of 512 possible feature trigrams which characterize the space of all possible strings in the language. The syllable counts of words follows a Poisson distribution with λ\lambda == 2.

The single rule active in this language governs the distribution of vowels specified for the feature [ATR]: vowels in adjacent syllables had to have the same [ATR] specification. This means that vowel sequences in a word can be [i…e] or […], but not [e …] or [e …]. Since [a] is not specified for ATR, it creates boundaries that allow different ATR values to exist on either side of it: for example, while the vowel sequence [e …] is not permitted, the sequence [e …a …] is allowed, because the ATR-distinct vowels are separated by the unspecified [a]. This yielded sample licit words like [katipe], [tp], and [qekat], and illicit ones [kkiqa], [ttaqik], and [qiqka].

Feature trigrams were composed of triples of the features and specifications shown in Appendix Table 3, any one of which picks out a certain set of vowel trigrams in adjacent syllables.

Data

We sampled 157 unique words as the lexicon LL, and a set of 1,010 random words, roughly balanced for length, as a test set. The model was provided with the set of features in Appendix Table 3, and restrictions on syllable structure for use in the proposal distribution.

Informant

The informant was configured to reject any word that contained vowels in adjacent syllables that differed in ATR specification (like [pekit] or [qetatkipe]), and accept all others.

5.2 Procedurally-Generated Languages

We also experimented with languages that share the same feature space, and thus the same set of 512 feature trigrams, as ATR Vowel Harmony (§5.1) but were procedurally generated by sampling 16 of the 512 total feature trigrams to be \sayon (i.e., penalized) and set all others to be off, creating languages with different restrictions on licit vowel sequences in adjacent syllables.

Data

For each \saylanguage (i.e., set of sampled feature trigrams to be penalized), we carried out a procedure to sample the lexicon LL, as well as evaluation datasets. For each set of 16 θ\theta values representing penalized phonological feature trigrams, we created random strings as in Experiment 1, filtering them to ensure that the train and test set are of equal size, and the test set is balanced for length of word and composed of half acceptable and half unacceptable forms.

5.3 Experimental Set-Up

Hyperparameters

The model has several free parameters: a noise parameter α{\alpha} that represents the probability that an observed label is correct (versus noisy), and θprior{\theta_{\text{prior}}}, the prior probability of a feature being on (penalized), i.e., pprior(θj=1)p_{\text{prior}}(\theta_{j}=1). There are also hyperparameters governing the optimization of the model: we denote by ss the number of optimization steps in the variational update.111These optimization parameters govern both the model’s learning and the evaluation of candidate queries for prospective strategies, i.e., πeig\pi_{\text{eig}}, and the hybrid strategies. When s=s=\infty, we optimize until the magnitude of the change in 𝜽\bm{\theta} is less than or equal to an error threshold ϵ=2e7\epsilon=2e^{-7} We also experiment with s=1s=1, in which we perform a single update.

We ran a grid-search over the parameter space of log(log(α))\alpha))\in {0.1, 0.25, 0.5, 1, 2, 4, 8}, θprior\theta_{\text{prior}}\in {0.001, 0.025, 0.05, 0.1, 0.2, 0.35}, and s{1,}s\in\{1,\infty\}. We ran 10 random seeds (9 for the procedurally generated languages)222For the generated languages, seed also governed the “language,” i.e., phonological feature trigrams sampled as “on.” and all query policies in Table 1 for each hyperparameter setting. Each experiment was run for 150 steps.

For non-train policies, we generated k=100k=100 candidates from 𝒳\mathcal{X}.

Evaluation

At each step, we compute the AUC (area under the ROC curve) on the test set. We then compute the mean AUC value across steps, which we refer to as the mean-AUC; a higher mean-AUC indicates more efficient learning. We report the median of the mean-AUC values over seeds.

Refer to caption

Figure 2: We report three analyses of the toy ATR Vowel Harmony language and our procedurally-generated languages: in-distribution (left column, see §5.4), out-of-distribution (center column, see §5.5), and an upper-bound assessment (right column, see §5.6). For each, we report the median and standard error of the mean-AUC over steps aggregated across runs (top row; numerical values and hyperparameters reported in Appendix Table 2), average AUC at each step aggregated across runs (middle row), and at each step the proportion of runs where the basic train strategy was selected by the hybrid strategies (bottom row). Results: In terms of median mean-AUC (top row), our query strategies are numerically on par with, if not beating, the stronger of the two baseline conditions; statistically, only the difference between Info. gain / train (model) and uniform was significant in the in-distribution analysis (top left). Average AUC over time (middle row) shows a similar pattern across all three analyses, with the non-baseline strategies rising faster and asymptoting sooner than baseline strategies, but usually with greater variance. Finally, though all hybrid strategies prefer non-train some portion of the time, the Info. gain / train (model) exhibits an interpretable shift from early preference for train data to later preference for its own synthesized queries in all three analyses.

5.4 In-Distribution Results

Assessing the in-distribution results, shown in the left column of Figure 2, we see that interactive elicitation is on par with, if not higher than, baseline strategies (top left plot). The difference between the train and uniform baselines was not significant according to a two-sided paired t-test, and the only strategy that performed significantly better than train after correcting for multiple comparisons was Info. gain / train (model). This difference is more visually striking in the plot of average AUC over time (middle left plot), where Info. gain / train (model) both ascends faster, and asymptotes earlier, than train, although with greater variance across runs. In the bottom left plot of Figure 2, we see that the numerically-best-performing Info. gain / train (model) strategy moves rather smoothly from an initial train preference to an Info. gain preference as learning progresses. That is, information in known-good words is initially helpful, but quickly becomes less useful as the model learns more of the language and can generate more targeted queries.

5.5 Out-Of-Distribution Results

The out-of-distribution analysis on the ATR Vowel Harmony language found greater variance of median mean-AUC between strategies, and also greater variance within strategies across seeds (top center plot). We note that this performance is lower than what is found in the upper-bound analysis, since the hyperparameters (listed in Appendix Table 2) were chosen based on the pooled results of the procedurally-generated languages. As in the in-distribution analysis, we found no statistical difference between the two baselines, nor between the Info. gain strategy and uniform, although Info. gain performed numerically better. In terms of average AUC over time (middle center plot), we find again that the top two non-baseline strategies rise faster and peak earlier than uniform, but exhibit greater variance.

5.6 Upper Bound Results

Greedily selecting for the best test performance in a hyperparameter search conducted on ATR Vowel Harmony yields superior performance compared to the out-of-distribution analysis hyperparameters, as seen in the top right plot in Figure 2. Appendix Table 2 lists the hyperparameter values used. However, we found no significant difference between the stronger baseline (uniform) and any other strategy after correcting for multiple comparisons.

6 Related Work

The goal of active learning is to improve learning efficiency by allowing models to choose which data to query an oracle about (Zhang et al., 2022). Uncertainty sampling (Lewis and Gale, 1994) methods select queries for which model uncertainty is highest. Most closely related are uncertainty sampling methods for probabilistic models, including least-confidence Culotta and McCallum (2005), margin sampling Scheffer et al. (2001), and entropy-based methods.

Disagreement-based strategies query instances that maximize disagreement among a group of models (Seung et al., 1992). The distribution over a single model's parameters can also be treated as this ``group'' of distinct models, as has been done for neural models (Gal et al., 2017). Such methods are closely related to the feature entropy querying policy that we explore.

Another class of forward-looking methods incorporates information about how models would change if a given data-point were observed. Previous work includes methods that sample instances based on expected loss reduction (Roy and McCallum, 2001), expected information gain (MacKay, 1992), and expected gradient length (Settles et al., 2007). These methods are closely related to the policies based on information-gain that we explore.

Our hybrid policies are also related to previous work on dynamic selection between multiple active learning policies, such as DUAL (Donmez et al., 2007), which dynamically switches between density and uncertainty-based strategies.

The model we propose is also related to a body of work in computational and theoretical linguistics focused on phonotactic learning. Much of this work, largely inspired by Hayes and Wilson (2008), seeks to discover and/or parameterize models of phonotactic acceptability on the basis of only positive data, in line with common assumptions about infant language acquisition Albright (2009); Adriaans and Kager (2010); Linzen and O’Donnell (2015); Futrell et al. (2017); Mirea and Bicknell (2019); Gouskova and Gallagher (2020); Mayer and Nelson (2020); Dai et al. (2023); Dai (to appear). Our work differs from these in that we are explicitly not seeking to model phonotactic learning from the infants' point of view, instead drawing inspiration from the strategy of a linguist working with a competent native speaker to discover linguistic structure via iterated querying. Practically, this means that our model can make use of both positive and negative data, and also takes an active role in seeking out the data it will learn from.

7 Conclusion

We have described a method for parameterizing a formal model of language via efficient, iterative querying of a black box agent. We demonstrated that on an in-distribution set of toy languages, our query policies consistently outperform baselines numerically, including a statistically-significant improvement for the most effective policy. The model struggles more on out-of-distribution languages, though in all cases the query policies are numerically comparable to the best baseline. We note that a contributing factor to the difficulty of the query policies consistently achieving a significantly higher performance than baselines is the small number of seeds, which exhibit nontrivial variance, particularly in hybrid policies. Future work may address this with more robust experiments.

Acknowledgements

Thanks to members of the audience at Interactions between Formal and Computational Linguistics (IFLG) Seminar hosted by the Linguistique Informatique, Formelle et de Terrain group, as well as two anonymous SCiL reviewers, for helpful questions and comments.

We acknowledge the following funding sources: MIT-IBM Watson AI Lab (CB, RL), NSF GRFP grant number 2023357727 (AR), a MIT Shillman Fellowship (AR), a MIT Dean of Science Fellowship (AM), and NSF IIS-2212310 (JA, AR).

References

  • Adriaans and Kager (2010) Frans Adriaans and René Kager. 2010. Adding generalization to statistical learning: The induction of phonotactics from continuous speech. Journal of Memory and Language, 62(3):311–331.
  • Albright (2009) Adam Albright. 2009. Feature-based generalisation as a source of gradient acceptability. Phonology, 26(1):9–41.
  • Bender et al. (2021) Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623.
  • Casali (2003) Roderic F Casali. 2003. [atr] value asymmetries and underlying vowel inventory structure in niger-congo and nilo-saharan.
  • Casali (2008) Roderic F Casali. 2008. Atr harmony in african languages. Language and linguistics compass, 2(3):496–549.
  • Casali (2016) Roderic F Casali. 2016. Some inventory-related asymetries in the patterning of tongue root harmony systems. Studies in African Linguistics, pages 96–99.
  • Chomsky and Halle (1968) Noam Chomsky and Morris Halle. 1968. The sound pattern of English. Harper & Row New York.
  • Culotta and McCallum (2005) Aron Culotta and Andrew McCallum. 2005. Reducing labeling effort for structured prediction tasks. In Proceedings of the 20th National Conference on Artificial Intelligence - Volume 2, AAAI'05, page 746–751. AAAI Press.
  • Dai (to appear) Huteng Dai. to appear. An exception-filtering approach to phonotactic learning. Phonology.
  • Dai et al. (2023) Huteng Dai, Connor Mayer, and Richard Futrell. 2023. Rethinking representations: A log-bilinear model of phonotactics. Proceedings of the Society for Computation in Linguistics, 6(1):259–268.
  • Donmez et al. (2007) Pinar Donmez, Jaime G. Carbonell, and Paul N. Bennett. 2007. Dual strategy active learning. In Proceedings of the 18th European Conference on Machine Learning, ECML '07, page 116–127, Berlin, Heidelberg. Springer-Verlag.
  • Futrell et al. (2017) Richard Futrell, Adam Albright, Peter Graff, and Timothy J O’Donnell. 2017. A generative model of phonotactics. Transactions of the Association for Computational Linguistics, 5:73–86.
  • Gal et al. (2017) Yarin Gal, Riashat Islam, and Zoubin Ghahramani. 2017. Deep bayesian active learning with image data. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML'17, page 1183–1192. JMLR.org.
  • Gouskova and Gallagher (2020) Maria Gouskova and Gillian Gallagher. 2020. Inducing nonlocal constraints from baseline phonotactics. Natural Language & Linguistic Theory, 38:77–116.
  • Hayes and Wilson (2008) Bruce Hayes and Colin Wilson. 2008. A maximum entropy model of phonotactics and phonotactic learning. Linguistic inquiry, 39(3):379–440.
  • Hewitt and Manning (2019) John Hewitt and Christopher D Manning. 2019. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4129–4138.
  • Lewis and Gale (1994) David D. Lewis and William A. Gale. 1994. A sequential algorithm for training text classifiers. In SIGIR '94, pages 3–12, London. Springer London.
  • Linzen and O’Donnell (2015) Tal Linzen and Timothy O’Donnell. 2015. A model of rapid phonotactic generalization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1126–1131.
  • MacKay (1992) David J. C. MacKay. 1992. Information-Based Objective Functions for Active Data Selection. Neural Computation, 4(4):590–604.
  • Mayer and Nelson (2020) Connor Mayer and Max Nelson. 2020. Phonotactic learning with neural language models. Society for Computation in Linguistics, 3(1).
  • Mirea and Bicknell (2019) Nicole Mirea and Klinton Bicknell. 2019. Using lstms to assess the obligatoriness of phonological distinctive features for phonotactic learning. In Proceedings of the 57th annual meeting of the association for computational linguistics, pages 1595–1605.
  • Mohamed et al. (2022) Abdelrahman Mohamed, Hung-yi Lee, Lasse Borgholt, Jakob D Havtorn, Joakim Edin, Christian Igel, Katrin Kirchhoff, Shang-Wen Li, Karen Livescu, Lars Maaløe, et al. 2022. Self-supervised speech representation learning: A review. IEEE Journal of Selected Topics in Signal Processing.
  • Rose (2018) Sharon Rose. 2018. Atr vowel harmony: new patterns and diagnostics. In Proceedings of the Annual Meetings on Phonology, volume 5.
  • Roy and McCallum (2001) Nicholas Roy and Andrew McCallum. 2001. Toward optimal active learning through monte carlo estimation of error reduction. In International Conference on Machine Learning.
  • Scheffer et al. (2001) Tobias Scheffer, Christian Decomain, and Stefan Wrobel. 2001. Active hidden markov models for information extraction. In Advances in Intelligent Data Analysis, pages 309–318, Berlin, Heidelberg. Springer Berlin Heidelberg.
  • Settles et al. (2007) Burr Settles, Mark Craven, and Soumya Ray. 2007. Multiple-instance active learning. In Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc.
  • Seung et al. (1992) H. S. Seung, M. Opper, and H. Sompolinsky. 1992. Query by committee. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, COLT '92, page 287–294, New York, NY, USA. Association for Computing Machinery.
  • Zhang et al. (2022) Zhisong Zhang, Emma Strubell, and Eduard Hovy. 2022. A survey of active learning for natural language processing. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6166–6190, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Out-of-distribution analysis Upper-bound analysis
Policy log(log(α\alpha)) prior s
Median
mean-AUC
Std. err. Policy log(log(α\alpha)) prior s
Median
mean-AUC
Std. err.
Info. gain / train (model) 0.5 0.1 \infty 0.973 0.004 Info. gain / train (mixed) 0.25 0.1 \infty 0.977 0.010
Info. gain / train (history) 1 0.1 \infty 0.970 0.006 Information gain 0.1 0.025 \infty 0.975 0.002
Info. gain / train (mixed) 2 0.2 \infty 0.969 0.005 Info. gain / train (history) 0.1 0.05 \infty 0.974 0.013
Information gain 0.25 0.025 \infty 0.966 0.004 Info. gain / train (model) 1 0.001 1 0.973 0.009
Label entropy 0.1 0.1 \infty 0.964 0.009 Label entropy 0.5 0.05 1 0.968 0.011
Train (baseline) 1 0.1 \infty 0.947 0.007 Uniform (baseline) 0.5 0.025 1 0.958 0.010
Uniform (baseline) 1 0.1 1 0.940 0.008 Train (baseline) 8 0.35 1 0.932 0.003
Table 2: Hyperparameters for the out-of-distribution analysis (§5.5) and upper-bound analysis (§5.6).

Appendix A Phonological features for Toy Languages

As described in §5.1, the ATR Vowel Harmony language is based on the categorization of vowels as [+ATR], [-ATR], or unspecified. The features [high] and [low] also serve to distinguish vowels in the language, but are not governed by a phonotactic. In contrast, any of the 512 logically possible trigrams of specified phonological features may be penalized for the procedurally-generated languages. Table 3 displays the phonological features for each of the vowels in the languages.

Appendix B Hyperparameters for out-of-distribution and upper-bound analyses

In §5.3, we described the hyperparameters of our grammatical model and the process by which values were selected for the out-of-distribution analysis. These selected hyperparameter values are presented in Table 2.

[high] [low] [ATR]
i ++ - ++
i ++ - -
e - - ++
- - -
a - ++ 0
Table 3: Phonological features for vowels used in the toy languages. The feature [word boundary] is omitted for simplicity, as it has the value `-' for all segments.

Appendix C Query Policy Implementation

We now revisit the query strategies introduced in §3 and describe how they are implemented for the model described in §4. In particular, under the described generative model, p(y=1x,x¯,y¯)=jϕ(x)q(θj=0x,x¯,y¯)p(y=1\mid x,\underline{\smash{x}},\underline{\smash{y}})=\prod_{j\in\phi(x)}q(\theta_{j}=0\mid x,\underline{\smash{x}},\underline{\smash{y}}), as described above.

Let qy=jϕ(x)q(θj=0x,x¯,y¯)q_{y}=\prod_{j\in\phi(x)}q(\theta_{j}=0\mid x,\underline{\smash{x}},\underline{\smash{y}}), i.e., qyq_{y} is the probability of label y=1y=1 for input xx under the variational posterior; this is equivalent to the probability of all features in ϕ(x)\phi(x) being ``off''. Let qθj=q(θj=1x¯,y¯)q_{\theta_{j}}=q(\theta_{j}=1\mid\underline{\smash{x}},\underline{\smash{y}}) indicate the probability of parameter θj\theta_{j} being ``on'' (i.e., penalized) under the current variational q(𝜽)q(\bm{\theta}). For this model, the quantities used by the query policies in §3 are computed as follows:

Label Entropy

Policy πlabel-ent\pi_{\text{label-ent}} selects xx^{*} according to:

x=\displaystyle x^{*}= argmaxx𝒳H(qy),where\displaystyle\operatorname*{arg\,max}_{x\in\mathcal{X}}H(q_{y}),\text{where}
H(p(yx,x¯,y¯))=\displaystyle H(p(y\mid x,\underline{\smash{x}},\underline{\smash{y}}))= qylogqy\displaystyle-q_{y}\log q_{y}
(1qy)log(1qy).\displaystyle-(1-q_{y})\log(1-q_{y}).

Expected Information Gain

Policy πeig\pi_{\text{eig}} selects xx^{*} according to:

x\displaystyle x^{*} =argmaxx𝒳VIG(x,y=1;x¯,y¯)qy\displaystyle=\operatorname*{arg\,max}_{x\in\mathcal{X}}V_{\text{IG}}(x,y=1;\underline{\smash{x}},\underline{\smash{y}})\cdot q_{y}
+VIG(x,y=0;x¯,y¯)(1qy),\displaystyle\phantom{\operatorname*{arg\,max}_{x\in\mathcal{X}}}+V_{\text{IG}}(x,y=0;\underline{\smash{x}},\underline{\smash{y}})\cdot(1-q_{y}),

where VIGV_{\text{IG}} is given by

VIG(x,y;x¯,y¯)\displaystyle V_{\text{IG}}(x,y;\underline{\smash{x}},\underline{\smash{y}}) =jθ(H(q(θjx¯,y¯))\displaystyle=\sum_{j\in\mid\theta\mid}\Big{(}H(q(\theta_{j}\mid\underline{\smash{x}},\underline{\smash{y}}))
H(q(θjx,y,x¯,y¯))),\displaystyle\phantom{=}-H(q(\theta_{j}\mid x,y,\underline{\smash{x}},\underline{\smash{y}}))\Big{)},

and HH is given by

H(q(θj))\displaystyle H(q(\theta_{j})) =qθjlogqθj\displaystyle=-q_{\theta_{j}}\log q_{\theta_{j}}
(1qθj)log(1qθj).\displaystyle\phantom{=}-(1-q_{\theta_{j}})\log(1-q_{\theta_{j}}).

Appendix D Derivation of the Update Rule

We want to compute the posterior p(θ|y,x,α)p(\theta|y,x,\alpha), which is intractable. Thus, we approximate it with a variational posterior, composed of binomial distributions for each θi\theta_{i}. We further assume that the individual dimensions of the posterior (the individual components of θ\theta) have values that are not correlated. This allows us to perform coordinate ascent on each dimension of the posterior separately; thus we express the following derivation in terms of q(θi)q(\theta_{i}), where i is the index in the feature n-gram vector.

The variational posterior is optimized to minimize the KL divergence between the true posterior p(𝜽|X,Y,α)p(\bm{\theta}|X,Y,\alpha) and q(𝜽)q(\bm{\theta}); we do this by maximizing the ELBO.

The coordinate ascent update rule for each dimension of the posterior, that is, for each latent variable, is:

q(θi)exp[𝔼q¬ilogp(θi,θ¬i,y,x)].q(\theta_{i})\propto\exp\big{[}\mathop{\operatorname{\mathbb{E}}_{q\lnot i}}\log p(\theta_{i},\theta_{\lnot i},y,x)\big{]}.

Given the generative process, we can rewrite:

p(θi,θ¬i,y,x)=p(θi)p(θ¬i)p(y|x,θi,θ¬i).p(\theta_{i},\theta_{\lnot i},y,x)=p(\theta_{i})\cdot p(\theta_{\lnot i})\cdot p(y|x,\theta_{i},\theta_{\lnot i}).

𝔼q¬ilogp(θ¬i)\mathop{\operatorname{\mathbb{E}}_{q\lnot i}}\log p(\theta_{\lnot i}) is assumed to be constant across values of θi\theta_{i} (expressing the lack of dependence between parameters), so we can rewrite the update rule as:

q(θi)exp[𝔼q¬i[logp(θi)+logp(y|x,θi,θ¬i)]].q(\theta_{i})\propto\exp\big{[}\mathop{\operatorname{\mathbb{E}}_{q\lnot i}}[\log p(\theta_{i})+\log p(y|x,\theta_{i},\theta_{\lnot i})]\big{]}.

Further, since logp(θi)\log p(\theta_{i}) is constant across values of q¬iq\lnot i, we can rewrite it once more:

q(θi)exp[logp(θi)+𝔼q¬ilogp(y|x,θi,θ¬i)].q(\theta_{i})\propto\exp\big{[}\log p(\theta_{i})+\mathop{\operatorname{\mathbb{E}}_{q\lnot i}}\log p(y|x,\theta_{i},\theta_{\lnot i})\big{]}.

Since our approximating distribution is binomial, we describe in turn the treatment of each of the two possible values of θ\theta. First, we derive the update rule for when the label yy is acceptable (y=1y=1).

We know that there are two subsets of q¬iq\lnot i cases where this can happen. In α\alpha proportion of them, yy is a correct label, which can only happen when θj=0\theta_{j}=0 for all jiϕ(x)j\neq i\in\phi(x). This occurs with probability pall_off=jiϕ(x)q(θj=0)p_{\text{all\_off}}=\prod_{j\neq i\in\phi(x)}q(\theta_{j}=0). There is also, then, the 1α1-\alpha proportion of cases in which yy is an incorrect label, and the true judgement is unacceptable. Under this assumption, at least 1 feature is on, which occurs with probability 1pall_off1-p_{\text{all\_off}}.

We can rewrite the expectation term to get approximate probabilities for both the θi=0\theta_{i}=0 and θi=1\theta_{i}=1 cases when y=1y=1:

q(θi=0)\displaystyle q(\theta_{i}=0) exp[logp(θi=0)\displaystyle\propto\exp\big{[}\log p(\theta_{i}=0)
+(pall_off\displaystyle+\big{(}p_{\text{all\_off}} logα+(1pall_off)log(1α))].\displaystyle\cdot\log\alpha+(1-p_{\text{all\_off}})\cdot\log(1-\alpha)\big{)}\big{]}.

If θi=1,\theta_{i}=1, we know that logp(y|x,θi,θ¬i)=log(1α)\log p(y|x,\theta_{i},\theta_{\lnot i})=\log(1-\alpha) for all q¬iq\lnot i, since we know that yy must be a noisy label. Thus:

q(θi=1)exp[logp(θi=1)+log(1α)]q(\theta_{i}=1)\propto\exp\big{[}\log p(\theta_{i}=1)+\log(1-\alpha)\big{]}.

We can normalize these quantities to get a proper probability distribution, i.e. we can set q(θi=1)q(\theta_{i}=1) to the following quantity:

q(θi=1):=q(θi=1)q(θi=1)+q(θi=0).q(\theta_{i}=1):=\frac{q(\theta_{i}=1)}{q(\theta_{i}=1)+q(\theta_{i}=0)}.

Using the expression q(θi)q(\theta_{i}) as shorthand for q(θi=1)q(\theta_{i}=1), this results in the following update rule:

q(θi=1)\displaystyle q(\theta_{i}=1) =σ(logp(θi)log(1p(θi))\displaystyle=\sigma\Big{(}\log p(\theta_{i})-\log(1-p(\theta_{i}))
pall_offlogα1α).\displaystyle\phantom{=}-p_{\text{all\_off}}\cdot\log\frac{\alpha}{1-\alpha}\Big{)}.

In practice, we update over batches of inputs/outputs rather than single datapoints, i.e.,

𝐦𝐢,𝐣=jjϕ(xi)log(1p(θj))+loglogα1α,\displaystyle\mathbf{m_{i,j}}=\sum_{j^{\prime}\neq j\in\phi(x_{i})}\log(1-p(\theta_{j}^{\prime}))+\log{\log{\frac{\alpha}{1-\alpha}}},
q(θj)=σ(logp(θj)\displaystyle q(\theta_{j})=\sigma(\log p(\theta_{j}) log(1p(θj))\displaystyle-\log(1-p(\theta_{j}))
i<tyiexp(𝐦𝐢,𝐣)).\displaystyle-\sum_{i<t}y_{i}\cdot\exp(\mathbf{m_{i,j}})).

We update each q(θj)q(\theta_{j}) either for a fixed number of steps ss, or until convergence, i.e., when:

|j|𝜽|qjδ+1qjδ|<ϵ,\displaystyle\left|\sum_{j\in|\bm{\theta}|}q^{\delta+1}_{j}-q^{\delta}_{j}\right|<\epsilon,

where ϵ\epsilon is an error threshold.