Unsupervised Classification of English Words Based on Phonological Information: Discovery of Germanic and Latinate Clusters

Takashi Morita tmorita@alum.mit.edu Academy of Emerging Sciences, Chubu University Institute for Advanced Research, Nagoya University Timothy J. O'Donnell timothy.odonnell@mcgill.ca Department of Linguistics, McGill University Canada CIFAR AI Chair, Mila Mila — Quebec AI Institute

Abstract

Cross-linguistically, native words and loanwords follow different phonological rules. In English, for example, words of Germanic and Latinate origin exhibit different stress patterns, and a certain syntactic structure is exclusive to Germanic verbs. When seeing them as a cognitive model, however, such etymology-based generalizations face challenges in terms of learnability, since the historical origins of words are presumably inaccessible information for general language learners. In this study, we present computational evidence indicating that the Germanic-Latinate distinction in the English lexicon is learnable from the phonotactic information of individual words. Specifically, we performed an unsupervised clustering on corpus-extracted words, and the resulting word clusters largely aligned with the etymological distinction. The model-discovered clusters also recovered various linguistic generalizations documented in the previous literature regarding the corresponding etymological classes. Moreover, our findings also uncovered previously unrecognized features of the quasi-etymological clusters, offering novel hypotheses for future experimental studies.

1 Introduction

Discovering appropriate groups of observations without access to correct answers (i.e., unsupervised class learning/clustering) is a fundamental challenge in the computational modeling of language acquisition. For example, a plausible model of spoken-language learners must be able to identify the vowel and consonant inventories of the target language solely from the acoustic properties of speech inputs (without reference to text transcriptions, unlike industrial speech recognition systems). This phonetic learning is particularly challenging due to the considerable individual and contextual variations in the acoustic data (Vallabha et al., 2007; Feldman et al., 2009; Dunbar et al., 2017).

In addition to categorizing individual speech sounds (phonemes), language learners must also acquire knowledge of their discrete sequential patterns (phonology). And this phonological learning also involves a classification task, since the lexicon of a single language comprises multiple groups of words that adhere to different phonological rules and constraints. For example, nouns and verbs are often governed by separate phonological rules/constraints in various languages (Goodenough and Sugita, 1980/1990; Kelly and Bock, 1988; Bobaljik, 1997; Meyers, 1997; Smith, 1999, 2016). Likewise, semantically distinguished classes of words may also exhibit unique phonological patterns, differing from the rest of the lexicon within the same language; for instance, onomatopoeic expressions (ideophones) in Japanese are formed through the repetition of a bimoraic morpheme (e.g., [kia-kia ], meaning ``shining''; Ito and Mester, 1995b, 1999), and similar exceptionalities of this word class have been documented across languages (see Dingemanse, 2012, for a review).

Similarly, words of different etymological origins exhibit different phonological patterns. For example, English words are typically categorized according to their Germanic vs. Latinate origins, and this distinction correlates with two different stress patterns found in verbs (Grimshaw, 1985; Grimshaw and Prince, 1986). This etymological classification also serves as a crucial framework for analyzing morphological structures in English, wherein Latinate suffixes predominantly attach to Latinate roots, thereby maintaining etymological consistency throughout entire words (Anshen et al., 1986; Fabb, 1988; O'Donnell, 2015). Comparable etymology-based generalizations are found in other languages as well (see §2 for details; Trubetzkoy, 1939/1967; Fries and Pike, 1949; Lees, 1961; Postal, 1969; Zimmer, 1969; Chung, 1983; Ito and Mester, 1995b).

However, modeling the acquisition of such etymology-dependent phonology is exceptionally challenging compared to that of other word classes; the etymological origin of words is not directly observable by general language learners, in contrast to the rich syntactic or semantic information embedded in word-sequence data (Mikolov et al., 2013a, b; Radford et al., 2019; Brown et al., 2020; Ouyang et al., 2022). A recent study by Morita and O'Donnell (2022, see also , ) introduced a computational framework to investigate the learnability of etymological distinctions in the absence of explicit cues. Case-studying Japanese, they demonstrated that etymological word classes of the language can be inferred solely from phonological information. Specifically, they performed unsupervised clustering on existing Japanese words, represented as strings of phonetic symbols, and their learning model identified word clusters that were significantly well-aligned with the etymologically defined classes. Moreover, the model also recovered the etymology-based phonological generalizations proposed in the previous literature. These findings offer an empirical justification for etymology-based linguistic analyses, replacing arbitrarily defined word classes with psychologically plausible and learnable word clusters.

In the present study, we apply the same clustering algorithm to English words, and demonstrate that the distinction between Germanic and Latinate origins is also learnable from sequential patterns of phonetic symbols appearing in the words (i.e., segmental phonotactics). Our contributions can be summarized as follows.

•

We present empirical evidence for the unsupervised learnability of the Germanic and Latinate word clusters based solely on phonological information, or segmental phonotactics in particular.
•

We demonstrate that the identified word clusters recover various linguistic properties of Germanic and Latinate words as proposed in the previous literature.
•

We highlight several hitherto unnoticed linguistic properties of the discovered word clusters, which can guide future experimental studies.
•

In conjunction with the findings from the previous study on Japanese (Morita and O'Donnell, 2022), our present study supports the cross-linguistic validity of the proposed learning framework.

The remainder of this paper is organized as follows. In §2, we commence with a review of related studies concerning etymology-based generalizations of morpho-phonological patterns in English and other languages. Then, §3 outlines our model for learning etymological classes (or their proxies) exclusively from phonotactic information, alongside an explanation of the dataset employed for the learning simulation. §4 reports the basic results of the word clustering, including the alignment between the model-detected clusters and the ground-truth etymology. §5–7 delve into the linguistic properties of the identified clusters, recovering various etymology-based generalizations proposed in the previous literature. Finally, §8 provides a comprehensive discussion of our findings, outlines potential avenues for future research, and offers concluding remarks.

2 Background

2.1 Cross-Linguistic Ubiquity of Etymology-Based Phonology

Etymologically-defined lexical subclasses have been extensively documented for various languages, most typically distinguishing between native words and loanwords (see Ito and Mester, 1995b, for reviews). For example:

•

In Chamorro, mid vowels are absent in native words but present in Spanish loans (Chung, 1983).
•

In Mohawk, labial consonants [m,b,p] are found in French loans but not in native words (Postal, 1969).
•

In Mazateco, postnasal stops are systematically voiced in native words but can be voiceless in loans (Fries and Pike, 1949).
•

German native words do not start with [s], whereas this constraint does not apply to loans (Trubetzkoy, 1939/1967).
•

In Turkish, high vowels are rounded after labial consonants in native words but not necessarily in loans (Lees, 1961; Zimmer, 1969).

Etymological distinctions are not necessarily binary. For instance, Japanese has ternary distinctions in its morpho-phonology; words are divided into native words, loanwords from Old Chinese, and more recent loans primarily from English (Ito and Mester, 1995a, b, 1999, 2003, 2008; Fukazawa, 1998; Fukazawa et al., 1998; Moreton and Amano, 1999; Ota, 2004; Gelbart and Kawahara, 2007; Frellesvig, 2010).

2.2 Etymology-Based Generalizations in English

Like other languages, English exhibits linguistic generalizations rooted in the Germanic-Latinate distinction. One such generalization involves the etymological consistency of morphemes within a word: Latinate bases tend to be suffixed with Latinate morphemes (Anshen et al., 1986; Fabb, 1988; O'Donnell, 2015). Another well-known example is the difference in stress patterns: Germanic verbs bear initial stress, whereas the initial syllable of Latinate verbs is typically unstressed (Grimshaw, 1985; Grimshaw and Prince, 1986).

The Germanic-Latinate distinction in English has also been utilized in syntactic analyses. Most famously, Germanic and Latinate verbs differ in their tolerance of double-object construction. On the one hand, the dative argument of Germanic verbs (e.g., offer) can be expressed either as a prepositional phrase or as an indirect object (the examples below are cited from Yang and Montrul, 2017):

•

✓John offered fifty dollars to the drivers.
(prepositional construction)
•

✓John offered the drivers fifty dollars.
(double-object construction)

On the other hand, Latinate verbs are said not to allow the double-object construction (Gropen et al., 1989; Levin, 1993). For example, the dative argument of the verb donate—despite its semantic similarity to offer—can only be expressed as a prepositional phrase, not as an indirect object:

•

✓John donated fifty dollars to the drivers.
(prepositional construction)
•

*John donated the drivers fifty dollars.
(double-object construction)

This syntactic pattern extends to newly formed words as well; Gropen et al. (1989) experimentally investigated the availability of the double-object construction with quasi-Germanic and Latinate verbs that were phonologically characterized by monosyllabicity vs. polysyllabicity, respectively.

3 Materials and Methods

This section provides a high-level explanation of our learning model (§3.1) and the data used for the learning (§3.2).

3.1 Overview of the Learning Model

We model the learning of etymological lexical classes in the framework of unsupervised word clustering. The learning model receives English words—represented as strings of phonetic symbols—as its inputs and groups them into an optimal number of clusters following a certain statistical policy (explained below). Most importantly, the model has no access to ground-truth etymological information (such as ``Germanic origin'' or ``Latinate origin'') during its learning process, making it unsupervised.

Our approach to word clustering is grounded in Bayesian inference: We define a prior probability of word partitions (i.e., learning hypotheses) as well as their likelihood with respect to the data. Then, the optimal clustering is determined by the maximization of the posterior probability, which is proportional to the product of the prior and the likelihood.

In the remainder of this section, we focus on a high-level explanation of the model components, abstracting away from mathematical details. Interested readers are referred to Morita and O'Donnell (2022). In addition, the Python code used for this study is publicly available in https://github.com/tkc-morita/variational_inference_DP_mix_HDP_topic_ngram. The values of hyperparameters are reported in A.

3.1.1 Prior

We adopt a non-parametric prior distribution on cluster assignments known as the Dirichlet process (DP; Ferguson, 1973; Antoniak, 1974; Sethuraman, 1994). The DP prior prioritizes grouping words into the fewest possible clusters; it assigns a geometrically smaller probability (in expectation) to assignments spread over a greater number of hypothesized clusters. In linguistics and other fields, the DP has been widely used as a prior over lexica and other similar inventories (e.g., Anderson, 1990; Kemp et al., 2007; Goldwater et al., 2006, 2009; Teh, 2006; Teh et al., 2006; Feldman et al., 2013; O'Donnell, 2015).

3.1.2 Likelihood

We assume that each word is sampled from a probability distribution whose parameters are associated with the cluster to which the word belongs.¹¹1The parameters of each word cluster are optimized during the learning process together with the cluster assignments. Then, the likelihood of a word partition is defined by the product of the probabilities of all words in the dataset given their cluster assignments defined by the partition.

Clusters with fewer words have less phonotactic variability and thus the probability distribution associated with the cluster is able to be more sharply peaked around a smaller set of phonotactic patterns assigning a higher probability to each word. For this reason, the likelihood favors finer-grained word partitions; in an extreme case, the likelihood is maximized when each individual word forms its own cluster, specialized to generate just that particular word. Thus, the likelihood and the prior have opposing preferences for word partitions, and learning amounts to balancing a tradeoff between these opposing pressures.

For the specific implementation of the likelihood, we utilize a trigram model of phoneme strings (with the hierarchical backed-off smoothing; Goldwater et al., 2006; Teh, 2006). This model defines probabilities of phonemes conditioned on their two closest predecessors within words, and the overall word probability is the product of these phoneme probabilities. While trigram models may not capture all phonotactic patterns in the world's languages (Hansson, 2001; Rose and Walker, 2004; Heinz, 2010),²²2We may achieve greater likelihood by employing an artificial neural network capable of modeling longer stochastic dependencies on preceding elements. Indeed, the current state-of-the-art model of time series data, Transformer, can be mathematically interpreted as an extended version of the $n$ -gram model, permitting a significantly large Markov order of $n-1$ (Vaswani et al., 2017). Nevertheless, this study adopts the simpler trigram model, as it can be combined more easily with the DP prior in our implementation of Bayesian inference. they can effectively express local phonotactic dependencies among segments and account for a major portion of attested phonotactics (Gafos, 1999; Ní Chiosáin and Padgett, 2001; Hayes and Wilson, 2008). With this model, the likelihood of a word partition is simply the product of the probabilities of all words in the dataset given their cluster assignments defined by the partition.

3.1.3 Bayesian Inference

The optimal word clustering is formalized as the computation of the posterior probability, proportional to the product of the prior and likelihood by Bayes' theorem. A similar approach to balancing between simplicity and fit to the data is inherent in various theories of inductive inference (Rissanen and Ristad, 1994; Li and Vitányi, 2008; Grünwald, 2007; Shalev-Shwartz and Ben-David, 2014; Vapnik, 1998; Clark and Lappin, 2011; Jain et al., 1999; Bernardo and Smith, 1994).

A major challenge in this Bayesian inference is that the exact computation of the posterior probabilities is typically computationally intractable. Accordingly, we resort to variational approximation (specifically, the mean-field approximation) of the posterior to effectively handle the computational complexity and obtain practical solutions (Bishop, 2006; Blei and Jordan, 2006; Wainwright and Jordan, 2008a, b; Wang et al., 2011; Blei et al., 2017).

3.2 Data

The clustering method introduced above was applied to the (British) English portion of the CELEX lexical database (Baayen et al., 1995). We adopted the original phonetic transcription of the dataset (called DISC; see B for details) to represent the input.

Our model solely relies on the segmental information in words; accordingly, prosodic information—specifically, stress and syllable boundary markers—was removed from the transcriptions.³³3There are several possibilities for extending our trigram model to incorporate prosodic information of words. One possibility is to build a hierarchical model that could represent the syllable-like units (cf. Lee et al., 2015). Another is to capture the stress patterns by adopting a tier-based model that allows vowel-to-vowel interactions as well as local segment interactions (Futrell et al., 2017).

Our data only distinguished lemmas, ignoring spelling and inflectional variations among words (such as singular-plural distinctions).⁴⁴4 Some lemmas had more than one possible pronunciation; in such cases, we adopted the first (leftmost) entry. We further limited the data to lemmas with positive frequency in the Collins Birmingham University International Language Database (COBUID; Sinclair, 1987). After filtering, there remained 38,731 words in the input data.

4 Clustering Results

Cluster Name	#Words	Proportion
Sublex_{$\approx$ Germanic}	23,354	60.3%
Sublex_{$\approx$ Latinate}	15,174	39.2%
Sublex_{$\approx$ -ity}	203	0.5%

Table 1: Clustering results based on the maximum-a-posteriori (MAP) prediction of the model.

The unsupervised clustering revealed two primary sublexical clusters, labeled as Sublex_{$\approx$ Germanic} and Sublex_{$\approx$ Latinate} (Table 1). Additionally, a small cluster, labeled as Sublex_{$\approx$ -ity}, was identified, consisting of words that end with the suffix -ity.⁵⁵5All but one of the 203 words in Sublex_{$\approx$ -ity} were singular, with a single exception of susceptibilities. The emergence of this minor cluster suggests that a significant proportion of English words are formed through the suffixation of -ity, indicating the exceptional productivity of this suffix. Given that our model operates without awareness of morphological structures, there remains little else to discuss on Sublex_{$\approx$ -ity}. Therefore, the remainder of this paper is devoted to discussing linguistic properties of the other two major clusters, Sublex_{$\approx$ Germanic} and Sublex_{$\approx$ Latinate}.

Refer to caption — Figure 1: Alignment between the model-discovered clusters (columns, MAP classification) and the etymological origin according to Wikipedia (rows). Each cell of the heatmap is annotated with the word counts of the corresponding cluster-etymology intersection, followed by their relative frequency (in parentheses) per etymological origin (i.e., normalized over the columns, per row). The heatmap darkness also represents this relative frequency. The etymological origins are grouped into Germanic and Latinate by blue and orange dashed lines, respectively. The rows labeled with multiple origins (e.g., ``AngloSaxon/OldNorse'') represent the words duplicated in the Wikipedia lists of the corresponding origins.

Figure 1 illustrates the alignment between the discovered clusters (columns) and ground-truth etymological origins (rows). Due to the absence of etymological information in the CELEX dataset, we evaluated only a subset of the data (3,535 Germanics and 10,637 Latinates) whose etymological origin was identified in Wikipedia articles (see C for details).

Germanic words—of Anglo-Saxon, Old Norse, or Dutch origin—were closely aligned with the discovered Sublex_{$\approx$ Germanic} class. By contrast, the alignment between the discovered Sublex_{$\approx$ Latinate} class and words of etymologically Latinate origin showed less consistency; while Latin-derived words predominantly clustered into Sublex_{$\approx$ Latinate}, those of French origin were almost evenly split between the two clusters. However, this imperfect alignment of the model predictions with the ground-truth etymology is not necessarily a disappointing result; in §7, we will see that our model's ``misclassifications'' exhibit stronger correlations with the grammaticality of double-object constructions than the ground-truth etymology, thus providing a more effective account of the ``exceptions'' in the previous generalizations.

To quantitatively assess the significance of the alignment between our unsupervised clustering and the ground-truth etymology, we employed the V-measure metric (Roseberg and Hirschberg, 2007). The V-measure evaluates the similarity between predicted clustering and ground-truth classification based on two desiderata:

•

Homogeneity: each of the predicted clusters should contain only members of a single ground-truth class.
•

Completeness: the members of each ground-truth class should be grouped into the same cluster.

Homogeneity and completeness are formally defined based on a normalized variant of the conditional entropy, both falling on a scale of 0 (worst) to 1 (best). The V-measure score is their harmonic mean (analogous to the F1-score).

The V-measure score of our clustering result was $0.198$ , significantly greater than the chance-level baseline drawn by random shuffling of the ground-truth word origins ( $p<10^{-5}$ ).⁶⁶6The $p$ -value was estimated using Monte Carlo: We sampled 100,000 random permutations of the ground-truth classifications, and the $p$ -value was defined by the proportion of the random permutations whose V-measure score was greater than the model performance (Ewens, 2003). None of the 100,000 random permutations achieved a V-measure as great as the model, thus yielding $p<10^{-5}$ .

5 Phonotactic Characterization of the Discovered Word Clusters

In this section, we investigate the phonotactic properties of the model-detected clusters, Sublex_{$\approx$ Germanic} and Sublex_{$\approx$ Latinate}, examining if they are consistent with the observations made in the previous literature (§5.2). We also conduct a data-driven exploration of the phonotactic features of these clusters, aiming to uncover previously unrecognized patterns (§5.3).

5.1 Metric of Representativeness

Following Morita and O'Donnell (2022), our analysis of the phonotactic properties of the identified clusters is grounded in a metric of representativeness of phonetic segments (Tenenbaum and Griffiths, 2001). This metric assesses the relative likelihood that a sequence of phonetic segments comes from a particular cluster compared to the other(s). Essentially, representativeness is highest for patterns that are highly probable in the target cluster but improbable in the other(s). Consequently, it helps us identify (strings of) segments that provide informative cues for classification.

Formally, the representativeness $R(\mathbf{x},k)$ of a string of phonetic segments $\mathbf{x}:=\left(x_{1},\dots,x_{m}\right)$ with respect to the word cluster $k$ is defined by the logarithmic ratio of the posterior predictive probability of $\mathbf{x}$ appearing somewhere in a word belonging to the cluster $k$ , to the average posterior predictive probability of $\mathbf{x}$ over all other clusters:

\displaystyle R(\mathbf{x},k):=\log\frac{p({\dots}\mathbf{x}{\dots}\mid k,\mathbf{d})}{\sum_{k^{\prime}\neq k}p({\dots}\mathbf{x}{\dots}\mid k^{\prime},\mathbf{d})p(k^{\prime}\mid\mathbf{d},k^{\prime}\neq k)}

(1)

where $\mathbf{d}$ denotes the training data of the clustering. For a detailed explanation of how the representativeness is computed, we refer interested readers to Morita and O'Donnell (2022).

5.2 Recovery of Previous Generalization Regarding Stress Patterns

We initiate our phonotactic analyses by showing that the model recover generalizations proposed in the previous literature. Specifically, we discuss the prosodic characterization that Germanic verbs bear initial stress whereas Latinte verbs have unstressed initial syllable (Grimshaw, 1985; Grimshaw and Prince, 1986).

It is important to note that the training data for our model did not explicitly include the prosodic information of words, such as syllables or stress patterns. Thus, the model cannot make direct predictions regarding the stress patterns of word-initial syllables. Nonetheless, the model can still make indirect predictions for the prosodic patterns, exploiting the fact that most of the English vowels are reduced to schwa [] in unstressed positions. Specifically, we can evaluate which English vowels are most representative of Sublex_{$\approx$ Latinate} and Sublex_{$\approx$ Germanic} when they appear as the first vowel in a word. If schwa has a high degree of representativeness with respect to Sublex_{$\approx$ Latinate} in first position, it indicates that the unstressed word-initial syllable is representative of the word cluster.

We focus our analysis on the initial vowels of polysyllabic words,⁷⁷7To compute the representativeness of a vowel $\textrm{V}_{1}$ appearing in the first syllable of a polysyllabic word, we replace $p(\dotsc\mathbf{x}\dotsc\mid k^{(\prime)},\mathbf{d})$ in Eq. 1 with $p(\textrm{V}_{1}\mid\textrm{C\textsubscript{1}\textsuperscript{*}\text@underline{\phantom{V}}C\textsubscript{2}\textsuperscript{*}V\textsubscript{2}},k^{(\prime)},\mathbf{d})$ : that is, the posterior predictive probability of $\textrm{V}_{1}$ conditioned on the context C₁^*VC₂^*V₂, where C₁^* and C₂^* represent the existence of an arbitrary number of consonants (including “no consonants”) in the positions and V₂ represents the existence of another vowel in the word. In practice, we constrain C₁^* up to three consonants and C₂^* up to five consonants, based on the maximum length of the word-initial and internal consonant clusters in the CELEX data. simply because monosyllabic words always bear initial stress. Moreover, Germanic words are more likely to be monosyllabic than Latinate words (Gropen et al., 1989); thus, without the requirement of polysyllabicity, initial unstressed vowels can be representative of Sublex_{$\approx$ Latinate} merely due to the greater expected word length, derailing our primary interest in stress patterns.⁸⁸8In our trigram model, the expected length of words is represented by the probability of a special symbol marking the word-final position.

Table 2 reports the representativeness scores of all English vowels with respect to Sublex_{$\approx$ Germanic} and Sublex_{$\approx$ Latinate} when occurring in the first syllable of polysyllabic words.⁹⁹9Note that vowels can have negative representativeness for both Sublex_{$\approx$ Germanic} and Sublex_{$\approx$ Latinate} (e.g., ~) when they have positive representativeness with respect to Sublex_{$\approx$ -ity}. Schwa [] scored the lowest in Sublex_{$\approx$ Germanic} and the highest in Sublex_{$\approx$ Latinate}. This finding aligns with the previous generalization that initial stress (i.e., non-schwa initial vowels) is a hallmark of Germanic words.

Vowel	Sublex_{$\approx$ Germanic}	Sublex_{$\approx$ Latinate}
	-1.832312	1.720911
	-1.173901	1.136778
	-0.819426	0.736090
	-0.704950	0.691387
	-0.291054	0.292225
æ	-0.162233	0.160080
	-0.127456	0.135075
~	-0.470889	-0.043847
	0.063449	-0.074478
æ̃	-0.570518	-0.184845
u	0.343961	-0.341091
	0.342331	-0.342073
æ̃	-0.314203	-0.396665
~	-0.277864	-0.426937
a	0.529269	-0.534647
	0.545830	-0.537470
	0.584508	-0.576595
	0.555049	-0.587985
	0.608208	-0.597709
i	0.849564	-0.840888
	1.001237	-0.998456
	1.141281	-1.181892
e	1.172851	-1.183918
a	1.697911	-1.707058

Table 2: Representativeness of the first vowels in polysyllabic words (with zero to three initial consonants and zero to five internal consonants between the first and second vowels) with respect to Sublex_{$\approx$ Germanic} and Sublex_{$\approx$ Latinate}. The phonetic transcription (of British English) was translated from DISC to IPA for readability.

5.3 Data-Driven Investigation of Representative Phonotactic Patterns

	Rank	Sublex_{$\approx$ Germanic}					Sublex_{$\approx$ Latinate}
	Rank	Substring			Rep. Score	Examples	Substring			Rep. Score	Examples
Unigram	1	D			1.5379	bathe, mother	n			1.9389	essence, nation
	2	a			1.4040	about, loud				1.4407	cure, duration
	3	w			1.3100	work, wound	j			1.2592	accuse, use
	4	N			1.2497	blink, swing				0.9589	rear, serial,
	5	h			1.2479	hand, hole	v			0.8201	survive, vacation
Bigram	1	h	u		4.7799	hoof, who		n		5.5336	ocean, sufficient
	2	h			4.7651	hook, likelihood				4.4053	actual, virtuous
	3	f			4.3186	careful, foot	e			4.3633	facial, ratio
	4		*		4.1726	hair, there	j			3.8582	occupy, popular
	5	r	a		4.0835	brown, ground		n		3.7467	decision, vision
Trigram	1	N		l	9.0861	mingle, wrangle	e		n	8.8990	education, patient
	2	h	a	s	8.1806	house, warehouse	j		r	8.6861	accurate, mercury
	3	i	p	END	7.8941	deep, sleep	p			7.8682	conceptual, sumptuous
	4			END	7.8941	flush, rush	k		n	7.5597	action, section
	5	k		k	7.3461	cook, cookie		n		7.5533	stationary, nationalism

Table 3: Uni- to trigram substrings yielding the greatest representativeness. The phonetic transcription (of British English) was translated from DISC to IPA for readability. The non-IPA tokens, END and * (asterisk), represent the word-final position and linking r, respectively.

In addition to recovering the previous generalization of Germanic and Latinate phonology, our model can also be used for a data-driven investigation to identify class-specific phonotactic patterns. Specifically, ranking phonotactic patterns by their representativeness with respect to each cluster can unveil previously unnoticed characteristic patterns, suggesting new hypotheses for future experimental studies.

It is important to note that the representativeness-based analysis does not eliminate ungrammatical phonotactic patterns that never appear in English words. This is because our trigram phonotactic model is smoothed and assigns non-zero probabilities to unobserved patterns; consequently, a string of segments that is extremely improbable across all clusters can still be representative of one cluster if its probability in that cluster is relatively greater than in the others. Given the challenge of interpreting such ungrammatical (and rare) patterns, we limit our ranking to substrings with a minimum frequency of ten (cf. Morita and O'Donnell, 2022).

Table 3 presents the top-five uni- to trigram substrings ranked by their representativeness. These substrings are largely consistent with our intuition. For instance, many of the high-ranking bigrams and trigrams for Sublex_{$\approx$ Latinate} correspond to the Latinate suffix -(at)ion, as exemplified by [ n] and [e n].

Similarly, it is reasonable that the word final [i p] (represented as a trigram [i p END] in our model) is characteristic of Sublex_{$\approx$ Germanic}, given that most words exhibiting this phonotactic pattern are Germanic, such as creep, deep, heap, leap, sleep, sheep, steep, and sweep (Stevenson and Lindberg, 2010).¹⁰¹⁰10The only possible counterexample to this generalization is cheap, which was built based on the Latin word caupo ‘small trader, innkeeper’. This Latin word, however, was adopted in early Proto-Germanic (Hoad, 1986/2003), and thus is considered more adapted to the native phonotactics. Likewise, the most representative bigram [h u] appears exclusively in Germanic words like hoof, hoop, and who(m). Despite their plausibility, however, these phonetic characterizations of Germanic words had never been documented previously to our knowledge, indicating the effectiveness of data-driven investigation for identifying novel patterns.

6 Word-Internal Consistency of Morpheme Etymology

In the previous section, we phonotactically characterized the word clusters identified by our model. Conversely, our model infers these clusters based on the word-internal correlations among such phonotactic patterns; frequent cooccurrences of different substrings within words are better explained by a mixture of multiple distributions—each fitting to specific cooccurring patterns—rather than by a single trigram distribution, which assumes i.i.d. sampling (or mere coincidence) of the cooccurring substrings. Consequently, the presence of long words is essential for our model to observe sufficient cooccurrences.

Long words are typically formed through the affixation of morphemes. Therefore, successful learning of word clusters premises the word-internal consistency of phonotactic distributions across morphemes. In other words, our model exploits the etymological consistency across morphemes (e.g., Latinate suffixes attach to Latinate bases; Anshen et al., 1986; Fabb, 1988; O'Donnell, 2015). In this section, we demonstrate that this word-internal etymological consistency is indeed reflected in our model, by showing that it classifies different base morphemes of a common affix into the same cluster.

Figure 2 illustrates the proportion of cluster assignments given to the bases of the thirty most productive suffixes, ranked by the number of words derived through the suffixation (i.e., type frequency), as documented in the CELEX dataset. Eleven of these suffixes were of Germanic origin, and eighteen were of Latinate origin (Stevenson and Lindberg, 2010); the remaining suffix, -ally, was analyzed as the concatenation of the Latinate suffix -al and the Germanic suffix -ly—although the CELEX treats it as a single morpheme—and thus, it was categorized separately as a ``mixed-origin'' suffix. The suffixes and their corresponding bases were identified according to the morphological structures provided in the CELEX, with the clustering based on the freestanding forms of the bases, which were also identified in the CELEX (otherwise excluded from the analysis).

The vertical dashed lines in the figure indicate the overall proportion of cluster assignments over the entire CELEX dataset (Sublex_{$\approx$ Germanic}: 59.33%, Sublex_{$\approx$ Latinate}: 40.13%, Sublex_{$\approx$ -ty}: 0.53%). Using this overall ratio as the null-hypothetical parameters of the multinomial test,¹¹¹¹11 We used the R function multinomial.test in the EMT package and executed the exact test. we assessed the statistical significance of the suffix-dependent tendencies of base clustering towards Sublex_{$\approx$ Germanic} or Sublex_{$\approx$ Latinate}.

Overall, our model systematically classified the bases of the Germanic and Latinate suffixes into Sublex_{$\approx$ Germanic} and Sublex_{$\approx$ Latinate}, respectively, recovering the generalization made in the previous literature with statistical significance (Anshen et al., 1986; Fabb, 1988; O'Donnell, 2015). The exceptions were the bases of the Germanic suffixes -ly and -s, which tended to be classified into Sublex_{$\approx$ Latinate}. In addition, two Latinate suffixes, -ism (for noun-to-noun derivation) and -ize (for noun-to-verb derivation), did not show a statistically significant deviation from the null-hypothetical clustering ratio.

7 Syntactic Predictions from the Clustering Results

Finally, we assess our model's capability to predict the syntactic grammaticality of double-object constructions (hereinafter abbreviated as DOC) of dative verbs. It should be noted that DOC can be ungrammatical for various reasons; for example, DOC is not permitted with verbs that have certain types of semantics, such as communication of propositions and propositional attitudes (e.g., announce, report) and manner of speaking (e.g., scream, whisper; Levin, 1993, see also Bresnan, 2007; Bresnan et al., 2007 for other factors). Again, our model relies purely on phonotactic information and cannot make any syntactic predictions per se. However, our model can indirectly infer the grammaticality by clustering dative verbs based on phonotactics and substituting the etymology-based generalizations with these model-detected clusters (cf. Gropen et al., 1989).

Specifically, we evaluate the alignment of our model's clustering with the following distinction:

•

^✓DOC verbs:
Dative verbs that permit DOC (and the prepositional construction).
•

*DOC-Lat verbs:
Dative verbs that are said to disallow DOC solely due to their Latinateness (i.e., no other factors can distinguish them from ^✓DOC verbs; Levin, 1993).

The test data for this assessment were derived from (Levin, 1993).¹²¹²12 ^✓DOC verbs are listed in §2.1, Ex. 115 of (Levin, 1993); and *DOC-Lat verbs are found in §2.1, Ex. 118a.¹³¹³13 Levin’s 1993 list of *DOC-Lat verbs includes broadcast, but its components, broad and cast, are in fact both of Germanic origin (Stevenson and Lindberg, 2010). Accordingly, we excluded it from our main analysis. As a side note, this example was also classified into Sublex_{$\approx$ Germanic}, so it did not contribute to adjudication between our model and the true etymology-based account (as both failed to predict the ungrammaticality of DOC). A critical aspect of these data is that not all of ^✓DOC verbs are etymologically Germanic in fact; there are exceptional Latinate verbs that allow DOC.¹⁴¹⁴14 The etymological origin of the ^✓DOC verbs were identified by referring to Wikipedia articles (C for details) and the New Oxford American Dictionary (Stevenson and Lindberg, 2010). Consequently, we excluded etymologically ambiguous words, listed as “Latinates of Germanic origin” in Wikipedia, from the analysis. We also excluded netmail and telex because they are compounds/blends of Germanic and Latinate morphemes. This empirical fact gives a room for our model—by predicting these exceptions—to outperform the etymology-based generalization in the previous literature.

Figure 3 reports the accuracy of distinguishing ^✓DOC from *DOC-Lat verbs based on our clustering results and the true etymological classifications. Remarkably, the model predictions (0.8681) surpassed those based on the true etymology (0.7014). As already noted above, this advantage is due to the fact that some of the Latinate verbs exceptionally permit DOC (termed ^✓DOC-Lat hereinafter to distinguish them from the non-Latinate verbs, termed ^✓DOC- $\overline{\textsc{Lat}}$ ) and our model ``correctly misclassified'' these exceptions to Sublex_{$\approx$ Germanic} based on their phonotactic patterns (Figure 4; see Appendix D for the clustering results of individual verbs). This outcome suggests that the grammaticality of DOC is more accurately generalized by phonotactic patterns rather than by the etymological origins. Indeed, this finding aligns with the experimental study by Gropen et al. (1989), who showed the productivity of DOC utilizing monosyllabic and polysyllabic nonce words that characterized Germanic and Latinate verbs, respectively.

8 General Discussion and Concluding Remarks

In this study, we demonstrated that the Germanic-Latinate distinction within the English lexicon can be learned unsupervisedly in the form of phonotactically characterized word clusters. Specifically, we showed that the model-discovered clusters:

•

Aligned with the ground-truth etymological classification (§4),
•

Recovered phonotactic generalizations (stress patterns; §5.2),
•

Revealed hitherto unnoticed phonotactic properties (e.g., Germanic representativeness of [ip] and [hu]; §5.3),
•

Captured the etymological consistency of morphemes within words (§6), and
•

Predicted the grammaticality of DOC (§7).

By empirically demonstrating the learnability of Latinate- and Germanic-like word clusters, the present study supports the psychological reality of existing linguistic generalizations based on the etymological distinction. Moreover, the clusters identified by our model may offer better generalizations of linguistic phenomena, as evidenced by the improved DOC predictions (§7).

The predictions made by our model also offer opportunities for further experimental investigation of the class-dependent linguistic properties. For example, researchers can test the psychological reality of the uni- to trigrams that our model identified as representative of the Latinate- and Germanic-like word clusters (§5), adopting experimental methods similar to those employed in the previous studies (e.g., Moreton and Amano, 1999). Similarly, the clustering results can be used to experimentally test the phonotactics-based predictions of the DOC grammaticality (cf. Gropen et al., 1989).

Finally, our findings suggest the cross-linguistic effectiveness of the proposed learning framework. Morita and O'Donnell (2022) demonstrated that the etymological word classes in Japanese could also be learned from phonotactic information using the same model, while in the previous literature, such learning was thought to require Japanese-specific information like the orthographic differences among the word classes (Gelbart and Kawahara, 2007). The phonotactics-based approach to learning word classes is applicable to any other language with phonetic transcription of its words, leaving room for further investigation of its universality in future studies.

Acknowledgments

This study was supported by JST AIP Accelerated Program (JPMJCR25U6), ACT-X (JPMJAX21AN), and CREST (JPMJCR22P5); JSPS Grant-in-Aid for Early-Career Scientists (JP21K17805) and for Scientific Research A (JP24H00774), B (JP22H03914), and C (JP24K15087); and Kayamori Foundation of Informational Science Advancement (K35XXVIII620). We also gratefully acknowledge the support of the Canada CIFAR AI Chairs Program and the Natural Sciences and Engineering Research Council of Canada (NSERC).

References

Anderson (1990) Anderson, J.R., 1990. The adaptive character of thought. Studies in cognition, L. Erlbaum Associates, Hillsdale, NJ.
Anshen et al. (1986) Anshen, F., Aronoff, M., Byrd, R., Klavans, J., 1986. The role of etymology and word-length in English word formation. M.S.
Antoniak (1974) Antoniak, C.E., 1974. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. The Annals of Statistics 2, 1152–1174.
Baayen et al. (1995) Baayen, H.R., Piepenbrock, R., Gulikers, L., 1995. The CELEX Lexical Database. Release 2. Linguistic Data Consortium, University of Pennsylvania, Philadelphia, Pennsylvania.
Bernardo and Smith (1994) Bernardo, J.M., Smith, A.F.M., 1994. Bayesian Theory. John Wiley and Sons, Chichester.
Bishop (2006) Bishop, C.M., 2006. Pattern recognition and machine learning. Information science and statistics, Springer, New York.
Blei and Jordan (2006) Blei, D.M., Jordan, M.I., 2006. Variational inference for Dirichlet process mixtures. Bayesian Analysis 1, 121–144.
Blei et al. (2017) Blei, D.M., Kucukelbir, A., McAuliffe, J.D., 2017. Variational inference: A review for statisticians. Journal of the American Statistical Association 112, 859–877. doi:10.1080/01621459.2017.1285773.
Bobaljik (1997) Bobaljik, J.D., 1997. Mostly predictable: Cyclicity and the distribution of schwa in Itelmen, in: Proceedings of the Twenty-Sixth Western Conference on Linguistics (WECOL), pp. 14–28.
Bresnan (2007) Bresnan, J., 2007. Is syntactic knowledge probabilistic? experiments with the English dative alternation, in: Featherston, S., Sternefeld, W. (Eds.), Roots: Linguistics in Search of Its Evidential Base. Mouton de Gruyter, Berlin, pp. 77–96.
Bresnan et al. (2007) Bresnan, J., Cueni, A., Nikitina, T., Baayen, H., 2007. Predicting the dative alternation, in: Bouma, G., Krämer, I., Zwarts, J. (Eds.), Cognitive Foundations of Interpretation, Edita-the Publishing House of the Royal, Amsterdam. pp. 69–94.
Brown et al. (2020) Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D., 2020. Language models are few-shot learners, in: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (Eds.), Advances in Neural Information Processing Systems, Curran Associates, Inc.. pp. 1877–1901. arXiv:2005.14165.
Chung (1983) Chung, S., 1983. Transderivational relationships in Chamorro phonology. Language 59, 35–66.
Clark and Lappin (2011) Clark, A., Lappin, S., 2011. Linguistic Nativism and the Poverty of the Stimulus. Wiley-Blackwell.
Dingemanse (2012) Dingemanse, M., 2012. Advances in the cross-linguistic study of ideophones. Language and Linguistics Compass 6, 654–672. doi:10.1002/lnc3.361.
Dunbar et al. (2017) Dunbar, E., Cao, X.N., Benjumea, J., Karadayi, J., Bernard, M., Besacier, L., Anguera, X., Dupoux, E., 2017. The zero resource speech challenge 2017, in: 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 323–330.
Ewens (2003) Ewens, W.J., 2003. On estimating p values by the monte carlo method. American journal of human genetics 72, 496–498.
Fabb (1988) Fabb, N., 1988. English suffixation is constrained only by selectional restrictions. Natural language and linguistic theory 6, 527–539.
Feldman et al. (2013) Feldman, N.H., Goldwater, S., Griffiths, T.L., Morgan, J.L., 2013. A role for the developing lexicon in phonetic category acquisition. Psychological Review 120, 751–778.
Feldman et al. (2009) Feldman, N.H., Griffiths, T.L., Morgan, J.L., 2009. The influence of categories on perception: Explaining the perceptual magnet effect as optimal statistical inference. Psychological review 116, 752–782.
Ferguson (1973) Ferguson, T.S., 1973. A Bayesian analysis of some nonparametric problems. The Annals of Statistics 1, 209–230.
Frellesvig (2010) Frellesvig, B., 2010. A History of the Japanese Language. Cambridge University Press, Cambridge. doi:10.1017/CBO9780511778322.
Fries and Pike (1949) Fries, C.C., Pike, K.L., 1949. Coexistent phonemic systems. Language 25, 29–50.
Fukazawa (1998) Fukazawa, H., 1998. Multiple input-output faithfulness relations in Japanese. Rutgers Optimality Archive ROA-260-0698.
Fukazawa et al. (1998) Fukazawa, H., Kitahara, M., Ota, M., 1998. Lexical stratification and ranking invariance in constraint-based grammars, in: CLS 34: The Panels, pp. 47–62.
Futrell et al. (2017) Futrell, R., Albright, A., Graff, P., O'Donnell, T.J., 2017. A generative model of phonotactics. Transactions of the Association for Computational Linguistics 5, 73–86.
Gafos (1999) Gafos, A.I., 1999. The Articulatory Basis of Locality in Phonology. Outstanding Dissertations in Linguistics, Routledge.
Gelbart and Kawahara (2007) Gelbart, B., Kawahara, S., 2007. Lexical cues to foreignness in Japanese, in: Miyamoto, Y., Ochi, M. (Eds.), Formal Approaches to Japanese Linguistics (FALJ 4), MIT Working Papers in Linguistics, Cambridge, MA. pp. 49–60.
Goldwater et al. (2006) Goldwater, S., Griffiths, T.L., Johnson, M., 2006. Interpolating between types and tokens by estimating power-law generators, in: Weiss, Y., Schölkopf, B., Platt, J.C. (Eds.), Advances in Neural Information Processing Systems 18, MIT Press, Cambridge, MA. pp. 459–466.
Goldwater et al. (2009) Goldwater, S., L Griffiths, T., Johnson, M., 2009. A Bayesian framework for word segmentation: Exploring the effects of context. Cognition 112, 21–54.
Goodenough and Sugita (1980/1990) Goodenough, W.H., Sugita, H., 1980/1990. Trukese-English Dictionary. American Philosophical Society.
Grimshaw (1985) Grimshaw, J., 1985. Remarks on dative verbs and universal grammar. Presented at the 10th Annual Boston University Conference on Language Development.
Grimshaw and Prince (1986) Grimshaw, J., Prince, A., 1986. A prosodic account of the to-dative alternation. M.S.
Gropen et al. (1989) Gropen, J., Pinker, S., Hollander, M., Goldberg, R., Wilson, R., 1989. The learnability and acquisition of the dative alternation in english. Language 65, 203–257.
Grünwald (2007) Grünwald, P.D., 2007. The Minimum Description Length Principle. The MIT Press, Cambridge, MA.
Hansson (2001) Hansson, G., 2001. Theoretical and Typological Issues in Consonant Harmony. Ph.D. thesis. University of California, Berkeley.
Hayes and Wilson (2008) Hayes, B., Wilson, C., 2008. A maximum entropy model of phonotactics and phonotactic learning. Linguistic Inquiry 39, 379–440. doi:10.1162/ling.2008.39.3.379.
Heinz (2010) Heinz, J., 2010. Learning long-distance phonotactics. Linguistic Inquiry 41, 623–661.
Hoad (1986/2003) Hoad, T.F., 1986/2003. The Concise Oxford Dictionary of English Etymology. Oxford University Press, Oxford; New York.
Ito and Mester (1995a) Ito, J., Mester, A., 1995a. The core-periphery structure of the lexicon and constraints on reranking, in: Beckman, J., Urbanczyk, S., Walsh, L. (Eds.), Papers in Optimality Theory III: University of Massachusetts Occasional Papers 32. GLSA Publications, Amherst, MA, pp. 181–210.
Ito and Mester (1995b) Ito, J., Mester, A., 1995b. Japanese phonology, in: Goldsmith, J. (Ed.), A Handbook of Phonological Theory. Blackwell, Oxford, pp. 817–838.
Ito and Mester (1999) Ito, J., Mester, A., 1999. The phonological lexicon, in: Tsujimura, N. (Ed.), The handbook of Japanese linguistics. Blackwell, Oxford, pp. 62–100.
Ito and Mester (2003) Ito, J., Mester, A., 2003. Lexical and postlexical phonology in Optimality Theory: evidence from Japanese. Linguistische Berichte 11, 183–207.
Ito and Mester (2008) Ito, J., Mester, A., 2008. Lexical classes in phonology, in: Miyagawa, S., Saito, M. (Eds.), The Oxford handbook of Japanese linguistics. Oxford University Press, Oxford. chapter 4, pp. 84–106.
Jain et al. (1999) Jain, S., Osherson, D., Royer, J.S., Sharma, A., 1999. Systems that Learn. The MIT Press, Cambridge, MA.
Kelly and Bock (1988) Kelly, M.H., Bock, J.K., 1988. Stress in time. Journal of Experimental Psychology: Human Perception and Performance 14, 389–403. doi:10.1037/0096-1523.14.3.389.
Kemp et al. (2007) Kemp, C., Perfors, A., Tenenbaum, J., 2007. Learning overhypotheses with hierarchical Bayesian models. Developmental Science 10, 307–321.
Lee et al. (2015) Lee, C.y., O'Donnell, T.J., Glass, J., 2015. Unsupervised lexicon discovery from acoustic input. Transactions of the Association for Computational Linguistics 3, 389–403.
Lees (1961) Lees, R.B., 1961. The phonology of modern standard Turkish. Uralic and Altaic series, Indiana University publications.
Levin (1993) Levin, B., 1993. English Verb Classes and Alternations: A Preliminary Investigation. University of Chicago Press, Chicago; London.
Li and Vitányi (2008) Li, M., Vitányi, P.M., 2008. An Introduction to Kolmogorov Complexity and Its Applications. Texts in Computer Science, Springer, New York, NY, USA.
Meyers (1997) Meyers, S., 1997. OCP effects in Optimality Theory. Natural Language & Linguistic Theory 15, 847–892. doi:10.1023/A:1005875608905.
Mikolov et al. (2013a) Mikolov, T., Chen, K., Corrado, G.S., Dean, J., 2013a. Efficient estimation of word representations in vector space, in: Bengio, Y., LeCun, Y. (Eds.), Proceedings of the 1st International Conference on Learning Representations (ICLR), Scottsdale, Arizona.
Mikolov et al. (2013b) Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J., 2013b. Distributed representations of words and phrases and their compositionality, in: Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (Eds.), Advances in Neural Information Processing Systems, Curran Associates, Inc.
Moreton and Amano (1999) Moreton, E., Amano, S., 1999. Phonotactics in the perception of Japanese vowel length: Evidence for long-distance dependencies, in: Proceedings of the 6th European Conference on Speech Communication and Technology, pp. 2679–2682.
Morita (2018) Morita, T., 2018. Unsupervised Learning of Lexical Subclasses from Phonotactics. Ph.D. thesis. Massachusetts Institute of Technology. Cambridge, MA.
Morita and O'Donnell (2022) Morita, T., O'Donnell, T.J., 2022. Statistical evidence for learnable lexical subclasses in Japanese. Linguistic Inquiry 53, 87–120. doi:10.1162/ling\_a\_00401.
Ní Chiosáin and Padgett (2001) Ní Chiosáin, M., Padgett, J., 2001. Markedness, segment realization, and locality in spreading, in: Lombardi, L. (Ed.), Constraints and Representations: Segmental Phonology in Optimality Theory. Cambridge University Press, Cambridge, pp. 118–156.
O'Donnell (2015) O'Donnell, T.J., 2015. Productivity and reuse in language : a theory of linguistic computation and storage. MIT Press, Cambridge, MA; London, England.
Ota (2004) Ota, M., 2004. The learnability of the stratified phonological lexicon. Journal of Japanese Linguistics 20, 19–40.
Ouyang et al. (2022) Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C.L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., Lowe, R., 2022. Training language models to follow instructions with human feedback. doi:10.48550/ARXIV.2203.02155.
Postal (1969) Postal, P.M., 1969. Mohawk vowel doubling. International Journal of American Linguistics 35, 291–298.
Radford et al. (2019) Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., 2019. Language Models are Unsupervised Multitask Learners. Technical Report. OpenAI. San Francisco, CA, USA.
Rissanen and Ristad (1994) Rissanen, J., Ristad, E.S., 1994. Language acquisition in the MDL framework, in: Language Computations: DIMACS workshop on human language, American Mathemtatical Society, Philedelphia. pp. 149–166.
Rose and Walker (2004) Rose, S., Walker, R., 2004. A typology of consonant agreement as correspondence. Language 80, 475–531.
Roseberg and Hirschberg (2007) Roseberg, A., Hirschberg, J., 2007. V-measure: A conditional entropy-based external cluster evaluation measure.
Sethuraman (1994) Sethuraman, J., 1994. A constructive definition of Dirichlet priors. Statistica Sinica , 639–650.
Shalev-Shwartz and Ben-David (2014) Shalev-Shwartz, S., Ben-David, S., 2014. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, Cambridge, England.
Sinclair (1987) Sinclair, J. (Ed.), 1987. Looking up: Account of the COBUILD Project in Lexical Computing. Collins Cobuild dictionaries, Collins CoBUILD, London, England.
Smith (1999) Smith, J., 1999. Noun faithfulness and accent in Fukuoka Japanese, in: Bird, S., Carnie, A., Haugen, J.D., Norquest, P. (Eds.), Proceedings of the 18th West Coast Conference on Formal Linguistics (WCCFL), Cascadilla Press, Somerville, MA. pp. 519–531.
Smith (2016) Smith, J., 2016. Segmental noun/verb phonotactic differences are productive too. Proceedings of the Linguistic Society of America 1, 16:1–15. doi:10.3765/plsa.v1i0.3717.
Stevenson and Lindberg (2010) Stevenson, A., Lindberg, C.A. (Eds.), 2010. New Oxford American Dictionary 3rd Edition. Oxford University Press.
Teh (2006) Teh, Y.W., 2006. A hierarchical Bayesian language model based on Pitman-Yor processes, in: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Stroudsburg, PA, USA. pp. 985–992. doi:10.3115/1220175.1220299.
Teh et al. (2006) Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M., 2006. Hierarchical Dirichlet processes. Journal of the American Statistical Association 101, 1566–1581.
Tenenbaum and Griffiths (2001) Tenenbaum, J.B., Griffiths, T.L., 2001. The rational basis of representativeness, in: 23rd Annual Conference of the Cognitive Science Society, pp. 84–98.
Trubetzkoy (1939/1967) Trubetzkoy, N.S., 1939/1967. Grundzüge der Phonologie. Vandenhoeck and Ruprecht, Göttingen.
Vallabha et al. (2007) Vallabha, G.K., McClelland, J.L., Pons, F., Werker, J.F., Amano, S., 2007. Unsupervised learning of vowel categories from infant-directed speech. PNAS 104, 13273–13278.
Vapnik (1998) Vapnik, V.N., 1998. Statistical Learning Theory. John Wiley and Sons.
Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L.u., Polosukhin, I., 2017. Attention is all you need, in: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (Eds.), Advances in Neural Information Processing Systems 30, Curran Associates, Inc.. pp. 5998–6008.
Wainwright and Jordan (2008a) Wainwright, M., Jordan, M.I., 2008a. Graphical models, exponential families, and variational inference. Boston : Now Publishers, 2008.
Wainwright and Jordan (2008b) Wainwright, M.J., Jordan, M.I., 2008b. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning 1, 1–305. doi:10.1561/2200000001.
Wang et al. (2011) Wang, C., Paisley, J., Blei, D., 2011. Online variational inference for the hierarchical Dirichlet process, in: Gordon, G., Dunson, D., Dudík, M. (Eds.), Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, PMLR, Fort Lauderdale, FL. pp. 752–760.
Yang and Montrul (2017) Yang, C., Montrul, S., 2017. Learning datives: The tolerance principle in monolingual and bilingual acquisition. Second Language Research 33, 119–144. doi:10.1177/0267658316673686.
Zimmer (1969) Zimmer, K.E., 1969. Psychological correlates of some Turkish morpheme structure conditions. Language 45, 309–321.

Appendix A Hyperparameters

Our learning simulation of English word classes adopted exactly the same model that Morita and O'Donnell (2022) used for that of Japanese word clusters: the trigram model with backoff smoothing based on the hierarchical DP (HDP; Goldwater et al., 2006; Teh et al., 2006; Futrell et al., 2017), including the values on hyperparameters. Specifically, the concatenation parameters of the cluster assignment DP and the backoff HDP followed the gamma prior distribution $\mathrm{Gamma}(10,10^{-1})$ (parameterized by the shape and scale), which is the standard setting also adopted by Goldwater et al. (2006), Teh et al. (2006), Futrell et al. (2017) etc. The top level unigram prior on the segments was uniform.

The parameters of the variational approximation of the posterior inference were also set in the same way as in (Morita and O'Donnell, 2022). Specifically, the upper bound on the number of word clusters was set to six, and that on the number of segment-generator distributions per backoff layer was set to twice the number of possible symbols: 52 phonetic segments plus two special symbols indicating the word-initial and -final positions. We optimized the variational approximator distributions using the coordinate ascent algorithm (Bishop, 2006; Blei and Jordan, 2006), and the best approximation result among 1000 runs with random initialization was reported here. Each run was terminated either when the improvement in the approximation error (measured by the evidence lower bound, or ELBO) per iteration became less than 0.1, or when the maximum number of iterations (= 5000) was reached.

Appendix B Supplementary Information about the Data Format

As noted in §3.2, we trained our clustering model on the CELEX dataset (Baayen et al., 1995); Specifically, the training data was extracted from the epl.cd file of the dataset, while filtering out lemmas whose corpus frequency—reported in the ``Cob'' column—was zero. The phonetic transcription in this dataset is coded in a special format called DISC. DISC represents each distinct segment with a single ASCII letter, and our $n$ -gram model treated each of them as a unit symbol. Specifically, we adopted the the ``PhonStrsDISC'' column of the epl.cd file, while the phonetic transcriptions in this paper have all been translated into IPA for better readability.

Appendix C Identification of the Etymological Origin

The CELEX database does not provide information on the etymological origin of words. Accordingly, to evaluate the alignment of our discovered clusters with the ground-truth etymology, we made use of a subset of words whose origin was identifiable in Wikipedia. Specifically, we considered words of Anglo-Saxon,¹⁵¹⁵15https://en.wikipedia.org/wiki/List_of_English_words_of_Anglo-Saxon_origin, accessed on 5 April, 2019. Old Norse,¹⁶¹⁶16https://en.wikipedia.org/wiki/List_of_English_words_of_Old_Norse_origin, accessed on 5 April, 2019. Dutch,¹⁷¹⁷17https://en.wikipedia.org/wiki/List_of_English_words_of_Dutch_origin, accessed on 5 April, 2019. Latin,¹⁸¹⁸18https://en.wikipedia.org/wiki/List_of_Latin_words_with_English_derivatives, accessed on 5 April, 2019. and French origin.¹⁹¹⁹19https://en.wikipedia.org/wiki/List_of_English_words_of_French_origin_(A-C), accessed on 5 April, 2019.²⁰²⁰20https://en.wikipedia.org/wiki/List_of_English_words_of_French_origin_(D-I), accessed on 5 April, 2019.²¹²¹21https://en.wikipedia.org/wiki/List_of_English_words_of_French_origin_(J-R), accessed on 5 April, 2019.²²²²22https://en.wikipedia.org/wiki/List_of_English_words_of_French_origin_(S-Z), accessed on 5 April, 2019. Words with multiple origins were included in the data only if the origins were either all Germanic (Anglo-Saxon, Old Norse, or Dutch) or Latinate (Latin or French); words reported as ``Latinates of Germanic origin''²³²³23https://en.wikipedia.org/wiki/List_of_English_Latinates_of_Germanic_origin, accessed on 5 April, 2019. were excluded from the analysis. We also excluded words that were of ambiguous origin. The resulting data amounted to 14,172 words (consisting of 3,535 Germanics and 10,637 Latinates).

Appendix D Predictions of DOC-Grammaticality for Individual Verbs

In §7, we reported our model predictions regarding the DOC grammaticality of dative verbs in the form of word counts per grammaticality/etymology type $\times$ model-discovered cluster. Here, we provide more detailed results, reporting the cluster-assignment probability of each individual verb.

Figure 5 presents the posterior cluster-assignment probability of the non-Latinate dative verbs that allow DOC (^✓DOC- $\overline{\textsc{Lat}}$ ). We can see that the vast majority of them have a greater assignment probability to Sublex_{$\approx$ Germanic} (represented by the blue portions in the figure).

Latinate dative verbs prohibiting (*DOC-Lat) and permitting (^✓DOC-Lat) DOC are listed in Figure 6 with their cluster-assignment probabilities. Most of the *DOC-Lat verbs had a greater probability of assignment to Sublex_{$\approx$ Latinate}; by contrast, most of the ^✓DOC-Lat verbs were MAP-classified into Sublex_{$\approx$ Germanic}. And it is these ``correct misclassifications of the exceptions'' that made our model be a better predictor of DOC grammaticality than the true etymology.