This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Precision, Stability, and Generalization: A Comprehensive Assessment of RNNs learnability capability for Classifying Counter and Dyck Languages

Neisarg Dave
The Pennsylvania State University
nud83@psu.edu &Daniel Kifer
The Pennsylvania State University
duk17@psu.edu \ANDC. Lee Giles
The Pennsylvania State University
clg20@psu.edu &Ankur Mali
University of South Florida
ankurarjunmali@usf.edu
Abstract

This study investigates the learnability of Recurrent Neural Networks (RNNs) in classifying structured formal languages, focusing on counter and Dyck languages. Traditionally, both first-order (LSTM) and second-order (O2RNN) RNNs have been considered effective for such tasks, primarily based on their theoretical expressiveness within the Chomsky hierarchy. However, our research challenges this notion by demonstrating that RNNs primarily operate as state machines, where their linguistic capabilities are heavily influenced by the precision of their embeddings and the strategies used for sampling negative examples. Our experiments revealed that performance declines significantly as the structural similarity between positive and negative examples increases. Remarkably, even a basic single-layer classifier using RNN embeddings performed better than chance. To evaluate generalization, we trained models on strings up to a length of 40 and tested them on strings from lengths 41 to 500, using 10 unique seeds to ensure statistical robustness. Stability comparisons between LSTM and O2RNN models showed that O2RNNs generally offer greater stability across various scenarios. We further explore the impact of different initialization strategies revealing that our hypothesis is consistent with various RNNs. Overall, this research questions established beliefs about RNNs’ computational capabilities, highlighting the importance of data structure and sampling techniques in assessing neural networks’ potential for language classification tasks. It emphasizes that stronger constraints on expressivity are crucial for understanding true learnability, as mere expressivity does not capture the essence of learning.

1 Introduction

Recurrent neural networks (RNNs) are experiencing a resurgence, spurring significant research aimed at establishing theoretical bounds on their expressivity. As natural neural analogs to state machines described by the Chomsky hierarchy, RNNs offer a robust framework for examining learnability, stability, and generalization—core aspects for advancing the development of memory-augmented models.

While conventional RNN architectures typically approximate finite state automata, LSTMs have shown the capacity to learn and generalize non-regular grammars, including counter languages and Dyck languages. These non-regular grammars demand a state machine enhanced with a memory component, and LSTM cell states have been demonstrated to possess sufficient expressivity to mimic dynamic counters. However, understanding whether these dynamics are stable and reliably learnable is crucial, particularly as the stability of learned fixed points directly impacts the generalization and reliability of these networks.

In this work, we extend the investigation into the expressiveness of RNNs by focusing on the empirical evidence for learnability and generalization in complex languages, with specific attention to Dyck and counter languages. Our analysis reveals the following key insights:

  1. 1.

    The counter dynamics learned by LSTM cell states are prone to collapse, leading to unstable behavior.

  2. 2.

    When positive and negative samples are not topologically proximal, the classifier can obscure the collapse of counter dynamics, creating an illusion of generalization.

  3. 3.

    The initial weight distribution has little effect on the eventual collapse of counter dynamics, suggesting inherent instability.

  4. 4.

    Second-order RNNs, although stable in approximating fixed states, lack the mechanism required for counting, underscoring the limitations of their expressivity.

Our analysis builds upon results from [1] regarding the behavior of the sigmoid activation function, extending this understanding to the fixed points of the tanh function used in LSTM cell and hidden state updates. Drawing from parallels noted by [2] between LSTMs and counter machines, we show that while LSTM cell states exhibit the expressivity needed for counting, this capability is not reliably captured in the hidden state. As a result, when the difference between successive hidden states falls below the precision threshold of the decoder, the classifier can no longer accurately represent the counter, leading to generalization failure. Additionally, we explore how input and forget gates within the LSTM clear the counter dynamics as state changes accumulate, resulting in an eventual collapse of dynamic behavior.

Further, we extend our exploration to analyze the learnability of classification layers when the encoding RNN is initialized randomly and not trained. This setup allows us to assess the extent of instability induced by the collapse of counter dynamics in the LSTM cell state and the role of numeric precision in the hidden state that supports the classification layer’s performance.

It is crucial to recognize that most prior studies demonstrating the learnability of RNNs on counter languages such as anbna^{n}b^{n}, anbncna^{n}b^{n}c^{n}, and anbncndna^{n}b^{n}c^{n}d^{n} have overlooked the significance of topological distance between positive and negative samples. Such sampling considerations are vital for a thorough understanding of RNN trainability. To address this gap, we incorporate three sampling strategies with varying levels of topological proximity between positive and negative samples, thereby challenging the RNNs to genuinely learn the counting mechanism.

By focusing on stability and fixed-point dynamics, our work offers a plausible lens through which the learnability of complex grammars in recurrent architectures can be better understood. We argue that stability, as characterized by the persistence of fixed points, is a critical factor in determining whether these models can generalize and reliably encode non-regular languages, shedding light on the inherent limitations and potentials of RNNs in such tasks.

2 Related Work

The relationship between Recurrent Neural Networks (RNNs), automata theory, and formal methods has been a focal point in understanding the computational power and limitations of neural architectures. Early studies have shown that RNNs can approximate the behavior of various automata and formal language classes, providing insights into their expressivity and learnability. [3] was one of the first to demonstrate that RNNs are capable of learning finite automata. They extracted finite state machines from trained RNNs, showing that the learned rules could be represented as deterministic automata. This foundational work laid the groundwork for subsequent studies on how RNNs encode and process state-based structures. Expanding on this, [4] showed that second-order recurrent networks, which include multiplicative interactions between inputs and hidden states, are superior state approximators compared to standard first-order RNNs. This enhancement in state representation significantly improved the networks’ ability to learn and represent complex sequences.

The analysis of RNNs’ functional capacity continued with [1, 5], who investigated the discriminant functions underlying first-order and second-order RNNs. Their results provided a deeper understanding of how these architectures utilize hidden state dynamics to implement decision boundaries and process temporal patterns. Meanwhile, the theoretical limits of RNNs were formalized by [6], who proved that RNNs are Turing Complete when equipped with infinite precision. This result implies that RNNs, in principle, can simulate any computable function, positioning them as universal function approximators.

Building on these foundational insights, recent research has aimed to identify the practical scenarios under which RNNs can achieve such theoretical expressiveness. [5] extended Turing completeness results to a second-order RNN, demonstrating that it can achieve Turing completeness in bounded time, making it relevant for real-world applications where resources are constrained. This shift towards practical expressivity has opened new avenues for applying RNNs to complex language tasks. Moving towards specific language modeling tasks, [7] explored how Long Short-Term Memory (LSTM) networks can learn context-free and context-sensitive grammars, such as anbna^{n}b^{n} and anbncna^{n}b^{n}c^{n}. Their results showed that LSTMs could successfully learn these patterns, albeit with limitations in scaling to larger sequence lengths. Extending these findings, [2] established a formal hierarchy categorizing RNN variants based on their expressivity, placing LSTMs in a higher class due to their ability to simulate counter machines. This formal classification aligns with the observation that LSTMs can implement complex counting and state-tracking mechanisms, making them suitable for tasks involving nested dependencies. Further investigations into RNNs’ capacity to recognize complex languages have revealed both strengths and limitations. [8] analyzed the ability of LSTMs to learn the Dyck-1 language, which models balanced parentheses, and found that while a single LSTM neuron could learn Dyck-1, it failed to generalize to Dyck-2, a more complex language with nested dependencies. Their follow-up work [9] studied generalization on anbna^{n}b^{n}, anbncna^{n}b^{n}c^{n}, and anbncndna^{n}b^{n}c^{n}d^{n} grammars, showing that performance varied significantly with sequence length sampling strategies. These findings highlight that while RNNs can theoretically represent these languages, practical training limitations impede their learnability.

Beyond traditional RNNs, the role of specific activation functions in enhancing expressivity has also been studied. [10] showed that RNNs with ReLU activations are strictly more powerful than those using standard sigmoid or tanh activations when it comes to counting tasks. This observation suggests that architectural modifications can significantly alter the network’s functional capacity. In a similar vein, [11] proposed neural network pushdown automata and neural network Turing machines, establishing a theoretical framework for integrating stacks into neural architectures, thereby enabling them to simulate complex computational models like pushdown automata and Turing machines. On the stability and generalization front, [12] compared the stability of states learned by first-order and second-order RNNs when trained on Tomita and Dyck grammars. Their results indicate that second-order RNNs are better suited for maintaining stable state representations across different grammatical tasks, which is critical for ensuring that the learned model captures the true structure of the language. Their work also explored methods for extracting deterministic finite automata (DFA) from trained networks, evaluating the effectiveness of extraction techniques like those proposed by [13] and [14]. This line of research is pivotal in understanding how well trained RNNs can be interpreted and how their internal state representations correspond to formal structures.

In terms of language generation and hierarchical structure learning, [15] demonstrated that LSTMs, when trained as language generators, can learn Dyck-(k,m)(k,m) languages, which involve hierarchical and nested dependencies, drawing parallels between these formal languages and syntactic structures in natural languages. Finally, several studies have shown that the choice of objective functions and learning algorithms significantly affects RNNs’ ability to stably learn complex grammars. For instance, [16] and [17] demonstrated that specialized loss functions, such as minimum description length, lead to more stable convergence and better generalization on formal language tasks. In light of the diverse findings from the aforementioned studies, our work systematically analyzes the divergence between the theoretical expressivity of RNNs and their empirical generalization capabilities through the lens of fixed-point theory. Specifically, we investigate how different RNN architectures capture and maintain stable state representations when learning complex grammars, focusing on the role of numerical precision, learning dynamics, and model stability. By leveraging theoretical results on fixed points and state stability, we provide a unified framework to evaluate the strengths and limitations of various RNN architectures.

The remainder of the paper is organized as follows: In Section 3, we present theoretical results on the fixed points of discriminant functions, providing foundational insights into the stability properties of RNNs. Section 4 extends these results by introducing a formal framework for analyzing RNNs as counter machines, highlighting how cell update mechanisms contribute to language recognition. Section 5 focuses on the impact of numerical precision on learning dynamics and convergence. Section 6 describes the experimental setup and empirical evaluation on complex formal languages, including Dyck and Tomita grammars. Finally, Section 7 discusses the broader implications of our findings and suggests future research directions.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 1: The fixed points of discriminant function f(wx+b)f(wx+b) are the intersection points with the line g(x)=xg(x)=x (solid black curve). The given figures show the existence of fixed points for bb in range [8,4][-8,-4] and w=13w=13 for sigmoid ( a and b) and tanh ( c and d ). We can observe here that in the given range as the ww increased from 55 to 1313, the number of fixed points increased from 11 to 33.

3 Fixed Points of Discriminant Functions

In this section we focus on two prominent discriminant functions: sigmoid and tanh, both of which are extensively utilized in widely-adopted RNN cells such as LSTM and O2RNN.

Theorem 3.1.

BROUWER’S FIXED POINT THEOREM [18]: For any continuous mapping f:ZZf:Z\rightarrow Z, where Z is a compact, non-empty convex set, zf\exists\ z_{f} s.t. f(zf)zff(z_{f})\rightarrow z_{f}

Corollary 3.1.1.

Let f:f:\mathbb{R}\rightarrow\mathbb{R} be a continuous, monotonic function with a non-empty, bounded, and convex co-domain 𝔻\mathbb{D}\subset\mathbb{R}. Then ff has at least one fixed point, i.e., there exists some cc\in\mathbb{R} such that f(c)=cf(c)=c.

Proof.

Since 𝔻\mathbb{D}\subset\mathbb{R} is a non-empty, bounded, and convex set, let the co-domain of ff be denoted as 𝔻=[a,b]\mathbb{D}=[a,b] for some a,ba,b\in\mathbb{R} with a<ba<b. Consider the identity function g(x)=xg(x)=x, which is continuous on \mathbb{R}. The fixed points of ff correspond to the points of intersection between f(x)f(x) and g(x)g(x), i.e., the solutions to the equation f(x)=g(x)f(x)=g(x).

Next, observe the behavior of ff outside its co-domain 𝔻=[a,b]\mathbb{D}=[a,b]:

  • For any x<ax<a, we have f(x)a>xf(x)\geq a>x (since ff is monotonic), implying that f(x)>xf(x)>x.

  • For any x>bx>b, we have f(x)b<xf(x)\leq b<x, implying that f(x)<xf(x)<x.

By the Intermediate Value Theorem, if f(x)>xf(x)>x for some x<ax<a and f(x)<xf(x)<x for some x>bx>b, then there must exist a point c[a,b]c\in[a,b] such that f(c)=cf(c)=c.

Thus, the function ff has at least one fixed point in the interval [a,b][a,b].

Corollary 3.1.2.

A parameterized sigmoid function of the form σ(x)=11+e(wx+b)\sigma(x)=\frac{1}{1+e^{-(wx+b)}}, where w,bw,b\in\mathbb{R}, has at least one fixed point, i.e., there exists some cc\in\mathbb{R} such that σ(c)=c\sigma(c)=c.

Proof.

Consider the function σ(x)=11+e(wx+b)\sigma(x)=\frac{1}{1+e^{-(wx+b)}}. We want to show that σ(x)\sigma(x) has at least one fixed point. A fixed point is a value cc\in\mathbb{R} such that σ(c)=c\sigma(c)=c.

First, observe that the sigmoid function σ(x)\sigma(x) is continuous and strictly increasing for all xx\in\mathbb{R}. The co-domain of σ(x)\sigma(x) is the interval [0,1][0,1], i.e., σ(x)[0,1]\sigma(x)\in[0,1] for all xx\in\mathbb{R}. We now consider the continuous identity function g(x)=xg(x)=x, which intersects the line y=xy=x.

Next, let us analyze the behavior of σ(x)x\sigma(x)-x as xx\rightarrow-\infty and x+x\rightarrow+\infty:

  • As xx\rightarrow-\infty, we have e(wx+b)e^{-(wx+b)}\rightarrow\infty, which implies σ(x)0\sigma(x)\rightarrow 0. Therefore, σ(x)x\sigma(x)-x\rightarrow-\infty as xx\rightarrow-\infty.

  • As x+x\rightarrow+\infty, we have e(wx+b)0e^{-(wx+b)}\rightarrow 0, which implies σ(x)1\sigma(x)\rightarrow 1. Therefore, σ(x)x1x\sigma(x)-x\rightarrow 1-x\rightarrow-\infty as x+x\rightarrow+\infty.

Since σ(x)x\sigma(x)-x is a continuous function on \mathbb{R} and changes sign (from positive to negative) as xx varies from -\infty to ++\infty, by the Intermediate Value Theorem, there must exist some cc\in\mathbb{R} such that:

σ(c)c=0σ(c)=c.\sigma(c)-c=0\quad\Rightarrow\quad\sigma(c)=c.

Hence, σ(x)\sigma(x) has at least one fixed point.

Corollary 3.1.3.

A parameterized tanh\tanh function of the form γ(x)=tanh(wx+b)\gamma(x)=\tanh(wx+b), where w,bw,b\in\mathbb{R}, has at least one fixed point, i.e., there exists some cc\in\mathbb{R} such that γ(c)=c\gamma(c)=c.

Proof.

Consider the function γ(x)=tanh(wx+b)\gamma(x)=\tanh(wx+b). We want to show that γ(x)\gamma(x) has at least one fixed point. A fixed point is a value cc\in\mathbb{R} such that γ(c)=c\gamma(c)=c.

Step 1: Properties of the Function γ(x)\gamma(x) The hyperbolic tangent function, tanh(x)\tanh(x), is a continuous and strictly increasing function for all xx\in\mathbb{R}. For any real value yy, the function tanh(y)\tanh(y) is bounded and satisfies 1<tanh(y)<1-1<\tanh(y)<1. Thus, the co-domain of γ(x)=tanh(wx+b)\gamma(x)=\tanh(wx+b) is also bounded within [1,1][-1,1], i.e., γ(x)[1,1]\gamma(x)\in[-1,1] for all xx\in\mathbb{R}.

Furthermore, since tanh(x)\tanh(x) is strictly increasing, the function γ(x)=tanh(wx+b)\gamma(x)=\tanh(wx+b) is also strictly increasing in xx. This implies that γ(x)\gamma(x) is one-to-one and continuous over \mathbb{R}.

Step 2: Analysis of γ(x)x\gamma(x)-x Consider the function:

f(x)=γ(x)x=tanh(wx+b)x.f(x)=\gamma(x)-x=\tanh(wx+b)-x.

We want to show that f(x)=0f(x)=0 has at least one solution, i.e., there exists some cc\in\mathbb{R} such that tanh(wc+b)=c\tanh(wc+b)=c. To analyze the existence of such a cc, let us examine the behavior of f(x)f(x) as x±x\rightarrow\pm\infty:

- As xx\rightarrow-\infty: We have wx+bwx+b\rightarrow-\infty. Thus, tanh(wx+b)1\tanh(wx+b)\rightarrow-1. Therefore:

f(x)=tanh(wx+b)x1x.f(x)=\tanh(wx+b)-x\rightarrow-1-x\rightarrow\infty.

- As x+x\rightarrow+\infty: We have wx+b+wx+b\rightarrow+\infty. Thus, tanh(wx+b)1\tanh(wx+b)\rightarrow 1. Therefore:

f(x)=tanh(wx+b)x1x.f(x)=\tanh(wx+b)-x\rightarrow 1-x\rightarrow-\infty.

Since f(x)f(x) is continuous on \mathbb{R} and changes sign from positive (as xx\rightarrow-\infty) to negative (as x+x\rightarrow+\infty), by the Intermediate Value Theorem, there must exist some cc\in\mathbb{R} such that:

f(c)=tanh(wc+b)c=0.f(c)=\tanh(wc+b)-c=0.

This implies that γ(c)=c\gamma(c)=c, i.e., γ(x)\gamma(x) has at least one fixed point.

Prior [1] have shown that parameterized sigmoid function σ(x)=11+e(wx+b)\sigma(x)=\frac{1}{1+e^{-(wx+b)}} has three fixed points for a given b]b,b+[b\in]b^{-},b^{+}[ and w>wbw>w_{b} for some b,b+,wbb^{-},b^{+},w_{b}\in\mathbb{R} and b<b+b^{-}<b^{+}. Further they showed sigmoid has two stable fixed point. In this work we go beyond sigmoid and show that TanH also has three fixed points

Theorem 3.2.

A parameterized tanh\tanh function γ(x)=tanh(wx+b)\gamma(x)=\tanh(wx+b) has three fixed points for a given b]b,b+[b\in]b^{-},b^{+}[ and w>wbw>w_{b} for some b,b+,wbb^{-},b^{+},w_{b}\in\mathbb{R} and b<b+b^{-}<b^{+}.

Proof.

We start by defining a fixed point of the function γ(x)=tanh(wx+b)\gamma(x)=\tanh(wx+b). A fixed point xx satisfies the equation:

γ(x)=xtanh(wx+b)=x.\gamma(x)=x\quad\Rightarrow\quad\tanh(wx+b)=x.

Let us define a new function to analyze the fixed points:

f(x)=tanh(wx+b)x.f(x)=\tanh(wx+b)-x.

The fixed points of γ(x)\gamma(x) are the solutions to the equation f(x)=0f(x)=0. We will analyze f(x)f(x) in detail to determine the number of solutions.

Step 1: Properties of f(x)f(x)
The function f(x)=tanh(wx+b)xf(x)=\tanh(wx+b)-x is continuous and differentiable. We start by computing its derivative:

f(x)=ddx(tanh(wx+b)x)=wsech2(wx+b)1,f^{\prime}(x)=\frac{d}{dx}\left(\tanh(wx+b)-x\right)=w\cdot\text{sech}^{2}(wx+b)-1,

where sech(y)=2ey+ey\text{sech}(y)=\frac{2}{e^{y}+e^{-y}} is the hyperbolic secant function. The value of sech2(wx+b)\text{sech}^{2}(wx+b) satisfies 0<sech2(y)10<\text{sech}^{2}(y)\leq 1. Thus:

f(x)=wsech2(wx+b)1.f^{\prime}(x)=w\cdot\text{sech}^{2}(wx+b)-1.

Step 2: Critical Points of f(x)f(x)
The critical points occur when f(x)=0f^{\prime}(x)=0:

wsech2(wx+b)1=0sech2(wx+b)=1w.w\cdot\text{sech}^{2}(wx+b)-1=0\quad\Rightarrow\quad\text{sech}^{2}(wx+b)=\frac{1}{w}.

Since 0<sech2(wx+b)10<\text{sech}^{2}(wx+b)\leq 1, the above equation has a real solution if and only if:

w>1.w>1.

For w>1w>1, there are exactly two critical points, x1x_{1} and x2x_{2}, such that x1<x2x_{1}<x_{2}.

Step 3: Behavior of f(x)f(x) as x±x\rightarrow\pm\infty
As xx\rightarrow\infty, wx+bwx+b\rightarrow\infty for w>0w>0. Thus, tanh(wx+b)1\tanh(wx+b)\rightarrow 1. Hence:

f(x)=tanh(wx+b)x1xasx.f(x)=\tanh(wx+b)-x\rightarrow 1-x\quad\text{as}\quad x\rightarrow\infty.

Therefore, f(x)f(x)\rightarrow-\infty as xx\rightarrow\infty.

As xx\rightarrow-\infty, wx+bwx+b\rightarrow-\infty for w>0w>0. Thus, tanh(wx+b)1\tanh(wx+b)\rightarrow-1. Hence:

f(x)=tanh(wx+b)x1xasx.f(x)=\tanh(wx+b)-x\rightarrow-1-x\quad\text{as}\quad x\rightarrow-\infty.

Therefore, f(x)f(x)\rightarrow\infty as xx\rightarrow-\infty.

Step 4: Intermediate Value Theorem
The intermediate value theorem tells us that since f(x)f(x) is continuous and changes sign from \infty to -\infty, it must have at least one root. Thus, there is at least one fixed point for γ(x)\gamma(x).

Step 5: Conditions for Three Fixed Points
We want to show that for specific values of bb and ww, the function f(x)=tanh(wx+b)xf(x)=\tanh(wx+b)-x has exactly three roots. To do so, we analyze f(x)f(x) in detail around its critical points.

1. Critical Points Analysis:

Recall that the critical points of f(x)f(x) are given by:

wsech2(wx+b)1=0sech2(wx+b)=1w.w\cdot\text{sech}^{2}(wx+b)-1=0\quad\Rightarrow\quad\text{sech}^{2}(wx+b)=\frac{1}{w}.

Let y=wx+by=wx+b. Then the critical points y1y_{1} and y2y_{2} satisfy:

sech2(y1)=1w,sech2(y2)=1w.\text{sech}^{2}(y_{1})=\frac{1}{w},\quad\text{sech}^{2}(y_{2})=\frac{1}{w}.

Solving for yy, we get:

y1=±cosh1(w),y2=y1.y_{1}=\pm\cosh^{-1}(\sqrt{w}),\quad y_{2}=-y_{1}.

Converting back to xx:

x1=y1bw,x2=y2bw.x_{1}=\frac{y_{1}-b}{w},\quad x_{2}=\frac{y_{2}-b}{w}.

2. Local Minima and Maxima Analysis:

At these critical points, the second derivative f′′(x)f^{\prime\prime}(x) determines whether f(x)f(x) has a local minimum or maximum:

f′′(x)=w2(2sech2(wx+b)tanh(wx+b))1.f^{\prime\prime}(x)=w^{2}\cdot(-2\text{sech}^{2}(wx+b)\tanh(wx+b))-1.

Analyzing f′′(x1)f^{\prime\prime}(x_{1}) and f′′(x2)f^{\prime\prime}(x_{2}), we can show that x1x_{1} corresponds to a local minimum and x2x_{2} corresponds to a local maximum (or vice-versa depending on bb).

3. Behavior of f(x)f(x) in the Range ]b,b+[]b^{-},b^{+}[:

For b]b,b+[b\in]b^{-},b^{+}[ and w>wbw>w_{b}, f(x)f(x) changes sign three times, indicating three distinct zeros.

Thus, for w>wbw>w_{b} and b]b,b+[b\in]b^{-},b^{+}[, the function γ(x)=tanh(wx+b)\gamma(x)=\tanh(wx+b) has exactly three fixed points.

This can be visualized in Figure 1(c, d). Let b[8,4]b\in[-8,-4], then we can observe that γ(x)\gamma(x) meets g(x)=xg(x)=x three times for w=13w=13, while it only meets g(x)=xg(x)=x once for w=5w=5. Since γ1(x)\gamma^{-1}(x) is also a monotonic function, and ww and xx will have a monotonic inverse relationship, for all w13w\geq 13, γ(x)\gamma(x) has 33 fixed points.

Next we show Tanh has out of three fixed point, two stable fixed points stable

Theorem 3.3.

If a parameterized tanh\tanh function γ(x)=tanh(wx+b)\gamma(x)=\tanh(wx+b) has three fixed points ξ,ξ0,ξ+\xi^{-},\xi^{0},\xi^{+} such that 1<ξ<ξ0<ξ+<1-1<\xi^{-}<\xi^{0}<\xi^{+}<1, then ξ\xi^{-} and ξ+\xi^{+} are stable fixed points.

Proof.

Let us start by defining a fixed point of the function γ(x)=tanh(wx+b)\gamma(x)=\tanh(wx+b). A point xx is a fixed point if:

γ(x)=xtanh(wx+b)=x.\gamma(x)=x\quad\Rightarrow\quad\tanh(wx+b)=x.

We are given that there are three fixed points ξ,ξ0,ξ+\xi^{-},\xi^{0},\xi^{+} such that:

1<ξ<ξ0<ξ+<1.-1<\xi^{-}<\xi^{0}<\xi^{+}<1.

Step 1: Stability Criterion for Fixed Points A fixed point ξ\xi is considered stable if the magnitude of the derivative of γ(x)\gamma(x) at ξ\xi is less than 1, i.e.,

|γ(ξ)|<1.|\gamma^{\prime}(\xi)|<1.

Conversely, a fixed point is unstable if:

|γ(ξ)|>1.|\gamma^{\prime}(\xi)|>1.

Step 2: Derivative of the Function γ(x)\gamma(x) We compute the derivative of γ(x)=tanh(wx+b)\gamma(x)=\tanh(wx+b):

γ(x)=ddx(tanh(wx+b)).\gamma^{\prime}(x)=\frac{d}{dx}\left(\tanh(wx+b)\right).

Recall that the derivative of the hyperbolic tangent function is:

ddxtanh(x)=sech2(x),\frac{d}{dx}\tanh(x)=\text{sech}^{2}(x),

where sech(x)=2ex+ex\text{sech}(x)=\frac{2}{e^{x}+e^{-x}}. Using the chain rule, we obtain:

γ(x)=wsech2(wx+b).\gamma^{\prime}(x)=w\cdot\text{sech}^{2}(wx+b).

Thus, at a fixed point ξ\xi, the derivative is:

γ(ξ)=wsech2(wξ+b).\gamma^{\prime}(\xi)=w\cdot\text{sech}^{2}(w\xi+b).

Step 3: Stability Analysis at Each Fixed Point We will now analyze the derivative at each of the three fixed points to determine their stability.

1. Middle Fixed Point ξ0\xi^{0}:

Since ξ0\xi^{0} is the middle fixed point, the function γ(x)\gamma(x) has a steep slope at ξ0\xi^{0}. Intuitively, the slope of tanh(x)\tanh(x) around the origin (and for values near zero) is steep, making |γ(ξ0)|>1|\gamma^{\prime}(\xi^{0})|>1. Thus, ξ0\xi^{0} is an unstable fixed point.

2. Leftmost Fixed Point ξ\xi^{-}:

Consider the derivative at the leftmost fixed point ξ\xi^{-}:

γ(ξ)=wsech2(wξ+b).\gamma^{\prime}(\xi^{-})=w\cdot\text{sech}^{2}(w\xi^{-}+b).

Since ξ\xi^{-} is smaller in magnitude compared to ξ0\xi^{0}, the value of sech2(wξ+b)\text{sech}^{2}(w\xi^{-}+b) is close to 1 but slightly less, and thus:

|γ(ξ)|<1.|\gamma^{\prime}(\xi^{-})|<1.

This implies that the fixed point ξ\xi^{-} is stable.

3. Rightmost Fixed Point ξ+\xi^{+}:

Similarly, for the rightmost fixed point ξ+\xi^{+}:

γ(ξ+)=wsech2(wξ++b).\gamma^{\prime}(\xi^{+})=w\cdot\text{sech}^{2}(w\xi^{+}+b).

Since ξ+\xi^{+} is greater than ξ0\xi^{0}, the value of sech2(wξ++b)\text{sech}^{2}(w\xi^{+}+b) is also close to 1 but less than at ξ0\xi^{0}, leading to:

|γ(ξ+)|<1.|\gamma^{\prime}(\xi^{+})|<1.

This means that the fixed point ξ+\xi^{+} is stable.

Thus we have shown that for the three fixed points ξ\xi^{-}, ξ0\xi^{0}, and ξ+\xi^{+} of the function γ(x)=tanh(wx+b)\gamma(x)=\tanh(wx+b):

- ξ0\xi^{0} is an unstable fixed point because |γ(ξ0)|>1|\gamma^{\prime}(\xi^{0})|>1. - ξ\xi^{-} and ξ+\xi^{+} are stable fixed points because |γ(ξ)|<1|\gamma^{\prime}(\xi^{-})|<1 and |γ(ξ+)|<1|\gamma^{\prime}(\xi^{+})|<1.

Thus, the theorem is proven. ∎

We can make the following observations about the fixed points of sigmoid and tanh functions:

  • If one fixed point exists, then it is a stable fixed point

  • If two fixed point exists, then one fixed point is stable and other is unstable.

  • If three fixed point exists, then two fixed points are stable and one is unstable.

More details regarding these observations can be found in the Appendix.

4 Counter Machines

Counter machines [19] are abstract machines composed of finite state automata controlling one or more counters. A counter can either increment (+1+1), decrement (1-1 if >0>0), clear (×0\times 0), do nothing (+0+0). A counter machine can be formally defined as:

Definition 4.1.

A counter machine (CM) is a 7-tuple (Q,Σ,q0,F,δ,γ,𝟏=0)(Q,\Sigma,q_{0},F,\delta,\gamma,\mathbf{1}_{=0}) where

  • Σ\Sigma is a finite alphabet

  • QQ is a set of states with q0Qq_{0}\in Q as initial state

  • FQF\subset Q is a set of accepting states.

  • 𝟏=0\mathbf{1}_{=0} checks the state of the counter and returns 11 if counter is zero else returns 0

  • γ\gamma is the counter update function defined as:

    γ:Σ×Q×𝟏=0{×0,1,+0,+1}\gamma:\Sigma\times Q\times\mathbf{1}_{=0}\rightarrow\{\times 0,-1,+0,+1\} (1)
  • δ\delta is the state transition function defined as:

    δ:Σ×Q×𝟏=0Q\delta:\Sigma\times Q\times\mathbf{1}_{=0}\rightarrow Q (2)

Acceptance of a string in a counter machine can be assessed by either the final state is in FF or the counter reaches 0 at the end of the input.

5 Learning to accept anbna^{n}b^{n} with LSTM

LSTM [20] is a gated RNN cell. The LSTM state is a tuple (h,c)(h,c) where hh is popularly known as hidden state and cc is known as cell state.

it\displaystyle i_{t} =σ(Wixt+Uiht1+bi)\displaystyle=\sigma(W_{i}x_{t}+U_{i}h_{t-1}+b_{i}) (3)
ft\displaystyle f_{t} =σ(Wfxt+Ufht1+bf)\displaystyle=\sigma(W_{f}x_{t}+U_{f}h_{t-1}+b_{f}) (4)
ot\displaystyle o_{t} =σ(Woxt+Uoht1+bo)\displaystyle=\sigma(W_{o}x_{t}+U_{o}h_{t-1}+b_{o}) (5)
ct~\displaystyle\tilde{c_{t}} =tanh(Wcxt+Ucht1+bc)\displaystyle=\tanh(W_{c}x_{t}+U_{c}h_{t-1}+b_{c}) (6)
ct\displaystyle c_{t} =ftct1+itct~\displaystyle=f_{t}\odot c_{t-1}+i_{t}\odot\tilde{c_{t}} (7)
ht\displaystyle h_{t} =ottanh(ct)\displaystyle=o_{t}\odot tanh(c_{t}) (8)

A typical binary classification network with LSTM cell is composed of two parts:

  1. 1.

    The enocoder recurrent network

    (h,c)t+1=LSTM(x,(h,c)t)(h,c)_{t+1}=LSTM(x,(h,c)_{t}) (9)
  2. 2.

    A classification layer, usually a single perceptron layer followed by a sigmoid

    p=σ(Whτ+b)p=\sigma(Wh_{\tau}+b) (10)

    In case of multiclass classification, σ\sigma is replaced by softmax function.

Following the construction of [2] we can draw parallels between the workings of counter machine and LSTM cell. Here iti_{t} decides wheather to execute +0+0, while {+1,1}\{+1,-1\} are decided by ct~\tilde{c_{t}}. To execute ×0\times 0 both ftf_{t} and iti_{t} needs to be 0.

Also the cell state and hidden state of LSTM have tanh as discriminant function. In the case of anbna^{n}b^{n}, a continuous stream of aa is followed by an equal number of bb, which creates an iterative execution of LSTM cell, making the output of discriminant functions closer to their fixed points. For maximum learnability, we can assume ct~{ξ,ξ+}\tilde{c_{t}}\in\{\xi^{-},\xi^{+}\}. Thus maximum final state values for anbna^{n}b^{n} will be:

canbn\displaystyle c_{a^{n}b^{n}} =n(ξ++ξ)\displaystyle=n(\xi^{+}+\xi^{-}) (11)
hanbn\displaystyle h_{a^{n}b^{n}} =tanh(n(ξ++ξ))\displaystyle=\tanh(n(\xi^{+}+\xi^{-})) (12)

From the above equations we can see that, while cell state is unbounded, hidden state is bounded in range ]1,1[]-1,1[. In our experiments we see that hidden state saturates to boundary values faster than is required to maintain the count. Formally, for some fairly moderate α\alpha and β\beta we can reach a point where |tanh(αξ++βξ)tanh(αξ++(β+1)ξ)|<ϵ|\tanh(\alpha\xi^{+}+\beta\xi^{-})-\tanh(\alpha\xi^{+}+(\beta+1)\xi^{-})|<\epsilon, where ϵ\epsilon is the precision of the classification layer.

Saturation of hidden state is desired from the perspective of consistent calculation of counter updates. In the LSTM cell, saturated hidden state means more stable gates which in turn leads to consistent cell state. However from the perspective of the classification layer, a saturated hidden state does not offer much information for a robust classification.

6 Precision of Neural Network

Numerical precision plays an important role in the partition of feature space by the classifier network, especially when the final hidden state from RNN either collapses towards 0 or saturates asymptotically to the boundary values.

Theorem 6.1.

Given a neural network layer with an input vector 𝐇n\mathbf{H}\in\mathbb{R}^{n}, a weight matrix 𝐖n×m\mathbf{W}\in\mathbb{R}^{n\times m}, a bias vector 𝐛m\mathbf{b}\in\mathbb{R}^{m}, and a sigmoid activation function σ\sigma, the output of the layer is defined by f(𝐇)=σ(𝐇𝐖+𝐛)f(\mathbf{H})=\sigma(\mathbf{H}\cdot\mathbf{W}+\mathbf{b}). The capacity of this layer to encode information is influenced by both the precision of the floating-point representation and the dynamical properties of the sigmoid function.

Let ϵ\epsilon be the machine epsilon, which represents the difference between 1 and the least value greater than 1 that is representable in the floating-point system used by the network. Assume the elements of 𝐖\mathbf{W} and 𝐛\mathbf{b} are drawn from a Gaussian distribution and are fixed post-initialization.

Then, the following bounds hold for the output of the network layer:

  1. 1.

    The granularity of the output is limited by ϵ\epsilon, such that for any element hih_{i} in 𝐇\mathbf{H} and corresponding weight wijw_{ij} in 𝐖\mathbf{W}, the difference in the layer’s output due to a change in hih_{i} or wijw_{ij} less than ϵ\epsilon may be imperceptible.

  2. 2.

    For z=𝐇𝐖+𝐛z=\mathbf{H}\cdot\mathbf{W}+\mathbf{b}, the sigmoid function σ(z)\sigma(z) saturates to 1 as zz\to\infty and to 0 as zz\to-\infty. The saturation points occur approximately at z>log(1ϵ)z>\log(\frac{1}{\epsilon}) and z<log(1ϵ)z<-\log(\frac{1}{\epsilon}), respectively.

  3. 3.

    The precision of the network’s output is governed by the stable fixed points of the sigmoid function, which occur when σ(z)\sigma(z) stabilizes at values near 0 or 1. If the dynamics of the network converge to one or more stable fixed points, the effective precision is reduced because minor variations in the input will not significantly alter the output.

  4. 4.

    When three fixed points exist for the sigmoid function—two stable and one unstable—information encoding can become confined to the stable fixed points. This behavior causes the network to collapse to a discrete set of values, reducing its effective resolution.

  5. 5.

    Therefore, the maximum discrimination in the output is not only limited by ϵ\epsilon but also by the attraction of the stable fixed points. The effective precision is bounded by both 12ϵ1-2\epsilon and the dynamics that collapse the output towards these stable points.

The detailed proof can be found in the appendix.

Theorem 6.2.

Consider a recurrent neural network (RNN) with fixed weights and the hidden state update rule given by:

𝐡t+1=tanh(𝐖𝐡t+𝐔𝐱t+𝐛),\mathbf{h}_{t+1}=\tanh(\mathbf{W}\mathbf{h}_{t}+\mathbf{U}\mathbf{x}_{t}+\mathbf{b}),

where 𝐖d×d\mathbf{W}\in\mathbb{R}^{d\times d}, 𝐔d×m\mathbf{U}\in\mathbb{R}^{d\times m}, 𝐛d\mathbf{b}\in\mathbb{R}^{d}, and 𝐱t\mathbf{x}_{t} represents the input symbol. Given a bounded sequence length NN, the RNN can encode sequences of the form anbncna^{n}b^{n}c^{n} by exploiting state dynamics that converge to distinct, stable fixed points in the hidden state space for each symbol. The expressivity of the RNN, equivalent to a deterministic finite automaton (DFA), enables the encoding of such grammars purely through state dynamics.

The detailed proof can be found in appendix

Theorem 6.3.

Given an RNN with fixed random weights and trainable sigmoid layer has sufficient capacity to encode complex grammars. Despite the randomness of the recurrent layer, the network can still classify sequences of the form anbncna^{n}b^{n}c^{n} by leveraging the distinct distributions of the hidden states induced by the input symbols. The classification layer learns to map the hidden states to the correct sequence class, even for bounded sequence lengths NN.

Proof.

The proof proceeds in three steps: (1) analyzing the hidden state dynamics in the presence of random fixed weights, (2) demonstrating that distinct classes (e.g., anbncna^{n}b^{n}c^{n}) can still be linearly separable based on the hidden states, and (3) showing that the classification layer can be trained to distinguish these hidden state patterns.

1. Hidden State Dynamics with Fixed Random Weights:

Consider an RNN with hidden state 𝐡td\mathbf{h}_{t}\in\mathbb{R}^{d} updated as:

𝐡t+1=tanh(𝐖𝐡t+𝐔𝐱t+𝐛),\mathbf{h}_{t+1}=\tanh(\mathbf{W}\mathbf{h}_{t}+\mathbf{U}\mathbf{x}_{t}+\mathbf{b}),

where 𝐖d×d\mathbf{W}\in\mathbb{R}^{d\times d} and 𝐔d×m\mathbf{U}\in\mathbb{R}^{d\times m} are randomly initialized and fixed. The hidden state dynamics in this case are governed by the random projections imposed by 𝐖\mathbf{W} and 𝐔\mathbf{U}.

Although the weights are random, the hidden state 𝐡t\mathbf{h}_{t} still carries information about the input sequence. Specifically, different sequences (e.g., ana^{n}, bnb^{n}, and cnc^{n}) induce distinct trajectories in the hidden state space. These trajectories are not arbitrary but depend on the input symbols, even under random weights.

2. Distinguishability of Hidden States for Different Sequence Classes:

Despite the randomness of the weights, the hidden state distributions for different sequences remain distinguishable. For example: - The hidden states after processing ana^{n} tend to cluster in a specific region of the state space, forming a characteristic distribution. - Similarly, the hidden states after processing bnb^{n} and cnc^{n} will occupy different regions.

These clusters may not correspond to single fixed points as in the trained RNN case, but they still form distinct, linearly separable patterns in the high-dimensional space.

3. Training the Classification Layer:

The classification layer is a fully connected layer that maps the final hidden state 𝐡N\mathbf{h}_{N} to the output class (e.g., ”class 1” for anbncna^{n}b^{n}c^{n}). The classification layer is trained using a supervised learning approach, typically minimizing a cross-entropy loss.

Because the hidden states exhibit distinct distributions for different sequences, the classification layer can learn to separate these distributions. In high-dimensional spaces, even random projections (as induced by the random recurrent weights) create enough separation for the classification layer to distinguish between different classes.

Thus even with random fixed weights, the hidden state dynamics create distinguishable patterns for different input sequences. The classification layer, which is the only trained component, leverages these patterns to correctly classify sequences like anbncna^{n}b^{n}c^{n}. This demonstrates that the RNN’s expressivity remains sufficient for the classification task, despite the randomness in the recurrent layer.

7 Experiment Setup

model \rightarrow
(trained layers)
lstm
(all layers)
lstm
(classifier-only)
o2rnn
(all layers)
o2rnn
(classifier-only)
grammars sdim max mean±std max mean±std max mean±std max mean±std
Dyck-11 2 85.95 82.88 ± 2.54 73.89 72.3 ± 1.28 83.38 80.57 ± 2.37 63.88 63.85 ± 0.02
Dyck-22 4 98.65 87.35 ± 8.3 72.99 70.05 ± 3.35 86.85 82.57 ± 2.11 63.67 60.75 ± 2.41
Dyck-44 8 99.2 86.57 ± 8.01 71.93 69.58 ± 3.71 99.11 94.55 ± 0.85 63.71 59.88 ± 2.44
Dyck-66 12 97.45 87.34 ± 8.85 72.27 68.64 ± 2.24 99.54 98.4 ± 1.42 62.74 60.33 ± 1.79
anbncna^{n}b^{n}c^{n} 6 98.13 90.17 ± 15.30 81.09 78.60 ± 1.54 97.86 97.27 ± 0.35 69.71 69.66 ± 0.03
anbncndna^{n}b^{n}c^{n}d^{n} 8 98.33 90.45 ± 14.22 80.95 79.59 ± 0.97 97.24 96.11 ± 2.43 71.8 71.74 ± 0.03
anbmambna^{n}b^{m}a^{m}b^{n} 8 99.83 98.05 ± 4.65 70.83 69.69 ± 0.74 99.76 99.58 ± 0.19 58.7 58.08 ± 0.48
anbmambma^{n}b^{m}a^{m}b^{m} 8 99.93 99.64 ± 0.56 73.25 70.59 ±1.18 99.43 99.13 ± 0.38 58.67 56.95 ± 2.58
Table 1: Performance comparison of RNNs trained with all layers and when trained with all weights frozen except classifier with hard 1 negative string sampling.

Models

We evaluate the performance of two types of Recurrent Neural Networks (RNNs): Long Short-Term Memory (LSTM) networks [20] and Second-Order Recurrent Neural Networks (O2RNNs) [21]. The LSTM is considered a first-order RNN since its weight tensors are second-order matrices, whereas the O2RNN utilizes third-order weight tensors for state transitions, making it a second-order RNN. The state update for the O2RNN is defined as follows:

hi(t)=j,kwijkxj(t)hk(t1)+bi,h_{i}^{(t)}=\sum_{j,k}w_{ijk}x^{(t)}_{j}h^{(t-1)}_{k}+b_{i},

where wijkw_{ijk} is a third-order tensor that models the interactions between the input vector x(t)x^{(t)} and the previous hidden state h(t1)h^{(t-1)}, and bib_{i} is the bias term. All models consist of a single recurrent layer followed by a sigmoid activation layer for binary classification, as defined in Equation 10.

Datasets

We conduct experiments on eight different formal languages, divided into two categories: Dyck languages and counter languages. The Dyck languages include Dyck-1, Dyck-2, Dyck-4, and Dyck-6, which vary in the complexity and depth of nested dependencies. The counter languages include anbncna^{n}b^{n}c^{n}, anbncndna^{n}b^{n}c^{n}d^{n}, anbmambna^{n}b^{m}a^{m}b^{n}, and anbmambma^{n}b^{m}a^{m}b^{m}. Each language requires the network to learn specific counting or hierarchical patterns, posing unique challenges for generalization.

The number of neurons used in the hidden state for each RNN configuration is summarized in Table 1. To ensure robustness and a fair comparison, all models were trained on sequences with lengths ranging from 1 to 40 and tested on sequences of lengths ranging from 41 to 500, thereby evaluating their generalization capability on longer and more complex sequences.

Training and Testing Methodology

Since the number of possible sequences grows exponentially with length (for a sequence of length ll, there are 2l2^{l} possible combinations), we sampled sequences using an inverse exponential distribution over length, ensuring a balanced representation of short and long strings during training. Each model was trained to predict whether a given sequence is a positive example (belongs to the target language) or a negative example (does not follow the grammatical rules of the language).

For all eight languages, positive examples are inherently sparse in the overall sample space. This sparsity makes the generation of negative samples crucial to ensure a challenging and informative training set. We generated three different datasets, each using a distinct strategy for sampling negative examples:

  1. 1.

    Hard 0 (Random Sampling): Negative samples were randomly generated from the sample space without any structural similarity to positive samples. This method creates a broad variety of negatives, but many of these are trivially distinguishable, providing limited learning value for more sophisticated models.

  2. 2.

    Hard 1 (Edit Distance Sampling): Negative samples were constructed based on their string edit distance from positive examples. Specifically, for a sequence of length ll, we generated negative strings that have a maximum edit distance of 0.25l0.25l. This approach ensures that negative samples are structurally similar to positive ones, making it challenging for the model to differentiate them based solely on surface-level patterns.

  3. 3.

    Hard 2 (Topological Proximity Sampling): Negative samples were generated using topological proximity to positive strings, based on the structural rules of the language. For instance, in the counter language anbncna^{n}b^{n}c^{n}, a potential negative string could be an1bn+1cna^{n-1}b^{n+1}c^{n}, which maintains a similar overall structure but violates the language’s grammatical constraints. This method ensures that the negative samples are more nuanced, requiring the model to maintain precise state transitions and counters to correctly classify them.

Training Details For reproducibility and stability we train each model over 1010 seeds and report the mean, standard deviation, and maximum of accuracy over test set. We use stochastic gradient descent optimizer for a maximum of 100,000100,000 iterations. We employ a batch size of 128128 and a learning rate of 0.010.01. Validation is run very 100100 iterations and training is stopped if validation loss does not improve for 70007000 consecutive iterations. All models use binary cross entropy loss as optimization function.

We use uniform random initialization (𝒰(k,k)\mathcal{U}(-\sqrt{k},\sqrt{k}), where kk is the hidden size) for LSTM weights and normal initialization (𝒩(0,0.1)\mathcal{N}(0,0.1)) for O2RNN for all experiments, except figure 4 which compares performance of LSTM and O2RNN on anbncndna^{n}b^{n}c^{n}d^{n} with the following initialization strategies:

  1. 1.

    uniform initialization : w𝒰(0.1,0.1)w\sim\mathcal{U}(-0.1,0.1)

  2. 2.

    orthogonal initialization : gain = 11

  3. 3.

    sparse initialization : sparsity = 0.10.1, all non zero wijw_{ij} are sampled from 𝒩(0,0.01)\mathcal{N}(0,0.01)

All the biases are initialized with a constant value of 0.010.01

Tables 1 and 2 show results comparing models trained in two different ways : (1) all layers: all layers of the model are trained, and (2) classifier-only: weights of the RNN cells are frozen after random initialization and only classifier is trained.

We use Nvidia 2080ti GPUs to run our experiments with training times varying from under 1515 minutes for a simpler dataset like Dyck-11 on O2RNN, to over 6060 minutes for a counter language on LSTM. In total, we train 700700 models for our main results with over 400400 hours of cumulative GPU training times.

8 Results and Discussion

lstm
(all layers)
lstm
(classifier-only)
o2rnn
(all layers)
o2rnn
(classifier-only)
grammar neg set max mean ± std max mean ± std max mean ± std max mean ± std
anbncna^{n}b^{n}c^{n} hard 0 99.92 99.28 ± 0.29 96.13 93.14 ± 2.53 99.61 99.27 ± 0.48 83.37 83.32 ± 0.03
hard 1 98.13 90.17 ± 15.30 81.09 78.60 ± 1.54 97.86 97.27 ± 0.35 69.71 69.66 ± 0.03
hard 2 87.49 74.35 ± 13.23 75.64 74.48 ± 0.73 86.42 82.10 ± 3.3 69.94 69.85 ± 0.07
anbncndna^{n}b^{n}c^{n}d^{n} hard 0 99.59 99.36 ± 0.19 98.1 95.94 ± 1.49 99.48 98.91 ± 1.17 87.53 87.5 ± 0.02
hard 1 98.33 90.45 ± 14.22 80.95 79.59 ± 0.97 97.24 96.11 ± 2.43 71.8 71.74 ± 0.03
hard 2 85.81 71.66 ± 12.21 75.33 74.47 ± 1.29 85.61 80.84 ± 3.68 70.72 70.66 ± 0.03
Table 2: Performance of RNNs declines when negative strings closer to positive strings are sampled for training
Refer to caption
Refer to caption
(a) LSTM hidden state values for string (((((()()())))))
Refer to caption
(b) LSTM hidden state values for string (((())))(((())))
Refer to caption
(c) LSTM cell state values for string (((((()()())))))
Refer to caption
(d) LSTM cell state values for string (((())))(((())))
Figure 2: Transitions of hidden and cell states of LSTM for Dyck-11 grammars
Refer to caption
Refer to caption
(a) O2RNN state values for string (((((()()())))))
Refer to caption
(b) O2RNN state values for string (((())))(((())))
Figure 3: Transitions of hidden states of O2RNN for Dyck-11 grammars
Refer to caption
Refer to caption
(a) uniform initialization
𝒰(0.1,0.1)\sim\mathcal{U}(-0.1,0.1)
Refer to caption
(b) orthogonal initialization
gain = 11
Refer to caption
(c) sparse initialization
sparsity = 0.10.1
Figure 4: Performance of LSTM and O2RNN with various weight initialization strategies on anbncndna^{n}b^{n}c^{n}d^{n}

Learnability of Dyck and Counter Languages

The results from Table 2 for negative set hard 0 confirm prior findings on the expressivity of LSTMs and RNNs on counter, context-free, and context-sensitive languages. A one-layer LSTM is theoretically capable of representing all classes of counter languages, indicating that its expressivity is sufficient to model non-regular grammars. However, the results for negative sets hard 1 and hard 2 indicate that this expressivity does not necessarily translate to practical learnability. The observed performance drop on these harder negative sets suggests that, despite the LSTM’s capacity to model such languages, its ability to generalize correctly under realistic training conditions is limited. This discrepancy between expressivity and learnability calls for a deeper understanding of how the network’s internal dynamics align with the objective function during training.

In particular, the sparsity of positive samples combined with naively sampled negative examples (as in hard 0) allows the classifier to partition the feature space even when the internal feature encodings are not well-structured. This may give an inflated impression of the LSTM’s practical learnability. Tables 1 and 2 compare fully trained models and classifier-only trained models, showing that the latter can achieve above-chance accuracy, even with minimal feature encoding. When negative samples are sampled closer to positive ones, as in hard 1 and hard 2, the classifier struggles to maintain robust partitions, highlighting that the underlying feature encodings are not sufficiently aligned with the grammar structure. Future work can leverage fixed-point theory and expressivity analysis to establish better learnability bounds, offering a more principled approach to bridge the gap between theoretical capacity and empirical generalization.

Stability of Feature Encoding in LSTM

The stability of LSTM feature encodings is heavily influenced by the precision of the network’s internal dynamics. Across 10 random seeds, the standard deviation of accuracy for fully trained LSTMs is significantly higher compared to classifier-only models, particularly for challenging sampling strategies. For example, a fully trained LSTM on anbncna^{n}b^{n}c^{n} shows a standard deviation of 15.30% compared to only 1.54% for the classifier-only network using hard 1 sampling. This difference is less pronounced for hard 0 (0.29%) but becomes more severe for hard 2 (13.23%), indicating that instability in the learned feature encodings increases as the negative examples become structurally closer to positive ones. This instability is due to the LSTM’s reliance on its cell state to encode dynamic counters, which may not align precisely with the hidden state used for classification. As a result, slight deviations in internal dynamics cause substantial fluctuations in performance, suggesting a lack of robust fixed-point behavior in the cell state.

Stability of Second-Order RNNs

In contrast, the O2RNN, which utilizes a third-order weight tensor, demonstrates more consistent performance across different training strategies and random seeds. In all configurations, the O2RNN exhibits a standard deviation of less than 3%, as shown in Tables 1 and 2. This stability is attributed to the higher-order interactions in the weight tensor, which drive the activation dynamics towards more stable fixed points. The convergence to these stable fixed points results in more robust internal state representations, making the O2RNN less sensitive to variations in training data and initialization. These findings are consistent with observations by [12] for regular and Dyck languages, suggesting that higher-order tensor interactions inherently stabilize the internal dynamics, improving the alignment between the learned state transitions and the theoretical expressive capacity.

Dynamic Counting and Fixed Points

The LSTM’s ability to perform dynamic counting is closely tied to the stability of its cell state, which relies on the fixed points of the tanh activation function, as shown in Equation 11. Figures 2(c) and 2(d) provide evidence of dynamic counting when the LSTM encounters consecutive open brackets, as indicated by the solid blue curve that decreases monotonically. This behavior is in accordance with Equation 12, where the hidden state saturates to -1. However, when the network encounters a closing bracket, the cell state counter collapses, causing the hidden and cell states to start mirroring each other. This collapse occurs due to a mismatch between the counter dynamics and the LSTM’s training objective, which primarily optimizes for hidden state changes rather than directly influencing the cell state’s stability.

The root cause lies in the misalignment between the counter dynamics and the classification objective. Since the classification layer only uses the hidden state as input, any instability in the cell dynamics propagates through the hidden state, making it difficult for the network to maintain precise counter updates. In contrast, the O2RNN’s pure state approximation mechanism, as illustrated in Figure 3, shows smoother transitions and stable dynamics, indicating that the network’s internal states are better aligned with its expressivity requirements.

Effect of Initialization on Fixed-Point Stability

The choice of initialization strategy significantly influences the stability of fixed points in RNNs. Figure 4 shows that the performance of both LSTMs and O2RNNs declines from hard 0 to hard 1 for all initialization strategies. However, we observe that the O2RNN is particularly sensitive to sparse initialization, while being more stable for the other two initialization methods. This sensitivity reflects the network’s reliance on precise weight configurations to drive its activation dynamics towards stable fixed points. In contrast, the LSTM’s performance is relatively invariant to initialization strategies, as the collapse of its counting dynamics is more directly influenced by interactions between its gates rather than by initial weight values. Understanding the role of initialization in achieving stable fixed-point dynamics is crucial for designing networks that can consistently maintain dynamic behaviors throughout training.

9 Conclusion

Our framework analyzed models based on the fixed-point theory of activation functions and the precision of classification, providing a unified approach to study the stability and learnability of recurrent networks. By leveraging this framework, we identified critical gaps between the theoretical expressivity and the empirical learnability of LSTMs on Dyck and counter languages. While the LSTM cell state theoretically has the capacity to implement dynamic counting, we observed that misalignment between the training objective and the network’s internal state dynamics often causes a collapse of the counter mechanism. This collapse leads the LSTM to lose its counting capacity, resulting in unstable feature encodings in its final state representations. Additionally, our analysis showed that this instability is masked in standard training setups due to the power of the classifier to partition the feature space effectively. However, when the dataset includes closely related positive and negative samples, this instability prevents the network from maintaining clear separations between similar classes, ultimately resulting in a decline in performance. These findings underscore that, despite LSTMs’ theoretical capability for complex pattern recognition, their practical performance is hindered by internal instability and sensitivity to training configurations. To address this gap, our fixed-point analysis focused on understanding the stability of activation functions, offering a mathematical framework that connects theoretical properties to empirical behaviors. This approach provides new insights into how activation stability can influence the overall learnability of a system, enabling us to better align theory and practice. Our results emphasize that improving the stability of counter dynamics in LSTMs can lead to more robust, generalizable memory-augmented networks. Ultimately, this work contributes to a deeper understanding of the learnability of LSTMs and other recurrent networks, paving the way for future research that bridges the divide between theoretical expressivity and practical generalization.

References

  • [1] C. W. Omlin and C. L. Giles, “Constructing deterministic finite-state automata in recurrent neural networks,” J. ACM, vol. 43, p. 937–972, nov 1996.
  • [2] W. Merrill, G. Weiss, Y. Goldberg, R. Schwartz, N. A. Smith, and E. Yahav, “A formal hierarchy of RNN architectures,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault, eds.), (Online), pp. 443–459, Association for Computational Linguistics, July 2020.
  • [3] C. L. Giles and C. W. Omlin, “Extraction, insertion and refinement of symbolic rules in dynamically driven recurrent neural networks,” Connection Science, vol. 5, no. 3-4, pp. 307–337, 1993.
  • [4] C. W. Omlin and C. L. Giles, “Training second-order recurrent neural networks using hints,” in Machine Learning Proceedings 1992 (D. Sleeman and P. Edwards, eds.), pp. 361–366, San Francisco (CA): Morgan Kaufmann, 1992.
  • [5] A. Mali, A. Ororbia, D. Kifer, and L. Giles, “On the computational complexity and formal hierarchy of second order recurrent neural networks,” arXiv preprint arXiv:2309.14691, 2023.
  • [6] H. T. Siegelmann and E. D. Sontag, “On the computational power of neural nets,” in Proceedings of the Fifth Annual Workshop on Computational Learning Theory, COLT ’92, (New York, NY, USA), p. 440–449, Association for Computing Machinery, 1992.
  • [7] F. Gers and E. Schmidhuber, “Lstm recurrent networks learn simple context-free and context-sensitive languages,” IEEE Transactions on Neural Networks, vol. 12, no. 6, pp. 1333–1340, 2001.
  • [8] M. Suzgun, Y. Belinkov, S. Shieber, and S. Gehrmann, “LSTM networks can perform dynamic counting,” in Proceedings of the Workshop on Deep Learning and Formal Languages: Building Bridges (J. Eisner, M. Gallé, J. Heinz, A. Quattoni, and G. Rabusseau, eds.), (Florence), pp. 44–54, Association for Computational Linguistics, Aug. 2019.
  • [9] M. Suzgun, Y. Belinkov, and S. M. Shieber, “On evaluating the generalization of LSTM models in formal languages,” in Proceedings of the Society for Computation in Linguistics (SCiL) 2019 (G. Jarosz, M. Nelson, B. O’Connor, and J. Pater, eds.), pp. 277–286, 2019.
  • [10] G. Weiss, Y. Goldberg, and E. Yahav, “On the practical computational power of finite precision rnns for language recognition,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 740–745, 2018.
  • [11] J. Stogin, A. Mali, and C. L. Giles, “A provably stable neural network turing machine with finite precision and time,” Information Sciences, vol. 658, p. 120034, 2024.
  • [12] N. Dave, D. Kifer, C. L. Giles, and A. Mali, “Stability analysis of various symbolic rule extraction methods from recurrent neural network,” arXiv preprint arXiv:2402.02627, 2024.
  • [13] G. Weiss, Y. Goldberg, and E. Yahav, “Extracting automata from recurrent neural networks using queries and counterexamples,” in International Conference on Machine Learning, pp. 5247–5256, PMLR, 2018.
  • [14] Q. Wang, K. Zhang, X. Liu, and C. L. Giles, “Verification of recurrent neural networks through rule extraction,” arXiv preprint arXiv:1811.06029, 2018.
  • [15] J. Hewitt, M. Hahn, S. Ganguli, P. Liang, and C. D. Manning, “RNNs can generate bounded hierarchical languages with optimal memory,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (B. Webber, T. Cohn, Y. He, and Y. Liu, eds.), (Online), pp. 1978–2010, Association for Computational Linguistics, Nov. 2020.
  • [16] N. Lan, M. Geyer, E. Chemla, and R. Katzir, “Minimum description length recurrent neural networks,” Transactions of the Association for Computational Linguistics, vol. 10, pp. 785–799, 2022.
  • [17] A. Mali, A. Ororbia, D. Kifer, and L. Giles, “Investigating backpropagation alternatives when learning to dynamically count with recurrent neural networks,” in International Conference on Grammatical Inference, pp. 154–175, PMLR, 2021.
  • [18] W. M. Boothby, “On two classical theorems of algebraic topology,” The American Mathematical Monthly, vol. 78, no. 3, pp. 237–249, 1971.
  • [19] P. C. Fischer, A. R. Meyer, and A. L. Rosenberg, “Counter machines and counter languages,” Mathematical systems theory, vol. 2, pp. 265–283, Sept. 1968.
  • [20] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, p. 1735–1780, nov 1997.
  • [21] C. W. Omlin and C. L. Giles, “Training second-order recurrent neural networks using hints,” in Machine Learning Proceedings 1992 (D. Sleeman and P. Edwards, eds.), pp. 361–366, San Francisco (CA): Morgan Kaufmann, 1992.

Appendix A Appendix: Additional Results and Discussion

A.1 Generalization Results

Figure 5 shows the generalization plots for LSTM and O2RNN for both training strategies i.e. all layers trained and classifier-only trained. These networks were trained on string lengths 2402-40 and tested on lengths 4150041-500. The plots show the distribution of performance across the test sequence lengths. Both RNNs maintain their accuracy across the test range indicating generalization of the results.

A.2 Results with Transformers

To examine the capacity of transformer encoder architecture and compare them with our results from RNNs, we train one layer transformer encoder architecture. For binary classification of counter languages, we adopt two different embedding strategies as input to the classifier:

  1. 1.

    transformer-avg : The classification layer receives the mean of all output embeddings generated by the transformer encoder as input feature.

  2. 2.

    transformer-cls : The classification layer receives the output embedding of [CLS] token as input feature.

We train single layer transformer encoder network on two counter languages anbncna^{n}b^{n}c^{n} and anbncndna^{n}b^{n}c^{n}d^{n}. We use the embedding dimension of 88 with 44 attention heads. Table 5 and figure 6 shows that one-layer transformer encoder model fails to learn counter languages. Among the two classification strategies, transformer-cls shows high standard deviation in performance than transformer-avg across 1010 seeds. transformer-cls model on some seed performed as high as 64%64\% on anbncndna^{n}b^{n}c^{n}d^{n} grammars, however the mean performance across 1010 seeds remained near 50%50\%. transformer-cls model does not show any signs of training (table 4) for weight initialization strategies used for comparing RNNs. For most seeds, the network has 50%50\% accuracy.

A.3 Results on Penn Tree Bank dataset

Table 3 compares O2RNN, LSTM and one-layer transformer encoder network on PTB dataset. O2RNN and LSTM are trained with hidden state size of 88 for character level training, and with size 256256 for word level training. For transformer-encoder model we use similar embedding dimensions - 88 for character level training and 256256 for word level training.

dataset model all layers classifier-only
ptb-char lstm 3.1243 7.7886
o2rnn 3.2911 8.4865
transformer-
-encoder
4.4389 9.6622
ptb-word lstm 160.3073 403.9483
o2rnn 283.5615 356.4486
transformer-
-encoder
196.8097 318.425
Table 3: Perplexity of LSTM, O2RNN, and transformer-encoder models on PTB dataset with all layers trained and classifier-only training.
all layers classifier
initialization negative samples model max mean ± std max mean ± std
uniform hard 0 lstm 99.89 92.96 ± 13.2 95.54 94.03 ± 1.48
o2rnn 99.12 99.1 ± 0.01 87.54 87.5 ± 0.02
transformer-cls 50 50.00 ± 0.00 50 49.99 ± 0.02
hard 2 lstm 86.96 79.33 ± 10.15 74.74 74.41 ± 0.31
o2rnn 85.5 82.6 ± 2.93 70.72 70.66 ± 0.03
transformer-cls 50 50.00 ± 0.00 50 50.00 ± 0.00
orthogonal hard 0 lstm 99.99 98.64 ± 1.76 96.88 96.13 ± 0.5
o2rnn 99.12 99.10 ± 0.01 87.54 87.03 ± 1.41
transformer-cls 53.05 50.58 ± 1.0 50.22 49.99 ± 0.13
hard 2 lstm 86.55 67.85 ± 14.15 75.24 74.83 ± 0.27
o2rnn 85.41 80.34 ± 3.36 70.72 70.25 ± 1.23
transformer-cls 51.23 50.15 ± 0.48 50.08 49.72 ± 0.61
sparse hard 0 lstm 99.59 99.27 ± 0.16 96.55 95.85 ± 0.39
o2rnn 99.12 99.10 ± 0.01 87.54 83.75 ± 11.25
transformer-cls 50 50.00 ± 0.00 50 50.00 ± 0.00
hard 2 lstm 86.26 76.96 ± 13.73 76.39 72.87 ± 7.63
o2rnn 85.66 80.50 ± 3.48 70.72 58.26 ± 10.12
transformer-cls 50 50.00 ± 0.00 50 50.00 ± 0.00
Table 4: Effect of weight initialization strategies on networks ability to respond to topologically close positive and negative strings
Refer to caption
Refer to caption
(a) LSTM on hard 0
Refer to caption
(b) LSTM on hard 1
Refer to caption
(c) LSTM on hard 2
Refer to caption
(d) O2RNN on hard 0
Refer to caption
(e) O2RNN on hard 1
Refer to caption
(f) O2RNN on hard 2
Figure 5: Generalization plots for LSTM and O2RNN
Refer to caption
Refer to caption
(a) Transformer-cls on anbncna^{n}b^{n}c^{n}
Refer to caption
(b) Transformer-cls on anbncndna^{n}b^{n}c^{n}d^{n}

1

Figure 6: Generalization plots for transformer-cls network with hard 0 negative sampling strategy
grammar feature layers trained max mean ± std
anbncna^{n}b^{n}c^{n} cls all layers 55.45 51.70 ± 2.37
classifier-only 60.22 49.88 ± 7.40
avg-pool all layers 58.58 51.53 ± 2.44
classifier-only 59.15 51.60 ± 2.92
anbncndna^{n}b^{n}c^{n}d^{n} cls all layers 64.25 51.99 ± 10.59
classifier-only 60.47 50.22 ± 9.38
avg-pool all layers 55.79 49.35 ± 3.01
classifier-only 52.36 49.41 ± 1.27
Table 5: One layer transformer encoder networks do not learn counter languages like anbncna^{n}b^{n}c^{n} and anbncndna^{n}b^{n}c^{n}d^{n} with hard 0 negative sampling strategy

A.4 Stable and Unstable Fixed Points

σ(wx+b)\sigma(wx+b) and tanh(wx+b)\tanh(wx+b) are monotonic functions with bounded co-domain. For w>0w>0, both functions are non-decreasing. Let f:f:\mathbb{R}\rightarrow\mathbb{R} be a monotonic, non-decreasing function with bounded co-domain, and g(x)=x,xg(x)=x,\ x\in\mathbb{R}, Then,

  • If one fixed point exists, then it is a stable fixed point
    Let x>zfx>z_{f} where zfz_{f} is the only fixed point. Then, f(x)<g(x)f(x)<g(x), thus iteratively xi+1=f(xi)x_{i+1}=f(x_{i}), with each xi+1xix_{i+1}\leq x_{i} and equality occuring at xi=zfx_{i}=z_{f}. Similary for x<zfx<z_{f}, we can show that with each iterative application of f(x)f(x), xx moves towards zfz_{f}.

  • If two fixed points exists, then one fixed point is stable and other is unstable.
    If there are two fixed points then at one fixed point (ztz_{t}) g(x)g(x) is tangent to f(x)f(x). For xztx\neq z_{t} f(x)<g(x)f(x)<g(x), thus making that fixed point unstable.

  • If three fixed point exists, then two fixed points are stable and one is unstable.
    This is already shown in Theorem 3.2 and 3.3.

Appendix B: Estimation Methodology based on Machine Precision

Given the constraints of the RNN model and the precision limits of float32, we aim to calculate the maximum distinguishable count NN for each symbol in the sequence.

Assumptions

  • The tanh\tanh activation function is used in the RNN, bounding the hidden state outputs within (1,1)(-1,1).

  • The machine epsilon (ϵ\epsilon) for float32 is approximately 1.19×1071.19\times 10^{-7}, indicating the smallest representable change for values around 1.

  • A conservative approach is adopted, considering a dynamic range of interest for tanh\tanh outputs from -0.9 to 0.9 to avoid saturation effects.

Calculation Dynamic Range and Minimum Noticeable Change The effective dynamic range for tanh\tanh outputs is set to avoid saturation, calculated as:

Dynamic Range=0.9(0.9)=1.8.\text{Dynamic Range}=0.9-(-0.9)=1.8.

Assuming a minimum noticeable change in the hidden state, given by 10×ϵ10\times\epsilon, to ensure distinguishability within the SGD training process, we have:

Δhmin=10×1.19×107.\Delta h_{\min}=10\times 1.19\times 10^{-7}.

Number of Distinguishable Steps The total number of distinguishable steps within the dynamic range can be estimated as:

Steps=Dynamic RangeΔhmin.\text{Steps}=\frac{\text{Dynamic Range}}{\Delta h_{\min}}.

Given the usable capacity for encoding is potentially less than the total dynamic range due to the RNN’s need to represent sequence information beyond mere counts, a conservative factor (ff) is applied:

Nmax=f×Steps.N_{\max}=f\times\text{Steps}.

Conservative Factor and Final Estimation Applying a conservative factor (ff) to account for the practical limitations in encoding and sequence discrimination, we estimate NmaxN_{\max} without dividing by 3, contrary to the previous incorrect interpretation. This factor reflects the assumption that not all distinguishable steps are equally usable for encoding sequences due to the complexity of sequential dependencies and the potential for error accumulation.

Nmax=f×1.810×1.19×107.N_{\max}=f\times\frac{1.8}{10\times 1.19\times 10^{-7}}.

Thus we can shown that this estimation provides a mathematical framework for understanding the maximum count NN that can be distinguished by a simple RNN model with fixed weights and a trainable classification layer, under idealized assumptions about floating-point precision and the behavior of the tanh\tanh activation function. The actual capacity for sequence discrimination may vary based on the specifics of the network architecture, weight initialization, and training methodology.

Appendix C: Complete Proofs Precision Theorem

We provide detailed proof for theorem 6.1 presented in the main paper

Proof.

The proof proceeds in three steps: (1) analyzing the hidden state dynamics in the presence of random fixed weights, (2) demonstrating that distinct classes (e.g., anbncna^{n}b^{n}c^{n}) can still be linearly separable based on the hidden states, and (3) showing that the classification layer can be trained to distinguish these hidden state patterns.

1. Hidden State Dynamics with Fixed Random Weights:

Consider an RNN with hidden state 𝐡td\mathbf{h}_{t}\in\mathbb{R}^{d} updated as:

𝐡t+1=tanh(𝐖𝐡t+𝐔𝐱t+𝐛),\mathbf{h}_{t+1}=\tanh(\mathbf{W}\mathbf{h}_{t}+\mathbf{U}\mathbf{x}_{t}+\mathbf{b}),

where 𝐖d×d\mathbf{W}\in\mathbb{R}^{d\times d} and 𝐔d×m\mathbf{U}\in\mathbb{R}^{d\times m} are randomly initialized and fixed. The hidden state dynamics in this case are governed by the random projections imposed by 𝐖\mathbf{W} and 𝐔\mathbf{U}.

Although the weights are random, the hidden state 𝐡t\mathbf{h}_{t} still carries information about the input sequence. Specifically, different sequences (e.g., ana^{n}, bnb^{n}, and cnc^{n}) induce distinct trajectories in the hidden state space. These trajectories are not arbitrary but depend on the input symbols, even under random weights.

2. Distinguishability of Hidden States for Different Sequence Classes:

Despite the randomness of the weights, the hidden state distributions for different sequences remain distinguishable. For example: - The hidden states after processing ana^{n} tend to cluster in a specific region of the state space, forming a characteristic distribution. - Similarly, the hidden states after processing bnb^{n} and cnc^{n} will occupy different regions.

These clusters may not correspond to single fixed points as in the trained RNN case, but they still form distinct, linearly separable patterns in the high-dimensional space.

3. Training the Classification Layer:

The classification layer is a fully connected layer that maps the final hidden state 𝐡N\mathbf{h}_{N} to the output class (e.g., ”class 1” for anbncna^{n}b^{n}c^{n}). The classification layer is trained using a supervised learning approach, typically minimizing a cross-entropy loss.

Because the hidden states exhibit distinct distributions for different sequences, the classification layer can learn to separate these distributions. In high-dimensional spaces, even random projections (as induced by the random recurrent weights) create enough separation for the classification layer to distinguish between different classes.

The main key insight observed based on above analysis is that even with random fixed weights, the hidden state dynamics create distinguishable patterns for different input sequences. The classification layer, which is the only trained component, leverages these patterns to correctly classify sequences like anbncna^{n}b^{n}c^{n}. This demonstrates that the RNN’s expressivity remains sufficient for the classification task, despite the randomness in the recurrent layer.

Now We provide detailed proof for theorem 6.2 presented in the main paper

Proof.

The proof is divided into three parts: (1) establishing the existence of stable fixed points for each input symbol, (2) analyzing the convergence of state dynamics to these fixed points, and (3) demonstrating how the RNN encodes the sequence anbncna^{n}b^{n}c^{n} using these fixed points.

1. Existence of Stable Fixed Points for Each Input Symbol:

Let the hidden state 𝐡td\mathbf{h}_{t}\in\mathbb{R}^{d} at time tt be updated according to:

𝐡t+1=tanh(𝐖𝐡t+𝐔𝐱t+𝐛),\mathbf{h}_{t+1}=\tanh(\mathbf{W}\mathbf{h}_{t}+\mathbf{U}\mathbf{x}_{t}+\mathbf{b}),

where 𝐱t{a,b,c}\mathbf{x}_{t}\in\{a,b,c\} represents the input symbol. For a fixed input symbol 𝐱\mathbf{x}, we analyze the fixed points of the hidden state dynamics.

The fixed points satisfy:

𝐡=tanh(𝐖𝐡+𝐔𝐱+𝐛).\mathbf{h}^{*}=\tanh(\mathbf{W}\mathbf{h}^{*}+\mathbf{U}\mathbf{x}+\mathbf{b}).

Assume that the system has distinct stable fixed points ξa,ξb,ξc\xi_{a}^{-},\xi_{b}^{-},\xi_{c}^{-} for inputs 𝐱=a\mathbf{x}=a, 𝐱=b\mathbf{x}=b, and 𝐱=c\mathbf{x}=c, respectively. These fixed points are stable under small perturbations, meaning that for each symbol, the hidden state dynamics tend to converge to the corresponding fixed point.

2. Convergence of State Dynamics to Fixed Points:

For a sufficiently long subsequence of identical symbols, such as ana^{n}, the hidden state will converge to 𝐡ξa\mathbf{h}\approx\xi_{a}^{-} as tt increases. This convergence is governed by the stability of the fixed point ξa\xi_{a}^{-}. The same holds true for subsequences bnb^{n} and cnc^{n}, where the hidden state will converge to ξb\xi_{b}^{-} and ξc\xi_{c}^{-}, respectively.

Mathematically, this convergence is characterized by the eigenvalues of the Jacobian matrix 𝐉\mathbf{J} at the fixed point ξa\xi_{a}^{-}:

𝐉=𝐡[tanh(𝐖𝐡+𝐔𝐚+𝐛)]|𝐡=ξa.\mathbf{J}=\frac{\partial}{\partial\mathbf{h}}\left[\tanh(\mathbf{W}\mathbf{h}+\mathbf{U}\mathbf{a}+\mathbf{b})\right]\bigg{|}_{\mathbf{h}=\xi_{a}^{-}}.

If the eigenvalues satisfy |λi|<1|\lambda_{i}|<1 for all ii, the fixed point is stable, ensuring that the hidden state dynamics converge to ξa\xi_{a}^{-} over time.

3. Encoding the Sequence anbncna^{n}b^{n}c^{n} via Fixed Points:

Given a bounded sequence length NN, the RNN can encode the sequence anbncna^{n}b^{n}c^{n} by leveraging the stable fixed points ξa\xi_{a}^{-}, ξb\xi_{b}^{-}, and ξc\xi_{c}^{-} as follows:

  1. 1.

    After processing the subsequence ana^{n}, the hidden state converges to 𝐡ξa\mathbf{h}\approx\xi_{a}^{-}.

  2. 2.

    Upon receiving the input symbol bb, the hidden state begins to transition from ξa\xi_{a}^{-} to ξb\xi_{b}^{-}. As the network processes bnb^{n}, the hidden state stabilizes at ξb\xi_{b}^{-}.

  3. 3.

    Similarly, the hidden state transitions to ξc\xi_{c}^{-} after processing cnc^{n}, representing the final part of the sequence.

4. Expressivity of RNNs and DFA Equivalence:

The expressivity of RNNs and even o2RNN is equivalent to that of deterministic finite automata (DFA) [2, 5]. In this context, the RNN’s behavior mirrors that of a DFA, with distinct stable fixed points representing states for each input symbol. The transitions between these states are governed by the input sequence and the corresponding hidden state dynamics, which collapse to stable fixed points. This allows the RNN to encode complex grammars like anbncna^{n}b^{n}c^{n} purely through its internal state dynamics.

Thus it can be seen that RNN can encode the sequence anbncna^{n}b^{n}c^{n} by relying on the convergence of state dynamics to stable fixed points. The bounded sequence length NN ensures that the hidden states have sufficient time to converge to these fixed points, enabling the network to express such grammars within its capacity. The expressivity of the RNN, akin to a DFA, underlines that the encoding is achieved purely through state dynamics, which is especially true for 02RNN.