This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On Non-asymptotic Theory of Recurrent Neural Networks in Temporal Point Processes

Zhiheng Chen1,   Guanhua Fang2∗,   Wen Yu2
1 Shanghai Center for Mathematical Sciences, Fudan University
2 Department of Statistics and Data Science, Fudan University
fanggh@fudan.edu.cn
Abstract

Temporal point process (TPP) is an important tool for modeling and predicting irregularly timed events across various domains. Recently, the recurrent neural network (RNN)-based TPPs have shown practical advantages over traditional parametric TPP models. However, in the current literature, it remains nascent in understanding neural TPPs from theoretical viewpoints. In this paper, we establish the excess risk bounds of RNN-TPPs under many well-known TPP settings. We especially show that an RNN-TPP with no more than four layers can achieve vanishing generalization errors. Our technical contributions include the characterization of the complexity of the multi-layer RNN class, the construction of tanh\tanh neural networks for approximating dynamic event intensity functions, and the truncation technique for alleviating the issue of unbounded event sequences. Our results bridge the gap between TPP’s application and neural network theory.

1 Introduction

Temporal point process (TPP) (Daley et al., 2003; Daley and Vere-Jones, 2008) is an important mathematical framework that provides tools for analyzing and predicting the timing and patterns of events in continuous time. TPP particularly deals with event streaming data where the events occur at irregular time stamps, which is different from classical time series analysis that often assumes a regular time spacing between data points. In real world applications, the events could be anything from transactions in financial markets (Bauwens and Hautsch, 2009; Hawkes, 2018) to user activities in online social network platforms (Farajtabar et al., 2017; Fang et al., 2023), earthquakes in seismology (Wang et al., 2012; Laub et al., 2021), neural spikes in biological experiments (Perkel et al., 1967; Williams et al., 2020), or failure times in survival analysis (Aalen et al., 2008; Fleming and Harrington, 2013).

With advent of artificial intelligence in last decades, the neural network (McCulloch and Pitts, 1943) has been proved to be a powerful architecture that can be adapted to different applications with distinct purposes. In modern machine learning, researchers have also incorporated deep neural networks into TPPs to handle complex patterns and dependencies in event data, leading to advancements in many areas such as recommendation systems (Du et al., 2015; Hosseini et al., 2017), social network analysis (Du et al., 2016; Zhang et al., 2021), healthcare analytics (Li et al., 2018; Enguehard et al., 2020), etc. Many new TPP models have been proposed in the recent literature, including but not limited to, recurrent temporal point process Du et al. (2016), fully neural network TPP model Omi et al. (2019), transformer Hawkes process Zuo et al. (2020); see Shchur et al. (2021); Lin et al. (2022) and the references therein for a more comprehensive review.

Despite the recent process in TPP’s applications as mentioned above, there is a lack of understanding in neural TPPs from the theoretical perspective. A fundamental question remains: whether the neural network-based TPP can provably have a small generalization error? In this paper, we provide an affirmative answer to this question for recurrent neural network (RNN, Medsker and Jain (1999))-based TPPs. To be specific, we establish the non-asymptotic rates of generation error bounds under mild model assumptions and provide the construction of RNN architectures that could approximate many widely-used TPPs, including homogeneous Poisson process, non-homogeneous Poisson process, self-exciting process, etc.

There are a few challenges in developing the theory of RNN-based TPPs. (a) Characterization of functional space. In the machine learning theory, it is necessary to specify the model space to derive any generalization errors. In our setting, the thing becomes more complicated since the model should be data-dependent (i.e., adapts to the past events). Otherwise, the model could not capture the information in event history and fail to provide a good fitting. (b) Expressive power of RNN architecture. RNN is the most widely adopted neural architecture in TPP modelling. However, it remains questionable whether the RNNs can approximate most well-known temporal point processes. If the answer is yes, it would be of great interest to know how many hidden layers and how large hidden dimensions will be sufficient for the approximation. (c) Expressive power of activation function. In modern neural networks, the activation function is chosen to be a simple non-linear function for the sake of computational feasibility. In RNNs, it is taken as the “tanh" by default. Then it is important to understand the approximability of tanh activation functions. (d) Variable length of event sequence. Unlike the standard RNN’s modelling where each sample is assumed to have the same number of observations (events), the event sequences in our setting may vary from one to another. In addition, their lengths are potentially unbounded. These add difficulties in computing the complexity of the model space.

To overcome the above challenges, we adopt the following approaches. (a) In TPPs, the intensity function is the core. We recursively construct the multi-layer hidden cells through RNNs to store the event information and adopt the suitable output layer to compute the intensity value. Equipped with suitable input embeddings, our construction can capture the information of event history and adapt to variable lengths of event sequence. (b) For four main categories of TPPs, homogeneous Poisson process, non-homogeneous Poisson process, self-exciting process, and self-correcting process, we carefully study their intensity formula. We can decompose the intensity function into different parts and approximate them component-wisely. Our construction explicitly gives the upper bounds on the model depth, the width of hidden layers, and parameter weights of the RNN architecture to achieve a certain level of approximation accuracy. (c) We use the results in a recent work (De Ryck et al., 2021), where they provide the approximation ability of one- and two-layer tanh\tanh neural networks. We adapt such results to our specific RNN structure and give the universal approximation results for each of the intensity components. (d) Thanks to the exponential decay property of the tail probability of the sequence length, we are able to use the truncation technique to decouple the randomness of independent and identically distributed (i.i.d.) samples and the lengths of event sequences. For the space of truncated loss functions, the space complexity can be obtained through calculating the covering number. The classical chaining methods in empirical process theory can hence be applied as well.

Our main technical contributions can be summarized as follows.

(i) In the analysis of the stochastic error in the excess risk of RNN-based TPPs, we provide a truncation technique to decompose the randomness into a bounded component and a tail component. By carefully balancing between the two parts, we establish a nearly optimal stochastic error bound. Additionally, we also derive the complexity of the multi-layer RNN-based TPP class, where we precisely analyze and compute the Lipschitz constant of RNN architecture. This extends the existing result in Chen et al. (2020) where they only give the Lipschitz constant of a single-layer RNN. Therefore, our truncation technique and the Lipschitz result of multi-layer RNNs can be useful and of independent interest for many other related problems.

(ii) We establish the approximation error bounds for the intensity functions of TPPs of four main categories. To the best of our knowledge, there is very few work (De Ryck et al., 2021) on studying the approximation property of tanh\tanh activation function. Our work is the first one to provide approximation results for RNN-based statistical models. Our construction procedure largely depends on the Markov nature (Laub et al., 2021) of self-exciting processes so that we can design hidden cells to store sufficient information of past events. Moreover, we decompose the excitation function into different parts. Each of them is a simple smooth function (i.e. either exponential function or trigonometric function) that can be well approximated by a single-layer tanh\tanh network. Our construction method can be viewed as a useful tool in analyzing other sequential-type neural networks.

(iii) We illustrate the differences between the architectures of classical RNNs and RNN-based TPPs. Note the fact that the observed events happen at the discrete time grids, while the TPP models should take into account the continuous time domain. Therefore, the interpolation of values in hidden cells at each time point is important and necessary. We show that improper interpolation mechanisms (e.g. constant, linear, exponential decay interpolation) may fail to provide RNN-based TPP with the universal approximation ability. Our result indicates that the input embedding plays an important role in interpolating the hidden states.

The rest of paper is organized as follows. In Section 2, the background of TPPs, the formulation of RNN-based TPPs, and useful notations are introduced. The main theories along with high-level explanations are given in Section 3. The technical tools for analyzing stochastic errors are provided in Section 4. The construction procedures for approximating different types of intensity functions are listed in Section 5. In Section 6, we provide explanations that the improper interpolation of hidden states in RNN-TPPs may lead to unsatisfactory approximation results. The concluding remarks are given in Section 7.

2 Preliminaries

2.1 Framework Specification

We observe a set of nn irregular event time sequences,

𝐃train:={Si;i=1,,n}={(ti,1,,ti,Nei);i=1,,n},\displaystyle\mathbf{D}_{train}:=\{S_{i};i=1,...,n\}=\{(t_{i,1},...,t_{i,N_{ei}});i=1,...,n\}, (1)

where 0<ti,1<<ti,j<<ti,NeiT0<t_{i,1}<...<t_{i,j}<...<t_{i,N_{ei}}\leq T with TT being the end time point, and NeiN_{ei} is the number of events in the ii-th sequence, SiS_{i}. It is assumed that each of SiS_{i}’s is independently generated from a TPP model with an unknown intensity function λ(t)\lambda^{\ast}(t) defined on [0,T][0,T]. That is,

λ(t):=limdt0𝔼[N[t,t+dt)|t]dt,\lambda^{\ast}(t):=\lim_{dt\rightarrow 0}\frac{\mathbb{E}[N[t,t+dt)|\mathcal{H}_{t}]}{dt},

where N[t,t+dt):=N(t+dt)N(t)N[t,t+dt):=N(t+dt)-N(t) with N(t):={i:tit}N(t):=\sharp\{i:t_{i}\leq t\} being the number of events observed up to time tt, and t:=σ({N(s);s<t})\mathcal{H}_{t}:=\sigma(\{N(s);s<t\}) is the history filtration before time tt.

In the literature of TPP’s learning (Shchur et al., 2021), the primary goal is to estimate λ(t)\lambda^{\ast}(t) based on 𝐃train\mathbf{D}_{train}. Throughout the current work, we adopt the negative log-likelihood function as our objective. To be specific, for any event time sequence S=(t1,..,tNe)S=(t_{1},..,t_{N_{e}}), we define

loss(λ,S):={j=1Nelogλ(tj)0Tλ(t)dt}.\displaystyle\text{loss}(\lambda,S):=-\left\{\sum_{j=1}^{N_{e}}\log\lambda(t_{j})-\int_{0}^{T}\lambda(t)\mathrm{d}t\right\}. (2)

Then the estimator can be defined as

λ^\displaystyle\hat{\lambda} :=\displaystyle:= argminλloss(λ)\displaystyle\arg\min_{\lambda\in\mathcal{F}}\text{loss}(\lambda) (3)
:=\displaystyle:= argminλ{1ni=1nloss(λ,Si)},\displaystyle\arg\min_{\lambda\in\mathcal{F}}\left\{\frac{1}{n}\sum_{i=1}^{n}\text{loss}(\lambda,S_{i})\right\},

where \mathcal{F} is a user-specified functional space. For example, in the existing works, \mathcal{F} can be taken as any space of parametric models (Schoenberg, 2005; Laub et al., 2021), nonparametric models (Cai et al., 2022; Fang et al., 2023), or neural network models (Du et al., 2016; Mei and Eisner, 2017).

In the language of deep learning, 𝐃train\mathbf{D}_{train} is also called a training data set. loss(λ)\text{loss}(\lambda) is known as the loss function of predictor λ\lambda. λ^\hat{\lambda} defined in (3) is the empirical risk minimizer (ERM). To evaluate the performance of λ^\hat{\lambda}, a common practice in machine (deep) learning is using the excess risk (Hastie et al., 2009; James et al., 2013; Vidyasagar, 2013; Shalev-Shwartz and Ben-David, 2014). To be mathematically formal, we define

ER(λ^):=𝔼[loss(λ^,Stest)]𝔼[loss(λ,Stest)],\displaystyle\text{ER}(\hat{\lambda}):=\mathbb{E}[\text{loss}(\hat{\lambda},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})], (4)

where StestS_{test} is a testing sample, i.e., a new event time sequence, which is independent of 𝐃train\mathbf{D}_{train} and also follows the intensity λ(t)\lambda^{\ast}(t). The expectation here is taken with respect to the new testing data. We give a proof of ER(λ^)0\text{ER}(\hat{\lambda})\geq 0 in the supplementary. As a result, (4) is a well-defined excess risk under our model setup.

2.2 RNN Structure

Throughout this paper, we consider \mathcal{F} to be a space of RNN-based TPP models. An arbitrary intensity function λ\lambda in \mathcal{F}, indexed by the parameter θ\theta, is defined through the following recursive formula,

λθ(t;S)\displaystyle\lambda_{\theta}(t;S) :=\displaystyle:= f(Wx(L+1)h(L)(t;S)+b(L+1))1,fort(tj,tj+1],\displaystyle f\left(W_{x}^{(L+1)}h^{(L)}(t;S)+b^{(L+1)}\right)\in\mathbb{R}^{1},~{}~{}\text{for}~{}t\in(t_{j},t_{j+1}], (5)

where the hidden vector function h(L)(t;S)h^{(L)}(t;S) has the following hierarchical form,

h(1)(t;S)\displaystyle h^{(1)}(t;S) =\displaystyle= σ(Wx(1)x(t;S)+Wh(1)hj(1)+b(1)),\displaystyle\sigma\left(W_{x}^{(1)}x(t;S)+W_{h}^{(1)}h_{j}^{(1)}+b^{(1)}\right),
h(2)(t;S)\displaystyle h^{(2)}(t;S) =\displaystyle= σ(Wx(2)h(1)(t;S)+Wh(2)hj(2)+b(2)),\displaystyle\sigma\left(W_{x}^{(2)}h^{(1)}(t;S)+W_{h}^{(2)}h_{j}^{(2)}+b^{(2)}\right),
\displaystyle\vdots
h(L)(t;S)\displaystyle h^{(L)}(t;S) =\displaystyle= σ(Wx(L)h(L1)(t;S)+Wh(L)hj(L)+b(L)),fort(tj,tj+1],\displaystyle\sigma\left(W_{x}^{(L)}h^{(L-1)}(t;S)+W_{h}^{(L)}h_{j}^{(L)}+b^{(L)}\right),~{}~{}\text{for}~{}t\in(t_{j},t_{j+1}], (6)

with

hj(1)\displaystyle h_{j}^{(1)} =\displaystyle= σ(Wx(1)x(tj;S)+Wh(1)hj1(1)+b(1)),\displaystyle\sigma\left(W_{x}^{(1)}x(t_{j};S)+W_{h}^{(1)}h_{j-1}^{(1)}+b^{(1)}\right),
hj(2)\displaystyle h_{j}^{(2)} =\displaystyle= σ(Wx(2)hj(1)+Wh(2)hj1(2)+b(2)),\displaystyle\sigma\left(W_{x}^{(2)}h_{j}^{(1)}+W_{h}^{(2)}h_{j-1}^{(2)}+b^{(2)}\right),
\displaystyle\vdots
hj(L)\displaystyle h_{j}^{(L)} =\displaystyle= σ(Wx(L)hj(L1)+Wh(L)hj1(L)+b(L)),forj{1,,Ne}.\displaystyle\sigma\left(W_{x}^{(L)}h_{j}^{(L-1)}+W_{h}^{(L)}h_{j-1}^{(L)}+b^{(L)}\right),~{}~{}\text{for}~{}j\in\{1,...,N_{e}\}. (7)

Here σ\sigma, ff are two known activation functions of the hidden layers and the output layer, respectively. Both of them are pre-determined by the user. We specifically take σ(x)=tanh(x)=(exp(x)exp(x))/(exp(x)+exp(x))\sigma(x)=\tanh(x)=(\exp(x)-\exp(-x))/(\exp(x)+\exp(-x)) and f(x)=min{max{x,lf},uf}f(x)=\min\{\max\{x,l_{f}\},u_{f}\}, where lfl_{f} and ufu_{f} are two fixed positive constants. The input embedding vector function x(t;S)x(t;S) is also known to the user before training. In the current work, we particularly take x(t;S)=(t,tFS(t))x(t;S)=(t,t-F_{S}(t))^{\top} where FS(t)=ttjF_{S}(t)=t-t_{j} for t(tj,tj+1]t\in(t_{j},t_{j+1}], jNe\forall j\in N_{e}. The model parameters consist of Wx(l)W_{x}^{(l)}, Wh(l)W_{h}^{(l)}, and b(l)b^{(l)} (1lL1\leq l\leq L). For notational simplicity, we concatenate all parameter matrices and vectors and write as θ={Wx(l),Wh(l),b(l);1lL+1}\theta=\{W_{x}^{(l)},W_{h}^{(l)},b^{(l)};1\leq l\leq L+1\}, where Wh(L+1)𝟎W_{h}^{(L+1)}\equiv\mathbf{0}. By default, we take the initial values t00t_{0}\equiv 0 and h0(l)𝟎h_{0}^{(l)}\equiv\mathbf{0} for 1lL1\leq l\leq L. The last time grid tNe+1Tt_{N_{e}+1}\equiv T. We call the model defined through equations (5) - (2.2) as the RNN-TPP.

Refer to caption
Refer to caption
Figure 1: Left: the classical RNN architecture. Right: the RNN-TPP architecture given in (5) - (2.2). The blue box represents the interpolation of hidden states.

Moreover, we define the maximum hidden size D:=max{d1,d2dL}D:=\max\{d_{1},d_{2}\cdots d_{L}\}, where dld_{l} is the dimension of the ll-th hidden layer, and the parameter norm

θ:=max{Wx(l)2,Wh(l)2,b(l)2;1lL+1}.\|\theta\|:=\max\left\{\|W_{x}^{(l)}\|_{2},\|W_{h}^{(l)}\|_{2},\|b^{(l)}\|_{2};1\leq l\leq L+1\right\}.

Then the RNN-TPP class \mathcal{F} is described by

=L,D,Bm,lf,uf:={λθ;θBm},\displaystyle\mathcal{F}=\mathcal{F}_{L,D,B_{m},l_{f},u_{f}}:=\{\lambda_{\theta};~{}\|\theta\|\leq B_{m}\}, (8)

where BmB_{m} may depend on the hidden size DD and the sample size nn. To help readers gain more intuitions, a graphical illustration of the network structure is given in Figure 1.

Remark 1.

The default choice (De Ryck et al., 2021) of activation function σ(x)\sigma(x) in RNNs is tanh(x)\tanh(x). In practice, the number of layers LL is usually no more than 4.

Remark 2.

By the constructions (5) - (2.2), it is not hard to see that the intensity λθ(t;S)\lambda_{\theta}(t;S) is a left-continuous function of tt. In other words, it is a well-defined predictable function with respect to the information filtration generated by event sequence SS.

Remark 3.

In the standard application of RNN models, the training data usually consist of discrete-time sequences (e.g., sequences of tokens in natural language processing (NLP) (Yin et al., 2017; Tarwani and Edem, 2017); time series in financial market forecasting (Cao et al., 2019; Chimmula and Zhang, 2020)). Therefore, the classical (single-layer) RNN architecture is defined only through the discrete time grids. That is, the hidden vector at jj-th grid is

hj\displaystyle h_{j} =σ(Wxxj+Whhj1+bh),\displaystyle=\sigma\left(W_{x}x_{j}+W_{h}h_{j-1}+b_{h}\right),

where xjx_{j} is the corresponding embedding input. The prediction at time step jj is given by yj=f(Wyhj+by)y_{j}=f(W_{y}h_{j}+b_{y})\in\mathbb{R}. In contrast, the RNN-based TPP model should take into account any time point tt between grids tjt_{j} and tj+1t_{j+1}. Hence the interpolation of h(l)(t;S)h^{(l)}(t;S) between hj(l)h_{j}^{(l)} and hj+1(l)h_{j+1}^{(l)} is heuristically necessary to give reasonable model predictions over the entire time interval (tj,tj+1](t_{j},t_{j+1}].

Remark 4.

In the literature, there exist a few methods to interpolate the hidden embedding between hj(L)h_{j}^{(L)} and hj+1(L)h_{j+1}^{(L)}. In Du et al. (2016), a constant embedding mechanism is used, i.e. h(l)(t;S)hj(l)h^{(l)}(t;S)\equiv h_{j}^{(l)} for t(tj,tj+1]t\in(t_{j},t_{j+1}] and any jj and ll. In Mei and Eisner (2017), the author adopted an exponential decay method to encode the hidden representations under an extended RNN architecture, Long Short Term Memory (LSTM) network. More recently, Rubanova et al. (2019) used the neural ordinary differential equation (ODE) method for solving the intermediate hidden state h(l)(t;S)h^{(l)}(t;S).

It can be shown that the first two interpolation methods are unable to precisely capture the true intensity in the sense of excess risk. We will give the explanation in Section 6; see Theorem 8.

Remark 5.

Our result still holds if tanh is replaced with other Sigmoidal-type activation functions (Cybenko, 1989) (e.g., ReLU (Fukushima, 1969)). In the literature of TPP modelling, the most common choice of f(x)f(x) is the Softplus function (Dugas et al., 2001; Zhou et al., 2022), log(1+exp(x))\log(1+\exp(x)), which ensures λθ(t;S)\lambda_{\theta}(t;S) to be positive and differentiable. Our result also holds if we take f(x)f(x) to be min{max{log(1+exp(x)),lf},uf}\min\{\max\{\log(1+\exp(x)),l_{f}\},u_{f}\} with 0<lf<uf0<l_{f}<u_{f}. Introducing lfl_{f} and ufu_{f} only serves the technical purpose, i.e., the predicted intensity value is bounded from above and below.

2.3 Classical TPPs

In the statistical literature, TPPs can be categorized into several types based on the nature of the intensity functions. Four main categories are summarized as follows.

Homogeneous Poisson process (Kingman, 1992). It is the simplest type where events occur completely independently of one another, and the intensity function is constant, i.e., λ(t)λ\lambda^{\ast}(t)\equiv\lambda, where λ\lambda is unknown and needs to be estimated.

Non-homogeneous Poisson process (Kingman, 1992; Daley et al., 2003). In this model, the intensity function varies over time but is still independent of past events. That is, λ(t)\lambda^{\ast}(t) is a non-constant unknown function that is usually estimated via certain nonparametric methods.

Self-exciting process (Hawkes and Oakes, 1974). Future events are influenced by past events, which can lead to clustering of events in time. A well-known example is the Hawkes process (Hawkes, 1971; Hawkes and Oakes, 1974), where the intensity function takes form,

λ(t)=λ0(t)+j:tj<tμ(ttj),\displaystyle\lambda^{\ast}(t)=\lambda_{0}(t)+\sum_{j:t_{j}<t}\mu(t-t_{j}), (9)

where λ0(t)\lambda_{0}(t) and μ(t)\mu(t) are some positive functions which are called the background intensity and excitation/impact function, respectively. In many applications (Laub et al., 2021), the excitation function takes the exponential form that μ(t)=αexp(βt)\mu(t)=\alpha\exp(-\beta t), which allows the efficient computation. The model defined in (9) is also known as the linear self-exciting process since the intensity is in an additive form of different components. More generally, the non-linear self-exciting process (Brémaud and Massoulié, 1996)

λ(t)=Ψ(λ0(t)+j:tj<tμ(ttj)),\displaystyle\lambda^{\ast}(t)=\Psi\left(\lambda_{0}(t)+\sum_{j:t_{j}<t}\mu(t-t_{j})\right), (10)

is also considered in the literature, where Ψ\Psi is a non-linear function.

Self-correcting process (Isham and Westcott, 1979; Ogata and Vere-Jones, 1984). The occurrence of an event decreases the likelihood of future events for some time period. To be mathematically formal, the intensity postulates the formula,

λ(t)=Ψ(μtj:tj<tα),\displaystyle\lambda^{\ast}(t)=\Psi\left(\mu t-\sum_{j:t_{j}<t}\alpha\right), (11)

where both μ\mu and α\alpha are positive and Ψ\Psi may be a non-linear function.

2.4 Notations

Let ab=min{a,b}a\wedge b=\min\{a,b\} and ab=max{a,b}a\vee b=\max\{a,b\}. We use \mathbb{N} and \mathbb{Z} to denote the set of nonnegative integers and all integers, respectively. Denote [n]={1,2,n}[n]=\{1,2\cdots,n\} for a positive integer nn. Let a=min{b,ba}\lceil a\rceil=\min\{b\in\mathbb{Z},b\geq a\}. For a set AA, denote #(A)\#(A) to be its cardinality. For a vector x=(x1,,xd)dx=(x_{1},\cdots,x_{d})^{\top}\in\mathbb{R}^{d}, denote its Euclidean norm as x2=i=1dxi2\|x\|_{2}=\sqrt{\sum_{i=1}^{d}x_{i}^{2}}. Write aNbNa_{N}\lesssim b_{N} if there exists some constant C>0C>0 such that aNCbNa_{N}\leq Cb_{N} for all index NN, and the range of NN may be defined case by case. For a function ff defined on some domain, denote fL\|f\|_{L^{\infty}} as its essential upper bound. For ss\in\mathbb{N}, the Sobolev norm fWs,([0,T])\|f\|_{W^{s,\infty}([0,T])} is defined as fWs,([0,T])=max0|α|mDαfL([0,T])\|f\|_{W^{s,\infty}([0,T])}=\max_{0\leq|\alpha|\leq m}\|D^{\alpha}f\|_{L^{\infty}([0,T])}. For a constant B0>0B_{0}>0, the B0B_{0}-ball of Sobolev space Ws,([0,T])W^{s,\infty}([0,T]) is defined as

Ws,([0,T],B0):={fWs,([0,T]),fWs,([0,T])B0}.\displaystyle W^{s,\infty}([0,T],B_{0}):=\left\{f\in W^{s,\infty}([0,T]),\|f\|_{W^{s,\infty}([0,T])}\leq B_{0}\right\}.

For constant C0>0C_{0}>0, the ball Cs,([0,T],C0)C^{s,\infty}([0,T],C_{0}) is a subset of Ws,([0,T],C0)W^{s,\infty}([0,T],C_{0}) which contains all ss-order smooth functions. We use O()O(\cdot) to hidden all constants and use O~()\tilde{O}(\cdot) to denote O()O(\cdot) with hidden log factors. Throughout this paper, α\alpha, β\beta, γ\gamma, 𝒞\mathcal{C}, and 𝒞1\mathcal{C}_{1} are positive real numbers and may be defined case by case.

3 Main Results

Recent applications in event stream analyses have witnessed the usefulness of TPPs with the incorporation of RNNs. However, there is no study in the existing literature to explain why the RNN structure in TPP modeling is so useful from the theoretical perspective. We attempt to answer the question of whether the RNN-TPPs can provably have small generalization error or excess risk. Our answer is positive! When the event data are generated according to the classical models described in Section 2.3, we show that the RNN-TPPs can perfectly generalize such data.

To make our presentation easier, we only need to focus on the self-exciting processes. 111Homogeneous Poisson, non-homogeneous Poisson, and self-correcting process can be treated similarly due to the following reasons. If we take μ(t)0\mu(t)\equiv 0 in (9), the linear self-exciting process reduces to the homogeneous Poisson or non-homogeneous Poisson process. In the RNN-TPP architecture, we can take the input embedding function x(t;S)=(t,tFS(t),N(t))x(t;S)=(t,t-F_{S}(t),N(t-)), i.e., using an additional input dimension to store the number of past events. Then establishing the excess risk of self-correcting process is technically equivalent to that of non-homogeneous Poisson process. To start with, we first consider the linear case (9).

Some regularity assumptions should be stated before we present the main theorem.

(A1) There exists a constant B0>0B_{0}>0 such that λ0Ws,([0,T],B0)\lambda_{0}\in W^{s,\infty}([0,T],B_{0}), where s1s\geq 1, ss\in\mathbb{N}.

(A2) 0Tμ(t)dt:=cμ<1\int_{0}^{T}\mu(t)\mathrm{d}t:=c_{\mu}<1.

(A3) There exists a positive constant B1B_{1} such that inft[0,T]λ0(t)B1\inf_{t\in[0,T]}\lambda_{0}(t)\geq B_{1}.

Assumption (A1) assumes the boundedness of the background intensity, which is also common in neural network approximation studies. Assumption (A2) is standard in the literature of Hawkes process, which guarantees the existence of a stationary version of the process when λ0(t)\lambda_{0}(t) is constant. Assumption (A3) is an informative lower bound assumption, which ensures that sufficient intensity exists in any subdomain of [0,T][0,T].

Now we can present the results on the non-asymptotic bound of excess risk (4) under model (9).

Theorem 1.

Under model (9) and RNN-TPP class =L,D,Bm,lf,uf\mathcal{F}=\mathcal{F}_{L,D,B_{m},l_{f},u_{f}} defined as (8), suppose that assumptions (A1)-(A3) hold, then for nn i.i.d. sample series {Si,i[n]}\{S_{i},i\in[n]\}, with probability at least 1δ1-\delta, the excess risk (4) of ERM (3) satisfies:

(i) (Poisson case) If μ0\mu\equiv 0, for L=2L=2, D=O~(n12(s+1))D=\tilde{O}(n^{\frac{1}{2(s+1)}}), Bm=O~(ns+14)B_{m}=\tilde{O}(n^{\frac{s+1}{4}}), lf=B11l_{f}=B_{1}\wedge 1, and uf=B0u_{f}=B_{0},

𝔼[loss(λ^,Stest)]𝔼[loss(λ,Stest)]O~(ns2(s+1));\displaystyle\mathbb{E}[\text{loss}(\hat{\lambda},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]\leq\tilde{O}\left(n^{-\frac{s}{2(s+1)}}\right); (12)

(ii) (Vanilla Hawkes case) If μ(t)=αexp(βt)\mu(t)=\alpha\exp(-\beta t), for L=2L=2, D=O~(n12(s+1))D=\tilde{O}(n^{\frac{1}{2(s+1)}}), Bm=O~((logn)3s2log2n)B_{m}=\tilde{O}((\log n)^{3s^{2}\log^{2}n}), lf=B11l_{f}=B_{1}\wedge 1, and uf=B0+O(logn)u_{f}=B_{0}+O(\log n),

𝔼[loss(λ^,Stest)]𝔼[loss(λ,Stest)]O~(ns2(s+1));\displaystyle\mathbb{E}[\text{loss}(\hat{\lambda},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]\leq\tilde{O}\left(n^{-\frac{s}{2(s+1)}}\right); (13)

(iii) (General case) If μCk,([0,T],C0)\mu\in C^{k,\infty}([0,T],C_{0}), k2k\geq 2, kk\in\mathbb{N}, for L=2L=2, D=O~(n12(1s+15k+4))D=\tilde{O}(n^{\frac{1}{2}\left(\frac{1}{s+1}\vee\frac{5}{k+4}\right)}), Bm=O~((logn)3s2log2n)B_{m}=\tilde{O}((\log n)^{3s^{2}\log^{2}n}), lf=B11l_{f}=B_{1}\wedge 1, and uf=B0+O(logn)u_{f}=B_{0}+O(\log n),

𝔼[loss(λ^,Stest)]𝔼[loss(λ,Stest)]O~(n12(ss+1k1k+4)).\displaystyle\mathbb{E}[\text{loss}(\hat{\lambda},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]\leq\tilde{O}\left(n^{-\frac{1}{2}\left(\frac{s}{s+1}\wedge\frac{k-1}{k+4}\right)}\right). (14)

As suggested in Theorem 1, there exists a two-layer RNN-TPP model whose excess risk becomes vanishing when the size of the training set goes to infinity. The width of such network grows with the sample size, while the depth remains two.

Remark 6.

Here we require the depth of RNN-TPP L=2L=2 due the fact that λ0Ws,([0,T],B0)\lambda_{0}\in W^{s,\infty}([0,T],B_{0}). However, if we allow λ0\lambda_{0} to be sufficiently smooth (i.e., λ0C([0,T])\lambda_{0}\in C^{\infty}([0,T])), we only need one-layer tanh\tanh neural network to approximate λ0\lambda_{0}. As a result, the number of layers of RNN-TPP can be reduced to one.

Now we consider the true model to be a non-linear Hawkes process, which is given in (10). For simplicity, we only consider the case μ(t)=αexp(βt)\mu(t)=\alpha\exp(-\beta t), which is

λ(t)=Ψ(λ0(t)+ti<tαexp(β(tti))).\displaystyle\lambda^{\ast}(t)=\Psi\left(\lambda_{0}(t)+\sum_{t_{i}<t}\alpha\exp(-\beta(t-t_{i}))\right). (15)

The regularity of Ψ\Psi is presented as Assumption (A4).

(A4) Function Ψ\Psi is LL-Lipschitz, positive and bounded. In other words, there exist B1~,B0~>0\tilde{B_{1}},\tilde{B_{0}}>0 such that B1~ΨB0~\tilde{B_{1}}\leq\Psi\leq\tilde{B_{0}} and |Ψ(x1)Ψ(x2)|L|x1x2||\Psi(x_{1})-\Psi(x_{2})|\leq L|x_{1}-x_{2}| for any x1,x2x_{1},x_{2}.

We have a similar bound of excess risk (4) under model (15).

Theorem 2.

(Nonlinear Hawkes Case) Under model (15) and RNN-TPP class =L,D,Bm,lf,uf\mathcal{F}=\mathcal{F}_{L,D,B_{m},l_{f},u_{f}} defined as (8), suppose that assumptions (A1) and (A4) hold, then for nn i.i.d. sample series {Si,i[n]}\{S_{i},i\in[n]\}, with probability at least 1δ1-\delta, for L=4L=4, D=O~(n14)D=\tilde{O}(n^{\frac{1}{4}}), Bm=O~((logn)3s2log2n)B_{m}=\tilde{O}((\log n)^{3s^{2}\log^{2}n}), lf=B~11l_{f}=\tilde{B}_{1}\wedge 1, and uf=B~0u_{f}=\tilde{B}_{0}, the excess risk (4) of ERM (3) satisfies:

𝔼[loss(λ^,Stest)]𝔼[loss(λ,Stest)]O~(n14).\displaystyle\mathbb{E}[\text{loss}(\hat{\lambda},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]\leq\tilde{O}\left(n^{-\frac{1}{4}}\right). (16)

For the non-linear case, as indicated by Theorem 2, we require a deeper RNN-TPP with four layers to achieve the vanishing excess risk. Under the Lipschitz assumption of Ψ\Psi, the width of the hidden layers is of order n1/4n^{1/4}. When Ψ\Psi is allowed to have higher-order smoothness, the width can reduce to that of the vanilla Hawkes case.

Remark 7.

(i) Two additional layers of RNN are required for the approximation of the arbitrary non-linear Lipschitz continuous function Ψ\Psi. (ii) For the model λ(t)=Ψ(λ0(t)+ti<tμ(tti))\lambda^{\ast}(t)=\Psi\left(\lambda_{0}(t)+\sum_{t_{i}<t}\mu(t-t_{i})\right) with general excitation function μ\mu, we can obtain the similar excess risk bound using the same technique in the proof of Theorem 1.

To better explain the excess risks that obtained in Theorems 1-2, we depend on the following decomposition lemma.

Lemma 1.

Let λˇ=argminλ𝔼[loss(λ,Stest)]\check{\lambda}^{\ast}=\arg\min_{\lambda\in\mathcal{F}}\mathbb{E}[\text{loss}(\lambda,S_{test})], for any random sample {Si,i[n]}\{S_{i},i\in[n]\}, the excess risk of ERM (3) satisfies

𝔼[loss(λ^,Stest)]𝔼[loss(λ,Stest)]\displaystyle\mathbb{E}[\text{loss}(\hat{\lambda},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})] 2supλ|𝔼[loss(λ,Stest)]1ni[n]loss(λ,Si)|stochastic error\displaystyle\leq\underbrace{2\sup_{\lambda\in\mathcal{F}}\Big{|}\mathbb{E}[\text{loss}(\lambda,S_{test})]-\frac{1}{n}\sum_{i\in[n]}\text{loss}(\lambda,S_{i})\Big{|}}_{\text{stochastic error}}
+𝔼[loss(λˇ,Stest)]𝔼[loss(λ,Stest)]approximation error.\displaystyle+\underbrace{\mathbb{E}[\text{loss}(\check{\lambda}^{\ast},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]}_{\text{approximation error}}. (17)

By Lemma 1, the excess risk of ERM is bounded by the sum of two terms, the stochastic error 2supλ|𝔼[loss(λ,Stest)]n1i[n]loss(λ,Si)|2\sup_{\lambda\in\mathcal{F}}|\mathbb{E}[\text{loss}(\lambda,S_{test})]-n^{-1}\sum_{i\in[n]}\text{loss}(\lambda,S_{i})| and the approximation error 𝔼[loss(λˇ,Stest)]𝔼[loss(λ,Stest)]\mathbb{E}[\text{loss}(\check{\lambda}^{\ast},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]. The first term can be bounded by the complexity of the function class \mathcal{F} using the empirical process theory, where the unboundedness of the loss function needs to be handled carefully; we present the details in section 4. The second term characterizes the approximation ability of the RNN function class \mathcal{F} to the true intensity λ\lambda^{\ast} under the measure of the expectation of the negative log-likelihood loss function. In order to bound this term, we need to carefully construct a suitable RNN which can approximate λ\lambda^{\ast} well. This has not been studied yet in the literature; see section 5 for the details.

Based on Lemma 1, the results in Theorem 1 admit the following form,

𝔼[loss(λ^,Stest)]𝔼[loss(λ,Stest)]O(C(N)n+1R(N)),\displaystyle\mathbb{E}[\text{loss}(\hat{\lambda},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]\leq O\left(\frac{C(N)}{\sqrt{n}}+\frac{1}{R(N)}\right),

where C(N)/nC(N)/\sqrt{n} is the stochastic error and 1/R(N)1/R(N) is the approximation error. C(N)C(N) is the complexity of RNN function class \mathcal{F} and R(N)R(N) is the corresponding approximation rate, where NN is a tuning parameter. For the Poisson case, we can construct a two-layer RNN-TPP with O(N)O(N) width to achieve O(Ns)O(N^{-s}) approximation error. Hence C(N)=O(N)C(N)=O(N), R(N)=O(Ns)R(N)=O\left(N^{s}\right), and the final excess risk bound is O~(ns2(s+1))\tilde{O}(n^{-\frac{s}{2(s+1)}}) in (12). For the vanilla Hawkes case, since the exponential function is CC^{\infty}-smooth, we only need extra O(Poly(logN))O(\text{Poly}(\log N)) hidden cells in each layer to obtain O~(Ns)\tilde{O}(N^{-s}) approximation error, and then we have the same order excess risk bound. For the general case, motivated by the vanilla Hawkes case, we decompose μCk,([0,T],C0)\mu\in C^{k,\infty}([0,T],C_{0}) into two parts. One part is a polynomial of exponential functions which can be well approximated by O(Poly(logN))O(\text{Poly}(\log N))-width tanh neural network. The other part is a function μ~Ck,([0,T],C0~)\tilde{\mu}\in C^{k,\infty}([0,T],\tilde{C_{0}}) satisfying μ~(j)(0+)=μ~(j)(T)\tilde{\mu}^{(j)}(0+)=\tilde{\mu}^{(j)}(T-), j=0,1,,k1j=0,1,\cdots,k-1. It is easy to check that the rr-th Fourier coefficients of μ~\tilde{\mu}, μ^r\hat{\mu}_{r}, decay at the rate of rkr^{-k}. Then it is sufficient to approximate the first NN functions in the Fourier expansion of μ~\tilde{\mu} to get O~(N(k1))\tilde{O}(N^{-(k-1)}) approximation error, which additionally costs O~(N5)\tilde{O}\left(N^{5}\right) complexity (see section 5.3 for details). Combining this with the approximation result of λ0\lambda_{0}, we get the final bound O~(n12(ss+1k1k+4))\tilde{O}(n^{-\frac{1}{2}\left(\frac{s}{s+1}\wedge\frac{k-1}{k+4}\right)}). Similarly, for the nonlinear Hawkes case, we need O~(N)\tilde{O}(N) complexity to obtain O~(N1)\tilde{O}(N^{-1}) approximation error, which leads to O~(n14)\tilde{O}({n}^{-\frac{1}{4}}) excess risk bound.

As we emphasize in the above remarks, the number of layers depends on the smoothness of λ0\lambda_{0}. If λ0C([0,T])\lambda_{0}\in C^{\infty}([0,T]) and λ0Ws,Cs\|\lambda_{0}\|_{W^{s,\infty}}\leq C^{s}, we only need one-layer tanh\tanh neural network to approximate λ0\lambda_{0}, hence the number of layers in RNN-TPP can be reduced to one.

4 Stochastic Error

In this section, we focus on the stochastic error in (17). This type of stochastic error for the RNN function class has been studied in the recent literature, such as Chen et al. (2020) and Tu et al. (2020). However, they only consider the case where the lengths of the input sequences are bounded, which is not applicable under the TPP setting. Here we establish an upper bound of the stochastic error in (17) by a novel decoupling technique to make the classical results applicable. This technique can be used in many other related problems.

4.1 Main Variance Term

We first give out some mild assumptions for the RNN-TPP function class \mathcal{F} under a more general framework.

(B1) The embedding function x()x(\cdot) is bounded by a constant Bin(T)B_{in}(T) on the time domain [0,T][0,T], i.e. x()2Bin(T)\|x(\cdot)\|_{2}\leq B_{in}(T).

(B2) The parameter θ\theta lies in a bounded domain Θ\Theta. More precisely, we assume that the spectral norms of weight matrices (vectors) and other parameters are bounded respectively, i.e., Wx(l)2Bx\|W_{x}^{(l)}\|_{2}\leq B_{x}, Wh(l)2Bh\|W_{h}^{(l)}\|_{2}\leq B_{h}, b(l)2Bb\|b^{(l)}\|_{2}\leq B_{b}, 1lL+11\leq l\leq L+1, and Bm=max{Bb,Bh,Bx}B_{m}=\max\{B_{b},B_{h},B_{x}\}.

(B3) Activation functions σ\sigma and ff are Lipschitz continuous with parameters ρσ\rho_{\sigma} and ρf\rho_{f} respectively, σ(0)=0\sigma(0)=0, and there exists |b0|Bb|b_{0}|\leq B_{b} such that f(b0)=1f(b_{0})=1 . Additionally, σ\sigma is entrywise bounded by BσB_{\sigma}, and ff satisfies lffLufl_{f}\leq\|f\|_{L^{\infty}}\leq u_{f}.

Now we consider the first term of (17). For convenience, we denote Xθ=𝔼[loss(λθ,Stest)]n1i=1nloss(λθ,Si)X_{\theta}=\mathbb{E}[\text{loss}(\lambda_{\theta},S_{test})]-n^{-1}\sum_{i=1}^{n}\text{loss}(\lambda_{\theta},S_{i}).

Theorem 3.

Under assumptions (B1)-(B3) and suppose the event number NeN_{e} satisfies the tail condition

(Nes)aNexp(cNs),s,\mathbb{P}(N_{e}\geq s)\leq a_{N}\exp(-c_{N}s),~{}s\in\mathbb{N},

with probability at least 1δ1-\delta, we have

supθΘ|Xθ|\displaystyle\sup_{\theta\in\Theta}|X_{\theta}|\leq 192n(T+1lf)(s0+1)uf(log(4δ)+D(3L+2)(log(1+M(s0))+1)\displaystyle\frac{192}{\sqrt{n}}\left(T+\frac{1}{l_{f}}\right)(s_{0}+1)u_{f}\Bigg{(}\sqrt{\log\left(\frac{4}{\delta}\right)}+D\sqrt{(3L+2)}\left(\sqrt{\log\left(1+M(s_{0})\right)}+1\right)
+1(1exp(cN))2).\displaystyle+\frac{1}{(1-\exp(-c_{N}))^{2}}~{}\Bigg{)}~{}.

Thus

supθBm|Xθ|O~(D2L2n),\displaystyle\sup_{\|\theta\|\leq B_{m}}|X_{\theta}|\leq\tilde{O}\left(\sqrt{\frac{D^{2}L^{2}}{n}}\right)~{}, (18)

where s0=cN1(log(2aNn/δ)1)s_{0}=\lceil{c_{N}}^{-1}\left(\log\left(2a_{N}n/\delta\right)-1\right)\rceil, M(s)=ρfBmD(BσDBin(T)1)(γL1)(s+1)L1(βs+11)/(β1)M(s)=\rho_{f}B_{m}\sqrt{D}(B_{\sigma}\sqrt{D}\vee B_{in}(T)\vee 1)(\gamma^{L}\vee 1)(s+1)^{L-1}(\beta^{s+1}-1)/(\beta-1), γ=ρσBx\gamma=\rho_{\sigma}B_{x}, β=ρσBh\beta=\rho_{\sigma}B_{h}.

Remark 8.

There exist constants aN,cNa_{N},c_{N} so that the tail condition (Nes)aNexp(cNs),s\mathbb{P}(N_{e}\geq s)\leq a_{N}\exp(-c_{N}s),s\in\mathbb{N} always holds for (non) homogeneous Poisson processes, linear and nonlinear Hawkes processes, and self-correcting processes under weak assumptions. To be more concrete, Lemma 2 in the following section gives a result for the linear case.

Remark 9.

For one-layer RNN with width DD and bounded sequence length TT, Chen et al. (2020) gives a O~(D3T/n)\tilde{O}\big{(}\sqrt{{D^{3}T}/{n}}\big{)} type stochastic error bound. Our bound reduces the term D3D^{3} to D2D^{2}, thanks to the bounded output layer, i.e., f(x)=min{max{x,lf},uf}f(x)=\min\{\max\{x,l_{f}\},u_{f}\}. The term D2D^{2} is also order-optimal by noticing that the number of free parameters in a single-layer RNN is at least D2D^{2}.

The stochastic error in (3) is mainly determined by the complexity of the RNN function class \mathcal{F}, which will be discussed in the following section. To obtain this bound, we need to handle the unboundedness of the event number. We use a truncation technique to decouple the randomness of the tail of NeN_{e}, which allows us to use classical empirical process theory to derive the upper bound. Our computation is motivated by Chen et al. (2020), which gives the generalization error bound of a single-layer RNN function class.

4.2 Key Techniques

To be reader-friendly, the main techniques for proving Theorem 3 are summarized as follows.

4.2.1 Probability Bound of Events Number

Define Ne(n):=max{Nei,1in}N_{e(n)}:=\max\{N_{ei},1\leq i\leq n\}. The following lemma characterizes the tails of event number NeN_{e} and Ne(n)N_{e(n)} under model (9) and assumptions (A1) and (A2) (For assumption (A1), we only need λ0B0\lambda_{0}\leq B_{0} in this section). The proof is similar to Proposition 2 in Hansen et al. (2015); see supplementary for the details.

Lemma 2.

For model (9), under assumptions (A1) and (A2), with probability at least 1δ1-\delta, we have

Ne(n)<11cμη(2log(η)log(2nB0Tδ(1cμ))+η(B0T)).\displaystyle N_{e(n)}<\frac{1}{1-c_{\mu}\eta}\left(\frac{2}{\log(\eta)}\log\left(\frac{2n\sqrt{B_{0}T}}{\delta(1-c_{\mu})}\right)+\eta(B_{0}T)\right).

Hence

(Ne=s)(Nes)2B0T1cμexp(log(η)2[η(B0T)(1cμη)s]),\displaystyle\mathbb{P}\left(N_{e}=s\right)\leq\mathbb{P}\left(N_{e}\geq s\right)\leq\frac{2\sqrt{B_{0}T}}{1-c_{\mu}}\exp\left(\frac{\log(\eta)}{2}\left[\eta(B_{0}T)-(1-c_{\mu}\eta)s\right]\right),

where η(1,cμ1)\eta\in\left(1,c_{\mu}^{-1}\right). Let aN=2B0Texp(log(η0)η0(B0T)/2)/(1cμ)a_{N}=2\sqrt{B_{0}T}\exp(\log(\eta_{0})\eta_{0}(B_{0}T)/2)/(1-c_{\mu}) and cN=log(η0)(1cμη0)/2c_{N}=\log(\eta_{0})(1-c_{\mu}\eta_{0})/2 with η0(1,cμ1)\eta_{0}\in\left(1,c_{\mu}^{-1}\right) being fixed. Then

(Ne=s)(Nes)aNexp(cNs).\displaystyle\mathbb{P}\left(N_{e}=s\right)\leq\mathbb{P}\left(N_{e}\geq s\right)\leq a_{N}\exp(-c_{N}s). (19)

Our result is more refined than Proposition 2 in Hansen et al. (2015), with computing all the constants and giving a tuning parameter to control the probability bound.

For the nonlinear case (10), under Assumption (A4), we can obtain results similar to the non-homogeneous Poisson case, which are included in the above Lemma.

4.2.2 From Unboundedness to Boundedness

The following lemma is the key to handling the unboundedness of XθX_{\theta}, i.e., the unboundedness of the loss function. For any ss\in\mathbb{N}, we let Xθ(s)=𝔼[loss(λθ,Stest)𝟙{Nes}]n1i=1nloss(λθ,Si)𝟙{Neis}X_{\theta}(s)=\mathbb{E}\left[\text{loss}(\lambda_{\theta},S_{test})\mathbbm{1}_{\{N_{e}\leq s\}}\right]-n^{-1}\sum_{i=1}^{n}\text{loss}(\lambda_{\theta},S_{i})\mathbbm{1}_{\{N_{ei}\leq s\}} and Eθ(s)=𝔼[loss(λθ,Stest)𝟙{Ne>s}]E_{\theta}(s)=\mathbb{E}\left[\text{loss}(\lambda_{\theta},S_{test})\mathbbm{1}_{\{N_{e}>s\}}\right].

Lemma 3.

For any ss\in\mathbb{N} and nonempty parameter set Θ\Theta, we have

(supθΘ|Xθ|>t)(supθΘ|Xθ(s)|+supθΘ|Eθ(s)|>t)+(Ne(n)>s).\displaystyle\mathbb{P}\left(\sup_{\theta\in\Theta}|X_{\theta}|>t\right)\leq\mathbb{P}\left(\sup_{\theta\in\Theta}|X_{\theta}(s)|+\sup_{\theta\in\Theta}|E_{\theta}(s)|>t\right)+\mathbb{P}(N_{e(n)}>s). (20)
Proof of Lemma 3.

For ω{Ne(n)s}\forall\omega\in\{N_{e(n)}\leq s\}, we have

Xθ(ω)\displaystyle X_{\theta}(\omega) =𝔼[loss(λθ,Stest)]1ni=1nloss(λθ,Si)(ω)\displaystyle=\mathbb{E}[\text{loss}(\lambda_{\theta},S_{test})]-\frac{1}{n}\sum_{i=1}^{n}\text{loss}(\lambda_{\theta},S_{i})(\omega)
=𝔼[loss(λθ,Stest)]1nk=1nloss(λθ,Si)𝟙{Neis}(ω)\displaystyle=\mathbb{E}[\text{loss}(\lambda_{\theta},S_{test})]-\frac{1}{n}\sum_{k=1}^{n}\text{loss}(\lambda_{\theta},S_{i})\mathbbm{1}_{\{N_{ei}\leq s\}}(\omega)
=Xθ(s)(ω)+Eθ(s)(ω).\displaystyle=X_{\theta}(s)(\omega)+E_{\theta}(s)(\omega).

Hence, under the condition Ne(n)sN_{e(n)}\leq s, we have |Xθ||Xθ(s)|+|Eθ(s)||X_{\theta}|\leq|X_{\theta}(s)|+|E_{\theta}(s)|, thus supθΘ|Xθ|supθΘ|Xθ(s)|+supθΘ|Eθ(s)|\sup_{\theta\in\Theta}|X_{\theta}|\leq\sup_{\theta\in\Theta}|X_{\theta}(s)|+\sup_{\theta\in\Theta}|E_{\theta}(s)|. Then

(supθΘ|Xθ|>t)\displaystyle\mathbb{P}\left(\sup_{\theta\in\Theta}|X_{\theta}|>t\right) =(supθΘ|Xθ|>t,Ne(n)s)+(supθΘ|Xθ|>t,Ne(n)>s)\displaystyle=\mathbb{P}\left(\sup_{\theta\in\Theta}|X_{\theta}|>t,~{}N_{e(n)}\leq s\right)+\mathbb{P}\left(\sup_{\theta\in\Theta}|X_{\theta}|>t,~{}N_{e(n)}>s\right)
(supθΘ|Xθ(s)|+supθΘ|Eθ(s)|>t,Ne(n)s)\displaystyle\leq\mathbb{P}\left(\sup_{\theta\in\Theta}|X_{\theta}(s)|+\sup_{\theta\in\Theta}|E_{\theta}(s)|>t,~{}N_{e(n)}\leq s\right)
+(supθΘ|Xθ|>t,Ne(n)>s)\displaystyle~{}~{}+\mathbb{P}\left(\sup_{\theta\in\Theta}|X_{\theta}|>t,~{}N_{e(n)}>s\right)
(supθΘ|Xθ(s)|+supθΘ|Eθ(s)|>t)+(Ne(n)>s).\displaystyle\leq\mathbb{P}\left(\sup_{\theta\in\Theta}|X_{\theta}(s)|+\sup_{\theta\in\Theta}|E_{\theta}(s)|>t\right)+\mathbb{P}\left(N_{e(n)}>s\right).

The consequence of this lemma is to decompose (supθΘ|Xθ|>t)\mathbb{P}\left(\sup_{\theta\in\Theta}|X_{\theta}|>t\right) into two parts. The first part (supθΘ|Xθ(s)|+supθΘ|Eθ(s)|>t)\mathbb{P}\left(\sup_{\theta\in\Theta}|X_{\theta}(s)|+\sup_{\theta\in\Theta}|E_{\theta}(s)|>t\right) is the tail probability of the supremum of a set of bounded variables, and can therefore be handled by standard empirical process theory. The second part (Ne(n)>s)\mathbb{P}(N_{e(n)}>s) is the tail probability of Ne(n)N_{e(n)}. Thanks to Lemma 2, this term can be controlled by the exponential decay property of the sub-critical point process. By choosing suitable ss, we can make (20) sharper. This result plays a key role in stochastic error calculations.

4.2.3 Complexity of the RNN-TPP Class

To get the result in Theorem 3, we need to compute the complexity of the RNN function class which is specified in section 2.2. There are many possible complexity measures in deep learning theory (Suh and Cheng, 2024), and here we choose covering number which can be well computed for the RNN function class. In our setup, the key to the computation of the covering number is finding the Lipschitz continuity constant of RNN-TPPs, which separates the spectral norms of weight matrices and the total number of parameters (Chen et al., 2020).

Consider two different sets of parameters θ1={Wx,1(l),Wh,1(l),b1(l);1lL+1}\theta_{1}=\{W_{x,1}^{(l)},W_{h,1}^{(l)},b_{1}^{(l)};1\leq l\leq L+1\}, θ2={Wx,2(l),Wh,2(l),b2(l);1lL+1}\theta_{2}=\{W_{x,2}^{(l)},W_{h,2}^{(l)},b_{2}^{(l)};1\leq l\leq L+1\}. Denote Δbl=b1(l)b2(l)2\Delta_{b}^{l}=\|b_{1}^{(l)}-b_{2}^{(l)}\|_{2}, Δhl=Wh,1(l)Wh,2(l)2\Delta_{h}^{l}=\|W_{h,1}^{(l)}-W_{h,2}^{(l)}\|_{2}, Δxl=Wx,1(l)Wx,2(l)2\Delta_{x}^{l}=\|W_{x,1}^{(l)}-W_{x,2}^{(l)}\|_{2}, 1lL+11\leq l\leq L+1 (ΔhL+10)(\Delta_{h}^{L+1}\equiv 0). The following lemma characterizes the Lipschitz constant of λθ\lambda_{\theta}.

Lemma 4.

Under Assumptions (B1)-(B3), given an input sequence of length NSN_{S}, S={ti}i=1NS[0,T]S=\{t_{i}\}_{i=1}^{N_{S}}\subset[0,T] (here we set tNS+1=Tt_{N_{S}+1}=T), for t(ti,ti+1]t\in(t_{i},t_{i+1}], 1iNS1\leq i\leq N_{S}, and θ1,θ2Θ\theta_{1},\theta_{2}\in\Theta, we have

|λθ1(t;S)λθ2(t;S)|\displaystyle\left|\lambda_{\theta_{1}}(t;S)-\lambda_{\theta_{2}}(t;S)\right| ρfγ(l=0L1γlSilΔbLl+BσDl=0L2γlSilΔxLl+Bin(T)γL1SiL1Δx1\displaystyle\leq\rho_{f}\gamma\left(\sum_{l=0}^{L-1}\gamma^{l}S_{i}^{l}\Delta_{b}^{L-l}+B_{\sigma}\sqrt{D}\sum_{l=0}^{L-2}\gamma^{l}S_{i}^{l}\Delta_{x}^{L-l}+B_{in}(T)\gamma^{L-1}S_{i}^{L-1}\Delta_{x}^{1}\right.
+BσDl=0L1γlSi1lΔhLl)+ρfΔbL+1+ρfBσDΔxL+1,\displaystyle\quad\quad\quad\left.+B_{\sigma}\sqrt{D}\sum_{l=0}^{L-1}\gamma^{l}S_{i-1}^{l}\Delta_{h}^{L-l}\right)+\rho_{f}\Delta_{b}^{L+1}+\rho_{f}B_{\sigma}\sqrt{D}\Delta_{x}^{L+1}, (21)

where β=ρσBh\beta=\rho_{\sigma}B_{h}, γ=ρσBx\gamma=\rho_{\sigma}B_{x}, Sil=j=0i(j+ll)βjS_{i}^{l}=\sum_{j=0}^{i}\tbinom{j+l}{l}\beta^{j} (S1l=0S_{-1}^{l}=0), and d=max{dl|1lL+1}d=\max\{d_{l}|1\leq l\leq L+1\}. We set l=abAl=0\sum_{l=a}^{b}A_{l}=0 if a>ba>b.

The proof of Lemma 4 is based on the induction. The full proof is given in the supplementary. Our result is an extension of Lemma 2 in Chen et al. (2020), where they only consider the family of one-layer RNN models. Lemma 4 is of independent interest and can be useful in any other problems regarding RNN-based modeling. Using Lemma 4, we can establish a covering number bound for \mathcal{F} under a “truncated" distance.

Denote 𝒩(,ϵ,d(,))\mathcal{N}\left(\mathcal{F},\epsilon,d(\cdot,\cdot)\right) as the covering number of metric space \mathcal{F}, i.e., the minimal cardinality of a subset 𝒞\mathcal{C}\subset\mathcal{F} that covers \mathcal{F} in scale ϵ\epsilon with respect to the metric d(,)d(\cdot,\cdot). Given a fixed integer N0N_{0}, We define a truncated distance,

dN0(λθ1,λθ2)=sup#(S)N0λθ1(t;S)λθ2(t;S)L[0,T].\displaystyle d_{N_{0}}(\lambda_{\theta_{1}},\lambda_{\theta_{2}})=\sup_{\#(S)\leq{N_{0}}}\left\|\lambda_{\theta_{1}}(t;S)-\lambda_{\theta_{2}}(t;S)\right\|_{L^{\infty}[0,T]}~{}.

The following lemma gives an upper bound of 𝒩(,ϵ,dN0(,))\mathcal{N}\left(\mathcal{F},\epsilon,d_{N_{0}}(\cdot,\cdot)\right).

Lemma 5.

Under assumptions (B1)-(B3), for any ϵ>0\epsilon>0 and =L,D,Bm,lf,uf\mathcal{F}=\mathcal{F}_{L,D,B_{m},l_{f},u_{f}} defined as (8), the covering number 𝒩(,ϵ,dN0(,))\mathcal{N}\left(\mathcal{F},\epsilon,d_{N_{0}}(\cdot,\cdot)\right) is bounded by

𝒩(,ϵ,dN0(,))(1+C(N0)(3L+2)BmDϵ)D2(3L+2),\displaystyle\mathcal{N}\left(\mathcal{F},\epsilon,d_{N_{0}}(\cdot,\cdot)\right)\leq\left(1+\frac{C({N_{0}})(3L+2)B_{m}\sqrt{D}}{\epsilon}\right)^{D^{2}(3L+2)},

where C(N0)=ρf(BσDBin(T)1)(γL1)(N0+1)L1(βN0+11)/(β1)C(N_{0})=\rho_{f}(B_{\sigma}\sqrt{D}\vee B_{in}(T)\vee 1)(\gamma^{L}\vee 1)({N_{0}}+1)^{L-1}(\beta^{{N_{0}}+1}-1)/(\beta-1), γ=ρσBx\gamma=\rho_{\sigma}B_{x}, and β=ρσBh\beta=\rho_{\sigma}B_{h}.

By Lemma 5, taking N0=sN_{0}=s, we can get the non-asymptotic bound of Xθ(s)X_{\theta}(s), which is an important step to obtain the first part of (20).

5 Approximation Error

In this section, we focus on the approximation error, i.e., the second part of (17). The approximation error of deep neural networks has been broadly studied in the literature (Schmidt-Hieber, 2020; Shen et al., 2019; Jiao et al., 2023; Lu et al., 2021). However, most of them only consider the ReLU activation case, which is different from tanh\tanh, the activation function usually chosen for RNNs. Recently, De Ryck et al. (2021) studied the approximation properties of shallow tanh\tanh neural networks, which provides a technical tool for our analysis. To the best of our knowledge, the approximation ability of RNN-type networks has not been fully studied in the literature. Here we propose a family of approximation results for the intensities of various TPP models stated in section 2.3.

5.1 Poisson Case

We start with the the approximation of (non-homogeneous) Poisson process, whose intensity is independent of the event history, i.e. λ(t)=λ0(t)\lambda^{\ast}(t)=\lambda_{0}(t), where λ0(t)\lambda_{0}(t) is an unknown function. In this case, we do not need to take into account the transfer of information in the time domain. To be precise, we can take Whl=0W_{h}^{l}=0 for l[L]l\in[L]. Then the problem degenerates to a standard neural network approximation problem. Using the approximation results for tanh\tanh neural networks in De Ryck et al. (2021), we can get the following approximation result.

Theorem 4.

(Approximation for Poisson process) Under model λ(t)=λ0(t)\lambda^{\ast}(t)=\lambda_{0}(t) and assumptions (A1) and (A3), for N5N\geq 5, NN\in\mathbb{N}, there exists an RNN-TPP λ^N\hat{\lambda}^{N} as stated in section 2.2 with L=2L=2, lf=B1l_{f}=B_{1}, uf=B0u_{f}=B_{0}, and input function x(t;S)=tx(t;S)=t such that

|𝔼[loss(λ^N,Stest)]𝔼[loss(λ,Stest)]|15exp(2B0T)(T+2B11)𝒞TsNs,\displaystyle|\mathbb{E}[\text{loss}(\hat{\lambda}^{N},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]|\leq 15\exp\left({2B_{0}T}\right)(T+2B_{1}^{-1})\frac{\mathcal{C}T^{s}}{N^{s}}, (22)

where 𝒞=2s5s/(s1)!\mathcal{C}=\sqrt{2s}5^{s}/(s-1)! . Moreover, the width of λ^N\hat{\lambda}^{N} satisfies D3s/2+6ND\leq 3\lceil s/2\rceil+6N and the weights of λ^N\hat{\lambda}^{N} are less than

𝒞1[2s5s(s1)!B0Ts]s2N1+s22(s(s+2))3s(s+2),\displaystyle\mathcal{C}_{1}\left[\frac{\sqrt{2s}5^{s}}{(s-1)!}B_{0}T^{s}\right]^{-\frac{s}{2}}N^{\frac{1+s^{2}}{2}}(s(s+2))^{3s(s+2)},

where 𝒞1\mathcal{C}_{1} is an universal constant.

A graphical representation of RNN approximation is given in Figure 2. For the non-homogeneous Poisson models, the RNN-TPP λ^N\hat{\lambda}^{N} in Theorem 4 is indeed a two-layer neural network. From Theorem 4, we need an RNN-TPP with O(N)O(N) width and Bm=O(Ns2+22)B_{m}=O(N^{\frac{s^{2}+2}{2}}) to obtain O(Ns)O(N^{-s}) approximation error. Combining with Theorem 3, we can get the part (i) of Theorem 1.

Refer to caption
Figure 2: The construction of RNN-TPP for the case of Poisson processes.

5.2 Vanilla Hawkes Case

Recall that the intensity of the vanilla Hawkes process has the form

λ(t)=λ0(t)+j:tj<tαexp{β(ttj)}.\displaystyle\lambda^{\ast}(t)=\lambda_{0}(t)+\sum_{j:t_{j}<t}\alpha\exp\{-\beta(t-t_{j})\}. (23)

Different from Poisson process, the intensity of the vanilla Hawkes process depends on historical events. Hence it can not be approximated by a simple neural network and needs the recurrent structure. We construct an RNN-TPP to approximate the intensity using the Markov property of (23). Specifically, note that if we have observed the first kk event times {t1,,tk}\{t_{1},\cdots,t_{k}\}, then for any tt satisfying tk<ttk+1t_{k}<t\leq t_{k+1}, we have

λ(t)λ0(t)\displaystyle\lambda^{\ast}(t)-\lambda_{0}(t) =j:tj<tαexp{β(ttj)}\displaystyle=\sum_{j:t_{j}<t}\alpha\exp\{-\beta(t-t_{j})\}
=exp(β(ttk))j:tjtkαexp{β(tktj)}\displaystyle=\exp(-\beta(t-t_{k}))\sum_{j:t_{j}\leq t_{k}}\alpha\exp\{-\beta(t_{k}-t_{j})\}
=(λ(tk)λ0(tk)+α)exp(β(ttk)).\displaystyle=(\lambda^{\ast}(t_{k})-\lambda_{0}(t_{k})+\alpha)\exp(-\beta(t-t_{k})).

Therefore, we can use the hidden layers in RNN-TPP to store the information of λ(tk)λ0(tk)\lambda^{\ast}(t_{k})-\lambda_{0}(t_{k}) and then compute λ(t)λ0(t)\lambda^{\ast}(t)-\lambda_{0}(t) with the help of input ttkt-t_{k}. Together with the approximation of λ0\lambda_{0}, we can obtain the final approximation result. A graphical illustration of the above construction procedures are given in Figure 3.

Theorem 5.

(Approximation for Vanilla Hawkes process) Under model (23), assumptions (A1), (A3), and α/β<1\alpha/\beta<1, for N5N\geq 5, NN\in\mathbb{N}, there exists an RNN-TPP λ^N\hat{\lambda}^{N} as stated in section 2.2 with L=2L=2, lf=B1l_{f}=B_{1}, uf=B0+O(logN)u_{f}=B_{0}+O(\log N), and input function x(t;S)=(t,tFS(t))x(t;S)=(t,t-F_{S}(t))^{\top} such that

|𝔼[loss(λ^N,Stest)]𝔼[loss(λ,Stest)]|(logN)2Ns.\displaystyle|\mathbb{E}[\text{loss}(\hat{\lambda}^{N},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]|\lesssim\frac{(\log N)^{2}}{N^{s}}. (24)

Moreover, the width of λ^N\hat{\lambda}^{N} satisfies D=O(N)D=O(N) and the weights of λ^N\hat{\lambda}^{N} are less than

𝒞1(log(N))12s2(log(N))2,\displaystyle\mathcal{C}_{1}(\log(N))^{12s^{2}(\log(N))^{2}}~{},

where 𝒞1\mathcal{C}_{1} is a constant related to s,B0,βs,B_{0},\beta, and TT.

Due to the smoothness of the exponential function, the approximation rate in Theorem 5 only adds the log(N)\log(N) term compared with the results in Theorem 4. Similarly, combining with Theorem 3, we can easily get the part (ii) of Theorem 1.

Refer to caption
Figure 3: The construction of RNN-TPP for the case of the vanilla Hawkes process.

5.3 Linear Hawkes Case

Now we consider the general linear Hawkes process, i.e., (9) in section 2.3. Motivated by the approximation construction of the Vanilla Hawkes process, we want to find a decomposition for the general μ\mu where each term has the ’Markov property’ so that we can construct the corresponding RNN structure. Precisely, for μCk,([0,T],C0)\mu\in C^{k,\infty}([0,T],C_{0}), k2k\geq 2, kk\in\mathbb{N}, we can decompose μ\mu into two parts,

μ(t)=μ~(t)part1+j=1kαjexp(βjt)part2,t[0,T],\displaystyle\mu(t)=\underbrace{\tilde{\mu}(t)}_{\text{part}_{1}}+\underbrace{\sum_{j=1}^{k}\alpha_{j}\exp(-\beta_{j}t)}_{\text{part}_{2}},~{}~{}t\in[0,T],

where μ~\tilde{\mu} satisfies the boundary condition, μ~j(0+)=μ~j(T)\tilde{\mu}^{j}(0+)=\tilde{\mu}^{j}(T-), 0jk10\leq j\leq k-1, jj\in\mathbb{N}, and βj=j/k\beta_{j}=j/k, j[k]j\in[k]. The term j=1kαjexp(βjt)\sum_{j=1}^{k}\alpha_{j}\exp(-\beta_{j}t) can be handled similarly to the vanilla Hawkes process. For μ~\tilde{\mu}, we consider its Fourier expansion,

μ~(t)=μ^02+l=1(μ^lcos(2lπTt)+ν^lsin(2lπTt)).\tilde{\mu}(t)=\frac{\hat{\mu}_{0}}{2}+\sum_{l=1}^{\infty}\left(\hat{\mu}_{l}\cos\Big{(}\frac{2l\pi}{T}t\Big{)}+\hat{\nu}_{l}\sin\Big{(}\frac{2l\pi}{T}t\Big{)}\right).

Thanks to the boundary condition, μ~(t)\tilde{\mu}(t) can be well approximated by the finite sum of Fourier series. Then we can use the “Markov property" of the trigonometric function pairs cos(2lπt/T)\cos(2l\pi t/T) and sin(2lπt/T)\sin(2l\pi t/T) to construct the RNN-TPP. The construction is similar to that of the exponential function case but needs more thorny calculations. Combining all the approximation parts, we can get the approximation theorem for (9). The above ideas are visualized in Figure 4.

Refer to caption
Figure 4: The construction of RNN-TPP for the case of general linear Hawkes processes.
Theorem 6.

(Approximation for linear Hawkes process) Under model (9), assumptions (A1)-(A3), and μCk,([0,T],C0)\mu\in C^{k,\infty}([0,T],C_{0}), k2k\geq 2, kk\in\mathbb{N}, for N5N\geq 5, NN\in\mathbb{N}, there exists an RNN-TPP λ^N,Nμ\hat{\lambda}^{N,N_{\mu}} as stated in section 2.2 with L=2L=2, lf=B1l_{f}=B_{1}, uf=B0+O(logN)u_{f}=B_{0}+O(\log N), and input function x(t;S)=(t,tFS(t))x(t;S)=(t,t-F_{S}(t))^{\top} such that

|𝔼[loss(λ^N,Nμ,Stest)]𝔼[loss(λ,Stest)]|(logN)2Ns+logNNμk1.\displaystyle|\mathbb{E}[\text{loss}(\hat{\lambda}^{N,N_{\mu}},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]|\lesssim\frac{(\log N)^{2}}{N^{s}}+\frac{\log N}{N_{\mu}^{k-1}}. (25)

Moreover, the width of λ^N,Nμ\hat{\lambda}^{N,N_{\mu}} satisfies D=O(N+Nμ5(logN)4)D=O(N+N_{\mu}^{5}(\log N)^{4}) and the weights of λ^N,Nμ\hat{\lambda}^{N,N_{\mu}} are less than

𝒞1(log(NNμ))12s2(log(NNμ))2,\displaystyle\mathcal{C}_{1}(\log(NN_{\mu}))^{12s^{2}(\log(NN_{\mu}))^{2}}~{},

where 𝒞1\mathcal{C}_{1} is a constant related to s,k,B0,C0,cμs,k,B_{0},C_{0},c_{\mu}, and TT.

We make a few explanations on Theorem 6. There are two tuning parameters in λ^N,Nμ\hat{\lambda}^{N,N_{\mu}}, where NN is the tuning parameter to control the approximation error of λ0\lambda_{0}, j=1kαjexp(βjt)\sum_{j=1}^{k}\alpha_{j}\exp(-\beta_{j}t), and the finite sum of the Fourier series, and NμN_{\mu} is the tuning parameter to control the number of terms in the Fourier series entering the RNN-TPP. The term (logN)2/Ns(\log N)^{2}/N^{s} is obtained similarly to that in the vanilla Hawkes process case, and the term logN/Nμk1\log N/N_{\mu}^{k-1} is the error caused by the finite sum approximation for the Fourier series. Moreover, the O(Nμ5(logN)4)O(N_{\mu}^{5}(\log N)^{4}) term in the width of RNN-TPP is caused by the approximation construction of the first NμN_{\mu} terms of the Fourier series. Finally, combining with Theorem 3, we can obtain the part (iii) of Theorem 1.

5.4 Nonlinear Hawkes Case

Finally, we consider the nonlinear Hawkes process, which is defined in (10) in section 2.3. To make the statement simpler, we only consider the simple case, i.e., μ(t)=αexp(βt)\mu(t)=\alpha\exp(-\beta t). The results for the general μ\mu can be obtained similarly. Compared to the vanilla Hawkes case, the additional challenge here is the existence of a nonlinear function Φ\Phi. With two additional layers, we can approximate Φ\Phi well. Together with the results for the case of the vanilla Hawkes process, we can obtain the desired RNN-TPP architecture. To be clearer, we also provide the graphical illustration in Figure 5.

Theorem 7.

(Approximation for nonlinear Hawkes process) Under model (15), assumptions (A1) and (A4), for Nmax{5,(2𝒞B0Ts+1)1s}N\geq\max\{5,(2\mathcal{C}B_{0}T^{s}+1)^{\frac{1}{s}}\} with 𝒞=2s5s/(s1)!\mathcal{C}=\sqrt{2s}5^{s}/(s-1)!, there exists an RNN-TPP λ^N\hat{\lambda}^{N} as stated in section 2.2 with L=4L=4, lf=B~1l_{f}=\tilde{B}_{1}, uf=B~0u_{f}=\tilde{B}_{0}, and input function x(t;S)=(t,tFS(t))x(t;S)=(t,t-F_{S}(t))^{\top} such that

|𝔼[loss(λ^N,Stest)]𝔼[loss(λ,Stest)]|logNN.\displaystyle|\mathbb{E}[\text{loss}(\hat{\lambda}^{N},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]|\lesssim\frac{\log N}{N}. (26)

Moreover, the width of λ^N\hat{\lambda}^{N} satisfies D=O(N)D=O(N) and the weights of λ^N\hat{\lambda}^{N} are less than

𝒞1(logN)12s2(logN)2,\displaystyle\mathcal{C}_{1}(\log N)^{12s^{2}(\log N)^{2}}~{},

where 𝒞1\mathcal{C}_{1} is a constant related to s,B~0,α,β,Ts,\tilde{B}_{0},\alpha,\beta,T, and LL.

Since we assume Φ\Phi to be Lipschitz continuous, we can only get O~(N1)\tilde{O}(N^{-1}) approximation error. The rate can be improved if Φ\Phi is allowed to have better smoothness properties. Again, combining with Theorem 3, we arrive at Theorem 2.

Refer to caption
Figure 5: The construction of RNN-TPP for the case of nonlinear Hawkes process.
Remark 10.

The universal approximation properties of one-layer RNNs are studied in Schäfer and Zimmermann (2007). Our current results are different from theirs in the following sense. (i) RNN-TPP is defined over the continuous time domain [0,T][0,T], while the standard RNN only considers the discrete points. In other words, our approximation results hold uniformly over all t[0,T]t\in[0,T]. (ii) In Schäfer and Zimmermann (2007), they do not give the explicit formula of the widths of hidden layers or parameter weights in the construction of RNN approximator. Therefore, their results cannot be directly used in computing the approximation error.

6 Usefulness of Interpolation of Hidden States

As mentioned in Remark 3, the RNN-TPP needs to take into account any continuous time point tt between observed time grids tjt_{j} and tj+1t_{j+1}. The interpolation of hidden state h(l)(t;S)h^{(l)}(t;S) between hj(l)h_{j}^{(l)} and hj+1(l)h_{j+1}^{(l)} is essential and important during the construction of RNN-TPPs.

In this section, we give a counter-example to illustrate that an RNN-TPP model with a linear interpolation of hidden states is unable to precisely capture the true intensity in terms of excess risk (4)\eqref{eq:def:gen:err}. For simplicity, we only consider the single-layer RNN-TPP and the argument is the same for multi-layer RNN-TPPs.

We consider a (single-layer) RNN-TPP which admits the following model structure,

hj\displaystyle h_{j} =σ(Wxx(tj;S)+Whhj1),\displaystyle=\sigma(W_{x}x(t_{j};S)+W_{h}h_{j-1}),
λ^ne(t)\displaystyle\hat{\lambda}_{ne}(t) =f(α(ttj)+Wyhj+b),t(tj,tj+1],\displaystyle=f(\alpha(t-t_{j})+W_{y}h_{j}+b)\in\mathbb{R},~{}~{}t\in(t_{j},t_{j+1}], (27)

where x(tj;S)x(t_{j};S) is the embedding for the jj-th event, h0=𝟎h_{0}=\mathbf{0}, σ(x)=tanh(x)\sigma(x)=\tanh(x), f(x)=(xlf)uff(x)=(x\vee l_{f})\wedge u_{f}, and lfl_{f} and ufu_{f} will be determined from the true intensity. If we take α=0\alpha=0, it would be the same as the model using a constant hidden state interpolation mechanism, such as Du et al. (2016). In other words, h(1)(t;S)hjh^{(1)}(t;S)\equiv h_{j} for all tt satisfying tjt<tj+1t_{j}\leq t<t_{j+1} when α=0\alpha=0.

Theorem 8.

Suppose the true model intensity on [0,T][0,T] has the following form,

λ(t)\displaystyle\lambda^{\ast}(t) =\displaystyle= {T,t[0,T/3]9Tt2,t(T/3,2T/3)4T,t[2T/3,T].\displaystyle\left\{\begin{aligned} &T&,~{}&t\in[0,T/3]\\ &\frac{9}{T}t^{2}&,~{}&t\in(T/3,2T/3)\\ &4T&,~{}&t\in[2T/3,T]\end{aligned}\right.\quad.

Hence we can take lf=Tl_{f}=T and uf=4Tu_{f}=4T, and then there exists a constant C>0C>0 such that

minλ^ne as (27)𝔼[loss(λ^ne,Stest)]𝔼[loss(λ,Stest)]C>0.\displaystyle\min_{\hat{\lambda}_{ne}\text{ as }\eqref{RNN_naive(main)}}\mathbb{E}[\text{loss}(\hat{\lambda}_{ne},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]\geq C>0. (28)

Theorem 8 tells us that the RNN-TPPs with an improper hidden state interpolation may fail to offer a good approximation, even under a very simple non-homogeneous Poisson model. Therefore, the user-determined input embedding vector function x(t;S)x(t;S) plays an important role in interpolating the hidden states. It should be carefully chosen so that x(t;S)x(t;S) can summarize the information of past event history to some extent.

Remark 11.

One can substitute the linear interpolation mechanism (27) with the exponential decaying mechanism given in Mei and Eisner (2017). Theorem 8 still holds.

Remark 12.

For other different types of ff (e.g. Softplus) in the output layer, the failure of the linear interpolation mechanism can be obtained similarly.

7 Discussion

In this paper, we give a positive answer to the question "whether the RNN-TPPs can provably have small excess risks in the estimation of the well-known TPPs". We establish the excess risk bounds under homogeneous Poisson process, non-homogeneous Poisson process, self-exciting process, and self-correcting process framework. Our analysis focuses on two parts, the stochastic error and the approximation error. For the stochastic error, we use a novel truncation technique to decouple the randomness and make the classical empirical process theory applicable. We carefully compute the Lipschitz constant of multi-layer RNNs, which is a useful intermediate result for future RNN-related work. For approximation error, we construct a series of RNNs to approximate the intensities of different TPPs by providing the explicit network depth, width, and parameter weights. To the best of our knowledge, our work is the first one to study the approximation ability of the multi-layer RNNs over the continuous time domain. We believe the results in the current work add values to both learning theory and neural network fields.

There are several possible extensions along the research line of neural network-based TPPs. First, it is not clear whether the approximation rate can be improved by a more refined RNN structure construction (with possible fewer layers and smaller width) or other possible approaches. Second, we here only consider the “large nn" setting where the event sequences are observed in a bounded time domain [0,T][0,T] with nn repeated samples. It is interesting to extend our results to “large TT" setting where the end time TT goes to infinity but the number of event sequences, nn, remains fixed. Third, in the current work, we do not take into account the different event types. It may be useful to extend our results to the marked TPP settings. Moreover, it is also worth investigating the theoretical performances of other neural network architectures (e.g. Transformer-TPPs) that have performed well in recent empirical applications.

Supplementary Material for "On Non-asymptotic Theory of Recurrent Neural Networks in Temporal Point Processes"

Additional Notations in the Supplementary: For two random variables XX and YY, we write Xs.t.YX\leq_{s.t.}Y if (X>t)(Y>t)\mathbb{P}(X>t)\leq\mathbb{P}(Y>t) for any tt\in\mathbb{R}. Use +\mathbb{N}_{+} to denote the set of positive integers.

8 Proofs in section 3 and 4

8.1 Proof of Lemma 1

By the definition of λˇ\check{\lambda}^{\ast} and λ^\hat{\lambda}, we have

𝔼[loss(λ^,Stest)]𝔼[loss(λ,Stest)]]\displaystyle\quad\mathbb{E}[\text{loss}(\hat{\lambda},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]]
=𝔼[loss(λ^,Stest)]𝔼[loss(λˇ,Stest)]+𝔼[loss(λˇ,Stest)]𝔼[loss(λ,Stest)]\displaystyle=\mathbb{E}[\text{loss}(\hat{\lambda},S_{test})]-\mathbb{E}[\text{loss}(\check{\lambda}^{\ast},S_{test})]+\mathbb{E}[\text{loss}(\check{\lambda}^{\ast},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]
𝔼[loss(λ^,Stest)]1ni[n]loss(λ^,Si)+1ni[n]loss(λˇ,Si)0𝔼[loss(λˇ,Stest)]\displaystyle\leq\mathbb{E}[\text{loss}(\hat{\lambda},S_{test})]\underbrace{-\frac{1}{n}\sum_{i\in[n]}\text{loss}(\hat{\lambda},S_{i})+\frac{1}{n}\sum_{i\in[n]}\text{loss}(\check{\lambda}^{\ast},S_{i})}_{\geq 0}-\mathbb{E}[\text{loss}(\check{\lambda}^{\ast},S_{test})]
+𝔼[loss(λˇ,Stest)]𝔼[loss(λ,Stest)]\displaystyle\quad+\mathbb{E}[\text{loss}(\check{\lambda}^{\ast},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]
2supλ|𝔼[loss(λ,Stest)]1ni[n]loss(λ,Si)|\displaystyle\leq 2\sup_{\lambda\in\mathcal{F}}\Big{|}\mathbb{E}[\text{loss}(\lambda,S_{test})]-\frac{1}{n}\sum_{i\in[n]}\text{loss}(\lambda,S_{i})\Big{|}
+𝔼[loss(λˇ,Stest)]𝔼[loss(λ,Stest)].\displaystyle\quad+\mathbb{E}[\text{loss}(\check{\lambda}^{\ast},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})].

8.2 Proof of Lemma 2

From model assumptions (A1) and (A2), we have λ(t)=λ0(t)+j:tj<tμ(ttj)\lambda^{*}(t)=\lambda_{0}(t)+\sum_{j:t_{j}<t}\mu(t-t_{j}),   0Tμ(t)dtcμ<1\int_{0}^{T}\mu(t)\mathrm{d}t\leq c_{\mu}<1, λ0(t)B0\lambda_{0}(t)\leq B_{0}. Following the notations in the paper, we denote NeN_{e} as the number of event time of λ\lambda^{*} in [0,T][0,T]. Consider another density λ¯(t)=B0+j:tj<tμ(ttj)\overline{\lambda}(t)=B_{0}+\sum_{j:t_{j}<t}\mu(t-t_{j}) and similarly denote N¯e\overline{N}_{e} as the number of event time of λ¯\overline{\lambda} in [0,T][0,T]. Then for any fixed event sequence S={tj}S=\{t_{j}\}, λ(t;S)λ¯(t;S)\lambda^{*}(t;S)\leq\overline{\lambda}(t;S), and thus Nes.t.N¯eN_{e}\leq_{s.t.}\overline{N}_{e}. By a similar formulation in Daley et al. (2003), the point process with intensity λ¯\overline{\lambda} is equivalent to a birth-immigration process with immigration intensity cc and birth intensity μ(t)\mu(t). Hence

N¯e=N¯0+i=1N¯i,\displaystyle\overline{N}_{e}=\overline{N}_{0}+\sum_{i=1}^{\infty}\overline{N}_{i},

where N¯0Poisson(B0T)\overline{N}_{0}\sim\operatorname{Poisson}(B_{0}T) and N¯k\overline{N}_{k} is the number of event time in generation kk, which are children of generation k1k-1.

For t1<t2t_{1}<t_{2}, let μt1t2=t1t2μ(tt1)dt\mu_{t_{1}}^{t_{2}}=\int_{t_{1}}^{t_{2}}\mu(t-t_{1})\mathrm{d}t. We have

𝔼[exp(sN¯0)]=exp(B0T(exp(s)1)),\displaystyle\mathbb{E}\left[\exp\left(s\overline{N}_{0}\right)\right]=\exp\left(B_{0}T\left(\exp(s)-1\right)\right),

and

𝔼[exp(sN¯k+1)]\displaystyle\mathbb{E}\left[\exp\left(s\overline{N}_{k+1}\right)\right] =𝔼[𝔼[exp(sN¯k+1)|{tj(k)}j=1N¯k]]\displaystyle=\mathbb{E}\left[\mathbb{E}\left[\exp\left(s\overline{N}_{k+1}\right)\Big{|}\left\{t_{j}^{(k)}\right\}_{j=1}^{\overline{N}_{k}}\right]\right]
=𝔼[j=1N¯kexp(μtj(k)T(exp(s)1))]\displaystyle=\mathbb{E}\left[\prod_{j=1}^{\overline{N}_{k}}\exp\left(\mu_{t_{j}^{(k)}}^{T}\left(\exp(s)-1\right)\right)\right]
𝔼[exp(cμN¯k(exp(s)1))],\displaystyle\leq\mathbb{E}\left[\exp\left(c_{\mu}\overline{N}_{k}\left(\exp(s)-1\right)\right)\right],

for any s>0s>0. Since cμ<1c_{\mu}<1, for any fixed c1(cμ,1]c_{1}\in(c_{\mu},1] and any s(0,log(c1/cμ)]s\in(0,\log(c_{1}/c_{\mu})], we have

𝔼[exp(sN¯k)]𝔼[exp(cμN¯k1(exp(s)1))]𝔼[exp(c1sN¯k1)]𝔼[exp(c1ksN¯0)],\displaystyle\mathbb{E}\left[\exp\left(s\overline{N}_{k}\right)\right]\leq\mathbb{E}\left[\exp\left(c_{\mu}\overline{N}_{k-1}\left(\exp(s)-1\right)\right)\right]\leq{\mathbb{E}}\left[\exp\left(c_{1}s\overline{N}_{k-1}\right)\right]\leq\cdots\leq{\mathbb{E}}\left[\exp\left(c_{1}^{k}s\overline{N}_{0}\right)\right],

i.e.

𝔼[exp(sN¯k)]𝔼[exp(c1ksN¯0)]=exp(B0T(exp(c1ks)1))exp(c1k+1cμ(B0T)s)\displaystyle\mathbb{E}\left[\exp\left(s\overline{N}_{k}\right)\right]\leq\mathbb{E}\left[\exp\left(c_{1}^{k}s\overline{N}_{0}\right)\right]=\exp\left(B_{0}T\left(\exp\left(c_{1}^{k}s\right)-1\right)\right)\leq\exp\left(\frac{c_{1}^{k+1}}{c_{\mu}}(B_{0}T)s\right)

for any kk\in\mathbb{N}.

Since N¯k(T)\overline{N}_{k}(T) can only take integer values, we can get (N¯k=0)+es(N¯k0)𝔼[exp(sN¯k)]\mathbb{P}(\overline{N}_{k}=0)+e^{s}\mathbb{P}(\overline{N}_{k}\neq 0)\leq\mathbb{E}\left[\exp(s\overline{N}_{k})\right]. Thus

(N¯k0)𝔼[exp(sN¯k)]1exp(s)1c1k+2cμ2(B0T),s(0,min{cμc1k+1(B0T),1}log(c1cμ)].\displaystyle\mathbb{P}\left(\overline{N}_{k}\neq 0\right)\leq\frac{\mathbb{E}\left[\exp\left(s\overline{N}_{k}\right)\right]-1}{\exp(s)-1}\leq\frac{c_{1}^{k+2}}{c_{\mu}^{2}}(B_{0}T),~{}\forall s\in\left(0,\min\left\{\frac{c_{\mu}}{c_{1}^{k+1}(B_{0}T)},1\right\}\log\left(\frac{c_{1}}{c_{\mu}}\right)\right].

Setting c1cμc_{1}\searrow c_{\mu}, we get

(N¯k0)cμk(B0T).\displaystyle\mathbb{P}\left(\overline{N}_{k}\neq 0\right)\leq c_{\mu}^{k}(B_{0}T).

Now take c1(cμ,1)c_{1}\in(c_{\mu},1), and then c11(1c1)k=1c1k=1c_{1}^{-1}(1-c_{1})\sum_{k=1}^{\infty}c_{1}^{k}=1. By Bool’s inequality, we have

(k=0N¯kN)\displaystyle\mathbb{P}\left(\sum_{k=0}^{\infty}\overline{N}_{k}\geq N\right) k=0(N¯k1c1c1c1k+1N)\displaystyle\leq\sum_{k=0}^{\infty}\mathbb{P}\left(\overline{N}_{k}\geq\frac{1-c_{1}}{c_{1}}c_{1}^{k+1}N\right)
k=0K01(N¯k1c1c1c1k+1N)+k=K0(N¯k0).\displaystyle\leq\sum_{k=0}^{K_{0}-1}\mathbb{P}\left(\overline{N}_{k}\geq\frac{1-c_{1}}{c_{1}}c_{1}^{k+1}N\right)+\sum_{k=K_{0}}^{\infty}\mathbb{P}\left(\overline{N}_{k}\neq 0\right).

For the second term, k=K0(N¯k0)k=K0cμk(B0T)=cμK0(B0T)/(1cμ)\sum_{k=K_{0}}^{\infty}\mathbb{P}\left(\overline{N}_{k}\neq 0\right)\leq\sum_{k=K_{0}}^{\infty}c_{\mu}^{k}(B_{0}T)=c_{\mu}^{K_{0}}(B_{0}T)/(1-c_{\mu}). Let cμK0(B0T)/(1cμ)δ/2nc_{\mu}^{K_{0}}(B_{0}T)/(1-c_{\mu})\leq\delta/2n. It can be showed

K0log(2nB0T/[δ(1cμ)])log(1/cμ).\displaystyle K_{0}\geq\frac{\log\left(2nB_{0}T/[\delta(1-c_{\mu})]\right)}{\log\left(1/c_{\mu}\right)}.

For the first term, we have

k=0K01(N¯k1c1c1c1k+1N)\displaystyle\sum_{k=0}^{K_{0}-1}\mathbb{P}\left(\overline{N}_{k}\geq\frac{1-c_{1}}{c_{1}}c_{1}^{k+1}N\right) k=0K01exp(s(1c1c1c1k+1N))𝔼[exp(sN¯k)]\displaystyle\leq\sum_{k=0}^{K_{0}-1}\exp\left(-s\left(\frac{1-c_{1}}{c_{1}}c_{1}^{k+1}N\right)\right)\mathbb{E}\left[\exp(s\overline{N}_{k})\right]
k=0K01exp(c1k+1s(B0Tcμ1c1c1N)),\displaystyle\leq\sum_{k=0}^{K_{0}-1}\exp\left(c_{1}^{k+1}s\left(\frac{B_{0}T}{c_{\mu}}-\frac{1-c_{1}}{c_{1}}N\right)\right),

where s(0,log(c1/cμ)]s\in(0,\log(c_{1}/c_{\mu})]. We can take c1s(B0T/cμ(1c1)N/c1)log(δ/(2nK0))c_{1}s\left(B_{0}T/c_{\mu}-(1-c_{1})N/c_{1}\right)\leq\log(\delta/(2nK_{0})) so that k=0K01exp(c1k+1s(B0T/cμ(1c1)N/c1))δ/(2n)\sum_{k=0}^{K_{0}-1}\exp\left(c_{1}^{k+1}s\left(B_{0}T/c_{\mu}-(1-c_{1})N/c_{1}\right)\right)\leq\delta/(2n). Then

N11c1(1slog(2nK0δ)+c1cμ(B0T)).\displaystyle N\geq\frac{1}{1-c_{1}}\left(\frac{1}{s}\log\left(\frac{2nK_{0}}{\delta}\right)+\frac{c_{1}}{c_{\mu}}(B_{0}T)\right).

Now let η=c1/cμ(1,1/cμ)\eta=c_{1}/c_{\mu}\in(1,1/c_{\mu}), s=log(c1/cμ)=log(η)s=\log(c_{1}/c_{\mu})=\log(\eta), and N[log(2nK0/δ)/log(η)+η(B0T)]/(1cμη)N\geq\left[\log\left(2nK_{0}/\delta\right)/\log(\eta)+\eta(B_{0}T)\right]/(1-c_{\mu}\eta). Taking K0=log(2nB0T/[δ(1cμ)])/log(1/cμ)K_{0}=\lceil\log\left(2nB_{0}T/[\delta(1-c_{\mu})]\right)/\log\left(1/c_{\mu}\right)\rceil and
N=[log(2nK0/δ)/log(η)+η(B0T)]/(1cμη)N=\left[\log\left(2nK_{0}/\delta\right)/\log(\eta)+\eta(B_{0}T)\right]/(1-c_{\mu}\eta), we have (NeN)(N¯eN)δ/n\mathbb{P}\left(N_{e}\geq N\right)\leq\mathbb{P}\left(\overline{N}_{e}\geq N\right)\leq\delta/n. Since

(Ne(n)N)=1(Ne(n)<N)=1i=1n(Ne<N)1(1δn)nδ,\displaystyle\mathbb{P}(N_{e(n)}\geq N)=1-\mathbb{P}(N_{e(n)}<N)=1-\prod_{i=1}^{n}\mathbb{P}\left(N_{e}<N\right)\leq 1-\left(1-\frac{\delta}{n}\right)^{n}\leq\delta,

we get that with probability at least 1δ1-\delta,

Ne(n)<N11cμη[1log(η)log(2nKn,δδ)+η(B0T)],\displaystyle N_{e(n)}<N\leq\frac{1}{1-c_{\mu}\eta}\left[\frac{1}{\log(\eta)}\log\left(\frac{2nK_{n,\delta}}{\delta}\right)+\eta(B_{0}T)\right],

where η(1,1/cμ)\eta\in(1,1/c_{\mu}), and Kn,δ=log(2nB0T/δ(1cμ))/log(1/cμ)+1K_{n,\delta}=\log\left(2nB_{0}T/\delta(1-c_{\mu})\right)/\log\left(1/c_{\mu}\right)+1. Since 11/xlog(x)x11-1/x\leq\log(x)\leq x-1, we have Kn,δ2nB0T/[δ(1cμ)2]K_{n,\delta}\leq 2nB_{0}T/[\delta(1-c_{\mu})^{2}]. Thus with probability at least 1δ1-\delta,

Ne(n)<11cμη[2log(η)log(2nB0Tδ(1cμ))+η(B0T)].\displaystyle N_{e(n)}<\frac{1}{1-c_{\mu}\eta}\left[\frac{2}{\log(\eta)}\log\left(\frac{2n\sqrt{B_{0}T}}{\delta(1-c_{\mu})}\right)+\eta(B_{0}T)\right].

Taking n=1n=1 and 2log(2B0T/[δ(1cμ)])/log(η)+η(B0T)/(1cμη)=s2\log\left(2\sqrt{B_{0}T}/[\delta(1-c_{\mu})]\right)/\log(\eta)+\eta(B_{0}T)/(1-c_{\mu}\eta)=s, we have δ=2B0Texp(log(η)[η(B0T)(1cμη)s]/2)/(1cμ)\delta=2\sqrt{B_{0}T}\exp\left(\log(\eta)\left[\eta(B_{0}T)-(1-c_{\mu}\eta)s\right]/2\right)/(1-c_{\mu}). Then

(Ne=s)(Nes)2B0T1cμexp(log(η)2[η(B0T)(1cμη)s]).\displaystyle\mathbb{P}\left(N_{e}=s\right)\leq\mathbb{P}\left(N_{e}\geq s\right)\leq\frac{2\sqrt{B_{0}T}}{1-c_{\mu}}\exp\left(\frac{\log(\eta)}{2}\left[\eta(B_{0}T)-(1-c_{\mu}\eta)s\right]\right).

8.3 Proof of Lemma 4

The proof is based on induction. Using the same notation, we give two claims.

Claim 1.

For 1lL\forall 1\leq l\leq L, 1iN1\leq i\leq N, hi,1(l)hi,2(l)2\|h_{i,1}^{(l)}-h_{i,2}^{(l)}\|_{2} is bounded by

hi,1(l)hi,2(l)2ρσ(r=0l1γrSi1rΔblr+BσDr=0l2γrSi1rΔxlr+Bin(T)γl1Si1l1Δx1+BσDr=0l1γrSi2rΔhlr).\displaystyle\left\|h_{i,1}^{(l)}-h_{i,2}^{(l)}\right\|_{2}\leq\rho_{\sigma}\left(\sum_{r=0}^{l-1}\gamma^{r}S_{i-1}^{r}\Delta_{b}^{l-r}+B_{\sigma}\sqrt{D}\sum_{r=0}^{l-2}\gamma^{r}S_{i-1}^{r}\Delta_{x}^{l-r}+B_{in}(T)\gamma^{l-1}S_{i-1}^{l-1}\Delta_{x}^{1}+B_{\sigma}\sqrt{D}\sum_{r=0}^{l-1}\gamma^{r}S_{i-2}^{r}\Delta_{h}^{l-r}\right). (29)
Proof of Claim 1.

When i=1i=1, we have

h1,1(l)h1,2(l)2\displaystyle\left\|h_{1,1}^{(l)}-h_{1,2}^{(l)}\right\|_{2} =σ(Wx,1(l)h1(l1)+b1(l))σ(Wx,2(l)h2(l1)+b2(l))2\displaystyle=\left\|\sigma\left(W_{x,1}^{(l)}h_{1}^{(l-1)}+b_{1}^{(l)}\right)-\sigma\left(W_{x,2}^{(l)}h_{2}^{(l-1)}+b_{2}^{(l)}\right)\right\|_{2}
ρσ(Wx,1(l)h1(l1)Wx,2(l)h2(l1)2+b1(l)b2(l)2)\displaystyle\leq\rho_{\sigma}\left(\left\|W_{x,1}^{(l)}h_{1}^{(l-1)}-W_{x,2}^{(l)}h_{2}^{(l-1)}\right\|_{2}+\left\|b_{1}^{(l)}-b_{2}^{(l)}\right\|_{2}\right)
ρσ(BσDΔxl+Bxh1,1(l1)h1,2(l1)2+Δbl).\displaystyle\leq\rho_{\sigma}\left(B_{\sigma}\sqrt{D}\Delta_{x}^{l}+B_{x}\left\|h_{1,1}^{(l-1)}-h_{1,2}^{(l-1)}\right\|_{2}+\Delta_{b}^{l}\right).

Repeat this derivation recursively, we get

h1,1(l)h1,2(l)2\displaystyle\left\|h_{1,1}^{(l)}-h_{1,2}^{(l)}\right\|_{2} ρσ(BσDΔxl+Bxh1,1(l1)h1,2(l1)2+Δbl)\displaystyle\leq\rho_{\sigma}\left(B_{\sigma}\sqrt{D}\Delta_{x}^{l}+B_{x}\left\|h_{1,1}^{(l-1)}-h_{1,2}^{(l-1)}\right\|_{2}+\Delta_{b}^{l}\right)
ρσΔbl+ρσBσDΔxl+γ(ρσΔbl+ρσBσDΔxl+γh1,1(l2)h1,2(l2)2)\displaystyle\leq\rho_{\sigma}\Delta_{b}^{l}+\rho_{\sigma}B_{\sigma}\sqrt{D}\Delta_{x}^{l}+\gamma\left(\rho_{\sigma}\Delta_{b}^{l}+\rho_{\sigma}B_{\sigma}\sqrt{D}\Delta_{x}^{l}+\gamma\left\|h_{1,1}^{(l-2)}-h_{1,2}^{(l-2)}\right\|_{2}\right)
\displaystyle\leq\cdots\cdots
ρσ(r=0l1γrΔblr+BσDr=0l2γrΔxlr+Bin(T)γl1Δx1).\displaystyle\leq\rho_{\sigma}\left(\sum_{r=0}^{l-1}\gamma^{r}\Delta_{b}^{l-r}+B_{\sigma}\sqrt{D}\sum_{r=0}^{l-2}\gamma^{r}\Delta_{x}^{l-r}+B_{in}(T)\gamma^{l-1}\Delta_{x}^{1}\right).

When l=1l=1, we have

hi,1(1)hi,2(1)2\displaystyle\left\|h_{i,1}^{(1)}-h_{i,2}^{(1)}\right\|_{2} =σ(Wx,1(1)x(ti;ti1)+Wh,1(1)hi1,1(1)+b1(1))σ(Wx,2(1)x(ti;ti1)+Wh,2(1)hi1,2(1)+b2(1))2\displaystyle=\left\|\sigma\left(W_{x,1}^{(1)}x(t_{i};t_{i-1})+W_{h,1}^{(1)}h_{i-1,1}^{(1)}+b_{1}^{(1)}\right)-\sigma\left(W_{x,2}^{(1)}x(t_{i};t_{i-1})+W_{h,2}^{(1)}h_{i-1,2}^{(1)}+b_{2}^{(1)}\right)\right\|_{2}
ρσ(Bin(T)Wx,1(1)Wx,2(1)2+Wh,1(1)hi1,1(1)Wh,2(1)hi1,2(1)2+b1(1)b2(1)2)\displaystyle\leq\rho_{\sigma}\left(B_{in}(T)\left\|W_{x,1}^{(1)}-W_{x,2}^{(1)}\right\|_{2}+\left\|W_{h,1}^{(1)}h_{i-1,1}^{(1)}-W_{h,2}^{(1)}h_{i-1,2}^{(1)}\right\|_{2}+\left\|b_{1}^{(1)}-b_{2}^{(1)}\right\|_{2}\right)
ρσ(Bin(T)Δx1+BσDΔh1+Bhhi1,1(1)hi1,2(1)2+Δb1).\displaystyle\leq\rho_{\sigma}\left(B_{in}(T)\Delta_{x}^{1}+B_{\sigma}\sqrt{D}\Delta_{h}^{1}+B_{h}\left\|h_{i-1,1}^{(1)}-h_{i-1,2}^{(1)}\right\|_{2}+\Delta_{b}^{1}\right).

Again repeat it recursively, we can get

hi,1(1)hi,2(1)2\displaystyle\quad\left\|h_{i,1}^{(1)}-h_{i,2}^{(1)}\right\|_{2}
ρσ(Bin(T)Δx1+BσDΔh1+Bhhi1,1(1)hi1,2(1)2+Δb1)\displaystyle\leq\rho_{\sigma}\left(B_{in}(T)\Delta_{x}^{1}+B_{\sigma}\sqrt{D}\Delta_{h}^{1}+B_{h}\left\|h_{i-1,1}^{(1)}-h_{i-1,2}^{(1)}\right\|_{2}+\Delta_{b}^{1}\right)
ρσ(Bin(T)Δx1+BσDΔh1+Δb1)+β(ρσ(Bin(T)Δx1+BσDΔh1+Δb1)+βhi2,1(1)hi2,2(1)2)\displaystyle\leq\rho_{\sigma}\left(B_{in}(T)\Delta_{x}^{1}+B_{\sigma}\sqrt{D}\Delta_{h}^{1}+\Delta_{b}^{1}\right)+\beta\left(\rho_{\sigma}\left(B_{in}(T)\Delta_{x}^{1}+B_{\sigma}\sqrt{D}\Delta_{h}^{1}+\Delta_{b}^{1}\right)+\beta\left\|h_{i-2,1}^{(1)}-h_{i-2,2}^{(1)}\right\|_{2}\right)
\displaystyle\leq\cdots\cdots
ρσ(Si10Δb1+Bin(T)Si10Δx1+BσDSi20Δh1).\displaystyle\leq\rho_{\sigma}\left(S_{i-1}^{0}\Delta_{b}^{1}+B_{in}(T)S_{i-1}^{0}\Delta_{x}^{1}+B_{\sigma}\sqrt{D}S_{i-2}^{0}\Delta_{h}^{1}\right).

Now suppose for all i<i0i<i_{0}, l<l0l<l_{0}, (29) is true. Consider the case i=i0i=i_{0}, l=l0l=l_{0}, we have

hi0,1(l0)hi0,2(l0)2\displaystyle\quad\left\|h_{i_{0},1}^{(l_{0})}-h_{i_{0},2}^{(l_{0})}\right\|_{2}
=σ(Wx,1(l0)hi0,1(l01)+Wh,1(l0)hi01,1(l0)+b1(l0))σ(Wx,2(l0)hi0,2(l01)+Wh,2(l0)hi01,2(l0)+b2(l0))2\displaystyle=\left\|\sigma\left(W_{x,1}^{(l_{0})}h_{i_{0},1}^{(l_{0}-1)}+W_{h,1}^{(l_{0})}h_{i_{0}-1,1}^{(l_{0})}+b_{1}^{(l_{0})}\right)-\sigma\left(W_{x,2}^{(l_{0})}h_{i_{0},2}^{(l_{0}-1)}+W_{h,2}^{(l_{0})}h_{i_{0}-1,2}^{(l_{0})}+b_{2}^{(l_{0})}\right)\right\|_{2}
ρσ(Wx,1(l0)hi0,1(l01)Wx,2(l0)hi0,2(l01)2+Wh,1(l0)hi01,1(l0)Wh,2(l0)hi01,2(l0)2+b1(l0)b2(l0)2)\displaystyle\leq\rho_{\sigma}\left(\left\|W_{x,1}^{(l_{0})}h_{i_{0},1}^{(l_{0}-1)}-W_{x,2}^{(l_{0})}h_{i_{0},2}^{(l_{0}-1)}\right\|_{2}+\left\|W_{h,1}^{(l_{0})}h_{i_{0}-1,1}^{(l_{0})}-W_{h,2}^{(l_{0})}h_{i_{0}-1,2}^{(l_{0})}\right\|_{2}+\left\|b_{1}^{(l_{0})}-b_{2}^{(l_{0})}\right\|_{2}\right)
ρσ(Bxhi0,1(l01)hi0,2(l01)2+BσDΔxl0+Bhhi01,1(l0)hi01,2(l0)2+BσDΔxl0+Δbl0)\displaystyle\leq\rho_{\sigma}\left(B_{x}\left\|h_{i_{0},1}^{(l_{0}-1)}-h_{i_{0},2}^{(l_{0}-1)}\right\|_{2}+B_{\sigma}\sqrt{D}\Delta_{x}^{l_{0}}+B_{h}\left\|h_{i_{0}-1,1}^{(l_{0})}-h_{i_{0}-1,2}^{(l_{0})}\right\|_{2}+B_{\sigma}\sqrt{D}\Delta_{x}^{l_{0}}+\Delta_{b}^{l_{0}}\right)
ρσβ(r=0l01γrSi02rΔbl0r+BσDr=0l02γrSi02rΔxl0r+Bin(T)γl01Si02l01Δx1\displaystyle\leq\rho_{\sigma}\beta\left(\sum_{r=0}^{l_{0}-1}\gamma^{r}S_{i_{0}-2}^{r}\Delta_{b}^{l_{0}-r}+B_{\sigma}\sqrt{D}\sum_{r=0}^{l_{0}-2}\gamma^{r}S_{i_{0}-2}^{r}\Delta_{x}^{l_{0}-r}+B_{in}(T)\gamma^{l_{0}-1}S_{i_{0}-2}^{l_{0}-1}\Delta_{x}^{1}\right.
+BσDr=0l01γrSi03rΔhl0r)+ρσγ(r=0l02γrSi01rΔbl01r+BσDr=0l03γrSi01rΔxl01r\displaystyle\quad\quad\quad\left.+B_{\sigma}\sqrt{D}\sum_{r=0}^{l_{0}-1}\gamma^{r}S_{i_{0}-3}^{r}\Delta_{h}^{l_{0}-r}\right)+\rho_{\sigma}\gamma\left(\sum_{r=0}^{l_{0}-2}\gamma^{r}S_{i_{0}-1}^{r}\Delta_{b}^{l_{0}-1-r}+B_{\sigma}\sqrt{D}\sum_{r=0}^{l_{0}-3}\gamma^{r}S_{i_{0}-1}^{r}\Delta_{x}^{l_{0}-1-r}\right.
Bin(T)γl02Si01l02Δx1+BσDr=0l02γrSi02rΔhl01r)+ρσ(BσDΔxl0+BσDΔxl0+Δbl0)\displaystyle\quad\quad\quad\left.B_{in}(T)\gamma^{l_{0}-2}S_{i_{0}-1}^{l_{0}-2}\Delta_{x}^{1}+B_{\sigma}\sqrt{D}\sum_{r=0}^{l_{0}-2}\gamma^{r}S_{i_{0}-2}^{r}\Delta_{h}^{l_{0}-1-r}\right)+\rho_{\sigma}\left(B_{\sigma}\sqrt{D}\Delta_{x}^{l_{0}}+B_{\sigma}\sqrt{D}\Delta_{x}^{l_{0}}+\Delta_{b}^{l_{0}}\right)
ρσ(r=1l01γr(βSi02r+Si01r1)Δbl0r+BσDr=1l02γr(βSi02r+Si01r1)Δxl0r\displaystyle\leq\rho_{\sigma}\left(\sum_{r=1}^{l_{0}-1}\gamma^{r}(\beta S_{i_{0}-2}^{r}+S_{i_{0}-1}^{r-1})\Delta_{b}^{l_{0}-r}+B_{\sigma}\sqrt{D}\sum_{r=1}^{l_{0}-2}\gamma^{r}(\beta S_{i_{0}-2}^{r}+S_{i_{0}-1}^{r-1})\Delta_{x}^{l_{0}-r}\right.
+Bin(T)γl01(βSi02l01+Si01l02)Δx1+BσDr=1l01γr(βSi03r+Si02r1)Δhl0r)\displaystyle\quad\quad\quad\left.+B_{in}(T)\gamma^{l_{0}-1}(\beta S_{i_{0}-2}^{l_{0}-1}+S_{i_{0}-1}^{l_{0}-2})\Delta_{x}^{1}+B_{\sigma}\sqrt{D}\sum_{r=1}^{l_{0}-1}\gamma^{r}(\beta S_{i_{0}-3}^{r}+S_{i_{0}-2}^{r-1})\Delta_{h}^{l_{0}-r}\right)
+ρσ((1+βSi020)(Δbl0+BσDΔxl0)+(1+βSi030)BσDΔhl0).\displaystyle\quad+\rho_{\sigma}\left((1+\beta S_{i_{0}-2}^{0})\left(\Delta_{b}^{l_{0}}+B_{\sigma}\sqrt{D}\Delta_{x}^{l_{0}}\right)+(1+\beta S_{i_{0}-3}^{0})B_{\sigma}\sqrt{D}\Delta_{h}^{l_{0}}\right).

Using the fact that 1+βSi10=Si01+\beta S_{i-1}^{0}=S_{i}^{0} and

βSi1r+Sir1\displaystyle\beta S_{i-1}^{r}+S_{i}^{r-1} =βj=0i1(j+rr)βj+j=0i(j+r1r1)βj=1+j=1i((j+r1r)+(j+r1r1))βj\displaystyle=\beta\sum_{j=0}^{i-1}\tbinom{j+r}{r}\beta^{j}+\sum_{j=0}^{i}\tbinom{j+r-1}{r-1}\beta^{j}=1+\sum_{j=1}^{i}\left(\tbinom{j+r-1}{r}+\tbinom{j+r-1}{r-1}\right)\beta^{j}
=j=0i(j+rr)βj=Sir,\displaystyle=\sum_{j=0}^{i}\tbinom{j+r}{r}\beta^{j}=S_{i}^{r},

(29) is proved. ∎

Claim 2.

For 1lL\forall 1\leq l\leq L, 1iN1\leq i\leq N and t(ti,ti+1]t\in(t_{i},t_{i+1}], h1(l)(t;S)h2(l)(t;S)2\|h_{1}^{(l)}(t;S)-h_{2}^{(l)}(t;S)\|_{2} is bounded by

h1(l)(t;S)h2(l)(t;S)2\displaystyle\left\|h_{1}^{(l)}(t;S)-h_{2}^{(l)}(t;S)\right\|_{2} ρσ(r=0l1γrSirΔblr+BσDr=0l2γrSirΔxlr\displaystyle\leq\rho_{\sigma}\left(\sum_{r=0}^{l-1}\gamma^{r}S_{i}^{r}\Delta_{b}^{l-r}+B_{\sigma}\sqrt{D}\sum_{r=0}^{l-2}\gamma^{r}S_{i}^{r}\Delta_{x}^{l-r}\right. (30)
+Bin(T)γl1Sil1Δx1+BσDr=0l1γrSi1rΔhlr).\displaystyle\left.+B_{in}(T)\gamma^{l-1}S_{i}^{l-1}\Delta_{x}^{1}+B_{\sigma}\sqrt{D}\sum_{r=0}^{l-1}\gamma^{r}S_{i-1}^{r}\Delta_{h}^{l-r}\right). (31)
Proof of Claim 2.

When l=1l=1, by the definition of h(1)(t;S)h^{(1)}(t;S) and (29), for any 1iN1\leq i\leq N and t(ti,ti+1]t\in(t_{i},t_{i+1}], we have

h1(1)(t;S)h2(1)(t;S)2\displaystyle\left\|h_{1}^{(1)}(t;S)-h_{2}^{(1)}(t;S)\right\|_{2} =σ(Wx,1(1)x(t;ti)+Wh,1(1)hi,1(1)+b1(1))σ(Wx,2(1)x(t;ti)+Wh,2(1)hi,2(1)+b2(1))2\displaystyle=\left\|\sigma\left(W_{x,1}^{(1)}x(t;t_{i})+W_{h,1}^{(1)}h_{i,1}^{(1)}+b_{1}^{(1)}\right)-\sigma\left(W_{x,2}^{(1)}x(t;t_{i})+W_{h,2}^{(1)}h_{i,2}^{(1)}+b_{2}^{(1)}\right)\right\|_{2}
ρσ(Bin(T)Wx,1(1)Wx,2(1)2+Wh,1(1)hi,1(1)Wh,2(1)hi,2(1)2+b1(1)b2(1)2)\displaystyle\leq\rho_{\sigma}\left(B_{in}(T)\left\|W_{x,1}^{(1)}-W_{x,2}^{(1)}\right\|_{2}+\left\|W_{h,1}^{(1)}h_{i,1}^{(1)}-W_{h,2}^{(1)}h_{i,2}^{(1)}\right\|_{2}+\left\|b_{1}^{(1)}-b_{2}^{(1)}\right\|_{2}\right)
ρσ(Bin(T)Δx1+BσDΔh1+Bhhi,1(1)hi,2(1)2+Δb1)\displaystyle\leq\rho_{\sigma}\left(B_{in}(T)\Delta_{x}^{1}+B_{\sigma}\sqrt{D}\Delta_{h}^{1}+B_{h}\left\|h_{i,1}^{(1)}-h_{i,2}^{(1)}\right\|_{2}+\Delta_{b}^{1}\right)
ρσβ(Si10Δb1+BσDSi10Δx1+BσDSi20Δh1)\displaystyle\leq\rho_{\sigma}\beta\left(S_{i-1}^{0}\Delta_{b}^{1}+B_{\sigma}\sqrt{D}S_{i-1}^{0}\Delta_{x}^{1}+B_{\sigma}\sqrt{D}S_{i-2}^{0}\Delta_{h}^{1}\right)
+ρσ(Bin(T)Δx1+BσDΔh1+Δb1)\displaystyle\quad+\rho_{\sigma}\left(B_{in}(T)\Delta_{x}^{1}+B_{\sigma}\sqrt{D}\Delta_{h}^{1}+\Delta_{b}^{1}\right)
ρσ(Si0Δb1+Bin(T)Si0Δx1+BσDSi10Δh1).\displaystyle\leq\rho_{\sigma}\left(S_{i}^{0}\Delta_{b}^{1}+B_{in}(T)S_{i}^{0}\Delta_{x}^{1}+B_{\sigma}\sqrt{D}S_{i-1}^{0}\Delta_{h}^{1}\right).

Now suppose for all l<l0l<l_{0}, (29) is true for any 1iN1\leq i\leq N and t(ti,ti+1]t\in(t_{i},t_{i+1}]. Considering the case l=l0l=l_{0}, for any 1iN1\leq i\leq N and t(ti,ti+1]t\in(t_{i},t_{i+1}], we have

h1(l0)(t;S)h2(l0)(t;S)2\displaystyle\quad\left\|h_{1}^{(l_{0})}(t;S)-h_{2}^{(l_{0})}(t;S)\right\|_{2}
=σ(Wx,1(l0)h1(l01)(t;S)+Wh,1(l0)hi,1(l0)+b1(l0))σ(Wx,2(l0)h2(l01)(t;S)+Wh,2(l0)hi,2(l0)+b2(l0))2\displaystyle=\left\|\sigma\left(W_{x,1}^{(l_{0})}h_{1}^{(l_{0}-1)}(t;S)+W_{h,1}^{(l_{0})}h_{i,1}^{(l_{0})}+b_{1}^{(l_{0})}\right)-\sigma\left(W_{x,2}^{(l_{0})}h_{2}^{(l_{0}-1)}(t;S)+W_{h,2}^{(l_{0})}h_{i,2}^{(l_{0})}+b_{2}^{(l_{0})}\right)\right\|_{2}
ρσ(Wx,1(l0)h1(l01)(t;S)Wx,2(l0)h2(l01)(t;S)2+Wh,1(l0)hi,1(l0)Wh,2(l0)hi,2(l0)2+b1(l0)b2(l0)2)\displaystyle\leq\rho_{\sigma}\left(\left\|W_{x,1}^{(l_{0})}h_{1}^{(l_{0}-1)}(t;S)-W_{x,2}^{(l_{0})}h_{2}^{(l_{0}-1)}(t;S)\right\|_{2}+\left\|W_{h,1}^{(l_{0})}h_{i,1}^{(l_{0})}-W_{h,2}^{(l_{0})}h_{i,2}^{(l_{0})}\right\|_{2}+\left\|b_{1}^{(l_{0})}-b_{2}^{(l_{0})}\right\|_{2}\right)
ρσ(Δbl0+BσDΔxl0+Bxh1(l01)(t;S)h2(l01)(t;S)2+BσDΔhl0+Bhhi,1(l0)hi,2(l0)2)\displaystyle\leq\rho_{\sigma}\left(\Delta_{b}^{l_{0}}+B_{\sigma}\sqrt{D}\Delta_{x}^{l_{0}}+B_{x}\left\|h_{1}^{(l_{0}-1)}(t;S)-h_{2}^{(l_{0}-1)}(t;S)\right\|_{2}+B_{\sigma}\sqrt{D}\Delta_{h}^{l_{0}}+B_{h}\left\|h_{i,1}^{(l_{0})}-h_{i,2}^{(l_{0})}\right\|_{2}\right)
ρσγ(r=0l02γrSirΔbl01r+BσDr=0l03γrSirΔxl01r+Bin(T)γl02Sil02Δx1\displaystyle\leq\rho_{\sigma}\gamma\left(\sum_{r=0}^{l_{0}-2}\gamma^{r}S_{i}^{r}\Delta_{b}^{l_{0}-1-r}+B_{\sigma}\sqrt{D}\sum_{r=0}^{l_{0}-3}\gamma^{r}S_{i}^{r}\Delta_{x}^{l_{0}-1-r}+B_{in}(T)\gamma^{l_{0}-2}S_{i}^{l_{0}-2}\Delta_{x}^{1}\right.
+BσDr=0l02γrSi1rΔhl01r)+ρσβ(r=0l01γrSi1rΔbl0r+BσDr=0l02γrSi1rΔxl0r\displaystyle\quad\quad\quad\left.+B_{\sigma}\sqrt{D}\sum_{r=0}^{l_{0}-2}\gamma^{r}S_{i-1}^{r}\Delta_{h}^{l_{0}-1-r}\right)+\rho_{\sigma}\beta\left(\sum_{r=0}^{l_{0}-1}\gamma^{r}S_{i-1}^{r}\Delta_{b}^{l_{0}-r}+B_{\sigma}\sqrt{D}\sum_{r=0}^{l_{0}-2}\gamma^{r}S_{i-1}^{r}\Delta_{x}^{l_{0}-r}\right.
+Bin(T)γl01Si1l01Δx1+BσDl=0l01γrSi2rΔhl0r)+ρσ(Δbl0+BσDΔxl0+BσDΔhl0)\displaystyle\quad\quad\quad\left.+B_{in}(T)\gamma^{l_{0}-1}S_{i-1}^{l_{0}-1}\Delta_{x}^{1}+B_{\sigma}\sqrt{D}\sum_{l=0}^{l_{0}-1}\gamma^{r}S_{i-2}^{r}\Delta_{h}^{l_{0}-r}\right)+\rho_{\sigma}\left(\Delta_{b}^{l_{0}}+B_{\sigma}\sqrt{D}\Delta_{x}^{l_{0}}+B_{\sigma}\sqrt{D}\Delta_{h}^{l_{0}}\right)
ρσ(r=1l01γr(βSi1r+Sir1)Δbl0r+BσDr=1l02γr(βSi1r+Sir1)Δxl0r\displaystyle\leq\rho_{\sigma}\left(\sum_{r=1}^{l_{0}-1}\gamma^{r}(\beta S_{i-1}^{r}+S_{i}^{r-1})\Delta_{b}^{l_{0}-r}+B_{\sigma}\sqrt{D}\sum_{r=1}^{l_{0}-2}\gamma^{r}(\beta S_{i-1}^{r}+S_{i}^{r-1})\Delta_{x}^{l_{0}-r}\right.
+Bin(T)γl01(βSi1l01+Sil02)Δx1+BσDr=1l01γr(βSi2r+Si1r1)Δhl0r)\displaystyle\quad\quad\quad\left.+B_{in}(T)\gamma^{l_{0}-1}(\beta S_{i-1}^{l_{0}-1}+S_{i}^{l_{0}-2})\Delta_{x}^{1}+B_{\sigma}\sqrt{D}\sum_{r=1}^{l_{0}-1}\gamma^{r}(\beta S_{i-2}^{r}+S_{i-1}^{r-1})\Delta_{h}^{l_{0}-r}\right)
+ρσ((1+βSi20)(Δbl0+(BσDBin(T))Δxl0)+(1+βSi30)BσDΔhl0)\displaystyle\quad+\rho_{\sigma}\left((1+\beta S_{i-2}^{0})\left(\Delta_{b}^{l_{0}}+(B_{\sigma}\sqrt{D}\vee B_{in}(T))\Delta_{x}^{l_{0}}\right)+(1+\beta S_{i-3}^{0})B_{\sigma}\sqrt{D}\Delta_{h}^{l_{0}}\right)
ρσ(r=0l01γrSirΔbl0r+BσDr=0l02γrSirΔxl0r+Bin(T)γl01Sil01Δx1+BσDr=0l01γrSi1rΔhl0r).\displaystyle\leq\rho_{\sigma}\left(\sum_{r=0}^{l_{0}-1}\gamma^{r}S_{i}^{r}\Delta_{b}^{l_{0}-r}+B_{\sigma}\sqrt{D}\sum_{r=0}^{l_{0}-2}\gamma^{r}S_{i}^{r}\Delta_{x}^{l_{0}-r}+B_{in}(T)\gamma^{l_{0}-1}S_{i}^{l_{0}-1}\Delta_{x}^{1}+B_{\sigma}\sqrt{D}\sum_{r=0}^{l_{0}-1}\gamma^{r}S_{i-1}^{r}\Delta_{h}^{l_{0}-r}\right).

Hence (31) is proved. ∎

Now we prove Lemma 4. For t(ti,ti+1]t\in(t_{i},t_{i+1}], we have

|λθ1(t;S)λθ2(t;S)|\displaystyle\left|\lambda_{\theta_{1}}(t;S)-\lambda_{\theta_{2}}(t;S)\right| =|f(Wx,1(L+1)h(L)(t;S)+b1(L+1))f(Wx,2(L+1)h(L)(t;S)+b2(L+1))|\displaystyle=\left|f\left(W_{x,1}^{(L+1)}h^{(L)}(t;S)+b_{1}^{(L+1)}\right)-f\left(W_{x,2}^{(L+1)}h^{(L)}(t;S)+b_{2}^{(L+1)}\right)\right|
ρf(b1(L+1)b2(L+1)2+Wx,1(L+1)h(L)(t;S)Wx,2(L+1)h(L)(t;S)2)\displaystyle\leq\rho_{f}\left(\left\|b_{1}^{(L+1)}-b_{2}^{(L+1)}\right\|_{2}+\left\|W_{x,1}^{(L+1)}h^{(L)}(t;S)-W_{x,2}^{(L+1)}h^{(L)}(t;S)\right\|_{2}\right)
ρf(ΔbL+1+BσDΔxL+1+Bxh1(L)(t;S)h2(L)(t;S)2)\displaystyle\leq\rho_{f}\left(\Delta_{b}^{L+1}+B_{\sigma}\sqrt{D}\Delta_{x}^{L+1}+B_{x}\left\|h_{1}^{(L)}(t;S)-h_{2}^{(L)}(t;S)\right\|_{2}\right)
ρfγ(l=0L1γlSilΔbLl+BσDl=0L2γlSilΔxLl+Bin(T)γL1SiL1Δx1\displaystyle\leq\rho_{f}\gamma\left(\sum_{l=0}^{L-1}\gamma^{l}S_{i}^{l}\Delta_{b}^{L-l}+B_{\sigma}\sqrt{D}\sum_{l=0}^{L-2}\gamma^{l}S_{i}^{l}\Delta_{x}^{L-l}+B_{in}(T)\gamma^{L-1}S_{i}^{L-1}\Delta_{x}^{1}\right.
+BσDl=0L1γlSi1lΔhLl)+ρfΔbL+1+ρfBσDΔxL+1.\displaystyle\quad\quad\quad\left.+B_{\sigma}\sqrt{D}\sum_{l=0}^{L-1}\gamma^{l}S_{i-1}^{l}\Delta_{h}^{L-l}\right)+\rho_{f}\Delta_{b}^{L+1}+\rho_{f}B_{\sigma}\sqrt{D}\Delta_{x}^{L+1}.

8.4 Proof of Lemma 5

From Lemma 4, for λθ1,λθ2\forall~{}\lambda_{\theta_{1}},\lambda_{\theta_{2}}\in\mathcal{F}, we have

dN(λθ1,λθ2)\displaystyle\quad d_{N}(\lambda_{\theta_{1}},\lambda_{\theta_{2}})
ρfγ(l=0L1γlSNlΔbLl+BσDl=0L2γlSNlΔxLl+Bin(T)γL1SNL1Δx1+BσDl=0L1γlSN1lΔhLl)\displaystyle\leq\rho_{f}\gamma\left(\sum_{l=0}^{L-1}\gamma^{l}S_{N}^{l}\Delta_{b}^{L-l}+B_{\sigma}\sqrt{D}\sum_{l=0}^{L-2}\gamma^{l}S_{N}^{l}\Delta_{x}^{L-l}+B_{in}(T)\gamma^{L-1}S_{N}^{L-1}\Delta_{x}^{1}+B_{\sigma}\sqrt{D}\sum_{l=0}^{L-1}\gamma^{l}S_{N-1}^{l}\Delta_{h}^{L-l}\right)
+ρfΔbL+1+ρfBσDΔxL+1\displaystyle\quad+\rho_{f}\Delta_{b}^{L+1}+\rho_{f}B_{\sigma}\sqrt{D}\Delta_{x}^{L+1}
ρf(BσDBin(T)1)(γL1)SNL1Δθ\displaystyle\leq\rho_{f}\left(B_{\sigma}\sqrt{D}\vee B_{in}(T)\vee 1\right)\left(\gamma^{L}\vee 1\right)S_{N}^{L-1}\Delta_{\theta}
ρf(BσDBin(T)1)(γL1)(N+1)L1βN+11β1Δθ,\displaystyle\leq\rho_{f}\left(B_{\sigma}\sqrt{D}\vee B_{in}(T)\vee 1\right)\left(\gamma^{L}\vee 1\right)(N+1)^{L-1}\frac{\beta^{N+1}-1}{\beta-1}\Delta_{\theta},

where Δθl=0L+1(Δbl+Δxl+Δhl)\Delta_{\theta}\triangleq\sum_{l=0}^{L+1}\left(\Delta_{b}^{l}+\Delta_{x}^{l}+\Delta_{h}^{l}\right).

Define C(N)ρf(BσDBin(T)1)(γL1)(N+1)L1(βN+11)/(β1)C(N)\triangleq\rho_{f}\left(B_{\sigma}\sqrt{D}\vee B_{in}(T)\vee 1\right)\left(\gamma^{L}\vee 1\right)(N+1)^{L-1}(\beta^{N+1}-1)/(\beta-1), using Lemma 11 and 2F\|\cdot\|_{2}\leq\|\cdot\|_{F}, we can get

𝒩(Θ,ϵ,dNλ(,))\displaystyle\mathcal{N}\left(\mathcal{F}_{\Theta}^{\mathcal{B}},\epsilon,d_{N}^{\lambda}(\cdot,\cdot)\right) l=1L+1𝒩(Wx(l),ϵC(N)(3L+2),F)l=1L𝒩(Wh(l),ϵC(N)(3L+2),F)\displaystyle\leq\prod_{l=1}^{L+1}\mathcal{N}\left(W_{x}^{(l)},\frac{\epsilon}{C(N)(3L+2)},\|\cdot\|_{F}\right)~{}\prod_{l=1}^{L}\mathcal{N}\left(W_{h}^{(l)},\frac{\epsilon}{C(N)(3L+2)},\|\cdot\|_{F}\right)
l=1L+1𝒩(b(l),ϵC(N)(3L+2),2)\displaystyle\quad~{}\prod_{l=1}^{L+1}\mathcal{N}\left(b^{(l)},\frac{\epsilon}{C(N)(3L+2)},\|\cdot\|_{2}\right)
(1+C(N)(3L+2)BmDϵ)D2(3L+2),\displaystyle\leq\left(1+\frac{C(N)(3L+2)B_{m}\sqrt{D}}{\epsilon}\right)^{D^{2}(3L+2)}~{},

where Bm=max{Bb,Bh,Bx}B_{m}=\max\{B_{b},B_{h},B_{x}\}.

8.5 Proof of Theorem 3

Lemma 6.

Under assumptions (B1)-(B3), for fixed ss\in\mathbb{N}, with probability at least 1δ1-\delta, we have

supθΘ|Xθ(s)|\displaystyle\sup_{\theta\in\Theta}|X_{\theta}(s)| 48n(T+1lf)(s+1){4uf(log(2δ)+D(3L+2)log(1+M(s)))+D3L+2}.\displaystyle\leq\frac{48}{\sqrt{n}}\left(T+\frac{1}{l_{f}}\right)(s+1)\left\{4u_{f}\left(\sqrt{\log\left(\frac{2}{\delta}\right)}+D\sqrt{(3L+2)\log\left(1+M(s)\right)}~{}\right)+D\sqrt{3L+2}\right\}.

Hence

supθBm|Xθ(s)|O~(D2L2s3n),\displaystyle\sup_{\|\theta\|\leq B_{m}}|X_{\theta}(s)|\leq\tilde{O}\left(\sqrt{\frac{{D^{2}L^{2}s^{3}}}{n}}\right),

where M(s)=ρfBmD(BσDBin(T)1)(γL1)(s+1)L1(βs+11)/(β1)M(s)=\rho_{f}B_{m}\sqrt{D}\left(B_{\sigma}\sqrt{D}\vee B_{in}(T)\vee 1\right)\left(\gamma^{L}\vee 1\right)(s+1)^{L-1}(\beta^{s+1}-1)/(\beta-1), Bm=max{Bb,Bh,Bx}B_{m}=\max\{B_{b},B_{h},B_{x}\}, γ=ρσBx\gamma=\rho_{\sigma}B_{x}, and β=ρσBh\beta=\rho_{\sigma}B_{h}.

Proof of Lemma 6.

For 1kn1\leq k\leq n, denote Xθ,k(s)=𝔼[loss(λθ,Stest)𝟙{Nes}]loss(λθ,Sk)𝟙{Neks}X_{\theta,k}(s)=\mathbb{E}\left[\text{loss}(\lambda_{\theta},S_{test})\mathbbm{1}_{\{N_{e}\leq s\}}\right]-\text{loss}(\lambda_{\theta},S_{k})\mathbbm{1}_{\{N_{ek}\leq s\}}. Then Xθ(s)=n1k=1nXθ,k(s)X_{\theta}(s)=n^{-1}\sum_{k=1}^{n}X_{\theta,k}(s). For two parameters θ1\theta_{1} and θ2\theta_{2}, we have

|loss(λθ1,Sk)𝟙{Neks}loss(λθ2,Sk)𝟙{Neks}|\displaystyle~{}~{}~{}~{}\left|\text{loss}(\lambda_{\theta_{1}},S_{k})\mathbbm{1}_{\{N_{ek}\leq s\}}-\text{loss}(\lambda_{\theta_{2}},S_{k})\mathbbm{1}_{\{N_{ek}\leq s\}}\right|
|i=1Nk(logλθ1(ti)logλθ2(ti))|+|0T(λθ1(t)λθ2(t))dt|\displaystyle\leq\Big{|}\sum_{i=1}^{N_{k}}(\log\lambda_{\theta_{1}}(t_{i})-\log\lambda_{\theta_{2}}(t_{i}))\Big{|}+\Big{|}\int_{0}^{T}\left(\lambda_{\theta_{1}}(t)-\lambda_{\theta_{2}}(t)\right)\mathrm{dt}\Big{|}
1lfi=1Nk|λθ1(ti)λθ2(ti))|+0T|λθ1(t)λθ2(t)|dt\displaystyle\leq\frac{1}{l_{f}}\sum_{i=1}^{N_{k}}|\lambda_{\theta_{1}}(t_{i})-\lambda_{\theta_{2}}(t_{i}))|+\int_{0}^{T}\left|\lambda_{\theta_{1}}(t)-\lambda_{\theta_{2}}(t)\right|\mathrm{dt}
(T+Nklf)dNk(λθ1,λθ2)\displaystyle\leq\left(T+\frac{N_{k}}{l_{f}}\right)d_{N_{k}}(\lambda_{\theta_{1}},\lambda_{\theta_{2}})
(T+1lf)(s+1)ds(λθ1,λθ2),\displaystyle\leq\left(T+\frac{1}{l_{f}}\right)(s+1)d_{s}(\lambda_{\theta_{1}},\lambda_{\theta_{2}}),

and similarly,

|𝔼[loss(λθ1,Stest)𝟙{Nes}]𝔼[loss(λθ2,Stest)𝟙{Nes}]|(T+1lf)(s+1)ds(λθ1,λθ2).\displaystyle\left|\mathbb{E}\left[\text{loss}(\lambda_{\theta_{1}},S_{test})\mathbbm{1}_{\{N_{e}\leq s\}}\right]-\mathbb{E}\left[\text{loss}(\lambda_{\theta_{2}},S_{test})\mathbbm{1}_{\{N_{e}\leq s\}}\right]\right|\leq\left(T+\frac{1}{l_{f}}\right)(s+1)d_{s}(\lambda_{\theta_{1}},\lambda_{\theta_{2}}).

Hence

|Xθ1,k(s)Xθ2,k(s)|2(T+1lf)(s+1)ds(λθ1,λθ2).\displaystyle\left|X_{\theta_{1},k}(s)-X_{\theta_{2},k}(s)\right|\leq 2\left(T+\frac{1}{l_{f}}\right)(s+1)d_{s}(\lambda_{\theta_{1}},\lambda_{\theta_{2}}).

By the property of bounded variable, Xθ1,k(s)Xθ2,k(s)X_{\theta_{1},k}(s)-X_{\theta_{2},k}(s) is 2(T+1/lf)(s+1)ds(λθ1,λθ2)2\left(T+1/l_{f}\right)(s+1)d_{s}(\lambda_{\theta_{1}},\lambda_{\theta_{2}})-sub-gaussian. Since {Xθ1,k(s)Xθ2,k(s)}k=1n\{X_{\theta_{1},k}(s)-X_{\theta_{2},k}(s)\}_{k=1}^{n} is mutually independent, Xθ1(s)Xθ2(s)X_{\theta_{1}}(s)-X_{\theta_{2}}(s) is 2(T+1/lf)(s+1)ds(λθ1,λθ2)/n2\left(T+1/l_{f}\right)(s+1)d_{s}(\lambda_{\theta_{1}},\lambda_{\theta_{2}})/\sqrt{n}-sub-gaussian. From assumptions 2 and 3, there exists θ0Bm\|\theta_{0}\|\leq B_{m} such that λθ01\lambda_{\theta_{0}}\equiv 1, implying Xθ0(s)=0X_{\theta_{0}}(s)=0.

The diameter of \mathcal{F} under the distance ds(,)d_{s}(\cdot,\cdot) can be bounded by

diam(|ds)\displaystyle\text{diam}\left(\mathcal{F}|d_{s}\right) supθ1,θ2Θdsλ(λθ,λθ0)supθ1,θ2Θsup#Ssλθ1(t;S)λθ2(t;S)L\displaystyle\leq\sup_{\theta_{1},\theta_{2}\in\Theta}d^{\lambda}_{s}(\lambda_{\theta},\lambda_{\theta_{0}})\leq\sup_{\theta_{1},\theta_{2}\in\Theta}\sup_{\#S\leq s}\|\lambda_{\theta_{1}}(t;S)-\lambda_{\theta_{2}}(t;S)\|_{L^{\infty}}
2uf.\displaystyle\leq 2u_{f}~{}. (32)

By Lemma 5, we get

log𝒩(,ϵ,ds(,))D2(3L+2)log(1+C(s)(3L+2)BmDϵ),\displaystyle\log\mathcal{N}\left(\mathcal{F},\epsilon,d_{s}(\cdot,\cdot)\right)\leq D^{2}(3L+2)\log\left(1+\frac{C(s)(3L+2)B_{m}\sqrt{D}}{\epsilon}\right)~{}, (33)

where C(s)=ρf(BσDBin(T)1)(γL1)(s+1)L1(βs+11)/(β1)C(s)=\rho_{f}\left(B_{\sigma}\sqrt{D}\vee B_{in}(T)\vee 1\right)\left(\gamma^{L}\vee 1\right)(s+1)^{L-1}(\beta^{s+1}-1)/(\beta-1), Bm=max{Bb,Bh,Bx}B_{m}=\max\{B_{b},B_{h},B_{x}\}. Denote M(s)=C(s)(3L+2)BmDM(s)=C(s)(3L+2)B_{m}\sqrt{D}, 𝒟=diam(Θ|dsλ)\mathcal{D}=\text{diam}\left(\mathcal{F}_{\Theta}|d^{\lambda}_{s}\right). We have

02𝒟log(1+M(s)ϵ)dϵ\displaystyle\int_{0}^{2\mathcal{D}}\sqrt{\log\left(1+\frac{M(s)}{\epsilon}\right)}\mathrm{d}\epsilon (0a+a2𝒟)log(1+M(s)ϵ)dϵ(0a2𝒟)\displaystyle\leq\left(\int_{0}^{a}+\int_{a}^{2\mathcal{D}}\right)\sqrt{\log\left(1+\frac{M(s)}{\epsilon}\right)}\mathrm{d}\epsilon~{}~{}(\forall 0\leq a\leq 2\mathcal{D})
inf0a2𝒟{0aM(s)ϵdϵ+a2𝒟log(1+M(s)ϵ)dϵ}\displaystyle\leq\inf_{0\leq a\leq 2\mathcal{D}}\left\{\int_{0}^{a}\sqrt{\frac{M(s)}{\epsilon}}\mathrm{d}\epsilon+\int_{a}^{2\mathcal{D}}\sqrt{\log\left(1+\frac{M(s)}{\epsilon}\right)}\mathrm{d}\epsilon\right\}
inf0a2𝒟{2M(s)a+2𝒟log(1+M(s)a)}\displaystyle\leq\inf_{0\leq a\leq 2\mathcal{D}}\left\{2\sqrt{M(s)a}+2\mathcal{D}\sqrt{\log\left(1+\frac{M(s)}{a}\right)}\right\}
2+2𝒟log(1+M(s)2)(takea=M(s)1)\displaystyle\leq 2+2\mathcal{D}\sqrt{\log\left(1+{M(s)}^{2}\right)}~{}~{}(\text{take}~{}a={M(s)}^{-1})
2+4𝒟log(1+M(s)),\displaystyle\leq 2+4\mathcal{D}\sqrt{\log\left(1+M(s)\right)}, (34)

where we need 2𝒟M(s)12\mathcal{D}M(s)\geq 1. If 2𝒟M(s)<12\mathcal{D}M(s)<1, (34) is obvious since the integral is less than 22.

Combining (32), (33), (34) and using Lemma 12, we have

supθΘ|Xθ(s)|\displaystyle\sup_{\theta\in\Theta}|X_{\theta}(s)| 24n(T+1lf)(s+1)(𝒟(4log(2δ)+4D(3L+2)log(1+M(s)))+2D3L+2)\displaystyle\leq\frac{24}{\sqrt{n}}\left(T+\frac{1}{l_{f}}\right)(s+1)\left(\mathcal{D}\left(4\sqrt{\log\left(\frac{2}{\delta}\right)}+4D\sqrt{(3L+2)\log(1+M(s))}\right)+2D\sqrt{3L+2}\right)
48n(T+1lf)(s+1){4uf(log(2δ)+D(3L+2)log(1+M(s)))+D3L+2}.\displaystyle\leq\frac{48}{\sqrt{n}}\left(T+\frac{1}{l_{f}}\right)(s+1)\left\{4u_{f}\left(\sqrt{\log\left(\frac{2}{\delta}\right)}+D\sqrt{(3L+2)\log\left(1+M(s)\right)}~{}\right)+D\sqrt{3L+2}\right\}~{}.

Lemma 7.

Suppose the event number NeN_{e} satisfies the tail condition

(Nes)aNexp(cNs),s.\mathbb{P}(N_{e}\geq s)\leq a_{N}\exp(-c_{N}s),~{}s\in\mathbb{N}.

Under assumptions (B1)-(B3), for fixed ss\in\mathbb{N}, we have

supθΘ|Eθ(s)|(T+1lf)(uf+2)aN(s+2)(1exp(cN))2exp(cN(s+1)).\displaystyle\sup_{\theta\in\Theta}|E_{\theta}(s)|\leq\left(T+\frac{1}{l_{f}}\right)(u_{f}+2)\frac{a_{N}(s+2)}{(1-\exp(-c_{N}))^{2}}\exp(-c_{N}(s+1))~{}.
Proof of Lemma 7.

From assumptions (B2) and (B3), there exists θ0Θ\theta_{0}\in\Theta such that λθ01\lambda_{\theta_{0}}\equiv 1. Then

|Eθ(s)|\displaystyle|E_{\theta}(s)| =|𝔼[loss(λθ,Stest)𝟙{Ne>s}]|𝔼|loss(λθ,Stest)|𝟙{Ne>s}\displaystyle=\left|\mathbb{E}\left[\text{loss}(\lambda_{\theta},S_{test})\mathbbm{1}_{\{N_{e}>s\}}\right]\right|\leq\mathbb{E}\left|\text{loss}(\lambda_{\theta},S_{test})\right|\mathbbm{1}_{\{N_{e}>s\}}
𝔼|loss(λθ,Stest)loss(λθ0,Stest)|𝟙{Ne>s}+𝔼|loss(λθ0,Stest)|𝟙{Ne>s}\displaystyle\leq\mathbb{E}\left|\text{loss}(\lambda_{\theta},S_{test})-\text{loss}(\lambda_{\theta_{0}},S_{test})\right|\mathbbm{1}_{\{N_{e}>s\}}+\mathbb{E}\left|\text{loss}(\lambda_{\theta_{0}},S_{test})\right|\mathbbm{1}_{\{N_{e}>s\}}
𝔼[(T+1lf)(Ne+1)dNe(λθ,λθ0)]𝟙{Ne>s}+T(Ne>s)\displaystyle\leq\mathbb{E}\left[\left(T+\frac{1}{l_{f}}\right)(N_{e}+1)d_{N_{e}}(\lambda_{\theta},\lambda_{\theta_{0}})\right]\mathbbm{1}_{\{N_{e}>s\}}+T\mathbb{P}(N_{e}>s)
(T+1lf)(uf+1)𝔼[(Ne+1)𝟙{Ne>s}]+T(Ne>s)\displaystyle\leq\left(T+\frac{1}{l_{f}}\right)(u_{f}+1)\mathbb{E}[(N_{e}+1)\mathbbm{1}_{\{N_{e}>s\}}]+T\mathbb{P}(N_{e}>s)

By the tail condition (Nes)aNexp(cNs),s\mathbb{P}(N_{e}\geq s)\leq a_{N}\exp(-c_{N}s),~{}s\in\mathbb{N}, we have

|Eθ(s)|\displaystyle|E_{\theta}(s)| (T+1lf)(uf+1)aN(s+1)(1exp(cN))2exp(cN(s+1))\displaystyle\leq\left(T+\frac{1}{l_{f}}\right)(u_{f}+1)\frac{a_{N}(s+1)}{(1-\exp(-c_{N}))^{2}}\exp(-c_{N}(s+1))
+(T+1lf)(uf+2)aNexp(cN(s+1))\displaystyle\quad+\left(T+\frac{1}{l_{f}}\right)(u_{f}+2)a_{N}\exp(-c_{N}(s+1))
(T+1lf)(uf+2)aN(s+2)(1exp(cN))2exp(cN(s+1)).\displaystyle\leq\left(T+\frac{1}{l_{f}}\right)(u_{f}+2)\frac{a_{N}(s+2)}{(1-\exp(-c_{N}))^{2}}\exp(-c_{N}(s+1)).

Now we prove Theorem 3. From Lemma 2, we have

(supθΘ|Xθ|>t)(supθΘ|Xθ(s)|+supθΘ|Eθ(s)|>t)+(Ne(n)>s).\displaystyle\mathbb{P}\left(\sup_{\theta\in\Theta}|X_{\theta}|>t\right)\leq\mathbb{P}\left(\sup_{\theta\in\Theta}|X_{\theta}(s)|+\sup_{\theta\in\Theta}|E_{\theta}(s)|>t\right)+\mathbb{P}(N_{e(n)}>s).

Since

(Ne(n)>s)n(Ne>s)naNexp(cNs),\displaystyle\mathbb{P}(N_{e(n)}>s)\leq n\mathbb{P}(N_{e}>s)\leq na_{N}\exp(-c_{N}s),

we can take s0=(log(2aNn/δ)1)/cNs_{0}=\lceil\left(\log\left(2a_{N}n/\delta\right)-1\right)/c_{N}\rceil such that naNexp(cNs0)δ/2na_{N}\exp(-c_{N}s_{0})\leq\delta/2, so we only need solve t>0t>0 such that

(supθΘ|Xθ(s0)|+supθΘ|Eθ(s0)|>t)δ2.\displaystyle\mathbb{P}\left(\sup_{\theta\in\Theta}|X_{\theta}(s_{0})|+\sup_{\theta\in\Theta}|E_{\theta}(s_{0})|>t\right)\leq\frac{\delta}{2}~{}.

From Lemma 7, we have

supθΘ|Eθ(s0)|\displaystyle\sup_{\theta\in\Theta}|E_{\theta}(s_{0})| (T+1lf)(uf+2)aN(s0+2)(1exp(cN))2exp(cN(s0+1)):=B(s0).\displaystyle\leq\left(T+\frac{1}{l_{f}}\right)(u_{f}+2)\frac{a_{N}(s_{0}+2)}{(1-\exp(-c_{N}))^{2}}\exp(-c_{N}(s_{0}+1)):=B(s_{0}).

By the definition of s0s_{0}, B(s0)(T+1/lf)(uf+2)(s0+2)δ/[2n(1exp(cN))2]B(s_{0})\leq\left(T+1/l_{f}\right)(u_{f}+2)(s_{0}+2)\delta/[2n(1-\exp(-c_{N}))^{2}]. Thus we only need to solve t>0t>0 such that

(supθΘ|Xθ(s0)|+supθΘ|Eθ(s0)|>t)(supθΘ|Xθ(s0)|>tB(s0))δ2.\displaystyle\mathbb{P}\left(\sup_{\theta\in\Theta}|X_{\theta}(s_{0})|+\sup_{\theta\in\Theta}|E_{\theta}(s_{0})|>t\right)\leq\mathbb{P}\left(\sup_{\theta\in\Theta}|X_{\theta}(s_{0})|>t-B(s_{0})\right)\leq\frac{\delta}{2}~{}.

From Lemma 6, we can choose

t0\displaystyle t_{0} =48n(T+1lf)(s0+1){4uf(log(4δ)+D(3L+2)log(1+M(s0)))+D3L+2}+B(s0)\displaystyle=\frac{48}{\sqrt{n}}\left(T+\frac{1}{l_{f}}\right)(s_{0}+1)\left\{4u_{f}\left(\sqrt{\log\left(\frac{4}{\delta}\right)}+D\sqrt{(3L+2)\log\left(1+M(s_{0})\right)}\right)+D\sqrt{3L+2}\right\}+B(s_{0})
48n(T+1lf)(s0+1){4uf(log(4δ)+D(3L+2)log(1+M(s0)))+D3L+2}\displaystyle\leq\frac{48}{\sqrt{n}}\left(T+\frac{1}{l_{f}}\right)(s_{0}+1)\left\{4u_{f}\left(\sqrt{\log\left(\frac{4}{\delta}\right)}+D\sqrt{(3L+2)\log\left(1+M(s_{0})\right)}~{}\right)+D\sqrt{3L+2}\right\}
+(T+1lf)(uf+2)s0+2(1exp(cN))2δ2n\displaystyle\quad+\left(T+\frac{1}{l_{f}}\right)(u_{f}+2)\frac{s_{0}+2}{(1-\exp(-c_{N}))^{2}}\frac{\delta}{2n}
192n(T+1lf)(s0+1)uf(log(4δ)+D(3L+2)(log(1+M(s0))+1)+1(1exp(cN))2).\displaystyle\leq\frac{192}{\sqrt{n}}\left(T+\frac{1}{l_{f}}\right)(s_{0}+1)u_{f}\left(\sqrt{\log\left(\frac{4}{\delta}\right)}+D\sqrt{(3L+2)}(\sqrt{\log\left(1+M(s_{0})\right)}+1)+\frac{1}{(1-\exp(-c_{N}))^{2}}~{}\right)~{}.

such that (supθΘ|Xθ|>t)δ\mathbb{P}\left(\sup_{\theta\in\Theta}|X_{\theta}|>t\right)\leq\delta. Hence the theorem is proved.

9 Proofs in section 5 and 6

9.1 Proof of Theorem 4

For λ(t)=λ0(t)Ws,([0,T],B0)\lambda^{\ast}(t)=\lambda_{0}(t)\in W^{s,\infty}([0,T],B_{0}), δ=1/2\delta=1/2, and N5N\geq 5, by Lemma 13, there exists a two-layer NN f^N\hat{f}^{N} such that

|f^N(x)λ(Tx)|3𝒞B0Ts2Ns,0x1,\displaystyle\left|\hat{f}^{N}(x)-\lambda^{\ast}(Tx)\right|\leq\frac{3\mathcal{C}B_{0}T^{s}}{2N^{s}},~{}0\leq x\leq 1,

where 𝒞=2s5s/(s1)!\mathcal{C}=\sqrt{2s}5^{s}/(s-1)!.

Then we have

|f^N(tT)λ(t)|3𝒞B0Ts2Ns,0tT.\displaystyle\left|\hat{f}^{N}(\frac{t}{T})-\lambda^{\ast}(t)\right|\leq\frac{3\mathcal{C}B_{0}T^{s}}{2N^{s}},~{}0\leq t\leq T.

Since B1λ(t)B0B_{1}\leq\lambda^{*}(t)\leq B_{0}, taking lf=B1l_{f}=B_{1}, uf=B0u_{f}=B_{0} and λ^N(t)=f(f^N(t/T))\hat{\lambda}^{N}(t)=f(\hat{f}^{N}(t/T)), we have

|λ^N(t)λ(t)||f^N(tT)λ(t)|3𝒞Ts2Ns,0tT.\displaystyle\left|\hat{\lambda}^{N}(t)-\lambda^{\ast}(t)\right|\leq\left|\hat{f}^{N}(\frac{t}{T})-\lambda^{\ast}(t)\right|\leq\frac{3\mathcal{C}T^{s}}{2N^{s}},~{}\forall 0\leq t\leq T.

Then

|𝔼[loss(λ^N,Stest)]𝔼[loss(λ,Stest)]|\displaystyle|\mathbb{E}[\text{loss}(\hat{\lambda}^{N},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]| 𝔼|loss(λ^N,Stest)loss(λ,Stest)|\displaystyle\leq\mathbb{E}\left|\text{loss}(\hat{\lambda}^{N},S_{test})-\text{loss}(\lambda^{\ast},S_{test})\right|
𝔼(|i=1Ne(logλ~N(ti)logλ(ti))|+|0T(λ~N(t)λ(t))dt|)\displaystyle\leq\mathbb{E}\left(\Big{|}\sum_{i=1}^{N_{e}}(\log\tilde{\lambda}^{N}(t_{i})-\log\lambda^{\ast}(t_{i}))\Big{|}+\Big{|}\int_{0}^{T}\left(\tilde{\lambda}^{N}(t)-\lambda^{\ast}(t)\right)\mathrm{dt}\Big{|}\right)
𝔼(T+NeB1)λ~NλL[0,T]\displaystyle\leq\mathbb{E}\left(T+\frac{N_{e}}{B_{1}}\right)\left\|\tilde{\lambda}^{N}-\lambda^{\ast}\right\|_{L^{\infty}[0,T]}
(T+1B1)3𝒞B0Ts2Ns𝔼(Ne+1).\displaystyle\leq\left(T+\frac{1}{B_{1}}\right)\frac{3\mathcal{C}B_{0}T^{s}}{2N^{s}}\mathbb{E}(N_{e}+1). (35)

Since λ0B0\lambda_{0}\leq B_{0}, μ0\mu\equiv 0, taking c=B0c=B_{0}, c0=0c_{0}=0, and η=e\eta=e in Lemma 2 , we have

(Nes)2B0Texp(eB0Ts2).\displaystyle\mathbb{P}(N_{e}\geq s)\leq 2\sqrt{B_{0}T}\exp\left(\frac{eB_{0}T-s}{2}\right).

Thus

𝔼(Ne+1)1+s=1(Nes)1+2B0T1exp(1/2)exp(eB0T12)5B0T+1exp(3B0T2).\displaystyle\mathbb{E}(N_{e}+1)\leq 1+\sum_{s=1}^{\infty}\mathbb{P}(N_{e}\geq s)\leq 1+\frac{2\sqrt{B_{0}T}}{1-\exp(-1/2)}\exp\left(\frac{eB_{0}T-1}{2}\right)\leq{5\sqrt{B_{0}T+1}}\exp\left(\frac{3B_{0}T}{2}\right). (36)

Combining (35) and (36), we get

|𝔼[loss(λ^N,Stest)]𝔼[loss(λ,Stest)]|\displaystyle|\mathbb{E}[\text{loss}(\hat{\lambda}^{N},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]| 5B0T+1exp(3B0T2)(T+1B1)3𝒞B0Ts2Ns\displaystyle\leq{5\sqrt{B_{0}T+1}}\exp\left(\frac{3B_{0}T}{2}\right)(T+\frac{1}{B_{1}})\frac{3\mathcal{C}B_{0}T^{s}}{2N^{s}}
15exp(2B0T)(T+1B1)𝒞B0TsNs,\displaystyle\leq 15\exp\left({2B_{0}T}\right)(T+\frac{1}{B_{1}})\frac{\mathcal{C}B_{0}T^{s}}{N^{s}},

where 𝒞=2s5s/(s1)!\mathcal{C}=\sqrt{2s}5^{s}/(s-1)! .

λ~N\tilde{\lambda}^{N} can be naturally seen as an RNN by taking Whl=0W_{h}^{l}=0, l=1,2l=1,2. The width and weights bound can be directly obtained by Lemma 13 and Remark 14.

9.2 Proof of Theorem 5

The proof is divided into several steps. Let S={ti}i=1NeS=\{t_{i}\}_{i=1}^{N_{e}}. Here we agree on t0=0t_{0}=0, tNe+1=Tt_{N_{e}+1}=T. To be concise, we denote S(t)=ti<texp(β(tti))+1S(t)=\sum_{t_{i}<t}\exp(-\beta(t-t_{i}))+1, Si=0<j<iexp(β(titj))+1S_{i}=\sum_{0<j<i}\exp(-\beta(t_{i}-t_{j}))+1, i+i\in\mathbb{N}_{+}, hence λ(t)=λ0(t)+α(S(t)1)\lambda^{\ast}(t)=\lambda_{0}(t)+\alpha(S(t)-1), Si+1=Siexp(β(ti+1ti))+1S_{i+1}=S_{i}\exp(-\beta(t_{i+1}-t_{i}))+1, S(t)=Siexp(β(tti))+1S(t)=S_{i}\exp(-\beta(t-t_{i}))+1, where we take S0=0S_{0}=0 by default.

We first fix s0+s_{0}\in\mathbb{N}_{+}.

Step 1. Construct the approximation of g(x,y)=xexp(βy)+1g(x,y)=x\exp(-\beta y)+1, where
gC([(s0+1),2(s0+1)]×[0,T])g\in C^{\infty}\left([-(s_{0}+1),2(s_{0}+1)]\times[0,T]\right) .

Let g~(x,y)=g((3x1)(s0+1),Ty)\tilde{g}(x,y)=g((3x-1)(s_{0}+1),Ty), then g~C([0,1]2)\tilde{g}\in C^{\infty}\left([0,1]^{2}\right). By simple computation, we have

g~Wk,([0,1]2)3(s0+1)(βT1)k.\displaystyle\|\tilde{g}\|_{W^{k,\infty}\left([0,1]^{2}\right)}\leq 3(s_{0}+1)(\beta T\vee 1)^{k}.

Applying Lemma 14 to g~/[3(s0+1)]\tilde{g}/[3(s_{0}+1)], for any 𝒩+\mathcal{N}\in\mathbb{N}_{+}, there exists a tanh neural network g~𝒩\tilde{g}^{\mathcal{N}} with only one hidden layer and width 3𝒩+10(βT1)2(𝒩+10(βT1)+22)3\lceil\frac{\mathcal{N}+10(\beta T\vee 1)}{2}\rceil\binom{\mathcal{N}+10(\beta T\vee 1)+2}{2} such that

|g~(x,y)g~𝒩(x,y)|3(s0+1)exp(𝒩),(x,y)[0,1]2.\displaystyle\left|\tilde{g}(x,y)-\tilde{g}^{\mathcal{N}}(x,y)\right|\leq 3(s_{0}+1)\exp(-\mathcal{N}),~{}(x,y)\in[0,1]^{2}.

By coordinate transformation, we get

|g(x,y)g~𝒩(13(s0+1)x+13,1Ty)|3(s0+1)exp(𝒩),(x,y)[(s0+1),2(s0+1)]×[0,T].\displaystyle\left|g(x,y)-\tilde{g}^{\mathcal{N}}(\frac{1}{3(s_{0}+1)}x+\frac{1}{3},\frac{1}{T}y)\right|\leq 3(s_{0}+1)\exp(-\mathcal{N}),~{}(x,y)\in[-(s_{0}+1),2(s_{0}+1)]\times[0,T].

Define g^𝒩(x,y)=g~𝒩(x/[3(s0+1)]+1/3,y/T)\hat{g}^{\mathcal{N}}(x,y)=\tilde{g}^{\mathcal{N}}(x/[3(s_{0}+1)]+1/3,y/T). Then

|g(x,y)g^𝒩(x,y)|3(s0+1)exp(𝒩),(x,y)[(s0+1),2(s0+1)]×[0,T].\displaystyle\left|g(x,y)-\hat{g}^{\mathcal{N}}(x,y)\right|\leq 3(s_{0}+1)\exp(-\mathcal{N}),~{}(x,y)\in[-(s_{0}+1),2(s_{0}+1)]\times[0,T].

From Lemma 14 and Remark 15, the weights of g^𝒩\hat{g}^{\mathcal{N}} are bounded by

O((s0+1)exp(𝒩2+𝒩3Cd𝒩2)(𝒩(𝒩+2))3𝒩(𝒩+2)),\displaystyle O\left((s_{0}+1)\exp(\frac{{\mathcal{N}^{\prime}}^{2}+\mathcal{N}^{\prime}-3Cd\mathcal{N}^{\prime}}{2})(\mathcal{N}^{\prime}(\mathcal{N}^{\prime}+2))^{3\mathcal{N}^{\prime}(\mathcal{N}^{\prime}+2)}\right), (37)

where 𝒩=𝒩+10(βT1)\mathcal{N}^{\prime}=\mathcal{N}+10(\beta T\vee 1). Taking 𝒩𝒩+log(3(s0+1))\mathcal{N}\leftarrow\mathcal{N}+\lceil\log(3(s_{0}+1))\rceil, we have

|g(x,y)g^𝒩(x,y)|exp(𝒩),(x,y)[(s0+1),2(s0+1)]×[0,T].\displaystyle\left|g(x,y)-\hat{g}^{\mathcal{N}}(x,y)\right|\leq\exp(-\mathcal{N}),~{}(x,y)\in[-(s_{0}+1),2(s_{0}+1)]\times[0,T]. (38)

Especially, |g(x,y)g^𝒩(x,y)|1\left|g(x,y)-\hat{g}^{\mathcal{N}}(x,y)\right|\leq 1. Since g^𝒩\hat{g}^{\mathcal{N}}\in\mathbb{R}, by a small tuning (precisely, width plus 1), we can assume g^𝒩\hat{g}^{\mathcal{N}} has the following structure:

g^𝒩(x,y)=V1σ((WB)(xy)+b0).\displaystyle\hat{g}^{\mathcal{N}}(x,y)=V_{1}\sigma\left(\begin{pmatrix}W&B\end{pmatrix}\begin{pmatrix}x\\ y\end{pmatrix}+b_{0}\right).

Step 2. Construct the approximation of SiS_{i} and S(t)S(t) under the event {Nes0}\{N_{e}\leq s_{0}\}.

Let h0=0h_{0}=0, S¯0=0\overline{S}_{0}=0, for 1is01\leq i\leq s_{0}. We construct hi𝒩h_{i}^{\mathcal{N}} and S¯i𝒩\overline{S}_{i}^{\mathcal{N}} recursively by

{hi𝒩=σ((WB)(V1hi1𝒩titi1)+b0),S¯i𝒩=V1hi.\left\{\begin{aligned} h_{i}^{\mathcal{N}}&=\sigma\left(\begin{pmatrix}W&B\end{pmatrix}\begin{pmatrix}V_{1}h_{i-1}^{\mathcal{N}}\\ t_{i}-t_{i-1}\end{pmatrix}+b_{0}\right),\\ \overline{S}_{i}^{\mathcal{N}}&=V_{1}h_{i}.\end{aligned}\right.

Hence S¯i𝒩=g^𝒩(S¯i1𝒩,titi1),1is0\overline{S}_{i}^{\mathcal{N}}=\hat{g}^{\mathcal{N}}(\overline{S}_{i-1}^{\mathcal{N}},t_{i}-t_{i-1}),1\leq i\leq s_{0}, here t0=0t_{0}=0.

Similarly, we can define S¯𝒩(t),t(ti1,ti]\overline{S}^{\mathcal{N}}(t),t\in(t_{i-1},t_{i}] by

{h𝒩(t)=σ((WB)(V1hi1𝒩tti1)+b0),S¯𝒩(t)=V1h(t).\left\{\begin{aligned} h^{\mathcal{N}}(t)&=\sigma\left(\begin{pmatrix}W&B\end{pmatrix}\begin{pmatrix}V_{1}h_{i-1}^{\mathcal{N}}\\ t-t_{i-1}\end{pmatrix}+b_{0}\right),\\ \overline{S}^{\mathcal{N}}(t)&=V_{1}h(t).\end{aligned}\right.

Hence S¯𝒩(t)=g^𝒩(S¯i1𝒩,tti1),t(ti1,ti]\overline{S}^{\mathcal{N}}(t)=\hat{g}^{\mathcal{N}}(\overline{S}_{i-1}^{\mathcal{N}},t-t_{i-1}),t\in(t_{i-1},t_{i}]. The approximation error can be bounded by

|S(t)S¯𝒩(t)|\displaystyle|S(t)-\overline{S}^{\mathcal{N}}(t)| =|g(Si1,titi1)g^N(S¯i1𝒩,titi1)|\displaystyle=\left|g(S_{i-1},t_{i}-t_{i-1})-\hat{g}^{N}(\overline{S}_{i-1}^{\mathcal{N}},t_{i}-t_{i-1})\right|
|g(Si1,titi1)g(S¯i1𝒩,titi1)|+|g(S¯i1𝒩,titi1)g^N(S¯i1𝒩,titi1)|\displaystyle\leq\left|g(S_{i-1},t_{i}-t_{i-1})-g(\overline{S}_{i-1}^{\mathcal{N}},t_{i}-t_{i-1})\right|+\left|g(\overline{S}_{i-1}^{\mathcal{N}},t_{i}-t_{i-1})-\hat{g}^{N}(\overline{S}_{i-1}^{\mathcal{N}},t_{i}-t_{i-1})\right|
|Si1S¯i1𝒩|+gg^𝒩\displaystyle\leq\left|S_{i-1}-\overline{S}_{i-1}^{\mathcal{N}}\right|+\left\|g-\hat{g}^{\mathcal{N}}\right\|_{\infty}
\displaystyle\leq\cdots
igg^𝒩,t(ti1,ti].\displaystyle\leq i\left\|g-\hat{g}^{\mathcal{N}}\right\|_{\infty},~{}t\in(t_{i-1},t_{i}].

Under the event {Nes0}\{N_{e}\leq s_{0}\}, we have

|S(t)S¯𝒩(t)|(s0+1)gg^𝒩.\displaystyle\left|S(t)-\overline{S}^{\mathcal{N}}(t)\right|\leq(s_{0}+1)\left\|g-\hat{g}^{\mathcal{N}}\right\|_{\infty}. (39)

Step 3. Construct the approximation of identity.

By Lemma 3.1 of De Ryck et al. (2021), for any ϵ>0\epsilon>0, there exists a one-layer tanh neural network ψh\psi_{h} such that

|xψh(x)|(6M)4h2,x[M,M].\displaystyle|x-\psi_{h}(x)|\leq(6M)^{4}h^{2},~{}x\in[-M,M]. (40)

Actually, ψh\psi_{h} can be represented as

ψh(x)=1σ(0)h[σ(hy2)σ(hy2)]=2σ(0)hσ(hy2).\displaystyle\psi_{h}(x)=\frac{1}{\sigma^{{}^{\prime}}(0)h}\left[\sigma\left(\frac{hy}{2}\right)-\sigma\left(-\frac{hy}{2}\right)\right]=\frac{2}{\sigma^{{}^{\prime}}(0)h}\sigma\left(\frac{hy}{2}\right).

Step 4. Construct the approximation of λ(t)\lambda^{\ast}(t) under the event {Nes0}\{N_{e}\leq s_{0}\}.

Since λ0Ws,([0,T],B0)\lambda_{0}\in W^{s,\infty}([0,T],B_{0}), from the proof of Theorem 4 , there exists a two-layer tanh neural network λ¯0N\overline{\lambda}_{0}^{N} with width less than 3s/2+6N3\lceil s/2\rceil+6N such that

|λ¯0N(t)λ0(t)|3𝒞Ts2Ns,t[0,T].\displaystyle\left|\overline{\lambda}_{0}^{N}(t)-\lambda_{0}(t)\right|\leq\frac{3\mathcal{C}T^{s}}{2N^{s}},~{}t\in[0,T]. (41)

Moreover, the weights of λ¯0N\overline{\lambda}_{0}^{N} can be bounded by

O([2s5s(s1)!B0Ts]s/2N(1+s2)/2(s(s+2))3s(s+2)).\displaystyle O\left(\left[\frac{\sqrt{2s}5^{s}}{(s-1)!}B_{0}T^{s}\right]^{-s/2}N^{(1+s^{2})/2}(s(s+2))^{3s(s+2)}\right).

Here we assume λ¯0N(t)\overline{\lambda}_{0}^{N}(t) have the following structure

λ¯0N(t)=V2σ(V1σ(Bt+b0)+b1)+b2.\displaystyle\overline{\lambda}_{0}^{N}(t)=V_{2}^{{}^{\prime}}\sigma\left(V_{1}^{{}^{\prime}}\sigma\left(B^{{}^{\prime}}t+b_{0}^{{}^{\prime}}\right)+b_{1}^{{}^{\prime}}\right)+b_{2}^{{}^{\prime}}.

Since λ(t)=λ0(t)+α(S(t)1)\lambda(t)=\lambda_{0}(t)+\alpha(S(t)-1), we can construct its approximation by

hi(1)\displaystyle h_{i}^{(1)} =\displaystyle= σ((WV1000)hi1+(B00B)(titi1ti)+(b0b0)),1is0,\displaystyle\sigma\left(\begin{pmatrix}WV_{1}&0\\ 0&0\end{pmatrix}h_{i-1}+\begin{pmatrix}B&0\\ 0&B^{{}^{\prime}}\end{pmatrix}\begin{pmatrix}t_{i}-t_{i-1}\\ t_{i}\end{pmatrix}+\begin{pmatrix}b_{0}\\ b_{0}^{{}^{\prime}}\end{pmatrix}\right),~{}1\leq i\leq s_{0},

and

h(1)(t;S)\displaystyle h^{(1)}(t;S) =\displaystyle= σ((WV1000)hi(1)+(B00B)(ttit)+(b0b0)),\displaystyle\sigma\left(\begin{pmatrix}WV_{1}&0\\ 0&0\end{pmatrix}h_{i}^{(1)}+\begin{pmatrix}B&0\\ 0&B^{{}^{\prime}}\end{pmatrix}\begin{pmatrix}t-t_{i}\\ t\end{pmatrix}+\begin{pmatrix}b_{0}\\ b_{0}^{{}^{\prime}}\end{pmatrix}\right),
h(2)(t;S)\displaystyle h^{(2)}(t;S) =\displaystyle= σ((h2V100V1)h(1)(t;S)+(0b1)),\displaystyle\sigma\left(\begin{pmatrix}\frac{h}{2}V_{1}&0\\ 0&V_{1}^{{}^{\prime}}\end{pmatrix}h^{(1)}(t;S)+\begin{pmatrix}0\\ b_{1}^{{}^{\prime}}\end{pmatrix}\right),
λ^(t;S)\displaystyle\hat{\lambda}(t;S) =\displaystyle= f((2ασ(0)hV2)h(2)(t;S)+(b2α))1,t(ti,ti+1].\displaystyle f\left(\begin{pmatrix}\frac{2\alpha}{\sigma^{{}^{\prime}}(0)h}&V_{2}^{{}^{\prime}}\end{pmatrix}h^{(2)}(t;S)+\left(b_{2}^{{}^{\prime}}-\alpha\right)\right)\in\mathbb{R}^{1},~{}t\in(t_{i},t_{i+1}]. (42)

Under the event {Nes0}\{N_{e}\leq s_{0}\}, we have B1λ(t)B0+αs0B_{1}\leq\lambda(t)\leq B_{0}+\alpha s_{0}. Recall that f(x)=min{max{x,lf},uf}f(x)=\min\{\max\{x,l_{f}\},u_{f}\}. Here we can take lf=B1l_{f}=B_{1}, uf=B0+αs0u_{f}=B_{0}+\alpha s_{0}.

Step 5. Estimate the approximation error under the event {Nes0}\{N_{e}\leq s_{0}\}.

We rewrite (42) as λ^(t;S)=f(λ¯(t;S))\hat{\lambda}(t;S)=f(\overline{\lambda}(t;S)). Under the event {Nes0}\{N_{e}\leq s_{0}\} and the construction of ff, we have

λλ^Lλλ¯L.\displaystyle\|\lambda^{\ast}-\hat{\lambda}\|_{L^{\infty}}\leq\|\lambda^{\ast}-\overline{\lambda}\|_{L^{\infty}}~{}.

From the constuction of λ¯\overline{\lambda}, we get

λ¯(t)=λ¯0N(t)+αψh(S¯𝒩(t))α,\displaystyle\overline{\lambda}(t)=\overline{\lambda}_{0}^{N}(t)+\alpha\psi_{h}(\overline{S}^{\mathcal{N}}(t))-\alpha~{}, (43)

then

|λ(t)λ¯(t)||λ0(t)λ¯0N(t)|+α|S(t)ψh(S¯𝒩(t))|.\displaystyle\left|\lambda^{\ast}(t)-\overline{\lambda}(t)\right|\leq\left|\lambda_{0}(t)-\overline{\lambda}_{0}^{N}(t)\right|+\alpha\left|S(t)-\psi_{h}(\overline{S}^{\mathcal{N}}(t))\right|~{}. (44)

From (39) (40) and (38), under the event {Nes0}\{N_{e}\leq s_{0}\}, we have

|S(t)ψh(S¯𝒩(t))|\displaystyle\left|S(t)-\psi_{h}(\overline{S}^{\mathcal{N}}(t))\right| |S(t)S¯𝒩(t)|+|S¯𝒩(t)ψh(S¯𝒩(t))|\displaystyle\leq\left|S(t)-\overline{S}^{\mathcal{N}}(t)\right|+\left|\overline{S}^{\mathcal{N}}(t)-\psi_{h}(\overline{S}^{\mathcal{N}}(t))\right|
(s0+1)gg^𝒩+(6M)4h2\displaystyle\leq(s_{0}+1)\left\|g-\hat{g}^{\mathcal{N}}\right\|_{\infty}+(6M)^{4}h^{2}
(s0+1)exp(𝒩)+(12(s0+1))4h2,t[0,T],\displaystyle\leq(s_{0}+1)\exp(-\mathcal{N})+(12(s_{0}+1))^{4}h^{2},~{}t\in[0,T],

where we take M=2(s0+1)M=2(s_{0}+1) to ensure that S¯𝒩(t)\overline{S}^{\mathcal{N}}(t) can be well approximated by ψh(S¯𝒩(t))\psi_{h}(\overline{S}^{\mathcal{N}}(t)). On the other hand, (41) shows that

|λ¯0N(t)λ0(t)|3𝒞B0Ts2Ns,t[0,T].\displaystyle\left|\overline{\lambda}_{0}^{N}(t)-\lambda_{0}(t)\right|\leq\frac{3\mathcal{C}B_{0}T^{s}}{2N^{s}},~{}t\in[0,T].

To trade off the two error terms in (44), let exp(𝒩)Ns\exp(-\mathcal{N})\asymp N^{-s}, and then we can take 𝒩=slog(N)\mathcal{N}=\lceil s\log(N)\rceil. Moreover, take 𝒩𝒩+log(s0+1)\mathcal{N}\leftarrow\mathcal{N}+\lceil\log(s_{0}+1)\rceil and h=(12(s0+1))2Ns/2h=(12(s_{0}+1))^{-2}N^{-s/2}. Hence, under {Nes0}\{N_{e}\leq s_{0}\} , we have

|λ(t)λ^(t)||λ(t)λ¯(t)|3𝒞B0Ts+42Ns,t[0,T].\displaystyle\left|\lambda^{\ast}(t)-\hat{\lambda}(t)\right|\leq\left|\lambda^{\ast}(t)-\overline{\lambda}(t)\right|\leq\frac{3\mathcal{C}B_{0}T^{s}+4}{2N^{s}},~{}t\in[0,T]. (45)

Step 6. Estimate the final approximation error.

Similar to (35), we have

|𝔼[loss(λ^,Stest)]𝔼[loss(λ,Stest)]|\displaystyle\quad|\mathbb{E}[\text{loss}(\hat{\lambda},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]|
𝔼|loss(λ^,Stest)loss(λ,Stest)|\displaystyle\leq\mathbb{E}\left|\text{loss}(\hat{\lambda},S_{test})-\text{loss}(\lambda^{\ast},S_{test})\right|
𝔼(|i=1Ne(logλ^(ti)logλ(ti))|+|0T(λ^(t)λ(t))dt|)\displaystyle\leq\mathbb{E}\left(\Big{|}\sum_{i=1}^{N_{e}}(\log\hat{\lambda}(t_{i})-\log\lambda^{\ast}(t_{i}))\Big{|}+\Big{|}\int_{0}^{T}\left(\hat{\lambda}(t)-\lambda^{\ast}(t)\right)\mathrm{dt}\Big{|}\right)
𝔼[(|i=1Ne(logλ^(ti)logλ(ti))|+|0T(λ^(t)λ(t))dt|)𝟙{Nes0}\displaystyle\leq\mathbb{E}\left[\left(\Big{|}\sum_{i=1}^{N_{e}}(\log\hat{\lambda}(t_{i})-\log\lambda^{\ast}(t_{i}))\Big{|}+\Big{|}\int_{0}^{T}\left(\hat{\lambda}(t)-\lambda^{\ast}(t)\right)\mathrm{dt}\Big{|}\right)\mathbbm{1}_{\{N_{e}\leq s_{0}\}}\right.
+(|i=1Ne(logλ^(ti)logλ(ti))|+|0T(λ^(t)λ(t))dt|)𝟙{Ne>s0}]\displaystyle\quad\quad+\left.\left(\Big{|}\sum_{i=1}^{N_{e}}(\log\hat{\lambda}(t_{i})-\log\lambda^{\ast}(t_{i}))\Big{|}+\Big{|}\int_{0}^{T}\left(\hat{\lambda}(t)-\lambda^{\ast}(t)\right)\mathrm{dt}\Big{|}\right)\mathbbm{1}_{\{N_{e}>s_{0}\}}\right]
𝔼[(T+NeB1)λ^λL𝟙{Nes0}]+𝔼[(T+NeB1)λ^λL𝟙{Ne>s0}]\displaystyle\leq\mathbb{E}\left[(T+\frac{N_{e}}{B_{1}})\left\|\hat{\lambda}-\lambda^{\ast}\right\|_{L^{\infty}}\mathbbm{1}_{\{N_{e}\leq s_{0}\}}\right]+\mathbb{E}\left[(T+\frac{N_{e}}{B_{1}})\left\|\hat{\lambda}-\lambda^{\ast}\right\|_{L^{\infty}}\mathbbm{1}_{\{N_{e}>s_{0}\}}\right]
:=𝕀1+𝕀2.\displaystyle:=\mathbb{I}_{1}+\mathbb{I}_{2}~{}. (46)

Since λ0(t)B0\lambda_{0}(t)\leq B_{0}, μ(t)=αexp(β(t))\mu(t)=\alpha\exp(-\beta(t)), taking cμ=α/βc_{\mu}=\alpha/\beta, η=(α+β)/(2α)\eta=(\alpha+\beta)/(2\alpha) in Lemma 2 , we have

(Nes)\displaystyle\mathbb{P}\left(N_{e}\geq s\right) 2B0T1cμexp(log(η)2[η(B0T)(1cμη)s])\displaystyle\leq\frac{2\sqrt{B_{0}T}}{1-c_{\mu}}\exp\left(\frac{\log(\eta)}{2}\left[\eta(B_{0}T)-(1-c_{\mu}\eta)s\right]\right)
2βB0Tβαexp(log(α+β2α)2[α+β2α(B0T)βα2βs])\displaystyle\leq\frac{2\beta\sqrt{B_{0}T}}{\beta-\alpha}\exp\left(\frac{\log\left(\frac{\alpha+\beta}{2\alpha}\right)}{2}\left[\frac{\alpha+\beta}{2\alpha}(B_{0}T)-\frac{\beta-\alpha}{2\beta}s\right]\right)
:=aeexp(ces).\displaystyle:=a_{e}\exp\left(-c_{e}s\right).

By (45),

𝕀1\displaystyle\mathbb{I}_{1} (T+1B1)3𝒞B0Ts+22Ns𝔼[(Ne+1)𝟙{Nes0}]\displaystyle\leq\left(T+\frac{1}{B_{1}}\right)\frac{3\mathcal{C}B_{0}T^{s}+2}{2N^{s}}\mathbb{E}\left[(N_{e}+1)\mathbbm{1}_{\{N_{e}\leq s_{0}\}}\right]
(T+1B1)3𝒞B0Ts+22Ns𝔼[(Ne+1)]\displaystyle\leq\left(T+\frac{1}{B_{1}}\right)\frac{3\mathcal{C}B_{0}T^{s}+2}{2N^{s}}\mathbb{E}\left[(N_{e}+1)\right]
=(T+1B1)3𝒞B0Ts+22Ns(1+s=1(Nes))\displaystyle=\left(T+\frac{1}{B_{1}}\right)\frac{3\mathcal{C}B_{0}T^{s}+2}{2N^{s}}\left(1+\sum_{s=1}^{\infty}\mathbb{P}(N_{e}\geq s)\right)
(T+1B1)(1+aeexp(ce)1exp(ce))3𝒞B0Ts+42Ns\displaystyle\leq\left(T+\frac{1}{B_{1}}\right)\left(1+\frac{a_{e}\exp(-c_{e})}{1-\exp(-c_{e})}\right)\frac{3\mathcal{C}B_{0}T^{s}+4}{2N^{s}} (47)

On the other hand, from λ^LB0+αs0\|\hat{\lambda}\|_{L^{\infty}}\leq B_{0}+\alpha s_{0} and λLB0+αNe\|\lambda^{\ast}\|_{L^{\infty}}\leq B_{0}+\alpha N_{e}, we have

𝕀2\displaystyle\mathbb{I}_{2} 𝔼[(T+NeB1)λ^L𝟙{Ne>s0}]+𝔼[(T+NeB1)λL𝟙{Ne>s0}]\displaystyle\leq\mathbb{E}\left[\left(T+\frac{N_{e}}{B_{1}}\right)\|\hat{\lambda}\|_{L^{\infty}}\mathbbm{1}_{\{N_{e}>s_{0}\}}\right]+\mathbb{E}\left[\left(T+\frac{N_{e}}{B_{1}}\right)\|\lambda^{\ast}\|_{L^{\infty}}\mathbbm{1}_{\{N_{e}>s_{0}\}}\right]
(T+1B1)(B0+αs0)𝔼[(Ne+1)𝟙{Ne>s0}]+(T+1B1)E[(Ne+1)(B0+αNe)𝟙{Ne>s0}]\displaystyle\leq\left(T+\frac{1}{B_{1}}\right)(B_{0}+\alpha s_{0})\mathbb{E}\left[(N_{e}+1)\mathbbm{1}_{\{N_{e}>s_{0}\}}\right]+\left(T+\frac{1}{B_{1}}\right)E\left[(N_{e}+1)(B_{0}+\alpha N_{e})\mathbbm{1}_{\{N_{e}>s_{0}\}}\right]
(T+1B1)(B0+αs0)((s0+1)(Nes0+1)+s=s0+1(Nes))\displaystyle\leq\left(T+\frac{1}{B_{1}}\right)(B_{0}+\alpha s_{0})\left((s_{0}+1)\mathbb{P}(N_{e}\geq s_{0}+1)+\sum_{s=s_{0}+1}^{\infty}\mathbb{P}(N_{e}\geq s)\right)
+(T+1B1)((s0+1)(B0+αs0)(Nes0+1)+s=s0+1(2αs+B0)(Nes))\displaystyle\quad+\left(T+\frac{1}{B_{1}}\right)\left((s_{0}+1)(B_{0}+\alpha s_{0})\mathbb{P}(N_{e}\geq s_{0}+1)+\sum_{s=s_{0}+1}^{\infty}(2\alpha s+B_{0})\mathbb{P}(N_{e}\geq s)\right)
(T+1B1)(B0+αs0)aeexp(ce(s0+1))((s0+1)+11exp(ce))\displaystyle\leq\left(T+\frac{1}{B_{1}}\right)(B_{0}+\alpha s_{0})a_{e}\exp(-c_{e}(s_{0}+1))\left((s_{0}+1)+\frac{1}{1-\exp(-c_{e})}\right)
+(T+1B1)aeexp(ce(s0+1))((s0+1)(B0+αs0)+2α(s0+1)+B0(1exp(ce))2)\displaystyle\quad+\left(T+\frac{1}{B_{1}}\right)a_{e}\exp(-c_{e}(s_{0}+1))\left((s_{0}+1)(B_{0}+\alpha s_{0})+\frac{2\alpha(s_{0}+1)+B_{0}}{(1-\exp(-c_{e}))^{2}}\right)
(T+1B1)aeexp(ce(s0+1))(2(s0+1)(B0+αs0)+3α(s0+1)+2B0(1exp(ce))2).\displaystyle\leq\left(T+\frac{1}{B_{1}}\right)a_{e}\exp(-c_{e}(s_{0}+1))\left(2(s_{0}+1)(B_{0}+\alpha s_{0})+\frac{3\alpha(s_{0}+1)+2B_{0}}{(1-\exp(-c_{e}))^{2}}\right). (48)

Combing (46) (47) (48), we have

|𝔼[loss(λ^,Stest)]𝔼[loss(λ,Stest)]|\displaystyle\quad|\mathbb{E}[\text{loss}(\hat{\lambda},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]|
(T+1B1)aeexp(ce(s0+1))(2(s0+1)(B0+αs0)+3α(s0+1)+2B0(1exp(ce))2)\displaystyle\leq\left(T+\frac{1}{B_{1}}\right)a_{e}\exp(-c_{e}(s_{0}+1))\left(2(s_{0}+1)(B_{0}+\alpha s_{0})+\frac{3\alpha(s_{0}+1)+2B_{0}}{(1-\exp(-c_{e}))^{2}}\right)
+(T+1B1)(1+aeexp(ce)1exp(ce))3𝒞B0Ts+42Ns.\displaystyle\quad+\left(T+\frac{1}{B_{1}}\right)\left(1+\frac{a_{e}\exp(-c_{e})}{1-\exp(-c_{e})}\right)\frac{3\mathcal{C}B_{0}T^{s}+4}{2N^{s}}.

Let s0=slog(N)/ces_{0}=\lceil s\log(N)/c_{e}\rceil, and denote λ^N=λ^\hat{\lambda}^{N}=\hat{\lambda}. We have

|𝔼[loss(λ^N,Stest)]𝔼[loss(λ,Stest)]|(logN)2Ns.\displaystyle|\mathbb{E}[\text{loss}(\hat{\lambda}^{N},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]|\lesssim\frac{(\log N)^{2}}{N^{s}}~{}.

Step 7. Bound the sizes of the network width and weights.

From step 1-6, the width of the network is less than

3𝒩~2(𝒩~+22)+3s2+6N+2,\displaystyle 3\left\lceil\frac{\tilde{\mathcal{N}}}{2}\right\rceil\binom{\tilde{\mathcal{N}}+2}{2}+3\left\lceil\frac{s}{2}\right\rceil+6N+2~{},

where 𝒩~=𝒩+10(βT1)+2log(3(s0+1))\tilde{\mathcal{N}}=\mathcal{N}+10(\beta T\vee 1)+2\lceil\log(3(s_{0}+1))\rceil. Since s0=scelog(N)s_{0}=\lceil\frac{s}{c_{e}}\log(N)\rceil and 𝒩=slog(N)\mathcal{N}=\lceil s\log(N)\rceil, we have DO(N)D\leq O(N) .

From the construction of g^𝒩\hat{g}^{\mathcal{N}}, ψh\psi_{h} and λ¯0N\overline{\lambda}_{0}^{N}, the weights of the network is less than

O\displaystyle O (max{2σ(0)hexp(𝒩~2+𝒩~3Cd𝒩~2)(𝒩~(𝒩~+2))3𝒩~(𝒩~+2),\displaystyle\left(\max\left\{\frac{2}{\sigma^{{}^{\prime}}(0)h}\exp(\frac{{\tilde{\mathcal{N}}}^{2}+\tilde{\mathcal{N}}-3Cd\tilde{\mathcal{N}}}{2})(\tilde{\mathcal{N}}(\tilde{\mathcal{N}}+2))^{3\tilde{\mathcal{N}}(\tilde{\mathcal{N}}+2)},\right.\right.
[2s5s(s1)!B0Ts]s/2N(1+s2)/2(s(s+2))3s(s+2)}),\displaystyle\quad\quad\quad\left.\left.\left[\frac{\sqrt{2s}5^{s}}{(s-1)!}B_{0}T^{s}\right]^{-s/2}N^{(1+s^{2})/2}(s(s+2))^{3s(s+2)}\right\}\right),

where 𝒩~=𝒩+10(βT1)+2log(3(s0+1))\tilde{\mathcal{N}}=\mathcal{N}+10(\beta T\vee 1)+2\lceil\log(3(s_{0}+1))\rceil. Since s0=scelog(N)s_{0}=\lceil\frac{s}{c_{e}}\log(N)\rceil, h=(12(s0+1))2Ns/2h=(12(s_{0}+1))^{-2}N^{-s/2}, 𝒩=slog(N)\mathcal{N}=\lceil s\log(N)\rceil, the weights are less than

𝒞1(log(N))12s2(log(N))2,\displaystyle\mathcal{C}_{1}(\log(N))^{12s^{2}(\log(N))^{2}}~{},

where 𝒞1\mathcal{C}_{1} is a constant related to s,B0,α,βs,B_{0},\alpha,\beta, and TT.

9.3 Proof of Theorem 6

Lemma 8.

Suppose μCk,([0,T],C0)\mu\in C^{k,\infty}([0,T],C_{0}), k2k\geq 2, kk\in\mathbb{N}. The fourier series of μ\mu is given by

S(t)=μ^02+l=1(μ^lcos(2lπTt)+ν^lsin(2lπTt)),\displaystyle S_{\infty}(t)=\frac{\hat{\mu}_{0}}{2}+\sum_{l=1}^{\infty}\left(\hat{\mu}_{l}\cos\left(\frac{2l\pi}{T}t\right)+\hat{\nu}_{l}\sin\left(\frac{2l\pi}{T}t\right)\right), (49)

where μ^l=20Tμ(t)cos(2lπt/T)dt/T\hat{\mu}_{l}=2\int_{0}^{T}\mu(t)\cos(2l\pi t/T)\mathrm{d}t/T, ν^l=20Tμ(t)sin(2lπt/T)dt\hat{\nu}_{l}=2\int_{0}^{T}\mu(t)\sin(2l\pi t/T)\mathrm{d}t/T, l0l\geq 0. If μ(j)(0+)=μ(j)(T)\mu^{(j)}(0+)=\mu^{(j)}(T-), 0jk10\leq j\leq k-1, then

|μ^l|2C0Tk(2lπ)k,|ν^l|2C0Tk(2lπ)k\displaystyle|\hat{\mu}_{l}|\leq\frac{2C_{0}T^{k}}{(2l\pi)^{k}},|\hat{\nu}_{l}|\leq\frac{2C_{0}T^{k}}{(2l\pi)^{k}}

and S(t)=μ(t)S_{\infty}(t)=\mu(t) on t[0,T]t\in[0,T]. Moveover, denote the partial sum of S(t)S_{\infty}(t) as SNμ(t)=μ^0/2+l=1Nμ(μ^lcos(2lπt/T)+ν^lsin(2lπt/T))S_{N_{\mu}}(t)=\hat{\mu}_{0}/2+\sum_{l=1}^{N_{\mu}}\left(\hat{\mu}_{l}\cos(2l\pi t/T)+\hat{\nu}_{l}\sin(2l\pi t/T)\right),

|μ(t)SNμ(t)|2C0Tk+1(k1)(2π)kNμk1,t[0,T].\displaystyle\left|\mu(t)-S_{N_{\mu}}(t)\right|\leq\frac{2C_{0}T^{k+1}}{(k-1)(2\pi)^{k}N_{\mu}^{k-1}},~{}t\in[0,T].
Proof of Lemma 8.

The proof is a standard Fourier analysis exercise and we omit it. ∎

Theorem 9.

Under model assumption 5 and μ(j)(0+)=μ(j)(T)\mu^{(j)}(0+)=\mu^{(j)}(T-), 0jk10\leq j\leq k-1 , for N5N\geq 5, there exists an RNN structure λ^N,Nμ\hat{\lambda}^{N,N_{\mu}} as stated in section 2.2 with L=2L=2, lf=B1l_{f}=B_{1}, uf=B0+O(logN)u_{f}=B_{0}+O(\log N), and input function x(t;S)=(t,tFS(t))x(t;S)=(t,t-F_{S}(t))^{\top} such that

|𝔼[loss(λ^N,Nμ,Stest)]𝔼[loss(λ,Stest)]|11cμexp(2B0Tcμ2)(Ts+log2NNs+TklogNNμk1).\displaystyle|\mathbb{E}[\text{loss}(\hat{\lambda}^{N,N_{\mu}},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]|\lesssim\frac{1}{1-c_{\mu}}\exp\left(\frac{2B_{0}T}{c_{\mu}^{2}}\right)\left(\frac{T^{s}+\log^{2}N}{N^{s}}+\frac{T^{k}\log N}{N_{\mu}^{k-1}}\right)~{}.

Moreover, the width of λ~N\tilde{\lambda}^{N} satisfies DN+Nμ5log4ND\lesssim N+N_{\mu}^{5}\log^{4}N and the weights of λ^N\hat{\lambda}^{N} are less than

𝒞1(log(NNμ))12s2(log(NNμ))2,\displaystyle\mathcal{C}_{1}(\log(NN_{\mu}))^{12s^{2}(\log(NN_{\mu}))^{2}}~{},

where 𝒞1\mathcal{C}_{1} is a constant related to s,B0,C0,cμs,B_{0},C_{0},c_{\mu}, and TT.

Proof of Theorem 9.

Similar to the proof of Theorem 5 , the proof is divided into several steps. Denote wl=2lπ/Tw_{l}=2l\pi/T, gl,1(x,t)=x1coswlt+x2sinwltg_{l,1}(x,t)=x_{1}\cos w_{l}t+x_{2}\sin w_{l}t, gl,2(x,t)=x1sinwlt+x2coswlt+1g_{l,2}(x,t)=-x_{1}\sin w_{l}t+x_{2}\cos w_{l}t+1, gl(x,t)=(gl,1(x,t),gl,2(x,t))2g_{l}(x,t)=(g_{l,1}(x,t),g_{l,2}(x,t))^{\top}\in\mathbb{R}^{2}, where x2x\in\mathbb{R}^{2}, l+l\in\mathbb{N}_{+}. For l+l\in\mathbb{N}_{+}, define

Sl(t)=ti<t(sinwl(tti)coswl(tti))+(01),\displaystyle S_{l}(t)=\sum_{t_{i}<t}\begin{pmatrix}\sin w_{l}(t-t_{i})\\ \cos w_{l}(t-t_{i})\end{pmatrix}+\begin{pmatrix}0\\ 1\end{pmatrix},

and

Sl,i=0<j<i(sinwl(titj)coswl(titj))+(01).\displaystyle S_{l,i}=\sum_{0<j<i}\begin{pmatrix}\sin w_{l}(t_{i}-t_{j})\\ \cos w_{l}(t_{i}-t_{j})\end{pmatrix}+\begin{pmatrix}0\\ 1\end{pmatrix}.

Hence we have

Sl,i+1=(coswl(ti+1ti)sinwl(ti+1ti)sinwl(ti+1ti)coswl(ti+1ti))Sl,i+(01)=gl(Sl,i,ti+1ti)\displaystyle S_{l,i+1}=\begin{pmatrix}\cos w_{l}(t_{i+1}-t_{i})&\sin w_{l}(t_{i+1}-t_{i})\\ -\sin w_{l}(t_{i+1}-t_{i})&\cos w_{l}(t_{i+1}-t_{i})\end{pmatrix}S_{l,i}+\begin{pmatrix}0\\ 1\end{pmatrix}=g_{l}(S_{l,i},t_{i+1}-t_{i})

and

Sl(t)=gl(Sl,i,tti),t(ti,ti+1].\displaystyle S_{l}(t)=g_{l}(S_{l,i},t-t_{i}),~{}t\in(t_{i},t_{i+1}].

where we agree on Sl,0=𝟎S_{l,0}=\mathbf{0}. Define S0(t)=#{i:ti<t}S_{0}(t)=\#\{i:t_{i}<t\}. If we assume t1>0t_{1}>0, the true intensity can be rewritten as

λ(t)=λ0(t)+μ^02(S0(t)1)+l=1(ν^l,μ^l)(Sl(t)(01)),t[0,T],\displaystyle\lambda^{\ast}(t)=\lambda_{0}(t)+\frac{\hat{\mu}_{0}}{2}(S_{0}(t)-1)+\sum_{l=1}^{\infty}(\hat{\nu}_{l},\hat{\mu}_{l})\cdot\left(S_{l}(t)-\begin{pmatrix}0\\ 1\end{pmatrix}\right),~{}t\in[0,T], (50)

where aba\cdot b refers to the standard inner product of vectors aa and bb.

We first fix s0+s_{0}\in\mathbb{N}_{+}. Since (t1=0)=0\mathbb{P}(t_{1}=0)=0. We assume t1>0t_{1}>0 so that (50) holds.

Step 1. Construct the approximation of gl(x,t)=(gl,1(x,t),gl,2(x,t))2g_{l}(x,t)=(g_{l,1}(x,t),g_{l,2}(x,t))^{\top}\in\mathbb{R}^{2}, where gl,1,gl,2C([3(s0+1),3(s0+1)]2×[0,T])g_{l,1},g_{l,2}\in C^{\infty}\left([-3({s_{0}+1}),3({s_{0}+1})]^{2}\times[0,T]\right) . Here x2x\in\mathbb{R}^{2}.

Let g~l,i(x,t)=gl,i(3(s0+1)(2x1),Tt)\tilde{g}_{l,i}(x,t)=g_{l,i}(3(s_{0}+1)(2x-1),Tt), i=1,2i=1,2. Then g~l,iC([0,1]3)\tilde{g}_{l,i}\in C^{\infty}\left([0,1]^{3}\right). By simple computation, we have

g~l,iWk,([0,1]3)6(s0+1)(wlT)k.\displaystyle\|\tilde{g}_{l,i}\|_{W^{k,\infty}\left([0,1]^{3}\right)}\leq 6({s_{0}+1})(w_{l}T)^{k}.

Applying Lemma 14 to g~l,i/[6(s0+1)]\tilde{g}_{l,i}/[6({s_{0}+1})], for any 𝒩+\mathcal{N}\in\mathbb{N}_{+}, there exists a tanh neural network g~l,i𝒩\tilde{g}_{l,i}^{\mathcal{N}} with only one hidden layer and width 3(𝒩+15wnT)/2(𝒩+15wnT+33)3\lceil(\mathcal{N}+15w_{n}T)/2\rceil\binom{\mathcal{N}+15w_{n}T+3}{3} such that

|g~l,i(x,t)g~l,i𝒩(x,t)|6(s0+1)exp(𝒩),(x,t)[0,1]3.\displaystyle\left|\tilde{g}_{l,i}(x,t)-\tilde{g}_{l,i}^{\mathcal{N}}(x,t)\right|\leq 6({s_{0}+1})\exp(-\mathcal{N}),~{}(x,t)\in[0,1]^{3}.

By coordinate transformation, we get

|gl,i(x,t)g~l,i𝒩(x6(s0+1)+12,tT)|6(s0+1)exp(𝒩),(x,y)[3(s0+1),3(s0+1)]2×[0,T].\displaystyle\left|g_{l,i}(x,t)-\tilde{g}_{l,i}^{\mathcal{N}}\left(\frac{x}{6({s_{0}+1})}+\frac{1}{2},\frac{t}{T}\right)\right|\leq 6({s_{0}+1})\exp(-\mathcal{N}),~{}(x,y)\in[-3({s_{0}+1}),3({s_{0}+1})]^{2}\times[0,T].

Define g^l,i𝒩(x,t)=g~l,i𝒩(x/[6(s0+1)]+1/2,t/T)\hat{g}_{l,i}^{\mathcal{N}}(x,t)=\tilde{g}_{l,i}^{\mathcal{N}}(x/[6({s_{0}+1})]+1/2,t/T), then

|gl,i(x,t)g^l,i𝒩(x,t)|6(s0+1)exp(𝒩),(x,y)[3(s0+1),3(s0+1)]2×[0,T].\displaystyle\left|g_{l,i}(x,t)-\hat{g}_{l,i}^{\mathcal{N}}(x,t)\right|\leq 6({s_{0}+1})\exp(-\mathcal{N}),~{}(x,y)\in[-3({s_{0}+1}),3({s_{0}+1})]^{2}\times[0,T].

Taking 𝒩𝒩+log(6(s0+1))\mathcal{N}\leftarrow\mathcal{N}+\lceil\log(6({s_{0}+1}))\rceil, we have

|gl,i(x,t)g^l,i𝒩(x,t)|exp(𝒩),(x,y)[3(s0+1),3(s0+1)]2×[0,T].\displaystyle\left|g_{l,i}(x,t)-\hat{g}_{l,i}^{\mathcal{N}}(x,t)\right|\leq\exp(-\mathcal{N}),~{}(x,y)\in[-3({s_{0}+1}),3({s_{0}+1})]^{2}\times[0,T].

Especially, |gl,i(x,t)g^l,i𝒩(x,t)|1\left|g_{l,i}(x,t)-\hat{g}_{l,i}^{\mathcal{N}}(x,t)\right|\leq 1. The width of this NN is bounded by 3u/2(u+33)3\lceil u/2\rceil\binom{u+3}{3}. From Lemma 14 and Remark 15, the weights of g^l,i𝒩\hat{g}_{l,i}^{\mathcal{N}} are bounded by

O((s0+1)exp(𝒩2+𝒩3Cd𝒩2)(𝒩(𝒩+2))3𝒩(𝒩+2)),\displaystyle O\left(({s_{0}+1})\exp(\frac{{\mathcal{N}^{\prime}}^{2}+\mathcal{N}^{\prime}-3Cd\mathcal{N}^{\prime}}{2})(\mathcal{N}^{\prime}(\mathcal{N}^{\prime}+2))^{3\mathcal{N}^{\prime}(\mathcal{N}^{\prime}+2)}\right),

where 𝒩=𝒩+log(6(s0+1))+15wlT\mathcal{N}^{\prime}=\mathcal{N}+\lceil\log(6({s_{0}+1}))\rceil+15w_{l}T. Since g^l,i𝒩\hat{g}_{l,i}^{\mathcal{N}}\in\mathbb{R}, by a small tuning(precisely, let width plus 1), we can assume g^l,i𝒩\hat{g}_{l,i}^{\mathcal{N}} has the following structure:

g^l,i𝒩(x,y)=Vl,iσ((Wl,iBl,i)(xl,itl,i)+bl,i).\displaystyle\hat{g}_{l,i}^{\mathcal{N}}(x,y)=V_{l,i}\sigma\left(\begin{pmatrix}W_{l,i}&B_{l,i}\end{pmatrix}\begin{pmatrix}x_{l,i}\\ t_{l,i}\end{pmatrix}+b_{l,i}\right).

Denote g^l𝒩(x,t)=(g^l,1𝒩(x,t),g^l,2𝒩(x,t))\hat{g}_{l}^{\mathcal{N}}(x,t)=\left(\hat{g}_{l,1}^{\mathcal{N}}(x,t),\hat{g}_{l,2}^{\mathcal{N}}(x,t)\right)^{\top}.

Step 1. Construct the approximation of identity and g0(x)=x+1g_{0}(x)=x+1, x[(s0+1),2(s0+1)]x\in[-(s_{0}+1),2(s_{0}+1)]. Here xx\in\mathbb{R}.

Similarly to step 3 in the proof of Theorem 5 , taking ψh(x)=2σ(hy/2)/[σ(0)h]\psi_{h}(x)=2\sigma\left(hy/2\right)/[\sigma^{{}^{\prime}}(0)h], we have

|xψh(x)|(6M)4h2,x[M,M].\displaystyle|x-\psi_{h}(x)|\leq(6M)^{4}h^{2},~{}x\in[-M,M].

For g0(x)=x+1g_{0}(x)=x+1, x[(s0+1),2(s0+1)]x\in[-(s_{0}+1),2(s_{0}+1)], we can construct a similar approximation as the proof of of Theorem 5. There exists a tanh neural network g^0𝒩\hat{g}_{0}^{\mathcal{N}} with only one hidden layer and width 3(𝒩′′+5)/23\lceil(\mathcal{N}^{\prime\prime}+5)/2\rceil such that

|g0(x)g^0𝒩(x)|exp(𝒩),x[(s0+1),2(s0+1)],\displaystyle\left|g_{0}(x)-\hat{g}_{0}^{\mathcal{N}}(x)\right|\leq\exp(-\mathcal{N}),~{}x\in[-(s_{0}+1),2(s_{0}+1)],

where 𝒩′′=𝒩+(s0+3)log2\mathcal{N}^{\prime\prime}=\mathcal{N}+\lceil(s_{0}+3)\log 2\rceil. The weight of g^0𝒩\hat{g}_{0}^{\mathcal{N}} is bounded by

O((s0+1)exp(𝒩′′2+𝒩′′3Cd𝒩′′2)[𝒩′′(𝒩′′+2)]3𝒩′′(𝒩′′+2)).\displaystyle O\left(({s_{0}+1})\exp\left(\frac{{\mathcal{N}^{\prime\prime}}^{2}+\mathcal{N}^{\prime\prime}-3Cd\mathcal{N}^{\prime\prime}}{2}\right)\left[\mathcal{N}^{\prime\prime}(\mathcal{N}^{\prime\prime}+2)\right]^{3\mathcal{N}^{\prime\prime}(\mathcal{N}^{\prime\prime}+2)}\right).

Step 2. Construct the approximation of Sn,iS_{n,i} and Sn(t)S_{n}(t) under the event {Nes0}\{N_{e}\leq s_{0}\}.

Let hl,0𝒩=(hl,0,1𝒩,hl,0,2𝒩)=𝟎h_{l,0}^{\mathcal{N}}=(h_{l,0,1}^{\mathcal{N}},h_{l,0,2}^{\mathcal{N}})^{\top}=\mathbf{0}, S¯l,0𝒩=(S¯l,0,1𝒩,S¯l,0,2𝒩)=𝟎\overline{S}_{l,0}^{\mathcal{N}}=(\overline{S}_{l,0,1}^{\mathcal{N}},\overline{S}_{l,0,2}^{\mathcal{N}})^{\top}=\mathbf{0}, for 1is01\leq i\leq s_{0}. We construct hl,i𝒩h_{l,i}^{\mathcal{N}} and S¯l,i𝒩\overline{S}_{l,i}^{\mathcal{N}} recursively by

{hl,i𝒩=σ((Wl,1(Vl,100Vl,2)Wl,2(Vl,100Vl,2))hl,i1𝒩+(Bl,1Bl,2)(titi1)+(bl,1bl,2)),S¯l,i𝒩=(Vl,100Vl,2)hl,i𝒩.\left\{\begin{aligned} h_{l,i}^{\mathcal{N}}&=\sigma\left(\begin{pmatrix}W_{l,1}\begin{pmatrix}V_{l,1}&0\\ 0&V_{l,2}\end{pmatrix}\\ W_{l,2}\begin{pmatrix}V_{l,1}&0\\ 0&V_{l,2}\end{pmatrix}\end{pmatrix}h_{l,i-1}^{\mathcal{N}}+\begin{pmatrix}B_{l,1}\\ B_{l,2}\end{pmatrix}\begin{pmatrix}t_{i}-t_{i-1}\end{pmatrix}+\begin{pmatrix}b_{l,1}\\ b_{l,2}\end{pmatrix}\right),\\ \overline{S}_{l,i}^{\mathcal{N}}&=\begin{pmatrix}V_{l,1}&0\\ 0&V_{l,2}\end{pmatrix}h_{l,i}^{\mathcal{N}}.\end{aligned}\right.

Hence S¯l,i𝒩=g^l𝒩(S¯l,i1𝒩,titi1),1is0\overline{S}_{l,i}^{\mathcal{N}}=\hat{g}_{l}^{\mathcal{N}}(\overline{S}_{l,i-1}^{\mathcal{N}},t_{i}-t_{i-1}),1\leq i\leq s_{0}. Here we agree on t0=0t_{0}=0.

Similarly, we can define S¯l𝒩(t),t(ti1,ti]\overline{S}_{l}^{\mathcal{N}}(t),t\in(t_{i-1},t_{i}] by

{hl𝒩(t)=σ((Wl,1(Vl,100Vl,2)Wl,2(Vl,100Vl,2))hl,i1𝒩+(Bl,1Bl,2)(titi1)+(bl,1bl,2)),S¯l𝒩(t)=(Vl,100Vl,2)hl𝒩(t).\left\{\begin{aligned} h_{l}^{\mathcal{N}}(t)&=\sigma\left(\begin{pmatrix}W_{l,1}\begin{pmatrix}V_{l,1}&0\\ 0&V_{l,2}\end{pmatrix}\\ W_{l,2}\begin{pmatrix}V_{l,1}&0\\ 0&V_{l,2}\end{pmatrix}\end{pmatrix}h_{l,i-1}^{\mathcal{N}}+\begin{pmatrix}B_{l,1}\\ B_{l,2}\end{pmatrix}\begin{pmatrix}t_{i}-t_{i-1}\end{pmatrix}+\begin{pmatrix}b_{l,1}\\ b_{l,2}\end{pmatrix}\right),\\ \overline{S}_{l}^{\mathcal{N}}(t)&=\begin{pmatrix}V_{l,1}&0\\ 0&V_{l,2}\end{pmatrix}h_{l}^{\mathcal{N}}(t).\end{aligned}\right.

Hence S¯l𝒩(t)=g^l𝒩(S¯l,i1𝒩,tti1),t(ti1,ti]\overline{S}_{l}^{\mathcal{N}}(t)=\hat{g}_{l}^{\mathcal{N}}(\overline{S}_{l,i-1}^{\mathcal{N}},t-t_{i-1}),t\in(t_{i-1},t_{i}]. The approximation error can be bounded by

Sl(t)S¯l𝒩(t)2\displaystyle\quad\left\|S_{l}(t)-\overline{S}_{l}^{\mathcal{N}}(t)\right\|_{2}
=gl(Sl,i1,tti1)g^l𝒩(S¯l,i1𝒩,tti1)2\displaystyle=\left\|g_{l}(S_{l,i-1},t-t_{i-1})-\hat{g}_{l}^{\mathcal{N}}(\overline{S}_{l,i-1}^{\mathcal{N}},t-t_{i-1})\right\|_{2}
gl(Sl,i1,tti1)gl(S¯l,i1𝒩,tti1)2+gl(S¯l,i1𝒩,tti1)g^l𝒩(S¯l,i1𝒩,tti1)2\displaystyle\leq\left\|g_{l}(S_{l,i-1},t-t_{i-1})-g_{l}(\overline{S}_{l,i-1}^{\mathcal{N}},t-t_{i-1})\right\|_{2}+\left\|g_{l}(\overline{S}_{l,i-1}^{\mathcal{N}},t-t_{i-1})-\hat{g}_{l}^{\mathcal{N}}(\overline{S}_{l,i-1}^{\mathcal{N}},t-t_{i-1})\right\|_{2}
Sl,i1S¯l,i1𝒩2+2max{gl,1g^l,1𝒩gl,2g^l,2𝒩}\displaystyle\leq\left\|S_{l,i-1}-\overline{S}_{l,i-1}^{\mathcal{N}}\right\|_{2}+\sqrt{2}\max\left\{\left\|g_{l,1}-\hat{g}_{l,1}^{\mathcal{N}}\right\|_{\infty}\vee\left\|g_{l,2}-\hat{g}_{l,2}^{\mathcal{N}}\right\|_{\infty}\right\}
Sl,i1S¯l,i1𝒩2+2exp(𝒩)\displaystyle\leq\left\|S_{l,i-1}-\overline{S}_{l,i-1}^{\mathcal{N}}\right\|_{2}+\sqrt{2}\exp(-\mathcal{N}) (51)
\displaystyle\leq\cdots
2iexp(𝒩),t(ti1,ti].\displaystyle\leq\sqrt{2}i\exp(-\mathcal{N}),~{}t\in(t_{i-1},t_{i}].

Under the event {Nes0}\{N_{e}\leq s_{0}\}, we have

Sl(t)S¯l𝒩(t)22(s0+1)exp(𝒩).\displaystyle\left\|S_{l}(t)-\overline{S}_{l}^{\mathcal{N}}(t)\right\|_{2}\leq\sqrt{2}(s_{0}+1)\exp(-\mathcal{N}).

Moreover, S¯l,i𝒩2Sl,iS¯l,i𝒩2+Sl,i22(s0+1)+(s0+1)3(s0+1)\left\|\overline{S}_{l,i}^{\mathcal{N}}\right\|_{2}\leq\left\|S_{l,i}-\overline{S}_{l,i}^{\mathcal{N}}\right\|_{2}+\left\|S_{l,i}\right\|_{2}\leq\sqrt{2}(s_{0}+1)+(s_{0}+1)\leq 3(s_{0}+1), is0i\leq s_{0}, then S¯l,i𝒩[3(s0+1),3(s0+1)]2\overline{S}_{l,i}^{\mathcal{N}}\in[-3(s_{0}+1),3(s_{0}+1)]^{2} and (51) can be verified by induction under the event {Nes0}\{N_{e}\leq s_{0}\}.

For the approximation of S0(t)S_{0}(t), we can similarly construct a simple RNN such that S¯0,i𝒩=g^0𝒩(S¯0,i1𝒩)\overline{S}_{0,i}^{\mathcal{N}}=\hat{g}_{0}^{\mathcal{N}}(\overline{S}_{0,i-1}^{\mathcal{N}}) and |S0(t)S¯0𝒩(t)|(s0+1)exp(𝒩)\left|S_{0}(t)-\overline{S}_{0}^{\mathcal{N}}(t)\right|\leq(s_{0}+1)\exp(-\mathcal{N}).

Step 3. Construct the approximation of λ(t)\lambda^{\ast}(t) under the event {Nes0}\{N_{e}\leq s_{0}\}.

Since λ0Ws,([0,T],B0)\lambda_{0}\in W^{s,\infty}([0,T],B_{0}), from the proof of Theorem 4 , there exists a two layer tanh neural network λ¯0N\overline{\lambda}_{0}^{N} with width less than 3s/2+6N3\lceil s/2\rceil+6N such that

|λ¯0N(t)λ0(t)|3𝒞B0Ts2Ns,0tT.\displaystyle\left|\overline{\lambda}_{0}^{N}(t)-\lambda_{0}(t)\right|\leq\frac{3\mathcal{C}B_{0}T^{s}}{2N^{s}},~{}\forall 0\leq t\leq T. (52)

Moreover, the weights of λ¯0N\overline{\lambda}_{0}^{N} can be bounded by

O([2s5s(s1)!B0Ts]s2N(1+s2)/2(s(s+2))3s(s+2)).\displaystyle O\left(\left[\frac{\sqrt{2s}5^{s}}{(s-1)!}B_{0}T^{s}\right]^{-\frac{s}{2}}N^{(1+s^{2})/2}(s(s+2))^{3s(s+2)}\right)~{}.

Here we assume λ¯0N(t)\overline{\lambda}_{0}^{N}(t) have the following structure

λ¯0N(t)=V2σ(V1σ(Bt+b0)+b1)+b2.\displaystyle\overline{\lambda}_{0}^{N}(t)=V_{2}^{{}^{\prime}}\sigma\left(V_{1}^{{}^{\prime}}\sigma\left(B^{{}^{\prime}}t+b_{0}^{{}^{\prime}}\right)+b_{1}^{{}^{\prime}}\right)+b_{2}^{{}^{\prime}}~{}.

Since λ(t)=λ0(t)+μ^0(S0(t)1)/2+l=1(ν^l,μ^l)(Sl(t)(01))\lambda^{\ast}(t)=\lambda_{0}(t)+\hat{\mu}_{0}(S_{0}(t)-1)/2+\sum_{l=1}^{\infty}(\hat{\nu}_{l},\hat{\mu}_{l})\cdot\left(S_{l}(t)-\begin{pmatrix}0\\ 1\end{pmatrix}\right), we can construct its (finite sum) approximation by

λ¯(t)=λ¯0N(t)+μ^02(ψh(S0¯𝒩(t))1)+l=1Nμ(ν^l,μ^l)(ψh(Sl¯𝒩(t))(01)).\displaystyle\overline{\lambda}(t)=\overline{\lambda}_{0}^{N}(t)+\frac{\hat{\mu}_{0}}{2}(\psi_{h}(\overline{S_{0}}^{\mathcal{N}}(t))-1)+\sum_{l=1}^{N_{\mu}}(\hat{\nu}_{l},\hat{\mu}_{l})\cdot\left(\psi_{h}(\overline{S_{l}}^{\mathcal{N}}(t))-\begin{pmatrix}0\\ 1\end{pmatrix}\right)~{}.

It can be seen as a parallelism of (Nμ+2)(N_{\mu}+2) RNNs defined before.

Under the event {Nes0}\{N_{e}\leq s_{0}\}, we have B1λ(t)B0+C0s0B_{1}\leq\lambda(t)\leq B_{0}+C_{0}s_{0}. Recall that f(x)=min{max{x,lf},uf}f(x)=\min\{\max\{x,l_{f}\},u_{f}\}. Here we can take lf=B1l_{f}=B_{1}, uf=B0+C0s0u_{f}=B_{0}+C_{0}s_{0}. The final output is λ^(t;S)=f(λ¯(t;S))\hat{\lambda}(t;S)=f(\overline{\lambda}(t;S)).

Step 4. Compute the approximation error under the event {Nes0}\{N_{e}\leq s_{0}\}.

Under the event {Nes0}\{N_{e}\leq s_{0}\} and the construction of ff, we have

λλ^Lλλ¯L.\displaystyle\|\lambda^{\ast}-\hat{\lambda}\|_{L^{\infty}}\leq\|\lambda^{\ast}-\overline{\lambda}\|_{L^{\infty}}~{}. (53)

By the construction of λ¯\overline{\lambda},

|λ(t)λ¯(t)|\displaystyle\left|\lambda^{\ast}(t)-\overline{\lambda}(t)\right|\leq |λ0(t)λ¯0N(t)|+μ^02|S0(t)ψh(S¯0𝒩(t))|+|l=1Nμ(ν^l,μ^l)(Sl(t)ψh(Sl¯𝒩(t)))|\displaystyle\left|\lambda_{0}(t)-\overline{\lambda}_{0}^{N}(t)\right|+\frac{\hat{\mu}_{0}}{2}\left|S_{0}(t)-\psi_{h}(\overline{S}_{0}^{\mathcal{N}}(t))\right|+\left|\sum_{l=1}^{N_{\mu}}(\hat{\nu}_{l},\hat{\mu}_{l})\cdot\left(S_{l}(t)-\psi_{h}(\overline{S_{l}}^{\mathcal{N}}(t))\right)\right|
+|l>Nμ(ν^l,μ^l)(Sl(t)(01))|.\displaystyle+\left|\sum_{l>N_{\mu}}(\hat{\nu}_{l},\hat{\mu}_{l})\cdot\left(S_{l}(t)-\begin{pmatrix}0\\ 1\end{pmatrix}\right)\right|~{}. (54)

Under the event {Nes0}\{N_{e}\leq s_{0}\}, for the second term, we have

|S0(t)ψh(S¯0𝒩(t))|\displaystyle\left|S_{0}(t)-\psi_{h}(\overline{S}_{0}^{\mathcal{N}}(t))\right| |S0(t)S¯0𝒩(t)|+|S¯0𝒩(t)ψh(S¯0𝒩(t))|\displaystyle\leq\left|S_{0}(t)-\overline{S}_{0}^{\mathcal{N}}(t)\right|+\left|\overline{S}_{0}^{\mathcal{N}}(t)-\psi_{h}(\overline{S}_{0}^{\mathcal{N}}(t))\right|
(s0+1)g0g^0𝒩+(6M)4h2\displaystyle\leq(s_{0}+1)\left\|g_{0}-\hat{g}_{0}^{\mathcal{N}}\right\|_{\infty}+(6M)^{4}h^{2}
(s0+1)exp(𝒩)+(18(s0+1))4h2,0tT.\displaystyle\leq(s_{0}+1)\exp(-\mathcal{N})+(18(s_{0}+1))^{4}h^{2},~{}0\leq t\leq T. (55)

For the third term, similarly,

|l=1Nμ(ν^l,μ^l)(Sl(t)ψh(Sl¯𝒩(t)))|\displaystyle\left|\sum_{l=1}^{N_{\mu}}(\hat{\nu}_{l},\hat{\mu}_{l})\cdot\left(S_{l}(t)-\psi_{h}(\overline{S_{l}}^{\mathcal{N}}(t))\right)\right| l=1Nμ(ν^l,μ^l)2Sl(t)ψh(Sl¯𝒩(t))2\displaystyle\leq\sum_{l=1}^{N_{\mu}}\left\|(\hat{\nu}_{l},\hat{\mu}_{l})^{\top}\right\|_{2}\left\|S_{l}(t)-\psi_{h}(\overline{S_{l}}^{\mathcal{N}}(t))\right\|_{2}
l=1Nμ2C0(Sl(t)Sl¯𝒩(t)2+Sl¯𝒩(t)ψh(Sl¯𝒩(t))2)\displaystyle\leq\sum_{l=1}^{N_{\mu}}\sqrt{2}C_{0}\left(\left\|S_{l}(t)-\overline{S_{l}}^{\mathcal{N}}(t)\right\|_{2}+\left\|\overline{S_{l}}^{\mathcal{N}}(t)-\psi_{h}(\overline{S_{l}}^{\mathcal{N}}(t))\right\|_{2}\right)
2C0Nμ((s0+1)exp(𝒩)+(18(s0+1))4h2),0tT,\displaystyle\leq 2C_{0}N_{\mu}\left((s_{0}+1)\exp(-\mathcal{N})+(18(s_{0}+1))^{4}h^{2}\right),~{}0\leq t\leq T, (56)

where we take M=3(s0+1)M=3(s_{0}+1) in (55) and (56) to ensure that S¯0𝒩(t)\overline{S}_{0}^{\mathcal{N}}(t) and S¯l𝒩(t)\overline{S}_{l}^{\mathcal{N}}(t) can be well approximated by ψh(S¯0𝒩(t))\psi_{h}(\overline{S}_{0}^{\mathcal{N}}(t)) and ψh(S¯l𝒩(t))\psi_{h}(\overline{S}_{l}^{\mathcal{N}}(t)).

For the fourth term, using Lemma 8,

|l>Nμ(ν^l,μ^l)(Sl(t)(01))|\displaystyle\left|\sum_{l>N_{\mu}}(\hat{\nu}_{l},\hat{\mu}_{l})\cdot\left(S_{l}(t)-\begin{pmatrix}0\\ 1\end{pmatrix}\right)\right| =ti<t|μ(tti)SNμ(tti)|\displaystyle=\sum_{t_{i}<t}\left|\mu(t-t_{i})-S_{N_{\mu}}(t-t_{i})\right|
s04C0Tk(k1)(2π)kNμk1.\displaystyle\leq s_{0}\frac{4C_{0}T^{k}}{(k-1)(2\pi)^{k}N_{\mu}^{k-1}}.

Here SNμ(t)S_{N_{\mu}}(t) is the finite sum of the fourier series defined in Lemma 8 .

Finally, by (52), we have |λ¯0N(t)λ0(t)|3𝒞B0Ts/(2Ns),0tT|\overline{\lambda}_{0}^{N}(t)-\lambda_{0}(t)|\leq 3\mathcal{C}B_{0}T^{s}/(2N^{s}),~{}0\leq t\leq T. To trade off the error terms in (54), take 𝒩=log((s0+1)NsNμ)\mathcal{N}=\lceil\log((s_{0}+1)N^{s}N_{\mu})\rceil and h=(18(s0+1))2Ns/2Nμ1/2h=(18(s_{0}+1))^{-2}N^{-s/2}N_{\mu}^{-1/2}. Then under the event {Nes0}\{N_{e}\leq s_{0}\}, we have

|λ(t)λ¯(t)|\displaystyle\left|\lambda^{\ast}(t)-\overline{\lambda}(t)\right| 2C0Nμ+1NsNμ+s04C0Tk(k1)(2π)kNμk1+3𝒞B0Ts2Ns\displaystyle\leq\frac{2C_{0}N_{\mu}+1}{N^{s}N_{\mu}}+s_{0}\frac{4C_{0}T^{k}}{(k-1)(2\pi)^{k}N_{\mu}^{k-1}}+\frac{3\mathcal{C}B_{0}T^{s}}{2N^{s}}
3𝒞B0Ts+4C0+22Ns+s04C0Tk(k1)(2π)kNμk1,t[0,T].\displaystyle\leq\frac{3\mathcal{C}B_{0}T^{s}+4C_{0}+2}{2N^{s}}+s_{0}\frac{4C_{0}T^{k}}{(k-1)(2\pi)^{k}N_{\mu}^{k-1}},~{}t\in[0,T]. (57)

Step 5. Compute the final approximation error.

By (46),

|𝔼[loss(λ^,Stest)]𝔼[loss(λ,Stest)]|\displaystyle\quad|\mathbb{E}[\text{loss}(\hat{\lambda},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]|
𝔼[(T+NeB1)λ^λL𝟙{Nes0}]+𝔼[(T+NeB1)λ^λL𝟙{Ne>s0}]\displaystyle\leq\mathbb{E}\left[(T+\frac{N_{e}}{B_{1}})\left\|\hat{\lambda}-\lambda^{\ast}\right\|_{L^{\infty}}\mathbbm{1}_{\{N_{e}\leq s_{0}\}}\right]+\mathbb{E}\left[(T+\frac{N_{e}}{B_{1}})\left\|\hat{\lambda}-\lambda^{\ast}\right\|_{L^{\infty}}\mathbbm{1}_{\{N_{e}>s_{0}\}}\right]
:=𝕀1+𝕀2.\displaystyle:=\mathbb{I}_{1}+\mathbb{I}_{2}~{}. (58)

Taking η=(cμ+1)/(2cμ)\eta=(c_{\mu}+1)/(2c_{\mu}) in Lemma 2 , we have

(Nes)\displaystyle\mathbb{P}\left(N_{e}\geq s\right) 2B0T1cμexp(log(η)2[η(B0T)(1cμη)s])\displaystyle\leq\frac{2\sqrt{B_{0}T}}{1-c_{\mu}}\exp\left(\frac{\log(\eta)}{2}\left[\eta(B_{0}T)-(1-c_{\mu}\eta)s\right]\right)
2B0T1cμexp(log(cμ+12cμ)2[cμ+12cμ(B0T)1cμ2s])\displaystyle\leq\frac{2\sqrt{B_{0}T}}{1-c_{\mu}}\exp\left(\frac{\log\left(\frac{c_{\mu}+1}{2c_{\mu}}\right)}{2}\left[\frac{c_{\mu}+1}{2c_{\mu}}(B_{0}T)-\frac{1-c_{\mu}}{2}s\right]\right)
:=aeexp(ces).\displaystyle:=a_{e}\exp\left(-c_{e}s\right)~{}.

By (53) and (57),

𝕀1\displaystyle\mathbb{I}_{1} (T+1B1)λλ^L𝔼[(Ne+1)𝟙{Nes0}]\displaystyle\leq\left(T+\frac{1}{B_{1}}\right)\|\lambda^{\ast}-\hat{\lambda}\|_{L^{\infty}}\mathbb{E}\left[(N_{e}+1)\mathbbm{1}_{\{N_{e}\leq s_{0}\}}\right]
(T+1B1)λλ¯L𝔼[(Ne+1)]\displaystyle\leq\left(T+\frac{1}{B_{1}}\right)\|\lambda^{\ast}-\overline{\lambda}\|_{L^{\infty}}\mathbb{E}\left[(N_{e}+1)\right]
=(T+1B1)λλ¯L(1+s=1(Nes))\displaystyle=\left(T+\frac{1}{B_{1}}\right)\|\lambda^{\ast}-\overline{\lambda}\|_{L^{\infty}}\left(1+\sum_{s=1}^{\infty}\mathbb{P}(N_{e}\geq s)\right)
(T+1B1)(1+aeexp(ce)1exp(ce))(3𝒞B0Ts+4C0+22Ns+s04C0Tk(k1)(2π)kNμk1).\displaystyle\leq\left(T+\frac{1}{B_{1}}\right)\left(1+\frac{a_{e}\exp(-c_{e})}{1-\exp(-c_{e})}\right)\left(\frac{3\mathcal{C}B_{0}T^{s}+4C_{0}+2}{2N^{s}}+s_{0}\frac{4C_{0}T^{k}}{(k-1)(2\pi)^{k}N_{\mu}^{k-1}}\right). (59)

Since λ^LB0+C0s0\|\hat{\lambda}\|_{L^{\infty}}\leq B_{0}+C_{0}s_{0} and λLB0+C0Ne\|\lambda^{\ast}\|_{L^{\infty}}\leq B_{0}+C_{0}N_{e}, similar to (48), we have

𝕀2\displaystyle\mathbb{I}_{2} 𝔼[(T+NeB1)λ^L𝟙{Ne>s0}]+𝔼[(T+NeB1)λL𝟙{Ne>s0}]\displaystyle\leq\mathbb{E}\left[\left(T+\frac{N_{e}}{B_{1}}\right)\|\hat{\lambda}\|_{L^{\infty}}\mathbbm{1}_{\{N_{e}>s_{0}\}}\right]+\mathbb{E}\left[\left(T+\frac{N_{e}}{B_{1}}\right)\left\|\lambda^{\ast}\right\|_{L^{\infty}}\mathbbm{1}_{\{N_{e}>s_{0}\}}\right]
(T+1B1)(B0+C0s0)𝔼[(Ne+1)𝟙{Ne>s0}]+(T+1B1)E[(Ne+1)(B0+C0Ne)𝟙{Ne>s0}]\displaystyle\leq\left(T+\frac{1}{B_{1}}\right)(B_{0}+C_{0}s_{0})\mathbb{E}\left[(N_{e}+1)\mathbbm{1}_{\{N_{e}>s_{0}\}}\right]+\left(T+\frac{1}{B_{1}}\right)E\left[(N_{e}+1)(B_{0}+C_{0}N_{e})\mathbbm{1}_{\{N_{e}>s_{0}\}}\right]
(T+1B1)aeexp(ce(s0+1))(2(s0+1)(B0+C0s0)+3C0(s0+1)+2B0(1exp(ce))2).\displaystyle\leq\left(T+\frac{1}{B_{1}}\right)a_{e}\exp(-c_{e}(s_{0}+1))\left(2(s_{0}+1)(B_{0}+C_{0}s_{0})+\frac{3C_{0}(s_{0}+1)+2B_{0}}{(1-\exp(-c_{e}))^{2}}\right). (60)

Combining (58), (59), and (60), we have

|𝔼[loss(λ^,Stest)]𝔼[loss(λ,Stest)]|\displaystyle\quad|\mathbb{E}[\text{loss}(\hat{\lambda},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]|
(T+1B1)aeexp(ce(s0+1))(2(s0+1)(B0+C0s0)+3C0(s0+1)+2B0(1exp(ce))2)\displaystyle\leq\left(T+\frac{1}{B_{1}}\right)a_{e}\exp(-c_{e}(s_{0}+1))\left(2(s_{0}+1)(B_{0}+C_{0}s_{0})+\frac{3C_{0}(s_{0}+1)+2B_{0}}{(1-\exp(-c_{e}))^{2}}\right)
+(T+1B1)(1+aeexp(ce)1exp(ce))(3𝒞B0Ts+4C0+22Ns+s04C0Tk(k1)(2π)kNμk1).\displaystyle\quad+\left(T+\frac{1}{B_{1}}\right)\left(1+\frac{a_{e}\exp(-c_{e})}{1-\exp(-c_{e})}\right)\left(\frac{3\mathcal{C}B_{0}T^{s}+4C_{0}+2}{2N^{s}}+s_{0}\frac{4C_{0}T^{k}}{(k-1)(2\pi)^{k}N_{\mu}^{k-1}}\right).

Let s0=slog(N)/ces_{0}=\lceil s\log(N)/c_{e}\rceil and denote λ^N,Nμ=λ^\hat{\lambda}^{N,N_{\mu}}=\hat{\lambda}. We have

|𝔼[loss(λ^N,Nμ,Stest)]𝔼[loss(λ,Stest)]|11cμexp(2B0Tcμ2)(Ts+log2NNs+TklogNNμk1).\displaystyle|\mathbb{E}[\text{loss}(\hat{\lambda}^{N,N_{\mu}},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]|\lesssim\frac{1}{1-c_{\mu}}\exp\left(\frac{2B_{0}T}{c_{\mu}^{2}}\right)\left(\frac{T^{s}+\log^{2}N}{N^{s}}+\frac{T^{k}\log N}{N_{\mu}^{k-1}}\right).

Step 6. Bound the sizes of the network width and weights.

From step 1-5, we have the width of the network being less than

(3𝒩2(𝒩+33))2Nμ+3s2+6N+3𝒩′′+52\displaystyle\left(3\lceil\frac{\mathcal{N}^{\prime}}{2}\rceil\binom{\mathcal{N}^{\prime}+3}{3}\right)2N_{\mu}+3\Big{\lceil}\frac{s}{2}\Big{\rceil}+6N+3\lceil\frac{\mathcal{N}^{\prime\prime}+5}{2}\rceil

where 𝒩=𝒩+log(6(s0+1))+15wNμT\mathcal{N}^{\prime}=\mathcal{N}+\lceil\log(6({s_{0}+1}))\rceil+15w_{N_{\mu}}T, 𝒩′′=𝒩+(s0+3)log2\mathcal{N}^{\prime\prime}=\mathcal{N}+\lceil(s_{0}+3)\log 2\rceil, 𝒩=log((s0+1)NsNμ)\mathcal{N}=\lceil\log((s_{0}+1)N^{s}N_{\mu})\rceil, s0=slog(N)/ces_{0}=\lceil s\log(N)/c_{e}\rceil. Hence

DN+Nμ5log4N.\displaystyle D\lesssim N+N_{\mu}^{5}\log^{4}N~{}.

From the construction of g^l,i𝒩\hat{g}_{l,i}^{\mathcal{N}}, g^0𝒩\hat{g}_{0}^{\mathcal{N}}, ψh\psi_{h}, λ¯0N\overline{\lambda}_{0}^{N}, the weights of the network is less than

𝒞1max\displaystyle\mathcal{C}_{1}^{\prime}\max {((s0+1)exp(𝒩2+𝒩3Cd𝒩2)(𝒩(𝒩+2))3𝒩(𝒩+2)),\displaystyle\left\{\left(({s_{0}+1})\exp(\frac{{\mathcal{N}^{\prime}}^{2}+\mathcal{N}^{\prime}-3Cd\mathcal{N}^{\prime}}{2})(\mathcal{N}^{\prime}(\mathcal{N}^{\prime}+2))^{3\mathcal{N}^{\prime}(\mathcal{N}^{\prime}+2)}\right),\right.
((s0+1)exp(𝒩′′2+𝒩′′3Cd𝒩′′2)(𝒩′′(𝒩′′+2))3𝒩′′(𝒩′′+2)),2σ(0)h,\displaystyle\quad\left(({s_{0}+1})\exp(\frac{{\mathcal{N}^{\prime\prime}}^{2}+\mathcal{N}^{\prime\prime}-3Cd\mathcal{N}^{\prime\prime}}{2})(\mathcal{N}^{\prime\prime}(\mathcal{N}^{\prime\prime}+2))^{3\mathcal{N}^{\prime\prime}(\mathcal{N}^{\prime\prime}+2)}\right),\frac{2}{\sigma^{{}^{\prime}}(0)h},
([2s5s(s1)!B0Ts]s/2N(1+s2)/2(s(s+2))3s(s+2))},\displaystyle\quad\left.\left(\left[\frac{\sqrt{2s}5^{s}}{(s-1)!}B_{0}T^{s}\right]^{-s/2}N^{(1+s^{2})/2}(s(s+2))^{3s(s+2)}\right)\right\},

where h=(18(s0+1))2Ns/2Nμ1/2h=(18(s_{0}+1))^{-2}N^{-s/2}N_{\mu}^{-1/2}. Hence the weights of the network is less than

𝒞1(log(NNμ))12s2(log(NNμ))2,\displaystyle\mathcal{C}_{1}(\log(NN_{\mu}))^{12s^{2}(\log(NN_{\mu}))^{2}}~{},

where 𝒞1\mathcal{C}_{1} is a constant related to s,B0,C0,cμs,B_{0},C_{0},c_{\mu}, and TT. ∎

Lemma 9.

Let δj=jk\delta_{j}=\frac{j}{k}, 1jk1\leq j\leq k, δ=(δ1,δ2,,δk)\delta=(\delta_{1},\delta_{2},\cdots,\delta_{k})^{\top} and

Vδ=(111δ1δ2δkδ12δ22δk2δ1k1δ2k1δkk1)\displaystyle V_{\delta}=\begin{pmatrix}1&1&\cdots&1\\ \delta_{1}&\delta_{2}&\cdots&\delta_{k}\\ \delta_{1}^{2}&\delta_{2}^{2}&\cdots&\delta_{k}^{2}\\ \vdots&\vdots&\ddots&\vdots\\ \delta_{1}^{k-1}&\delta_{2}^{k-1}&\cdots&\delta_{k}^{k-1}\end{pmatrix}

then VδV_{\delta} is invertible and Vδ1C8k\|V_{\delta}^{-1}\|_{\infty}\leq C*8^{k}, where CC is a universal constant.

Proof of Lemma 9.

See Gautschi (1990). ∎

Lemma 10.

For μCk,([0,T],C0)\mu\in C^{k,\infty}([0,T],C_{0}) and δ=(δ1,δ2,,δk)\delta=(\delta_{1},\delta_{2},\cdots,\delta_{k})^{\top} defined in Lemma 9, there exists α=(α1,α2,,αk)\alpha=(\alpha_{1},\alpha_{2},\cdots,\alpha_{k})^{\top} such that

μ~(t):=μ(t)+j=1kαjexp(δjt)\displaystyle\tilde{\mu}(t):=\mu(t)+\sum_{j=1}^{k}\alpha_{j}\exp(-\delta_{j}t) (61)

satisfing α2C0C8k/(1exp(T))\|\alpha\|_{\infty}\leq 2C_{0}C8^{k}/(1-\exp(-T)), μ~Ck,([0,T],C0+kα)\tilde{\mu}\in C^{k,\infty}([0,T],C_{0}+k\|\alpha\|_{\infty}) and μ~(j)(0+)=μ~(j)(T)\tilde{\mu}^{(j)}(0+)=\tilde{\mu}^{(j)}(T-), 0jk10\leq j\leq k-1, where the constant CC is defined in Lemma 9.

Proof of Lemma 10.

We only need to solve the following equations:

μ~(j)(0+)=μ~(j)(T),0jk1.\displaystyle\tilde{\mu}^{(j)}(0+)=\tilde{\mu}^{(j)}(T-),~{}0\leq j\leq k-1. (62)

In matrix form,

(1eδ1T1eδ2T1eδkT(δ1)(1eδ1T)(δ2)(1eδ2T)(δk)(1eδkT)(δ1)k1(1eδ1T)(δ2)k1(1eδ2T)(δk)k1(1eδkT))(α1α2αk)\displaystyle~{}~{}~{}~{}\begin{pmatrix}1-e^{-\delta_{1}T}&1-e^{-\delta_{2}T}&\cdots&1-e^{-\delta_{k}T}\\ (-\delta_{1})(1-e^{-\delta_{1}T})&(-\delta_{2})(1-e^{-\delta_{2}T})&\cdots&(-\delta_{k})(1-e^{-\delta_{k}T})\\ \vdots&\vdots&\ddots&\vdots\\ (-\delta_{1})^{k-1}(1-e^{-\delta_{1}T})&(-\delta_{2})^{k-1}(1-e^{-\delta_{2}T})&\cdots&(-\delta_{k})^{k-1}(1-e^{-\delta_{k}T})\end{pmatrix}\begin{pmatrix}\alpha_{1}\\ \alpha_{2}\\ \vdots\\ \alpha_{k}\end{pmatrix}
=(μ(T)μ(0+)μ(1)(T)μ(1)(0+)μ(k1)(T)μ(k1)(0+)).\displaystyle=\begin{pmatrix}{\mu}(T-)-{\mu}(0+)\\ {\mu}^{(1)}(T-)-{\mu}^{(1)}(0+)\\ \vdots\\ {\mu}^{(k-1)}(T-)-{\mu}^{(k-1)}(0+)\end{pmatrix}. (63)

Rewrite (63) as

DVδΛδα=Δμ,\displaystyle DV_{\delta}\Lambda_{\delta}\alpha=\Delta_{\mu},

where D=diag{1,1,,(1)k1D=\mathrm{diag}\{1,-1,\cdots,(-1)^{k-1}, Λδ=diag{1eδ1T,1eδ2T,,1eδkT}\Lambda_{\delta}=\mathrm{diag}\{1-e^{-\delta_{1}T},1-e^{-\delta_{2}T},\cdots,1-e^{-\delta_{k}T}\}, Λμ=(μ(T)μ(0+),μ(1)(T)μ(1)(0+),,μ(k1)(T)μ(k1)(0+))\Lambda_{\mu}=({\mu}(T-)-{\mu}(0+),{\mu}^{(1)}(T-)-{\mu}^{(1)}(0+),\cdots,{\mu}^{(k-1)}(T-)-{\mu}^{(k-1)}(0+))^{\top}, and VδV_{\delta} is defined in Lemma 9. By Lemma 9 and δj=j/k\delta_{j}=j/k, 1jk1\leq j\leq k, we have DVδΛδDV_{\delta}\Lambda_{\delta} is invertible and

α\displaystyle\|\alpha\|_{\infty} D1Vδ1Λδ1Δμ\displaystyle\leq\|D^{-1}\|_{\infty}\|V_{\delta}^{-1}\|_{\infty}\|\Lambda_{\delta}^{-1}\|_{\infty}\|\Delta_{\mu}\|_{\infty}
(C8k)1(1exp(T))(2C0)=2C0C8k1exp(T),\displaystyle\leq(C*8^{k})\frac{1}{(1-\exp(-T))}(2C_{0})=\frac{2C_{0}C8^{k}}{1-\exp(-T)},

where the constant CC is defined in Lemma 9. By (61), we have μ~Ck,([0,T],C0+kα)\tilde{\mu}\in C^{k,\infty}([0,T],C_{0}+k\|\alpha\|_{\infty}). ∎

Now we prove Theorem 6. The proof is based on Theorem 5, Theorem 9, and Lemma 10. From Lemma 10, for μCk,([0,T],C0)\mu\in C^{k,\infty}([0,T],C_{0}), there exists α=(α1,α2,,αk)k\alpha=(\alpha_{1},\alpha_{2},\cdots,\alpha_{k})^{\top}\in\mathbb{R}^{k} such that μ~(t):=μ(t)+j=1kαjexp(δjt)\tilde{\mu}(t):=\mu(t)+\sum_{j=1}^{k}\alpha_{j}\exp(-\delta_{j}t) satisfying the boundary condition μ~(j)(0+)=μ~(j)(T)\tilde{\mu}^{(j)}(0+)=\tilde{\mu}^{(j)}(T-), 0jk10\leq j\leq k-1, and we have μ~Ck,([0,T],C0+k2C0C8k1exp(T))\tilde{\mu}\in C^{k,\infty}([0,T],C_{0}+k\frac{2C_{0}C8^{k}}{1-\exp(-T)}). Define ν~(t):=μ(t)μ~(t)=j=1kαjexp(δjt)\tilde{\nu}(t):=\mu(t)-\tilde{\mu}(t)=-\sum_{j=1}^{k}\alpha_{j}\exp(-\delta_{j}t). Denote

λ1(t)\displaystyle\lambda^{\ast}_{1}(t) :=λ0(t)+ti<tμ~(tti),\displaystyle:=\lambda_{0}(t)+\sum_{t_{i}<t}\tilde{\mu}(t-t_{i}),
λ2(t)\displaystyle\lambda^{\ast}_{2}(t) :=ti<tν~(tti)=j=1kti<t(αj)exp(δj(tti)):=j=1kλ2j(t),\displaystyle:=\sum_{t_{i}<t}\tilde{\nu}(t-t_{i})=\sum_{j=1}^{k}\sum_{t_{i}<t}(-\alpha_{j})\exp(-\delta_{j}(t-t_{i})):=\sum_{j=1}^{k}\lambda_{2j}^{\ast}(t),

and then λ(t)=λ1(t)+λ2(t)\lambda^{\ast}(t)=\lambda^{\ast}_{1}(t)+\lambda^{\ast}_{2}(t).

Fix s0+s_{0}\in\mathbb{N}_{+}. By the proof of Theorem 9, under the event {Nes0}\{N_{e}\leq s_{0}\}, there exists an RNN (without the output layer) λ¯1(t)\overline{\lambda}_{1}(t) such that

|λ1(t)λ¯1(t)|3𝒞B0Ts+4C0~+22Ns+s04C0~Tk(k1)(2π)kNμk1,t[0,T],\displaystyle\left|\lambda_{1}^{\ast}(t)-\overline{\lambda}_{1}(t)\right|\leq\frac{3\mathcal{C}B_{0}T^{s}+4\tilde{C_{0}}+2}{2N^{s}}+s_{0}\frac{4\tilde{C_{0}}T^{k}}{(k-1)(2\pi)^{k}N_{\mu}^{k-1}},~{}t\in[0,T],

where C0~=C0+2kC0C8k/(1exp(T))\tilde{C_{0}}=C_{0}+2kC_{0}C8^{k}/(1-\exp(-T)).

By the proof of Theorem 5 , under the event {Nes0}\{N_{e}\leq s_{0}\}, for1jk1\leq j\leq k, there exists an RNN (without the output layer) λ¯2j(t)\overline{\lambda}_{2j}(t) such that

|λ2j(t)λ¯2j(t)|2αjNs4C0C8k(1exp(T))Ns,t[0,T].\displaystyle\left|\lambda_{2j}^{\ast}(t)-\overline{\lambda}_{2j}(t)\right|\leq\frac{2\alpha_{j}}{N^{s}}\leq\frac{4C_{0}C8^{k}}{(1-\exp(-T))N^{s}},t\in[0,T].

Let λ¯2(t)=j=1kλ¯2j(t)\overline{\lambda}_{2}(t)=\sum_{j=1}^{k}\overline{\lambda}_{2j}(t). We have

|λ2(t)λ¯2(t)|2(C~0C0)Ns,t[0,T].\displaystyle\left|\lambda_{2}^{\ast}(t)-\overline{\lambda}_{2}(t)\right|\leq\frac{2(\tilde{C}_{0}-C_{0})}{N^{s}},t\in[0,T].

Let λ¯(t)=λ¯1(t)+λ¯2(t)\overline{\lambda}(t)=\overline{\lambda}_{1}(t)+\overline{\lambda}_{2}(t),

|λ(t)λ¯(t)||λ1(t)λ¯1(t)|+|λ2(t)λ¯2(t)|3𝒞B0Ts+8C0~+22Ns+s04C0~Tk(k1)(2π)kNμk1.\displaystyle\left|\lambda^{\ast}(t)-\overline{\lambda}(t)\right|\leq\left|\lambda_{1}^{\ast}(t)-\overline{\lambda}_{1}(t)\right|+\left|\lambda_{2}^{\ast}(t)-\overline{\lambda}_{2}(t)\right|\leq\frac{3\mathcal{C}B_{0}T^{s}+8\tilde{C_{0}}+2}{2N^{s}}+s_{0}\frac{4\tilde{C_{0}}T^{k}}{(k-1)(2\pi)^{k}N_{\mu}^{k-1}}.

Under the event {Nes0}\{N_{e}\leq s_{0}\}, B1λB0+C0s0B_{1}\leq\lambda^{\ast}\leq B_{0}+C_{0}s_{0}. Hence we can take lf=B1l_{f}=B_{1} and uf=B0+C0s0u_{f}=B_{0}+C_{0}s_{0} and denote λ^(t)=f(λ¯(t))\hat{\lambda}(t)=f(\overline{\lambda}(t)). Then λλ^λλ¯\|\lambda^{\ast}-\hat{\lambda}\|_{\infty}\leq\|\lambda^{\ast}-\overline{\lambda}\|_{\infty}. By similar arguments in Theorem 9, we have

|𝔼[loss(λ^,Stest)]𝔼[loss(λ,Stest)]|\displaystyle\quad|\mathbb{E}[{\text{loss}}(\hat{\lambda},S_{test})]-\mathbb{E}[{\text{loss}}(\lambda^{\ast},S_{test})]|
(T+1B1)aeexp(ce(s0+1))(2(s0+1)(B0+C0s0)+3C0(s0+1)+2B0(1exp(ce))2)\displaystyle\leq\left(T+\frac{1}{B_{1}}\right)a_{e}\exp(-c_{e}(s_{0}+1))\left(2(s_{0}+1)(B_{0}+C_{0}s_{0})+\frac{3C_{0}(s_{0}+1)+2B_{0}}{(1-\exp(-c_{e}))^{2}}\right)
+(T+1B1)(1+aeexp(ce)1exp(ce))(3𝒞B0Ts+8C~02Ns+s04C~0Tk(k1)(2π)kNμk1).\displaystyle\quad+\left(T+\frac{1}{B_{1}}\right)\left(1+\frac{a_{e}\exp(-c_{e})}{1-\exp(-c_{e})}\right)\left(\frac{3\mathcal{C}B_{0}T^{s}+8\tilde{C}_{0}}{2N^{s}}+s_{0}\frac{4\tilde{C}_{0}T^{k}}{(k-1)(2\pi)^{k}N_{\mu}^{k-1}}\right)~{}.

Let s0=slog(N)/ces_{0}=\lceil s\log(N)/c_{e}\rceil and denote λ^N,Nμ=λ^\hat{\lambda}^{N,N_{\mu}}=\hat{\lambda}. We have

|𝔼[loss~(λ^N,Nμ)]𝔼[loss~(λ)]|log2NNs+logNNμk1\displaystyle|\mathbb{E}[\tilde{\text{loss}}(\hat{\lambda}^{N,N_{\mu}})]-\mathbb{E}[\tilde{\text{loss}}(\lambda^{\ast})]|\lesssim\frac{\log^{2}N}{N^{s}}+\frac{\log N}{N_{\mu}^{k-1}}

The width and elements weights bound can also be obtained similarly to the proof of Theorem 9.

9.4 Proof of Theorem 7

Denote λ1(t)=λ0(t)+ti<tαexp(β(tti))\lambda_{1}^{*}(t)=\lambda_{0}(t)+\sum_{t_{i}<t}\alpha\exp(-\beta(t-t_{i})). Then λ(t)=Ψ(λ1(t))\lambda^{*}(t)=\Psi\left(\lambda_{1}^{*}(t)\right). Fix s0+s_{0}\in\mathbb{N}_{+}. From the proof of Theorem 5 , under the event {Nes0}\{N_{e}\leq s_{0}\}, there exists a 2-layer recurrent neural network λ¯1(t)\overline{\lambda}_{1}(t) as (43) such that

|λ¯1(t)λ1(t)|3𝒞B0Ts+22N1s,t[0,T].\displaystyle\left|\overline{\lambda}_{1}(t)-\lambda_{1}^{*}(t)\right|\leq\frac{3\mathcal{C}B_{0}T^{s}+2}{2N_{1}^{s}},~{}\forall t\in[0,T]. (64)

Moreover, the width of λ¯1(t)\overline{\lambda}_{1}(t) satisfies DN1D\lesssim N_{1} and the weights of λ¯1(t)\overline{\lambda}_{1}(t) are bounded by

O((logN1)12s2(logN1)2).\displaystyle O\left((\log N_{1})^{12s^{2}(\log N_{1})^{2}}\right)~{}.

Under the event {Nes0}\{N_{e}\leq s_{0}\}, the function λ1(t)\lambda_{1}^{*}(t) satisfies 0λ1B0+αs00\leq\lambda_{1}^{*}\leq B_{0}+\alpha s_{0}. Using (64) and taking (3𝒞B0Ts+2)/2N1s1(3\mathcal{C}B_{0}T^{s}+2)/2N_{1}^{s}\leq 1, we have λ¯1[1,B0+αs0+1]\overline{\lambda}_{1}\in[-1,B_{0}+\alpha s_{0}+1]. Hence we need to construct an approximation of Ψ\Psi on [1,B0+αs0+1][-1,B_{0}+\alpha s_{0}+1]. Let Ψ~(x)=Ψ(ρx1)\tilde{\Psi}(x)=\Psi(\rho x-1), where ρ=B0+αs0+2\rho={B_{0}+\alpha s_{0}+2}. Then Ψ(x)=Ψ~((x+1)/ρ)\Psi(x)=\tilde{\Psi}((x+1)/\rho).

Since Ψ\Psi is L-lipschitz and Ψ~\tilde{\Psi} is defined on [0,1][0,1] and ρL\rho L-Lipschitz, by the Corollary 5.4 of De Ryck et al. (2021), there exists a tanh neural network Ψ~N2\tilde{\Psi}^{N_{2}} with 2 hidden layers such that

Ψ~Ψ~N2L[0,1]7(ρLB~0)N2.\displaystyle\left\|\tilde{\Psi}-\tilde{\Psi}^{N_{2}}\right\|_{L^{\infty}[0,1]}\leq\frac{7(\rho L\vee\tilde{B}_{0})}{N_{2}}.

Let ΨN2(x)=Ψ~N2((x+1)/ρ)\Psi^{N_{2}}(x)=\tilde{\Psi}^{N_{2}}((x+1)/\rho). Then

|Ψ(x)ΨN2(x)|7(ρLB~0)N2,x[1,B0+αs0+1].\displaystyle\left|\Psi(x)-\Psi^{N_{2}}(x)\right|\leq\frac{7(\rho L\vee\tilde{B}_{0})}{N_{2}},~{}x\in[-1,B_{0}+\alpha s_{0}+1].

Then under the event {Nes0}\{N_{e}\leq s_{0}\}, we have

|Ψ(λ1(t))ΨN2(λ¯1(t))|\displaystyle\left|\Psi(\lambda_{1}^{*}(t))-\Psi^{N_{2}}(\overline{\lambda}_{1}(t))\right| |Ψ(λ1(t))Ψ(λ¯1(t))|+|Ψ(λ¯1(t))ΨN2(λ¯1(t))|\displaystyle\leq\left|\Psi(\lambda_{1}^{*}(t))-\Psi(\overline{\lambda}_{1}(t))\right|+\left|\Psi(\overline{\lambda}_{1}(t))-\Psi^{N_{2}}(\overline{\lambda}_{1}(t))\right|
L|λ¯1(t)λ1(t)|+ΨΨN2L\displaystyle\leq L\left|\overline{\lambda}_{1}(t)-\lambda_{1}^{*}(t)\right|+\left\|\Psi-\Psi^{N_{2}}\right\|_{L^{\infty}}
L3𝒞B0Ts+22N1s+7(ρLB~0)N2.\displaystyle\leq L\frac{3\mathcal{C}B_{0}T^{s}+2}{2N_{1}^{s}}+\frac{7(\rho L\vee\tilde{B}_{0})}{N_{2}}. (65)

Recall that f(x)=min{max{x,lf},uf}f(x)=\min\{\max\{x,l_{f}\},u_{f}\}. Since B~1ΨB~0\tilde{B}_{1}\leq\Psi\leq\tilde{B}_{0}, we can take lf=B~1l_{f}=\tilde{B}_{1} and uf=B~0u_{f}=\tilde{B}_{0}. Define λ^(t)=f(ΨN2(λ¯1(t)))\hat{\lambda}(t)=f\left(\Psi^{N_{2}}(\overline{\lambda}_{1}(t))\right). We have

|λ(t)λ^(t)||Ψ(λ1(t))ΨN2(λ¯1(t))|,t[0,T].\displaystyle\left|\lambda^{*}(t)-\hat{\lambda}(t)\right|\leq\left|\Psi(\lambda_{1}^{*}(t))-\Psi^{N_{2}}(\overline{\lambda}_{1}(t))\right|,~{}\forall t\in[0,T]. (66)

Similar to (46), we have

|𝔼[loss(λ^,Stest)]𝔼[loss(λ,Stest)]|\displaystyle\quad|\mathbb{E}[\text{loss}(\hat{\lambda},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]|
𝔼|loss(λ^,Stest)loss(λ,Stest)|\displaystyle\leq\mathbb{E}\left|\text{loss}(\hat{\lambda},S_{test})-\text{loss}(\lambda^{\ast},S_{test})\right|
𝔼(|i=1Ne(logλ^(ti)logλ(ti))|+|0T(λ^(t)λ(t))dt|)\displaystyle\leq\mathbb{E}\left(\Big{|}\sum_{i=1}^{N_{e}}(\log\hat{\lambda}(t_{i})-\log\lambda^{\ast}(t_{i}))\Big{|}+\Big{|}\int_{0}^{T}\left(\hat{\lambda}(t)-\lambda^{\ast}(t)\right)\mathrm{dt}\Big{|}\right)
𝔼[(|i=1Ne(logλ^(ti)logλ(ti))|+|0T(λ^(t)λ(t))dt|)𝟙{Nes0}\displaystyle\leq\mathbb{E}\left[\left(\Big{|}\sum_{i=1}^{N_{e}}(\log\hat{\lambda}(t_{i})-\log\lambda^{\ast}(t_{i}))\Big{|}+\Big{|}\int_{0}^{T}\left(\hat{\lambda}(t)-\lambda^{\ast}(t)\right)\mathrm{dt}\Big{|}\right)\mathbbm{1}_{\{N_{e}\leq s_{0}\}}\right.
+(|i=1Ne(logλ^(ti)logλ(ti))|+|0T(λ^(t)λ(t))dt|)𝟙{Ne>s0}]\displaystyle\quad\quad+\left.\left(\Big{|}\sum_{i=1}^{N_{e}}(\log\hat{\lambda}(t_{i})-\log\lambda^{\ast}(t_{i}))\Big{|}+\Big{|}\int_{0}^{T}\left(\hat{\lambda}(t)-\lambda^{\ast}(t)\right)\mathrm{dt}\Big{|}\right)\mathbbm{1}_{\{N_{e}>s_{0}\}}\right]
𝔼[(T+NeB1~)λ^λL𝟙{Nes0}]+𝔼[(T+NeB1~)λ^λL𝟙{Ne>s0}]\displaystyle\leq\mathbb{E}\left[(T+\frac{N_{e}}{\tilde{B_{1}}})\left\|\hat{\lambda}-\lambda^{\ast}\right\|_{L^{\infty}}\mathbbm{1}_{\{N_{e}\leq s_{0}\}}\right]+\mathbb{E}\left[(T+\frac{N_{e}}{\tilde{B_{1}}})\left\|\hat{\lambda}-\lambda^{\ast}\right\|_{L^{\infty}}\mathbbm{1}_{\{N_{e}>s_{0}\}}\right]
:=𝕀1+𝕀2.\displaystyle:=\mathbb{I}_{1}+\mathbb{I}_{2}~{}. (67)

Since ΨB0~\Psi\leq\tilde{B_{0}}, similar to (46), taking η=e\eta=e in Lemma 2 , we have

(Nes)2B0~Texp(eB0~Ts2),\displaystyle\mathbb{P}(N_{e}\geq s)\leq 2\sqrt{\tilde{B_{0}}T}\exp\left(\frac{e\tilde{B_{0}}T-s}{2}\right),

and similar to (36), we have

𝔼(Ne+1)1+s=1(Nes)1+2B0~T1exp(1/2)exp(eB0~T12)5B0~T+1exp(3B0~T2).\displaystyle\mathbb{E}(N_{e}+1)\leq 1+\sum_{s=1}^{\infty}\mathbb{P}(N_{e}\geq s)\leq 1+\frac{2\sqrt{\tilde{B_{0}}T}}{1-\exp(-1/2)}\exp\left(\frac{e\tilde{B_{0}}T-1}{2}\right)\leq{5\sqrt{\tilde{B_{0}}T+1}}\exp\left(\frac{3\tilde{B_{0}}T}{2}\right). (68)

By (65), (66), and (68),

𝕀1\displaystyle\mathbb{I}_{1} (T+1B1~)3𝒞B0Ts+22Ns𝔼[(Ne+1)𝟙{Nes0}]\displaystyle\leq\left(T+\frac{1}{\tilde{B_{1}}}\right)\frac{3\mathcal{C}B_{0}T^{s}+2}{2N^{s}}\mathbb{E}\left[(N_{e}+1)\mathbbm{1}_{\{N_{e}\leq s_{0}\}}\right]
(T+1B1~)3𝒞B0Ts+22Ns𝔼[(Ne+1)]\displaystyle\leq\left(T+\frac{1}{\tilde{B_{1}}}\right)\frac{3\mathcal{C}B_{0}T^{s}+2}{2N^{s}}\mathbb{E}\left[(N_{e}+1)\right]
(T+1B1~)5B0~T+1exp(3B0~T2)(L3𝒞B0Ts+22N1s+7(ρLB~0)N2)\displaystyle\leq\left(T+\frac{1}{\tilde{B_{1}}}\right){5\sqrt{\tilde{B_{0}}T+1}}\exp\left(\frac{3\tilde{B_{0}}T}{2}\right)\left(L\frac{3\mathcal{C}B_{0}T^{s}+2}{2N_{1}^{s}}+\frac{7(\rho L\vee\tilde{B}_{0})}{N_{2}}\right)
5Lexp(2B0~T)(T+1B1~)(3𝒞B0Ts+22N1s+7(ρ(B~0/L))N2).\displaystyle\leq 5L\exp\left({2\tilde{B_{0}}T}\right)\left(T+\frac{1}{\tilde{B_{1}}}\right)\left(\frac{3\mathcal{C}B_{0}T^{s}+2}{2N_{1}^{s}}+\frac{7(\rho\vee(\tilde{B}_{0}/L))}{N_{2}}\right). (69)

On the other hand, since λ^LB0~\|\hat{\lambda}\|_{L^{\infty}}\leq\tilde{B_{0}} and λLB0~\|\lambda^{\ast}\|_{L^{\infty}}\leq\tilde{B_{0}}, we have

𝕀2\displaystyle\mathbb{I}_{2} 𝔼[(T+NeB1~)λ^L𝟙{Ne>s0}]+𝔼[(T+NeB1~)λL𝟙{Ne>s0}]\displaystyle\leq\mathbb{E}\left[\left(T+\frac{N_{e}}{\tilde{B_{1}}}\right)\|\hat{\lambda}\|_{L^{\infty}}\mathbbm{1}_{\{N_{e}>s_{0}\}}\right]+\mathbb{E}\left[\left(T+\frac{N_{e}}{\tilde{B_{1}}}\right)\left\|\lambda^{\ast}\right\|_{L^{\infty}}\mathbbm{1}_{\{N_{e}>s_{0}\}}\right]
2(T+1B1~)B0~𝔼[(Ne+1)𝟙{Ne>s0}]\displaystyle\leq 2\left(T+\frac{1}{\tilde{B_{1}}}\right)\tilde{B_{0}}\mathbb{E}\left[(N_{e}+1)\mathbbm{1}_{\{N_{e}>s_{0}\}}\right]
2(T+1B1~)B0~((s0+1)(Nes0+1)+s=s0+1(Nes))\displaystyle\leq 2\left(T+\frac{1}{\tilde{B_{1}}}\right)\tilde{B_{0}}\left((s_{0}+1)\mathbb{P}(N_{e}\geq s_{0}+1)+\sum_{s=s_{0}+1}^{\infty}\mathbb{P}(N_{e}\geq s)\right)
4(T+1B1~)B0~B0~Texp(eB0~T(s0+1)2)((s0+1)+11e12)\displaystyle\leq 4\left(T+\frac{1}{\tilde{B_{1}}}\right)\tilde{B_{0}}\sqrt{\tilde{B_{0}}T}\exp\left(\frac{e\tilde{B_{0}}T-(s_{0}+1)}{2}\right)\left((s_{0}+1)+\frac{1}{1-e^{-\frac{1}{2}}}\right)
4(T+1B1~)B0~B0~Texp(3B0~T(s0+1)2)(s0+4)\displaystyle\leq 4\left(T+\frac{1}{\tilde{B_{1}}}\right)\tilde{B_{0}}\sqrt{\tilde{B_{0}}T}\exp\left(\frac{3\tilde{B_{0}}T-(s_{0}+1)}{2}\right)\left(s_{0}+4\right)
4(T+1B1~)B0~exp(2B0~T)(s0+4)exp(s0+12).\displaystyle\leq 4\left(T+\frac{1}{\tilde{B_{1}}}\right)\tilde{B_{0}}\exp\left({2\tilde{B_{0}}T}\right)\left(s_{0}+4\right)\exp\left(-\frac{s_{0}+1}{2}\right). (70)

Combining (67), (69), and (70), we have

|𝔼[loss(λ^,Stest)]𝔼[loss(λ,Stest)]|\displaystyle|\mathbb{E}[\text{loss}(\hat{\lambda},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]| 5Lexp(2B0~T)(T+1B1~)(3𝒞B0Ts+22N1s+7(ρ(B~0/L))N2)\displaystyle\leq 5L\exp\left({2\tilde{B_{0}}T}\right)\left(T+\frac{1}{\tilde{B_{1}}}\right)\left(\frac{3\mathcal{C}B_{0}T^{s}+2}{2N_{1}^{s}}+\frac{7(\rho\vee(\tilde{B}_{0}/L))}{N_{2}}\right)
+4(T+1B1~)B0~exp(2B0~T)(s0+4)exp(s0+12).\displaystyle\quad+4\left(T+\frac{1}{\tilde{B_{1}}}\right)\tilde{B_{0}}\exp\left({2\tilde{B_{0}}T}\right)\left(s_{0}+4\right)\exp\left(-\frac{s_{0}+1}{2}\right).

Let s0=2logNs_{0}=\lceil 2\log N\rceil, N1=N2=NN_{1}=N_{2}=N and denote λ^N=λ^\hat{\lambda}^{N}=\hat{\lambda}. We have

|𝔼[loss(λ^N,Stest)]𝔼[loss(λ,Stest)]|logNN.\displaystyle|\mathbb{E}[\text{loss}(\hat{\lambda}^{N},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]|\lesssim\frac{\log N}{N}.

Similar to the proof of Theorem 5, we can bound the width of the network by

max{3𝒩~2(𝒩~+22)+3s2+6N+2,6N},\displaystyle\max\left\{3\left\lceil\frac{\tilde{\mathcal{N}}}{2}\right\rceil\binom{\tilde{\mathcal{N}}+2}{2}+3\left\lceil\frac{s}{2}\right\rceil+6N+2,6N\right\},

where 𝒩~=slog(N)+10(δT1)+2log(3(s0+1))\tilde{\mathcal{N}}=\lceil s\log(N)\rceil+10(\delta T\vee 1)+2\lceil\log(3(s_{0}+1))\rceil. Hence we have DND\lesssim N.

Moreover, from the construction of λ^\hat{\lambda}, the weights of the network is less than

𝒞1max{(log(N))12s2(log(N))2,NρρL},\displaystyle\mathcal{C}_{1}^{\prime}\max\left\{(\log(N))^{12s^{2}(\log(N))^{2}},\frac{N}{\rho\sqrt{\rho L}}\right\},

where ρ=B0+αs0+2=B0+α2logN+2\rho=B_{0}+\alpha s_{0}+2=B_{0}+\alpha\lceil 2\log N\rceil+2, 𝒞1\mathcal{C}_{1}^{\prime} is a constant related to s,B0,α,δ,T,B~0s,B_{0},\alpha,\delta,T,\tilde{B}_{0}, and LL. Then the weights of the network can be bounded by

𝒞1(log(N))12s2(log(N))2,\displaystyle\mathcal{C}_{1}(\log(N))^{12s^{2}(\log(N))^{2}},

where C1C_{1} are constants related to s,B0,α,δ,T,B~0s,B_{0},\alpha,\delta,T,\tilde{B}_{0}, and LL.

9.5 Proof of Theorem 8

Without loss of generality, we denote t1=T/3t_{1}=T/3, t2=2T/3t_{2}=2T/3 for simplicity. Since the compensator of N(t)N(t) is Λ(t)=0tλ(s)ds\Lambda(t)=\int_{0}^{t}\lambda^{\ast}(s)\mathrm{d}s, for a predictable stochastic process λ(t),t[0,T]\lambda(t),t\in[0,T], we have

𝔼[loss(λ,Stest)]\displaystyle\mathbb{E}[\text{loss}(\lambda,S_{test})] =𝔼[ti<Tlogλ(ti)+0Tλ(t)dt]\displaystyle=\mathbb{E}\left[-\sum_{t_{i}<T}\log\lambda(t_{i})+\int_{0}^{T}\lambda(t)\mathrm{d}t\right]
=𝔼[0Tlogλ(t)dN(t)+0Tλ(t)dt]\displaystyle=\mathbb{E}\left[-\int_{0}^{T}\log\lambda(t)\mathrm{d}N(t)+\int_{0}^{T}\lambda(t)\mathrm{d}t\right]
=𝔼[0T(λ(t)logλ(t)λ(t))dt].\displaystyle=\mathbb{E}\left[\int_{0}^{T}\left(\lambda(t)-\log\lambda(t)*\lambda^{\ast}(t)\right)\mathrm{d}t\right].

Since both λ\lambda^{\ast} and λ^ne\hat{\lambda}_{ne} are predictable, we have

𝔼[loss(λ^ne,Stest)]𝔼[loss(λ,Stest)]\displaystyle\mathbb{E}[\text{loss}(\hat{\lambda}_{ne},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]
=\displaystyle= 𝔼[0T(λ^ne(t)logλ^ne(t)λ(t))dt]E[0T(λ(t)logλ(t)λ(t))dt]\displaystyle\mathbb{E}\left[\int_{0}^{T}\left(\hat{\lambda}_{ne}(t)-\log\hat{\lambda}_{ne}(t)*\lambda^{\ast}(t)\right)\mathrm{d}t\right]-E\left[\int_{0}^{T}\left(\lambda^{\ast}(t)-\log\lambda^{\ast}(t)*\lambda^{\ast}(t)\right)\mathrm{d}t\right]
:=\displaystyle:= 𝔼[0T(g(λ^ne(t),λ(t))g(λ(t),λ(t)))dt],\displaystyle\mathbb{E}\left[\int_{0}^{T}\left(g(\hat{\lambda}_{ne}(t),\lambda^{\ast}(t))-g(\lambda^{\ast}(t),\lambda^{\ast}(t))\right)\mathrm{d}t\right],

where g(x,y)=xlogxyylogyy=g(y,y)g(x,y)=x-\log x*y\geq y-\log y*y=g(y,y), x,y>0\forall x,y>0, and the equality holds if and only if x=yx=y. Thus

𝔼[loss(λ^ne,Stest)]𝔼[loss(λ,Stest)]=𝔼[0T(g(λ^ne(t),λ(t))g(λ(t),λ(t)))dt]0.\displaystyle\mathbb{E}[\text{loss}(\hat{\lambda}_{ne},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]=\mathbb{E}\left[\int_{0}^{T}\left(g(\hat{\lambda}_{ne}(t),\lambda^{\ast}(t))-g(\lambda^{\ast}(t),\lambda^{\ast}(t))\right)\mathrm{d}t\right]\geq 0.

Denote ={there is no event in [0,2T/3]}\mathcal{E}=\{\text{there is no event in }[0,2T/3]\}, and ()>0\mathbb{P}(\mathcal{E})>0. Denote I0=[T/3,2T/3]I_{0}=\left[T/3,2T/3\right]. By a similar argument, we have

𝔼[loss(λ^ne,Stest)]𝔼[loss(λ,Stest)]𝔼[(I0(g(λ^ne(t),λ(t))g(λ(t),λ(t)))dt)𝟙].\displaystyle\mathbb{E}[\text{loss}(\hat{\lambda}_{ne},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]\geq\mathbb{E}\left[\left(\int_{I_{0}}\left(g(\hat{\lambda}_{ne}(t),\lambda^{\ast}(t))-g(\lambda^{\ast}(t),\lambda^{\ast}(t))\right)\mathrm{d}t\right)\mathbbm{1}_{\mathcal{E}}\right]. (71)

Under the event \mathcal{E},

λ^ne(t)=f(αt+b)={T,αt+b<Tαt+b,Tαt+b4T4T,αt+b>4T,tI0,\displaystyle\hat{\lambda}_{ne}(t)=f(\alpha t+b)=\left\{\begin{aligned} &T&,~{}&\alpha t+b<T\\ &\alpha t+b&,~{}&T\leq\alpha t+b\leq 4T\\ &4T&,~{}&\alpha t+b>4T\end{aligned}\right.~{},~{}t\in I_{0},

and

𝔼[(I0(g(λ^ne(t),λ(t))g(λ(t),λ(t)))dt)𝟙]\displaystyle\mathbb{E}\left[\left(\int_{I_{0}}\left(g\left(\hat{\lambda}_{ne}(t),\lambda^{\ast}(t)\right)-g\left(\lambda^{\ast}(t),\lambda^{\ast}(t)\right)\right)\mathrm{d}t\right)\mathbbm{1}_{\mathcal{E}}\right]
=\displaystyle= [I0(g(f(αt+b),9Tt2)g(9Tt2,9Tt2))dt]()\displaystyle\left[\int_{I_{0}}\left(g\left(f(\alpha t+b),\frac{9}{T}t^{2}\right)-g\left(\frac{9}{T}t^{2},\frac{9}{T}t^{2}\right)\right)\mathrm{d}t\right]\mathbb{P}(\mathcal{E})
:=\displaystyle:= F(α,b)P().\displaystyle F(\alpha,b)P(\mathcal{E}). (72)

Then we only need to show

infα,bF(α,b)>0.\displaystyle\inf_{\alpha\in\mathbb{R},b\in\mathbb{R}}F(\alpha,b)>0.

Case 1. |α|>18|\alpha|>18, bb\in\mathbb{R}.

Since |3T/α|T/6\left|3T/\alpha\right|\leq T/6, λ^ne(t){T,4T}\hat{\lambda}_{ne}(t)\in\{T,4T\} on I1:=[T/3,5T/12]I_{1}:=\left[T/3,5T/12\right] or I2:=[7T/12,2T/3]I_{2}:=[7T/12,2T/3]. From g(x,y)g(y,y),x,y>0g(x,y)\geq g(y,y),x,y>0,

inf|α|>18,bF(α,b)min\displaystyle\inf_{|\alpha|>18,b\in\mathbb{R}}F(\alpha,b)\geq\min {I1(g(T,9Tt2)g(9Tt2,9Tt2))dt,\displaystyle\left\{\int_{I_{1}}\left(g\left(T,\frac{9}{T}t^{2}\right)-g\left(\frac{9}{T}t^{2},\frac{9}{T}t^{2}\right)\right)\mathrm{d}t,\right.
I2(g(T,9Tt2)g(9Tt2,9Tt2))dt,\displaystyle\int_{I_{2}}\left(g\left(T,\frac{9}{T}t^{2}\right)-g\left(\frac{9}{T}t^{2},\frac{9}{T}t^{2}\right)\right)\mathrm{d}t,
I1(g(4T,9Tt2)g(9Tt2,9Tt2))dt,\displaystyle\int_{I_{1}}\left(g\left(4T,\frac{9}{T}t^{2}\right)-g\left(\frac{9}{T}t^{2},\frac{9}{T}t^{2}\right)\right)\mathrm{d}t,
I2(g(4T,9Tt2)g(9Tt2,9Tt2))dt}\displaystyle\left.\int_{I_{2}}\left(g\left(4T,\frac{9}{T}t^{2}\right)-g\left(\frac{9}{T}t^{2},\frac{9}{T}t^{2}\right)\right)\mathrm{d}t\right\}
:=C1\displaystyle:=C_{1} >0.\displaystyle>0. (73)

Case 2. |α|18|\alpha|\leq 18, |b|>16T|b|>16T.

In this case , we can check that {t:Tαt+b4T}I0=\{t:T\leq\alpha t+b\leq 4T\}\cap I_{0}=\emptyset. Hence

inf|α|18,|b|>16TF(α,b)min\displaystyle\inf_{|\alpha|\leq 18,|b|>16T}F(\alpha,b)\geq\min {I0(g(T,9Tt2)g(9Tt2,9Tt2))dt,\displaystyle\left\{\int_{I_{0}}\left(g\left(T,\frac{9}{T}t^{2}\right)-g\left(\frac{9}{T}t^{2},\frac{9}{T}t^{2}\right)\right)\mathrm{d}t,\right.
I0(g(4T,9Tt2)g(9Tt2,9Tt2))dt}\displaystyle\left.\int_{I_{0}}\left(g\left(4T,\frac{9}{T}t^{2}\right)-g\left(\frac{9}{T}t^{2},\frac{9}{T}t^{2}\right)\right)\mathrm{d}t\right\}
:=C2\displaystyle:=C_{2} >0.\displaystyle>0. (74)

Case 3. |α|18|\alpha|\leq 18, |b|16T|b|\leq 16T.

By (72) , FF is continuous with respect to (α,b)(\alpha,b). For fixed (α,b)(\alpha,b), since f(αt+b)9T2t2f(\alpha t+b)\not\equiv\frac{9}{T^{2}}t^{2}, F(α,b)>0F(\alpha,b)>0. Since {|α|18,|b|16T}\{|\alpha|\leq 18,|b|\leq 16T\} is a compact set in 2\mathbb{R}^{2}, there exists C3>0C_{3}>0 such that

inf|α|18,|b|16TF(α,b)C3>0.\displaystyle\inf_{|\alpha|\leq 18,|b|\leq 16T}F(\alpha,b)\geq C_{3}>0. (75)

By (71), (72), (73), (74), and (75),

𝔼[loss~(λ^ne)]𝔼[loss~(λ)]min{C1,C2,C3}():=C>0.\displaystyle\mathbb{E}[\tilde{\text{loss}}(\hat{\lambda}_{ne})]-\mathbb{E}[\tilde{\text{loss}}(\lambda^{\ast})]\geq\min\{C_{1},C_{2},C_{3}\}\mathbb{P}(\mathcal{E}):=C>0.

Hence Theorem 8 is proved.

Remark 13.

Note that we have proved the excess risk

𝔼[loss(λ^,Stest)]𝔼[loss(λ,Stest)]\displaystyle\mathbb{E}[\text{loss}(\hat{\lambda},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})] (76)

is always positive if λ^λ\hat{\lambda}\neq\lambda^{\ast} in the proof of Theorem 8. Thus (76) is a well-defined excess risk.

10 Supporting Lemmas

Lemma 11.

(Lemma 8 in Chen et al. (2020))   Let 𝒢={Ad1×d2:A2λ}\mathcal{G}=\{A\in\mathbb{R}^{d_{1}\times d_{2}}:\|A\|_{2}\leq\lambda\} be the set of matrices with bounded spectral norm and ϵ>0\epsilon>0 be given. The covering number 𝒩(𝒢,ϵ,F)\mathcal{N}(\mathcal{G},\epsilon,\|\cdot\|_{F}) is bounded above by

𝒩(𝒢,ϵ,F)(1+(d1d2)λϵ)d1d2.\displaystyle\mathcal{N}(\mathcal{G},\epsilon,\|\cdot\|_{F})\leq\left(1+\frac{(\sqrt{d_{1}}\wedge\sqrt{d_{2}})\lambda}{\epsilon}\right)^{d_{1}d_{2}}.

The following lemma is a bridge between the covering number and the upper bound of sub-gaussian process.

Definition 1.

A stochastic process {Xh}hH\{X_{h}\}_{h\in H} is called a sub-gaussian process for metric d(,)d(\cdot,\cdot) on HH if

𝔼[exp(λ(Xh1Xh2))]exp(λ2d(h1,h2)22) for λ,h1,h2H.\mathbb{E}\left[\exp\left(\lambda\left(X_{h_{1}}-X_{h_{2}}\right)\right)\right]\leq\exp\left(\frac{\lambda^{2}d(h_{1},h_{2})^{2}}{2}\right)~{}\text{ for }\lambda\in\mathbb{R},~{}h_{1},h_{2}\in H.

A stochastic process {Xh}hH\{X_{h}\}_{h\in H} is called a centered sub-gaussian process for metric d(,)d(\cdot,\cdot) on HH if {Xh}hH\{X_{h}\}_{h\in H} is a sub-gaussian process for metric d(,)d(\cdot,\cdot) and 𝔼[Xh]=0,hH\mathbb{E}[X_{h}]=0,~{}\forall h\in H.

Lemma 12.

Suppose {Xh}hH\{X_{h}\}_{h\in H} is a centered sub-gaussian process for metric Kd(,)K\cdot d(\cdot,\cdot) on metric space HH, where the diameter of HH is finite, i.e. diam(H)=suph1,h2Hd(h1,h2)<+\operatorname{diam}(H)=\sup_{h_{1},h_{2}\in H}d(h_{1},h_{2})<+\infty. Then with probability at least 1δ1-\delta, for any fixed h0Hh_{0}\in H, we have

suphH|XhXh0|6K(8diam(H)log(2δ)+k=κ2klog𝒩(H,d,2k))\displaystyle\sup_{h\in H}|X_{h}-X_{h_{0}}|\leq 6K\left(8\operatorname{diam}(H)\sqrt{\log\left(\frac{2}{\delta}\right)}+\sum_{k=-\kappa}^{\infty}2^{-k}\sqrt{\log\mathcal{N}\left(H,d,2^{-k}\right)}\right)

and

suphH|XhXh0|12K(4diam(H)log(2δ)+02diam(H)log𝒩(H,d,ϵ)dϵ),\displaystyle\sup_{h\in H}|X_{h}-X_{h_{0}}|\leq 12K\left(4\operatorname{diam}(H)\sqrt{\log\left(\frac{2}{\delta}\right)}+\int_{0}^{2\operatorname{diam}(H)}\sqrt{\log\mathcal{N}\left(H,d,\epsilon\right)}~{}\mathrm{d}\epsilon\right),

where κ+\kappa\in\mathbb{Z}_{+} satisfies 2κ1<diam(H)2κ2^{\kappa-1}<\operatorname{diam}(H)\leq 2^{\kappa}.

Proof of Lemma 12.

Let κ+\kappa\in\mathbb{Z}_{+} satisfy 2κ1<diam(H)2κ2^{\kappa-1}<\operatorname{diam}(H)\leq 2^{\kappa}. Define ϵk=2k,k,kκ\epsilon_{k}=2^{-k},k\in\mathbb{Z},k\geq-\kappa. Let HkH_{k} be the ϵk-net\epsilon_{k}\text{-net} of HH with metric d(,)d(\cdot,\cdot), i.e., HkHH_{k}\subset H covers HH at scale ϵk\epsilon_{k} with respect to the metric d(,)d(\cdot,\cdot). Clearly |Hκ|=1|H_{-\kappa}|=1. We take Hκ={h0}H_{-\kappa}=\{h_{0}\}. Define πk(h)\pi_{k}(h) as the closest element of hh in HkH_{k} under the metric d(,)d(\cdot,\cdot). Then hH,Nκ,N\forall h\in H,\forall N\geq-\kappa,N\in\mathbb{Z}, we have

XhXh0=k=κ+1(Xπk(h)Xπk1(h))a.s..\displaystyle X_{h}-X_{h_{0}}=\sum_{k=-\kappa+1}^{\infty}\left(X_{\pi_{k}(h)}-X_{\pi_{k-1}(h)}\right)\quad a.s.~{}.

Thus

suphH|XhXh0|k=κ+1suphH|Xπk(h)Xπk1(h)|a.s..\displaystyle\sup_{h\in H}|X_{h}-X_{h_{0}}|\leq\sum_{k=-\kappa+1}^{\infty}\sup_{h\in H}\left|X_{\pi_{k}(h)}-X_{\pi_{k-1}(h)}\right|\quad a.s.~{}.

Consider Pk={Xπk(h)Xπk1(h)|hH}P_{k}=\{X_{\pi_{k}(h)}-X_{\pi_{k-1}(h)}|h\in H\}, |Pk||Hk1||Hk||Hk|2|P_{k}|\leq|H_{k-1}||H_{k}|\leq|H_{k}|^{2} and any element in PkP_{k} is K(ϵk+ϵk1)K(\epsilon_{k}+\epsilon_{k-1}) sub-gaussian. By Hoeffding’s inequality and union bound argument, we have

(supXPk|X|t)\displaystyle\mathbb{P}\left(\sup_{X\in P_{k}}|X|\geq t\right) =(XPk{|X|t})\displaystyle=\mathbb{P}\left(\bigcup_{X\in P_{k}}\left\{|X|\geq t\right\}\right)
XPk(|X|t)\displaystyle\leq\sum_{X\in P_{k}}\mathbb{P}\left(|X|\geq t\right)
2|Pk|exp(t22K2(ϵk1+ϵk)2)\displaystyle\leq 2|P_{k}|\exp\left(-\frac{t^{2}}{2K^{2}(\epsilon_{k-1}+\epsilon_{k})^{2}}\right)
2|Pk|exp(t218K2ϵk2).\displaystyle\leq 2|P_{k}|\exp\left(-\frac{t^{2}}{18K^{2}\epsilon_{k}^{2}}\right).

Let 2|Pk|exp(t2/18K2ϵk2)=δk1/22|P_{k}|\exp\left(-t^{2}/18K^{2}\epsilon_{k}^{2}\right)=\delta_{k}\leq 1/2, t=18Kϵklog(|Pk|)+log(2/δk)32Kϵk(log(|Pk|)+log(2/δk))t=\sqrt{18}K\epsilon_{k}\sqrt{\log(|P_{k}|)+\log(2/\delta_{k})}\leq 3\sqrt{2}K\epsilon_{k}(\sqrt{\log(|P_{k}|)}+\sqrt{\log(2/\delta_{k})}) . Then with probability at least 1δk1-\delta_{k}, we have

supXPk|X|\displaystyle\sup_{X\in P_{k}}|X| 32Kϵk(log(|Pk|)+log(2/δk))\displaystyle\leq 3\sqrt{2}K\epsilon_{k}\left(\sqrt{\log(|P_{k}|)}+\sqrt{\log(2/\delta_{k})}\right)
6Kϵk(log(|Hk|)+log(1/δk)).\displaystyle\leq 6K\epsilon_{k}\left(\sqrt{\log(|H_{k}|)}+\sqrt{\log(1/\delta_{k})}\right).

Thus, with probability at least 1k=κ+δk1-\sum_{k=-\kappa}^{+\infty}\delta_{k}, we get

suphH|XhXh0|6Kk=κ2k(log𝒩(H,d,2k)+log(1/δk)),\displaystyle\sup_{h\in H}|X_{h}-X_{h_{0}}|\leq 6K\sum_{k=-\kappa}^{\infty}2^{-k}\left(\sqrt{\log\mathcal{N}\left(H,d,2^{-k}\right)}+\sqrt{\log\left(1/\delta_{k}\right)}\right),

Let δk=δ/2k+κ+1\delta_{k}=\delta/2^{k+\kappa+1}. Then k=κδk=δ\sum_{k=-\kappa}^{\infty}\delta_{k}=\delta. We have

k=κ2klog(1/δk)\displaystyle\sum_{k=-\kappa}^{\infty}2^{-k}\sqrt{\log\left(1/\delta_{k}\right)} =k=κ2klog(2k+κ+1/δ)\displaystyle=\sum_{k=-\kappa}^{\infty}2^{-k}\sqrt{\log\left(2^{k+\kappa+1}/\delta\right)}
k=κ2kk+κ+1log(2/δ)\displaystyle\leq\sum_{k=-\kappa}^{\infty}2^{-k}\sqrt{k+\kappa+1}\sqrt{\log\left(2/\delta\right)}
8diam(H)log(2/δ).\displaystyle\leq 8\operatorname{diam}(H)\sqrt{\log\left(2/\delta\right)}~{}.

Thus,

suphH|XhXh0|6K(8diam(H)log(2δ)+k=κ2klog𝒩(H,d,2k)).\displaystyle\sup_{h\in H}|X_{h}-X_{h_{0}}|\leq 6K\left(8\operatorname{diam}(H)\sqrt{\log\left(\frac{2}{\delta}\right)}+\sum_{k=-\kappa}^{\infty}2^{-k}\sqrt{\log\mathcal{N}\left(H,d,2^{-k}\right)}\right).

Since

k=κ2klog𝒩(H,d,2k)202κlog𝒩(H,d,ϵ)dϵ202diam(H)log𝒩(H,d,ϵ)dϵ,\displaystyle\sum_{k=-\kappa}^{\infty}2^{-k}\sqrt{\log\mathcal{N}\left(H,d,2^{-k}\right)}\leq 2\int_{0}^{2^{\kappa}}\sqrt{\log\mathcal{N}\left(H,d,\epsilon\right)}~{}\mathrm{d}\epsilon\leq 2\int_{0}^{2\operatorname{diam}(H)}\sqrt{\log\mathcal{N}\left(H,d,\epsilon\right)}~{}\mathrm{d}\epsilon~{},

the lemma is proved. ∎

Lemma 13.

(Theorem 5.1 in De Ryck et al. (2021))   Let d,s+d,s\in\mathbb{N}_{+}, δ>0\delta>0 and fWs,([0,1]d)f\in W^{s,\infty}([0,1]^{d}). There exist constants 𝒞(d,s,f)\mathcal{C}(d,s,f) and N0(d)>0N_{0}(d)>0 such that for every integer N>N0(d)N>N_{0}(d), there exists a tanh neural network f^N\hat{f}^{N} with two hidden layers, with one width at most 3s/2(s+d1d)+d(N1)3\lceil s/2\rceil\tbinom{s+d-1}{d}+d(N-1) and the other width at most 3(d+2)/2(2d+1d)Nd3\lceil(d+2)/2\rceil\tbinom{2d+1}{d}N^{d} (or 3s/2+N13\lceil s/2\rceil+N-1 and 6N6N for d=1d=1), such that

ff^NL([0,1]d)(1+δ)𝒞(d,s,f)Ns.\displaystyle\left\|f-\hat{f}^{N}\right\|_{L^{\infty}([0,1]^{d})}\leq(1+\delta)\frac{\mathcal{C}(d,s,f)}{N^{s}}~{}.

If fCs([0,1]d)f\in C^{s}([0,1]^{d}), then it holds that

𝒞(d,s,f)=(3d)ss!2sfWs,([0,1]d),N0(d)=3d2,\displaystyle\mathcal{C}(d,s,f)=\frac{(3d)^{s}}{s!2^{s}}\|f\|_{W^{s,\infty}([0,1]^{d})},\quad N_{0}(d)=\frac{3d}{2},

and else, it holds that

𝒞(d,s,f)=π1/4s(5d)s(s1)!fWs,([0,1]d),N0(d)=5d2.\displaystyle\mathcal{C}(d,s,f)=\frac{\pi^{1/4}\sqrt{s}(5d)^{s}}{(s-1)!}\|f\|_{W^{s,\infty}([0,1]^{d})},\quad N_{0}(d)=5d^{2}.

Moreover, the weights of f^N\hat{f}^{N} scale as O(𝒞(d,s,f)s/2Nd(d+s2)/2(s(s+2))3s(s+2))O(\mathcal{C}(d,s,f)^{-s/2}N^{d(d+s^{2})/2}(s(s+2))^{3s(s+2)}).

Remark 14.

By Lemma 13, there exists a constant C(δ)C(\delta) which is only dependent with δ\delta, such that

|the weights of f^N|C(δ)𝒞(d,s,f)s/2Nd(d+s2)/2(s(s+2))3s(s+2).\displaystyle|\text{the weights of $\hat{f}^{N}$}|\leq C(\delta)\mathcal{C}(d,s,f)^{-s/2}N^{d(d+s^{2})/2}(s(s+2))^{3s(s+2)}.
Lemma 14.

(Corollary 5.8 in De Ryck et al. (2021)) Let d+d\in\mathbb{N}_{+}, Ωd\Omega\subset\mathbb{R}^{d} open with [0,1]dΩ[0,1]^{d}\subset\Omega and let ff be analytic on Ω\Omega. If, for some C>0C>0, ff satisfies that fWs,([0,1]d)Cs\|f\|_{W^{s,\infty}([0,1]^{d})}\leq C^{s} for all ss\in\mathbb{N}, then for any 𝒩+\mathcal{N}\in\mathbb{N}_{+}, there exists a one-layer tanh\tanh neural network f^𝒩\hat{f}^{\mathcal{N}} of width 3(𝒩+5Cd)/2(𝒩+(5C+1)dd)3\lceil(\mathcal{N}+5Cd)/2\rceil\tbinom{\mathcal{N}+(5C+1)d}{d} (or 3𝒩/23\lceil\mathcal{N}/2\rceil for d=1d=1) such that

ff^NL([0,1]d)exp(𝒩).\displaystyle\left\|f-\hat{f}^{N}\right\|_{L^{\infty}([0,1]^{d})}\leq\exp(-\mathcal{N})~{}.
Remark 15.

In De Ryck et al. (2021), the construction of f^𝒩\hat{f}^{\mathcal{N}} in Lemma 14 uses Lemma 13 directly. Hence the weights of f^𝒩\hat{f}^{\mathcal{N}} can be derived from Lemma 13. Then there exists a constant C~\tilde{C} such that

|the weights of f^𝒩|C~exp(𝒩2+𝒩3Cd𝒩2)(𝒩(𝒩+2))3𝒩(𝒩+2),\displaystyle|\text{the weights of $\hat{f}^{\mathcal{N}}$}|\leq\tilde{C}\exp(\frac{{\mathcal{N}^{\prime}}^{2}+\mathcal{N}^{\prime}-3Cd\mathcal{N}^{\prime}}{2})(\mathcal{N}^{\prime}(\mathcal{N}^{\prime}+2))^{3\mathcal{N}^{\prime}(\mathcal{N}^{\prime}+2)},

where 𝒩=𝒩+5Cd\mathcal{N}^{\prime}=\mathcal{N}+5Cd. We emphasize that the original literature (De Ryck et al., 2021) does not give this result, but it can be obtained by simple calculations.

References

  • Aalen et al. [2008] Odd Aalen, Ornulf Borgan, and Hakon Gjessing. Survival and event history analysis: a process point of view. Springer Science & Business Media, 2008.
  • Bauwens and Hautsch [2009] Luc Bauwens and Nikolaus Hautsch. Modelling financial high frequency data using point processes. In Handbook of financial time series, pages 953–979. Springer, 2009.
  • Brémaud and Massoulié [1996] Pierre Brémaud and Laurent Massoulié. Stability of nonlinear hawkes processes. The Annals of Probability, pages 1563–1588, 1996.
  • Cai et al. [2022] Biao Cai, Jingfei Zhang, and Yongtao Guan. Latent network structure learning from high-dimensional multivariate point processes. Journal of the American Statistical Association, pages 1–14, 2022.
  • Cao et al. [2019] Jian Cao, Zhi Li, and Jian Li. Financial time series forecasting model based on ceemdan and lstm. Physica A: Statistical mechanics and its applications, 519:127–139, 2019.
  • Chen et al. [2020] Minshuo Chen, Xingguo Li, and Tuo Zhao. On generalization bounds of a family of recurrent neural networks. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, pages 1233–1243. PMLR, 2020.
  • Chimmula and Zhang [2020] Vinay Kumar Reddy Chimmula and Lei Zhang. Time series forecasting of covid-19 transmission in canada using lstm networks. Chaos, solitons & fractals, 135:109864, 2020.
  • Cybenko [1989] George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314, 1989.
  • Daley and Vere-Jones [2008] Daryl J Daley and David Vere-Jones. An introduction to the theory of point processes: volume II: general theory and structure. Springer, 2008.
  • Daley et al. [2003] Daryl J Daley, David Vere-Jones, et al. An introduction to the theory of point processes: volume I: elementary theory and methods. Springer, 2003.
  • De Ryck et al. [2021] Tim De Ryck, Samuel Lanthaler, and Siddhartha Mishra. On the approximation of functions by tanh neural networks. Neural Networks, 143:732–750, 2021.
  • Du et al. [2015] Nan Du, Yichen Wang, Niao He, Jimeng Sun, and Le Song. Time-sensitive recommendation from recurrent user activities. Advances in neural information processing systems, 28, 2015.
  • Du et al. [2016] Nan Du, Hanjun Dai, Rakshit Trivedi, Utkarsh Upadhyay, Manuel Gomez-Rodriguez, and Le Song. Recurrent marked temporal point processes: Embedding event history to vector. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1555–1564, 2016.
  • Dugas et al. [2001] Michel J Dugas, Patrick Gosselin, and Robert Ladouceur. Intolerance of uncertainty and worry: Investigating specificity in a nonclinical sample. Cognitive therapy and Research, 25:551–558, 2001.
  • Enguehard et al. [2020] Joseph Enguehard, Dan Busbridge, Adam Bozson, Claire Woodcock, and Nils Hammerla. Neural temporal point processes for modelling electronic health records. In Machine Learning for Health, pages 85–113. PMLR, 2020.
  • Fang et al. [2023] Guanhua Fang, Ganggang Xu, Haochen Xu, Xuening Zhu, and Yongtao Guan. Group network hawkes process. Journal of the American Statistical Association, pages 1–17, 2023.
  • Farajtabar et al. [2017] Mehrdad Farajtabar, Yichen Wang, Manuel Gomez-Rodriguez, Shuang Li, Hongyuan Zha, and Le Song. Coevolve: A joint point process model for information diffusion and network evolution. Journal of Machine Learning Research, 18(41):1–49, 2017.
  • Fleming and Harrington [2013] Thomas R Fleming and David P Harrington. Counting processes and survival analysis, volume 625. John Wiley & Sons, 2013.
  • Fukushima [1969] Kunihiko Fukushima. Visual feature extraction by a multilayered network of analog threshold elements. IEEE Transactions on Systems Science and Cybernetics, 5(4):322–333, 1969.
  • Gautschi [1990] Walter Gautschi. How (un)stable are vandermonde systems? Asymptotic and Computational Analysis, 1990. URL https://api.semanticscholar.org/CorpusID:18896588.
  • Hansen et al. [2015] Niels Richard Hansen, Patricia Reynaud-Bouret, and Vincent Rivoirard. Lasso and probabilistic inequalities for multivariate point processes. Bernoulli, 2015.
  • Hastie et al. [2009] Trevor Hastie, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer, 2009.
  • Hawkes [1971] Alan G Hawkes. Spectra of some self-exciting and mutually exciting point processes. Biometrika, 58(1):83–90, 1971.
  • Hawkes [2018] Alan G Hawkes. Hawkes processes and their applications to finance: a review. Quantitative Finance, 18(2):193–198, 2018.
  • Hawkes and Oakes [1974] Alan G Hawkes and David Oakes. A cluster process representation of a self-exciting process. Journal of applied probability, 11(3):493–503, 1974.
  • Hosseini et al. [2017] Seyed Abbas Hosseini, Keivan Alizadeh, Ali Khodadadi, Ali Arabzadeh, Mehrdad Farajtabar, Hongyuan Zha, and Hamid R Rabiee. Recurrent poisson factorization for temporal recommendation. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 847–855, 2017.
  • Isham and Westcott [1979] Valerie Isham and Mark Westcott. A self-correcting point process. Stochastic processes and their applications, 8(3):335–347, 1979.
  • James et al. [2013] Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani, et al. An introduction to statistical learning, volume 112. Springer, 2013.
  • Jiao et al. [2023] Yuling Jiao, Guohao Shen, Yuanyuan Lin, and Jian Huang. Deep nonparametric regression on approximate manifolds: Nonasymptotic error bounds with polynomial prefactors. The Annals of Statistics, 51(2):691–716, 2023.
  • Kingman [1992] John Frank Charles Kingman. Poisson processes, volume 3. Clarendon Press, 1992.
  • Laub et al. [2021] Patrick J Laub, Young Lee, and Thomas Taimre. The elements of Hawkes processes. Springer, 2021.
  • Li et al. [2018] Shuang Li, Shuai Xiao, Shixiang Zhu, Nan Du, Yao Xie, and Le Song. Learning temporal point processes via reinforcement learning. Advances in neural information processing systems, 31, 2018.
  • Lin et al. [2022] Haitao Lin, Lirong Wu, Guojiang Zhao, Pai Liu, and Stan Z Li. Exploring generative neural temporal point process. arXiv preprint arXiv:2208.01874, 2022.
  • Lu et al. [2021] Jianfeng Lu, Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation for smooth functions. SIAM Journal on Mathematical Analysis, 53(5):5465–5506, 2021.
  • McCulloch and Pitts [1943] Warren S McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5:115–133, 1943.
  • Medsker and Jain [1999] Larry Medsker and Lakhmi C Jain. Recurrent neural networks: design and applications. CRC press, 1999.
  • Mei and Eisner [2017] Hongyuan Mei and Jason M Eisner. The neural hawkes process: A neurally self-modulating multivariate point process. Advances in neural information processing systems, 30, 2017.
  • Ogata and Vere-Jones [1984] Yosihiko Ogata and David Vere-Jones. Inference for earthquake models: a self-correcting model. Stochastic processes and their applications, 17(2):337–347, 1984.
  • Omi et al. [2019] Takahiro Omi, Kazuyuki Aihara, et al. Fully neural network based model for general temporal point processes. Advances in neural information processing systems, 32, 2019.
  • Perkel et al. [1967] Donald H Perkel, George L Gerstein, and George P Moore. Neuronal spike trains and stochastic point processes: I. the single spike train. Biophysical journal, 7(4):391–418, 1967.
  • Rubanova et al. [2019] Yulia Rubanova, Ricky TQ Chen, and David K Duvenaud. Latent ordinary differential equations for irregularly-sampled time series. Advances in neural information processing systems, 32, 2019.
  • Schäfer and Zimmermann [2007] Anton Maximilian Schäfer and Hans-Georg Zimmermann. Recurrent neural networks are universal approximators. International journal of neural systems, 17(04):253–263, 2007.
  • Schmidt-Hieber [2020] Johannes Schmidt-Hieber. Nonparametric regression using deep neural networks with relu activation function. The Annals of Statistics, 48(4), 2020. doi: 10.1214/19-AOS1875. URL https://doi.org/10.1214/19-AOS1875.
  • Schoenberg [2005] Frederic Paik Schoenberg. Consistent parametric estimation of the intensity of a spatial–temporal point process. Journal of Statistical Planning and Inference, 128(1):79–93, 2005.
  • Shalev-Shwartz and Ben-David [2014] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
  • Shchur et al. [2021] Oleksandr Shchur, Ali Caner Türkmen, Tim Januschowski, and Stephan Günnemann. Neural temporal point processes: A review. arXiv preprint arXiv:2104.03528, 2021.
  • Shen et al. [2019] Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation characterized by number of neurons. arXiv preprint arXiv:1906.05497, 2019.
  • Suh and Cheng [2024] Namjoon Suh and Guang Cheng. A survey on statistical theory of deep learning: Approximation, training dynamics, and generative models. arXiv preprint arXiv:2401.07187, 2024.
  • Tarwani and Edem [2017] Kanchan M Tarwani and Swathi Edem. Survey on recurrent neural network in natural language processing. Int. J. Eng. Trends Technol, 48(6):301–304, 2017.
  • Tu et al. [2020] Zhuozhuo Tu, Fengxiang He, and Dacheng Tao. Understanding generalization in recurrent neural networks. In International Conference on Learning Representations, 2020. URL https://api.semanticscholar.org/CorpusID:214346647.
  • Vidyasagar [2013] Mathukumalli Vidyasagar. Learning and generalisation: with applications to neural networks. Springer Science & Business Media, 2013.
  • Wang et al. [2012] Ting Wang, Mark Bebbington, and David Harte. Markov-modulated hawkes process with stepwise decay. Annals of the Institute of Statistical Mathematics, 64:521–544, 2012.
  • Williams et al. [2020] Alex Williams, Anthony Degleris, Yixin Wang, and Scott Linderman. Point process models for sequence detection in high-dimensional neural spike trains. Advances in neural information processing systems, 33:14350–14361, 2020.
  • Yin et al. [2017] Wenpeng Yin, Katharina Kann, Mo Yu, and Hinrich Schütze. Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923, 2017.
  • Zhang et al. [2021] Yizhou Zhang, Karishma Sharma, and Yan Liu. Vigdet: Knowledge informed neural temporal point process for coordination detection on social media. Advances in Neural Information Processing Systems, 34:3218–3231, 2021.
  • Zhou et al. [2022] Zihao Zhou, Xingyi Yang, Ryan Rossi, Handong Zhao, and Rose Yu. Neural point process for learning spatiotemporal event dynamics. In Learning for Dynamics and Control Conference, pages 777–789. PMLR, 2022.
  • Zuo et al. [2020] Simiao Zuo, Haoming Jiang, Zichong Li, Tuo Zhao, and Hongyuan Zha. Transformer hawkes process. In International conference on machine learning, pages 11692–11702. PMLR, 2020.