On Non-asymptotic Theory of Recurrent Neural Networks in Temporal Point Processes

Zhiheng Chen¹, Guanhua Fang^2∗, Wen Yu²
¹ Shanghai Center for Mathematical Sciences, Fudan University
² Department of Statistics and Data Science, Fudan University
^∗ fanggh@fudan.edu.cn

Abstract

Temporal point process (TPP) is an important tool for modeling and predicting irregularly timed events across various domains. Recently, the recurrent neural network (RNN)-based TPPs have shown practical advantages over traditional parametric TPP models. However, in the current literature, it remains nascent in understanding neural TPPs from theoretical viewpoints. In this paper, we establish the excess risk bounds of RNN-TPPs under many well-known TPP settings. We especially show that an RNN-TPP with no more than four layers can achieve vanishing generalization errors. Our technical contributions include the characterization of the complexity of the multi-layer RNN class, the construction of $\tanh$ neural networks for approximating dynamic event intensity functions, and the truncation technique for alleviating the issue of unbounded event sequences. Our results bridge the gap between TPP’s application and neural network theory.

1 Introduction

Temporal point process (TPP) (Daley et al., 2003; Daley and Vere-Jones, 2008) is an important mathematical framework that provides tools for analyzing and predicting the timing and patterns of events in continuous time. TPP particularly deals with event streaming data where the events occur at irregular time stamps, which is different from classical time series analysis that often assumes a regular time spacing between data points. In real world applications, the events could be anything from transactions in financial markets (Bauwens and Hautsch, 2009; Hawkes, 2018) to user activities in online social network platforms (Farajtabar et al., 2017; Fang et al., 2023), earthquakes in seismology (Wang et al., 2012; Laub et al., 2021), neural spikes in biological experiments (Perkel et al., 1967; Williams et al., 2020), or failure times in survival analysis (Aalen et al., 2008; Fleming and Harrington, 2013).

With advent of artificial intelligence in last decades, the neural network (McCulloch and Pitts, 1943) has been proved to be a powerful architecture that can be adapted to different applications with distinct purposes. In modern machine learning, researchers have also incorporated deep neural networks into TPPs to handle complex patterns and dependencies in event data, leading to advancements in many areas such as recommendation systems (Du et al., 2015; Hosseini et al., 2017), social network analysis (Du et al., 2016; Zhang et al., 2021), healthcare analytics (Li et al., 2018; Enguehard et al., 2020), etc. Many new TPP models have been proposed in the recent literature, including but not limited to, recurrent temporal point process Du et al. (2016), fully neural network TPP model Omi et al. (2019), transformer Hawkes process Zuo et al. (2020); see Shchur et al. (2021); Lin et al. (2022) and the references therein for a more comprehensive review.

Despite the recent process in TPP’s applications as mentioned above, there is a lack of understanding in neural TPPs from the theoretical perspective. A fundamental question remains: whether the neural network-based TPP can provably have a small generalization error? In this paper, we provide an affirmative answer to this question for recurrent neural network (RNN, Medsker and Jain (1999))-based TPPs. To be specific, we establish the non-asymptotic rates of generation error bounds under mild model assumptions and provide the construction of RNN architectures that could approximate many widely-used TPPs, including homogeneous Poisson process, non-homogeneous Poisson process, self-exciting process, etc.

There are a few challenges in developing the theory of RNN-based TPPs. (a) Characterization of functional space. In the machine learning theory, it is necessary to specify the model space to derive any generalization errors. In our setting, the thing becomes more complicated since the model should be data-dependent (i.e., adapts to the past events). Otherwise, the model could not capture the information in event history and fail to provide a good fitting. (b) Expressive power of RNN architecture. RNN is the most widely adopted neural architecture in TPP modelling. However, it remains questionable whether the RNNs can approximate most well-known temporal point processes. If the answer is yes, it would be of great interest to know how many hidden layers and how large hidden dimensions will be sufficient for the approximation. (c) Expressive power of activation function. In modern neural networks, the activation function is chosen to be a simple non-linear function for the sake of computational feasibility. In RNNs, it is taken as the “tanh" by default. Then it is important to understand the approximability of tanh activation functions. (d) Variable length of event sequence. Unlike the standard RNN’s modelling where each sample is assumed to have the same number of observations (events), the event sequences in our setting may vary from one to another. In addition, their lengths are potentially unbounded. These add difficulties in computing the complexity of the model space.

To overcome the above challenges, we adopt the following approaches. (a) In TPPs, the intensity function is the core. We recursively construct the multi-layer hidden cells through RNNs to store the event information and adopt the suitable output layer to compute the intensity value. Equipped with suitable input embeddings, our construction can capture the information of event history and adapt to variable lengths of event sequence. (b) For four main categories of TPPs, homogeneous Poisson process, non-homogeneous Poisson process, self-exciting process, and self-correcting process, we carefully study their intensity formula. We can decompose the intensity function into different parts and approximate them component-wisely. Our construction explicitly gives the upper bounds on the model depth, the width of hidden layers, and parameter weights of the RNN architecture to achieve a certain level of approximation accuracy. (c) We use the results in a recent work (De Ryck et al., 2021), where they provide the approximation ability of one- and two-layer $\tanh$ neural networks. We adapt such results to our specific RNN structure and give the universal approximation results for each of the intensity components. (d) Thanks to the exponential decay property of the tail probability of the sequence length, we are able to use the truncation technique to decouple the randomness of independent and identically distributed (i.i.d.) samples and the lengths of event sequences. For the space of truncated loss functions, the space complexity can be obtained through calculating the covering number. The classical chaining methods in empirical process theory can hence be applied as well.

Our main technical contributions can be summarized as follows.

(i) In the analysis of the stochastic error in the excess risk of RNN-based TPPs, we provide a truncation technique to decompose the randomness into a bounded component and a tail component. By carefully balancing between the two parts, we establish a nearly optimal stochastic error bound. Additionally, we also derive the complexity of the multi-layer RNN-based TPP class, where we precisely analyze and compute the Lipschitz constant of RNN architecture. This extends the existing result in Chen et al. (2020) where they only give the Lipschitz constant of a single-layer RNN. Therefore, our truncation technique and the Lipschitz result of multi-layer RNNs can be useful and of independent interest for many other related problems.

(ii) We establish the approximation error bounds for the intensity functions of TPPs of four main categories. To the best of our knowledge, there is very few work (De Ryck et al., 2021) on studying the approximation property of $\tanh$ activation function. Our work is the first one to provide approximation results for RNN-based statistical models. Our construction procedure largely depends on the Markov nature (Laub et al., 2021) of self-exciting processes so that we can design hidden cells to store sufficient information of past events. Moreover, we decompose the excitation function into different parts. Each of them is a simple smooth function (i.e. either exponential function or trigonometric function) that can be well approximated by a single-layer $\tanh$ network. Our construction method can be viewed as a useful tool in analyzing other sequential-type neural networks.

(iii) We illustrate the differences between the architectures of classical RNNs and RNN-based TPPs. Note the fact that the observed events happen at the discrete time grids, while the TPP models should take into account the continuous time domain. Therefore, the interpolation of values in hidden cells at each time point is important and necessary. We show that improper interpolation mechanisms (e.g. constant, linear, exponential decay interpolation) may fail to provide RNN-based TPP with the universal approximation ability. Our result indicates that the input embedding plays an important role in interpolating the hidden states.

The rest of paper is organized as follows. In Section 2, the background of TPPs, the formulation of RNN-based TPPs, and useful notations are introduced. The main theories along with high-level explanations are given in Section 3. The technical tools for analyzing stochastic errors are provided in Section 4. The construction procedures for approximating different types of intensity functions are listed in Section 5. In Section 6, we provide explanations that the improper interpolation of hidden states in RNN-TPPs may lead to unsatisfactory approximation results. The concluding remarks are given in Section 7.

2 Preliminaries

2.1 Framework Specification

We observe a set of $n$ irregular event time sequences,

\displaystyle\mathbf{D}_{train}:=\{S_{i};i=1,...,n\}=\{(t_{i,1},...,t_{i,N_{ei}});i=1,...,n\},

(1)

where $0<t_{i,1}<...<t_{i,j}<...<t_{i,N_{ei}}\leq T$ with $T$ being the end time point, and $N_{ei}$ is the number of events in the $i$ -th sequence, $S_{i}$ . It is assumed that each of $S_{i}$ ’s is independently generated from a TPP model with an unknown intensity function $\lambda^{\ast}(t)$ defined on $[0,T]$ . That is,

\lambda^{\ast}(t):=\lim_{dt\rightarrow 0}\frac{\mathbb{E}[N[t,t+dt)|\mathcal{H}_{t}]}{dt},

where $N[t,t+dt):=N(t+dt)-N(t)$ with $N(t):=\sharp\{i:t_{i}\leq t\}$ being the number of events observed up to time $t$ , and $\mathcal{H}_{t}:=\sigma(\{N(s);s<t\})$ is the history filtration before time $t$ .

In the literature of TPP’s learning (Shchur et al., 2021), the primary goal is to estimate $\lambda^{\ast}(t)$ based on $\mathbf{D}_{train}$ . Throughout the current work, we adopt the negative log-likelihood function as our objective. To be specific, for any event time sequence $S=(t_{1},..,t_{N_{e}})$ , we define

\displaystyle\text{loss}(\lambda,S):=-\left\{\sum_{j=1}^{N_{e}}\log\lambda(t_{j})-\int_{0}^{T}\lambda(t)\mathrm{d}t\right\}.

(2)

Then the estimator can be defined as

	$\displaystyle\hat{\lambda}$	$\displaystyle:=$	$\displaystyle\arg\min_{\lambda\in\mathcal{F}}\text{loss}(\lambda)$		(3)
		$\displaystyle:=$	$\displaystyle\arg\min_{\lambda\in\mathcal{F}}\left\{\frac{1}{n}\sum_{i=1}^{n}\text{loss}(\lambda,S_{i})\right\},$		(3)

where $\mathcal{F}$ is a user-specified functional space. For example, in the existing works, $\mathcal{F}$ can be taken as any space of parametric models (Schoenberg, 2005; Laub et al., 2021), nonparametric models (Cai et al., 2022; Fang et al., 2023), or neural network models (Du et al., 2016; Mei and Eisner, 2017).

In the language of deep learning, $\mathbf{D}_{train}$ is also called a training data set. $\text{loss}(\lambda)$ is known as the loss function of predictor $\lambda$ . $\hat{\lambda}$ defined in (3) is the empirical risk minimizer (ERM). To evaluate the performance of $\hat{\lambda}$ , a common practice in machine (deep) learning is using the excess risk (Hastie et al., 2009; James et al., 2013; Vidyasagar, 2013; Shalev-Shwartz and Ben-David, 2014). To be mathematically formal, we define

\displaystyle\text{ER}(\hat{\lambda}):=\mathbb{E}[\text{loss}(\hat{\lambda},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})],

(4)

where $S_{test}$ is a testing sample, i.e., a new event time sequence, which is independent of $\mathbf{D}_{train}$ and also follows the intensity $\lambda^{\ast}(t)$ . The expectation here is taken with respect to the new testing data. We give a proof of $\text{ER}(\hat{\lambda})\geq 0$ in the supplementary. As a result, (4) is a well-defined excess risk under our model setup.

2.2 RNN Structure

Throughout this paper, we consider $\mathcal{F}$ to be a space of RNN-based TPP models. An arbitrary intensity function $\lambda$ in $\mathcal{F}$ , indexed by the parameter $\theta$ , is defined through the following recursive formula,

\displaystyle\lambda_{\theta}(t;S)

\displaystyle:=

\displaystyle f\left(W_{x}^{(L+1)}h^{(L)}(t;S)+b^{(L+1)}\right)\in\mathbb{R}^{1},~{}~{}\text{for}~{}t\in(t_{j},t_{j+1}],

(5)

where the hidden vector function $h^{(L)}(t;S)$ has the following hierarchical form,

$\displaystyle h^{(1)}(t;S)$	$\displaystyle=$	$\displaystyle\sigma\left(W_{x}^{(1)}x(t;S)+W_{h}^{(1)}h_{j}^{(1)}+b^{(1)}\right),$
$\displaystyle h^{(2)}(t;S)$	$\displaystyle=$	$\displaystyle\sigma\left(W_{x}^{(2)}h^{(1)}(t;S)+W_{h}^{(2)}h_{j}^{(2)}+b^{(2)}\right),$
	$\displaystyle\vdots$
$\displaystyle h^{(L)}(t;S)$	$\displaystyle=$	$\displaystyle\sigma\left(W_{x}^{(L)}h^{(L-1)}(t;S)+W_{h}^{(L)}h_{j}^{(L)}+b^{(L)}\right),~{}~{}\text{for}~{}t\in(t_{j},t_{j+1}],$	(6)

with

$\displaystyle h_{j}^{(1)}$	$\displaystyle=$	$\displaystyle\sigma\left(W_{x}^{(1)}x(t_{j};S)+W_{h}^{(1)}h_{j-1}^{(1)}+b^{(1)}\right),$
$\displaystyle h_{j}^{(2)}$	$\displaystyle=$	$\displaystyle\sigma\left(W_{x}^{(2)}h_{j}^{(1)}+W_{h}^{(2)}h_{j-1}^{(2)}+b^{(2)}\right),$
	$\displaystyle\vdots$
$\displaystyle h_{j}^{(L)}$	$\displaystyle=$	$\displaystyle\sigma\left(W_{x}^{(L)}h_{j}^{(L-1)}+W_{h}^{(L)}h_{j-1}^{(L)}+b^{(L)}\right),~{}~{}\text{for}~{}j\in\{1,...,N_{e}\}.$	(7)

Here $\sigma$ , $f$ are two known activation functions of the hidden layers and the output layer, respectively. Both of them are pre-determined by the user. We specifically take $\sigma(x)=\tanh(x)=(\exp(x)-\exp(-x))/(\exp(x)+\exp(-x))$ and $f(x)=\min\{\max\{x,l_{f}\},u_{f}\}$ , where $l_{f}$ and $u_{f}$ are two fixed positive constants. The input embedding vector function $x(t;S)$ is also known to the user before training. In the current work, we particularly take $x(t;S)=(t,t-F_{S}(t))^{\top}$ where $F_{S}(t)=t-t_{j}$ for $t\in(t_{j},t_{j+1}]$ , $\forall j\in N_{e}$ . The model parameters consist of $W_{x}^{(l)}$ , $W_{h}^{(l)}$ , and $b^{(l)}$ ( $1\leq l\leq L$ ). For notational simplicity, we concatenate all parameter matrices and vectors and write as $\theta=\{W_{x}^{(l)},W_{h}^{(l)},b^{(l)};1\leq l\leq L+1\}$ , where $W_{h}^{(L+1)}\equiv\mathbf{0}$ . By default, we take the initial values $t_{0}\equiv 0$ and $h_{0}^{(l)}\equiv\mathbf{0}$ for $1\leq l\leq L$ . The last time grid $t_{N_{e}+1}\equiv T$ . We call the model defined through equations (5) - (2.2) as the RNN-TPP.

Refer to caption — Figure 1: Left: the classical RNN architecture. Right: the RNN-TPP architecture given in (5) - (2.2). The blue box represents the interpolation of hidden states.

Moreover, we define the maximum hidden size $D:=\max\{d_{1},d_{2}\cdots d_{L}\}$ , where $d_{l}$ is the dimension of the $l$ -th hidden layer, and the parameter norm

\|\theta\|:=\max\left\{\|W_{x}^{(l)}\|_{2},\|W_{h}^{(l)}\|_{2},\|b^{(l)}\|_{2};1\leq l\leq L+1\right\}.

Then the RNN-TPP class $\mathcal{F}$ is described by

\displaystyle\mathcal{F}=\mathcal{F}_{L,D,B_{m},l_{f},u_{f}}:=\{\lambda_{\theta};~{}\|\theta\|\leq B_{m}\},

(8)

where $B_{m}$ may depend on the hidden size $D$ and the sample size $n$ . To help readers gain more intuitions, a graphical illustration of the network structure is given in Figure 1.

Remark 1.

The default choice (De Ryck et al., 2021) of activation function $\sigma(x)$ in RNNs is $\tanh(x)$ . In practice, the number of layers $L$ is usually no more than 4.

Remark 2.

By the constructions (5) - (2.2), it is not hard to see that the intensity $\lambda_{\theta}(t;S)$ is a left-continuous function of $t$ . In other words, it is a well-defined predictable function with respect to the information filtration generated by event sequence $S$ .

Remark 3.

In the standard application of RNN models, the training data usually consist of discrete-time sequences (e.g., sequences of tokens in natural language processing (NLP) (Yin et al., 2017; Tarwani and Edem, 2017); time series in financial market forecasting (Cao et al., 2019; Chimmula and Zhang, 2020)). Therefore, the classical (single-layer) RNN architecture is defined only through the discrete time grids. That is, the hidden vector at $j$ -th grid is

\displaystyle h_{j}

\displaystyle=\sigma\left(W_{x}x_{j}+W_{h}h_{j-1}+b_{h}\right),

where $x_{j}$ is the corresponding embedding input. The prediction at time step $j$ is given by $y_{j}=f(W_{y}h_{j}+b_{y})\in\mathbb{R}$ . In contrast, the RNN-based TPP model should take into account any time point $t$ between grids $t_{j}$ and $t_{j+1}$ . Hence the interpolation of $h^{(l)}(t;S)$ between $h_{j}^{(l)}$ and $h_{j+1}^{(l)}$ is heuristically necessary to give reasonable model predictions over the entire time interval $(t_{j},t_{j+1}]$ .

Remark 4.

In the literature, there exist a few methods to interpolate the hidden embedding between $h_{j}^{(L)}$ and $h_{j+1}^{(L)}$ . In Du et al. (2016), a constant embedding mechanism is used, i.e. $h^{(l)}(t;S)\equiv h_{j}^{(l)}$ for $t\in(t_{j},t_{j+1}]$ and any $j$ and $l$ . In Mei and Eisner (2017), the author adopted an exponential decay method to encode the hidden representations under an extended RNN architecture, Long Short Term Memory (LSTM) network. More recently, Rubanova et al. (2019) used the neural ordinary differential equation (ODE) method for solving the intermediate hidden state $h^{(l)}(t;S)$ .

It can be shown that the first two interpolation methods are unable to precisely capture the true intensity in the sense of excess risk. We will give the explanation in Section 6; see Theorem 8.

Remark 5.

Our result still holds if tanh is replaced with other Sigmoidal-type activation functions (Cybenko, 1989) (e.g., ReLU (Fukushima, 1969)). In the literature of TPP modelling, the most common choice of $f(x)$ is the Softplus function (Dugas et al., 2001; Zhou et al., 2022), $\log(1+\exp(x))$ , which ensures $\lambda_{\theta}(t;S)$ to be positive and differentiable. Our result also holds if we take $f(x)$ to be $\min\{\max\{\log(1+\exp(x)),l_{f}\},u_{f}\}$ with $0<l_{f}<u_{f}$ . Introducing $l_{f}$ and $u_{f}$ only serves the technical purpose, i.e., the predicted intensity value is bounded from above and below.

2.3 Classical TPPs

In the statistical literature, TPPs can be categorized into several types based on the nature of the intensity functions. Four main categories are summarized as follows.

Homogeneous Poisson process (Kingman, 1992). It is the simplest type where events occur completely independently of one another, and the intensity function is constant, i.e., $\lambda^{\ast}(t)\equiv\lambda$ , where $\lambda$ is unknown and needs to be estimated.

Non-homogeneous Poisson process (Kingman, 1992; Daley et al., 2003). In this model, the intensity function varies over time but is still independent of past events. That is, $\lambda^{\ast}(t)$ is a non-constant unknown function that is usually estimated via certain nonparametric methods.

Self-exciting process (Hawkes and Oakes, 1974). Future events are influenced by past events, which can lead to clustering of events in time. A well-known example is the Hawkes process (Hawkes, 1971; Hawkes and Oakes, 1974), where the intensity function takes form,

\displaystyle\lambda^{\ast}(t)=\lambda_{0}(t)+\sum_{j:t_{j}<t}\mu(t-t_{j}),

(9)

where $\lambda_{0}(t)$ and $\mu(t)$ are some positive functions which are called the background intensity and excitation/impact function, respectively. In many applications (Laub et al., 2021), the excitation function takes the exponential form that $\mu(t)=\alpha\exp(-\beta t)$ , which allows the efficient computation. The model defined in (9) is also known as the linear self-exciting process since the intensity is in an additive form of different components. More generally, the non-linear self-exciting process (Brémaud and Massoulié, 1996)

\displaystyle\lambda^{\ast}(t)=\Psi\left(\lambda_{0}(t)+\sum_{j:t_{j}<t}\mu(t-t_{j})\right),

(10)

is also considered in the literature, where $\Psi$ is a non-linear function.

Self-correcting process (Isham and Westcott, 1979; Ogata and Vere-Jones, 1984). The occurrence of an event decreases the likelihood of future events for some time period. To be mathematically formal, the intensity postulates the formula,

\displaystyle\lambda^{\ast}(t)=\Psi\left(\mu t-\sum_{j:t_{j}<t}\alpha\right),

(11)

where both $\mu$ and $\alpha$ are positive and $\Psi$ may be a non-linear function.

2.4 Notations

Let $a\wedge b=\min\{a,b\}$ and $a\vee b=\max\{a,b\}$ . We use $\mathbb{N}$ and $\mathbb{Z}$ to denote the set of nonnegative integers and all integers, respectively. Denote $[n]=\{1,2\cdots,n\}$ for a positive integer $n$ . Let $\lceil a\rceil=\min\{b\in\mathbb{Z},b\geq a\}$ . For a set $A$ , denote $\#(A)$ to be its cardinality. For a vector $x=(x_{1},\cdots,x_{d})^{\top}\in\mathbb{R}^{d}$ , denote its Euclidean norm as $\|x\|_{2}=\sqrt{\sum_{i=1}^{d}x_{i}^{2}}$ . Write $a_{N}\lesssim b_{N}$ if there exists some constant $C>0$ such that $a_{N}\leq Cb_{N}$ for all index $N$ , and the range of $N$ may be defined case by case. For a function $f$ defined on some domain, denote $\|f\|_{L^{\infty}}$ as its essential upper bound. For $s\in\mathbb{N}$ , the Sobolev norm $\|f\|_{W^{s,\infty}([0,T])}$ is defined as $\|f\|_{W^{s,\infty}([0,T])}=\max_{0\leq|\alpha|\leq m}\|D^{\alpha}f\|_{L^{\infty}([0,T])}$ . For a constant $B_{0}>0$ , the $B_{0}$ -ball of Sobolev space $W^{s,\infty}([0,T])$ is defined as

\displaystyle W^{s,\infty}([0,T],B_{0}):=\left\{f\in W^{s,\infty}([0,T]),\|f\|_{W^{s,\infty}([0,T])}\leq B_{0}\right\}.

For constant $C_{0}>0$ , the ball $C^{s,\infty}([0,T],C_{0})$ is a subset of $W^{s,\infty}([0,T],C_{0})$ which contains all $s$ -order smooth functions. We use $O(\cdot)$ to hidden all constants and use $\tilde{O}(\cdot)$ to denote $O(\cdot)$ with hidden log factors. Throughout this paper, $\alpha$ , $\beta$ , $\gamma$ , $\mathcal{C}$ , and $\mathcal{C}_{1}$ are positive real numbers and may be defined case by case.

3 Main Results

Recent applications in event stream analyses have witnessed the usefulness of TPPs with the incorporation of RNNs. However, there is no study in the existing literature to explain why the RNN structure in TPP modeling is so useful from the theoretical perspective. We attempt to answer the question of whether the RNN-TPPs can provably have small generalization error or excess risk. Our answer is positive! When the event data are generated according to the classical models described in Section 2.3, we show that the RNN-TPPs can perfectly generalize such data.

To make our presentation easier, we only need to focus on the self-exciting processes. ¹¹1Homogeneous Poisson, non-homogeneous Poisson, and self-correcting process can be treated similarly due to the following reasons. If we take $\mu(t)\equiv 0$ in (9), the linear self-exciting process reduces to the homogeneous Poisson or non-homogeneous Poisson process. In the RNN-TPP architecture, we can take the input embedding function $x(t;S)=(t,t-F_{S}(t),N(t-))$ , i.e., using an additional input dimension to store the number of past events. Then establishing the excess risk of self-correcting process is technically equivalent to that of non-homogeneous Poisson process. To start with, we first consider the linear case (9).

Some regularity assumptions should be stated before we present the main theorem.

(A1) There exists a constant $B_{0}>0$ such that $\lambda_{0}\in W^{s,\infty}([0,T],B_{0})$ , where $s\geq 1$ , $s\in\mathbb{N}$ .

(A2) $\int_{0}^{T}\mu(t)\mathrm{d}t:=c_{\mu}<1$ .

(A3) There exists a positive constant $B_{1}$ such that $\inf_{t\in[0,T]}\lambda_{0}(t)\geq B_{1}$ .

Assumption (A1) assumes the boundedness of the background intensity, which is also common in neural network approximation studies. Assumption (A2) is standard in the literature of Hawkes process, which guarantees the existence of a stationary version of the process when $\lambda_{0}(t)$ is constant. Assumption (A3) is an informative lower bound assumption, which ensures that sufficient intensity exists in any subdomain of $[0,T]$ .

Now we can present the results on the non-asymptotic bound of excess risk (4) under model (9).

Theorem 1.

Under model (9) and RNN-TPP class $\mathcal{F}=\mathcal{F}_{L,D,B_{m},l_{f},u_{f}}$ defined as (8), suppose that assumptions (A1)-(A3) hold, then for $n$ i.i.d. sample series $\{S_{i},i\in[n]\}$ , with probability at least $1-\delta$ , the excess risk (4) of ERM (3) satisfies:

(i) (Poisson case) If $\mu\equiv 0$ , for $L=2$ , $D=\tilde{O}(n^{\frac{1}{2(s+1)}})$ , $B_{m}=\tilde{O}(n^{\frac{s+1}{4}})$ , $l_{f}=B_{1}\wedge 1$ , and $u_{f}=B_{0}$ ,

\displaystyle\mathbb{E}[\text{loss}(\hat{\lambda},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]\leq\tilde{O}\left(n^{-\frac{s}{2(s+1)}}\right);

(12)

(ii) (Vanilla Hawkes case) If $\mu(t)=\alpha\exp(-\beta t)$ , for $L=2$ , $D=\tilde{O}(n^{\frac{1}{2(s+1)}})$ , $B_{m}=\tilde{O}((\log n)^{3s^{2}\log^{2}n})$ , $l_{f}=B_{1}\wedge 1$ , and $u_{f}=B_{0}+O(\log n)$ ,

\displaystyle\mathbb{E}[\text{loss}(\hat{\lambda},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]\leq\tilde{O}\left(n^{-\frac{s}{2(s+1)}}\right);

(13)

(iii) (General case) If $\mu\in C^{k,\infty}([0,T],C_{0})$ , $k\geq 2$ , $k\in\mathbb{N}$ , for $L=2$ , $D=\tilde{O}(n^{\frac{1}{2}\left(\frac{1}{s+1}\vee\frac{5}{k+4}\right)})$ , $B_{m}=\tilde{O}((\log n)^{3s^{2}\log^{2}n})$ , $l_{f}=B_{1}\wedge 1$ , and $u_{f}=B_{0}+O(\log n)$ ,

\displaystyle\mathbb{E}[\text{loss}(\hat{\lambda},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]\leq\tilde{O}\left(n^{-\frac{1}{2}\left(\frac{s}{s+1}\wedge\frac{k-1}{k+4}\right)}\right).

(14)

As suggested in Theorem 1, there exists a two-layer RNN-TPP model whose excess risk becomes vanishing when the size of the training set goes to infinity. The width of such network grows with the sample size, while the depth remains two.

Remark 6.

Here we require the depth of RNN-TPP $L=2$ due the fact that $\lambda_{0}\in W^{s,\infty}([0,T],B_{0})$ . However, if we allow $\lambda_{0}$ to be sufficiently smooth (i.e., $\lambda_{0}\in C^{\infty}([0,T])$ ), we only need one-layer $\tanh$ neural network to approximate $\lambda_{0}$ . As a result, the number of layers of RNN-TPP can be reduced to one.

Now we consider the true model to be a non-linear Hawkes process, which is given in (10). For simplicity, we only consider the case $\mu(t)=\alpha\exp(-\beta t)$ , which is

\displaystyle\lambda^{\ast}(t)=\Psi\left(\lambda_{0}(t)+\sum_{t_{i}<t}\alpha\exp(-\beta(t-t_{i}))\right).

(15)

The regularity of $\Psi$ is presented as Assumption (A4).

(A4) Function $\Psi$ is $L$ -Lipschitz, positive and bounded. In other words, there exist $\tilde{B_{1}},\tilde{B_{0}}>0$ such that $\tilde{B_{1}}\leq\Psi\leq\tilde{B_{0}}$ and $|\Psi(x_{1})-\Psi(x_{2})|\leq L|x_{1}-x_{2}|$ for any $x_{1},x_{2}$ .

We have a similar bound of excess risk (4) under model (15).

Theorem 2.

(Nonlinear Hawkes Case) Under model (15) and RNN-TPP class $\mathcal{F}=\mathcal{F}_{L,D,B_{m},l_{f},u_{f}}$ defined as (8), suppose that assumptions (A1) and (A4) hold, then for $n$ i.i.d. sample series $\{S_{i},i\in[n]\}$ , with probability at least $1-\delta$ , for $L=4$ , $D=\tilde{O}(n^{\frac{1}{4}})$ , $B_{m}=\tilde{O}((\log n)^{3s^{2}\log^{2}n})$ , $l_{f}=\tilde{B}_{1}\wedge 1$ , and $u_{f}=\tilde{B}_{0}$ , the excess risk (4) of ERM (3) satisfies:

\displaystyle\mathbb{E}[\text{loss}(\hat{\lambda},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]\leq\tilde{O}\left(n^{-\frac{1}{4}}\right).

(16)

For the non-linear case, as indicated by Theorem 2, we require a deeper RNN-TPP with four layers to achieve the vanishing excess risk. Under the Lipschitz assumption of $\Psi$ , the width of the hidden layers is of order $n^{1/4}$ . When $\Psi$ is allowed to have higher-order smoothness, the width can reduce to that of the vanilla Hawkes case.

Remark 7.

(i) Two additional layers of RNN are required for the approximation of the arbitrary non-linear Lipschitz continuous function $\Psi$ . (ii) For the model $\lambda^{\ast}(t)=\Psi\left(\lambda_{0}(t)+\sum_{t_{i}<t}\mu(t-t_{i})\right)$ with general excitation function $\mu$ , we can obtain the similar excess risk bound using the same technique in the proof of Theorem 1.

To better explain the excess risks that obtained in Theorems 1-2, we depend on the following decomposition lemma.

Lemma 1.

Let $\check{\lambda}^{\ast}=\arg\min_{\lambda\in\mathcal{F}}\mathbb{E}[\text{loss}(\lambda,S_{test})]$ , for any random sample $\{S_{i},i\in[n]\}$ , the excess risk of ERM (3) satisfies

	$\displaystyle\mathbb{E}[\text{loss}(\hat{\lambda},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]$	$\displaystyle\leq\underbrace{2\sup_{\lambda\in\mathcal{F}}\Big{\|}\mathbb{E}[\text{loss}(\lambda,S_{test})]-\frac{1}{n}\sum_{i\in[n]}\text{loss}(\lambda,S_{i})\Big{\|}}_{\text{stochastic error}}$
		$\displaystyle+\underbrace{\mathbb{E}[\text{loss}(\check{\lambda}^{\ast},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]}_{\text{approximation error}}.$		(17)

By Lemma 1, the excess risk of ERM is bounded by the sum of two terms, the stochastic error $2\sup_{\lambda\in\mathcal{F}}|\mathbb{E}[\text{loss}(\lambda,S_{test})]-n^{-1}\sum_{i\in[n]}\text{loss}(\lambda,S_{i})|$ and the approximation error $\mathbb{E}[\text{loss}(\check{\lambda}^{\ast},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]$ . The first term can be bounded by the complexity of the function class $\mathcal{F}$ using the empirical process theory, where the unboundedness of the loss function needs to be handled carefully; we present the details in section 4. The second term characterizes the approximation ability of the RNN function class $\mathcal{F}$ to the true intensity $\lambda^{\ast}$ under the measure of the expectation of the negative log-likelihood loss function. In order to bound this term, we need to carefully construct a suitable RNN which can approximate $\lambda^{\ast}$ well. This has not been studied yet in the literature; see section 5 for the details.

Based on Lemma 1, the results in Theorem 1 admit the following form,

\displaystyle\mathbb{E}[\text{loss}(\hat{\lambda},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]\leq O\left(\frac{C(N)}{\sqrt{n}}+\frac{1}{R(N)}\right),

where $C(N)/\sqrt{n}$ is the stochastic error and $1/R(N)$ is the approximation error. $C(N)$ is the complexity of RNN function class $\mathcal{F}$ and $R(N)$ is the corresponding approximation rate, where $N$ is a tuning parameter. For the Poisson case, we can construct a two-layer RNN-TPP with $O(N)$ width to achieve $O(N^{-s})$ approximation error. Hence $C(N)=O(N)$ , $R(N)=O\left(N^{s}\right)$ , and the final excess risk bound is $\tilde{O}(n^{-\frac{s}{2(s+1)}})$ in (12). For the vanilla Hawkes case, since the exponential function is $C^{\infty}$ -smooth, we only need extra $O(\text{Poly}(\log N))$ hidden cells in each layer to obtain $\tilde{O}(N^{-s})$ approximation error, and then we have the same order excess risk bound. For the general case, motivated by the vanilla Hawkes case, we decompose $\mu\in C^{k,\infty}([0,T],C_{0})$ into two parts. One part is a polynomial of exponential functions which can be well approximated by $O(\text{Poly}(\log N))$ -width tanh neural network. The other part is a function $\tilde{\mu}\in C^{k,\infty}([0,T],\tilde{C_{0}})$ satisfying $\tilde{\mu}^{(j)}(0+)=\tilde{\mu}^{(j)}(T-)$ , $j=0,1,\cdots,k-1$ . It is easy to check that the $r$ -th Fourier coefficients of $\tilde{\mu}$ , $\hat{\mu}_{r}$ , decay at the rate of $r^{-k}$ . Then it is sufficient to approximate the first $N$ functions in the Fourier expansion of $\tilde{\mu}$ to get $\tilde{O}(N^{-(k-1)})$ approximation error, which additionally costs $\tilde{O}\left(N^{5}\right)$ complexity (see section 5.3 for details). Combining this with the approximation result of $\lambda_{0}$ , we get the final bound $\tilde{O}(n^{-\frac{1}{2}\left(\frac{s}{s+1}\wedge\frac{k-1}{k+4}\right)})$ . Similarly, for the nonlinear Hawkes case, we need $\tilde{O}(N)$ complexity to obtain $\tilde{O}(N^{-1})$ approximation error, which leads to $\tilde{O}({n}^{-\frac{1}{4}})$ excess risk bound.

As we emphasize in the above remarks, the number of layers depends on the smoothness of $\lambda_{0}$ . If $\lambda_{0}\in C^{\infty}([0,T])$ and $\|\lambda_{0}\|_{W^{s,\infty}}\leq C^{s}$ , we only need one-layer $\tanh$ neural network to approximate $\lambda_{0}$ , hence the number of layers in RNN-TPP can be reduced to one.

4 Stochastic Error

In this section, we focus on the stochastic error in (17). This type of stochastic error for the RNN function class has been studied in the recent literature, such as Chen et al. (2020) and Tu et al. (2020). However, they only consider the case where the lengths of the input sequences are bounded, which is not applicable under the TPP setting. Here we establish an upper bound of the stochastic error in (17) by a novel decoupling technique to make the classical results applicable. This technique can be used in many other related problems.

4.1 Main Variance Term

We first give out some mild assumptions for the RNN-TPP function class $\mathcal{F}$ under a more general framework.

(B1) The embedding function $x(\cdot)$ is bounded by a constant $B_{in}(T)$ on the time domain $[0,T]$ , i.e. $\|x(\cdot)\|_{2}\leq B_{in}(T)$ .

(B2) The parameter $\theta$ lies in a bounded domain $\Theta$ . More precisely, we assume that the spectral norms of weight matrices (vectors) and other parameters are bounded respectively, i.e., $\|W_{x}^{(l)}\|_{2}\leq B_{x}$ , $\|W_{h}^{(l)}\|_{2}\leq B_{h}$ , $\|b^{(l)}\|_{2}\leq B_{b}$ , $1\leq l\leq L+1$ , and $B_{m}=\max\{B_{b},B_{h},B_{x}\}$ .

(B3) Activation functions $\sigma$ and $f$ are Lipschitz continuous with parameters $\rho_{\sigma}$ and $\rho_{f}$ respectively, $\sigma(0)=0$ , and there exists $|b_{0}|\leq B_{b}$ such that $f(b_{0})=1$ . Additionally, $\sigma$ is entrywise bounded by $B_{\sigma}$ , and $f$ satisfies $l_{f}\leq\|f\|_{L^{\infty}}\leq u_{f}$ .

Now we consider the first term of (17). For convenience, we denote $X_{\theta}=\mathbb{E}[\text{loss}(\lambda_{\theta},S_{test})]-n^{-1}\sum_{i=1}^{n}\text{loss}(\lambda_{\theta},S_{i})$ .

Theorem 3.

Under assumptions (B1)-(B3) and suppose the event number $N_{e}$ satisfies the tail condition

\mathbb{P}(N_{e}\geq s)\leq a_{N}\exp(-c_{N}s),~{}s\in\mathbb{N},

with probability at least $1-\delta$ , we have

	$\displaystyle\sup_{\theta\in\Theta}\|X_{\theta}\|\leq$	$\displaystyle\frac{192}{\sqrt{n}}\left(T+\frac{1}{l_{f}}\right)(s_{0}+1)u_{f}\Bigg{(}\sqrt{\log\left(\frac{4}{\delta}\right)}+D\sqrt{(3L+2)}\left(\sqrt{\log\left(1+M(s_{0})\right)}+1\right)$
		$\displaystyle+\frac{1}{(1-\exp(-c_{N}))^{2}}~{}\Bigg{)}~{}.$

Thus

\displaystyle\sup_{\|\theta\|\leq B_{m}}|X_{\theta}|\leq\tilde{O}\left(\sqrt{\frac{D^{2}L^{2}}{n}}\right)~{},

(18)

where $s_{0}=\lceil{c_{N}}^{-1}\left(\log\left(2a_{N}n/\delta\right)-1\right)\rceil$ , $M(s)=\rho_{f}B_{m}\sqrt{D}(B_{\sigma}\sqrt{D}\vee B_{in}(T)\vee 1)(\gamma^{L}\vee 1)(s+1)^{L-1}(\beta^{s+1}-1)/(\beta-1)$ , $\gamma=\rho_{\sigma}B_{x}$ , $\beta=\rho_{\sigma}B_{h}$ .

Remark 8.

There exist constants $a_{N},c_{N}$ so that the tail condition $\mathbb{P}(N_{e}\geq s)\leq a_{N}\exp(-c_{N}s),s\in\mathbb{N}$ always holds for (non) homogeneous Poisson processes, linear and nonlinear Hawkes processes, and self-correcting processes under weak assumptions. To be more concrete, Lemma 2 in the following section gives a result for the linear case.

Remark 9.

For one-layer RNN with width $D$ and bounded sequence length $T$ , Chen et al. (2020) gives a $\tilde{O}\big{(}\sqrt{{D^{3}T}/{n}}\big{)}$ type stochastic error bound. Our bound reduces the term $D^{3}$ to $D^{2}$ , thanks to the bounded output layer, i.e., $f(x)=\min\{\max\{x,l_{f}\},u_{f}\}$ . The term $D^{2}$ is also order-optimal by noticing that the number of free parameters in a single-layer RNN is at least $D^{2}$ .

The stochastic error in (3) is mainly determined by the complexity of the RNN function class $\mathcal{F}$ , which will be discussed in the following section. To obtain this bound, we need to handle the unboundedness of the event number. We use a truncation technique to decouple the randomness of the tail of $N_{e}$ , which allows us to use classical empirical process theory to derive the upper bound. Our computation is motivated by Chen et al. (2020), which gives the generalization error bound of a single-layer RNN function class.

4.2 Key Techniques

To be reader-friendly, the main techniques for proving Theorem 3 are summarized as follows.

4.2.1 Probability Bound of Events Number

Define $N_{e(n)}:=\max\{N_{ei},1\leq i\leq n\}$ . The following lemma characterizes the tails of event number $N_{e}$ and $N_{e(n)}$ under model (9) and assumptions (A1) and (A2) (For assumption (A1), we only need $\lambda_{0}\leq B_{0}$ in this section). The proof is similar to Proposition 2 in Hansen et al. (2015); see supplementary for the details.

Lemma 2.

For model (9), under assumptions (A1) and (A2), with probability at least $1-\delta$ , we have

\displaystyle N_{e(n)}<\frac{1}{1-c_{\mu}\eta}\left(\frac{2}{\log(\eta)}\log\left(\frac{2n\sqrt{B_{0}T}}{\delta(1-c_{\mu})}\right)+\eta(B_{0}T)\right).

Hence

\displaystyle\mathbb{P}\left(N_{e}=s\right)\leq\mathbb{P}\left(N_{e}\geq s\right)\leq\frac{2\sqrt{B_{0}T}}{1-c_{\mu}}\exp\left(\frac{\log(\eta)}{2}\left[\eta(B_{0}T)-(1-c_{\mu}\eta)s\right]\right),

where $\eta\in\left(1,c_{\mu}^{-1}\right)$ . Let $a_{N}=2\sqrt{B_{0}T}\exp(\log(\eta_{0})\eta_{0}(B_{0}T)/2)/(1-c_{\mu})$ and $c_{N}=\log(\eta_{0})(1-c_{\mu}\eta_{0})/2$ with $\eta_{0}\in\left(1,c_{\mu}^{-1}\right)$ being fixed. Then

\displaystyle\mathbb{P}\left(N_{e}=s\right)\leq\mathbb{P}\left(N_{e}\geq s\right)\leq a_{N}\exp(-c_{N}s).

(19)

Our result is more refined than Proposition 2 in Hansen et al. (2015), with computing all the constants and giving a tuning parameter to control the probability bound.

For the nonlinear case (10), under Assumption (A4), we can obtain results similar to the non-homogeneous Poisson case, which are included in the above Lemma.

4.2.2 From Unboundedness to Boundedness

The following lemma is the key to handling the unboundedness of $X_{\theta}$ , i.e., the unboundedness of the loss function. For any $s\in\mathbb{N}$ , we let $X_{\theta}(s)=\mathbb{E}\left[\text{loss}(\lambda_{\theta},S_{test})\mathbbm{1}_{\{N_{e}\leq s\}}\right]-n^{-1}\sum_{i=1}^{n}\text{loss}(\lambda_{\theta},S_{i})\mathbbm{1}_{\{N_{ei}\leq s\}}$ and $E_{\theta}(s)=\mathbb{E}\left[\text{loss}(\lambda_{\theta},S_{test})\mathbbm{1}_{\{N_{e}>s\}}\right]$ .

Lemma 3.

For any $s\in\mathbb{N}$ and nonempty parameter set $\Theta$ , we have

\displaystyle\mathbb{P}\left(\sup_{\theta\in\Theta}|X_{\theta}|>t\right)\leq\mathbb{P}\left(\sup_{\theta\in\Theta}|X_{\theta}(s)|+\sup_{\theta\in\Theta}|E_{\theta}(s)|>t\right)+\mathbb{P}(N_{e(n)}>s).

(20)

Proof of Lemma 3.

For $\forall\omega\in\{N_{e(n)}\leq s\}$ , we have

	$\displaystyle X_{\theta}(\omega)$	$\displaystyle=\mathbb{E}[\text{loss}(\lambda_{\theta},S_{test})]-\frac{1}{n}\sum_{i=1}^{n}\text{loss}(\lambda_{\theta},S_{i})(\omega)$
		$\displaystyle=\mathbb{E}[\text{loss}(\lambda_{\theta},S_{test})]-\frac{1}{n}\sum_{k=1}^{n}\text{loss}(\lambda_{\theta},S_{i})\mathbbm{1}_{\{N_{ei}\leq s\}}(\omega)$
		$\displaystyle=X_{\theta}(s)(\omega)+E_{\theta}(s)(\omega).$

Hence, under the condition $N_{e(n)}\leq s$ , we have $|X_{\theta}|\leq|X_{\theta}(s)|+|E_{\theta}(s)|$ , thus $\sup_{\theta\in\Theta}|X_{\theta}|\leq\sup_{\theta\in\Theta}|X_{\theta}(s)|+\sup_{\theta\in\Theta}|E_{\theta}(s)|$ . Then

	$\displaystyle\mathbb{P}\left(\sup_{\theta\in\Theta}\|X_{\theta}\|>t\right)$	$\displaystyle=\mathbb{P}\left(\sup_{\theta\in\Theta}\|X_{\theta}\|>t,~{}N_{e(n)}\leq s\right)+\mathbb{P}\left(\sup_{\theta\in\Theta}\|X_{\theta}\|>t,~{}N_{e(n)}>s\right)$
		$\displaystyle\leq\mathbb{P}\left(\sup_{\theta\in\Theta}\|X_{\theta}(s)\|+\sup_{\theta\in\Theta}\|E_{\theta}(s)\|>t,~{}N_{e(n)}\leq s\right)$
		$\displaystyle~{}~{}+\mathbb{P}\left(\sup_{\theta\in\Theta}\|X_{\theta}\|>t,~{}N_{e(n)}>s\right)$
		$\displaystyle\leq\mathbb{P}\left(\sup_{\theta\in\Theta}\|X_{\theta}(s)\|+\sup_{\theta\in\Theta}\|E_{\theta}(s)\|>t\right)+\mathbb{P}\left(N_{e(n)}>s\right).$

∎

The consequence of this lemma is to decompose $\mathbb{P}\left(\sup_{\theta\in\Theta}|X_{\theta}|>t\right)$ into two parts. The first part $\mathbb{P}\left(\sup_{\theta\in\Theta}|X_{\theta}(s)|+\sup_{\theta\in\Theta}|E_{\theta}(s)|>t\right)$ is the tail probability of the supremum of a set of bounded variables, and can therefore be handled by standard empirical process theory. The second part $\mathbb{P}(N_{e(n)}>s)$ is the tail probability of $N_{e(n)}$ . Thanks to Lemma 2, this term can be controlled by the exponential decay property of the sub-critical point process. By choosing suitable $s$ , we can make (20) sharper. This result plays a key role in stochastic error calculations.

4.2.3 Complexity of the RNN-TPP Class

To get the result in Theorem 3, we need to compute the complexity of the RNN function class which is specified in section 2.2. There are many possible complexity measures in deep learning theory (Suh and Cheng, 2024), and here we choose covering number which can be well computed for the RNN function class. In our setup, the key to the computation of the covering number is finding the Lipschitz continuity constant of RNN-TPPs, which separates the spectral norms of weight matrices and the total number of parameters (Chen et al., 2020).

Consider two different sets of parameters $\theta_{1}=\{W_{x,1}^{(l)},W_{h,1}^{(l)},b_{1}^{(l)};1\leq l\leq L+1\}$ , $\theta_{2}=\{W_{x,2}^{(l)},W_{h,2}^{(l)},b_{2}^{(l)};1\leq l\leq L+1\}$ . Denote $\Delta_{b}^{l}=\|b_{1}^{(l)}-b_{2}^{(l)}\|_{2}$ , $\Delta_{h}^{l}=\|W_{h,1}^{(l)}-W_{h,2}^{(l)}\|_{2}$ , $\Delta_{x}^{l}=\|W_{x,1}^{(l)}-W_{x,2}^{(l)}\|_{2}$ , $1\leq l\leq L+1$ $(\Delta_{h}^{L+1}\equiv 0)$ . The following lemma characterizes the Lipschitz constant of $\lambda_{\theta}$ .

Lemma 4.

Under Assumptions (B1)-(B3), given an input sequence of length $N_{S}$ , $S=\{t_{i}\}_{i=1}^{N_{S}}\subset[0,T]$ (here we set $t_{N_{S}+1}=T$ ), for $t\in(t_{i},t_{i+1}]$ , $1\leq i\leq N_{S}$ , and $\theta_{1},\theta_{2}\in\Theta$ , we have

	$\displaystyle\left\|\lambda_{\theta_{1}}(t;S)-\lambda_{\theta_{2}}(t;S)\right\|$	$\displaystyle\leq\rho_{f}\gamma\left(\sum_{l=0}^{L-1}\gamma^{l}S_{i}^{l}\Delta_{b}^{L-l}+B_{\sigma}\sqrt{D}\sum_{l=0}^{L-2}\gamma^{l}S_{i}^{l}\Delta_{x}^{L-l}+B_{in}(T)\gamma^{L-1}S_{i}^{L-1}\Delta_{x}^{1}\right.$
		$\displaystyle\quad\quad\quad\left.+B_{\sigma}\sqrt{D}\sum_{l=0}^{L-1}\gamma^{l}S_{i-1}^{l}\Delta_{h}^{L-l}\right)+\rho_{f}\Delta_{b}^{L+1}+\rho_{f}B_{\sigma}\sqrt{D}\Delta_{x}^{L+1},$		(21)

where $\beta=\rho_{\sigma}B_{h}$ , $\gamma=\rho_{\sigma}B_{x}$ , $S_{i}^{l}=\sum_{j=0}^{i}\tbinom{j+l}{l}\beta^{j}$ ( $S_{-1}^{l}=0$ ), and $d=\max\{d_{l}|1\leq l\leq L+1\}$ . We set $\sum_{l=a}^{b}A_{l}=0$ if $a>b$ .

The proof of Lemma 4 is based on the induction. The full proof is given in the supplementary. Our result is an extension of Lemma 2 in Chen et al. (2020), where they only consider the family of one-layer RNN models. Lemma 4 is of independent interest and can be useful in any other problems regarding RNN-based modeling. Using Lemma 4, we can establish a covering number bound for $\mathcal{F}$ under a “truncated" distance.

Denote $\mathcal{N}\left(\mathcal{F},\epsilon,d(\cdot,\cdot)\right)$ as the covering number of metric space $\mathcal{F}$ , i.e., the minimal cardinality of a subset $\mathcal{C}\subset\mathcal{F}$ that covers $\mathcal{F}$ in scale $\epsilon$ with respect to the metric $d(\cdot,\cdot)$ . Given a fixed integer $N_{0}$ , We define a truncated distance,

\displaystyle d_{N_{0}}(\lambda_{\theta_{1}},\lambda_{\theta_{2}})=\sup_{\#(S)\leq{N_{0}}}\left\|\lambda_{\theta_{1}}(t;S)-\lambda_{\theta_{2}}(t;S)\right\|_{L^{\infty}[0,T]}~{}.

The following lemma gives an upper bound of $\mathcal{N}\left(\mathcal{F},\epsilon,d_{N_{0}}(\cdot,\cdot)\right)$ .

Lemma 5.

Under assumptions (B1)-(B3), for any $\epsilon>0$ and $\mathcal{F}=\mathcal{F}_{L,D,B_{m},l_{f},u_{f}}$ defined as (8), the covering number $\mathcal{N}\left(\mathcal{F},\epsilon,d_{N_{0}}(\cdot,\cdot)\right)$ is bounded by

\displaystyle\mathcal{N}\left(\mathcal{F},\epsilon,d_{N_{0}}(\cdot,\cdot)\right)\leq\left(1+\frac{C({N_{0}})(3L+2)B_{m}\sqrt{D}}{\epsilon}\right)^{D^{2}(3L+2)},

where $C(N_{0})=\rho_{f}(B_{\sigma}\sqrt{D}\vee B_{in}(T)\vee 1)(\gamma^{L}\vee 1)({N_{0}}+1)^{L-1}(\beta^{{N_{0}}+1}-1)/(\beta-1)$ , $\gamma=\rho_{\sigma}B_{x}$ , and $\beta=\rho_{\sigma}B_{h}$ .

By Lemma 5, taking $N_{0}=s$ , we can get the non-asymptotic bound of $X_{\theta}(s)$ , which is an important step to obtain the first part of (20).

5 Approximation Error

In this section, we focus on the approximation error, i.e., the second part of (17). The approximation error of deep neural networks has been broadly studied in the literature (Schmidt-Hieber, 2020; Shen et al., 2019; Jiao et al., 2023; Lu et al., 2021). However, most of them only consider the ReLU activation case, which is different from $\tanh$ , the activation function usually chosen for RNNs. Recently, De Ryck et al. (2021) studied the approximation properties of shallow $\tanh$ neural networks, which provides a technical tool for our analysis. To the best of our knowledge, the approximation ability of RNN-type networks has not been fully studied in the literature. Here we propose a family of approximation results for the intensities of various TPP models stated in section 2.3.

5.1 Poisson Case

We start with the the approximation of (non-homogeneous) Poisson process, whose intensity is independent of the event history, i.e. $\lambda^{\ast}(t)=\lambda_{0}(t)$ , where $\lambda_{0}(t)$ is an unknown function. In this case, we do not need to take into account the transfer of information in the time domain. To be precise, we can take $W_{h}^{l}=0$ for $l\in[L]$ . Then the problem degenerates to a standard neural network approximation problem. Using the approximation results for $\tanh$ neural networks in De Ryck et al. (2021), we can get the following approximation result.

Theorem 4.

(Approximation for Poisson process) Under model $\lambda^{\ast}(t)=\lambda_{0}(t)$ and assumptions (A1) and (A3), for $N\geq 5$ , $N\in\mathbb{N}$ , there exists an RNN-TPP $\hat{\lambda}^{N}$ as stated in section 2.2 with $L=2$ , $l_{f}=B_{1}$ , $u_{f}=B_{0}$ , and input function $x(t;S)=t$ such that

\displaystyle|\mathbb{E}[\text{loss}(\hat{\lambda}^{N},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]|\leq 15\exp\left({2B_{0}T}\right)(T+2B_{1}^{-1})\frac{\mathcal{C}T^{s}}{N^{s}},

(22)

where $\mathcal{C}=\sqrt{2s}5^{s}/(s-1)!$ . Moreover, the width of $\hat{\lambda}^{N}$ satisfies $D\leq 3\lceil s/2\rceil+6N$ and the weights of $\hat{\lambda}^{N}$ are less than

\displaystyle\mathcal{C}_{1}\left[\frac{\sqrt{2s}5^{s}}{(s-1)!}B_{0}T^{s}\right]^{-\frac{s}{2}}N^{\frac{1+s^{2}}{2}}(s(s+2))^{3s(s+2)},

where $\mathcal{C}_{1}$ is an universal constant.

A graphical representation of RNN approximation is given in Figure 2. For the non-homogeneous Poisson models, the RNN-TPP $\hat{\lambda}^{N}$ in Theorem 4 is indeed a two-layer neural network. From Theorem 4, we need an RNN-TPP with $O(N)$ width and $B_{m}=O(N^{\frac{s^{2}+2}{2}})$ to obtain $O(N^{-s})$ approximation error. Combining with Theorem 3, we can get the part (i) of Theorem 1.

5.2 Vanilla Hawkes Case

Recall that the intensity of the vanilla Hawkes process has the form

\displaystyle\lambda^{\ast}(t)=\lambda_{0}(t)+\sum_{j:t_{j}<t}\alpha\exp\{-\beta(t-t_{j})\}.

(23)

Different from Poisson process, the intensity of the vanilla Hawkes process depends on historical events. Hence it can not be approximated by a simple neural network and needs the recurrent structure. We construct an RNN-TPP to approximate the intensity using the Markov property of (23). Specifically, note that if we have observed the first $k$ event times $\{t_{1},\cdots,t_{k}\}$ , then for any $t$ satisfying $t_{k}<t\leq t_{k+1}$ , we have

	$\displaystyle\lambda^{\ast}(t)-\lambda_{0}(t)$	$\displaystyle=\sum_{j:t_{j}<t}\alpha\exp\{-\beta(t-t_{j})\}$
		$\displaystyle=\exp(-\beta(t-t_{k}))\sum_{j:t_{j}\leq t_{k}}\alpha\exp\{-\beta(t_{k}-t_{j})\}$
		$\displaystyle=(\lambda^{\ast}(t_{k})-\lambda_{0}(t_{k})+\alpha)\exp(-\beta(t-t_{k})).$

Therefore, we can use the hidden layers in RNN-TPP to store the information of $\lambda^{\ast}(t_{k})-\lambda_{0}(t_{k})$ and then compute $\lambda^{\ast}(t)-\lambda_{0}(t)$ with the help of input $t-t_{k}$ . Together with the approximation of $\lambda_{0}$ , we can obtain the final approximation result. A graphical illustration of the above construction procedures are given in Figure 3.

Theorem 5.

(Approximation for Vanilla Hawkes process) Under model (23), assumptions (A1), (A3), and $\alpha/\beta<1$ , for $N\geq 5$ , $N\in\mathbb{N}$ , there exists an RNN-TPP $\hat{\lambda}^{N}$ as stated in section 2.2 with $L=2$ , $l_{f}=B_{1}$ , $u_{f}=B_{0}+O(\log N)$ , and input function $x(t;S)=(t,t-F_{S}(t))^{\top}$ such that

\displaystyle|\mathbb{E}[\text{loss}(\hat{\lambda}^{N},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]|\lesssim\frac{(\log N)^{2}}{N^{s}}.

(24)

Moreover, the width of $\hat{\lambda}^{N}$ satisfies $D=O(N)$ and the weights of $\hat{\lambda}^{N}$ are less than

\displaystyle\mathcal{C}_{1}(\log(N))^{12s^{2}(\log(N))^{2}}~{},

where $\mathcal{C}_{1}$ is a constant related to $s,B_{0},\beta$ , and $T$ .

Due to the smoothness of the exponential function, the approximation rate in Theorem 5 only adds the $\log(N)$ term compared with the results in Theorem 4. Similarly, combining with Theorem 3, we can easily get the part (ii) of Theorem 1.

5.3 Linear Hawkes Case

Now we consider the general linear Hawkes process, i.e., (9) in section 2.3. Motivated by the approximation construction of the Vanilla Hawkes process, we want to find a decomposition for the general $\mu$ where each term has the ’Markov property’ so that we can construct the corresponding RNN structure. Precisely, for $\mu\in C^{k,\infty}([0,T],C_{0})$ , $k\geq 2$ , $k\in\mathbb{N}$ , we can decompose $\mu$ into two parts,

\displaystyle\mu(t)=\underbrace{\tilde{\mu}(t)}_{\text{part}_{1}}+\underbrace{\sum_{j=1}^{k}\alpha_{j}\exp(-\beta_{j}t)}_{\text{part}_{2}},~{}~{}t\in[0,T],

where $\tilde{\mu}$ satisfies the boundary condition, $\tilde{\mu}^{j}(0+)=\tilde{\mu}^{j}(T-)$ , $0\leq j\leq k-1$ , $j\in\mathbb{N}$ , and $\beta_{j}=j/k$ , $j\in[k]$ . The term $\sum_{j=1}^{k}\alpha_{j}\exp(-\beta_{j}t)$ can be handled similarly to the vanilla Hawkes process. For $\tilde{\mu}$ , we consider its Fourier expansion,

\tilde{\mu}(t)=\frac{\hat{\mu}_{0}}{2}+\sum_{l=1}^{\infty}\left(\hat{\mu}_{l}\cos\Big{(}\frac{2l\pi}{T}t\Big{)}+\hat{\nu}_{l}\sin\Big{(}\frac{2l\pi}{T}t\Big{)}\right).

Thanks to the boundary condition, $\tilde{\mu}(t)$ can be well approximated by the finite sum of Fourier series. Then we can use the “Markov property" of the trigonometric function pairs $\cos(2l\pi t/T)$ and $\sin(2l\pi t/T)$ to construct the RNN-TPP. The construction is similar to that of the exponential function case but needs more thorny calculations. Combining all the approximation parts, we can get the approximation theorem for (9). The above ideas are visualized in Figure 4.

Theorem 6.

(Approximation for linear Hawkes process) Under model (9), assumptions (A1)-(A3), and $\mu\in C^{k,\infty}([0,T],C_{0})$ , $k\geq 2$ , $k\in\mathbb{N}$ , for $N\geq 5$ , $N\in\mathbb{N}$ , there exists an RNN-TPP $\hat{\lambda}^{N,N_{\mu}}$ as stated in section 2.2 with $L=2$ , $l_{f}=B_{1}$ , $u_{f}=B_{0}+O(\log N)$ , and input function $x(t;S)=(t,t-F_{S}(t))^{\top}$ such that

\displaystyle|\mathbb{E}[\text{loss}(\hat{\lambda}^{N,N_{\mu}},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]|\lesssim\frac{(\log N)^{2}}{N^{s}}+\frac{\log N}{N_{\mu}^{k-1}}.

(25)

Moreover, the width of $\hat{\lambda}^{N,N_{\mu}}$ satisfies $D=O(N+N_{\mu}^{5}(\log N)^{4})$ and the weights of $\hat{\lambda}^{N,N_{\mu}}$ are less than

\displaystyle\mathcal{C}_{1}(\log(NN_{\mu}))^{12s^{2}(\log(NN_{\mu}))^{2}}~{},

where $\mathcal{C}_{1}$ is a constant related to $s,k,B_{0},C_{0},c_{\mu}$ , and $T$ .

We make a few explanations on Theorem 6. There are two tuning parameters in $\hat{\lambda}^{N,N_{\mu}}$ , where $N$ is the tuning parameter to control the approximation error of $\lambda_{0}$ , $\sum_{j=1}^{k}\alpha_{j}\exp(-\beta_{j}t)$ , and the finite sum of the Fourier series, and $N_{\mu}$ is the tuning parameter to control the number of terms in the Fourier series entering the RNN-TPP. The term $(\log N)^{2}/N^{s}$ is obtained similarly to that in the vanilla Hawkes process case, and the term $\log N/N_{\mu}^{k-1}$ is the error caused by the finite sum approximation for the Fourier series. Moreover, the $O(N_{\mu}^{5}(\log N)^{4})$ term in the width of RNN-TPP is caused by the approximation construction of the first $N_{\mu}$ terms of the Fourier series. Finally, combining with Theorem 3, we can obtain the part (iii) of Theorem 1.

5.4 Nonlinear Hawkes Case

Finally, we consider the nonlinear Hawkes process, which is defined in (10) in section 2.3. To make the statement simpler, we only consider the simple case, i.e., $\mu(t)=\alpha\exp(-\beta t)$ . The results for the general $\mu$ can be obtained similarly. Compared to the vanilla Hawkes case, the additional challenge here is the existence of a nonlinear function $\Phi$ . With two additional layers, we can approximate $\Phi$ well. Together with the results for the case of the vanilla Hawkes process, we can obtain the desired RNN-TPP architecture. To be clearer, we also provide the graphical illustration in Figure 5.

Theorem 7.

(Approximation for nonlinear Hawkes process) Under model (15), assumptions (A1) and (A4), for $N\geq\max\{5,(2\mathcal{C}B_{0}T^{s}+1)^{\frac{1}{s}}\}$ with $\mathcal{C}=\sqrt{2s}5^{s}/(s-1)!$ , there exists an RNN-TPP $\hat{\lambda}^{N}$ as stated in section 2.2 with $L=4$ , $l_{f}=\tilde{B}_{1}$ , $u_{f}=\tilde{B}_{0}$ , and input function $x(t;S)=(t,t-F_{S}(t))^{\top}$ such that

\displaystyle|\mathbb{E}[\text{loss}(\hat{\lambda}^{N},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]|\lesssim\frac{\log N}{N}.

(26)

Moreover, the width of $\hat{\lambda}^{N}$ satisfies $D=O(N)$ and the weights of $\hat{\lambda}^{N}$ are less than

\displaystyle\mathcal{C}_{1}(\log N)^{12s^{2}(\log N)^{2}}~{},

where $\mathcal{C}_{1}$ is a constant related to $s,\tilde{B}_{0},\alpha,\beta,T$ , and $L$ .

Since we assume $\Phi$ to be Lipschitz continuous, we can only get $\tilde{O}(N^{-1})$ approximation error. The rate can be improved if $\Phi$ is allowed to have better smoothness properties. Again, combining with Theorem 3, we arrive at Theorem 2.

Remark 10.

The universal approximation properties of one-layer RNNs are studied in Schäfer and Zimmermann (2007). Our current results are different from theirs in the following sense. (i) RNN-TPP is defined over the continuous time domain $[0,T]$ , while the standard RNN only considers the discrete points. In other words, our approximation results hold uniformly over all $t\in[0,T]$ . (ii) In Schäfer and Zimmermann (2007), they do not give the explicit formula of the widths of hidden layers or parameter weights in the construction of RNN approximator. Therefore, their results cannot be directly used in computing the approximation error.

6 Usefulness of Interpolation of Hidden States

As mentioned in Remark 3, the RNN-TPP needs to take into account any continuous time point $t$ between observed time grids $t_{j}$ and $t_{j+1}$ . The interpolation of hidden state $h^{(l)}(t;S)$ between $h_{j}^{(l)}$ and $h_{j+1}^{(l)}$ is essential and important during the construction of RNN-TPPs.

In this section, we give a counter-example to illustrate that an RNN-TPP model with a linear interpolation of hidden states is unable to precisely capture the true intensity in terms of excess risk $\eqref{eq:def:gen:err}$ . For simplicity, we only consider the single-layer RNN-TPP and the argument is the same for multi-layer RNN-TPPs.

We consider a (single-layer) RNN-TPP which admits the following model structure,

	$\displaystyle h_{j}$	$\displaystyle=\sigma(W_{x}x(t_{j};S)+W_{h}h_{j-1}),$
	$\displaystyle\hat{\lambda}_{ne}(t)$	$\displaystyle=f(\alpha(t-t_{j})+W_{y}h_{j}+b)\in\mathbb{R},~{}~{}t\in(t_{j},t_{j+1}],$		(27)

where $x(t_{j};S)$ is the embedding for the $j$ -th event, $h_{0}=\mathbf{0}$ , $\sigma(x)=\tanh(x)$ , $f(x)=(x\vee l_{f})\wedge u_{f}$ , and $l_{f}$ and $u_{f}$ will be determined from the true intensity. If we take $\alpha=0$ , it would be the same as the model using a constant hidden state interpolation mechanism, such as Du et al. (2016). In other words, $h^{(1)}(t;S)\equiv h_{j}$ for all $t$ satisfying $t_{j}\leq t<t_{j+1}$ when $\alpha=0$ .

Theorem 8.

Suppose the true model intensity on $[0,T]$ has the following form,

\displaystyle\lambda^{\ast}(t)

\displaystyle=

\displaystyle\left\{\begin{aligned} &T&,~{}&t\in[0,T/3]\\ &\frac{9}{T}t^{2}&,~{}&t\in(T/3,2T/3)\\ &4T&,~{}&t\in[2T/3,T]\end{aligned}\right.\quad.

Hence we can take $l_{f}=T$ and $u_{f}=4T$ , and then there exists a constant $C>0$ such that

\displaystyle\min_{\hat{\lambda}_{ne}\text{ as }\eqref{RNN_naive(main)}}\mathbb{E}[\text{loss}(\hat{\lambda}_{ne},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]\geq C>0.

(28)

Theorem 8 tells us that the RNN-TPPs with an improper hidden state interpolation may fail to offer a good approximation, even under a very simple non-homogeneous Poisson model. Therefore, the user-determined input embedding vector function $x(t;S)$ plays an important role in interpolating the hidden states. It should be carefully chosen so that $x(t;S)$ can summarize the information of past event history to some extent.

Remark 11.

One can substitute the linear interpolation mechanism (27) with the exponential decaying mechanism given in Mei and Eisner (2017). Theorem 8 still holds.

Remark 12.

For other different types of $f$ (e.g. Softplus) in the output layer, the failure of the linear interpolation mechanism can be obtained similarly.

7 Discussion

In this paper, we give a positive answer to the question "whether the RNN-TPPs can provably have small excess risks in the estimation of the well-known TPPs". We establish the excess risk bounds under homogeneous Poisson process, non-homogeneous Poisson process, self-exciting process, and self-correcting process framework. Our analysis focuses on two parts, the stochastic error and the approximation error. For the stochastic error, we use a novel truncation technique to decouple the randomness and make the classical empirical process theory applicable. We carefully compute the Lipschitz constant of multi-layer RNNs, which is a useful intermediate result for future RNN-related work. For approximation error, we construct a series of RNNs to approximate the intensities of different TPPs by providing the explicit network depth, width, and parameter weights. To the best of our knowledge, our work is the first one to study the approximation ability of the multi-layer RNNs over the continuous time domain. We believe the results in the current work add values to both learning theory and neural network fields.

There are several possible extensions along the research line of neural network-based TPPs. First, it is not clear whether the approximation rate can be improved by a more refined RNN structure construction (with possible fewer layers and smaller width) or other possible approaches. Second, we here only consider the “large $n$ " setting where the event sequences are observed in a bounded time domain $[0,T]$ with $n$ repeated samples. It is interesting to extend our results to “large $T$ " setting where the end time $T$ goes to infinity but the number of event sequences, $n$ , remains fixed. Third, in the current work, we do not take into account the different event types. It may be useful to extend our results to the marked TPP settings. Moreover, it is also worth investigating the theoretical performances of other neural network architectures (e.g. Transformer-TPPs) that have performed well in recent empirical applications.

Supplementary Material for "On Non-asymptotic Theory of Recurrent Neural Networks in Temporal Point Processes"

Additional Notations in the Supplementary: For two random variables $X$ and $Y$ , we write $X\leq_{s.t.}Y$ if $\mathbb{P}(X>t)\leq\mathbb{P}(Y>t)$ for any $t\in\mathbb{R}$ . Use $\mathbb{N}_{+}$ to denote the set of positive integers.

8 Proofs in section 3 and 4

8.1 Proof of Lemma 1

By the definition of $\check{\lambda}^{\ast}$ and $\hat{\lambda}$ , we have

	$\displaystyle\quad\mathbb{E}[\text{loss}(\hat{\lambda},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]]$
	$\displaystyle=\mathbb{E}[\text{loss}(\hat{\lambda},S_{test})]-\mathbb{E}[\text{loss}(\check{\lambda}^{\ast},S_{test})]+\mathbb{E}[\text{loss}(\check{\lambda}^{\ast},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]$
	$\displaystyle\leq\mathbb{E}[\text{loss}(\hat{\lambda},S_{test})]\underbrace{-\frac{1}{n}\sum_{i\in[n]}\text{loss}(\hat{\lambda},S_{i})+\frac{1}{n}\sum_{i\in[n]}\text{loss}(\check{\lambda}^{\ast},S_{i})}_{\geq 0}-\mathbb{E}[\text{loss}(\check{\lambda}^{\ast},S_{test})]$
	$\displaystyle\quad+\mathbb{E}[\text{loss}(\check{\lambda}^{\ast},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]$
	$\displaystyle\leq 2\sup_{\lambda\in\mathcal{F}}\Big{\|}\mathbb{E}[\text{loss}(\lambda,S_{test})]-\frac{1}{n}\sum_{i\in[n]}\text{loss}(\lambda,S_{i})\Big{\|}$
	$\displaystyle\quad+\mathbb{E}[\text{loss}(\check{\lambda}^{\ast},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})].$

8.2 Proof of Lemma 2

From model assumptions (A1) and (A2), we have $\lambda^{*}(t)=\lambda_{0}(t)+\sum_{j:t_{j}<t}\mu(t-t_{j})$ , $\int_{0}^{T}\mu(t)\mathrm{d}t\leq c_{\mu}<1$ , $\lambda_{0}(t)\leq B_{0}$ . Following the notations in the paper, we denote $N_{e}$ as the number of event time of $\lambda^{*}$ in $[0,T]$ . Consider another density $\overline{\lambda}(t)=B_{0}+\sum_{j:t_{j}<t}\mu(t-t_{j})$ and similarly denote $\overline{N}_{e}$ as the number of event time of $\overline{\lambda}$ in $[0,T]$ . Then for any fixed event sequence $S=\{t_{j}\}$ , $\lambda^{*}(t;S)\leq\overline{\lambda}(t;S)$ , and thus $N_{e}\leq_{s.t.}\overline{N}_{e}$ . By a similar formulation in Daley et al. (2003), the point process with intensity $\overline{\lambda}$ is equivalent to a birth-immigration process with immigration intensity $c$ and birth intensity $\mu(t)$ . Hence

\displaystyle\overline{N}_{e}=\overline{N}_{0}+\sum_{i=1}^{\infty}\overline{N}_{i},

where $\overline{N}_{0}\sim\operatorname{Poisson}(B_{0}T)$ and $\overline{N}_{k}$ is the number of event time in generation $k$ , which are children of generation $k-1$ .

For $t_{1}<t_{2}$ , let $\mu_{t_{1}}^{t_{2}}=\int_{t_{1}}^{t_{2}}\mu(t-t_{1})\mathrm{d}t$ . We have

\displaystyle\mathbb{E}\left[\exp\left(s\overline{N}_{0}\right)\right]=\exp\left(B_{0}T\left(\exp(s)-1\right)\right),

and

	$\displaystyle\mathbb{E}\left[\exp\left(s\overline{N}_{k+1}\right)\right]$	$\displaystyle=\mathbb{E}\left[\mathbb{E}\left[\exp\left(s\overline{N}_{k+1}\right)\Big{\|}\left\{t_{j}^{(k)}\right\}_{j=1}^{\overline{N}_{k}}\right]\right]$
		$\displaystyle=\mathbb{E}\left[\prod_{j=1}^{\overline{N}_{k}}\exp\left(\mu_{t_{j}^{(k)}}^{T}\left(\exp(s)-1\right)\right)\right]$
		$\displaystyle\leq\mathbb{E}\left[\exp\left(c_{\mu}\overline{N}_{k}\left(\exp(s)-1\right)\right)\right],$

for any $s>0$ . Since $c_{\mu}<1$ , for any fixed $c_{1}\in(c_{\mu},1]$ and any $s\in(0,\log(c_{1}/c_{\mu})]$ , we have

\displaystyle\mathbb{E}\left[\exp\left(s\overline{N}_{k}\right)\right]\leq\mathbb{E}\left[\exp\left(c_{\mu}\overline{N}_{k-1}\left(\exp(s)-1\right)\right)\right]\leq{\mathbb{E}}\left[\exp\left(c_{1}s\overline{N}_{k-1}\right)\right]\leq\cdots\leq{\mathbb{E}}\left[\exp\left(c_{1}^{k}s\overline{N}_{0}\right)\right],

i.e.

\displaystyle\mathbb{E}\left[\exp\left(s\overline{N}_{k}\right)\right]\leq\mathbb{E}\left[\exp\left(c_{1}^{k}s\overline{N}_{0}\right)\right]=\exp\left(B_{0}T\left(\exp\left(c_{1}^{k}s\right)-1\right)\right)\leq\exp\left(\frac{c_{1}^{k+1}}{c_{\mu}}(B_{0}T)s\right)

for any $k\in\mathbb{N}$ .

Since $\overline{N}_{k}(T)$ can only take integer values, we can get $\mathbb{P}(\overline{N}_{k}=0)+e^{s}\mathbb{P}(\overline{N}_{k}\neq 0)\leq\mathbb{E}\left[\exp(s\overline{N}_{k})\right]$ . Thus

\displaystyle\mathbb{P}\left(\overline{N}_{k}\neq 0\right)\leq\frac{\mathbb{E}\left[\exp\left(s\overline{N}_{k}\right)\right]-1}{\exp(s)-1}\leq\frac{c_{1}^{k+2}}{c_{\mu}^{2}}(B_{0}T),~{}\forall s\in\left(0,\min\left\{\frac{c_{\mu}}{c_{1}^{k+1}(B_{0}T)},1\right\}\log\left(\frac{c_{1}}{c_{\mu}}\right)\right].

Setting $c_{1}\searrow c_{\mu}$ , we get

\displaystyle\mathbb{P}\left(\overline{N}_{k}\neq 0\right)\leq c_{\mu}^{k}(B_{0}T).

Now take $c_{1}\in(c_{\mu},1)$ , and then $c_{1}^{-1}(1-c_{1})\sum_{k=1}^{\infty}c_{1}^{k}=1$ . By Bool’s inequality, we have

	$\displaystyle\mathbb{P}\left(\sum_{k=0}^{\infty}\overline{N}_{k}\geq N\right)$	$\displaystyle\leq\sum_{k=0}^{\infty}\mathbb{P}\left(\overline{N}_{k}\geq\frac{1-c_{1}}{c_{1}}c_{1}^{k+1}N\right)$
		$\displaystyle\leq\sum_{k=0}^{K_{0}-1}\mathbb{P}\left(\overline{N}_{k}\geq\frac{1-c_{1}}{c_{1}}c_{1}^{k+1}N\right)+\sum_{k=K_{0}}^{\infty}\mathbb{P}\left(\overline{N}_{k}\neq 0\right).$

For the second term, $\sum_{k=K_{0}}^{\infty}\mathbb{P}\left(\overline{N}_{k}\neq 0\right)\leq\sum_{k=K_{0}}^{\infty}c_{\mu}^{k}(B_{0}T)=c_{\mu}^{K_{0}}(B_{0}T)/(1-c_{\mu})$ . Let $c_{\mu}^{K_{0}}(B_{0}T)/(1-c_{\mu})\leq\delta/2n$ . It can be showed

\displaystyle K_{0}\geq\frac{\log\left(2nB_{0}T/[\delta(1-c_{\mu})]\right)}{\log\left(1/c_{\mu}\right)}.

For the first term, we have

	$\displaystyle\sum_{k=0}^{K_{0}-1}\mathbb{P}\left(\overline{N}_{k}\geq\frac{1-c_{1}}{c_{1}}c_{1}^{k+1}N\right)$	$\displaystyle\leq\sum_{k=0}^{K_{0}-1}\exp\left(-s\left(\frac{1-c_{1}}{c_{1}}c_{1}^{k+1}N\right)\right)\mathbb{E}\left[\exp(s\overline{N}_{k})\right]$
		$\displaystyle\leq\sum_{k=0}^{K_{0}-1}\exp\left(c_{1}^{k+1}s\left(\frac{B_{0}T}{c_{\mu}}-\frac{1-c_{1}}{c_{1}}N\right)\right),$

where $s\in(0,\log(c_{1}/c_{\mu})]$ . We can take $c_{1}s\left(B_{0}T/c_{\mu}-(1-c_{1})N/c_{1}\right)\leq\log(\delta/(2nK_{0}))$ so that $\sum_{k=0}^{K_{0}-1}\exp\left(c_{1}^{k+1}s\left(B_{0}T/c_{\mu}-(1-c_{1})N/c_{1}\right)\right)\leq\delta/(2n)$ . Then

\displaystyle N\geq\frac{1}{1-c_{1}}\left(\frac{1}{s}\log\left(\frac{2nK_{0}}{\delta}\right)+\frac{c_{1}}{c_{\mu}}(B_{0}T)\right).

Now let $\eta=c_{1}/c_{\mu}\in(1,1/c_{\mu})$ , $s=\log(c_{1}/c_{\mu})=\log(\eta)$ , and $N\geq\left[\log\left(2nK_{0}/\delta\right)/\log(\eta)+\eta(B_{0}T)\right]/(1-c_{\mu}\eta)$ . Taking $K_{0}=\lceil\log\left(2nB_{0}T/[\delta(1-c_{\mu})]\right)/\log\left(1/c_{\mu}\right)\rceil$ and
$N=\left[\log\left(2nK_{0}/\delta\right)/\log(\eta)+\eta(B_{0}T)\right]/(1-c_{\mu}\eta)$ , we have $\mathbb{P}\left(N_{e}\geq N\right)\leq\mathbb{P}\left(\overline{N}_{e}\geq N\right)\leq\delta/n$ . Since

\displaystyle\mathbb{P}(N_{e(n)}\geq N)=1-\mathbb{P}(N_{e(n)}<N)=1-\prod_{i=1}^{n}\mathbb{P}\left(N_{e}<N\right)\leq 1-\left(1-\frac{\delta}{n}\right)^{n}\leq\delta,

we get that with probability at least $1-\delta$ ,

\displaystyle N_{e(n)}<N\leq\frac{1}{1-c_{\mu}\eta}\left[\frac{1}{\log(\eta)}\log\left(\frac{2nK_{n,\delta}}{\delta}\right)+\eta(B_{0}T)\right],

where $\eta\in(1,1/c_{\mu})$ , and $K_{n,\delta}=\log\left(2nB_{0}T/\delta(1-c_{\mu})\right)/\log\left(1/c_{\mu}\right)+1$ . Since $1-1/x\leq\log(x)\leq x-1$ , we have $K_{n,\delta}\leq 2nB_{0}T/[\delta(1-c_{\mu})^{2}]$ . Thus with probability at least $1-\delta$ ,

\displaystyle N_{e(n)}<\frac{1}{1-c_{\mu}\eta}\left[\frac{2}{\log(\eta)}\log\left(\frac{2n\sqrt{B_{0}T}}{\delta(1-c_{\mu})}\right)+\eta(B_{0}T)\right].

Taking $n=1$ and $2\log\left(2\sqrt{B_{0}T}/[\delta(1-c_{\mu})]\right)/\log(\eta)+\eta(B_{0}T)/(1-c_{\mu}\eta)=s$ , we have $\delta=2\sqrt{B_{0}T}\exp\left(\log(\eta)\left[\eta(B_{0}T)-(1-c_{\mu}\eta)s\right]/2\right)/(1-c_{\mu})$ . Then

\displaystyle\mathbb{P}\left(N_{e}=s\right)\leq\mathbb{P}\left(N_{e}\geq s\right)\leq\frac{2\sqrt{B_{0}T}}{1-c_{\mu}}\exp\left(\frac{\log(\eta)}{2}\left[\eta(B_{0}T)-(1-c_{\mu}\eta)s\right]\right).

8.3 Proof of Lemma 4

The proof is based on induction. Using the same notation, we give two claims.

Claim 1.

For $\forall 1\leq l\leq L$ , $1\leq i\leq N$ , $\|h_{i,1}^{(l)}-h_{i,2}^{(l)}\|_{2}$ is bounded by

\displaystyle\left\|h_{i,1}^{(l)}-h_{i,2}^{(l)}\right\|_{2}\leq\rho_{\sigma}\left(\sum_{r=0}^{l-1}\gamma^{r}S_{i-1}^{r}\Delta_{b}^{l-r}+B_{\sigma}\sqrt{D}\sum_{r=0}^{l-2}\gamma^{r}S_{i-1}^{r}\Delta_{x}^{l-r}+B_{in}(T)\gamma^{l-1}S_{i-1}^{l-1}\Delta_{x}^{1}+B_{\sigma}\sqrt{D}\sum_{r=0}^{l-1}\gamma^{r}S_{i-2}^{r}\Delta_{h}^{l-r}\right).

(29)

Proof of Claim 1.

When $i=1$ , we have

	$\displaystyle\left\\|h_{1,1}^{(l)}-h_{1,2}^{(l)}\right\\|_{2}$	$\displaystyle=\left\\|\sigma\left(W_{x,1}^{(l)}h_{1}^{(l-1)}+b_{1}^{(l)}\right)-\sigma\left(W_{x,2}^{(l)}h_{2}^{(l-1)}+b_{2}^{(l)}\right)\right\\|_{2}$
		$\displaystyle\leq\rho_{\sigma}\left(\left\\|W_{x,1}^{(l)}h_{1}^{(l-1)}-W_{x,2}^{(l)}h_{2}^{(l-1)}\right\\|_{2}+\left\\|b_{1}^{(l)}-b_{2}^{(l)}\right\\|_{2}\right)$
		$\displaystyle\leq\rho_{\sigma}\left(B_{\sigma}\sqrt{D}\Delta_{x}^{l}+B_{x}\left\\|h_{1,1}^{(l-1)}-h_{1,2}^{(l-1)}\right\\|_{2}+\Delta_{b}^{l}\right).$

Repeat this derivation recursively, we get

	$\displaystyle\left\\|h_{1,1}^{(l)}-h_{1,2}^{(l)}\right\\|_{2}$	$\displaystyle\leq\rho_{\sigma}\left(B_{\sigma}\sqrt{D}\Delta_{x}^{l}+B_{x}\left\\|h_{1,1}^{(l-1)}-h_{1,2}^{(l-1)}\right\\|_{2}+\Delta_{b}^{l}\right)$
		$\displaystyle\leq\rho_{\sigma}\Delta_{b}^{l}+\rho_{\sigma}B_{\sigma}\sqrt{D}\Delta_{x}^{l}+\gamma\left(\rho_{\sigma}\Delta_{b}^{l}+\rho_{\sigma}B_{\sigma}\sqrt{D}\Delta_{x}^{l}+\gamma\left\\|h_{1,1}^{(l-2)}-h_{1,2}^{(l-2)}\right\\|_{2}\right)$
		$\displaystyle\leq\cdots\cdots$
		$\displaystyle\leq\rho_{\sigma}\left(\sum_{r=0}^{l-1}\gamma^{r}\Delta_{b}^{l-r}+B_{\sigma}\sqrt{D}\sum_{r=0}^{l-2}\gamma^{r}\Delta_{x}^{l-r}+B_{in}(T)\gamma^{l-1}\Delta_{x}^{1}\right).$

When $l=1$ , we have

	$\displaystyle\left\\|h_{i,1}^{(1)}-h_{i,2}^{(1)}\right\\|_{2}$	$\displaystyle=\left\\|\sigma\left(W_{x,1}^{(1)}x(t_{i};t_{i-1})+W_{h,1}^{(1)}h_{i-1,1}^{(1)}+b_{1}^{(1)}\right)-\sigma\left(W_{x,2}^{(1)}x(t_{i};t_{i-1})+W_{h,2}^{(1)}h_{i-1,2}^{(1)}+b_{2}^{(1)}\right)\right\\|_{2}$
		$\displaystyle\leq\rho_{\sigma}\left(B_{in}(T)\left\\|W_{x,1}^{(1)}-W_{x,2}^{(1)}\right\\|_{2}+\left\\|W_{h,1}^{(1)}h_{i-1,1}^{(1)}-W_{h,2}^{(1)}h_{i-1,2}^{(1)}\right\\|_{2}+\left\\|b_{1}^{(1)}-b_{2}^{(1)}\right\\|_{2}\right)$
		$\displaystyle\leq\rho_{\sigma}\left(B_{in}(T)\Delta_{x}^{1}+B_{\sigma}\sqrt{D}\Delta_{h}^{1}+B_{h}\left\\|h_{i-1,1}^{(1)}-h_{i-1,2}^{(1)}\right\\|_{2}+\Delta_{b}^{1}\right).$

Again repeat it recursively, we can get

	$\displaystyle\quad\left\\|h_{i,1}^{(1)}-h_{i,2}^{(1)}\right\\|_{2}$
	$\displaystyle\leq\rho_{\sigma}\left(B_{in}(T)\Delta_{x}^{1}+B_{\sigma}\sqrt{D}\Delta_{h}^{1}+B_{h}\left\\|h_{i-1,1}^{(1)}-h_{i-1,2}^{(1)}\right\\|_{2}+\Delta_{b}^{1}\right)$
	$\displaystyle\leq\rho_{\sigma}\left(B_{in}(T)\Delta_{x}^{1}+B_{\sigma}\sqrt{D}\Delta_{h}^{1}+\Delta_{b}^{1}\right)+\beta\left(\rho_{\sigma}\left(B_{in}(T)\Delta_{x}^{1}+B_{\sigma}\sqrt{D}\Delta_{h}^{1}+\Delta_{b}^{1}\right)+\beta\left\\|h_{i-2,1}^{(1)}-h_{i-2,2}^{(1)}\right\\|_{2}\right)$
	$\displaystyle\leq\cdots\cdots$
	$\displaystyle\leq\rho_{\sigma}\left(S_{i-1}^{0}\Delta_{b}^{1}+B_{in}(T)S_{i-1}^{0}\Delta_{x}^{1}+B_{\sigma}\sqrt{D}S_{i-2}^{0}\Delta_{h}^{1}\right).$

Now suppose for all $i<i_{0}$ , $l<l_{0}$ , (29) is true. Consider the case $i=i_{0}$ , $l=l_{0}$ , we have

	$\displaystyle\quad\left\\|h_{i_{0},1}^{(l_{0})}-h_{i_{0},2}^{(l_{0})}\right\\|_{2}$
	$\displaystyle=\left\\|\sigma\left(W_{x,1}^{(l_{0})}h_{i_{0},1}^{(l_{0}-1)}+W_{h,1}^{(l_{0})}h_{i_{0}-1,1}^{(l_{0})}+b_{1}^{(l_{0})}\right)-\sigma\left(W_{x,2}^{(l_{0})}h_{i_{0},2}^{(l_{0}-1)}+W_{h,2}^{(l_{0})}h_{i_{0}-1,2}^{(l_{0})}+b_{2}^{(l_{0})}\right)\right\\|_{2}$
	$\displaystyle\leq\rho_{\sigma}\left(\left\\|W_{x,1}^{(l_{0})}h_{i_{0},1}^{(l_{0}-1)}-W_{x,2}^{(l_{0})}h_{i_{0},2}^{(l_{0}-1)}\right\\|_{2}+\left\\|W_{h,1}^{(l_{0})}h_{i_{0}-1,1}^{(l_{0})}-W_{h,2}^{(l_{0})}h_{i_{0}-1,2}^{(l_{0})}\right\\|_{2}+\left\\|b_{1}^{(l_{0})}-b_{2}^{(l_{0})}\right\\|_{2}\right)$
	$\displaystyle\leq\rho_{\sigma}\left(B_{x}\left\\|h_{i_{0},1}^{(l_{0}-1)}-h_{i_{0},2}^{(l_{0}-1)}\right\\|_{2}+B_{\sigma}\sqrt{D}\Delta_{x}^{l_{0}}+B_{h}\left\\|h_{i_{0}-1,1}^{(l_{0})}-h_{i_{0}-1,2}^{(l_{0})}\right\\|_{2}+B_{\sigma}\sqrt{D}\Delta_{x}^{l_{0}}+\Delta_{b}^{l_{0}}\right)$
	$\displaystyle\leq\rho_{\sigma}\beta\left(\sum_{r=0}^{l_{0}-1}\gamma^{r}S_{i_{0}-2}^{r}\Delta_{b}^{l_{0}-r}+B_{\sigma}\sqrt{D}\sum_{r=0}^{l_{0}-2}\gamma^{r}S_{i_{0}-2}^{r}\Delta_{x}^{l_{0}-r}+B_{in}(T)\gamma^{l_{0}-1}S_{i_{0}-2}^{l_{0}-1}\Delta_{x}^{1}\right.$
	$\displaystyle\quad\quad\quad\left.+B_{\sigma}\sqrt{D}\sum_{r=0}^{l_{0}-1}\gamma^{r}S_{i_{0}-3}^{r}\Delta_{h}^{l_{0}-r}\right)+\rho_{\sigma}\gamma\left(\sum_{r=0}^{l_{0}-2}\gamma^{r}S_{i_{0}-1}^{r}\Delta_{b}^{l_{0}-1-r}+B_{\sigma}\sqrt{D}\sum_{r=0}^{l_{0}-3}\gamma^{r}S_{i_{0}-1}^{r}\Delta_{x}^{l_{0}-1-r}\right.$
	$\displaystyle\quad\quad\quad\left.B_{in}(T)\gamma^{l_{0}-2}S_{i_{0}-1}^{l_{0}-2}\Delta_{x}^{1}+B_{\sigma}\sqrt{D}\sum_{r=0}^{l_{0}-2}\gamma^{r}S_{i_{0}-2}^{r}\Delta_{h}^{l_{0}-1-r}\right)+\rho_{\sigma}\left(B_{\sigma}\sqrt{D}\Delta_{x}^{l_{0}}+B_{\sigma}\sqrt{D}\Delta_{x}^{l_{0}}+\Delta_{b}^{l_{0}}\right)$
	$\displaystyle\leq\rho_{\sigma}\left(\sum_{r=1}^{l_{0}-1}\gamma^{r}(\beta S_{i_{0}-2}^{r}+S_{i_{0}-1}^{r-1})\Delta_{b}^{l_{0}-r}+B_{\sigma}\sqrt{D}\sum_{r=1}^{l_{0}-2}\gamma^{r}(\beta S_{i_{0}-2}^{r}+S_{i_{0}-1}^{r-1})\Delta_{x}^{l_{0}-r}\right.$
	$\displaystyle\quad\quad\quad\left.+B_{in}(T)\gamma^{l_{0}-1}(\beta S_{i_{0}-2}^{l_{0}-1}+S_{i_{0}-1}^{l_{0}-2})\Delta_{x}^{1}+B_{\sigma}\sqrt{D}\sum_{r=1}^{l_{0}-1}\gamma^{r}(\beta S_{i_{0}-3}^{r}+S_{i_{0}-2}^{r-1})\Delta_{h}^{l_{0}-r}\right)$
	$\displaystyle\quad+\rho_{\sigma}\left((1+\beta S_{i_{0}-2}^{0})\left(\Delta_{b}^{l_{0}}+B_{\sigma}\sqrt{D}\Delta_{x}^{l_{0}}\right)+(1+\beta S_{i_{0}-3}^{0})B_{\sigma}\sqrt{D}\Delta_{h}^{l_{0}}\right).$

Using the fact that $1+\beta S_{i-1}^{0}=S_{i}^{0}$ and

	$\displaystyle\beta S_{i-1}^{r}+S_{i}^{r-1}$	$\displaystyle=\beta\sum_{j=0}^{i-1}\tbinom{j+r}{r}\beta^{j}+\sum_{j=0}^{i}\tbinom{j+r-1}{r-1}\beta^{j}=1+\sum_{j=1}^{i}\left(\tbinom{j+r-1}{r}+\tbinom{j+r-1}{r-1}\right)\beta^{j}$
		$\displaystyle=\sum_{j=0}^{i}\tbinom{j+r}{r}\beta^{j}=S_{i}^{r},$

(29) is proved. ∎

Claim 2.

For $\forall 1\leq l\leq L$ , $1\leq i\leq N$ and $t\in(t_{i},t_{i+1}]$ , $\|h_{1}^{(l)}(t;S)-h_{2}^{(l)}(t;S)\|_{2}$ is bounded by

	$\displaystyle\left\\|h_{1}^{(l)}(t;S)-h_{2}^{(l)}(t;S)\right\\|_{2}$	$\displaystyle\leq\rho_{\sigma}\left(\sum_{r=0}^{l-1}\gamma^{r}S_{i}^{r}\Delta_{b}^{l-r}+B_{\sigma}\sqrt{D}\sum_{r=0}^{l-2}\gamma^{r}S_{i}^{r}\Delta_{x}^{l-r}\right.$		(30)
		$\displaystyle\left.+B_{in}(T)\gamma^{l-1}S_{i}^{l-1}\Delta_{x}^{1}+B_{\sigma}\sqrt{D}\sum_{r=0}^{l-1}\gamma^{r}S_{i-1}^{r}\Delta_{h}^{l-r}\right).$		(31)

Proof of Claim 2.

When $l=1$ , by the definition of $h^{(1)}(t;S)$ and (29), for any $1\leq i\leq N$ and $t\in(t_{i},t_{i+1}]$ , we have

	$\displaystyle\left\\|h_{1}^{(1)}(t;S)-h_{2}^{(1)}(t;S)\right\\|_{2}$	$\displaystyle=\left\\|\sigma\left(W_{x,1}^{(1)}x(t;t_{i})+W_{h,1}^{(1)}h_{i,1}^{(1)}+b_{1}^{(1)}\right)-\sigma\left(W_{x,2}^{(1)}x(t;t_{i})+W_{h,2}^{(1)}h_{i,2}^{(1)}+b_{2}^{(1)}\right)\right\\|_{2}$
		$\displaystyle\leq\rho_{\sigma}\left(B_{in}(T)\left\\|W_{x,1}^{(1)}-W_{x,2}^{(1)}\right\\|_{2}+\left\\|W_{h,1}^{(1)}h_{i,1}^{(1)}-W_{h,2}^{(1)}h_{i,2}^{(1)}\right\\|_{2}+\left\\|b_{1}^{(1)}-b_{2}^{(1)}\right\\|_{2}\right)$
		$\displaystyle\leq\rho_{\sigma}\left(B_{in}(T)\Delta_{x}^{1}+B_{\sigma}\sqrt{D}\Delta_{h}^{1}+B_{h}\left\\|h_{i,1}^{(1)}-h_{i,2}^{(1)}\right\\|_{2}+\Delta_{b}^{1}\right)$
		$\displaystyle\leq\rho_{\sigma}\beta\left(S_{i-1}^{0}\Delta_{b}^{1}+B_{\sigma}\sqrt{D}S_{i-1}^{0}\Delta_{x}^{1}+B_{\sigma}\sqrt{D}S_{i-2}^{0}\Delta_{h}^{1}\right)$
		$\displaystyle\quad+\rho_{\sigma}\left(B_{in}(T)\Delta_{x}^{1}+B_{\sigma}\sqrt{D}\Delta_{h}^{1}+\Delta_{b}^{1}\right)$
		$\displaystyle\leq\rho_{\sigma}\left(S_{i}^{0}\Delta_{b}^{1}+B_{in}(T)S_{i}^{0}\Delta_{x}^{1}+B_{\sigma}\sqrt{D}S_{i-1}^{0}\Delta_{h}^{1}\right).$

Now suppose for all $l<l_{0}$ , (29) is true for any $1\leq i\leq N$ and $t\in(t_{i},t_{i+1}]$ . Considering the case $l=l_{0}$ , for any $1\leq i\leq N$ and $t\in(t_{i},t_{i+1}]$ , we have

	$\displaystyle\quad\left\\|h_{1}^{(l_{0})}(t;S)-h_{2}^{(l_{0})}(t;S)\right\\|_{2}$
	$\displaystyle=\left\\|\sigma\left(W_{x,1}^{(l_{0})}h_{1}^{(l_{0}-1)}(t;S)+W_{h,1}^{(l_{0})}h_{i,1}^{(l_{0})}+b_{1}^{(l_{0})}\right)-\sigma\left(W_{x,2}^{(l_{0})}h_{2}^{(l_{0}-1)}(t;S)+W_{h,2}^{(l_{0})}h_{i,2}^{(l_{0})}+b_{2}^{(l_{0})}\right)\right\\|_{2}$
	$\displaystyle\leq\rho_{\sigma}\left(\left\\|W_{x,1}^{(l_{0})}h_{1}^{(l_{0}-1)}(t;S)-W_{x,2}^{(l_{0})}h_{2}^{(l_{0}-1)}(t;S)\right\\|_{2}+\left\\|W_{h,1}^{(l_{0})}h_{i,1}^{(l_{0})}-W_{h,2}^{(l_{0})}h_{i,2}^{(l_{0})}\right\\|_{2}+\left\\|b_{1}^{(l_{0})}-b_{2}^{(l_{0})}\right\\|_{2}\right)$
	$\displaystyle\leq\rho_{\sigma}\left(\Delta_{b}^{l_{0}}+B_{\sigma}\sqrt{D}\Delta_{x}^{l_{0}}+B_{x}\left\\|h_{1}^{(l_{0}-1)}(t;S)-h_{2}^{(l_{0}-1)}(t;S)\right\\|_{2}+B_{\sigma}\sqrt{D}\Delta_{h}^{l_{0}}+B_{h}\left\\|h_{i,1}^{(l_{0})}-h_{i,2}^{(l_{0})}\right\\|_{2}\right)$
	$\displaystyle\leq\rho_{\sigma}\gamma\left(\sum_{r=0}^{l_{0}-2}\gamma^{r}S_{i}^{r}\Delta_{b}^{l_{0}-1-r}+B_{\sigma}\sqrt{D}\sum_{r=0}^{l_{0}-3}\gamma^{r}S_{i}^{r}\Delta_{x}^{l_{0}-1-r}+B_{in}(T)\gamma^{l_{0}-2}S_{i}^{l_{0}-2}\Delta_{x}^{1}\right.$
	$\displaystyle\quad\quad\quad\left.+B_{\sigma}\sqrt{D}\sum_{r=0}^{l_{0}-2}\gamma^{r}S_{i-1}^{r}\Delta_{h}^{l_{0}-1-r}\right)+\rho_{\sigma}\beta\left(\sum_{r=0}^{l_{0}-1}\gamma^{r}S_{i-1}^{r}\Delta_{b}^{l_{0}-r}+B_{\sigma}\sqrt{D}\sum_{r=0}^{l_{0}-2}\gamma^{r}S_{i-1}^{r}\Delta_{x}^{l_{0}-r}\right.$
	$\displaystyle\quad\quad\quad\left.+B_{in}(T)\gamma^{l_{0}-1}S_{i-1}^{l_{0}-1}\Delta_{x}^{1}+B_{\sigma}\sqrt{D}\sum_{l=0}^{l_{0}-1}\gamma^{r}S_{i-2}^{r}\Delta_{h}^{l_{0}-r}\right)+\rho_{\sigma}\left(\Delta_{b}^{l_{0}}+B_{\sigma}\sqrt{D}\Delta_{x}^{l_{0}}+B_{\sigma}\sqrt{D}\Delta_{h}^{l_{0}}\right)$
	$\displaystyle\leq\rho_{\sigma}\left(\sum_{r=1}^{l_{0}-1}\gamma^{r}(\beta S_{i-1}^{r}+S_{i}^{r-1})\Delta_{b}^{l_{0}-r}+B_{\sigma}\sqrt{D}\sum_{r=1}^{l_{0}-2}\gamma^{r}(\beta S_{i-1}^{r}+S_{i}^{r-1})\Delta_{x}^{l_{0}-r}\right.$
	$\displaystyle\quad\quad\quad\left.+B_{in}(T)\gamma^{l_{0}-1}(\beta S_{i-1}^{l_{0}-1}+S_{i}^{l_{0}-2})\Delta_{x}^{1}+B_{\sigma}\sqrt{D}\sum_{r=1}^{l_{0}-1}\gamma^{r}(\beta S_{i-2}^{r}+S_{i-1}^{r-1})\Delta_{h}^{l_{0}-r}\right)$
	$\displaystyle\quad+\rho_{\sigma}\left((1+\beta S_{i-2}^{0})\left(\Delta_{b}^{l_{0}}+(B_{\sigma}\sqrt{D}\vee B_{in}(T))\Delta_{x}^{l_{0}}\right)+(1+\beta S_{i-3}^{0})B_{\sigma}\sqrt{D}\Delta_{h}^{l_{0}}\right)$
	$\displaystyle\leq\rho_{\sigma}\left(\sum_{r=0}^{l_{0}-1}\gamma^{r}S_{i}^{r}\Delta_{b}^{l_{0}-r}+B_{\sigma}\sqrt{D}\sum_{r=0}^{l_{0}-2}\gamma^{r}S_{i}^{r}\Delta_{x}^{l_{0}-r}+B_{in}(T)\gamma^{l_{0}-1}S_{i}^{l_{0}-1}\Delta_{x}^{1}+B_{\sigma}\sqrt{D}\sum_{r=0}^{l_{0}-1}\gamma^{r}S_{i-1}^{r}\Delta_{h}^{l_{0}-r}\right).$

Hence (31) is proved. ∎

Now we prove Lemma 4. For $t\in(t_{i},t_{i+1}]$ , we have

	$\displaystyle\left\|\lambda_{\theta_{1}}(t;S)-\lambda_{\theta_{2}}(t;S)\right\|$	$\displaystyle=\left\|f\left(W_{x,1}^{(L+1)}h^{(L)}(t;S)+b_{1}^{(L+1)}\right)-f\left(W_{x,2}^{(L+1)}h^{(L)}(t;S)+b_{2}^{(L+1)}\right)\right\|$
		$\displaystyle\leq\rho_{f}\left(\left\\|b_{1}^{(L+1)}-b_{2}^{(L+1)}\right\\|_{2}+\left\\|W_{x,1}^{(L+1)}h^{(L)}(t;S)-W_{x,2}^{(L+1)}h^{(L)}(t;S)\right\\|_{2}\right)$
		$\displaystyle\leq\rho_{f}\left(\Delta_{b}^{L+1}+B_{\sigma}\sqrt{D}\Delta_{x}^{L+1}+B_{x}\left\\|h_{1}^{(L)}(t;S)-h_{2}^{(L)}(t;S)\right\\|_{2}\right)$
		$\displaystyle\leq\rho_{f}\gamma\left(\sum_{l=0}^{L-1}\gamma^{l}S_{i}^{l}\Delta_{b}^{L-l}+B_{\sigma}\sqrt{D}\sum_{l=0}^{L-2}\gamma^{l}S_{i}^{l}\Delta_{x}^{L-l}+B_{in}(T)\gamma^{L-1}S_{i}^{L-1}\Delta_{x}^{1}\right.$
		$\displaystyle\quad\quad\quad\left.+B_{\sigma}\sqrt{D}\sum_{l=0}^{L-1}\gamma^{l}S_{i-1}^{l}\Delta_{h}^{L-l}\right)+\rho_{f}\Delta_{b}^{L+1}+\rho_{f}B_{\sigma}\sqrt{D}\Delta_{x}^{L+1}.$

8.4 Proof of Lemma 5

From Lemma 4, for $\forall~{}\lambda_{\theta_{1}},\lambda_{\theta_{2}}\in\mathcal{F}$ , we have

	$\displaystyle\quad d_{N}(\lambda_{\theta_{1}},\lambda_{\theta_{2}})$
	$\displaystyle\leq\rho_{f}\gamma\left(\sum_{l=0}^{L-1}\gamma^{l}S_{N}^{l}\Delta_{b}^{L-l}+B_{\sigma}\sqrt{D}\sum_{l=0}^{L-2}\gamma^{l}S_{N}^{l}\Delta_{x}^{L-l}+B_{in}(T)\gamma^{L-1}S_{N}^{L-1}\Delta_{x}^{1}+B_{\sigma}\sqrt{D}\sum_{l=0}^{L-1}\gamma^{l}S_{N-1}^{l}\Delta_{h}^{L-l}\right)$
	$\displaystyle\quad+\rho_{f}\Delta_{b}^{L+1}+\rho_{f}B_{\sigma}\sqrt{D}\Delta_{x}^{L+1}$
	$\displaystyle\leq\rho_{f}\left(B_{\sigma}\sqrt{D}\vee B_{in}(T)\vee 1\right)\left(\gamma^{L}\vee 1\right)S_{N}^{L-1}\Delta_{\theta}$
	$\displaystyle\leq\rho_{f}\left(B_{\sigma}\sqrt{D}\vee B_{in}(T)\vee 1\right)\left(\gamma^{L}\vee 1\right)(N+1)^{L-1}\frac{\beta^{N+1}-1}{\beta-1}\Delta_{\theta},$

where $\Delta_{\theta}\triangleq\sum_{l=0}^{L+1}\left(\Delta_{b}^{l}+\Delta_{x}^{l}+\Delta_{h}^{l}\right)$ .

Define $C(N)\triangleq\rho_{f}\left(B_{\sigma}\sqrt{D}\vee B_{in}(T)\vee 1\right)\left(\gamma^{L}\vee 1\right)(N+1)^{L-1}(\beta^{N+1}-1)/(\beta-1)$ , using Lemma 11 and $\|\cdot\|_{2}\leq\|\cdot\|_{F}$ , we can get

	$\displaystyle\mathcal{N}\left(\mathcal{F}_{\Theta}^{\mathcal{B}},\epsilon,d_{N}^{\lambda}(\cdot,\cdot)\right)$	$\displaystyle\leq\prod_{l=1}^{L+1}\mathcal{N}\left(W_{x}^{(l)},\frac{\epsilon}{C(N)(3L+2)},\\|\cdot\\|_{F}\right)~{}\prod_{l=1}^{L}\mathcal{N}\left(W_{h}^{(l)},\frac{\epsilon}{C(N)(3L+2)},\\|\cdot\\|_{F}\right)$
		$\displaystyle\quad~{}\prod_{l=1}^{L+1}\mathcal{N}\left(b^{(l)},\frac{\epsilon}{C(N)(3L+2)},\\|\cdot\\|_{2}\right)$
		$\displaystyle\leq\left(1+\frac{C(N)(3L+2)B_{m}\sqrt{D}}{\epsilon}\right)^{D^{2}(3L+2)}~{},$

where $B_{m}=\max\{B_{b},B_{h},B_{x}\}$ .

8.5 Proof of Theorem 3

Lemma 6.

Under assumptions (B1)-(B3), for fixed $s\in\mathbb{N}$ , with probability at least $1-\delta$ , we have

\displaystyle\sup_{\theta\in\Theta}|X_{\theta}(s)|

\displaystyle\leq\frac{48}{\sqrt{n}}\left(T+\frac{1}{l_{f}}\right)(s+1)\left\{4u_{f}\left(\sqrt{\log\left(\frac{2}{\delta}\right)}+D\sqrt{(3L+2)\log\left(1+M(s)\right)}~{}\right)+D\sqrt{3L+2}\right\}.

Hence

\displaystyle\sup_{\|\theta\|\leq B_{m}}|X_{\theta}(s)|\leq\tilde{O}\left(\sqrt{\frac{{D^{2}L^{2}s^{3}}}{n}}\right),

where $M(s)=\rho_{f}B_{m}\sqrt{D}\left(B_{\sigma}\sqrt{D}\vee B_{in}(T)\vee 1\right)\left(\gamma^{L}\vee 1\right)(s+1)^{L-1}(\beta^{s+1}-1)/(\beta-1)$ , $B_{m}=\max\{B_{b},B_{h},B_{x}\}$ , $\gamma=\rho_{\sigma}B_{x}$ , and $\beta=\rho_{\sigma}B_{h}$ .

Proof of Lemma 6.

For $1\leq k\leq n$ , denote $X_{\theta,k}(s)=\mathbb{E}\left[\text{loss}(\lambda_{\theta},S_{test})\mathbbm{1}_{\{N_{e}\leq s\}}\right]-\text{loss}(\lambda_{\theta},S_{k})\mathbbm{1}_{\{N_{ek}\leq s\}}$ . Then $X_{\theta}(s)=n^{-1}\sum_{k=1}^{n}X_{\theta,k}(s)$ . For two parameters $\theta_{1}$ and $\theta_{2}$ , we have

	$\displaystyle~{}~{}~{}~{}\left\|\text{loss}(\lambda_{\theta_{1}},S_{k})\mathbbm{1}_{\{N_{ek}\leq s\}}-\text{loss}(\lambda_{\theta_{2}},S_{k})\mathbbm{1}_{\{N_{ek}\leq s\}}\right\|$
	$\displaystyle\leq\Big{\|}\sum_{i=1}^{N_{k}}(\log\lambda_{\theta_{1}}(t_{i})-\log\lambda_{\theta_{2}}(t_{i}))\Big{\|}+\Big{\|}\int_{0}^{T}\left(\lambda_{\theta_{1}}(t)-\lambda_{\theta_{2}}(t)\right)\mathrm{dt}\Big{\|}$
	$\displaystyle\leq\frac{1}{l_{f}}\sum_{i=1}^{N_{k}}\|\lambda_{\theta_{1}}(t_{i})-\lambda_{\theta_{2}}(t_{i}))\|+\int_{0}^{T}\left\|\lambda_{\theta_{1}}(t)-\lambda_{\theta_{2}}(t)\right\|\mathrm{dt}$
	$\displaystyle\leq\left(T+\frac{N_{k}}{l_{f}}\right)d_{N_{k}}(\lambda_{\theta_{1}},\lambda_{\theta_{2}})$
	$\displaystyle\leq\left(T+\frac{1}{l_{f}}\right)(s+1)d_{s}(\lambda_{\theta_{1}},\lambda_{\theta_{2}}),$

and similarly,

\displaystyle\left|\mathbb{E}\left[\text{loss}(\lambda_{\theta_{1}},S_{test})\mathbbm{1}_{\{N_{e}\leq s\}}\right]-\mathbb{E}\left[\text{loss}(\lambda_{\theta_{2}},S_{test})\mathbbm{1}_{\{N_{e}\leq s\}}\right]\right|\leq\left(T+\frac{1}{l_{f}}\right)(s+1)d_{s}(\lambda_{\theta_{1}},\lambda_{\theta_{2}}).

Hence

\displaystyle\left|X_{\theta_{1},k}(s)-X_{\theta_{2},k}(s)\right|\leq 2\left(T+\frac{1}{l_{f}}\right)(s+1)d_{s}(\lambda_{\theta_{1}},\lambda_{\theta_{2}}).

By the property of bounded variable, $X_{\theta_{1},k}(s)-X_{\theta_{2},k}(s)$ is $2\left(T+1/l_{f}\right)(s+1)d_{s}(\lambda_{\theta_{1}},\lambda_{\theta_{2}})$ -sub-gaussian. Since $\{X_{\theta_{1},k}(s)-X_{\theta_{2},k}(s)\}_{k=1}^{n}$ is mutually independent, $X_{\theta_{1}}(s)-X_{\theta_{2}}(s)$ is $2\left(T+1/l_{f}\right)(s+1)d_{s}(\lambda_{\theta_{1}},\lambda_{\theta_{2}})/\sqrt{n}$ -sub-gaussian. From assumptions 2 and 3, there exists $\|\theta_{0}\|\leq B_{m}$ such that $\lambda_{\theta_{0}}\equiv 1$ , implying $X_{\theta_{0}}(s)=0$ .

The diameter of $\mathcal{F}$ under the distance $d_{s}(\cdot,\cdot)$ can be bounded by

	$\displaystyle\text{diam}\left(\mathcal{F}\|d_{s}\right)$	$\displaystyle\leq\sup_{\theta_{1},\theta_{2}\in\Theta}d^{\lambda}_{s}(\lambda_{\theta},\lambda_{\theta_{0}})\leq\sup_{\theta_{1},\theta_{2}\in\Theta}\sup_{\#S\leq s}\\|\lambda_{\theta_{1}}(t;S)-\lambda_{\theta_{2}}(t;S)\\|_{L^{\infty}}$
		$\displaystyle\leq 2u_{f}~{}.$		(32)

By Lemma 5, we get

\displaystyle\log\mathcal{N}\left(\mathcal{F},\epsilon,d_{s}(\cdot,\cdot)\right)\leq D^{2}(3L+2)\log\left(1+\frac{C(s)(3L+2)B_{m}\sqrt{D}}{\epsilon}\right)~{},

(33)

where $C(s)=\rho_{f}\left(B_{\sigma}\sqrt{D}\vee B_{in}(T)\vee 1\right)\left(\gamma^{L}\vee 1\right)(s+1)^{L-1}(\beta^{s+1}-1)/(\beta-1)$ , $B_{m}=\max\{B_{b},B_{h},B_{x}\}$ . Denote $M(s)=C(s)(3L+2)B_{m}\sqrt{D}$ , $\mathcal{D}=\text{diam}\left(\mathcal{F}_{\Theta}|d^{\lambda}_{s}\right)$ . We have

$\displaystyle\int_{0}^{2\mathcal{D}}\sqrt{\log\left(1+\frac{M(s)}{\epsilon}\right)}\mathrm{d}\epsilon$	$\displaystyle\leq\left(\int_{0}^{a}+\int_{a}^{2\mathcal{D}}\right)\sqrt{\log\left(1+\frac{M(s)}{\epsilon}\right)}\mathrm{d}\epsilon~{}~{}(\forall 0\leq a\leq 2\mathcal{D})$
	$\displaystyle\leq\inf_{0\leq a\leq 2\mathcal{D}}\left\{\int_{0}^{a}\sqrt{\frac{M(s)}{\epsilon}}\mathrm{d}\epsilon+\int_{a}^{2\mathcal{D}}\sqrt{\log\left(1+\frac{M(s)}{\epsilon}\right)}\mathrm{d}\epsilon\right\}$
	$\displaystyle\leq\inf_{0\leq a\leq 2\mathcal{D}}\left\{2\sqrt{M(s)a}+2\mathcal{D}\sqrt{\log\left(1+\frac{M(s)}{a}\right)}\right\}$
	$\displaystyle\leq 2+2\mathcal{D}\sqrt{\log\left(1+{M(s)}^{2}\right)}~{}~{}(\text{take}~{}a={M(s)}^{-1})$
	$\displaystyle\leq 2+4\mathcal{D}\sqrt{\log\left(1+M(s)\right)},$	(34)

where we need $2\mathcal{D}M(s)\geq 1$ . If $2\mathcal{D}M(s)<1$ , (34) is obvious since the integral is less than $2$ .

Combining (32), (33), (34) and using Lemma 12, we have

	$\displaystyle\sup_{\theta\in\Theta}\|X_{\theta}(s)\|$	$\displaystyle\leq\frac{24}{\sqrt{n}}\left(T+\frac{1}{l_{f}}\right)(s+1)\left(\mathcal{D}\left(4\sqrt{\log\left(\frac{2}{\delta}\right)}+4D\sqrt{(3L+2)\log(1+M(s))}\right)+2D\sqrt{3L+2}\right)$
		$\displaystyle\leq\frac{48}{\sqrt{n}}\left(T+\frac{1}{l_{f}}\right)(s+1)\left\{4u_{f}\left(\sqrt{\log\left(\frac{2}{\delta}\right)}+D\sqrt{(3L+2)\log\left(1+M(s)\right)}~{}\right)+D\sqrt{3L+2}\right\}~{}.$

∎

Lemma 7.

Suppose the event number $N_{e}$ satisfies the tail condition

\mathbb{P}(N_{e}\geq s)\leq a_{N}\exp(-c_{N}s),~{}s\in\mathbb{N}.

Under assumptions (B1)-(B3), for fixed $s\in\mathbb{N}$ , we have

\displaystyle\sup_{\theta\in\Theta}|E_{\theta}(s)|\leq\left(T+\frac{1}{l_{f}}\right)(u_{f}+2)\frac{a_{N}(s+2)}{(1-\exp(-c_{N}))^{2}}\exp(-c_{N}(s+1))~{}.

Proof of Lemma 7.

From assumptions (B2) and (B3), there exists $\theta_{0}\in\Theta$ such that $\lambda_{\theta_{0}}\equiv 1$ . Then

	$\displaystyle\|E_{\theta}(s)\|$	$\displaystyle=\left\|\mathbb{E}\left[\text{loss}(\lambda_{\theta},S_{test})\mathbbm{1}_{\{N_{e}>s\}}\right]\right\|\leq\mathbb{E}\left\|\text{loss}(\lambda_{\theta},S_{test})\right\|\mathbbm{1}_{\{N_{e}>s\}}$
		$\displaystyle\leq\mathbb{E}\left\|\text{loss}(\lambda_{\theta},S_{test})-\text{loss}(\lambda_{\theta_{0}},S_{test})\right\|\mathbbm{1}_{\{N_{e}>s\}}+\mathbb{E}\left\|\text{loss}(\lambda_{\theta_{0}},S_{test})\right\|\mathbbm{1}_{\{N_{e}>s\}}$
		$\displaystyle\leq\mathbb{E}\left[\left(T+\frac{1}{l_{f}}\right)(N_{e}+1)d_{N_{e}}(\lambda_{\theta},\lambda_{\theta_{0}})\right]\mathbbm{1}_{\{N_{e}>s\}}+T\mathbb{P}(N_{e}>s)$
		$\displaystyle\leq\left(T+\frac{1}{l_{f}}\right)(u_{f}+1)\mathbb{E}[(N_{e}+1)\mathbbm{1}_{\{N_{e}>s\}}]+T\mathbb{P}(N_{e}>s)$

By the tail condition $\mathbb{P}(N_{e}\geq s)\leq a_{N}\exp(-c_{N}s),~{}s\in\mathbb{N}$ , we have

	$\displaystyle\|E_{\theta}(s)\|$	$\displaystyle\leq\left(T+\frac{1}{l_{f}}\right)(u_{f}+1)\frac{a_{N}(s+1)}{(1-\exp(-c_{N}))^{2}}\exp(-c_{N}(s+1))$
		$\displaystyle\quad+\left(T+\frac{1}{l_{f}}\right)(u_{f}+2)a_{N}\exp(-c_{N}(s+1))$
		$\displaystyle\leq\left(T+\frac{1}{l_{f}}\right)(u_{f}+2)\frac{a_{N}(s+2)}{(1-\exp(-c_{N}))^{2}}\exp(-c_{N}(s+1)).$

∎

Now we prove Theorem 3. From Lemma 2, we have

\displaystyle\mathbb{P}\left(\sup_{\theta\in\Theta}|X_{\theta}|>t\right)\leq\mathbb{P}\left(\sup_{\theta\in\Theta}|X_{\theta}(s)|+\sup_{\theta\in\Theta}|E_{\theta}(s)|>t\right)+\mathbb{P}(N_{e(n)}>s).

Since

\displaystyle\mathbb{P}(N_{e(n)}>s)\leq n\mathbb{P}(N_{e}>s)\leq na_{N}\exp(-c_{N}s),

we can take $s_{0}=\lceil\left(\log\left(2a_{N}n/\delta\right)-1\right)/c_{N}\rceil$ such that $na_{N}\exp(-c_{N}s_{0})\leq\delta/2$ , so we only need solve $t>0$ such that

\displaystyle\mathbb{P}\left(\sup_{\theta\in\Theta}|X_{\theta}(s_{0})|+\sup_{\theta\in\Theta}|E_{\theta}(s_{0})|>t\right)\leq\frac{\delta}{2}~{}.

From Lemma 7, we have

\displaystyle\sup_{\theta\in\Theta}|E_{\theta}(s_{0})|

\displaystyle\leq\left(T+\frac{1}{l_{f}}\right)(u_{f}+2)\frac{a_{N}(s_{0}+2)}{(1-\exp(-c_{N}))^{2}}\exp(-c_{N}(s_{0}+1)):=B(s_{0}).

By the definition of $s_{0}$ , $B(s_{0})\leq\left(T+1/l_{f}\right)(u_{f}+2)(s_{0}+2)\delta/[2n(1-\exp(-c_{N}))^{2}]$ . Thus we only need to solve $t>0$ such that

\displaystyle\mathbb{P}\left(\sup_{\theta\in\Theta}|X_{\theta}(s_{0})|+\sup_{\theta\in\Theta}|E_{\theta}(s_{0})|>t\right)\leq\mathbb{P}\left(\sup_{\theta\in\Theta}|X_{\theta}(s_{0})|>t-B(s_{0})\right)\leq\frac{\delta}{2}~{}.

From Lemma 6, we can choose

	$\displaystyle t_{0}$	$\displaystyle=\frac{48}{\sqrt{n}}\left(T+\frac{1}{l_{f}}\right)(s_{0}+1)\left\{4u_{f}\left(\sqrt{\log\left(\frac{4}{\delta}\right)}+D\sqrt{(3L+2)\log\left(1+M(s_{0})\right)}\right)+D\sqrt{3L+2}\right\}+B(s_{0})$
		$\displaystyle\leq\frac{48}{\sqrt{n}}\left(T+\frac{1}{l_{f}}\right)(s_{0}+1)\left\{4u_{f}\left(\sqrt{\log\left(\frac{4}{\delta}\right)}+D\sqrt{(3L+2)\log\left(1+M(s_{0})\right)}~{}\right)+D\sqrt{3L+2}\right\}$
		$\displaystyle\quad+\left(T+\frac{1}{l_{f}}\right)(u_{f}+2)\frac{s_{0}+2}{(1-\exp(-c_{N}))^{2}}\frac{\delta}{2n}$
		$\displaystyle\leq\frac{192}{\sqrt{n}}\left(T+\frac{1}{l_{f}}\right)(s_{0}+1)u_{f}\left(\sqrt{\log\left(\frac{4}{\delta}\right)}+D\sqrt{(3L+2)}(\sqrt{\log\left(1+M(s_{0})\right)}+1)+\frac{1}{(1-\exp(-c_{N}))^{2}}~{}\right)~{}.$

such that $\mathbb{P}\left(\sup_{\theta\in\Theta}|X_{\theta}|>t\right)\leq\delta$ . Hence the theorem is proved.

9 Proofs in section 5 and 6

9.1 Proof of Theorem 4

For $\lambda^{\ast}(t)=\lambda_{0}(t)\in W^{s,\infty}([0,T],B_{0})$ , $\delta=1/2$ , and $N\geq 5$ , by Lemma 13, there exists a two-layer NN $\hat{f}^{N}$ such that

\displaystyle\left|\hat{f}^{N}(x)-\lambda^{\ast}(Tx)\right|\leq\frac{3\mathcal{C}B_{0}T^{s}}{2N^{s}},~{}0\leq x\leq 1,

where $\mathcal{C}=\sqrt{2s}5^{s}/(s-1)!$ .

Then we have

\displaystyle\left|\hat{f}^{N}(\frac{t}{T})-\lambda^{\ast}(t)\right|\leq\frac{3\mathcal{C}B_{0}T^{s}}{2N^{s}},~{}0\leq t\leq T.

Since $B_{1}\leq\lambda^{*}(t)\leq B_{0}$ , taking $l_{f}=B_{1}$ , $u_{f}=B_{0}$ and $\hat{\lambda}^{N}(t)=f(\hat{f}^{N}(t/T))$ , we have

\displaystyle\left|\hat{\lambda}^{N}(t)-\lambda^{\ast}(t)\right|\leq\left|\hat{f}^{N}(\frac{t}{T})-\lambda^{\ast}(t)\right|\leq\frac{3\mathcal{C}T^{s}}{2N^{s}},~{}\forall 0\leq t\leq T.

Then

$\displaystyle\|\mathbb{E}[\text{loss}(\hat{\lambda}^{N},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]\|$	$\displaystyle\leq\mathbb{E}\left\|\text{loss}(\hat{\lambda}^{N},S_{test})-\text{loss}(\lambda^{\ast},S_{test})\right\|$
	$\displaystyle\leq\mathbb{E}\left(\Big{\|}\sum_{i=1}^{N_{e}}(\log\tilde{\lambda}^{N}(t_{i})-\log\lambda^{\ast}(t_{i}))\Big{\|}+\Big{\|}\int_{0}^{T}\left(\tilde{\lambda}^{N}(t)-\lambda^{\ast}(t)\right)\mathrm{dt}\Big{\|}\right)$
	$\displaystyle\leq\mathbb{E}\left(T+\frac{N_{e}}{B_{1}}\right)\left\\|\tilde{\lambda}^{N}-\lambda^{\ast}\right\\|_{L^{\infty}[0,T]}$
	$\displaystyle\leq\left(T+\frac{1}{B_{1}}\right)\frac{3\mathcal{C}B_{0}T^{s}}{2N^{s}}\mathbb{E}(N_{e}+1).$	(35)

Since $\lambda_{0}\leq B_{0}$ , $\mu\equiv 0$ , taking $c=B_{0}$ , $c_{0}=0$ , and $\eta=e$ in Lemma 2 , we have

\displaystyle\mathbb{P}(N_{e}\geq s)\leq 2\sqrt{B_{0}T}\exp\left(\frac{eB_{0}T-s}{2}\right).

Thus

\displaystyle\mathbb{E}(N_{e}+1)\leq 1+\sum_{s=1}^{\infty}\mathbb{P}(N_{e}\geq s)\leq 1+\frac{2\sqrt{B_{0}T}}{1-\exp(-1/2)}\exp\left(\frac{eB_{0}T-1}{2}\right)\leq{5\sqrt{B_{0}T+1}}\exp\left(\frac{3B_{0}T}{2}\right).

(36)

Combining (35) and (36), we get

	$\displaystyle\|\mathbb{E}[\text{loss}(\hat{\lambda}^{N},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]\|$	$\displaystyle\leq{5\sqrt{B_{0}T+1}}\exp\left(\frac{3B_{0}T}{2}\right)(T+\frac{1}{B_{1}})\frac{3\mathcal{C}B_{0}T^{s}}{2N^{s}}$
		$\displaystyle\leq 15\exp\left({2B_{0}T}\right)(T+\frac{1}{B_{1}})\frac{\mathcal{C}B_{0}T^{s}}{N^{s}},$

where $\mathcal{C}=\sqrt{2s}5^{s}/(s-1)!$ .

$\tilde{\lambda}^{N}$ can be naturally seen as an RNN by taking $W_{h}^{l}=0$ , $l=1,2$ . The width and weights bound can be directly obtained by Lemma 13 and Remark 14.

9.2 Proof of Theorem 5

The proof is divided into several steps. Let $S=\{t_{i}\}_{i=1}^{N_{e}}$ . Here we agree on $t_{0}=0$ , $t_{N_{e}+1}=T$ . To be concise, we denote $S(t)=\sum_{t_{i}<t}\exp(-\beta(t-t_{i}))+1$ , $S_{i}=\sum_{0<j<i}\exp(-\beta(t_{i}-t_{j}))+1$ , $i\in\mathbb{N}_{+}$ , hence $\lambda^{\ast}(t)=\lambda_{0}(t)+\alpha(S(t)-1)$ , $S_{i+1}=S_{i}\exp(-\beta(t_{i+1}-t_{i}))+1$ , $S(t)=S_{i}\exp(-\beta(t-t_{i}))+1$ , where we take $S_{0}=0$ by default.

We first fix $s_{0}\in\mathbb{N}_{+}$ .

Step 1. Construct the approximation of $g(x,y)=x\exp(-\beta y)+1$ , where
$g\in C^{\infty}\left([-(s_{0}+1),2(s_{0}+1)]\times[0,T]\right)$ .

Let $\tilde{g}(x,y)=g((3x-1)(s_{0}+1),Ty)$ , then $\tilde{g}\in C^{\infty}\left([0,1]^{2}\right)$ . By simple computation, we have

\displaystyle\|\tilde{g}\|_{W^{k,\infty}\left([0,1]^{2}\right)}\leq 3(s_{0}+1)(\beta T\vee 1)^{k}.

Applying Lemma 14 to $\tilde{g}/[3(s_{0}+1)]$ , for any $\mathcal{N}\in\mathbb{N}_{+}$ , there exists a tanh neural network $\tilde{g}^{\mathcal{N}}$ with only one hidden layer and width $3\lceil\frac{\mathcal{N}+10(\beta T\vee 1)}{2}\rceil\binom{\mathcal{N}+10(\beta T\vee 1)+2}{2}$ such that

\displaystyle\left|\tilde{g}(x,y)-\tilde{g}^{\mathcal{N}}(x,y)\right|\leq 3(s_{0}+1)\exp(-\mathcal{N}),~{}(x,y)\in[0,1]^{2}.

By coordinate transformation, we get

\displaystyle\left|g(x,y)-\tilde{g}^{\mathcal{N}}(\frac{1}{3(s_{0}+1)}x+\frac{1}{3},\frac{1}{T}y)\right|\leq 3(s_{0}+1)\exp(-\mathcal{N}),~{}(x,y)\in[-(s_{0}+1),2(s_{0}+1)]\times[0,T].

Define $\hat{g}^{\mathcal{N}}(x,y)=\tilde{g}^{\mathcal{N}}(x/[3(s_{0}+1)]+1/3,y/T)$ . Then

\displaystyle\left|g(x,y)-\hat{g}^{\mathcal{N}}(x,y)\right|\leq 3(s_{0}+1)\exp(-\mathcal{N}),~{}(x,y)\in[-(s_{0}+1),2(s_{0}+1)]\times[0,T].

From Lemma 14 and Remark 15, the weights of $\hat{g}^{\mathcal{N}}$ are bounded by

\displaystyle O\left((s_{0}+1)\exp(\frac{{\mathcal{N}^{\prime}}^{2}+\mathcal{N}^{\prime}-3Cd\mathcal{N}^{\prime}}{2})(\mathcal{N}^{\prime}(\mathcal{N}^{\prime}+2))^{3\mathcal{N}^{\prime}(\mathcal{N}^{\prime}+2)}\right),

(37)

where $\mathcal{N}^{\prime}=\mathcal{N}+10(\beta T\vee 1)$ . Taking $\mathcal{N}\leftarrow\mathcal{N}+\lceil\log(3(s_{0}+1))\rceil$ , we have

\displaystyle\left|g(x,y)-\hat{g}^{\mathcal{N}}(x,y)\right|\leq\exp(-\mathcal{N}),~{}(x,y)\in[-(s_{0}+1),2(s_{0}+1)]\times[0,T].

(38)

Especially, $\left|g(x,y)-\hat{g}^{\mathcal{N}}(x,y)\right|\leq 1$ . Since $\hat{g}^{\mathcal{N}}\in\mathbb{R}$ , by a small tuning (precisely, width plus 1), we can assume $\hat{g}^{\mathcal{N}}$ has the following structure:

\displaystyle\hat{g}^{\mathcal{N}}(x,y)=V_{1}\sigma\left(\begin{pmatrix}W&B\end{pmatrix}\begin{pmatrix}x\\ y\end{pmatrix}+b_{0}\right).

Step 2. Construct the approximation of $S_{i}$ and $S(t)$ under the event $\{N_{e}\leq s_{0}\}$ .

Let $h_{0}=0$ , $\overline{S}_{0}=0$ , for $1\leq i\leq s_{0}$ . We construct $h_{i}^{\mathcal{N}}$ and $\overline{S}_{i}^{\mathcal{N}}$ recursively by

\left\{\begin{aligned} h_{i}^{\mathcal{N}}&=\sigma\left(\begin{pmatrix}W&B\end{pmatrix}\begin{pmatrix}V_{1}h_{i-1}^{\mathcal{N}}\\ t_{i}-t_{i-1}\end{pmatrix}+b_{0}\right),\\ \overline{S}_{i}^{\mathcal{N}}&=V_{1}h_{i}.\end{aligned}\right.

Hence $\overline{S}_{i}^{\mathcal{N}}=\hat{g}^{\mathcal{N}}(\overline{S}_{i-1}^{\mathcal{N}},t_{i}-t_{i-1}),1\leq i\leq s_{0}$ , here $t_{0}=0$ .

Similarly, we can define $\overline{S}^{\mathcal{N}}(t),t\in(t_{i-1},t_{i}]$ by

\left\{\begin{aligned} h^{\mathcal{N}}(t)&=\sigma\left(\begin{pmatrix}W&B\end{pmatrix}\begin{pmatrix}V_{1}h_{i-1}^{\mathcal{N}}\\ t-t_{i-1}\end{pmatrix}+b_{0}\right),\\ \overline{S}^{\mathcal{N}}(t)&=V_{1}h(t).\end{aligned}\right.

Hence $\overline{S}^{\mathcal{N}}(t)=\hat{g}^{\mathcal{N}}(\overline{S}_{i-1}^{\mathcal{N}},t-t_{i-1}),t\in(t_{i-1},t_{i}]$ . The approximation error can be bounded by

	$\displaystyle\|S(t)-\overline{S}^{\mathcal{N}}(t)\|$	$\displaystyle=\left\|g(S_{i-1},t_{i}-t_{i-1})-\hat{g}^{N}(\overline{S}_{i-1}^{\mathcal{N}},t_{i}-t_{i-1})\right\|$
		$\displaystyle\leq\left\|g(S_{i-1},t_{i}-t_{i-1})-g(\overline{S}_{i-1}^{\mathcal{N}},t_{i}-t_{i-1})\right\|+\left\|g(\overline{S}_{i-1}^{\mathcal{N}},t_{i}-t_{i-1})-\hat{g}^{N}(\overline{S}_{i-1}^{\mathcal{N}},t_{i}-t_{i-1})\right\|$
		$\displaystyle\leq\left\|S_{i-1}-\overline{S}_{i-1}^{\mathcal{N}}\right\|+\left\\|g-\hat{g}^{\mathcal{N}}\right\\|_{\infty}$
		$\displaystyle\leq\cdots$
		$\displaystyle\leq i\left\\|g-\hat{g}^{\mathcal{N}}\right\\|_{\infty},~{}t\in(t_{i-1},t_{i}].$

Under the event $\{N_{e}\leq s_{0}\}$ , we have

\displaystyle\left|S(t)-\overline{S}^{\mathcal{N}}(t)\right|\leq(s_{0}+1)\left\|g-\hat{g}^{\mathcal{N}}\right\|_{\infty}.

(39)

Step 3. Construct the approximation of identity.

By Lemma 3.1 of De Ryck et al. (2021), for any $\epsilon>0$ , there exists a one-layer tanh neural network $\psi_{h}$ such that

\displaystyle|x-\psi_{h}(x)|\leq(6M)^{4}h^{2},~{}x\in[-M,M].

(40)

Actually, $\psi_{h}$ can be represented as

\displaystyle\psi_{h}(x)=\frac{1}{\sigma^{{}^{\prime}}(0)h}\left[\sigma\left(\frac{hy}{2}\right)-\sigma\left(-\frac{hy}{2}\right)\right]=\frac{2}{\sigma^{{}^{\prime}}(0)h}\sigma\left(\frac{hy}{2}\right).

Step 4. Construct the approximation of $\lambda^{\ast}(t)$ under the event $\{N_{e}\leq s_{0}\}$ .

Since $\lambda_{0}\in W^{s,\infty}([0,T],B_{0})$ , from the proof of Theorem 4 , there exists a two-layer tanh neural network $\overline{\lambda}_{0}^{N}$ with width less than $3\lceil s/2\rceil+6N$ such that

\displaystyle\left|\overline{\lambda}_{0}^{N}(t)-\lambda_{0}(t)\right|\leq\frac{3\mathcal{C}T^{s}}{2N^{s}},~{}t\in[0,T].

(41)

Moreover, the weights of $\overline{\lambda}_{0}^{N}$ can be bounded by

\displaystyle O\left(\left[\frac{\sqrt{2s}5^{s}}{(s-1)!}B_{0}T^{s}\right]^{-s/2}N^{(1+s^{2})/2}(s(s+2))^{3s(s+2)}\right).

Here we assume $\overline{\lambda}_{0}^{N}(t)$ have the following structure

\displaystyle\overline{\lambda}_{0}^{N}(t)=V_{2}^{{}^{\prime}}\sigma\left(V_{1}^{{}^{\prime}}\sigma\left(B^{{}^{\prime}}t+b_{0}^{{}^{\prime}}\right)+b_{1}^{{}^{\prime}}\right)+b_{2}^{{}^{\prime}}.

Since $\lambda(t)=\lambda_{0}(t)+\alpha(S(t)-1)$ , we can construct its approximation by

\displaystyle h_{i}^{(1)}

\displaystyle=

\displaystyle\sigma\left(\begin{pmatrix}WV_{1}&0\\ 0&0\end{pmatrix}h_{i-1}+\begin{pmatrix}B&0\\ 0&B^{{}^{\prime}}\end{pmatrix}\begin{pmatrix}t_{i}-t_{i-1}\\ t_{i}\end{pmatrix}+\begin{pmatrix}b_{0}\\ b_{0}^{{}^{\prime}}\end{pmatrix}\right),~{}1\leq i\leq s_{0},

and

$\displaystyle h^{(1)}(t;S)$	$\displaystyle=$	$\displaystyle\sigma\left(\begin{pmatrix}WV_{1}&0\\ 0&0\end{pmatrix}h_{i}^{(1)}+\begin{pmatrix}B&0\\ 0&B^{{}^{\prime}}\end{pmatrix}\begin{pmatrix}t-t_{i}\\ t\end{pmatrix}+\begin{pmatrix}b_{0}\\ b_{0}^{{}^{\prime}}\end{pmatrix}\right),$
$\displaystyle h^{(2)}(t;S)$	$\displaystyle=$	$\displaystyle\sigma\left(\begin{pmatrix}\frac{h}{2}V_{1}&0\\ 0&V_{1}^{{}^{\prime}}\end{pmatrix}h^{(1)}(t;S)+\begin{pmatrix}0\\ b_{1}^{{}^{\prime}}\end{pmatrix}\right),$
$\displaystyle\hat{\lambda}(t;S)$	$\displaystyle=$	$\displaystyle f\left(\begin{pmatrix}\frac{2\alpha}{\sigma^{{}^{\prime}}(0)h}&V_{2}^{{}^{\prime}}\end{pmatrix}h^{(2)}(t;S)+\left(b_{2}^{{}^{\prime}}-\alpha\right)\right)\in\mathbb{R}^{1},~{}t\in(t_{i},t_{i+1}].$	(42)

Under the event $\{N_{e}\leq s_{0}\}$ , we have $B_{1}\leq\lambda(t)\leq B_{0}+\alpha s_{0}$ . Recall that $f(x)=\min\{\max\{x,l_{f}\},u_{f}\}$ . Here we can take $l_{f}=B_{1}$ , $u_{f}=B_{0}+\alpha s_{0}$ .

Step 5. Estimate the approximation error under the event $\{N_{e}\leq s_{0}\}$ .

We rewrite (42) as $\hat{\lambda}(t;S)=f(\overline{\lambda}(t;S))$ . Under the event $\{N_{e}\leq s_{0}\}$ and the construction of $f$ , we have

\displaystyle\|\lambda^{\ast}-\hat{\lambda}\|_{L^{\infty}}\leq\|\lambda^{\ast}-\overline{\lambda}\|_{L^{\infty}}~{}.

From the constuction of $\overline{\lambda}$ , we get

\displaystyle\overline{\lambda}(t)=\overline{\lambda}_{0}^{N}(t)+\alpha\psi_{h}(\overline{S}^{\mathcal{N}}(t))-\alpha~{},

(43)

then

\displaystyle\left|\lambda^{\ast}(t)-\overline{\lambda}(t)\right|\leq\left|\lambda_{0}(t)-\overline{\lambda}_{0}^{N}(t)\right|+\alpha\left|S(t)-\psi_{h}(\overline{S}^{\mathcal{N}}(t))\right|~{}.

(44)

From (39) (40) and (38), under the event $\{N_{e}\leq s_{0}\}$ , we have

	$\displaystyle\left\|S(t)-\psi_{h}(\overline{S}^{\mathcal{N}}(t))\right\|$	$\displaystyle\leq\left\|S(t)-\overline{S}^{\mathcal{N}}(t)\right\|+\left\|\overline{S}^{\mathcal{N}}(t)-\psi_{h}(\overline{S}^{\mathcal{N}}(t))\right\|$
		$\displaystyle\leq(s_{0}+1)\left\\|g-\hat{g}^{\mathcal{N}}\right\\|_{\infty}+(6M)^{4}h^{2}$
		$\displaystyle\leq(s_{0}+1)\exp(-\mathcal{N})+(12(s_{0}+1))^{4}h^{2},~{}t\in[0,T],$

where we take $M=2(s_{0}+1)$ to ensure that $\overline{S}^{\mathcal{N}}(t)$ can be well approximated by $\psi_{h}(\overline{S}^{\mathcal{N}}(t))$ . On the other hand, (41) shows that

\displaystyle\left|\overline{\lambda}_{0}^{N}(t)-\lambda_{0}(t)\right|\leq\frac{3\mathcal{C}B_{0}T^{s}}{2N^{s}},~{}t\in[0,T].

To trade off the two error terms in (44), let $\exp(-\mathcal{N})\asymp N^{-s}$ , and then we can take $\mathcal{N}=\lceil s\log(N)\rceil$ . Moreover, take $\mathcal{N}\leftarrow\mathcal{N}+\lceil\log(s_{0}+1)\rceil$ and $h=(12(s_{0}+1))^{-2}N^{-s/2}$ . Hence, under $\{N_{e}\leq s_{0}\}$ , we have

\displaystyle\left|\lambda^{\ast}(t)-\hat{\lambda}(t)\right|\leq\left|\lambda^{\ast}(t)-\overline{\lambda}(t)\right|\leq\frac{3\mathcal{C}B_{0}T^{s}+4}{2N^{s}},~{}t\in[0,T].

(45)

Step 6. Estimate the final approximation error.

Similar to (35), we have

	$\displaystyle\quad\|\mathbb{E}[\text{loss}(\hat{\lambda},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]\|$
	$\displaystyle\leq\mathbb{E}\left\|\text{loss}(\hat{\lambda},S_{test})-\text{loss}(\lambda^{\ast},S_{test})\right\|$
	$\displaystyle\leq\mathbb{E}\left(\Big{\|}\sum_{i=1}^{N_{e}}(\log\hat{\lambda}(t_{i})-\log\lambda^{\ast}(t_{i}))\Big{\|}+\Big{\|}\int_{0}^{T}\left(\hat{\lambda}(t)-\lambda^{\ast}(t)\right)\mathrm{dt}\Big{\|}\right)$
	$\displaystyle\leq\mathbb{E}\left[\left(\Big{\|}\sum_{i=1}^{N_{e}}(\log\hat{\lambda}(t_{i})-\log\lambda^{\ast}(t_{i}))\Big{\|}+\Big{\|}\int_{0}^{T}\left(\hat{\lambda}(t)-\lambda^{\ast}(t)\right)\mathrm{dt}\Big{\|}\right)\mathbbm{1}_{\{N_{e}\leq s_{0}\}}\right.$
	$\displaystyle\quad\quad+\left.\left(\Big{\|}\sum_{i=1}^{N_{e}}(\log\hat{\lambda}(t_{i})-\log\lambda^{\ast}(t_{i}))\Big{\|}+\Big{\|}\int_{0}^{T}\left(\hat{\lambda}(t)-\lambda^{\ast}(t)\right)\mathrm{dt}\Big{\|}\right)\mathbbm{1}_{\{N_{e}>s_{0}\}}\right]$
	$\displaystyle\leq\mathbb{E}\left[(T+\frac{N_{e}}{B_{1}})\left\\|\hat{\lambda}-\lambda^{\ast}\right\\|_{L^{\infty}}\mathbbm{1}_{\{N_{e}\leq s_{0}\}}\right]+\mathbb{E}\left[(T+\frac{N_{e}}{B_{1}})\left\\|\hat{\lambda}-\lambda^{\ast}\right\\|_{L^{\infty}}\mathbbm{1}_{\{N_{e}>s_{0}\}}\right]$
	$\displaystyle:=\mathbb{I}_{1}+\mathbb{I}_{2}~{}.$		(46)

Since $\lambda_{0}(t)\leq B_{0}$ , $\mu(t)=\alpha\exp(-\beta(t))$ , taking $c_{\mu}=\alpha/\beta$ , $\eta=(\alpha+\beta)/(2\alpha)$ in Lemma 2 , we have

	$\displaystyle\mathbb{P}\left(N_{e}\geq s\right)$	$\displaystyle\leq\frac{2\sqrt{B_{0}T}}{1-c_{\mu}}\exp\left(\frac{\log(\eta)}{2}\left[\eta(B_{0}T)-(1-c_{\mu}\eta)s\right]\right)$
		$\displaystyle\leq\frac{2\beta\sqrt{B_{0}T}}{\beta-\alpha}\exp\left(\frac{\log\left(\frac{\alpha+\beta}{2\alpha}\right)}{2}\left[\frac{\alpha+\beta}{2\alpha}(B_{0}T)-\frac{\beta-\alpha}{2\beta}s\right]\right)$
		$\displaystyle:=a_{e}\exp\left(-c_{e}s\right).$

By (45),

$\displaystyle\mathbb{I}_{1}$	$\displaystyle\leq\left(T+\frac{1}{B_{1}}\right)\frac{3\mathcal{C}B_{0}T^{s}+2}{2N^{s}}\mathbb{E}\left[(N_{e}+1)\mathbbm{1}_{\{N_{e}\leq s_{0}\}}\right]$
	$\displaystyle\leq\left(T+\frac{1}{B_{1}}\right)\frac{3\mathcal{C}B_{0}T^{s}+2}{2N^{s}}\mathbb{E}\left[(N_{e}+1)\right]$
	$\displaystyle=\left(T+\frac{1}{B_{1}}\right)\frac{3\mathcal{C}B_{0}T^{s}+2}{2N^{s}}\left(1+\sum_{s=1}^{\infty}\mathbb{P}(N_{e}\geq s)\right)$
	$\displaystyle\leq\left(T+\frac{1}{B_{1}}\right)\left(1+\frac{a_{e}\exp(-c_{e})}{1-\exp(-c_{e})}\right)\frac{3\mathcal{C}B_{0}T^{s}+4}{2N^{s}}$	(47)

On the other hand, from $\|\hat{\lambda}\|_{L^{\infty}}\leq B_{0}+\alpha s_{0}$ and $\|\lambda^{\ast}\|_{L^{\infty}}\leq B_{0}+\alpha N_{e}$ , we have

$\displaystyle\mathbb{I}_{2}$	$\displaystyle\leq\mathbb{E}\left[\left(T+\frac{N_{e}}{B_{1}}\right)\\|\hat{\lambda}\\|_{L^{\infty}}\mathbbm{1}_{\{N_{e}>s_{0}\}}\right]+\mathbb{E}\left[\left(T+\frac{N_{e}}{B_{1}}\right)\\|\lambda^{\ast}\\|_{L^{\infty}}\mathbbm{1}_{\{N_{e}>s_{0}\}}\right]$
	$\displaystyle\leq\left(T+\frac{1}{B_{1}}\right)(B_{0}+\alpha s_{0})\mathbb{E}\left[(N_{e}+1)\mathbbm{1}_{\{N_{e}>s_{0}\}}\right]+\left(T+\frac{1}{B_{1}}\right)E\left[(N_{e}+1)(B_{0}+\alpha N_{e})\mathbbm{1}_{\{N_{e}>s_{0}\}}\right]$
	$\displaystyle\leq\left(T+\frac{1}{B_{1}}\right)(B_{0}+\alpha s_{0})\left((s_{0}+1)\mathbb{P}(N_{e}\geq s_{0}+1)+\sum_{s=s_{0}+1}^{\infty}\mathbb{P}(N_{e}\geq s)\right)$
	$\displaystyle\quad+\left(T+\frac{1}{B_{1}}\right)\left((s_{0}+1)(B_{0}+\alpha s_{0})\mathbb{P}(N_{e}\geq s_{0}+1)+\sum_{s=s_{0}+1}^{\infty}(2\alpha s+B_{0})\mathbb{P}(N_{e}\geq s)\right)$
	$\displaystyle\leq\left(T+\frac{1}{B_{1}}\right)(B_{0}+\alpha s_{0})a_{e}\exp(-c_{e}(s_{0}+1))\left((s_{0}+1)+\frac{1}{1-\exp(-c_{e})}\right)$
	$\displaystyle\quad+\left(T+\frac{1}{B_{1}}\right)a_{e}\exp(-c_{e}(s_{0}+1))\left((s_{0}+1)(B_{0}+\alpha s_{0})+\frac{2\alpha(s_{0}+1)+B_{0}}{(1-\exp(-c_{e}))^{2}}\right)$
	$\displaystyle\leq\left(T+\frac{1}{B_{1}}\right)a_{e}\exp(-c_{e}(s_{0}+1))\left(2(s_{0}+1)(B_{0}+\alpha s_{0})+\frac{3\alpha(s_{0}+1)+2B_{0}}{(1-\exp(-c_{e}))^{2}}\right).$	(48)

Combing (46) (47) (48), we have

	$\displaystyle\quad\|\mathbb{E}[\text{loss}(\hat{\lambda},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]\|$
	$\displaystyle\leq\left(T+\frac{1}{B_{1}}\right)a_{e}\exp(-c_{e}(s_{0}+1))\left(2(s_{0}+1)(B_{0}+\alpha s_{0})+\frac{3\alpha(s_{0}+1)+2B_{0}}{(1-\exp(-c_{e}))^{2}}\right)$
	$\displaystyle\quad+\left(T+\frac{1}{B_{1}}\right)\left(1+\frac{a_{e}\exp(-c_{e})}{1-\exp(-c_{e})}\right)\frac{3\mathcal{C}B_{0}T^{s}+4}{2N^{s}}.$

Let $s_{0}=\lceil s\log(N)/c_{e}\rceil$ , and denote $\hat{\lambda}^{N}=\hat{\lambda}$ . We have

\displaystyle|\mathbb{E}[\text{loss}(\hat{\lambda}^{N},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]|\lesssim\frac{(\log N)^{2}}{N^{s}}~{}.

Step 7. Bound the sizes of the network width and weights.

From step 1-6, the width of the network is less than

\displaystyle 3\left\lceil\frac{\tilde{\mathcal{N}}}{2}\right\rceil\binom{\tilde{\mathcal{N}}+2}{2}+3\left\lceil\frac{s}{2}\right\rceil+6N+2~{},

where $\tilde{\mathcal{N}}=\mathcal{N}+10(\beta T\vee 1)+2\lceil\log(3(s_{0}+1))\rceil$ . Since $s_{0}=\lceil\frac{s}{c_{e}}\log(N)\rceil$ and $\mathcal{N}=\lceil s\log(N)\rceil$ , we have $D\leq O(N)$ .

From the construction of $\hat{g}^{\mathcal{N}}$ , $\psi_{h}$ and $\overline{\lambda}_{0}^{N}$ , the weights of the network is less than

	$\displaystyle O$	$\displaystyle\left(\max\left\{\frac{2}{\sigma^{{}^{\prime}}(0)h}\exp(\frac{{\tilde{\mathcal{N}}}^{2}+\tilde{\mathcal{N}}-3Cd\tilde{\mathcal{N}}}{2})(\tilde{\mathcal{N}}(\tilde{\mathcal{N}}+2))^{3\tilde{\mathcal{N}}(\tilde{\mathcal{N}}+2)},\right.\right.$
		$\displaystyle\quad\quad\quad\left.\left.\left[\frac{\sqrt{2s}5^{s}}{(s-1)!}B_{0}T^{s}\right]^{-s/2}N^{(1+s^{2})/2}(s(s+2))^{3s(s+2)}\right\}\right),$

where $\tilde{\mathcal{N}}=\mathcal{N}+10(\beta T\vee 1)+2\lceil\log(3(s_{0}+1))\rceil$ . Since $s_{0}=\lceil\frac{s}{c_{e}}\log(N)\rceil$ , $h=(12(s_{0}+1))^{-2}N^{-s/2}$ , $\mathcal{N}=\lceil s\log(N)\rceil$ , the weights are less than

\displaystyle\mathcal{C}_{1}(\log(N))^{12s^{2}(\log(N))^{2}}~{},

where $\mathcal{C}_{1}$ is a constant related to $s,B_{0},\alpha,\beta$ , and $T$ .

9.3 Proof of Theorem 6

Lemma 8.

Suppose $\mu\in C^{k,\infty}([0,T],C_{0})$ , $k\geq 2$ , $k\in\mathbb{N}$ . The fourier series of $\mu$ is given by

\displaystyle S_{\infty}(t)=\frac{\hat{\mu}_{0}}{2}+\sum_{l=1}^{\infty}\left(\hat{\mu}_{l}\cos\left(\frac{2l\pi}{T}t\right)+\hat{\nu}_{l}\sin\left(\frac{2l\pi}{T}t\right)\right),

(49)

where $\hat{\mu}_{l}=2\int_{0}^{T}\mu(t)\cos(2l\pi t/T)\mathrm{d}t/T$ , $\hat{\nu}_{l}=2\int_{0}^{T}\mu(t)\sin(2l\pi t/T)\mathrm{d}t$ /T, $l\geq 0$ . If $\mu^{(j)}(0+)=\mu^{(j)}(T-)$ , $0\leq j\leq k-1$ , then

\displaystyle|\hat{\mu}_{l}|\leq\frac{2C_{0}T^{k}}{(2l\pi)^{k}},|\hat{\nu}_{l}|\leq\frac{2C_{0}T^{k}}{(2l\pi)^{k}}

and $S_{\infty}(t)=\mu(t)$ on $t\in[0,T]$ . Moveover, denote the partial sum of $S_{\infty}(t)$ as $S_{N_{\mu}}(t)=\hat{\mu}_{0}/2+\sum_{l=1}^{N_{\mu}}\left(\hat{\mu}_{l}\cos(2l\pi t/T)+\hat{\nu}_{l}\sin(2l\pi t/T)\right)$ ,

\displaystyle\left|\mu(t)-S_{N_{\mu}}(t)\right|\leq\frac{2C_{0}T^{k+1}}{(k-1)(2\pi)^{k}N_{\mu}^{k-1}},~{}t\in[0,T].

Proof of Lemma 8.

The proof is a standard Fourier analysis exercise and we omit it. ∎

Theorem 9.

Under model assumption 5 and $\mu^{(j)}(0+)=\mu^{(j)}(T-)$ , $0\leq j\leq k-1$ , for $N\geq 5$ , there exists an RNN structure $\hat{\lambda}^{N,N_{\mu}}$ as stated in section 2.2 with $L=2$ , $l_{f}=B_{1}$ , $u_{f}=B_{0}+O(\log N)$ , and input function $x(t;S)=(t,t-F_{S}(t))^{\top}$ such that

\displaystyle|\mathbb{E}[\text{loss}(\hat{\lambda}^{N,N_{\mu}},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]|\lesssim\frac{1}{1-c_{\mu}}\exp\left(\frac{2B_{0}T}{c_{\mu}^{2}}\right)\left(\frac{T^{s}+\log^{2}N}{N^{s}}+\frac{T^{k}\log N}{N_{\mu}^{k-1}}\right)~{}.

Moreover, the width of $\tilde{\lambda}^{N}$ satisfies $D\lesssim N+N_{\mu}^{5}\log^{4}N$ and the weights of $\hat{\lambda}^{N}$ are less than

\displaystyle\mathcal{C}_{1}(\log(NN_{\mu}))^{12s^{2}(\log(NN_{\mu}))^{2}}~{},

where $\mathcal{C}_{1}$ is a constant related to $s,B_{0},C_{0},c_{\mu}$ , and $T$ .

Proof of Theorem 9.

Similar to the proof of Theorem 5 , the proof is divided into several steps. Denote $w_{l}=2l\pi/T$ , $g_{l,1}(x,t)=x_{1}\cos w_{l}t+x_{2}\sin w_{l}t$ , $g_{l,2}(x,t)=-x_{1}\sin w_{l}t+x_{2}\cos w_{l}t+1$ , $g_{l}(x,t)=(g_{l,1}(x,t),g_{l,2}(x,t))^{\top}\in\mathbb{R}^{2}$ , where $x\in\mathbb{R}^{2}$ , $l\in\mathbb{N}_{+}$ . For $l\in\mathbb{N}_{+}$ , define

\displaystyle S_{l}(t)=\sum_{t_{i}<t}\begin{pmatrix}\sin w_{l}(t-t_{i})\\ \cos w_{l}(t-t_{i})\end{pmatrix}+\begin{pmatrix}0\\ 1\end{pmatrix},

and

\displaystyle S_{l,i}=\sum_{0<j<i}\begin{pmatrix}\sin w_{l}(t_{i}-t_{j})\\ \cos w_{l}(t_{i}-t_{j})\end{pmatrix}+\begin{pmatrix}0\\ 1\end{pmatrix}.

Hence we have

\displaystyle S_{l,i+1}=\begin{pmatrix}\cos w_{l}(t_{i+1}-t_{i})&\sin w_{l}(t_{i+1}-t_{i})\\ -\sin w_{l}(t_{i+1}-t_{i})&\cos w_{l}(t_{i+1}-t_{i})\end{pmatrix}S_{l,i}+\begin{pmatrix}0\\ 1\end{pmatrix}=g_{l}(S_{l,i},t_{i+1}-t_{i})

and

\displaystyle S_{l}(t)=g_{l}(S_{l,i},t-t_{i}),~{}t\in(t_{i},t_{i+1}].

where we agree on $S_{l,0}=\mathbf{0}$ . Define $S_{0}(t)=\#\{i:t_{i}<t\}$ . If we assume $t_{1}>0$ , the true intensity can be rewritten as

\displaystyle\lambda^{\ast}(t)=\lambda_{0}(t)+\frac{\hat{\mu}_{0}}{2}(S_{0}(t)-1)+\sum_{l=1}^{\infty}(\hat{\nu}_{l},\hat{\mu}_{l})\cdot\left(S_{l}(t)-\begin{pmatrix}0\\ 1\end{pmatrix}\right),~{}t\in[0,T],

(50)

where $a\cdot b$ refers to the standard inner product of vectors $a$ and $b$ .

We first fix $s_{0}\in\mathbb{N}_{+}$ . Since $\mathbb{P}(t_{1}=0)=0$ . We assume $t_{1}>0$ so that (50) holds.

Step 1. Construct the approximation of $g_{l}(x,t)=(g_{l,1}(x,t),g_{l,2}(x,t))^{\top}\in\mathbb{R}^{2}$ , where $g_{l,1},g_{l,2}\in C^{\infty}\left([-3({s_{0}+1}),3({s_{0}+1})]^{2}\times[0,T]\right)$ . Here $x\in\mathbb{R}^{2}$ .

Let $\tilde{g}_{l,i}(x,t)=g_{l,i}(3(s_{0}+1)(2x-1),Tt)$ , $i=1,2$ . Then $\tilde{g}_{l,i}\in C^{\infty}\left([0,1]^{3}\right)$ . By simple computation, we have

\displaystyle\|\tilde{g}_{l,i}\|_{W^{k,\infty}\left([0,1]^{3}\right)}\leq 6({s_{0}+1})(w_{l}T)^{k}.

Applying Lemma 14 to $\tilde{g}_{l,i}/[6({s_{0}+1})]$ , for any $\mathcal{N}\in\mathbb{N}_{+}$ , there exists a tanh neural network $\tilde{g}_{l,i}^{\mathcal{N}}$ with only one hidden layer and width $3\lceil(\mathcal{N}+15w_{n}T)/2\rceil\binom{\mathcal{N}+15w_{n}T+3}{3}$ such that

\displaystyle\left|\tilde{g}_{l,i}(x,t)-\tilde{g}_{l,i}^{\mathcal{N}}(x,t)\right|\leq 6({s_{0}+1})\exp(-\mathcal{N}),~{}(x,t)\in[0,1]^{3}.

By coordinate transformation, we get

\displaystyle\left|g_{l,i}(x,t)-\tilde{g}_{l,i}^{\mathcal{N}}\left(\frac{x}{6({s_{0}+1})}+\frac{1}{2},\frac{t}{T}\right)\right|\leq 6({s_{0}+1})\exp(-\mathcal{N}),~{}(x,y)\in[-3({s_{0}+1}),3({s_{0}+1})]^{2}\times[0,T].

Define $\hat{g}_{l,i}^{\mathcal{N}}(x,t)=\tilde{g}_{l,i}^{\mathcal{N}}(x/[6({s_{0}+1})]+1/2,t/T)$ , then

\displaystyle\left|g_{l,i}(x,t)-\hat{g}_{l,i}^{\mathcal{N}}(x,t)\right|\leq 6({s_{0}+1})\exp(-\mathcal{N}),~{}(x,y)\in[-3({s_{0}+1}),3({s_{0}+1})]^{2}\times[0,T].

Taking $\mathcal{N}\leftarrow\mathcal{N}+\lceil\log(6({s_{0}+1}))\rceil$ , we have

\displaystyle\left|g_{l,i}(x,t)-\hat{g}_{l,i}^{\mathcal{N}}(x,t)\right|\leq\exp(-\mathcal{N}),~{}(x,y)\in[-3({s_{0}+1}),3({s_{0}+1})]^{2}\times[0,T].

Especially, $\left|g_{l,i}(x,t)-\hat{g}_{l,i}^{\mathcal{N}}(x,t)\right|\leq 1$ . The width of this NN is bounded by $3\lceil u/2\rceil\binom{u+3}{3}$ . From Lemma 14 and Remark 15, the weights of $\hat{g}_{l,i}^{\mathcal{N}}$ are bounded by

\displaystyle O\left(({s_{0}+1})\exp(\frac{{\mathcal{N}^{\prime}}^{2}+\mathcal{N}^{\prime}-3Cd\mathcal{N}^{\prime}}{2})(\mathcal{N}^{\prime}(\mathcal{N}^{\prime}+2))^{3\mathcal{N}^{\prime}(\mathcal{N}^{\prime}+2)}\right),

where $\mathcal{N}^{\prime}=\mathcal{N}+\lceil\log(6({s_{0}+1}))\rceil+15w_{l}T$ . Since $\hat{g}_{l,i}^{\mathcal{N}}\in\mathbb{R}$ , by a small tuning(precisely, let width plus 1), we can assume $\hat{g}_{l,i}^{\mathcal{N}}$ has the following structure:

\displaystyle\hat{g}_{l,i}^{\mathcal{N}}(x,y)=V_{l,i}\sigma\left(\begin{pmatrix}W_{l,i}&B_{l,i}\end{pmatrix}\begin{pmatrix}x_{l,i}\\ t_{l,i}\end{pmatrix}+b_{l,i}\right).

Denote $\hat{g}_{l}^{\mathcal{N}}(x,t)=\left(\hat{g}_{l,1}^{\mathcal{N}}(x,t),\hat{g}_{l,2}^{\mathcal{N}}(x,t)\right)^{\top}$ .

Step 1^′. Construct the approximation of identity and $g_{0}(x)=x+1$ , $x\in[-(s_{0}+1),2(s_{0}+1)]$ . Here $x\in\mathbb{R}$ .

Similarly to step 3 in the proof of Theorem 5 , taking $\psi_{h}(x)=2\sigma\left(hy/2\right)/[\sigma^{{}^{\prime}}(0)h]$ , we have

\displaystyle|x-\psi_{h}(x)|\leq(6M)^{4}h^{2},~{}x\in[-M,M].

For $g_{0}(x)=x+1$ , $x\in[-(s_{0}+1),2(s_{0}+1)]$ , we can construct a similar approximation as the proof of of Theorem 5. There exists a tanh neural network $\hat{g}_{0}^{\mathcal{N}}$ with only one hidden layer and width $3\lceil(\mathcal{N}^{\prime\prime}+5)/2\rceil$ such that

\displaystyle\left|g_{0}(x)-\hat{g}_{0}^{\mathcal{N}}(x)\right|\leq\exp(-\mathcal{N}),~{}x\in[-(s_{0}+1),2(s_{0}+1)],

where $\mathcal{N}^{\prime\prime}=\mathcal{N}+\lceil(s_{0}+3)\log 2\rceil$ . The weight of $\hat{g}_{0}^{\mathcal{N}}$ is bounded by

\displaystyle O\left(({s_{0}+1})\exp\left(\frac{{\mathcal{N}^{\prime\prime}}^{2}+\mathcal{N}^{\prime\prime}-3Cd\mathcal{N}^{\prime\prime}}{2}\right)\left[\mathcal{N}^{\prime\prime}(\mathcal{N}^{\prime\prime}+2)\right]^{3\mathcal{N}^{\prime\prime}(\mathcal{N}^{\prime\prime}+2)}\right).

Step 2. Construct the approximation of $S_{n,i}$ and $S_{n}(t)$ under the event $\{N_{e}\leq s_{0}\}$ .

Let $h_{l,0}^{\mathcal{N}}=(h_{l,0,1}^{\mathcal{N}},h_{l,0,2}^{\mathcal{N}})^{\top}=\mathbf{0}$ , $\overline{S}_{l,0}^{\mathcal{N}}=(\overline{S}_{l,0,1}^{\mathcal{N}},\overline{S}_{l,0,2}^{\mathcal{N}})^{\top}=\mathbf{0}$ , for $1\leq i\leq s_{0}$ . We construct $h_{l,i}^{\mathcal{N}}$ and $\overline{S}_{l,i}^{\mathcal{N}}$ recursively by

\left\{\begin{aligned} h_{l,i}^{\mathcal{N}}&=\sigma\left(\begin{pmatrix}W_{l,1}\begin{pmatrix}V_{l,1}&0\\ 0&V_{l,2}\end{pmatrix}\\ W_{l,2}\begin{pmatrix}V_{l,1}&0\\ 0&V_{l,2}\end{pmatrix}\end{pmatrix}h_{l,i-1}^{\mathcal{N}}+\begin{pmatrix}B_{l,1}\\ B_{l,2}\end{pmatrix}\begin{pmatrix}t_{i}-t_{i-1}\end{pmatrix}+\begin{pmatrix}b_{l,1}\\ b_{l,2}\end{pmatrix}\right),\\ \overline{S}_{l,i}^{\mathcal{N}}&=\begin{pmatrix}V_{l,1}&0\\ 0&V_{l,2}\end{pmatrix}h_{l,i}^{\mathcal{N}}.\end{aligned}\right.

Hence $\overline{S}_{l,i}^{\mathcal{N}}=\hat{g}_{l}^{\mathcal{N}}(\overline{S}_{l,i-1}^{\mathcal{N}},t_{i}-t_{i-1}),1\leq i\leq s_{0}$ . Here we agree on $t_{0}=0$ .

Similarly, we can define $\overline{S}_{l}^{\mathcal{N}}(t),t\in(t_{i-1},t_{i}]$ by

\left\{\begin{aligned} h_{l}^{\mathcal{N}}(t)&=\sigma\left(\begin{pmatrix}W_{l,1}\begin{pmatrix}V_{l,1}&0\\ 0&V_{l,2}\end{pmatrix}\\ W_{l,2}\begin{pmatrix}V_{l,1}&0\\ 0&V_{l,2}\end{pmatrix}\end{pmatrix}h_{l,i-1}^{\mathcal{N}}+\begin{pmatrix}B_{l,1}\\ B_{l,2}\end{pmatrix}\begin{pmatrix}t_{i}-t_{i-1}\end{pmatrix}+\begin{pmatrix}b_{l,1}\\ b_{l,2}\end{pmatrix}\right),\\ \overline{S}_{l}^{\mathcal{N}}(t)&=\begin{pmatrix}V_{l,1}&0\\ 0&V_{l,2}\end{pmatrix}h_{l}^{\mathcal{N}}(t).\end{aligned}\right.

Hence $\overline{S}_{l}^{\mathcal{N}}(t)=\hat{g}_{l}^{\mathcal{N}}(\overline{S}_{l,i-1}^{\mathcal{N}},t-t_{i-1}),t\in(t_{i-1},t_{i}]$ . The approximation error can be bounded by

	$\displaystyle\quad\left\\|S_{l}(t)-\overline{S}_{l}^{\mathcal{N}}(t)\right\\|_{2}$
	$\displaystyle=\left\\|g_{l}(S_{l,i-1},t-t_{i-1})-\hat{g}_{l}^{\mathcal{N}}(\overline{S}_{l,i-1}^{\mathcal{N}},t-t_{i-1})\right\\|_{2}$
	$\displaystyle\leq\left\\|g_{l}(S_{l,i-1},t-t_{i-1})-g_{l}(\overline{S}_{l,i-1}^{\mathcal{N}},t-t_{i-1})\right\\|_{2}+\left\\|g_{l}(\overline{S}_{l,i-1}^{\mathcal{N}},t-t_{i-1})-\hat{g}_{l}^{\mathcal{N}}(\overline{S}_{l,i-1}^{\mathcal{N}},t-t_{i-1})\right\\|_{2}$
	$\displaystyle\leq\left\\|S_{l,i-1}-\overline{S}_{l,i-1}^{\mathcal{N}}\right\\|_{2}+\sqrt{2}\max\left\{\left\\|g_{l,1}-\hat{g}_{l,1}^{\mathcal{N}}\right\\|_{\infty}\vee\left\\|g_{l,2}-\hat{g}_{l,2}^{\mathcal{N}}\right\\|_{\infty}\right\}$
	$\displaystyle\leq\left\\|S_{l,i-1}-\overline{S}_{l,i-1}^{\mathcal{N}}\right\\|_{2}+\sqrt{2}\exp(-\mathcal{N})$		(51)
	$\displaystyle\leq\cdots$
	$\displaystyle\leq\sqrt{2}i\exp(-\mathcal{N}),~{}t\in(t_{i-1},t_{i}].$

Under the event $\{N_{e}\leq s_{0}\}$ , we have

\displaystyle\left\|S_{l}(t)-\overline{S}_{l}^{\mathcal{N}}(t)\right\|_{2}\leq\sqrt{2}(s_{0}+1)\exp(-\mathcal{N}).

Moreover, $\left\|\overline{S}_{l,i}^{\mathcal{N}}\right\|_{2}\leq\left\|S_{l,i}-\overline{S}_{l,i}^{\mathcal{N}}\right\|_{2}+\left\|S_{l,i}\right\|_{2}\leq\sqrt{2}(s_{0}+1)+(s_{0}+1)\leq 3(s_{0}+1)$ , $i\leq s_{0}$ , then $\overline{S}_{l,i}^{\mathcal{N}}\in[-3(s_{0}+1),3(s_{0}+1)]^{2}$ and (51) can be verified by induction under the event $\{N_{e}\leq s_{0}\}$ .

For the approximation of $S_{0}(t)$ , we can similarly construct a simple RNN such that $\overline{S}_{0,i}^{\mathcal{N}}=\hat{g}_{0}^{\mathcal{N}}(\overline{S}_{0,i-1}^{\mathcal{N}})$ and $\left|S_{0}(t)-\overline{S}_{0}^{\mathcal{N}}(t)\right|\leq(s_{0}+1)\exp(-\mathcal{N})$ .

Step 3. Construct the approximation of $\lambda^{\ast}(t)$ under the event $\{N_{e}\leq s_{0}\}$ .

Since $\lambda_{0}\in W^{s,\infty}([0,T],B_{0})$ , from the proof of Theorem 4 , there exists a two layer tanh neural network $\overline{\lambda}_{0}^{N}$ with width less than $3\lceil s/2\rceil+6N$ such that

\displaystyle\left|\overline{\lambda}_{0}^{N}(t)-\lambda_{0}(t)\right|\leq\frac{3\mathcal{C}B_{0}T^{s}}{2N^{s}},~{}\forall 0\leq t\leq T.

(52)

Moreover, the weights of $\overline{\lambda}_{0}^{N}$ can be bounded by

\displaystyle O\left(\left[\frac{\sqrt{2s}5^{s}}{(s-1)!}B_{0}T^{s}\right]^{-\frac{s}{2}}N^{(1+s^{2})/2}(s(s+2))^{3s(s+2)}\right)~{}.

Here we assume $\overline{\lambda}_{0}^{N}(t)$ have the following structure

\displaystyle\overline{\lambda}_{0}^{N}(t)=V_{2}^{{}^{\prime}}\sigma\left(V_{1}^{{}^{\prime}}\sigma\left(B^{{}^{\prime}}t+b_{0}^{{}^{\prime}}\right)+b_{1}^{{}^{\prime}}\right)+b_{2}^{{}^{\prime}}~{}.

Since $\lambda^{\ast}(t)=\lambda_{0}(t)+\hat{\mu}_{0}(S_{0}(t)-1)/2+\sum_{l=1}^{\infty}(\hat{\nu}_{l},\hat{\mu}_{l})\cdot\left(S_{l}(t)-\begin{pmatrix}0\\ 1\end{pmatrix}\right)$ , we can construct its (finite sum) approximation by

\displaystyle\overline{\lambda}(t)=\overline{\lambda}_{0}^{N}(t)+\frac{\hat{\mu}_{0}}{2}(\psi_{h}(\overline{S_{0}}^{\mathcal{N}}(t))-1)+\sum_{l=1}^{N_{\mu}}(\hat{\nu}_{l},\hat{\mu}_{l})\cdot\left(\psi_{h}(\overline{S_{l}}^{\mathcal{N}}(t))-\begin{pmatrix}0\\ 1\end{pmatrix}\right)~{}.

It can be seen as a parallelism of $(N_{\mu}+2)$ RNNs defined before.

Under the event $\{N_{e}\leq s_{0}\}$ , we have $B_{1}\leq\lambda(t)\leq B_{0}+C_{0}s_{0}$ . Recall that $f(x)=\min\{\max\{x,l_{f}\},u_{f}\}$ . Here we can take $l_{f}=B_{1}$ , $u_{f}=B_{0}+C_{0}s_{0}$ . The final output is $\hat{\lambda}(t;S)=f(\overline{\lambda}(t;S))$ .

Step 4. Compute the approximation error under the event $\{N_{e}\leq s_{0}\}$ .

Under the event $\{N_{e}\leq s_{0}\}$ and the construction of $f$ , we have

\displaystyle\|\lambda^{\ast}-\hat{\lambda}\|_{L^{\infty}}\leq\|\lambda^{\ast}-\overline{\lambda}\|_{L^{\infty}}~{}.

(53)

By the construction of $\overline{\lambda}$ ,

	$\displaystyle\left\|\lambda^{\ast}(t)-\overline{\lambda}(t)\right\|\leq$	$\displaystyle\left\|\lambda_{0}(t)-\overline{\lambda}_{0}^{N}(t)\right\|+\frac{\hat{\mu}_{0}}{2}\left\|S_{0}(t)-\psi_{h}(\overline{S}_{0}^{\mathcal{N}}(t))\right\|+\left\|\sum_{l=1}^{N_{\mu}}(\hat{\nu}_{l},\hat{\mu}_{l})\cdot\left(S_{l}(t)-\psi_{h}(\overline{S_{l}}^{\mathcal{N}}(t))\right)\right\|$
		$\displaystyle+\left\|\sum_{l>N_{\mu}}(\hat{\nu}_{l},\hat{\mu}_{l})\cdot\left(S_{l}(t)-\begin{pmatrix}0\\ 1\end{pmatrix}\right)\right\|~{}.$		(54)

Under the event $\{N_{e}\leq s_{0}\}$ , for the second term, we have

$\displaystyle\left\|S_{0}(t)-\psi_{h}(\overline{S}_{0}^{\mathcal{N}}(t))\right\|$	$\displaystyle\leq\left\|S_{0}(t)-\overline{S}_{0}^{\mathcal{N}}(t)\right\|+\left\|\overline{S}_{0}^{\mathcal{N}}(t)-\psi_{h}(\overline{S}_{0}^{\mathcal{N}}(t))\right\|$
	$\displaystyle\leq(s_{0}+1)\left\\|g_{0}-\hat{g}_{0}^{\mathcal{N}}\right\\|_{\infty}+(6M)^{4}h^{2}$
	$\displaystyle\leq(s_{0}+1)\exp(-\mathcal{N})+(18(s_{0}+1))^{4}h^{2},~{}0\leq t\leq T.$	(55)

For the third term, similarly,

$\displaystyle\left\|\sum_{l=1}^{N_{\mu}}(\hat{\nu}_{l},\hat{\mu}_{l})\cdot\left(S_{l}(t)-\psi_{h}(\overline{S_{l}}^{\mathcal{N}}(t))\right)\right\|$	$\displaystyle\leq\sum_{l=1}^{N_{\mu}}\left\\|(\hat{\nu}_{l},\hat{\mu}_{l})^{\top}\right\\|_{2}\left\\|S_{l}(t)-\psi_{h}(\overline{S_{l}}^{\mathcal{N}}(t))\right\\|_{2}$
	$\displaystyle\leq\sum_{l=1}^{N_{\mu}}\sqrt{2}C_{0}\left(\left\\|S_{l}(t)-\overline{S_{l}}^{\mathcal{N}}(t)\right\\|_{2}+\left\\|\overline{S_{l}}^{\mathcal{N}}(t)-\psi_{h}(\overline{S_{l}}^{\mathcal{N}}(t))\right\\|_{2}\right)$
	$\displaystyle\leq 2C_{0}N_{\mu}\left((s_{0}+1)\exp(-\mathcal{N})+(18(s_{0}+1))^{4}h^{2}\right),~{}0\leq t\leq T,$	(56)

where we take $M=3(s_{0}+1)$ in (55) and (56) to ensure that $\overline{S}_{0}^{\mathcal{N}}(t)$ and $\overline{S}_{l}^{\mathcal{N}}(t)$ can be well approximated by $\psi_{h}(\overline{S}_{0}^{\mathcal{N}}(t))$ and $\psi_{h}(\overline{S}_{l}^{\mathcal{N}}(t))$ .

For the fourth term, using Lemma 8,

	$\displaystyle\left\|\sum_{l>N_{\mu}}(\hat{\nu}_{l},\hat{\mu}_{l})\cdot\left(S_{l}(t)-\begin{pmatrix}0\\ 1\end{pmatrix}\right)\right\|$	$\displaystyle=\sum_{t_{i}<t}\left\|\mu(t-t_{i})-S_{N_{\mu}}(t-t_{i})\right\|$
		$\displaystyle\leq s_{0}\frac{4C_{0}T^{k}}{(k-1)(2\pi)^{k}N_{\mu}^{k-1}}.$

Here $S_{N_{\mu}}(t)$ is the finite sum of the fourier series defined in Lemma 8 .

Finally, by (52), we have $|\overline{\lambda}_{0}^{N}(t)-\lambda_{0}(t)|\leq 3\mathcal{C}B_{0}T^{s}/(2N^{s}),~{}0\leq t\leq T$ . To trade off the error terms in (54), take $\mathcal{N}=\lceil\log((s_{0}+1)N^{s}N_{\mu})\rceil$ and $h=(18(s_{0}+1))^{-2}N^{-s/2}N_{\mu}^{-1/2}$ . Then under the event $\{N_{e}\leq s_{0}\}$ , we have

	$\displaystyle\left\|\lambda^{\ast}(t)-\overline{\lambda}(t)\right\|$	$\displaystyle\leq\frac{2C_{0}N_{\mu}+1}{N^{s}N_{\mu}}+s_{0}\frac{4C_{0}T^{k}}{(k-1)(2\pi)^{k}N_{\mu}^{k-1}}+\frac{3\mathcal{C}B_{0}T^{s}}{2N^{s}}$
		$\displaystyle\leq\frac{3\mathcal{C}B_{0}T^{s}+4C_{0}+2}{2N^{s}}+s_{0}\frac{4C_{0}T^{k}}{(k-1)(2\pi)^{k}N_{\mu}^{k-1}},~{}t\in[0,T].$		(57)

Step 5. Compute the final approximation error.

By (46),

	$\displaystyle\quad\|\mathbb{E}[\text{loss}(\hat{\lambda},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]\|$
	$\displaystyle\leq\mathbb{E}\left[(T+\frac{N_{e}}{B_{1}})\left\\|\hat{\lambda}-\lambda^{\ast}\right\\|_{L^{\infty}}\mathbbm{1}_{\{N_{e}\leq s_{0}\}}\right]+\mathbb{E}\left[(T+\frac{N_{e}}{B_{1}})\left\\|\hat{\lambda}-\lambda^{\ast}\right\\|_{L^{\infty}}\mathbbm{1}_{\{N_{e}>s_{0}\}}\right]$
	$\displaystyle:=\mathbb{I}_{1}+\mathbb{I}_{2}~{}.$		(58)

Taking $\eta=(c_{\mu}+1)/(2c_{\mu})$ in Lemma 2 , we have

	$\displaystyle\mathbb{P}\left(N_{e}\geq s\right)$	$\displaystyle\leq\frac{2\sqrt{B_{0}T}}{1-c_{\mu}}\exp\left(\frac{\log(\eta)}{2}\left[\eta(B_{0}T)-(1-c_{\mu}\eta)s\right]\right)$
		$\displaystyle\leq\frac{2\sqrt{B_{0}T}}{1-c_{\mu}}\exp\left(\frac{\log\left(\frac{c_{\mu}+1}{2c_{\mu}}\right)}{2}\left[\frac{c_{\mu}+1}{2c_{\mu}}(B_{0}T)-\frac{1-c_{\mu}}{2}s\right]\right)$
		$\displaystyle:=a_{e}\exp\left(-c_{e}s\right)~{}.$

By (53) and (57),

$\displaystyle\mathbb{I}_{1}$	$\displaystyle\leq\left(T+\frac{1}{B_{1}}\right)\\|\lambda^{\ast}-\hat{\lambda}\\|_{L^{\infty}}\mathbb{E}\left[(N_{e}+1)\mathbbm{1}_{\{N_{e}\leq s_{0}\}}\right]$
	$\displaystyle\leq\left(T+\frac{1}{B_{1}}\right)\\|\lambda^{\ast}-\overline{\lambda}\\|_{L^{\infty}}\mathbb{E}\left[(N_{e}+1)\right]$
	$\displaystyle=\left(T+\frac{1}{B_{1}}\right)\\|\lambda^{\ast}-\overline{\lambda}\\|_{L^{\infty}}\left(1+\sum_{s=1}^{\infty}\mathbb{P}(N_{e}\geq s)\right)$
	$\displaystyle\leq\left(T+\frac{1}{B_{1}}\right)\left(1+\frac{a_{e}\exp(-c_{e})}{1-\exp(-c_{e})}\right)\left(\frac{3\mathcal{C}B_{0}T^{s}+4C_{0}+2}{2N^{s}}+s_{0}\frac{4C_{0}T^{k}}{(k-1)(2\pi)^{k}N_{\mu}^{k-1}}\right).$	(59)

Since $\|\hat{\lambda}\|_{L^{\infty}}\leq B_{0}+C_{0}s_{0}$ and $\|\lambda^{\ast}\|_{L^{\infty}}\leq B_{0}+C_{0}N_{e}$ , similar to (48), we have

$\displaystyle\mathbb{I}_{2}$	$\displaystyle\leq\mathbb{E}\left[\left(T+\frac{N_{e}}{B_{1}}\right)\\|\hat{\lambda}\\|_{L^{\infty}}\mathbbm{1}_{\{N_{e}>s_{0}\}}\right]+\mathbb{E}\left[\left(T+\frac{N_{e}}{B_{1}}\right)\left\\|\lambda^{\ast}\right\\|_{L^{\infty}}\mathbbm{1}_{\{N_{e}>s_{0}\}}\right]$
	$\displaystyle\leq\left(T+\frac{1}{B_{1}}\right)(B_{0}+C_{0}s_{0})\mathbb{E}\left[(N_{e}+1)\mathbbm{1}_{\{N_{e}>s_{0}\}}\right]+\left(T+\frac{1}{B_{1}}\right)E\left[(N_{e}+1)(B_{0}+C_{0}N_{e})\mathbbm{1}_{\{N_{e}>s_{0}\}}\right]$
	$\displaystyle\leq\left(T+\frac{1}{B_{1}}\right)a_{e}\exp(-c_{e}(s_{0}+1))\left(2(s_{0}+1)(B_{0}+C_{0}s_{0})+\frac{3C_{0}(s_{0}+1)+2B_{0}}{(1-\exp(-c_{e}))^{2}}\right).$	(60)

Combining (58), (59), and (60), we have

	$\displaystyle\quad\|\mathbb{E}[\text{loss}(\hat{\lambda},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]\|$
	$\displaystyle\leq\left(T+\frac{1}{B_{1}}\right)a_{e}\exp(-c_{e}(s_{0}+1))\left(2(s_{0}+1)(B_{0}+C_{0}s_{0})+\frac{3C_{0}(s_{0}+1)+2B_{0}}{(1-\exp(-c_{e}))^{2}}\right)$
	$\displaystyle\quad+\left(T+\frac{1}{B_{1}}\right)\left(1+\frac{a_{e}\exp(-c_{e})}{1-\exp(-c_{e})}\right)\left(\frac{3\mathcal{C}B_{0}T^{s}+4C_{0}+2}{2N^{s}}+s_{0}\frac{4C_{0}T^{k}}{(k-1)(2\pi)^{k}N_{\mu}^{k-1}}\right).$

Let $s_{0}=\lceil s\log(N)/c_{e}\rceil$ and denote $\hat{\lambda}^{N,N_{\mu}}=\hat{\lambda}$ . We have

\displaystyle|\mathbb{E}[\text{loss}(\hat{\lambda}^{N,N_{\mu}},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]|\lesssim\frac{1}{1-c_{\mu}}\exp\left(\frac{2B_{0}T}{c_{\mu}^{2}}\right)\left(\frac{T^{s}+\log^{2}N}{N^{s}}+\frac{T^{k}\log N}{N_{\mu}^{k-1}}\right).

Step 6. Bound the sizes of the network width and weights.

From step 1-5, we have the width of the network being less than

\displaystyle\left(3\lceil\frac{\mathcal{N}^{\prime}}{2}\rceil\binom{\mathcal{N}^{\prime}+3}{3}\right)2N_{\mu}+3\Big{\lceil}\frac{s}{2}\Big{\rceil}+6N+3\lceil\frac{\mathcal{N}^{\prime\prime}+5}{2}\rceil

where $\mathcal{N}^{\prime}=\mathcal{N}+\lceil\log(6({s_{0}+1}))\rceil+15w_{N_{\mu}}T$ , $\mathcal{N}^{\prime\prime}=\mathcal{N}+\lceil(s_{0}+3)\log 2\rceil$ , $\mathcal{N}=\lceil\log((s_{0}+1)N^{s}N_{\mu})\rceil$ , $s_{0}=\lceil s\log(N)/c_{e}\rceil$ . Hence

\displaystyle D\lesssim N+N_{\mu}^{5}\log^{4}N~{}.

From the construction of $\hat{g}_{l,i}^{\mathcal{N}}$ , $\hat{g}_{0}^{\mathcal{N}}$ , $\psi_{h}$ , $\overline{\lambda}_{0}^{N}$ , the weights of the network is less than

	$\displaystyle\mathcal{C}_{1}^{\prime}\max$	$\displaystyle\left\{\left(({s_{0}+1})\exp(\frac{{\mathcal{N}^{\prime}}^{2}+\mathcal{N}^{\prime}-3Cd\mathcal{N}^{\prime}}{2})(\mathcal{N}^{\prime}(\mathcal{N}^{\prime}+2))^{3\mathcal{N}^{\prime}(\mathcal{N}^{\prime}+2)}\right),\right.$
		$\displaystyle\quad\left(({s_{0}+1})\exp(\frac{{\mathcal{N}^{\prime\prime}}^{2}+\mathcal{N}^{\prime\prime}-3Cd\mathcal{N}^{\prime\prime}}{2})(\mathcal{N}^{\prime\prime}(\mathcal{N}^{\prime\prime}+2))^{3\mathcal{N}^{\prime\prime}(\mathcal{N}^{\prime\prime}+2)}\right),\frac{2}{\sigma^{{}^{\prime}}(0)h},$
		$\displaystyle\quad\left.\left(\left[\frac{\sqrt{2s}5^{s}}{(s-1)!}B_{0}T^{s}\right]^{-s/2}N^{(1+s^{2})/2}(s(s+2))^{3s(s+2)}\right)\right\},$

where $h=(18(s_{0}+1))^{-2}N^{-s/2}N_{\mu}^{-1/2}$ . Hence the weights of the network is less than

\displaystyle\mathcal{C}_{1}(\log(NN_{\mu}))^{12s^{2}(\log(NN_{\mu}))^{2}}~{},

where $\mathcal{C}_{1}$ is a constant related to $s,B_{0},C_{0},c_{\mu}$ , and $T$ . ∎

Lemma 9.

Let $\delta_{j}=\frac{j}{k}$ , $1\leq j\leq k$ , $\delta=(\delta_{1},\delta_{2},\cdots,\delta_{k})^{\top}$ and

\displaystyle V_{\delta}=\begin{pmatrix}1&1&\cdots&1\\ \delta_{1}&\delta_{2}&\cdots&\delta_{k}\\ \delta_{1}^{2}&\delta_{2}^{2}&\cdots&\delta_{k}^{2}\\ \vdots&\vdots&\ddots&\vdots\\ \delta_{1}^{k-1}&\delta_{2}^{k-1}&\cdots&\delta_{k}^{k-1}\end{pmatrix}

then $V_{\delta}$ is invertible and $\|V_{\delta}^{-1}\|_{\infty}\leq C*8^{k}$ , where $C$ is a universal constant.

Proof of Lemma 9.

See Gautschi (1990). ∎

Lemma 10.

For $\mu\in C^{k,\infty}([0,T],C_{0})$ and $\delta=(\delta_{1},\delta_{2},\cdots,\delta_{k})^{\top}$ defined in Lemma 9, there exists $\alpha=(\alpha_{1},\alpha_{2},\cdots,\alpha_{k})^{\top}$ such that

\displaystyle\tilde{\mu}(t):=\mu(t)+\sum_{j=1}^{k}\alpha_{j}\exp(-\delta_{j}t)

(61)

satisfing $\|\alpha\|_{\infty}\leq 2C_{0}C8^{k}/(1-\exp(-T))$ , $\tilde{\mu}\in C^{k,\infty}([0,T],C_{0}+k\|\alpha\|_{\infty})$ and $\tilde{\mu}^{(j)}(0+)=\tilde{\mu}^{(j)}(T-)$ , $0\leq j\leq k-1$ , where the constant $C$ is defined in Lemma 9.

Proof of Lemma 10.

We only need to solve the following equations:

\displaystyle\tilde{\mu}^{(j)}(0+)=\tilde{\mu}^{(j)}(T-),~{}0\leq j\leq k-1.

(62)

In matrix form,

	$\displaystyle~{}~{}~{}~{}\begin{pmatrix}1-e^{-\delta_{1}T}&1-e^{-\delta_{2}T}&\cdots&1-e^{-\delta_{k}T}\\ (-\delta_{1})(1-e^{-\delta_{1}T})&(-\delta_{2})(1-e^{-\delta_{2}T})&\cdots&(-\delta_{k})(1-e^{-\delta_{k}T})\\ \vdots&\vdots&\ddots&\vdots\\ (-\delta_{1})^{k-1}(1-e^{-\delta_{1}T})&(-\delta_{2})^{k-1}(1-e^{-\delta_{2}T})&\cdots&(-\delta_{k})^{k-1}(1-e^{-\delta_{k}T})\end{pmatrix}\begin{pmatrix}\alpha_{1}\\ \alpha_{2}\\ \vdots\\ \alpha_{k}\end{pmatrix}$
	$\displaystyle=\begin{pmatrix}{\mu}(T-)-{\mu}(0+)\\ {\mu}^{(1)}(T-)-{\mu}^{(1)}(0+)\\ \vdots\\ {\mu}^{(k-1)}(T-)-{\mu}^{(k-1)}(0+)\end{pmatrix}.$		(63)

Rewrite (63) as

\displaystyle DV_{\delta}\Lambda_{\delta}\alpha=\Delta_{\mu},

where $D=\mathrm{diag}\{1,-1,\cdots,(-1)^{k-1}$ , $\Lambda_{\delta}=\mathrm{diag}\{1-e^{-\delta_{1}T},1-e^{-\delta_{2}T},\cdots,1-e^{-\delta_{k}T}\}$ , $\Lambda_{\mu}=({\mu}(T-)-{\mu}(0+),{\mu}^{(1)}(T-)-{\mu}^{(1)}(0+),\cdots,{\mu}^{(k-1)}(T-)-{\mu}^{(k-1)}(0+))^{\top}$ , and $V_{\delta}$ is defined in Lemma 9. By Lemma 9 and $\delta_{j}=j/k$ , $1\leq j\leq k$ , we have $DV_{\delta}\Lambda_{\delta}$ is invertible and

	$\displaystyle\\|\alpha\\|_{\infty}$	$\displaystyle\leq\\|D^{-1}\\|_{\infty}\\|V_{\delta}^{-1}\\|_{\infty}\\|\Lambda_{\delta}^{-1}\\|_{\infty}\\|\Delta_{\mu}\\|_{\infty}$
		$\displaystyle\leq(C*8^{k})\frac{1}{(1-\exp(-T))}(2C_{0})=\frac{2C_{0}C8^{k}}{1-\exp(-T)},$

where the constant $C$ is defined in Lemma 9. By (61), we have $\tilde{\mu}\in C^{k,\infty}([0,T],C_{0}+k\|\alpha\|_{\infty})$ . ∎

Now we prove Theorem 6. The proof is based on Theorem 5, Theorem 9, and Lemma 10. From Lemma 10, for $\mu\in C^{k,\infty}([0,T],C_{0})$ , there exists $\alpha=(\alpha_{1},\alpha_{2},\cdots,\alpha_{k})^{\top}\in\mathbb{R}^{k}$ such that $\tilde{\mu}(t):=\mu(t)+\sum_{j=1}^{k}\alpha_{j}\exp(-\delta_{j}t)$ satisfying the boundary condition $\tilde{\mu}^{(j)}(0+)=\tilde{\mu}^{(j)}(T-)$ , $0\leq j\leq k-1$ , and we have $\tilde{\mu}\in C^{k,\infty}([0,T],C_{0}+k\frac{2C_{0}C8^{k}}{1-\exp(-T)})$ . Define $\tilde{\nu}(t):=\mu(t)-\tilde{\mu}(t)=-\sum_{j=1}^{k}\alpha_{j}\exp(-\delta_{j}t)$ . Denote

	$\displaystyle\lambda^{\ast}_{1}(t)$	$\displaystyle:=\lambda_{0}(t)+\sum_{t_{i}<t}\tilde{\mu}(t-t_{i}),$
	$\displaystyle\lambda^{\ast}_{2}(t)$	$\displaystyle:=\sum_{t_{i}<t}\tilde{\nu}(t-t_{i})=\sum_{j=1}^{k}\sum_{t_{i}<t}(-\alpha_{j})\exp(-\delta_{j}(t-t_{i})):=\sum_{j=1}^{k}\lambda_{2j}^{\ast}(t),$

and then $\lambda^{\ast}(t)=\lambda^{\ast}_{1}(t)+\lambda^{\ast}_{2}(t)$ .

Fix $s_{0}\in\mathbb{N}_{+}$ . By the proof of Theorem 9, under the event $\{N_{e}\leq s_{0}\}$ , there exists an RNN (without the output layer) $\overline{\lambda}_{1}(t)$ such that

\displaystyle\left|\lambda_{1}^{\ast}(t)-\overline{\lambda}_{1}(t)\right|\leq\frac{3\mathcal{C}B_{0}T^{s}+4\tilde{C_{0}}+2}{2N^{s}}+s_{0}\frac{4\tilde{C_{0}}T^{k}}{(k-1)(2\pi)^{k}N_{\mu}^{k-1}},~{}t\in[0,T],

where $\tilde{C_{0}}=C_{0}+2kC_{0}C8^{k}/(1-\exp(-T))$ .

By the proof of Theorem 5 , under the event $\{N_{e}\leq s_{0}\}$ , for $1\leq j\leq k$ , there exists an RNN (without the output layer) $\overline{\lambda}_{2j}(t)$ such that

\displaystyle\left|\lambda_{2j}^{\ast}(t)-\overline{\lambda}_{2j}(t)\right|\leq\frac{2\alpha_{j}}{N^{s}}\leq\frac{4C_{0}C8^{k}}{(1-\exp(-T))N^{s}},t\in[0,T].

Let $\overline{\lambda}_{2}(t)=\sum_{j=1}^{k}\overline{\lambda}_{2j}(t)$ . We have

\displaystyle\left|\lambda_{2}^{\ast}(t)-\overline{\lambda}_{2}(t)\right|\leq\frac{2(\tilde{C}_{0}-C_{0})}{N^{s}},t\in[0,T].

Let $\overline{\lambda}(t)=\overline{\lambda}_{1}(t)+\overline{\lambda}_{2}(t)$ ,

\displaystyle\left|\lambda^{\ast}(t)-\overline{\lambda}(t)\right|\leq\left|\lambda_{1}^{\ast}(t)-\overline{\lambda}_{1}(t)\right|+\left|\lambda_{2}^{\ast}(t)-\overline{\lambda}_{2}(t)\right|\leq\frac{3\mathcal{C}B_{0}T^{s}+8\tilde{C_{0}}+2}{2N^{s}}+s_{0}\frac{4\tilde{C_{0}}T^{k}}{(k-1)(2\pi)^{k}N_{\mu}^{k-1}}.

Under the event $\{N_{e}\leq s_{0}\}$ , $B_{1}\leq\lambda^{\ast}\leq B_{0}+C_{0}s_{0}$ . Hence we can take $l_{f}=B_{1}$ and $u_{f}=B_{0}+C_{0}s_{0}$ and denote $\hat{\lambda}(t)=f(\overline{\lambda}(t))$ . Then $\|\lambda^{\ast}-\hat{\lambda}\|_{\infty}\leq\|\lambda^{\ast}-\overline{\lambda}\|_{\infty}$ . By similar arguments in Theorem 9, we have

	$\displaystyle\quad\|\mathbb{E}[{\text{loss}}(\hat{\lambda},S_{test})]-\mathbb{E}[{\text{loss}}(\lambda^{\ast},S_{test})]\|$
	$\displaystyle\leq\left(T+\frac{1}{B_{1}}\right)a_{e}\exp(-c_{e}(s_{0}+1))\left(2(s_{0}+1)(B_{0}+C_{0}s_{0})+\frac{3C_{0}(s_{0}+1)+2B_{0}}{(1-\exp(-c_{e}))^{2}}\right)$
	$\displaystyle\quad+\left(T+\frac{1}{B_{1}}\right)\left(1+\frac{a_{e}\exp(-c_{e})}{1-\exp(-c_{e})}\right)\left(\frac{3\mathcal{C}B_{0}T^{s}+8\tilde{C}_{0}}{2N^{s}}+s_{0}\frac{4\tilde{C}_{0}T^{k}}{(k-1)(2\pi)^{k}N_{\mu}^{k-1}}\right)~{}.$

Let $s_{0}=\lceil s\log(N)/c_{e}\rceil$ and denote $\hat{\lambda}^{N,N_{\mu}}=\hat{\lambda}$ . We have

\displaystyle|\mathbb{E}[\tilde{\text{loss}}(\hat{\lambda}^{N,N_{\mu}})]-\mathbb{E}[\tilde{\text{loss}}(\lambda^{\ast})]|\lesssim\frac{\log^{2}N}{N^{s}}+\frac{\log N}{N_{\mu}^{k-1}}

The width and elements weights bound can also be obtained similarly to the proof of Theorem 9.

9.4 Proof of Theorem 7

Denote $\lambda_{1}^{*}(t)=\lambda_{0}(t)+\sum_{t_{i}<t}\alpha\exp(-\beta(t-t_{i}))$ . Then $\lambda^{*}(t)=\Psi\left(\lambda_{1}^{*}(t)\right)$ . Fix $s_{0}\in\mathbb{N}_{+}$ . From the proof of Theorem 5 , under the event $\{N_{e}\leq s_{0}\}$ , there exists a 2-layer recurrent neural network $\overline{\lambda}_{1}(t)$ as (43) such that

\displaystyle\left|\overline{\lambda}_{1}(t)-\lambda_{1}^{*}(t)\right|\leq\frac{3\mathcal{C}B_{0}T^{s}+2}{2N_{1}^{s}},~{}\forall t\in[0,T].

(64)

Moreover, the width of $\overline{\lambda}_{1}(t)$ satisfies $D\lesssim N_{1}$ and the weights of $\overline{\lambda}_{1}(t)$ are bounded by

\displaystyle O\left((\log N_{1})^{12s^{2}(\log N_{1})^{2}}\right)~{}.

Under the event $\{N_{e}\leq s_{0}\}$ , the function $\lambda_{1}^{*}(t)$ satisfies $0\leq\lambda_{1}^{*}\leq B_{0}+\alpha s_{0}$ . Using (64) and taking $(3\mathcal{C}B_{0}T^{s}+2)/2N_{1}^{s}\leq 1$ , we have $\overline{\lambda}_{1}\in[-1,B_{0}+\alpha s_{0}+1]$ . Hence we need to construct an approximation of $\Psi$ on $[-1,B_{0}+\alpha s_{0}+1]$ . Let $\tilde{\Psi}(x)=\Psi(\rho x-1)$ , where $\rho={B_{0}+\alpha s_{0}+2}$ . Then $\Psi(x)=\tilde{\Psi}((x+1)/\rho)$ .

Since $\Psi$ is L-lipschitz and $\tilde{\Psi}$ is defined on $[0,1]$ and $\rho L$ -Lipschitz, by the Corollary 5.4 of De Ryck et al. (2021), there exists a tanh neural network $\tilde{\Psi}^{N_{2}}$ with 2 hidden layers such that

\displaystyle\left\|\tilde{\Psi}-\tilde{\Psi}^{N_{2}}\right\|_{L^{\infty}[0,1]}\leq\frac{7(\rho L\vee\tilde{B}_{0})}{N_{2}}.

Let $\Psi^{N_{2}}(x)=\tilde{\Psi}^{N_{2}}((x+1)/\rho)$ . Then

\displaystyle\left|\Psi(x)-\Psi^{N_{2}}(x)\right|\leq\frac{7(\rho L\vee\tilde{B}_{0})}{N_{2}},~{}x\in[-1,B_{0}+\alpha s_{0}+1].

Then under the event $\{N_{e}\leq s_{0}\}$ , we have

$\displaystyle\left\|\Psi(\lambda_{1}^{*}(t))-\Psi^{N_{2}}(\overline{\lambda}_{1}(t))\right\|$	$\displaystyle\leq\left\|\Psi(\lambda_{1}^{*}(t))-\Psi(\overline{\lambda}_{1}(t))\right\|+\left\|\Psi(\overline{\lambda}_{1}(t))-\Psi^{N_{2}}(\overline{\lambda}_{1}(t))\right\|$
	$\displaystyle\leq L\left\|\overline{\lambda}_{1}(t)-\lambda_{1}^{*}(t)\right\|+\left\\|\Psi-\Psi^{N_{2}}\right\\|_{L^{\infty}}$
	$\displaystyle\leq L\frac{3\mathcal{C}B_{0}T^{s}+2}{2N_{1}^{s}}+\frac{7(\rho L\vee\tilde{B}_{0})}{N_{2}}.$	(65)

Recall that $f(x)=\min\{\max\{x,l_{f}\},u_{f}\}$ . Since $\tilde{B}_{1}\leq\Psi\leq\tilde{B}_{0}$ , we can take $l_{f}=\tilde{B}_{1}$ and $u_{f}=\tilde{B}_{0}$ . Define $\hat{\lambda}(t)=f\left(\Psi^{N_{2}}(\overline{\lambda}_{1}(t))\right)$ . We have

\displaystyle\left|\lambda^{*}(t)-\hat{\lambda}(t)\right|\leq\left|\Psi(\lambda_{1}^{*}(t))-\Psi^{N_{2}}(\overline{\lambda}_{1}(t))\right|,~{}\forall t\in[0,T].

(66)

Similar to (46), we have

	$\displaystyle\quad\|\mathbb{E}[\text{loss}(\hat{\lambda},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]\|$
	$\displaystyle\leq\mathbb{E}\left\|\text{loss}(\hat{\lambda},S_{test})-\text{loss}(\lambda^{\ast},S_{test})\right\|$
	$\displaystyle\leq\mathbb{E}\left(\Big{\|}\sum_{i=1}^{N_{e}}(\log\hat{\lambda}(t_{i})-\log\lambda^{\ast}(t_{i}))\Big{\|}+\Big{\|}\int_{0}^{T}\left(\hat{\lambda}(t)-\lambda^{\ast}(t)\right)\mathrm{dt}\Big{\|}\right)$
	$\displaystyle\leq\mathbb{E}\left[\left(\Big{\|}\sum_{i=1}^{N_{e}}(\log\hat{\lambda}(t_{i})-\log\lambda^{\ast}(t_{i}))\Big{\|}+\Big{\|}\int_{0}^{T}\left(\hat{\lambda}(t)-\lambda^{\ast}(t)\right)\mathrm{dt}\Big{\|}\right)\mathbbm{1}_{\{N_{e}\leq s_{0}\}}\right.$
	$\displaystyle\quad\quad+\left.\left(\Big{\|}\sum_{i=1}^{N_{e}}(\log\hat{\lambda}(t_{i})-\log\lambda^{\ast}(t_{i}))\Big{\|}+\Big{\|}\int_{0}^{T}\left(\hat{\lambda}(t)-\lambda^{\ast}(t)\right)\mathrm{dt}\Big{\|}\right)\mathbbm{1}_{\{N_{e}>s_{0}\}}\right]$
	$\displaystyle\leq\mathbb{E}\left[(T+\frac{N_{e}}{\tilde{B_{1}}})\left\\|\hat{\lambda}-\lambda^{\ast}\right\\|_{L^{\infty}}\mathbbm{1}_{\{N_{e}\leq s_{0}\}}\right]+\mathbb{E}\left[(T+\frac{N_{e}}{\tilde{B_{1}}})\left\\|\hat{\lambda}-\lambda^{\ast}\right\\|_{L^{\infty}}\mathbbm{1}_{\{N_{e}>s_{0}\}}\right]$
	$\displaystyle:=\mathbb{I}_{1}+\mathbb{I}_{2}~{}.$		(67)

Since $\Psi\leq\tilde{B_{0}}$ , similar to (46), taking $\eta=e$ in Lemma 2 , we have

\displaystyle\mathbb{P}(N_{e}\geq s)\leq 2\sqrt{\tilde{B_{0}}T}\exp\left(\frac{e\tilde{B_{0}}T-s}{2}\right),

and similar to (36), we have

\displaystyle\mathbb{E}(N_{e}+1)\leq 1+\sum_{s=1}^{\infty}\mathbb{P}(N_{e}\geq s)\leq 1+\frac{2\sqrt{\tilde{B_{0}}T}}{1-\exp(-1/2)}\exp\left(\frac{e\tilde{B_{0}}T-1}{2}\right)\leq{5\sqrt{\tilde{B_{0}}T+1}}\exp\left(\frac{3\tilde{B_{0}}T}{2}\right).

(68)

By (65), (66), and (68),

$\displaystyle\mathbb{I}_{1}$	$\displaystyle\leq\left(T+\frac{1}{\tilde{B_{1}}}\right)\frac{3\mathcal{C}B_{0}T^{s}+2}{2N^{s}}\mathbb{E}\left[(N_{e}+1)\mathbbm{1}_{\{N_{e}\leq s_{0}\}}\right]$
	$\displaystyle\leq\left(T+\frac{1}{\tilde{B_{1}}}\right)\frac{3\mathcal{C}B_{0}T^{s}+2}{2N^{s}}\mathbb{E}\left[(N_{e}+1)\right]$
	$\displaystyle\leq\left(T+\frac{1}{\tilde{B_{1}}}\right){5\sqrt{\tilde{B_{0}}T+1}}\exp\left(\frac{3\tilde{B_{0}}T}{2}\right)\left(L\frac{3\mathcal{C}B_{0}T^{s}+2}{2N_{1}^{s}}+\frac{7(\rho L\vee\tilde{B}_{0})}{N_{2}}\right)$
	$\displaystyle\leq 5L\exp\left({2\tilde{B_{0}}T}\right)\left(T+\frac{1}{\tilde{B_{1}}}\right)\left(\frac{3\mathcal{C}B_{0}T^{s}+2}{2N_{1}^{s}}+\frac{7(\rho\vee(\tilde{B}_{0}/L))}{N_{2}}\right).$	(69)

On the other hand, since $\|\hat{\lambda}\|_{L^{\infty}}\leq\tilde{B_{0}}$ and $\|\lambda^{\ast}\|_{L^{\infty}}\leq\tilde{B_{0}}$ , we have

$\displaystyle\mathbb{I}_{2}$	$\displaystyle\leq\mathbb{E}\left[\left(T+\frac{N_{e}}{\tilde{B_{1}}}\right)\\|\hat{\lambda}\\|_{L^{\infty}}\mathbbm{1}_{\{N_{e}>s_{0}\}}\right]+\mathbb{E}\left[\left(T+\frac{N_{e}}{\tilde{B_{1}}}\right)\left\\|\lambda^{\ast}\right\\|_{L^{\infty}}\mathbbm{1}_{\{N_{e}>s_{0}\}}\right]$
	$\displaystyle\leq 2\left(T+\frac{1}{\tilde{B_{1}}}\right)\tilde{B_{0}}\mathbb{E}\left[(N_{e}+1)\mathbbm{1}_{\{N_{e}>s_{0}\}}\right]$
	$\displaystyle\leq 2\left(T+\frac{1}{\tilde{B_{1}}}\right)\tilde{B_{0}}\left((s_{0}+1)\mathbb{P}(N_{e}\geq s_{0}+1)+\sum_{s=s_{0}+1}^{\infty}\mathbb{P}(N_{e}\geq s)\right)$
	$\displaystyle\leq 4\left(T+\frac{1}{\tilde{B_{1}}}\right)\tilde{B_{0}}\sqrt{\tilde{B_{0}}T}\exp\left(\frac{e\tilde{B_{0}}T-(s_{0}+1)}{2}\right)\left((s_{0}+1)+\frac{1}{1-e^{-\frac{1}{2}}}\right)$
	$\displaystyle\leq 4\left(T+\frac{1}{\tilde{B_{1}}}\right)\tilde{B_{0}}\sqrt{\tilde{B_{0}}T}\exp\left(\frac{3\tilde{B_{0}}T-(s_{0}+1)}{2}\right)\left(s_{0}+4\right)$
	$\displaystyle\leq 4\left(T+\frac{1}{\tilde{B_{1}}}\right)\tilde{B_{0}}\exp\left({2\tilde{B_{0}}T}\right)\left(s_{0}+4\right)\exp\left(-\frac{s_{0}+1}{2}\right).$	(70)

Combining (67), (69), and (70), we have

	$\displaystyle\|\mathbb{E}[\text{loss}(\hat{\lambda},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]\|$	$\displaystyle\leq 5L\exp\left({2\tilde{B_{0}}T}\right)\left(T+\frac{1}{\tilde{B_{1}}}\right)\left(\frac{3\mathcal{C}B_{0}T^{s}+2}{2N_{1}^{s}}+\frac{7(\rho\vee(\tilde{B}_{0}/L))}{N_{2}}\right)$
		$\displaystyle\quad+4\left(T+\frac{1}{\tilde{B_{1}}}\right)\tilde{B_{0}}\exp\left({2\tilde{B_{0}}T}\right)\left(s_{0}+4\right)\exp\left(-\frac{s_{0}+1}{2}\right).$

Let $s_{0}=\lceil 2\log N\rceil$ , $N_{1}=N_{2}=N$ and denote $\hat{\lambda}^{N}=\hat{\lambda}$ . We have

\displaystyle|\mathbb{E}[\text{loss}(\hat{\lambda}^{N},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]|\lesssim\frac{\log N}{N}.

Similar to the proof of Theorem 5, we can bound the width of the network by

\displaystyle\max\left\{3\left\lceil\frac{\tilde{\mathcal{N}}}{2}\right\rceil\binom{\tilde{\mathcal{N}}+2}{2}+3\left\lceil\frac{s}{2}\right\rceil+6N+2,6N\right\},

where $\tilde{\mathcal{N}}=\lceil s\log(N)\rceil+10(\delta T\vee 1)+2\lceil\log(3(s_{0}+1))\rceil$ . Hence we have $D\lesssim N$ .

Moreover, from the construction of $\hat{\lambda}$ , the weights of the network is less than

\displaystyle\mathcal{C}_{1}^{\prime}\max\left\{(\log(N))^{12s^{2}(\log(N))^{2}},\frac{N}{\rho\sqrt{\rho L}}\right\},

where $\rho=B_{0}+\alpha s_{0}+2=B_{0}+\alpha\lceil 2\log N\rceil+2$ , $\mathcal{C}_{1}^{\prime}$ is a constant related to $s,B_{0},\alpha,\delta,T,\tilde{B}_{0}$ , and $L$ . Then the weights of the network can be bounded by

\displaystyle\mathcal{C}_{1}(\log(N))^{12s^{2}(\log(N))^{2}},

where $C_{1}$ are constants related to $s,B_{0},\alpha,\delta,T,\tilde{B}_{0}$ , and $L$ .

9.5 Proof of Theorem 8

Without loss of generality, we denote $t_{1}=T/3$ , $t_{2}=2T/3$ for simplicity. Since the compensator of $N(t)$ is $\Lambda(t)=\int_{0}^{t}\lambda^{\ast}(s)\mathrm{d}s$ , for a predictable stochastic process $\lambda(t),t\in[0,T]$ , we have

	$\displaystyle\mathbb{E}[\text{loss}(\lambda,S_{test})]$	$\displaystyle=\mathbb{E}\left[-\sum_{t_{i}<T}\log\lambda(t_{i})+\int_{0}^{T}\lambda(t)\mathrm{d}t\right]$
		$\displaystyle=\mathbb{E}\left[-\int_{0}^{T}\log\lambda(t)\mathrm{d}N(t)+\int_{0}^{T}\lambda(t)\mathrm{d}t\right]$
		$\displaystyle=\mathbb{E}\left[\int_{0}^{T}\left(\lambda(t)-\log\lambda(t)*\lambda^{\ast}(t)\right)\mathrm{d}t\right].$

Since both $\lambda^{\ast}$ and $\hat{\lambda}_{ne}$ are predictable, we have

		$\displaystyle\mathbb{E}[\text{loss}(\hat{\lambda}_{ne},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]$
	$\displaystyle=$	$\displaystyle\mathbb{E}\left[\int_{0}^{T}\left(\hat{\lambda}_{ne}(t)-\log\hat{\lambda}_{ne}(t)\lambda^{\ast}(t)\right)\mathrm{d}t\right]-E\left[\int_{0}^{T}\left(\lambda^{\ast}(t)-\log\lambda^{\ast}(t)\lambda^{\ast}(t)\right)\mathrm{d}t\right]$
	$\displaystyle:=$	$\displaystyle\mathbb{E}\left[\int_{0}^{T}\left(g(\hat{\lambda}_{ne}(t),\lambda^{\ast}(t))-g(\lambda^{\ast}(t),\lambda^{\ast}(t))\right)\mathrm{d}t\right],$

where $g(x,y)=x-\log x*y\geq y-\log y*y=g(y,y)$ , $\forall x,y>0$ , and the equality holds if and only if $x=y$ . Thus

\displaystyle\mathbb{E}[\text{loss}(\hat{\lambda}_{ne},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]=\mathbb{E}\left[\int_{0}^{T}\left(g(\hat{\lambda}_{ne}(t),\lambda^{\ast}(t))-g(\lambda^{\ast}(t),\lambda^{\ast}(t))\right)\mathrm{d}t\right]\geq 0.

Denote $\mathcal{E}=\{\text{there is no event in }[0,2T/3]\}$ , and $\mathbb{P}(\mathcal{E})>0$ . Denote $I_{0}=\left[T/3,2T/3\right]$ . By a similar argument, we have

\displaystyle\mathbb{E}[\text{loss}(\hat{\lambda}_{ne},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]\geq\mathbb{E}\left[\left(\int_{I_{0}}\left(g(\hat{\lambda}_{ne}(t),\lambda^{\ast}(t))-g(\lambda^{\ast}(t),\lambda^{\ast}(t))\right)\mathrm{d}t\right)\mathbbm{1}_{\mathcal{E}}\right].

(71)

Under the event $\mathcal{E}$ ,

\displaystyle\hat{\lambda}_{ne}(t)=f(\alpha t+b)=\left\{\begin{aligned} &T&,~{}&\alpha t+b<T\\ &\alpha t+b&,~{}&T\leq\alpha t+b\leq 4T\\ &4T&,~{}&\alpha t+b>4T\end{aligned}\right.~{},~{}t\in I_{0},

and

	$\displaystyle\mathbb{E}\left[\left(\int_{I_{0}}\left(g\left(\hat{\lambda}_{ne}(t),\lambda^{\ast}(t)\right)-g\left(\lambda^{\ast}(t),\lambda^{\ast}(t)\right)\right)\mathrm{d}t\right)\mathbbm{1}_{\mathcal{E}}\right]$
$\displaystyle=$	$\displaystyle\left[\int_{I_{0}}\left(g\left(f(\alpha t+b),\frac{9}{T}t^{2}\right)-g\left(\frac{9}{T}t^{2},\frac{9}{T}t^{2}\right)\right)\mathrm{d}t\right]\mathbb{P}(\mathcal{E})$
$\displaystyle:=$	$\displaystyle F(\alpha,b)P(\mathcal{E}).$	(72)

Then we only need to show

\displaystyle\inf_{\alpha\in\mathbb{R},b\in\mathbb{R}}F(\alpha,b)>0.

Case 1. $|\alpha|>18$ , $b\in\mathbb{R}$ .

Since $\left|3T/\alpha\right|\leq T/6$ , $\hat{\lambda}_{ne}(t)\in\{T,4T\}$ on $I_{1}:=\left[T/3,5T/12\right]$ or $I_{2}:=[7T/12,2T/3]$ . From $g(x,y)\geq g(y,y),x,y>0$ ,

$\displaystyle\inf_{\|\alpha\|>18,b\in\mathbb{R}}F(\alpha,b)\geq\min$	$\displaystyle\left\{\int_{I_{1}}\left(g\left(T,\frac{9}{T}t^{2}\right)-g\left(\frac{9}{T}t^{2},\frac{9}{T}t^{2}\right)\right)\mathrm{d}t,\right.$
	$\displaystyle\int_{I_{2}}\left(g\left(T,\frac{9}{T}t^{2}\right)-g\left(\frac{9}{T}t^{2},\frac{9}{T}t^{2}\right)\right)\mathrm{d}t,$
	$\displaystyle\int_{I_{1}}\left(g\left(4T,\frac{9}{T}t^{2}\right)-g\left(\frac{9}{T}t^{2},\frac{9}{T}t^{2}\right)\right)\mathrm{d}t,$
	$\displaystyle\left.\int_{I_{2}}\left(g\left(4T,\frac{9}{T}t^{2}\right)-g\left(\frac{9}{T}t^{2},\frac{9}{T}t^{2}\right)\right)\mathrm{d}t\right\}$
$\displaystyle:=C_{1}$	$\displaystyle>0.$	(73)

Case 2. $|\alpha|\leq 18$ , $|b|>16T$ .

In this case , we can check that $\{t:T\leq\alpha t+b\leq 4T\}\cap I_{0}=\emptyset$ . Hence

$\displaystyle\inf_{\|\alpha\|\leq 18,\|b\|>16T}F(\alpha,b)\geq\min$	$\displaystyle\left\{\int_{I_{0}}\left(g\left(T,\frac{9}{T}t^{2}\right)-g\left(\frac{9}{T}t^{2},\frac{9}{T}t^{2}\right)\right)\mathrm{d}t,\right.$
	$\displaystyle\left.\int_{I_{0}}\left(g\left(4T,\frac{9}{T}t^{2}\right)-g\left(\frac{9}{T}t^{2},\frac{9}{T}t^{2}\right)\right)\mathrm{d}t\right\}$
$\displaystyle:=C_{2}$	$\displaystyle>0.$	(74)

Case 3. $|\alpha|\leq 18$ , $|b|\leq 16T$ .

By (72) , $F$ is continuous with respect to $(\alpha,b)$ . For fixed $(\alpha,b)$ , since $f(\alpha t+b)\not\equiv\frac{9}{T^{2}}t^{2}$ , $F(\alpha,b)>0$ . Since $\{|\alpha|\leq 18,|b|\leq 16T\}$ is a compact set in $\mathbb{R}^{2}$ , there exists $C_{3}>0$ such that

\displaystyle\inf_{|\alpha|\leq 18,|b|\leq 16T}F(\alpha,b)\geq C_{3}>0.

(75)

By (71), (72), (73), (74), and (75),

\displaystyle\mathbb{E}[\tilde{\text{loss}}(\hat{\lambda}_{ne})]-\mathbb{E}[\tilde{\text{loss}}(\lambda^{\ast})]\geq\min\{C_{1},C_{2},C_{3}\}\mathbb{P}(\mathcal{E}):=C>0.

Hence Theorem 8 is proved.

Remark 13.

Note that we have proved the excess risk

\displaystyle\mathbb{E}[\text{loss}(\hat{\lambda},S_{test})]-\mathbb{E}[\text{loss}(\lambda^{\ast},S_{test})]

(76)

is always positive if $\hat{\lambda}\neq\lambda^{\ast}$ in the proof of Theorem 8. Thus (76) is a well-defined excess risk.

10 Supporting Lemmas

Lemma 11.

(Lemma 8 in Chen et al. (2020)) Let $\mathcal{G}=\{A\in\mathbb{R}^{d_{1}\times d_{2}}:\|A\|_{2}\leq\lambda\}$ be the set of matrices with bounded spectral norm and $\epsilon>0$ be given. The covering number $\mathcal{N}(\mathcal{G},\epsilon,\|\cdot\|_{F})$ is bounded above by

\displaystyle\mathcal{N}(\mathcal{G},\epsilon,\|\cdot\|_{F})\leq\left(1+\frac{(\sqrt{d_{1}}\wedge\sqrt{d_{2}})\lambda}{\epsilon}\right)^{d_{1}d_{2}}.

The following lemma is a bridge between the covering number and the upper bound of sub-gaussian process.

Definition 1.

A stochastic process $\{X_{h}\}_{h\in H}$ is called a sub-gaussian process for metric $d(\cdot,\cdot)$ on $H$ if

\mathbb{E}\left[\exp\left(\lambda\left(X_{h_{1}}-X_{h_{2}}\right)\right)\right]\leq\exp\left(\frac{\lambda^{2}d(h_{1},h_{2})^{2}}{2}\right)~{}\text{ for }\lambda\in\mathbb{R},~{}h_{1},h_{2}\in H.

A stochastic process $\{X_{h}\}_{h\in H}$ is called a centered sub-gaussian process for metric $d(\cdot,\cdot)$ on $H$ if $\{X_{h}\}_{h\in H}$ is a sub-gaussian process for metric $d(\cdot,\cdot)$ and $\mathbb{E}[X_{h}]=0,~{}\forall h\in H$ .

Lemma 12.

Suppose $\{X_{h}\}_{h\in H}$ is a centered sub-gaussian process for metric $K\cdot d(\cdot,\cdot)$ on metric space $H$ , where the diameter of $H$ is finite, i.e. $\operatorname{diam}(H)=\sup_{h_{1},h_{2}\in H}d(h_{1},h_{2})<+\infty$ . Then with probability at least $1-\delta$ , for any fixed $h_{0}\in H$ , we have

\displaystyle\sup_{h\in H}|X_{h}-X_{h_{0}}|\leq 6K\left(8\operatorname{diam}(H)\sqrt{\log\left(\frac{2}{\delta}\right)}+\sum_{k=-\kappa}^{\infty}2^{-k}\sqrt{\log\mathcal{N}\left(H,d,2^{-k}\right)}\right)

and

\displaystyle\sup_{h\in H}|X_{h}-X_{h_{0}}|\leq 12K\left(4\operatorname{diam}(H)\sqrt{\log\left(\frac{2}{\delta}\right)}+\int_{0}^{2\operatorname{diam}(H)}\sqrt{\log\mathcal{N}\left(H,d,\epsilon\right)}~{}\mathrm{d}\epsilon\right),

where $\kappa\in\mathbb{Z}_{+}$ satisfies $2^{\kappa-1}<\operatorname{diam}(H)\leq 2^{\kappa}$ .

Proof of Lemma 12.

Let $\kappa\in\mathbb{Z}_{+}$ satisfy $2^{\kappa-1}<\operatorname{diam}(H)\leq 2^{\kappa}$ . Define $\epsilon_{k}=2^{-k},k\in\mathbb{Z},k\geq-\kappa$ . Let $H_{k}$ be the $\epsilon_{k}\text{-net}$ of $H$ with metric $d(\cdot,\cdot)$ , i.e., $H_{k}\subset H$ covers $H$ at scale $\epsilon_{k}$ with respect to the metric $d(\cdot,\cdot)$ . Clearly $|H_{-\kappa}|=1$ . We take $H_{-\kappa}=\{h_{0}\}$ . Define $\pi_{k}(h)$ as the closest element of $h$ in $H_{k}$ under the metric $d(\cdot,\cdot)$ . Then $\forall h\in H,\forall N\geq-\kappa,N\in\mathbb{Z}$ , we have

\displaystyle X_{h}-X_{h_{0}}=\sum_{k=-\kappa+1}^{\infty}\left(X_{\pi_{k}(h)}-X_{\pi_{k-1}(h)}\right)\quad a.s.~{}.

Thus

\displaystyle\sup_{h\in H}|X_{h}-X_{h_{0}}|\leq\sum_{k=-\kappa+1}^{\infty}\sup_{h\in H}\left|X_{\pi_{k}(h)}-X_{\pi_{k-1}(h)}\right|\quad a.s.~{}.

Consider $P_{k}=\{X_{\pi_{k}(h)}-X_{\pi_{k-1}(h)}|h\in H\}$ , $|P_{k}|\leq|H_{k-1}||H_{k}|\leq|H_{k}|^{2}$ and any element in $P_{k}$ is $K(\epsilon_{k}+\epsilon_{k-1})$ sub-gaussian. By Hoeffding’s inequality and union bound argument, we have

	$\displaystyle\mathbb{P}\left(\sup_{X\in P_{k}}\|X\|\geq t\right)$	$\displaystyle=\mathbb{P}\left(\bigcup_{X\in P_{k}}\left\{\|X\|\geq t\right\}\right)$
		$\displaystyle\leq\sum_{X\in P_{k}}\mathbb{P}\left(\|X\|\geq t\right)$
		$\displaystyle\leq 2\|P_{k}\|\exp\left(-\frac{t^{2}}{2K^{2}(\epsilon_{k-1}+\epsilon_{k})^{2}}\right)$
		$\displaystyle\leq 2\|P_{k}\|\exp\left(-\frac{t^{2}}{18K^{2}\epsilon_{k}^{2}}\right).$

Let $2|P_{k}|\exp\left(-t^{2}/18K^{2}\epsilon_{k}^{2}\right)=\delta_{k}\leq 1/2$ , $t=\sqrt{18}K\epsilon_{k}\sqrt{\log(|P_{k}|)+\log(2/\delta_{k})}\leq 3\sqrt{2}K\epsilon_{k}(\sqrt{\log(|P_{k}|)}+\sqrt{\log(2/\delta_{k})})$ . Then with probability at least $1-\delta_{k}$ , we have

	$\displaystyle\sup_{X\in P_{k}}\|X\|$	$\displaystyle\leq 3\sqrt{2}K\epsilon_{k}\left(\sqrt{\log(\|P_{k}\|)}+\sqrt{\log(2/\delta_{k})}\right)$
		$\displaystyle\leq 6K\epsilon_{k}\left(\sqrt{\log(\|H_{k}\|)}+\sqrt{\log(1/\delta_{k})}\right).$

Thus, with probability at least $1-\sum_{k=-\kappa}^{+\infty}\delta_{k}$ , we get

\displaystyle\sup_{h\in H}|X_{h}-X_{h_{0}}|\leq 6K\sum_{k=-\kappa}^{\infty}2^{-k}\left(\sqrt{\log\mathcal{N}\left(H,d,2^{-k}\right)}+\sqrt{\log\left(1/\delta_{k}\right)}\right),

Let $\delta_{k}=\delta/2^{k+\kappa+1}$ . Then $\sum_{k=-\kappa}^{\infty}\delta_{k}=\delta$ . We have

	$\displaystyle\sum_{k=-\kappa}^{\infty}2^{-k}\sqrt{\log\left(1/\delta_{k}\right)}$	$\displaystyle=\sum_{k=-\kappa}^{\infty}2^{-k}\sqrt{\log\left(2^{k+\kappa+1}/\delta\right)}$
		$\displaystyle\leq\sum_{k=-\kappa}^{\infty}2^{-k}\sqrt{k+\kappa+1}\sqrt{\log\left(2/\delta\right)}$
		$\displaystyle\leq 8\operatorname{diam}(H)\sqrt{\log\left(2/\delta\right)}~{}.$

Thus,

\displaystyle\sup_{h\in H}|X_{h}-X_{h_{0}}|\leq 6K\left(8\operatorname{diam}(H)\sqrt{\log\left(\frac{2}{\delta}\right)}+\sum_{k=-\kappa}^{\infty}2^{-k}\sqrt{\log\mathcal{N}\left(H,d,2^{-k}\right)}\right).

Since

\displaystyle\sum_{k=-\kappa}^{\infty}2^{-k}\sqrt{\log\mathcal{N}\left(H,d,2^{-k}\right)}\leq 2\int_{0}^{2^{\kappa}}\sqrt{\log\mathcal{N}\left(H,d,\epsilon\right)}~{}\mathrm{d}\epsilon\leq 2\int_{0}^{2\operatorname{diam}(H)}\sqrt{\log\mathcal{N}\left(H,d,\epsilon\right)}~{}\mathrm{d}\epsilon~{},

the lemma is proved. ∎

Lemma 13.

(Theorem 5.1 in De Ryck et al. (2021)) Let $d,s\in\mathbb{N}_{+}$ , $\delta>0$ and $f\in W^{s,\infty}([0,1]^{d})$ . There exist constants $\mathcal{C}(d,s,f)$ and $N_{0}(d)>0$ such that for every integer $N>N_{0}(d)$ , there exists a tanh neural network $\hat{f}^{N}$ with two hidden layers, with one width at most $3\lceil s/2\rceil\tbinom{s+d-1}{d}+d(N-1)$ and the other width at most $3\lceil(d+2)/2\rceil\tbinom{2d+1}{d}N^{d}$ (or $3\lceil s/2\rceil+N-1$ and $6N$ for $d=1$ ), such that

\displaystyle\left\|f-\hat{f}^{N}\right\|_{L^{\infty}([0,1]^{d})}\leq(1+\delta)\frac{\mathcal{C}(d,s,f)}{N^{s}}~{}.

If $f\in C^{s}([0,1]^{d})$ , then it holds that

\displaystyle\mathcal{C}(d,s,f)=\frac{(3d)^{s}}{s!2^{s}}\|f\|_{W^{s,\infty}([0,1]^{d})},\quad N_{0}(d)=\frac{3d}{2},

and else, it holds that

\displaystyle\mathcal{C}(d,s,f)=\frac{\pi^{1/4}\sqrt{s}(5d)^{s}}{(s-1)!}\|f\|_{W^{s,\infty}([0,1]^{d})},\quad N_{0}(d)=5d^{2}.

Moreover, the weights of $\hat{f}^{N}$ scale as $O(\mathcal{C}(d,s,f)^{-s/2}N^{d(d+s^{2})/2}(s(s+2))^{3s(s+2)})$ .

Remark 14.

By Lemma 13, there exists a constant $C(\delta)$ which is only dependent with $\delta$ , such that

\displaystyle|\text{the weights of $\hat{f}^{N}$}|\leq C(\delta)\mathcal{C}(d,s,f)^{-s/2}N^{d(d+s^{2})/2}(s(s+2))^{3s(s+2)}.

Lemma 14.

(Corollary 5.8 in De Ryck et al. (2021)) Let $d\in\mathbb{N}_{+}$ , $\Omega\subset\mathbb{R}^{d}$ open with $[0,1]^{d}\subset\Omega$ and let $f$ be analytic on $\Omega$ . If, for some $C>0$ , $f$ satisfies that $\|f\|_{W^{s,\infty}([0,1]^{d})}\leq C^{s}$ for all $s\in\mathbb{N}$ , then for any $\mathcal{N}\in\mathbb{N}_{+}$ , there exists a one-layer $\tanh$ neural network $\hat{f}^{\mathcal{N}}$ of width $3\lceil(\mathcal{N}+5Cd)/2\rceil\tbinom{\mathcal{N}+(5C+1)d}{d}$ (or $3\lceil\mathcal{N}/2\rceil$ for $d=1$ ) such that

\displaystyle\left\|f-\hat{f}^{N}\right\|_{L^{\infty}([0,1]^{d})}\leq\exp(-\mathcal{N})~{}.

Remark 15.

In De Ryck et al. (2021), the construction of $\hat{f}^{\mathcal{N}}$ in Lemma 14 uses Lemma 13 directly. Hence the weights of $\hat{f}^{\mathcal{N}}$ can be derived from Lemma 13. Then there exists a constant $\tilde{C}$ such that

\displaystyle|\text{the weights of $\hat{f}^{\mathcal{N}}$}|\leq\tilde{C}\exp(\frac{{\mathcal{N}^{\prime}}^{2}+\mathcal{N}^{\prime}-3Cd\mathcal{N}^{\prime}}{2})(\mathcal{N}^{\prime}(\mathcal{N}^{\prime}+2))^{3\mathcal{N}^{\prime}(\mathcal{N}^{\prime}+2)},

where $\mathcal{N}^{\prime}=\mathcal{N}+5Cd$ . We emphasize that the original literature (De Ryck et al., 2021) does not give this result, but it can be obtained by simple calculations.

References

Aalen et al. [2008] Odd Aalen, Ornulf Borgan, and Hakon Gjessing. Survival and event history analysis: a process point of view. Springer Science & Business Media, 2008.
Bauwens and Hautsch [2009] Luc Bauwens and Nikolaus Hautsch. Modelling financial high frequency data using point processes. In Handbook of financial time series, pages 953–979. Springer, 2009.
Brémaud and Massoulié [1996] Pierre Brémaud and Laurent Massoulié. Stability of nonlinear hawkes processes. The Annals of Probability, pages 1563–1588, 1996.
Cai et al. [2022] Biao Cai, Jingfei Zhang, and Yongtao Guan. Latent network structure learning from high-dimensional multivariate point processes. Journal of the American Statistical Association, pages 1–14, 2022.
Cao et al. [2019] Jian Cao, Zhi Li, and Jian Li. Financial time series forecasting model based on ceemdan and lstm. Physica A: Statistical mechanics and its applications, 519:127–139, 2019.
Chen et al. [2020] Minshuo Chen, Xingguo Li, and Tuo Zhao. On generalization bounds of a family of recurrent neural networks. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, pages 1233–1243. PMLR, 2020.
Chimmula and Zhang [2020] Vinay Kumar Reddy Chimmula and Lei Zhang. Time series forecasting of covid-19 transmission in canada using lstm networks. Chaos, solitons & fractals, 135:109864, 2020.
Cybenko [1989] George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314, 1989.
Daley and Vere-Jones [2008] Daryl J Daley and David Vere-Jones. An introduction to the theory of point processes: volume II: general theory and structure. Springer, 2008.
Daley et al. [2003] Daryl J Daley, David Vere-Jones, et al. An introduction to the theory of point processes: volume I: elementary theory and methods. Springer, 2003.
De Ryck et al. [2021] Tim De Ryck, Samuel Lanthaler, and Siddhartha Mishra. On the approximation of functions by tanh neural networks. Neural Networks, 143:732–750, 2021.
Du et al. [2015] Nan Du, Yichen Wang, Niao He, Jimeng Sun, and Le Song. Time-sensitive recommendation from recurrent user activities. Advances in neural information processing systems, 28, 2015.
Du et al. [2016] Nan Du, Hanjun Dai, Rakshit Trivedi, Utkarsh Upadhyay, Manuel Gomez-Rodriguez, and Le Song. Recurrent marked temporal point processes: Embedding event history to vector. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1555–1564, 2016.
Dugas et al. [2001] Michel J Dugas, Patrick Gosselin, and Robert Ladouceur. Intolerance of uncertainty and worry: Investigating specificity in a nonclinical sample. Cognitive therapy and Research, 25:551–558, 2001.
Enguehard et al. [2020] Joseph Enguehard, Dan Busbridge, Adam Bozson, Claire Woodcock, and Nils Hammerla. Neural temporal point processes for modelling electronic health records. In Machine Learning for Health, pages 85–113. PMLR, 2020.
Fang et al. [2023] Guanhua Fang, Ganggang Xu, Haochen Xu, Xuening Zhu, and Yongtao Guan. Group network hawkes process. Journal of the American Statistical Association, pages 1–17, 2023.
Farajtabar et al. [2017] Mehrdad Farajtabar, Yichen Wang, Manuel Gomez-Rodriguez, Shuang Li, Hongyuan Zha, and Le Song. Coevolve: A joint point process model for information diffusion and network evolution. Journal of Machine Learning Research, 18(41):1–49, 2017.
Fleming and Harrington [2013] Thomas R Fleming and David P Harrington. Counting processes and survival analysis, volume 625. John Wiley & Sons, 2013.
Fukushima [1969] Kunihiko Fukushima. Visual feature extraction by a multilayered network of analog threshold elements. IEEE Transactions on Systems Science and Cybernetics, 5(4):322–333, 1969.
Gautschi [1990] Walter Gautschi. How (un)stable are vandermonde systems? Asymptotic and Computational Analysis, 1990. URL https://api.semanticscholar.org/CorpusID:18896588.
Hansen et al. [2015] Niels Richard Hansen, Patricia Reynaud-Bouret, and Vincent Rivoirard. Lasso and probabilistic inequalities for multivariate point processes. Bernoulli, 2015.
Hastie et al. [2009] Trevor Hastie, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer, 2009.
Hawkes [1971] Alan G Hawkes. Spectra of some self-exciting and mutually exciting point processes. Biometrika, 58(1):83–90, 1971.
Hawkes [2018] Alan G Hawkes. Hawkes processes and their applications to finance: a review. Quantitative Finance, 18(2):193–198, 2018.
Hawkes and Oakes [1974] Alan G Hawkes and David Oakes. A cluster process representation of a self-exciting process. Journal of applied probability, 11(3):493–503, 1974.
Hosseini et al. [2017] Seyed Abbas Hosseini, Keivan Alizadeh, Ali Khodadadi, Ali Arabzadeh, Mehrdad Farajtabar, Hongyuan Zha, and Hamid R Rabiee. Recurrent poisson factorization for temporal recommendation. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 847–855, 2017.
Isham and Westcott [1979] Valerie Isham and Mark Westcott. A self-correcting point process. Stochastic processes and their applications, 8(3):335–347, 1979.
James et al. [2013] Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani, et al. An introduction to statistical learning, volume 112. Springer, 2013.
Jiao et al. [2023] Yuling Jiao, Guohao Shen, Yuanyuan Lin, and Jian Huang. Deep nonparametric regression on approximate manifolds: Nonasymptotic error bounds with polynomial prefactors. The Annals of Statistics, 51(2):691–716, 2023.
Kingman [1992] John Frank Charles Kingman. Poisson processes, volume 3. Clarendon Press, 1992.
Laub et al. [2021] Patrick J Laub, Young Lee, and Thomas Taimre. The elements of Hawkes processes. Springer, 2021.
Li et al. [2018] Shuang Li, Shuai Xiao, Shixiang Zhu, Nan Du, Yao Xie, and Le Song. Learning temporal point processes via reinforcement learning. Advances in neural information processing systems, 31, 2018.
Lin et al. [2022] Haitao Lin, Lirong Wu, Guojiang Zhao, Pai Liu, and Stan Z Li. Exploring generative neural temporal point process. arXiv preprint arXiv:2208.01874, 2022.
Lu et al. [2021] Jianfeng Lu, Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation for smooth functions. SIAM Journal on Mathematical Analysis, 53(5):5465–5506, 2021.
McCulloch and Pitts [1943] Warren S McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5:115–133, 1943.
Medsker and Jain [1999] Larry Medsker and Lakhmi C Jain. Recurrent neural networks: design and applications. CRC press, 1999.
Mei and Eisner [2017] Hongyuan Mei and Jason M Eisner. The neural hawkes process: A neurally self-modulating multivariate point process. Advances in neural information processing systems, 30, 2017.
Ogata and Vere-Jones [1984] Yosihiko Ogata and David Vere-Jones. Inference for earthquake models: a self-correcting model. Stochastic processes and their applications, 17(2):337–347, 1984.
Omi et al. [2019] Takahiro Omi, Kazuyuki Aihara, et al. Fully neural network based model for general temporal point processes. Advances in neural information processing systems, 32, 2019.
Perkel et al. [1967] Donald H Perkel, George L Gerstein, and George P Moore. Neuronal spike trains and stochastic point processes: I. the single spike train. Biophysical journal, 7(4):391–418, 1967.
Rubanova et al. [2019] Yulia Rubanova, Ricky TQ Chen, and David K Duvenaud. Latent ordinary differential equations for irregularly-sampled time series. Advances in neural information processing systems, 32, 2019.
Schäfer and Zimmermann [2007] Anton Maximilian Schäfer and Hans-Georg Zimmermann. Recurrent neural networks are universal approximators. International journal of neural systems, 17(04):253–263, 2007.
Schmidt-Hieber [2020] Johannes Schmidt-Hieber. Nonparametric regression using deep neural networks with relu activation function. The Annals of Statistics, 48(4), 2020. doi: 10.1214/19-AOS1875. URL https://doi.org/10.1214/19-AOS1875.
Schoenberg [2005] Frederic Paik Schoenberg. Consistent parametric estimation of the intensity of a spatial–temporal point process. Journal of Statistical Planning and Inference, 128(1):79–93, 2005.
Shalev-Shwartz and Ben-David [2014] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
Shchur et al. [2021] Oleksandr Shchur, Ali Caner Türkmen, Tim Januschowski, and Stephan Günnemann. Neural temporal point processes: A review. arXiv preprint arXiv:2104.03528, 2021.
Shen et al. [2019] Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation characterized by number of neurons. arXiv preprint arXiv:1906.05497, 2019.
Suh and Cheng [2024] Namjoon Suh and Guang Cheng. A survey on statistical theory of deep learning: Approximation, training dynamics, and generative models. arXiv preprint arXiv:2401.07187, 2024.
Tarwani and Edem [2017] Kanchan M Tarwani and Swathi Edem. Survey on recurrent neural network in natural language processing. Int. J. Eng. Trends Technol, 48(6):301–304, 2017.
Tu et al. [2020] Zhuozhuo Tu, Fengxiang He, and Dacheng Tao. Understanding generalization in recurrent neural networks. In International Conference on Learning Representations, 2020. URL https://api.semanticscholar.org/CorpusID:214346647.
Vidyasagar [2013] Mathukumalli Vidyasagar. Learning and generalisation: with applications to neural networks. Springer Science & Business Media, 2013.
Wang et al. [2012] Ting Wang, Mark Bebbington, and David Harte. Markov-modulated hawkes process with stepwise decay. Annals of the Institute of Statistical Mathematics, 64:521–544, 2012.
Williams et al. [2020] Alex Williams, Anthony Degleris, Yixin Wang, and Scott Linderman. Point process models for sequence detection in high-dimensional neural spike trains. Advances in neural information processing systems, 33:14350–14361, 2020.
Yin et al. [2017] Wenpeng Yin, Katharina Kann, Mo Yu, and Hinrich Schütze. Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923, 2017.
Zhang et al. [2021] Yizhou Zhang, Karishma Sharma, and Yan Liu. Vigdet: Knowledge informed neural temporal point process for coordination detection on social media. Advances in Neural Information Processing Systems, 34:3218–3231, 2021.
Zhou et al. [2022] Zihao Zhou, Xingyi Yang, Ryan Rossi, Handong Zhao, and Rose Yu. Neural point process for learning spatiotemporal event dynamics. In Learning for Dynamics and Control Conference, pages 777–789. PMLR, 2022.
Zuo et al. [2020] Simiao Zuo, Haoming Jiang, Zichong Li, Tuo Zhao, and Hongyuan Zha. Transformer hawkes process. In International conference on machine learning, pages 11692–11702. PMLR, 2020.

	$\displaystyle\mathbb{P}\left(\sup_{\theta\in\Theta}\|X_{\theta}\|>t\right)$	$\displaystyle=\mathbb{P}\left(\sup_{\theta\in\Theta}\|X_{\theta}\|>t,~{}N_{e(n)}\leq s\right)+\mathbb{P}\left(\sup_{\theta\in\Theta}\|X_{\theta}\|>t,~{}N_{e(n)}>s\right)$
		$\displaystyle\leq\mathbb{P}\left(\sup_{\theta\in\Theta}\|X_{\theta}(s)\|+\sup_{\theta\in\Theta}\|E_{\theta}(s)\|>t,~{}N_{e(n)}\leq s\right)$
		$\displaystyle~{}~{}+\mathbb{P}\left(\sup_{\theta\in\Theta}\|X_{\theta}\|>t,~{}N_{e(n)}>s\right)$
		$\displaystyle\leq\mathbb{P}\left(\sup_{\theta\in\Theta}\|X_{\theta}(s)\|+\sup_{\theta\in\Theta}\|E_{\theta}(s)\|>t\right)+\mathbb{P}\left(N_{e(n)}>s\right).$

	$\displaystyle\left\\|h_{1,1}^{(l)}-h_{1,2}^{(l)}\right\\|_{2}$	$\displaystyle=\left\\|\sigma\left(W_{x,1}^{(l)}h_{1}^{(l-1)}+b_{1}^{(l)}\right)-\sigma\left(W_{x,2}^{(l)}h_{2}^{(l-1)}+b_{2}^{(l)}\right)\right\\|_{2}$
		$\displaystyle\leq\rho_{\sigma}\left(\left\\|W_{x,1}^{(l)}h_{1}^{(l-1)}-W_{x,2}^{(l)}h_{2}^{(l-1)}\right\\|_{2}+\left\\|b_{1}^{(l)}-b_{2}^{(l)}\right\\|_{2}\right)$
		$\displaystyle\leq\rho_{\sigma}\left(B_{\sigma}\sqrt{D}\Delta_{x}^{l}+B_{x}\left\\|h_{1,1}^{(l-1)}-h_{1,2}^{(l-1)}\right\\|_{2}+\Delta_{b}^{l}\right).$

	$\displaystyle\left\\|h_{1,1}^{(l)}-h_{1,2}^{(l)}\right\\|_{2}$	$\displaystyle\leq\rho_{\sigma}\left(B_{\sigma}\sqrt{D}\Delta_{x}^{l}+B_{x}\left\\|h_{1,1}^{(l-1)}-h_{1,2}^{(l-1)}\right\\|_{2}+\Delta_{b}^{l}\right)$
		$\displaystyle\leq\rho_{\sigma}\Delta_{b}^{l}+\rho_{\sigma}B_{\sigma}\sqrt{D}\Delta_{x}^{l}+\gamma\left(\rho_{\sigma}\Delta_{b}^{l}+\rho_{\sigma}B_{\sigma}\sqrt{D}\Delta_{x}^{l}+\gamma\left\\|h_{1,1}^{(l-2)}-h_{1,2}^{(l-2)}\right\\|_{2}\right)$
		$\displaystyle\leq\cdots\cdots$
		$\displaystyle\leq\rho_{\sigma}\left(\sum_{r=0}^{l-1}\gamma^{r}\Delta_{b}^{l-r}+B_{\sigma}\sqrt{D}\sum_{r=0}^{l-2}\gamma^{r}\Delta_{x}^{l-r}+B_{in}(T)\gamma^{l-1}\Delta_{x}^{1}\right).$

	$\displaystyle\left\\|h_{i,1}^{(1)}-h_{i,2}^{(1)}\right\\|_{2}$	$\displaystyle=\left\\|\sigma\left(W_{x,1}^{(1)}x(t_{i};t_{i-1})+W_{h,1}^{(1)}h_{i-1,1}^{(1)}+b_{1}^{(1)}\right)-\sigma\left(W_{x,2}^{(1)}x(t_{i};t_{i-1})+W_{h,2}^{(1)}h_{i-1,2}^{(1)}+b_{2}^{(1)}\right)\right\\|_{2}$
		$\displaystyle\leq\rho_{\sigma}\left(B_{in}(T)\left\\|W_{x,1}^{(1)}-W_{x,2}^{(1)}\right\\|_{2}+\left\\|W_{h,1}^{(1)}h_{i-1,1}^{(1)}-W_{h,2}^{(1)}h_{i-1,2}^{(1)}\right\\|_{2}+\left\\|b_{1}^{(1)}-b_{2}^{(1)}\right\\|_{2}\right)$
		$\displaystyle\leq\rho_{\sigma}\left(B_{in}(T)\Delta_{x}^{1}+B_{\sigma}\sqrt{D}\Delta_{h}^{1}+B_{h}\left\\|h_{i-1,1}^{(1)}-h_{i-1,2}^{(1)}\right\\|_{2}+\Delta_{b}^{1}\right).$

	$\displaystyle\quad\left\\|h_{i,1}^{(1)}-h_{i,2}^{(1)}\right\\|_{2}$
	$\displaystyle\leq\rho_{\sigma}\left(B_{in}(T)\Delta_{x}^{1}+B_{\sigma}\sqrt{D}\Delta_{h}^{1}+B_{h}\left\\|h_{i-1,1}^{(1)}-h_{i-1,2}^{(1)}\right\\|_{2}+\Delta_{b}^{1}\right)$
	$\displaystyle\leq\rho_{\sigma}\left(B_{in}(T)\Delta_{x}^{1}+B_{\sigma}\sqrt{D}\Delta_{h}^{1}+\Delta_{b}^{1}\right)+\beta\left(\rho_{\sigma}\left(B_{in}(T)\Delta_{x}^{1}+B_{\sigma}\sqrt{D}\Delta_{h}^{1}+\Delta_{b}^{1}\right)+\beta\left\\|h_{i-2,1}^{(1)}-h_{i-2,2}^{(1)}\right\\|_{2}\right)$
	$\displaystyle\leq\cdots\cdots$
	$\displaystyle\leq\rho_{\sigma}\left(S_{i-1}^{0}\Delta_{b}^{1}+B_{in}(T)S_{i-1}^{0}\Delta_{x}^{1}+B_{\sigma}\sqrt{D}S_{i-2}^{0}\Delta_{h}^{1}\right).$