On Non-asymptotic Theory of Recurrent Neural Networks in Temporal Point Processes
Abstract
Temporal point process (TPP) is an important tool for modeling and predicting irregularly timed events across various domains. Recently, the recurrent neural network (RNN)-based TPPs have shown practical advantages over traditional parametric TPP models. However, in the current literature, it remains nascent in understanding neural TPPs from theoretical viewpoints. In this paper, we establish the excess risk bounds of RNN-TPPs under many well-known TPP settings. We especially show that an RNN-TPP with no more than four layers can achieve vanishing generalization errors. Our technical contributions include the characterization of the complexity of the multi-layer RNN class, the construction of neural networks for approximating dynamic event intensity functions, and the truncation technique for alleviating the issue of unbounded event sequences. Our results bridge the gap between TPP’s application and neural network theory.
1 Introduction
Temporal point process (TPP) (Daley et al., 2003; Daley and Vere-Jones, 2008) is an important mathematical framework that provides tools for analyzing and predicting the timing and patterns of events in continuous time. TPP particularly deals with event streaming data where the events occur at irregular time stamps, which is different from classical time series analysis that often assumes a regular time spacing between data points. In real world applications, the events could be anything from transactions in financial markets (Bauwens and Hautsch, 2009; Hawkes, 2018) to user activities in online social network platforms (Farajtabar et al., 2017; Fang et al., 2023), earthquakes in seismology (Wang et al., 2012; Laub et al., 2021), neural spikes in biological experiments (Perkel et al., 1967; Williams et al., 2020), or failure times in survival analysis (Aalen et al., 2008; Fleming and Harrington, 2013).
With advent of artificial intelligence in last decades, the neural network (McCulloch and Pitts, 1943) has been proved to be a powerful architecture that can be adapted to different applications with distinct purposes. In modern machine learning, researchers have also incorporated deep neural networks into TPPs to handle complex patterns and dependencies in event data, leading to advancements in many areas such as recommendation systems (Du et al., 2015; Hosseini et al., 2017), social network analysis (Du et al., 2016; Zhang et al., 2021), healthcare analytics (Li et al., 2018; Enguehard et al., 2020), etc. Many new TPP models have been proposed in the recent literature, including but not limited to, recurrent temporal point process Du et al. (2016), fully neural network TPP model Omi et al. (2019), transformer Hawkes process Zuo et al. (2020); see Shchur et al. (2021); Lin et al. (2022) and the references therein for a more comprehensive review.
Despite the recent process in TPP’s applications as mentioned above, there is a lack of understanding in neural TPPs from the theoretical perspective. A fundamental question remains: whether the neural network-based TPP can provably have a small generalization error? In this paper, we provide an affirmative answer to this question for recurrent neural network (RNN, Medsker and Jain (1999))-based TPPs. To be specific, we establish the non-asymptotic rates of generation error bounds under mild model assumptions and provide the construction of RNN architectures that could approximate many widely-used TPPs, including homogeneous Poisson process, non-homogeneous Poisson process, self-exciting process, etc.
There are a few challenges in developing the theory of RNN-based TPPs. (a) Characterization of functional space. In the machine learning theory, it is necessary to specify the model space to derive any generalization errors. In our setting, the thing becomes more complicated since the model should be data-dependent (i.e., adapts to the past events). Otherwise, the model could not capture the information in event history and fail to provide a good fitting. (b) Expressive power of RNN architecture. RNN is the most widely adopted neural architecture in TPP modelling. However, it remains questionable whether the RNNs can approximate most well-known temporal point processes. If the answer is yes, it would be of great interest to know how many hidden layers and how large hidden dimensions will be sufficient for the approximation. (c) Expressive power of activation function. In modern neural networks, the activation function is chosen to be a simple non-linear function for the sake of computational feasibility. In RNNs, it is taken as the “tanh" by default. Then it is important to understand the approximability of tanh activation functions. (d) Variable length of event sequence. Unlike the standard RNN’s modelling where each sample is assumed to have the same number of observations (events), the event sequences in our setting may vary from one to another. In addition, their lengths are potentially unbounded. These add difficulties in computing the complexity of the model space.
To overcome the above challenges, we adopt the following approaches. (a) In TPPs, the intensity function is the core. We recursively construct the multi-layer hidden cells through RNNs to store the event information and adopt the suitable output layer to compute the intensity value. Equipped with suitable input embeddings, our construction can capture the information of event history and adapt to variable lengths of event sequence. (b) For four main categories of TPPs, homogeneous Poisson process, non-homogeneous Poisson process, self-exciting process, and self-correcting process, we carefully study their intensity formula. We can decompose the intensity function into different parts and approximate them component-wisely. Our construction explicitly gives the upper bounds on the model depth, the width of hidden layers, and parameter weights of the RNN architecture to achieve a certain level of approximation accuracy. (c) We use the results in a recent work (De Ryck et al., 2021), where they provide the approximation ability of one- and two-layer neural networks. We adapt such results to our specific RNN structure and give the universal approximation results for each of the intensity components. (d) Thanks to the exponential decay property of the tail probability of the sequence length, we are able to use the truncation technique to decouple the randomness of independent and identically distributed (i.i.d.) samples and the lengths of event sequences. For the space of truncated loss functions, the space complexity can be obtained through calculating the covering number. The classical chaining methods in empirical process theory can hence be applied as well.
Our main technical contributions can be summarized as follows.
(i) In the analysis of the stochastic error in the excess risk of RNN-based TPPs, we provide a truncation technique to decompose the randomness into a bounded component and a tail component. By carefully balancing between the two parts, we establish a nearly optimal stochastic error bound. Additionally, we also derive the complexity of the multi-layer RNN-based TPP class, where we precisely analyze and compute the Lipschitz constant of RNN architecture. This extends the existing result in Chen et al. (2020) where they only give the Lipschitz constant of a single-layer RNN. Therefore, our truncation technique and the Lipschitz result of multi-layer RNNs can be useful and of independent interest for many other related problems.
(ii) We establish the approximation error bounds for the intensity functions of TPPs of four main categories. To the best of our knowledge, there is very few work (De Ryck et al., 2021) on studying the approximation property of activation function. Our work is the first one to provide approximation results for RNN-based statistical models. Our construction procedure largely depends on the Markov nature (Laub et al., 2021) of self-exciting processes so that we can design hidden cells to store sufficient information of past events. Moreover, we decompose the excitation function into different parts. Each of them is a simple smooth function (i.e. either exponential function or trigonometric function) that can be well approximated by a single-layer network. Our construction method can be viewed as a useful tool in analyzing other sequential-type neural networks.
(iii) We illustrate the differences between the architectures of classical RNNs and RNN-based TPPs. Note the fact that the observed events happen at the discrete time grids, while the TPP models should take into account the continuous time domain. Therefore, the interpolation of values in hidden cells at each time point is important and necessary. We show that improper interpolation mechanisms (e.g. constant, linear, exponential decay interpolation) may fail to provide RNN-based TPP with the universal approximation ability. Our result indicates that the input embedding plays an important role in interpolating the hidden states.
The rest of paper is organized as follows. In Section 2, the background of TPPs, the formulation of RNN-based TPPs, and useful notations are introduced. The main theories along with high-level explanations are given in Section 3. The technical tools for analyzing stochastic errors are provided in Section 4. The construction procedures for approximating different types of intensity functions are listed in Section 5. In Section 6, we provide explanations that the improper interpolation of hidden states in RNN-TPPs may lead to unsatisfactory approximation results. The concluding remarks are given in Section 7.
2 Preliminaries
2.1 Framework Specification
We observe a set of irregular event time sequences,
(1) |
where with being the end time point, and is the number of events in the -th sequence, . It is assumed that each of ’s is independently generated from a TPP model with an unknown intensity function defined on . That is,
where with being the number of events observed up to time , and is the history filtration before time .
In the literature of TPP’s learning (Shchur et al., 2021), the primary goal is to estimate based on . Throughout the current work, we adopt the negative log-likelihood function as our objective. To be specific, for any event time sequence , we define
(2) |
Then the estimator can be defined as
(3) | |||||
where is a user-specified functional space. For example, in the existing works, can be taken as any space of parametric models (Schoenberg, 2005; Laub et al., 2021), nonparametric models (Cai et al., 2022; Fang et al., 2023), or neural network models (Du et al., 2016; Mei and Eisner, 2017).
In the language of deep learning, is also called a training data set. is known as the loss function of predictor . defined in (3) is the empirical risk minimizer (ERM). To evaluate the performance of , a common practice in machine (deep) learning is using the excess risk (Hastie et al., 2009; James et al., 2013; Vidyasagar, 2013; Shalev-Shwartz and Ben-David, 2014). To be mathematically formal, we define
(4) |
where is a testing sample, i.e., a new event time sequence, which is independent of and also follows the intensity . The expectation here is taken with respect to the new testing data. We give a proof of in the supplementary. As a result, (4) is a well-defined excess risk under our model setup.
2.2 RNN Structure
Throughout this paper, we consider to be a space of RNN-based TPP models. An arbitrary intensity function in , indexed by the parameter , is defined through the following recursive formula,
(5) |
where the hidden vector function has the following hierarchical form,
(6) |
with
(7) |
Here , are two known activation functions of the hidden layers and the output layer, respectively. Both of them are pre-determined by the user. We specifically take and , where and are two fixed positive constants. The input embedding vector function is also known to the user before training. In the current work, we particularly take where for , . The model parameters consist of , , and (). For notational simplicity, we concatenate all parameter matrices and vectors and write as , where . By default, we take the initial values and for . The last time grid . We call the model defined through equations (5) - (2.2) as the RNN-TPP.


Moreover, we define the maximum hidden size , where is the dimension of the -th hidden layer, and the parameter norm
Then the RNN-TPP class is described by
(8) |
where may depend on the hidden size and the sample size . To help readers gain more intuitions, a graphical illustration of the network structure is given in Figure 1.
Remark 1.
The default choice (De Ryck et al., 2021) of activation function in RNNs is . In practice, the number of layers is usually no more than 4.
Remark 2.
Remark 3.
In the standard application of RNN models, the training data usually consist of discrete-time sequences (e.g., sequences of tokens in natural language processing (NLP) (Yin et al., 2017; Tarwani and Edem, 2017); time series in financial market forecasting (Cao et al., 2019; Chimmula and Zhang, 2020)). Therefore, the classical (single-layer) RNN architecture is defined only through the discrete time grids. That is, the hidden vector at -th grid is
where is the corresponding embedding input. The prediction at time step is given by . In contrast, the RNN-based TPP model should take into account any time point between grids and . Hence the interpolation of between and is heuristically necessary to give reasonable model predictions over the entire time interval .
Remark 4.
In the literature, there exist a few methods to interpolate the hidden embedding between and . In Du et al. (2016), a constant embedding mechanism is used, i.e. for and any and . In Mei and Eisner (2017), the author adopted an exponential decay method to encode the hidden representations under an extended RNN architecture, Long Short Term Memory (LSTM) network. More recently, Rubanova et al. (2019) used the neural ordinary differential equation (ODE) method for solving the intermediate hidden state .
Remark 5.
Our result still holds if tanh is replaced with other Sigmoidal-type activation functions (Cybenko, 1989) (e.g., ReLU (Fukushima, 1969)). In the literature of TPP modelling, the most common choice of is the Softplus function (Dugas et al., 2001; Zhou et al., 2022), , which ensures to be positive and differentiable. Our result also holds if we take to be with . Introducing and only serves the technical purpose, i.e., the predicted intensity value is bounded from above and below.
2.3 Classical TPPs
In the statistical literature, TPPs can be categorized into several types based on the nature of the intensity functions. Four main categories are summarized as follows.
Homogeneous Poisson process (Kingman, 1992). It is the simplest type where events occur completely independently of one another, and the intensity function is constant, i.e., , where is unknown and needs to be estimated.
Non-homogeneous Poisson process (Kingman, 1992; Daley et al., 2003). In this model, the intensity function varies over time but is still independent of past events. That is, is a non-constant unknown function that is usually estimated via certain nonparametric methods.
Self-exciting process (Hawkes and Oakes, 1974). Future events are influenced by past events, which can lead to clustering of events in time. A well-known example is the Hawkes process (Hawkes, 1971; Hawkes and Oakes, 1974), where the intensity function takes form,
(9) |
where and are some positive functions which are called the background intensity and excitation/impact function, respectively. In many applications (Laub et al., 2021), the excitation function takes the exponential form that , which allows the efficient computation. The model defined in (9) is also known as the linear self-exciting process since the intensity is in an additive form of different components. More generally, the non-linear self-exciting process (Brémaud and Massoulié, 1996)
(10) |
is also considered in the literature, where is a non-linear function.
Self-correcting process (Isham and Westcott, 1979; Ogata and Vere-Jones, 1984). The occurrence of an event decreases the likelihood of future events for some time period. To be mathematically formal, the intensity postulates the formula,
(11) |
where both and are positive and may be a non-linear function.
2.4 Notations
Let and . We use and to denote the set of nonnegative integers and all integers, respectively. Denote for a positive integer . Let . For a set , denote to be its cardinality. For a vector , denote its Euclidean norm as . Write if there exists some constant such that for all index , and the range of may be defined case by case. For a function defined on some domain, denote as its essential upper bound. For , the Sobolev norm is defined as . For a constant , the -ball of Sobolev space is defined as
For constant , the ball is a subset of which contains all -order smooth functions. We use to hidden all constants and use to denote with hidden log factors. Throughout this paper, , , , , and are positive real numbers and may be defined case by case.
3 Main Results
Recent applications in event stream analyses have witnessed the usefulness of TPPs with the incorporation of RNNs. However, there is no study in the existing literature to explain why the RNN structure in TPP modeling is so useful from the theoretical perspective. We attempt to answer the question of whether the RNN-TPPs can provably have small generalization error or excess risk. Our answer is positive! When the event data are generated according to the classical models described in Section 2.3, we show that the RNN-TPPs can perfectly generalize such data.
To make our presentation easier, we only need to focus on the self-exciting processes. 111Homogeneous Poisson, non-homogeneous Poisson, and self-correcting process can be treated similarly due to the following reasons. If we take in (9), the linear self-exciting process reduces to the homogeneous Poisson or non-homogeneous Poisson process. In the RNN-TPP architecture, we can take the input embedding function , i.e., using an additional input dimension to store the number of past events. Then establishing the excess risk of self-correcting process is technically equivalent to that of non-homogeneous Poisson process. To start with, we first consider the linear case (9).
Some regularity assumptions should be stated before we present the main theorem.
(A1) There exists a constant such that , where , .
(A2) .
(A3) There exists a positive constant such that .
Assumption (A1) assumes the boundedness of the background intensity, which is also common in neural network approximation studies. Assumption (A2) is standard in the literature of Hawkes process, which guarantees the existence of a stationary version of the process when is constant. Assumption (A3) is an informative lower bound assumption, which ensures that sufficient intensity exists in any subdomain of .
Theorem 1.
Under model (9) and RNN-TPP class defined as (8), suppose that assumptions (A1)-(A3) hold, then for i.i.d. sample series , with probability at least , the excess risk (4) of ERM (3) satisfies:
(i) (Poisson case) If , for , , , , and ,
(12) |
(ii) (Vanilla Hawkes case) If , for , , , , and ,
(13) |
(iii) (General case) If , , , for , , , , and ,
(14) |
As suggested in Theorem 1, there exists a two-layer RNN-TPP model whose excess risk becomes vanishing when the size of the training set goes to infinity. The width of such network grows with the sample size, while the depth remains two.
Remark 6.
Here we require the depth of RNN-TPP due the fact that . However, if we allow to be sufficiently smooth (i.e., ), we only need one-layer neural network to approximate . As a result, the number of layers of RNN-TPP can be reduced to one.
Now we consider the true model to be a non-linear Hawkes process, which is given in (10). For simplicity, we only consider the case , which is
(15) |
The regularity of is presented as Assumption (A4).
(A4) Function is -Lipschitz, positive and bounded. In other words, there exist such that and for any .
Theorem 2.
For the non-linear case, as indicated by Theorem 2, we require a deeper RNN-TPP with four layers to achieve the vanishing excess risk. Under the Lipschitz assumption of , the width of the hidden layers is of order . When is allowed to have higher-order smoothness, the width can reduce to that of the vanilla Hawkes case.
Remark 7.
(i) Two additional layers of RNN are required for the approximation of the arbitrary non-linear Lipschitz continuous function . (ii) For the model with general excitation function , we can obtain the similar excess risk bound using the same technique in the proof of Theorem 1.
To better explain the excess risks that obtained in Theorems 1-2, we depend on the following decomposition lemma.
Lemma 1.
Let , for any random sample , the excess risk of ERM (3) satisfies
(17) |
By Lemma 1, the excess risk of ERM is bounded by the sum of two terms, the stochastic error and the approximation error . The first term can be bounded by the complexity of the function class using the empirical process theory, where the unboundedness of the loss function needs to be handled carefully; we present the details in section 4. The second term characterizes the approximation ability of the RNN function class to the true intensity under the measure of the expectation of the negative log-likelihood loss function. In order to bound this term, we need to carefully construct a suitable RNN which can approximate well. This has not been studied yet in the literature; see section 5 for the details.
Based on Lemma 1, the results in Theorem 1 admit the following form,
where is the stochastic error and is the approximation error. is the complexity of RNN function class and is the corresponding approximation rate, where is a tuning parameter. For the Poisson case, we can construct a two-layer RNN-TPP with width to achieve approximation error. Hence , , and the final excess risk bound is in (12). For the vanilla Hawkes case, since the exponential function is -smooth, we only need extra hidden cells in each layer to obtain approximation error, and then we have the same order excess risk bound. For the general case, motivated by the vanilla Hawkes case, we decompose into two parts. One part is a polynomial of exponential functions which can be well approximated by -width tanh neural network. The other part is a function satisfying , . It is easy to check that the -th Fourier coefficients of , , decay at the rate of . Then it is sufficient to approximate the first functions in the Fourier expansion of to get approximation error, which additionally costs complexity (see section 5.3 for details). Combining this with the approximation result of , we get the final bound . Similarly, for the nonlinear Hawkes case, we need complexity to obtain approximation error, which leads to excess risk bound.
As we emphasize in the above remarks, the number of layers depends on the smoothness of . If and , we only need one-layer neural network to approximate , hence the number of layers in RNN-TPP can be reduced to one.
4 Stochastic Error
In this section, we focus on the stochastic error in (17). This type of stochastic error for the RNN function class has been studied in the recent literature, such as Chen et al. (2020) and Tu et al. (2020). However, they only consider the case where the lengths of the input sequences are bounded, which is not applicable under the TPP setting. Here we establish an upper bound of the stochastic error in (17) by a novel decoupling technique to make the classical results applicable. This technique can be used in many other related problems.
4.1 Main Variance Term
We first give out some mild assumptions for the RNN-TPP function class under a more general framework.
(B1) The embedding function is bounded by a constant on the time domain , i.e. .
(B2) The parameter lies in a bounded domain . More precisely, we assume that the spectral norms of weight matrices (vectors) and other parameters are bounded respectively, i.e., , , , , and .
(B3) Activation functions and are Lipschitz continuous with parameters and respectively, , and there exists such that . Additionally, is entrywise bounded by , and satisfies .
Now we consider the first term of (17). For convenience, we denote .
Theorem 3.
Under assumptions (B1)-(B3) and suppose the event number satisfies the tail condition
with probability at least , we have
Thus
(18) |
where , , , .
Remark 8.
There exist constants so that the tail condition always holds for (non) homogeneous Poisson processes, linear and nonlinear Hawkes processes, and self-correcting processes under weak assumptions. To be more concrete, Lemma 2 in the following section gives a result for the linear case.
Remark 9.
For one-layer RNN with width and bounded sequence length , Chen et al. (2020) gives a type stochastic error bound. Our bound reduces the term to , thanks to the bounded output layer, i.e., . The term is also order-optimal by noticing that the number of free parameters in a single-layer RNN is at least .
The stochastic error in (3) is mainly determined by the complexity of the RNN function class , which will be discussed in the following section. To obtain this bound, we need to handle the unboundedness of the event number. We use a truncation technique to decouple the randomness of the tail of , which allows us to use classical empirical process theory to derive the upper bound. Our computation is motivated by Chen et al. (2020), which gives the generalization error bound of a single-layer RNN function class.
4.2 Key Techniques
To be reader-friendly, the main techniques for proving Theorem 3 are summarized as follows.
4.2.1 Probability Bound of Events Number
Define . The following lemma characterizes the tails of event number and under model (9) and assumptions (A1) and (A2) (For assumption (A1), we only need in this section). The proof is similar to Proposition 2 in Hansen et al. (2015); see supplementary for the details.
Lemma 2.
For model (9), under assumptions (A1) and (A2), with probability at least , we have
Hence
where . Let and with being fixed. Then
(19) |
Our result is more refined than Proposition 2 in Hansen et al. (2015), with computing all the constants and giving a tuning parameter to control the probability bound.
For the nonlinear case (10), under Assumption (A4), we can obtain results similar to the non-homogeneous Poisson case, which are included in the above Lemma.
4.2.2 From Unboundedness to Boundedness
The following lemma is the key to handling the unboundedness of , i.e., the unboundedness of the loss function. For any , we let and .
Lemma 3.
For any and nonempty parameter set , we have
(20) |
Proof of Lemma 3.
For , we have
Hence, under the condition , we have , thus . Then
∎
The consequence of this lemma is to decompose into two parts. The first part is the tail probability of the supremum of a set of bounded variables, and can therefore be handled by standard empirical process theory. The second part is the tail probability of . Thanks to Lemma 2, this term can be controlled by the exponential decay property of the sub-critical point process. By choosing suitable , we can make (20) sharper. This result plays a key role in stochastic error calculations.
4.2.3 Complexity of the RNN-TPP Class
To get the result in Theorem 3, we need to compute the complexity of the RNN function class which is specified in section 2.2. There are many possible complexity measures in deep learning theory (Suh and Cheng, 2024), and here we choose covering number which can be well computed for the RNN function class. In our setup, the key to the computation of the covering number is finding the Lipschitz continuity constant of RNN-TPPs, which separates the spectral norms of weight matrices and the total number of parameters (Chen et al., 2020).
Consider two different sets of parameters , . Denote , , , . The following lemma characterizes the Lipschitz constant of .
Lemma 4.
Under Assumptions (B1)-(B3), given an input sequence of length , (here we set ), for , , and , we have
(21) |
where , , (), and . We set if .
The proof of Lemma 4 is based on the induction. The full proof is given in the supplementary. Our result is an extension of Lemma 2 in Chen et al. (2020), where they only consider the family of one-layer RNN models. Lemma 4 is of independent interest and can be useful in any other problems regarding RNN-based modeling. Using Lemma 4, we can establish a covering number bound for under a “truncated" distance.
Denote as the covering number of metric space , i.e., the minimal cardinality of a subset that covers in scale with respect to the metric . Given a fixed integer , We define a truncated distance,
The following lemma gives an upper bound of .
Lemma 5.
Under assumptions (B1)-(B3), for any and defined as (8), the covering number is bounded by
where , , and .
5 Approximation Error
In this section, we focus on the approximation error, i.e., the second part of (17). The approximation error of deep neural networks has been broadly studied in the literature (Schmidt-Hieber, 2020; Shen et al., 2019; Jiao et al., 2023; Lu et al., 2021). However, most of them only consider the ReLU activation case, which is different from , the activation function usually chosen for RNNs. Recently, De Ryck et al. (2021) studied the approximation properties of shallow neural networks, which provides a technical tool for our analysis. To the best of our knowledge, the approximation ability of RNN-type networks has not been fully studied in the literature. Here we propose a family of approximation results for the intensities of various TPP models stated in section 2.3.
5.1 Poisson Case
We start with the the approximation of (non-homogeneous) Poisson process, whose intensity is independent of the event history, i.e. , where is an unknown function. In this case, we do not need to take into account the transfer of information in the time domain. To be precise, we can take for . Then the problem degenerates to a standard neural network approximation problem. Using the approximation results for neural networks in De Ryck et al. (2021), we can get the following approximation result.
Theorem 4.
(Approximation for Poisson process) Under model and assumptions (A1) and (A3), for , , there exists an RNN-TPP as stated in section 2.2 with , , , and input function such that
(22) |
where . Moreover, the width of satisfies and the weights of are less than
where is an universal constant.
A graphical representation of RNN approximation is given in Figure 2. For the non-homogeneous Poisson models, the RNN-TPP in Theorem 4 is indeed a two-layer neural network. From Theorem 4, we need an RNN-TPP with width and to obtain approximation error. Combining with Theorem 3, we can get the part (i) of Theorem 1.

5.2 Vanilla Hawkes Case
Recall that the intensity of the vanilla Hawkes process has the form
(23) |
Different from Poisson process, the intensity of the vanilla Hawkes process depends on historical events. Hence it can not be approximated by a simple neural network and needs the recurrent structure. We construct an RNN-TPP to approximate the intensity using the Markov property of (23). Specifically, note that if we have observed the first event times , then for any satisfying , we have
Therefore, we can use the hidden layers in RNN-TPP to store the information of and then compute with the help of input . Together with the approximation of , we can obtain the final approximation result. A graphical illustration of the above construction procedures are given in Figure 3.
Theorem 5.
Due to the smoothness of the exponential function, the approximation rate in Theorem 5 only adds the term compared with the results in Theorem 4. Similarly, combining with Theorem 3, we can easily get the part (ii) of Theorem 1.

5.3 Linear Hawkes Case
Now we consider the general linear Hawkes process, i.e., (9) in section 2.3. Motivated by the approximation construction of the Vanilla Hawkes process, we want to find a decomposition for the general where each term has the ’Markov property’ so that we can construct the corresponding RNN structure. Precisely, for , , , we can decompose into two parts,
where satisfies the boundary condition, , , , and , . The term can be handled similarly to the vanilla Hawkes process. For , we consider its Fourier expansion,
Thanks to the boundary condition, can be well approximated by the finite sum of Fourier series. Then we can use the “Markov property" of the trigonometric function pairs and to construct the RNN-TPP. The construction is similar to that of the exponential function case but needs more thorny calculations. Combining all the approximation parts, we can get the approximation theorem for (9). The above ideas are visualized in Figure 4.

Theorem 6.
We make a few explanations on Theorem 6. There are two tuning parameters in , where is the tuning parameter to control the approximation error of , , and the finite sum of the Fourier series, and is the tuning parameter to control the number of terms in the Fourier series entering the RNN-TPP. The term is obtained similarly to that in the vanilla Hawkes process case, and the term is the error caused by the finite sum approximation for the Fourier series. Moreover, the term in the width of RNN-TPP is caused by the approximation construction of the first terms of the Fourier series. Finally, combining with Theorem 3, we can obtain the part (iii) of Theorem 1.
5.4 Nonlinear Hawkes Case
Finally, we consider the nonlinear Hawkes process, which is defined in (10) in section 2.3. To make the statement simpler, we only consider the simple case, i.e., . The results for the general can be obtained similarly. Compared to the vanilla Hawkes case, the additional challenge here is the existence of a nonlinear function . With two additional layers, we can approximate well. Together with the results for the case of the vanilla Hawkes process, we can obtain the desired RNN-TPP architecture. To be clearer, we also provide the graphical illustration in Figure 5.
Theorem 7.
Since we assume to be Lipschitz continuous, we can only get approximation error. The rate can be improved if is allowed to have better smoothness properties. Again, combining with Theorem 3, we arrive at Theorem 2.

Remark 10.
The universal approximation properties of one-layer RNNs are studied in Schäfer and Zimmermann (2007). Our current results are different from theirs in the following sense. (i) RNN-TPP is defined over the continuous time domain , while the standard RNN only considers the discrete points. In other words, our approximation results hold uniformly over all . (ii) In Schäfer and Zimmermann (2007), they do not give the explicit formula of the widths of hidden layers or parameter weights in the construction of RNN approximator. Therefore, their results cannot be directly used in computing the approximation error.
6 Usefulness of Interpolation of Hidden States
As mentioned in Remark 3, the RNN-TPP needs to take into account any continuous time point between observed time grids and . The interpolation of hidden state between and is essential and important during the construction of RNN-TPPs.
In this section, we give a counter-example to illustrate that an RNN-TPP model with a linear interpolation of hidden states is unable to precisely capture the true intensity in terms of excess risk . For simplicity, we only consider the single-layer RNN-TPP and the argument is the same for multi-layer RNN-TPPs.
We consider a (single-layer) RNN-TPP which admits the following model structure,
(27) |
where is the embedding for the -th event, , , , and and will be determined from the true intensity. If we take , it would be the same as the model using a constant hidden state interpolation mechanism, such as Du et al. (2016). In other words, for all satisfying when .
Theorem 8.
Suppose the true model intensity on has the following form,
Hence we can take and , and then there exists a constant such that
(28) |
Theorem 8 tells us that the RNN-TPPs with an improper hidden state interpolation may fail to offer a good approximation, even under a very simple non-homogeneous Poisson model. Therefore, the user-determined input embedding vector function plays an important role in interpolating the hidden states. It should be carefully chosen so that can summarize the information of past event history to some extent.
Remark 11.
Remark 12.
For other different types of (e.g. Softplus) in the output layer, the failure of the linear interpolation mechanism can be obtained similarly.
7 Discussion
In this paper, we give a positive answer to the question "whether the RNN-TPPs can provably have small excess risks in the estimation of the well-known TPPs". We establish the excess risk bounds under homogeneous Poisson process, non-homogeneous Poisson process, self-exciting process, and self-correcting process framework. Our analysis focuses on two parts, the stochastic error and the approximation error. For the stochastic error, we use a novel truncation technique to decouple the randomness and make the classical empirical process theory applicable. We carefully compute the Lipschitz constant of multi-layer RNNs, which is a useful intermediate result for future RNN-related work. For approximation error, we construct a series of RNNs to approximate the intensities of different TPPs by providing the explicit network depth, width, and parameter weights. To the best of our knowledge, our work is the first one to study the approximation ability of the multi-layer RNNs over the continuous time domain. We believe the results in the current work add values to both learning theory and neural network fields.
There are several possible extensions along the research line of neural network-based TPPs. First, it is not clear whether the approximation rate can be improved by a more refined RNN structure construction (with possible fewer layers and smaller width) or other possible approaches. Second, we here only consider the “large " setting where the event sequences are observed in a bounded time domain with repeated samples. It is interesting to extend our results to “large " setting where the end time goes to infinity but the number of event sequences, , remains fixed. Third, in the current work, we do not take into account the different event types. It may be useful to extend our results to the marked TPP settings. Moreover, it is also worth investigating the theoretical performances of other neural network architectures (e.g. Transformer-TPPs) that have performed well in recent empirical applications.
Supplementary Material for "On Non-asymptotic Theory of Recurrent Neural Networks in Temporal Point Processes"
Additional Notations in the Supplementary: For two random variables and , we write if for any . Use to denote the set of positive integers.
8 Proofs in section 3 and 4
8.1 Proof of Lemma 1
By the definition of and , we have
8.2 Proof of Lemma 2
From model assumptions (A1) and (A2), we have , , . Following the notations in the paper, we denote as the number of event time of in . Consider another density and similarly denote as the number of event time of in . Then for any fixed event sequence , , and thus . By a similar formulation in Daley et al. (2003), the point process with intensity is equivalent to a birth-immigration process with immigration intensity and birth intensity . Hence
where and is the number of event time in generation , which are children of generation .
For , let . We have
and
for any . Since , for any fixed and any , we have
i.e.
for any .
Since can only take integer values, we can get . Thus
Setting , we get
Now take , and then . By Bool’s inequality, we have
For the second term, . Let . It can be showed
For the first term, we have
where . We can take so that . Then
Now let , , and . Taking and
, we have .
Since
we get that with probability at least ,
where , and . Since , we have . Thus with probability at least ,
Taking and , we have . Then
8.3 Proof of Lemma 4
The proof is based on induction. Using the same notation, we give two claims.
Claim 1.
For , , is bounded by
(29) |
Proof of Claim 1.
Claim 2.
For , and , is bounded by
(30) | ||||
(31) |
Proof of Claim 2.
Now we prove Lemma 4. For , we have
8.4 Proof of Lemma 5
From Lemma 4, for , we have
where .
8.5 Proof of Theorem 3
Lemma 6.
Under assumptions (B1)-(B3), for fixed , with probability at least , we have
Hence
where , , , and .
Proof of Lemma 6.
For , denote . Then . For two parameters and , we have
and similarly,
Hence
By the property of bounded variable, is -sub-gaussian. Since is mutually independent, is -sub-gaussian. From assumptions 2 and 3, there exists such that , implying .
The diameter of under the distance can be bounded by
(32) |
By Lemma 5, we get
(33) |
where , . Denote , . We have
(34) |
where we need . If , (34) is obvious since the integral is less than .
∎
Lemma 7.
Suppose the event number satisfies the tail condition
Under assumptions (B1)-(B3), for fixed , we have
Proof of Lemma 7.
From assumptions (B2) and (B3), there exists such that . Then
By the tail condition , we have
∎
9 Proofs in section 5 and 6
9.1 Proof of Theorem 4
9.2 Proof of Theorem 5
The proof is divided into several steps. Let . Here we agree on , . To be concise, we denote , , , hence , , , where we take by default.
We first fix .
Step 1. Construct the approximation of , where
.
Let , then . By simple computation, we have
Applying Lemma 14 to , for any , there exists a tanh neural network with only one hidden layer and width such that
By coordinate transformation, we get
Define . Then
From Lemma 14 and Remark 15, the weights of are bounded by
(37) |
where . Taking , we have
(38) |
Especially, . Since , by a small tuning (precisely, width plus 1), we can assume has the following structure:
Step 2. Construct the approximation of and under the event .
Let , , for . We construct and recursively by
Hence , here .
Similarly, we can define by
Hence . The approximation error can be bounded by
Under the event , we have
(39) |
Step 3. Construct the approximation of identity.
By Lemma 3.1 of De Ryck et al. (2021), for any , there exists a one-layer tanh neural network such that
(40) |
Actually, can be represented as
Step 4. Construct the approximation of under the event .
Since , from the proof of Theorem 4 , there exists a two-layer tanh neural network with width less than such that
(41) |
Moreover, the weights of can be bounded by
Here we assume have the following structure
Since , we can construct its approximation by
and
(42) |
Under the event , we have . Recall that . Here we can take , .
Step 5. Estimate the approximation error under the event .
We rewrite (42) as . Under the event and the construction of , we have
From the constuction of , we get
(43) |
then
(44) |
From (39) (40) and (38), under the event , we have
where we take to ensure that can be well approximated by . On the other hand, (41) shows that
To trade off the two error terms in (44), let , and then we can take . Moreover, take and . Hence, under , we have
(45) |
Step 6. Estimate the final approximation error.
Similar to (35), we have
(46) |
Since , , taking , in Lemma 2 , we have
By (45),
(47) |
On the other hand, from and , we have
(48) |
Combing (46) (47) (48), we have
Let , and denote . We have
Step 7. Bound the sizes of the network width and weights.
From step 1-6, the width of the network is less than
where . Since and , we have .
From the construction of , and , the weights of the network is less than
where . Since , , , the weights are less than
where is a constant related to , and .
9.3 Proof of Theorem 6
Lemma 8.
Suppose , , . The fourier series of is given by
(49) |
where , /T, . If , , then
and on . Moveover, denote the partial sum of as ,
Proof of Lemma 8.
The proof is a standard Fourier analysis exercise and we omit it. ∎
Theorem 9.
Under model assumption 5 and , , for , there exists an RNN structure as stated in section 2.2 with , , , and input function such that
Moreover, the width of satisfies and the weights of are less than
where is a constant related to , and .
Proof of Theorem 9.
Similar to the proof of Theorem 5 , the proof is divided into several steps. Denote , , , , where , . For , define
and
Hence we have
and
where we agree on . Define . If we assume , the true intensity can be rewritten as
(50) |
where refers to the standard inner product of vectors and .
We first fix . Since . We assume so that (50) holds.
Step 1. Construct the approximation of , where . Here .
Let , . Then . By simple computation, we have
Applying Lemma 14 to , for any , there exists a tanh neural network with only one hidden layer and width such that
By coordinate transformation, we get
Define , then
Taking , we have
Especially, . The width of this NN is bounded by . From Lemma 14 and Remark 15, the weights of are bounded by
where . Since , by a small tuning(precisely, let width plus 1), we can assume has the following structure:
Denote .
Step 1′. Construct the approximation of identity and , . Here .
Similarly to step 3 in the proof of Theorem 5 , taking , we have
For , , we can construct a similar approximation as the proof of of Theorem 5. There exists a tanh neural network with only one hidden layer and width such that
where . The weight of is bounded by
Step 2. Construct the approximation of and under the event .
Let , , for . We construct and recursively by
Hence . Here we agree on .
Similarly, we can define by
Hence . The approximation error can be bounded by
(51) | |||
Under the event , we have
Moreover, , , then and (51) can be verified by induction under the event .
For the approximation of , we can similarly construct a simple RNN such that and .
Step 3. Construct the approximation of under the event .
Since , from the proof of Theorem 4 , there exists a two layer tanh neural network with width less than such that
(52) |
Moreover, the weights of can be bounded by
Here we assume have the following structure
Since , we can construct its (finite sum) approximation by
It can be seen as a parallelism of RNNs defined before.
Under the event , we have . Recall that . Here we can take , . The final output is .
Step 4. Compute the approximation error under the event .
Under the event and the construction of , we have
(53) |
By the construction of ,
(54) |
Under the event , for the second term, we have
(55) |
For the third term, similarly,
(56) |
where we take in (55) and (56) to ensure that and can be well approximated by and .
For the fourth term, using Lemma 8,
Here is the finite sum of the fourier series defined in Lemma 8 .
Finally, by (52), we have . To trade off the error terms in (54), take and . Then under the event , we have
(57) |
Step 5. Compute the final approximation error.
Since and , similar to (48), we have
(60) |
Combining (58), (59), and (60), we have
Let and denote . We have
Step 6. Bound the sizes of the network width and weights.
From step 1-5, we have the width of the network being less than
where , , , . Hence
From the construction of , , , , the weights of the network is less than
where . Hence the weights of the network is less than
where is a constant related to , and . ∎
Lemma 9.
Let , , and
then is invertible and , where is a universal constant.
Lemma 10.
Proof of Lemma 10.
Now we prove Theorem 6. The proof is based on Theorem 5, Theorem 9, and Lemma 10. From Lemma 10, for , there exists such that satisfying the boundary condition , , and we have . Define . Denote
and then .
Fix . By the proof of Theorem 9, under the event , there exists an RNN (without the output layer) such that
where .
By the proof of Theorem 5 , under the event , for, there exists an RNN (without the output layer) such that
Let . We have
Let ,
Under the event , . Hence we can take and and denote . Then . By similar arguments in Theorem 9, we have
Let and denote . We have
The width and elements weights bound can also be obtained similarly to the proof of Theorem 9.
9.4 Proof of Theorem 7
Denote . Then . Fix . From the proof of Theorem 5 , under the event , there exists a 2-layer recurrent neural network as (43) such that
(64) |
Moreover, the width of satisfies and the weights of are bounded by
Under the event , the function satisfies . Using (64) and taking , we have . Hence we need to construct an approximation of on . Let , where . Then .
Since is L-lipschitz and is defined on and -Lipschitz, by the Corollary 5.4 of De Ryck et al. (2021), there exists a tanh neural network with 2 hidden layers such that
Let . Then
Then under the event , we have
(65) |
Recall that . Since , we can take and . Define . We have
(66) |
Similar to (46), we have
(67) |
Since , similar to (46), taking in Lemma 2 , we have
and similar to (36), we have
(68) |
(69) |
On the other hand, since and , we have
(70) |
Combining (67), (69), and (70), we have
Let , and denote . We have
Similar to the proof of Theorem 5, we can bound the width of the network by
where . Hence we have .
Moreover, from the construction of , the weights of the network is less than
where , is a constant related to , and . Then the weights of the network can be bounded by
where are constants related to , and .
9.5 Proof of Theorem 8
Without loss of generality, we denote , for simplicity. Since the compensator of is , for a predictable stochastic process , we have
Since both and are predictable, we have
where , , and the equality holds if and only if . Thus
Denote , and . Denote . By a similar argument, we have
(71) |
Under the event ,
and
(72) |
Then we only need to show
Case 1. , .
Since , on or . From ,
(73) |
Case 2. , .
In this case , we can check that . Hence
(74) |
Case 3. , .
By (72) , is continuous with respect to . For fixed , since , . Since is a compact set in , there exists such that
(75) |
By (71), (72), (73), (74), and (75),
Hence Theorem 8 is proved.
Remark 13.
Note that we have proved the excess risk
(76) |
is always positive if in the proof of Theorem 8. Thus (76) is a well-defined excess risk.
10 Supporting Lemmas
Lemma 11.
(Lemma 8 in Chen et al. (2020)) Let be the set of matrices with bounded spectral norm and be given. The covering number is bounded above by
The following lemma is a bridge between the covering number and the upper bound of sub-gaussian process.
Definition 1.
A stochastic process is called a sub-gaussian process for metric on if
A stochastic process is called a centered sub-gaussian process for metric on if is a sub-gaussian process for metric and .
Lemma 12.
Suppose is a centered sub-gaussian process for metric on metric space , where the diameter of is finite, i.e. . Then with probability at least , for any fixed , we have
and
where satisfies .
Proof of Lemma 12.
Let satisfy . Define . Let be the of with metric , i.e., covers at scale with respect to the metric . Clearly . We take . Define as the closest element of in under the metric . Then , we have
Thus
Consider , and any element in is sub-gaussian. By Hoeffding’s inequality and union bound argument, we have
Let , . Then with probability at least , we have
Thus, with probability at least , we get
Let . Then . We have
Thus,
Since
the lemma is proved. ∎
Lemma 13.
(Theorem 5.1 in De Ryck et al. (2021)) Let , and . There exist constants and such that for every integer , there exists a tanh neural network with two hidden layers, with one width at most and the other width at most (or and for ), such that
If , then it holds that
and else, it holds that
Moreover, the weights of scale as .
Remark 14.
By Lemma 13, there exists a constant which is only dependent with , such that
Lemma 14.
(Corollary 5.8 in De Ryck et al. (2021)) Let , open with and let be analytic on . If, for some , satisfies that for all , then for any , there exists a one-layer neural network of width (or for ) such that
Remark 15.
In De Ryck et al. (2021), the construction of in Lemma 14 uses Lemma 13 directly. Hence the weights of can be derived from Lemma 13. Then there exists a constant such that
where . We emphasize that the original literature (De Ryck et al., 2021) does not give this result, but it can be obtained by simple calculations.
References
- Aalen et al. [2008] Odd Aalen, Ornulf Borgan, and Hakon Gjessing. Survival and event history analysis: a process point of view. Springer Science & Business Media, 2008.
- Bauwens and Hautsch [2009] Luc Bauwens and Nikolaus Hautsch. Modelling financial high frequency data using point processes. In Handbook of financial time series, pages 953–979. Springer, 2009.
- Brémaud and Massoulié [1996] Pierre Brémaud and Laurent Massoulié. Stability of nonlinear hawkes processes. The Annals of Probability, pages 1563–1588, 1996.
- Cai et al. [2022] Biao Cai, Jingfei Zhang, and Yongtao Guan. Latent network structure learning from high-dimensional multivariate point processes. Journal of the American Statistical Association, pages 1–14, 2022.
- Cao et al. [2019] Jian Cao, Zhi Li, and Jian Li. Financial time series forecasting model based on ceemdan and lstm. Physica A: Statistical mechanics and its applications, 519:127–139, 2019.
- Chen et al. [2020] Minshuo Chen, Xingguo Li, and Tuo Zhao. On generalization bounds of a family of recurrent neural networks. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, pages 1233–1243. PMLR, 2020.
- Chimmula and Zhang [2020] Vinay Kumar Reddy Chimmula and Lei Zhang. Time series forecasting of covid-19 transmission in canada using lstm networks. Chaos, solitons & fractals, 135:109864, 2020.
- Cybenko [1989] George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314, 1989.
- Daley and Vere-Jones [2008] Daryl J Daley and David Vere-Jones. An introduction to the theory of point processes: volume II: general theory and structure. Springer, 2008.
- Daley et al. [2003] Daryl J Daley, David Vere-Jones, et al. An introduction to the theory of point processes: volume I: elementary theory and methods. Springer, 2003.
- De Ryck et al. [2021] Tim De Ryck, Samuel Lanthaler, and Siddhartha Mishra. On the approximation of functions by tanh neural networks. Neural Networks, 143:732–750, 2021.
- Du et al. [2015] Nan Du, Yichen Wang, Niao He, Jimeng Sun, and Le Song. Time-sensitive recommendation from recurrent user activities. Advances in neural information processing systems, 28, 2015.
- Du et al. [2016] Nan Du, Hanjun Dai, Rakshit Trivedi, Utkarsh Upadhyay, Manuel Gomez-Rodriguez, and Le Song. Recurrent marked temporal point processes: Embedding event history to vector. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1555–1564, 2016.
- Dugas et al. [2001] Michel J Dugas, Patrick Gosselin, and Robert Ladouceur. Intolerance of uncertainty and worry: Investigating specificity in a nonclinical sample. Cognitive therapy and Research, 25:551–558, 2001.
- Enguehard et al. [2020] Joseph Enguehard, Dan Busbridge, Adam Bozson, Claire Woodcock, and Nils Hammerla. Neural temporal point processes for modelling electronic health records. In Machine Learning for Health, pages 85–113. PMLR, 2020.
- Fang et al. [2023] Guanhua Fang, Ganggang Xu, Haochen Xu, Xuening Zhu, and Yongtao Guan. Group network hawkes process. Journal of the American Statistical Association, pages 1–17, 2023.
- Farajtabar et al. [2017] Mehrdad Farajtabar, Yichen Wang, Manuel Gomez-Rodriguez, Shuang Li, Hongyuan Zha, and Le Song. Coevolve: A joint point process model for information diffusion and network evolution. Journal of Machine Learning Research, 18(41):1–49, 2017.
- Fleming and Harrington [2013] Thomas R Fleming and David P Harrington. Counting processes and survival analysis, volume 625. John Wiley & Sons, 2013.
- Fukushima [1969] Kunihiko Fukushima. Visual feature extraction by a multilayered network of analog threshold elements. IEEE Transactions on Systems Science and Cybernetics, 5(4):322–333, 1969.
- Gautschi [1990] Walter Gautschi. How (un)stable are vandermonde systems? Asymptotic and Computational Analysis, 1990. URL https://api.semanticscholar.org/CorpusID:18896588.
- Hansen et al. [2015] Niels Richard Hansen, Patricia Reynaud-Bouret, and Vincent Rivoirard. Lasso and probabilistic inequalities for multivariate point processes. Bernoulli, 2015.
- Hastie et al. [2009] Trevor Hastie, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer, 2009.
- Hawkes [1971] Alan G Hawkes. Spectra of some self-exciting and mutually exciting point processes. Biometrika, 58(1):83–90, 1971.
- Hawkes [2018] Alan G Hawkes. Hawkes processes and their applications to finance: a review. Quantitative Finance, 18(2):193–198, 2018.
- Hawkes and Oakes [1974] Alan G Hawkes and David Oakes. A cluster process representation of a self-exciting process. Journal of applied probability, 11(3):493–503, 1974.
- Hosseini et al. [2017] Seyed Abbas Hosseini, Keivan Alizadeh, Ali Khodadadi, Ali Arabzadeh, Mehrdad Farajtabar, Hongyuan Zha, and Hamid R Rabiee. Recurrent poisson factorization for temporal recommendation. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 847–855, 2017.
- Isham and Westcott [1979] Valerie Isham and Mark Westcott. A self-correcting point process. Stochastic processes and their applications, 8(3):335–347, 1979.
- James et al. [2013] Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani, et al. An introduction to statistical learning, volume 112. Springer, 2013.
- Jiao et al. [2023] Yuling Jiao, Guohao Shen, Yuanyuan Lin, and Jian Huang. Deep nonparametric regression on approximate manifolds: Nonasymptotic error bounds with polynomial prefactors. The Annals of Statistics, 51(2):691–716, 2023.
- Kingman [1992] John Frank Charles Kingman. Poisson processes, volume 3. Clarendon Press, 1992.
- Laub et al. [2021] Patrick J Laub, Young Lee, and Thomas Taimre. The elements of Hawkes processes. Springer, 2021.
- Li et al. [2018] Shuang Li, Shuai Xiao, Shixiang Zhu, Nan Du, Yao Xie, and Le Song. Learning temporal point processes via reinforcement learning. Advances in neural information processing systems, 31, 2018.
- Lin et al. [2022] Haitao Lin, Lirong Wu, Guojiang Zhao, Pai Liu, and Stan Z Li. Exploring generative neural temporal point process. arXiv preprint arXiv:2208.01874, 2022.
- Lu et al. [2021] Jianfeng Lu, Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation for smooth functions. SIAM Journal on Mathematical Analysis, 53(5):5465–5506, 2021.
- McCulloch and Pitts [1943] Warren S McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5:115–133, 1943.
- Medsker and Jain [1999] Larry Medsker and Lakhmi C Jain. Recurrent neural networks: design and applications. CRC press, 1999.
- Mei and Eisner [2017] Hongyuan Mei and Jason M Eisner. The neural hawkes process: A neurally self-modulating multivariate point process. Advances in neural information processing systems, 30, 2017.
- Ogata and Vere-Jones [1984] Yosihiko Ogata and David Vere-Jones. Inference for earthquake models: a self-correcting model. Stochastic processes and their applications, 17(2):337–347, 1984.
- Omi et al. [2019] Takahiro Omi, Kazuyuki Aihara, et al. Fully neural network based model for general temporal point processes. Advances in neural information processing systems, 32, 2019.
- Perkel et al. [1967] Donald H Perkel, George L Gerstein, and George P Moore. Neuronal spike trains and stochastic point processes: I. the single spike train. Biophysical journal, 7(4):391–418, 1967.
- Rubanova et al. [2019] Yulia Rubanova, Ricky TQ Chen, and David K Duvenaud. Latent ordinary differential equations for irregularly-sampled time series. Advances in neural information processing systems, 32, 2019.
- Schäfer and Zimmermann [2007] Anton Maximilian Schäfer and Hans-Georg Zimmermann. Recurrent neural networks are universal approximators. International journal of neural systems, 17(04):253–263, 2007.
- Schmidt-Hieber [2020] Johannes Schmidt-Hieber. Nonparametric regression using deep neural networks with relu activation function. The Annals of Statistics, 48(4), 2020. doi: 10.1214/19-AOS1875. URL https://doi.org/10.1214/19-AOS1875.
- Schoenberg [2005] Frederic Paik Schoenberg. Consistent parametric estimation of the intensity of a spatial–temporal point process. Journal of Statistical Planning and Inference, 128(1):79–93, 2005.
- Shalev-Shwartz and Ben-David [2014] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
- Shchur et al. [2021] Oleksandr Shchur, Ali Caner Türkmen, Tim Januschowski, and Stephan Günnemann. Neural temporal point processes: A review. arXiv preprint arXiv:2104.03528, 2021.
- Shen et al. [2019] Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation characterized by number of neurons. arXiv preprint arXiv:1906.05497, 2019.
- Suh and Cheng [2024] Namjoon Suh and Guang Cheng. A survey on statistical theory of deep learning: Approximation, training dynamics, and generative models. arXiv preprint arXiv:2401.07187, 2024.
- Tarwani and Edem [2017] Kanchan M Tarwani and Swathi Edem. Survey on recurrent neural network in natural language processing. Int. J. Eng. Trends Technol, 48(6):301–304, 2017.
- Tu et al. [2020] Zhuozhuo Tu, Fengxiang He, and Dacheng Tao. Understanding generalization in recurrent neural networks. In International Conference on Learning Representations, 2020. URL https://api.semanticscholar.org/CorpusID:214346647.
- Vidyasagar [2013] Mathukumalli Vidyasagar. Learning and generalisation: with applications to neural networks. Springer Science & Business Media, 2013.
- Wang et al. [2012] Ting Wang, Mark Bebbington, and David Harte. Markov-modulated hawkes process with stepwise decay. Annals of the Institute of Statistical Mathematics, 64:521–544, 2012.
- Williams et al. [2020] Alex Williams, Anthony Degleris, Yixin Wang, and Scott Linderman. Point process models for sequence detection in high-dimensional neural spike trains. Advances in neural information processing systems, 33:14350–14361, 2020.
- Yin et al. [2017] Wenpeng Yin, Katharina Kann, Mo Yu, and Hinrich Schütze. Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923, 2017.
- Zhang et al. [2021] Yizhou Zhang, Karishma Sharma, and Yan Liu. Vigdet: Knowledge informed neural temporal point process for coordination detection on social media. Advances in Neural Information Processing Systems, 34:3218–3231, 2021.
- Zhou et al. [2022] Zihao Zhou, Xingyi Yang, Ryan Rossi, Handong Zhao, and Rose Yu. Neural point process for learning spatiotemporal event dynamics. In Learning for Dynamics and Control Conference, pages 777–789. PMLR, 2022.
- Zuo et al. [2020] Simiao Zuo, Haoming Jiang, Zichong Li, Tuo Zhao, and Hongyuan Zha. Transformer hawkes process. In International conference on machine learning, pages 11692–11702. PMLR, 2020.