Spatio-temporal point processes with deep non-stationary kernels

Zheng Dong H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology Xiuyuan Cheng Department of Mathematics, Duke University Yao Xie¹¹1Email: yao.xie@isye.gatech.edu H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology

Abstract

Point process data are becoming ubiquitous in modern applications, such as social networks, health care, and finance. Despite the powerful expressiveness of the popular recurrent neural network (RNN) models for point process data, they may not successfully capture sophisticated non-stationary dependencies in the data due to their recurrent structures. Another popular type of deep model for point process data is based on representing the influence kernel (rather than the intensity function) by neural networks. We take the latter approach and develop a new deep non-stationary influence kernel that can model non-stationary spatio-temporal point processes. The main idea is to approximate the influence kernel with a novel and general low-rank decomposition, enabling efficient representation through deep neural networks and computational efficiency and better performance. We also take a new approach to maintain the non-negativity constraint of the conditional intensity by introducing a log-barrier penalty. We demonstrate our proposed method’s good performance and computational efficiency compared with the state-of-the-art on simulated and real data.

1 Introduction

Point process data, consisting of sequential events with timestamps and associated information such as location or category, are ubiquitous in modern scientific fields and real-world applications. The distribution of events is of great scientific and practical interest, both for predicting new events and understanding the events’ generative dynamics (Reinhart, 2018). To model such discrete events in continuous time and space, spatio-temporal point processes (STPPs) are widely used in a diverse range of domains, including modeling earthquakes (Ogata, 1988, 1998), the spread of infectious diseases (Schoenberg et al., 2019; Dong et al., 2021), and wildfire propagation Hering et al. (2009).

A modeling challenge is to accurately capture the underlying generative model of event occurrence in general spatio-temporal point processes (STPP) while maintaining the model efficiency. Specific parametric forms of conditional intensity are proposed in seminal works of Hawkes process (Hawkes, 1971; Ogata, 1988) to tackle the issue of computational complexity in STPPs, which requires evaluating the complex multivariate integral in the likelihood function. They use an exponentially decaying influence kernel to measure the influence of a past event over time and assume the influence of all past events is positive and linearly additive. Despite computational simplicity (since the integral of the likelihood function is avoided), such a parametric form limits the model’s practicality in modern applications.

Recent models use neural networks in modeling point processes to capture complicated event occurrences. RNN (Du et al., 2016) and LSTM (Mei and Eisner, 2017) have been used by taking advantage of their representation power and capability in capturing event temporal dependencies. However, the recurrent structures of RNN-based models cannot capture long-range dependency (Bengio et al., 1994) and attention-based structure (Zhang et al., 2020; Zuo et al., 2020) is introduced to address such limitations of RNN. Despite much development, existing models still cannot sufficiently capture spatio-temporal non-stationarity, which are common in real-world data (Graham et al., 2013; Dong et al., 2021). Moreover, while RNN-type models may produce strong prediction performance, the models consist of general forms of network layers and the modeling power relies on the hidden states, thus often not easily interpretable.

A promising approach to overcome the above model restrictions is point process models that combine statistical models with neural network representation, such as Zhu et al. (2022) and Chen et al. (2020), to enjoy both the interpretability and expressive power of neural networks. In particular, the idea is to represent the (possibly non-stationary) influence kernel based on a spectral decomposition and represent the basis functions using neural networks. However, the prior work (Zhu et al., 2022) is not specifically designed for non-stationary kernel and the low-rank representation can be made significantly more efficient, which is the main focus of this paper.

Contribution. In this paper, we develop a non-stationary kernel (referred to as DNSK) for (possibly non-stationary) spatio-temporal processes that enjoy efficient low-rank representation, which leads to much improved computational efficiency and predictive performance. The construction is based on an interesting observation that by reparameterize the influence kernel from the original form of $k(t^{\prime},t)$ , (where $t^{\prime}$ is the historical even time, and $t$ is the current time) to an equivalent form $k(t^{\prime},t-t^{\prime})$ (which thus is parameterized by the displacement $t-t^{\prime}$ instead), the rank can be reduced significantly, as shown in Figure 1. This observation inspired us to design a much more efficient representation of the non-stationary point processes with much fewer basis functions to represent the same kernel.

In summary, the contributions of our paper include

•

We introduce an efficient low-rank representation of the influence kernel based on a novel “displacement” re-parameterization. Our representation can well-approximate a large class of general non-stationary influence kernels and is generalizable to spatio-temporal kernels (also potentially to data with high-dimensional marks). Efficient representation leads to lower computational cost and better prediction power, as demonstrated in our experiments.
•

In model fitting, we introduce a log-barrier penalty term in the objective function to ensure the non-negative conditional intensity function so the model is statistically meaningful, and the problem is numerically stable. This approach also enables the model to learn general influence functions (that can have negative values), which is a drastic improvement from existing influence kernel-based methods that require the kernel functions to be non-negative.
•

Using extensive synthetic and real data experiments, we show the competitive performance of our proposed methods in both model recovery and event prediction compared with the state-of-the-art, such as the RNN-based and transformer-based models.

Refer to caption — (a) Kernel matrix of $k(t^{\prime},t)$ with rank $298$ .

1.1 Related works

The original work of A. Hawkes (Hawkes, 1971) provides classic self-exciting point processes for temporal events, which express the conditional intensity function with an influence kernel and a base rate. Ogata (1988, 1998) propose a parametric form of spatio-temporal influence kernel which enjoys strong model interpretability and efficiency. However, such simple parametric forms own limited expressiveness in characterizing the complex events’ dynamic in modern applications.

Neural networks have been widely adopted in point processes (Xiao et al., 2017; Chen et al., 2020). Du et al. (2016) incorporates recurrent neural networks and Mei and Eisner (2017) use a continuous-time invariant of LSTM to model event influence with exponential decay over time. These RNN-based models may be unable to capture complicated event dependencies due to the recurrent structure. Zhang et al. (2020); Zuo et al. (2020) introduce self-attentive structures into point processes for their capability to memorize long-term influence by dealing with an event sequence as a whole. The main limitation is that they assume a dot-product-based score function and assume linearly decaying of event influence. Omi et al. (2019) propose a fully-connected neural network to model the cumulative intensity function to go beyond parametric decaying influence. However, the event embeddings are still generated by RNN, and fitting cumulative intensity function by neural networks lacks model interpretability. Note that all the above models tackle temporal events with categorical marks, which are inapplicable in continuous time and location space.

Recent works adopt neural networks in learning the influence kernel function. The kernel introduced in Okawa et al. (2021) uses neural networks to model the latent dynamic of time interval but still assumes an exponentially decaying influence over time. Zhu et al. (2022) proposes a kernel representation using spectral decomposition and represents feature functions using deep neural networks to harvest powerful model expressiveness when dealing with marked event data. Our method considers an alternative novel kernel representation that allows the general kernel to be expressed further low-rankly.

2 Background

Spatio-temporal point processes (STPPs). (Reinhart, 2018; Moller and Waagepetersen, 2003) have been widely used to model sequences of random events that happen in continuous time and space. Let $\mathcal{H}=\{(t_{i},s_{i})\}_{i=1}^{n}$ denote the event stream with time $t_{i}\in[0,T]\subset\mathbb{R}$ and location $s_{i}\in\mathcal{S}\subset\mathbb{R}^{d_{\mathcal{S}}}$ of $i$ th event. The event number $n$ is also random. Given the observed history $\mathcal{H}_{t}=\{(t_{i},s_{i})\in\mathcal{H}|t_{i}<t\}$ before time $t$ , an STPP is then fully characterized by the conditional intensity function

\lambda\left(t,s\mid\mathcal{H}_{t}\right)=\lim_{\Delta t\downarrow 0,\Delta s\downarrow 0}\frac{\mathbb{E}\left[\mathbb{N}([t,t+\Delta t]\times B(s,\Delta s))\mid\mathcal{H}_{t}\right]}{|B(s,\Delta s)|\Delta t},

(1)

where $B(s,\Delta s)$ is a ball centered at $s\in\mathbb{R}^{d_{\mathcal{S}}}$ with radius $\Delta s$ , and the counting measure $\mathbb{N}$ is defined as the number of events occurring in $[t,t+\Delta t]\times B(s,\Delta s)\subset\mathbb{R}^{d_{\mathcal{S}}+1}$ . Naturally $\lambda\left(t,s|\mathcal{H}_{t}\right)\geq 0$ for any arbitrary $t$ and $s$ . In the following, we omit the dependency on history $\mathcal{H}_{t}$ and use common shorthand $\lambda(t,s)$ . The log-likelihood of observing $\mathcal{H}$ on $[0,T]\times\mathcal{S}$ is given by (Daley et al., 2003)

\ell(\mathcal{H})=\sum_{i=1}^{n}\log\lambda\left(t_{i},s_{i}\right)-\int_{0}^{T}\int_{\mathcal{S}}\lambda(t,s)dsdt

(2)

Neural point processes. parameterize the conditional intensity function by taking advantage of recurrent neural networks (RNNs). In Du et al. (2016), an input vector $\bm{x}_{i}$ which extracts the information of event $t_{i}$ and the associated event attributes $m_{i}$ (can be event mark or location) is fed into the RNN. A hidden state vector $\bm{h}_{i}$ is updated by $\bm{h}_{i}=\rho(\bm{h}_{i-1},\bm{x}_{i})$ , where $\rho(\cdot)$ is a mapping fulfilled by recurrent neural network operations. The conditional intensity function on $(t_{i},t_{i+1}]$ is then defined as $\lambda(t)=\delta(t,\bm{h}_{i})$ , where $\delta$ is an exponential transformation that guarantees a positive intensity. In Mei and Eisner (2017) the RNN is replaced by a continuous-time LSTM module with hidden states $\bm{h}(t)$ defined on $[0,T]$ and a Softplus function $\delta$ . Attention-based models are introduced in Zuo et al. (2020); Zhang et al. (2020) to overcome the inability of RNNs to capture sophisticated event dependencies due to their recurrent structures.

Hawkes process. (Hawkes, 1971) is a well-known generalized point process model. Assuming the influences from past events are linearly additive, the conditional intensity function takes the form of

\lambda(t,s)=\mu+\sum_{(t^{\prime},s^{\prime})\in\mathcal{H}_{t}}k(t^{\prime},t,s^{\prime},s),

(3)

where $k$ is an influence kernel function that captures event interactions. Commonly the kernel function is assumed to be stationary, that is, $k$ only depends on $t-t^{\prime}$ and $s-s^{\prime}$ , which limits the model expressivity. In this work, we aim to capture complicated non-stationarity in spatio-temporal event dependencies by leveraging the strong approximation power of neural networks in kernel fitting.

3 Low-rank deep non-stationary kernel

Due to the intricate dependencies between events, it is challenging to choose the form of kernel function that achieves great model expressiveness while enjoying high model efficiency. In this section, we introduce a unified model with a low-rank deep non-stationary kernel to capture the complex heterogeneity in events’ influence over spatio-temporal space.

3.1 Kernel with history and spatio-temporal displacement

For the influence kernel function $k(t^{\prime},t,s^{\prime},s)$ , by using the displacements in $t$ and $s$ as variables, we first re-parameterize the kernel as $k(t^{\prime},t-t^{\prime},s^{\prime},s-s^{\prime})$ , where the minus in $s-s^{\prime}$ refers to element-wise difference between $s$ and $s^{\prime}$ when $d_{\mathcal{S}}>1$ . Then we achieve a finite-rank decomposed representation based on (truncated) singular value decomposition (SVD) for kernel functions (Mollenhauer et al., 2020) (which can be understood as the kernel version of matrix SVD, where the eigendecomposition is based on Mercer’s Theorem (Mercer, 1909)), and that the decomposed spatial (and temporal) kernel functions can be approximated under shared basis functions (cf. Assumption A.2). The resulting approximate finite-rank representation is written as (details are in Appendix A.1)

k(t^{\prime},t-t^{\prime},s^{\prime},s-s^{\prime})=\sum_{r=1}^{R}\sum_{l=1}^{L}\alpha_{lr}\psi_{l}(t^{\prime})\varphi_{l}(t-t^{\prime})u_{r}(s^{\prime})v_{r}(s-s^{\prime}).

(4)

Here $\{\psi_{l},\varphi_{l}:[0,T]\to\mathbb{R},l=1,\dots,L\}$ are two sets of temporal basis functions that characterize the temporal influence of event at $t^{\prime}$ and the decaying effect brought by elapsed time $t-t^{\prime}$ . Similarly, spatial basis functions $\{u_{r},v_{r}:\mathcal{S}\to\mathbb{R},r=1,\dots,R\}$ capture the spatial influence of event at $s^{\prime}$ and the decayed influence after spreading over the displacement of $s-s^{\prime}$ . The corresponding weight $\alpha_{lr}$ at different spatio-temporal ranks combines each set of basis functions into a weighted summation, leading to the final expression of influence kernel $k$ .

To further enhance the model expressiveness, we use a fully-connected neural network to represent each basis function. The history or displacement is taken as the input and fed through multiple hidden layers equipped with Softplus non-linear activation function. To allow for inhibiting influence from past events (negative value of influence kernel $k$ ), we use a linear output layer for each neural network. For an influence kernel with temporal rank $L$ and spatial rank $R$ , we need $2(L+R)$ independent neural networks for modeling.

The benefits of our proposed kernel framework lies in the following: (i) The kernel parameterization with displacement significantly reduces the rank needed when representing the complicated kernel encountered in practice as shown in Figure 1. (ii) The non-stationarity of original influence of historical events over spatio-temporal space can be conveniently captured by in-homogeneous $\{\psi_{l}\}_{l=1}^{L}$ , $\{u_{r}\}_{r=1}^{R}$ , making the model applicable in modeling general STPPs. (iii) The propagating patterns of influence are characterized by $\{\varphi_{l}\}_{l=1}^{L}$ , $\{v_{r}\}_{r=1}^{R}$ which go beyond simple parametric forms. In particular, when the events’ influence has finite range, i.e. there exist $\tau_{\text{max}}$ and $a_{\text{max}}$ such that the influence decays to zero if $|t-t^{\prime}|>\tau_{\text{max}}$ or $||s-s^{\prime}||>a_{\text{max}}$ , we can restrict the parameterization of $\{\varphi_{l}\}_{l=1}^{L}$ and $\{v_{r}\}_{r=1}^{R}$ on a local domain $[0,\tau_{\text{max}}]\times B(0,a_{\text{max}})$ instead of $[0,T]\times\mathcal{S}$ , which further reduce the model complexity. Details of choosing kernel and neural network architectures are described in Appendix C.

Remark 1 (the class of influence kernel expressed).

The proposed deep kernel representation covers a large class of non-stationary kernels generally used in STPPs. In particular, the proposed form of the kernel does not need to be positive semi-definite or even symmetric (Reinhart, 2018). The low-rank decomposed formulation (4) is of SVD-type (cf. Appendix A.1). While each $\varphi_{l}$ (and $v_{r}$ ) can be viewed as stationary (i.e., shift-invariant), the combination with left modes in the summation enables to model spatio-temporal non-stationarity. The technical assumptions A.1 and A.2 do not require more than the existence of a low-rank decomposition motivated by kernel SVD. As long as the $2(R+L)$ many functions $\psi_{l}$ , $\varphi_{l}$ , and $u_{r}$ , $v_{r}$ are sufficiently regular, they can be approximated and learned by a neural network. The universal approximation power of neural networks enables our framework to express a broad range of general kernel functions, and the low-rank decomposed form reduces the modeling of a spatio-temporal kernel to finite many functions on time and space domains (the right modes are on truncated domains), respectively.

4 Efficient computation of model

We consider model optimization through Maximum likelihood estimation (MLE) (Reinhart, 2018). The resulting conditional intensity function could now be negative by allowing inhibiting historical influence. A common approach to guarantee the non-negativity is to adopt a nonlinear positive activation function in the conditional intensity (Du et al., 2016; Zhu et al., 2022). However, the integral of such a nonlinear intensity over spatio-temporal space is computationally expensive. To tackle this, we first introduce a log-barrier to the MLE optimization problem to guarantee the non-negativity of conditional intensity function $\lambda$ and maintain its linearity. Then we provide a computationally efficient strategy that benefits from the linearity of the conditional intensity. The extension of the approach to point process data with marks is given in Appendix B.

4.1 Model optimization with log-barrier

We re-denote $\ell(\mathcal{H})$ in (2) by $\ell(\theta)$ in terms of model parameter $\theta$ . The constrained MLE optimization problem for model parameter estimation can be formulated as:

\min_{\theta}-\ell(\theta),s.t.-\lambda(t,s)\leq 0,\ \forall t\in[0,T],\forall s\in\mathcal{S}.

Introduce a log-barrier method (Boyd et al., 2004) to ensure the non-negativity of $\lambda$ , and penalize the values of $\lambda$ on a dense enough grid $\mathcal{U}_{\text{bar},t}\times\mathcal{U}_{\text{bar},s}\subset[0,T]\times\mathcal{S}$ . The log-barrier is defined as

p(\theta,b):=-\frac{1}{|\mathcal{U}_{\text{bar},t}\times\mathcal{U}_{\text{bar},s}|}\sum_{c_{t}=1}^{|\mathcal{U}_{\text{bar},t}|}\sum_{c_{s}=1}^{|\mathcal{U}_{\text{bar},s}|}\log(\lambda(t_{c_{t}},s_{c_{s}})-b),

(5)

where $c_{t},c_{s}$ indicate the index of the gird, and $b$ is a lower bound of conditional intensity function on the grid to guarantee the feasibility of logarithm operation. The MLE optimization problem can be written as

	$\displaystyle\min_{\theta}\mathcal{L}(\theta):=-\ell(\theta)+\frac{1}{w}p(\theta,b)$	$\displaystyle=-\left(\sum_{i=1}^{n}\log\lambda(t_{i},s_{i})-\int_{0}^{T}\int_{\mathcal{S}}\lambda(t,s)dsdt\right)$		(6)
		$\displaystyle~{}~{}~{}-\frac{1}{w\|\mathcal{U}_{\text{bar},t}\times\mathcal{U}_{\text{bar},s}\|}\sum_{c_{t}=1}^{\|\mathcal{U}_{\text{bar},t}\|}\sum_{c_{s}=1}^{\|\mathcal{U}_{\text{bar},s}\|}\log(\lambda(t_{c_{t}},s_{c_{s}})-b),$		(6)

where $w$ is a weight to control the trade-off between log-likelihood and log-barrier; $w$ and $b$ can be set accordingly during the learning procedure. Details can be found in Appendix A.2.

Note that previous works (Du et al., 2016; Mei and Eisner, 2017; Pan et al., 2021; Zuo et al., 2020; Zhu et al., 2022) use a scaled positive transformation to guarantee non-negativity conditional intensity function. Compared with them, the log-barrier method preserves the linearity of the conditional intensity function. As shown in Table 1, such a log-barrier method enables efficient model computation (See more details in Section 4.2) and enhance the model recovery power.

4.2 Model computation

The log-likelihood computation of general STPPs (especially those with general influence function) is often difficult and requires numerical integral and thus time-consuming. Given a sequence of events $\{x_{i}=(t_{i},s_{i})\}_{i=1}^{n}$ of number $n$ , the complexity of neural network evaluation is of $\mathcal{O}(n^{2})$ for the term of log-summation and of $\mathcal{O}(Kn)$ ( $K\gg n$ ) when using numerical integration for the double integral term with $K$ sampled points in a multi-dimensional space. In the following, we circumvent the calculation difficulty by proposing an efficient computation for $\mathcal{L}(\theta)$ with complexity $\mathcal{O}(n)$ of neural network evaluation through a domain discretization strategy.

Computation of log-summation.

The first log-summation term in (2) can be written as:

\sum_{i=1}^{n}\log\lambda(t_{i},s_{i})=\sum_{i=1}^{n}\log{\left(\mu+\sum_{t_{j}<t_{i}}\sum_{r=1}^{R}\sum_{l=1}^{L}\alpha_{lr}\psi_{l}(t_{j})\varphi_{l}(t_{i}-t_{j})u_{r}(s_{j})v_{r}(s_{i}-s_{j})\right)}.

(7)

Note that each $\psi_{l}$ only needs to be evaluated at event time $\{t_{i}\}_{i=1}^{n}$ and each $u_{r}$ is evaluated at all the event location $\{s_{i}\}_{i=1}^{n}$ . To avoid the redundant evaluations of $\varphi_{l}$ over every pair of $(t_{i},t_{j})$ , we set up a uniform grid $\mathcal{U}_{t}$ over time horizon $[0,\tau_{\text{max}}]$ and evaluate $\varphi_{l}$ on the grid. The value of $\varphi_{l}(t_{j}-t_{i})$ can be obtained by linear interpolation with values on two adjacent grid points of $t_{j}-t_{i}$ . By doing so, we only need to evaluate $\varphi_{l}$ for $|\mathcal{U}_{t}|$ times on the grids. Note that $\varphi_{l}$ can be simply feed with $0$ when $t_{j}-t_{i}>\tau_{\text{max}}$ without any neural network evaluation.

Here we directly evaluate $v_{r}(s_{i}-s_{j})$ since numerical interpolation is less accurate in location space. Note that one does not need to evaluate every pair of index $(i,j)$ . Instead, we have $I:=\{(i,j)\mid v_{r}(s_{i}-s_{j})\text{ will be computed}\}=\{(i,j)\mid t_{j}<t_{i}\leq t_{j}+\tau_{\text{max}}\}\cap\{(i,j)\mid\|s_{i}-s_{j}\|\leq a_{\text{max}}\}$ . We use $0$ to other pairs of $(i,j)$ .

Computation of integral.

A benefit of our approach is that we avoid numerical integration for the conditional intensity function (needed to evaluate the likelihood function), since the design of the kernel allows us to decompose the desired integral to integrating basis functions. Specifically, we have

		$\displaystyle\int_{0}^{T}\int_{\mathcal{S}}\lambda(t,s)dsdt=\mu\|\mathcal{S}\|T+\sum_{i=1}^{n}\int_{0}^{T}\int_{\mathcal{S}}I(t_{i}<t)k(t_{i},t,s_{i},s)dsdt$		(8)
		$\displaystyle~{}~{}~{}=\mu\|\mathcal{S}\|T+\sum_{i=1}^{n}\sum_{r=1}^{R}u_{r}(s_{i})\int_{\mathcal{S}}v_{r}(s-s_{i})ds\sum_{l=1}^{L}\alpha_{rl}\psi_{l}(t_{i})\int_{0}^{T-t_{i}}\varphi_{l}(t)dt.$		(8)

To compute the integral of $\varphi_{l}$ , we take the advantage of the pre-computed $\varphi_{l}$ on the grid $\mathcal{U}_{t}$ . Let $F_{l}(t):=\int_{0}^{t}\varphi_{l}(\tau)d\tau$ . Then $F_{l}(T-t_{i})$ can be computed by the linear interpolation of values of $F_{l}$ at two adjacent grid points (in $\mathcal{U}_{t}$ ) of $T-t_{i}$ . In particular, $F_{l}$ evaluated on $\mathcal{U}_{t}$ equals to the cumulative sum of $\varphi_{l}$ divided by the grid width.

The integral of $v_{r}$ can be estimated based on a grid $\mathcal{U}_{s}$ in $B(0,a_{\text{max}})\subset\mathbb{R}^{d_{\mathcal{S}}}$ since it decays outside the ball. For each $s_{i}$ , $\int_{\mathcal{S}}v_{r}(s-s_{i})ds=\int_{B(0,a_{\text{max}})\cap\{\mathcal{S}-s_{i}\}}v_{r}(s)ds$ , where $\mathcal{S}-s_{i}:=\{s^{\prime}\mid s^{\prime}=s-s_{i},s\in\mathcal{S}\}$ . Thus the integral is well estimated with the evaluations of $v_{r}$ on grid set $\mathcal{U}_{s}\cap\mathcal{S}-s_{i}$ . Note that in practice we only evaluate $v_{r}$ on $\mathcal{U}_{s}$ once and use subsets of the evaluations for different $s_{i}$ . More details about grid-based computation can be found in Appendix A.3.

Computation of log-barrier.

The barrier term $p(\theta,b)$ is calculated in a similar way as (7) by replacing $(t_{i},s_{i},\mu)$ with $(t_{c_{t}},s_{c_{s}},\mu-b)$ , i.e. we use interpolation to calculate $\varphi_{l}(t_{c_{t}}-t_{j})$ and evaluate $v_{r}$ on a subset of $\{(s_{c_{s}},s_{j})\}$ , $c_{s}=1,\dots,|\mathcal{U}_{\text{bar},s}|,j=1,\dots,n$ .

4.3 Computational complexity

Table 1: Comparison of model training time per epoch on a 1D and a 3D synthetic data. Time is measured in second.

	1D Data set 1		3D Data set 1
Model	#Parameters	Training time	#Parameters	Training time
NSMPP	$2576$	$60.040$	$9415$	$170.004$
DNSK+Barrier	$2307$	$1.299$	$9228$	$3.529$

The evaluation of $\{u_{r}\}_{r=1}^{R}$ and $\{\psi_{l}\}_{l=1}^{L}$ over $n$ events costs $\mathcal{O}((R+L)n)$ complexity. The evaluation of $\{\varphi_{l}\}_{l=1}^{L}$ is of $\mathcal{O}(L|\mathcal{U}_{t}|)$ complexity since it relies on the grid $\mathcal{U}_{t}$ . The evaluation of $\{v_{r}\}_{r=1}^{R}$ costs no more than $\mathcal{O}(RC\tau_{\text{max}}n)+\mathcal{O}(R|\mathcal{U}_{s}|)$ complexity. We note that $L,R,\tau_{\text{max}},|\mathcal{U}_{t}|,|\mathcal{U}_{s}|$ are all constant that much less than event number $n$ , thus the overall computation complexity will be $\mathcal{O}(n)$ . We compare the model training time per epoch for a baseline equipped with a softplus activation function (NSMPP) and our model with log-barrier method (DNSK+Barrier) on a 1D synthetic data set and a 3D synthetic data set. The quantitative results in Table 1 demonstrates the efficiency improvement of our model by using log-barrier technique. More details about the computation complexity analysis can be found in Appendix A.4.

5 Experiment

We use large-scale synthetic and real data sets to demonstrate the superior performance of our model and present the results in this section. Experimental details and results can be found in Appendix C. Codes will be released upon publication.

Baselines.

We compare our method (DNSK+Barrier) with: (i) RMTPP (RMTPP) (Du et al., 2016); (ii) Neural Hawkes (NH) (Mei and Eisner, 2017); (iii) Transformer Hawkes process (THP) (Zuo et al., 2020); (iv) Parametric Hawkes process (PHP+exp) with exponentially decaying spatio-temporal kernel; (v) Neual spectral marked point processes (NSMPP) (Zhu et al., 2022); (vi) DNSK without log-barrier but with a non-negative Softplus activation function (DNSK+Softplus). We note that RMTPP, NH and THP directly model conditional intensity function using neural networks while others learn the influence kernel in the framework of (3). In particular, NSMPP designs the kernel based on singular value decomposition but parameterizes it without displacement. The model parameters are estimated using the training data via Adam optimization method (Kingma and Ba, 2014). Details of training can be found in Appendix A.2 and C.

5.1 Synthetic data experiments

Synthetic data sets.

To show the effectiveness of DNSK+Barrier, we conduct all the models on three temporal data sets and three spatio-temporal data sets generated by following true kernels: (i) 1D exponential kernel (ii) 1D non-stationary kernel; (iii) 1D infinite rank kernel; (iv) 2D exponential kernel; (v) 3D non-stationary inhibition kernel; (vi) 3D non-stationary mixture kernel. Data sets are generated using thinning algorithm in Daley and Vere-Jones (2008). Each data set is composed of $2000$ sequences. Details of kernel formulas and data generation can be found in Appendix C.

We consider two performance metrics for testing data evaluation: Mean relative error (MRE) of the predicted intensity and log-likelihood. The true and predicted $\lambda^{*}(x),\hat{\lambda}(x)$ can be calculated using (4) with true and learned kernel. The MRE for one test trajectory is defined as $\int_{\mathcal{X}}{|\lambda^{*}(x)-\hat{\lambda}(x)|}/{\lambda^{*}(x)}dx$ and the averaged MRE over all test trajectories is reported. The log-likelihood for observing each testing sequence can be computed according to (2), and the average predictive log-likelihood per event is reported. The log-likelihood shows the model’s goodness-of-fit, and the intensity evaluation further reflects the model’s ability to recover the underlying mechanism of event occurrence and predict the future.

Table 2: Synthetic data results. Testing log-likelihood per event (on the left side of slash, higher the better) and MRE (on the right side of slash, lower the better) are reported in each entry.

Model	1D Data set 1	1D Data set 2	1D Data set 3	2D Data set 1	3D Data set 1	3D Data set 2
RMTPP	$-0.467_{(0.009)}/0.086$	$-2.591_{(0.010)}/0.259$	$-1.353_{(0.002)}/0.212$	$-5.268_{(0.053)}/0.195$	$-2.561_{(0.015)}/0.117$	$-2.289_{(0.002)}/0.316$
NH	$-0.459_{(0.002)}/0.068$	$-2.543_{(0.007)}/0.092$	$-1.315_{(0.032)}/0.204$	$-5.223_{(0.051)}/0.174$	$-2.524_{(0.002)}/0.098$	$-2.291_{(0.003)}/0.319$
THP	$-0.537_{(0.008)}/0.843$	$-2.554_{(0.005)}/0.106$	$-1.319_{(0.003)}/0.115$	$-5.292_{(0.029)}/0.182$	$-2.527_{(0.001)}/0.041$	$-2.497_{(0.018)}/0.350$
PHP+exp	$-0.451_{(0.001)}/0.093$	$-2.725_{(0.002)}/0.181$	$-1.524_{(0.015)}/0.223$	$-2.737_{(0.003)}/0.306$	$-2.683_{(0.002)}/0.291$	$-2.424_{(0.003)}/0.282$
NSMPP	$-0.462_{(0.010)}/0.078$	$-2.638_{(0.008)}/0.162$	$-1.473_{(0.033)}/0.164$	$-2.807_{(0.016)}/0.156$	$-2.637_{(0.012)}/0.193$	$-2.381_{(0.012)}/0.280$
DNSK+Softplus	$-0.455_{(0.003)}/0.045$	$-2.539_{(0.002)}/0.037$	$-1.300_{(0.004)}/0.104$	$-2.592_{(0.002)}/0.042$	$-2.515_{(0.002)}/0.088$	$-2.279_{(0.003)}/0.119$
DNSK+Barrier	$\textbf{--0.451}_{(0.002)}/\textbf{0.039}$	$\textbf{--2.536}_{(0.003)}/\textbf{0.016}$	$\textbf{--1.298}_{(0.002)}/\textbf{0.031}$	$\textbf{--2.585}_{(0.001)}/\textbf{0.028}$	$\textbf{--2.498}_{(0.003)}/\textbf{0.021}$	$\textbf{--2.251}_{(0.001)}/\textbf{0.082}$

The heat maps in Figure 2 visualize the results of non-stationary kernel recovery for DNSK+Barrier and NSMPP on 1D Data set 2 and 3 (The true kernel used in 1D Data set 3 is the one in Figure 1). DNSK+Barrier recovers the true kernel more accurately than NSMPP, indicating the strong representation power of the low-rank kernel parameterization with displacements. Line charts in Figure 2 present the recovered intensities with the true ones (dark grey curves). It demonstrates that our method can accurately capture the temporal dynamics of events. In particular, the average conditional intensity $\lambda$ over multiple testing sequences shows the model’s ability to recover data non-stationarity over time. While DNSK+Barrier successfully captures the non-stationarity among data, both RMTPP and NH fail to do so by showing a flat curve of the averaged intensity. Note that THP with positional encoding recovers the data non-stationarity (as shown in two figures in the last column). However, our method still outperforms THP which suffers from limited model expressiveness when complicated propagation of event influence is involved (see two figures in the penultimate column).

Tabel 2 summarized the quantitative results of testing log-likelihood and MRE. It shows that DNSK+Barrier has superior predictive performance against baselines in characterizing the dynamics of data generation in spatio-temporal space. Specifically, with evidently over-parameterization for 1D Data set 1 generated by a stationary exponentially decaying kernel, our model can still approximate the kernel and recover the true conditional intensity without overfitting, which shows the adaptiveness of our model. Moreover, DNSK+Barrier enjoys outstanding performance gain when learning a diverse variety of complicated non-stationary kernels. The comparison between DNSK+Softplus and DNSK+Barrier proves that the model with log-barrier achieves a better recovery performance by maintaining the linearity of the conditional intensity. THP outperforms RMTPP in non-stationary cases but is still limited due to its pre-assumed parametric form of influence propagation. More results about kernel and intensity recovery can be found in Appendix C.

5.2 Real data results

Real data sets.

We provide a comprehensive evaluation of our approach on several real-world data sets: we first use two popular data sets containing time-stamped events with categorical marks to demonstrate the robustness of DNSK+Barrier in marked STPPs (refer to Appendix B for detailed definition and kernel modeling): (i) Financial Transactions. (Du et al., 2016). This data set contains transaction records of a stock in one day with time unit milliseconds and the action (mark) of each transaction. We partition the events into different sequences by time stamps. (ii) StackOverflow (Leskovec and Krevl, 2014): The data is collected from the website StackOverflow over two years, containing reward records for users who promote engagement in the community. Each user’s reward history is treated as a sequence.

Next, we demonstrate the practical versatility of the model using the following spatio-temporal data sets: (i) Southern California earthquake data provided by Southern California Earthquake Data Center (SCEDC) contains time and location information of earthquakes in Southern California. We collect 19,414 records from 1999 to 2019 with magnitude larger than 2.5 and partition the data into multiple sequences by month with average length of 40.2. (ii) Atlanta robbery & burglary data. Atlanta Police Department (APD) provides a proprietary data source for city crime. We extract 3420 reported robberies and 14958 burglaries with time and location from 2013 to 2019. Two crime types are preprocessed as separate data sets on a 10-day basis with average lengths of 13.7 and 58.7.

Finally, the model’s ability to tackle high-dimensional marks is evaluated with Atlanta textual crime data. The proprietary data set provided by APD records 4644 crime incidents from 2016 to 2017 with time, location, and comprehensive text descriptions. The text information is preprocessed by TF-IDF technique, leading to a $5012$ -dimensional mark for each event.

Table 3: Results of real data sets with time and categorical marks. We compare the testing log-likelihood (Testing

\ell

, higher the better), event time prediction error (lower the better), and event type prediction accuracy (higher the better) of DNSK+Barrier and other baselines on each data set.

	Financial			StackOverflow
Model	Testing $-\ell$	Time RMSE	Type Accuracy	Testing $-\ell$	Time RMSE	Type Accuracy
RMTPP	$-3.890$	$1.560$	$0.620$	$-2.600$	$9.780$	$0.459$
NH	$-3.600$	$1.560$	$0.622$	$-2.550$	$9.830$	$0.463$
THP	$-0.938$	$1.019$	$0.596$	–1.231	$11.804$	$0.436$
NSMPP	$-3.058$	$1.276$	$0.608$	$-3.182$	$8.735$	$0.447$
DNSK+Softplus	$-0.889$	$0.327$	$0.621$	$-2.173$	$6.416$	$0.497$
DNSK+Barrier	–0.709	0.153	0.630	$-2.089$	4.833	0.508

Table 3 summarizes the results of models dealing with categorical marks. Event time and type prediction are evaluated by Root Mean Square Error (RMSE) and accuracy, respectively. We can see that DNSK+Barrier outperforms the baselines in all prediction tasks by providing less time RMSE and higher type accuracy.

For real-world spatio-temporal data, we report average predictive log-likelihood per event for the testing set since MRE is not applicable. Besides, we perform online prediction for earthquake data to demonstrate the model predicting ability. The probability density function $f(t,s)$ which represents the conditional probability that the next event will occur at $(t,s)$ given history $\mathcal{H}_{t}$ can be written as $f(t,s)=\lambda(t,s)\exp\left(-\int_{\mathcal{S}}\int_{t_{n}}^{t}\lambda(\tau,\nu)d\tau d\nu\right).$ The predicted time and location of the next event can be computed as $\mathbb{E}\left[t_{n+1}|\mathcal{H}_{t}\right]=\int_{t_{n}}^{\infty}t\int_{\mathcal{S}}f(t,s)dsdt$ , $\mathbb{E}\left[s_{n+1}|\mathcal{H}_{t}\right]=\int_{\mathcal{S}}s\int_{t_{n}}^{\infty}f(t,s)dtds.$ We predict the the time and location of the last event in each sequence. The mean absolute error (MAE) of the predictions is computed. The quantitative results in Table 4 show that DNSK+Barrier provides more accurate predictions than other alternatives with higher event log-likelihood.

To demonstrate our model’s interpretability and power to capture heterogeneous data characteristics, we visualize the learned influence kernels and predicted conditional intensity for two crime categories in Figure 3. The first column shows kernel evaluations at fixed geolocation in downtown Atlanta which intuitively reflect the spatial influence of crimes in that neighborhood. The influence of a robbery in the downtown area is more intensive but regional, while a burglary which is hard to be responded to by police in time would impact a larger neighborhood along major highways of Atlanta. We also provide the predicted conditional intensity over space for two crimes. As we can observe, DNSK+Barrier captures the occurrence of events in regions with a higher crime rate, and crimes of the same category happening in different regions would influence their neighborhoods differently. We note that this example emphasizes the ability of the proposed method to recover data non-stationarity with different sequence lengths, and improve the limited model interpretability of other neural network-based methods (RMTPP, NH, and THP) in practice.

Table 4: Real data results with spatial and mark information. Testing log-likelihood (higher the better) and prediction mean absolute error of event time and location (lower the better) are reported.

Model	Testing $\ell$	Time MAE	Location MAE	Testing $\ell$	Testing $\ell$	Testing $\ell$
	South California Earthquake			Atlanta Robbery	Atlanta Burglary	Atlanta Textual Crime
RMTPP	$-1.825_{(0.053)}$	$6.963$	$0.602$	$-2.349_{(0.018)}$	$-0.246_{(0.008)}$	/
NH	$-1.818_{(0.037)}$	$6.880$	$0.458$	$-2.445_{(0.016)}$	$-0.308_{(0.012)}$	/
THP	$-1.784_{(0.007)}$	$6.113$	$0.633$	$\textbf{--2.221}_{(0.005)}$	$-0.251_{(0.001)}$	/
PHP+exp	$-2.048_{(0.093)}$	$8.132$	$0.487$	$-2.641_{(0.006)}$	$0.382_{(0.005)}$	$-10.542_{(0.016)}$
NSMPP	$-4.152_{(0.187)}$	$6.780$	$0.455$	$-2.753_{(0.113)}$	$0.304_{(0.098)}$	$-8.694_{(0.336)}$
DNSK+Softplus	$-1.807_{(0.096)}$	$1.685$	$0.492$	$-2.435_{(0.009)}$	$0.493_{(0.008)}$	$-5.876_{(0.057)}$
DNSK+Barrier	$\textbf{--1.751}_{(0.080)}$	1.474	0.431	$-2.255_{(0.008)}$	$\textbf{0.519}_{(0.012)}$	$\textbf{--5.279}_{(0.044)}$

•

¹RMTPP, NH, and THP are not applicable when dealing with high-dimensional data.

For Atlanta textual crime data, we borrow the idea in Zhu and Xie (2022) by encoding the highly sparse TF-IDF representation into a binary mark vector with dimension $d=50$ using Restricted Boltzmann Machine (RBM) (Fischer and Igel, 2012). The average testing log-likelihoods per event for each model are reported in Table 4. The results show that DNSK+Barrier outperforms PHP+exp in Zhu and Xie (2022) and NSMPP by achieving a higher testing log-likelihood. We visualize the basis functions of learned influence kernel by DNSK+Barrier in Figure A.4 in Appendix.

6 Conclusion

We propose a deep non-stationary kernel for spatio-temporal point processes using a low-rank parameterization scheme based on displacement, which enables the model to be further low-rank when learning complicated influence kernel, thus significantly reducing the model complexity. The non-negativity of the solution is guaranteed by a log-barrier method which maintains the linearity of the conditional intensity function. Based on that, we propose a computationally efficient strategy for model estimation. The superior performance of our model is demonstrated using synthetic and real data sets.

Acknowledgement

The work of Dong and Xie are partially supported by by an NSF CAREER CCF-1650913, and NSF DMS-2134037, CMMI-2015787, DMS-1938106, and DMS-1830210.

References

Reinhart [2018] Alex Reinhart. A review of self-exciting spatio-temporal point processes and their applications. Statistical Science, 33(3):299–318, 2018.
Ogata [1988] Yosihiko Ogata. Statistical models for earthquake occurrences and residual analysis for point processes. Journal of the American Statistical association, 83(401):9–27, 1988.
Ogata [1998] Yosihiko Ogata. Space-time point-process models for earthquake occurrences. Annals of the Institute of Statistical Mathematics, 50(2):379–402, 1998.
Schoenberg et al. [2019] Frederic Paik Schoenberg, Marc Hoffmann, and Ryan J Harrigan. A recursive point process model for infectious diseases. Annals of the Institute of Statistical Mathematics, 71(5):1271–1287, 2019.
Dong et al. [2021] Zheng Dong, Shixiang Zhu, Yao Xie, Jorge Mateu, and Francisco J Rodríguez-Cortés. Non-stationary spatio-temporal point process modeling for high-resolution covid-19 data. arXiv preprint arXiv:2109.09029, 2021.
Hering et al. [2009] Amanda S Hering, Cynthia L Bell, and Marc G Genton. Modeling spatio-temporal wildfire ignition point patterns. Environmental and Ecological Statistics, 16(2):225–250, 2009.
Hawkes [1971] Alan G Hawkes. Spectra of some self-exciting and mutually exciting point processes. Biometrika, 58(1):83–90, 1971.
Du et al. [2016] Nan Du, Hanjun Dai, Rakshit Trivedi, Utkarsh Upadhyay, Manuel Gomez-Rodriguez, and Le Song. Recurrent marked temporal point processes: Embedding event history to vector. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1555–1564, 2016.
Mei and Eisner [2017] Hongyuan Mei and Jason M Eisner. The neural hawkes process: A neurally self-modulating multivariate point process. Advances in neural information processing systems, 30, 2017.
Bengio et al. [1994] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2):157–166, 1994.
Zhang et al. [2020] Qiang Zhang, Aldo Lipani, Omer Kirnap, and Emine Yilmaz. Self-attentive hawkes process. In International conference on machine learning, pages 11183–11193. PMLR, 2020.
Zuo et al. [2020] Simiao Zuo, Haoming Jiang, Zichong Li, Tuo Zhao, and Hongyuan Zha. Transformer hawkes process. In International conference on machine learning, pages 11692–11702. PMLR, 2020.
Graham et al. [2013] Michael Graham, Jarno Kiviaho, and Jussi Nikkinen. Short-term and long-term dependencies of the s&p 500 index and commodity prices. Quantitative Finance, 13(4):583–592, 2013.
Zhu et al. [2022] Shixiang Zhu, Haoyun Wang, Zheng Dong, Xiuyuan Cheng, and Yao Xie. Neural spectral marked point processes. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=0rcbOaoBXbg.
Chen et al. [2020] Ricky TQ Chen, Brandon Amos, and Maximilian Nickel. Neural spatio-temporal point processes. arXiv preprint arXiv:2011.04583, 2020.
Xiao et al. [2017] Shuai Xiao, Junchi Yan, Xiaokang Yang, Hongyuan Zha, and Stephen Chu. Modeling the intensity function of point process via recurrent neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017.
Omi et al. [2019] Takahiro Omi, Kazuyuki Aihara, et al. Fully neural network based model for general temporal point processes. Advances in neural information processing systems, 32, 2019.
Okawa et al. [2021] Maya Okawa, Tomoharu Iwata, Yusuke Tanaka, Hiroyuki Toda, Takeshi Kurashima, and Hisashi Kashima. Dynamic hawkes processes for discovering time-evolving communities’ states behind diffusion processes. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 1276–1286, 2021.
Moller and Waagepetersen [2003] Jesper Moller and Rasmus Plenge Waagepetersen. Statistical inference and simulation for spatial point processes. CRC press, 2003.
Daley et al. [2003] Daryl J Daley, David Vere-Jones, et al. An introduction to the theory of point processes: volume I: elementary theory and methods. Springer, 2003.
Mollenhauer et al. [2020] Mattes Mollenhauer, Ingmar Schuster, Stefan Klus, and Christof Schütte. Singular value decomposition of operators on reproducing kernel hilbert spaces. In Advances in Dynamics, Optimization and Computation: A volume dedicated to Michael Dellnitz on the occasion of his 60th birthday, pages 109–131. Springer, 2020.
Mercer [1909] J Mercer. Functions ofpositive and negativetypeand theircommection with the theory ofintegral equations. Philos. Trinsdictions Rogyal Soc, 209:4–415, 1909.
Boyd et al. [2004] Stephen Boyd, Stephen P Boyd, and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
Pan et al. [2021] Zhimeng Pan, Zheng Wang, Jeff M Phillips, and Shandian Zhe. Self-adaptable point processes with nonparametric time decays. Advances in Neural Information Processing Systems, 34, 2021.
Kingma and Ba [2014] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2014. URL https://arxiv.org/abs/1412.6980.
Daley and Vere-Jones [2008] Daryl J Daley and David Vere-Jones. An Introduction to the Theory of Point Processes. Volume II: General Theory and Structure. Springer, 2008.
Leskovec and Krevl [2014] Jure Leskovec and Andrej Krevl. Snap datasets: Stanford large network dataset collection, 2014.
Zhu and Xie [2022] Shixiang Zhu and Yao Xie. Spatiotemporal-textual point processes for crime linkage detection. The Annals of Applied Statistics, 16(2):1151–1170, 2022.
Fischer and Igel [2012] Asja Fischer and Christian Igel. An introduction to restricted boltzmann machines. In Iberoamerican congress on pattern recognition, pages 14–36. Springer, 2012.
Lewis and Shedler [1979] PA W Lewis and Gerald S Shedler. Simulation of nonhomogeneous poisson processes by thinning. Naval research logistics quarterly, 26(3):403–413, 1979.

Appendix A Additional methodology details

A.1 Derivation of (4)

We denote $\tau:=t-t^{\prime},\nu:=s-s^{\prime}$ , the variables $t^{\prime}\in[0,T]$ , $\tau\in[0,\tau_{\rm max}]$ , $s^{\prime}\in\mathcal{S}$ and $\nu\in B(0,a_{\rm max})$ , where the sets $\mathcal{S},B(0,a_{\rm max})\subset\mathbb{R}^{2}$ . Viewing the spatial and temporal variables, i.e., $(t^{\prime},\tau)$ and $(s^{\prime},\nu)$ , as left and right mode variables, respectively, the kernel function SVD [Mollenhauer et al., 2020, Mercer, 1909] of $k$ gives that

k(t^{\prime},\tau,s^{\prime},\nu)=\sum_{k=1}^{\infty}\sigma_{k}g_{k}(t^{\prime},\tau)h_{k}(s^{\prime},\nu).

(A.1)

We assume that the SVD can be truncated at $k\leq K$ with a residual of $\varepsilon$ for some small $\varepsilon>0$ , and this holds as long as the singular values $\sigma_{k}$ decay sufficiently fast. To fulfill the approximate finite-rank representation, it suffices to have the scalars $\sigma_{k}$ and the functions $g_{k}$ and $h_{k}$ so that the expansion approximates the kernel $k$ , even if they are not SVD of the kernel. This leads to the following assumption:

Assumption A.1.

There exist coefficients $\sigma_{k}$ , and functions $g_{k}(t^{\prime},\tau)$ , $h_{k}(s^{\prime},\nu)$ s.t.

k(t^{\prime},\tau,s^{\prime},\nu)=\sum_{k=1}^{K}\sigma_{k}g_{k}(t^{\prime},\tau)h_{k}(s^{\prime},\nu)+O(\varepsilon).

(A.2)

To proceed, one can apply kernel SVD again to $g_{k}$ and $h_{k}$ respectively, and obtain left and right singular functions that potentially differ for different $k$ . Here, we impose that across $k=1,\cdots,K$ , the singular functions of $g_{k}$ are the same (as shown below, being approximately same suffices) set of basis functions, that is,

g_{k}(t^{\prime},\tau)=\sum_{l=1}^{\infty}\beta_{k,l}\psi_{l}(t^{\prime})\varphi_{l}(\tau).

As we will truncate $l$ to be up to a finite rank again (up to an $O(\varepsilon)$ residual) we require the (approximately) shared singular modes only up to $L$ . Similarly as above, technically it suffices to have a finite-rank expansion to achieve the $O(\varepsilon)$ error without requiring them to be SVD, which leads to the following assumption where we assume the same condition for $h_{k}$ :

Assumption A.2.

For the $g_{k}$ and $h_{k}$ in (A.2), up to an $O(\varepsilon)$ error,

(i) The $K$ temporal kernel functions $g_{k}(t^{\prime},\tau)$ can be approximated under a same set of left and right basis functions, i.e., there exist coefficients $\beta_{kl}$ , and functions $\psi_{l}(t^{\prime})$ , $\varphi_{l}(\tau)$ for $l=1,\cdots,L$ , s.t.

g_{k}(t^{\prime},\tau)=\sum_{l=1}^{L}\beta_{kl}\psi_{l}(t^{\prime})\varphi_{l}(\tau)+O(\varepsilon),\quad k=1,\cdots,K.

(A.3)

(ii) The $K$ spatial kernel functions $h_{k}(s^{\prime},\nu)$ can be approximated under a same set of left and right basis functions, i.e., there exist coefficients $\gamma_{kr}$ , and functions $u_{r}(s^{\prime})$ , $v_{r}(\nu)$ for $r=1,\cdots,R$ , s.t.

h_{k}(s^{\prime},\nu)=\sum_{r=1}^{R}\gamma_{kr}u_{r}(t^{\prime})v_{r}(\nu)+O(\varepsilon),\quad r=1,\cdots,R.

(A.4)

Inserting (A.3) and (A.4) into (A.2) gives the rank-truncated representation of the kernel function. Since $K$ , $L$ , $R$ are fixed numbers, assuming boundedness of all the coefficients and functions, we have the representation with the final residual as $O(\varepsilon)$ , namely,

k(t^{\prime},\tau,s^{\prime},\nu)=\sum_{l=1}^{L}\sum_{r=1}^{R}\sum_{k=1}^{K}\sigma_{k}\beta_{kl}\gamma_{kr}\psi_{l}(t^{\prime})\varphi_{l}(\tau)u_{r}(t^{\prime})v_{r}(\nu)+O(\varepsilon).

Defining $\sum_{k=1}^{K}\sigma_{k}\beta_{kl}\gamma_{kr}$ as $\alpha_{lr}$ leads to (4).

A.2 Algorithms

We include the algorithms 1 and 2.

Input: Training set

X

, batch size

M

, epoch number

E

, learning rate

\gamma

, constant

a>1

to update

s

in (6).

Initialization: model parameter

\theta_{0}

, first epoch

e=0

s=s_{0}

while

e<E

for each batch with size

M

1. For 1D temporal point process, compute

\ell(\theta)

\{\lambda(t_{c_{t}})\}_{c_{t}=1,\dots,|\mathcal{U}_{\text{bar},t}|}

. For spatio-temporal point process, compute

\ell(\theta)

\{\lambda(t_{c_{t}},s_{c_{s}})\}_{c_{t}=1,\dots,|\mathcal{U}_{\text{bar},t}|,{c_{s}}=1,\dots,|\mathcal{U}_{\text{bar},s}|}

2. Set

b=\min\{\lambda(t_{c_{t}})\}_{c_{t}=1,\dots,|\mathcal{U}_{\text{bar},t}|}-\epsilon

(or

\min\{\{\lambda(t_{c_{t}},s_{c_{s}})\}_{c_{t}=1,\dots,|\mathcal{U}_{\text{bar},t}|,c_{s}=1,\dots,|\mathcal{U}_{\text{bar},s}|}-\epsilon

), where

\epsilon

is a small value to guarantee logarithm feasibility.

3. Compute

\mathcal{L}(\theta)=-\ell(\theta)+\frac{1}{w}p(\theta,b)

4. Update

\theta_{e+1}\leftarrow\theta_{e}-\gamma\frac{\partial\mathcal{L}}{\partial\theta_{e}}

e\leftarrow e+1,w\leftarrow w\cdot a

end for

end while

Algorithm 1 Model parameter estimation

Input: Model

\lambda(\cdot),T,\mathcal{S}

, Upper bound of conditional intensity

\bar{\lambda}

Initialization:

\mathcal{H}_{T}=\emptyset,t=0,n=0

while

t<T

1. Sample

u\sim\text{Unif}(0,1)

t\leftarrow t-\ln u/\bar{\lambda}

3. Sample

s\sim\text{Unif}(\mathcal{S}),D\sim\text{Unif}(0,1)

\lambda=\lambda(t,s|\mathcal{H}_{T})

D\bar{\lambda}\leq\lambda

then

n\leftarrow n+1

;

t_{n}=t,s_{n}=s

\mathcal{H}_{T}\leftarrow\mathcal{H}_{T}\cup\{(t_{n},s_{n})\}

end if

end while

t_{n}>=T

then

return

\mathcal{H}_{T}-\{(t_{n},s_{n})\}

else

return

\mathcal{H}_{T}

end if

Algorithm 2 Synthetic data generation

A.3 Grid-based model computation

In this section, we elaborate on the details of the grid-based efficient model computation.

In Figure A.1, we visualize the procedure of computing the integrals of $\int_{0}^{T-t_{i}}\varphi_{l}(t)dt$ and $\int_{\mathcal{S}}v_{r}(s-s_{i})ds$ in (8), respectively. Panel (a) illustrates the calculation of $\int_{0}^{T-t_{i}}\varphi_{l}(t)dt$ . As explained in Section 4.2, the evaluations of $\varphi_{l}$ only happens on the grid $\mathcal{U}_{t}$ over $[0,\tau_{\text{max}}]$ (since $\varphi_{l}(t)=0$ when $t>\tau_{\text{max}}$ ). The value of $F(t)=\int_{0}^{t}\varphi_{l}(\tau)d\tau$ on the grid can be obtained through numerical integration. Then given $t_{i}$ , the value of $F(T-t_{i})=\int_{0}^{T-t_{i}}\varphi_{l}(t)dt$ is calculated using linear interpolation of $F$ on two adjacent grid points of $T-t_{i}$ . Panel (b) shows the computation of $\int_{\mathcal{S}}v_{r}(s-s_{i})ds$ . Given $s_{i}$ , $\int_{\mathcal{S}}v_{r}(s-s_{i})ds=\int_{B(0,a_{\text{max}})\cap\{\mathcal{S}-s_{i}\}}v_{r}(s)ds$ since $v_{r}(s)=0$ when $s>a_{\text{max}}$ . Then $B(0,a_{\text{max}})$ is discretized into the grid $\mathcal{U}_{s}$ , and $\int_{\mathcal{S}}v_{r}(s-s_{i})ds$ can be calculated based on the value of $v_{r}$ on the grid points in $\mathcal{U}_{s}\cap\mathcal{S}-s_{i}$ (the deep red dots in Figure A.1(b)) using numerical integration.

Table A.1: Comparison of DNSK+Barrier performance on 3D Data set 2 with different grid resolutions. Testing log-likelihood per event and intensity MRE are reported. The highlighted ones are the results in the main paper.

Temporal resolution: $\|\mathcal{U}_{t}\|$	1000	1500	3000
	Spatial resolution: $\|\mathcal{U}_{s}\|$
30	$-2.272_{(0.005)}/0.102$	$-2.252_{(0.002)}/0.088$	$-2.250_{(0.002)}/0.081$
50	$-2.257_{(0.002)}/0.095$	$-2.251_{(0.001)}/0.082$	$-2.249_{(0.001)}/0.078$
100	$-2.255_{(0.001)}/0.091$	$-2.252_{(0.001)}/0.081$	$-2.250_{(0.001)}/0.078$

To evaluate the sensitivity of our model to the chosen grids, we compare the performance of DNSK+Barrier on 3D Data set 2 using grids with different resolutions. The quantitative results of testing log-likelihood and intensity prediction error are reported in Table A.1. We use $|\mathcal{U}_{t}|=50,|\mathcal{U}_{s}|=1500$ for the experiments in the main paper. As we can see, the model shows similar performances when a higher grid resolution is used and works slightly less accurately but still better than other baselines with less number of grid points. It reveals that our choice of grid resolution is accurate enough to capture the complex dynamics of event occurrences for this non-stationary data, and the model performance is robust to different grid resolutions.

In practice, the grids can be flexibly chosen to reach the balance of model accuracy and computational efficiency. For instance, the number of uniformly distributed grid points along one dimension can be chosen around $\mathcal{O}(n_{0})$ , where $n_{0}$ is the average number of events in one observed sequence. Note that $|\mathcal{U}_{t}|$ or $|\mathcal{U}_{s}|$ would be far less than the total number of observed events because we use thousands of sequences ( $2000$ in our synthetic experiments) for model learning. And the grid size can be even smaller when it comes to non-Lebesgue-measured space.

A.4 Details of computational complexity

We provide the detailed analysis of the $\mathcal{O}(n)$ computation complexity of $\mathcal{L}(\theta)$ in Section 4.3 as following:

$\bullet$ Computation of log-summation. The evaluation of $\{u_{r}\}_{r=1}^{R}$ and $\{\psi_{l}\}_{l=1}^{L}$ over $n$ events costs $\mathcal{O}((R+L)n)$ complexity. The evaluation of $\{\varphi_{l}\}_{l=1}^{L}$ is of $\mathcal{O}(L|\mathcal{U}_{t}|)$ complexity since it relies on the grid $\mathcal{U}_{t}$ . With the assumption that the conditional intensity is bounded by a constant $C$ in a finite time horizon [Lewis and Shedler, 1979, Daley et al., 2003, Zhu et al., 2022], for each fixed $j$ , the cardinality of set $\{(i,j)\mid t_{j}<t_{i}\leq t_{j}+\tau_{\text{max}}\}$ is less than $C\tau_{\text{max}}$ , which leads to a $\mathcal{O}(RC\tau_{\text{max}}n)$ complexity of $\{v_{r}\}_{r=1}^{R}$ evaluation.

$\bullet$ Computation of integral. The integration of $\{\varphi_{l}\}_{l=1}^{L}$ only relies on numerical operations of $\{\varphi_{l}\}_{l=1}^{L}$ on grids $\mathcal{U}_{t}$ without extra evaluations of neural networks. The integration of $\{v_{r}\}_{r=1}^{R}$ depends on the evaluation on grid $\mathcal{U}_{s}$ of $\mathcal{O}(R|\mathcal{U}_{s}|)$ complexity.

$\bullet$ Computation of barrier. $\{\varphi_{l}\}_{l=1}^{L}$ on grid $\mathcal{U}_{\text{bar},t}$ is estimated by numerical interpolation of previously computed $\{\varphi_{l}\}_{l=1}^{L}$ on grid $\mathcal{U}_{t}$ . Additional neural network evaluations of $\{v_{r}\}_{r=1}^{R}$ cost no more than $\mathcal{O}(RC\tau_{\text{max}}n)$ complexity.

Appendix B Deep non-stationary kernel for Marked STPPs

In marked STPPs [Reinhart, 2018], each observed event is associated with additional information describing event attribute, denoted as $m\in\mathcal{M}\subset\mathbb{R}^{d_{\mathcal{M}}}$ . Let $\mathcal{H}=\{(t_{i},s_{i},m_{i})\}_{i=1}^{n}$ denote the event sequence. Given the observed history $\mathcal{H}_{t}=\{(t_{i},s_{i},m_{i})\in\mathcal{H}|t_{i}<t\}$ , the conditional intensity function of a marked STPPs is similarly defined as:

\lambda\left(t,s,m\right)=\lim_{\Delta t\downarrow 0,\Delta s\downarrow 0,\Delta m\downarrow 0}\frac{\mathbb{E}\left[\mathbb{N}([t,t+\Delta t]\times B(s,\Delta s)\times B(m,\Delta m))\mid\mathcal{H}_{t}\right]}{|B(s,\Delta s)||B(m,\Delta m)|\Delta t},

where $B(m,\Delta m)$ is a ball centered at $m\in\mathbb{R}^{d_{\mathcal{M}}}$ with radius $\Delta m$ . The log-likelihood of observing $\mathcal{H}$ on $[0,T]\times\mathcal{S}\times\mathcal{M}$ is given by

\ell(\mathcal{H})=\sum_{i=1}^{n}\log\lambda\left(t_{i},s_{i},m_{i}\right)-\int_{0}^{T}\int_{\mathcal{S}}\int_{\mathcal{M}}\lambda(t,s,m)dmdsdt.

B.1 Kernel incorporating marks

One of the salient features of our spatio-temporal kernel framework is that it can be conveniently adopted in modeling marked STPPs with additional sets of mark basis functions $\{g_{q},h_{q}\}_{q=1}^{Q}$ . We modify the influence kernel function $k$ accordingly as following:

k(t^{\prime},t-t^{\prime},s^{\prime},s-s^{\prime},m^{\prime},m)=\sum_{q=1}^{Q}\sum_{r=1}^{R}\sum_{l=1}^{L}\alpha_{lrq}\psi_{l}(t^{\prime})\varphi_{l}(t-t^{\prime})u_{r}(s^{\prime})v_{r}(s-s^{\prime})g_{q}(m^{\prime})h_{q}(m).

Here $m^{\prime},m\in\mathcal{M}\subset\mathbb{R}^{d_{\mathcal{M}}}$ and $\{g_{q},h_{q}:\mathcal{M}\to\mathbb{R},q=1,\dots,Q\}$ represented by independent neural networks model the influence of historical mark $m^{\prime}$ and current mark $m$ , respectively. Since the mark space $\mathcal{M}$ is always categorical and the difference between $m^{\prime}$ and $m$ is of little practical meaning, we use $g_{q}$ and $h_{q}$ to model $m^{\prime}$ and $m$ separately instead of modeling $m-m^{\prime}$ .

B.2 Log-barrier and model computation

The conditional intensity for marked spatio-temporal point processes at $(t,s,m)$ can be written as:

\lambda(t,s,m)=\mu+\sum_{l,r,q}\alpha_{lrq}\sum_{(t_{i},s_{i},m_{i})\in\mathcal{H}_{t}}\psi_{l}(t_{i})\varphi(t-t_{i})u_{r}(s_{i})v_{r}(s-s_{i})g_{q}(m_{i})h_{q}(m).

We need to guarantee the non-negativity of $\lambda$ over the space of $[0,T]\times\mathcal{S}\times\mathcal{M}$ . When the total number of unique categorical mark in $\mathcal{M}$ is small, the log-barrier can be conveniently computed as the summation of $\lambda$ on grids $\mathcal{U}_{\text{bar},t}\times\mathcal{U}_{\text{bar},s}\times\mathcal{M}$ . In the following we focus on the case that $\mathcal{M}$ is high-dimensional with $\mathcal{O}(n)$ number of unique marks.

For model simplicity we use non-negative $g_{q}$ and $h_{q}$ in this case (which can be done by adding a non-negative activation function to the linear output layer in neural networks). We re-write $\lambda(t,s,m)$ and denote as following:

\lambda(t,s,m)=\mu+\sum_{q}\underbrace{\vrule width=0.0pt,height=0.0pt,depth=17.22217pt\left(\sum_{l,r}\alpha_{lrq}\sum_{(t_{i},s_{i},m_{i})\in\mathcal{H}_{t}}\psi_{l}(t_{i})\varphi(t-t_{i})u_{r}(s_{i})v_{r}(s-s_{i})g_{q}(m_{i})\right)}_{\hat{F}_{q}(t,s)}h_{q}(m).

Note that the function in the brackets are only with regard to $t,s$ . We denote it as $\hat{F}_{q}(t,s)$ (since it is in the $r$ th rank of mark). Since $h_{q}(m)\geq 0$ , the non-negativity of $\lambda$ can be guaranteed by the non-negativity of $\hat{F}_{q}(t,s)$ . Thus we apply log-barrier method on $\hat{F}_{q}(t,s)$ . The log-barrier term becomes:

p(\theta,b):=-\frac{1}{Q|\mathcal{U}_{\text{bar},t}\times\mathcal{U}_{\text{bar},s}|}\sum_{c_{t}=1}^{|\mathcal{U}_{\text{bar},t}|}\sum_{c_{s}=1}^{|\mathcal{U}_{\text{bar},s}|}\sum_{q=1}^{Q}\log(\hat{F}_{q}(t_{c_{t}},s_{c_{s}})-b),

Since our model is low-rank, the value of $Q$ will not be large.

For the model computation, the additional evaluations for $\{g_{q}\}_{q=1}^{Q}$ on events is of $\mathcal{O}(Qn)$ complexity and the evaluations for $\{h_{q}\}_{q=1}^{Q}$ only depends on the unique number of marks which at most of $\mathcal{O}(n)$ . The log-barrier method does not introduce extra evaluation in mark space. Thus the overall computation complexity for DNSK in marked STPPs is still $\mathcal{O}(n)$ .

Appendix C Additional experimental results

In this section we provide details of data sets and experimental setup, together with additional experimental results.

Synthetic data sets.

To show the robustness of our model, we generate three temporal data sets and three spatio-temporal data sets using the following kernels:

(i)

1D Data set 1 with exponential kernel: $k(t^{\prime},t)=0.8e^{-(t-t^{\prime})}$ .
(ii)

1D Data set 2 with non-stationary kernel: $k(t^{\prime},t)=0.3(0.5+0.5\cos(0.2t^{\prime}))e^{-2(t-t^{\prime})}$ .

(iii)

1D Data set 3 with infinite rank kernel:

k(t^{\prime},t)=0.3\sum_{j=1}^{\infty}2^{-j}\left(0.3+\cos(2+(\frac{t^{\prime}}{5})^{0.7}1.3(j+1)\pi)\right)e^{-\frac{8(t-t^{\prime})^{2}}{25}j^{2}}

(iv)

2D Data set 1 with exponential kernel: $k(t^{\prime},t,s^{\prime},s)=0.5e^{-1.5(t-t^{\prime})}e^{-0.8s^{\prime}}$ .

(v)

3D Data set 1 with non-stationary inhibition kernel:

k(t^{\prime},t,s^{\prime},s)=0.3(1-0.01t)e^{-2(t-t^{\prime})}\frac{1}{2\pi\sigma_{s^{\prime}}^{2}}e^{-\frac{\|s^{\prime}\|^{2}}{2\sigma_{s^{\prime}}^{2}}}\frac{\cos{(10\|s-s^{\prime}\|)}}{2\pi\sigma_{s}^{2}(1+e^{10(\|s-s^{\prime}\|-0.5)}}e^{-\frac{\|s-s^{\prime}\|^{2}}{2\sigma_{s}^{2}}}

, where $\sigma_{s^{\prime}}=0.5,\sigma_{s}=0.15$ .

(vi)

3D Data set 2 with non-stationary mixture kernel:

k(t^{\prime},t,s^{\prime},s)=\sum_{r=1}^{2}\sum_{l=1}^{2}\alpha_{rl}u_{r}(s^{\prime})v_{r}(s-s^{\prime})\psi_{l}(t^{\prime})\varphi_{l}(t-t^{\prime})

, where $u_{1}(s^{\prime})=1-a_{s}(s^{\prime}_{2}+1),u_{2}(s^{\prime})=1-b_{s}(s^{\prime}_{2}+1),v_{1}(s-s^{\prime})=\frac{1}{2\pi\sigma_{1}^{2}}e^{-\frac{\|s-s^{\prime}\|^{2}}{2\sigma_{1}^{2}}},v_{2}(s-s^{\prime})=\frac{1}{2\pi\sigma_{2}^{2}}e^{-\frac{\|s-s^{\prime}-0.8\|^{2}}{2\sigma_{2}^{2}}},\psi_{1}(t^{\prime})=1-a_{t}t^{\prime},\psi_{2}(t^{\prime})=1-b_{t}t^{\prime},\varphi_{1}(t-t^{\prime})=e^{-\beta(t-t^{\prime})},\varphi_{2}(t-t^{\prime})=(t-t^{\prime}-1)\cdot I(t-t^{\prime}<3)$ , and $a_{s}=0.3,b_{s}=0.4,a_{t}=0.02,b_{t}=0.02,\sigma_{1}=0.2,\sigma_{2}=0.3,\beta=2,(\alpha_{11},\alpha_{12},\alpha_{21},\alpha_{22})=(0.6,0.15,0.225,0.525)$ .

Note that kernel (iii) is the one we illustrated in Figure 1, which is of infinite rank according to the formulas. In Figure 1, the value matrix of $k(t^{\prime},t)$ and $k(t^{\prime},t-t^{\prime})$ are the kernel evaluations on a same $300\times 300$ uniform grid. As we can see, the rank of the value matrix of the same kernel $k$ is reduced from $298$ to $7$ after changing to the displacement-based kernel parameterization.

Details of Experimental setup.

For RMTPP and NH we test embedding size of $\{32,64,128\}$ and choose $64$ for experiments. For THP we take the default experiment setting recommended by Zuo et al. [2020]. For NSMPP we use the same model setting in Zhu et al. [2022] with rank $5$ . Each experiment is implemented by the following procedure: Given the data set, we split $90\%$ of the sequences as training set and $10\%$ as testing set.

We use independent fully-connected neural networks with two-hidden layers for each basis function. Each layer contains $64$ hidden nodes. The temporal rank of DNSK+Barrier is set to be $1$ for synthetic data (i), (ii), (iv), (v), $2$ for (vi), and $3$ for (iii). The spatial rank is $1$ for synthetic data (iv), (v) and $2$ for (vi). The temporal and spatial rank for real data are both set to be $2$ through cross validation. For each real data set, the $\tau_{\text{max}}$ is chosen to be around $T/4$ and $s_{\text{max}}$ is $1$ for each data set since the location space is normalized before training. The hyper-parameter of DNSK+Softplus are the same as DNSK+Barrier. For RMTPP, NH, and THP the batch size is $32$ and the learning rate is $10^{-3}$ . For others, the batch size is $64$ and the learning rate is $10^{-1}$ . The quantitative results are collected by running each experiment for $5$ independent times. All experiments are implemented on Google Colaboratory (Pro version) with 25GB RAM and a Tesla T4 GPU.

C.1 Synthetic results with 2D & 3D kernel

In this section we present additional experiment results for the synthetic data sets with 2D exponential and 3D non-stationary mixture kernel. Our proposed model successfully recovers the kernel and event conditional intensity in both case. Note that the recovery of 3D mixture kernel demonstrates the capability of our model to handle complex event dependency with mixture patterns by conveniently setting time and mark rank to be more than 1.

C.2 Atlanta textual crime data with high-dimensional marks

Figure A.4 visualizes the fitting and prediction results of DNSK+Barrier. Our model presents an decaying pattern in temporal effect and captures two different patterns of spatial influence for incidents in the northeast. Besides, the in-sample and out-of-sample intensity predictions demonstrate the ability of DNSK to characterize the event occurrences by showing different conditional intensities.