Spatio-Temporal-Network Point Processes for Modeling
Crime Events with Landmarks

Zheng Dong H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology Jorge Mateu Department of Mathematics, Universitat Jaume I, Castelló de la Plana, Valencia, Spain Yao Xie¹¹1Email: yao.xie@isye.gatech.edu H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology

Abstract

Self-exciting point processes are widely used to model the contagious effects of crime events living within continuous geographic space, using their occurrence time and locations. However, in urban environments, most events are naturally constrained within the city’s street network structure, and the contagious effects of crime are governed by such a network geography. Meanwhile, the complex distribution of urban infrastructures also plays an important role in shaping crime patterns across space. We introduce a novel spatio-temporal-network point process framework for crime modeling that integrates these urban environmental characteristics by incorporating self-attention graph neural networks. Our framework incorporates the street network structure as the underlying event space, where crime events can occur at random locations on the network edges. To realistically capture criminal movement patterns, distances between events are measured using street network distances. We then propose a new mark for a crime event by concatenating the event’s crime category with the type of its nearby landmark, aiming to capture how the urban design influences the mixing structures of various crime types. A graph attention network architecture is adopted to learn the existence of mark-to-mark interactions. Extensive experiments on crime data from Valencia, Spain, demonstrate the effectiveness of our framework in understanding the crime landscape and forecasting crime risks across regions.

1 Introduction

Self-exciting point processes (Reinhart,, 2018) have been used in crime modeling with several successful attempts on burglary (Mohler et al.,, 2011), gang violence (Zipkin et al.,, 2014), gunshot incidents (Dong and Xie,, 2024), and terrorism data (Porter and White,, 2012). The statistical structure of a self-exciting process is well-suited to characterize both the endogenous crime rates and the contagious pattern observed in crime data (Johnson,, 2008; Mohler,, 2013; Loeffler and Flaxman,, 2018). Specifically, it models the intensity of crime events using a background event rate and a so-called influence kernel that plays a pivotal role in capturing the contagious effect of an observed crime event on future crime events in nearby neighborhoods.

The dynamics of crime contagion are particularly complex within urban settings, influenced heavily by the geographic layout of the city. While crimes occur in a continuous space (e.g., within a city area measured by longitude and latitude), they are mostly confined to the street networks, influencing both the escape routes of criminals and the spatial distribution of crime (Rossmo,, 1999). Early research (Mohler et al.,, 2011) also suggests that crime’s contagious effects propagate along these street networks instead of dispersing freely, as evidenced by a fitted non-parametric influence kernel from real crime events. In this situation, traditional point processes with Euclidean distance-based influence kernels (Mohler,, 2014; Reinhart and Greenhouse,, 2018; Zhuang and Mateu,, 2019) often fall short, necessitating an adjusted influence kernel that respects this urban constraint.

Another factor contributing to the complexity of urban crime dynamics is the diversity of the surrounding urban environments where different crimes occur. Diverse land uses, ranging from commercial to residential areas, influence the types and prevalence of criminal activities, each fostering unique interactions between potential offenders and victims (Fleming et al.,, 1994; Stucky and Ottensmann,, 2009; Kinney et al.,, 2008). For instance, commercial areas can host a variety of legitimate (shopping, working, eating, etc.) and criminal (shoplifting, picking pockets, etc.) activities during business hours, creating specific crime patterns that are very different from those in other regions (Kinney et al.,, 2008). While previous studies have shown the effectiveness of fine-crafted point process models in understanding the landscapes of various crime types across different regions (Mohler,, 2014; Linderman and Adams,, 2014), there remains a gap in these models’ capability to integrate the information of urban land uses, limiting their explanatory power regarding the relationship between specific urban surroundings and crime patterns.

In this paper, we introduce a novel spatio-temporal-network point process model tailored for analyzing crime within urban street networks. This model uniquely incorporates the structure of city street networks and adopts a street-network-based distance metric that aligns more closely with the actual movement patterns of criminals compared to traditional Euclidean metrics, providing a realistic depiction of crime patterns within a networked urban environment. To integrate the contextual data of urban land uses into the model, we craft a special mark for each crime event, which considers the information about nearby landmarks such as banks, restaurants, and supermarkets. Using the concept of urban functional zones (Yuan et al.,, 2014) that segment the entire city area based on the landmarks, we extend the traditional mark of a crime event, typically the category of the crime (Mohler,, 2014; Reinhart and Greenhouse,, 2018) (e.g., burglary, larceny, robbery, etc.), into a new mark that contains both the crime and landmark categories. Such an event mark allows for direct analysis of the impact of specific urban surroundings on local crime patterns.

The design of our influence kernel jointly considers the time, location, and mark information of crime events. A temporal kernel and a street distance-based spatial kernel characterize how previous crimes influence future ones over time and space, respectively. Moreover, we introduce a novel mark network captured through graph neural networks (GNNs), which assesses interactions between different crime events based on their marks. This GNN framework predicts potential linkages between different marks while considering their intrinsic similarities. By capturing these intricate relationships, our model facilitates a deeper understanding of crime clustering and propagation. Tested extensively with real crime data from Valencia, Spain, our model has proven highly effective in capturing the dynamic landscape of urban crime and predicting crime risk across the city, offering significant improvements over existing methodologies.

The paper is organized as follows. The rest of this section reviews related literature. Section 2 introduces the crime and landmark data sets collected in Valencia that motivate our model. Section 3 presents the data-processing techniques that define the format of the discrete event data with marks. Section 4 introduces our spatio-temporal-network point-process model with graph neural networks, which is learned using the estimation strategy in Section 5. Finally, in Section 6, we present the results using our model on the real crime data in Valencia and a comparison with baselines. The paper ends with some further discussion.

1.1 Related work

Our research is placed within the domain of predictive policing (Perry,, 2013), which includes four general categories: methods for predicting crimes (Chainey et al.,, 2008; Neill and Gorr,, 2007; Wang and Brown,, 2012), methods for predicting offenders (Bonta et al.,, 1998; Grann et al.,, 1999), methods for predicting perpetrators’ identities (Lev-Wiesel et al.,, 2004; Tarzia et al.,, 2018), and methods for predicting victims of crime (Gottfredson,, 1981; Russo et al.,, 2013). Our study belongs to the first category, which aims to forecast places and times with an increased crime risk. Unlike the other three categories that require the collection of extensive information about crime incidents, such as police reports, to identify certain individuals or groups that may get involved in criminal activities, the prediction of spatio-temporal occurrences of crime can be mainly achieved by leveraging historical crime data without the necessary access to sensitive information.

Many mathematical models have been used to understand the complex phenomenon of crime; a family of those includes tools that aim to detect potential hotspots based on empirical observations of spatial clusters of crime incidents (Levine and CrimeStat,, 2002; Bowers et al.,, 2004; Chainey et al.,, 2008). However, most hotspot modeling approaches do not consider the temporal dynamics of the hotspot, despite some exploring the overall evolution of hotspots rather than focusing on individual events (Short et al.,, 2008). Other models use regression-based methods (Meera and Jayakumar,, 1995; Kennedy et al.,, 2011, 2016) to quantitatively assess the effects of different factors on the total number of crimes in a specific region, and the interpretable results from the regression models can aid in more targeted and effective interventions. These methods usually run over a discretized space and require the collection of contextual information, such as demographic and socioeconomic data, to establish the regression models. Compared with them, our method focuses on modeling the spatio-temporal near-repeat effect of crime using discrete crime event data and can provide the prediction and risk evaluation on a fine-grained level over the street network.

In recent decades, there has been a substantial body of research (Kinney et al.,, 2008; Johnson and Bowers,, 2010; Groff,, 2011; Weisburd et al.,, 2012; Xu and Griffiths,, 2017) examining the relationship between urban land use and crime patterns. These studies have pinpointed environmental characteristics linked to increased crime risks in specific urban settings. Our approach differs from the aggregated-statistics-based analysis often taken in such research (Fleming et al.,, 1994; Kinney et al.,, 2008; Stucky and Ottensmann,, 2009; Browning et al.,, 2010; Xu and Griffiths,, 2017). Instead, we model the spatio-temporal crime patterns through the lens of individual crime incidents, providing a dynamic perspective for explaining the crime and integrating effective surveillance.

The application of self-exciting point processes, motivated by the modeling of earthquake occurrences in seismology (Ogata,, 1988), has been widely explored to characterize the dynamics of criminal activities (Mohler et al.,, 2011; Lewis et al.,, 2012; Mohler,, 2013; Reinhart,, 2018; Zhuang and Mateu,, 2019; Zhu and Xie,, 2022). Previous attempts demonstrate the effectiveness of point processes in modeling crime using residential burglary data in Los Angeles (Mohler et al.,, 2011), civilian death reports in Iraq (Lewis et al.,, 2012), and gunshot data in Washington, DC (Loeffler and Flaxman,, 2018). Later approaches (Mohler,, 2014; Reinhart and Greenhouse,, 2018; Zhu and Xie,, 2022) improve point process models for crime by incorporating the events’ type, location, and textual information to capture complex crime patterns in different modeling tasks. Compared with them, our approach extends the traditional modeling of crime events in Euclidean space by adopting a network distance between crime events that is more realistic to estimate the travel distance of criminals in the urban environment. A recent paper considers crime events on linear street network (D’Angelo et al.,, 2024) that focus on improving the estimation of the non-parametric influence kernel and event intensity. Our model differs from theirs by considering an influence kernel that can leverage multiple levels of information.

A number of studies have considered the problem of modeling discrete events observed within network structures using self-exciting point processes in addition to modeling crime incidents. Most of them (Fang et al.,, 2023; Cai et al.,, 2024; Sanna Passino et al.,, 2024) only model the temporal occurrences of the events that come from the nodes in the networks. Besides, a previous work (Cho et al.,, 2013) studies the missing data problem in spatio-temporal social networks, where the network nodes are geographically distributed entities. The missing data problem also exists in criminology, such as detecting unreported crimes, which is not the focus of our study. Another work of network Hawkes (Linderman and Adams,, 2014) adopts a similar decomposed representation for the mark interactions as ours. However, their approach includes the estimation of a binary random matrix using Gibbs sampling, which is fundamentally different and more complicated, and the events’ location information is processed on an aggregated level.

Last but not least, incorporating neural networks in point process models has recently been a popular research topic (Shchur et al.,, 2021). Various neural point processes focus on leveraging recurrent neural networks (RNNs) (Du et al.,, 2016; Mei and Eisner,, 2017) or Transformer structure (Zuo et al.,, 2020) to encode the historical information. Compared with our method, they did not consider the statistical framework of the self-exciting point process and often lacked model interpretability. Another line of work (Dong et al., 2023b, ; Dong et al.,, 2022; Zhu et al., 2021a, ; Zhu et al., 2021b, ; Zhu et al., 2021c, ) focuses on representing the influence kernel using neural networks, allowing for the modeling of a wider range of event dynamics such as non-stationary and inhibiting effects. However, they do not consider contextual information such as the latent network structure or mark features. Graph neural networks have been extensively developed within the machine learning community. Nevertheless, their application in point processes has received scant investigation. Two recent works use message-passing GNNs in point processes (Xia et al.,, 2022; Wu et al.,, 2020) for the task of temporal link prediction rather than modeling discrete marked events. Another concurrent study of graph point processes (Dong et al., 2023a, ) shares similarities with ours by approximating influence kernels using graph neural networks. However, they do not consider the spatial aspect of the data.

2 Data description

The crime data in this study is collected by the local police department in Valencia (Spain), a town located along the Mediterranean coast with more than 1.5 million inhabitants. The data set records thefts and robberies over five years from 2015 to 2019, including a total of $47,125$ crime events. Each record contains comprehensive information about one event, including time, location (measured in longitude and latitude), and the crime category. The recorded events are categorized into three distinct types, including: (i) Assault (Agresión, in its source name) referring to thefts involving physical assault, (ii) Subtraction (Sustracción) referring to thefts executed smoothly without the use of force, and (iii) Others (Otros) referring to other types of street thefts or robberies not included in the previous categories.

The data set uniquely focuses on crimes that occurred on city streets, as emphasized by the local police department. To support our data analysis, we acquire street network data within the Valencia city boundary from the OpenStreetMap database (OpenStreetMap contributors,, 2017). Fig 1(a) and (b) provide visual snapshots of the recorded crime events scattered across Valencia’s street network in July 2016 and October 2018, respectively.

Additionally, to investigate the relationship between the patterns of the reported crimes and the surrounding urban environment, we collect the location information of $1,975$ city landmarks in Valencia, categorized into seven types: financial, industrial, market, nightclub, police, restaurant, and taxi. Fig 1(c) visualizes the spatial distribution of these landmarks, with different colors indicating different categories.

3 Data processing

We first present the processing strategies for our crime data set, as they play an important role in characterizing the latent and complex correlation structure presented in the events. Consider a sequence of $n$ reported crime events in Valencia. Denoting each event as a tuple, the entire sequence of events can be represented as

(t_{1},s_{1},c_{1}),(t_{2},s_{2},c_{2}),\dots,(t_{n},s_{n},c_{n}).

(1)

For the $i$ -th event, $t_{i}\in[0,T]$ represents the time of incident occurrence, $s_{i}\in\mathcal{S}\subseteq\mathbb{R}^{2}$ denotes the location of the incident, measured in longitude and latitude coordinates, where $\mathcal{S}$ denotes the geographical area covered by the city of Valencia, and $c_{i}\in\mathscr{C}\coloneqq\{1,2,3\}$ denotes the crime category of the $i$ -th incident, with $1,2,\text{and}~{}3$ representing Assault, Subtraction, and Others, respectively. All the events are temporally ordered, i.e., $0\leq t_{1}<t_{2}<\dots<t_{n}\leq T$ . We also introduce the notation for seven landmark categories as $\mathscr{L}\coloneqq[1:7]$ . Values from $1$ to $7$ correspond to the landmark categories of financial, industrial, market, nightclub, police, restaurant, and taxi, respectively.

3.1 Urban functional zone identification

Urban areas with different facilities and functionalities, known as urban functional zones (Yuan et al.,, 2014), can have different crime patterns based on the citizen activities exhibited in those areas (Kinney et al.,, 2008). For instance, commercial or public places attract more human activities and, potentially, more crime and disorder events (Andresen,, 2007; Wuschke and Kinney,, 2018). The identification of the urban functional zones is critical for implementing targeted crime prevention strategies and mitigating potential hotspots.

In our study, we partition the entire city area of Valencia into various urban functional zones based on the 1,975 city landmarks. This approach aligns with the point-of-interest (POI) method commonly referenced in the literature (Gao et al.,, 2017; Hu and Han,, 2019; Long et al.,, 2015; Yuan et al.,, 2014), which involves geographic entities that can be abstracted as points for zone identification, such as schools, banks, companies, restaurants, and supermarkets (Jiang et al.,, 2015). Specifically, we use the $k$ -nearest neighbors algorithm to identify different functional zones based on their proximity to the city landmarks. For a given location $s\in\mathcal{S}$ , we find its $k$ nearest landmarks and assign to it a landmark category $l\coloneqq\ell(s)$ as the most common landmark category among the $k$ landmarks. Thus, the function

\ell(s):\mathcal{S}\rightarrow\mathscr{L}

serves a labeling mechanism that maps each location within the city $\mathcal{S}$ to a corresponding landmark category in the set $\mathscr{L}$ . Locations sharing the same landmark category (e.g., $l$ ) are grouped to form the functional zone $\mathcal{S}_{l}$ , and we have $\mathcal{S}=\cup_{l\in\mathscr{L}}\mathcal{S}_{l}$ . The left panel in Fig 2 visualizes the partition of urban functional zones in Valencia.

3.2 Event mark definition

To accurately depict patterns of criminal activity across the city, it is crucial to consider contextual information about crime events, such as the type of crime and the environment setting in which it occurs. Currently, crimes are grouped by their crime types, for instance, Assault (or Agresión, in the original name). This categorization, however, may overlook important contextual differences. For instance, an assault near a restaurant and another near a bank are both categorized under Assault, despite the distinct human activity patterns typical of dining and financial areas. By refining our crime categorization to account for these specific environment settings, we can enhance our understanding of crime dynamics.

We design a novel mark associated with each event (Reinhart,, 2018) to categorize the crime events. The mark is designed to combine the event’s crime category $c$ and the landmark category $\ell(s)$ of its location $s$ , thus considering the urban functional zone that the event falls in. We denote the event mark as $c\times\ell(s)$ . For instance, as illustrated in the right panel of Fig 2, the Assault occurring in a restaurant zone on July 9th, 2016, is assigned the label $1\times 6$ (representing Assault $\times$ restaurant), while another Subtraction on October 12th, 2018, in a financial zone receives the label $2\times 1$ (representing Subtraction $\times$ financial). The value space of the mark is a finite set $\mathscr{C}\times\mathscr{L}$ with a size of 21 (three crime categories and seven landmark categories). We refer to the mark “crime-landmark label” of the event in the later discussion to reveal its practical meaning. As we can see, this new mark derives a comprehensive categorization of crime events by including contexts of the observed event. Meanwhile, it allows for a detailed examination of crime patterns across different urban functional zones by analyzing incidents through the lens of their specific crime-landmark labels.

3.3 Event dependence through multiple spaces

Crime events are ordered in time, and historical events will impact the probability, timing, or characteristics of future events (Mohler et al.,, 2011; Loeffler and Flaxman,, 2018). Such an impact is referred to as event dependence. To model the dependence among temporal events, we consider their relations over the geographic space with an underlying street network structure and the mark space (crime-landmark labels) characterized by an interaction network.

To model the spatial relationship between crime events on the urban streets of Valencia, we overlay a street network structure $G$ on top of the continuous geographic space $\mathcal{S}$ . This street network is constructed using the data from OpenStreetMap database (OpenStreetMap contributors,, 2017). The streets in Valencia are represented as linear segments linked at their endpoints. Note that the endpoints of these segments do not necessarily align with actual street intersections; they can be located in the middle of a street, such as on a curved street divided into multiple segments. These endpoints are treated as the nodes of the street network, while the street segments become the edges connecting these nodes. Each network edge is associated with an attribute, known as the edge weight, indicating the length of the corresponding street segment measured in kilometers. The network is processed to be undirected to reflect the mobility patterns in street crimes in Valencia, where perpetrators commonly travel on foot or by bike in either direction along the streets (Bounce,, 2024). The street network consists of 8,043 nodes and 12,309 weighted, undirected edges, covering the entire city area of Valencia. It is worth noting that crime events are integrated into this network by being mapped to random locations on the network edges based on their geographic coordinates rather than being assigned to specific nodes. Two layers at the bottom in Fig 3 illustrate such an overlay of the street network $G$ and the mapping of the crime events to the network edges.

Understanding the relation between event marks also provides valuable insights into characterizing the dependencies of crime events. By analyzing the sequence of observed marks, we can determine if certain events tend to be triggered by others in a specific pattern. In this study, such dependencies are represented through a mark network, denoted as $A$ . Each node of the mark network represents a distinct crime-landmark label (total of 21 nodes), and the events are assigned to the corresponding nodes based on their crime-landmark labels. The edges between these nodes indicate the potential relation between the crime-landmark labels they connect with. Such a relation can be directional, i.e., an observed crime with label $c\times\ell(s)$ may influence the occurrence of a future event with label $c^{\prime}\times\ell(s^{\prime})$ but not vice versa. Hence, the edges of the mark network are directional. Unlike the street network, which is derived directly from available geographic data, the mark network is established by learning a point process model detailed in the next section from the crime data. This model learns from the crime data to establish the directed and weighted edges of the mark network, indicating both the direction and strength of the dependencies between different event marks. An example of the mark network is presented at the top of Fig 3, highlighting directional relations among various event marks (crime-landmark labels).

4 Point process modeling for event dependence

With the introduced event marks in Section 3, we re-denote the processed data of $n$ observed crime events in (1) as

(t_{1},s_{1},c_{1}\times l_{1}),(t_{2},s_{2},c_{2}\times l_{2}),\dots,(t_{n},s_{n},c_{n}\times l_{n}),

where $0\leq t_{1}<t_{2}<\cdots<t_{n}\leq T,s_{i}\in\mathcal{S},c_{i}\in\mathscr{C}$ , and $l_{i}\coloneqq\ell(s_{i})\in\mathscr{L}$ . In the following, we present our point process modeling for understanding the multi-modal dependencies among the reported crime events over the street network.

4.1 Spatio-temporal-network point processes

Self-exciting spatio-temporal point processes (Moller and Waagepetersen,, 2003; Reinhart,, 2018) are widely used in crime modeling to capture the contagious nature of crime events (Mohler et al.,, 2011). Let $\mathcal{H}_{t}=\{(t_{i},s_{i},c_{i}\times l_{i})\in\mathcal{H}_{T}|t_{i}<t\}$ denote the observed crime events happened before time $t$ ; we adopt a conditional intensity function for each event category $c\times l$ to suggest the possibility of observing a new event with label $c\times l$ conditioning on the history. Specifically, the conditional intensity function at time $t$ and location $s$ is defined as

\lambda_{cl}\left(t,s\mid\mathcal{H}_{t}\right)=\lim_{\Delta t\downarrow 0,\Delta s\downarrow 0}\frac{\mathbb{E}\left[\mathbb{N}_{cl}([t,t+\Delta t]\times B(s,\Delta s))\mid\mathcal{H}_{t}\right]}{|B(s,\Delta s)|\Delta t},\quad s\in\mathcal{S}_{l},

where $B(s,\Delta s)$ is a ball centered at location $s$ with radius $\Delta s$ . The $\mathbb{N}_{cl}$ is the counting measure for events with label $c\times l$ , i.e., $\mathbb{N}_{cl}(A)$ is defined as the number of events with label $c\times l$ occurring within any subset $A\subseteq[0,T]\times\mathcal{S}$ . This function essentially measures the rate at which events are expected to occur at a specific time and place based on historical data, with $\lambda_{cl}\left(t,s|\mathcal{H}_{t}\right)\geq 0$ for any arbitrary $c$ , $l$ , $t$ and $s$ . To simplify the notation, we omit the $\times$ between $c$ and $l$ in the subscript.

Hawkes processes proposed in (Hawkes,, 1971) provide the self-exciting model formulation for capturing the triggering effects among events. It assumes that the occurrences of future events are positively influenced by the observed history, and the influence of past events is linearly additive. In this study, we model the conditional intensity function as follows:

\lambda_{cl}(t,s\mid\mathcal{H}_{t})=\mu_{cl}+\sum_{(t^{\prime},s^{\prime},c^{\prime}\times l^{\prime})\in\mathcal{H}_{t}}k(t^{\prime},t,s^{\prime},s,c^{\prime}\times l^{\prime},c\times l),\quad s\in\mathcal{S}_{l}.

(2)

Here, $\mu_{cl}$ is a constant representing the base intensity of events with label $c\times l$ . The $k$ function is the so-called influence kernel that captures the influence of a past incident $(t^{\prime},s^{\prime},c^{\prime}\times l^{\prime})$ on a current event $(t,s,c\times l)$ . This formulation allows for characterizing the influence of historical events on the likelihood of future events within the framework of the Hawkes process.

A separable form of the influence kernel has been commonly assumed in previous literature (Dong et al., 2023b, ; Mohler,, 2014; Reinhart,, 2018; Reinhart and Greenhouse,, 2018; Zhu and Xie,, 2022). The influence kernel $k$ can be expressed by the product of three individual kernel functions as

k(t^{\prime},t,s^{\prime},s,c^{\prime}\times l^{\prime},c\times l)=f(t^{\prime},t)\cdot g(s^{\prime},s)\cdot h(c^{\prime}\times l^{\prime},c\times l).

The kernel functions $f,g,h$ characterize the event influence over the space of times, locations, and event marks, respectively. We note that the separable form of the influence kernel enables a computationally efficient procedure for model fitting, given the large size of the data set. Meanwhile, the separable influence kernel can also provide us with interpretable results, as illustrated in Section 6. In the following, we introduce the construction of these kernel functions in our context of modeling the street crime events within an urban environment.

Temporal kernel

We choose our temporal kernel $f$ to be an exponential function

f(t^{\prime},t)=\beta e^{-\beta(t-t^{\prime})},\quad t>t^{\prime}.

Such a kernel function assumes the influence of a past event becomes significant in the near future and decays over time exponentially with a decaying rate $\beta>0$ , for subsequent incidents usually aggregate in time, occurring sooner after previous crimes.

Street-network-based spatial kernel

In our case, criminal activities appear on the city street network, and criminals typically use roads to flee crime scenes rather than traveling in straight lines, which is impractical due to urban structures, such as buildings. Therefore, the Euclidean distance between event locations becomes unsuitable for assessing the spatial connectivity between crime events. Favored by the overlay of the street network, we adopt a street network distance (Wei et al.,, 2020), denoted as $d_{\rm net}(s,s^{\prime})$ , for calculating the travel distance between any two locations $s$ and $s^{\prime}$ on the network edges. The calculation of $d_{\rm net}$ involves two scenarios, as illustrated in Fig 4: (i) The movement from $s$ to $s^{\prime}$ on different edges involves moving from $s$ to an adjacent node ( $u_{1}$ or $u_{2}$ ), traversing the shortest path (indicated by dashed lines in Fig 4) to a node ( $u_{3}$ or $u_{4}$ ) on the edge that $s^{\prime}$ falls on, and finally proceeding to $s^{\prime}$ . There are four possible paths between $s$ and $s^{\prime}$ : $s\to u_{1}\mathrel{\leadsto}u_{3}\to s^{\prime}$ , $s\to u_{1}\mathrel{\leadsto}u_{4}\to s^{\prime}$ , $s\to u_{2}\mathrel{\leadsto}u_{3}\to s^{\prime}$ , and $s\to u_{2}\mathrel{\leadsto}u_{4}\to s^{\prime}$ , where the $\mathrel{\leadsto}$ represents the shortest path over the network between two nodes. Then, $d_{\rm net}(s,s^{\prime})$ equals the shortest length of these four paths; (ii) For $s$ and $s^{\prime\prime}$ on the same edge, $d_{\rm net}(s,s^{\prime\prime})$ is simply the straight-line (Euclidean) distance between them. Based on the street network distance, we propose a Gaussian spatial kernel, defined as

g(s^{\prime},s)=\frac{1}{2\pi\sigma^{2}}e^{-\frac{d_{\rm net}^{2}(s,s^{\prime})}{2\sigma^{2}}}.

This kernel function indicates that the influence of an event decays as the distance increases. The parameter $\sigma>0$ determines the scale of influence across the street network, illustrating how spatial interactions diminish over distance.

Interactions between event marks

To model the interactions between event marks that are categorical, we represent the kernel function $h$ using a set of coefficients $\{\alpha_{cl,c^{\prime}l^{\prime}}\}_{c,c^{\prime}\in\mathscr{C},l,l^{\prime}\in\mathscr{L}}$ , where

h(c^{\prime}\times l^{\prime},c\times l)=\alpha_{cl,c^{\prime}l^{\prime}}

captures the influence of a historical event with mark $c^{\prime}\times l^{\prime}$ on a future event with mark $c\times l$ . A larger value of $\alpha_{cl,c^{\prime}l^{\prime}}$ contributes more to the conditional intensity function, suggesting a higher possibility of observing a future event marked by $c\times l$ given an observed event mark $c^{\prime}\times l^{\prime}$ . Note that such an interaction can be directional, that is, $\alpha_{cl,c^{\prime}l^{\prime}}\neq\alpha_{c^{\prime}l^{\prime},cl}$ . All the coefficients are set to be non-negative, and a zero-value $\alpha_{cl,c^{\prime}l^{\prime}}$ means no influence from events with mark $c^{\prime}\times l^{\prime}$ to events with mark $c\times l$ . The mark network $A$ is established accordingly from the coefficients. When $\alpha_{cl,c^{\prime}l^{\prime}}>0$ , a directed edge from the node representing crime-landmark label $c^{\prime}\times l^{\prime}$ to the node representing label $c\times l$ is created, with the edge weight assigned as $\alpha_{cl,c^{\prime}l^{\prime}}$ .

Following the chosen kernel functions, the conditional intensity function for a crime event with mark $c\times l$ at time $t$ and location $s$ is modeled as follows

\displaystyle\lambda_{cl}(t,s)

\displaystyle=\mu_{cl}+\sum_{(t^{\prime},s^{\prime},c^{\prime}\times l^{\prime})\in\mathcal{H}_{t}}\alpha_{cl,c^{\prime}l^{\prime}}\beta e^{-\beta(t-t^{\prime})}\frac{e^{-\frac{d_{\rm net}^{2}(s,s^{\prime})}{2\sigma^{2}}}}{2\pi\sigma^{2}},\quad s\in\mathcal{S}_{l}.

(3)

The base intensity $\mu_{cl}$ is estimated from the data. The influence kernel is chosen to integrate to $\alpha_{cl,c^{\prime}l^{\prime}}$ , providing a natural interpretation of the coefficient: the $\alpha_{cl,c^{\prime}l^{\prime}}$ is the expected number of crime events with mark $c\times l$ triggered by an observed event with mark $c^{\prime}\times l^{\prime}$ . Here, for notation simplicity, we omit the dependence on history $\mathcal{H}_{t}$ in the intensity function and use common shorthand $\lambda_{cl}(t,s)$ to denote $\lambda_{cl}(t,s\mid\mathcal{H}_{t})$ . Note that it is possible to allow different spatial and temporal decays for events with different crime landmark labels. Yet, this approach would significantly increase the number of model parameters.

4.2 Influence kernel learning with graph neural networks

The learning of the coefficients $\{\alpha_{cl,c^{\prime}l^{\prime}}\}_{c,c^{\prime}\in\mathscr{C},l,l^{\prime}\in\mathscr{L}}$ plays an essential role in understanding the mark interactions and the characterization of the event dynamics. Traditionally, these coefficients have been directly estimated from data, as outlined in various studies (Mohler,, 2014; Reinhart and Greenhouse,, 2018; Zhu and Xie,, 2022). However, recent advancements in point process models have showcased the value of incorporating prior knowledge of event marks, known as features, into the modeling of these coefficients and the mark interactions. For example, when modeling the interactions between different social media users (marks), the work of Group Network Hawkes Process (Fang et al.,, 2023) treats these users as network nodes and leverages their characteristics (the features of the marks) to effectively identify the group interactions and influential users in social networks. In our case, the event marks are defined by the combinations of multiple crime and landmark categories. It is reasonable to believe that marks sharing the same crime or landmark category tend to exhibit stronger interactions than those with differing categories. Therefore, these crime or landmark categories that compose the mark can be regarded as the mark features that we can leverage in the coefficient modeling.

We introduce a novel approach via GNNs to model the coefficients, leveraging their ability to integrate nodal features in learning node similarity. We first decompose the coefficients into two components as follows:

\alpha_{cl,c^{\prime}l^{\prime}}=a_{cl,c^{\prime}l^{\prime}}\cdot p_{cl,c^{\prime}l^{\prime}},

where $a_{cl,c^{\prime}l^{\prime}}>0$ and $0\leq p_{cl,c^{\prime}l^{\prime}}\leq 1$ are both scalars. Together, these two components can be viewed as the strength and the chance of the interaction between two marks. The term $p_{cl,c^{\prime}l^{\prime}}$ , modeled by a GNN, will incorporate the mark features and capture the graph topology by providing the likelihood for any $c\times l$ to have a connection to the rest of $c^{\prime}\times l^{\prime}$ ; meanwhile, the $a_{cl,c^{\prime}l^{\prime}}$ captures the weights on the edges in the mark network, indicating the strength of the connection.

We model these two components separately. For the strength $a_{cl,c^{\prime}l^{\prime}}$ , we treat it as a trainable scalar that is learned from data. For the modeling of the chance, we use Graph Attention Networks (GAT) (Veličković et al.,, 2018) to take the mark features into account. The feature of mark $c\times l$ can be denoted by a column vector $X_{cl}\in\mathbb{R}^{D}$ , which is the concatenation of the one-hot vectors of the crime category $c$ and the landmark category $l$ . Based on the feature vectors, GAT computes the chance of the influence from mark $c^{\prime}\times l^{\prime}$ to mark $c\times l$ through a so-called multi-head self-attention mechanism over graphs. The attention mechanism in the $r$ -th attention head is carried out as

e^{r}_{cl,c^{\prime}l^{\prime}}=\gamma(\bm{W}^{r}X_{cl},\bm{W}^{r}X_{c^{\prime}l^{\prime}}),

Here, the weight matrix $\bm{W}^{r}\in\mathbb{R}^{D^{\prime}\times D}$ defines a shared linear transformation of each mark feature, and $\gamma\colon\mathbb{R}^{D^{\prime}}\times\mathbb{R}^{D^{\prime}}\to\mathbb{R}$ takes a pair of transformed features as input and computes the attention weight $e^{r}_{cl,c^{\prime}l^{\prime}}$ (defined in the next equation). The attention weights are normalized through the softmax function, which leads to the output of the $r$ -th head

	$\displaystyle p^{r}_{cl,c^{\prime}l^{\prime}}$	$\displaystyle={\rm softmax}(\{e^{r}_{cl,c^{\prime}l^{\prime}}\}_{c^{\prime}\in\mathscr{C},l^{\prime}\in\mathscr{L}})$		(4)
		$\displaystyle=\frac{{\rm exp}\left({\rm LeakyReLU}\left({\bm{b}^{r}}^{\top}\left[\bm{W}^{r}X_{cl}\\|\bm{W}^{r}X_{c^{\prime}l^{\prime}}\right]\right)\right)}{\sum_{c^{\prime}\in\mathscr{C},l^{\prime}\in\mathscr{L}}{\rm exp}\left({\rm LeakyReLU}\left({\bm{b}^{r}}^{\top}\left[\bm{W}^{r}X_{cl}\\|\bm{W}^{r}X_{c^{\prime}l^{\prime}}\right]\right)\right)}.$		(4)

Here, $\|$ stands for the concatenation operation, and the Leaky ReLU (Maas et al.,, 2013) is an element-wise non-linear function defined as ${\rm LeakyReLU}(x)=\max(0,x)+b\min(0,x)$ . In our numerical experiments, we set $b=0.2$ as used in the original GAT (Veličković et al.,, 2018). The results from different attention heads are averaged to get the final chance $p_{cl,c^{\prime}l^{\prime}}$ as

p_{cl,c^{\prime}l^{\prime}}=\frac{1}{R}\sum_{r=1}^{R}p^{r}_{cl,c^{\prime}l^{\prime}}.

Note that the GAT introduces a constraint of $\sum_{c^{\prime}\in\mathscr{C},l^{\prime}\in\mathscr{L}}p_{cl,c^{\prime}l^{\prime}}=1$ to ensure that the $p_{cl,c^{\prime}l^{\prime}}$ collectively form a probability distribution over potential connections. The hyper-parameter to be determined in advance is the number of attention heads $R$ to achieve the balance between model flexibility and generability. The learnable parameters are $\{\bm{b}^{r},\bm{W}^{r}\}_{r=1}^{R}$ in GAT and the interaction strength $\{a_{cl,c^{\prime}l^{\prime}}\}_{c,c^{\prime}\in\mathscr{C},l,l^{\prime}\in\mathscr{L}}$ .

5 Model estimation

We now discuss the estimation of model parameters based on the Maximum Likelihood Estimation (MLE) approach (Reinhart,, 2018). The units for measuring the event time and distance are days and kilometers, respectively, throughout the model estimation and empirical experiments.

We first estimate the base intensity $\{\mu_{cl}\}_{c\in\mathscr{C},l\in\mathscr{L}}$ by the average number of observed crime events with mark $c\times l$ over a street segment and time interval of unit length (i.e., one kilometer and one day). Other non-parametric procedures for estimating the base intensity using stochastic declustering (Mohler et al.,, 2011; Zhuang and Mateu,, 2019) or kernel density estimation (Mohler,, 2014; Reinhart and Greenhouse,, 2018; Yuan et al.,, 2019) have been adopted in previous literature on modeling self-exciting crime events. Compared with these methods, our approach provides a more computationally efficient procedure, particularly for large-scale crime data sets (e.g., more than 10,000 crimes) (Reinhart,, 2018), and avoids the model identification issue when Gaussian kernels are used in both base intensity and influence kernel (Reinhart and Greenhouse,, 2018). By estimating base intensities for various crime types and urban functional regions, our approach also captures the heterogeneity in event occurrence across both geographic space and mark space.

The influence kernel is estimated by maximizing the log-likelihood function of the point process model (Daley et al.,, 2003). We denote the parameters in the influence kernel as $\theta\coloneqq\{\{\bm{b}^{r},\bm{W}^{r}\}_{r=1}^{R},\{a_{cl,c^{\prime}l^{\prime}}\}_{c,c^{\prime}\in\mathscr{C},l,l^{\prime}\in\mathscr{L}},\beta,\sigma\}$ . The log-likelihood function of observing $\mathcal{H}_{T}=\{(t_{i},s_{i},c_{i}\times l_{i})\}_{i=1}^{n}$ on $[0,T]\times\mathcal{S}$ is given by

\displaystyle L(\theta)

\displaystyle=\sum_{i=1}^{n}\log\lambda_{c_{i}l_{i}}\left(t_{i},s_{i}\right)-\sum_{c\in\mathscr{C},l\in\mathscr{L}}\int_{0}^{T}\int_{\mathcal{S}}\lambda_{cl}(t,s)dsdt,

(5)

where $\theta$ is incorporated into the conditional intensity function (see Appendix A for log-likelihood derivation). Due to the existence of graph neural networks in our model and the large data size, solving the M-step in the classic expectation-maximization (EM) algorithm for point processes (Liu et al.,, 2021; Veen and Schoenberg,, 2008; Zhu and Xie,, 2022) becomes intractable and overwhelming. Therefore, we adopt the commonly-used optimization strategy of stochastic gradient descent (Robbins and Monro,, 1951) to estimate the model parameters $\theta$ . The crime data set used for model training is separated into multiple event sequences by consecutive fixed-length time windows. The obtained event sequences will be retrieved in random order with a fixed batch size. Each retrieved batch of the event sequences is used to compute the gradient of the loss function with regard to the model parameters using backpropagation (Rumelhart et al.,, 1986). The model parameters are then updated along the computed gradient with a chosen learning rate $\eta$ . In our case, the loss function for each batch is the summation of the negative log-likelihoods $-L(\theta)$ of all the sequences in that batch. Algorithm 1 summarizes the learning procedure for the parameters $\theta$ , where we set the batch size $M=3$ , learning rate $\eta=1.0$ , and epoch number $E=1,500$ in our experiments. The validity of using multiple subsequences for learning the parameters can be guaranteed by setting the length of the time window used for splitting the entire sequence (for example, 120 days) much larger than the scale of the decaying temporal effect of historical events (around 30.77 days by the final learned model).

Input: Training set

\{\mathcal{H}^{j}_{T}\}_{j=1}^{J}

with

J

non-overlapping subsequences, where

\mathcal{H}_{T}^{j}=\{(t_{i},s_{i},c_{i}\times l_{i})\}_{i=1}^{n_{j}}

and

\cup_{j=1}^{J}\mathcal{H}^{j}_{T}=\mathcal{H}_{T}

; batch size

M

; epoch number

E

; learning rate

\eta

Initialization: model parameters

\theta^{(0)}

, first epoch

e=0

while

e<E

for each batch

\{\mathcal{H}_{T}^{j_{1}},\dots,\mathcal{H}_{T}^{j_{M}}\}

with size

M

1. Compute the negative log-likelihood

-L^{j}(\theta^{(e)})

using

\mathcal{H}_{T}^{j}

for

j\in\{j_{1},\dots,j_{M}\}

, according to (5).

2. Compute the gradient

\left.\bm{g}^{(e)}=\nabla_{\theta}\left(\sum_{j_{1},\dots,j_{M}}-L^{j}(\theta)\right)\right\rvert_{\theta=\theta^{(e)}}

using backpropagation.

3. Update the model parameters:

\theta^{(e+1)}\leftarrow\theta^{(e)}-\eta\bm{g}^{(e)}

end for

e\leftarrow e+1

end while

return Learned model parameter

\theta^{(E)}

Algorithm 1 Model parameter estimation using stochastic gradient descent

Remark: The computational cost of the loss function mainly lies in the evaluation of the first term in (5), which involves evaluations of the influence kernel between each pair of events in the event sequence. By dividing the entire training sequence with $n$ events into $J$ subsequences with each of $n_{j}$ events, the complexity of computing (5) over the entire data set can be reduced from $\mathcal{O}(n^{2})$ down to $\sum_{j=1}^{J}\mathcal{O}(n_{j}^{2})\approx\mathcal{O}(n^{2}/J^{2})$ . In fact, we are eliminating the overwhelming and unnecessary evaluations of the influence kernel between event pairs that are far away enough over time so that the earlier event has little or no influence on the latter one. Fig B1 in Appendix B shows the model training time and the model’s goodness-of-fit on the training data set with different $J$ s. With a proper $J$ , enhanced model computational efficiency can be attained without degrading the model performance. In our experiments, we choose $J=12$ to achieve the balance between the model performance and computational efficiency (i.e., the length of the time window for each subsequence is 120 days).

6 Results

We now present the results by analyzing the crime data set in Valencia (Spain), and further demonstrate the competitive performance of our proposed model (referred to as STNPP) in predicting future crime rates and understanding the dynamics of crime events. The entire data set is partitioned into two parts. The first part includes data from 2015 through 2018, which is used to estimate the model parameters and evaluate the goodness-of-fit of the model. The second part contains data from 2019 and is used for assessing the model’s predictive performance.

6.1 Model validation

We first validate our model from two aspects: the determination of the hyper-parameter $R$ and the goodness-of-fit of the chosen model on the crime data.

An appropriate choice of the number of attention heads $R$ in GAT needs to be determined in advance, which can be achieved using cross-validation. We first divide the training data from 2015 to 2018 into 12 subsequences with the same time window length of 120 days. Then, we adopt 4-fold cross-validation on the training data to determine the value of $R$ . Given a choice of $R$ , all the 12 subsequences are shuffled randomly and split into four groups. One round of cross-validation involves taking one group as the hold-out data, training the model with the remaining groups, and evaluating the trained model on the hold-out data. The final model performance is obtained by averaging the metrics over four independent rounds. We compare the performance of the model with $R=2,4,6,8,12$ , and $16$ attention heads in terms of the model log-likelihood on the hold-out data, which evaluates the model generalization ability to the unseen data. Better model performance is indicated by a higher hold-out log-likelihood. Fig 5 reports the averaged hold-out log-likelihood with different attention heads. According to the results, we choose $R=8$ as an optimal choice in the remaining experiments in this study.

Another model assumption – the stationarity of the influence kernel needs to be validated by investigating the model’s fit with the training data. Stationarity means that the model parameters do not vary over time, indicating that the pattern of event influence remains consistent. Previous research on crime modeling with point processes has frequently made this assumption, but often without adequately verifying its validity. The work of the non-stationary ETAS model (Kumazawa and Ogata,, 2014) presents a method to test the goodness-of-fit of a stationary point process model to the data by comparing the expected cumulative number of events computed from the learned model and empirical cumulative number of events. In our context, given the learned model $\hat{\lambda}_{cl}$ , the expected cumulative number of events in the time interval $[0,t]$ is computed as $\Lambda(t)=\int^{t}_{0}\int_{\mathcal{S}}\sum_{c\in\mathscr{C},l\in\mathscr{L}}\hat{\lambda}_{cl}(t,s)dsdt$ . If the model represents a good approximation of the real data, we expect that $\Lambda(t)$ and the empirical cumulative event counts $\mathbb{N}(t)=\sum_{c\in\mathscr{C},l\in\mathscr{L}}\mathbb{N}_{cl}(t)$ are close. We fit the model using the entire training set, and plot the $\mathbb{N}(t)$ and $\Lambda(t)$ from 2015 to 2018 in Fig 6. The consistent match between the empirical and expected cumulative event numbers suggests that the underlying data dynamics are stationary, and the assumption of kernel stationarity is reliable when fitting this data set.

6.2 Data fitting and in-sample estimations

We then fit the model on the entire training data from 2015 to 2018 and analyze the results. To quantitatively demonstrate the effectiveness of our model, we compare our model with different baselines in terms of the fitted log-likelihood on the training data, the Akaike Information Criterion (AIC) (Akaike,, 1974, 1998) of the model, and the mean absolute error (MAE) of the in-sample estimation of the number of the crime events. The log-likelihood, computed by (5) using training data, measures the model goodness-of-fit to the training data. The AIC considers both the model fit to the data and the model complexity. It is described as $AIC=-2\max_{\theta}\log L(\theta)+2k$ , where $\log L(\theta)$ is the model log-likelihood and $k$ is the number of model parameters to be estimated. The in-sample estimation of the event number over a given time interval can be performed as follows: we fit the model using the entire training data, feed the same data into the fitted model, and calculate the integral of the conditional intensity function over the time interval as the estimated number of events. In practice, the in-sample estimation of number of events with mark $c\times l$ over $[t_{1},t_{2}]$ can be calculated by $\int_{t_{1}}^{t_{2}}\int_{\mathcal{S}}\hat{\lambda}_{cl}(t,s)dsdt$ . We evaluate the in-sample estimation of event numbers during each week using our model STNPP, and compare its performance with four baselines, including two predictive time series models, one point process model, and an ablated variant of our model: (1) The persistence forecast (Persistent) that uses the event number in the previous week as the estimation; (2) Vector autoregression (VAR), which is a statistical model for analyzing and predicting multivariate time series data; (3) Epidemic-type aftershock sequence (ETAS) model (Ogata,, 1998) with a diffusion-type kernel using Euclidean distance; (4) The STNPP without GAT (STNPP-GAT). We slightly modify the ETAS model by incorporating a set of coefficients to account for the interactions between different event marks, since the original ETAS model cannot deal with multiple event types (see Appendix B for details). Fig 7 visualizes the in-sample estimations by different models on the number of each event type and the total events from 2015 to 2018, alongside the actual observed values. Our model effectively recovers both the overall temporal trend in total event numbers and the specific temporal patterns for event types that occurred either frequently or infrequently during the training period. Note that such heterogeneity in the event dynamics is simultaneously captured by a holistic model instead of fitting independent models for each type of event.

Table 1: Quantitative results of data fitting and in-sample estimation. Bold indicates the best performance.

Model	MAE (rare) ( $\downarrow$ )	MAE (frequent) ( $\downarrow$ )	MAE (total) ( $\downarrow$ )	Training log-likelihood ( $\uparrow$ )	AIC ( $\downarrow$ )
Persistent	0.998	5.736	31.538	/	/
VAR	0.906	3.680	21.940	/	/
ETAS	0.785	4.266	30.925	-2.476	45039.270
STNPP-GAT	0.728	3.875	21.561	-2.427	44173.386
STNPP	0.716	3.708	20.080	-2.413	44099.266

More quantitative results about the in-sample estimations are summarized in Table 1. To showcase our model’s versatility in handling different types of events with distinct underlying mechanisms, we present separate in-sample estimation MAE assessments for events characterized by frequent or rare occurrences of the marks. The frequent event marks include those with landmark categories of “financial,” “industrial,” and “restaurant,” corresponding to the crime-landmark categories with the top nine total observations. The remaining crime-landmark categories are treated as rare event marks. We report MAE (rare), MAE (frequent), and MAE (total) as the final metrics, representing the estimation MAEs averaged over rare, frequent, and all event marks. The results in Table 1 demonstrate the comparable or superior predictive performance of STNPP against baselines. Note that although VAR has performance metrics that are close to our method, it is a time series model for predicting the event numbers, and it is not designed for dealing with discrete spatio-temporal events or providing any insights on the underlying event dynamics.

Table 1 also reports the training log-likelihood and the model AIC for three spatio-temporal point processes (we omit the comparison with AIC of VAR, which is not meaningful). The highest training log-likelihood and the lowest model AIC show that STNPP enjoys the best goodness-of-fit to the data. Besides, the improved performance from ETAS to STNPP-GAT highlights the advantages of using street-network distance, and the performance gain of STNPP against STNPP-GAT emphasizes the benefits of incorporating nodal (mark) features in capturing complex event dynamics.

6.3 Out-of-sample predictions on testing data

Table 2: Quantitative results of out-of-sample estimation. Bold indicates the best performance.

Model	MAE (rare) ( $\downarrow$ )	MAE (frequent) ( $\downarrow$ )	MAE (total) ( $\downarrow$ )	Testing log-likelihood ( $\uparrow$ )
Persistent	1.006	5.803	28.808	/
VAR	0.998	5.502	27.507	/
ETAS	0.879	4.786	28.715	-2.223
STNPP-GAT	0.769	4.329	26.302	-2.201
STNPP	0.773	4.223	21.788	-2.183

The model’s predictive power can be assessed by the out-of-sample prediction task on the testing data set. We perform a one-week-ahead prediction of the number of events over the time window of 2019. Specifically, at a given time $t^{*}$ in 2019, we feed the data before $t^{*}$ into the learned model and evaluate the conditional intensity function over the next week. The predicted number of events with mark $c\times l$ in the following week can be estimated by the integral of the evaluated intensity function $\hat{\lambda}_{cl}$ over time and space, similarly as those in the in-sample predictions. The MAE between the predicted event numbers and the number of true observations are computed to indicate the model’s predictive performance. We perform the out-of-sample prediction on a weekly basis over the year of 2019 and report the average prediction MAE. Table 2 presents the average MAEs of the predictions for the number of rare events, frequent events, and total events by our model STNPP and four baselines, indicating the superior performance of our model against other baselines on predicting the future. Besides the MAE, we also compare the fitted log-likelihood of the testing data using different point process models and report them in the table. The highest log-likelihood of STNPP showcases the best generalization ability of our model to the unseen data.

We visualize the predictive power of STNPP in Fig 8 by showing the predicted conditional intensity function for three types of crimes over the street network at different times in 2019. Each panel compares the predicted conditional intensity of one type of crime over space given the observed history with the true distribution of that type of crime in the next two days. As we can observe, the predicted event intensity by our model is consistent with the true distribution of future events, showing a higher intensity in those areas with a higher likelihood of observing crime events. Meanwhile, our model discerns the spatial patterns of different crimes by learning from the historical data, such as the risk for Assault victims in major busy areas (e.g., the financial zone in the north part of the city) and a more regional, concentrated pattern for Subtraction and Others.

6.4 Learned coefficients of mark interactions

The coefficients $\{\alpha_{cl,c^{\prime}l^{\prime}}\}_{c,c^{\prime}\in\mathscr{C},l,l^{\prime}\in\mathscr{L}}$ learned by GAT capture the direction and magnitude of the influence between different event marks and is crucial in interpreting the model in practice. We visualize the learned coefficients by our model STNPP in Fig 9(a) by stacking them together into a matrix. Each matrix entry represents the coefficient that models the triggering effect from the event mark at the corresponding column to the mark at the corresponding row. As the matrix can be regarded as the weighted adjacency matrix of the mark network we established in Section 3.3, we adopt the Louvain algorithm (Blondel et al.,, 2008) to perform community detection on the event marks. The detected communities tell us the groups of marks that are more closely connected, which are indicated by the square frames with red dashed lines in the visualized matrix. Five communities are detected based on the coefficients, suggesting different types of human daily activities. For instance, the largest community with six marks, including Assault and Subtraction in industrial, market, and nightclub zones, showcases the clustering patterns of certain crime events related to citizen activities after hours, such as grocery shopping or night amusement. Other communities also reveal criminal activities that are relevant to specific urban facilities, such as restaurants (the first community) and industrial zones (the last community).

We also visualize the mark network $A$ established from the learned coefficients in Figure 9(c), with nodes representing the event marks and edges indicating their interactions. The colors of the nodes suggest the detected communities of different marks. To demonstrate the benefits of adopting GAT to learn the coefficient and their community structure, we compare the learned coefficients by the ablated model STNPP-GAT in Fig 9(b)(d) with detected communities. Although we have no ground truth to validate the community detection results, we here report the modularity (Newman,, 2010) of the mark networks. Networks with higher modularity have stronger intra-community connections and fewer inter-community connections. The modularity of the learned mark network by STNPP is much higher than the one learned by STNPP-GAT, as reported in Fig 9(c)(d). These visualizations also reveal a more distinct community structure in the network learned by STNPP, in contrast to the one of STNPP-GAT, which has more blurred community divisions. This high modularity of the mark network is beneficial for the decision-making of the local police department. For example, the detected communities highlight those closely connected marks and help identify influential crime events within specific communities. These insights can lead to more targeted and effective police patrolling against criminal activities.

To identify the most influential event marks, we plot the expected number of events that are triggered by an observed event with each mark in Fig 10. The number of triggered events by one event with mark $c^{\prime}\times l^{\prime}$ is calculated by aggregating the coefficient $\alpha_{cl,c^{\prime}l^{\prime}}$ over the index $c$ and $l$ , i.e., $\sum_{c\in\mathscr{C},l\in\mathscr{L}}\alpha_{cl,c^{\prime}l^{\prime}}$ , and a larger number indicates a stronger influence by an observed event with mark $c^{\prime}\times l^{\prime}$ . As we can see, crime events with marks of Assault $\times$ taxi have the strongest influence on the future by triggering the most number of events. Although this event mark is barely observed during the five-year period, its impact on subsequent event occurrences is not negligible. Other influential event marks include those related to police zones or assault activities, indicating the heterogeneity of the event dynamics across different crime types and urban areas.

6.5 Important event marks

Another important task for local practitioners in implementing effective prevention strategies is to identify particular types of crime events that can lead to an obvious risk increase in the community’s exposure to the crimes. These events not only include those that trigger the subsequent event occurrences to a large extent, such as Assault $\times$ taxi, but also include those event types that have a smaller influence magnitude on future events but are frequently reported, such as Assault $\times$ restaurant.

To this end, we investigate the contribution of each event mark $c^{\prime}\times l^{\prime}$ to the underlying event generation mechanisms by neutralizing its influence (i.e., set $\alpha_{cl,c^{\prime}l^{\prime}}=0,\forall c\in\mathscr{C},l\in\mathscr{L}$ ) and then evaluating the performance gap between the reduced model and the original model. A larger performance gap indicates a higher importance of that event mark in the effectiveness of the model. We evaluate the performance of each reduced model with influence from one type of event mark neutralized (a total of 21 reduced models) in terms of three metrics: AIC on training data, negative log-likelihood on testing data, and out-of-sample prediction MAE, and compare them with the metrics obtained by the full model. The corresponding results are presented in Fig 11. For the top two marks that have the most expected number of triggered events shown in Fig 10, the neutralization of the influence of Assault $\times$ taxi leads to obvious performance degradation, while the other one of Others $\times$ police have a much smaller impact, due to the scarce observation of this event mark during the investigation period. Other event marks, to which the neutralization of the influence can significantly decline the model performance, include those of Assault $\times$ financial, Assault $\times$ industrial, Assault $\times$ restaurant, Subtraction $\times$ restaurant, and so on. Crime events with these marks relate to those places and the daily activities of citizens who are more vulnerable to criminals. These events are more frequently observed than others, and their cumulative effects on attracting future criminal activities need to be paid attention to.

7 Discussion

We have presented a new spatio-temporal-network point process developed to model crime events within Valencia, Spain. This model is built on the city’s street network for spatial analysis, mirroring the real-world context where urban crimes mainly occur along streets. The introduction of a spatial kernel that measures distance across the street network respects the intrinsic network nature of urban areas, capturing the contagion effect of crime more accurately and realistically compared to traditional point process models with kernels that rely on Euclidean distance. The integration of urban environmental factors such as nearby facilities and land use into our analysis adds another layer of depth. By partitioning the city into different functional zones and creating new event marks with corresponding crime and zone categories, our model allows us to explore how specific urban environments foster particular types of crime. The adoption of a graph attention neural network architecture improves the learning of the complex interactions between various event marks, which also enables the identification of those important crime types in different environmental contexts, leading to insights that could inform targeted interventions. The numerical results on the real crime data in Valencia demonstrate the superior performance of our model against common baselines in forecasting the numbers of crime events and their distributions. The results not only prove the effectiveness of our model in actual practice but also underscore the importance of models tailored to crime modeling in specific urban contexts.

Several avenues exist for further enhancement of our model. Considering a directed street network could offer additional insights, particularly in scenarios where the movement direction of perpetrators (such as those in vehicles) plays a role in crime execution. A more rigorous statistical analysis of the significance of learned mark-to-mark interactions would enhance the robustness of our findings, potentially revealing more intricate patterns that could be pivotal for law enforcement and urban planning strategies. By addressing these areas, we aim to refine our understanding of urban crime dynamics further, thus not only contributing to academic discourse but also providing a practical framework for enhancing public safety and security in urban settings.

Acknowledgement

This work is partially supported by an NSF CAREER CCF-1650913, NSF DMS-2134037, CMMI-2015787, CMMI-2112533, DMS-GR00023160, DMS-1938106, DMS-1830210, ONR N000142412278, and the Coca-Cola Foundation.

References

Akaike, (1974) Akaike, H. (1974). A new look at the statistical model identification. IEEE transactions on automatic control, 19(6):716–723.
Akaike, (1998) Akaike, H. (1998). Information theory and an extension of the maximum likelihood principle. In Selected papers of hirotugu akaike, pages 199–213. Springer.
Andresen, (2007) Andresen, M. A. (2007). Location quotients, ambient populations, and the spatial analysis of crime in vancouver, canada. Environment and Planning A, 39(10):2423–2444.
Blondel et al., (2008) Blondel, V. D., Guillaume, J.-L., Lambiotte, R., and Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment, 2008(10):P10008.
Bonta et al., (1998) Bonta, J., Law, M., and Hanson, K. (1998). The prediction of criminal and violent recidivism among mentally disordered offenders: a meta-analysis. Psychological bulletin, 123(2):123.
Bounce, (2024) Bounce (2024). Is valencia safe to visit? a comprehensive safety guide.
Bowers et al., (2004) Bowers, K. J., Johnson, S. D., and Pease, K. (2004). Prospective hot-spotting: the future of crime mapping? British journal of criminology, 44(5):641–658.
Browning et al., (2010) Browning, C. R., Byron, R. A., Calder, C. A., Krivo, L. J., Kwan, M.-P., Lee, J.-Y., and Peterson, R. D. (2010). Commercial density, residential concentration, and crime: Land use patterns and violence in neighborhood context. Journal of Research in Crime and Delinquency, 47(3):329–357.
Cai et al., (2024) Cai, B., Zhang, J., and Guan, Y. (2024). Latent network structure learning from high-dimensional multivariate point processes. Journal of the American Statistical Association, 119(545):95–108.
Chainey et al., (2008) Chainey, S., Tompson, L., and Uhlig, S. (2008). The utility of hotspot mapping for predicting spatial patterns of crime. Security journal, 21:4–28.
Cho et al., (2013) Cho, Y.-S., Galstyan, A., Brantingham, P. J., and Tita, G. (2013). Latent self-exciting point process model for spatial-temporal networks. arXiv preprint arXiv:1302.2671.
Daley et al., (2003) Daley, D. J., Vere-Jones, D., et al. (2003). An introduction to the theory of point processes: volume I: elementary theory and methods. Springer.
Dong et al., (2022) Dong, Z., Cheng, X., and Xie, Y. (2022). Spatio-temporal point processes with deep non-stationary kernels. arXiv preprint arXiv:2211.11179.
(14) Dong, Z., Repasky, M., Cheng, X., and Xie, Y. (2023a). Deep graph kernel point processes. arXiv preprint arXiv:2306.11313.
Dong and Xie, (2024) Dong, Z. and Xie, Y. (2024). Atlanta gun violence modeling via nonstationary spatio-temporal point processes. arXiv preprint arXiv:2408.09258.
(16) Dong, Z., Zhu, S., Xie, Y., Mateu, J., and Rodríguez-Cortés, F. J. (2023b). Non-stationary spatio-temporal point process modeling for high-resolution covid-19 data. Journal of the Royal Statistical Society Series C: Applied Statistics, 72(2):368–386.
Du et al., (2016) Du, N., Dai, H., Trivedi, R., Upadhyay, U., Gomez-Rodriguez, M., and Song, L. (2016). Recurrent marked temporal point processes: Embedding event history to vector. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data mining, pages 1555–1564.
D’Angelo et al., (2024) D’Angelo, N., Payares, D., Adelfio, G., and Mateu, J. (2024). Self-exciting point process modelling of crimes on linear networks. Statistical Modelling, 24(2):139–168.
Fang et al., (2023) Fang, G., Xu, G., Xu, H., Zhu, X., and Guan, Y. (2023). Group network hawkes process. Journal of the American Statistical Association, pages 1–17.
Fleming et al., (1994) Fleming, Z., Brantingham, P., Brantingham, P., et al. (1994). Exploring auto theft in british columbia. Crime prevention studies, 3:47–90.
Gao et al., (2017) Gao, S., Janowicz, K., and Couclelis, H. (2017). Extracting urban functional regions from points of interest and human activities on location-based social networks. Transactions in GIS, 21(3):446–467.
Gottfredson, (1981) Gottfredson, M. R. (1981). On the etiology of criminal victimization. J. Crim. L. & Criminology, 72:714.
Grann et al., (1999) Grann, M., Långström, N., Tengström, A., and Kullgren, G. (1999). Psychopathy (PCL-R) predicts violent recidivism among criminal offenders with personality disorders in sweden. Law and human behavior, 23:205–217.
Groff, (2011) Groff, E. (2011). Exploring ‘near’: Characterizing the spatial extent of drinking place influence on crime. Australian & New Zealand Journal of Criminology, 44(2):156–179.
Hawkes, (1971) Hawkes, A. G. (1971). Spectra of some self-exciting and mutually exciting point processes. Biometrika, 58(1):83–90.
Hu and Han, (2019) Hu, Y. and Han, Y. (2019). Identification of urban functional areas based on poi data: A case study of the guangzhou economic and technological development zone. Sustainability, 11(5):1385.
Jiang et al., (2015) Jiang, S., Alves, A., Rodrigues, F., Ferreira Jr, J., and Pereira, F. C. (2015). Mining point-of-interest data from social networks for urban land use classification and disaggregation. Computers, Environment and Urban Systems, 53:36–46.
Johnson, (2008) Johnson, S. D. (2008). Repeat burglary victimisation: a tale of two theories. Journal of Experimental Criminology, 4:215–240.
Johnson and Bowers, (2010) Johnson, S. D. and Bowers, K. J. (2010). Permeability and burglary risk: Are cul-de-sacs safer? Journal of Quantitative Criminology, 26:89–111.
Kennedy et al., (2011) Kennedy, L. W., Caplan, J. M., and Piza, E. (2011). Risk clusters, hotspots, and spatial intelligence: risk terrain modeling as an algorithm for police resource allocation strategies. Journal of quantitative criminology, 27:339–362.
Kennedy et al., (2016) Kennedy, L. W., Caplan, J. M., Piza, E. L., and Buccine-Schraeder, H. (2016). Vulnerability and exposure to crime: Applying risk terrain modeling to the study of assault in chicago. Applied Spatial Analysis and Policy, 9:529–548.
Kinney et al., (2008) Kinney, J. B., Brantingham, P. L., Wuschke, K., Kirk, M. G., and Brantingham, P. J. (2008). Crime attractors, generators and detractors: Land use and urban crime opportunities. Built environment, 34(1):62–74.
Kumazawa and Ogata, (2014) Kumazawa, T. and Ogata, Y. (2014). Nonstationary ETAS models for nonstandard earthquakes. The Annals of Applied Statistics, 8(3):1825 – 1852.
Lev-Wiesel et al., (2004) Lev-Wiesel, R., Amir, M., and Besser, A. (2004). Posttraumatic growth among female survivors of childhood sexual abuse in relation to the perpetrator identity. Journal of Loss and Trauma, 10(1):7–17.
Levine and CrimeStat, (2002) Levine, N. and CrimeStat, I. (2002). A spatial statistics program for the analysis of crime incident locations. Ned Levine and Associates, Houston, TX, and the National Institute of Justice, Washington, DC.
Lewis et al., (2012) Lewis, E., Mohler, G., Brantingham, P. J., and Bertozzi, A. L. (2012). Self-exciting point process models of civilian deaths in Iraq. Security Journal, 25:244–264.
Linderman and Adams, (2014) Linderman, S. and Adams, R. (2014). Discovering latent network structure in point process data. In International Conference on Machine Learning, pages 1413–1421. PMLR.
Liu et al., (2021) Liu, X., Carter, J., Ray, B., and Mohler, G. (2021). Point process modeling of drug overdoses with heterogeneous and missing data. The Annals of Applied Statistics, 15(1):88 – 101.
Loeffler and Flaxman, (2018) Loeffler, C. and Flaxman, S. (2018). Is gun violence contagious? a spatiotemporal test. Journal of quantitative criminology, 34:999–1017.
Long et al., (2015) Long, Y., Shen, Z., Long, Y., and Shen, Z. (2015). Discovering functional zones using bus smart card data and points of interest in beijing. Geospatial analysis to support urban planning in Beijing, pages 193–217.
Maas et al., (2013) Maas, A. L., Hannun, A. Y., Ng, A. Y., et al. (2013). Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, volume 30, page 3. Atlanta, GA.
Meera and Jayakumar, (1995) Meera, A. K. and Jayakumar, M. D. (1995). Determinants of crime in a developing country: a regression model. Applied Economics, 27(5):455–460.
Mei and Eisner, (2017) Mei, H. and Eisner, J. M. (2017). The neural hawkes process: A neurally self-modulating multivariate point process. Advances in Neural Information Processing Systems, 30.
Mohler, (2013) Mohler, G. (2013). Modeling and estimation of multi-source clustering in crime and security data. The Annals of Applied Statistics, pages 1525–1539.
Mohler, (2014) Mohler, G. (2014). Marked point process hotspot maps for homicide and gun crime prediction in chicago. International Journal of Forecasting, 30(3):491–497.
Mohler et al., (2011) Mohler, G. O., Short, M. B., Brantingham, P. J., Schoenberg, F. P., and Tita, G. E. (2011). Self-exciting point process modeling of crime. Journal of the American Statistical Association, 106(493):100–108.
Moller and Waagepetersen, (2003) Moller, J. and Waagepetersen, R. P. (2003). Statistical inference and simulation for spatial point processes. CRC press.
Neill and Gorr, (2007) Neill, D. B. and Gorr, W. L. (2007). Detecting and preventing emerging epidemics of crime. Advances in Disease Surveillance, 4(13).
Newman, (2010) Newman, M. (2010). Networks: An Introduction. Oxford University Press.
Ogata, (1988) Ogata, Y. (1988). Statistical models for earthquake occurrences and residual analysis for point processes. Journal of the American Statistical Association, 83(401):9–27.
Ogata, (1998) Ogata, Y. (1998). Space-time point-process models for earthquake occurrences. Annals of the Institute of Statistical Mathematics, 50:379–402.
OpenStreetMap contributors, (2017) OpenStreetMap contributors (2017). Planet dump retrieved from https://planet.osm.org . https://www.openstreetmap.org.
Perry, (2013) Perry, W. L. (2013). Predictive policing: The role of crime forecasting in law enforcement operations. Rand Corporation.
Porter and White, (2012) Porter, M. D. and White, G. (2012). Self-exciting hurdle models for terrorist activity.
Reinhart, (2018) Reinhart, A. (2018). A review of self-exciting spatio-temporal point processes and their applications. Statistical Science, 33(3):299–318.
Reinhart and Greenhouse, (2018) Reinhart, A. and Greenhouse, J. (2018). Self-exciting point processes with spatial covariates: modelling the dynamics of crime. Journal of the Royal Statistical Society Series C: Applied Statistics, 67(5):1305–1329.
Robbins and Monro, (1951) Robbins, H. and Monro, S. (1951). A stochastic approximation method. The annals of mathematical statistics, pages 400–407.
Rossmo, (1999) Rossmo, D. K. (1999). Geographic profiling. CRC press.
Rumelhart et al., (1986) Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning representations by back-propagating errors. nature, 323(6088):533–536.
Russo et al., (2013) Russo, S., Roccato, M., and Vieno, A. (2013). Criminal victimization and crime risk perception: A multilevel longitudinal study. Social Indicators Research, 112:535–548.
Sanna Passino et al., (2024) Sanna Passino, F., Che, Y., and Cardoso Correia Perello, C. (2024). Graph-based mutually exciting point processes for modelling event times in docked bike-sharing systems. Stat, 13(1):e660.
Shchur et al., (2021) Shchur, O., Türkmen, A. C., Januschowski, T., and Günnemann, S. (2021). Neural temporal point processes: A review. arXiv preprint arXiv:2104.03528.
Short et al., (2008) Short, M. B., D’orsogna, M. R., Pasour, V. B., Tita, G. E., Brantingham, P. J., Bertozzi, A. L., and Chayes, L. B. (2008). A statistical model of criminal behavior. Mathematical Models and Methods in Applied Sciences, 18(supp01):1249–1267.
Stucky and Ottensmann, (2009) Stucky, T. D. and Ottensmann, J. R. (2009). Land use and violent crime. Criminology, 47(4):1223–1264.
Tarzia et al., (2018) Tarzia, L., Thuraisingam, S., Novy, K., Valpied, J., Quake, R., and Hegarty, K. (2018). Exploring the relationships between sexual violence, mental health and perpetrator identity: a cross-sectional australian primary care study. BMC public health, 18:1–9.
Veen and Schoenberg, (2008) Veen, A. and Schoenberg, F. P. (2008). Estimation of space–time branching process models in seismology using an em–type algorithm. Journal of the American Statistical Association, 103(482):614–624.
Veličković et al., (2018) Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., and Bengio, Y. (2018). Graph attention networks. In International Conference on Learning Representations.
Wang and Brown, (2012) Wang, X. and Brown, D. E. (2012). The spatio-temporal modeling for criminal incidents. Security Informatics, 1:1–17.
Wei et al., (2020) Wei, N., Walteros, J. L., and Batta, R. (2020). On the distance between random events on a network. Networks, 75(2):203–231.
Weisburd et al., (2012) Weisburd, D., Groff, E. R., and Yang, S.-M. (2012). The criminology of place: Street segments and our understanding of the crime problem. Oxford University Press.
Wu et al., (2020) Wu, W., Liu, H., Zhang, X., Liu, Y., and Zha, H. (2020). Modeling event propagation via graph biased temporal point process. IEEE Transactions on Neural Networks and Learning Systems.
Wuschke and Kinney, (2018) Wuschke, K. and Kinney, J. B. (2018). 475Built Environment, Land Use, and Crime. In The Oxford Handbook of Environmental Criminology. Oxford University Press.
Xia et al., (2022) Xia, W., Li, Y., and Li, S. (2022). Graph neural point process for temporal interaction prediction. IEEE Transactions on Knowledge and Data Engineering.
Xu and Griffiths, (2017) Xu, J. and Griffiths, E. (2017). Shooting on the street: measuring the spatial influence of physical features on gun violence in a bounded street network. Journal of quantitative criminology, 33:237–253.
Yuan et al., (2019) Yuan, B., Li, H., Bertozzi, A. L., Brantingham, P. J., and Porter, M. A. (2019). Multivariate spatiotemporal hawkes processes and network reconstruction. SIAM Journal on Mathematics of Data Science, 1(2):356–382.
Yuan et al., (2014) Yuan, N. J., Zheng, Y., Xie, X., Wang, Y., Zheng, K., and Xiong, H. (2014). Discovering urban functional zones using latent activity trajectories. IEEE Transactions on Knowledge and Data Engineering, 27(3):712–725.
(77) Zhu, S., Li, S., Peng, Z., and Xie, Y. (2021a). Imitation learning of neural spatio-temporal point processes. IEEE Transactions on Knowledge and Data Engineering, 34(11):5391–5402.
(78) Zhu, S., Wang, H., Dong, Z., Cheng, X., and Xie, Y. (2021b). Neural spectral marked point processes. In International Conference on Learning Representations.
Zhu and Xie, (2022) Zhu, S. and Xie, Y. (2022). Spatiotemporal-textual point processes for crime linkage detection. The Annals of Applied Statistics, 16(2):1151–1170.
(80) Zhu, S., Zhang, M., Ding, R., and Xie, Y. (2021c). Deep fourier kernel for self-attentive point processes. In International Conference on Artificial Intelligence and Statistics, pages 856–864. PMLR.
Zhuang and Mateu, (2019) Zhuang, J. and Mateu, J. (2019). A semiparametric spatiotemporal hawkes-type point process model with periodic background for crime data. Journal of the Royal Statistical Society: Series A (Statistics in Society), 182(3):919–942.
Zipkin et al., (2014) Zipkin, J. R., Short, M. B., and Bertozzi, A. L. (2014). Cops on the dots in a mathematical model of urban crime and police response. Discrete and Continuous Dynamical Systems-B, 19(5):1479–1506.
Zuo et al., (2020) Zuo, S., Jiang, H., Li, Z., Zhao, T., and Zha, H. (2020). Transformer hawkes process. In International Conference on Machine Learning, pages 11692–11702. PMLR.

Appendix A Derivation of model log-likelihood

The log-likelihood of observing a total number of $n$ events within $[0,T]\times\mathcal{S}$ can be derived in two steps: (1) For any $1\leq i\leq n$ , compute the conditional probability density function of the $(i+1)$ -th event given the previous $i$ events; (2) Use probability chain rule to get the final likelihood by multiplying $n$ conditional probability densities together. Without loss of generality, we showcase below the derivation of the $(i+1)$ -th conditional probability density. For any $t\in(t_{i},t_{i+1}]$ , we let $F_{cl}(t)=\mathbb{P}(t_{i+1}<t,c_{i+1}=c,l_{i+1}=l|\mathcal{H}_{t_{i}}\cup\{(t_{i},s_{i},c_{i}\times l_{i})\})$ be the cumulative probability function for the next event happened before time $t$ with mark $c\times l$ . We also denote $f_{cl}(t,s)\triangleq f_{cl}(t,s|\mathcal{H}_{t_{i}}\cup\{(t_{i},s_{i},c_{i}\times l_{i})\})$ to be the corresponding conditional probability density function for the next event with mark $c\times l$ at time $t$ and location $s$ , i.e., $F_{cl}(t)=\int_{t_{i}}^{t}\int_{\mathcal{S}}f_{cl}(t,s)dsdt$ . By summing over all the marks, we can define $F(t)\triangleq\sum_{c\in\mathscr{C},l\in\mathscr{L}}F_{cl}(t)=\mathbb{P}(t_{i+1}<t|\mathcal{H}_{t_{i}}\cup\{(t_{i},s_{i},c_{i}\times l_{i})\})$ , $f(t,s)\triangleq\sum_{c\in\mathscr{C},l\in\mathscr{L}}f_{cl}(t,s)$ , and $\lambda(t,s)\triangleq\sum_{c\in\mathscr{C},l\in\mathscr{L}}\lambda_{cl}(t,s)$ . Then, if we denote $\Omega=[t,t+dt)\times B(s,\Delta s)$ to be a small neighborhood around $(t,s)$ , the conditional intensity $\lambda(t,s)$ can be expressed as

$\displaystyle\lambda(t,s)\|B(s,\Delta s)\|dt$	$\displaystyle=\mathbb{P}\{(t_{i+1},s_{i+1})\in\Omega\|\mathcal{H}_{t}\}$	(A1)
	$\displaystyle=\mathbb{P}\{(t_{i+1},s_{i+1})\in\Omega\|\mathcal{H}_{t_{i}}\cup\{(t_{i},s_{i},c_{i}\times l_{i})\}\cup\{t_{i+1}\geq t\}\}$
	$\displaystyle=\frac{\mathbb{P}\{(t_{i+1},s_{i+1})\in\Omega,t_{n+1}\geq t\|\mathcal{H}_{t_{i}}\cup\{(t_{i},s_{i},c_{i}\times l_{i})\}\}}{\mathbb{P}\{t_{i+1}\geq t\|\mathcal{H}_{t_{i}}\cup\{(t_{i},s_{i},c_{i}\times l_{i})\}\}}$
	$\displaystyle=\frac{f(t,s)\|B(s,\Delta s)\|dt}{1-F(t)}$

If integrating over $s$ we can have

\displaystyle dt\cdot\int_{\mathcal{S}}\lambda(t,s)ds=\frac{dt\cdot\int_{\mathcal{S}}f(t,s)ds}{1-F(t)}=\frac{dF(t)}{1-F(t)}=-d\log{(1-F(t))}.

Replace $t$ with $\tau$ and integrate $\tau$ over $(t_{i},t)$ leads to $F(t)=1-\exp(-\int_{t_{i}}^{t}\int_{\mathcal{S}}\lambda(\tau,u)dud\tau)$ because $F(t_{i})=0$ . Then we have

f(t,s)=\lambda(t,s)\cdot\exp\left(-\int_{t_{n}}^{t}\int_{\mathcal{S}}\lambda(\tau,u)dud\tau\right)\ .

Since $f_{cl}(t,s)$ is proportional to $\lambda_{cl}(t,s)$ , we have

\displaystyle f_{cl}(t,s)

\displaystyle=f(t,s)\cdot\frac{\lambda_{cl}(t,s)}{\lambda(t,s)}=\lambda_{cl}(t,s)\cdot\exp\left(-\int_{t_{n}}^{t}\int_{\mathcal{S}}\lambda(\tau,u)dud\tau\right)\ .

The log-likelihood for observing the entire event sequence can be computed via the chain rule as

	$\displaystyle L(\theta)$	$\displaystyle=\log L(\{(t_{i},s_{i},c_{i}\times l_{i})\}_{i=1}^{n})=\log\left(\prod_{i=1}^{n}f_{c_{i}l_{i}}(t_{i},s_{i})\right)$
		$\displaystyle=\sum_{i=1}^{n}\log\lambda_{c_{i}l_{i}}\left(t_{i},s_{i}\right)-\sum_{c\in\mathscr{C},l\in\mathscr{L}}\int_{0}^{T}\int_{\mathcal{S}}\lambda_{cl}(t,s)dsdt,$

which leads to the results in (5).

Appendix B Experiment details and additional results

B.1 Choice of the number of training subsequences

We assess the efficiency and performance of the training procedure with different numbers of event subsequences $J$ in the training set. We choose $J$ from the value set $\{6,8,10,12,14,16,18,20,24\}$ . The training time per epoch (i.e., computation time for the entire training set) and the fitted model’s log-likelihood on the training set are reported for each $J$ in Figure B1.

As observed, partitioning the entire sequence into fewer subsequences (i.e., longer time window for each subsequence) allows the model to better fit the data. Note that a longer time window for subsequences means more preservation of the dependencies among events. This preservation is crucial for achieving a good fit, as it ensures that dependent events are analyzed within the same context. However, a longer length of each subsequence demands higher computational complexity for the log-likelihood function in (5), thus reducing the model training efficiency. On the other hand, increasing the number of subsequences by dividing the sequence into many shorter segments reduces the complexity of evaluating the log-likelihood and enhances computational efficiency. Nonetheless, this method risks severing the crucial dependencies between events, as it may arbitrarily split more dependent events into separate subsequences. Such fragmentation can result in an underfitting of the data, failing to capture essential patterns of dependencies among crime events.

We select the number of subsequences to be $12$ as this choice can strike a balance between the computational efficiency and the model’s goodness-of-fit to the data. This selection is justified by the two performance metrics in Figure B1, which achieve a significant improvement in learning efficiency while maintaining the model fitting performance.

B.2 Baseline descriptions

We compare our model with the following four baselines:

•

The persistence forecast (Persistent) is a simple and straightforward forecasting technique where the future value is predicted to be the same as the most recent observed value. In our experiments, the number of events with a specified mark in the next week $t+1$ is predicted as the number of events with the same mark observed in the current week $t$ .
•

The Vector Autoregression (VAR) is a statistical model used to capture the linear dependencies among multiple time series. VAR generalizes the univariate autoregressive model (AR) by modeling each variable in the system as a linear combination of past values of itself and past values of all the other variables in the system. Specifically, denoting the variable vector as $y\in\mathbb{R}^{d}$ and its value at time $t$ as $y_{t}$ , the linear relationship between future values and past values is expressed as

$y_{t}=C+\sum_{i=1}^{p}A_{i}y_{t-i}+\epsilon_{t}.$

Here $C\in\mathbb{R}^{d}$ is a constant vector, $A_{i}$ are coefficient matrices, and $\epsilon_{t}$ is a white noise vector.

•

The Epidemic-type aftershock sequence (ETAS) model is a benchmark point process model for modeling spatio-temporal discrete event data. The original ETAS only models the time and location of the event without considering the event type. Here, we slightly modify the original model by incorporating a set of coefficients to account for the interactions between different event marks. Specifically, the influence kernel takes the form of a diffusion-type kernel as

\displaystyle k(t^{\prime},t,s^{\prime},s,c^{\prime}\times l^{\prime},c\times l)=\frac{\eta_{cl,c^{\prime}l^{\prime}}e^{-\beta(t-t^{\prime})}}{2\pi\sqrt{|\Sigma|}(t-t^{\prime})}\cdot\exp{\left\{-\frac{(s-s^{\prime})^{\top}\Sigma^{-1}(s-s^{\prime})}{2(t-t^{\prime})}\right\}}.

Here $\mu\geq 0$ is the base event intensity, $\Sigma=\mathrm{diag}(\sigma_{x}^{2},\sigma_{y}^{2})$ is a two-dimensional diagonal matrix representing the covariance of the spatial correlation, $\beta>0$ is the decaying rate, and $\eta_{cl,c^{\prime}l^{\prime}}>0$ controls the magnitude of the influence from past events. We use the same estimation strategy for the base intensity $\mu$ and estimate other parameters $\{\sigma_{x},\sigma_{y},\beta,\eta_{cl,c^{\prime}l^{\prime}}\}$ using the same SGD with regard to model likelihood.

•

The STNPP without GAT (STNPP-GAT) is an ablated variant of our model where we remove the GAT architecture and directly estimate the coefficients $\{\alpha_{cl,c^{\prime}l^{\prime}}\}_{c,c^{\prime}\in\mathscr{C},l,l^{\prime}\in\mathscr{L}}$ using SGD. The goal of comparing our model to STNPP-GAT is to showcase how the integration of the GAT architecture enhances our ability to discern the intricate patterns of mark interactions. This improvement facilitates the identification of closely related marks and yields more precise predictions.

$\displaystyle\lambda(t,s)\|B(s,\Delta s)\|dt$	$\displaystyle=\mathbb{P}\{(t_{i+1},s_{i+1})\in\Omega\|\mathcal{H}_{t}\}$	(A1)
	$\displaystyle=\mathbb{P}\{(t_{i+1},s_{i+1})\in\Omega\|\mathcal{H}_{t_{i}}\cup\{(t_{i},s_{i},c_{i}\times l_{i})\}\cup\{t_{i+1}\geq t\}\}$
	$\displaystyle=\frac{\mathbb{P}\{(t_{i+1},s_{i+1})\in\Omega,t_{n+1}\geq t\|\mathcal{H}_{t_{i}}\cup\{(t_{i},s_{i},c_{i}\times l_{i})\}\}}{\mathbb{P}\{t_{i+1}\geq t\|\mathcal{H}_{t_{i}}\cup\{(t_{i},s_{i},c_{i}\times l_{i})\}\}}$
	$\displaystyle=\frac{f(t,s)\|B(s,\Delta s)\|dt}{1-F(t)}$

Spatio-Temporal-Network Point Processes for Modeling Crime Events with Landmarks