This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\newcites

NewReferences

A Graphical Point Process Framework for Understanding Removal Effects in Multi-Touch Attribution

Jun Tao1,2 1The Pennsylvania State University; 2Adobe Inc. Qian Chen1 1The Pennsylvania State University; 2Adobe Inc. James W. Snyder Jr.2 1The Pennsylvania State University; 2Adobe Inc. Arava Sai Kumar2 1The Pennsylvania State University; 2Adobe Inc.
Amirhossein Meisami2
1The Pennsylvania State University; 2Adobe Inc.
Lingzhou Xue1 1The Pennsylvania State University; 2Adobe Inc.
(First Version: May 2022; This Version: February 2023)
Abstract

Marketers employ various online advertising channels to reach customers, and they are particularly interested in attribution – measuring the degree to which individual touchpoints contribute to an eventual conversion. The availability of individual customer-level path-to-purchase data and the increasing number of online marketing channels and types of touchpoints bring new challenges to this fundamental problem. We aim to tackle the attribution problem with finer granularity by conducting attribution at the path level. To this end, we develop a novel graphical point process framework to study the direct conversion effects and the full relational structure among numerous types of touchpoints simultaneously. Utilizing the temporal point process of conversion and the graphical structure, we further propose graphical attribution methods to allocate proper path-level conversion credit, called the attribution score, to individual touchpoints or corresponding channels for each customer’s path to purchase. Our proposed attribution methods consider the attribution score as the removal effect, and we use the rigorous probabilistic definition to derive two types of removal effects. We examine the performance of our proposed methods in extensive simulation studies and compare their performance with commonly used attribution models. We also demonstrate the performance of the proposed methods in a real-world attribution application.

Keywords: Granger Causality, Graphical Model, High Dimensional Statistics, Multi-Touch Attribution, Point Process.

1 Introduction

Attribution is a classic problem. The granularity of data plays an important role in studying this problem. Early work investigating synergies across channels employed aggregate data to extract insights for customer targeting and marketing budget allocation (e.g., Naik & Raman, 2003). With advances in online data collection, individual customer-level path-to-purchase data, describing when and how individual customers interact with various channels in their purchase funnels, becomes available. Such data availability led attribution modeling to a new phase. Paths to purchases differ among customers, and the temporal distances between touchpoints also differ. In Figure 1, we provide two illustrative examples of customer-level path-to-purchase data. As shown in Path 1, one customer received an email ad about the focal product at time t1t_{1}, and then this customer saw an ad about the product on social media at time t2t_{2}. Later, this customer searched for this product in a search engine, and its paid ad appeared at the top of the search result page at time t3t_{3}. This customer clicked on this product ad at time t4t_{4} to visit the product’s website and then purchased the product at time t5t_{5}. As shown in Path 2, another customer saw a search engine ad about this product at time t1t_{1}^{\prime} and then clicked on this ad at time t2t_{2}^{\prime} to visit the product’s website and converted at time t3t_{3}^{\prime}.

Refer to caption
Figure 1: The illustrations of customer-level path-to-purchase data.

Researchers have started to utilize customer-level path-to-purchase data for attribution modeling (e.g., Xu et al., 2014; Li & Kannan, 2014; Anderl et al., 2016). In recent years, we have witnessed an increasing number of online marketing channels and types of touchpoints. The channels include various search engines (e.g., Google, Bing), social media platforms (e.g., LinkedIn, Twitter, Instagram, Tiktok, Snapchat, WhatsApp), display ads, emails, web banners, app banners, customer support, desktop notification, and many others. Meanwhile, there are a lot of different types of touches within each search engine/platform/media. For example, ad impressions and ad clicks are different touchpoints that may have different conversion effects. To enable granular marketing, marketers are interested in evaluating the marketing performance of each search engine/platform/media and even the performance of a particular type of touchpoints. But in most existing research about attribution modeling, touchpoints on the path are typically aggregated to the channel they belong to (e.g., search, display, email); different search engines/platforms/media within the same channels are not differentiated, and different types of touchpoints within the same search engine/platform/media are not differentiated either. For example, both search ad impression and click are considered customers’ interactions with the search channel and are not studied separately; touchpoints via Instagram, Twitter, and other social media platforms are not differentiated but studied as one social media channel as a whole. In addition, in most existing studies, only a few channels are studied simultaneously.

Similar to recent work in attribution modeling, we also focus on the path-to-purchase data. But we aim to achieve finer granularity by conducting attribution at the path level and providing touchpoint-wise scores. Namely, we aim to allocate appropriate credit for the conversion to each touchpoint or a subset of touchpoints for each customer’s path to purchase. By doing this, we can provide attribution scores for both individual touchpoints and channels, which can help marketers make better marketing decisions. Also, we would like to study the touchpoints from more channels simultaneously, which better suits firms’ current needs.

Attributing proper credit to each touchpoint is very challenging, as touchpoints interplay with each other both within-channel and across-channel. For instance, seeing email and social ads may increase a customer’s future probability of searching for the product in a search engine and then clicking the search ad to visit the firm’s website and make an online purchase. Also, a touchpoint (e.g., search ad impression) may trigger the occurrence of a future touchpoint within the same channel (e.g., search ad click), leading to a conversion. The phenomenon that the earlier touches may increase the probability of the occurrence of future touches and the possible conversions are well discussed in Li & Kannan (2014) as carryover effect (through the same channels) and spillover effects (through other channels).

To properly capture both direct and indirect conversion effects of various touchpoints, we must consider the following characteristics in attribution modeling. Firstly, different types of touchpoints may significantly vary in their probability of occurring, exciting other types of touchpoints and affecting purchase conversion. Modeling the multivariate nature of different types of touchpoints is necessary. Secondly, all marketing effects, including conversion effects and the interactive effects among touches, decay over time. It is essential to consider the time interval between two touches in attribution modeling. Solely utilizing the sequence information of the touches ignores such decaying effects, resulting in estimation biases. Thirdly, touchpoints and purchase conversions will likely gather together as clusters on the timeline. In other words, the path-to-purchase data are clumpy (Zhang et al., 2013). Fourthly, the attribution modeling and its estimation methods need to be scalable for analyzing a large number of marketing channels and types of touchpoints.

1.1 Related Literature

The attribution problem has been studied broadly by researchers from both academia and industry for over two decades. In this subsection, we provide a brief overview of existing approaches, and more details can be found in the survey papers (Kannan et al., 2016; Gaur & Bharti, 2020). Existing attribution models can be classified into two major categories: rule-based heuristics and data-driven approaches.

1.1.1 Rule-based Heuristics

Simple rule-based heuristics are widely used for multi-touch attribution in practice. For example, the last-touch attribution method assigns the credit solely to the touchpoint directly preceding the conversion; the first-touch attribution assigns the credit to the first touchpoint in the customer journey; the U-shaped method assigns an equal amount of credit to the first and the last touchpoints while evenly distributing the remaining credit amongst the other touchpoints. These rule-based heuristics can still be easily employed for the path-to-purchase data, but they either ignore the effects of other touchpoints or do not consider the interactive effects among touchpoints. They are also criticized for being biased and lacking rationale justifying their appropriateness as attribution measures (Singal et al., 2022).

1.1.2 Data-driven Approaches

The data-driven approaches mainly consist of incremental value (or removal effect) approaches and Shapley value approaches. The main novelty of previous work in this category comes from the following two perspectives. The majority of research focuses on proposing new models to describe user behavior (e.g., Shao & Li, 2011; Breuer et al., 2011; Danaher & van Heerde, 2018; Xu et al., 2014; Zhao et al., 2019), while the others focus on studying the attribution scoring methods such as the justification and fairness of Sharpley value (e.g., Dalessandro et al., 2012; Singal et al., 2022).

  • The incremental value (or removal effect) approaches compute the change in the conversion probability when one touchpoint or a set of touchpoints are removed from a customer’s path. As a result, the change in the conversion probability is also known as the removal effect. In the past decade, researchers have developed a variety of models to describe consumer behavior, such as regression models (e.g., Shao & Li, 2011; Breuer et al., 2011; Danaher & van Heerde, 2018; Zhao et al., 2019), Markov models (Yang & Ghose, 2010; Anderl et al., 2016; Berman, 2018; Kakalejčík et al., 2018), Bayesian models (e.g., Li & Kannan, 2014), time series models (Kireyev et al., 2016; De Haan et al., 2016), survival theory-based models (Zhang et al., 2014; Ji et al., 2016), deep learning models (Li et al., 2018; Kumar et al., 2020), and so on. The main novelty of previous work in this line comes from modeling user behavior. Most of them consider the touchpoints as deterministic rather than stochastic events and ignore the dynamic interactions among these marketing communications and interventions (Xu et al., 2014). However, it is important to account for the exciting effects of these touchpoints. Also, as pointed out by Singal et al. (2022), there exists little (if any) theoretical justification for the attribution based on the incremental value.

  • The Shapley value approaches apply the game theory-based concept of Shapley value (Shapley, 1953) for allocating credit to individual players in a cooperative game. Due to the nature of the Shapley value, it typically provides channel-level but not path-level attribution or touchpoint-wise attribution scores. In addition, existing methods based on Shapley value did not take into account the temporal distance between touchpoints in the path-to-purchase data, including (Dalessandro et al., 2012; De Haan et al., 2016; Kireyev et al., 2016; Berman, 2018; Singal et al., 2022). For example, the most recent work by Singal et al. (2022) used a discrete Markov chain model to describe the transitions in a customer’s state along the customer journey through the conversion funnel, which does not incorporate the temporal distance when the customer moves from a state to another state in one transition.

  • Attribution has also been investigated from other angles. For example, Xu et al. (2014) proposed a Bayesian method using a multivariate point process and calculated the attribution scores using simulations. However, this simulation-based attribution method is computationally expensive and unable to provide a path-level score for each observed path with a conversion.

1.2 Our Approach

To the best of our knowledge, none of the existing models can study the full relational structure of numerous types of touches across multi-channels, which is essential to understanding both the direct and indirect conversion effects of each type of touch and the corresponding channels. Given the potentially large number of various types of touchpoints under study, a model that can simultaneously study the interactive and conversion effects of these many types of touchpoints is in need. To fill these research gaps, we make the following efforts in this work:

Firstly, we develop a novel graphical point process model for attribution to describe customer behavior in the multi-channel setting using customer-level path-to-purchase data. The graphical model not only learns the direct conversion effects of numerous types of touchpoints but also estimates exciting effects among different types of touches simultaneously.

Secondly, we propose graphical attribution methods to allocate proper conversion credit, called the attribution score, to individual touchpoints and the corresponding channels for each customer’s path to purchase. Our proposed methods consider the attribution score as the removal effect. We derive two versions of the removal effect using the temporal point process of conversion and the graphical structure.

Thirdly, we propose a regularization method for simultaneous edge selection and parameter estimation. We design a customized alternating direction method of multipliers (ADMM) to solve this optimization problem in an efficient and scalable way. In addition, we provide a theoretical guarantee by establishing the asymptotic theory for parameter estimates.

In what follows, we briefly introduce the idea of our proposed graphical point process model. As the first step, we model path-to-purchase data as multivariate temporal point processes. More specifically, the proposed method considers individual paths of touchpoints as independent event streams. Each stream consists of various types of touchpoints (events) occurring irregularly and asynchronously on a common timeline. To capture the dynamic inter-dependencies (e.g., exciting patterns) among touches from a large number of independent event streams, the proposed method models event streams as multivariate temporal point processes. The multivariate temporal point processes are commonly characterized using conditional intensity functions (Gunawardana et al., 2011), which describe the instantaneous rates of occurrences of future touchpoints given the history of prior touchpoints. The proposed model can consider the temporal distances and the clumpy nature of touches. Xu et al. (2014) is the first paper to tackle attribution modeling using multivariate temporal point processes. They consider advertisement clicks and purchases as dependent random events in continuous time and cast the model in the Bayesian hierarchical framework.

The proposed model further introduces a Granger causality graph that is a directed graph to represent the dependencies among various event types. The nodes in the graph represent event types, and directed edges depict the historical influence of one type of event on the others, which are called the Granger causality relations (Granger, 1969). The proposed graphical model allows for a large number of online marketing channels and types of touchpoints. By fully capturing the Granger causality relations among various event types, the proposed model simultaneously measures how the numerous types of touches affect conversion, as well as the exciting effects of different types of touches within and across channels.

Based on the learned graph, we propose graphical attribution methods to assign proper credit for conversions to each type of touchpoint or a corresponding channel. The conversion credit, also called attribution scores, are calculated at the path level from a granular point of view. They can be aggregated to the channel level when necessary. The first attribution method measures the direct effect of the event(s) of interest on the conversion. This is the relative change in conversion intensity when we remove only the event(s) of interest from the path and assume other touchpoints on the path remain unaffected. The second attribution method fully uses the graphical causality structure and measures the total removal effect of the event(s) of interest. The corresponding attribution score is the marginal lift of the expected intensity of conversion by considering not only the removal of the events of interest but also the potential loss of subsequent customer-initiated events.

We examine the performance of our proposed methods in simulation studies and compare their performance with commonly used attribution models using two sets of simulated data. One data set is simulated from the multivariate Hawkes process (Hawkes, 1971). The other data set is simulated from a modified version of the Digital Advertising System Simulation (DASS) developed by Google Inc. The simulated data includes online customer browsing behavior and injected advertising events that impact customer behavior. Our proposed methods outperform the benchmark models in measuring channels’ contribution to conversions. Moreover, we demonstrate the performance of the proposed methods in a real-world attribution application.

1.3 Our Contributions

We provide practitioners with a new attribution modeling tool to understand how different marketing efforts contribute to conversions at a finer granularity in online multi-channel settings, where there exist a potentially large number of different types of touchpoints nowadays. Our tool distributes proper credit to individual touchpoints and the corresponding channels by conducting attribution modeling at the individual customer path level. Our tool helps firms’ granular marketing operations, budget allocation, profit maximization, etc.

In addition to the substantive contribution, we have the following methodological contributions to the literature.

Firstly, we contribute to the attribution modeling literature by proposing a graphical point process model to describe customer behavior using individual customer-level path-to-purchase data. We rigorously model the exciting effects of numerous types of touches in this framework and develop an efficient penalized algorithm for model estimation. We apply Granger causality in a marketing context to study the temporal relations among marketing activities. We also establish the asymptotic theory for parameter estimates.

Secondly, our proposed graphical attribution methods contribute to the literature on the incremental value (or removal effect) approaches. We provide a rigorous probabilistic definition of attribution scores and derive two types of removal effects, namely, the direct and total removal effects. We develop a new efficient thinning-based simulation method and a backpropagation algorithm for the calculation of two types of removal effects, respectively.

We organize the rest of the paper as follows. We introduce the proposed graphical point process model in Section 2 and the proposed graphical attribution methods in Section 3. We present the model estimation, computational details, and asymptotic properties in Section 4. We then demonstrate the performance of the proposed method and algorithm with the simulated data in Section 5 and provide an empirical application in Section 6. We conclude this work in Section 7.

2 Graphical Point Process Model

To tackle the attribution problem, we first propose a graphical point process model. This model utilizes customer-level path-to-purchase data to learn the full relational structure among different types of touchpoints. This section introduces the proposed graphical point process model.

Our graphical point process model considers the observed individual paths to purchases as independent event streams, where each stream consists of various types of events (i.e., touchpoints and conversions) occurring irregularly and asynchronously on a common timeline. Suppose there are pp unique types of events, which can be labeled as 1,,p1,\dots,p. A customer’s path DD is represented by {(ti,ei)}i=1m\{(t_{i},e_{i})\}_{i=1}^{m} with 0t1<t2<<tmT0\leq t_{1}<t_{2}<\dots<t_{m}\leq T, where mm is the total count of occurred events and TT is the length of observation. For the ii-th event (ti,ei)(t_{i},e_{i}), tit_{i} is its timestamp, and ei={1,2,,p}e_{i}\in\mathcal{E}=\{1,2,\dots,p\} is its event label which specifies the type of touchpoint (e.g., social ad impression, social ad click, email sent, email ad click, search ad impression, search ad click, display ad impression, display ad click) or conversion. Without loss of generality, let e=1e=1 denote the label of conversion. Given the path DD, if there exists imi\leq m such that ei=1e_{i}=1, which means there is a conversion event, then such a path DD is called a positive path. Otherwise, DD is called a negative path. Suppose we observe nn paths, D1,,DnD_{1},\dots,D_{n}, where Dj={(tij,eij)}i=1mjD_{j}=\{(t_{i}^{j},e_{i}^{j})\}_{i=1}^{m_{j}} is the jj-th path with length TjT_{j}.

To capture the dynamic inter-dependencies among touches from a large number of independent event streams, the proposed framework models event streams as multivariate temporal point processes. It introduces a directed Granger causality graph to represent the dependencies among various types of touchpoints in the event streams. In this section, we first provide a brief overview of the temporal point process and the Granger causality graph. Then we will introduce our proposed model in detail.

2.1 Temporal Point Process

For an event type labeled as ee, we can describe its occurrence on the timeline as a temporal point process Ne(t)N_{e}(t). The function Ne(t)N_{e}(t) is the number of type-ee events that happened until time tt, which is a right-continuous and non-decreasing piece-wise function 0\mathbb{R}_{\geq 0}\rightarrow\mathbb{N}:

Ne(t)=i:ei=e𝟙{tit}.N_{e}(t)=\sum_{i:e_{i}=e}\mathbbm{1}_{\{t_{i}\leq t\}}.

We assume Ne(t)N_{e}(t) has almost surely step size 11 and does not jump simultaneously. If t<tt^{\prime}<t, then Ne(t)Ne(t)N_{e}(t)-N_{e}(t^{\prime}) is the number of type-ee events that occurred during the interval (t,t](t^{\prime},t], which can also be denoted by Ne((t,t])N_{e}((t^{\prime},t]). Putting pp event types together, we let 𝐍(t)\mathbf{N}(t) denote the vector of counting functions (N1(t),,Np(t))(N_{1}(t),\dots,N_{p}(t))^{\top}. The coordinate Ne(t)N_{e}(t), e=1,,pe=1,\dots,p, is characterized by its conditional intensity

λe(tt):=limΔt0+1Δt(Ne(t+Δt)Ne(t)>0t),\lambda_{e}(t\mid\mathcal{H}_{t}):=\lim_{\Delta t\to 0^{+}}{\frac{1}{\Delta t}}\mathbbm{P}(N_{e}(t+\Delta t)-N_{e}(t)>0\mid\mathcal{H}_{t}),

which describes the instantaneous rates of occurrence of future type-ee events. The filtration t:=σ{𝐍(u):u<t}\mathcal{H}_{t}:=\sigma\{\mathbf{N}(u):u<t\} is the σ\sigma-algebra of all events up to but excluding tt, referring to the historical information before time tt.

An example is the multivariate Hawkes process (Hawkes, 1971) with the conditional intensity function

λe(tt)=μe+e=1p0thee(tu)𝑑Ne(u).\lambda_{e}(t\mid\mathcal{H}_{t})=\mu_{e}+\sum_{e^{\prime}=1}^{p}\int_{0}^{t}h_{e^{\prime}e}(t-u)dN_{e^{\prime}}(u).

Here μe>0\mu_{e}>0 is the baseline intensity serving as a background rate of occurrence regardless of historical impact, and hee()0h_{e^{\prime}e}(\cdot)\geq 0 is called an impact function. Intuitively, each of the past ee^{\prime}-events with hee()>0h_{e^{\prime}e}(\cdot)>0 has a positive contribution to the occurrence of the current type-ee event by increasing the conditional intensity of event type ee, and this influence may decrease through time. Such positive influence is called an exciting effect from event type ee^{\prime} to event type ee.

2.2 Granger Causality Graph

Let ={1,2,,p}\mathcal{E}=\{1,2,\dots,p\} be the set of various types of labeled events (i.e., touchpoints and conversions), whose historical influences on each other are of great interest. We use the Granger causality relations (Granger, 1969) to describe the temporal dependencies among the studied types of touchpoints. If the history of event type ee^{\prime} helps to predict event type ee above and beyond the history of event type ee alone, event type ee^{\prime} is said to “Granger-cause” event type ee. The Granger causality was introduced and discussed in the original paper Granger (1969) and follow-up papers (Granger, 1980, 1988), while Sims (1972) gave an alternative definition of Granger causality.

For any event label subset WW\subseteq\mathcal{E}, let 𝐍W(t)\mathbf{N}_{W}(t) be the subprocess (Ne(t))eW(N_{e}(t))_{e\in W}. For example, 𝐍{e,e}(t)\mathbf{N}_{\mathcal{E}\setminus\{e,e^{\prime}\}}(t) means the subprocess of all the event types other than ee and ee^{\prime}. For a temporal point process, the Granger causality is defined below.

Definition 1.

(Local independence (Didelez, 2008)) The temporal point process Ne(t)N_{e}(t) is locally independent on Ne(t)N_{e^{\prime}}(t) given 𝐍{e,e}(t)\mathbf{N}_{\mathcal{E}\setminus\{e,e^{\prime}\}}(t), denoted by NeNe𝐍{e,e}N_{e^{\prime}}\nrightarrow N_{e}\mid\mathbf{N}_{\mathcal{E}\setminus\{e,e^{\prime}\}}, if the conditional intensity λe(t)\lambda_{e}(t) is measurable with respect to σ{𝐍{e,e}(u):u<t}\sigma\{\mathbf{N}_{\mathcal{E}\setminus\{e,e^{\prime}\}}(u):u<t\} for all t<Tt<T. Otherwise, Ne(t)N_{e}(t) is said to be locally dependent on Ne(t)N_{e^{\prime}}(t) given 𝐍{e,e}(t)\mathbf{N}_{\mathcal{E}\setminus\{e,e^{\prime}\}}(t) with respect to t\mathcal{H}_{t}, or NeNe𝐍{e,e}N_{e^{\prime}}\rightarrow N_{e}\mid\mathbf{N}_{\mathcal{E}\setminus\{e,e^{\prime}\}}.

The above definition was introduced by Didelez (2008) for marked point processes in an interval [0,T][0,T]. Eichler (2012) studied the stationary multivariate point processes like multivariate Hawkes processes in \mathbb{R} and used the notion “Granger non-causality”. The above definition is equivalent to saying the temporal point process Ne(t)N_{e^{\prime}}(t) does not Granger-cause Ne(t)N_{e}(t) with respect to historical information t\mathcal{H}_{t}. Otherwise, Ne(t)N_{e^{\prime}}(t) Granger-causes Ne(t)N_{e}(t) with respect to t\mathcal{H}_{t}. For ease of interpretation, we also say that event type ee^{\prime} Granger-causes event type ee in an unambiguous manner.

We use a directed graph 𝒢=(,𝒜)\mathcal{G}=(\mathcal{E},\mathcal{A}) like Figure 2 to represent the temporal dependencies among various types of touchpoints in the conversion paths. The node set of the graph represents the set of event types \mathcal{E}. The edge set 𝒜×\mathcal{A}\subseteq\mathcal{E}\times\mathcal{E} represents the Granger causality relations between the various event types. If Ne(t)N_{e^{\prime}}(t) Granger-causes Ne(t)N_{e}(t) with respect to t\mathcal{H}_{t}, then the directed edge from node ee^{\prime} to node ee, denoted by (ee)(e^{\prime}\rightarrow e), is said to belong to 𝒜\mathcal{A}. Such a graph 𝒢=(,𝒜)\mathcal{G}=(\mathcal{E},\mathcal{A}) is called a Granger causality graph. For a multivariate temporal point process, the Granger causality graph is defined below.

Definition 2.

(Granger causality graph (Didelez, 2008; Eichler, 2012)) A multivariate temporal point process 𝐍(t)\mathbf{N}(t) is said to follow the Granger causality graph 𝒢=(,𝒜)\mathcal{G}=(\mathcal{E},\mathcal{A}) if for any pair of nodes e,ee,e^{\prime}\in\mathcal{E},

(ee)𝒜NeNe𝐍{e,e}.(e^{\prime}\rightarrow e)\notin\mathcal{A}\Longleftrightarrow N_{e^{\prime}}\nrightarrow N_{e}\mid\mathbf{N}_{\mathcal{E}\setminus\{e,e^{\prime}\}}.
Refer to caption
Figure 2: A toy example of the Granger causality graph for attribution. All four types of touchpoints Granger-cause conversion. Within either the display channel or the search channel, impression Granger-causes click. Between the two channels, display impression Granger-causes search impression.

2.3 The Proposed Graphical Point Process Model

We treat each path to purchase as a pp-dimensional point process 𝐍(t)\mathbf{N}(t). In the attribution problem, there are two categories of event types, firm-initiated event types and customer-initiated event types. The firm-initiated event types (Wiesel et al., 2011) are initiated by firms such as email sent, social ad impression, and display impression. The customer-initiated event types (Bowman & Narayandas, 2001; Wiesel et al., 2011) are initiated by customers or prospective customers, including conversion, search ad impression and click, email click, and so forth. Let f\mathcal{E}_{\mathrm{f}} and c\mathcal{E}_{\mathrm{c}} denote the set of firm-initiated event types and the set of customer-initiated event types and thus =fc\mathcal{E}=\mathcal{E}_{\mathrm{f}}\cup\mathcal{E}_{\mathrm{c}} and fc=\mathcal{E}_{\mathrm{f}}\cap\mathcal{E}_{\mathrm{c}}=\emptyset. Suppose there are qq customer-initiated event types for some 1q<p1\leq q<p. Without loss of generality, let c={1,,q}\mathcal{E}_{\mathrm{c}}=\{1,\dots,q\} with conversion labeled as e=1e=1 and f={q+1,,p}\mathcal{E}_{\mathrm{f}}=\{q+1,\dots,p\}. We assume that the point process of the firm-initiated event types 𝐍f(t)\mathbf{N}_{\mathcal{E}_{\mathrm{f}}}(t) is controlled by the firms or ad advertisers strategically. In other words, 𝐍f(t)\mathbf{N}_{\mathcal{E}_{\mathrm{f}}}(t) only serves as an observed input whose conditional intensity function does not require learning. For the point process of customer-initiated event types 𝐍c(t)\mathbf{N}_{\mathcal{E}_{\mathrm{c}}}(t), we model it through the following conditional intensity:

λe(tt)=μe+e=1pαee0tψee(tu)𝑑Ne(u),forec,\lambda_{e}(t\mid\mathcal{H}_{t})=\mu_{e}+\sum_{e^{\prime}=1}^{p}\alpha_{e^{\prime}e}\int_{0}^{t}\psi_{e^{\prime}e}(t-u)dN_{e^{\prime}}(u),\quad\text{for}\ e\in\mathcal{E}_{\mathrm{c}}, (1)
  • μe0\mu_{e}\geq 0 is the baseline intensity, which corresponds to the sources of the occurrence of type-ee events other than the occurrence history of itself and other types of touchpoints, namely the intrinsic tendency of the occurrence.

  • ψee()0,e=1,,p\psi_{e^{\prime}e}(\cdot)\geq 0,\ e^{\prime}=1,\dots,p, is a bounded, left-continuous kernel function, ψee(t)=0\psi_{e^{\prime}e}(t)=0 when t0t\leq 0 and 0ψee(t)𝑑t=1.\int_{0}^{\infty}\psi_{e^{\prime}e}(t)dt=1. The kernel function ψee()\psi_{e^{\prime}e}(\cdot) describes the shapes of the touchpoints’ impact. For example, ψee(t)=1T0𝟙{0<tT0}\psi_{e^{\prime}e}(t)=\frac{1}{T_{0}}\cdot\mathbbm{1}_{\{0<t\leq T_{0}\}} accounts for a constant impact of a previous type-ee^{\prime} event within T0T_{0} of its occurrence; ψee(t)=1T0exp(tT0)𝟙{t>0}\psi_{e^{\prime}e}(t)=\frac{1}{T_{0}}\exp(-\frac{t}{T_{0}})\cdot\mathbbm{1}_{\{t>0\}} works for an exponential decaying impact; ψee(t)=2πT02exp(t22T02)𝟙{t>0}\psi_{e^{\prime}e}(t)=\sqrt{\frac{2}{\pi T_{0}^{2}}}\exp(-\frac{t^{2}}{2T_{0}^{2}})\cdot\mathbbm{1}_{\{t>0\}} can be used when the decaying impact is even faster.

  • αee0,e=1,,p\alpha_{e^{\prime}e}\geq 0,\ e^{\prime}=1,\dots,p, is the Granger causality coefficient. The value of αee\alpha_{e^{\prime}e} describes the scale of the temporal dependence.

To interpret the coefficient αee\alpha_{e^{\prime}e}, we introduce the following theorem.

Theorem 1.

Assume a point process 𝐍(t)\mathbf{N}(t) with conditional intensity functions defined in (1) follows the Granger causality graph 𝒢=(,𝒜)\mathcal{G}=(\mathcal{E},\mathcal{A}). For any event labels ee^{\prime}\in\mathcal{E} and ece\in\mathcal{E}_{\mathrm{c}}, if the condition Ne(T)>0N_{e^{\prime}}(T)>0 holds, then

(ee)𝒜αee=0.(e^{\prime}\rightarrow e)\notin\mathcal{A}\Longleftrightarrow\alpha_{e^{\prime}e}=0.

This result is an adaptation from the case of the multivariate Hawkes process (Eichler, 2012; Xu et al., 2016). Analogous to the multivariate Hawkes process, the meaning of αee\alpha_{e^{\prime}e} can be described in two cases: αee>0\alpha_{e^{\prime}e}>0 implies an exciting effect from event type ee^{\prime} to event type ee by increasing its conditional intensity λe(tt)\lambda_{e}(t\mid\mathcal{H}_{t}); the case αee=0\alpha_{e^{\prime}e}=0 implies no Granger causality from event type ee^{\prime} to event type ee since λe(tt)\lambda_{e}(t\mid\mathcal{H}_{t}) is not affected by the occurrence of any type-ee^{\prime} event. By such construction, we can use the matrix A=(αee)e,ec0p×qA=\left(\alpha_{e^{\prime}e}\right)_{e^{\prime}\in\mathcal{E},e\in\mathcal{E}_{c}}\in\mathbb{R}_{\geq 0}^{p\times q} to stand for the graphical Granger causality structure for customer-initiated event types.

As the firm-initiated event types are fully controlled by marketers or ad advertisers, in our proposed model, we assume that they are not dependent on other types of events. That is to say, the ground truth of 𝒜\mathcal{A} for efe\in\mathcal{E}_{\mathrm{f}} is known with NeNe𝐍{e,e}N_{e^{\prime}}\nrightarrow N_{e}\mid\mathbf{N}_{\mathcal{E}\setminus\{e,e^{\prime}\}} for any ee^{\prime}\in\mathcal{E}.

3 Graphical Attribution Method

The goal of attribution is to assign proper credit for conversions to each type of touchpoint or a corresponding channel. We adopt the path-level scoring approach to obtain a granular view. This approach does not rely on the distribution of the firm-initiated event types 𝐍f(t)\mathbf{N}_{\mathcal{E}_{\mathrm{f}}}(t). So it can work for cases where customers are treated with different advertising strategies.

The path-level credit, called the attribution score, represents the potential fractional loss of a conversion on a path given the absence of certain event(s). We will refer to this value by the removal effect of the event(s). In this section, we first derive a version of the removal effect using the temporal point process of conversion – direct removal effect by analyzing the incremental contribution of each touchpoint. Then we fully use the graphical structure and propose another version of the removal effect – total removal effect – which explains the marginal increase in the chance of conversion.

3.1 Direct Removal Effect

In this subsection, we calculate attribution scores using the direct removal effect from the point of view of the graphical point process. We define the direct removal effects (DRE) as the relative change in conversion intensity when we remove only the event(s) of interest from the path and assume other touchpoints on the path remain unaffected. That is, the direct removal effect considers merely the influence of the studied event(s) on an occurrence of conversion and ignores their influence on other events. Graphically speaking, the direct removal effect focuses on the direct Granger causality parent nodes of conversion and does not depend on the hierarchy beyond them.

For a positive path DD, suppose there is a conversion at t=tit=t_{i^{\star}} for some 1<im1<i^{\star}\leq m, that is, ei=1e_{i^{\star}}=1. Let FtW(D):={(ti,ei)D:ti<t,eiW}F^{W}_{t}(D):=\{(t_{i},e_{i})\in D:t_{i}<t,\ e_{i}\in W\} be the set of occurred events before tt whose event labels belong to WW, where WW\subseteq\mathcal{E} is an arbitrary event label subset. Especially, let Ft(D)F_{t}(D) stand for Ft(D)F^{\mathcal{E}}_{t}(D), which is the truncated path before tt. We are interested in the influence of a subset of touchpoints RFt(D)R\subseteq F_{t^{\star}}(D) on this conversion and call RR a removal set. Let 𝐍D(t)=(N1D(t),,NpD(t))\mathbf{N}^{D}(t)=(N_{1}^{D}(t),\dots,N_{p}^{D}(t))^{\top} denote the point process with respect to path DD with NeD(t)=(ti,ei)D:ei=e𝟙{tit}N_{e}^{D}(t)=\sum_{(t_{i},e_{i})\in D:e_{i}=e}\mathbbm{1}_{\{t_{i}\leq t\}} for e=1,,pe=1,\dots,p. Let tD\mathcal{H}_{t}^{D} be the corresponding filtration σ{𝐍D(u):u<t}\sigma\{\mathbf{N}^{D}(u):u<t\}. We can calculate the attribution score as the direct removal effect of RR with respect to path DD:

attti(direct)(RD):=λ1(titiD)λ1(titiDR)λ1(titiD).\mathrm{att}_{t_{i^{\star}}}^{\mathrm{(direct)}}(R\mid D):=\frac{\lambda_{1}(t_{i^{\star}}\mid\mathcal{H}_{t_{i^{\star}}}^{D})-\lambda_{1}(t_{i^{\star}}\mid\mathcal{H}_{t_{i^{\star}}}^{D\setminus R})}{\lambda_{1}(t_{i^{\star}}\mid\mathcal{H}_{t_{i^{\star}}}^{D})}. (2)

The above definition is a general statement that does not depend on any assumptions of point process modeling.

Suppose that the conversion event satisfies model (1), which means the conditional intensity function of this conversion takes the form

λ1(titi)=μ1+i<iαei1ψei1(titi).\lambda_{1}(t_{i^{\star}}\mid\mathcal{H}_{t_{i^{\star}}})=\mu_{1}+\sum_{i<i^{\star}}\alpha_{e_{i}1}\psi_{e_{i}1}(t_{i^{\star}}-t_{i}). (3)

We can derive more specific attribution scores based on Equation (2) and (3). For example, the direct removal effect of touchpoint (ti,ei)(t_{i},e_{i}) can be calculated by

attti(direct)({(ti,ei)}D)=αei1ψei1(titi)λ1(titiD).\mathrm{att}_{t_{i^{\star}}}^{\mathrm{(direct)}}(\{(t_{i},e_{i})\}\mid D)=\frac{\alpha_{e_{i}1}\psi_{e_{i}1}(t_{i^{\star}}-t_{i})}{\lambda_{1}(t_{i^{\star}}\mid\mathcal{H}_{t_{i^{\star}}}^{D})}.

This expression shows the relationship between the Granger causality graph and the attribution score. A touchpoint (ti,ei)(t_{i},e_{i}) can be attributed with a score only if its touchpoint type is a parent node of the conversion node on the graph or (ei1)𝒜(e_{i}\rightarrow 1)\in\mathcal{A}. Generally, for a subset RR of Ft(D)F_{t^{\star}}(D), its direct removal effect is

attti(direct)(RD)=(ti,ei)Rαei1ψei1(titi)λ1(titiD).\mathrm{att}_{t_{i^{\star}}}^{\mathrm{(direct)}}(R\mid D)=\sum_{(t_{i},e_{i})\in R}\frac{\alpha_{e_{i}1}\psi_{e_{i}1}(t_{i^{\star}}-t_{i})}{\lambda_{1}(t_{i^{\star}}\mid\mathcal{H}_{t_{i^{\star}}}^{D})}. (4)

Besides, under model (1), the baseline effect can be defined to be μ1λ1(titiD)\frac{\mu_{1}}{\lambda_{1}(t_{i^{\star}}\mid\mathcal{H}_{t_{i^{\star}}}^{D})}. For example, Figure 3 shows the direct removal effects of touchpoints with respect to a path consisting of three touchpoints and a conversion.

Refer to caption
Figure 3: The direct removal effects and the decomposition of the relative conditional intensity of conversion for an example path. The path contains three touchpoints, with t1=1t_{1}=1, t2=3t_{2}=3, t3=6t_{3}=6 and e1,e2,e_{1},e_{2}, and e3e_{3} representing search impression, display impression, and search impression, respectively. There is a conversion at t4=7t_{4}=7.

In the following context, we will explain the attribution score’s general form (2) in probability language. Consider an ideal experiment of two customers, A and B, with a toy example shown in Figure 4. Suppose A and B react to touchpoints in the same way. Their conditional intensity functions for customer-initiated event types follow the same model. We observe DAD_{\mathrm{A}}, the path of A, with a conversion at t=ti>0t=t_{i^{\star}}>0. Let DBD_{\mathrm{B}} denote the path of B. Suppose Fti{1}(DB)Fti{1}(DA)F^{\mathcal{E}\setminus\{1\}}_{t_{i^{\star}}}(D_{\mathrm{B}})\subsetneqq F^{\mathcal{E}\setminus\{1\}}_{t_{i^{\star}}}(D_{\mathrm{A}}), the touchpoints on DBD_{\mathrm{B}} are a subset of those on DAD_{\mathrm{A}}. Assume that there is no information about N1DB(t)N_{1}^{D_{\mathrm{B}}}(t). Namely, we do not observe whether B converts or not. We can sample the process N1DB(t)N_{1}^{D_{\mathrm{B}}}(t) to obtain a complete path DBD_{\mathrm{B}}. We exploit the thinning operation for a temporal point process, which uses some definite rule to delete points of a basic point process, yielding a new point process.

Refer to caption
Figure 4: The problem explained by the direct removal effect of the display impression with respect to a path containing three touchpoints. With the removal of the display impression, the direct removal effect describes the chance to see a conversion with the same timestamp.
Theorem 2.

(Lewis & Shedler, 1979; Ogata, 1981) Assume a univariate temporal point process N(t)N(t) in [0,T][0,T] with intensity function λ(t)\lambda(t). Let t1,,tN(T)t_{1},\dots,t_{N(T)} be the timestamps of N(t)N(t). There exists a function λ(t)\lambda^{\prime}(t) satisfying a.s.

λ(t)λ(t),for  0<t<T.\lambda^{\prime}(t)\leq\lambda(t),\quad\text{for }\ 0<t<T.

For i=1,,N(T)i=1,\dots,N(T), delete the point at t=tit=t_{i} with probability 1λ(ti)/λ(ti)1-\lambda^{\prime}(t_{i})/\lambda(t_{i}). Then the remaining points form a point process N(t)N^{\prime}(t) in the interval [0,T][0,T] with intensity function λ(t)\lambda^{\prime}(t).

We refer to the point process N(t)N(t) as a background process for the desired point process N(t)N^{\prime}(t). In general, Ogata (1981) suggested using a Poisson process as the background process for point process simulation. To sample the N1DB(t)N_{1}^{D_{\mathrm{B}}}(t), we can look for a constant λ¯\overline{\lambda} satisfying λ¯λ1(ttDB)\overline{\lambda}\geq\lambda_{1}(t\mid\mathcal{H}_{t}^{D_{\mathrm{B}}}) for t[0,T]t\in[0,T] and use the Poisson process with rate λ¯\overline{\lambda} as the background process.

With a closer look at the problem, we can find that sampling a Poisson process as the background process is not necessary. The point process of conversion for A, N1DA(t)N_{1}^{D_{\mathrm{A}}}(t), can also serve as a background process. Based on (3), we have λ1(ttDB)λ1(ttDA)\lambda_{1}(t\mid\mathcal{H}_{t}^{D_{\mathrm{B}}})\leq\lambda_{1}(t\mid\mathcal{H}_{t}^{D_{\mathrm{A}}}) before any occurrence of conversion for B. Theorem 2 tells us that if we delete the conversion at t=tit=t_{i^{\star}} in N1DA(t)N_{1}^{D_{\mathrm{A}}}(t) with probability 1λ1(titiDB)/λ1(titiDA)1-\lambda_{1}(t_{i^{\star}}\mid\mathcal{H}_{t_{i^{\star}}}^{D_{\mathrm{B}}})/\lambda_{1}(t_{i^{\star}}\mid\mathcal{H}_{t_{i^{\star}}}^{D_{\mathrm{A}}}), then we obtain a process following the distribution described in (3) for B. Compared with (2), this probability is nothing but the direct removal effect of DADBD_{\mathrm{A}}\setminus D_{\mathrm{B}} with respect to path DAD_{\mathrm{A}}.

3.2 Total Removal Effect

The direct removal effect of a given subset of events describes the expected loss of conversion by comparing the path without it to the original path. In this subsection, we study another attribution score that measures the overall influence resulting from removing the subset of events along the path. The corresponding attribution score is the marginal lift of the expected intensity of conversion by considering not only the removal of the events of interest but also the potential loss of other related customer-initiated events.

For example, the direct removal effect of the first touchpoint (t1,e1)D(t_{1},e_{1})\in D can be studied through D{(t1,e1)}D\setminus\{(t_{1},e_{1})\}. But removing such a touchpoint in the early stage of the path will result in more than the removal of itself. Some later touchpoints may get affected and tend not to occur. So D{(t1,e1)}D\setminus\{(t_{1},e_{1})\} may not well represent a possible remaining path described by model (1) because 𝐍cD{(t1,e1)}(t)\mathbf{N}_{\mathcal{E}_{\mathrm{c}}}^{D\setminus\{(t_{1},e_{1})\}}(t) is not guaranteed to follow the same model as 𝐍cD(t)\mathbf{N}_{\mathcal{E}_{\mathrm{c}}}^{D}(t). In other words, the direct removal effect may not well reflect the overall influence of the removal of the touchpoint.

By model (1), for any ece\in\mathcal{E}_{\mathrm{c}} (including conversion),

λe(ttDR)λe(ttD),t(0,T).\lambda_{e}(t\mid\mathcal{H}_{t}^{D\setminus R})\leq\lambda_{e}(t\mid\mathcal{H}_{t}^{D}),\quad t\in(0,T).

The removal of RR from DD may result in the missing of certain subsequent touchpoints as well as paid conversion. The actual remaining path may contain even fewer event occurrences, which yields a subset of DRD\setminus R. This inspires us to perform the thinning operation to 𝐍c(t)\mathbf{N}_{\mathcal{E}_{\mathrm{c}}}(t) according to Theorem 2 instead of thinning the univariate process N1(t)N_{1}(t) only.

Let imin(R)=min{1im:(ti,ei)R}i_{\min}(R)=\min\{1\leq i\leq m:(t_{i},e_{i})\in R\} be the event index of the first event in RR. Recall the ideal experiment of two customers, A and B. Now we assume that DA=DD_{\mathrm{A}}=D for A and the process of the customer-initiated event types 𝐍cDB(t)\mathbf{N}_{\mathcal{E}_{\mathrm{c}}}^{D_{\mathrm{B}}}(t) is unknown for B. Let DB=DRD_{\mathrm{B}}=D\setminus R first. By Theorem 2, if we delete the touchpoint (ti,ei)(t_{i},e_{i}) in Ftic(DB)F^{\mathcal{E}_{\mathrm{c}}}_{t_{i^{\star}}}(D_{\mathrm{B}}) from DBD_{\mathrm{B}} with probability 1λei(titiDB)/λei(titiDA)1-\lambda_{e_{i}}(t_{i}\mid\mathcal{H}_{t_{i}}^{D_{\mathrm{B}}})/\lambda_{e_{i}}(t_{i}\mid\mathcal{H}_{t_{i}}^{D_{\mathrm{A}}}) sequentially for i>imin(R)i>i_{\min}(R), then the obtained process 𝐍cDB(t)\mathbf{N}_{\mathcal{E}_{\mathrm{c}}}^{D_{\mathrm{B}}}(t) follows model (1). Suppose DRD\setminus R^{\diamond} is the thinned version of DBD_{\mathrm{B}} with RRR^{\diamond}\supseteq R being the actual removal set. Figure 5 shows a toy example of this idea. As a result, given a subset RR of a path DD, its total removal effect (TRE) is defined by

attti(total)(RD):=𝔼[attti(direct)(RD)D]=λ1(titiD)𝔼[λ1(titiDR)D]λ1(titiD),\mathrm{att}_{t_{i^{\star}}}^{\mathrm{(total)}}(R\mid D):=\mathbbm{E}[\mathrm{att}_{t_{i^{\star}}}^{\mathrm{(direct)}}(R^{\diamond}\mid D)\mid D]=\frac{\lambda_{1}(t_{i^{\star}}\mid\mathcal{H}_{t_{i^{\star}}}^{D})-\mathbbm{E}[\lambda_{1}(t_{i^{\star}}\mid\mathcal{H}_{t_{i^{\star}}}^{D\setminus R^{\diamond}})\mid D]}{\lambda_{1}(t_{i^{\star}}\mid\mathcal{H}_{t_{i^{\star}}}^{D})},

where the conditional expectation is with respect to the actual removal set RR^{\diamond} from the random thinning operation. The algorithm with the thinning operation is summarized in Algorithm 1.

Refer to caption
Figure 5: For the example path containing three touchpoints, the total removal effect of the display impression can be viewed as the expected direct removal effect, where the uncertainty lies in the actual removal set. The removal of the display impression may result in the removal of the subsequent search impression.
Algorithm 1 Total removal effect: thinning
Input: A path DD, a specified conversion timestamp tit_{i^{\star}} (i>1i^{\star}>1) and a nonempty removal set RFt(D)R\subseteq F_{t^{\star}}(D). Model parameters 𝝁,𝜶1,,𝜶q\bm{\mu},\bm{\alpha}_{1},\dots,\bm{\alpha}_{q} and a large integer LL.
for =1,,L\ell=1,\dots,L do
     R=RR^{\diamond}=R.
     for ii in {i>imin(R):(ti,ei)Ftic(DR)}\{i>i_{\min}(R):(t_{i},e_{i})\in F^{\mathcal{E}_{\mathrm{c}}}_{t_{i^{\star}}}(D\setminus R)\} (ascending order) do
         Update R=R{(ti,ei)}R^{\diamond}=R^{\diamond}\cup\{(t_{i},e_{i})\} with probability 1λei(titiDR)λei(titiD)1-\frac{\lambda_{e_{i}}(t_{i}\mid\mathcal{H}_{t_{i}}^{D\setminus R^{\diamond}})}{\lambda_{e_{i}}(t_{i}\mid\mathcal{H}_{t_{i}}^{D})}.
     end for
     x=attti(direct)(RD)x_{\ell}=\mathrm{att}_{t_{i^{\star}}}^{\mathrm{(direct)}}(R^{\diamond}\mid D).
end for
Return: attti(total)(RD)=1L=1Lx\mathrm{att}_{t_{i^{\star}}}^{\mathrm{(total)}}(R\mid D)=\frac{1}{L}\sum_{\ell=1}^{L}x_{\ell}.

The largest value of RR^{\diamond} is R{(ti,ei)Ftic(DR):i>imin(R)}R\cup\{(t_{i},e_{i})\in F^{\mathcal{E}_{\mathrm{c}}}_{t_{i^{\star}}}(D\setminus R):i>i_{\min}(R)\}, denoted by Ω\Omega for ease of notation. Using the thinning algorithm, we can derive an explicit form of the total removal effect:

attti(total)(RD)=RRΩattti(direct)(RD)(R=RD).\mathrm{att}_{t_{i^{\star}}}^{\mathrm{(total)}}(R\mid D)=\sum_{R\subseteq R^{\prime}\subseteq\Omega}\mathrm{att}_{t_{i^{\star}}}^{\mathrm{(direct)}}(R^{\prime}\mid D)\mathbbm{P}(R^{\diamond}=R^{\prime}\mid D).

During the thinning operation, the value of RR^{\diamond} depends on sampling a sequence of Bernoulli random variables. Let UiBernoulli(1λei(titiDR)λei(titiD))U_{i}\sim\mathrm{Bernoulli}(1-\frac{\lambda_{e_{i}}(t_{i}\mid\mathcal{H}_{t_{i}}^{D\setminus R^{\prime}})}{\lambda_{e_{i}}(t_{i}\mid\mathcal{H}_{t_{i}}^{D})}) be the Bernoulli random variable for thinning event (ti,ei)Ω(t_{i},e_{i})\in\Omega. Then the conditional probability mass function of RR^{\diamond} is

(R=RD)=\displaystyle\ \mathbbm{P}(R^{\diamond}=R^{\prime}\mid D)= (ti,ei)ΩR(Ui=𝟙{(ti,ei)R})\displaystyle\prod_{(t_{i},e_{i})\in\Omega\setminus R}\mathbbm{P}(U_{i}=\mathbbm{1}_{\{(t_{i},e_{i})\in R^{\prime}\}})
=\displaystyle= (ti,ei)ΩR(1λei(titiDR)λei(titiD))𝟙{(ti,ei)R}(λei(titiDR)λei(titiD))1𝟙{(ti,ei)R}.\displaystyle\prod_{(t_{i},e_{i})\in\Omega\setminus R}\left(1-\frac{\lambda_{e_{i}}(t_{i}\mid\mathcal{H}_{t_{i}}^{D\setminus R^{\prime}})}{\lambda_{e_{i}}(t_{i}\mid\mathcal{H}_{t_{i}}^{D})}\right)^{\mathbbm{1}_{\{(t_{i},e_{i})\in R^{\prime}\}}}\left(\frac{\lambda_{e_{i}}(t_{i}\mid\mathcal{H}_{t_{i}}^{D\setminus R^{\prime}})}{\lambda_{e_{i}}(t_{i}\mid\mathcal{H}_{t_{i}}^{D})}\right)^{1-\mathbbm{1}_{\{(t_{i},e_{i})\in R^{\prime}\}}}.

The above probability mass function could be difficult to implement in practice due to the combinatorial subset calculation. Under model (1), we can change the order of summation to derive the total removal effect according to the linear decomposition of the direct removal effect in (4).

attti(total)(RD)=\displaystyle\mathrm{att}_{t_{i^{\star}}}^{\mathrm{(total)}}(R\mid D)= RRΩattti(direct)(RD)(R=RD)\displaystyle\sum_{R\subseteq R^{\prime}\subseteq\Omega}\mathrm{att}_{t_{i^{\star}}}^{\mathrm{(direct)}}(R^{\prime}\mid D)\mathbbm{P}(R^{\diamond}=R^{\prime}\mid D)
=\displaystyle= RRΩ(ti,ei)Rattti(direct)({(ti,ei)}D)(R=RD)\displaystyle\sum_{R\subseteq R^{\prime}\subseteq\Omega}\sum_{(t_{i},e_{i})\in R^{\prime}}\mathrm{att}_{t_{i^{\star}}}^{\mathrm{(direct)}}(\{(t_{i},e_{i})\}\mid D)\mathbbm{P}(R^{\diamond}=R^{\prime}\mid D)
=\displaystyle= (ti,ei)Ωattti(direct)({(ti,ei)}D)((ti,ei)RD).\displaystyle\sum_{(t_{i},e_{i})\in\Omega}\mathrm{att}_{t_{i^{\star}}}^{\mathrm{(direct)}}(\{(t_{i},e_{i})\}\mid D)\mathbbm{P}((t_{i},e_{i})\in R^{\diamond}\mid D).

This result implies another iterative algorithm, which is simulation-free and more efficient. The basic idea is to redistribute the obtained scores from the direct removal effect. It adopts an intuitive backpropagation way of scoring as summarized in Algorithm 2, and the backpropagation, introduced by Rumelhart et al. (1986), has been gaining popularity in artificial neural networks.

Algorithm 2 Total removal effect: backpropagation
Input: A path DD, a specified conversion timestamp tit_{i^{\star}} (i>1i^{\star}>1) and a nonempty removal set RFt(D)R\subseteq F_{t^{\star}}(D). Model parameters 𝝁,𝜶1,,𝜶q\bm{\mu},\bm{\alpha}_{1},\dots,\bm{\alpha}_{q}.
Initialize: yi=attti(direct)({(ti,ei)}D)y_{i}=\mathrm{att}_{t_{i^{\star}}}^{\mathrm{(direct)}}(\{(t_{i},e_{i})\}\mid D) for imin(R)ii1i_{\min}(R)\leq i\leq i^{\star}-1.
for ii in {i>imin(R):(ti,ei)Ftic(DR)}\{i>i_{\min}(R):(t_{i},e_{i})\in F^{\mathcal{E}_{\mathrm{c}}}_{t_{i^{\star}}}(D\setminus R)\} (descending order) do
     for ii^{\prime} in {1ii1:(eiei)𝒜}\{1\leq i^{\prime}\leq i-1:(e_{i^{\prime}}\rightarrow e_{i})\in\mathcal{A}\}  do
         yi=yi+yi(1λei(titiD{(ti,ei)})λei(titiD))y_{i^{\prime}}=y_{i^{\prime}}+y_{i}\cdot\Big{(}1-\frac{\lambda_{e_{i}}(t_{i}\mid\mathcal{H}_{t_{i}}^{D\setminus\{(t_{i^{\prime}},e_{i^{\prime}})\}})}{\lambda_{e_{i}}(t_{i}\mid\mathcal{H}_{t_{i}}^{D})}\Big{)}.
     end for
end for
Return: attti(total)(RD)=(ti,ei)Ryi\mathrm{att}_{t_{i^{\star}}}^{\mathrm{(total)}}(R\mid D)=\sum_{(t_{i},e_{i})\in R}y_{i}.
Refer to caption
Figure 6: For the example path containing three touchpoints, the total removal effect of the display ad impression can be calculated by the score backpropagation. It consists of two parts, the direct removal effect of itself and its share in the direct removal effect of the subsequent search ad impression. The graph on the right describes how the score flows between event types, with black and red arrows representing the direct removal effect attribution and score backpropagation, respectively.

As shown by the example path in Figure 6, instead of going forward along the path, the score flows backward from the last touchpoint to the previous touchpoints. This flow follows the reverse graph of the Granger causality graph, whose arrows point from an event type to its Granger causality parents.

3.3 Remark on the Two Scoring Methods

In this subsection, we discuss the difference between the proposed two attribution scoring methods and their implementation in an extreme case.

The direct removal effect regards the contributions from each touchpoint as individual components. While the total removal effect of a specified subset of touchpoints is the cumulative influence resulting from its removal along the path. The direct removal effect regards the conversion event as the only response, while the whole vector/subprocess of customer-initiated event types is the response for the total removal effect. Intuitively, on the score-flow graph in Figure 6, the direct removal effect lets the score directly come out from the conversion node, and the total removal effect is based on the continual flow of the score until it reaches a specified node or a node with no parents.

These two scores differ in the context of additivity. Suppose R1R_{1} and R2R_{2} are two mutually exclusive subsets of the truncated path Ft(D)F_{t^{\star}}(D). The direct removal effect under model (1) has the additive property implied by the definition that

attti(direct)(R1R2D)=attti(direct)(R1D)+attti(direct)(R2D).\mathrm{att}_{t_{i^{\star}}}^{\mathrm{(direct)}}(R_{1}\cup R_{2}\mid D)=\mathrm{att}_{t_{i^{\star}}}^{\mathrm{(direct)}}(R_{1}\mid D)+\mathrm{att}_{t_{i^{\star}}}^{\mathrm{(direct)}}(R_{2}\mid D).

However, for the total removal effect, we let the score flow between the two sets during backpropagation wherever there is Granger causality between their event types. Hence the total removal effect is subadditive and thus not incremental

attti(total)(R1R2D)attti(total)(R1D)+attti(total)(R2D).\mathrm{att}_{t_{i^{\star}}}^{\mathrm{(total)}}(R_{1}\cup R_{2}\mid D)\leq\mathrm{att}_{t_{i^{\star}}}^{\mathrm{(total)}}(R_{1}\mid D)+\mathrm{att}_{t_{i^{\star}}}^{\mathrm{(total)}}(R_{2}\mid D).

As a result, to obtain the total removal effect of a channel, we should select the removal set as the set of all the corresponding individual touchpoints rather than take the sum of touchpoint-wise scores.

Refer to caption
Figure 7: An extreme case where the Granger causality graph is line-shaped.

We consider an extreme case illustrated by the Granger causality graph in Figure 7. In practice, we believe that touchpoints in each channel Granger-cause conversion separately, and hence the conversion node has multiple parents on the Granger causality graph like Figure 2. But here we use this example to justify the proposed two attributions scores. On the graph, conversion is the event type labeled as 11, and all the other event types 2,,p2,\dots,p are touchpoints, where the touchpoint type labeled as pp is the only firm-initiated event type. Suppose there is no baseline intensity for any customer-initiated event type, namely μe=0\mu_{e}=0 for e=1,,p1e=1,\dots,p-1. Then a type-(e+1)(e+1) event will be the only reason for the occurrence of a type-ee event. Suppose we observe a path DD with the same pattern as Figure 7. That is,

D={(t1,p),,(tp2,3),(tp1,2),(tp,1)}.D=\{(t_{1},p),\dots,(t_{p-2},3),(t_{p-1},2),(t_{p},1)\}.

If we look at the direct removal effect of each touchpoint, we get

atttp(direct)({(ti,ei)}D)={100%,i=p1;0%,i=1,,p2,\mathrm{att}_{t_{p}}^{\mathrm{(direct)}}(\{(t_{i},e_{i})\}\mid D)=\begin{cases}100\%,&i=p-1;\\ 0\%,&i=1,\dots,p-2,\end{cases}

where ei=pi+1e_{i}=p-i+1 for i=1,,p1i=1,\dots,p-1. The direct removal effect ignores the importance of touchpoints in the early stage since they cannot instantly trigger a conversion. On the other hand, the total removal effect gives

atttp(total)({(ti,ei)}D)=100%,i=1,,p1.\mathrm{att}_{t_{p}}^{\mathrm{(total)}}(\{(t_{i},e_{i})\}\mid D)=100\%,\quad i=1,\dots,p-1.

It seems that the total removal effect can over-allocate since there is only 11 conversion on path DD. This issue was also pointed out by Singal et al. (2022) as a drawback of attribution methods based on the removal effect. However, it does not mean the scores are wrong. This extreme over-allocation problem implies that it might be inappropriate to calculate such an attribution score for each touchpoint. In Figure 7, it is likely that event types 22 to pp all belong to the same channel. A detailed example can be the case where email sent, email open, and email click are the touchpoints, and conversion can be only triggered by email click (invite-only purchase through email link). In this case, we can regard event types 22 to pp as an entirety using the removal set R={(t1,p),,(tp2,3),(tp1,2)}R=\{(t_{1},p),\dots,(t_{p-2},3),(t_{p-1},2)\}. Then we have

atttp(direct)(RD)=atttp(total)(RD)=100%.\mathrm{att}_{t_{p}}^{\mathrm{(direct)}}(R\mid D)=\mathrm{att}_{t_{p}}^{\mathrm{(total)}}(R\mid D)=100\%.

If event types 22 to pp are actually from different channels, it means the graph is not likely true, or there is no need to distinguish these channels from each other since they are highly dependent.

To summarize, the direct removal effect serves as an explanatory score by allocating the credit of a conversion event to each touchpoint according to its incremental influence. Both proposed methods can give channel-level scores, but the total removal effect provides a marginal point of view for conversion. In other words, it better applies to cases where the leave-one-channel-out loss of conversion is the desired quantity.

4 Estimation

In this section, we develop a regularized estimator for our proposed model. We design an computationally efficient alternating direction method of multipliers (ADMM) algorithm to solve the corresponding optimization problem. We also derive the graphical model selection consistency and the rates of convergence rate of model estimates and attribution scores in the asymptotic regime.

We introduce some necessary notations to facilitate the presentation. For any KK-dimensional vector 𝒙=(x1,,xK)K\bm{x}=(x_{1},\dots,x_{K})^{\top}\in\mathbb{R}^{K}, where K1K\geq 1 is an arbitrary integer, let 𝒙1:=k=1K|xk|\|\bm{x}\|_{1}:=\sum_{k=1}^{K}|x_{k}|, 𝒙2:=k=1Kxk2\|\bm{x}\|_{2}:=\sqrt{\sum_{k=1}^{K}x_{k}^{2}}, and 𝒙:=max1kK|xk|\|\bm{x}\|_{\infty}:=\max_{1\leq k\leq K}|x_{k}| denote its L1L_{1}-norm, L2L_{2}-norm, and LL_{\infty}-norm respectively. Let 𝒙+\bm{x}_{+} denote the non-negative part of 𝒙\bm{x} that is defined by 𝒙+:=(max(0,x1),,max(0,xK))\bm{x}_{+}:=(\max(0,x_{1}),\dots,\max(0,x_{K}))^{\top}. For a matrix M=(Mij)i,jM=(M_{ij})_{i,j}, the matrix L1L_{1}-norm is defined as M1,:=maxi(j|Mij|)\big{\|}M\big{\|}_{1,\infty}:=\max_{i}(\sum_{j}\big{|}M_{ij}\big{|}).

4.1 Model Learning

Suppose there are nn paths in total, D1,,DnD_{1},\dots,D_{n}, where DjD_{j} is the jj-th path on time interval [0,Tj][0,T_{j}]. Suppose the kernel functions ψee(),e=1,,p,e=1,,q,\psi_{e^{\prime}e}(\cdot),\ e^{\prime}=1,\dots,p,\ e=1,\dots,q, are known. Let 𝝁=(μ1,,μq)0q\bm{\mu}=(\mu_{1},\dots,\mu_{q})^{\top}\in\mathbb{R}_{\geq 0}^{q} be the vector of baseline intensities of 𝐍c(t)\mathbf{N}_{\mathcal{E}_{\mathrm{c}}}(t) and 𝜶e=(α1e,,αpe)0p\bm{\alpha}_{e}=(\alpha_{1e},\dots,\alpha_{pe})^{\top}\in\mathbb{R}_{\geq 0}^{p} be the ee-th column of AA for e=1,,qe=1,\dots,q. Learning the model turns into an optimization problem:

min𝝁,𝜶1,,𝜶q0p1nj=1nΦ(Dj;𝝁,𝜶1,,𝜶q),\min_{\bm{\mu},\bm{\alpha}_{1},\dots,\bm{\alpha}_{q}\in\mathbb{R}_{\geq 0}^{p}}\frac{1}{n}\sum_{j=1}^{n}\Phi(D_{j};\bm{\mu},\bm{\alpha}_{1},\dots,\bm{\alpha}_{q}),

where Φ(Dj;𝝁,𝜶1,,𝜶q)\Phi(D_{j};\bm{\mu},\bm{\alpha}_{1},\dots,\bm{\alpha}_{q}) is the loss function given the jj-th path DjD_{j}. Consider a least-squares functional:

Φ(Dj;𝝁,𝜶1,,𝜶q)=e=1q{120Tj[λe(ttDj)]2𝑑t0Tjλe(ttDj)𝑑NeDj(t)}.\Phi(D_{j};\bm{\mu},\bm{\alpha}_{1},\dots,\bm{\alpha}_{q})=\sum_{e=1}^{q}\left\{\frac{1}{2}\int_{0}^{T_{j}}[\lambda_{e}(t\mid\mathcal{H}_{t}^{D_{j}})]^{2}dt-\int_{0}^{T_{j}}\lambda_{e}(t\mid\mathcal{H}_{t}^{D_{j}})dN^{D_{j}}_{e}(t)\right\}.

Compared with the negative log-likelihood, the least-squares functional enjoys better computational efficiency. Its equivalent form was proposed for estimating the additive risk model by Lin & Ying (1994). It was also adopted by Hansen et al. (2015) and Bacry et al. (2020) for point process models.

Assume that the edge set is sparse, we add the sparsity constraints on the coefficients as follows:

min𝝁,𝜶1,,𝜶q0p1nj=1nΦ(Dj;𝝁,𝜶1,,𝜶q)+e=1qγe𝜶e1,\min_{\bm{\mu},\bm{\alpha}_{1},\dots,\bm{\alpha}_{q}\in\mathbb{R}_{\geq 0}^{p}}\frac{1}{n}\sum_{j=1}^{n}\Phi(D_{j};\bm{\mu},\bm{\alpha}_{1},\dots,\bm{\alpha}_{q})+\sum_{e=1}^{q}\gamma_{e}\|\bm{\alpha}_{e}\|_{1}, (5)

where γ1,,γq0\gamma_{1},\dots,\gamma_{q}\geq 0 are the regularization parameters to control the individual sparsity of coefficient vectors 𝜶1,,𝜶q\bm{\alpha}_{1},\dots,\bm{\alpha}_{q}. It is worth pointing out that {μe,𝜶e}e=1q\{\mu_{e},\bm{\alpha}_{e}\}_{e=1}^{q} are separable in the objective function (5). Thus, the optimization problem can be decomposed into node-wise model learning. For each node ece\in\mathcal{E}_{\mathrm{c}}, learning its parent nodes yields

minμe0,𝜶e0p1nj=1nϕe(Dj;μe,𝜶e)+γe𝜶e1,\min_{\mu_{e}\geq 0,\bm{\alpha}_{e}\in\mathbb{R}_{\geq 0}^{p}}\frac{1}{n}\sum_{j=1}^{n}\phi_{e}(D_{j};\mu_{e},\bm{\alpha}_{e})+\gamma_{e}\|\bm{\alpha}_{e}\|_{1}, (6)

where

ϕe(Dj;μe,𝜶e)=120Tj[λe(ttDj)]2𝑑t0Tjλe(ttDj)𝑑NeDj(t).\phi_{e}(D_{j};\mu_{e},\bm{\alpha}_{e})=\frac{1}{2}\int_{0}^{T_{j}}[\lambda_{e}(t\mid\mathcal{H}_{t}^{D_{j}})]^{2}dt-\int_{0}^{T_{j}}\lambda_{e}(t\mid\mathcal{H}_{t}^{D_{j}})dN^{D_{j}}_{e}(t).

In the following context, we write 𝜽=(θ0,θ1,,θp)=(μe,𝜶e)0p+1\bm{\theta}=(\theta_{0},\theta_{1},\dots,\theta_{p})^{\top}=(\mu_{e},\bm{\alpha}_{e}^{\top})^{\top}\in\mathbb{R}_{\geq 0}^{p+1}. Let 𝑿j(t)=(Xj,0(t),Xj,1(t),,Xj,p(t))\bm{X}_{j}(t)=(X_{j,0}(t),X_{j,1}(t),\dots,X_{j,p}(t))^{\top} with

Xj,0(t)=1,Xj,k(t)=0tψke(tu)𝑑NkDj(u),k=1,,p.X_{j,0}(t)=1,\quad X_{j,k}(t)=\int_{0}^{t}\psi_{ke}(t-u)dN^{D_{j}}_{k}(u),\ k=1,\dots,p.

Now the conditional intensity can be written as λe(ttDj)=𝜽𝑿j(t)\lambda_{e}(t\mid\mathcal{H}_{t}^{D_{j}})=\bm{\theta}^{\top}\bm{X}_{j}(t). Let V=(Vkk)k,k=0p(p+1)×(p+1)V=(V_{kk^{\prime}})_{k,k^{\prime}=0}^{p}\in\mathbb{R}^{(p+1)\times(p+1)} and 𝒃=(b0,,bp)p+1\bm{b}=(b_{0},\cdots,b_{p})^{\top}\in\mathbb{R}^{p+1}, where for k=0,,pk=0,\dots,p,

Vkk=1nj=1n0TjXj,k(t)Xj,k(t)𝑑t,bk=1nj=1n0TjXj,k(t)𝑑NeDj(t).V_{kk^{\prime}}=\frac{1}{n}\sum_{j=1}^{n}\int_{0}^{T_{j}}X_{j,k}(t)X_{j,k^{\prime}}(t)dt,\quad b_{k}=\frac{1}{n}\sum_{j=1}^{n}\int_{0}^{T_{j}}X_{j,k}(t)dN_{e}^{D_{j}}(t).

Then the regularized solution satisfies

𝜽^=argmin𝜽0p+112𝜽V𝜽𝒃𝜽+γe𝜶e1.\hat{\bm{\theta}}=\underset{\bm{\theta}\in\mathbb{R}_{\geq 0}^{p+1}}{\arg\min}\ \frac{1}{2}\bm{\theta}^{\top}V\bm{\theta}-\bm{b}^{\top}\bm{\theta}+\gamma_{e}\|\bm{\alpha}_{e}\|_{1}. (7)

The above problem is equivalent to a linearly constrained one

min𝜽0p+1,𝜶e=𝜶e12𝜽V𝜽𝒃𝜽+γe𝜶e1.\min_{\bm{\theta}\in\mathbb{R}_{\geq 0}^{p+1},\bm{\alpha}_{e}=\bm{\alpha}_{e}^{\prime}}\ \frac{1}{2}\bm{\theta}^{\top}V\bm{\theta}-\bm{b}^{\top}\bm{\theta}+\gamma_{e}\|\bm{\alpha}_{e}^{\prime}\|_{1}.

Apply the alternating direction method of multipliers (ADMM), where the corresponding augmented Lagrangian function is

η(𝜽,𝜶e,𝝎)=12𝜽V𝜽𝒃𝜽+γe𝜶e1+𝝎(𝜶e𝜶e)+12η𝜶e𝜶e22.\mathcal{L}_{\eta}(\bm{\theta},\bm{\alpha}_{e}^{\prime},\bm{\omega})=\frac{1}{2}\bm{\theta}^{\top}V\bm{\theta}-\bm{b}^{\top}\bm{\theta}+\gamma_{e}\|\bm{\alpha}_{e}^{\prime}\|_{1}+\bm{\omega}^{\top}(\bm{\alpha}_{e}-\bm{\alpha}_{e}^{\prime})+\frac{1}{2}\eta\|\bm{\alpha}_{e}-\bm{\alpha}_{e}^{\prime}\|_{2}^{2}.

The learning algorithm is shown in Algorithm 3.

Algorithm 3 Graphical point process learning by ADMM
Input: Paths D1,,DnD_{1},\dots,D_{n} and the regularization parameter γe\gamma_{e}.
Pre-compute: V,𝒃V,\ \bm{b}.
Initialize: Set η>0\eta>0 and proper initial values of μe\mu_{e}, 𝜶e\bm{\alpha}_{e}, 𝜶e\bm{\alpha}_{e}^{\prime}, and 𝝎\bm{\omega}.
while not converge do
     (μe𝜶e)=[(V+(0ηIp))1(𝒃+(0η𝜶e𝝎))]+\begin{pmatrix}\mu_{e}\\ \bm{\alpha}_{e}\end{pmatrix}=\left[\left(V+\begin{pmatrix}0&\\ &\eta I_{p}\end{pmatrix}\right)^{-1}\left(\bm{b}+\begin{pmatrix}0\\ \eta\bm{\alpha}_{e}^{\prime}-\bm{\omega}\end{pmatrix}\right)\right]_{+}.
       𝜶e=(𝜶e+η1𝝎η1γe𝟙p)+\bm{\alpha}_{e}^{\prime}=(\bm{\alpha}_{e}+\eta^{-1}\bm{\omega}-\eta^{-1}\gamma_{e}\mathbbm{1}_{p})_{+}.
       𝝎=𝝎+η(𝜶e𝜶e)\bm{\omega}=\bm{\omega}+\eta(\bm{\alpha}_{e}-\bm{\alpha}_{e}^{\prime}).
end while
Return: 𝜽=(μe,𝜶e)\bm{\theta}=(\mu_{e},\bm{\alpha}_{e}^{\top})^{\top}.

4.2 Asymptotic Properties

In this subsection, we derive the rate of convergence of the proposed estimator. The customer-initiated part of our model (1) is similar to Hawkes process (Hawkes, 1971). But unlike the typical multivariate Hawkes process, model (1) is not guaranteed a stationary point process. As a result, the asymptotic results are derived under the assumption that the sample size nn\to\infty, instead of the length of observation TT\to\infty (Guo et al., 2018; Yu et al., 2020). In survival analysis, Lin & Ying (1994) studied the additive risk model, and Lin & Lv (2013) established the consistency of the corresponding L1L_{1}-regularized estimator. Combining these works, we will show that under certain conditions, including irrepresentability, our proposed estimator is consistent in both the classical fixed pp setting and the sparse high-dimensional setting, where log(p)\log(p) is comparable to the sample size nn.

Assumption 1.

(I.I.D.) The process of the customer-initiated event types 𝐍cDj(t)\mathbf{N}_{\mathcal{E}_{\mathrm{c}}}^{D_{j}}(t), j=1,,nj=1,\dots,n are independent and follow model (1).

The convergence properties of the estimator rely on the identical distribution of the observations. We do not need to assume the external (firm-initiated) events are I.I.D. across paths. They may vary from one path to another, but model (1) should be true for each individual.

Assumption 2.

(Bounded input) There exist constants T¯\overline{T} and X¯\overline{X} such that Tj<T¯T_{j}<\overline{T} and supt(0,Tj)𝐗j(t)<X¯\sup_{t\in(0,T_{j})}\big{\|}\bm{X}_{j}(t)\big{\|}_{\infty}<\overline{X} a.s. for j=1,,nj=1,\dots,n.

Define the active set S={0}{e:αee>0}S=\{0\}\cup\{e^{\prime}\in\mathcal{E}:\alpha_{e^{\prime}e}>0\} and its complement S𝖼={e:αee=0}S^{\mathsf{c}}=\{e^{\prime}\in\mathcal{E}:\alpha_{e^{\prime}e}=0\}. We use s=Card(S)s=\mathrm{Card}(S) to denote the cardinality of the active set SS. Let G=(Gkk)k,k=0p=𝔼VG=(G_{kk^{\prime}})_{k,k^{\prime}=0}^{p}=\mathbbm{E}V be the population version of VV. Assume the sub-matrix GSS=(Gkk)k,kSG_{SS}=(G_{kk^{\prime}})_{k,k^{\prime}\in S} is non-singular and define κ:=GSS11,\kappa:=\big{\|}G_{SS}^{-1}\big{\|}_{1,\infty}.

Assumption 3.

(Irrepresentability) There exists a constant ξ(0,1)\xi\in(0,1) such that

GS𝖼SGSS11,<1ξ.\big{\|}G_{S^{\mathsf{c}}S}G_{SS}^{-1}\big{\|}_{1,\infty}<1-\xi.

This condition is adapted from Condition 3 in Lin & Lv (2013), which is a generalization of Condition (15) in Wainwright (2009) for linear regression with LASSO.

Now we establish the rate of convergence and the model selection consistency of the regularized estimator. By abuse of notation, let 𝜽\bm{\theta} denote the true value. We consider two scenarios, where the number of event types pp is fixed or pp diverges while the active set SS is sparse in the sense that ss is bounded from pp.

Theorem 3.

Under Assumptions 1-3, there exist C1>0C_{1}>0 and C2>0C_{2}>0 such that the regularized estimator 𝛉^\hat{\bm{\theta}} in (7) satisfies the following properties:

  1. (i)

    If pp is fixed, then for any constant 0<ν<10<\nu<1, when γe\gamma_{e} is chosen properly, and nn is sufficiently large, with probability at least 12(p+1)(p+2)exp(nν)1-2(p+1)(p+2)\exp(-n^{\nu}),

    • (Edge selection) 𝜽^S𝖼=0\hat{\bm{\theta}}_{S^{\mathsf{c}}}=0

    • (LL_{\infty}-error) 𝜽^𝜽10ξ1κC112C2n1ν2\big{\|}\hat{\bm{\theta}}-\bm{\theta}\big{\|}_{\infty}\leq 10\xi^{-1}\kappa C_{1}^{-\frac{1}{2}}C_{2}n^{-\frac{1-\nu}{2}};

  2. (ii)

    If pp diverges, then for any constant ζ>2\zeta>2, when γe\gamma_{e} is chosen properly, and nn is sufficiently large, with probability at least 13/(p+1)ζ21-3/(p+1)^{\zeta-2},

    • (Edge selection) 𝜽^S𝖼=0\hat{\bm{\theta}}_{S^{\mathsf{c}}}=0

    • (LL_{\infty}-error) 𝜽^𝜽10ξ1κC112C2ζlog(p+1)n\big{\|}\hat{\bm{\theta}}-\bm{\theta}\big{\|}_{\infty}\leq 10\xi^{-1}\kappa C_{1}^{-\frac{1}{2}}C_{2}\sqrt{\frac{\zeta\log(p+1)}{n}}.

In the above Theorem, we do not specify the choices of γe\gamma_{e} and nn for ease of presentation. The choices in detail can be found in the supplementary materials.

Then we move on to the rate of convergence of attribution scores. For a positive path DD with a conversion at t=tit=t_{i^{\star}}, we would like to analyze the direct removal effect of a subset RFt(D)R\subseteq F_{t^{\star}}(D). In practice, for Equation (2), we can only obtain the estimate of the conditional intensities. As a result, for model (1), we use the estimated direct removal effect given by

att^ti(direct)(RD):=(ti,ei)Rα^ei1ψei1(titi)μ^1+i<iα^ei1ψei1(titi).\widehat{\mathrm{att}}_{t_{i^{\star}}}^{\mathrm{(direct)}}(R\mid D):=\frac{\sum_{(t_{i},e_{i})\in R}\hat{\alpha}_{e_{i}1}\psi_{e_{i}1}(t_{i^{\star}}-t_{i})}{\hat{\mu}_{1}+\sum_{i<i^{\star}}\hat{\alpha}_{e_{i}1}\psi_{e_{i}1}(t_{i^{\star}}-t_{i})}.

Let r=Card(R)r=\mathrm{Card}(R) denote the cardinality of the removal set RR. For the kernel functions, suppose there exists ψ¯1>0\overline{\psi}_{1}>0 such that max1epsupt>0|ψe1(t)|<ψ¯1\max_{1\leq e\leq p}\sup_{t>0}|\psi_{e1}(t)|<\overline{\psi}_{1}.

Theorem 4.

Given a path DD with a conversion at t=ti,i>1t=t_{i^{\star}},\ i^{\star}>1, under the above assumptions for e=1e=1, there exists C3>0C_{3}>0 dependent on ψ¯1\overline{\psi}_{1} and DD such that the estimated direct removal effect of the removal set RFt(D)R\subseteq F_{t^{\star}}(D) satisfies either of the following condition:

  1. (i)

    If pp is fixed and nn is sufficiently large, then for any constant 0<ν<10<\nu<1, with probability at least 12(p+1)(p+2)exp(nν)1-2(p+1)(p+2)\exp(-n^{\nu}),

    |att^ti(direct)(RD)attti(direct)(RD)|10ξ1κC112C2C3rn1ν2;\left|\widehat{\mathrm{att}}_{t_{i^{\star}}}^{\mathrm{(direct)}}(R\mid D)-\mathrm{att}_{t_{i^{\star}}}^{\mathrm{(direct)}}(R\mid D)\right|\leq 10\xi^{-1}\kappa C_{1}^{-\frac{1}{2}}C_{2}C_{3}rn^{-\frac{1-\nu}{2}};
  2. (ii)

    If pp diverges and nn is sufficiently large, then for any constant ζ>2\zeta>2, with probability at least 13/(p+1)ζ21-3/(p+1)^{\zeta-2},

    |att^ti(direct)(RD)attti(direct)(RD)|10ξ1κC112C2C3rζlog(p+1)n.\left|\widehat{\mathrm{att}}_{t_{i^{\star}}}^{\mathrm{(direct)}}(R\mid D)-\mathrm{att}_{t_{i^{\star}}}^{\mathrm{(direct)}}(R\mid D)\right|\leq 10\xi^{-1}\kappa C_{1}^{-\frac{1}{2}}C_{2}C_{3}r\sqrt{\frac{\zeta\log(p+1)}{n}}.

The constants ξ\xi, κ\kappa, C1C_{1}, and C2C_{2} come from Theorem 3 corresponding to e=1e=1. Based on the analysis of the estimated direct removal effect, we then provide an error bound of the total removal effect in estimation. We refer to the estimated total removal effect as

att^ti(total)(RD):=𝔼[att^ti(direct)(R^D)D],\widehat{\mathrm{att}}_{t_{i^{\star}}}^{\mathrm{(total)}}(R\mid D):=\mathbbm{E}[\widehat{\mathrm{att}}_{t_{i^{\star}}}^{\mathrm{(direct)}}(\widehat{R^{\diamond}}\mid D)\mid D],

where R^R\widehat{R^{\diamond}}\supseteq R is the actual removal set obtained by the thinning operation in Algorithm 1 using estimated thinning probabilities. Besides ψ¯1\overline{\psi}_{1}, for each e=2,,qe=2,\dots,q, suppose there exists ψ¯e>0\overline{\psi}_{e}>0 such that max1epsupt>0|ψee(t)|<ψ¯e\max_{1\leq e^{\prime}\leq p}\sup_{t>0}|\psi_{e^{\prime}e}(t)|<\overline{\psi}_{e}.

Theorem 5.

Given a path DD with a conversion at t=ti,i>1t=t_{i^{\star}},\ i^{\star}>1 and a removal set RFt(D)R\subseteq F_{t^{\star}}(D), under the above assumptions for e=1,,qe=1,\dots,q, there exists C4>0C_{4}>0 dependent on ψ¯1,,ψ¯q\overline{\psi}_{1},\dots,\overline{\psi}_{q}, DD, and RR such that the estimated total removal effect of RR satisfies either of the following condition:

  1. (i)

    If pp is fixed and nn is sufficiently large, then for any constant 0<ν<10<\nu<1, with probability at least 12p(p+1)(p+2)exp(nν)1-2p(p+1)(p+2)\exp(-n^{\nu}),

    |att^ti(total)(RD)attti(total)(RD)|C4n1ν2;\left|\widehat{\mathrm{att}}_{t_{i^{\star}}}^{\mathrm{(total)}}(R\mid D)-\mathrm{att}_{t_{i^{\star}}}^{\mathrm{(total)}}(R\mid D)\right|\leq C_{4}n^{-\frac{1-\nu}{2}};
  2. (ii)

    If pp diverges and nn is sufficiently large, then for any constant ζ>3\zeta>3, with probability at least 13/(p+1)ζ31-3/(p+1)^{\zeta-3},

    |att^ti(total)(RD)attti(total)(RD)|C4ζlog(p+1)n.\left|\widehat{\mathrm{att}}_{t_{i^{\star}}}^{\mathrm{(total)}}(R\mid D)-\mathrm{att}_{t_{i^{\star}}}^{\mathrm{(total)}}(R\mid D)\right|\leq C_{4}\sqrt{\frac{\zeta\log(p+1)}{n}}.

5 Simulation Study

In this section, we carry out two simulation experiments to examine the performance of our graphical attribution methods. In the first part, we validate the proposed graphical attribution methods using data simulated from the multivariate Hawkes process. In the second part, we compare the proposed methods with the commonly used attribution models using data simulated from a modified version of the Digital Advertising System Simulation (DASS) developed by Google Inc. The simulated data includes online customer browsing behavior and injected advertising events that impact this customer behavior.

We first explain some channel-level metrics for attribution methods. In general, suppose there are ZZ channels, labeled as z=1,,Zz=1,\dots,Z. Inspired by the previous literature Anderl et al. (2016); Li & Kannan (2014), we calculate the proportion of channel-level conversion count (proportion of CCC) for each channel zz, which is

𝗉z:=CCCzz=1ZCCCzwithCCCz=j=1nN1Dj((0,Tj])j=1nN1Dj(z-off)((0,Tj]),\mathsf{p}_{z}:=\frac{\mathrm{CCC}_{z}}{\sum_{z=1}^{Z}\mathrm{CCC}_{z}}\quad\text{with}\ \mathrm{CCC}_{z}=\sum_{j=1}^{n}N_{1}^{D_{j}}((0,T_{j}])-\sum_{j=1}^{n}N_{1}^{D_{j}^{\text{($z$-off)}}}((0,T_{j}]),

where Dj(z-off)D_{j}^{\text{($z$-off)}} is a path following the same distribution as DjD_{j} other than having no touchpoints of channel zz. CCCz\mathrm{CCC}_{z} is interpreted as the number of conversions resulting from channel zz. The first term of CCCz\mathrm{CCC}_{z} is the number of conversions out of nn paths, and the second term is the number of conversions out of nn paths when channel zz is turned off. To obtain this quantity for synthetic data, we can disable all the related touchpoint types and run the simulator again with the same seed. Then the decrease in total conversions is the desired value.

Let 𝒞z\mathcal{C}_{z}\subseteq\mathcal{E} denote the set of touchpoint types belonging to channel zz, for z=1,,Zz=1,\dots,Z. Then the corresponding removal set with respect to the conversion at tt in the path DD is Ft𝒞z(D)={(ti,ei)D:ti<t,ei𝒞z}F^{\mathcal{C}_{z}}_{t}(D)=\{(t_{i},e_{i})\in D:t_{i}<t,\ e_{i}\in\mathcal{C}_{z}\}. We use the proportion of channel-level aggregated score (proportion of CAS) to estimate the proportion of CCC for channel zz, which is

𝗉^z:=CASzz=1ZCASzwithCASz=j=1nimj:eij=1att^tij(Ftij𝒞z(Dj)Dj).\hat{\mathsf{p}}_{z}:=\frac{\mathrm{CAS}_{z}}{\sum_{z=1}^{Z}\mathrm{CAS}_{z}}\quad\text{with}\ \mathrm{CAS}_{z}=\sum_{j=1}^{n}\sum_{i\leq m_{j}:\ e_{i}^{j}=1}\widehat{\mathrm{att}}_{t_{i}^{j}}(F^{\mathcal{C}_{z}}_{t_{i}^{j}}(D_{j})\mid D_{j}).

Recall that the attribution score is an incremental component or marginal loss with respect to conversion. Its aggregated version, CAS, is the overall decrease in conversion counts for each channel compared with the total number of conversions and thus can be used to estimate CCC. Let 𝗽=(𝗉1,,𝗉Z)\bm{\mathsf{p}}=(\mathsf{p}_{1},\dots,\mathsf{p}_{Z})^{\top} denote the vector of proportions of CCC and 𝗽^=(𝗉^1,,𝗉^Z)\hat{\bm{\mathsf{p}}}=(\hat{\mathsf{p}}_{1},\dots,\hat{\mathsf{p}}_{Z})^{\top} denote the vector of the proportions of CAS. We adopt the KL divergence DKL()D_{\mathrm{KL}}(\cdot\parallel\cdot) and the Hellinger distance H(,)H(\cdot,\cdot) to measure the estimation accuracy, where

DKL(𝗽𝗽^)\displaystyle D_{\mathrm{KL}}(\bm{\mathsf{p}}\parallel\hat{\bm{\mathsf{p}}}) :=z=1Z𝗉zlog(𝗉z𝗉^z)\displaystyle:=\sum_{z=1}^{Z}\mathsf{p}_{z}\log\left(\frac{\mathsf{p}_{z}}{\hat{\mathsf{p}}_{z}}\right)
H(𝗽,𝗽^)\displaystyle H(\bm{\mathsf{p}},\hat{\bm{\mathsf{p}}}) :=12z=1Z(𝗉^z𝗉z)2.\displaystyle:=\sqrt{\frac{1}{2}\sum_{z=1}^{Z}\left(\sqrt{\hat{\mathsf{p}}_{z}}-\sqrt{\mathsf{p}_{z}}\right)^{2}}.

5.1 Simulation Based on Hawkes Process

In this subsection, we verify our attribution methods using a dataset simulated from the multivariate Hawkes process (Hawkes, 1971).

Referring to Equation (1), our model reduces to a multivariate Hawkes process if Ne(t)N_{e}(t) is a Poisson process for each efe\in\mathcal{E}_{\mathrm{f}}. In other words, the multivariate Hawkes process is nested in our model. Therefore, we simulate a data set using a multivariate Hawkes process according to Figure 2, which involves Z=2Z=2 channels, display and search, and 44 types of touchpoints: display impression, display click, search impression, and search click. Display impression is regarded as a firm-initiated event type, following a Poisson process distribution with a rate of 0.020.02. A total of n=10,000n=10,000 paths are simulated with Tj=365T_{j}=365 days for j=1,,nj=1,\dots,n. We take ψee=110exp(t10)𝟙{t>0}\psi_{e^{\prime}e}=\frac{1}{10}\exp(-\frac{t}{10})\cdot\mathbbm{1}_{\{t>0\}} for each possible pair of connected nodes. The baseline intensities of search impression and conversion are set as 0.020.02 and 1×1041\times 10^{-4}, and the two click touchpoint types have zero baselines.

From To Display click Search impression Search click Conversion
Display impression 0.08 0.08 0 0.01
Display click 0 0 0 0.08
Search impression 0 0 0.08 0.02
Search click 0 0 0 0.1
Conversion 0 0 0 0
Table 1: Granger causality coefficients of the simulated data.
Channel: 𝗉z\mathsf{p}_{z} TRE DRE
Display: 0.37990.3799 0.3782(0.0104)0.3782\ (0.0104) 0.3491(0.0112)0.3491\ (0.0112)
Search: 0.62010.6201 0.6218(0.0104)0.6218\ (0.0104) 0.6509(0.0112)0.6509\ (0.0112)
KL divergence 0.0002(0.0002)0.0002\ (0.0002) 0.0023(0.0015)0.0023\ (0.0015)
Hellinger distance 0.0064(0.0042)0.0064\ (0.0042) 0.0226(0.0083)0.0226\ (0.0083)
Table 2: Comparison of the proportions of CAS between total removal effect and direct removal effect. Reported numbers are averages over 100100 independent runs, with standard errors given in parentheses. The first column lists the proportions of channel-level conversion counts (CCC) as the ground truth.

The Granger causality coefficients are given in Table 1. The simulation results are summarized in Table 2, which are calculated over 100100 independent runs. As shown in Table 2, our graphical TRE attribution method, which takes into account the Granger causality among different types of touchpoints is accurate in estimating the true removal effects of both search and display channels. The proportions of CAS for both channels calculated by TRE are very close to the ground truth. Also, the KL divergence and the Hellinger distance of TRE are very small. This is not surprising since Theorem 2 guarantees that Dj(FTj𝒞z(Dj))D_{j}\setminus(F^{\mathcal{C}_{z}}_{T_{j}}(D_{j}))^{\diamond} and Dj(z-off)D_{j}^{\text{($z$-off)}} are identically distributed. In contrast, the graphical DRE method tends to underestimate the contribution of the display channel because it ignores the exciting effect of the display impression on the search impression.

5.2 Simulation Based on DASS

Next, we compare our graphical attribution methods with existing methods, including DNAMTA (Li et al., 2018), logistic regression, Markov model (Anderl et al., 2016), as well as the rule-based methods including last-touch, first-touch, linear, time-decay, and U-shaped. Table 3 lists the description of the models under comparison. Among them, the Markov model provides channel-level scores directly and the others provide path-level scores that can be aggregated to the channel level.

Method Type Scoring Description
TRE Data-driven Total removal effect of the graphical point process model.
DRE Data-driven Direct removal effect of the graphical point process model.
DNAMTA Data-driven An incremental score derived from the conversion probability of Deep Neural Net With Attention multi-touch attribution model developed in Li et al. (2018).
Logistic Data-driven An incremental score derived from the conversion probability of logistic regression.
Markov Data-driven Removal effect of Markov model developed in Anderl et al. (2016).
Last Rule-based Last-touch attribution, assigning all credit to the touchpoint closest to the conversion.
First Rule-based First-touch attribution, assigning all credit to the initial touchpoint on a path.
Linear Rule-based Linear attribution, assigning equal credit to each touchpoint before the conversion.
Decay Rule-based Time-decay attribution, where touchpoints closer to the conversion receive more credit than touchpoints that are farther away in time from the conversion.
U-shaped Rule-based U-shaped attribution, assigning 40%40\% of the credit to both the first touchpoint and the last touchpoint, with the other touchpoints splitting the remaining 20%20\% equally.
Table 3: Description of attribution methods under comparison.

For model comparison, we simulate data from a modified version of the Digital Advertising System Simulation (DASS). DASS (Sapp et al., 2016), developed by Google Inc., is a popular attribution simulator in the industry and its effectiveness is well accepted by practitioners. “It generates the data to which observational models can be applied, as well as the ability to run virtual experiments with simulated customers to measure the actual incremental value of marketing for direct comparison” (Sapp et al., 2016). We modified DASS to work with two desired features and call it DASS+. DASS simulates transitions between the browsing states of each customer without any clear regard for timestamps while DASS+ uses a transition matrix reflecting these browsing states in each minute. On the other hand, DASS has no explicit restriction on the number of advertisements that can be served. With DASS+, the number of ads is capped to a fixed amount, and the ads can be served in a pre-determined distribution across the simulation timeframe.

This synthetic data involves Z=4Z=4 channels, email, display, search, and social including 99 types of touchpoints: email sent, email open, email click, display impression, display click, search impression, search click, social impression, and social click. With all channels turned on, we obtain n=98,986n=98,986 valid paths out of 100,000100,000 customers. Among them, there are 62,28762,287 positive paths and 36,69936,699 negative paths. The simulation period is 9090 days for each path. For model learning, we use the timestamp of the very first event as the starting time (t1=0t_{1}=0) and use the timestamp of the final event on the path as the terminal time.

Figure 8 shows the Granger causality graph learned by our proposed model for simulated data. Most types of touchpoints have exciting effects on conversion. Similar to the first simulation study, we confirm the intra-channel carry-over effects for every channel, even with more channels present. Also, we confirm the inter-channel spill-over effects from other channels to search impression.

Refer to caption
Figure 8: The Granger causality graph for simulated data.
Channel: 𝗉z\mathsf{p}_{z} TRE DRE DNAMTA Logistic Markov Last First Linear Decay U-shaped
Display: 0.2070.207 0.1850.185 0.1370.137 0.1720.172 0.2750.275 0.2210.221 0.3710.371 0.3970.397 0.3800.380 0.3720.372 0.3830.383
(0.006)(0.006) (0.006)(0.006) (0.027)(0.027) (0.004)(0.004) (0.000)(0.000) (0.002)(0.002) (0.003)(0.003) (0.001)(0.001) (0.001)(0.001) (0.001)(0.001)
Email: 0.1710.171 0.1390.139 0.1010.101 0.1300.130 0.2170.217 0.3560.356 0.2770.277 0.2930.293 0.2880.288 0.2820.282 0.2860.286
(0.008)(0.008) (0.007)(0.007) (0.019)(0.019) (0.004)(0.004) (0.000)(0.000) (0.001)(0.001) (0.002)(0.002) (0.001)(0.001) (0.001)(0.001) (0.001)(0.001)
Search: 0.4850.485 0.5740.574 0.6800.680 0.5790.579 0.3240.324 0.2060.206 0.1350.135 0.0780.078 0.1080.108 0.1270.127 0.1070.107
(0.006)(0.006) (0.007)(0.007) (0.034)(0.034) (0.003)(0.003) (0.000)(0.000) (0.001)(0.001) (0.001)(0.001) (0.000)(0.000) (0.000)(0.000) (0.001)(0.001)
Social: 0.1370.137 0.1020.102 0.0820.082 0.1190.119 0.1830.183 0.2170.217 0.2170.217 0.2320.232 0.2240.224 0.2190.219 0.2240.224
(0.009)(0.009) (0.008)(0.008) (0.028)(0.028) (0.004)(0.004) (0.000)(0.000) (0.002)(0.002) (0.002)(0.002) (0.001)(0.001) (0.001)(0.001) (0.001)(0.001)
KL divergence 0.0080.008 0.0110.011 0.0120.012 0.0240.024 0.0930.093 0.1540.154 0.2550.255 0.1940.194 0.1640.164 0.1960.196
(0.002)(0.002) (0.001)(0.001) (0.007)(0.007) (0.002)(0.002) (0.002)(0.002) (0.004)(0.004) (0.005)(0.005) (0.004)(0.004) (0.004)(0.004) (0.004)(0.004)
Hellinger distance 0.0670.067 0.1410.141 0.0780.078 0.1170.117 0.2260.226 0.2770.277 0.3420.342 0.3060.306 0.2850.285 0.3070.307
(0.007)(0.007) (0.007)(0.007) (0.023)(0.023) (0.005)(0.005) (0.003)(0.003) (0.004)(0.004) (0.003)(0.003) (0.003)(0.003) (0.003)(0.003) (0.003)(0.003)
Table 4: Comparison of proportions of the channel-level aggregated score (CAS) for different methods. Reported numbers are averages over 1010 independent runs, with standard errors given in parentheses. The first column lists the proportions of channel-level conversion counts (CCC) as the ground truth.

Table 4 summarizes the proportions of CAS, as well as the KL divergence and the Hellinger distance for different methods, which are calculated over 1010 independent runs. Our two graphical attribution methods, achieve the most accurate results. The rule-based methods (i.e., last-touch, first-touch, linear, time-decay, and U-shaped method) underestimate the contribution of the search channel and overestimate those of other channels, particularly the display channel. They are unable to take the baseline effect into consideration and thus are outperformed by other methods. Among all the methods, our graphical attribution methods have the smallest estimation errors, which demonstrates the advantage of our proposed graphical attribution methods in measuring channels’ contribution to conversions.

6 Real Application

In this section, we apply the proposed methods to a real-world use case. The data are about paid conversions of an online subscription product of a Fortune 500 company within 44 consecutive months. There are 2,887,6572,887,657 paths and 74,44074,440 conversions in total.

The touchpoints belong to Z=4Z=4 channels with more specific details. For the search channel, a branded search is a specific company or product being advertised while a non-branded search is a generic search result, not for a specific company or product. For the social channel, an owned social click is within the company’s control and not paid for (e.g. a corporate LinkedIn post) and an earned social click is outside the company’s control but not paid for either (e.g. a third party sharing a corporate LinkedIn post). For the email channel, awareness means the ad is just trying to make a customer aware of products. A promotion email means that there is a discount-priced product being offered. Call to action means the customer is already familiar with the product, and the ad contains a specific call to action (e.g. buy now).

Refer to caption
Figure 9: The Granger causality graph for real data.

The learned Granger causality graph is shown in Figure 9. Besides the excitation from touchpoints to a potential conversion, the graphical point process model finds the interactions between touchpoints within and across channels. For example, an awareness email open touchpoint may increase the chance of opening a promotion email. A display impression can trigger search clicks and awareness emails open. There is also a self-loop for the branded search click, meaning that clicks of this type tend to appear in clusters. Figure 10 visualizes the proportions of CAS among different methods. The two graphical attribution methods and DNAMTA have similar results by giving the search channel the highest proportion of credit (70%\geq 70\%). As far as the display channel is concerned, compared with the direct removal effect (10.7%10.7\%), the total removal effect assigns more credit (14.2%14.2\%) since a display impression may trigger a search click. Logistic regression emphasizes the importance of the display channel more than other algorithmic methods by assigning scores to the display channel (33.6%33.6\%) close to the search channel (44.8%44.8\%). The rule-based methods, together with the Markov model, tend to give the highest scores to the email channel (38%\geq 38\%).

Refer to caption
Figure 10: The comparison of the proportions of CAS between graphical attribution methods and other methods. Each bar represents the proportions of CAS for an attribution method, with four channels colored differently in four parts.

Based on the Granger causality graph, there indeed exists a hierarchical structure for the event types, and thus it is necessary to build a model with the response being more than just conversion. The channel-level aggregated scores obtained from our graphical methods reflect that the search channel is the most effective channel. The total removal effect emphasizes the importance of the display channel, which may play a key role in the early stage of a positive path.

7 Conclusion

In this paper, we propose a novel graphical point process framework for multi-touch attribution. First, we develop a graphical point process model to analyze customer-level path-to-purchase data. The graphical model utilizes the Granger causality to reveal the exciting effects among touches as well as the direct conversion effects of numerous types of touchpoints. Then, in the framework of the point process, we further propose graphical attribution methods to allocate proper conversion credit to individual touchpoints and the corresponding channels for each customer’s path to purchase. Our proposed attribution methods consider the attribution score as the removal effect, and we study two types of removal effects. We provide the probabilistic definition and the mathematical form of the removal effects. We develop a new efficient thinning-based simulation method and a backpropagation algorithm for the calculation. We employ a regularization method to select edges and estimate parameters simultaneously. We develop an ADMM to solve this optimization problem with desired computational efficiency and scalability. In addition, we provide a theoretical guarantee by establishing the asymptotic theory for parameter estimates.

References

  • (1)
  • Anderl et al. (2016) Anderl, E., Becker, I., Von Wangenheim, F. & Schumann, J. H. (2016), ‘Mapping the customer journey: Lessons learned from graph-based online attribution modeling’, International Journal of Research in Marketing 33(3), 457–474.
  • Bacry et al. (2020) Bacry, E., Bompaire, M., Gaïffas, S. & Muzy, J.-F. (2020), ‘Sparse and low-rank multivariate hawkes processes’, Journal of Machine Learning Research 21(50), 1–32.
  • Berman (2018) Berman, R. (2018), ‘Beyond the last touch: Attribution in online advertising’, Marketing Science 37(5), 771–792.
  • Bowman & Narayandas (2001) Bowman, D. & Narayandas, D. (2001), ‘Managing customer-initiated contacts with manufacturers: The impact on share of category requirements and word-of-mouth behavior’, Journal of Marketing Research 38(3), 281–297.
  • Breuer et al. (2011) Breuer, R., Brettel, M. & Engelen, A. (2011), ‘Incorporating long-term effects in determining the effectiveness of different types of online advertising’, Marketing Letters 22(4), 327–340.
  • Dalessandro et al. (2012) Dalessandro, B., Perlich, C., Stitelman, O. & Provost, F. (2012), Causally motivated attribution for online advertising, in ‘Proceedings of the sixth International Workshop on Data Mining for Online Advertising and Internet Economy’, pp. 1–9.
  • Danaher & van Heerde (2018) Danaher, P. J. & van Heerde, H. J. (2018), ‘Delusion in attribution: Caveats in using attribution for multimedia budget allocation’, Journal of Marketing Research 55(5), 667–685.
  • De Haan et al. (2016) De Haan, E., Wiesel, T. & Pauwels, K. (2016), ‘The effectiveness of different forms of online advertising for purchase conversion in a multiple-channel attribution framework’, International Journal of Research in Marketing 33(3), 491–507.
  • Didelez (2008) Didelez, V. (2008), ‘Graphical models for marked point processes based on local independence’, Journal of the Royal Statistical Society: Series B 70(1), 245–264.
  • Eichler (2012) Eichler, M. (2012), ‘Graphical modelling of multivariate time series’, Probability Theory and Related Fields 153(1), 233–268.
  • Gaur & Bharti (2020) Gaur, J. & Bharti, K. (2020), ‘Attribution modelling in marketing: Literature review and research agenda’, Academy of Marketing Studies Journal 24(4), 1–21.
  • Granger (1969) Granger, C. W. (1969), ‘Investigating causal relations by econometric models and cross-spectral methods’, Econometrica 37(3), 424–438.
  • Granger (1980) Granger, C. W. (1980), ‘Testing for causality: a personal viewpoint’, Journal of Economic Dynamics and Control 2, 329–352.
  • Granger (1988) Granger, C. W. (1988), ‘Some recent development in a concept of causality’, Journal of Econometrics 39(1-2), 199–211.
  • Gunawardana et al. (2011) Gunawardana, A., Meek, C. & Xu, P. (2011), ‘A model for temporal dependencies in event streams’, Advances in Neural Information Processing Systems 24, 1962–1970.
  • Guo et al. (2018) Guo, X., Hu, A., Xu, R. & Zhang, J. (2018), ‘Consistency and computation of regularized mles for multivariate hawkes processes’, arXiv preprint arXiv:1810.02955 .
  • Hansen et al. (2015) Hansen, N. R., Reynaud-Bouret, P. & Rivoirard, V. (2015), ‘Lasso and probabilistic inequalities for multivariate point processes’, Bernoulli 21(1), 83–143.
  • Hawkes (1971) Hawkes, A. G. (1971), ‘Spectra of some self-exciting and mutually exciting point processes’, Biometrika 58(1), 83–90.
  • Ji et al. (2016) Ji, W., Wang, X. & Zhang, D. (2016), A probabilistic multi-touch attribution model for online advertising, in ‘Proceedings of the 25th ACM International on Conference on Information and Knowledge Management’, pp. 1373–1382.
  • Kakalejčík et al. (2018) Kakalejčík, L., Bucko, J., Resende, P. A. & Ferencova, M. (2018), ‘Multichannel marketing attribution using markov chains’, Journal of Applied Management and Investments 7(1), 49–60.
  • Kannan et al. (2016) Kannan, P., Reinartz, W. & Verhoef, P. C. (2016), ‘The path to purchase and attribution modeling: Introduction to special section’, International Journal of Research in Marketing 33(3), 449–456.
  • Kireyev et al. (2016) Kireyev, P., Pauwels, K. & Gupta, S. (2016), ‘Do display ads influence search? attribution and dynamics in online advertising’, International Journal of Research in Marketing 33(3), 475–490.
  • Kumar et al. (2020) Kumar, S., Gupta, G., Prasad, R., Chatterjee, A., Vig, L. & Shroff, G. (2020), Camta: Causal attention model for multi-touch attribution, in ‘2020 International Conference on Data Mining Workshops (ICDMW)’, IEEE, pp. 79–86.
  • Lewis & Shedler (1979) Lewis, P. W. & Shedler, G. S. (1979), ‘Simulation of nonhomogeneous poisson processes by thinning’, Naval Research Logistics Quarterly 26(3), 403–413.
  • Li & Kannan (2014) Li, H. & Kannan, P. (2014), ‘Attributing conversions in a multichannel online marketing environment: An empirical model and a field experiment’, Journal of Marketing Research 51(1), 40–56.
  • Li et al. (2018) Li, N., Arava, S. K., Dong, C., Yan, Z. & Pani, A. (2018), ‘Deep neural net with attention for multi-channel multitouch attribution’, arXiv preprint arXiv:1809.02230 .
  • Lin & Ying (1994) Lin, D. Y. & Ying, Z. (1994), ‘Semiparametric analysis of the additive risk model’, Biometrika 81(1), 61–71.
  • Lin & Lv (2013) Lin, W. & Lv, J. (2013), ‘High-dimensional sparse additive hazards regression’, Journal of the American Statistical Association 108(501), 247–264.
  • Naik & Raman (2003) Naik, P. A. & Raman, K. (2003), ‘Understanding the impact of synergy in multimedia communications’, Journal of Marketing Research 40(4), 375–388.
  • Ogata (1981) Ogata, Y. (1981), ‘On lewis’ simulation method for point processes’, IEEE Transactions on Information Theory 27(1), 23–31.
  • Rumelhart et al. (1986) Rumelhart, D. E., Hinton, G. E. & Williams, R. J. (1986), ‘Learning representations by back-propagating errors’, Nature 323(6088), 533–536.
  • Sapp et al. (2016) Sapp, S., Vaver, J., Shi, M. & Bathia, N. (2016), Dass: Digital advertising system simulation, Technical report, Google Inc.
  • Shao & Li (2011) Shao, X. & Li, L. (2011), Data-driven multi-touch attribution models, in ‘Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining’, pp. 258–264.
  • Shapley (1953) Shapley, L. S. (1953), A value for n-person games, in H. W. Kuhn & A. W. Tucker, eds, ‘Contributions to the Theory of Games II’, Princeton University Press, pp. 307–317.
  • Sims (1972) Sims, C. A. (1972), ‘Money, income, and causality’, American Economic Review 62(4), 540–552.
  • Singal et al. (2022) Singal, R., Besbes, O., Desir, A., Goyal, V. & Iyengar, G. (2022), ‘Shapley meets uniform: An axiomatic framework for attribution in online advertising’, Management Science 68(10), 7457–7479.
  • Wainwright (2009) Wainwright, M. J. (2009), ‘Sharp thresholds for high-dimensional and noisy sparsity recovery using 1\ell_{1}-constrained quadratic programming (lasso)’, IEEE Transactions on Information Theory 55(5), 2183–2202.
  • Wiesel et al. (2011) Wiesel, T., Pauwels, K. & Arts, J. (2011), ‘Practice prize paper—marketing’s profit impact: Quantifying online and off-line funnel progression’, Marketing Science 30(4), 604–611.
  • Xu et al. (2016) Xu, H., Farajtabar, M. & Zha, H. (2016), Learning granger causality for hawkes processes, in ‘International Conference on Machine Learning’, PMLR, pp. 1717–1726.
  • Xu et al. (2014) Xu, L., Duan, J. A. & Whinston, A. (2014), ‘Path to purchase: A mutually exciting point process model for online advertising and conversion’, Management Science 60(6), 1392–1412.
  • Yang & Ghose (2010) Yang, S. & Ghose, A. (2010), ‘Analyzing the relationship between organic and sponsored search advertising: Positive, negative, or zero interdependence?’, Marketing Science 29(4), 602–623.
  • Yu et al. (2020) Yu, X., Shanmugam, K., Bhattacharjya, D., Gao, T., Subramanian, D. & Xue, L. (2020), Hawkesian graphical event models, in ‘International Conference on Probabilistic Graphical Models’, PMLR, pp. 569–580.
  • Zhang et al. (2013) Zhang, Y., Bradlow, E. T. & Small, D. S. (2013), ‘New measures of clumpiness for incidence data’, Journal of Applied Statistics 40(11), 2533–2548.
  • Zhang et al. (2014) Zhang, Y., Wei, Y. & Ren, J. (2014), Multi-touch attribution in online advertising with survival theory, in ‘2014 IEEE International Conference on Data Mining’, IEEE, pp. 687–696.
  • Zhao et al. (2019) Zhao, K., Mahboobi, S. H. & Bagheri, S. R. (2019), ‘Revenue-based attribution modeling for online advertising’, International Journal of Market Research 61(2), 195–209.