Treatment Effect Estimation Amidst Dynamic Network Interference in Online Gaming Experiments

Yu Zhu University of California, Santa CruzSanta CruzCAUSA yzhu201@ucsc.edu , Zehang Richard Li University of California, Santa CruzSanta CruzCAUSA lizehang@ucsc.edu , Yang Su TencentPalo AltoCAUSA yaangsu@global.tencent.com and Zhenyu Zhao TencentPalo AltoCAUSA zzy287@gmail.com

Abstract.

The evolving landscape of online multiplayer gaming presents unique challenges in assessing the causal impacts of game features. Traditional A/B testing methodologies fall short due to complex player interactions, leading to violations of fundamental assumptions like the Stable Unit Treatment Value Assumption (SUTVA). Unlike traditional social networks with stable and long-term connections, networks in online games are often dynamic and short-lived. Players are temporarily teamed up for the duration of a game, forming transient networks that dissolve once the game ends. This fleeting nature of interactions presents a new challenge compared with running experiments in a stable social network. This study introduces a novel framework for treatment effect estimation in online gaming environments, considering the dynamic and ephemeral network interference that occurs among players. We propose an innovative estimator tailored for scenarios where a completely randomized experimental design is implemented without explicit knowledge of network structures. Notably, our method facilitates post-hoc interference adjustment on experimental data, significantly reducing the complexities and costs associated with intricate experimental designs and randomization strategies. The proposed framework stands out for its ability to accommodate varying levels of interference, thereby yielding more accurate and robust estimations. Through comprehensive simulations set against a variety of interference scenarios, along with empirical validation using real-world data from a mobile gaming environment, we demonstrate the efficacy of our approach. This study represents a pioneering effort in exploring causal inference in user-randomized experiments impacted by dynamic network effects.

A/B testing, Online games, Treatment effect estimation, Network interference

^†^†copyright: none

1. Introduction

A/B testing is heavily used to drive product decisions in the dynamic landscape of technology and digital platforms. Central to the effectiveness of these tests is the Stable Unit Treatment Value Assumption (SUTVA) (Rubin, 1980), which posits that the outcome of any unit (e.g., a user) in an experiment is unaffected by the treatment assignment of other units. This assumption is critical for unbiased estimation of treatment effects. However, in many real-world applications of A/B testing, particularly within online gaming, the SUTVA assumption is commonly violated.

Online games, such as the multiplayer online battle arena (MOBA) game, frequently set up strategies to match players into teams. Connections and interactions between players are temporarily formed for the span of a single game session. Players’ experiences are inherently affected by teammates and adversaries. For instance, consider an online experiment to assess the impact of game difficulty – a scenario where players in the treatment group are assigned a less challenging version of the game. The assignment to treatment and control groups is delineated before the game starts. However, once the game session unfolds, the game difficulty level may be re-defined based on all the players’ treatment assignments. In such a team-oriented environment, the exposure of a single player to a lower difficulty setting can ripple through the entire team, altering the actual treatment receipt of all involved and potentially affecting the game’s outcome. And since the data are collected afterwards, interference among players becomes an inevitable issue. Furthermore, the network structure within online gaming is inherently dynamic and short-lived, continually evolving as game sessions conclude and new ones unfold. Each game session reshapes the network connections as players disperse and regroup. During the experiment period, players may enter multiple games and receive different numbers of in-game treatments. This post-hoc network interference, stemming from the interactions between players and complicated by the number of treatments players received, may lead to biased estimations of causal effects.

Recently, various experimental designs have been proposed to address the challenges posed by network interference, particularly in terms of mitigating spillover effects and enhancing the accuracy of treatment effect estimation. Notably, cluster-based randomized experiments (e.g., Aronow and Middleton, 2013; Eckles et al., 2016; Ugander et al., 2013; Walker and Muchnik, 2014), have been at the forefront of these developments. These experiments group subjects into clusters based on their network connections, and then randomize the treatment at the cluster level rather than the individual level. Multi-level experiments (Hudgens and Halloran, 2008) introduce sophisticated frameworks where interventions are administered at multiple hierarchical levels within the network. Additionally, mixed experiments, as explored by (Karrer et al., 2020), simultaneously employ both unit-level and cluster-level randomization to capture a broader range of network effects.

Despite the innovative nature of these online experimental designs, there can be methodological, ethical, and cost-related challenges in applying these designs in the online gaming environment. For instance, it is not realistic to completely isolate players into distinct clusters without affecting their typical gaming experience. Players often interact in a dynamic, interconnected online space, making it difficult to create clear-cut, isolated clusters for experimentation. Additionally, cluster-based or multi-level experiments are typically inefficient to implement both computationally and financially.

More recent studies have focused on causal effect estimations under interference. One line of research particularly emphasized the concept of partial interference, which assumed that network interference is restricted within the non-overlapped groups (e.g., Sobel, 2006; Tchetgen and VanderWeele, 2012; Liu and Hudgens, 2014; Basse and Feller, 2017). Another line of work studied the causal estimation in a more generalized setting, with arbitrary interference or introducing specific network graphs as extra information into the models (e.g., Bowers et al., 2013; Manski, 2013; Goldsmith-Pinkham and Imbens, 2013; Forastiere et al., 2016). Some other papers discussed causal effect estimations under potentially mis-specified or unknown network interference (e.g., Halloran and Hudgens, 2016; Choi, 2016; Forastiere et al., 2016). Complementing these studies, (Karwa and Airoldi, 2018) presented an innovative approach for causal estimations by developing a semi-parametric form of potential outcome. The potential outcome function includes the specification of the exposure mapping with network neighborhood information. (Sävje et al., 2017) investigated the large sample properties of generalized treatment effect estimation under the unknown interference structure.

While existing literature offers valuable insights into causal inference in the presence of network interference, treatment effect estimation in dynamic environments like online gaming remains less explored and comes with its set of unique challenges. Specifically, within a short period, players repeatedly start new gaming sessions with potentially different team members. In contrast to applications like traditional social networks with a static and long-term network structure (e.g., Xu et al., 2015; Gui et al., 2015), the mechanisms of interference are not well-defined or consistent over time, thus all players can be subject to treatment spillover through a random process that experiment designers cannot control. Nevertheless, we will adopt similar framework as some of the previous work and define a more general exposure mapping in such settings.

Causal analysis within the literature on online gaming has been sparse. (Meng et al., 2019) proposed a general causal analysis framework ‘exCause’ for the real-time game sessions. (He et al., 2021) employed causal inference techniques, specifically focusing on the estimation of heterogeneous treatment effects, to assess the impact of software updates and patches in games. (Dong et al., 2020) innovated the novel Attention Neural Networks to refine the estimation of causal effects using mobile gaming data. However, it is notable that the network interference problem for causal inference, specifically in the online gaming area, remains unexplored.

In this study, we introduce an innovative framework that combines causal inference under network interference with the context of online gaming. Our main contributions can be summarized as follows:

•

We formalize the problem of treatment effect estimation in online gaming with post-hoc network interference where the networks are dynamic and ephemeral.
•

We develop an estimator for treatment effect with interference, specifically when the completely randomized experimental design is required and the network structure is not explicitly known.
•

We evaluate the proposed estimator through a series of simulation studies with various interference settings, and demonstrate the accuracy and robustness of our approach, especially in comparison to more naive methodologies.
•

We validate the effectiveness of the proposed framework through its deployment on real-world data from a mobile gaming online experiment.

The rest of the paper is organized as follows. In Section 2, we introduce the experimentation process in online mobile games, and discuss the limitations of the naive estimator for the Average Treatment Effect (ATE). In Section 3, we define the specific causal estimand of interest and describe our proposed framework for estimation in detail. In Section 4, we conduct a series of simulation studies under various interference scenarios and compare the performance of different estimators. Section 5 includes the application and validation of our proposed estimator using data from a real-world mobile gaming experiment, thereby demonstrating its practical efficacy.

2. Online Experimentation of the Mobile Game

Mobile gaming has seen unprecedented growth in recent years, becoming a significant part of the digital entertainment industry. Specifically, MOBA games have not only attained immense popularity but also cultivated a vast, global player community, largely due to their captivating gameplay mechanics and dynamic player interaction. Take a 5v5 game as an example, 5 players work together as a team to achieve objectives and defeat the opposing team. In this study, we ignore the influence of the opposite team and treat the games as the ‘Player versus Environment’ (PVE).

In this study, we focus on the causal analysis of the ‘treated game’ feature, an intervention designed to enhance player engagement and promote a more active gaming community. Specifically, when the ‘treated game’ is activated in a game session, it modulates the game’s experience level for the players. In an ideal scenario without interference among players, those in the treatment group would consistently participate in the ‘treated game’, while those in the control group would never engage in the ‘treated game’.

Our experimental design incorporates a substantial segment of the online traffic, utilizing 40% of it to gauge the effects of our treatment. This allocation is divided evenly, with 20% of users randomly assigned to the treatment group and the remaining 20% serving as a control group for a comparative analysis. This process can be seen as the initial stage of the randomized treatment assignment mechanism.

In the second stage, the players engage in games with an unknown team-matching process. The actual treatment status, or in other words, the activation status of the ‘treated game’, for each player in a given game is determined based on a specific criterion: all team members will receive the treatment if at least one member was initially designated to receive the treatment. Under this criterion, players who were initially assigned to the treatment group will consistently receive the treatment throughout the experiment. In contrast, those initially designated as ‘control’ may experience a shift in experience due to the influence of their teammates from the treatment group. During the experimentation period, each player may participate in multiple gaming sessions. An illustration of this process is shown in Figure 1.

Refer to caption — Figure 1. An illustration of the randomized experiment process with network interference during online games.

Under this setting, players can be further categorized into three distinct groups based on their original treatment assignment and their engagement with the ‘treated game’ feature:

•

Treatment Group: Players with an initial treatment assignment who consistently participate in the ‘treated game’;
•

Control-Mixed Group: Players originally assigned to the control group who have played both the standard (control) game and the ‘treated game’;
•

Control-Control Group: Players initially assigned to the control group, who exclusively play the control game without any exposure to the ‘treated game’.

In the rest of the paper, we adopt the following notation. For each player $i\in\{1,...,N\}$ , let $Z_{i}\in\{0,1\}$ be the initial treatment assignment. We let $M_{i}\in\mathbb{N}$ denote the number of treated games played by player $i$ and $Y_{i}\in\mathbb{R}^{+}$ is the outcome of interest. As for the three treatment receipt groups, we use $T\subset\{1,...,N\}$ to denote the set of players in the treatment group; $C_{1}\subset\{1,...,N\}$ the player in the control-mixed group and $C_{0}\subset\{1,...,N\}$ as the player in the control-control group. We let $X_{i}\in\mathcal{X}$ denote any pre-experiment covariates we have for each player.

2.1. The Naive Estimator

The goal of this experimentation is to estimate ATE to validate whether this ‘treated game’ intervention can significantly bring positive influence on user engagement. The ATE is commonly estimated using the strategy of Difference-in-Means (DiM) (e.g., Karwa and Airoldi, 2018; Splawa-Neyman et al., 1990; Neyman et al., 1935; Ding et al., 2017), where the difference in the average outcomes between the treatment and control groups are calculated.

If we ignore the post-hoc network interference and directly apply the DiM estimator to the treatment and control group, a naive estimator of the overall treatment effect is

(1)

\displaystyle\hat{\tau}^{\mbox{naive}}

\displaystyle=\frac{1}{|T|}\sum_{t\in T}Y_{t}^{obs}-\frac{1}{|C_{0}|+|C_{1}|}\sum_{c\in C_{0}\cup C_{1}}Y_{c}^{obs}

This naive estimator ignored the treatment received by players in the control group through treated games. This results in an over-estimation of the baseline effect based on the control group data. Consequently, such over-estimation introduces bias into our analysis, leading to under-estimation of treatment effect. Similarly, one might define the naive estimator by replacing the second term as the average outcome in the control-control group. Such estimators are also biased since intrinsically, the control-control group tends to include players who are less active, as opposed to those in the control-mixed group, who are generally more active and thus have a higher chance of participating in the ‘treated game’. We demonstrate such biases in Section 4 and 5.

3. Network Experiment Analysis

In this section, we move beyond the naive estimators and formally define the causal estimand of interest in our setting, together with assumptions that allow us to define a valid estimator of the average treatment effect.

3.1. Causal Estimand

We follow the potential outcome framework and use $Y_{i}(\bm{Z})$ to denote the outcome that would be observed had the treatment status for all players been set to the vector $\bm{Z}$ . The potential outcome is indexed by the treatment status of all players as we allow arbitrary interference through treated games. This potential outcome is well-defined if the following assumption holds.

Assumption 1 (No multiple treatment).

If $\bm{Z}=\bm{z}$ , then $Y_{i}=Y_{i}(\bm{z})$ .

Specific to our case, as the number of treated games played by each player is known, it can serve as the so-called ‘exposure mapping’ in the interference literature (Aronow and Samii, 2017), i.e., we can assume that treatment of other players only affects unit $i$ through the number of treated games that player $i$ played.

Assumption 2 (Exposure mapping).

Let $\bm{Z}=(Z_{i},\bm{Z}_{-i})$ . There exists a exposure function $g:\{0,1\}^{N}\rightarrow\mathbb{N}$ such that the following equality holds

Y_{i}(Z_{i},\bm{Z}_{-i})=Y_{i}(Z_{i},\bm{Z}_{-i}^{\prime})

if $g(Z_{i},\bm{Z}_{-i})=g(Z_{i},\bm{Z}_{-i}^{\prime})$ . We further assume that $M_{i}=g(Z_{i},\bm{Z}_{-i})$ , so the potential outcome can be simplified into $Y_{i}(m,z)$ .

We can define the causal effect for a fixed level of exposure $m>0$ ,

(2)

\displaystyle\tau(m)

\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\mathbb{E}\left[Y_{i}(M_{i}=m,Z_{i}=1)-Y_{i}(M_{i}=0,Z_{i}=0)\right].

We refer to $\tau(m)$ as the average treatment effect in the rest of the paper as it measures the average effect on the players’ potential outcomes when they experience certain number of ‘treated games’, compared to when they don’t experience the treated game at all. This causal analysis helps in understanding how varying intensities or types of treatment influence players’ behavior or game experience.

Another estimand that is of interest is the overall effect $\tau$ , defined as the ATEs weighted by players’ natural distribution of the number of games they engage in with the ‘treated game’ feature activated in the population. Since the initial treatment assignment is randomized, we have

\tau=\sum_{m>0}\tau(m)P(M_{i}=m\mid Z_{i}=1).

3.2. Estimation

Since the exposure received by each player cannot be randomized, we further make the common unconfoundedness assumption for both the initial treatment assignment and exposure as follows.

Assumption 3 (Unconfoundedness of the joint treatment).

Y_{i}(m,z)\perp Z_{i},M_{i}\mid X_{i}

Under Assumptions 1 to 3, we have the following identification result:

(3)	$\displaystyle\mathbb{E}\left[Y_{i}(m,z)\right]$	$\displaystyle\overset{(a)}{=}\mathbb{E}_{X}\left[\mathbb{E}\left(Y_{i}(m,z)\mid X_{i}\right)\right]$
(4)		$\displaystyle\overset{(b)}{=}\mathbb{E}_{X}\left[\mathbb{E}\left(Y_{i}(m,z)\mid X_{i},M_{i}=m,Z_{i}=z\right)\right]$
(5)		$\displaystyle\overset{(c)}{=}\mathbb{E}_{X}\left[\mathbb{E}\left(Y_{i}\mid X_{i},M_{i}=m,Z_{i}=z\right)\right]$

Here, the equation $(a)$ is due to the law of total expectation; $(b)$ is due to Assumption 3 and $(c)$ is due to Assumption 1 and 2.

For the estimand of interest, we only need to evaluate $\mathbb{E}\left[Y_{i}(m,1)\right]$ and $\mathbb{E}\left[Y_{i}(0,0)\right]$ . For the latter, we have pre-experiment data for all the players in the study, which can provide robust estimation for the outcome function without intervention. We denote the predicted outcome from this model to be

\hat{\mu}(x)=\hat{\mathbb{E}}\left[Y\mid Z=0,M=0,X=x\right].

For the treated players, the level of exposure, $m$ , can have a highly unbalanced distribution. Thus models that directly estimate the outcome $\mathbb{E}\left[Y_{i}\mid X_{i},M_{i}=m,Z_{i}=1\right]$ can be unstable and computationally challenging. Therefore, we propose a weighted estimator of $\mathbb{E}\left[Y_{i}(m,1)\right]$ using Inverse Probability Weighting (IPW) (e.g., ROSENBAUM and Rubin, 1983; Hirano and Imbens, 2001). IPW is a statistical technique used to adjust for potential selection bias where random assignment is not possible. The core idea of IPW is to re-weight the data so that the weighted sample resembles a randomized experiment. That is, we adopt the following unbiased estimator (Hajek, 1971) for $m>0$ ,

(6)

\hat{\mathbb{E}}\left[Y_{i}(m,1)\right]=\left[\sum_{i=1}^{N}\frac{\bm{1}\{M_{i}=m\}}{\hat{e}_{m}(X_{i})}\right]^{-1}\sum_{i=1}^{N}\left[\frac{Y_{i}^{obs}\cdot\bm{1}\{M_{i}=m\}}{\hat{e}_{m}(X_{i})}\right]

where $\hat{e}_{m}(x)=P(M_{i}=m\mid X_{i}=x)$ is the estimated propensity score for being exposed to $m$ treated games. Putting the two estimators together, we have an unbiased estimator of the average treatment effect

(7)

\displaystyle\hat{\tau}(m)

\displaystyle=\left[\sum_{i=1}^{N}\frac{\bm{1}\{M_{i}=m\}}{\hat{e}_{m}(X_{i})}\right]^{-1}\sum_{i=1}^{N}\left[\frac{Y_{i}^{obs}\cdot\bm{1}\{M_{i}=m\}}{\hat{e}_{m}(X_{i})}\right]-\frac{1}{N}\sum_{i=1}^{N}\hat{\mu}(X_{i})

In practice, observations with large $m$ can be extremely sparse. One common strategy is to truncate the number of treated games and treat all $m$ above a certain threshold to be a single category. The estimation of propensity scores for multi-level treatment is notably more complex than in binary cases and requires more flexible models to avoid unstable or extreme weights in the IPW estimator. We estimate the propensity score function $\hat{e}_{m}(x)$ with flexible predictive models, as described in the Section 4 and 5.

4. Simulations

In this section, we evaluate the proposed framework with simulation studies reflecting different scenarios of network interference. The source codes are provided in this link¹¹1https://github.com/YuZoeyZhu/-KDD-2024-Network-Interference-Online-Gaming/tree/main . We assume a 5v5 game setting, i.e., $5$ players participate in each game. Let the total number of players $N=1000$ . Under the completely randomized treatment assignment, we generate the treatment assignment $Z_{i}\sim Bernoulli(0.5)$ for each player $i$ . For each player, we also generate a single pre-experiment covariate $X_{i}\sim Beta(0.5,0.5)$ . The player-matching process is simulated as follows. For the $j_{th}$ round of the game,

•

Sample the number of treatment players $n_{T(j)}\in\{0,1,...,5\}$ based on the pre-set probability. The number of control players $n_{C(j)}=5-n_{T(j)}$ ;
•

Sample $n_{T(j)}$ players from the treatment group with probability $p(i\mid Z_{i}=1)\propto\frac{0.8}{|T|}+0.2(\frac{X_{i}}{\sum X_{i}})^{2}$ ;

•

Sample $n_{C(j)}$ players from the control group with probability

\displaystyle p(i\mid Z_{i}=0)

\displaystyle\propto\begin{cases}X_{i}\bm{1}\{X_{i}<0.2\}&\text{if }n_{T(j)}=0,\\ \frac{0.8}{|C_{0}|+|C_{1}|}+0.2\left(\frac{X_{i}}{\sum X_{i}}\right)^{2}&\text{ o.w.}\end{cases}

We generate $N_{g}$ rounds of game. Let $M_{i}$ denote the number of ‘treated games’ player $i$ participated. We assume the potential outcome follows the following distribution with mean depending on both the feature $X$ and number of treatment games $M$ :

	$\displaystyle Y_{i}\mid M,X$	$\displaystyle\sim Exponential(1/\lambda)$
	$\displaystyle\lambda$	$\displaystyle=0.5M^{1/2}+2X+0.5X\bm{1}\{X>0.5\}+0.5M^{1/2}X$

This synthetic data generating process mimics the distribution of real data where the heterogeneity of $Y$ increase for larger $X$ . For example, if $X$ measures the level of activity of a player, active players tend to have larger variance in $Y$ than less active players.

The simulation setting allows us to control the amount of interference by varying the probability of sampling players in the treatment group. More specifically, we consider three different settings with a progressively increasing level of interference in the control-mixed group from Case I to Case III. The number of games $N_{g}$ and the distribution of exposure $p(M)$ are summarized in Table 1. Figure 2 illustrates the distribution of sample size proportions across different values of $M$ for all three groups of players: treated, control-mixed, and control-control group. The amount of interference, increasing from I to III, is the sample size proportion of control-mixed group under each $M$ relative to the treatment group, which can be compared via the overlapping area under the blue line (the control-mixed group) and red line (the treatment group). In addition, in Case I, the sample size proportions are relatively small in both the treatment and control-mixed groups at lower $M$ levels, which poses more challenges in estimating $\tau(m)$ when $m$ is small. In Case II, the probability of $M$ in the control-mixed group is skewed to the right with the mode at 3. The sample size proportions are slightly smaller in both small and large $M$ s, but is overall more balanced comparing with Case I. In Case III, the probability of $M$ in the control-mixed group and treatment group are both skewed to the right with similar modes. In contrast to Case I, the sample size proportions are small in both groups at higher $M$ s. In all three cases, around $10\%$ of samples are in the control-control group.

In the simulation, we assume $\mu(x)$ is known, which is a reasonable assumption when there are enough pre-experiment data without treatment. Thus we focus on the estimation of the counterfactual under non-zero exposures. We truncate the $M$ at 10, treating $M$ as categorical with values in $\{0,1,2,...,9,10+\}$ and apply the XGBoost classification model (Chen and Guestrin, 2016) to estimate the propensity scores $\hat{e}_{m}(x)$ . We also evaluate a modified version of the proposed estimator by using only the players in the treatment group to estimate $\hat{\mathbb{E}}\left[Y_{i}(m,1)\right]$ in equation 6, which we refer to the ‘proposed estimator without control-mixed’.

Let $\bar{Y}^{(m)}$ be the average outcome for players in the treatment group who are exposed to $m$ treated games, $\bar{Y}^{\mbox{c}}$ and $\bar{Y}^{\mbox{cc}}$ be the average outcome for players in the control group and control-control group respectively. We compute our estimator to two naive estimators ignoring interference:

•

Naive: This is the simplest DiM estimator ignoring interference and confounding and only computes the contrast between players in the treatment group who are exposed to $m$ treated games with the whole control group

(8) $\displaystyle\hat{\tau}^{\mbox{naive, 1}}$ $\displaystyle=\bar{Y}^{(m)}-\bar{Y}^{\mbox{c}}$
•

Naive - w/o Control-Mixed: Alternatively, we may remove all players subject to interference and consider the comparison to the control-control group only,

(9) $\displaystyle\hat{\tau}^{\mbox{naive, 2}}$ $\displaystyle=\bar{Y}^{(m)}-\bar{Y}^{\mbox{cc}}$

To evaluate the performance of each estimator, we generate 100 datasets in each case and compare the mean and 95% uncertainty interval of the estimated effects with the truth for $\hat{\tau}(m)$ . As shown in Figure 3, in all three cases, the first naive estimator tends to overestimate the effects in all levels of $M$ , whereas the second naive estimator ignores the control-mixed group’s consistently overestimated effect. This is as expected as in the former case, $\mathbb{E}\left[Y(0,0)\right]$ is over-estimated as some of the players in the control group are exposed to treatment through interference; whereas in the latter case, $\mathbb{E}\left[Y(0,0)\right]$ is under-estimated as the players in the control-control group are more likely to be less active players. Both versions of the proposed estimator show improvement over the naive estimators in general. We also observe that the proposed estimator without using the control-mixed group exhibits larger uncertainty in general, especially for levels of $M$ with a small sample size. By incorporating both the treatment and control-mixed group, the proposed estimator is able to achieve the smallest bias with low variation.

This simulation study offers clear insights into the performance of our estimator across various probabilities of treatment player matching. In real cases, these matching probabilities can be strategically adjusted. For instance, selecting an appropriate matching probability can help maintain a balance between enhancing the user engagement of new features and improving the accuracy of the estimator. It can be further explored to manage and control network interference, thereby aligning the experimentation with practical business considerations.

This simulation study provides a straightforward insight of the performance of the estimator under different treatment player matching probability. In the real case, the matching probability can be adjusted on purpose with business concerns. And we discuss it as one of the strategies to control the interference.

Table 1. Three different settings for the simulation.

Case	$N_{g}$	$p(M=0)$	$p(M=1)$	$p(M=2)$	$p(M=3)$	$p(M=4)$	$p(M=5)$
I	2000	0.40	0.10	0.10	0.10	0.10	0.20
II	1000	0.06	0.02	0.19	0.23	0.34	0.16
III	1000	0.20	0.34	0.07	0.16	0.06	0.17

5. Case Study

In this case study, we consider a online experimentation for the ‘treated game’ feature of a MOBA game from Tencent. We obtained the experiment dataset with a total of 58,565 players for a two-week experimentation period. And we collected a pre-experiment dataset from the same set of players, covering the two weeks immediately preceding the start of the experiment. The control-control group takes around 7.8% of the whole sample.

In this experiment, we focus on monitoring a specific target metric (TM) for each player. This business metric is a measurement of a player’s engagement in the mobile game. An increase in TM generally indicates a higher level of player involvement and satisfaction with the game, which can lead to increased loyalty, longer-term player retention, and potentially greater revenue through in-game purchases. We expect to see a significant increase in TM with the new ‘treated game’ feature.

This specific TM, being inherently positive, displays a distribution with decreasing tread and a long right tail. Figure 4 illustrates the distributions of TM across the three groups, comparing the experimental period and the pre-experimental phase data. In both datasets, the TM for the control-control group is relatively smaller, aligning with the inference that this group consists of less active players. Additionally, the pre-experiment TM generally surpasses the experiment TM, which could be attributed to various factors, such as in-game events, public holidays, etc.

The pre-experiment features, which serve as potential confounders, encompass a variety of descriptors that capture different characteristics of the players. These include metrics that measure players’ gaming abilities, their levels within the game, and their in-game purchases.

As for the pre-processing of the data, we treat the observations with TM $\geq 60$ as outliers and remove them from the data. We also truncate $M$ with a threshold of 21.

5.1. Estimation of $\hat{\mu}(x)$

We estimated $\hat{\mu}(x)$ , the expected potential outcome under no treatment, as the sum of the observed TM during the pre-experiment period and a covariate-dependent increment during the experiment period. That is, we assume the difference of the potential outcomes between the pre-experiment and experiment periods under no treatment can be explained by the observed covariates. We estimate this difference with a linear model using the control-control group, and make predictions for the control-mixed group and treatment group players to get $\hat{\mu}(x)$ . This approach is related to the Difference-in-Difference (DiD) estimator(Ashenfelter, 1978; Ashenfelter and Card, 1999; Bertrand et al., 2003).

5.2. Estimation of propensity score

We used the XGBoost classification model to estimate the propensity score with $M\in\{0,1,2,...,20,21+\}$ . To implement the model, We used 5-fold cross-validation to find the optimal hyper-parameters, with learning rate $\eta=0.3$ and the max depth = 6. The overall accuracy in the testing set is $78.5\%$ . The propensity score is estimated with the predicted probability of the corresponding level of exposure.

5.3. Performance

Figure 5 presents the effect estimations with the comparisons of four estimators across multiple treatment levels. The ‘Naive’ estimator, represented by the dark grey line, suggested a conservative estimation for the treatment effects as expected. It also provides a baseline for comparison against more sophisticated approaches. The ‘Naive without Control-Mixed’ estimator in the light grey line shows overall higher effects estimation compared with others, indicating that excluding the mixed-control data naively may lead to over-estimation. The proposed estimator is between the two naive approaches, and also close to the trajectory of the ‘Proposed without Control-Mixed’ estimator.

The estimated treatment effects show an overall increasing trend as the treatment level $M$ grows, regardless of the estimation approaches. Besides, the increasing relationship is non-linear, with fluctuations and diminishing incremental benefits beyond a certain level of $M$ . This is expected because a higher level of treatment is presumed to present a stronger influence on the outcome, and the treatment effects reach a plateau as the treatment level approaches a saturation point where additional increments in $M$ no longer yield proportional increases in the outcome. The relatively small sample sizes under large exposures also lead to the noiser estimations of the ATE.

Lastly, Table 2 shows the estimation of overall treatment effect $\tau$ for all the estimators.

Table 2. Marginal Treatment Effect

\hat{\tau}

Estimator	$\hat{\tau}$
Naive	0.960
Naive - w/o Control-Mixed	5.270
Proposed - w/o Control-Mixed	2.509
Proposed	2.113

6. Conclusion

In conclusion, our comprehensive study provides a novel framework for analyzing causal effects in the context of online gaming, where traditional A/B testing faces the challenges of post-hoc network interference. By focusing on the ‘treated game’ feature of the mobile MOBA game, we have demonstrated the potential of our proposed framework to accurately estimate treatment effects amidst the complex dynamics of player interactions.

It is worth noticing that certain limitations could impact the generalizability and applicability of our findings. First, our model relies on assumptions that may not hold across all gaming environments or user demographics. Moreover, our model presumes that the post-hoc network interference is uniformly distributed across all players, an assumption that might oversimplify the complex and often unique interactions between players. Besides, potential selection biases may not be fully accounted for due to the unknown team-matching strategy. For future work, we can further refine and validate our model with the underlying assumptions. For instance, exploring models that account for heterogeneity in player behavior or that explicitly model the unique network structures inherent in different gaming communities.

The potential applications of our work extend beyond the realm of online gaming. The principles and methodologies we have outlined could be adapted to other digital platforms and social networks where user engagement and interaction are pivotal to the product’s success. Specifically, we can generalize our work to the exploration of dynamic network interference scenarios, such as those encountered in real-time strategy or ride-sharing platforms like Uber with continuously evolving user interactions.

Acknowledgements.

We would like to thank Sizhe Zhang, Yifeng Huang, Junlong Zhou, Ke Nie and Ning Zhang for useful comments and discussions.

References

(1)
Aronow and Middleton (2013) Peter Aronow and Joel Middleton. 2013. A Class of Unbiased Estimators of the Average Treatment Effect in Randomized Experiments. Journal of Causal Inference 1 (01 2013). https://doi.org/10.1515/jci-2012-0009
Aronow and Samii (2017) Peter M. Aronow and Cyrus Samii. 2017. Estimating average causal effects under general interference, with application to a social network experiment. The Annals of Applied Statistics 11, 4 (2017), 1912 – 1947. https://doi.org/10.1214/16-AOAS1005
Ashenfelter (1978) Orley Ashenfelter. 1978. Estimating the Effect of Training Programs on Earnings. The Review of Economics and Statistics 60, 1 (1978), 47–57. https://EconPapers.repec.org/RePEc:tpr:restat:v:60:y:1978:i:1:p:47-57
Ashenfelter and Card (1999) Orley Ashenfelter and David Card. 1999. Using Longitudinal Structure of Earnings to Estimate the Effect of Training Programs. Review of Economics and Statistics 67 (10 1999). https://doi.org/10.2307/1924810
Basse and Feller (2017) Guillaume Basse and Avi Feller. 2017. Analyzing Two-Stage Experiments in the Presence of Interference. J. Amer. Statist. Assoc. 113 (06 2017). https://doi.org/10.1080/01621459.2017.1323641
Bertrand et al. (2003) MMarianne Bertrand, Esther Duflo, and Sendhil Mullainathan. 2003. How Much Should We Trust Differences-In-Differences Estimates? The Quarterly Journal of Economics 119 (11 2003), pp. 249–275. https://doi.org/10.1162/003355304772839588
Bowers et al. (2013) Jake Bowers, Mark Fredrickson, and Costas Panagopoulos. 2013. Reasoning about Interference Between Units: A General Framework. Political Analysis 21 (12 2013), 97–124. https://doi.org/10.2307/23359695
Chen and Guestrin (2016) Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Francisco, California, USA) (KDD ’16). Association for Computing Machinery, New York, NY, USA, 785–794. https://doi.org/10.1145/2939672.2939785
Choi (2016) David Choi. 2016. Estimation of Monotone Treatment Effects in Network Experiments. (01 2016). https://doi.org/10.6084/M9.FIGSHARE.3443393
Ding et al. (2017) Peng Ding, Xinran Li, and Luke Miratrix. 2017. Bridging Finite and Super Population Causal Inference. Journal of Causal Inference 5 (04 2017). https://doi.org/10.1515/jci-2016-0027
Dong et al. (2020) Boxiang Dong, Hui Bill Li, Yang Ryan Wang, and Rami Safadi. 2020. 2D-ATT: Causal Inference for Mobile Game Organic Installs with 2-Dimensional Attentional Neural Network. In 2020 IEEE International Conference on Big Data (Big Data). 1503–1512. https://doi.org/10.1109/BigData50022.2020.9378413
Eckles et al. (2016) Dean Eckles, Brian Karrer, and Johan Ugander. 2016. Design and Analysis of Experiments in Networks: Reducing Bias from Interference. Journal of Causal Inference 5 (02 2016). https://doi.org/10.1515/jci-2015-0021
Forastiere et al. (2016) Laura Forastiere, Edoardo Airoldi, and Fabrizia Mealli. 2016. Identification and Estimation of Treatment and Interference Effects in Observational Studies on Networks. J. Amer. Statist. Assoc. 116 (09 2016). https://doi.org/10.1080/01621459.2020.1768100
Goldsmith-Pinkham and Imbens (2013) Paul Goldsmith-Pinkham and Guido Imbens. 2013. Social Networks and the Identification of Peer Effects Rejoinder. Journal of Business & Economic Statistics 31 (07 2013). https://doi.org/10.1080/07350015.2013.801251
Gui et al. (2015) Huan Gui, Ya Xu, Anmol Bhasin, and Jiawei Han. 2015. Network A/B Testing: From Sampling to Estimation. 399–409. https://doi.org/10.1145/2736277.2741081
Hajek (1971) J. Hajek. 1971. Discussion of “An essay on the logical foundations of survey sampling, part one,” by D. Basu. Holt, Rinehart and Winston, Toronto, Ontario, Canada, 236.
Halloran and Hudgens (2016) M. Halloran and Michael Hudgens. 2016. Dependent Happenings: A Recent Methodological Review. Current Epidemiology Reports 3 (07 2016). https://doi.org/10.1007/s40471-016-0086-4
He et al. (2021) Yuzi He, Christopher Tran, Julie Jiang, Keith Burghardt, Emilio Ferrara, Elena Zheleva, and Kristina Lerman. 2021. Heterogeneous Effects of Software Patches in a Multiplayer Online Battle Arena Game. In Proceedings of the 16th International Conference on the Foundations of Digital Games (Montreal, QC, Canada) (FDG ’21). Association for Computing Machinery, New York, NY, USA, Article 11, 9 pages. https://doi.org/10.1145/3472538.3472550
Hirano and Imbens (2001) Keisuke Hirano and Guido Imbens. 2001. Estimation of Causal Effects Using Propensity Score Weighting: An Application to Data on Right Heart Catheterization. Health Services and Outcomes Research Methodology 2 (12 2001), 259–278. https://doi.org/10.1023/A:1020371312283
Hudgens and Halloran (2008) Michael Hudgens and M Halloran. 2008. Toward Causal Inference With Interference. J. Amer. Statist. Assoc. 103 (07 2008), 832–842. https://doi.org/10.1198/016214508000000292
Karrer et al. (2020) Brian Karrer, Liang Shi, Monica Bhole, Matt Goldman, Tyrone Palmer, Charlie Gelman, Mikael Konutgan, and Feng Sun. 2020. Network experimentation at scale.
Karwa and Airoldi (2018) Vishesh Karwa and Edoardo Airoldi. 2018. A systematic investigation of classical causal inference strategies under mis-specification due to network interference.
Liu and Hudgens (2014) Lan Liu and Michael Hudgens. 2014. Large Sample Randomization Inference of Causal Effects in the Presence of Interference. J. Amer. Statist. Assoc. 109 (03 2014), 288–301. https://doi.org/10.1080/01621459.2013.844698
Manski (2013) Charles F. Manski. 2013. Identification of treatment response with social interactions. Econometrics Journal 16, 1 (Feb. 2013), S1–S23. https://doi.org/10.1111/j.1368-423X.2012.00368.x
Meng et al. (2019) Yuan Meng, Shenglin Zhang, Zijie Ye, Benliang Wang, Zhi Wang, Sun Yongqian, Qitong Liu, Shuai Yang, and Dan Pei. 2019. Causal Analysis of the Unsatisfying Experience in Realtime Mobile Multiplayer Games in the Wild. 1870–1875. https://doi.org/10.1109/ICME.2019.00321
Neyman et al. (1935) J. Neyman, K. Iwaszkiewicz, and St. Kolodziejczyk. 1935. Statistical Problems in Agricultural Experimentation. Supplement to the Journal of the Royal Statistical Society 2, 2 (1935), 107–180. http://www.jstor.org/stable/2983637
ROSENBAUM and Rubin (1983) PAUL ROSENBAUM and Donald Rubin. 1983. The Central Role of the Propensity Score in Observational Studies For Causal Effects. Biometrika 70 (04 1983), 41–55. https://doi.org/10.1093/biomet/70.1.41
Rubin (1980) Donald B. Rubin. 1980. Randomization Analysis of Experimental Data: The Fisher Randomization Test Comment. J. Amer. Statist. Assoc. 75, 371 (1980), 591–593.
Sobel (2006) Michael Sobel. 2006. What Do Randomized Studies of Housing Mobility Demonstrate?: Causal Inference in the Face of Interference. J. Amer. Statist. Assoc. 101 (02 2006), 1398–1407. https://doi.org/10.2307/27639760
Splawa-Neyman et al. (1990) Jerzy Splawa-Neyman, Dorota Dabrowska, and T. Speed. 1990. On the Application of Probability Theory to Agricultural Experiments. Essay on Principles. Section 9. Statist. Sci. 5 (11 1990). https://doi.org/10.1214/ss/1177012031
Sävje et al. (2017) Fredrik Sävje, Peter Aronow, and Michael Hudgens. 2017. Average treatment effects in the presence of unknown interference. The Annals of Statistics 49 (11 2017). https://doi.org/10.1214/20-AOS1973
Tchetgen and VanderWeele (2012) Eric J. Tchetgen Tchetgen and Tyler J. VanderWeele. 2012. On causal inference in the presence of interference. Statistical Methods in Medical Research 21 (2012), 55 – 75. https://api.semanticscholar.org/CorpusID:34519507
Ugander et al. (2013) Johan Ugander, Brian Karrer, Lars Backstrom, and Jon Kleinberg. 2013. Graph cluster randomization: network exposure to multiple universes. (05 2013). https://doi.org/10.1145/2487575.2487695
Walker and Muchnik (2014) Dylan Walker and Lev Muchnik. 2014. Design of Randomized Experiments in Networks. (12 2014). https://doi.org/10.1109/JPROC.2014.2363674
Xu et al. (2015) Ya Xu, Nanyu Chen, Adrian Fernandez, Omar Sinno, and Anmol Bhasin. 2015. From Infrastructure to Culture. 2227–2236. https://doi.org/10.1145/2783258.2788602