This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Competition over data: how does data purchase affect users?

Yongchan Kwon yk3012@columbia.edu
Department of Statistics, Columbia University
Tony Ginart tginart@stanford.edu
Department of Electrical Engineering, Stanford University
James Zou jamesz@stanford.edu
Department of Biomedical Data Science, Stanford University
Abstract

As the competition among machine learning (ML) predictors is widespread in practice, it becomes increasingly important to understand the impact and biases arising from such competition. One critical aspect of ML competition is that ML predictors are constantly updated by acquiring additional data during the competition. Although this active data acquisition can largely affect the overall competition environment, it has not been well-studied before. In this paper, we study what happens when ML predictors can purchase additional data during the competition. We introduce a new environment in which ML predictors use active learning algorithms to effectively acquire labeled data within their budgets while competing against each other. We empirically show that the overall performance of an ML predictor improves when predictors can purchase additional labeled data. Surprisingly, however, the quality that users experience—i.e., the accuracy of the predictor selected by each user—can decrease even as the individual predictors get better. We demonstrate that this phenomenon naturally arises due to a trade-off whereby competition pushes each predictor to specialize in a subset of the population while data purchase has the effect of making predictors more uniform. With comprehensive experiments, we show that our findings are robust against different modeling assumptions.

1 Introduction

When there are several companies on a marketplace offering similar services, a customer usually chooses the one that offers the best options or functionalities, leading to competition among the companies. Accordingly, companies are motivated to offer high-quality services without raising price too much, as their ultimate goal is to attract more customers and make more profits. When it comes to machine learning (ML)-based services, high-quality services are often achieved by a regular re-training after buying more data from customers or data vendors (Meierhofer et al., 2019). In this paper, we consider a competition situation where multiple companies offer ML-based services while constantly improving their predictions by acquiring labeled data.

For instance, we consider the U.S. auto insurance market (Jin & Vasserman, 2019; Sennaar, 2019). The auto insurance companies including State Farm, Progressive, and AllState use ML models to analyze customer data, assess risk, and adjust actual premiums. These companies also offer insurance called the Pay-How-You-Drive, which is usually cheaper than regular auto insurances on the condition that the insurer monitors driving patterns, such as rapid acceleration or oscillations in speed (Arumugam & Bhargavi, 2019; Jin & Vasserman, 2019). That is, the companies essentially provide financial benefits to customers, collecting customers’ driving pattern data. With these user data, they can regularly update their ML models, improving model performance while competing with each other.

Analyzing the effects of data purchase in competitions could have practical implications, but it has not been studied much in the ML literature. The effects of data acquisition have been investigated in active learning (AL) literature (Settles, 2009; Ren et al., 2020), but it is not straightforward to establish competition in AL settings because it considers the single-agent situation. Recently, Ginart et al. (2021) studied implications of competitions by modeling an environment where ML predictors compete against each other for user data. They showed that competition pushes competing predictors to focus on a small subset of the population and helps users find high-quality predictions. Although this work describes an interesting phenomenon, it is limited to describe the data purchase system due to the simplicity of its model. The impact of data purchases on competition has not been studied much in the literature, which is the main focus of our work. Our environment is able to model situations where competing companies actively acquire user data by providing a financial benefit to users, and influence the way users choose service providers (See Figure 1). Related works are further discussed in Section 5.

Refer to caption
Refer to caption
Figure 1: Illustrations of our competition environment (left) when there is a company showing purchase intent and (right) when no company shows purchase intent. In step 1, described in the first arrow, each predictor receives a user query and decides whether to buy user data. In step 2, described in the second arrow, (left) if there is a company that thinks the data is worth buying, the company shows purchase intent. The user XtX_{t} then selects the buyer for financial benefits. (Right) If no one thinks the user data is worth buying, a user selects one company based on received ML predictions. In step 3, the only selected predictor gets the user label YtY_{t} and updates its model. We provide details on the environment in Section 2.

Contributions

In this paper, we propose a general competition environment and study what happens when competing ML predictors can actively acquire user data. Our main contributions are as follows.

  • We propose a novel environment that can simulate various real-world competitions. Our environment allows ML predictors to use AL algorithms to purchase labeled data within a finite budget while competing against each other (Section 2).

  • Surprisingly, our results show that when competing ML predictors purchase data, the quality of the predictions selected by each user can decrease even as competing ML predictors get better (Section 3.1).

  • We demonstrate that data purchase makes competing predictors similar to each other, leading to this counterintuitive finding (Section 3.2). Our finding is robust and is consistently observed across various competition situations (Section 3.3).

  • We theoretically analyze how the diversity of a user’s available options can affect the user experience to support our empirical findings. (Section 4).

2 A general environment for competition and data purchase

This section formally introduces a new and general competition environment. In our environment, competition is represented by a series of interactions between a sequence of users and fixed competing ML predictors. Here the interaction is modeled by supervised learning tasks. To be more specific, we define some notations.

Notations

At each round t[T]:={1,,T}t\in[T]\mathrel{\mathop{\mathchar 58\relax}}=\{1,\dots,T\}, we denote a user query by Xt𝒳X_{t}\in\mathcal{X} and its associated user label by Yt𝒴Y_{t}\in\mathcal{Y}. We focus on classification problems, i.e., |𝒴||\mathcal{Y}| is finite, while our environment can easily extend to regression settings. We denote a sequence of users by {(Xt,Yt)}t=1T\{(X_{t},Y_{t})\}_{t=1}^{T} and assume users are independent and identically distributed (i.i.d.) by some distribution PX,YP_{X,Y}. We call PX,YP_{X,Y} the user distribution.

As for the ML predictor side, we suppose there are MM competing predictors in a market. For i[M]i\in[M], each ML predictor is described as a tuple 𝒞(i):=(ns(i),nb(i),f(i),π(i))\mathcal{C}^{(i)}\mathrel{\mathop{\mathchar 58\relax}}=(n_{\mathrm{s}}^{(i)},n_{\mathrm{b}}^{(i)},f^{(i)},\pi^{(i)}), where ns(i)n_{\mathrm{s}}^{(i)}\in\mathbb{N} is the number of i.i.d. seed data points from PX,YP_{X,Y}, nb(i)n_{\mathrm{b}}^{(i)}\in\mathbb{N} is a budget, f(i):𝒳𝒴f^{(i)}\mathrel{\mathop{\mathchar 58\relax}}\mathcal{X}\to\mathcal{Y} is an ML model, and π(i):𝒳{0,1}\pi^{(i)}\mathrel{\mathop{\mathchar 58\relax}}\mathcal{X}\to\{0,1\} is a buying strategy. We consider the following setting. A predictor 𝒞(i)\mathcal{C}^{(i)} initially owns ns(i)n_{\mathrm{s}}^{(i)} data points and can additionally purchase user data within nb(i)n_{\mathrm{b}}^{(i)} budgets. We set the price of one data point is one, i.e., a predictor 𝒞(i)\mathcal{C}^{(i)} can purchase up to nb(i)n_{\mathrm{b}}^{(i)} data points from a sequence of users. A predictor 𝒞(i)\mathcal{C}^{(i)} produces a prediction using the ML model f(i)f^{(i)} and determines whether to buy the user data with the buying strategy π(i)\pi^{(i)}. We consider the utility function for 𝒞(i)\mathcal{C}^{(i)} is the classification accuracy of f(i)f^{(i)} with respect to the user distribution PX,YP_{X,Y}. Lastly, f(i)f^{(i)} and π(i)\pi^{(i)} are allowed to be updated throughout the TT competition rounds. That is, companies keep improving their ML models with newly collected data points.

Competition dynamics

Before the first competition round, all the MM competing predictors independently train their model f(i)f^{(i)} using the ns(i)n_{\mathrm{s}}^{(i)} seed data points. After this initialization, at each round t[T]t\in[T], a user sends a query XtX_{t} to all the predictors {𝒞(j)}j=1M\{\mathcal{C}^{(j)}\}_{j=1}^{M}, and each predictor 𝒞(i)\mathcal{C}^{(i)} determines whether to buy the user data. We describe this decision by using the buying strategy π(i)\pi^{(i)}. If the predictor 𝒞(i)\mathcal{C}^{(i)} thinks that the labeled data would be worth one unit of budget, we denote this by π(i)(Xt)=1\pi^{(i)}(X_{t})=1. Otherwise, if 𝒞(i)\mathcal{C}^{(i)} thinks that it is not worth one unit of budget, then π(i)(Xt)=0\pi^{(i)}(X_{t})=0. As for the π(i)\pi^{(i)}, ML predictors can use any stream-based AL algorithm (Freund et al., 1997; Žliobaitė et al., 2013). For instance, a predictor 𝒞(i)\mathcal{C}^{(i)} can use the uncertainty-based AL rule (Settles & Craven, 2008)i.e., 𝒞(i)\mathcal{C}^{(i)} attempts to purchase user data if the current prediction f(i)(Xt)f^{(i)}(X_{t}) is not confident (e.g., the Shannon’s entropy of p(i)(Xt)p^{(i)}(X_{t}) is higher than some predefined threshold value where p(i)(Xt)p^{(i)}(X_{t}) is the probability estimate at the tt-th round). In brief, we suppose a predictor 𝒞(i)\mathcal{C}^{(i)} shows purchase intent if the remaining budget is greater than zero and π(i)(Xt)=1\pi^{(i)}(X_{t})=1. If the remaining budget is zero or π(i)(Xt)=0\pi^{(i)}(X_{t})=0, then 𝒞(i)\mathcal{C}^{(i)} does not show purchase intent and provides a prediction f(i)(Xt)f^{(i)}(X_{t}) to the user. To analyze the complicated real-world competition, we simplify the competition environment and assume that the service price is the same for all companies. This allows users to primarily compare the quality of ML predictions. Given that users select one company within their desired price range, this assumption describes that the options competing companies provide are within this range.

We now elaborate on how a user selects one predictor. At every round t[T]t\in[T], the user selects only one predictor based on both purchase intents and prediction information received from {𝒞(j)}j=1M\{\mathcal{C}^{(j)}\}_{j=1}^{M}. If there is a buyer, then we assume that a user prefers the company with purchase intent to others. We can think of this as a bargain in that the company offers a financial advantage (e.g., discounts or coupons) and the user selects it even if the quality might not be the best. When there is more than one buyer, we assume a user selects one of them uniformly at random. Once selected, the only selected predictor’s budget is reduced by one; all other predictor’s budget stays the same because they are not selected and do not have to provide financial benefits. If no predictor shows purchase intent, then a user receives prediction information {f(j)(Xt)}j=1M\{f^{(j)}(X_{t})\}_{j=1}^{M} and chooses the predictor 𝒞(i)\mathcal{C}^{(i)} with the following probability.

P(Wt=iYt,{f(j)(Xt)}j=1M)=exp(αq(Yt,f(i)(Xt)))j=1Mexp(αq(Yt,f(j)(Xt))),\displaystyle P\left(W_{t}=i\mid Y_{t},\{f^{(j)}(X_{t})\}_{j=1}^{M}\right)=\frac{\exp{\left(\alpha q\left(Y_{t},f^{(i)}(X_{t})\right)\right)}}{\sum_{j=1}^{M}\exp{\left(\alpha q\left(Y_{t},f^{(j)}(X_{t})\right)\right)}}, (1)

where α0\alpha\geq 0 denotes a temperature parameter and Wt[M]W_{t}\in[M] denotes the index of selected predictor. Here, q:𝒴×𝒴+:={zz0}q\mathrel{\mathop{\mathchar 58\relax}}\mathcal{Y}\times\mathcal{Y}\to\mathbb{R}^{+}\mathrel{\mathop{\mathchar 58\relax}}=\{z\in\mathbb{R}\mid z\geq 0\} is a predefined quality function that measures similarity between the user label YtY_{t} and the prediction (e.g., 𝟙({Y1=Y2})\mathbbm{1}(\{Y_{1}=Y_{2}\})). With the softmax function in Equation (1), users are more likely to select high-quality predictions, describing the rationality of the user selection. Here, the temperature parameter α\alpha indicates how selective users are. For instance, α\alpha is close to \infty, users are very confident in their selection and choose the best company. Afterwards, the selected predictor 𝒞(Wt)\mathcal{C}^{(W_{t})} gets the user label YtY_{t} and updates the model f(Wt)f^{(W_{t})} by training on the new datum (Xt,Yt)(X_{t},Y_{t}). The other predictors f(i)f^{(i)} stay the same for iWti\neq W_{t}. We describe our competition system in Environment 1.

Characteristics of our environment

Our environment simplifies real-world competition and data purchases, which usually exist in much more complicated forms, yet it captures key characteristics. First, our environment reflects the rationality of customers. Customers are likely to choose the best service within their budget, but they can select a company that is not necessarily the best if it offers financial benefits, such as promotional coupons, discounts, or free services (Rowley, 1998; Familmaleki et al., 2015; Reimers & Shiller, 2019). Such a user selection represents that a user can prioritize financial advantages and change her selection, which has not been considered in the ML literature. Second, our environment realistically models a company’s data acquisition. Competing companies strive to attract more users, constantly purchasing user data to improve their ML predictions. Since the data buying process could be costly for the companies, data should be carefully chosen, and this is why we incorporate AL algorithms. Our environment allows companies to use AL algorithms within finite budgets and to selectively acquire user data. Third, our environment is flexible and takes into account various competition situations in practice. Note that we make no assumptions about the number of competing predictors MM or budgets nb(i)n_{\mathrm{b}}^{(i)}, algorithms for predictors f(i)f^{(i)} or buying strategies π(i)\pi^{(i)}, and the user distribution PX,YP_{X,Y}.

Example 1 (Auto insurance in Section 1).

XtX_{t} includes the tt-th driver’s demographic information, driving or insurance claim history, and YtY_{t} is the driver’s preferred insurance plan within the user’s budget constraints. Each predictor 𝒞(i)\mathcal{C}^{(i)} is one insurance company (e.g., State Farm, Progressive, or AllState), offering an auto insurance plan f(i)(Xt)f^{(i)}(X_{t}) based on what it predicts to be most suitable for this driver. The driver chooses one company whose offered plan is the closest to YtY_{t}. If a company believes that in its database there are infrequent data from a particular group of drivers tt-th driver belongs to (e.g., new drivers in their 30s), it can attempt to collect more data. Accordingly, the company offers discounts to attract her, and the acquired data is used to improve the company’s future ML predictions.

Environment 1 A competition environment with data purchase
Input: Number of competition rounds TT; user distribution PX,YP_{X,Y}; number of predictors MM; competing predictors 𝒞(i)=(ns(i),nb(i),f(i),π(i))\mathcal{C}^{(i)}=(n_{\mathrm{s}}^{(i)},n_{\mathrm{b}}^{(i)},f^{(i)},\pi^{(i)}) for i[M]i\in[M].
Procedure:
For all i[M]i\in[M], a model f(i)f^{(i)} is trained using the ns(i)n_{\mathrm{s}}^{(i)} seed data points
for t[T]t\in[T] do
     (Xt,Yt)(X_{t},Y_{t}) from PX,YP_{X,Y} is drawn and a set of buyers =\mathcal{B}=\emptyset is initialized
     for i[M]i\in[M] do
         if (nb(i)1n_{\mathrm{b}}^{(i)}\geq 1) and (π(i)(Xt)=1\pi^{(i)}(X_{t})=1then
              {𝒞(i)}\mathcal{B}\leftarrow\mathcal{B}\cup\{\mathcal{C}^{(i)}\}
         else
              Predict f(i)(Xt)f^{(i)}(X_{t})
         end if
     end for
     if  ||1|\mathcal{B}|\geq 1  then
         A user selects one predictor WtW_{t} from \mathcal{B} uniformly at random
         nb(Wt)nb(Wt)1n_{\mathrm{b}}^{(W_{t})}\leftarrow n_{\mathrm{b}}^{(W_{t})}-1
     else
         A user selects one predictor WtW_{t} based on (1)
     end if
     𝒞(Wt)\mathcal{C}^{(W_{t})} receives a user label YtY_{t} and updates f(Wt)f^{(W_{t})}
end for

3 Experiments

Using the proposed Environment 1, we investigate the impacts of the data purchase in ML competition. Our experiments show an interesting phenomenon that data purchase can decrease the quality of the predictor selected by a user, even when the quality of the predictors gets improved on average (Section 3.1). In addition, we demonstrate that data purchase makes ML predictors similar to each other. Data purchase reduces the effective variety of options, and predictors can avoid specializing to a small subset of the population (Section 3.2). Lastly, we show our results are robust against different modeling assumptions (Section 3.3).

Metrics

To quantitatively measure the effects of data purchase, we introduce three population-level evaluation metrics. First, we define the overall quality as follows.

𝔼[1Mj=1Mq(Y,f(j)(X))],\displaystyle\mathbb{E}\left[\frac{1}{M}\sum_{j=1}^{M}q\left(Y,f^{(j)}(X)\right)\right], (Overall quality)

where the expectation is taken over the user distribution PX,YP_{X,Y}. The overall quality represents the average quality that competing predictors provide in the market. Second is the quality of experience (QoE), the quality of the predictor selected by a user. The QoE is defined as

𝔼[q(Y,f(W)(X))].\displaystyle\mathbb{E}\left[q\left(Y,f^{(W)}(X)\right)\right]. (QoE)

Here, the expectation is over the random variables (X,Y,W)(X,Y,W) and a conditional distribution of a selected index P(WX,Y)P(W\mid X,Y) is considered as Equation (1). Given that a user selects one predictor based on Equation (1) when there is no buyer, QoE can be considered as the key utility of users. Note that the overall quality and QoE capture different aspects of prediction qualities, and they are only equal when users select one predictor uniformly at random, i.e., when α=0\alpha=0 (See Lemma 1).

Next, we define the diversity to quantify how variable the ML predictions are. To be more specific, for i𝒴i\in\mathcal{Y}, we define the proportion of predictors whose prediction is ii as pi(X):=1Mj=1M𝟙(f(j)(X)=i)p_{i}(X)\mathrel{\mathop{\mathchar 58\relax}}=\frac{1}{M}\sum_{j=1}^{M}\mathbbm{1}(f^{(j)}(X)=i). Then the diversity is defined as

𝔼[i𝒴pi(X)log(pi(X))],\displaystyle\mathbb{E}\left[-\sum_{i\in\mathcal{Y}}p_{i}(X)\log(p_{i}(X))\right], (Diversity)

where the expectation is taken over the marginal distribution PXP_{X} and we use the convention 0log(0)=00\log(0)=0 when pi(X)=0p_{i}(X)=0. Note that the diversity is defined as the expected Shannon’s entropy of competing ML predictions. When there are various different options that a user can choose from, the diversity is more likely to be large.

Implementation protocol

Our experiments consider the seven real datasets to describe various user distributions PX,YP_{X,Y}, namely Insurance (Van Der Putten & van Someren, 2000), Adult (Dua & Graff, 2017), Postures (Gardner et al., 2014), Skin-nonskin (Chang & Lin, 2011), MNIST (LeCun et al., 2010), Fashion-MNIST (Xiao et al., 2017), and CIFAR10 (Krizhevsky et al., 2009) datasets. To minimize the variance caused by other factors, we first consider a homogeneous setting in Sections 3.1 and 3.2: for each competition, all predictors have the same number of seed data ns(i)n_{\mathrm{s}}^{(i)} and budgets nb(i)n_{\mathrm{b}}^{(i)}, the same classification algorithm for f(i)f^{(i)}, and the same AL algorithm for π(i)\pi^{(i)}. As for heterogeneous settings in Section 3.3, competitors are allowed to have different configurations of parameters.

Throughout the paper, we set the total number of competition rounds to T=104T=10^{4}, the number of predictors to M=18M=18, and a quality function to the correctness function, i.e., q(Y1,Y2)=𝟙({Y1=Y2})q(Y_{1},Y_{2})=\mathbbm{1}(\{Y_{1}=Y_{2}\}) for all Y1,Y2𝒴Y_{1},Y_{2}\in\mathcal{Y}. We set a small number for seed data points ns(i)n_{\mathrm{s}}^{(i)}, which is between 50 and 200 depending on a dataset. We use either a logistic model or a neural network model with one hidden layer for f(i)f^{(i)}. As for the buying policy, we use a standard entropy-based AL rule for π(i)\pi^{(i)} (Settles & Craven, 2008). We consider various competition situations by varying the budget nb{0,100,200,400}n_{\mathrm{b}}\in\{0,100,200,400\}111In Section 3, for notational convenience, we often suppress the predictor index in the superscript if the context is clear. For example, we use nbn_{\mathrm{b}} instead of nb(i)n_{\mathrm{b}}^{(i)}. and the temperature parameter α{0,1,2,4}\alpha\in\{0,1,2,4\}. Note that a pair (nb,α)(n_{\mathrm{b}},\alpha) generates one competition environment. We repeat experiments 30 times to obtain stable estimates for each pair (nb,α)(n_{\mathrm{b}},\alpha). We provide the full implementation details in Appendix A.

As the evaluation time, we do not perform data buying procedures in order to directly compare the predictability of ML models. As the evaluation metrics are defined as the population-level quantity, it is difficult to compute the expectation exactly. To handle this, we consider the sample averages using the i.i.d. held-out test data that are not used during the competition rounds.

3.1 Effects of data purchase on quality

We first study how data purchase affects the overall quality and the QoE in various competition settings. As Figure 2 illustrates, data purchase increases the overall quality as nbn_{\mathrm{b}} increases across all datasets. For instance, when α=4\alpha=4 and the dataset is Postures, the overall quality is 0.4050.405 on average when nb=0n_{\mathrm{b}}=0, but it increases to 0.4400.440 and 0.4640.464 when nb=200n_{\mathrm{b}}=200 and nb=400n_{\mathrm{b}}=400, which correspond to 9%9\% and 14%14\% increases, respectively. As for the QoE, however, data purchase mostly decreases QoE as nbn_{\mathrm{b}} increases. For example, when the user distribution is Insurance and α=2\alpha=2, QoE is 0.8750.875 when nb=0n_{\mathrm{b}}=0, but it reduces to 0.8670.867 and 0.8140.814 when nb=200n_{\mathrm{b}}=200 and nb=400n_{\mathrm{b}}=400, which correspond to 1% and 7% reduction, respectively. For MNIST or Fashion-MNIST, although there are small increases when α=1\alpha=1, QoE decreases when α=4\alpha=4.

This can be explained as follows. Given that an ML predictor attempts to collect user data when its prediction is highly uncertain, this active data acquisition increases the predictability of the individual model and reduces the model’s uncertainty. Similar to AL, data purchase effectively increases a model’s predictability, and so does the overall quality.

In most cases, surprisingly, QoE decreases even when the overall quality increases. In other words, the quality that competing predictors provide is generally improved, but it does not necessarily mean that users will be more satisfied with the ML predictions. Although this result might sound counterintuitive, we demonstrate that it happens when the data purchase makes users experience fewer options, increasing the probability of finding low-quality predictions. To verify our hypothesis, we examine how data purchase affects diversity in the next section.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: Illustrations of QoE as a function of the overall quality in various levels of nb{0,100,200,400}n_{\mathrm{b}}\in\{0,100,200,400\} and α{0,1,2,4}\alpha\in\{0,1,2,4\} on the seven datasets. Different color indicates different α\alpha, and the size of point indicates budgets nbn_{\mathrm{b}}. The larger budget is, the larger the point size is. In several settings, the overall quality increases as more budgets are used, but QoE decreases.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Illustrations of the diversity as a function of the budget nbn_{\mathrm{b}} for various α{0,1,2,4}\alpha\in\{0,1,2,4\} on the seven datasets. Each color indicates different α\alpha. We denote a 99% confidence band based on 30 independent runs. Competing ML predictors become similar in the sense that the diversity decreases as the budget increases.
Refer to caption
Refer to caption
Figure 4: Heatmaps of Q(j,y)Qavg(y)Q(j,y)-Q_{\mathrm{avg}}(y) for (left) Insurance and (right) Adult datasets. We consider nb{0,100,200,400}n_{\mathrm{b}}\in\{0,100,200,400\} and α=4\alpha=4. The heatmaps in each row represent different nbn_{\mathrm{b}} but share the same color scale. For each heatmap, a horizontal axis indicates a predictor ID in {1,,18}\{1,\dots,18\} and a vertical axis indicates a class in 𝒴={1,2}\mathcal{Y}=\{1,2\}. The grid colored red (resp. blue) indicates a class-specific quality is higher (resp. lower) than average, and the white grid indicates the average. As the budget increases, the diversity decreases, and predictors produce similar class-specific quality.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 5: Probability density plots of the average quality 1Mj=1Mq(Y,f(j)(X))\frac{1}{M}\sum_{j=1}^{M}q(Y,f^{(j)}(X)) at near zero when nb{0,100,200,400}n_{\mathrm{b}}\in\{0,100,200,400\} and α=4\alpha=4. Different color indicates different nbn_{\mathrm{b}}. As nbn_{\mathrm{b}} increases, the average quality is more likely to be close to zero. That is, the probability that all ML predictors produce low-quality prediction at the same time increases, and users might not be satisfied with the ML predictions after the competing predictors purchase data.

3.2 Effects of data purchase on diversity

We investigate the effect of data purchase on the diversity. Figure 3 illustrates the diversity as a function of the budget nbn_{\mathrm{b}} in various competition settings. In general, the diversity monotonically decreases as nbn_{\mathrm{b}} increases across all datasets. That is, the competing predictors get similar as more budgets are allowed, and users get essentially fewer options when nbn_{\mathrm{b}} increases. In particular, when α=4\alpha=4 and the dataset is Adult, the diversity is 0.3710.371 on average when nb=0n_{\mathrm{b}}=0, but it reduces to 0.2700.270 and 0.1870.187, which correspond to 27%27\% and 50%50\% reduction, when nb=200n_{\mathrm{b}}=200 and nb=400n_{\mathrm{b}}=400, respectively.

We also compare the class-specific qualities of competing predictors. In Figure 4, we illustrate heatmaps of the difference Q(j,y)Qavg(y)Q(j,y)-Q_{\mathrm{avg}}(y) where Q(j,y)Q(j,y) is the class-specific quality defined as Q(j,y):=𝔼[q(Y,f(j)(X))Y=y]Q(j,y)\mathrel{\mathop{\mathchar 58\relax}}=\mathbb{E}\left[q\left(Y,f^{(j)}(X)\right)\mid Y=y\right] for j[M]j\in[M] and y𝒴y\in\mathcal{Y}, and Qavg(y)Q_{\mathrm{avg}}(y) for y𝒴y\in\mathcal{Y} is defined as 1Mj=1MQ(j,y)\frac{1}{M}\sum_{j=1}^{M}Q(j,y). This difference measures the class-specific quality of a company, showing how specialized its ML model is. We use the Insurance and Adult datasets. When nb=0n_{\mathrm{b}}=0, the Adult heatmap shows that predictor 1 and predictor 5 so specialize to class 2 prediction that they sacrifice their prediction power for class 1 compared to other predictors. In other words, the competition encourages each ML model to be very specialized in a small subgroup. However, when nb=400n_{\mathrm{b}}=400, all predictors have similar levels of class-specific quality. The data purchase makes competing ML predictors similar and helps predictors not too much focus on a subgroup of the population. Similar results are shown in Ginart et al. (2021), but one finding that the previous works have not shown is that this specialization can be alleviated when predictors purchase user data.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 6: Heterogeneous predictors. Illustrations of QoE as a function of the overall quality when ML predictors have different buying strategies π\pi. Different color indicates different α\alpha, and the size of point indicates budgets nbn_{\mathrm{b}}. The larger budget is, the larger the point size is. In several settings, the overall quality increases as more budgets are used, but QoE decreases.

Implications of decreases in diversity

We now examine the connection between the diversity and the quality of the predictor selected by a user. We demonstrate that the probability of finding low-quality predictions can increase due to the reduction in diversity, explaining how diversity affects QoE. Figure 5 illustrates the probability density functions of the average quality near zero. It clearly shows that the probability that the average quality is near zero increases as more budgets nbn_{\mathrm{b}} are used: the areas for nb=400n_{\mathrm{b}}=400 (colored in yellow) are clearly larger than those for nb=0n_{\mathrm{b}}=0 (colored in red). That is, as predictions become similar, it is more likely that all ML predictions have a low quality simultaneously. Hence, the probability that users are not satisfied with the predictions increases, and leading to decreases in QoE even when the overall quality increases.

3.3 Robustness to modeling assumptions

We further demonstrate that our findings are robust against different modeling assumptions. We consider the same situation used in the previous sections but ML predictors now can have different buying strategies π\pi. To be more specific, we consider the three different types of buying strategies by varying the threshold of the uncertainty-based AL method. For CEnt{0,0.3,0.6}C_{\mathrm{Ent}}\in\{0,0.3,0.6\}, we consider the following buying strategy models πCEnt(Xt)=𝟙({Ent(p(i)(Xt))CEntlog(|𝒴|)})\pi_{C_{\mathrm{Ent}}}(X_{t})=\mathbbm{1}(\{\mathrm{Ent}(p^{(i)}(X_{t}))\geq C_{\mathrm{Ent}}\log(|\mathcal{Y}|)\}), where Ent\mathrm{Ent} is the Shannon’s entropy function and p(i)(Xt)p^{(i)}(X_{t}) is the probability estimate given XtX_{t}. We assume there are 66 predictors for each buying strategy π0\pi_{0}, π0.3\pi_{0.3}, and π0.6\pi_{0.6}. This modeling assumption considers the situation where there are three groups with different levels of sensitivity to data purchases. For instance, in our setting, π0.6\pi_{0.6} is the most conservative data buyer and is less likely to buy new data.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 7: Heterogeneous predictors. Illustrations of the diversity as a function of the budget nbn_{\mathrm{b}} when ML predictors have different buying strategies π\pi. Different color indicates different α\alpha. We denote a 99% confidence band based on 30 independent runs. As the budget increases, the diversity decreases.

Figure 6 shows the relationship between the QoE and the overall quality when there are heterogeneous ML predictors with different buying strategies. Similar to Figure 2, the overall quality increases but QoE generally decreases as the budget increases across all datasets. As for the diversity, as anticipated, Figure 7 shows that the diversity decreases as the budget increases. It demonstrates that our findings are robustly observed against different environment settings. We also conduct more experiments (i) when budgets nbn_{\mathrm{b}} are different across companies or (ii) when there are different number of predictors MM. All these additional results are provided in Appendix B.

4 Theoretical analysis on competition

In this section, we establish a simple representation for QoE when a quality function is the correctness function. Based on this finding, we theoretically analyze how the diversity-like quantity can affect QoE. Proofs are provided in Appendix C.

Lemma 1 (A simple representation for QoE).

Suppose there is a set of M2M\geq 2 predictors {f(j)}j=1M\{f^{(j)}\}_{j=1}^{M} and a quality function is the correctness function, i.e., q(Y1,Y2)=𝟙({Y1=Y2})q(Y_{1},Y_{2})=\mathbbm{1}(\{Y_{1}=Y_{2}\}) for all Y1,Y2𝒴Y_{1},Y_{2}\in\mathcal{Y}. Let Z:=1Mj=1Mq(Y,f(j)(X))Z\mathrel{\mathop{\mathchar 58\relax}}=\frac{1}{M}\sum_{j=1}^{M}q\left(Y,f^{(j)}(X)\right) be the average quality for a user (X,Y)(X,Y). For any α0\alpha\geq 0, we have

𝔼[Z]=(Overall quality)(QoE)=𝔼[ZeαZeα+(1Z)],\displaystyle\mathbb{E}[Z]=\text{(Overall quality)}\leq\mathrm{(QoE)}=\mathbb{E}\left[\frac{Ze^{\alpha}}{Ze^{\alpha}+(1-Z)}\right], (2)

where the inequality holds when α=0\alpha=0 and the expectation is considered over PX,YP_{X,Y}.

Lemma 1 presents a relationship between QoE and the overall quality—QoE is always greater than the overall quality if α>0\alpha>0. In addition, it shows that QoE can be simplified as a function of the average quality ZZ over competitors when a quality function qq is the correctness function. When qq is not the correctness function, QoE does not have such a comprehensible representation. We present upper and lower bounds for QoE in Appendix C.2. Using the relationship shown in Lemma 1, we provide a sufficient condition for the overall quality to be greater but the QoE to be less in the following theorem.

Theorem 1 (Comparison of two competition dynamics).

Suppose there are two sets of M2M\geq 2 predictors, 1:={f(j)}j=1M\mathcal{F}_{1}\mathrel{\mathop{\mathchar 58\relax}}=\{f^{(j)}\}_{j=1}^{M} and 2:={g(j)}j=1M\mathcal{F}_{2}\mathrel{\mathop{\mathchar 58\relax}}=\{g^{(j)}\}_{j=1}^{M}. Without loss of generality, the overall quality for 2\mathcal{F}_{2} is larger than that of 1\mathcal{F}_{1}. For the correctness function qq, we define Z1:=1Mj=1Mq(Y,f(j)(X))Z_{1}\mathrel{\mathop{\mathchar 58\relax}}=\frac{1}{M}\sum_{j=1}^{M}q(Y,f^{(j)}(X)) and Z2:=1Mj=1Mq(Y,g(j)(X))Z_{2}\mathrel{\mathop{\mathchar 58\relax}}=\frac{1}{M}\sum_{j=1}^{M}q(Y,g^{(j)}(X)) as Lemma 1. If αCα\alpha\geq C_{\alpha} and Var[Z2]C1Var[Z1]\mathrm{Var}[Z_{2}]\geq C_{1}\mathrm{Var}[Z_{1}] for some explicit constants CαC_{\alpha} and C11C_{1}\leq 1, then QoE for 2\mathcal{F}_{2} is smaller than that for 1\mathcal{F}_{1}.

Theorem 1 compares two competition dynamics, 1\mathcal{F}_{1} and 2\mathcal{F}_{2}, providing a sufficient condition for when QoE for 2\mathcal{F}_{2} is smaller than that for 1\mathcal{F}_{1} whereas the associated overall quality is larger. For ease of understanding, we can regard 1\mathcal{F}_{1} (resp. 2\mathcal{F}_{2}) as a set of ML predictors when nb(i)=0n_{\mathrm{b}}^{(i)}=0 (resp. when nb(i)>0n_{\mathrm{b}}^{(i)}>0). Theorem 1 implies that QoE can decrease when Var[Z2]\mathrm{Var}[Z_{2}] is large enough compared to Var[Z1]\mathrm{Var}[Z_{1}]. Considering our results in Figures 3, 4, and 5 that data purchase makes competing predictors similar when α\alpha is large enough, the average quality is more likely to become zero or one. As a result, it increases variance Var[Z2]\mathrm{Var}[Z_{2}] because the variance is maximized when random variables spread over [0,1][0,1]. Theorem 1 explains why QoE decreases when ML predictors can actively acquire user data through the data purchase system, supporting our main findings in experiments.

5 Related works

This work builds off and extends the recent paper, Ginart et al. (2021), which studied the impacts of the competition. We extend this setting by incorporating the data purchase system into competition systems. Note that the setting by Ginart et al. (2021) is a special case of ours when nb(i)=0n_{\mathrm{b}}^{(i)}=0 for all i[M]i\in[M]. Our environment enables us to study the impacts of data acquisition in competition, which is not considered in the previous work. Compared to the previous work, which showed competing predictors become too focused on sub-populations, our work suggests that this can be a good thing in the sense that it provides a variety of different options and better quality of the predictors selected by users.

A related field of our work is the stream-based AL, the problem of learning an algorithm that effectively finds data points to label from a stream of data points. (Settles, 2009; Ren et al., 2020). AL has been shown to have better sample complexity than passive learning algorithms (Kanamori, 2007; Hanneke et al., 2011; El-Yaniv & Wiener, 2012), and it is practically effective when the training sample size is small (Konyushkova et al., 2017; Gal et al., 2017; Sener & Savarese, 2018). However, our competition environment is significantly different from AL. In AL, since there is only one agent, competition cannot be established. In addition, while an agent in AL collects data only from label queries, competing predictors in our environment can obtain data from data purchase as well as regular competition. These differences create a unique competition environment, and this work studies the impacts of data purchase in competitive systems.

Competition has been studied in multi-agent reinforcement learning (MARL), which is a problem of optimizing goals in a setting where a group of agents in a common environment interact with each other and with the environment (Lowe et al., 2017; Foerster et al., 2017). Competing agents in MARL maximize their own objective goals that could conflict with others. This setting is often characterized by zero-sum Markov games and is applied to video games such as Pong or Starcraft II (Littman, 1994; Tampuu et al., 2017; Vinyals et al., 2019). We refer to Zhang et al. (2019) for a complementary literature survey of MARL.

Although MARL and our environment have some similarities, the user selection and the data purchase in our environment uniquely define the interactions between users and ML predictors. In MARL, it is assumed that all agents observe information drawn from the shared environment. Different agents may observe different statuses and rewards, but all agents receive information and use them to update the policy function. In contrast, in our environment, the only selected predictor obtains the label and updates the predictor, which might be more realistic. In addition, ML predictors can collect data points through the data purchase. These settings have not been considered in the field of MARL.

6 Conclusion

In this paper, characterizing the nature of competition and data purchase, we propose a new competition environment in which ML predictors are allowed to actively acquire user labels and improve their models. Our results show that even though the data purchase improves the quality that predictors provide, it can decrease the quality that users experience. We explain this counterintuitive finding by demonstrating that data purchase makes competing predictors similar to each other in various situations.

Broader impact statement

Our findings can broadly benefit the ML communities by providing insights into how competition over datasets and data acquisitions can affect a user’s experiences. In order to derive tractable analysis and experiments, we have to make some modeling simplifications. Similar simplifications are commonly used in ML and multi-agent literature and are necessary here, especially because there lacks systematic previous analysis of data purchase in competition. For example, one assumption in our environment for simplicity is that the user distribution does not change over time. In practice, customer behavior can change or evolve over time (Jin & Vasserman, 2019; Reimers & Shiller, 2019). It is therefore important to expand this direction of research with other models in future works.

References

  • Arumugam & Bhargavi (2019) Subramanian Arumugam and R Bhargavi. A survey on driving behavior analysis in usage based insurance using big data. Journal of Big Data, 6(1):1–21, 2019.
  • Chang & Lin (2011) Chih-Chung Chang and Chih-Jen Lin. Libsvm: A library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3):1–27, 2011.
  • Dua & Graff (2017) Dheeru Dua and Casey Graff. Uci machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
  • El-Yaniv & Wiener (2012) Ran El-Yaniv and Yair Wiener. Active learning via perfect selective classification. The Journal of Machine Learning Research, 13(1):255–279, 2012.
  • Familmaleki et al. (2015) Mahsa Familmaleki, Alireza Aghighi, and Kambiz Hamidi. Analyzing the influence of sales promotion on customer purchasing behavior. International Journal of Economics & management sciences, 4(4):1–6, 2015.
  • Foerster et al. (2017) Jakob Foerster, Nantas Nardelli, Gregory Farquhar, Triantafyllos Afouras, Philip HS Torr, Pushmeet Kohli, and Shimon Whiteson. Stabilising experience replay for deep multi-agent reinforcement learning. In International Conference on Machine Learning, pp. 1146–1155. PMLR, 2017.
  • Freund et al. (1997) Yoav Freund, H Sebastian Seung, Eli Shamir, and Naftali Tishby. Selective sampling using the query by committee algorithm. Machine learning, 28(2):133–168, 1997.
  • Gal et al. (2017) Yarin Gal, Riashat Islam, and Zoubin Ghahramani. Deep bayesian active learning with image data. In International Conference on Machine Learning, pp. 1183–1192, 2017.
  • Gardner et al. (2014) Andrew Gardner, Christian A Duncan, Jinko Kanno, and Rastko Selmic. 3d hand posture recognition from small unlabeled point sets. In 2014 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp.  164–169. IEEE, 2014.
  • Ginart et al. (2021) Tony Ginart, Eva Zhang, Yongchan Kwon, and James Zou. Competing ai: How does competition feedback affect machine learning? In International Conference on Artificial Intelligence and Statistics, pp.  1693–1701. PMLR, 2021.
  • Hanneke et al. (2011) Steve Hanneke et al. Rates of convergence in active learning. The Annals of Statistics, 39(1):333–361, 2011.
  • Jin & Vasserman (2019) Yizhou Jin and Shoshana Vasserman. Buying data from consumers: The impact of monitoring programs in us auto insurance. Unpublished manuscript. Harvard University, Department of Economics, Cambridge, MA, 2019.
  • Kanamori (2007) Takafumi Kanamori. Pool-based active learning with optimal sampling distribution and its information geometrical interpretation. Neurocomputing, 71(1-3):353–362, 2007.
  • Kingma & Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Konyushkova et al. (2017) Ksenia Konyushkova, Raphael Sznitman, and Pascal Fua. Learning active learning from data. In Advances in Neural Information Processing Systems, pp. 4225–4235, 2017.
  • Krizhevsky et al. (2009) Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • LeCun et al. (2010) Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010.
  • Littman (1994) Michael L Littman. Markov games as a framework for multi-agent reinforcement learning. In Machine learning proceedings 1994, pp.  157–163. Elsevier, 1994.
  • Lowe et al. (2017) Ryan Lowe, Yi I Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems, pp. 6379–6390, 2017.
  • Meierhofer et al. (2019) Jürg Meierhofer, Thilo Stadelmann, and Mark Cieliebak. Data products. In Applied Data Science, pp.  47–61. Springer, 2019.
  • Reimers & Shiller (2019) Imke Reimers and Benjamin R Shiller. The impacts of telematics on competition and consumer behavior in insurance. The Journal of Law and Economics, 62(4):613–632, 2019.
  • Ren et al. (2020) Pengzhen Ren, Yun Xiao, Xiaojun Chang, Po-Yao Huang, Zhihui Li, Xiaojiang Chen, and Xin Wang. A survey of deep active learning. arXiv preprint arXiv:2009.00236, 2020.
  • Rowley (1998) Jennifer Rowley. Promotion and marketing communications in the information marketplace. Library review, 1998.
  • Sener & Savarese (2018) Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach. In International Conference on Learning Representations, 2018.
  • Sennaar (2019) Kumba Sennaar. How america’s top 4 insurance companies are using machine learning. https://emerj.com/ai-sector-overviews/machine-learning-at-insurance-companies, 2019. Posted February 26, 2020; Retrieved May 19, 2021.
  • Settles (2009) Burr Settles. Active learning literature survey. Technical report, University of Wisconsin-Madison Department of Computer Sciences, 2009.
  • Settles & Craven (2008) Burr Settles and Mark Craven. An analysis of active learning strategies for sequence labeling tasks. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pp.  1070–1079, 2008.
  • Tampuu et al. (2017) Ardi Tampuu, Tambet Matiisen, Dorian Kodelja, Ilya Kuzovkin, Kristjan Korjus, Juhan Aru, Jaan Aru, and Raul Vicente. Multiagent cooperation and competition with deep reinforcement learning. PloS one, 12(4):e0172395, 2017.
  • Van Der Putten & van Someren (2000) Peter Van Der Putten and Maarten van Someren. Coil challenge 2000: The insurance company case. Technical report, Technical Report 2000–09, Leiden Institute of Advanced Computer Science, 2000.
  • Vinyals et al. (2019) Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
  • Xiao et al. (2017) Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
  • Zhang et al. (2019) Kaiqing Zhang, Zhuoran Yang, and Tamer Başar. Multi-agent reinforcement learning: A selective overview of theories and algorithms. arXiv preprint arXiv:1911.10635, 2019.
  • Žliobaitė et al. (2013) Indrė Žliobaitė, Albert Bifet, Bernhard Pfahringer, and Geoffrey Holmes. Active learning with drifting streaming data. IEEE transactions on neural networks and learning systems, 25(1):27–39, 2013.

Appendix

In this appendix, we provide implementation details in Appendix A, additional numerical experiments in Appendix B, and proofs and additional theoretical results in Appendix C.

Appendix A Implementation details

In this section, we provide implementation details. We explain the user distribution, ML predictors, and the proposed environment.

Datasets (user distribution)

As for the datasets (user distribution PX,YP_{X,Y}), we used the following seven datasets for our experiments: Insurance (Van Der Putten & van Someren, 2000), Adult (Dua & Graff, 2017), Postures (Gardner et al., 2014), Skin-nonskin (Chang & Lin, 2011), FashionMNIST (Xiao et al., 2017), MNIST (LeCun et al., 2010), and CIFAR10 (Krizhevsky et al., 2009). For all datasets, we first split a dataset into competition and evaluation datasets: the competition dataset is used during the T=104T=10^{4} competition rounds and the evaluation dataset is used for evaluate metrics after the competition. For FashionMNIST, MNIST, and CIFAR10, we use the original training and test datasets for competition and evaluation datasets, respectively. For Insurance, Adult, Postures, and Skin-nonskin, we randomly sample 50005000 data points from the original dataset to make the evaluation dataset and use the remaining data points as the competition dataset. At each round of competition, we randomly sample one data point from the competition dataset. After the TT competition rounds, we randomly sample 30003000 points from the evaluation dataset and evaluate the metrics (the overall quality, QoE, and diversity). Note that all of experiment results are based on the evaluation dataset. Table 1 shows a summary of the seven datasets used in our experiments.

Table 1: A summary of datasets used in our experiments.
Dataset The size of The size of Input dimension # of classes |𝒴||\mathcal{Y}|
competition dataset evaluation dataset
Insurance 13823 5000 16 2
Adult 43842 5000 108 2
Postures 69975 5000 15 5
Skin-nonskin 239057 5000 3 2
Fashion-MNIST 60000 10000 784 10
MNIST 60000 10000 784 10
CIFAR10 50000 10000 3072 10

As for the preprocessing, we apply the standardization to have zero mean and one standard deviation for Skin-nonskin. For the two image datasets, MNIST and CIFAR10 we apply the channel-wise standardization. Other than the three datasets, we do not apply any other preprocessing. To reflect the customers’ randomness in their selection, we apply a random noise on the original label. We assign a random label with 30% for every dataset. This random perturbation is applied to both the competition and evaluation datasets.

ML predictors

We fix the number of predictors to M=18M=18 throughout our experiments. For each dataset, which makes one competition environment, we consider a homogeneous setting, i.e., all predictors have the same number of seed data ns(i)n_{\mathrm{s}}^{(i)}, a budget nb(i)n_{\mathrm{b}}^{(i)}, a model f(i)f^{(i)}, and a buying strategy π(i)\pi^{(i)}. As for the buying strategy, we fix π(i)(Xt)=𝟙({Ent(p(i)(Xt))0.3log(|𝒴|)})\pi^{(i)}(X_{t})=\mathbbm{1}(\{\mathrm{Ent}(p^{(i)}(X_{t}))\geq 0.3\log(|\mathcal{Y}|)\}), where Ent(p(i)(Xt))\mathrm{Ent}(p^{(i)}(X_{t})) is the Shannon’s entropy of p(i)(Xt)p^{(i)}(X_{t}), and p(i)(Xt)p^{(i)}(X_{t}) is the corresponding probability estimate for P(Y=Yt)P(Y=Y_{t}). That is, if the entropy is higher than the pre-defined threshold 0.3log(|𝒴|)0.3\log(|\mathcal{Y}|), a predictor decides to buy the user data. Note that log(|𝒴|)\log(|\mathcal{Y}|) is the Shannon’s entropy of the uniform distribution on 𝒴\mathcal{Y}.

Table 2 shows a summary information for the seed data nsn_{\mathrm{s}} and the model ff for each dataset. Every ML predictor initially trains with the nsn_{\mathrm{s}} seed data points. For all experiments, we use the Adam optimization (Kingma & Ba, 2014) with the specified learning rate and epochs. The batch size is fixed to 6464. If an predictor is selected, then its ML model is updated with one iteration with the newly obtained data point, and we retrain the model whenever the ‘retrain period’ new samples are obtained.

Table 2: A summary of hyperparameters by datasets. Logistic denotes a logistic model and NN denotes a neural network one hidden layer.
Dataset Seed data nsn_{\mathrm{s}} ML predictor ff
Model # of hidden nodes Epoch Learning rate Retrain period
Insurance 100 Logistic - 10 5×1035\times 10^{-3} 50
Adult 100 Logistic - 10 10210^{-2} 50
Postures 200 Logistic - 10 3×1023\times 10^{-2} 50
Skin-nonskin 50 Logistic - 10 3×1023\times 10^{-2} 50
Fashion-MNIST 50 NN 400 30 10410^{-4} 150
MNIST 50 NN 400 30 10410^{-4} 150
CIFAR10 100 NN 400 30 10410^{-4} 150

Appendix B Additional numerical experiments

In this section, we provide additional experimental results to demonstrate the robustness of our findings against different modeling assumptions in the heterogeneous setting. As for the heterogeneous settings, we consider different budgets in the subsection B.1 and different number of competing predictors in the subsection B.2. All additional results again show the robustness of our experimental findings against different modeling assumptions.

B.1 Different budgets

We use the same setting used in the homogeneous setting but with different budgets. We use the Insurance, Adult, and Skin-nonskin datasets. For nb(i){0,100,200,400}n_{\mathrm{b}}^{(i)}\in\{0,100,200,400\}, we assume that the first 99 predictors have nb(i)n_{\mathrm{b}}^{(i)} budgets, but the last 99 predictors have nb(i)/2n_{\mathrm{b}}^{(i)}/2 budgets. That is, half of the predictors have half the budget compared to the other group. This situation can be considered as some groups of companies have a larger amount of capital than others. Figure 8 shows that the main findings appear again even when different budgets are used (QoE generally decreases, overall quality increases, and diversity decreases).

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 8: Heterogeneous predictors. Main figures when competing predictors use different budgets. The results are similar to the homogeneous setting, showing the robustness of our main findings.

B.2 Different number of competing predictors

We also show that our findings are consistent for the different number of competing predictors in the market. All the experiments in Section 3 of the manuscript consider M=18M=18. Here, we consider the homogeneous setting but a different number of competing predictors M=9M=9 or M=12M=12. As Figures 9 and 10 show, the main findings are captured again when the number of predictors are used.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 9: Heterogeneous predictors. Main figures when there are M=9M=9 competing predictors. The results are similar to the M=18M=18, showing the robustness of our main results.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 10: Heterogeneous predictors. Main figures when there are M=12M=12 competing predictors. The results are similar to the M=18M=18, showing the robustness of our main results.

Appendix C Proofs and additional theoretical results

We provide proofs for Lemma 1 and Theorem 1 in the subsection C.1. We also present an additional theoretical result, QoE for a general quality function, in the subsection C.2.

C.1 Proofs

Proof of Lemma 1.

For notational convenience, we set q(j):=q(Y,f(j)(X))q^{{(j)}}\mathrel{\mathop{\mathchar 58\relax}}=q(Y,f^{(j)}(X)) for j[M]j\in[M].

𝔼[q(Y,f(W(α))(X))]\displaystyle\mathbb{E}\left[q\left(Y,f^{(W(\alpha))}(X)\right)\right] =𝔼[𝔼[q(Y,f(W(α))(X))Y,{f(j)}j=1M]]=𝔼[j=1Mp(j)(α)q(j)],\displaystyle=\mathbb{E}\left[\mathbb{E}\left[q\left(Y,f^{(W(\alpha))}(X)\right)\mid Y,\{f^{(j)}\}_{j=1}^{M}\right]\right]=\mathbb{E}\left[\sum_{j=1}^{M}p^{(j)}(\alpha)q^{(j)}\right], (3)

where for i[M]i\in[M],

p(i)(α)=exp(αq(i))j=1Mexp(αq(j)).\displaystyle p^{(i)}(\alpha)=\frac{\exp{\left(\alpha q^{(i)}\right)}}{\sum_{j=1}^{M}\exp{\left(\alpha q^{(j)}\right)}}.

Since qj=𝟙({Y=f(j)(X)}){0,1}q^{j}=\mathbbm{1}(\{Y=f^{(j)}(X)\})\in\{0,1\}, for Ncor:=j=1Mq(Y,f(j)(X))=j=1M𝟙({Y=f(j)(X)})N_{\mathrm{cor}}\mathrel{\mathop{\mathchar 58\relax}}=\sum_{j=1}^{M}q(Y,f^{(j)}(X))=\sum_{j=1}^{M}\mathbbm{1}(\{Y=f^{(j)}(X)\}), we have

p(j)(α)=exp(α𝟙({Y=f(j)(X)}))Ncorexp(α)+(MNcor),\displaystyle p^{(j)}(\alpha)=\frac{\exp(\alpha\mathbbm{1}(\{Y=f^{(j)}(X)\}))}{N_{\mathrm{cor}}\exp(\alpha)+(M-N_{\mathrm{cor}})},

and

j=1Mp(j)(α)q(j)\displaystyle\sum_{j=1}^{M}p^{(j)}(\alpha)q^{(j)} =1Mj=1Meα𝟙({Y=f(j)(X)})(Ncor/M)eα+(1Ncor/M)=(Ncor/M)eα(Ncor/M)eα+(1Ncor/M).\displaystyle=\frac{1}{M}\sum_{j=1}^{M}\frac{e^{\alpha}\mathbbm{1}(\{Y=f^{(j)}(X)\})}{(N_{\mathrm{cor}}/M)e^{\alpha}+(1-N_{\mathrm{cor}}/M)}=\frac{(N_{\mathrm{cor}}/M)e^{\alpha}}{(N_{\mathrm{cor}}/M)e^{\alpha}+(1-N_{\mathrm{cor}}/M)}.

Since Z=Ncor/MZ=N_{\mathrm{cor}}/M and k(z,α):=zeαzeα+(1z)k(z,\alpha)\mathrel{\mathop{\mathchar 58\relax}}=\frac{ze^{\alpha}}{ze^{\alpha}+(1-z)} is an increasing function, it concludes a proof. ∎

Proof of Theorem 1.

For 0z10\leq z\leq 1 and α0\alpha\geq 0, let k(z,α):=zeαzeα+(1z)k(z,\alpha)\mathrel{\mathop{\mathchar 58\relax}}=\frac{ze^{\alpha}}{ze^{\alpha}+(1-z)}, μ1:=𝔼[Z1]\mu_{1}\mathrel{\mathop{\mathchar 58\relax}}=\mathbb{E}\left[Z_{1}\right], and μ2:=𝔼[Z2]\mu_{2}\mathrel{\mathop{\mathchar 58\relax}}=\mathbb{E}\left[Z_{2}\right]. Note that

𝔼[q(Y,f(W(α))(X))]\displaystyle\mathbb{E}\left[q\left(Y,f^{(W(\alpha))(X)}\right)\right] =μ1+𝔼[k(Z1,α)Z1]\displaystyle=\mu_{1}+\mathbb{E}\left[k(Z_{1},\alpha)-Z_{1}\right]
𝔼[q(Y,g(W(α))(X))]\displaystyle\mathbb{E}\left[q\left(Y,g^{(W(\alpha))}(X)\right)\right] =μ2+𝔼[k(Z2,α)Z2].\displaystyle=\mu_{2}+\mathbb{E}\left[k(Z_{2},\alpha)-Z_{2}\right].

Thus, we have

𝔼[q(Y,f(W(α))(X))]𝔼[q(Y,g(W(α))(X))]\displaystyle\mathbb{E}\left[q\left(Y,f^{(W(\alpha))(X)}\right)\right]\geq\mathbb{E}\left[q\left(Y,g^{(W(\alpha))}(X)\right)\right]
\displaystyle\Longleftrightarrow\quad 𝔼[k(Z1,α)Z1]𝔼[k(Z2,α)Z2]μ2μ1.\displaystyle\mathbb{E}\left[k(Z_{1},\alpha)-Z_{1}\right]-\mathbb{E}\left[k(Z_{2},\alpha)-Z_{2}\right]\geq\mu_{2}-\mu_{1}.

For Z{1M,,M1M}Z\in\{\frac{1}{M},\dots,\frac{M-1}{M}\}, we have

M(eα1)(M1)eα+1eα1Zeα+(1Z)M(eα1)eα+(M1).\displaystyle\frac{M(e^{\alpha}-1)}{(M-1)e^{\alpha}+1}\leq\frac{e^{\alpha}-1}{Ze^{\alpha}+(1-Z)}\leq\frac{M(e^{\alpha}-1)}{e^{\alpha}+(M-1)}.

Therefore, since k(z)z=Z(1Z)(eα1)Zeα+(1Z)k(z)-z=\frac{Z(1-Z)(e^{\alpha}-1)}{Ze^{\alpha}+(1-Z)}, we have

M(eα1)(M1)eα+1Z(1Z)k(Z)ZM(eα1)eα+(M1)Z(1Z).\displaystyle\frac{M(e^{\alpha}-1)}{(M-1)e^{\alpha}+1}Z(1-Z)\leq k(Z)-Z\leq\frac{M(e^{\alpha}-1)}{e^{\alpha}+(M-1)}Z(1-Z). (4)

Let Clow=M(eα1)(M1)eα+1C_{\mathrm{low}}=\frac{M(e^{\alpha}-1)}{(M-1)e^{\alpha}+1} and Cupp=M(eα1)eα+(M1)C_{\mathrm{upp}}=\frac{M(e^{\alpha}-1)}{e^{\alpha}+(M-1)}. From the inequalities (4), we have

𝔼[k(Z1,α)Z1]𝔼[k(Z2,α)Z2]\displaystyle\mathbb{E}\left[k(Z_{1},\alpha)-Z_{1}\right]-\mathbb{E}\left[k(Z_{2},\alpha)-Z_{2}\right] Clow𝔼[Z1(1Z1)]Cupp𝔼[Z2(1Z2)]\displaystyle\geq C_{\mathrm{low}}\mathbb{E}\left[Z_{1}(1-Z_{1})\right]-C_{\mathrm{upp}}\mathbb{E}\left[Z_{2}(1-Z_{2})\right]
=Clow(μ1(1μ1)Var[Z1])Cupp(μ2(1μ2)Var[Z2]).\displaystyle=C_{\mathrm{low}}(\mu_{1}(1-\mu_{1})-\mathrm{Var}[Z_{1}])-C_{\mathrm{upp}}(\mu_{2}(1-\mu_{2})-\mathrm{Var}[Z_{2}]).

The last equality is due to 𝔼[Z(1Z)]=𝔼[Z](1𝔼[Z])Var(Z)\mathbb{E}\left[Z(1-Z)\right]=\mathbb{E}\left[Z\right](1-\mathbb{E}\left[Z\right])-\mathrm{Var}(Z). Therefore, QoE is decreased if

Clow(μ1(1μ1)Var[Z1])Cupp(μ2(1μ2)Var[Z2])μ2μ1\displaystyle C_{\mathrm{low}}(\mu_{1}(1-\mu_{1})-\mathrm{Var}[Z_{1}])-C_{\mathrm{upp}}(\mu_{2}(1-\mu_{2})-\mathrm{Var}[Z_{2}])\geq\mu_{2}-\mu_{1}
\displaystyle\Longleftrightarrow\quad Var[Z2]ClowCupp(Var[Z1]μ1(1μ1))+1Cupp(μ2μ1)+μ2(1μ2)\displaystyle\mathrm{Var}[Z_{2}]\geq\frac{C_{\mathrm{low}}}{C_{\mathrm{upp}}}\left(\mathrm{Var}[Z_{1}]-\mu_{1}(1-\mu_{1})\right)+\frac{1}{C_{\mathrm{upp}}}\left(\mu_{2}-\mu_{1}\right)+\mu_{2}(1-\mu_{2})
\displaystyle\Longleftrightarrow\quad Var[Z2]C1Var[Z1]+C2(μ1,μ2),\displaystyle\mathrm{Var}[Z_{2}]\geq C_{1}\mathrm{Var}[Z_{1}]+C_{2}(\mu_{1},\mu_{2}),

where

C1\displaystyle C_{1} :=ClowCupp=eα+(M1)(M1)eα+11\displaystyle\mathrel{\mathop{\mathchar 58\relax}}=\frac{C_{\mathrm{low}}}{C_{\mathrm{upp}}}=\frac{e^{\alpha}+(M-1)}{(M-1)e^{\alpha}+1}\leq 1
C2(μ1,μ2)\displaystyle C_{2}(\mu_{1},\mu_{2}) :=C1μ1(1μ1)+1Cupp(μ2μ1)+μ2(1μ2)\displaystyle\mathrel{\mathop{\mathchar 58\relax}}=-C_{1}\mu_{1}(1-\mu_{1})+\frac{1}{C_{\mathrm{upp}}}\left(\mu_{2}-\mu_{1}\right)+\mu_{2}(1-\mu_{2})

Therefore, if there is a constant CαC_{\alpha} such that C20C_{2}\geq 0, then it concludes a proof.

By definition of C2C_{2}, it is positive when μ2(1μ2)C1μ1(1μ1)\mu_{2}(1-\mu_{2})-C_{1}\mu_{1}(1-\mu_{1}).

μ2(1μ2)C1μ1(1μ1)>0\displaystyle\mu_{2}(1-\mu_{2})-C_{1}\mu_{1}(1-\mu_{1})>0
\displaystyle\Longleftrightarrow\quad C1μ2(1μ2)μ1(1μ1)\displaystyle C_{1}\leq\frac{\mu_{2}(1-\mu_{2})}{\mu_{1}(1-\mu_{1})}
\displaystyle\Longleftrightarrow\quad eα+(M1)(M1)eα+1μ2(1μ2)μ1(1μ1)\displaystyle\frac{e^{\alpha}+(M-1)}{(M-1)e^{\alpha}+1}\leq\frac{\mu_{2}(1-\mu_{2})}{\mu_{1}(1-\mu_{1})}
\displaystyle\Longleftrightarrow\quad eα(M1)μ1(1μ1)μ2(1μ2)(M1)μ2(1μ2)μ1(1μ1).\displaystyle e^{\alpha}\geq\frac{(M-1)\mu_{1}(1-\mu_{1})-\mu_{2}(1-\mu_{2})}{(M-1)\mu_{2}(1-\mu_{2})-\mu_{1}(1-\mu_{1})}.

By setting Cα=log(M1)μ1(1μ1)μ2(1μ2)(M1)μ2(1μ2)μ1(1μ1)C_{\alpha}=\log\frac{(M-1)\mu_{1}(1-\mu_{1})-\mu_{2}(1-\mu_{2})}{(M-1)\mu_{2}(1-\mu_{2})-\mu_{1}(1-\mu_{1})}, it concludes a proof.

C.2 QoE for a general quality function

The following theorem shows the upper and lower bounds of QoE for a general quality function.

Theorem 2.

Suppose there is a set of M2M\geq 2 prediction models {f(j)}j=1M\{f^{(j)}\}_{j=1}^{M}. For any non-negative function q:𝒴×𝒴+q\mathrel{\mathop{\mathchar 58\relax}}\mathcal{Y}\times\mathcal{Y}\to\mathbb{R}_{+} and α0\alpha\geq 0, we have the following upper and lower bounds.

𝔼[1Mj=1Mq(Y,f(j)(X))]𝔼[q(Y,f(W(α))(X))]𝔼[maxj[M]q(Y,f(j)(X))].\displaystyle\mathbb{E}\left[\frac{1}{M}\sum_{j=1}^{M}q\left(Y,f^{(j)}(X)\right)\right]\leq\mathbb{E}\left[q\left(Y,f^{(W(\alpha))}(X)\right)\right]\leq\mathbb{E}\left[\max_{j\in[M]}q\left(Y,f^{(j)}(X)\right)\right].

where W(α)[M]W{(\alpha)}\in[M] denotes the selected index.

Proof of Theorem 2.

We use the same notations in the proof of Lemma 1. We first show QoE is an increasing function as α\alpha. From the representation (3), we have

𝔼[j=1Mp(j)(α)q(j)]α\displaystyle\frac{\partial\mathbb{E}\left[\sum_{j=1}^{M}p^{(j)}(\alpha)q^{(j)}\right]}{\partial\alpha}
=𝔼[j=1Mp(j)(α)αq(j)]\displaystyle=\mathbb{E}\left[\sum_{j=1}^{M}\frac{\partial p^{(j)}(\alpha)}{\partial\alpha}q^{(j)}\right]
=𝔼[j=1M(exp(αq(j))q(j)(k=1Mexp(αq(k))))(exp(αq(j))k=1Mexp(αq(k))q(k))(j=1Mexp(αq(j)))2q(j)]\displaystyle=\mathbb{E}\left[\sum_{j=1}^{M}\frac{\left(\exp{\left(\alpha q^{(j)}\right)}q^{(j)}\left(\sum_{k=1}^{M}\exp{\left(\alpha q^{(k)}\right)}\right)\right)-\left(\exp{\left(\alpha q^{(j)}\right)}\sum_{k=1}^{M}\exp{\left(\alpha q^{(k)}\right)}q^{(k)}\right)}{\left(\sum_{j=1}^{M}\exp{\left(\alpha q^{(j)}\right)}\right)^{2}}q^{(j)}\right]
=𝔼[j=1Mp(j)(α)(q(j)q¯)q(j)],\displaystyle=\mathbb{E}\left[\sum_{j=1}^{M}p^{(j)}(\alpha)(q^{(j)}-\bar{q})q^{(j)}\right],

where q¯:=k=1Mp(k)(α)q(k)\bar{q}\mathrel{\mathop{\mathchar 58\relax}}=\sum_{k=1}^{M}p^{(k)}(\alpha)q^{(k)}. From the last equality, we have

j=1Mp(j)(α)(q(j)q¯)q(j)=j=1Mp(j)(α)(q(j))2q¯2>0.\displaystyle\sum_{j=1}^{M}p^{(j)}(\alpha)(q^{(j)}-\bar{q})q^{(j)}=\sum_{j=1}^{M}p^{(j)}(\alpha)(q^{(j)})^{2}-\bar{q}^{2}>0.

Note the non-negativity is from Cauchy-Schwarz inequality. We now prove an upper bound. Note that

j=1Mp(j)(α)q(j)maxj[M]q(j),\displaystyle\sum_{j=1}^{M}p^{(j)}(\alpha)q^{(j)}\leq\max_{j\in[M]}q^{(j)},

and the equality holds when α=\alpha=\infty. Therefore, taking expectations on both sides provides an upper bound. As for the lower bound. Due to the representation (3), it is enough to show that

j=1Mp(j)(α)q(j)1Mj=1Mq(j).\displaystyle\sum_{j=1}^{M}p^{(j)}(\alpha)q^{(j)}\geq\frac{1}{M}\sum_{j=1}^{M}q^{(j)}.

Since QoE is an increasing function, by plugging in α=0\alpha=0, it gives p(j)(α)=1/Mp^{(j)}(\alpha)=1/M. It concludes a proof. ∎