This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Ranking with Multiple Objectives

Nikhil R. Devanur Amazon. Email: Iam@nikhildevanur.com. Work done while the author was at Microsoft Research.    Sivakanth Gopi Microsoft Research. Email: sigopi@microsoft.com
Abstract

In search and advertisement ranking, it is often required to simultaneously maximize multiple objectives. For example, the objectives can correspond to multiple intents of a search query, or in the context of advertising, they can be relevance and revenue. It is important to efficiently find rankings which strike a good balance between such objectives. Motivated by such applications, we formulate a general class of problems where

  • each result gets a different score corresponding to each objective,

  • the results of a ranking are aggregated by taking, for each objective, a weighted sum of the scores in the order of the ranking, and

  • an arbitrary concave function of the aggregates is maximized.

Combining the aggregates using a concave function will naturally lead to more balanced outcomes. We give an approximation algorithm in a bicriteria/resource augmentation setting: the algorithm with a slight advantage does as well as the optimum. In particular, if the aggregation step is just the sum of the top kk results, then the algorithm outputs k+1k+1 results which do as well the as the optimal top kk results. We show how this approach helps with balancing different objectives via simulations on synthetic data as well as on real data from LinkedIn.

1 Introduction

We study the problem of ranking with multiple objectives. Ranking is an important component of many online platforms such as Google, Bing, Facebook, LinkedIn, Amazon, Yelp, and so on. It is quite common that the platform has multiple objectives while choosing a ranking. For instance, in a search engine, when someone searches for “jaguar”, it could refer to either the animal or the car company. Thus there is one set of results that are relevant for jaguar the animal, another for jaguar the car company, and the search engine has to pick a ranking to satisfy both intents.

Another common reason to have multiple objectives is advertising. The final ranking produced has organic results as well as ads, and the objectives are relevance and revenue. Ads contribute to both relevance and revenue, where as organic results only contribute to relevance. While in some cases ads occupy specialized slots, it is becoming more common to have floating ads. Also, in many cases, the same result can qualify both as an ad and as an organic result, and it is not desirable to repeat it. In such cases, one has to produce a single ranking of all the results (the union of organic results and ads) that achieves a certain tradeoff between the two objectives.

The predominant methodology currently used to handle multiple objectives is to combine them into one objective using a linear combination (Vogt and Cottrell, 1999). The advantage of this is that it can trace out the entire pareto frontier of the achievable objectives. The disadvantage is that you have to choose one linear combination for a large number of instances. This often results in cases where one objective is favored much more than the others. This is illustrated in Figure 1. To explain this figure we introduce some notation.

Suppose that there are mm instances, and for each instance there are nn results that are to be ranked. Each result jj for instance ii has two numbers associated with it, aija_{ij} and bijb_{ij}, that correspond to the two objectives. Given a ranking π:[n][n]\pi:[n]\rightarrow[n], we aggregate the two objective values for instance ii using cumulative scores defined as

𝖼𝗌𝐰(𝐚i,π)=j[n]wjaiπ(j) and 𝖼𝗌𝐰(𝐛i,π)=j[n]wjbiπ(j),\operatorname{\mathsf{cs}}_{\mathbf{w}}({\mathbf{a}}_{i},\pi)=\sum_{j\in[n]}w_{j}a_{i\pi(j)}\text{ and }\operatorname{\mathsf{cs}}_{\mathbf{w}}({\mathbf{b}}_{i},\pi)=\sum_{j\in[n]}w_{j}b_{i\pi(j)},

for some non-negative weight vector 𝐰=(w1,,wn){\mathbf{w}}=(w_{1},\cdots,w_{n}) with (weakly) decreasing weights i.e. w1w2wn0.w_{1}\geq w_{2}\geq\dots\geq w_{n}\geq 0.111Throughout the paper, we use the notation policy that wiw_{i} is the ithi^{\rm th} coordinate of 𝐰\mathbf{w}. Further, 𝐚i{\mathbf{a}}_{i} is the vector (ai1,ai2,,ain)(a_{i1},a_{i2},\cdots,a_{in}). For example, in advertisement ranking, wiw_{i} can represent the click rate i.e. the probability that a user clicks the ithi^{\rm th} result in the ranking. It is natural that the click rates decrease with the position i.e. it is more probable that a top result is clicked. Suppose aja_{j} represents the revenue generated when jthj^{\rm th} ad is clicked, and bjb_{j} represents the relevance of the jthj^{\rm th} ad to the user query. Then 𝖼𝗌𝐰(𝐚,π)\operatorname{\mathsf{cs}}_{\mathbf{w}}({\mathbf{a}},\pi) represents the expected revenue generated and 𝖼𝗌𝐰(𝐛,π)\operatorname{\mathsf{cs}}_{\mathbf{w}}({\mathbf{b}},\pi) represents the expected total relevance for the user when the ads are ranked according to π.\pi. In the figure, the weight vector is the one used in discounted cumulative gain (DCG), and normalized DCG (NDCG) (Burges et al., 2005), which are standard measures often used in evaluating search engine rankings. This weight vector is:

wi=1log2(i+1).w_{i}=\frac{1}{\log_{2}(i+1)}\enspace. (1)

We normalize the cumulative scores by the best possible ranking for each objective. This is motivated by two things: the resulting numbers are all in [0,1][0,1] so they are comparable to each other, and how well the ranking did relative to the best achievable one is often the more meaningful measure. We define

𝖼𝗌𝐰(𝐚)=maxπ𝖼𝗌𝐰(𝐚,π) and 𝖼𝗌𝐰(𝐛)=maxπ𝖼𝗌𝐰(𝐛,π).\operatorname{\mathsf{cs}}^{*}_{\mathbf{w}}({\mathbf{a}})=\max_{\pi}\operatorname{\mathsf{cs}}_{\mathbf{w}}({\mathbf{a}},\pi)\text{ and }\operatorname{\mathsf{cs}}^{*}_{\mathbf{w}}({\mathbf{b}})=\max_{\pi}\operatorname{\mathsf{cs}}_{\mathbf{w}}({\mathbf{b}},\pi).

When the weight vector is the one mentioned above, we refer to the normalized cumulative scores as NDCG. Figure 1 shows scatter plots of the NDCGs for the two objectives, for different algorithms: a given algorithm produces ranking πi\pi_{i} for instance ii, and each dot in the plot is a point

(𝖼𝗌𝐰(𝐚i,πi)𝖼𝗌𝐰(𝐚i),𝖼𝗌𝐰(𝐛i,πi)𝖼𝗌𝐰(𝐛i)).\left(\frac{\operatorname{\mathsf{cs}}_{\mathbf{w}}({\mathbf{a}}_{i},\pi_{i})}{\operatorname{\mathsf{cs}}^{*}_{\mathbf{w}}({\mathbf{a}}_{i})},\frac{\operatorname{\mathsf{cs}}_{\mathbf{w}}({\mathbf{b}}_{i},\pi_{i})}{\operatorname{\mathsf{cs}}^{*}_{\mathbf{w}}({\mathbf{b}}_{i})}\right).

The source of the data is LinkedIn news feed: it is a random sample from one day of results. Here we do not go into the details of what the two objectives are, etc.; Section 4 has more details. On the right, we show the result of ranking using the sum of the two scores. The triangle shape of the scatter plot is persistent across different samples, and different choices of linear combinations. What we wish to avoid are the two corners of the triangle where one of the two NDCGs is rather small. Ideally, we like to be at the apex of the triangle which is at the top right corner of the figure.

On the left, we show the results of our algorithm for the following objective:

maxπ𝖼𝗌𝐰(𝐚i,π)𝖼𝗌𝐰(𝐛i,π).\max_{\pi}\operatorname{\mathsf{cs}}_{\mathbf{w}}({\mathbf{a}}_{i},\pi)\cdot\operatorname{\mathsf{cs}}_{\mathbf{w}}({\mathbf{b}}_{i},\pi).

The undesirable corners of the triangle have vanished: instances where one objective is much smaller than the other are rare if any. The points are closer to the top right corner.

Refer to caption
Figure 1: Scatter plot of NDCGs for two different objectives, AA and BB, on real data.

1.1 Main Results

The key idea is to combine the two cumulative scores using a concave function ff. (Maximizing the product is the same as maximizing the sum of logs, which is a concave function.) Concave functions tend to favor more balanced outcomes almost by definition: the function at the average of two points is at least as high as the average of the function at the two points, i.e., f(x1+x22,y1+y22)f(x1,y1)+f(x2,y2)2f(\tfrac{x_{1}+x_{2}}{2},\tfrac{y_{1}+y_{2}}{2})\geq\tfrac{f(x_{1},y_{1})+f(x_{2},y_{2})}{2}. Figure 1 is a good demonstration of this.

We allow arbitrary concave functions that are strictly increasing in each coordinate. We define the combined objective score of π\pi with weights 𝐰{\mathbf{w}} as

𝖼𝗈𝐰,f(𝐚,𝐛,π)=f(𝖼𝗌𝐰(𝐚,π),𝖼𝗌𝐰(𝐛,π)).\operatorname{\mathsf{co}}_{{\mathbf{w}},f}({\mathbf{a}},{\mathbf{b}},\pi)=f\left(\operatorname{\mathsf{cs}}_{\mathbf{w}}({\mathbf{a}},\pi),\operatorname{\mathsf{cs}}_{\mathbf{w}}({\mathbf{b}},\pi)\right).

For some objectives, the sum of the cumulative scores across different instances is still an important metric, e.g., the total revenue, or the number of clicks, etc. We allow incorporating such metrics via a global concave function, i.e., a concave function of the sum of all the cumulative scores over all the instances. Let FF, and fif_{i} for i[n]i\in[n] be concave functions in 2 variables. We consider the problem of finding a ranking πi\pi_{i} for each ii in order to maximize

i[n]𝖼𝗈𝐰,fi(𝐚i,𝐛i,πi)+F(i[n]𝖼𝗌𝐰(𝐚i,πi),i[n]𝖼𝗌𝐰(𝐛i,πi)).\sum_{i\in[n]}\operatorname{\mathsf{co}}_{{\mathbf{w}},f_{i}}({\mathbf{a}}_{i},{\mathbf{b}}_{i},\pi_{i})+F\left(\sum_{i\in[n]}\operatorname{\mathsf{cs}}_{{\mathbf{w}}}({\mathbf{a}}_{i},\pi_{i}),\sum_{i\in[n]}\operatorname{\mathsf{cs}}_{{\mathbf{w}}}({\mathbf{b}}_{i},\pi_{i})\right).

Our main results are polynomial time bi-criteria approximation algorithms for the problem mentioned above. These are similar in spirit to results with resource augmentation in scheduling, or Bulow-Klemperer style results in mechanism design.

  • Consider the special case where we sum the top k<nk<n entries, i.e., the weight vector is kk ones followed by all zeros. For this case, we allow the algorithm to sum the top k+1k+1 entries, and show that the resulting objective is at least as good as the optimum for the sum of the top kk results.

  • For the general case, the algorithm gets an advantage as follows: replace one coordinate of 𝐰{\mathbf{w}}, say wjw_{j}, with the immediately preceding coordinate wj1w_{j-1}. For the case of summing the top kk entries, this corresponds to replacing the (k+1)st(k+1)^{\rm st} coordinate which is a 0, with the kthk^{\rm th} coordinate which is a 1. The replacement can be different for different ii. Again, the ranking output by the algorithm with the new weights does as well as the optimum with the original weights 𝐰{\mathbf{w}}. See Theorem 2.2 and Theorem 3.2 for formal statements.

    One advantage of such a guarantee is that it does not depend on the parameters of the convex functions, such as the Lipshitz constants, or the range of values, as is usual with other types of guarantees. This allows greater flexibility in the choice of these convex functions.

  • When there is no global function FF, for each ii, the algorithm just does a binary search (Proposition 2.8). In each iteration of the binary search, we compute a ranking optimal for a linear combination of the two objectives. The running time to solve each ranking problem (i.e. each instance ii) is O(nlog2n)O(n\log^{2}n). In practice, the ranking algorithms are required to be very fast, so this is an important property. For the general case, this is still true, provided that we are given 2 additional parameters that are optimized appropriately. In practice, such parameters are tuned ‘offline’ so we can still use the binary search to rank ‘online’ any new instance ii.

1.2 Related Work

Rank aggregation is much studied, most frequently in the context of databases with ‘fuzzy’ queries (Fagin and Wimmers, 1997) and in the context of information retrieval or web search (Dwork et al., 2001; Aslam and Montague, 2001). There are two main categories of results (Renda and Straccia, 2003), (i) where the input is a set of rankings (Dwork et al., 2001; Renda and Straccia, 2003), and (ii) where the input is a set of scores (Vogt and Cottrell, 1999; Fox and Shaw, 1994). Clearly score based aggregation methods are more powerful, since there is strictly more information; our paper falls in the score based aggregation category.

Among the score based methods, Vogt and Cottrell (1999) uses the same form of cumulative scores as us, and empirically evaluate the usage of a linear combination of cumulative scores. They identify limitations of this method and conditions under which this does well. Fox and Shaw (1994) propose and evaluate several methods for combining the scores for different objectives result by result which are then used to rank. In contrast, we first aggregate the scores for each objective and then combine these cumulative scores.

Azar et al. (2009) also consider rank aggregation motivated by multiple intents in search rankings, but with several differences. They consider a large number of different intents, as opposed to this paper where we focus on just 2. Their objective function also depends on a weight vector but in a different way. For each intent, a result is either ‘relevant’ or not, and given a ranking, the cumulative score for that intent is the weight corresponding to the highest rank at which a relevant result appears. Their objective is a weighted sum of the cumulative scores across all intents.

The rank based aggregation methods are closely related to voting schemes and social choice theory, and a lot of this has focused on algorithms to compute the Kemeny-Young rank aggregation (Young and Levenglick, 1978; Young, 1988; Saari, 1995; Borda, 1784).

Organization:

In Section 2 we give our binary search based algorithm for a single instance. The general case is presented in Section 3. We present experimental results in Section 4. Appendix contains some missing proofs.

2 A single instance of ranking with multiple objectives

Let us formally define the 𝖱𝖺𝗇𝗄\operatorname{\mathsf{Rank}} problem.

Definition 2.1 (𝖱𝖺𝗇𝗄(𝐚,𝐛,𝐰,f)\operatorname{\mathsf{Rank}}({\mathbf{a}},{\mathbf{b}},{\mathbf{w}},f)).

Given 𝐚,𝐛,𝐰,f{\mathbf{a}},{\mathbf{b}},{\mathbf{w}},f, find a ranking π\pi which maximizes the combined objective i.e. find 𝖼𝗈𝐰(𝐚,𝐛)=maxπ𝖼𝗈𝐰(𝐚,𝐛,π).\operatorname{\mathsf{co}}^{*}_{\mathbf{w}}({\mathbf{a}},{\mathbf{b}})=\max_{\pi}\operatorname{\mathsf{co}}_{{\mathbf{w}}}({\mathbf{a}},{\mathbf{b}},\pi).

It is not clear if 𝖱𝖺𝗇𝗄(𝐚,𝐛,𝐰,f)\operatorname{\mathsf{Rank}}({\mathbf{a}},{\mathbf{b}},{\mathbf{w}},f) can be efficiently solved, because it involves optimization over all rankings and there are exponentially many of them. Our main result shows that when ff is concave, we can find nearly optimal solutions. We will assume that ff is differentiable and has continuous derivatives.

Theorem 2.2.

Suppose f(α,β)f(\alpha,\beta) is a concave function over the range α,β0\alpha,\beta\geq 0 and ff is strictly increasing in each coordinate in that range. Given an instance of 𝖱𝖺𝗇𝗄(𝐚,𝐛,𝐰,f)\operatorname{\mathsf{Rank}}({\mathbf{a}},{\mathbf{b}},{\mathbf{w}},f), there is an algorithm that runs in O(nlog2n)O(n\log^{2}n) time222This is a Las Vegas algorithm i.e. always output the correct answer but runs in O(nlog2n)O(n\log^{2}n) time with high probability. We also give O(nlognlogB)O(n\log n\log B) algorithm when ai,bia_{i},b_{i} are integers bounded by BB. and outputs a ranking π\pi of [n][n] such that 𝖼𝗈𝐰(𝐚,𝐛,π)𝖼𝗈𝐰(𝐚,𝐛)\operatorname{\mathsf{co}}_{{\mathbf{w}}^{\prime}}({\mathbf{a}},{\mathbf{b}},\pi)\geq\operatorname{\mathsf{co}}^{*}_{\mathbf{w}}({\mathbf{a}},{\mathbf{b}}) where 𝐰=𝐰+(wiwi+1)𝐞i+1{\mathbf{w}}^{\prime}={\mathbf{w}}+(w_{i}-w_{i+1}){\mathbf{e}}_{i+1} for some i[n1]i\in[n-1]. In other words, 𝐰{\mathbf{w}}^{\prime} is obtained by replacing wi+1w_{i+1} with wiw_{i} in 𝐰{\mathbf{w}} for some i[n1]i\in[n-1].

We have the following corollary for the important special case where the cumulative scores are the sum of scores of top kk elements, i.e., 𝐰=(1,,1,0,,0){\mathbf{w}}=(1,\dots,1,0,\dots,0) with exactly kk ones. In this special case, 𝖱𝖺𝗇𝗄\operatorname{\mathsf{Rank}} is called the 𝖳𝖮𝖯k\operatorname{\mathsf{TOP}}_{k} problem.

Corollary 2.3.

Given such an instance of 𝖳𝖮𝖯k(𝐚,𝐛,f)\operatorname{\mathsf{TOP}}_{k}({\mathbf{a}},{\mathbf{b}},f), there is an efficient algorithm that outputs a subset S[n]S\subset[n] of at most k+1k+1 elements such that

f(iSai,iSbi)max|T|=kf(iTai,iTbi).f\left(\sum_{i\in S}a_{i},\sum_{i\in S}b_{i}\right)\geq\max_{|T|=k}f\left(\sum_{i\in T}a_{i},\sum_{i\in T}b_{i}\right).

We will now prove Theorem 3.2. We also make the mild assumption that the numbers in 𝐚,𝐛,𝐰{\mathbf{a}},{\mathbf{b}},{\mathbf{w}} are generic for the proof, which can be achieved by perturbing all the numbers with a tiny additive noise. In particular we will assume that w1>w2>>wn>0w_{1}>w_{2}>\dots>w_{n}>0. This only perturbs 𝖼𝗈𝐰(𝐚,𝐛)\operatorname{\mathsf{co}}^{*}_{\mathbf{w}}({\mathbf{a}},{\mathbf{b}}) by a tiny amount. By a limiting argument, this shouldn’t affect the result. To prove Theorem 2.2, we create a convex programming relaxation for 𝖼𝗈𝐰(𝐚,𝐛)\operatorname{\mathsf{co}}^{*}_{\mathbf{w}}({\mathbf{a}},{\mathbf{b}}) as shown in (2) and denote its value by 𝖮𝖯𝖳\mathsf{OPT}.

𝖮𝖯𝖳=maxπij,α,β0\displaystyle\mathsf{OPT}=\max_{\pi_{ij},\alpha,\beta\geq 0} f(α,β)\displaystyle f(\alpha,\beta) (2)
s.t.\displaystyle s.t. αi,j=1nπijwiaj\displaystyle\alpha\leq\sum_{i,j=1}^{n}\pi_{ij}w_{i}a_{j} (p)\displaystyle\rightarrow\hskip 28.45274pt(p)
βi,j=1nπijwibj\displaystyle\beta\leq\sum_{i,j=1}^{n}\pi_{ij}w_{i}b_{j} (q)\displaystyle\rightarrow\hskip 28.45274pt(q)
ij=1nπij1\displaystyle\forall i\ \sum_{j=1}^{n}\pi_{ij}\leq 1 (ri)\displaystyle\rightarrow\hskip 28.45274pt(r_{i})
ji=1nπij1\displaystyle\forall j\ \sum_{i=1}^{n}\pi_{ij}\leq 1 (cj)\displaystyle\rightarrow\hskip 28.45274pt(c_{j})

It is clear that 𝖮𝖯𝖳\mathsf{OPT} is a relaxation for 𝖼𝗈𝐰(𝐚,𝐛)\operatorname{\mathsf{co}}^{*}_{\mathbf{w}}({\mathbf{a}},{\mathbf{b}}) with 𝖼𝗈𝐰(𝐚,𝐛)𝖮𝖯𝖳.\operatorname{\mathsf{co}}^{*}_{\mathbf{w}}({\mathbf{a}},{\mathbf{b}})\leq\mathsf{OPT}. By convex programming duality, 𝖮𝖯𝖳\mathsf{OPT} can be expressed as a dual minimization problem (3) by introducing a dual variable for every constraint in the primal as shown in (2). Note that by Slater’s condition, strong duality holds here Boyd and Vandenberghe (2004). The constraints in the dual correspond to variables in the primal as shown in (3).

𝖮𝖯𝖳=minri,cj,p,q0\displaystyle\mathsf{OPT}=\min_{r_{i},c_{j},p,q\geq 0} i=1nri+j=1ncj+f(p,q)\displaystyle\sum_{i=1}^{n}r_{i}+\sum_{j=1}^{n}c_{j}+f^{*}(-p,-q) (3)
s.t.\displaystyle s.t. i,jri+cjwi(paj+qbj)\displaystyle\forall i,j\ \ r_{i}+c_{j}\geq w_{i}(pa_{j}+qb_{j}) (πij)\displaystyle\rightarrow\hskip 28.45274pt(\pi_{ij})

Here ff^{*} is the Fenchel dual of ff defined as

f(μ,ν)=supα,β0(μα+νβ+f(α,β)).f^{*}(\mu,\nu)=\sup_{\alpha,\beta\geq 0}\left(\mu\alpha+\nu\beta+f(\alpha,\beta)\right).

Note that ff^{*} is a convex function since it is the supremum of linear functions. Since f(α,β)f(\alpha,\beta) is strictly increasing in each coordinate, f(μ,ν)=f^{*}(\mu,\nu)=\infty unless μ,ν<0\mu,\nu<0. Since the dual is a minimization problem, the optimum value is attained only when μ,ν<0\mu,\nu<0. Hereafter, wlog, we assume that p,q>0p,q>0 in the dual (3). For example when f(α,β)=log(αβ)f(\alpha,\beta)=\log(\alpha\beta),

f(μ,ν)={log(μν)2if μ,ν<0else.f^{*}(\mu,\nu)=\begin{cases}-\log(\mu\nu)-2&\text{if }\mu,\nu<0\\ \infty&\text{else}.\end{cases}

If (π,α,β)(\pi^{*},\alpha^{*},\beta^{*}) is some optimal solution for the primal (2) and (𝐫,𝐜,p,q)({\mathbf{r}}^{*},{\mathbf{c}}^{*},p^{*},q^{*}) is some optimal solution for the dual (3), then they should together satisfy the KKT conditions given in (4). A constraint of primal is tight if the corresponding variable in the dual is strictly positive and vice-versa.

p>0ijπijwiaj=αf(α,β)=(p,q)q>0ijπijwibj=βπij>0ri+cj=wi(paj+qbj)ri>0jπij=1cj>0iπij=1\begin{array}[]{ccl|ccl}p^{*}>0&\Rightarrow&\sum_{ij}\pi_{ij}^{*}w_{i}a_{j}=\alpha^{*}&\nabla f(\alpha^{*},\beta^{*})&=&(p^{*},q^{*})\\ q^{*}>0&\Rightarrow&\sum_{ij}\pi_{ij}^{*}w_{i}b_{j}=\beta^{*}&\pi_{ij}^{*}>0&\Rightarrow&r_{i}^{*}+c_{j}^{*}=w_{i}(p^{*}a_{j}+q^{*}b_{j})\\ r_{i}^{*}>0&\Rightarrow&\sum_{j}\pi_{ij}^{*}=1&&&\\ c_{j}^{*}>0&\Rightarrow&\sum_{i}\pi_{ij}^{*}=1&&&\end{array} (4)
Proposition 2.4.

Let p,q>0p,q>0 be fixed. Then the value of the minimization program in (3) is given by

Ψ(p,q)=𝖼𝗌𝐰(p𝐚+q𝐛)+f(p,q)\Psi(p,q)=\operatorname{\mathsf{cs}}^{*}_{\mathbf{w}}(p{\mathbf{a}}+q{\mathbf{b}})+f^{*}(-p,-q)

where p𝐚+q𝐛=(pa1+qb1,,pan+qbn).p{\mathbf{a}}+q{\mathbf{b}}=(pa_{1}+qb_{1},\dots,pa_{n}+qb_{n}). Moreover the KKT condition (4) can be simplified to:

π𝖢𝗈𝗇𝗏𝖧𝗎𝗅𝗅{π:𝖼𝗌𝐰(p𝐚+q𝐛)=𝖼𝗌𝐰(p𝐚+q𝐛,π)},\displaystyle\pi^{*}\in\mathsf{ConvHull}\{\pi:\operatorname{\mathsf{cs}}^{*}_{\mathbf{w}}(p^{*}{\mathbf{a}}+q^{*}{\mathbf{b}})=\operatorname{\mathsf{cs}}_{\mathbf{w}}(p^{*}{\mathbf{a}}+q^{*}{\mathbf{b}},\pi)\}, (5)
α=ijπijwiaj,β=ijπijwibj,\displaystyle\alpha^{*}=\sum_{ij}\pi_{ij}^{*}w_{i}a_{j},\ \ \beta^{*}=\sum_{ij}\pi_{ij}^{*}w_{i}b_{j},
f(α,β)=(p,q).\displaystyle\nabla f(\alpha^{*},\beta^{*})=(p^{*},q^{*}).
Proof.

For a fixed p,q>0p,q>0, the dual program (3) reduces (after ignoring the fixed additive term f(p,q)f^{*}(-p,-q)) to the following linear program:

minri,cj0\displaystyle\min_{r_{i},c_{j}\geq 0}\ i=1nri+j=1ncj\displaystyle\sum_{i=1}^{n}r_{i}+\sum_{j=1}^{n}c_{j}
s.t.\displaystyle s.t.\ i,jri+cjwi(paj+qbj)\displaystyle\forall i,j\ r_{i}+c_{j}\geq w_{i}(pa_{j}+qb_{j}) (πij).\displaystyle\rightarrow\hskip 28.45274pt(\pi_{ij}).

The dual linear program is:

maxπij0\displaystyle\max_{\pi_{ij}\geq 0}\ ijπijwi(paj+qbj)\displaystyle\sum_{ij}\pi_{ij}w_{i}(pa_{j}+qb_{j})
ij=1nπij1\displaystyle\forall i\ \sum_{j=1}^{n}\pi_{ij}\leq 1 (ri)\displaystyle\rightarrow\hskip 28.45274pt(r_{i})
ji=1nπij1\displaystyle\forall j\ \sum_{i=1}^{n}\pi_{ij}\leq 1 (cj)\displaystyle\rightarrow\hskip 28.45274pt(c_{j})

The constraints on π\pi correspond to doubly stochastic constraints on the matrix π\pi. Therefore by the Birkhoff-von Neumann theorem, the feasible solutions are convex combinations of permutations and the optimum is attained at a permutation. An optimal permutation should sort the values of p𝐚+q𝐛p{\mathbf{a}}+q{\mathbf{b}} in decreasing order and convex combinations of such permutations are also optimal solutions. Thus the set of solutions π𝖢𝗈𝗇𝗏𝖧𝗎𝗅𝗅{π:𝖼𝗌𝐰(p𝐚+q𝐛)=𝖼𝗌𝐰(p𝐚+q𝐛,π)}\pi^{*}\in\mathsf{ConvHull}\{\pi:\operatorname{\mathsf{cs}}^{*}_{\mathbf{w}}(p^{*}{\mathbf{a}}+q^{*}{\mathbf{b}})=\operatorname{\mathsf{cs}}_{\mathbf{w}}(p^{*}{\mathbf{a}}+q^{*}{\mathbf{b}},\pi)\} and the value of both the above programs is 𝖼𝗌𝐰(p𝐚+q𝐛).\operatorname{\mathsf{cs}}^{*}_{\mathbf{w}}(p{\mathbf{a}}+q{\mathbf{b}}).

Lemma 2.5.

Fix some p,q>0p,q>0. Then one of the following is true:

  1. 1.

    There are no ties among p𝐚+q𝐛p{\mathbf{a}}+q{\mathbf{b}} i.e. there is a unique π\pi such that 𝖼𝗌𝐰(𝐚,𝐛)=𝖼𝗌𝐰(𝐚,𝐛,π)\operatorname{\mathsf{cs}}^{*}_{\mathbf{w}}({\mathbf{a}},{\mathbf{b}})=\operatorname{\mathsf{cs}}_{\mathbf{w}}({\mathbf{a}},{\mathbf{b}},\pi)

  2. 2.

    There is exactly one tie among p𝐚+q𝐛p{\mathbf{a}}+q{\mathbf{b}} i.e. there are exactly two permutations π1,π2\pi_{1},\pi_{2} such that 𝖼𝗌𝐰(𝐚,𝐛)=𝖼𝗌𝐰(𝐚,𝐛,π1)=𝖼𝗌𝐰(𝐚,𝐛,π2)\operatorname{\mathsf{cs}}^{*}_{\mathbf{w}}({\mathbf{a}},{\mathbf{b}})=\operatorname{\mathsf{cs}}_{\mathbf{w}}({\mathbf{a}},{\mathbf{b}},\pi_{1})=\operatorname{\mathsf{cs}}_{\mathbf{w}}({\mathbf{a}},{\mathbf{b}},\pi_{2}). Moreover π1,π2\pi_{1},\pi_{2} differ by an adjacent transposition i.e. π2\pi_{2} can be obtained from π1\pi_{1} by swapping adjacent elements.

Proof.

𝖼𝗌(p𝐚+q𝐛)=p𝖼𝗌(𝐚,π)+q𝖼𝗌(𝐚,π)\operatorname{\mathsf{cs}}^{*}(p{\mathbf{a}}+q{\mathbf{b}})=p\operatorname{\mathsf{cs}}({\mathbf{a}},\pi)+q\operatorname{\mathsf{cs}}({\mathbf{a}},\pi) where π\pi is any permutation which sorts p𝐚+q𝐛p{\mathbf{a}}+q{\mathbf{b}} in descending order. There are two cases:

  1. Case 1:

    If there are no ties among p𝐚+q𝐛p{\mathbf{a}}+q{\mathbf{b}}, then the permutation π\pi is unique.

  2. Case 2:

    Suppose there are ties among p𝐚+q𝐛p{\mathbf{a}}+q{\mathbf{b}}. Because we assumed that 𝐚,𝐛{\mathbf{a}},{\mathbf{b}} are generic, there can be at most one tie among (pa1+qb1,,pan+qbn)(pa_{1}+qb_{1},\dots,pa_{n}+qb_{n}) i.e. there is at most one pair s,ts,t such that pas+qbs=pat+qbtpa_{s}+qb_{s}=pa_{t}+qb_{t}. Two such ties impose two linearly independent equations on p,qp,q forcing them to be both zero. Therefore the there are at most two distinct permutations π1,π2\pi_{1},\pi_{2} such that 𝖼𝗌(p𝐚+q𝐛)=p𝖼𝗌(𝐚,π1)+q𝖼𝗌(𝐚,π1)=p𝖼𝗌(𝐚,π2)+q𝖼𝗌(𝐚,π2)\operatorname{\mathsf{cs}}^{*}(p{\mathbf{a}}+q{\mathbf{b}})=p\operatorname{\mathsf{cs}}({\mathbf{a}},\pi_{1})+q\operatorname{\mathsf{cs}}({\mathbf{a}},\pi_{1})=p\operatorname{\mathsf{cs}}({\mathbf{a}},\pi_{2})+q\operatorname{\mathsf{cs}}({\mathbf{a}},\pi_{2}). Moreover s,ts,t should be next to each other in π1,π2\pi_{1},\pi_{2} and their order is switched in π1,π2\pi_{1},\pi_{2}. Therefore they differ by an adjacent transposition.

Remark 2.6.

From Proposition 2.4 and Lemma 2.5, the solution π\pi^{*} for the primal program (2) is either a single permutation or a convex combination of two permutations which differ only by an adjacent transposition (i.e. swapping two elements next to each other).

Remark 2.7.

Ψ(p,q)=𝖼𝗌𝐰(p𝐚+q𝐛)+f(p,q)\Psi(p,q)=\operatorname{\mathsf{cs}}^{*}_{\mathbf{w}}(p{\mathbf{a}}+q{\mathbf{b}})+f^{*}(-p,-q) is a convex function. The gradient333or a subgradient at points where Ψ\Psi is not differentiable of Ψ\Psi can be calculated efficiently and therefore 𝖮𝖯𝖳=minp,q0Ψ(p,q)\mathsf{OPT}=\min_{p,q\geq 0}\Psi(p,q) can be found efficiently using gradient (or subgradient) descent Bubeck (2015).

It turns out that there is a much more efficient algorithm to find 𝖮𝖯𝖳\mathsf{OPT} using binary search. We need the notion of a subgradient. For a convex function g:dg:\mathbb{R}^{d}\to\mathbb{R}, the subgradient of gg at a point xdx\in\mathbb{R}^{d} is defined as g(x)={v:g(x+y)g(x)+v,yy}\partial g(x)=\{v:g(x+y)\geq g(x)+\left\langle v,y\right\rangle\forall y\}. It is always a convex subset of d\mathbb{R}^{d}. If gg is differentiable at xx then, g(x)={g(x)}\partial g(x)=\{\nabla g(x)\}.

Proposition 2.8 (Binary search to find 𝖮𝖯𝖳\mathsf{OPT}).

Suppose a1,,ana_{1},\dots,a_{n} and b1,,bnb_{1},\dots,b_{n} are integers bounded by BB in absolute value. We can solve the primal program (2), the dual program (3) and find 𝖮𝖯𝖳=minp,q>0𝖼𝗌𝐰(p𝐚+q𝐛)+f(p,q)\mathsf{OPT}=\min_{p,q>0}\operatorname{\mathsf{cs}}^{*}_{\mathbf{w}}(p{\mathbf{a}}+q{\mathbf{b}})+f^{*}(-p,-q) in O(nlognlogB)O(n\log n\log B) time. Moreover there is a strongly polynomial randomized algorithm which runs in O(nlog2n)O(n\log^{2}n) time.444Strongly polynomial refers to the fact that the running time is independent of BB or the actual numbers in 𝐚,𝐛{\mathbf{a}},{\mathbf{b}}. In this model, it is assumed that arithmetic and comparison operations between aia_{i}’s and bib_{i}’s take constant time.

Proof.

To solve the primal program (2) and the dual program (3), it is enough to find (π,α,β)(\pi^{*},\alpha^{*},\beta^{*}) and (p,q)(p^{*},q^{*}) which together satisfy all the simplified KKT conditions (5).

Throughout the proof, we drop the subscript 𝐰{\mathbf{w}} from 𝖼𝗌𝐰\operatorname{\mathsf{cs}}_{\mathbf{w}} for brevity. Ψ(p,q)=𝖼𝗌(p𝐚+q𝐛)+f(p,q)\Psi(p,q)=\operatorname{\mathsf{cs}}(p{\mathbf{a}}+q{\mathbf{b}})+f^{*}(-p,-q) is a convex function. So a local minimum is a global minimum and therefore it is enough to find (p,q)(p^{*},q^{*}) such that 0Ψ(p,q)0\in\partial\Psi(p^{*},q^{*}).

0Ψ(p,q)f(p,q)𝖼𝗌(p𝐚+q𝐛).0\in\partial\Psi(p^{*},q^{*})\iff\nabla f^{*}(-p^{*},-q^{*})\in\partial\operatorname{\mathsf{cs}}^{*}(p^{*}{\mathbf{a}}+q^{*}{\mathbf{b}}).

It is easy to see that f(p,q)=(α,β)f(α,β)=(p,q)\nabla f^{*}(-p,-q)=(\alpha,\beta)\iff\nabla f(\alpha,\beta)=(p,q). Therefore we rewrite the optimality condition for (p,q)(p^{*},q^{*}) as:

(p,q)f(𝖼𝗌(p𝐚+q𝐛)).(p^{*},q^{*})\in\nabla f\left(\partial\operatorname{\mathsf{cs}}^{*}(p^{*}{\mathbf{a}}+q^{*}{\mathbf{b}})\right). (6)

Thus we need to find a fixed point for a set-valued map. We begin by calculating the subgradient 𝖼𝗌(p𝐚+q𝐛)\partial\operatorname{\mathsf{cs}}^{*}(p{\mathbf{a}}+q{\mathbf{b}}). Note that 𝖼𝗌(p𝐚+q𝐛)=p𝖼𝗌(𝐚,π)+q𝖼𝗌(𝐛,π)\operatorname{\mathsf{cs}}^{*}(p{\mathbf{a}}+q{\mathbf{b}})=p\operatorname{\mathsf{cs}}({\mathbf{a}},\pi)+q\operatorname{\mathsf{cs}}({\mathbf{b}},\pi) where π\pi is any permutation which sorts p𝐚+q𝐛p{\mathbf{a}}+q{\mathbf{b}} in descending order. By Lemma 2.5, there are two cases:

  1. Case 1:

    If there are no ties among p𝐚+q𝐛p{\mathbf{a}}+q{\mathbf{b}}, then the permutation π\pi is unique and 𝖼𝗌(p𝐚+q𝐛)\operatorname{\mathsf{cs}}^{*}(p{\mathbf{a}}+q{\mathbf{b}}) is differentiable at (p,q)(p,q) and

    𝖼𝗌(p𝐚+q𝐛)={(𝖼𝗌(𝐚,π),𝖼𝗌(𝐛,π))}.\partial\operatorname{\mathsf{cs}}^{*}(p{\mathbf{a}}+q{\mathbf{b}})=\{\left(\operatorname{\mathsf{cs}}({\mathbf{a}},\pi),\operatorname{\mathsf{cs}}({\mathbf{b}},\pi)\right)\}.
  2. Case 2:

    Suppose there are ties among p𝐚+q𝐛p{\mathbf{a}}+q{\mathbf{b}}. Then there exists exactly two permutations π1,π2\pi_{1},\pi_{2} such that 𝖼𝗌(p𝐚+q𝐛)=𝖼𝗌(p𝐚+q𝐛,π1)=𝖼𝗌(p𝐚+q𝐛,π2)\operatorname{\mathsf{cs}}^{*}(p{\mathbf{a}}+q{\mathbf{b}})=\operatorname{\mathsf{cs}}(p{\mathbf{a}}+q{\mathbf{b}},\pi_{1})=\operatorname{\mathsf{cs}}(p{\mathbf{a}}+q{\mathbf{b}},\pi_{2}). In this case,

    𝖼𝗌(p𝐚+q𝐛)={μ(𝖼𝗌(𝐚,π1),𝖼𝗌(𝐛,π1))+(1μ)(𝖼𝗌(𝐚,π2),𝖼𝗌(𝐛,π2)):μ[0,1]}.\partial\operatorname{\mathsf{cs}}^{*}(p{\mathbf{a}}+q{\mathbf{b}})=\left\{\mu\left(\operatorname{\mathsf{cs}}({\mathbf{a}},\pi_{1}),\operatorname{\mathsf{cs}}({\mathbf{b}},\pi_{1})\right)+(1-\mu)\left(\operatorname{\mathsf{cs}}({\mathbf{a}},\pi_{2}),\operatorname{\mathsf{cs}}({\mathbf{b}},\pi_{2})\right):\mu\in[0,1]\right\}.

We make a few observations. The value of the subgradient 𝖼𝗌(p𝐚+q𝐛)\partial\operatorname{\mathsf{cs}}^{*}(p{\mathbf{a}}+q{\mathbf{b}}) only depends on the ratio of pp and qq, λ=q/p\lambda=q/p; this is because the optimal ranking of p𝐚+q𝐛p{\mathbf{a}}+q{\mathbf{b}} only depends on the ratio λ\lambda. And as we change this ratio λ\lambda from 0 to \infty, the subgradient changes at most (n2)\binom{n}{2} times. This happens whenever λ\lambda is such that ai+λbi=aj+λbja_{i}+\lambda b_{i}=a_{j}+\lambda b_{j} for some iji\neq j. We call the set

C={aiajbjbi:1i<jn,aiajbjbi>0},C=\left\{\frac{a_{i}-a_{j}}{b_{j}-b_{i}}:1\leq i<j\leq n,\frac{a_{i}-a_{j}}{b_{j}-b_{i}}>0\right\},

the critical set of λ\lambda’s where the subgradient changes value555|C||C| is equal to the number of inversions in 𝐚{\mathbf{a}} w.r.t. 𝐛{\mathbf{b}}, also called the Kendall tau distance.. Let m=|C|m=|C| and let λ1<λ2<<λm\lambda_{1}<\lambda_{2}<\dots<\lambda_{m} be an ordering of the critical set CC, further define λ0=0\lambda_{0}=0 and λm+1=\lambda_{m+1}=\infty.

Figure 2: The positive quadrant 𝒬++\mathcal{Q}^{++} is divided into regions AiA_{i} and rays RiR_{i} based on the values of the subgradient 𝖼𝗌(p𝐚+q𝐛)\partial\operatorname{\mathsf{cs}}^{*}(p{\mathbf{a}}+q{\mathbf{b}}).
Refer to caption

Define the regions

Ai={(p,pλ):p>0,λi<λ<λi+1}.A_{i}=\left\{(p,p\lambda):p>0,\lambda_{i}<\lambda<\lambda_{i+1}\right\}.

Also define the rays

Ri={(p,pλi):p>0}.R_{i}=\left\{(p,p\lambda_{i}):p>0\right\}.

In the region Ai,A_{i}, there is a unique permutation σi=argmaxπ𝖼𝗌(p𝐚+q𝐛,π)\sigma_{i}=\operatorname{argmax}_{\pi}\operatorname{\mathsf{cs}}(p{\mathbf{a}}+q{\mathbf{b}},\pi), therefore the subgradient is unique. Denote its value by gig_{i} i.e.

𝖼𝗌(p𝐚+q𝐛)|Ai={gi}={(𝖼𝗌(𝐚,σi),𝖼𝗌(𝐛,σi))}.\partial\operatorname{\mathsf{cs}}^{*}(p{\mathbf{a}}+q{\mathbf{b}})|_{A_{i}}=\{g_{i}\}=\{(\operatorname{\mathsf{cs}}({\mathbf{a}},\sigma_{i}),\operatorname{\mathsf{cs}}({\mathbf{b}},\sigma_{i}))\}.

On the ray Ri+1,R_{i+1}, we have {σi,σi+1}=argmaxπ𝖼𝗌(p𝐚+q𝐛,π)\{\sigma_{i},\sigma_{i+1}\}=\operatorname{argmax}_{\pi}\operatorname{\mathsf{cs}}(p{\mathbf{a}}+q{\mathbf{b}},\pi), therefore the subgradient is given by

𝖼𝗌(p𝐚+q𝐛)|Ri+1={μgi+(1μ)gi+1:μ[0,1]}.\partial\operatorname{\mathsf{cs}}^{*}(p{\mathbf{a}}+q{\mathbf{b}})|_{R_{i+1}}=\{\mu g_{i}+(1-\mu)g_{i+1}:\mu\in[0,1]\}.

Figure 2 shows the values of the subgradient 𝖼𝗌(p𝐚+q𝐛)\partial\operatorname{\mathsf{cs}}^{*}(p{\mathbf{a}}+q{\mathbf{b}}) as a function of (p,q)(p,q) in the regions Ai,Ri.A_{i},R_{i}.

Let 𝒬++={(p,q):p,q>0}\mathcal{Q}^{++}=\{(p,q):p,q>0\} be the positive quadrant. Since f(α,β)f(\alpha,\beta) is strictly increasing in each coordinate in 𝒬++\mathcal{Q}^{++}, the gradients also lie in the positive quadrant, i.e., f:𝒬++𝒬++\nabla f:\mathcal{Q}^{++}\to\mathcal{Q}^{++}. We now define a function ϕ:{0,1,,m}{1,0,1}\phi:\{0,1,\dots,m\}\to\{-1,0,1\} as follows:

ϕ(i)={0 if f(gi)RiAiRi+1+1 if f(gi) lies anticlockwise to RiAiRi+11 if f(gi) lies clockwise to RiAiRi+1.\displaystyle\phi(i)=\begin{cases}0&\text{ if }\nabla f(g_{i})\in R_{i}\cup A_{i}\cup R_{i+1}\\ +1&\text{ if }\nabla f(g_{i})\text{ lies anticlockwise to }R_{i}\cup A_{i}\cup R_{i+1}\\ -1&\text{ if }\nabla f(g_{i})\text{ lies clockwise to }R_{i}\cup A_{i}\cup R_{i+1}.\end{cases} (7)

We now show that it is enough to find some i{0,1,,m}i\in\{0,1,\dots,m\} such that one of the following is true.

  1. 1.

    ϕ(i)=0\phi(i)=0. In this case, we set (p,q)=f(gi).(p^{*},q^{*})=\nabla f(g_{i}). The condition ϕ(i)=0\phi(i)=0 implies that (p,q)RiAiRi+1.(p^{*},q^{*})\in R_{i}\cup A_{i}\cup R_{i+1}. Therefore gi𝖼𝗌(p𝐚+q𝐛).g_{i}\in\partial\operatorname{\mathsf{cs}}^{*}(p^{*}{\mathbf{a}}+q^{*}{\mathbf{b}}). Applying f\nabla f on both sides implies that (p,q)=f(gi)f(𝖼𝗌(p𝐚+q𝐛))(p^{*},q^{*})=\nabla f(g_{i})\in\nabla f(\partial\operatorname{\mathsf{cs}}^{*}(p^{*}{\mathbf{a}}+q^{*}{\mathbf{b}})) which is the fixed point condition (6). Moreover setting π=σi\pi^{*}=\sigma_{i} and (α,β)=gi(\alpha^{*},\beta^{*})=g_{i} gives a solution to the simplified KKT conditions (5).

  2. 2.

    ϕ(i)=1,ϕ(i+1)=1\phi(i)=1,\phi(i+1)=-1. In this case, we claim that there exists some μ(0,1)\mu^{*}\in(0,1) such that f(μgi+(1μ)gi+1)Ri+1\nabla f(\mu^{*}g_{i}+(1-\mu^{*})g_{i+1})\in R_{i+1}. This is because the curve γ:[0,1]𝒬++\gamma:[0,1]\to\mathcal{Q}^{++} given by γ(μ)=f(μgi+(1μ)gi+1)\gamma(\mu)=\nabla f(\mu g_{i}+(1-\mu)g_{i+1}) starts and ends in opposite sides of the ray Ri+1R_{i+1} as shown in Figure 3, so it should cross it at some point μ(0,1)\mu^{*}\in(0,1) which can be found by binary search (here we need continuity of f\nabla f). We then set (p,q)=f(μgi+(1μ)gi+1)(p^{*},q^{*})=\nabla f(\mu^{*}g_{i}+(1-\mu^{*})g_{i+1}). Since (p,q)Ri+1(p^{*},q^{*})\in R_{i+1}, μgi+(1μ)gi+1𝖼𝗌(p𝐚+q𝐛)\mu^{*}g_{i}+(1-\mu^{*})g_{i+1}\in\partial\operatorname{\mathsf{cs}}^{*}(p^{*}{\mathbf{a}}+q^{*}{\mathbf{b}}). Applying f\nabla f to both sides, we get (p,q)f(𝖼𝗌(p𝐚+q𝐛))(p^{*},q^{*})\in\nabla f(\partial\operatorname{\mathsf{cs}}^{*}(p^{*}{\mathbf{a}}+q^{*}{\mathbf{b}})) which is the fixed point condition (6). Setting π=μσi+(1μ)σi+1\pi^{*}=\mu^{*}\sigma_{i}+(1-\mu^{*})\sigma_{i+1} and (α,β)=μgi+(1μ)gi+1(\alpha^{*},\beta^{*})=\mu^{*}g_{i}+(1-\mu^{*})g_{i+1} gives a solution to the simplified KKT conditions (5).

    Figure 3: The endpoints of the curve γ\gamma lie on opposite sides of the ray Ri+1R_{i+1} so γ\gamma should cross the ray Ri+1R_{i+1} at some point.
    Refer to caption

Now if either ϕ(0)=0\phi(0)=0 or ϕ(m)=0\phi(m)=0, we are done. Otherwise, ϕ(0)=1\phi(0)=1 and ϕ(m)=1\phi(m)=-1. By a simple binary search on {0,1,,m}\{0,1,\dots,m\} we can find a point ii such that either ϕ(i)=0\phi(i)=0 or ϕ(i)=1,ϕ(i+1)=1\phi(i)=1,\phi(i+1)=-1.

A naive implementation of the above described binary search will require finding the set of critical values CC and sorting them. This can take Ω(n2logn)\Omega(n^{2}\log n) time. To improve this to near-linear time, we need a way do binary search on CC without listing all values in CC. See Appendix A for how to achieve this. ∎

We are now ready to prove Theorem 2.2.

Proof of Theorem 2.2.

By Proposition 2.8 and Remark 2.6, we can find a solution (π,α,β)(\pi^{*},\alpha^{*},\beta^{*}) to the primal program (2) where π\pi^{*} is either a permutation or a convex combination of two permutations which differ by an adjacent transposition. If π\pi^{*} is a permutation then

𝖼𝗈𝐰(𝐚,𝐛)𝖮𝖯𝖳=f(𝖼𝗌𝐰(𝐚,π),𝖼𝗌𝐰(𝐛,π)).\operatorname{\mathsf{co}}^{*}_{\mathbf{w}}({\mathbf{a}},{\mathbf{b}})\leq\mathsf{OPT}=f(\operatorname{\mathsf{cs}}_{\mathbf{w}}({\mathbf{a}},\pi^{*}),\operatorname{\mathsf{cs}}_{\mathbf{w}}({\mathbf{b}},\pi^{*})).

Thus we just output π\pi^{*} which is the optimal ranking.

If π=μπ1+(1μ)π2\pi^{*}=\mu\pi_{1}+(1-\mu)\pi_{2} i.e. a convex combination of π1,π2\pi_{1},\pi_{2} which differ in the i,i+1i,i+1 positions, then

𝖼𝗈𝐰(𝐚,𝐛)𝖮𝖯𝖳\displaystyle\operatorname{\mathsf{co}}^{*}_{\mathbf{w}}({\mathbf{a}},{\mathbf{b}})\leq\mathsf{OPT} =f(μ(𝖼𝗌𝐰(𝐚,π1),𝖼𝗌𝐰(𝐛,π1))+(1μ)(𝖼𝗌𝐰(𝐚,π2),𝖼𝗌𝐰(𝐛,π2))\displaystyle=f(\mu(\operatorname{\mathsf{cs}}_{\mathbf{w}}({\mathbf{a}},\pi_{1}),\operatorname{\mathsf{cs}}_{\mathbf{w}}({\mathbf{b}},\pi_{1}))+(1-\mu)(\operatorname{\mathsf{cs}}_{\mathbf{w}}({\mathbf{a}},\pi_{2}),\operatorname{\mathsf{cs}}_{\mathbf{w}}({\mathbf{b}},\pi_{2}))
f((𝖼𝗌𝐰(𝐚,π1),𝖼𝗌𝐰(𝐛,π1)))=𝖼𝗈𝐰(𝐚,𝐛,π1)\displaystyle\leq f((\operatorname{\mathsf{cs}}_{{\mathbf{w}}^{\prime}}({\mathbf{a}},\pi_{1}),\operatorname{\mathsf{cs}}_{{\mathbf{w}}^{\prime}}({\mathbf{b}},\pi_{1})))=\operatorname{\mathsf{co}}_{{\mathbf{w}}^{\prime}}({\mathbf{a}},{\mathbf{b}},\pi_{1})

where 𝐰=𝐰+(wiwi+1)𝐞i+1{\mathbf{w}}^{\prime}={\mathbf{w}}+(w_{i}-w_{i+1}){\mathbf{e}}_{i+1}. Similarly 𝖼𝗈𝐰(𝐚,𝐛)𝖼𝗈𝐰(𝐚,𝐛,π2).\operatorname{\mathsf{co}}^{*}_{\mathbf{w}}({\mathbf{a}},{\mathbf{b}})\leq\operatorname{\mathsf{co}}^{*}_{{\mathbf{w}}^{\prime}}({\mathbf{a}},{\mathbf{b}},\pi_{2}). So in this case we output either π1\pi_{1} or π2\pi_{2}, whichever has the higher combined objective.

3 Multiple instances of ranking with global aggregation

Suppose we have several instances of 𝖱𝖺𝗇𝗄\operatorname{\mathsf{Rank}} and we want to do well locally in each problem, but we also want to do well when we aggregate our solutions globally. Such a situation arises in the example application discussed in the beginning. For each user we want to solve an instance of 𝖱𝖺𝗇𝗄\operatorname{\mathsf{Rank}}, but globally across all the users, we want that the average revenue and average relevance are high as well. We model this as follows.

Let (𝐚1,𝐛1,𝐰1,f1),,(𝐚m,𝐛m,𝐰m,fm)({\mathbf{a}}_{1},{\mathbf{b}}_{1},{\mathbf{w}}_{1},f_{1}),\dots,({\mathbf{a}}_{m},{\mathbf{b}}_{m},{\mathbf{w}}_{m},f_{m}) be mm instances of 𝖱𝖺𝗇𝗄\operatorname{\mathsf{Rank}} on sequences of length nn. Let 𝐰=(𝐰1,,𝐰m),,f=(f1,,fm),𝐚=(𝐚1,,𝐚m),𝐛=(𝐛1,,𝐛m)\mathbf{w}=({\mathbf{w}}_{1},\dots,{\mathbf{w}}_{m}),\ ,f=(f_{1},\dots,f_{m}),\ \mathbf{a}=({\mathbf{a}}_{1},\dots,{\mathbf{a}}_{m}),\ \mathbf{b}=({\mathbf{b}}_{1},\dots,{\mathbf{b}}_{m}) and let π=(π1,,πm)\pi=(\pi_{1},\dots,\pi_{m}) be a sequence of rankings of [n][n]. The global cumulative a-score and b-score of π\pi is defined as

𝗀𝖼𝗌𝐰(𝐚,π)=i=1m𝖼𝗌𝐰i(𝐚i,πi) and 𝗀𝖼𝗌𝐰(𝐛,π)=i=1m𝖼𝗌𝐰i(𝐛i,πi)\operatorname{\mathsf{gcs}}_{\mathbf{w}}(\mathbf{a},\pi)=\sum_{i=1}^{m}\operatorname{\mathsf{cs}}_{{\mathbf{w}}_{i}}({\mathbf{a}}_{i},\pi_{i})\text{ and }\operatorname{\mathsf{gcs}}_{\mathbf{w}}(\mathbf{b},\pi)=\sum_{i=1}^{m}\operatorname{\mathsf{cs}}_{{\mathbf{w}}_{i}}({\mathbf{b}}_{i},\pi_{i})

respectively. Suppose F(α,β)F(\alpha,\beta) is a concave function increasing in each coordinate. The combined objective function is defined as

𝖼𝗈𝐰(𝐚,𝐛,π)=F(𝗀𝖼𝗌𝐰(𝐚,π),𝗀𝖼𝗌𝐰(𝐛,π))+i=1mfi(𝖼𝗌𝐰i(𝐚i,πi),𝖼𝗌𝐰i(𝐛i,πi)).\operatorname{\mathsf{co}}_{\mathbf{w}}(\mathbf{a},\mathbf{b},\pi)=F(\operatorname{\mathsf{gcs}}_{\mathbf{w}}(\mathbf{a},\pi),\operatorname{\mathsf{gcs}}_{\mathbf{w}}(\mathbf{b},\pi))+\sum_{i=1}^{m}f_{i}(\operatorname{\mathsf{cs}}_{{\mathbf{w}}_{i}}({\mathbf{a}}_{i},\pi_{i}),\operatorname{\mathsf{cs}}_{{\mathbf{w}}_{i}}({\mathbf{b}}_{i},\pi_{i})).
Definition 3.1 (𝖬𝗎𝗅𝗍𝗂𝖱𝖺𝗇𝗄(𝐚,𝐛,𝐰,f,F)\operatorname{\mathsf{MultiRank}}(\mathbf{a},\mathbf{b},\mathbf{w},f,F)).

Given 𝐚,𝐛,f,F\mathbf{a},\mathbf{b},f,F, find a sequence of rankings π=(π1,,πm)\pi=(\pi_{1},\dots,\pi_{m}) of [n][n] which maximizes 𝖼𝗈𝐰(𝐚,𝐛,π)\operatorname{\mathsf{co}}_{\mathbf{w}}(\mathbf{a},\mathbf{b},\pi) i.e. find 𝖼𝗈𝐰(𝐚,𝐛)=maxπ𝖼𝗈𝐰(𝐚,𝐛,π)\operatorname{\mathsf{co}}^{*}_{\mathbf{w}}(\mathbf{a},\mathbf{b})=\max_{\pi}\operatorname{\mathsf{co}}_{\mathbf{w}}(\mathbf{a},\mathbf{b},\pi)

We will assume that the functions f1,,fm,Ff_{1},\dots,f_{m},F are concave and strictly increasing in each coordinate, and that they are differentiable with continuous derivatives. Our main theorem is that we can efficiently find a sequence of rankings π\pi which, with a slight advantage, does as well as the optimal sequence of rankings.

Theorem 3.2.

Suppose the functions f1,,fm,Ff_{1},\dots,f_{m},F are concave and strictly increasing in each coordinate. Given an instance of 𝖬𝗎𝗅𝗍𝗂𝖱𝖺𝗇𝗄(𝐚,𝐛,𝐰,f,F)\operatorname{\mathsf{MultiRank}}(\mathbf{a},\mathbf{b},\mathbf{w},f,F), we can efficiently find a sequence of rankings π=(π1,,πm)\pi=(\pi_{1},\dots,\pi_{m}) such that

𝖼𝗈𝐰(𝐚,𝐛,π)𝖼𝗈𝐰(𝐚,𝐛)\operatorname{\mathsf{co}}_{\mathbf{w}^{\prime}}(\mathbf{a},\mathbf{b},\pi)\geq\operatorname{\mathsf{co}}^{*}_{\mathbf{w}}(\mathbf{a},\mathbf{b})

where 𝐰=(𝐰1,,𝐰m)\mathbf{w}^{\prime}=({\mathbf{w}}_{1}^{\prime},\dots,{\mathbf{w}}_{m}^{\prime}) and 𝐰i=𝐰i+(wtiwti+1)𝐞ti+1{\mathbf{w}}_{i}^{\prime}={\mathbf{w}}_{i}+(w_{t_{i}}-w_{{t_{i}}+1}){\mathbf{e}}_{{t_{i}}+1} for some ti[n1]t_{i}\in[n-1] i.e., each 𝐰i{\mathbf{w}}^{\prime}_{i} is obtained by replacing wti+1w_{t_{i}+1} with wtiw_{t_{i}}.

In the special case when the weight vectors 𝐰1,,𝐰m{\mathbf{w}}_{1},\dots,{\mathbf{w}}_{m} are just a sequence of kk ones followed by zeros i.e. the cumulative scores are calculated by adding the scores of top kk results, 𝖬𝗎𝗅𝗍𝗂𝖱𝖺𝗇𝗄\operatorname{\mathsf{MultiRank}} is called the 𝖬𝗎𝗅𝗍𝗂𝖳𝖮𝖯k\operatorname{\mathsf{Multi-TOP}}_{k} problem.

Corollary 3.3.

Suppose the functions f1,,fm,Ff_{1},\dots,f_{m},F are concave and strictly increasing in each coordinate. Given an instance of 𝖬𝗎𝗅𝗍𝗂𝖳𝖮𝖯k(𝐚,𝐛,f,F)\operatorname{\mathsf{Multi-TOP}}_{k}(\mathbf{a},\mathbf{b},f,F), we can efficiently find a sequence S=(S1,,Sm)S=(S_{1},\dots,S_{m}) of subsets of [n][n] of size at most k+1k+1 (corresponding to the top k+1k+1 elements) such that

𝖼𝗈(𝐚,𝐛,S)max|Ti|=k𝖼𝗈(𝐚,𝐛,(T1,,Tm)).\operatorname{\mathsf{co}}(\mathbf{a},\mathbf{b},S)\geq\max_{|T_{i}|=k}\operatorname{\mathsf{co}}(\mathbf{a},\mathbf{b},(T_{1},\dots,T_{m})).

Our approach to prove Theorem 3.2 is again very similar to how we proved Theorem 2.2. We write a convex programming relaxation and solve its dual program. We will also assume that the sequences 𝐚i,𝐛i,𝐰{\mathbf{a}}_{i},{\mathbf{b}}_{i},{\mathbf{w}} are generic which can be ensure by perturbing all entries by tiny additive noise, this will not change 𝖼𝗈𝐰(𝐚,𝐛)\operatorname{\mathsf{co}}^{*}_{\mathbf{w}}({\mathbf{a}},{\mathbf{b}}) by much. By a limiting argument, this will not affect the result. We first develop a convex programming relaxation 𝖮𝖯𝖳\mathsf{OPT} as shown in (8).

𝖮𝖯𝖳=maxπjki,αi,βi,α,β0\displaystyle\mathsf{OPT}=\max_{\pi^{i}_{jk},\alpha_{i},\beta_{i},\alpha,\beta\geq 0} F(α,β)+i=1mfi(αi,βi)\displaystyle F(\alpha,\beta)+\sum_{i=1}^{m}f_{i}(\alpha_{i},\beta_{i}) (8)
s.t.\displaystyle s.t. i[m]αij,k=1nwijaikπjki\displaystyle\forall i\in[m]\quad\alpha_{i}\leq\sum_{j,k=1}^{n}w_{ij}a_{ik}\pi^{i}_{jk} (pi)\displaystyle\rightarrow\hskip 28.45274pt(p_{i})
i[m]βij,k=1nwijbikπjki\displaystyle\forall i\in[m]\quad\beta_{i}\leq\sum_{j,k=1}^{n}w_{ij}b_{ik}\pi^{i}_{jk} (qi)\displaystyle\rightarrow\hskip 28.45274pt(q_{i})
αi=1mj,k=1nwijaikπjki\displaystyle\alpha\leq\sum_{i=1}^{m}\sum_{j,k=1}^{n}w_{ij}a_{ik}\pi^{i}_{jk} (p)\displaystyle\rightarrow\hskip 28.45274pt(p)
βi=1mj,k=1nwijbikπjki\displaystyle\beta\leq\sum_{i=1}^{m}\sum_{j,k=1}^{n}w_{ij}b_{ik}\pi^{i}_{jk} (q)\displaystyle\rightarrow\hskip 28.45274pt(q)
i[m],j[n]k=1nπjki1\displaystyle\forall i\in[m],j\in[n]\quad\sum_{k=1}^{n}\pi^{i}_{jk}\leq 1 (rij)\displaystyle\rightarrow\hskip 28.45274pt(r_{ij})
i[m],k[n]j=1nπjki1\displaystyle\forall i\in[m],k\in[n]\quad\sum_{j=1}^{n}\pi^{i}_{jk}\leq 1 (cik)\displaystyle\rightarrow\hskip 28.45274pt(c_{ik})

It is clear that 𝖮𝖯𝖳\mathsf{OPT} is a relaxation for 𝖼𝗈𝐰(𝐚,𝐛)\operatorname{\mathsf{co}}^{*}_{\mathbf{w}}({\mathbf{a}},{\mathbf{b}}) with 𝖼𝗈𝐰(𝐚,𝐛)𝖮𝖯𝖳.\operatorname{\mathsf{co}}^{*}_{\mathbf{w}}({\mathbf{a}},{\mathbf{b}})\leq\mathsf{OPT}. By convex programming duality, 𝖮𝖯𝖳\mathsf{OPT} can be expressed as a dual minimization problem (9) by introducing a dual variable for every constraint in the primal as shown in (8). Again by Slater’s condition, strong duality holds Boyd and Vandenberghe (2004). The constraints in the dual correspond to variables in the primal as shown in (9).

𝖮𝖯𝖳=minrij,cik,pi,qi,p,q0\displaystyle\mathsf{OPT}=\min_{r_{ij},c_{ik},p_{i},q_{i},p,q\geq 0} i=1m(j=1nrij+k=1ncik+fi(pi,qi))+F(p,q)\displaystyle\sum_{i=1}^{m}\left(\sum_{j=1}^{n}r_{ij}+\sum_{k=1}^{n}c_{ik}+f_{i}^{*}(-p_{i},-q_{i})\right)+F^{*}(-p,-q) (9)
s.t.\displaystyle s.t. i[m],j,k[n]rij+cikwij((pi+p)aik+(qi+q)bik)\displaystyle\forall i\in[m],j,k\in[n]\quad r_{ij}+c_{ik}\geq w_{ij}\left((p_{i}+p)a_{ik}+(q_{i}+q)b_{ik}\right) (xij)\displaystyle\rightarrow\hskip 28.45274pt(x_{ij})

Here fif_{i}^{*} is the Fenchel dual of ff and FF^{*} is the Fenchel dual of F.F. If (π,αi,βi,α,β)(\pi^{*},\alpha_{i}^{*},\beta_{i}^{*},\alpha^{*},\beta^{*}) is some optimal solution for the primal (8) and (r,c,pi,qi,p,q)(r^{*},c^{*},p_{i}^{*},q_{i}^{*},p^{*},q^{*}) is some optimal solution for the dual (9), then they should together satisfy the KKT conditions given in (10). Note that a constraint of primal is tight if the corresponding variable in the dual is strictly positive and vice-versa.

pi>0j,k=1nwijaikπjki=αifi(αi,βi)=(pi,qi)qi>0j,k=1nwijbikπjki=βiF(α,β)=(p,q)rij>0k=1nπjki=1πjki>0rij+cikwij((pi+p)aik+(qi+q)bik)cik>0j=1nπjki=1p>0i=1mj,k=1nwijaikπjki=αq>0i=1mj,k=1nwijbikπjki=β\begin{array}[]{ccl|ccl}p_{i}^{*}>0&\Rightarrow&\sum_{j,k=1}^{n}w_{ij}a_{ik}\pi^{i*}_{jk}=\alpha_{i}^{*}&\nabla f_{i}(\alpha_{i}^{*},\beta_{i}^{*})&=&(p_{i}^{*},q_{i}^{*})\\ q_{i}^{*}>0&\Rightarrow&\sum_{j,k=1}^{n}w_{ij}b_{ik}\pi^{i*}_{jk}=\beta_{i}^{*}&\nabla F(\alpha^{*},\beta^{*})&=&(p^{*},q^{*})\\ r^{*}_{ij}>0&\Rightarrow&\sum_{k=1}^{n}\pi^{i*}_{jk}=1&\pi^{i*}_{jk}>0&\Rightarrow&r_{ij}+c_{ik}\geq w_{ij}\left((p_{i}+p)a_{ik}+(q_{i}+q)b_{ik}\right)\\ c^{*}_{ik}>0&\Rightarrow&\sum_{j=1}^{n}\pi^{i*}_{jk}=1&&&\\ p^{*}>0&\Rightarrow&\sum_{i=1}^{m}\sum_{j,k=1}^{n}w_{ij}a_{ik}\pi^{i*}_{jk}=\alpha^{*}&&&\\ q^{*}>0&\Rightarrow&\sum_{i=1}^{m}\sum_{j,k=1}^{n}w_{ij}b_{ik}\pi^{i*}_{jk}=\beta^{*}&&&\end{array} (10)
Proposition 3.4.

Let p,q,p1,q1,,pm,qm>0p,q,p_{1},q_{1},\dots,p_{m},q_{m}>0 be fixed. Then the value of the minimization program in (9) is given by

Ψ(p,q,p1,q1,,pm,qm)=F(p,q)+i=1m𝖼𝗌𝐰i((p+pi)𝐚i+(q+qi)𝐛i)+fi(pi,qi).\Psi(p,q,p_{1},q_{1},\dots,p_{m},q_{m})=F^{*}(-p,-q)+\sum_{i=1}^{m}\operatorname{\mathsf{cs}}^{*}_{{\mathbf{w}}_{i}}\left((p+p_{i}){\mathbf{a}}_{i}+(q+q_{i}){\mathbf{b}}_{i}\right)+f_{i}^{*}(-p_{i},-q_{i}).

Moreover the KKT conditions 10 can be simplified to:

πi𝖢𝗈𝗇𝗏𝖧𝗎𝗅𝗅{π:𝖼𝗌𝐰((p+pi)𝐚i+(q+qi)𝐛i)=𝖼𝗌𝐰((p+pi)𝐚i+(q+qi)𝐛i,π)},\displaystyle\pi_{i}^{*}\in\mathsf{ConvHull}\{\pi:\operatorname{\mathsf{cs}}^{*}_{\mathbf{w}}((p^{*}+p_{i}^{*}){\mathbf{a}}_{i}+(q^{*}+q_{i}^{*}){\mathbf{b}}_{i})=\operatorname{\mathsf{cs}}_{\mathbf{w}}((p^{*}+p_{i}^{*}){\mathbf{a}}_{i}+(q^{*}+q_{i}^{*}){\mathbf{b}}_{i},\pi)\}, (11)
j,k=1nwijaikπjki=αi,j,k=1nwijbikπjki=βi,\displaystyle\sum_{j,k=1}^{n}w_{ij}a_{ik}\pi^{i*}_{jk}=\alpha_{i}^{*},\quad\sum_{j,k=1}^{n}w_{ij}b_{ik}\pi^{i*}_{jk}=\beta_{i}^{*},
i=1mj,k=1nwijaikπjki=α,i=1mj,k=1nwijbikπjki=β,\displaystyle\sum_{i=1}^{m}\sum_{j,k=1}^{n}w_{ij}a_{ik}\pi^{i*}_{jk}=\alpha^{*},\quad\sum_{i=1}^{m}\sum_{j,k=1}^{n}w_{ij}b_{ik}\pi^{i*}_{jk}=\beta^{*},
fi(αi,βi)=(pi,qi),F(α,β)=(p,q).\displaystyle\nabla f_{i}(\alpha_{i}^{*},\beta_{i}^{*})=(p_{i}^{*},q_{i}^{*}),\quad\nabla F(\alpha^{*},\beta^{*})=(p^{*},q^{*}).
Proof.

The proof is very similar to the proof of Proposition 2.4 where we write a linear program for each sub-problem and the corresponding dual linear program. We will skip the details. ∎

Remark 3.5.
Ψ(p,q,p1,q1,,pm,qm)=F(p,q)+i=1m𝖼𝗌𝐰i((p+pi)𝐚i+(q+qi)𝐛i)+fi(pi,qi)\Psi(p,q,p_{1},q_{1},\dots,p_{m},q_{m})=F^{*}(-p,-q)+\sum_{i=1}^{m}\operatorname{\mathsf{cs}}^{*}_{{\mathbf{w}}_{i}}\left((p+p_{i}){\mathbf{a}}_{i}+(q+q_{i}){\mathbf{b}}_{i}\right)+f_{i}^{*}(-p_{i},-q_{i})

is a convex function. The gradient (or a subgradient) of Ψ\Psi can be calculated efficiently and therefore 𝖮𝖯𝖳=minp,q>0Ψ(p,q)\mathsf{OPT}=\min_{p,q>0}\Psi(p,q) can be found efficiently using gradient (or subgradient) descent. The objective is also amenable to the use of stochastic gradient descent which can be much faster when m1.m\gg 1.

Proposition 3.6.

For a fixed p,q>0,p,q>0, we can find

minpi,qi>0Ψ(p,q,p1,q1,,pm,qm)=minpi,qi>0F(p,q)+i=1m𝖼𝗌𝐰i((p+pi)𝐚i+(q+qi)𝐛i)+fi(pi,qi)\min_{p_{i},q_{i}>0}\Psi(p,q,p_{1},q_{1},\dots,p_{m},q_{m})=\min_{p_{i},q_{i}>0}F^{*}(-p,-q)+\sum_{i=1}^{m}\operatorname{\mathsf{cs}}^{*}_{{\mathbf{w}}_{i}}\left((p+p_{i}){\mathbf{a}}_{i}+(q+q_{i}){\mathbf{b}}_{i}\right)+f_{i}^{*}(-p_{i},-q_{i})

efficiently using binary search.

Proof.

It is enough to the find

argminpi,qi>0𝖼𝗌𝐰i((p+pi)𝐚i+(q+qi)𝐛i)+fi(pi,qi)\mathrm{argmin}_{p_{i},q_{i}>0}\operatorname{\mathsf{cs}}^{*}_{{\mathbf{w}}_{i}}\left((p+p_{i}){\mathbf{a}}_{i}+(q+q_{i}){\mathbf{b}}_{i}\right)+f_{i}^{*}(-p_{i},-q_{i})

for a fixed i.i. By convexity of the objective, it is enough to find (pi,qi)(p_{i},q_{i}) such that fi(pi,qi)cs𝐰i((p+pi)𝐚i+(q+qi)𝐛i)\nabla f_{i}^{*}(-p_{i},-q_{i})\in\partial cs^{*}_{{\mathbf{w}}_{i}}\left((p+p_{i}){\mathbf{a}}_{i}+(q+q_{i}){\mathbf{b}}_{i}\right) which can be rewritten as:

(pi,qi)fi(cs𝐰i((p+pi)𝐚i+(q+qi)𝐛i)).(p_{i},q_{i})\in\nabla f_{i}\left(\partial cs^{*}_{{\mathbf{w}}_{i}}\left((p+p_{i}){\mathbf{a}}_{i}+(q+q_{i}){\mathbf{b}}_{i}\right)\right). (12)

This fixed point equation can be solved using binary search in exactly the same way as in Proposition 2.8. cs𝐰i((p+pi)𝐚i+(q+qi)𝐛i)\partial cs^{*}_{{\mathbf{w}}_{i}}\left((p+p_{i}){\mathbf{a}}_{i}+(q+q_{i}){\mathbf{b}}_{i}\right) only depends on the ration λ=(q+qi)/(p+pi)\lambda=(q+q_{i})/(p+p_{i}). Geometrically, this ratio is constant on any line passing through (p,q)(-p,-q). Figure 4 shows the regions where cs𝐰i((p+pi)𝐚i+(q+qi)𝐛i)\partial cs^{*}_{{\mathbf{w}}_{i}}\left((p+p_{i}){\mathbf{a}}_{i}+(q+q_{i}){\mathbf{b}}_{i}\right) remains constant. Now the fixed point equation (12) can be solved in exactly the same way as in the proof of Proposition 2.8. We will skip the details.

Figure 4: The positive quadrant 𝒬++\mathcal{Q}^{++} is divided into regions AjA_{j} and rays RjR_{j} based on the values of the subgradients cs𝐰i((p+pi)𝐚i+(q+qi)𝐛i)\partial cs^{*}_{{\mathbf{w}}_{i}}\left((p+p_{i}){\mathbf{a}}_{i}+(q+q_{i}){\mathbf{b}}_{i}\right). One can use binary search as in the proof of Proposition 2.8 to solve the fixed point equation (12).
Refer to caption

We are now ready to prove Theorem 3.2.

Proof of Theorem 3.2.

By Proposition 3.6 and 2.6, we can find a solution to the primal program 8 where each πi\pi_{i}^{*} is either a permutation or a convex combination of two permutations which differ by an adjacent transposition. If πi\pi_{i}^{*} is a permutation then we just output π\pi^{*} which is the optimal ranking for the ithi^{th} ranking problem. If π=μπ1+(1μ)π2\pi^{*}=\mu\pi_{1}+(1-\mu)\pi_{2} i.e. a convex combination of π1,π2\pi_{1},\pi_{2} which differ in the j,j+1j,j+1 positions, then we output either π1\pi_{1} or π2\pi_{2} as the ranking for the ithi^{th} subproblem. ∎

4 Experiments

4.1 Synthetic Data

We first present the results on synthetic data. The purpose of this experiment is to illustrate how different objective functions affect the distribution of the NDCGs. These results are summarized in Figures 6 and 7. We present the scatter plot of the NDCGs, just like in Figure 1, as well as the cumulative distribution functions (CDFs).

We aim to capture the multiple intents scenario where the results are likely to be good along one dimension but not both. The values are drawn from a log normal anti-correlated distribution: log(aij)\log(a_{ij}) and the log(bij)\log(b_{ij})s are drawn from a multivariate Gaussian with mean zero and covariance matrix

[0.20.160.160.2].\begin{bmatrix}0.2&-0.16\\[3.00003pt] -0.16&0.2\\[3.00003pt] \end{bmatrix}.

A scatter plot of the distribution of these values is shown in Figure 5.

Refer to caption
Figure 5: Distribution of aija_{ij} and bijb_{ij} values drawn from an anti-correlated log normal distribution.

Other parameters of the experiment are as follows: we draw 50 results for each instance, i.e., n=50n=50. The weight vector is the same as in the Introduction, (1), except that we only consider the top 10 results, i.e., the coordinates of 𝐰{\mathbf{w}} after the first 10 are 0. The number of different instances, mm is 500.

In addition to the product and the sum, we present the result of using two more combining functions: a quadratic and a normalized sum. Since we are plotting the NDCGs, a natural algorithm is to maximize the sum of the NDCGs. This is what the normalized sum does. The quadratic function first normalizes the scores to get the NDCGs, and then applies the function

f(x,y)=2xx2+2yy2.f(x,y)=2x-x^{2}+2y-y^{2}.

It can be seen from Figure 6 that the concave functions are quite a bit more clustered than the additive functions. This can also be seen in the table inside the figure, which shows the sum of the cumulative scores, the DCGs, as well as the mean of the normalized cumulative scores, the NDCGs. These quantities are almost the same across all algorithms. We also show the standard deviations of the NDCGs, which quite well captures how clustered the points are, and shows a significant difference between the concave and the additive functions.

Refer to caption
Figure 6: Scatter plot of NDCGs for two different objectives, AA and BB, on synthetic data.

We present the CDFs of the NDCGs for the four algorithms in Figure 7. The dots on the curves represent deciles, i.e., the values corresponding to the bottom 10%10\% of the population, 20%20\%, and so on. Recall that a lower CDF implies that the values are higher.666 Recall that a distribution FF stochastically dominates another distribution GG iff the cdf of FF is always below that of GG. The CDF shows that in the bottom half of the distribution, the concave functions are higher than the additive ones. Also the steeper shape of the CDFs for the concave functions show how they are more concentrated. There is indeed a price to pay in that the top half are worse but this is unavoidable. The additive function picks a point on the Pareto frontier after all; in fact, it maximizes the mean of AA for a fixed mean of BB and vice versa. The whole point is that the mean is not necessarily an appropriate metric.

Refer to caption
Figure 7: CDFs of NDCGs for two different objectives, AA and BB, on synthetic data.

4.2 Real Data

The purpose of the experiment in this section is to show how the ideas from this paper could help in a realistic setting of ranking. We present experiments on real data from LinkedIn, sampled from one day of results from their news feed. The number of instances is about 33,000. The results are either organic or ads. Organic results only have relevance scores, their revenue scores are 0. Ads have both relevance and revenue scores.

The objectives are ad revenue and relevance. We will denote the ad revenue by AA and relevance by BB. We use the same weight vector 𝐰{\mathbf{w}} in the introduction, (1), up to 10 coordinates. Ad revenue can be added across instances, so we just sum up the ad revenue across different instances and tune the algorithms so that they all have roughly the same revenue. (The difference is less than 1%1\%.) It makes less sense to add the relevance scores. In fact it is more important to make sure that no instance gets a really bad relevant score, rather than optimize for the mean or even the standard deviation. To do this, we aim to make the bottom quartile (25%) as high as possible.

Motivated by the above consideration, for the relevance objective, we consider a function that has a steep penalty for lower values. We first normalize the scores to get the NDCG, and then apply this function on the normalized value. For revenue, we just add up the cumulative scores. The function we use to combine the two cumulative scores is thus

f(x,y)=xec1y/𝖼𝗌𝐰(𝐛)c2,f(x,y)=x-e^{-c_{1}y/\operatorname{\mathsf{cs}}_{{\mathbf{w}}}^{*}({\mathbf{b}})-c_{2}},

for some suitable constants c1c_{1} and c2c_{2}. Higher values of c1c_{1} make the curve steeper and make the distribution more concentrated. We choose c1c_{1} so that we benefit the bottom quartile as much as possible. The constant c2c_{2} is tuned so that the total revenue is close to some target.

We compare this with an additive function. The revenue term is not normalized whereas the relevance term is. This function is

g(x,y)=x+c3y/𝖼𝗌𝐰(𝐛),g(x,y)=x+c_{3}y/\operatorname{\mathsf{cs}}_{{\mathbf{w}}}^{*}({\mathbf{b}}),

for some suitable choice of c3c_{3}. Once again c3c_{3} is tuned to achieve a revenue target.

We also add constraints on the ranking to better reflect the real scenario (although not exactly the same constraints as used in reality, for confidentiality reasons). An ad cannot occupy the first position, and the total number of ads in the top 10 positions is at most 4. It is quite easy to see that we can optimize a single objective given these constraints.777Although our guarantees don’t extend, our algorithm extends to handle such constraints, as long as we can solve the problem of optimizing a single objective. In experiments, the algorithm seems to do well. It is an interesting open problem to generalize our guarantees to such settings. We first sort by the score, then slide ads down if the first slot has an ad, and finally remove any ads beyond the top 4.

We present the CDFs of the NDCGs for relevance for the two algorithms in Figure 8. The figure shows that in the bottom quartile the exp function does better, and the relation flips above this. For the bottom decile, the difference is significant. As mentioned earlier, this is exactly what we wanted to achieve.

Another important aspect of a ranking algorithm in this context is the set of positions that ads occupy. In Figure 9, we show this distribution: for each position, we show the number of instances for which there was an ad in that position. For the additive function, which is the graph at the bottom, most of the ads are clustered around positions 2 to 4, and the number gradually decreases further down. The distribution in case of the exp function is better spread out. Interestingly, the most common position an ad is shown is the very last one.

Refer to caption
Figure 8: CDF of NDCGs for objectives BB, relevance, on real data.
Refer to caption
Figure 9: Distribution of ads by position.

To conclude, the choice of a concave function to combine the different objectives gives a greater degree of freedom to ranking algorithms. This freedom can be used to better control several important metrics in search and ad rankings. This experiment shows how this can be done for the relevance NDCGs in the bottom quartile, or for the distribution of ad positions.

References

  • Aslam and Montague [2001] Javed A Aslam and Mark Montague. Models for metasearch. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 276–284. ACM, 2001.
  • Azar et al. [2009] Yossi Azar, Iftah Gamzu, and Xiaoxin Yin. Multiple intents re-ranking. In Proceedings of the forty-first annual ACM symposium on Theory of computing, pages 669–678. ACM, 2009.
  • Borda [1784] JC de Borda. Mémoire sur les élections au scrutin. Histoire de l’Academie Royale des Sciences pour 1781 (Paris, 1784), 1784.
  • Boyd and Vandenberghe [2004] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
  • Bubeck [2015] Sébastian Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning, 8(3-4):231–357, 2015.
  • Burges et al. [2005] Christopher Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Gregory N Hullender. Learning to rank using gradient descent. In Proceedings of the 22nd International Conference on Machine learning (ICML-05), pages 89–96, 2005.
  • Dwork et al. [2001] Cynthia Dwork, Ravi Kumar, Moni Naor, and Dandapani Sivakumar. Rank aggregation methods for the web. In Proceedings of the 10th international conference on World Wide Web, pages 613–622. ACM, 2001.
  • Fagin and Wimmers [1997] Ronald Fagin and Edward L Wimmers. Incorporating user preferences in multimedia queries. In International Conference on Database Theory, pages 247–261. Springer, 1997.
  • Fox and Shaw [1994] Edward A Fox and Joseph A Shaw. Combination of multiple searches. NIST special publication SP, 243, 1994.
  • Renda and Straccia [2003] M Elena Renda and Umberto Straccia. Web metasearch: rank vs. score based rank aggregation methods. In Proceedings of the 2003 ACM symposium on Applied computing, pages 841–846. ACM, 2003.
  • Saari [1995] Donald G Saari. Basic geometry of voting, volume 12. Springer Science & Business Media, 1995.
  • Vogt and Cottrell [1999] Christopher C Vogt and Garrison W Cottrell. Fusion via a linear combination of scores. Information retrieval, 1(3):151–173, 1999.
  • Young [1988] H Peyton Young. Condorcet’s theory of voting. American Political science review, 82(4):1231–1244, 1988.
  • Young and Levenglick [1978] H Peyton Young and Arthur Levenglick. A consistent extension of condorcet’s election principle. SIAM Journal on applied Mathematics, 35(2):285–300, 1978.

Appendix A Running time of binary search in Proposition 2.8

Getting O(nlognlogB)O(n\log n\log B) running time.

Note that the critical set CC can be of size (n2)\binom{n}{2} and doing binary search on this set naively will need us to list all the critical values and then sort them. We can avoid this by doing a binary search directly on the ratio λ=q/p\lambda=q/p.

Recall that the critical set C=λ1,,λmC={\lambda_{1},\dots,\lambda_{m}} where λ1<λ2<<λm\lambda_{1}<\lambda_{2}<\dots<\lambda_{m} and we set λ0=0,λm+1=\lambda_{0}=0,\lambda_{m+1}=\infty. By the assumption that ai,bia_{i},b_{i} are integers bounded by BB, λm<2B+1\lambda_{m}<2B+1 and λi+1λi1/8B2\lambda_{i+1}-\lambda_{i}\geq 1/8B^{2} for every ii. Define ϕ~:0{1,0,1}\tilde{\phi}:\mathbb{R}^{\geq 0}\to\{-1,0,1\} as follows:

ϕ~(λ)={0 if λiλ<λi+1 and f(gi)RiAiRi+1+1 if λiλ<λi+1 and f(gi) lies anticlockwise to RiAiRi+11 if λiλ<λi+1 and f(gi) lies clockwise to RiAiRi+1.\displaystyle\tilde{\phi}(\lambda)=\begin{cases}0&\text{ if }\lambda_{i}\leq\lambda<\lambda_{i+1}\text{ and }\nabla f(g_{i})\in R_{i}\cup A_{i}\cup R_{i+1}\\ +1&\text{ if }\lambda_{i}\leq\lambda<\lambda_{i+1}\text{ and }\nabla f(g_{i})\text{ lies anticlockwise to }R_{i}\cup A_{i}\cup R_{i+1}\\ -1&\text{ if }\lambda_{i}\leq\lambda<\lambda_{i+1}\text{ and }\nabla f(g_{i})\text{ lies clockwise to }R_{i}\cup A_{i}\cup R_{i+1}.\end{cases}
Claim A.1.

Given λ>0\lambda^{*}>0, we can compute ϕ~(λ)\tilde{\phi}(\lambda^{*}) in O(nlogn)O(n\log n) time.

Proof.

It is enough to show that we can find λi,λi+1\lambda_{i},\lambda_{i+1} such that λiλ<λi+1\lambda_{i}\leq\lambda^{*}<\lambda_{i+1} in O(nlogn)O(n\log n) time. We can find the ranking π\pi which sorts 𝐚+λ𝐛{\mathbf{a}}+\lambda^{*}{\mathbf{b}} in decreasing order in O(nlogn)O(n\log n) time. We can then evaluate gi=(𝖼𝗌(𝐚,π),𝖼𝗌(𝐛,π))g_{i}=(\operatorname{\mathsf{cs}}({\mathbf{a}},\pi),\operatorname{\mathsf{cs}}({\mathbf{b}},\pi)) in O(n)O(n) time. Now λi+1\lambda_{i+1} is the first critical point where the sorted order of 𝐚+λ𝐛{\mathbf{a}}+\lambda{\mathbf{b}} switches from π\pi if we imagine increasing λ\lambda from λ\lambda^{*}. Once λ\lambda crosses λi+1\lambda_{i+1}, some adjacent elements of π\pi switch positions in the sorted order of 𝐚+λ𝐛{\mathbf{a}}+\lambda{\mathbf{b}}. Therefore

λi+1=min{λ:λ>λ and aπ(i)+λbπ(i)=aπ(i+1)+λbπ(i+1) for some i[n1]}\lambda_{i+1}=\min\left\{\lambda:\lambda>\lambda^{*}\text{ and }a_{\pi(i)}+\lambda b_{\pi(i)}=a_{\pi(i+1)}+\lambda b_{\pi(i+1)}\text{ for some }i\in[n-1]\right\}

where if the set is empty we set λi+1=\lambda_{i+1}=\infty. Note that this can be computed in O(n)O(n) time. Similarly

λi=min{λ:λλ and aπ(i)+λbπ(i)=aπ(i+1)+λbπ(i+1) for some i[n1]}\lambda_{i}=\min\left\{\lambda:\lambda\leq\lambda^{*}\text{ and }a_{\pi(i)}+\lambda b_{\pi(i)}=a_{\pi(i+1)}+\lambda b_{\pi(i+1)}\text{ for some }i\in[n-1]\right\}

where if the set is empty we set λi=0\lambda_{i}=0. ∎

Now we claim that it is enough to find some λ<λu\lambda_{\ell}<\lambda_{u} such that one of the following is true:

  1. 1.

    ϕ~(λ)=0\tilde{\phi}(\lambda_{\ell})=0 or ϕ~(λu)=0\tilde{\phi}(\lambda_{u})=0. This is similar to the case when ϕ(i)=0\phi(i)=0.

  2. 2.

    ϕ~(λ)=+1,ϕ~(λu)=1\tilde{\phi}(\lambda_{\ell})=+1,\tilde{\phi}(\lambda_{u})=-1 and λuλ<1/8B2\lambda_{u}-\lambda_{\ell}<1/8B^{2}. There can be at most one critical point between λ,λu\lambda_{\ell},\lambda_{u} i.e. there exists a unique ii such that λ<λi+1<λu\lambda_{\ell}<\lambda_{i+1}<\lambda_{u}. Therefore λ\lambda_{\ell} and λu\lambda_{u} must belong to adjacent regions. This is similar to the case when ϕ(i)=1,ϕ(i+1)=1.\phi(i)=1,\phi(i+1)=-1.

Now if either ϕ~(0)=0\tilde{\phi}(0)=0 or ϕ~(2B+1)=0\tilde{\phi}(2B+1)=0, we are done. Otherwise ϕ~(0)=+1\tilde{\phi}(0)=+1 and ϕ~(2B+1)=1\tilde{\phi}(2B+1)=-1. Using binary search in the range [0,2B+1][0,2B+1], one can find such λ,λu\lambda_{\ell},\lambda_{u} in O(logB)O(\log B) iterations. Since each iteration runs in O(nlogn)O(n\log n) time, the total running time is O(nlognlogB).O(n\log n\log B).

Getting strongly polynomial randomized O(nlog2n)O(n\log^{2}n) running time.

We will only give a proof sketch. In this case we cannot do a binary search over λ.\lambda. Because all the critical λ\lambda can be concentrated in a small region and we may take a long time to find this region. Before we proceed we make a few claims.

Claim A.2.

Given two (generic) sequences of numbers 𝐜{\mathbf{c}} and 𝐝{\mathbf{d}} of length nn, let II be the set of inversions of 𝐝{\mathbf{d}} w.r.t 𝐜{\mathbf{c}} i.e. I={(i,j):i<j,cicjdjdi>0}I=\left\{(i,j):i<j,\frac{c_{i}-c_{j}}{d_{j}-d_{i}}>0\right\}. We can find the size |I||I| in O(nlogn)O(n\log n) time and we can sample uniformly at random from II in O(nlogn)O(n\log n) time.888Note that there can be as many as (n2)\binom{n}{2} inversions and so we cannot list them all.

Proof.

We only give a proof sketch. Wlog, we can assume that 𝐜{\mathbf{c}} is already sorted by applying the same permutation to both 𝐜,𝐝.{\mathbf{c}},{\mathbf{d}}. We now sort 𝐝{\mathbf{d}} using the merge sort algorithm and it is not hard to see that we can count and sample from inversions during this process. ∎

Claim A.3.

Given any λ<λu\lambda_{\ell}<\lambda_{u}, we can sample uniformly at random from C[λ,λu]C\cap[\lambda_{\ell},\lambda_{u}] in O(nlogn)O(n\log n) time.

Proof.

Let II be the set of inversions of 𝐚+λu𝐛{\mathbf{a}}+\lambda_{u}{\mathbf{b}} w.r.t 𝐚+λ𝐛{\mathbf{a}}+\lambda_{\ell}{\mathbf{b}}. We claim that

C[λ,λu]={λ:ai+λbi=aj+λbj,(i,j)I}.C\cap[\lambda_{\ell},\lambda_{u}]=\{\lambda:a_{i}+\lambda b_{i}=a_{j}+\lambda b_{j},(i,j)\in I\}.

This is because when you imagine increasing λ\lambda from λ\lambda_{\ell} to λu\lambda_{u}, C[λ,λu]C\cap[\lambda_{\ell},\lambda_{u}] is the set of critical points where a switch happens in the sorted order of 𝐚+λ𝐛{\mathbf{a}}+\lambda{\mathbf{b}}. Therefore the critical points in [λ,λu][\lambda_{\ell},\lambda_{u}] correspond exactly to the inversions I.I. For each inversion (i,j)I(i,j)\in I, aiajbjbiC[λ,λu]\frac{a_{i}-a_{j}}{b_{j}-b_{i}}\in C\cap[\lambda_{\ell},\lambda_{u}]. ∎

Suppose ϕ(0)=1\phi(0)=1 and ϕ(m)=1\phi(m)=-1 (otherwise we are done), where ϕ\phi is the function defined in Equation (7). We want to find an ii such that ϕ(i)=0\phi(i)=0 or ϕ(i)=1,ϕ(i+1)=1\phi(i)=1,\phi(i+1)=-1. Set λ=0,λu=\lambda_{\ell}=0,\lambda_{u}=\infty. From Claim A.3, we can sample a uniformly random λC[λ,λu]\lambda\in C\cap[\lambda_{\ell},\lambda_{u}] in O(nlogn)O(n\log n) time. Suppose λ=λi\lambda=\lambda_{i}, then we can find λi+1\lambda_{i+1} and gig_{i} as shown in Claim A.1 in O(nlogn)O(n\log n) time. Therefore we can evaluate ϕ(i)\phi(i) in O(nlogn)O(n\log n) time. Now we continue the binary search based on the value of ϕ(i)\phi(i) and update the value of the lower bound λ=λ\lambda_{\ell}=\lambda or the upper bound λu=λ\lambda_{u}=\lambda. In each iteration, the random λC[λ,λu]\lambda\in C\cap[\lambda_{\ell},\lambda_{u}] will eliminate constant fraction of points in C[λ,λu]C\cap[\lambda_{\ell},\lambda_{u}] i.e. the size of C[λ,λu]C\cap[\lambda_{\ell},\lambda_{u}] shrinks by a constant factor in expectation. Therefore the algorithm should end in O(logn)O(\log n) iterations with high probability. In fact, we can stop the sampling process once the size of C[λ,λu]C\cap[\lambda_{\ell},\lambda_{u}] becomes O(n)O(n) and then do a regular binary search on them by listing them all. Since the running time of each iteration is O(nlogn),O(n\log n), the total running time is O(nlog2n).O(n\log^{2}n).