Ranking with Multiple Objectives

Nikhil R. Devanur Amazon. Email: Iam@nikhildevanur.com. Work done while the author was at Microsoft Research. Sivakanth Gopi Microsoft Research. Email: sigopi@microsoft.com

Abstract

In search and advertisement ranking, it is often required to simultaneously maximize multiple objectives. For example, the objectives can correspond to multiple intents of a search query, or in the context of advertising, they can be relevance and revenue. It is important to efficiently find rankings which strike a good balance between such objectives. Motivated by such applications, we formulate a general class of problems where

•

each result gets a different score corresponding to each objective,
•

the results of a ranking are aggregated by taking, for each objective, a weighted sum of the scores in the order of the ranking, and
•

an arbitrary concave function of the aggregates is maximized.

Combining the aggregates using a concave function will naturally lead to more balanced outcomes. We give an approximation algorithm in a bicriteria/resource augmentation setting: the algorithm with a slight advantage does as well as the optimum. In particular, if the aggregation step is just the sum of the top $k$ results, then the algorithm outputs $k+1$ results which do as well the as the optimal top $k$ results. We show how this approach helps with balancing different objectives via simulations on synthetic data as well as on real data from LinkedIn.

1 Introduction

We study the problem of ranking with multiple objectives. Ranking is an important component of many online platforms such as Google, Bing, Facebook, LinkedIn, Amazon, Yelp, and so on. It is quite common that the platform has multiple objectives while choosing a ranking. For instance, in a search engine, when someone searches for “jaguar”, it could refer to either the animal or the car company. Thus there is one set of results that are relevant for jaguar the animal, another for jaguar the car company, and the search engine has to pick a ranking to satisfy both intents.

Another common reason to have multiple objectives is advertising. The final ranking produced has organic results as well as ads, and the objectives are relevance and revenue. Ads contribute to both relevance and revenue, where as organic results only contribute to relevance. While in some cases ads occupy specialized slots, it is becoming more common to have floating ads. Also, in many cases, the same result can qualify both as an ad and as an organic result, and it is not desirable to repeat it. In such cases, one has to produce a single ranking of all the results (the union of organic results and ads) that achieves a certain tradeoff between the two objectives.

The predominant methodology currently used to handle multiple objectives is to combine them into one objective using a linear combination (Vogt and Cottrell, 1999). The advantage of this is that it can trace out the entire pareto frontier of the achievable objectives. The disadvantage is that you have to choose one linear combination for a large number of instances. This often results in cases where one objective is favored much more than the others. This is illustrated in Figure 1. To explain this figure we introduce some notation.

Suppose that there are $m$ instances, and for each instance there are $n$ results that are to be ranked. Each result $j$ for instance $i$ has two numbers associated with it, $a_{ij}$ and $b_{ij}$ , that correspond to the two objectives. Given a ranking $\pi:[n]\rightarrow[n]$ , we aggregate the two objective values for instance $i$ using cumulative scores defined as

\operatorname{\mathsf{cs}}_{\mathbf{w}}({\mathbf{a}}_{i},\pi)=\sum_{j\in[n]}w_{j}a_{i\pi(j)}\text{ and }\operatorname{\mathsf{cs}}_{\mathbf{w}}({\mathbf{b}}_{i},\pi)=\sum_{j\in[n]}w_{j}b_{i\pi(j)},

for some non-negative weight vector ${\mathbf{w}}=(w_{1},\cdots,w_{n})$ with (weakly) decreasing weights i.e. $w_{1}\geq w_{2}\geq\dots\geq w_{n}\geq 0.$ ¹¹1Throughout the paper, we use the notation policy that $w_{i}$ is the $i^{\rm th}$ coordinate of $\mathbf{w}$ . Further, ${\mathbf{a}}_{i}$ is the vector $(a_{i1},a_{i2},\cdots,a_{in})$ . For example, in advertisement ranking, $w_{i}$ can represent the click rate i.e. the probability that a user clicks the $i^{\rm th}$ result in the ranking. It is natural that the click rates decrease with the position i.e. it is more probable that a top result is clicked. Suppose $a_{j}$ represents the revenue generated when $j^{\rm th}$ ad is clicked, and $b_{j}$ represents the relevance of the $j^{\rm th}$ ad to the user query. Then $\operatorname{\mathsf{cs}}_{\mathbf{w}}({\mathbf{a}},\pi)$ represents the expected revenue generated and $\operatorname{\mathsf{cs}}_{\mathbf{w}}({\mathbf{b}},\pi)$ represents the expected total relevance for the user when the ads are ranked according to $\pi.$ In the figure, the weight vector is the one used in discounted cumulative gain (DCG), and normalized DCG (NDCG) (Burges et al., 2005), which are standard measures often used in evaluating search engine rankings. This weight vector is:

w_{i}=\frac{1}{\log_{2}(i+1)}\enspace.

(1)

We normalize the cumulative scores by the best possible ranking for each objective. This is motivated by two things: the resulting numbers are all in $[0,1]$ so they are comparable to each other, and how well the ranking did relative to the best achievable one is often the more meaningful measure. We define

\operatorname{\mathsf{cs}}^{*}_{\mathbf{w}}({\mathbf{a}})=\max_{\pi}\operatorname{\mathsf{cs}}_{\mathbf{w}}({\mathbf{a}},\pi)\text{ and }\operatorname{\mathsf{cs}}^{*}_{\mathbf{w}}({\mathbf{b}})=\max_{\pi}\operatorname{\mathsf{cs}}_{\mathbf{w}}({\mathbf{b}},\pi).

When the weight vector is the one mentioned above, we refer to the normalized cumulative scores as NDCG. Figure 1 shows scatter plots of the NDCGs for the two objectives, for different algorithms: a given algorithm produces ranking $\pi_{i}$ for instance $i$ , and each dot in the plot is a point

\left(\frac{\operatorname{\mathsf{cs}}_{\mathbf{w}}({\mathbf{a}}_{i},\pi_{i})}{\operatorname{\mathsf{cs}}^{*}_{\mathbf{w}}({\mathbf{a}}_{i})},\frac{\operatorname{\mathsf{cs}}_{\mathbf{w}}({\mathbf{b}}_{i},\pi_{i})}{\operatorname{\mathsf{cs}}^{*}_{\mathbf{w}}({\mathbf{b}}_{i})}\right).

The source of the data is LinkedIn news feed: it is a random sample from one day of results. Here we do not go into the details of what the two objectives are, etc.; Section 4 has more details. On the right, we show the result of ranking using the sum of the two scores. The triangle shape of the scatter plot is persistent across different samples, and different choices of linear combinations. What we wish to avoid are the two corners of the triangle where one of the two NDCGs is rather small. Ideally, we like to be at the apex of the triangle which is at the top right corner of the figure.

On the left, we show the results of our algorithm for the following objective:

\max_{\pi}\operatorname{\mathsf{cs}}_{\mathbf{w}}({\mathbf{a}}_{i},\pi)\cdot\operatorname{\mathsf{cs}}_{\mathbf{w}}({\mathbf{b}}_{i},\pi).

The undesirable corners of the triangle have vanished: instances where one objective is much smaller than the other are rare if any. The points are closer to the top right corner.

Refer to caption — Figure 1: Scatter plot of NDCGs for two different objectives, $A$ and $B$ , on real data.

1.1 Main Results

The key idea is to combine the two cumulative scores using a concave function $f$ . (Maximizing the product is the same as maximizing the sum of logs, which is a concave function.) Concave functions tend to favor more balanced outcomes almost by definition: the function at the average of two points is at least as high as the average of the function at the two points, i.e., $f(\tfrac{x_{1}+x_{2}}{2},\tfrac{y_{1}+y_{2}}{2})\geq\tfrac{f(x_{1},y_{1})+f(x_{2},y_{2})}{2}$ . Figure 1 is a good demonstration of this.

We allow arbitrary concave functions that are strictly increasing in each coordinate. We define the combined objective score of $\pi$ with weights ${\mathbf{w}}$ as

\operatorname{\mathsf{co}}_{{\mathbf{w}},f}({\mathbf{a}},{\mathbf{b}},\pi)=f\left(\operatorname{\mathsf{cs}}_{\mathbf{w}}({\mathbf{a}},\pi),\operatorname{\mathsf{cs}}_{\mathbf{w}}({\mathbf{b}},\pi)\right).

For some objectives, the sum of the cumulative scores across different instances is still an important metric, e.g., the total revenue, or the number of clicks, etc. We allow incorporating such metrics via a global concave function, i.e., a concave function of the sum of all the cumulative scores over all the instances. Let $F$ , and $f_{i}$ for $i\in[n]$ be concave functions in 2 variables. We consider the problem of finding a ranking $\pi_{i}$ for each $i$ in order to maximize

\sum_{i\in[n]}\operatorname{\mathsf{co}}_{{\mathbf{w}},f_{i}}({\mathbf{a}}_{i},{\mathbf{b}}_{i},\pi_{i})+F\left(\sum_{i\in[n]}\operatorname{\mathsf{cs}}_{{\mathbf{w}}}({\mathbf{a}}_{i},\pi_{i}),\sum_{i\in[n]}\operatorname{\mathsf{cs}}_{{\mathbf{w}}}({\mathbf{b}}_{i},\pi_{i})\right).

Our main results are polynomial time bi-criteria approximation algorithms for the problem mentioned above. These are similar in spirit to results with resource augmentation in scheduling, or Bulow-Klemperer style results in mechanism design.

•

Consider the special case where we sum the top $k<n$ entries, i.e., the weight vector is $k$ ones followed by all zeros. For this case, we allow the algorithm to sum the top $k+1$ entries, and show that the resulting objective is at least as good as the optimum for the sum of the top $k$ results.
•

For the general case, the algorithm gets an advantage as follows: replace one coordinate of ${\mathbf{w}}$ , say $w_{j}$ , with the immediately preceding coordinate $w_{j-1}$ . For the case of summing the top $k$ entries, this corresponds to replacing the $(k+1)^{\rm st}$ coordinate which is a 0, with the $k^{\rm th}$ coordinate which is a 1. The replacement can be different for different $i$ . Again, the ranking output by the algorithm with the new weights does as well as the optimum with the original weights ${\mathbf{w}}$ . See Theorem 2.2 and Theorem 3.2 for formal statements.

One advantage of such a guarantee is that it does not depend on the parameters of the convex functions, such as the Lipshitz constants, or the range of values, as is usual with other types of guarantees. This allows greater flexibility in the choice of these convex functions.
•

When there is no global function $F$ , for each $i$ , the algorithm just does a binary search (Proposition 2.8). In each iteration of the binary search, we compute a ranking optimal for a linear combination of the two objectives. The running time to solve each ranking problem (i.e. each instance $i$ ) is $O(n\log^{2}n)$ . In practice, the ranking algorithms are required to be very fast, so this is an important property. For the general case, this is still true, provided that we are given 2 additional parameters that are optimized appropriately. In practice, such parameters are tuned ‘offline’ so we can still use the binary search to rank ‘online’ any new instance $i$ .

1.2 Related Work

Rank aggregation is much studied, most frequently in the context of databases with ‘fuzzy’ queries (Fagin and Wimmers, 1997) and in the context of information retrieval or web search (Dwork et al., 2001; Aslam and Montague, 2001). There are two main categories of results (Renda and Straccia, 2003), (i) where the input is a set of rankings (Dwork et al., 2001; Renda and Straccia, 2003), and (ii) where the input is a set of scores (Vogt and Cottrell, 1999; Fox and Shaw, 1994). Clearly score based aggregation methods are more powerful, since there is strictly more information; our paper falls in the score based aggregation category.

Among the score based methods, Vogt and Cottrell (1999) uses the same form of cumulative scores as us, and empirically evaluate the usage of a linear combination of cumulative scores. They identify limitations of this method and conditions under which this does well. Fox and Shaw (1994) propose and evaluate several methods for combining the scores for different objectives result by result which are then used to rank. In contrast, we first aggregate the scores for each objective and then combine these cumulative scores.

Azar et al. (2009) also consider rank aggregation motivated by multiple intents in search rankings, but with several differences. They consider a large number of different intents, as opposed to this paper where we focus on just 2. Their objective function also depends on a weight vector but in a different way. For each intent, a result is either ‘relevant’ or not, and given a ranking, the cumulative score for that intent is the weight corresponding to the highest rank at which a relevant result appears. Their objective is a weighted sum of the cumulative scores across all intents.

The rank based aggregation methods are closely related to voting schemes and social choice theory, and a lot of this has focused on algorithms to compute the Kemeny-Young rank aggregation (Young and Levenglick, 1978; Young, 1988; Saari, 1995; Borda, 1784).

Organization:

In Section 2 we give our binary search based algorithm for a single instance. The general case is presented in Section 3. We present experimental results in Section 4. Appendix contains some missing proofs.

2 A single instance of ranking with multiple objectives

Let us formally define the $\operatorname{\mathsf{Rank}}$ problem.

Definition 2.1 ( $\operatorname{\mathsf{Rank}}({\mathbf{a}},{\mathbf{b}},{\mathbf{w}},f)$ ).

Given ${\mathbf{a}},{\mathbf{b}},{\mathbf{w}},f$ , find a ranking $\pi$ which maximizes the combined objective i.e. find $\operatorname{\mathsf{co}}^{*}_{\mathbf{w}}({\mathbf{a}},{\mathbf{b}})=\max_{\pi}\operatorname{\mathsf{co}}_{{\mathbf{w}}}({\mathbf{a}},{\mathbf{b}},\pi).$

It is not clear if $\operatorname{\mathsf{Rank}}({\mathbf{a}},{\mathbf{b}},{\mathbf{w}},f)$ can be efficiently solved, because it involves optimization over all rankings and there are exponentially many of them. Our main result shows that when $f$ is concave, we can find nearly optimal solutions. We will assume that $f$ is differentiable and has continuous derivatives.

Theorem 2.2.

Suppose $f(\alpha,\beta)$ is a concave function over the range $\alpha,\beta\geq 0$ and $f$ is strictly increasing in each coordinate in that range. Given an instance of $\operatorname{\mathsf{Rank}}({\mathbf{a}},{\mathbf{b}},{\mathbf{w}},f)$ , there is an algorithm that runs in $O(n\log^{2}n)$ time²²2This is a Las Vegas algorithm i.e. always output the correct answer but runs in $O(n\log^{2}n)$ time with high probability. We also give $O(n\log n\log B)$ algorithm when $a_{i},b_{i}$ are integers bounded by $B$ . and outputs a ranking $\pi$ of $[n]$ such that $\operatorname{\mathsf{co}}_{{\mathbf{w}}^{\prime}}({\mathbf{a}},{\mathbf{b}},\pi)\geq\operatorname{\mathsf{co}}^{*}_{\mathbf{w}}({\mathbf{a}},{\mathbf{b}})$ where ${\mathbf{w}}^{\prime}={\mathbf{w}}+(w_{i}-w_{i+1}){\mathbf{e}}_{i+1}$ for some $i\in[n-1]$ . In other words, ${\mathbf{w}}^{\prime}$ is obtained by replacing $w_{i+1}$ with $w_{i}$ in ${\mathbf{w}}$ for some $i\in[n-1]$ .

We have the following corollary for the important special case where the cumulative scores are the sum of scores of top $k$ elements, i.e., ${\mathbf{w}}=(1,\dots,1,0,\dots,0)$ with exactly $k$ ones. In this special case, $\operatorname{\mathsf{Rank}}$ is called the $\operatorname{\mathsf{TOP}}_{k}$ problem.

Corollary 2.3.

Given such an instance of $\operatorname{\mathsf{TOP}}_{k}({\mathbf{a}},{\mathbf{b}},f)$ , there is an efficient algorithm that outputs a subset $S\subset[n]$ of at most $k+1$ elements such that

f\left(\sum_{i\in S}a_{i},\sum_{i\in S}b_{i}\right)\geq\max_{|T|=k}f\left(\sum_{i\in T}a_{i},\sum_{i\in T}b_{i}\right).

We will now prove Theorem 3.2. We also make the mild assumption that the numbers in ${\mathbf{a}},{\mathbf{b}},{\mathbf{w}}$ are generic for the proof, which can be achieved by perturbing all the numbers with a tiny additive noise. In particular we will assume that $w_{1}>w_{2}>\dots>w_{n}>0$ . This only perturbs $\operatorname{\mathsf{co}}^{*}_{\mathbf{w}}({\mathbf{a}},{\mathbf{b}})$ by a tiny amount. By a limiting argument, this shouldn’t affect the result. To prove Theorem 2.2, we create a convex programming relaxation for $\operatorname{\mathsf{co}}^{*}_{\mathbf{w}}({\mathbf{a}},{\mathbf{b}})$ as shown in (2) and denote its value by $\mathsf{OPT}$ .

$\displaystyle\mathsf{OPT}=\max_{\pi_{ij},\alpha,\beta\geq 0}$	$\displaystyle f(\alpha,\beta)$		(2)
$\displaystyle s.t.$	$\displaystyle\alpha\leq\sum_{i,j=1}^{n}\pi_{ij}w_{i}a_{j}$	$\displaystyle\rightarrow\hskip 28.45274pt(p)$
	$\displaystyle\beta\leq\sum_{i,j=1}^{n}\pi_{ij}w_{i}b_{j}$	$\displaystyle\rightarrow\hskip 28.45274pt(q)$
	$\displaystyle\forall i\ \sum_{j=1}^{n}\pi_{ij}\leq 1$	$\displaystyle\rightarrow\hskip 28.45274pt(r_{i})$
	$\displaystyle\forall j\ \sum_{i=1}^{n}\pi_{ij}\leq 1$	$\displaystyle\rightarrow\hskip 28.45274pt(c_{j})$

It is clear that $\mathsf{OPT}$ is a relaxation for $\operatorname{\mathsf{co}}^{*}_{\mathbf{w}}({\mathbf{a}},{\mathbf{b}})$ with $\operatorname{\mathsf{co}}^{*}_{\mathbf{w}}({\mathbf{a}},{\mathbf{b}})\leq\mathsf{OPT}.$ By convex programming duality, $\mathsf{OPT}$ can be expressed as a dual minimization problem (3) by introducing a dual variable for every constraint in the primal as shown in (2). Note that by Slater’s condition, strong duality holds here Boyd and Vandenberghe (2004). The constraints in the dual correspond to variables in the primal as shown in (3).

	$\displaystyle\mathsf{OPT}=\min_{r_{i},c_{j},p,q\geq 0}$	$\displaystyle\sum_{i=1}^{n}r_{i}+\sum_{j=1}^{n}c_{j}+f^{*}(-p,-q)$			(3)
	$\displaystyle s.t.$	$\displaystyle\forall i,j\ \ r_{i}+c_{j}\geq w_{i}(pa_{j}+qb_{j})$	$\displaystyle\rightarrow\hskip 28.45274pt(\pi_{ij})$		(3)

Here $f^{*}$ is the Fenchel dual of $f$ defined as

f^{*}(\mu,\nu)=\sup_{\alpha,\beta\geq 0}\left(\mu\alpha+\nu\beta+f(\alpha,\beta)\right).

Note that $f^{*}$ is a convex function since it is the supremum of linear functions. Since $f(\alpha,\beta)$ is strictly increasing in each coordinate, $f^{*}(\mu,\nu)=\infty$ unless $\mu,\nu<0$ . Since the dual is a minimization problem, the optimum value is attained only when $\mu,\nu<0$ . Hereafter, wlog, we assume that $p,q>0$ in the dual (3). For example when $f(\alpha,\beta)=\log(\alpha\beta)$ ,

f^{*}(\mu,\nu)=\begin{cases}-\log(\mu\nu)-2&\text{if }\mu,\nu<0\\ \infty&\text{else}.\end{cases}

If $(\pi^{*},\alpha^{*},\beta^{*})$ is some optimal solution for the primal (2) and $({\mathbf{r}}^{*},{\mathbf{c}}^{*},p^{*},q^{*})$ is some optimal solution for the dual (3), then they should together satisfy the KKT conditions given in (4). A constraint of primal is tight if the corresponding variable in the dual is strictly positive and vice-versa.

\begin{array}[]{ccl|ccl}p^{*}>0&\Rightarrow&\sum_{ij}\pi_{ij}^{*}w_{i}a_{j}=\alpha^{*}&\nabla f(\alpha^{*},\beta^{*})&=&(p^{*},q^{*})\\ q^{*}>0&\Rightarrow&\sum_{ij}\pi_{ij}^{*}w_{i}b_{j}=\beta^{*}&\pi_{ij}^{*}>0&\Rightarrow&r_{i}^{*}+c_{j}^{*}=w_{i}(p^{*}a_{j}+q^{*}b_{j})\\ r_{i}^{*}>0&\Rightarrow&\sum_{j}\pi_{ij}^{*}=1&&&\\ c_{j}^{*}>0&\Rightarrow&\sum_{i}\pi_{ij}^{*}=1&&&\end{array}

(4)

Proposition 2.4.

Let $p,q>0$ be fixed. Then the value of the minimization program in (3) is given by

\Psi(p,q)=\operatorname{\mathsf{cs}}^{*}_{\mathbf{w}}(p{\mathbf{a}}+q{\mathbf{b}})+f^{*}(-p,-q)

where $p{\mathbf{a}}+q{\mathbf{b}}=(pa_{1}+qb_{1},\dots,pa_{n}+qb_{n}).$ Moreover the KKT condition (4) can be simplified to:

		$\displaystyle\pi^{}\in\mathsf{ConvHull}\{\pi:\operatorname{\mathsf{cs}}^{}_{\mathbf{w}}(p^{}{\mathbf{a}}+q^{}{\mathbf{b}})=\operatorname{\mathsf{cs}}_{\mathbf{w}}(p^{}{\mathbf{a}}+q^{}{\mathbf{b}},\pi)\},$		(5)
		$\displaystyle\alpha^{}=\sum_{ij}\pi_{ij}^{}w_{i}a_{j},\ \ \beta^{}=\sum_{ij}\pi_{ij}^{}w_{i}b_{j},$
		$\displaystyle\nabla f(\alpha^{},\beta^{})=(p^{},q^{}).$

Proof.

For a fixed $p,q>0$ , the dual program (3) reduces (after ignoring the fixed additive term $f^{*}(-p,-q)$ ) to the following linear program:

	$\displaystyle\min_{r_{i},c_{j}\geq 0}\$	$\displaystyle\sum_{i=1}^{n}r_{i}+\sum_{j=1}^{n}c_{j}$
	$\displaystyle s.t.\$	$\displaystyle\forall i,j\ r_{i}+c_{j}\geq w_{i}(pa_{j}+qb_{j})$	$\displaystyle\rightarrow\hskip 28.45274pt(\pi_{ij}).$

The dual linear program is:

$\displaystyle\max_{\pi_{ij}\geq 0}\$	$\displaystyle\sum_{ij}\pi_{ij}w_{i}(pa_{j}+qb_{j})$
	$\displaystyle\forall i\ \sum_{j=1}^{n}\pi_{ij}\leq 1$	$\displaystyle\rightarrow\hskip 28.45274pt(r_{i})$
	$\displaystyle\forall j\ \sum_{i=1}^{n}\pi_{ij}\leq 1$	$\displaystyle\rightarrow\hskip 28.45274pt(c_{j})$

The constraints on $\pi$ correspond to doubly stochastic constraints on the matrix $\pi$ . Therefore by the Birkhoff-von Neumann theorem, the feasible solutions are convex combinations of permutations and the optimum is attained at a permutation. An optimal permutation should sort the values of $p{\mathbf{a}}+q{\mathbf{b}}$ in decreasing order and convex combinations of such permutations are also optimal solutions. Thus the set of solutions $\pi^{*}\in\mathsf{ConvHull}\{\pi:\operatorname{\mathsf{cs}}^{*}_{\mathbf{w}}(p^{*}{\mathbf{a}}+q^{*}{\mathbf{b}})=\operatorname{\mathsf{cs}}_{\mathbf{w}}(p^{*}{\mathbf{a}}+q^{*}{\mathbf{b}},\pi)\}$ and the value of both the above programs is $\operatorname{\mathsf{cs}}^{*}_{\mathbf{w}}(p{\mathbf{a}}+q{\mathbf{b}}).$

∎

Lemma 2.5.

Fix some $p,q>0$ . Then one of the following is true:

1.

There are no ties among $p{\mathbf{a}}+q{\mathbf{b}}$ i.e. there is a unique $\pi$ such that $\operatorname{\mathsf{cs}}^{*}_{\mathbf{w}}({\mathbf{a}},{\mathbf{b}})=\operatorname{\mathsf{cs}}_{\mathbf{w}}({\mathbf{a}},{\mathbf{b}},\pi)$
2.

There is exactly one tie among $p{\mathbf{a}}+q{\mathbf{b}}$ i.e. there are exactly two permutations $\pi_{1},\pi_{2}$ such that $\operatorname{\mathsf{cs}}^{*}_{\mathbf{w}}({\mathbf{a}},{\mathbf{b}})=\operatorname{\mathsf{cs}}_{\mathbf{w}}({\mathbf{a}},{\mathbf{b}},\pi_{1})=\operatorname{\mathsf{cs}}_{\mathbf{w}}({\mathbf{a}},{\mathbf{b}},\pi_{2})$ . Moreover $\pi_{1},\pi_{2}$ differ by an adjacent transposition i.e. $\pi_{2}$ can be obtained from $\pi_{1}$ by swapping adjacent elements.

Proof.

$\operatorname{\mathsf{cs}}^{*}(p{\mathbf{a}}+q{\mathbf{b}})=p\operatorname{\mathsf{cs}}({\mathbf{a}},\pi)+q\operatorname{\mathsf{cs}}({\mathbf{a}},\pi)$ where $\pi$ is any permutation which sorts $p{\mathbf{a}}+q{\mathbf{b}}$ in descending order. There are two cases:

Case 1:

If there are no ties among $p{\mathbf{a}}+q{\mathbf{b}}$ , then the permutation $\pi$ is unique.
Case 2:

Suppose there are ties among $p{\mathbf{a}}+q{\mathbf{b}}$ . Because we assumed that ${\mathbf{a}},{\mathbf{b}}$ are generic, there can be at most one tie among $(pa_{1}+qb_{1},\dots,pa_{n}+qb_{n})$ i.e. there is at most one pair $s,t$ such that $pa_{s}+qb_{s}=pa_{t}+qb_{t}$ . Two such ties impose two linearly independent equations on $p,q$ forcing them to be both zero. Therefore the there are at most two distinct permutations $\pi_{1},\pi_{2}$ such that $\operatorname{\mathsf{cs}}^{*}(p{\mathbf{a}}+q{\mathbf{b}})=p\operatorname{\mathsf{cs}}({\mathbf{a}},\pi_{1})+q\operatorname{\mathsf{cs}}({\mathbf{a}},\pi_{1})=p\operatorname{\mathsf{cs}}({\mathbf{a}},\pi_{2})+q\operatorname{\mathsf{cs}}({\mathbf{a}},\pi_{2})$ . Moreover $s,t$ should be next to each other in $\pi_{1},\pi_{2}$ and their order is switched in $\pi_{1},\pi_{2}$ . Therefore they differ by an adjacent transposition.

∎

Remark 2.6.

From Proposition 2.4 and Lemma 2.5, the solution $\pi^{*}$ for the primal program (2) is either a single permutation or a convex combination of two permutations which differ only by an adjacent transposition (i.e. swapping two elements next to each other).

Remark 2.7.

$\Psi(p,q)=\operatorname{\mathsf{cs}}^{*}_{\mathbf{w}}(p{\mathbf{a}}+q{\mathbf{b}})+f^{*}(-p,-q)$ is a convex function. The gradient³³3or a subgradient at points where $\Psi$ is not differentiable of $\Psi$ can be calculated efficiently and therefore $\mathsf{OPT}=\min_{p,q\geq 0}\Psi(p,q)$ can be found efficiently using gradient (or subgradient) descent Bubeck (2015).

It turns out that there is a much more efficient algorithm to find $\mathsf{OPT}$ using binary search. We need the notion of a subgradient. For a convex function $g:\mathbb{R}^{d}\to\mathbb{R}$ , the subgradient of $g$ at a point $x\in\mathbb{R}^{d}$ is defined as $\partial g(x)=\{v:g(x+y)\geq g(x)+\left\langle v,y\right\rangle\forall y\}$ . It is always a convex subset of $\mathbb{R}^{d}$ . If $g$ is differentiable at $x$ then, $\partial g(x)=\{\nabla g(x)\}$ .

Proposition 2.8 (Binary search to find $\mathsf{OPT}$ ).

Suppose $a_{1},\dots,a_{n}$ and $b_{1},\dots,b_{n}$ are integers bounded by $B$ in absolute value. We can solve the primal program (2), the dual program (3) and find $\mathsf{OPT}=\min_{p,q>0}\operatorname{\mathsf{cs}}^{*}_{\mathbf{w}}(p{\mathbf{a}}+q{\mathbf{b}})+f^{*}(-p,-q)$ in $O(n\log n\log B)$ time. Moreover there is a strongly polynomial randomized algorithm which runs in $O(n\log^{2}n)$ time.⁴⁴4Strongly polynomial refers to the fact that the running time is independent of $B$ or the actual numbers in ${\mathbf{a}},{\mathbf{b}}$ . In this model, it is assumed that arithmetic and comparison operations between $a_{i}$ ’s and $b_{i}$ ’s take constant time.

Proof.

To solve the primal program (2) and the dual program (3), it is enough to find $(\pi^{*},\alpha^{*},\beta^{*})$ and $(p^{*},q^{*})$ which together satisfy all the simplified KKT conditions (5).

Throughout the proof, we drop the subscript ${\mathbf{w}}$ from $\operatorname{\mathsf{cs}}_{\mathbf{w}}$ for brevity. $\Psi(p,q)=\operatorname{\mathsf{cs}}(p{\mathbf{a}}+q{\mathbf{b}})+f^{*}(-p,-q)$ is a convex function. So a local minimum is a global minimum and therefore it is enough to find $(p^{*},q^{*})$ such that $0\in\partial\Psi(p^{*},q^{*})$ .

0\in\partial\Psi(p^{*},q^{*})\iff\nabla f^{*}(-p^{*},-q^{*})\in\partial\operatorname{\mathsf{cs}}^{*}(p^{*}{\mathbf{a}}+q^{*}{\mathbf{b}}).

It is easy to see that $\nabla f^{*}(-p,-q)=(\alpha,\beta)\iff\nabla f(\alpha,\beta)=(p,q)$ . Therefore we rewrite the optimality condition for $(p^{*},q^{*})$ as:

(p^{*},q^{*})\in\nabla f\left(\partial\operatorname{\mathsf{cs}}^{*}(p^{*}{\mathbf{a}}+q^{*}{\mathbf{b}})\right).

(6)

Thus we need to find a fixed point for a set-valued map. We begin by calculating the subgradient $\partial\operatorname{\mathsf{cs}}^{*}(p{\mathbf{a}}+q{\mathbf{b}})$ . Note that $\operatorname{\mathsf{cs}}^{*}(p{\mathbf{a}}+q{\mathbf{b}})=p\operatorname{\mathsf{cs}}({\mathbf{a}},\pi)+q\operatorname{\mathsf{cs}}({\mathbf{b}},\pi)$ where $\pi$ is any permutation which sorts $p{\mathbf{a}}+q{\mathbf{b}}$ in descending order. By Lemma 2.5, there are two cases:

Case 1:

If there are no ties among $p{\mathbf{a}}+q{\mathbf{b}}$ , then the permutation $\pi$ is unique and $\operatorname{\mathsf{cs}}^{*}(p{\mathbf{a}}+q{\mathbf{b}})$ is differentiable at $(p,q)$ and

\partial\operatorname{\mathsf{cs}}^{*}(p{\mathbf{a}}+q{\mathbf{b}})=\{\left(\operatorname{\mathsf{cs}}({\mathbf{a}},\pi),\operatorname{\mathsf{cs}}({\mathbf{b}},\pi)\right)\}.

Case 2:

Suppose there are ties among $p{\mathbf{a}}+q{\mathbf{b}}$ . Then there exists exactly two permutations $\pi_{1},\pi_{2}$ such that $\operatorname{\mathsf{cs}}^{*}(p{\mathbf{a}}+q{\mathbf{b}})=\operatorname{\mathsf{cs}}(p{\mathbf{a}}+q{\mathbf{b}},\pi_{1})=\operatorname{\mathsf{cs}}(p{\mathbf{a}}+q{\mathbf{b}},\pi_{2})$ . In this case,

\partial\operatorname{\mathsf{cs}}^{*}(p{\mathbf{a}}+q{\mathbf{b}})=\left\{\mu\left(\operatorname{\mathsf{cs}}({\mathbf{a}},\pi_{1}),\operatorname{\mathsf{cs}}({\mathbf{b}},\pi_{1})\right)+(1-\mu)\left(\operatorname{\mathsf{cs}}({\mathbf{a}},\pi_{2}),\operatorname{\mathsf{cs}}({\mathbf{b}},\pi_{2})\right):\mu\in[0,1]\right\}.

We make a few observations. The value of the subgradient $\partial\operatorname{\mathsf{cs}}^{*}(p{\mathbf{a}}+q{\mathbf{b}})$ only depends on the ratio of $p$ and $q$ , $\lambda=q/p$ ; this is because the optimal ranking of $p{\mathbf{a}}+q{\mathbf{b}}$ only depends on the ratio $\lambda$ . And as we change this ratio $\lambda$ from $0$ to $\infty$ , the subgradient changes at most $\binom{n}{2}$ times. This happens whenever $\lambda$ is such that $a_{i}+\lambda b_{i}=a_{j}+\lambda b_{j}$ for some $i\neq j$ . We call the set

C=\left\{\frac{a_{i}-a_{j}}{b_{j}-b_{i}}:1\leq i<j\leq n,\frac{a_{i}-a_{j}}{b_{j}-b_{i}}>0\right\},

the critical set of $\lambda$ ’s where the subgradient changes value⁵⁵5 $|C|$ is equal to the number of inversions in ${\mathbf{a}}$ w.r.t. ${\mathbf{b}}$ , also called the Kendall tau distance.. Let $m=|C|$ and let $\lambda_{1}<\lambda_{2}<\dots<\lambda_{m}$ be an ordering of the critical set $C$ , further define $\lambda_{0}=0$ and $\lambda_{m+1}=\infty$ .

Define the regions

A_{i}=\left\{(p,p\lambda):p>0,\lambda_{i}<\lambda<\lambda_{i+1}\right\}.

Also define the rays

R_{i}=\left\{(p,p\lambda_{i}):p>0\right\}.

In the region $A_{i},$ there is a unique permutation $\sigma_{i}=\operatorname{argmax}_{\pi}\operatorname{\mathsf{cs}}(p{\mathbf{a}}+q{\mathbf{b}},\pi)$ , therefore the subgradient is unique. Denote its value by $g_{i}$ i.e.

\partial\operatorname{\mathsf{cs}}^{*}(p{\mathbf{a}}+q{\mathbf{b}})|_{A_{i}}=\{g_{i}\}=\{(\operatorname{\mathsf{cs}}({\mathbf{a}},\sigma_{i}),\operatorname{\mathsf{cs}}({\mathbf{b}},\sigma_{i}))\}.

On the ray $R_{i+1},$ we have $\{\sigma_{i},\sigma_{i+1}\}=\operatorname{argmax}_{\pi}\operatorname{\mathsf{cs}}(p{\mathbf{a}}+q{\mathbf{b}},\pi)$ , therefore the subgradient is given by

\partial\operatorname{\mathsf{cs}}^{*}(p{\mathbf{a}}+q{\mathbf{b}})|_{R_{i+1}}=\{\mu g_{i}+(1-\mu)g_{i+1}:\mu\in[0,1]\}.

Figure 2 shows the values of the subgradient $\partial\operatorname{\mathsf{cs}}^{*}(p{\mathbf{a}}+q{\mathbf{b}})$ as a function of $(p,q)$ in the regions $A_{i},R_{i}.$

Let $\mathcal{Q}^{++}=\{(p,q):p,q>0\}$ be the positive quadrant. Since $f(\alpha,\beta)$ is strictly increasing in each coordinate in $\mathcal{Q}^{++}$ , the gradients also lie in the positive quadrant, i.e., $\nabla f:\mathcal{Q}^{++}\to\mathcal{Q}^{++}$ . We now define a function $\phi:\{0,1,\dots,m\}\to\{-1,0,1\}$ as follows:

\displaystyle\phi(i)=\begin{cases}0&\text{ if }\nabla f(g_{i})\in R_{i}\cup A_{i}\cup R_{i+1}\\ +1&\text{ if }\nabla f(g_{i})\text{ lies anticlockwise to }R_{i}\cup A_{i}\cup R_{i+1}\\ -1&\text{ if }\nabla f(g_{i})\text{ lies clockwise to }R_{i}\cup A_{i}\cup R_{i+1}.\end{cases}

(7)

We now show that it is enough to find some $i\in\{0,1,\dots,m\}$ such that one of the following is true.

1.

$\phi(i)=0$ . In this case, we set $(p^{*},q^{*})=\nabla f(g_{i}).$ The condition $\phi(i)=0$ implies that $(p^{*},q^{*})\in R_{i}\cup A_{i}\cup R_{i+1}.$ Therefore $g_{i}\in\partial\operatorname{\mathsf{cs}}^{*}(p^{*}{\mathbf{a}}+q^{*}{\mathbf{b}}).$ Applying $\nabla f$ on both sides implies that $(p^{*},q^{*})=\nabla f(g_{i})\in\nabla f(\partial\operatorname{\mathsf{cs}}^{*}(p^{*}{\mathbf{a}}+q^{*}{\mathbf{b}}))$ which is the fixed point condition (6). Moreover setting $\pi^{*}=\sigma_{i}$ and $(\alpha^{*},\beta^{*})=g_{i}$ gives a solution to the simplified KKT conditions (5).
2.

$\phi(i)=1,\phi(i+1)=-1$ . In this case, we claim that there exists some $\mu^{*}\in(0,1)$ such that $\nabla f(\mu^{*}g_{i}+(1-\mu^{*})g_{i+1})\in R_{i+1}$ . This is because the curve $\gamma:[0,1]\to\mathcal{Q}^{++}$ given by $\gamma(\mu)=\nabla f(\mu g_{i}+(1-\mu)g_{i+1})$ starts and ends in opposite sides of the ray $R_{i+1}$ as shown in Figure 3, so it should cross it at some point $\mu^{*}\in(0,1)$ which can be found by binary search (here we need continuity of $\nabla f$ ). We then set $(p^{*},q^{*})=\nabla f(\mu^{*}g_{i}+(1-\mu^{*})g_{i+1})$ . Since $(p^{*},q^{*})\in R_{i+1}$ , $\mu^{*}g_{i}+(1-\mu^{*})g_{i+1}\in\partial\operatorname{\mathsf{cs}}^{*}(p^{*}{\mathbf{a}}+q^{*}{\mathbf{b}})$ . Applying $\nabla f$ to both sides, we get $(p^{*},q^{*})\in\nabla f(\partial\operatorname{\mathsf{cs}}^{*}(p^{*}{\mathbf{a}}+q^{*}{\mathbf{b}}))$ which is the fixed point condition (6). Setting $\pi^{*}=\mu^{*}\sigma_{i}+(1-\mu^{*})\sigma_{i+1}$ and $(\alpha^{*},\beta^{*})=\mu^{*}g_{i}+(1-\mu^{*})g_{i+1}$ gives a solution to the simplified KKT conditions (5).

Figure 3: The endpoints of the curve $\gamma$ lie on opposite sides of the ray $R_{i+1}$ so $\gamma$ should cross the ray $R_{i+1}$ at some point.

Now if either $\phi(0)=0$ or $\phi(m)=0$ , we are done. Otherwise, $\phi(0)=1$ and $\phi(m)=-1$ . By a simple binary search on $\{0,1,\dots,m\}$ we can find a point $i$ such that either $\phi(i)=0$ or $\phi(i)=1,\phi(i+1)=-1$ .

A naive implementation of the above described binary search will require finding the set of critical values $C$ and sorting them. This can take $\Omega(n^{2}\log n)$ time. To improve this to near-linear time, we need a way do binary search on $C$ without listing all values in $C$ . See Appendix A for how to achieve this. ∎

We are now ready to prove Theorem 2.2.

Proof of Theorem 2.2.

By Proposition 2.8 and Remark 2.6, we can find a solution $(\pi^{*},\alpha^{*},\beta^{*})$ to the primal program (2) where $\pi^{*}$ is either a permutation or a convex combination of two permutations which differ by an adjacent transposition. If $\pi^{*}$ is a permutation then

\operatorname{\mathsf{co}}^{*}_{\mathbf{w}}({\mathbf{a}},{\mathbf{b}})\leq\mathsf{OPT}=f(\operatorname{\mathsf{cs}}_{\mathbf{w}}({\mathbf{a}},\pi^{*}),\operatorname{\mathsf{cs}}_{\mathbf{w}}({\mathbf{b}},\pi^{*})).

Thus we just output $\pi^{*}$ which is the optimal ranking.

If $\pi^{*}=\mu\pi_{1}+(1-\mu)\pi_{2}$ i.e. a convex combination of $\pi_{1},\pi_{2}$ which differ in the $i,i+1$ positions, then

	$\displaystyle\operatorname{\mathsf{co}}^{*}_{\mathbf{w}}({\mathbf{a}},{\mathbf{b}})\leq\mathsf{OPT}$	$\displaystyle=f(\mu(\operatorname{\mathsf{cs}}_{\mathbf{w}}({\mathbf{a}},\pi_{1}),\operatorname{\mathsf{cs}}_{\mathbf{w}}({\mathbf{b}},\pi_{1}))+(1-\mu)(\operatorname{\mathsf{cs}}_{\mathbf{w}}({\mathbf{a}},\pi_{2}),\operatorname{\mathsf{cs}}_{\mathbf{w}}({\mathbf{b}},\pi_{2}))$
		$\displaystyle\leq f((\operatorname{\mathsf{cs}}_{{\mathbf{w}}^{\prime}}({\mathbf{a}},\pi_{1}),\operatorname{\mathsf{cs}}_{{\mathbf{w}}^{\prime}}({\mathbf{b}},\pi_{1})))=\operatorname{\mathsf{co}}_{{\mathbf{w}}^{\prime}}({\mathbf{a}},{\mathbf{b}},\pi_{1})$

where ${\mathbf{w}}^{\prime}={\mathbf{w}}+(w_{i}-w_{i+1}){\mathbf{e}}_{i+1}$ . Similarly $\operatorname{\mathsf{co}}^{*}_{\mathbf{w}}({\mathbf{a}},{\mathbf{b}})\leq\operatorname{\mathsf{co}}^{*}_{{\mathbf{w}}^{\prime}}({\mathbf{a}},{\mathbf{b}},\pi_{2}).$ So in this case we output either $\pi_{1}$ or $\pi_{2}$ , whichever has the higher combined objective.

∎

3 Multiple instances of ranking with global aggregation

Suppose we have several instances of $\operatorname{\mathsf{Rank}}$ and we want to do well locally in each problem, but we also want to do well when we aggregate our solutions globally. Such a situation arises in the example application discussed in the beginning. For each user we want to solve an instance of $\operatorname{\mathsf{Rank}}$ , but globally across all the users, we want that the average revenue and average relevance are high as well. We model this as follows.

Let $({\mathbf{a}}_{1},{\mathbf{b}}_{1},{\mathbf{w}}_{1},f_{1}),\dots,({\mathbf{a}}_{m},{\mathbf{b}}_{m},{\mathbf{w}}_{m},f_{m})$ be $m$ instances of $\operatorname{\mathsf{Rank}}$ on sequences of length $n$ . Let $\mathbf{w}=({\mathbf{w}}_{1},\dots,{\mathbf{w}}_{m}),\ ,f=(f_{1},\dots,f_{m}),\ \mathbf{a}=({\mathbf{a}}_{1},\dots,{\mathbf{a}}_{m}),\ \mathbf{b}=({\mathbf{b}}_{1},\dots,{\mathbf{b}}_{m})$ and let $\pi=(\pi_{1},\dots,\pi_{m})$ be a sequence of rankings of $[n]$ . The global cumulative a-score and b-score of $\pi$ is defined as

\operatorname{\mathsf{gcs}}_{\mathbf{w}}(\mathbf{a},\pi)=\sum_{i=1}^{m}\operatorname{\mathsf{cs}}_{{\mathbf{w}}_{i}}({\mathbf{a}}_{i},\pi_{i})\text{ and }\operatorname{\mathsf{gcs}}_{\mathbf{w}}(\mathbf{b},\pi)=\sum_{i=1}^{m}\operatorname{\mathsf{cs}}_{{\mathbf{w}}_{i}}({\mathbf{b}}_{i},\pi_{i})

respectively. Suppose $F(\alpha,\beta)$ is a concave function increasing in each coordinate. The combined objective function is defined as

\operatorname{\mathsf{co}}_{\mathbf{w}}(\mathbf{a},\mathbf{b},\pi)=F(\operatorname{\mathsf{gcs}}_{\mathbf{w}}(\mathbf{a},\pi),\operatorname{\mathsf{gcs}}_{\mathbf{w}}(\mathbf{b},\pi))+\sum_{i=1}^{m}f_{i}(\operatorname{\mathsf{cs}}_{{\mathbf{w}}_{i}}({\mathbf{a}}_{i},\pi_{i}),\operatorname{\mathsf{cs}}_{{\mathbf{w}}_{i}}({\mathbf{b}}_{i},\pi_{i})).

Definition 3.1 ( $\operatorname{\mathsf{MultiRank}}(\mathbf{a},\mathbf{b},\mathbf{w},f,F)$ ).

Given $\mathbf{a},\mathbf{b},f,F$ , find a sequence of rankings $\pi=(\pi_{1},\dots,\pi_{m})$ of $[n]$ which maximizes $\operatorname{\mathsf{co}}_{\mathbf{w}}(\mathbf{a},\mathbf{b},\pi)$ i.e. find $\operatorname{\mathsf{co}}^{*}_{\mathbf{w}}(\mathbf{a},\mathbf{b})=\max_{\pi}\operatorname{\mathsf{co}}_{\mathbf{w}}(\mathbf{a},\mathbf{b},\pi)$

We will assume that the functions $f_{1},\dots,f_{m},F$ are concave and strictly increasing in each coordinate, and that they are differentiable with continuous derivatives. Our main theorem is that we can efficiently find a sequence of rankings $\pi$ which, with a slight advantage, does as well as the optimal sequence of rankings.

Theorem 3.2.

Suppose the functions $f_{1},\dots,f_{m},F$ are concave and strictly increasing in each coordinate. Given an instance of $\operatorname{\mathsf{MultiRank}}(\mathbf{a},\mathbf{b},\mathbf{w},f,F)$ , we can efficiently find a sequence of rankings $\pi=(\pi_{1},\dots,\pi_{m})$ such that

\operatorname{\mathsf{co}}_{\mathbf{w}^{\prime}}(\mathbf{a},\mathbf{b},\pi)\geq\operatorname{\mathsf{co}}^{*}_{\mathbf{w}}(\mathbf{a},\mathbf{b})

where $\mathbf{w}^{\prime}=({\mathbf{w}}_{1}^{\prime},\dots,{\mathbf{w}}_{m}^{\prime})$ and ${\mathbf{w}}_{i}^{\prime}={\mathbf{w}}_{i}+(w_{t_{i}}-w_{{t_{i}}+1}){\mathbf{e}}_{{t_{i}}+1}$ for some $t_{i}\in[n-1]$ i.e., each ${\mathbf{w}}^{\prime}_{i}$ is obtained by replacing $w_{t_{i}+1}$ with $w_{t_{i}}$ .

In the special case when the weight vectors ${\mathbf{w}}_{1},\dots,{\mathbf{w}}_{m}$ are just a sequence of $k$ ones followed by zeros i.e. the cumulative scores are calculated by adding the scores of top $k$ results, $\operatorname{\mathsf{MultiRank}}$ is called the $\operatorname{\mathsf{Multi-TOP}}_{k}$ problem.

Corollary 3.3.

Suppose the functions $f_{1},\dots,f_{m},F$ are concave and strictly increasing in each coordinate. Given an instance of $\operatorname{\mathsf{Multi-TOP}}_{k}(\mathbf{a},\mathbf{b},f,F)$ , we can efficiently find a sequence $S=(S_{1},\dots,S_{m})$ of subsets of $[n]$ of size at most $k+1$ (corresponding to the top $k+1$ elements) such that

\operatorname{\mathsf{co}}(\mathbf{a},\mathbf{b},S)\geq\max_{|T_{i}|=k}\operatorname{\mathsf{co}}(\mathbf{a},\mathbf{b},(T_{1},\dots,T_{m})).

Our approach to prove Theorem 3.2 is again very similar to how we proved Theorem 2.2. We write a convex programming relaxation and solve its dual program. We will also assume that the sequences ${\mathbf{a}}_{i},{\mathbf{b}}_{i},{\mathbf{w}}$ are generic which can be ensure by perturbing all entries by tiny additive noise, this will not change $\operatorname{\mathsf{co}}^{*}_{\mathbf{w}}({\mathbf{a}},{\mathbf{b}})$ by much. By a limiting argument, this will not affect the result. We first develop a convex programming relaxation $\mathsf{OPT}$ as shown in (8).

$\displaystyle\mathsf{OPT}=\max_{\pi^{i}_{jk},\alpha_{i},\beta_{i},\alpha,\beta\geq 0}$	$\displaystyle F(\alpha,\beta)+\sum_{i=1}^{m}f_{i}(\alpha_{i},\beta_{i})$		(8)
$\displaystyle s.t.$	$\displaystyle\forall i\in[m]\quad\alpha_{i}\leq\sum_{j,k=1}^{n}w_{ij}a_{ik}\pi^{i}_{jk}$	$\displaystyle\rightarrow\hskip 28.45274pt(p_{i})$
	$\displaystyle\forall i\in[m]\quad\beta_{i}\leq\sum_{j,k=1}^{n}w_{ij}b_{ik}\pi^{i}_{jk}$	$\displaystyle\rightarrow\hskip 28.45274pt(q_{i})$
	$\displaystyle\alpha\leq\sum_{i=1}^{m}\sum_{j,k=1}^{n}w_{ij}a_{ik}\pi^{i}_{jk}$	$\displaystyle\rightarrow\hskip 28.45274pt(p)$
	$\displaystyle\beta\leq\sum_{i=1}^{m}\sum_{j,k=1}^{n}w_{ij}b_{ik}\pi^{i}_{jk}$	$\displaystyle\rightarrow\hskip 28.45274pt(q)$
	$\displaystyle\forall i\in[m],j\in[n]\quad\sum_{k=1}^{n}\pi^{i}_{jk}\leq 1$	$\displaystyle\rightarrow\hskip 28.45274pt(r_{ij})$
	$\displaystyle\forall i\in[m],k\in[n]\quad\sum_{j=1}^{n}\pi^{i}_{jk}\leq 1$	$\displaystyle\rightarrow\hskip 28.45274pt(c_{ik})$

It is clear that $\mathsf{OPT}$ is a relaxation for $\operatorname{\mathsf{co}}^{*}_{\mathbf{w}}({\mathbf{a}},{\mathbf{b}})$ with $\operatorname{\mathsf{co}}^{*}_{\mathbf{w}}({\mathbf{a}},{\mathbf{b}})\leq\mathsf{OPT}.$ By convex programming duality, $\mathsf{OPT}$ can be expressed as a dual minimization problem (9) by introducing a dual variable for every constraint in the primal as shown in (8). Again by Slater’s condition, strong duality holds Boyd and Vandenberghe (2004). The constraints in the dual correspond to variables in the primal as shown in (9).

	$\displaystyle\mathsf{OPT}=\min_{r_{ij},c_{ik},p_{i},q_{i},p,q\geq 0}$	$\displaystyle\sum_{i=1}^{m}\left(\sum_{j=1}^{n}r_{ij}+\sum_{k=1}^{n}c_{ik}+f_{i}^{}(-p_{i},-q_{i})\right)+F^{}(-p,-q)$			(9)
	$\displaystyle s.t.$	$\displaystyle\forall i\in[m],j,k\in[n]\quad r_{ij}+c_{ik}\geq w_{ij}\left((p_{i}+p)a_{ik}+(q_{i}+q)b_{ik}\right)$	$\displaystyle\rightarrow\hskip 28.45274pt(x_{ij})$		(9)

Here $f_{i}^{*}$ is the Fenchel dual of $f$ and $F^{*}$ is the Fenchel dual of $F.$ If $(\pi^{*},\alpha_{i}^{*},\beta_{i}^{*},\alpha^{*},\beta^{*})$ is some optimal solution for the primal (8) and $(r^{*},c^{*},p_{i}^{*},q_{i}^{*},p^{*},q^{*})$ is some optimal solution for the dual (9), then they should together satisfy the KKT conditions given in (10). Note that a constraint of primal is tight if the corresponding variable in the dual is strictly positive and vice-versa.

\begin{array}[]{ccl|ccl}p_{i}^{*}>0&\Rightarrow&\sum_{j,k=1}^{n}w_{ij}a_{ik}\pi^{i*}_{jk}=\alpha_{i}^{*}&\nabla f_{i}(\alpha_{i}^{*},\beta_{i}^{*})&=&(p_{i}^{*},q_{i}^{*})\\ q_{i}^{*}>0&\Rightarrow&\sum_{j,k=1}^{n}w_{ij}b_{ik}\pi^{i*}_{jk}=\beta_{i}^{*}&\nabla F(\alpha^{*},\beta^{*})&=&(p^{*},q^{*})\\ r^{*}_{ij}>0&\Rightarrow&\sum_{k=1}^{n}\pi^{i*}_{jk}=1&\pi^{i*}_{jk}>0&\Rightarrow&r_{ij}+c_{ik}\geq w_{ij}\left((p_{i}+p)a_{ik}+(q_{i}+q)b_{ik}\right)\\ c^{*}_{ik}>0&\Rightarrow&\sum_{j=1}^{n}\pi^{i*}_{jk}=1&&&\\ p^{*}>0&\Rightarrow&\sum_{i=1}^{m}\sum_{j,k=1}^{n}w_{ij}a_{ik}\pi^{i*}_{jk}=\alpha^{*}&&&\\ q^{*}>0&\Rightarrow&\sum_{i=1}^{m}\sum_{j,k=1}^{n}w_{ij}b_{ik}\pi^{i*}_{jk}=\beta^{*}&&&\end{array}

(10)

Proposition 3.4.

Let $p,q,p_{1},q_{1},\dots,p_{m},q_{m}>0$ be fixed. Then the value of the minimization program in (9) is given by

\Psi(p,q,p_{1},q_{1},\dots,p_{m},q_{m})=F^{*}(-p,-q)+\sum_{i=1}^{m}\operatorname{\mathsf{cs}}^{*}_{{\mathbf{w}}_{i}}\left((p+p_{i}){\mathbf{a}}_{i}+(q+q_{i}){\mathbf{b}}_{i}\right)+f_{i}^{*}(-p_{i},-q_{i}).

Moreover the KKT conditions 10 can be simplified to:

		$\displaystyle\pi_{i}^{}\in\mathsf{ConvHull}\{\pi:\operatorname{\mathsf{cs}}^{}_{\mathbf{w}}((p^{}+p_{i}^{}){\mathbf{a}}_{i}+(q^{}+q_{i}^{}){\mathbf{b}}_{i})=\operatorname{\mathsf{cs}}_{\mathbf{w}}((p^{}+p_{i}^{}){\mathbf{a}}_{i}+(q^{}+q_{i}^{}){\mathbf{b}}_{i},\pi)\},$		(11)
		$\displaystyle\sum_{j,k=1}^{n}w_{ij}a_{ik}\pi^{i}_{jk}=\alpha_{i}^{},\quad\sum_{j,k=1}^{n}w_{ij}b_{ik}\pi^{i}_{jk}=\beta_{i}^{},$
		$\displaystyle\sum_{i=1}^{m}\sum_{j,k=1}^{n}w_{ij}a_{ik}\pi^{i}_{jk}=\alpha^{},\quad\sum_{i=1}^{m}\sum_{j,k=1}^{n}w_{ij}b_{ik}\pi^{i}_{jk}=\beta^{},$
		$\displaystyle\nabla f_{i}(\alpha_{i}^{},\beta_{i}^{})=(p_{i}^{},q_{i}^{}),\quad\nabla F(\alpha^{},\beta^{})=(p^{},q^{}).$

Proof.

The proof is very similar to the proof of Proposition 2.4 where we write a linear program for each sub-problem and the corresponding dual linear program. We will skip the details. ∎

Remark 3.5.

\Psi(p,q,p_{1},q_{1},\dots,p_{m},q_{m})=F^{*}(-p,-q)+\sum_{i=1}^{m}\operatorname{\mathsf{cs}}^{*}_{{\mathbf{w}}_{i}}\left((p+p_{i}){\mathbf{a}}_{i}+(q+q_{i}){\mathbf{b}}_{i}\right)+f_{i}^{*}(-p_{i},-q_{i})

is a convex function. The gradient (or a subgradient) of $\Psi$ can be calculated efficiently and therefore $\mathsf{OPT}=\min_{p,q>0}\Psi(p,q)$ can be found efficiently using gradient (or subgradient) descent. The objective is also amenable to the use of stochastic gradient descent which can be much faster when $m\gg 1.$

Proposition 3.6.

For a fixed $p,q>0,$ we can find

\min_{p_{i},q_{i}>0}\Psi(p,q,p_{1},q_{1},\dots,p_{m},q_{m})=\min_{p_{i},q_{i}>0}F^{*}(-p,-q)+\sum_{i=1}^{m}\operatorname{\mathsf{cs}}^{*}_{{\mathbf{w}}_{i}}\left((p+p_{i}){\mathbf{a}}_{i}+(q+q_{i}){\mathbf{b}}_{i}\right)+f_{i}^{*}(-p_{i},-q_{i})

efficiently using binary search.

Proof.

It is enough to the find

\mathrm{argmin}_{p_{i},q_{i}>0}\operatorname{\mathsf{cs}}^{*}_{{\mathbf{w}}_{i}}\left((p+p_{i}){\mathbf{a}}_{i}+(q+q_{i}){\mathbf{b}}_{i}\right)+f_{i}^{*}(-p_{i},-q_{i})

for a fixed $i.$ By convexity of the objective, it is enough to find $(p_{i},q_{i})$ such that $\nabla f_{i}^{*}(-p_{i},-q_{i})\in\partial cs^{*}_{{\mathbf{w}}_{i}}\left((p+p_{i}){\mathbf{a}}_{i}+(q+q_{i}){\mathbf{b}}_{i}\right)$ which can be rewritten as:

(p_{i},q_{i})\in\nabla f_{i}\left(\partial cs^{*}_{{\mathbf{w}}_{i}}\left((p+p_{i}){\mathbf{a}}_{i}+(q+q_{i}){\mathbf{b}}_{i}\right)\right).

(12)

This fixed point equation can be solved using binary search in exactly the same way as in Proposition 2.8. $\partial cs^{*}_{{\mathbf{w}}_{i}}\left((p+p_{i}){\mathbf{a}}_{i}+(q+q_{i}){\mathbf{b}}_{i}\right)$ only depends on the ration $\lambda=(q+q_{i})/(p+p_{i})$ . Geometrically, this ratio is constant on any line passing through $(-p,-q)$ . Figure 4 shows the regions where $\partial cs^{*}_{{\mathbf{w}}_{i}}\left((p+p_{i}){\mathbf{a}}_{i}+(q+q_{i}){\mathbf{b}}_{i}\right)$ remains constant. Now the fixed point equation (12) can be solved in exactly the same way as in the proof of Proposition 2.8. We will skip the details.

∎

We are now ready to prove Theorem 3.2.

Proof of Theorem 3.2.

By Proposition 3.6 and 2.6, we can find a solution to the primal program 8 where each $\pi_{i}^{*}$ is either a permutation or a convex combination of two permutations which differ by an adjacent transposition. If $\pi_{i}^{*}$ is a permutation then we just output $\pi^{*}$ which is the optimal ranking for the $i^{th}$ ranking problem. If $\pi^{*}=\mu\pi_{1}+(1-\mu)\pi_{2}$ i.e. a convex combination of $\pi_{1},\pi_{2}$ which differ in the $j,j+1$ positions, then we output either $\pi_{1}$ or $\pi_{2}$ as the ranking for the $i^{th}$ subproblem. ∎

4 Experiments

4.1 Synthetic Data

We first present the results on synthetic data. The purpose of this experiment is to illustrate how different objective functions affect the distribution of the NDCGs. These results are summarized in Figures 6 and 7. We present the scatter plot of the NDCGs, just like in Figure 1, as well as the cumulative distribution functions (CDFs).

We aim to capture the multiple intents scenario where the results are likely to be good along one dimension but not both. The values are drawn from a log normal anti-correlated distribution: $\log(a_{ij})$ and the $\log(b_{ij})$ s are drawn from a multivariate Gaussian with mean zero and covariance matrix

\begin{bmatrix}0.2&-0.16\\[3.00003pt] -0.16&0.2\\[3.00003pt] \end{bmatrix}.

A scatter plot of the distribution of these values is shown in Figure 5.

Other parameters of the experiment are as follows: we draw 50 results for each instance, i.e., $n=50$ . The weight vector is the same as in the Introduction, (1), except that we only consider the top 10 results, i.e., the coordinates of ${\mathbf{w}}$ after the first 10 are 0. The number of different instances, $m$ is 500.

In addition to the product and the sum, we present the result of using two more combining functions: a quadratic and a normalized sum. Since we are plotting the NDCGs, a natural algorithm is to maximize the sum of the NDCGs. This is what the normalized sum does. The quadratic function first normalizes the scores to get the NDCGs, and then applies the function

f(x,y)=2x-x^{2}+2y-y^{2}.

It can be seen from Figure 6 that the concave functions are quite a bit more clustered than the additive functions. This can also be seen in the table inside the figure, which shows the sum of the cumulative scores, the DCGs, as well as the mean of the normalized cumulative scores, the NDCGs. These quantities are almost the same across all algorithms. We also show the standard deviations of the NDCGs, which quite well captures how clustered the points are, and shows a significant difference between the concave and the additive functions.

We present the CDFs of the NDCGs for the four algorithms in Figure 7. The dots on the curves represent deciles, i.e., the values corresponding to the bottom $10\%$ of the population, $20\%$ , and so on. Recall that a lower CDF implies that the values are higher.⁶⁶6 Recall that a distribution $F$ stochastically dominates another distribution $G$ iff the cdf of $F$ is always below that of $G$ . The CDF shows that in the bottom half of the distribution, the concave functions are higher than the additive ones. Also the steeper shape of the CDFs for the concave functions show how they are more concentrated. There is indeed a price to pay in that the top half are worse but this is unavoidable. The additive function picks a point on the Pareto frontier after all; in fact, it maximizes the mean of $A$ for a fixed mean of $B$ and vice versa. The whole point is that the mean is not necessarily an appropriate metric.

4.2 Real Data

The purpose of the experiment in this section is to show how the ideas from this paper could help in a realistic setting of ranking. We present experiments on real data from LinkedIn, sampled from one day of results from their news feed. The number of instances is about 33,000. The results are either organic or ads. Organic results only have relevance scores, their revenue scores are 0. Ads have both relevance and revenue scores.

The objectives are ad revenue and relevance. We will denote the ad revenue by $A$ and relevance by $B$ . We use the same weight vector ${\mathbf{w}}$ in the introduction, (1), up to 10 coordinates. Ad revenue can be added across instances, so we just sum up the ad revenue across different instances and tune the algorithms so that they all have roughly the same revenue. (The difference is less than $1\%$ .) It makes less sense to add the relevance scores. In fact it is more important to make sure that no instance gets a really bad relevant score, rather than optimize for the mean or even the standard deviation. To do this, we aim to make the bottom quartile (25%) as high as possible.

Motivated by the above consideration, for the relevance objective, we consider a function that has a steep penalty for lower values. We first normalize the scores to get the NDCG, and then apply this function on the normalized value. For revenue, we just add up the cumulative scores. The function we use to combine the two cumulative scores is thus

f(x,y)=x-e^{-c_{1}y/\operatorname{\mathsf{cs}}_{{\mathbf{w}}}^{*}({\mathbf{b}})-c_{2}},

for some suitable constants $c_{1}$ and $c_{2}$ . Higher values of $c_{1}$ make the curve steeper and make the distribution more concentrated. We choose $c_{1}$ so that we benefit the bottom quartile as much as possible. The constant $c_{2}$ is tuned so that the total revenue is close to some target.

We compare this with an additive function. The revenue term is not normalized whereas the relevance term is. This function is

g(x,y)=x+c_{3}y/\operatorname{\mathsf{cs}}_{{\mathbf{w}}}^{*}({\mathbf{b}}),

for some suitable choice of $c_{3}$ . Once again $c_{3}$ is tuned to achieve a revenue target.

We also add constraints on the ranking to better reflect the real scenario (although not exactly the same constraints as used in reality, for confidentiality reasons). An ad cannot occupy the first position, and the total number of ads in the top 10 positions is at most 4. It is quite easy to see that we can optimize a single objective given these constraints.⁷⁷7Although our guarantees don’t extend, our algorithm extends to handle such constraints, as long as we can solve the problem of optimizing a single objective. In experiments, the algorithm seems to do well. It is an interesting open problem to generalize our guarantees to such settings. We first sort by the score, then slide ads down if the first slot has an ad, and finally remove any ads beyond the top 4.

We present the CDFs of the NDCGs for relevance for the two algorithms in Figure 8. The figure shows that in the bottom quartile the exp function does better, and the relation flips above this. For the bottom decile, the difference is significant. As mentioned earlier, this is exactly what we wanted to achieve.

Another important aspect of a ranking algorithm in this context is the set of positions that ads occupy. In Figure 9, we show this distribution: for each position, we show the number of instances for which there was an ad in that position. For the additive function, which is the graph at the bottom, most of the ads are clustered around positions 2 to 4, and the number gradually decreases further down. The distribution in case of the exp function is better spread out. Interestingly, the most common position an ad is shown is the very last one.

To conclude, the choice of a concave function to combine the different objectives gives a greater degree of freedom to ranking algorithms. This freedom can be used to better control several important metrics in search and ad rankings. This experiment shows how this can be done for the relevance NDCGs in the bottom quartile, or for the distribution of ad positions.

References

Aslam and Montague [2001] Javed A Aslam and Mark Montague. Models for metasearch. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 276–284. ACM, 2001.
Azar et al. [2009] Yossi Azar, Iftah Gamzu, and Xiaoxin Yin. Multiple intents re-ranking. In Proceedings of the forty-first annual ACM symposium on Theory of computing, pages 669–678. ACM, 2009.
Borda [1784] JC de Borda. Mémoire sur les élections au scrutin. Histoire de l’Academie Royale des Sciences pour 1781 (Paris, 1784), 1784.
Boyd and Vandenberghe [2004] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
Bubeck [2015] Sébastian Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning, 8(3-4):231–357, 2015.
Burges et al. [2005] Christopher Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Gregory N Hullender. Learning to rank using gradient descent. In Proceedings of the 22nd International Conference on Machine learning (ICML-05), pages 89–96, 2005.
Dwork et al. [2001] Cynthia Dwork, Ravi Kumar, Moni Naor, and Dandapani Sivakumar. Rank aggregation methods for the web. In Proceedings of the 10th international conference on World Wide Web, pages 613–622. ACM, 2001.
Fagin and Wimmers [1997] Ronald Fagin and Edward L Wimmers. Incorporating user preferences in multimedia queries. In International Conference on Database Theory, pages 247–261. Springer, 1997.
Fox and Shaw [1994] Edward A Fox and Joseph A Shaw. Combination of multiple searches. NIST special publication SP, 243, 1994.
Renda and Straccia [2003] M Elena Renda and Umberto Straccia. Web metasearch: rank vs. score based rank aggregation methods. In Proceedings of the 2003 ACM symposium on Applied computing, pages 841–846. ACM, 2003.
Saari [1995] Donald G Saari. Basic geometry of voting, volume 12. Springer Science & Business Media, 1995.
Vogt and Cottrell [1999] Christopher C Vogt and Garrison W Cottrell. Fusion via a linear combination of scores. Information retrieval, 1(3):151–173, 1999.
Young [1988] H Peyton Young. Condorcet’s theory of voting. American Political science review, 82(4):1231–1244, 1988.
Young and Levenglick [1978] H Peyton Young and Arthur Levenglick. A consistent extension of condorcet’s election principle. SIAM Journal on applied Mathematics, 35(2):285–300, 1978.

Appendix A Running time of binary search in Proposition 2.8

Getting $O(n\log n\log B)$ running time.

Note that the critical set $C$ can be of size $\binom{n}{2}$ and doing binary search on this set naively will need us to list all the critical values and then sort them. We can avoid this by doing a binary search directly on the ratio $\lambda=q/p$ .

Recall that the critical set $C={\lambda_{1},\dots,\lambda_{m}}$ where $\lambda_{1}<\lambda_{2}<\dots<\lambda_{m}$ and we set $\lambda_{0}=0,\lambda_{m+1}=\infty$ . By the assumption that $a_{i},b_{i}$ are integers bounded by $B$ , $\lambda_{m}<2B+1$ and $\lambda_{i+1}-\lambda_{i}\geq 1/8B^{2}$ for every $i$ . Define $\tilde{\phi}:\mathbb{R}^{\geq 0}\to\{-1,0,1\}$ as follows:

\displaystyle\tilde{\phi}(\lambda)=\begin{cases}0&\text{ if }\lambda_{i}\leq\lambda<\lambda_{i+1}\text{ and }\nabla f(g_{i})\in R_{i}\cup A_{i}\cup R_{i+1}\\ +1&\text{ if }\lambda_{i}\leq\lambda<\lambda_{i+1}\text{ and }\nabla f(g_{i})\text{ lies anticlockwise to }R_{i}\cup A_{i}\cup R_{i+1}\\ -1&\text{ if }\lambda_{i}\leq\lambda<\lambda_{i+1}\text{ and }\nabla f(g_{i})\text{ lies clockwise to }R_{i}\cup A_{i}\cup R_{i+1}.\end{cases}

Claim A.1.

Given $\lambda^{*}>0$ , we can compute $\tilde{\phi}(\lambda^{*})$ in $O(n\log n)$ time.

Proof.

It is enough to show that we can find $\lambda_{i},\lambda_{i+1}$ such that $\lambda_{i}\leq\lambda^{*}<\lambda_{i+1}$ in $O(n\log n)$ time. We can find the ranking $\pi$ which sorts ${\mathbf{a}}+\lambda^{*}{\mathbf{b}}$ in decreasing order in $O(n\log n)$ time. We can then evaluate $g_{i}=(\operatorname{\mathsf{cs}}({\mathbf{a}},\pi),\operatorname{\mathsf{cs}}({\mathbf{b}},\pi))$ in $O(n)$ time. Now $\lambda_{i+1}$ is the first critical point where the sorted order of ${\mathbf{a}}+\lambda{\mathbf{b}}$ switches from $\pi$ if we imagine increasing $\lambda$ from $\lambda^{*}$ . Once $\lambda$ crosses $\lambda_{i+1}$ , some adjacent elements of $\pi$ switch positions in the sorted order of ${\mathbf{a}}+\lambda{\mathbf{b}}$ . Therefore

\lambda_{i+1}=\min\left\{\lambda:\lambda>\lambda^{*}\text{ and }a_{\pi(i)}+\lambda b_{\pi(i)}=a_{\pi(i+1)}+\lambda b_{\pi(i+1)}\text{ for some }i\in[n-1]\right\}

where if the set is empty we set $\lambda_{i+1}=\infty$ . Note that this can be computed in $O(n)$ time. Similarly

\lambda_{i}=\min\left\{\lambda:\lambda\leq\lambda^{*}\text{ and }a_{\pi(i)}+\lambda b_{\pi(i)}=a_{\pi(i+1)}+\lambda b_{\pi(i+1)}\text{ for some }i\in[n-1]\right\}

where if the set is empty we set $\lambda_{i}=0$ . ∎

Now we claim that it is enough to find some $\lambda_{\ell}<\lambda_{u}$ such that one of the following is true:

1.

$\tilde{\phi}(\lambda_{\ell})=0$ or $\tilde{\phi}(\lambda_{u})=0$ . This is similar to the case when $\phi(i)=0$ .
2.

$\tilde{\phi}(\lambda_{\ell})=+1,\tilde{\phi}(\lambda_{u})=-1$ and $\lambda_{u}-\lambda_{\ell}<1/8B^{2}$ . There can be at most one critical point between $\lambda_{\ell},\lambda_{u}$ i.e. there exists a unique $i$ such that $\lambda_{\ell}<\lambda_{i+1}<\lambda_{u}$ . Therefore $\lambda_{\ell}$ and $\lambda_{u}$ must belong to adjacent regions. This is similar to the case when $\phi(i)=1,\phi(i+1)=-1.$

Now if either $\tilde{\phi}(0)=0$ or $\tilde{\phi}(2B+1)=0$ , we are done. Otherwise $\tilde{\phi}(0)=+1$ and $\tilde{\phi}(2B+1)=-1$ . Using binary search in the range $[0,2B+1]$ , one can find such $\lambda_{\ell},\lambda_{u}$ in $O(\log B)$ iterations. Since each iteration runs in $O(n\log n)$ time, the total running time is $O(n\log n\log B).$

Getting strongly polynomial randomized $O(n\log^{2}n)$ running time.

We will only give a proof sketch. In this case we cannot do a binary search over $\lambda.$ Because all the critical $\lambda$ can be concentrated in a small region and we may take a long time to find this region. Before we proceed we make a few claims.

Claim A.2.

Given two (generic) sequences of numbers ${\mathbf{c}}$ and ${\mathbf{d}}$ of length $n$ , let $I$ be the set of inversions of ${\mathbf{d}}$ w.r.t ${\mathbf{c}}$ i.e. $I=\left\{(i,j):i<j,\frac{c_{i}-c_{j}}{d_{j}-d_{i}}>0\right\}$ . We can find the size $|I|$ in $O(n\log n)$ time and we can sample uniformly at random from $I$ in $O(n\log n)$ time.⁸⁸8Note that there can be as many as $\binom{n}{2}$ inversions and so we cannot list them all.

Proof.

We only give a proof sketch. Wlog, we can assume that ${\mathbf{c}}$ is already sorted by applying the same permutation to both ${\mathbf{c}},{\mathbf{d}}.$ We now sort ${\mathbf{d}}$ using the merge sort algorithm and it is not hard to see that we can count and sample from inversions during this process. ∎

Claim A.3.

Given any $\lambda_{\ell}<\lambda_{u}$ , we can sample uniformly at random from $C\cap[\lambda_{\ell},\lambda_{u}]$ in $O(n\log n)$ time.

Proof.

Let $I$ be the set of inversions of ${\mathbf{a}}+\lambda_{u}{\mathbf{b}}$ w.r.t ${\mathbf{a}}+\lambda_{\ell}{\mathbf{b}}$ . We claim that

C\cap[\lambda_{\ell},\lambda_{u}]=\{\lambda:a_{i}+\lambda b_{i}=a_{j}+\lambda b_{j},(i,j)\in I\}.

This is because when you imagine increasing $\lambda$ from $\lambda_{\ell}$ to $\lambda_{u}$ , $C\cap[\lambda_{\ell},\lambda_{u}]$ is the set of critical points where a switch happens in the sorted order of ${\mathbf{a}}+\lambda{\mathbf{b}}$ . Therefore the critical points in $[\lambda_{\ell},\lambda_{u}]$ correspond exactly to the inversions $I.$ For each inversion $(i,j)\in I$ , $\frac{a_{i}-a_{j}}{b_{j}-b_{i}}\in C\cap[\lambda_{\ell},\lambda_{u}]$ . ∎

Suppose $\phi(0)=1$ and $\phi(m)=-1$ (otherwise we are done), where $\phi$ is the function defined in Equation (7). We want to find an $i$ such that $\phi(i)=0$ or $\phi(i)=1,\phi(i+1)=-1$ . Set $\lambda_{\ell}=0,\lambda_{u}=\infty$ . From Claim A.3, we can sample a uniformly random $\lambda\in C\cap[\lambda_{\ell},\lambda_{u}]$ in $O(n\log n)$ time. Suppose $\lambda=\lambda_{i}$ , then we can find $\lambda_{i+1}$ and $g_{i}$ as shown in Claim A.1 in $O(n\log n)$ time. Therefore we can evaluate $\phi(i)$ in $O(n\log n)$ time. Now we continue the binary search based on the value of $\phi(i)$ and update the value of the lower bound $\lambda_{\ell}=\lambda$ or the upper bound $\lambda_{u}=\lambda$ . In each iteration, the random $\lambda\in C\cap[\lambda_{\ell},\lambda_{u}]$ will eliminate constant fraction of points in $C\cap[\lambda_{\ell},\lambda_{u}]$ i.e. the size of $C\cap[\lambda_{\ell},\lambda_{u}]$ shrinks by a constant factor in expectation. Therefore the algorithm should end in $O(\log n)$ iterations with high probability. In fact, we can stop the sampling process once the size of $C\cap[\lambda_{\ell},\lambda_{u}]$ becomes $O(n)$ and then do a regular binary search on them by listing them all. Since the running time of each iteration is $O(n\log n),$ the total running time is $O(n\log^{2}n).$

		$\displaystyle\pi^{}\in\mathsf{ConvHull}\{\pi:\operatorname{\mathsf{cs}}^{}_{\mathbf{w}}(p^{}{\mathbf{a}}+q^{}{\mathbf{b}})=\operatorname{\mathsf{cs}}_{\mathbf{w}}(p^{}{\mathbf{a}}+q^{}{\mathbf{b}},\pi)\},$		(5)
		$\displaystyle\alpha^{}=\sum_{ij}\pi_{ij}^{}w_{i}a_{j},\ \ \beta^{}=\sum_{ij}\pi_{ij}^{}w_{i}b_{j},$
		$\displaystyle\nabla f(\alpha^{},\beta^{})=(p^{},q^{}).$

		$\displaystyle\pi_{i}^{}\in\mathsf{ConvHull}\{\pi:\operatorname{\mathsf{cs}}^{}_{\mathbf{w}}((p^{}+p_{i}^{}){\mathbf{a}}_{i}+(q^{}+q_{i}^{}){\mathbf{b}}_{i})=\operatorname{\mathsf{cs}}_{\mathbf{w}}((p^{}+p_{i}^{}){\mathbf{a}}_{i}+(q^{}+q_{i}^{}){\mathbf{b}}_{i},\pi)\},$		(11)
		$\displaystyle\sum_{j,k=1}^{n}w_{ij}a_{ik}\pi^{i}_{jk}=\alpha_{i}^{},\quad\sum_{j,k=1}^{n}w_{ij}b_{ik}\pi^{i}_{jk}=\beta_{i}^{},$
		$\displaystyle\sum_{i=1}^{m}\sum_{j,k=1}^{n}w_{ij}a_{ik}\pi^{i}_{jk}=\alpha^{},\quad\sum_{i=1}^{m}\sum_{j,k=1}^{n}w_{ij}b_{ik}\pi^{i}_{jk}=\beta^{},$
		$\displaystyle\nabla f_{i}(\alpha_{i}^{},\beta_{i}^{})=(p_{i}^{},q_{i}^{}),\quad\nabla F(\alpha^{},\beta^{})=(p^{},q^{}).$

Ranking with Multiple Objectives

Abstract

1 Introduction

1.1 Main Results

1.2 Related Work

Organization:

2 A single instance of ranking with multiple objectives

Definition 2.1 (𝖱𝖺𝗇𝗄⁡(𝐚,𝐛,𝐰,f)\operatorname{\mathsf{Rank}}({\mathbf{a}},{\mathbf{b}},{\mathbf{w}},f)).

Theorem 2.2.

Corollary 2.3.

Proposition 2.4.

Proof.

Lemma 2.5.

Proof.

Remark 2.6.

Remark 2.7.

Proposition 2.8 (Binary search to find 𝖮𝖯𝖳\mathsf{OPT}).

Proof.

Proof of Theorem 2.2.

3 Multiple instances of ranking with global aggregation

Definition 3.1 (𝖬𝗎𝗅𝗍𝗂𝖱𝖺𝗇𝗄⁡(𝐚,𝐛,𝐰,f,F)\operatorname{\mathsf{MultiRank}}(\mathbf{a},\mathbf{b},\mathbf{w},f,F)).

Theorem 3.2.

Corollary 3.3.

Proposition 3.4.

Proof.

Remark 3.5.

Proposition 3.6.

Proof.

Proof of Theorem 3.2.

4 Experiments

4.1 Synthetic Data

4.2 Real Data

References

Appendix A Running time of binary search in Proposition 2.8

Getting O​(n​log⁡n​log⁡B)O(n\log n\log B) running time.

Claim A.1.

Proof.

Getting strongly polynomial randomized O​(n​log2⁡n)O(n\log^{2}n) running time.

Claim A.2.

Proof.

Claim A.3.

Proof.

Definition 2.1 ( $\operatorname{\mathsf{Rank}}({\mathbf{a}},{\mathbf{b}},{\mathbf{w}},f)$ ).

Proposition 2.8 (Binary search to find $\mathsf{OPT}$ ).

Definition 3.1 ( $\operatorname{\mathsf{MultiRank}}(\mathbf{a},\mathbf{b},\mathbf{w},f,F)$ ).

Getting $O(n\log n\log B)$ running time.

Getting strongly polynomial randomized $O(n\log^{2}n)$ running time.