This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A survey on making skylines more flexible

Cem Cebeci Politecnico di Milano
Milan, Italy
cem.cebeci@mail.polimi.it
Abstract

Top-kk queries and skylines are the two most common approaches to finding the most interesting entries in a homogeneous multi-dimensional dataset. However, both of these strategies have some shortcomings. Top-kk queries are very challenging to specify precisely and skylines are not customizable to specific scenarios, on top of having unpredictable output cardinalities. We describe some alternative methods aimed at adressing the shortcomings of top-kk queries and sklyines and compare all approaches to illustrate which of the desired properties each of them possesses.

Keywords: top-kk query, skyline, multi-dimensional optimization

1 Introduction

Selecting the most interesting entries from a multi-dimensional dataset is an important operation in today’s recommender systems [8][15]. In order to perform this task, the two most commonly used approaches are top-kk queries [7] and skylines [1].

Given a multi-dimensional dataset RR and a scoring function f:Rf:R\rightarrow\mathbb{R}, the top-kk query on RR with ff returns the kk tuples r1,,rkRr_{1},...,r_{k}\in R for which f(ri)f(r_{i}) is smaller than any tuple not included in the query result. Top-kk queries are expressive and precise and we have fast algorithms [5][6] to compute them. However, defining ”most interesting” as a function of dataset attributes proves to be a significant challenge, even though techniques such as crowdsourcing [4] provide some tangible results. Additionally, query results may differ significantly with slight alterations on ff depending on the dataset, which poses a considerable problem given the natural inaccuracy we have in determining ff.

On the other hand, a skyline query returns the tuples in a dataset that are not Pareto dominated. In other words, it picks the tuples rr for which there are no other tuples tt such that tt is no worse than rr for all attributes and better than rr for at least one attribute. It’s worth noting that tuples in the skyline are the top-1 result for some scoring function. The notion of a skyline can be extended to a kk-skyband, the tuples that have less than kk tuples Pareto dominating them, or equivalently, top-kk results for some scoring function.

Skylines do not suffer from the same challenge as top-kk queries do since they require no scoring function to be specified. However, due to the same reason, they are not in any way customizable. There is no way to alter a skyline query to fit a user’s preferences, which is crucial in recommender systems. Another shortcoming of skyline queries is that their output cardinality is unbounded. In fact, for datasets with high dimensionality, the skyline may comprise the whole dataset [10], which renders the skyline operator useless. In general, a high number of results is undesirable in skyline queries.

To address the shortcomings of skylines and top-kk queries, multiple alternative operators have been introduced. In this paper, we describe some of those proposed solutions and investigate how they relate to each other in Section 2. We compare the different approaches in Section 3 and conclude in Section 4.

2 Proposed Solutions

2.1 Flexible Skylines

As mentioned before, one of the major shortcomings of skylines is customization. The skyline of a dataset does not depend on any kind of parameter, only on the dataset itself. As a result, the skyline query can not be customized in any way. FF-skylines are a generalization of the skyline that introduces a parameter to the query while compromising very little from the simplicity of skyline queries [3].

Before we define what an FF-skyline is, let us define FF-domination. A tuple r1r_{1} FF-dominates another tuple r2r_{2} if and only if r1r_{1} is at least as good as r2r_{2} for all scoring functions in FF and better than r2r_{2} for at least one scoring function. In formal terms:

r1r2fF.f(r1)f(r2)fF.f(r1)<f(r2)r_{1}\prec r_{2}\iff\forall f\in F.\ f(r_{1})\leq f(r_{2})\land\exists f\in F.f(r_{1})<f(r_{2})

If FF is the set of all monotonic functions, FF-domination reduces to Pareto domination.

Given a dataset RR and a set of monotone scoring functions FF, the two flexible skyline operators are defined as follows:

ND(R,F):={rR|tR.tFr}ND(R,F):=\{r\in R\ |\ \nexists t\in R.\ t\prec_{F}r\}
PO(R,F):={rR|fF,rr.f(r)f(r)}PO(R,F):=\{r\in R\ |\ \exists f\in F,\forall r^{\prime}\in r.\ f(r)\leq f(r^{\prime})\}

The first operator, NDND, denotes all non FF-dominated tuples of RR and the second operator, POPO, denotes the tuples in RR that are optimal for at least one of the functions in FF. When FF is the set of all monotonic functions, both ND(R,F)ND(R,F) and PO(R,F)PO(R,F) converge to the skyline of RR. However, in the general case, they are not necessarily equal [3].

NDND and POPO are both customizable since they have an additional parameter, FF, that enables customized queries. In addition, they partially address the other shortcoming of skylines, they lower the output cardinality for non-universal choices of FF. One can show PO(R,F)ND(R,F)SKY(R)PO(R,F)\subseteq ND(R,F)\subseteq SKY(R) for all choices of RR and FF. Thus guaranteeing neither NDND, nor POPO will have a greater cardinality than the skyline. Experimental results [3] depict that the cardinality of NDND is dramatically lower than the skyline even with only a few constraints on FF and the cardinality of POPO is even lower.

Customizability and control over result cardinality, the two properties flexible skylines provide, are already available in top-kk queries, although flexible skylines do not require precisely defined scoring functions as top-kk queries do. As an example, consider a simplified universe of functions comprising only weighted sums over the attributes with normalized weights, functions of the form f(r)=wirif(r)=\sum{w_{i}r_{i}} where rir_{i} denotes an attribute in the schema of RR. A top-kk query would need to specify a precise scoring function (or equivalently, a precise vector ww) but a flexible skyline allows less precise queries such as F={f(r)=w1r1+w2r2|w1>2w2}F=\{f(r)=w_{1}r_{1}+w_{2}r_{2}\ |\ w_{1}>2w_{2}\}. Such constraints are significantly easier to obtain [12]. For instance, if a user prefers the tuple (2,3)(2,3) to (1,6)(1,6), we can argue:

2w1+3w2<w1+6w2,i.e.,2w_{1}+3w_{2}<w_{1}+6w_{2},\ i.e.,
w1<3w2w_{1}<3w_{2}

Flexible skylines merge the idea of scoring functions of top-kk queries with traditional skylines. This is further illustrated by the fact that choosing FF to be the set of all monotonic functions reduces both operators to the traditional skyline but choosing it to contain a single monotonic function reduces them to the top-1 query result. By combining the two approaches, flexible skylines are both easy to specify and customizable at the same time.

The two flexible operators are discussed only for skylines here for brevity, but [11] and [2] extend this notion to kk-skybands to provide a more generalized framework.

2.2 Output Size Specified (OSS) operators

Another solution approach [10] to the same problem introduces the notion of ρ\rho-dominance and two operators that utilize this relation to provide types of queries that are not as precise as top-kk queries but still have query customization and smaller output cardinality than the skyline. In fact, these operators allow specifying the cardinality of their output as a parameter.

These operators only use weighted sums over attributes with normalized preferences as scoring functions. They define a new kind of dominance, much like FF-dominance. Given an estimate preference vector ww and tuples r1,r2r_{1},r_{2}, r1r_{1} ρ\rho-dominates r2r_{2} if the following conditions are satisfied:

  1. 1.

    vΔd1.|wv|ρivir1iivir2i\forall v\in\Delta^{d-1}.\ |w-v|\leq\rho\implies\sum_{i}{v_{i}{r_{1}}_{i}\leq\sum_{i}{v_{i}{r_{2}}_{i}}}

  2. 2.

    vΔd1.|wv|ρivir1i<ivir2i\exists v\in\Delta^{d-1}.\ |w-v|\leq\rho\land\sum_{i}{v_{i}{r_{1}}_{i}<\sum_{i}{v_{i}{r_{2}}_{i}}}

In other words, they relax the normalized preference vector ww up to a distance of ρ\rho and check dominance for every preference vector inside the resulting hypersphere. If a r1r_{1} performs at least as good as r2r_{2} for all vectors and better than r2r_{2} for at least one, r1r_{1} ρ\rho-dominates r2r_{2}, much like the concept of domination in flexible skylines.

The first output size specified operator is called ORDORD, it is Output size specified, has relaxed input and is dominance oriented. Given an estimate preference vector ww, an output size mm and a dataset RR, ORD(R,w,m)ORD(R,w,m) denotes mm tuples in RR that are not ρ\rho-dominated for the minimum value of ρ\rho that allows mm tuples.

The second output size specified operator is ORUORU, again because it is Output size specified, has relaxed input and it is utility-oriented. ORU(R,w,m)ORU(R,w,m) denotes mm tuples in RR that are optimal for at least one preference vector with a maximum distance of ρ\rho from ww.

Both of these operators can be further generalized by relaxing ORDORD to contain tuples that are dominated by fewer than kk others and relaxing ORUORU to contain tuples that are in the top-kk result for at least one vector.

This approach is very similar to flexible skylines. In fact, for any preference vector vv and distance ρ\rho, weighted sums with weight vectors inside the hypersphere centered around vv with radius ρ\rho specifies a set of functions FF. In fact, for any dataset RR, non-ρ\rho-dominated tuples in RR are precisely ND(R,F)ND(R,F) and tuples that are optimal for at least one vector are precisely PO(R,F)PO(R,F).

2.3 UTK queries

Another parallel method that addresses the problem of relaxing the input preference vector is uncertain top-KK (UTK) queries [11]. Similar to OSS operators, these queries only consider weighted sums over the attributes as scoring functions. Likewise, each UTK query operates on a set of possible preference vectors rather than a single vector and these possible preference vectors form a convex polytope in the d1d-1 dimensional simplex. Instead of specifying the region of possible vectors via a central vector and a radius, UTK queries allow the region to be specified as a parameter on its own.

[11] defines two different UTK operators, UTK1UTK_{1} and UTK2UTK_{2}. Given a dd dimensional dataset RR, a desired output cardinality kk and a preference region PΔd1P\subseteq\Delta^{d-1}, UTK1(k,R,P)UTK_{1}(k,R,P) contains the tuples in RR that are in the top-kk result for a preference vector in PP and UTK2(k,R,P)UTK_{2}(k,R,P) partitions PP so that vectors in each partition produce the same top-kk result and labels the partitions with the top-kk results they correspond to.

UTK queries can express a larger family of scoring functions than OSS operators since every d1d-1 dimensional hypersphere is a d1d-1 dimensional polytope but the converse is not true. However, flexible skylines can express an even larger family of scoring functions since UTK is still limited to functions that are linear in the tuple attributes.

The sets of tuples computed by UTK1UTK_{1} queries correspond to the tuples computed by POPO and ORUORU operators for matching inputs. Since UTKUTK queries only check optimality and not domination, NDND and ORDORD can not be computed by UTKUTK queries.

2.4 ϵ\epsilon-skylines

ϵ\epsilon-skylines [14] is one of the earlier approaches to addressing the shortcomings of top-kk and skyline queries. Similar to OSS operators, ϵ\epsilon-skylines only consider weighted sums as scoring functions. They use the notion of ϵ\epsilon-dominance, which, given a dataset RR, a normalized preference vector ww and a constant ϵ[1,1]\epsilon\in[-1,1] is defined as:

r1ϵr2(i.wir1iwir2i+ϵ)(i.r1i<r2i)r_{1}\prec_{\epsilon}r_{2}\iff(\forall i.w_{i}{r_{1}}_{i}\leq w_{i}{r_{2}}_{i}+\epsilon)\ \land\ (\exists i.{r_{1}}_{i}<{r_{2}}_{i})

it is a relaxed version of Pareto dominance where the dominant tuple is allowed to be worse in some attributes up to an additive constant ϵ\epsilon. In fact, when we pick ϵ=0\epsilon=0, ϵ\epsilon-dominance reduces to Pareto dominance.

The ϵ\epsilon skyline of a dataset for a preference vector ww and constant ϵ\epsilon is simply the set of non-ϵ\epsilon-dominated tuples. Unlike traditional skylines, ϵ\epsilon-skylines are customizable in both preferences and output cardinality. Since ϵ\epsilon is scaled with the weights wiw_{i} while checking dominance, changing ww enables the operator to respond to the preferences of a user. The impact of ϵ\epsilon on dominance decreases as the weight wiw_{i} of an attribute increases.

Regarding the output cardinality, ϵ\epsilon-skylines do not provide a precise number but allow controlling the cardinality via changing ϵ\epsilon. For positive ϵ\epsilon, every tuple’s ϵ\epsilon-dominance region is larger than their Pareto dominance region, resulting in a smaller number of non-dominated tuples. Conversely, negative ϵ\epsilon shrink the ϵ\epsilon-dominance regions and result in a larger number of non-dominated tuples. In fact, for datasets with traditional skylines that contain at least two tuples, ϵ=1\epsilon=-1 produces the whole dataset and ϵ=1\epsilon=1 produces the empty set[14].

2.5 Representative Skylines

The approaches described so far have focused on customizing the query somehow to include user preferences. Representative skylines [9][13] take an orthogonal approach, they reduce the cardinality of the skyline directly, by picking kk representative tuples from the skyline. The original implementation [9] picks the tuples that maximize the number of non-skyline tuples dominated by the picked tuples. A variant [13] of this implementation picks the tuples that minimize the maximum distance between a non-picked skyline tuple and any picked tuple.

Refer to caption
Figure 1: An example skyline illustrating drawbacks

Both versions have some drawbacks [13]. Consider the example in Figure 1, the skyline of the given dataset is {r1,r2,r4,r6,r8,r9}\{r_{1},r_{2},r_{4},r_{6},r_{8},r_{9}\}. If we optimize the number of dominated non-skyline tuples and pick k=3k=3, we get {r4,r6,r8}\{r_{4},r_{6},r_{8}\}. These two tuples are very similar in the trade-off they offer between attributes a1a_{1} and a2a_{2}. This subset fails to represent the overall skyline since it does not include any other types of trade-off in the skyline and focuses on a small cluster of tuples. On the other hand, minimizing the maximum distance gives {r1,r4,r9}\{r_{1},r_{4},r_{9}\}. This selection presents other trade-offs much better, but it fails to capture that most of the tuples in the skyline have a trade-off similar To that of r4r_{4}.

Representative skylines have the benefit of requiring no user input. They provide a way to limit output cardinality, with different diversification techniques, without compromising the simplicity of skyline queries. On the other hand, since they admit no input other than the dataset itself, they are not customizable save for the output cardinality.

3 Comparison

In this section, we discuss how the various proposed solutions compare to each other, as well as traditional top-kk queries and skylines. We analyze which problems each solution can address by inspecting four properties: input flexibility, customizability, control over cardinality and ranked output. These properties will be discussed one by one in the following subsections. This discussion is summarized in Table 1.

Input Flexibility Customizability Cardinality Control Ranked Output
Top-kk queries No Yes Yes Yes
Traditional Skylines Yes No No No
Flexible Skylines Yes Yes Partial No
OSS Operators Yes Partial Yes No
UTK Queries Yes Partial Yes No
ϵ\epsilon-skylines Yes Partial Partial No
Representative Skylines Yes No Yes Yes
Table 1: Properties of different approaches

3.1 Input flexibility

A query’s input is said to be flexible if it does not need to be precise and if small changes in the input do not cause significant changes in the output. Traditional skylines and representative skylines are the obvious superior choices for input flexibility since they have no input but the dataset itself. Thus, no additional parameter needs to be defined to make these queries. ϵ\epsilon-skylines are a bit less flexible, they require a value for ϵ\epsilon and a weight vector but the weight vector does not affect the output as directly as it does in some other approaches. OSS operators, flexible skylines and UTK queries are somewhat less flexible because they require some specifics about either the estimate preference weights or the family of scoring functions. Finally, top-kk queries are easily the least flexible of the queries we consider, they require a precisely defined scoring function which affects the output dramatically.

3.2 Customizability

A query is said to be customizable if it can adapt to the specific requirements of a user or a scenario. Traditional skylines are not customizable in any way, they simply return the same set of tuples for every scenario. Representative skylines, however, are mildly customizable since the output cardinality can be controlled. ϵ\epsilon-skylines, OSS skylines and UTK queries have noticeably higher customizability since they allow weight vectors to represent preferences over the attributes, though they only allow preferences to be expressed in the form of weighted sums. Top-kk queries and flexible skylines are fully customizable, they can be configured to return any tuple in the skyline.

3.3 Control over cardinality

The approaches are clearly divided into three groups on this property: Traditional skyline queries offer no control over the output cardinality. ϵ\epsilon skylines and flexible skylines provide methods to increase or decrease the cardinality but offer no precise control. Lastly, top-kk queries, representative skylines, UTK queries and OSS operators have a parameter to specify the output cardinality.

3.4 Ranked output

Top-kk queries and representative skylines provide a ranking between the tuples they return. None of the other approaches have this property.

4 Conclusion

In this paper, we mentioned the commonly used strategies to select the most interesting tuples in homogeneous multi-dimensional datasets together with their well-known shortcomings. Then, we described some strategies that aim to address the shortcomings of the conventional methods: Flexible skylines, output size specified operators, UTK queries, ϵ\epsilon-skylines and representative skylines. Finally, we compared these proposed solutions with each other and the traditional strategies, pointing out which issues each one can address.

Future work on flexible skylines could include estimating the output cardinality or limiting it by construction similar to what output size specified operators do. Discovering other families of scoring functions that are representative of users’ preferences and can be efficiently computed by flexible skylines could also be worthwhile research. Ranking the items contained in flexible skylines and computing the skyline tuple-by-tuple could be other directions to improve on this notion.

References

  • [1] Stephan Börzsönyi, Donald Kossmann, and Konrad Stocker. The skyline operator. Proceedings 17th International Conference on Data Engineering, pages 421–430, 2001.
  • [2] Paolo Ciaccia and Davide Martinenghi. Fa + ta ¡ fsa: Flexible score aggregation. CIKM ’18, page 57–66, New York, NY, USA, 2018. Association For Computing Machinery.
  • [3] Paolo Ciaccia and Davide Martinenghi. Flexible skylines: Dominance for arbitrary sets of monotone functions. ACM Trans. Database Syst., 45(4), dec 2020.
  • [4] Eleonora Ciceri, Piero Fraternali, Davide Martinenghi, and Marco Tagliasacchi. Crowdsourcing for top-k query processing over uncertain data. IEEE Transactions on Knowledge and Data Engineering, 28(1):41–53, 2016.
  • [5] Ronald Fagin. Combining fuzzy information from multiple systems (extended abstract). In Proceedings of the Fifteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, PODS ’96, page 216–226, New York, NY, USA, 1996. Association for Computing Machinery.
  • [6] Ronald Fagin, Amnon Lotem, and Moni Naor. Optimal aggregation algorithms for middleware. In Proceedings of the Twentieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS ’01, page 102–113, New York, NY, USA, 2001. Association for Computing Machinery.
  • [7] Ihab F. Ilyas, George Beskales, and Mohamed A. Soliman. A survey of top-¡i¿k¡/i¿ query processing techniques in relational database systems. ACM Comput. Surv., 40(4), oct 2008.
  • [8] Shuhei Kishida, Seiji Ueda, Atsushi Keyaki, and Jun Miyazaki. Skyline-based recommendation considering user preferences. In Lei Chen, Christian S. Jensen, Cyrus Shahabi, Xiaochun Yang, and Xiang Lian, editors, Web and Big Data, pages 133–141, Cham, 2017. Springer International Publishing.
  • [9] Xuemin Lin, Yidong Yuan, Qing Zhang, and Ying Zhang. Selecting stars: The k most representative skyline operator. In 2007 IEEE 23rd International Conference on Data Engineering, pages 86–95, 2007.
  • [10] Kyriakos Mouratidis, Keming Li, and Bo Tang. Marrying Top-k with Skyline Queries: Relaxing the Preference Input While Producing Output of Controllable Size, page 1317–1330. Association for Computing Machinery, New York, NY, USA, 2021.
  • [11] Kyriakos Mouratidis and Bo Tang. Exact processing of uncertain top-k queries in multi-criteria settings. Proc. VLDB Endow., 11(8):866–879, apr 2018.
  • [12] Li Qian, Jinyang Gao, and H. V. Jagadish. Learning user preferences by adaptive pairwise comparison. Proc. VLDB Endow., 8(11):1322–1333, jul 2015.
  • [13] Yufei Tao, Ling Ding, Xuemin Lin, and Jian Pei. Distance-based representative skyline. In 2009 IEEE 25th International Conference on Data Engineering, pages 892–903, 2009.
  • [14] Tian Xia, Donghui Zhang, and Yufei Tao. On skylining with flexible dominance relation. In 2008 IEEE 24th International Conference on Data Engineering, pages 1397–1399, 2008.
  • [15] Xiwang Yang, Harald Steck, Yang Guo, and Yong Liu. On top-k recommendation using social networks. In Proceedings of the Sixth ACM Conference on Recommender Systems, RecSys ’12, page 67–74, New York, NY, USA, 2012. Association for Computing Machinery.