This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Integration of Skyline Queries into Spark SQL

Lukas Grasmann lukas.grasmann@tuwien.ac.at TU WienViennaAustria Reinhard Pichler reinhard.pichler@tuwien.ac.at TU WienViennaAustria  and  Alexander Selzer alexander.selzer@tuwien.ac.at TU WienViennaAustria
Abstract.

Skyline queries are frequently used in data analytics and multi-criteria decision support applications to filter relevant information from big amounts of data. Apache Spark is a popular framework for processing big, distributed data. The framework even provides a convenient SQL-like interface via the Spark SQL module. However, skyline queries are not natively supported and require tedious rewriting to fit the SQL standard or Spark’s SQL-like language.

The goal of our work is to fill this gap. We thus provide a full-fledged integration of the skyline operator into Spark SQL. This allows for a simple and easy to use syntax to input skyline queries. Moreover, our empirical results show that this integrated solution by far outperforms a solution based on rewriting into standard SQL.

copyright: rightsretaineddoi: isbn: 978-3-89318-088-2conference: 26th International Conference on Extending Database Technology (EDBT); 28th March-31st March, 2023; Ioannina, Greecejournalyear: 2023

1. Introduction

Skyline queries are an important tool in data analytics and decision making to find interesting points in a multi-dimensional dataset. Given a set PP of data points, we can compute the skyline (also called Pareto front) using the skyline operator (Börzsönyi et al., 2001). Intuitively, a data point pPp\in P belongs to the skyline if it is not dominated by any other point. For a given set of dimensions, a data point qq dominates pp (denoted qpq\prec p) if qq is better in at least one dimension while being at least as good in every other dimension. More formally, the skyline SS is obtained as S={pPqP:qp}S=\{p\in P\mid\not\exists q\in P:q\prec p\}.

Refer to caption
Figure 1. Example skyline for hotels

A common example for skyline queries in the literature can be found in Figure 1. Suppose that we want to find the perfect hotel for our holiday stay. Each hotel has multiple properties that express its quality such as the price per night or the distance to the beach among others. In this example, we limit ourselves to the price per night (y-axis) and the user rating (x-axis) for simplicity. Such attributes relevant to the skyline are called skyline dimensions. A hotel dominates another hotel if it is strictly better in at least one skyline dimension and not worse in all others. For the price per night, minimization is desirable while we want the rating to be as high as possible (maximization). Figure 1 depicts the skyline of the hotels according to these two dimensions. In this simple (two-dimensional) example, it is immediately visible that for each hotel not part of the skyline there exists a “better” (dominating) alternative. In practice, things get more complex as skyline queries usually handle a larger number of skyline dimensions.

Skyline queries have a wide range of applications, including recommender systems to propose suitable items to the user (Börzsönyi et al., 2001) and location-based systems to find “the best” routes (Kriegel et al., 2010; Huang and Jensen, 2004). Skyline queries have also been used to improve the quality of (web) services (Alrifai et al., 2010) and they have applications in privacy (Chen et al., 2007), authentication (Lin et al., 2011), and DNA searching (Chen and Lian, 2008). Many further use cases for skyline queries are given in these surveys: (Kalyvas and Tzouramanis, 2017; Hose and Vlachou, 2012; Chomicki et al., 2013). Skyline queries are particularly useful when dealing with big datasets and filtering the relevant data items. Indeed, in data science and data analytics, we are typically confronted with big data, which is stored in an optimized and distributed manner. We thus need a processing engine that is capable of efficiently operating on such data.

Apache Spark (Zaharia et al., 2016) is a unified framework that is specifically designed for distributed processing of potentially big data and has gained huge popularity in recent years. It works for a wide range of applications, and it can interoperate with a great variety of data sources. Conceptually, it is similar to the MapReduce framework, where instead of map-combine-reduce nodes, there are individual nodes in an execution plan for each processing step. Spark also provides a multitude of modular extensions such as MLlib, PySpark, SparkR, and others, built on top of its core functionality. Another important extension is Spark SQL, which provides relational query functionality and comes with the powerful Catalyst optimizer.

Hence, Apache Spark would be the perfect fit for executing skyline queries over big, distributed data. Despite this, to date, the skyline operator has not been integrated into Spark SQL. Nevertheless, it is possible to formulate skyline queries using “plain” SQL. Listing 2 shows the “plain” skyline query for our hotel example.

1SELECT price, user_rating FROM hotels AS o WHERE NOT EXISTS(
2 SELECT * FROM hotels AS i WHERE
3 ii.price \leq oo.price
4 AND ii.user_rating \geq oo.user_rating
5 AND (
6 ii.price << oo.price
7 OR ii.user_rating << oo.user_rating
8 )
);
Listing 1 Hotel skyline query in plain SQL (Börzsönyi et al., 2001; Eder, 2009)

There are several drawbacks to such a formulation of skyline queries in plain SQL. First of all, as more skyline dimensions are used, the SQL code gets more and more cumbersome. Hence, such an approach is error-prone, and readability and maintainability will inevitably be lost. Moreover, most SQL engines (this also applies to Spark SQL) are not optimized for this kind of nested SELECTs. Hence, also performance is of course an issue. Therefore, a better solution is called for. This is precisely the goal of our work:

  • We aim at an integration of skyline queries into Spark SQL with a simple, easy to use syntax, e.g., the skyline query for our hotel example should look as depicted in Listing 2.

  • We want to make use of existing query optimizations provided by the Catalyst optimizer of Spark SQL and extend the optimizer to also cover skyline queries.

  • At the same time, we want to make sure that our integration of the skyline operator has no negative effect on the optimization and execution of other queries.

  • Maintainability of the new code is an issue. We aim at a modular implementation, making use of existing functionality and making future enhancements as easy as possible.

  • Spark and Spark SQL support many different data sources. The integration of skyline queries should work independently of the data source that is being used.

SELECT price, user_rating FROM hotels SKYLINE OF price MIN, user_rating MAX;
Listing 2 Hotel skyline query in extended SQL (Börzsönyi et al., 2001; Eder, 2009)

Our contribution

The main result of this work is a full-fledged integration of skyline queries. This includes the following:

  • Virtually every single component involved in the query optimization and execution process of Spark SQL (see Figure 2) had to be extended: the parser, analyzer, optimizer, etc.

  • While a substantial portion of new code had to be provided, at the same time, we took care to make use of existing code as much as possible and to keep the extensions modular for easy maintenance and future enhancements.

  • For the optimization of skyline queries, the specifics of distributed query processing had to be taken into account. Apart from incorporating new rules into the Catalyst optimizer of Spark SQL, we have extended the skyline query syntax compared with previous proposals (Börzsönyi et al., 2001; Eder, 2009) to allow the user to control further skyline-specific optimization.

  • We have carried out an extensive empirical evaluation which shows that the integrated version of skyline queries by far outperforms the rewritten version in the style of Listing 1.

Structure of the paper

In Section 2, we discuss relevant related work. The syntax and semantics of skyline queries are introduced in Section 3. A brief introduction to Apache Spark and Spark SQL is given in Section 4. In Section 5, we describe the main ideas of our integration of the skyline operator into Spark SQL. We report on the results of our empirical evaluation in Section 6. In Section 7, we conclude and identify some lines of future work.

The source code of our implementation is available here: https://github.com/Lukas-Grasmann/skyline-queries-spark. Further material, comprised of the binaries of our implementation, all input data, and queries from our empirical evaluation is provided at (Grasmann et al., 2022).

2. Related Work

Since the original publication of the skyline operator by Börzsönyi, Kossmann, and Stocker (Börzsönyi et al., 2001), various algorithms and approaches to skyline queries have been proposed. In this section, we give an overview of the relevant related work. We limit ourselves to, broadly, the “original” definition of skyline queries and exclude related query types like reverse skyline queries (Dellis and Seeger, 2007).

The main cost factor of skyline computation is the time spent on dominance testing, which in turn depends on how many dominance tests between tuples need to be performed. A simple, straightforward algorithm runs in quadratic time w.r.t. the size of the data. It checks for each pair of tuples whether one of them is dominated and, therefore, not part of the resulting skyline. This basic idea of pairwise dominance tests is realized in the Block-Nested-Loop skyline algorithm (Börzsönyi et al., 2001), which performs well especially when the resulting output is not too big, and the input dataset fits in memory.

In the most recent algorithms, there are two main approaches that are used to either decrease the total number of dominance checks or to increase the parallelism of these checks.

  1. (1)

    Using a score function on the skyline dimensions and sorting the tuples w.r.t. to this score function (Börzsönyi et al., 2001; Liu and Li, 2020). These approaches may require the entire dataset to be available on a single node, which is not ideal in a distributed environment.

  2. (2)

    Partitioning (and distributing) the skyline computation by first independently computing the skyline of partitions (local skylines) before proceeding to the computation of the final skyline (global skyline(Kim and Kim, 2018; Cui et al., 2008; Tang et al., 2019; Gulzar et al., 2019). This does not necessarily decrease the total number of dominance checks but speeds up the algorithm through parallelism.

Most of these approaches are centered around special data structures and indexes, which are provided by the respective database system. This makes them less suitable for Spark, which currently has no internal support for building and maintaining indexes. Such algorithms include Index (Tan et al., 2001) which uses a bitmap index to compute the skyline. Many algorithms of this class also make use of presorting, i.e., sorting the tuples from the dataset according to a monotone scoring function. Such scoring can, by definition, rule out certain dominance relationships between tuples such that fewer comparisons become necessary. This includes the algorithms SFS (Chomicki et al., 2003, 2005), LESS (Godfrey et al., 2005, 2007), SALSA (Bartolini et al., 2006, 2008), and SDI (Liu and Li, 2020).

A simple partition-based algorithm is the Divide-and-Conquer algorithm presented in the original skyline paper (Börzsönyi et al., 2001). Its working principle is to recursively divide the data into partitions until the units are small enough such that the skyline can be computed efficiently. The resulting (local) skylines are then merged step by step until the entire recursion stack has been processed. Related approaches include the nearest-neighbor search (Kossmann et al., 2002) as well as other similar algorithms (Papadias et al., 2003, 2005; Morse et al., 2007; Zhang et al., 2009; Sheng and Tao, 2012).

The algorithm BSkyTree (Lee and Hwang, 2010a, b) supports both a sorting-based and a partition-based variant. These make use of an index tree-structure to increase performance. Alternatively, there also exist tree structures based on quad-trees (Park et al., 2013). Here, the data is partitioned dynamically and gradually into smaller and smaller “rectangular” chunks until the skylines can be computed efficiently.

There are also newer algorithms specifically targeted towards the MapReduce framework (Kim and Kim, 2018; Lee et al., 2007; Tang et al., 2019). MapReduce approaches use mappers to generate partitions by assigning keys to each tuple while the reducers are responsible for computing the actual skylines for all tuples with the same key. Each intermediate skyline for a partition is called a local skyline. This approach usually employs two or more rounds of map and reduce operations (stages) where each stage first (re-)partitions the data and then computes skylines. The final result is then derived by the last stage which generates only one partition and computes the global skyline.

The success of MapReduce approaches relies on keeping the intermediate results as small as possible and maximizing parallelism (Kim and Kim, 2018). To achieve this, multiple different partitioning schemes exist such as random, grid-based, and angle-based partitioning (Kim and Kim, 2018). Keeping the last local skylines before the global skyline computation small is especially desirable since the global skyline computation cannot be fully parallelized. Indeed, this is usually the bottleneck which slows the skyline computation down (Kim and Kim, 2018). This approach is not suitable for our purposes since it requires data from different partitions to be passed along with meta-information about the partitions. Spark does not currently support such functionality and adding it would be beyond the scope of this work.

There are efforts to eliminate multiple data points at once and make the global skyline computation step distributed by partitioning the data. In the case of grid-based-partitioning it is possible to eliminate entire cells of data if they are dominated by another (non-empty) cell (Tang et al., 2019). It is then also possible to cut down the number of data points which each tuple must be compared against in the global skyline since some cells can never dominate each other according to the basic properties of the skyline (if a tuple is better in at least one dimension then it cannot be dominated) (Kim and Kim, 2018; Tang et al., 2019).

There have been two past efforts to implement skyline computation in Spark. The first one is the system SkySpark (Kanellis, 2020) which uses Spark as the data source and realizes the skyline computation on the level of RDD operations (see Section 4 for details on Spark and RDDs). A similar effort was made in the context of the thesis by Ioanna Papanikolaou (Papanikolaou, 2018), which uses the map-reduce functionality of RDDs in Spark to compute the skyline. Both systems thus realize skyline computation as standalone Spark programs rather than as part of Spark SQL, which would make, for instance, a direct performance comparison with our fully integrated solution difficult (even though it might still provide some interesting insights). However, the most important issue is that these implementations are not ready to use: SkySpark (Kanellis, 2020), with the last source code update in 2016, implements skyline computation for an old version of Spark. The code has been deprecated and archived by the author with a note that the code does not meet his standards anymore. For the system reported in (Papanikolaou, 2018), there is no runnable code provided since it is only available as code snippets in the PDF document.

As will be detailed in Section 5, we have chosen the Block-Nested-Loop skyline algorithm (Börzsönyi et al., 2001) for our integration into Spark SQL due to its simplicity. By the modular structure of our implementation, replacing this algorithm by more sophisticated ones in the future should not be too difficult. We will come back to suggestions for future work in Section 7. Partitioning also plays an important role in our distributed approach (see Section 5 for limits of distributed processing in the case of incomplete data though). However, to avoid unnecessary communication cost, we refrain from overriding Spark’s partitioning mechanism.

3. Skyline Queries

In this section, we take a deeper look into the syntax and semantics of skyline queries. Skyline queries can be expressed in a simple extension of SELECT-FROM-WHERE queries in SQL. This syntax was proposed together with the original skyline operator in (Börzsönyi et al., 2001) and can be found in Listing 3. The mm skyline dimensions out of nn total dimensions are denoted as d1d_{1} through dmd_{m} (where mnm\leq n). The DISTINCT keyword defines that we only return a single tuple if there are tuples with the same values in the skyline dimensions. If there are multiple possibilities, the exact tuple that is returned is chosen arbitrarily. The keyword COMPLETE is an extension compared to (Börzsönyi et al., 2001) that we have introduced. It allows the user to inform the system that the dataset is complete in the sense that no null occurs in the skyline dimensions. The system can, therefore, safely choose the complete skyline algorithm, which is more efficient than the incomplete one. This will be discussed in more detail in Section 5.

1SELECT … FROM … WHERE … GROUP BY … HAVING …
2 SKYLINE OF [DISTINCT][COMPLETE]
3 d1d_{1} [MIN | MAX | DIFF], …, dmd_{m} [MIN | MAX | DIFF]
ORDER BY …
Listing 3 Syntax of skyline queries in SQL (Börzsönyi et al., 2001)

In Section 1, we have already introduced the dominance relationship and the skyline itself in a semi-formal way. Below, we provide formal definitions of both concepts.

Definition 3.1 (Dominance relationship between tuples (Börzsönyi et al., 2001; Kim and Kim, 2018)).

Given a set of tuples RR, two nn-ary tuples r,sRr,s\in R and a set of skyline dimensions DD which always consists of four (potentially empty) disjoint subsets D𝑚𝑖𝑛,D𝑚𝑎𝑥,D𝑑𝑖𝑓𝑓,D𝑒𝑥𝑡𝑟𝑎D_{\mathit{min}},D_{\mathit{max}},D_{\mathit{diff}},D_{\mathit{extra}}, we define rir_{i} and sis_{i} as the value of the tuple rr and ss in dimension diDd_{i}\in D respectively. The subsets of the dimensions correspond to the skyline dimensions (D𝑚𝑖𝑛,D𝑚𝑎𝑥,D𝑑𝑖𝑓𝑓D_{\mathit{min}},D_{\mathit{max}},D_{\mathit{diff}}) and “extra” (non-skyline) dimensions (D𝑒𝑥𝑡𝑟𝑎D_{\mathit{extra}}). Then rr dominates ss (rsr\prec s) if and only if:

(diD𝑚𝑖𝑛risi)(diD𝑚𝑎𝑥risi)(diD𝑑𝑖𝑓𝑓ri=si)\displaystyle\bigg{(}\bigwedge_{d_{i}\in D_{\mathit{min}}}r_{i}\leq s_{i}\bigg{)}\wedge\bigg{(}\bigwedge_{d_{i}\in D_{\mathit{max}}}r_{i}\geq s_{i}\bigg{)}\wedge\bigg{(}\bigwedge_{d_{i}\in D_{\mathit{diff}}}r_{i}=s_{i}\bigg{)}
((diD𝑚𝑖𝑛ri<si)(diD𝑚𝑎𝑥ri>si))\displaystyle\wedge\Bigg{(}\bigg{(}\bigvee_{d_{i}\in D_{\mathit{min}}}r_{i}<s_{i}\bigg{)}\vee\bigg{(}\bigvee_{d_{i}\in D_{\mathit{max}}}r_{i}>s_{i}\bigg{)}\Bigg{)}

In words, the above definition means that a tuple rRr\in R dominates another tuple sRs\in R if and only if:

  • The values in all DIFF dimensions are equal and

  • rr is at least as good in all MIN/MAX skyline dimensions and

  • rr is strictly better in at least one MIN/MAX skyline dimension

The dominance relationships are transitive, i.e., if aa dominates bb and bb dominates cc then aa also dominates cc.

Given this formal definition of the dominance relationship, we can now also formally define the skyline itself.

Definition 3.2 (Skyline (Börzsönyi et al., 2001)).

Given a set of skyline dimensions DD, let RR be a set of tuples. The skyline RR (denoted SKY(R)SKY(R)) is a set of tuples defined as follows: SKY(R):={rRsR:sr}.SKY(R):=\{r\in R\mid\nexists s\in R:s\prec r\}.

Note that the above definitions apply to complete datasets. For incomplete datasets (if null may occur in some skyline dimension), we need to slightly modify the definition of dominance in that the comparison of two tuples is always restricted to those dimensions where both are not null. Hence, the above definition for the complete case is modified as follows (Gulzar et al., 2019):

  • The values in all DIFF dimensions where rr and ss are not null are equal and

  • rr is at least as good in all MIN/MAX skyline dimensions where rr and ss are not null and

  • rr is strictly better in at least one MIN/MAX skyline dimension where rr and ss are not null

Given the modified dominance relationship, we now run into the problem that for incomplete datasets the transitivity property of skyline dominance is lost and there may be cyclic dominance relationships. For instance, assume a dataset similar to the one given in (Gulzar et al., 2019) consisting of three tuples a=(1,,10)a=(1,*,10), b=(3,2,),c=(,5,3)b=(3,2,*),c=(*,5,3) where * is a placeholder for a missing value. If all three dimensions are skyline dimensions, then, given the definition of dominance in incomplete datasets, it follows that aba\prec b since 1<31<3 for the first dimension. Similarly, we have bcb\prec c since 2<52<5 on the second dimension. Lastly, we note that aca\not\prec c since 10310\nless 3 but cac\prec a since 3<103<10 for the third dimension. Under the assumption of transitivity, it would follow from aba\prec b and bcb\prec c that aca\prec c holds, which is not the case. In other words, for incomplete data, the transitivity property gets lost (Gulzar et al., 2019). Since aba\prec b, bcb\prec c, and cac\prec a, the dominance relationship forms a cycle and can therefore be referred to as a cyclic dominance relationship (Gulzar et al., 2019). Hence, when computing the skyline of a potentially incomplete dataset, prematurely deleting dominated tuples may lead to erroneous results. Indeed, this is the trap into which the skyline algorithm proposed in (Gulzar et al., 2019) fell. For details, see Appendix A.

As already mentioned in Section 1, skyline queries can be formulated in plain SQL without the specialized skyline syntax from Listing 3. A general schema of rewriting skyline queries in plain SQL is given in Listing 4. A first version of this rewriting was informally described in (Börzsönyi et al., 2001). The full details and the corresponding syntax used here were introduced in (Eder, 2009): First, the outer query selects all tuples from some relation (which may itself be the result of a complex SELECT-statement) and then we use a subquery with WHERE NOT EXISTS to eliminate all dominated tuples.

1 SELECT column_list FROM rel AS o WHERE condition(s) AND NOT EXISTS(
2 SELECT * FROM rel AS i WHERE condition(s)
3 AND i.a1o.a1i.a_{1}\leq o.a_{1} AND \dots AND i.ajo.aji.a_{j}\leq o.a_{j}
4 AND i.aj+1o.aj+1i.a_{j+1}\geq o.a_{j+1} AND \dots AND i.ako.aki.a_{k}\geq o.a_{k}
5 AND i.ak+1=o.ak+1i.a_{k+1}=o.a_{k+1} AND \dots AND i.am=o.ami.a_{m}=o.a_{m}
6 AND (
7 i.a1<o.a1i.a_{1}<o.a_{1} OR \dots OR i.aj<o.aji.a_{j}<o.a_{j}
8 OR i.aj+1>o.aj+1i.a_{j+1}>o.a_{j+1} OR \dots OR i.ak>o.aki.a_{k}>o.a_{k}
9 )
);
Listing 4 Translated skyline query in plain SQL (Börzsönyi et al., 2001; Eder, 2009)

4. Apache Spark

Refer to caption
Figure 2. Apache Spark SQL Query Processing and Execution (adapted from (Zaharia et al., 2010))

Apache Spark is a unified and distributed framework for processing large volumes of data. The framework can handle many different data sources. Through various specialized modules on top of its core module, it supports a wide range of processing workloads. The central data structure of the Spark Core are resilient distributed datasets (RDDs). An RDD is a collection of elements distributed over the nodes of the cluster. Working with Spark comes down to defining transformations and actions on RDDs, while data distribution, parallel execution, synchronization, fault-tolerance, etc. are taken care of by the system and are largely hidden from the user.

The focus of our work lies on the Spark SQL module, which extends the Spark Core by providing an SQL interface. Queries can be formulated either via query strings or using an API. This API is based on specialized data structures called DataFrame and DataSet. They are (meanwhile) closely related in that every DataFrame is also a DataSet, with the only difference being that DataSets are strongly typed while DataFrames are not. The processing and execution of input queries follows a well-defined schema with multiple steps that are depicted in Figure 2.

First, the queries are formulated either via SQL query strings or via APIs. In the case of query strings, it is necessary to parse them first. Both methods produce a logical execution plan that contains the information which operations the plan consists of. For instance, filtering is represented as a specific node in the plan and can be used for both WHERE and HAVING clauses in the query. The references to tables or columns are not yet assigned to actual objects in the database in this step. To solve this, the Analyzer takes each identifier and translates it using the Catalog. The result returned by the Analyzer is the resolved logical plan where all placeholders have been replaced by actual references to objects in the database.

The Catalyst Optimizer is a crucial part for the success of Spark SQL. It is a powerful rule-based optimizer that can be extended to include new rules for specific (newly introduced) nodes. Optimizations are applied to the resolved logical plan.

To actually execute the query, the logical execution plan must first be translated into a physical execution plan, which contains information about how the various operations are to be realized. For example, Spark provides different join implementations, which are each represented by different nodes in the physical plan and one of them is selected during the physical planning phase. There may be more than one physical plan generated during this step. Based on the physical plan(s), one specific plan must be selected, and Code Generation is carried out according to the chosen physical plan. In Spark, the generated code is a computation based on RDDs.

It will turn out that, for a full-fledged integration of skyline queries into Spark SQL, virtually all of the components of query processing and execution mentioned above have to be extended or modified. This will be the topic of the next section.

5. Integrating Skylines into Apache Spark

Below, we give an overview of the required modifications and extensions of each component of the query processing flow depicted in Figure 2. An important aspect of integrating skyline queries into Spark SQL is the algorithm selection for the actual computation of the skyline. Here, we will discuss in some detail the challenges arising from potentially incomplete data and our solution.

5.1. Skyline Query Syntax in Spark SQL

The parser is the first step of processing a Spark SQL query. It uses the ANTLR parser grammar generator to generate a lexer and subsequent parser which take query strings as input to obtain nodes in a logical plan. We modify both parts to extend Spark’s SQL-like syntax such that it also includes skyline queries.

skylineClause
: SKYLINE
skylineDistinct=DISTINCT?
skylineComplete=COMPLETE?
skylineItems+=skylineItem (’,’ skylineItems+=skylineItem)*
;
\parskylineItem
: skylineItemExpression=expression skylineMinMaxDiff=(MIN | MAX | DIFF)
;
Listing 5 ANTLR grammar for skyline queries

The grammar corresponding to the syntax introduced in Listing 3 is shown in Listing 5. Given a “regular” SELECT statement, note that a skyline always comes after the HAVING clause (if any) but before any ORDER BY clause. Each skyline clause consists of the SKYLINE OF keyword (line 2), an optional DISTINCT (line 3) and COMPLETE (line 4), as well as an arbitrary number of skyline dimensions (line 5 and 8 - 9) where the “type” MIN/MAX/DIFF is given separately for each dimension (line 9). The ordering of skyline dimensions in this syntax has no bearing on the outcome of the query (except for potential ordering). It may, however, have a slight effect on the performance of dominance checks since it determines the order in which the skyline dimensions of two tuples are compared.

5.2. Extending the Logical Plan

Each node in the logical plan stores the details necessary to choose an algorithm for the specific operator and to derive a physical plan. In the case of the skyline operator, we use a single node with a single child in the logical plan. This node contains information such as the skyline dimensions SkylineDimension as well as other vital information like whether the skyline should be DISTINCT or not. The child node of the skyline operator node provides the input data for the skyline computation.

Each SkylineDimension extends the default Spark Expression such that it stores both the reference to the database dimension and the type (i.e., MIN/MAX/DIFF). The database dimension is usually a column but can also be a more complex Expression (e.g., an aggregate) in Spark. It is stored as the child of the SkylineDimension which allows us to make use of the generic functionality responsible for resolving Expressions in the analyzer.

5.3. Extending the Analyzer

The analyzer of Spark already offers a wide range of rules which can be used to resolve the expressions. Since the skyline dimensions used by our skyline operator as well as their respective children are also expressions, they will be resolved automatically by the existing rules in most cases. We have to ensure that all common queries also work when the skyline operator is used. To achieve this, we have to extend the existing rules from the Analyzer to also incorporate the skyline operator. Such rules mainly pertain to the propagation of aggregates across the (unresolved) logical execution plan of Spark.

First, we ensure that we are also able to compute skylines which include dimensions not present in the final projection (i.e., the attributes specified in the SELECT clause). To achieve this, we expand the function ResolveMissingReferences by adding another case for the SkylineOperator. The code is shown in Listing 6. The rule is applied if and only if the SkylineOperator has not yet been resolved, has missing input attributes, and the child is already fully resolved (line 1 - 3). Then, we resolve the expressions and add the missing attributes (line 4 - 5) which are then used to generate a new set of skyline dimensions (line 6 - 7). If no output was added, we simply replace the skyline dimensions (line 8 - 9). Otherwise, we create a new skyline with the newly generated child (line 10 - 11) and add a projection to eliminate redundant attributes (line 12).

case s @ SkylineOperator(_, _, skylineItems, child)
if (!s.resolved || s.missingInput.nonEmpty)
&& child.resolved =>
val (exprs, newChild) =
resolveExprsAndAddMissingAttrs(skylineItems, child)
val dimensions =
exprs.map(_.asInstanceOf[SkylineDimension])
if (child.output == newChild.output) {
s.copy(skylineItems = dimensions)
} else {
val newSkyline = s.copy(dimensions, newChild)
Project(child.output, newSkyline)
}
Listing 6 Analyzer extension to allow dimensions not present in the projection in the skyline operator

Next, we need to take care that aggregate attributes are also propagated to the skyline properly. The code to accomplish this is given in Listing 7. We will only explain it briefly since it was modified from existing Spark code for similar nodes.

To do this, we match the skyline operator which is a parent or ancestor node of an Aggregate (line 3). Then we try to resolve the skyline dimensions using the output of the aggregate as a basis (line 5 - 6). This will also introduce missing aggregates in the Aggregate node which is necessary, for example, if the output of the query only contains the sum but the skyline is based on the count (line 7 - 13). Using the output of the resolution, we (re-)construct the skyline operator (line 14 - 17). Additionally, we introduce analogous rules for other similar cases like plans where there is a Filter node introduced by a HAVING clause between the Aggregate and the skyline operator node among others.

case SkylineOperator(
distinct, complete, skylineItems,
agg: Aggregate
) if agg.resolved =>
val maybeResolved = skylineItems.map(_.child)
.map(resolveExpressionByPlanOutput(_, agg))
resolveOperatorWithAggregate(
maybeResolved, agg,
(newExprs, newChild) => {
val newSkylineItems = skylineItems.zip(newExprs)
.map {
case (skylineItems, expr) =>
skylineItems.copy(child = expr)
}
SkylineOperator(
distinct, complete, newSkylineItems, newChild
)
}
)
Listing 7 Analyzer extension to propagate aggregate attributes to skylines

Note that the correct handling of Filter and Aggregate nodes is non-trivial in general – not only in combination with the skyline operator. Indeed, when working on the integration of skyline queries into Spark SQL, we noticed that aggregates are in some cases not resolved correctly by the default rules of Spark. This is, for example, the case when Sort nodes are resolved in combination with Filters and Aggregates. Such cases may be introduced by a HAVING clause in the query. For a more detailed description of this erroneous behavior of Spark SQL together with a proposal how to fix this bug, see Appendix B.

5.4. Additional Optimizations

Given a resolved logical plan, the Catalyst Optimizer aims at improving the logical query plan by applying rule-based optimizations (see Figure 2). The default optimizations of Spark also apply to skyline queries – especially to the ones where the data itself is retrieved by a more complex query. Skyline queries integrated into Apache Spark thus benefit from existing optimizations, which can be further improved by adding the following specialized rules.

If a skyline query contains a single MIN or MAX dimension, then this is equivalent to selecting the tuples with the lowest or highest values for the skyline dimension respectively. In other words, the Pareto optimum in a single dimension is simply the optimum. There are two possibilities to rewrite the skyline. We can either first sort the values according to the skyline dimension and then only select the top results or find the lowest or highest value in a scalar subquery (i.e., a subquery that yields a single value) and then select the tuples which have the desired value. Given that, for a relation with nn tuples, sorting and selecting exhibits a worst-case runtime of 𝒪(nlog(n))\mathcal{O}(n\cdot\log(n)) while the scalar subquery and selection can, when optimized, be done in 𝒪(n)\mathcal{O}(n) time, we opt for the latter.

A more sophisticated optimization is taken from (Börzsönyi et al., 2001), where (Carey and Kossmann, 1997) is referenced for the correctness of this transformation. It is applicable if the skyline appears in combination with a Join node in the tree. If the output of the join serves as the input to the skyline operator, then we can check whether it is possible to move the skyline into one of the “sides” of the join. This is applicable if the join is non-reductive (Börzsönyi et al., 2001) which is defined in (Carey and Kossmann, 1997). Intuitively, non-reductiveness in our case means that due to the constraints in the database, it can be inferred that for every tuple in the first table joined there must exist at least one “partner” in the other join table if the tuple is part of the skyline. The main benefit of this optimization is that computing the skyline before the join is likely to reduce the input size for both the skyline operator and the (now subsequent) join operation, which typically reduces the total query execution time.

5.5. Algorithm Selection

Given the problems that arise with skyline queries on incomplete datasets as mentioned in Section 3, we have to take extra care which algorithm is used for which dataset. The decision is mainly governed by whether we can assume the input dataset to be complete or not. If the dataset is (potentially) incomplete, we have to select an algorithm that is capable of handling incomplete data since otherwise, the computation may not yield the correct results.

As will be explained in Section 5.7 (and it will also be observed in the empirical evaluation in Section 6), skyline algorithms that are able to handle incomplete datasets are lacking in performance compared to their complete counterparts. Given that Spark can handle multiple different data sources and cannot always detect the nullability of a column, it is desirable to provide an override that decides whether a complete algorithm is used regardless of the detected input columns. We do this by introducing the optional COMPLETE keyword in the skyline query syntax. This gives the user the possibility to enforce the use of the complete algorithms also on datasets that can technically be incomplete but are known to be complete. The correctness of the algorithm only depends on whether null values actually appear in the data.

We implement the algorithm selection using the nodes in the physical execution plan. Which algorithm is selected depends on which physical nodes are chosen during translation of the (resolved and optimized) logical execution plan. In a minimally viable implementation, it is possible to implement the skyline operator in a single physical node. This has, however, the disadvantage that Spark’s potential of parallelism would be lost for the skyline computation. We therefore split the computation into two steps, represented by two separate nodes in the physical plan. The pseudocode of the algorithm selection is shown in Listing 8.

1
input : resolved logical plan with skyline operator node sky as root
output : optimized (resolved) logical plan
2
3𝑠𝑘𝑦𝑙𝑖𝑛𝑒𝑁𝑢𝑙𝑙𝑎𝑏𝑙𝑒dD𝑆𝐾𝑌:isnullable(d)\mathit{skylineNullable}\leftarrow\exists d\in D_{\mathit{SKY}}:isnullable(d)
4if COMPLETE is set OR ¬𝑠𝑘𝑦𝑙𝑖𝑛𝑒𝑁𝑢𝑙𝑙𝑎𝑏𝑙𝑒\neg\mathit{skylineNullable} then
5 local_skyline \leftarrow local_node()
6 global_skyline \leftarrow complete_global_node()
7 else
8 local_skyline \leftarrow local_node()
9 global_skyline \leftarrow incomplete_global_node()
10
return global_skyline
Listing 8 Selection of appropriate skyline nodes in physical execution plan

The main decision to be made (line 1, 2) is if we may use the complete algorithm. This is the case when the SKYLINE clause contains the COMPLETE keyword or when the skyline dimensions are recognized as not nullable by the system. Note that the local skylines use the same basic nodes both in the complete and incomplete cases, while the nodes for the global skyline differ.

The first node in the physical plan corresponds to the local skyline computation, which can be done in a distributed manner (line 3 and 6). This means that there may exist multiple instances of the node which can exist distributed across the cluster on which Spark runs. For the actual distribution of the data, we can use multiple different schemes provided by Spark or leave it as-is from the child in the physical plan. When left at standard distribution (UnspecifiedDistribution), Spark will usually try to distribute the data equally given the number of available executors. For example, if there are 10 executors available for 10.000.000 tuples in the original dataset, then each executor will receive roughly 1 million tuples each. For the algorithms that can handle incomplete datasets, we have to use a specialized distribution scheme, which will be discussed in more detail in Section 5.7.

The output of the local skyline computation is used as input for calculating the global skyline (line 4 and 7). This means that, in the physical execution plan, we create a global skyline node whose only child is the local skyline. In contrast to the local skyline, we may not freely choose the distribution but must instead ensure that all tuples from the local skyline are handled by the same executor for the global skyline computation. This is enforced by choosing the AllTuples distribution provided by Spark.

To keep our solution modular, we encapsulate the dominance check in a new utility that we introduce in Spark. It takes as input the values and types of the skyline dimensions of two tuples and checks if one tuple dominates the other. For every skyline dimension, we match the data type to avoid costly casting and (in the case of casting to floating types) potential loss of accuracy.

With our approach to algorithm selection in the physical plan and the modular structure of the code, it is easy to incorporate further skyline algorithms if so desired in the future. Choosing a different algorithm mainly entails replacing the local and/or global skyline computation nodes in the physical plan by new ones.

5.6. Skylines in Complete Datasets

For complete datasets, we adapt the Block-Nested-Loop skyline algorithm (Börzsönyi et al., 2001). The main idea is to keep a window of tuples in which the skyline of all tuples processed up to this point is stored. We iterate through the entire dataset, and, for each tuple, we check which dominance relationships exist with the tuples in the current window. If tuple tt is dominated by a tuple in the window, then tt is eliminated. Here, it is not necessary to check tt against the remaining tuples since it cannot dominate any tuples in the window due to transitivity. If tuple tt dominates one or more tuples in the window, then the dominated tuples are eliminated. In this case, tt is inserted into the window since, by transitivity, tt cannot be dominated by other tuples in the window. Tuple tt is also inserted into the window if it is found incomparable with all tuples in the current window.

We can use the same algorithm for both the local and the global skyline computation. The only necessary difference is the distribution of the data, where we let Spark handle the distribution for the local skyline while we force the AllTuples distribution for the global skyline computation. This has the additional advantage that partitioning done in prior processing steps can be kept, which increases the locality of the data for the local skyline computation and may thus help to further improve the overall performance.

The Block-Nested-Loop approach is most efficient if the window fits into main memory. Note that also Spark, in general, performs best if large portions of the data fit into the main memory available. Especially in cloud-based platforms, there is typically sufficient RAM available (in the order of magnitude of terabytes or even petabytes). At any rate, if RAM does not suffice, Spark will swap data to disk like any other program – with the usual expected performance loss.

5.7. Skylines in Incomplete Datasets

When computing the local skyline for an incomplete dataset, we must take care that no potential cyclic dominance relationships are lost such that tuples may appear in the result even though they are dominated by another tuple. To combat this, we use a special form of a bitmap-based skyline algorithm (Gulzar et al., 2019).

Given a set PP of tuples, we can assign each tuple pPp\in P a bitmap bb (an index in binary format) such that each bit in bb corresponds to a skyline dimension. If a tuple pp has a null value in a skyline dimension, then the corresponding bit in bitmap bb is set to 11; otherwise, it is set to 0. Then, subsequently, the data is partitioned according to the bitmap indexes such that all tuples with a specific bitmap bb are assigned to the same subset (partition) PbP_{b} of PP. We can then calculate the local skyline SKY(Pb)SKY(P_{b}) for each set of tuples PbP_{b} without losing the transitivity property or running into issues with cyclic dominance relationships.

In Spark, this sort of partitioning is done via the integrated distribution of the nodes. We craft an expression for the distribution which uses the predefined IsNull() method to achieve this effect.

For the global skyline computation, we cannot use the standard Block-Nested-Loop approach since cyclic dominance relationships may occur and the transitivity of the skyline operator is not guaranteed. We therefore opt for the less efficient approach where we compare all tuples against each other. Even if a tuple tt is dominated, we may not immediately delete it since it may be the only tuple that dominates another tuple tt^{\prime}. In such a case, by prematurely deleting tt, tuple tt^{\prime} would be erroneously added to the skyline. This is a subtle point which was, for instance, overlooked in the algorithm proposed in (Gulzar et al., 2019). In Appendix A. we illustrate the error resulting from premature deletion of dominated tuples in detail.

The following lemma guarantees that our skyline computation in case of a potentially incomplete dataset yields the correct result:

Lemma 5.1.

Let a dataset PP be partitioned according to the null values and let SlocalS_{local} denote the resulting union of local skylines. Then it holds for every tuple pPp\in P not part of the global skyline SglobalS_{global} that either pSlocalp\not\in S_{local} or there exists qSlocalq\in S_{local} with qpq\prec p.

Proof.

Let pPp\in P be a tuple that is not part of the global skyline SglobalS_{global}, i.e., there exists some tuple qPq\in P with qpq\prec p. If both pp and qq belong to the same partition during local skyline computation, then this dominance relation will be detected in this step and pp will be deleted. Hence, in this case, pSlocalp\not\in S_{local}.

Now suppose that pp and qq belong to different partitions. If qSlocalq\in S_{local}, then qq is the desired tuple in SlocalS_{local} with qpq\prec p.

It remains to consider the case that qSlocalq\not\in S_{local}. This means that qq gets deleted during the local skyline computation. In other words, there exists a tuple rr in the same partition as qq with rqr\prec q such that rSlocalr\in S_{local}. The latter property is guaranteed by the fact that, inside each partition, all tuples have nulls at the same positions and there can be no cyclic dominance relationships.

By the assumption qpq\prec p and the definition of skylines in incomplete datasets, qq is at least as good as pp in all non-missing skyline dimensions and strictly better in at least one. Since rqr\prec q and both tuples have the same set of missing dimensions due to being in the same partition, it follows that also rr is at least as good as pp in all non-missing skyline dimensions and strictly better in at least one. That is, rr is the desired tuple in SlocalS_{local} with rpr\prec p. ∎

Selecting an algorithm which can handle incomplete datasets yields the correct result also for a complete dataset (while the converse is, of course, not true). There is, however, a major drawback to always using an incomplete algorithm. Since the partitioning in the incomplete algorithm is done according to which values of a tuple are null, only limited partitioning is possible. The worst case occurs when the algorithm for incomplete datasets is applied to a complete dataset. In this case, since there are no null values, there is only a single partition which all tuples belong to. This means that any potential of parallelism gets lost. Hence, in the case of a complete dataset, the algorithm for incomplete data performs even worse than a strictly non-distributed approach which immediately proceeds to the global skyline computation by a single executor.

5.8. Extending the User Interfaces

In addition to our extension of the SQL string-interface of Spark SQL, we integrate the skyline queries also into the basic DataFrame API provided in Java and Scala by adding new API functions. Here, the data about the skyline dimensions can either be passed via pairs of skyline dimension and associated dimension type or by using the columnar definition of Spark directly. For the latter purpose, we define the functions smin(), smax(), and sdiff(), which each take a single argument that provides the skyline dimension in Spark columnar format as input. The API calls bypass the parsing step and directly create a new skyline operator node in the logical plan.

We also integrate skyline queries into Python and R via PySpark and SparkR, respectively – two of the most popular languages for analytical and statistical applications. Rather than re-implementing skyline queries in these languages, we build an intermediate layer that calls the Scala-implementation of the DataFrame API.

In Python, this method relies on the external, open-source library Py4J (https://www.py4j.org/). Its usage requires the skyline dimensions and skyline types to first be translated into Java objects, which can then be passed on using Py4J. A slight complication arises from the fact that Python is a weakly typed language, which imposes restrictions on the method signatures. Skyline dimensions and their types are therefore passed as separate lists of strings. The first dimension is then matched to the first type of skyline dimension and so forth. Passing the skyline dimensions via Columns (i.e., Expressions) works similarly to the Scala/Java APIs.

In R, the integration is simpler since R allows for tuples of data to be entered more easily. Hence, in addition to the column-based input analogous to the Python interface, the R interface also accepts a list of pairs of skyline dimension plus type.

5.9. Ensuring Correctness

As far as the impact of the skyline handling on the overall behavior of Spark SQL is concerned, we recall from Sections 5.2 and 5.3 that, in the logical plan, the skyline operator gives rise to a single node with a single input (from the child node) and a single output (to the parent node). When translating the logical node to a physical plan, this still holds since, even if there are multiple physical nodes in the plan, there is only a single input and output.

Therefore, the main concern of ensuring correctness of our integration of skyline queries into Spark SQL is the handling of skyline queries itself. There are no side effects whatsoever of the skyline integration on the rest of query processing in Spark SQL. This also applies to potential effects of the skyline integration on the performance of Spark SQL commands not using the skyline feature at all. In this case, the only difference in the query processing flow is an additional clause in the parser. The additional cost is negligible.

We have intensively tested the skyline handling to provide evidence for its correctness. Unit tests are provided as part of our implementation in the GitHub repository. Additionally, for a significant portion of the experiments reported in Section 6, we have verified that our integrated skyline computation yields the same result as the equivalent “plain” SQL query in the style of Listing 4.

6. Empirical Evaluation

In this section, we take a closer look at how our integration performs based on a series of benchmarks executed on a cluster. We will first provide information about the experimental setup as well as the input data and the queries which are executed on them. Following that, we will provide an overview of the outcomes of the benchmark. To round off the section, we will give a brief summary of the performance measurements and draw some conclusions.

6.1. Experimental Setup

We have implemented the integration of skyline queries into Apache Spark, and provide our implementation as open-source code at https://github.com/Lukas-Grasmann/skyline-queries-spark. Our implementation uses the same languages as Apache Spark, which include Java, Scala, Python, and R. The bulk of the main functionality was written in Scala, which is also the language in which the core functionalities of Spark are implemented. The experiments can be easily reproduced by using the provided open-source software. The binaries, benchmark data, queries, and additional scripts can be found at (Grasmann et al., 2022).

All tests were run on a cluster consisting of 2 namenodes and 18 datanodes. The latter are used to execute the actual program. Every node consists of 2 Xeon E5-2650 v4 CPUs by Intel that provide 24 cores each, which equals 48 cores per node and up to 864 cores across all worker nodes in total. Each node provides up to 256GB of RAM and 4 hard disks with a total capacity of 4TB each.

The resource management of the cluster is entirely based on Cloudera and uses YARN to deploy the applications. It provides the possibility to access data stored in Hive, which we found convenient for storing and maintaining the test data in our experiments. And it is also a common way how Apache Spark is used in practice.

For fine-tuning the parameters, we use the command line arguments provided by Spark to deploy the applications to YARN. In these tests, we tell Spark how many executors are to be spawned. The actual resource assignment is then left to Spark to provide conditions as close to the real-world setting as possible.

6.2. Test Data and Queries

For our tests with a real-world dataset, we use the freely available subset of the Inside Airbnb dataset as input, which contains accommodations from Airbnb over a certain timespan. The time span chosen is 30 days and contains approximately 1 million tuples. The tuples were downloaded from the Inside Airbnb website (Airbnb, 2022) and then subsequently merged while eliminating duplicates and string-based columns. For the complete variant of the dataset, we have eliminated all tuples containing a null value in at least one skyline dimension. Content-wise, this dataset is similar to the hotel example provided in Section 1. The use cases considered here are also similar and encompass finding the “best” listings according to the chosen dimensions. The relevant key (identifying) dimension and the 6 skyline dimensions can be found in Table 1. From this relation, we constructed skyline queries with 1 dimension, 2 dimensions, …, 6 dimensions by selecting the dimensions in the same order as they appear in Table 1, i.e., the one-dimensional skyline query only uses the first skyline dimension (price) in the table, the two-dimensional query uses the first two dimensions (price, accommodates), etc.

Dimension Type Description
id KEY identification number
price MIN price for renting
accommodates MAX (max) number of accommodated people
bedrooms MAX number of bedrooms
beds MAX number of beds
number_of_reviews MAX number of reviews
review_scores_rating MAX total review score ratings (all categories)
Table 1. Skyline dimensions in the Inside Airbnb dataset

We also carried out tests on a synthetic dataset, namely the store_sales table from the benchmark DSB (Ding et al., 2021). The skyline dimensions used in the benchmarks can be found in Table 2. The table has 2 identifying key dimensions and, again, 6 skyline dimensions. The selection of skyline dimensions to derive 6 skyline queries (with the number of skyline dimensions ranging from 1 to 6) is done exactly as described above for the Airbnb test case.

As far as the test data is concerned, we randomly generated data such that the total size of the table is approximately 15,000,000 tuples (note that data generation in DSB works by indicating the data volume in bytes not tuples; in our case, we generated 1.5 GB of data). For the tests with varying data size, we simply select the first tuples from the table until the desired size of the dataset is reached. For the complete dataset, we only select the tuples which are not null in all six potential skyline dimensions.

Dimension Type Description
ss_item_sk KEY stock item identifier
ss_ticket_number KEY ticket number identifier
ss_quantity MAX quantity purchased in sale
ss_wholesale_cost MIN wholesale cost
ss_list_price MIN list price
ss_sales_price MIN sales price
ss_ext_discount_amt MAX total discount given
ss_ext_sales_price MIN sum of sales price
Table 2. Skyline dimensions in the store_sales dataset

For both datasets, there exists a complete as well as an incomplete variant. The difference between them is that for the complete dataset all tuples that contain null values in at least one skyline dimension have been removed. For the real-world dataset, this means that the incomplete dataset is bigger than the complete variant while for the synthetic dataset, both variants have the same size.

6.3. Tested Algorithms

In total, we run our tests on up to four skyline algorithms:

  1. (1)

    The distributed algorithm for complete datasets (described in Section 5.6), which splits the skyline computation into local and global skyline computation; strictly speaking, only the local part is distributed (whence, parallelized).

  2. (2)

    The non-distributed complete algorithm, which completely gives up on parallelism, skips the local skyline computation and immediately proceeds to the global skyline computation from the previous algorithm.

  3. (3)

    The distributed algorithm for incomplete datasets described in Section 5.7. Recall, however, that here, distribution is based on the occurrence of nulls in skyline dimensions, which severely restricts the potential of parallelism.

  4. (4)

    As the principal reference algorithm for our skyline algorithms, we run our tests also with the rewriting of skyline queries into plain SQL as described in Listing 4.

In our performance charts, we refer to these 4 algorithms as “distributed complete”, “non-distributed complete”, “distributed incomplete”, and “reference”. In all tests with complete datasets, we compare all four algorithms against each other. For incomplete datasets, the complete algorithms are not applicable; we thus only compare the remaining two algorithms in these cases.

6.4. Experimental Results

In our experimental evaluation, we want to find out how the following parameters affect the execution time:

  • number of skyline dimensions;

  • number of input tuples in the dataset;

  • number of executors used by Spark.

For the number of skyline dimensions, we have crafted queries with 1 – 6 dimensions. The number of executors used by Spark is chosen from 1, 2, 3, 5, 10. For the size of the dataset, we distinguish between the real-world (Airbnb) and synthetic dataset (DSB): all tests with the real-world dataset are carried out with all tuples contained in the data, i.e., ca. 1,200,000 tuples if nulls are allowed and ca. 820,000 tuples after removing all tuples with a null in one of the skyline dimensions. In contrast, the synthetic dataset is randomly generated and we choose subsets of size 10610^{6}, 21062\cdot 10^{6}, 51065\cdot 10^{6}, and 10710^{7} both, for the complete and the incomplete dataset.

We have carried out tests with virtually all combinations of the above mentioned value ranges for the three parameters under investigation, resulting in a big collection of test cases: complete vs. incomplete data, 1 – 6 skyline dimensions, “all” tuples (for the Airbnb data) vs. varying between 10610^{6}, 21062\cdot 10^{6}, 51065\cdot 10^{6}, and 10710^{7} tuples (for the DSB data), and between 1 and 10 executors. In all these cases, we ran 4 algorithms (in the case of complete data) or 2 algorithms (for incomplete data) as explained in Section 6.3. Below we report on a representative subset of our experiments. The runtime measurements are shown in Figures 37. Further performance charts are provided in Appendix C.

In our experiments, we have defined a timeout of 36003600 seconds. Timeouts are “visualized” in our performance charts by missing data points. Actually, timeouts did occur quite frequently – especially with incomplete data and here, in particular, with the “reference” algorithm. Therefore, in Figures 37, for the case of incomplete data (always the plot on the right-hand side), we only show the results for datasets smaller than  10710^{7} tuples. In contrast, for complete data (shown in the plots on the left-hand side) we typically scale up to the max. size of 10710^{7} tuples.

Refer to caption
Refer to caption
Figure 3. Number of dimensions vs. execution time on the Inside Airbnb dataset
Refer to caption
Refer to caption
Figure 4. Number of dimensions vs. execution time on the store_sales dataset
Refer to caption
Refer to caption
Figure 5. Number of input tuples vs. execution time on the store_sales dataset
Refer to caption
Refer to caption
Figure 6. Number of executors vs. execution time on the Inside Airbnb dataset
Refer to caption
Refer to caption
Figure 7. Number of executors vs. execution time on the store_sales dataset

The plots for the impact of the number of skyline dimensions on the execution time can be found in Figure 3 for the real-world dataset and Figure 4 for the synthetic dataset. Of course, if the number of dimension increases, the dominance checks get slightly more costly. But the most important effect on the execution time is via the size of the skyline (in particular, of the intermediate skyline in the window).

Increasing the number of dimensions can have two opposing effects. First, adding a dimension can make tuples incomparable, where previously one tuple dominated the other. In this case, the size of the skyline increases due to the additional skyline dimension. Second, however, it may also happen that two tuples had identical values in the previously considered dimensions and the additional dimension introduces a dominance relation between these tuples. In this case, the size of the skyline decreases.

For the real-world data (Figure 3), the execution time tends to increase with the number of skyline dimensions. This is, in particular, the case for the reference algorithm. In other words, here we see the first possible effect of increasing the number of dimensions. In contrast, for the synthetic dataset, we also see the second possible effect. This effect is best visible for the reference algorithm in the left plot in Figure 4. Apparently, there are many tuples with the maximal value in the first dimension (= ss_quantity), which become distinguishable by the second dimension (= ss_wholesale_cost). For the right plot in Figure 4, it should be noted that the tests were carried out with a 10 times smaller dataset to avoid timeouts, which makes the test results less robust. The peak in the case of two dimensions is probably due to “unfavorable” partitioning of tuples causing some of the local skylines to get overly big.

At any rate, when comparing the algorithms, we note that the specialized algorithms are faster in almost all cases and scale significantly better than the “reference” algorithm (in particular, in case of the real-word data). For complete data, the “distributed complete” algorithm performs best. The discrepancy between the best specialized algorithm and the reference algorithm is over 50% in most cases and can get up to ca. 80% (complete synthetic data, 5 dimensions) and even over 95% (complete synthetic data, 1 dimension).

The impact of the size of the dataset on the execution time is shown in Figure 5. Recall that we consider the size of the real-world dataset as fixed. Hence, the experiments with varying data size were only done with the synthetic data. Of course, the execution time increases with the size of the dataset, since a larger dataset also increases the number of dominance checks needed. The exact increase depends on the used algorithm; it is particularly dramatic for the “reference” algorithm, which even reaches the timeout threshold as bigger datasets are considered. Again, the “distributed complete” algorithm, when applicable, performs best and, in almost all cases, all specialized skyline algorithms outperform the “reference” algorithm. The discrepancy between the best specialized algorithm and the reference algorithm reaches ca. 60% for incomplete data and over 80% for complete data (for 51065\cdot 10^{6} tuples). For the biggest datasets (10710^{7} tuples), the reference algorithm even times out.

The impact of the number of executors on the performance is shown in Figure 6 for the real-world dataset and in Figure 7 for the synthetic dataset. Ideally, we want the algorithms to be able to use an arbitrary number of executors productively. Here, we use the term “executors” since it is the parameter by which we can instruct Spark how many instances of the code it should run in parallel.

We observe that the benefit of additional executors tapers off after a certain number of executors has been reached. This point is reached sooner, the smaller the dataset is. This behavior, especially in the case of the distributed algorithms, can be explained as follows: As we increase the parallelism of the local skyline computation, the portion of data contained in each partition becomes smaller; this means that more and more dominance relationships get lost and fewer tuples are eliminated from the local skylines. Hence, more work is left to the global skyline computation, which allows for little to no parallelism and, thus, becomes the bottleneck of the entire computation. In other words, there is a sweet spot in terms of the amount of parallelism relative to the size of the input data.

Note that the “reference” algorithm is also able to make (limited) use of parallelism. Indeed, the execution plan generated by Spark from the plain SQL query is still somewhat distributed albeit not as much as the truly distributed and specialized approaches. As in the previous experiments, the “reference” algorithm never outperforms any of the specialized algorithms – not even the non-distributed or quasi-non-distributed (incomplete algorithm on complete dataset) approaches. The discrepancy between the best specialized algorithm and the reference algorithm is constantly above 50% and may even go up to 90% (complete synthetic data, 10 executors).

6.5. Further Measurements and Experiments

Our main concern with the experiments reported in Section 6.4 was to measure the execution time of our integrated skyline algorithms depending on several factors (number of dimensions/tuples/executors). Apart from the execution time, we made further measurements in our experiments. In particular, we were interested in the memory consumption of the various algorithms in the various settings. In short, the results were unsurprising: the number of dimensions does not influence the memory consumption while the number of executors and, even more so, the number of input tuples does. In the case of the executors this is mainly due to the fact that each executor loads its entire execution environment, including the Java execution framework, into main memory. In the case of the number of tuples, of course, more memory is needed if more tuples are loaded. Importantly, we did not observe any significant difference in memory consumption between the four algorithms studied here. For the sake of completeness, we provide plots with the memory consumption in various scenarios in Appendix C.

The test setup described in Section 6.2 was intentionally minimalistic in the sense that we applied skyline queries to base tables of the database rather than to the result of (possibly complex) SQL queries. This allowed us to focus on the behavior of our new skyline algorithms vs. the reference queries without having to worry about side-effects of the overall query evaluation. But, of course, it is also interesting to test the behavior of skyline queries on top of complex queries. To this end, we have carried out experiments on the MusicBrainz database (Foundation, 2022) with more complex queries including joins and aggregates. Due to lack of space, the detailed results are given in Appendix E. In a nutshell, the results are quite similar to the ones reported in Section 6.4:

  • The execution time of the reference solution is almost always above the time of the specialized algorithms. The only cases where the reference solution performs best are the easiest ones with execution times below 50 seconds. For the harder cases it is always slower and – in contrast to the specialized algorithms – causes several timeouts.

  • The memory consumption is comparable for all 4 algorithms. There are, however, some peaks in case of the reference algorithms that do not occur with the specialized algorithm. In general, the behavior of the reference algorithm seems to be less stable than the other algorithms. For instance, with 4, 5, or 6 dimensions, the execution time jumps from below 50 seconds for 3 executors to over 1000 seconds for 5 executors and then slightly decreases again for 10 executors.

  • It is also notable, that skyline queries rewritten to plain SQL are much less readable and less intuitive in the context of the entire query.

6.6. Summary of Results

From our experimental evaluation, we conclude that, in almost all cases, our specialized algorithms outperform the “reference” algorithm and they also provide better scalability. Parallelization (if applicable) has proved profitable – but only up to a certain point, that depends on the size of the data. Moreover, it has become clear that using the incomplete distributed algorithm on a complete dataset is not advisable and may perform worse than the non-distributed algorithm. As such, the introduction of additional syntax to select an appropriate algorithm through the keyword COMPLETE may help to significantly boost the performance.

7. Conclusion and Future Work

In this work, we have presented the extension of Spark SQL by the skyline operator. It provides users with a simple, easy to use syntax for formulating skyline queries. To the best of our knowledge, this is the first distributed data processing system with native support of skyline queries. By our extension of the various components of Spark SQL query execution and, in particular, of the Catalyst optimizer, we have obtained an implementation that clearly outperforms the formulation of skyline queries in plain SQL. Nevertheless, there is still ample space for future enhancements:

So far, we have implemented only the Block-Nested-Loop skyline algorithm and variants thereof. It would be interesting to implement additional algorithms based on other paradigms like ordering (Chomicki et al., 2003, 2005; Godfrey et al., 2005, 2007; Bartolini et al., 2006, 2008; Liu and Li, 2020) or index structures (Tan et al., 2001; Lee and Hwang, 2010a, b) and to evaluate their strengths and weaknesses in the Apache Spark context. Also, for the partitioning scheme, further options such as angle-based partitioning (Kim and Kim, 2018; Vlachou et al., 2008) are worth trying. Further specialized algorithms, that require a deeper modification of Spark (such as Z-order partitioning (Tang et al., 2019), which requires the computation of a Z-address for each tuple (Lee et al., 2010)) are more long-term projects.

For our skyline algorithm that can cope with (potentially) incomplete datasets, we have chosen a straightforward extension of the Block-Nested-Loop algorithm. As mentioned in Section 5.7, this may severely limit the potential of parallelism. Moreover, in the worst case, the algorithm may thus have to compare each tuple against any other tuple. Clearly, a more sophisticated algorithm for potentially incomplete datasets would be highly desirable.

We have included rule-based optimizations in our integration of skyline queries. The support of cost-based optimization by Spark SQL is only somewhat rudimentary as of now. Fully integrating skyline queries into a future cost-based optimizer will be an important, highly non-trivial research and development project. However, as soon as further skyline algorithms are implemented, a light-weight form of cost-based optimization should be implemented that selects the best-suited skyline algorithm for a particular query.

Finally, from the user perspective, the integration into different Spark modules such as structured streaming would be desirable.

To conclude, the goal of this work was a full integration of skyline queries into Spark SQL. The favorable experimental results with simple skyline algorithms (Block-Nested-Loops and variants thereof) demonstrate that an integrated solution is clearly superior to a rewriting of the query on SQL level. To leverage the full potential of the integrated solution, the implementation of further (more sophisticated) skyline algorithms is the most important task for future work. It should be noted that, in our implementation, we have paid particular attention to a modular structure of our new software so that the implementation of further algorithms is possible without having to worry about Spark SQL as a whole. Moreover, our code is provided under Apache Spark’s own open-source license, and we explicitly invite other groups to join in this effort of enabling convenient and efficient skyline queries in Spark SQL.

Acknowledgements.
This work was supported by the Austrian Science Fund (FWF): P30930 and by the Vienna Science and Technology Fund (WWTF) grant VRG18-013.

References

  • (1)
  • Airbnb (2022) Inside Airbnb. 2022. Inside Airbnb. Retrieved January 19, 2022 from http://insideairbnb.com/get-the-data.html
  • Alrifai et al. (2010) Mohammad Alrifai, Dimitrios Skoutas, and Thomas Risse. 2010. Selecting skyline services for QoS-based web service composition. In Proceedings of the 19th International Conference on World Wide Web, WWW 2010, Raleigh, North Carolina, USA, April 26-30, 2010, Michael Rappa, Paul Jones, Juliana Freire, and Soumen Chakrabarti (Eds.). ACM, 11–20. https://doi.org/10.1145/1772690.1772693
  • Bartolini et al. (2006) Ilaria Bartolini, Paolo Ciaccia, and Marco Patella. 2006. SaLSa: computing the skyline without scanning the whole sky. In Proceedings of the 2006 ACM CIKM International Conference on Information and Knowledge Management, Arlington, Virginia, USA, November 6-11, 2006, Philip S. Yu, Vassilis J. Tsotras, Edward A. Fox, and Bing Liu (Eds.). ACM, 405–414. https://doi.org/10.1145/1183614.1183674
  • Bartolini et al. (2008) Ilaria Bartolini, Paolo Ciaccia, and Marco Patella. 2008. Efficient sort-based skyline evaluation. ACM Trans. Database Syst. 33, 4 (2008), 31:1–31:49. https://doi.org/10.1145/1412331.1412343
  • Börzsönyi et al. (2001) Stephan Börzsönyi, Donald Kossmann, and Konrad Stocker. 2001. The Skyline Operator. In Proceedings of the 17th International Conference on Data Engineering, April 2-6, 2001, Heidelberg, Germany, Dimitrios Georgakopoulos and Alexander Buchmann (Eds.). IEEE Computer Society, 421–430. https://doi.org/10.1109/ICDE.2001.914855
  • Carey and Kossmann (1997) Michael J. Carey and Donald Kossmann. 1997. On Saying "Enough Already!" in SQL. In SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data, May 13-15, 1997, Tucson, Arizona, USA, Joan Peckham (Ed.). ACM Press, 219–230. https://doi.org/10.1145/253260.253302
  • Chen et al. (2007) Bee-Chung Chen, Raghu Ramakrishnan, and Kristen LeFevre. 2007. Privacy Skyline: Privacy with Multidimensional Adversarial Knowledge. In Proceedings of the 33rd International Conference on Very Large Data Bases, University of Vienna, Austria, September 23-27, 2007, Christoph Koch, Johannes Gehrke, Minos N. Garofalakis, Divesh Srivastava, Karl Aberer, Anand Deshpande, Daniela Florescu, Chee Yong Chan, Venkatesh Ganti, Carl-Christian Kanne, Wolfgang Klas, and Erich J. Neuhold (Eds.). ACM, 770–781. http://www.vldb.org/conf/2007/papers/research/p770-chen.pdf
  • Chen and Lian (2008) Lei Chen and Xiang Lian. 2008. Dynamic skyline queries in metric spaces. In EDBT 2008, 11th International Conference on Extending Database Technology, Nantes, France, March 25-29, 2008, Proceedings (ACM International Conference Proceeding Series), Alfons Kemper, Patrick Valduriez, Noureddine Mouaddib, Jens Teubner, Mokrane Bouzeghoub, Volker Markl, Laurent Amsaleg, and Ioana Manolescu (Eds.), Vol. 261. ACM, 333–343. https://doi.org/10.1145/1353343.1353386
  • Chomicki et al. (2013) Jan Chomicki, Paolo Ciaccia, and Niccolò Meneghetti. 2013. Skyline queries, front and back. SIGMOD Rec. 42, 3 (2013), 6–18. https://doi.org/10.1145/2536669.2536671
  • Chomicki et al. (2003) Jan Chomicki, Parke Godfrey, Jarek Gryz, and Dongming Liang. 2003. Skyline with Presorting. In Proceedings of the 19th International Conference on Data Engineering, March 5-8, 2003, Bangalore, India, Umeshwar Dayal, Krithi Ramamritham, and T. M. Vijayaraman (Eds.). IEEE Computer Society, 717–719. https://doi.org/10.1109/ICDE.2003.1260846
  • Chomicki et al. (2005) Jan Chomicki, Parke Godfrey, Jarek Gryz, and Dongming Liang. 2005. Skyline with Presorting: Theory and Optimizations. In Intelligent Information Processing and Web Mining, Proceedings of the International IIS: IIPWM’05 Conference held in Gdansk, Poland, June 13-16, 2005 (Advances in Soft Computing), Mieczyslaw A. Klopotek, Slawomir T. Wierzchon, and Krzysztof Trojanowski (Eds.), Vol. 31. Springer, 595–604. https://doi.org/10.1007/3-540-32392-9_72
  • Cui et al. (2008) Bin Cui, Hua Lu, Quanqing Xu, Lijiang Chen, Yafei Dai, and Yongluan Zhou. 2008. Parallel Distributed Processing of Constrained Skyline Queries by Filtering. In Proceedings of the 24th International Conference on Data Engineering, ICDE 2008, April 7-12, 2008, Cancún, Mexico, Gustavo Alonso, José A. Blakeley, and Arbee L. P. Chen (Eds.). IEEE Computer Society, 546–555. https://doi.org/10.1109/ICDE.2008.4497463
  • Dellis and Seeger (2007) Evangelos Dellis and Bernhard Seeger. 2007. Efficient Computation of Reverse Skyline Queries. In Proceedings of the 33rd International Conference on Very Large Data Bases, University of Vienna, Austria, September 23-27, 2007, Christoph Koch, Johannes Gehrke, Minos N. Garofalakis, Divesh Srivastava, Karl Aberer, Anand Deshpande, Daniela Florescu, Chee Yong Chan, Venkatesh Ganti, Carl-Christian Kanne, Wolfgang Klas, and Erich J. Neuhold (Eds.). ACM, 291–302. http://www.vldb.org/conf/2007/papers/research/p291-dellis.pdf
  • Ding et al. (2021) Bailu Ding, Surajit Chaudhuri, Johannes Gehrke, and Vivek Narasayya. 2021. DSB: A Decision Support Benchmark for Workload-Driven and Traditional Database Systems. Proc. VLDB Endow. 14, 13 (sep 2021), 3376–3388. https://doi.org/10.14778/3484224.3484234
  • Eder (2009) Hannes Eder. 2009. On Extending PostgreSQL with the Skyline Operator. Master’s thesis. Fakultät für Informatik der Technischen Universität Wien.
  • Foundation (2022) MetaBrainz Foundation. 2022. MusicBrainz. Retrieved August 30, 2022 from https://musicbrainz.org/
  • Godfrey et al. (2005) Parke Godfrey, Ryan Shipley, and Jarek Gryz. 2005. Maximal Vector Computation in Large Data Sets. In Proceedings of the 31st International Conference on Very Large Data Bases, Trondheim, Norway, August 30 - September 2, 2005, Klemens Böhm, Christian S. Jensen, Laura M. Haas, Martin L. Kersten, Per-Åke Larson, and Beng Chin Ooi (Eds.). ACM, 229–240. http://www.vldb.org/archives/website/2005/program/paper/tue/p229-godfrey.pdf
  • Godfrey et al. (2007) Parke Godfrey, Ryan Shipley, and Jarek Gryz. 2007. Algorithms and analyses for maximal vector computation. VLDB J. 16, 1 (2007), 5–28. https://doi.org/10.1007/s00778-006-0029-7
  • Grasmann et al. (2022) Lukas Grasmann, Reinhard Pichler, and Alexander Selzer. 2022. Integration of Skyline Queries into Spark SQL - Experiment Artifacts. Zenodo. https://doi.org/10.5281/zenodo.7157840
  • Gulzar et al. (2019) Yonis Gulzar, Ali Amer Alwan, and Sherzod Turaev. 2019. Optimizing Skyline Query Processing in Incomplete Data. IEEE Access 7 (2019), 178121–178138. https://doi.org/10.1109/ACCESS.2019.2958202
  • Hose and Vlachou (2012) Katja Hose and Akrivi Vlachou. 2012. A survey of skyline processing in highly distributed environments. VLDB J. 21, 3 (2012), 359–384. https://doi.org/10.1007/s00778-011-0246-6
  • Huang and Jensen (2004) Xuegang Huang and Christian S. Jensen. 2004. In-Route Skyline Querying for Location-Based Services. In Web and Wireless Geographical Information Systems, 4th InternationalWorkshop, W2GIS 2004, Goyang, Korea, November 2004, Revised Selected Papers (Lecture Notes in Computer Science), Yong Jin Kwon, Alain Bouju, and Christophe Claramunt (Eds.), Vol. 3428. Springer, 120–135. https://doi.org/10.1007/11427865_10
  • Kalyvas and Tzouramanis (2017) Christos Kalyvas and Theodoros Tzouramanis. 2017. A Survey of Skyline Query Processing. CoRR abs/1704.01788 (2017), 127. arXiv:1704.01788 http://arxiv.org/abs/1704.01788
  • Kanellis (2020) Aki Kanellis. 2020. SkySpark. Retrieved November 10, 2020 from https://github.com/AkiKanellis/skyspark
  • Kim and Kim (2018) Junsu Kim and Myoung Ho Kim. 2018. An efficient parallel processing method for skyline queries in MapReduce. J. Supercomput. 74, 2 (2018), 886–935. https://doi.org/10.1007/s11227-017-2171-y
  • Kossmann et al. (2002) Donald Kossmann, Frank Ramsak, and Steffen Rost. 2002. Shooting Stars in the Sky: An Online Algorithm for Skyline Queries. In Proceedings of 28th International Conference on Very Large Data Bases, VLDB 2002, Hong Kong, August 20-23, 2002. Morgan Kaufmann, 275–286. https://doi.org/10.1016/B978-155860869-6/50032-9
  • Kriegel et al. (2010) Hans-Peter Kriegel, Matthias Renz, and Matthias Schubert. 2010. Route skyline queries: A multi-preference path planning approach. In Proceedings of the 26th International Conference on Data Engineering, ICDE 2010, March 1-6, 2010, Long Beach, California, USA, Feifei Li, Mirella M. Moro, Shahram Ghandeharizadeh, Jayant R. Haritsa, Gerhard Weikum, Michael J. Carey, Fabio Casati, Edward Y. Chang, Ioana Manolescu, Sharad Mehrotra, Umeshwar Dayal, and Vassilis J. Tsotras (Eds.). IEEE Computer Society, 261–272. https://doi.org/10.1109/ICDE.2010.5447845
  • Lee and Hwang (2010a) Jongwuk Lee and Seung-won Hwang. 2010a. BSkyTree: scalable skyline computation using a balanced pivot selection. In EDBT 2010, 13th International Conference on Extending Database Technology, Lausanne, Switzerland, March 22-26, 2010, Proceedings (ACM International Conference Proceeding Series), Ioana Manolescu, Stefano Spaccapietra, Jens Teubner, Masaru Kitsuregawa, Alain Léger, Felix Naumann, Anastasia Ailamaki, and Fatma Özcan (Eds.), Vol. 426. ACM, 195–206. https://doi.org/10.1145/1739041.1739067
  • Lee and Hwang (2010b) Jongwuk Lee and Seung-won Hwang. 2010b. BSkyTree: scalable skyline computation using a balanced pivot selection. In EDBT 2010, 13th International Conference on Extending Database Technology, Lausanne, Switzerland, March 22-26, 2010, Proceedings (ACM International Conference Proceeding Series), Ioana Manolescu, Stefano Spaccapietra, Jens Teubner, Masaru Kitsuregawa, Alain Léger, Felix Naumann, Anastasia Ailamaki, and Fatma Özcan (Eds.), Vol. 426. ACM, 195–206. https://doi.org/10.1145/1739041.1739067
  • Lee et al. (2010) Ken C. K. Lee, Wang-Chien Lee, Baihua Zheng, Huajing Li, and Yuan Tian. 2010. Z-SKY: an efficient skyline query processing framework based on Z-order. VLDB J. 19, 3 (2010), 333–362. https://doi.org/10.1007/s00778-009-0166-x
  • Lee et al. (2007) Ken C. K. Lee, Baihua Zheng, Huajing Li, and Wang-Chien Lee. 2007. Approaching the Skyline in Z Order. In Proceedings of the 33rd International Conference on Very Large Data Bases, University of Vienna, Austria, September 23-27, 2007, Christoph Koch, Johannes Gehrke, Minos N. Garofalakis, Divesh Srivastava, Karl Aberer, Anand Deshpande, Daniela Florescu, Chee Yong Chan, Venkatesh Ganti, Carl-Christian Kanne, Wolfgang Klas, and Erich J. Neuhold (Eds.). ACM, 279–290. http://www.vldb.org/conf/2007/papers/research/p279-lee.pdf
  • Lin et al. (2011) Xin Lin, Jianliang Xu, and Haibo Hu. 2011. Authentication of location-based skyline queries. In Proceedings of the 20th ACM Conference on Information and Knowledge Management, CIKM 2011, Glasgow, United Kingdom, October 24-28, 2011, Craig Macdonald, Iadh Ounis, and Ian Ruthven (Eds.). ACM, 1583–1588. https://doi.org/10.1145/2063576.2063805
  • Liu and Li (2020) Rui Liu and Dominique Li. 2020. Efficient Skyline Computation in High-Dimensionality Domains. In Proceedings of the 23rd International Conference on Extending Database Technology, EDBT 2020, Copenhagen, Denmark, March 30 - April 02, 2020, Angela Bonifati, Yongluan Zhou, Marcos Antonio Vaz Salles, Alexander Böhm, Dan Olteanu, George H. L. Fletcher, Arijit Khan, and Bin Yang (Eds.). OpenProceedings.org, 459–462. https://doi.org/10.5441/002/edbt.2020.57
  • Morse et al. (2007) Michael D. Morse, Jignesh M. Patel, and H. V. Jagadish. 2007. Efficient Skyline Computation over Low-Cardinality Domains. In Proceedings of the 33rd International Conference on Very Large Data Bases, University of Vienna, Austria, September 23-27, 2007, Christoph Koch, Johannes Gehrke, Minos N. Garofalakis, Divesh Srivastava, Karl Aberer, Anand Deshpande, Daniela Florescu, Chee Yong Chan, Venkatesh Ganti, Carl-Christian Kanne, Wolfgang Klas, and Erich J. Neuhold (Eds.). ACM, 267–278. http://www.vldb.org/conf/2007/papers/research/p267-morse.pdf
  • Papadias et al. (2003) Dimitris Papadias, Yufei Tao, Greg Fu, and Bernhard Seeger. 2003. An Optimal and Progressive Algorithm for Skyline Queries. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, San Diego, California, USA, June 9-12, 2003, Alon Y. Halevy, Zachary G. Ives, and AnHai Doan (Eds.). ACM, 467–478. https://doi.org/10.1145/872757.872814
  • Papadias et al. (2005) Dimitris Papadias, Yufei Tao, Greg Fu, and Bernhard Seeger. 2005. Progressive skyline computation in database systems. ACM Trans. Database Syst. 30, 1 (2005), 41–82. https://doi.org/10.1145/1061318.1061320
  • Papanikolaou (2018) Ioanna Papanikolaou. 2018. Distributed Algorithms for Skyline Computation using Apache Spark. Master’s thesis. INTERNATIONAL HELLENIC UNIVERSITY. https://repository.ihu.edu.gr/xmlui/handle/11544/29524 Accessed on: 23.01.2022.
  • Park et al. (2013) Yoonjae Park, Jun-Ki Min, and Kyuseok Shim. 2013. Parallel Computation of Skyline and Reverse Skyline Queries Using MapReduce. Proc. VLDB Endow. 6, 14 (2013), 2002–2013. https://doi.org/10.14778/2556549.2556580
  • Sheng and Tao (2012) Cheng Sheng and Yufei Tao. 2012. Worst-Case I/O-Efficient Skyline Algorithms. ACM Trans. Database Syst. 37, 4 (2012), 26:1–26:22. https://doi.org/10.1145/2389241.2389245
  • Tan et al. (2001) Kian-Lee Tan, Pin-Kwang Eng, and Beng Chin Ooi. 2001. Efficient Progressive Skyline Computation. In VLDB 2001, Proceedings of 27th International Conference on Very Large Data Bases, September 11-14, 2001, Roma, Italy, Peter M. G. Apers, Paolo Atzeni, Stefano Ceri, Stefano Paraboschi, Kotagiri Ramamohanarao, and Richard T. Snodgrass (Eds.). Morgan Kaufmann, 301–310. http://www.vldb.org/conf/2001/P301.pdf
  • Tang et al. (2019) Mingjie Tang, Yongyang Yu, Walid G. Aref, Qutaibah M. Malluhi, and Mourad Ouzzani. 2019. Efficient Parallel Skyline Query Processing for High-Dimensional Data. In 35th IEEE International Conference on Data Engineering, ICDE 2019, Macao, China, April 8-11, 2019. IEEE, 2113–2114. https://doi.org/10.1109/ICDE.2019.00251
  • Vlachou et al. (2008) Akrivi Vlachou, Christos Doulkeridis, and Yannis Kotidis. 2008. Angle-based space partitioning for efficient parallel skyline computation. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, June 10-12, 2008, Jason Tsong-Li Wang (Ed.). ACM, 227–238. https://doi.org/10.1145/1376616.1376642
  • Zaharia et al. (2010) Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster Computing with Working Sets. In 2nd USENIX Workshop on Hot Topics in Cloud Computing, HotCloud’10, Boston, MA, USA, June 22, 2010, Erich M. Nahum and Dongyan Xu (Eds.). USENIX Association, 10. https://www.usenix.org/conference/hotcloud-10/spark-cluster-computing-working-sets
  • Zaharia et al. (2016) Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, and Ion Stoica. 2016. Apache Spark: a unified engine for big data processing. Commun. ACM 59, 11 (2016), 56–65. https://doi.org/10.1145/2934664
  • Zhang et al. (2009) Shiming Zhang, Nikos Mamoulis, and David W. Cheung. 2009. Scalable skyline computation using object-based space partitioning. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2009, Providence, Rhode Island, USA, June 29 - July 2, 2009, Ugur Çetintemel, Stanley B. Zdonik, Donald Kossmann, and Nesime Tatbul (Eds.). ACM, 483–494. https://doi.org/10.1145/1559845.1559897

Appendix A Problems with Incomplete Datasets

Recall from Section 3 that, in the case of incomplete data, cyclic dominance relationships may arise. Therefore, care is required with deleting dominated tuples, since premature deletion of dominated tuples may prevent us from identifying other tuples as being dominated. Below, we illustrate this problem by revisiting the skyline algorithm from (Gulzar et al., 2019), which behaves incorrectly in the presence of cyclic dominance relationships.

The algorithm in (Gulzar et al., 2019), divides the dataset into nn clusters (= partitions) according to the occurrence of nulls in skyline dimensions and computes the local skyline in each cluster. It then arranges these clusters, i.e., C1C_{1} through CnC_{n}, in some order and computes the global skyline by visiting the clusters in this order and carrying out dominance checks for the data points in the local skylines of all clusters. For each data point, the following two steps are applied:

  • For the current data point pp of the iteration, we check for every not yet deleted data point qq in all subsequent clusters if one dominates the other. For instance, if CiC_{i} is the current cluster, we check for all clusters CjC_{j} with j>ij>i. Checking in the same cluster is not necessary since dominated data points have already been removed when computing the local skyline. If dominance is detected there are two scenarios:

    • If pqp\prec q, then qq is immediately eliminated.

    • If qpq\prec p, then a domination flag is set for pp.

  • If after checking against all data points qq, the domination flag is set for pp, then pp is subsequently also eliminated from the candidate list, since it is dominated by another point.


Counterexample. We now argue that the approach of (Gulzar et al., 2019) to compute the global skyline as described above is incorrect. To show this, we give a counterexample with simple data where applying the algorithm does not yield a correct skyline. We thus revisit the example that was used in Section 3 to illustrate the existence of dominance cycles in the case of incomplete data: Suppose that we have 3 dominance dimension and consider the following 3 data points: a=(1,,10)a=(1,*,10), b=(3,2,)b=(3,2,*), and c=(,5,3)c=(*,5,3). Here, “*” indicates a null value.

As already discussed, we know that aba\prec b, bcb\prec c and cac\prec a in this example under the assumption that all dimensions are minimized in the skyline. From the definition of the dominance relation and the skyline for incomplete data, it follows that the skyline of these values should be empty as every single tuple is dominated by another tuple. We now “execute” the above algorithm to check whether we get the same result.

  • For clustering, the tuples correspond to the bitmaps 101, 110, and 011 respectively. It follows that every tuple is in its own cluster according to the algorithm.

  • Clearly, each tuple is contained in its local skyline since each cluster contains only a single element.

This leaves the global skyline computation as the only step where tuples are eliminated. We now assume w.l.o.g. that aa belongs to C1C_{1}, bb belongs to C2C_{2}, and cc belongs to C3C_{3} as per the algorithm introduced in (Gulzar et al., 2019). It follows that the algorithm starts by comparing aa from C1C_{1} to bb which is the only element in C2C_{2}. Since aba\prec b, we remove bb from C2C_{2} after which C2C_{2} is empty. Next, we go to the cluster C3C_{3} and compare aa to cc. Since cac\prec a, we set the domination flag for aa. No further clusters and tuples remain. Therefore, aa is deleted from C1C_{1} since the domination flag is set. For the next iteration, we check the cluster C2C_{2}. Since C2C_{2} is now empty, no comparisons remain for this cluster. We move on to the last cluster C3C_{3}. Since there are no further clusters, we note that C3C_{3} still contains cc after these steps. Hence, the algorithm returns tuple cc as part of the global skyline, which contradicts our expected result of an empty skyline.

Correct Skyline Computation. As was detailed in Section 5.7, we take over from (Gulzar et al., 2019) the idea of partitioning the data according to the occurrence of nulls for the local skyline computation of incomplete data. However, in our global skyline computation for incomplete data, we compare all pairs of tuples with each other and set a dominance flag if one tuple is dominated. Only after all pairs have been processed in this way, we actually delete the tuples for which the dominance flag is set. In this way, we can guarantee that all dominance relationships are found even if they are cyclic.

Appendix B Spark SQL Error When Sorting on Aggregates With Filter

While integrating the skyline queries into Apache Spark, we have noticed that aggregates are sometimes not resolved correctly by the default rules of Spark. This occurs, for example, when an Aggregate (introduced by a GROUP BY clause) is used in combination with a Filter (introduced by a HAVING clause) as input to the Skyline node. The aggregates are then not correctly resolved since they are either not introduced into the Aggregate node or lost during a projection after the Filter. As part of the implementation, we have taken preventive measures such that these queries can be resolved correctly for skyline queries.

object PreventPrematureProjections extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUp {
case sort@Sort(_, _,
project@Project(_,
filter@Filter(_,
aggregate: Aggregate
)
)
) if filter.resolved && aggregate.resolved =>
val newSort = sort.copy(
child = filter.copy(
child = aggregate
)
)
\parval newSortMaybeResolved = ResolveAggregateFunctions
.apply(newSort)
.asInstanceOf[Sort]
\parif (!newSortMaybeResolved.equals(newSort)) {
project.copy(
child = newSortMaybeResolved
)
} else {
sort
}
}
}
Listing 9 Prevent premature Project node in plan

Note, however, that this problem is not skyline-specific. A similar (erroneous) behavior in the Sort node leads to errors with queries in standard Apache Spark SQL. Below we discuss how this error can be fixed for standard Apache Spark SQL. The bugfix in the case of the skyline extension is similar. First, we eliminate Project nodes that may be introduced as part of the Filter caused by a HAVING clause in the query. To achieve this, we introduce a new analyzer rule which can be found in Listing 9.

First, we match the plan that was described above using the matching syntax of Scala (line 3 - 6). We then try to change the plan to not include the filter (line 10 - 14). Only if this allows us to resolve additional attributes (line 16 and 18), then we return the changed plan with the Filter reintroduced as the “parent” node (line 19 - 20). Otherwise, we return the old plan as-is (line 23).

case Sort(sortOrder, global, filter@Filter(_, agg: Aggregate)) =>
val maybeResolved = sortOrder
.map(_.child)
.map(resolveExpressionByPlanOutput(_, agg))
resolveOperatorWithAggregate(maybeResolved, agg, (newExprs, newChild) => {
val newSortOrder = sortOrder.zip(newExprs).map {
case (sortOrder, expr) => sortOrder.copy(child = expr)
}
Sort(newSortOrder, global, filter.copy(child = newChild))
})
Listing 10 Extension of ResolveAggregateFunctions

In addition to the new rule introduced above, we also extend the ResolveAggregateFunctions rule such that it can handle plans consisting of Sort, Filter, and Aggregate. This is necessary, since otherwise the aggregates from the Aggregate node may not be resolved correctly in the Sort node. To achieve this, we modify the existing source code from Spark to fit our needs. The modifications can be found in Listing 10.

This code first takes all ordering dimensions (line 2) and subsequently tries to resolve them (line 3 - 7). In this step, we insert the (potentially) resolved dimensions into a new list of sort orders (line 4 - 5) that is used to create a new Sort node that is now (more) resolved than before.

Appendix C Additional Benchmarks

In this section, we provide additional benchmarks and performance data that were left out in Section 6 due to lack of space. First, we look at the peak memory consumption across all nodes in Figure 8 for the real-world dataset and in Figures 9 and 10 for the synthetic data set. As in Section 6, the measurements for complete datasets are shown on the left-hand side, while the results for incomplete datasets are shown on the right-hand side.

In Figures 8 and 9, we are interested in the impact of the number of executors on the memory consumption, while the plots in Figure 10 are about the relationship between data size and memory consumption. In Figures 8 and 9, we observe that the memory consumption mainly increases with the number of executors – but with a considerable amount of jitter. The increase of memory consumption is the expected behavior as every single executor must include the entire execution environment of Spark to be able to execute the code generated by the nodes in the physical plan. However, there are some outliers such as, in Figure 8, for 5 executors in the case of the non-distributed complete algorithm (left plot) and the distributed incomplete algorithm (right plot). The problem here is probably that Spark’s history server sometimes reports the memory consumption inaccurately. These inaccuracies occur since Spark uses periodical “heartbeats” to report the current memory consumption and its beat is then computed by selecting the highest value. If the timing of the heartbeats is off, then the peak memory consumption may not be captured correctly. There is notably less jitter in Figures 9 and 10. This is likely due to the fact that the execution times of the queries are higher and the memory consumption improves the longer each Spark application runs.

To sum up, the plots show that the improved runtime of our specialized algorithms compared with the reference algorithm is not achieved at the expense of significantly higher memory usage. On the contrary, in most cases, the memory consumption of our algorithms is even below that of the reference algorithm. In the remaining cases, the additional memory consumption (in particular, of the distributed complete algorithm) is within a reasonable margin. A comparison among our algorithms shows that the distributed complete algorithm has the highest memory requirements. This is most obvious as the data size increases. Here, the size of the “window” (see Section 5.6 for explanations) maintained in the skyline computation seems to be the predominant cost factor in terms of memory consumption.

In Figures 3 and 4, we have already seen plots depicting the relationship between the number of dimensions and the execution time. Further results – with varying numbers of executors – are shown in Figure 11 for the real-world dataset and in Figure 12 for the synthetic dataset. In the case of the real-world dataset, the plots provide a similar picture as in Section 6. In the case of the synthetic dataset, two observations in Figure 12 seem particularly noteworthy: since the tests with incomplete data were carried out on a bigger dataset than in Figure 4, timeouts (in particular, of the reference algorithm) occur much more frequently. In the case of the tests with complete datasets, the two opposing effects of increasing the number of dimensions are very well visible: first, when moving from 1 dimension to 2 and further to 3 dimensions, apparently the size of the skyline shrinks and, therefore, in particular the reference algorithm requires significantly less time. As further dimensions are added, apparently the size of the skyline increases and, hence, in particular the reference algorithm starts requiring significantly more time again. However, what remains unchanged compared with Figures 3 and 4 in Section 6 is, that in almost all cases, the time needed by our specialized algorithms is (significantly) below the time needed by the reference algorithm.

In Figure 13, we follow up on Figure 5, providing further details on the impact of the size of the dataset on the execution time. In the plots in Figure 13, the number of executors may take the values 2,3,5, or 10. We notice that, only when we increase the number of executors to 5 or even 10, then the reference algorithm can cope with the biggest dataset of size 10710^{7}. In all tests with complete data, the distributed complete algorithm performs best and also the other specialized algorithms outperform the reference algorithm. In the case of incomplete data and a higher number of executors, the performance of the reference algorithm is comparable to the distributed incomplete algorithm.

In Figures 6 and 7, we analyzed the relationship between the number of executors and the running time. We report on further experiments in this direction in Figure 14 for the real-world dataset and in Figure 15 for the synthetic dataset. Again, the picture in Figures 14 and 15 is quite similar to the one in Figures 6 and 7. For instance, in almost all cases, all of the specialized algorithms outperform the reference algorithm and, for complete data, in almost all cases, the distributed complete algorithm performs best. Most noteworthy in the case of the tests with the real-world dataset (Figures 14) seems the observation that the distributed complete algorithm (which normally performs best) hardly profits from additional executors. We already discussed in Section 6 that, if the dataset is too small, then the distributed computation of the local skylines loses its effectiveness – leaving more work for the global skyline computation. This is in sharp contrast to the setting in Figure 7, where we considered an almost 10 times bigger dataset and where additional executors were clearly helpful. This effect is still visible but weaker in Figure 15, where we consider the complete dataset with 51065\cdot 10^{6} tuples. In the case of the incomplete dataset of this size, the reference algorithm runs into several timeouts and, in all cases, the distributed incomplete algorithm performs better than the reference algorithm.

Refer to caption
Refer to caption
Figure 8. Number of executors vs. memory consumption on the Inside Airbnb dataset
Refer to caption
Refer to caption
Figure 9. Number of executors vs. memory consumption on the store_sales dataset
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 10. Number of input tuples vs. memory consumption on the store_sales dataset
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 11. Number of dimensions vs. execution time on the Inside Airbnb dataset
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 12. Number of dimensions vs. execution time on the store_sales dataset
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 13. Number of input tuples vs. execution time on the store_sales dataset
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 14. Number of executors vs. execution time on the Inside Airbnb dataset
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 15. Number of executors vs. execution time on the store_sales dataset

Appendix D Tabulation of Execution Time

In this section, we revisit the experimental evaluation from Section 6.4. In Figures 37, we presented plots of the execution times depending on the number of dimensions, tuples, and executors, respectively. We now present the same results in table form to make the precise execution times easier to read. Since some queries do not finish in the allotted time frame, we use “t.o.” to mark such timeouts in the tables.

The tables also contain the execution times relative to the reference query. Here, we always denote the time consumed by the reference query as 100%. For the other algorithms, a value below 100% means a speed-up while a value above 100% means a slow-down. If a timeout has occurred, no meaningful percentage value can be given. We use “n.a.” to denote that the calculation of a percentage value is not applicable. In fact, if the reference query has timed out, we use “n.a.” for the entire column since no comparison with the reference value is possible in this case even if the specialized algorithms do not time out. Note that we never have the opposite situation that a specialized algorithm times out but not the reference algorithm.

The results are shown in Tables 312. The correspondence with the plots in Figures 37 is as follows:

  • The values of the plots in Figure 3 (number of dimensions vs. execution time on the Inside Airbnb dataset) are shown in Table 3 for the complete dataset and in Table 4 for the incomplete dataset.

  • The values of the plots in Figure 4 (number of dimensions vs. execution time on the store_sales dataset) are shown in Table 5 for the complete dataset and in Table 6 for the incomplete dataset.

  • The values of the plots in Figure 5 (number of input tuples vs. execution time on the store_sales dataset) are shown in Table 7 for the complete dataset and in Table 8 for the incomplete dataset.

  • The values of the plots in Figure 6 (number of executors vs. execution time on the Inside Airbnb dataset) are shown in Table 9 for the complete dataset and in Table 10 for the incomplete dataset.

  • The values of the plots in Figure 7 (number of executors vs. execution time on the store_sales dataset) are shown in Table 11 for the complete dataset and in Table 12 for the incomplete dataset.

From these tables, it becomes very apparent that if the reference query can execute the skyline in a reasonable time, then the computation with our specialized algorithms displays a similar or even better behavior. Actually, a notable fact with regards to the results discussed in Section 6.4 can be derived from Table 6. Here, we encounter one of the rare cases where the reference query is better than our implementation. However, we also see that the increase is only 6.6\sim 6.6 seconds or 6.5%\sim 6.5\%. While this is an interesting result, it is unlikely to ever have an impact on the performance or usability of our integration of the skyline operator.

algorithm 1 2 3 4 5 6
1 reference 100.00% 100.00% 100.00% 100.00% 100.00% 100.00%
2 non-distributed complete 96.60% 86.88% 83.03% 80.54% 75.45% 48.66%
3 distributed complete 96.42% 97.81% 83.44% 77.71% 74.28% 46.08%
4 distributed incomplete 81.09% 90.92% 85.72% 84.86% 78.62% 51.10%
algorithm 1 2 3 4 5 6
1 reference 43.72 43.94 45.61 48.19 52.63 96.34
2 non-distributed complete 42.23 38.17 37.87 38.81 39.71 46.88
3 distributed complete 42.15 42.98 38.06 37.45 39.09 44.40
4 distributed incomplete 35.45 39.95 39.10 40.90 41.38 49.23
Table 3. Number of dimensions vs. execution time on complete Inside Airbnb dataset (Executors: 5, tuples: 820698)
algorithm 1 2 3 4 5 6
1 reference 100.00% 100.00% 100.00% 100.00% 100.00% 100.00%
2 distributed incomplete 83.07% 69.62% 87.92% 69.08% 61.64% 34.61%
algorithm 1 2 3 4 5 6
1 reference 45.58 50.05 50.21 58.03 66.22 147.82
2 distributed incomplete 37.87 34.85 44.15 40.08 40.82 51.17
Table 4. Number of dimensions vs. execution time on incomplete Inside Airbnb dataset (Executors: 5, tuples: 1193465)
algorithm 1 2 3 4 5 6
1 reference 100.00% 100.00% 100.00% 100.00% 100.00% 100.00%
2 non-distributed complete 2.63% 34.41% 54.98% 63.06% 46.26% 69.97%
3 distributed complete 2.20% 27.42% 56.83% 45.35% 22.22% 29.12%
4 distributed incomplete 2.30% 36.12% 57.33% 59.74% 65.95% 95.69%
algorithm 1 2 3 4 5 6
1 reference 2463.29 164.18 105.44 93.86 281.42 1693.31
2 non-distributed complete 64.77 56.50 57.97 59.18 130.17 1184.86
3 distributed complete 54.26 45.02 59.93 42.56 62.52 493.03
4 distributed incomplete 56.58 59.29 60.45 56.06 185.60 1620.40
Table 5. Number of dimensions vs. execution time on the complete store_sales dataset (Executors: 10, tuples: 10000000)
algorithm 1 2 3 4 5 6
1 reference 100.00% 100.00% 100.00% 100.00% 100.00% 100.00%
2 distributed incomplete 14.60% 47.59% 33.80% 25.98% 36.61% 106.51%
algorithm 1 2 3 4 5 6
1 reference 314.47 354.12 258.10 197.19 149.84 101.62
2 distributed incomplete 45.92 168.53 87.23 51.23 54.86 108.23
Table 6. Number of dimensions vs. execution time on the incomplete store_sales dataset (Executors: 10, tuples: 1000000)
algorithm 1000000 2000000 5000000 10000000
1 reference 100.00% 100.00% 100.00% n.a.
2 non-distributed complete 56.23% 30.91% 21.18% n.a.
3 distributed complete 42.51% 23.46% 17.94% n.a.
4 distributed incomplete 72.73% 44.69% 40.72% n.a.
algorithm 1000000 2000000 5000000 10000000
1 reference 191.35 542.55 2022.67 t.o.
2 non-distributed complete 107.59 167.69 428.31 1184.43
3 distributed complete 81.35 127.31 362.83 723.11
4 distributed incomplete 139.16 242.47 823.56 1705.47
Table 7. Number of tuples vs. execution time on complete store_sales dataset (executors: 3, dimensions: 6)
algorithm 1000000 2000000 5000000 10000000
1 reference 100.00% 100.00% 100.00% n.a.
2 distributed incomplete 109.52% 73.15% 41.47% n.a.
algorithm 1000000 2000000 5000000 10000000
1 reference 101.17 282.33 1227.49 t.o.
2 distributed incomplete 110.80 206.53 509.06 1342.98
Table 8. Number of tuples vs. execution time on incomplete store_sales dataset (Executors: 3, dimensions: 6)
algorithm 1 2 3 5 10
1 reference 100.00% 100.00% 100.00% 100.00% 100.00%
2 non-distributed complete 29.34% 47.40% 48.95% 48.66% 48.38%
3 distributed complete 30.09% 49.74% 47.76% 46.08% 45.15%
4 distributed incomplete 33.69% 54.30% 52.15% 51.10% 50.77%
algorithm 1 2 3 5 10
1 reference 155.69 91.23 97.47 96.34 102.42
2 non-distributed complete 45.68 43.24 47.71 46.88 49.55
3 distributed complete 46.85 45.37 46.55 44.40 46.24
4 distributed incomplete 52.46 49.54 50.82 49.23 52.00
Table 9. Number of executors vs. execution time on Inside Airbnb dataset (tuples: 820698, dimensions: 6)
algorithm 1 2 3 5 10
1 reference 100.00% 100.00% 100.00% 100.00% 100.00%
2 distributed incomplete 33.48% 40.18% 39.56% 34.61% 37.39%
algorithm 1 2 3 5 10
1 reference 189.48 134.60 134.95 147.82 143.96
2 distributed incomplete 63.44 54.08 53.39 51.17 53.83
Table 10. Number of executors vs. execution time on incomplete Inside Airbnb dataset (tuples: 1193465, dimensions: 6)
algorithm 1 2 3 5 10
1 reference n.a. n.a. n.a. n.a. 100.00%
2 non-distributed complete n.a. n.a. n.a. n.a. 69.97%
3 distributed complete n.a. n.a. n.a. n.a. 29.12%
4 distributed incomplete n.a. n.a. n.a. n.a. 95.69%
algorithm 1 2 3 5 10
1 reference t.o. t.o. t.o. t.o. 1693.31
2 non-distributed complete 1154.56 1361.92 1184.43 1148.37 1184.86
3 distributed complete 1080.17 899.84 723.11 587.94 493.03
4 distributed incomplete 1686.67 1587.29 1705.47 1658.78 1620.40
Table 11. Number of executors vs. execution time on complete store_sales dataset (tuples: 10000000, dimensions: 6)
algorithm 1 2 3 5 10
1 reference 100.00% 100.00% 100.00% n.a. 100.00%
2 distributed incomplete 24.85% 34.17% 41.47% n.a. 74.49%
algorithm 1 2 3 5 10
1 reference 2768.18 1737.20 1227.49 t.o. 704.39
2 distributed incomplete 687.76 593.55 509.06 675.54 524.68
Table 12. Number of executors vs. execution time on incomplete store_sales dataset (tuples: 5000000, dimensions: 6)

Appendix E Skylines of More Complex Queries

We have also evaluated the performance of our skyline implementation using more complex queries. For this purpose, we have adapted a part of the MusicBrainz database (Foundation, 2022) to allow for more complex queries, which contain multiple joins and aggregates.

E.1. Test Data and Queries

We have selected tuples from Musicbrainz’ recordings such that we have a dataset for complete and one for incomplete algorithms. Both datasets contain almost exactly 1.5 million recording tuples, which we will use as the size for both sets. Tuples were selected such that all recordings with ratings (500,000\sim 500,000) were selected first and the rest (1,000,000\sim 1,000,000) were sampled randomly. To limit the complexity of the joins, we also have limited the tables of the join “partners” in similar ways. For details on the generation of the dataset and the datasets themselves as .csv files, we refer to our collection of supplemental material under (Grasmann et al., 2022).

As opposed to our more simple queries, there are now slight differences between the complete and incomplete queries due to the more complex nature of the queries. Nevertheless, they should still be comparable with regards to both execution time and memory consumption.

Our “base” queries are given in Listing 12 and Listing 11, respectively. They can be easily extended to skyline queries by adding the skyline clause and potentially adding a subquery where applicable. In Table 13, we give the skyline dimensions as we have already done in Section 6.1. The actual skyline queries are then constructed by using the base queries in combination with the skyline dimensions from this table.

1SELECT
2 r.id,
3 ifnull(r.length, 0) AS length,
4 r.video,
5 ifnull(rm.rating, 0) AS rating,
6 ifnull(rm.rating_count, 0) AS rating_count,
7 recording_tracks.num_tracks,
8 recording_tracks.min_position
9 FROM recording_complete r LEFT OUTER JOIN (
10 SELECT
11 ri.id AS id,
12 count(ti.recording) AS num_tracks,
13 min(ti.position) AS min_position
14 FROM recording_complete ri
15 JOIN track ti ON (ti.recording = ri.id)
16 GROUP BY ri.id
17 ) recording_tracks USING (id)
18 JOIN recording_meta rm USING (id)
Listing 11 Base query (complete) of the more complex benchmarks on a subset of the MusicBrainz database
1SELECT * FROM recording_incomplete r
2 LEFT OUTER JOIN (
3 SELECT
4 ri.id AS id,
5 count(ti.recording) AS num_tracks,
6 min(ti.position) AS min_position
7 FROM recording_incomplete ri
8 JOIN track ti ON (ti.recording = ri.id)
9 GROUP BY ri.id
10 ) recording_tracks USING (id)
JOIN recording_meta rm USING (id)
Listing 12 Base query (incomplete) of the more complex benchmarks on a subset of the MusicBrainz database
Dimension Type Description
id KEY id of the recording
rating MAX rating of the recording (cumulative)
rating_count MAX number of ratings
length MIN length of the recording
video MAX whether the recording has a video
num_tracks MAX how many tracks the recording is on
min_position MIN lowest track number of recording
Table 13. Skyline dimensions for recordings in the MusicBrainz dataset

Our queries can be described as finding the best and most often rated recordings which are the shortest and have an associated video. Additionally, we also seek recordings, which were on many tracks and had a very low position on the associated album (i.e., at the beginning of the album).

The more complex queries considered in this section clearly demonstrate the advantage of the easy to use skyline syntax. While the queries in skyline syntax are relatively straightforward by only containing the base query and a skyline clause, the corresponding reference queries become quite extensive and unwieldy. As we can see in Listing 13, the rewritten query is not easy to read anymore, even after heavy formatting to balance space restrictions and readability. In comparison, the specialized skyline syntax given in Listing 14, which expresses the same query, is significantly more concise and readable.

1 SELECT * FROM (SELECT * FROM ( SELECT
2 r.id, ifnull(r.length, 0) AS length,
3 r.video, ifnull(rm.rating, 0) AS rating,
4 ifnull(rm.rating_count, 0) AS rating_count,
5 recording_tracks.num_tracks,
6 recording_tracks.min_position
7 FROM recording_complete r
8 LEFT OUTER JOIN (
9 SELECT
10 ri.id AS id,
11 count(ti.recording) AS num_tracks,
12 min(ti.position) AS min_position
13 FROM recording_complete ri
14 JOIN track ti ON (ti.recording = ri.id)
15 GROUP BY ri.id
16 ) recording_tracks USING (id)
17 JOIN recording_meta rm USING (id) )
18 ) AS o WHERE NOT EXISTS(
19 SELECT * FROM (SELECT * FROM (
20 SELECT
21 r.id,
22 ifnull(r.length, 0) AS length, r.video,
23 ifnull(rm.rating, 0) AS rating,
24 ifnull(rm.rating_count, 0) AS rating_count,
25 recording_tracks.num_tracks,
26 recording_tracks.min_position
27 FROM recording_complete r
28 LEFT OUTER JOIN (
29 SELECT
30 ri.id AS id,
31 count(ti.recording) AS num_tracks,
32 min(ti.position) AS min_position
33 FROM recording_complete ri
34 JOIN track ti ON (ti.recording = ri.id)
35 GROUP BY ri.id
36 ) recording_tracks USING (id)
37 JOIN recording_meta rm USING (id)
38 ) ) AS i WHERE
39 i.rating >= o.rating AND
40 i.rating_count >= o.rating_count AND
41 i.length <= o.length AND
42 i.video >= o.video AND
43 i.num_tracks >= o.num_tracks AND
44 i.min_position <= o.min_position AND (
45 i.rating > o.rating OR
46 i.rating_count > o.rating_count OR
47 i.length < o.length OR
48 i.video > o.video OR
49 i.num_tracks > o.num_tracks OR
50 i.min_position < o.min_position ) )
Listing 13 Reference query of the more complex benchmarks on a subset of the MusicBrainz database (6 dimensions)
1SELECT * FROM (
2 SELECT
3 r.id,
4 ifnull(r.length, 0) AS length,
5 r.video,
6 ifnull(rm.rating, 0) AS rating,
7 ifnull(rm.rating_count, 0) AS rating_count,
8 recording_tracks.num_tracks,
9 recording_tracks.min_position
10 FROM recording_complete r LEFT OUTER JOIN (
11 SELECT
12 ri.id AS id,
13 count(ti.recording) AS num_tracks,
14 min(ti.position) AS min_position
15 FROM recording_complete ri
16 JOIN track ti ON (ti.recording = ri.id)
17 GROUP BY ri.id
18 ) recording_tracks USING (id)
19 JOIN recording_meta rm USING (id)
20 ) SKYLINE OF COMPLETE
21 rating MAX,
22 rating_count MAX, length MIN,
23 video MAX,
24 num_tracks MAX,
25 min_position MIN
Listing 14 Skyline query (complete) of the more complex benchmarks on a subset of the MusicBrainz database

E.2. Experimental Results

In this section, we will take a closer look at the performance results and some conclusions we can draw from them. For this, we will again look at both the execution time and the memory consumption depending on the number of dimensions and executors. Since we are using a real-world dataset to do this, considering the number of tuples as a parameter is not applicable here (as was already the case with the Airbnb dataset in Section 6.4). The impact of the number of dimensions on the execution time and on the memory consumption is shown in Figure 16 and in Figure 17, respectively. Likewise, the impact of the number of executors on the execution time and on the memory consumption is shown in Figure 18 and in Figure 19, respectively.

First, we note that more complex queries result in the measurements to be both more oblique and more opaque. The way how joins and aggregates are handled by Spark has a huge influence on the query execution and affects both the execution time and the memory consumption. This can be seen across almost all plots which correspond to the complex queries. In these cases, our implementation works such that it will take the output of the base query and then uses it to compute the skyline (we exclude potential optimizations here). Still, while always yielding the same result, Spark may execute the queries in many different optimized ways heavily affecting the performance. Hence, the results obtained with the simpler queries such as those found in Section 6.4 seem to give a more precise picture of the performance of the various skyline algorithms.

As already mentioned in Section 6.4, there is a limit of how much distribution makes sense for the distributed algorithms. Since joins can also be computed in a distributed fashion, this effect may be amplified by more complex queries. This can, for example, be seen in Figure 18 where 33 executors is the optimum for which the query is executed the fastest. Adding further executors might slow down the query due to additional distribution and synchronization costs. In these benchmarks it is, however, impossible to exactly pinpoint which part of the query causes such loss of performance.

To conclude, for skyline queries on top of complex queries, it is not always clear if a particular increase of execution time or memory consumption is caused by the skyline computation or by peculiarities of the way how Spark processes the complex base query. Nevertheless, the results obtained with the complex queries are consistent with the ones obtained with the simple queries in Section 6.4. Above all, it has again turned out that the reference queries are significantly slower in most cases and, in some cases, they also take significantly more memory than our specialized algorithms. As such, the specialized algorithms are potentially able to handle bigger data and also more complex queries than the “plain” SQL reference solution without running into serious problems.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 16. Number of dimensions vs. execution time using complex queries on the MusicBrainz dataset
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 17. Number of dimensions vs. memory consumption using complex queries on the complete MusicBrainz dataset
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 18. Number of executors vs. execution time using complex queries on the MusicBrainz dataset
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 19. Number of executors vs. memory consumption using complex queries on the MusicBrainz dataset