Facilitating Database Tuning with Hyper-Parameter Optimization: A Comprehensive Experimental Evaluation
[Supplemental Material]

OUTLINE

This supplemental material is organized as follows.

S1. More background for evaluating configuration tuning systems.

S2. Details about intra-algorithms.

S3. More details about workloads.

S4. More details and results about the experiment.

S5. Construction for database configuration tuning benchmark.

S6. Experimental environment and reproduction instructions.

Refer to caption — Figure S1. Detailed Intra-algorithm Designs of Configuration Tuning Systems (The Full Picture): Black boxes denote the algorithms adopted by existing database tuning systems (indicated by colored paths), and grey boxes denote the algorithms in the HPO field. SA denotes sensitivity analysis based on Gini score and GA denotes Genetic algorithm.

Appendix S1 More background for evaluating configuration tuning systems

The previous evaluation for database configuration tuning is limited to a subset of existing systems where the analysis and evaluation of intra-algorithm components are ignored. Instead, we identify three key modules of configuration tuning systems and conduct a thorough analysis and experimental evaluation from a micro perspective (i.e., evaluating every fine-grained algorithm). Figure S1 presents the fine-grained algorithms adopted by existing database tuning systems or from the HPO field. When conducting database configuration tuning in practice, we have to chose a solution “path” across the three modules: (1) knob selection, (2) configuration optimization, and (3) knowledge transfer, as shown in the figure. Each knob selection algorithm could determine an unique configuration space and can be “linked” to any of the configuration optimization algorithms (i.e., optimizers). And among the optimizers, all the BO-based optimizers assuming a Gaussian model (SMAC, vanilla BO, mixed-kernel BO, TurBO) can be “linked” to workload mapping or RGPE transfer frameworks. And the DDPG algorithm is “linked” to fine-tune framework. We have noted that existing systems only cover a part of the possible solutions and it remains unclear to identify the best “path” for database configuration tuning. We evaluate all the fine-grained algorithms listed in Figure S1 and carefully decompose the search (evaluation) space to identify the best “path” in various scenarios.

Appendix S2 Details about intra-algorithms

In this section, we present details about the intra-algorithms which we describe on a high level in the paper due to space constraints.

S2.1. Ablation Analysis

Ablation analysis (DBLP:conf/aaai/BiedenkappLEHFH17) selects the features whose changes contribute the most to improve the performance of configurations. We now describe how ablation analysis quantifies the performance change due to a certain feature’s change. Given a default configuration $\bm{\theta}_{default}$ and a target configuration $\bm{\theta}_{target}$ (usually a better one), ablation analysis first computes the feature differences $\Delta(\bm{\theta}_{default},\bm{\theta}_{target})$ between the default and target configurations. Next, an ablation path $\bm{\theta}_{default},\bm{\theta}_{1},\bm{\theta}_{2},\dots,\bm{\theta}_{target}$ is iteratively constructed. In each iteration $i$ with previous ablation path configuration $\bm{\theta}_{i-1}$ , we consider all remaining feature changes $\delta\in\Delta(\bm{\theta}_{default},\bm{\theta}_{target})$ and apply the change to the previous ablation path configuration and obtain the candidate $\bm{\theta}_{i-1}[\delta]$ . Each parameter change $\delta$ is a modification of one feature from its value in $\bm{\theta}_{i-1}$ to its value in $\bm{\theta}_{target}$ , along with any other feature modifications that may be necessary due to conditionally constraints in $\bm{\Theta}$ . The next configuration on the ablation path $\bm{\theta}_{i}$ is the candidate $\bm{\theta}_{i-1}[\delta]$ with the best objective performance $f$ . The performance evaluation is approximated via surrogate for efficiency reasons. The order that the feature is changed is the importance rank from the ablation analysis. Given a set of observations, we conduct the ablation analysis between the default configuration and the configurations with better performance than the default in the observation set (i.e., target configurations). For each feature, we use the average rank from each ablation path as the final ranking of importance.

S2.2. SHAP

SHAP (DBLP:conf/nips/LundbergL17) (SHapley Additive exPlanation) uses Shapley values of a conditional expectation function of the original model. SHAP values attribute to each feature the change in the expected model prediction when conditioning on that feature. They explain how to get from the base value that would be predicted if we did not know any features to the current output. When the model is non-linear or the input features are not independent, the order in which features are added to the expectation matters, and the SHAP values arise from averaging the contributing values across all possible orderings (DBLP:conf/nips/LundbergL17). The exact computation of SHAP values is challenging, which can be estimated by Shapley sampling values method (DBLP:journals/kais/StrumbeljK14) or Kernel SHAP method (DBLP:conf/nips/LundbergL17).

S2.3. SMAC

SMAC (DBLP:conf/lion/HutterHL11) constructs a random forest as a set of regression trees, each of which is built on n data points randomly sampled with repetitions from the entire training data set. It computes the random forest’s predictive mean $\hat{\mu}(\bm{\theta})$ and variance $\hat{\sigma}^{2}(\bm{\theta})$ for a new configuration $\bm{\theta}$ as the empirical mean and variance of the Gaussian distribution . SMAC uses the random forest model to select a list of promising parameter configurations. To quantify how promising a configuration $\bm{\theta}$ is, it uses the model’s predictive distribution for $\bm{\theta}$ to compute its expected positive improvement $EI(\bm{\theta})$ (DBLP:journals/jgo/JonesSW98) over the best configuration seen so far. $EI(\bm{\theta})$ is large for configurations $\bm{\theta}$ with high predicted performance and for those with high predicted uncertainty; thereby, it offers an automatic trade-off between exploitation (focusing on known good parts of the space) and exploration (gathering more information in unknown parts of the space). To gather a set of promising configurations with low computational overhead, SMAC performs a simple multi-start local search and considers all resulting configurations with locally maximal $EI$ .

S2.4. RGPE

RGPE (feurer2018scalable) is a scalable meta-learning framework to accelerate BO-based optimizer. First, for each previous tuning task $T_{i}$ , it trains a base Gaussian process (GP) model $M_{i}$ on the corresponding observations from $H_{i}$ . Then it builds a surrogate model $M_{meta}$ combine the base GP models, instead of the original surrogate $M_{T}$ fitted on the observations $H_{T}$ of the target task only. The prediction of $M_{meta}$ at point $\bm{\theta}$ is given by:

(1)

y\sim N(\sum_{i}{w_{i}\mu_{i}(\bm{\theta})},\sum_{i}{w_{i}\sigma^{2}_{i}(\bm{\theta})}),

where $w_{i}$ is the weight of base surrogate $M_{i}$ , and $\mu_{i}$ and $\sigma^{2}_{i}$ are the predictive mean and variance of the base surrogate $M_{i}$ . The weight $w_{i}$ reflects the similarity between the previous task and the current task. Therefore, $M_{meta}$ utilizes the knowledge on previous tuning tasks, which can greatly accelerate the convergence of the tuning in the target task. We then use the following ranking loss function $L$ , i.e., the number of misranked pairs, to measure the similarity between previous tasks and the target task:

(2)

L(M_{j},H_{T})\!=\!\sum_{j=1}^{n_{t}}\sum_{k=1}^{n_{t}}\mathbbm{1}\!\Big{(}(M_{i}(\bm{\theta}_{j})\!\leq\!M_{i}(\bm{\theta}_{k}))\!\oplus(y_{j}\leq y_{k})\Big{)},

where $\oplus$ is the exclusive-or operator, $n_{t}$ denotes the number of tuning tasks and $M_{i}(\bm{\theta}_{j})$ means the prediction of $M_{i}$ on configuration $\bm{\theta}$ . Based on the ranking loss function, the weight $w_{i}$ is set to the probability that $M_{i}$ has the smallest ranking loss on $H_{T}$ , that is, $w_{i}=\mathbbm{P}(i=\mathop{\arg\min}_{j}L(M_{j},H_{T}))$ . This probability can be estimated using MCMC sampling (DBLP:conf/aistats/MartensTY19).

S2.5. More Details about Workloads

While we have presented general profile information for workloads in Table LABEL:wkl in the paper, we detail the implementation and the reason we select those workloads in this section.

The Reasons for Workload Selection. When answering Q1 and Q2, we analyze the tuning performances over OLTP and OLAP scenarios. We use an OLAP workload – JOB and an OLTP workload – SYSBENCH. The reason is that the two workloads are often adopted in evaluating database configuration tuning methods and involve the scenarios of an online transaction/analytical processing. For example, JOB is adopted by QTune (DBLP:journals/pvldb/LiZLG19) and SYSBENCH is adopted by CDBTune (DBLP:conf/sigmod/ZhangLZLXCXWCLR19), QTune (DBLP:journals/pvldb/LiZLG19) and ResTune (DBLP:conf/sigmod/ZhangWCJT0Z021). When conducting knowledge transfer experiments (Q3), we focus on the OLTP scenarios since there are fewer OLAP workloads suitable for constructing source workloads of tuning tasks, except JOB and TPC-H. We choose three OLTP workloads – SYSBENCH, TPC-C, Twitter as the target tuning workloads, which have been adopted in previous studies. For example, OtterTune (DBLP:conf/sigmod/AkenPGZ17), CDBTune (DBLP:conf/sigmod/ZhangLZLXCXWCLR19), and ResTune (DBLP:conf/sigmod/ZhangWCJT0Z021)) has adopted TPC-C for evaluation and ResTune has also adopted Twitter. We use additional four OLTP workloads (i.e., SEATS, Smallbank, TATP, Voter, SIBENCH) as source workloads and configure them with various sizes, read-write ratios as shown in Table LABEL:wkl. SIBench is a microbenchmark designed to explore snapshot isolation in DBMSs (DBLP:journals/pvldb/JungHFR11). Based on our observations, the tuning opportunity for SIBench is limited. We add SIBench to the source workloads with the purpose of increasing the diversity.

Implementation of The Workloads. For JOB, we use the same setup illustrated in (DBLP:journals/pvldb/LeisGMBK015). For SYSBENCH, we load 150 tables each of which contains 800000 rows, and adopt the read-write mode. For workloads from OLTP-Bench, we use the scale factor to determine the data size (e.g., the number of warehouses in TPCC) as shown in Table S1. In addition, the parameter terminal is set to 64 for each workload. We keep other parameters as the default value as OLTP-Bench provided, including isolation and weights of transactions.

Table S1. Scale factors of workload from OLTP-Bench

Workload	TPCC	Twitter	Smallbank	SIBench	Voter	Seats	TATP
Scale Factor	200	1500	10	1000	10000	50	100

Appendix S3 More details and results about experiment

In this section, we present additional experimental details.

S3.1. Knob Selection

Table S2. The Top-20 important knobs selected by SHAP for OLTP workloads

Knob	Type	Dynamic	Module	Description
innodb_thread_concurrency	Integer	Yes	Concurrency	The maximum number of threads permitted inside of InnoDB.
innodb_log_file_size	Integer	No	Logging	The size in bytes of each log file in a log group.
max_allowed_packet	Integer	Yes	Replication	The upper limit on the size of any single message between the MySQL server and clients.
innodb_io_capacity_max	Integer	Yes	IO	The maximum number of IOPS performed by InnoDB background tasks.
tmp_table_size	Integer	Yes	Memory	The maximum size of internal in-memory temporary tables.
query_prealloc_size	Integer	Yes	Memory	The size in bytes of the persistent buffer used for statement parsing and execution.
max_heap_table_size	Integer	Yes	Memory	The maximum size to which user-created memory tables are permitted to grow.
innodb_doublewrite	Categorical	No	Memory	Whether the doublwrite buffer is enabled.
transaction_alloc_block_size	Interger	Yes	Memory	The amount in bytes by which to increase a per-transaction memory pool which needs memory.
join_buffer_size	Interger	Yes	Memory	The minimum size of the buffer that is used for joins.
innodb_flush_log_at_trx_commit	Categorical	Yes	Logging	Controlling the balance between ACID compliance for commit operations and performance.
innodb_max_dirty_pages_pct_lwm	Integer	Yes	Logging	The percentage of dirty pages at which preflushing is enabled to control the dirty page ratio.
innodb_log_files_in_group	Integer	No	Logging	The number of log files in the log group.
innodb_buffer_pool_size	Integer	Yes	Memory	The size in bytes of the buffer pool.
innodb_online_alter_log_max_size	Integer	Yes	Logging	An upper limit on the size of the log files used during online DDL operations for InnoDB tables.
key_cache_age_threshold	Integer	Yes	Memory	The demotion of buffers from the hot sublist of a key cache to the warm sublist.
binlog_cache_size	Integer	Yes	Memory	The size of the cache to hold changes to the binary log during a transaction.
innodb_purge_rseg_truncate_frequency	Integer	Yes	Logging	The frequency with which the purge system frees rollback segments.
query_cache_limit	Integer	Yes	Memory	The minimum size of cached results.
innodb_sort_buffer_size	Integer	No	Memory	The sort buffer size for online DDL operations that create or rebuild secondary indexes.

Table S3. Hardware configurations for more instances.

Instance	A	B	C	D
CPU	4 cores	8 cores	16 cores	32 cores
RAM	8GB	16GB	32GB	64GB

Top impacting knobs with high tunability for OLTP workloads. We further conduct an experiment using SHAP to generate a ranking of the most impacting knobs across OLTP workloads and hardware instances. And we use this ranking to conduct an evaluation for knowledge transfer component across OLTP workloads in LABEL:sec:exp-transfer in the paper. We use the seven OLTP workloads listed in Table LABEL:wkl in the paper and perform LHS to collect 1250 samples for each workload on the four hardware instances listed in Table LABEL:tab:hardward. Then we adopt SHAP to generate an importance ranking respectively and count the number of times that each knob appears in the top 20 of all the rankings to measure their overall importance. Table S2 shows the top-20 important knobs for OLTP workloads and their brief description. We believe this ranking could provide database practitioners a guidance for choosing the knobs to tune. Dynamic variables can be changed at runtime using the SET statement, while others can only be set at server startup using options on the command line or in an option file. We have that configuring the maximum number of threads and the size of the log file can contribute to the performance gain the most, which is aligned with the previous analysis (DBLP:conf/sigmod/AkenPGZ17; DBLP:conf/sigmod/ZhangWCJT0Z021; DBLP:journals/pvldb/AkenYBFZBP21). And 11 of the top-20 knobs are related to memory allocation, which indicates that the default setting of memory allocation in MySQL may not be appropriate across the workloads and hardware instances. We leave the important knobs ranking for OLAP workloads as future work, as there are fewer OLAP workloads suitable for database tuning tasks, except JOB and TPC-H.

S3.2. Configuration Optimization

Table S4. Average ranking of optimizers in terms of the best configuration they found.Bold values are the best.

Optimizer		Vanilla BO	Oon-Hot BO	Mixed-Kernel BO	SMAC	TPE	TURBO	DDPG	GA
Small Configuration Space	JOB	5.00	2.67	2.33	3.33	6.33	3.67	5.00	7.67
	SYSBENCH	5.67	5.33	2.00	3.33	5.33	4.00	5.00	5.33
	Average	5.33	4.00	2.17	3.33	5.83	3.83	5.00	6.50
Medium Configuration Space	JOB	5.33	3.33	3.00	1.00	7.67	3.67	5.00	7.00
	SYSBENCH	5.00	4.33	1.67	1.67	6.67	4.33	6.00	6.33
	Average	5.17	3.83	2.33	1.33	7.17	4.00	5.50	6.67
Large Configuration Space	JOB	7.00	7.00	7.00	1.00	7.00	7.00	2.00	7.00
	SYSBENCH	7.67	6.00	3.33	1.00	6.00	3.00	4.33	4.67
	Average	7.33	6.50	5.17	1.00	6.50	5.00	3.17	5.83
Overall		5.94	4.78	3.22	1.89	6.50	4.28	4.56	6.33

Average ranking of optimizers in terms of the best configuration they found. While we have presented the average rankings of optimizers in Table 7 in the paper, we detail the rankings on each workload and configuration space as shown in Table S4. In addition, we present how we calculate the average ranking of optimizers. For each workload and configuration space, we run three tuning sessions for an optimizer and sort the three sessions in terms of the best performance they found within 200 iterations. Then, we rank the optimizers based on the best performance in their best session, and then rank them based on their second session, and lastly the worst session. Finally, we average the three ranks of an optimizer, which corresponds to a row in Table S4.

S3.3. Knowledge Transfer

We have demonstrated the average performance enhancement (i.e., PE), speedup, and absolute performance ranking (i.e., APR) in Table 8 and omit the performance plot in the paper due to space constraints. Figure S2 plots the absolute performance over iteration of each baseline (i.e., the combination of transfer framework and base learner). On TPCC, both RGPE (Mixed-kernel BO) and RGPE (SMAC) find the approximately best performance in 200 iterations, while RGPE (SMAC) has a better speedup. On SYSBENCH, RGPE (SMAC) finds the best performance, though it takes a few more steps. On Twitter, RGPE (Mixed-kernel BO) finds the best performance, and at a fast speed. In general, we find that the combinations of RGPE and base learners have the best absolute performance as well as speedup.

Appendix S4 Experimental environment and reproduction instructions.

We conduct all experiments on Aliyun ECS. Each experiment consists of two instances. The first instance is used for the tuning sever, deployed on ecs.s6-c1m2.xlarge. The second instance is used for the target DBMS deployment, with four kinds of hardware configurations : ecs.s6-c1m2.xlarge, ecs.s6-c1m2.2xlarge, ecs.s6-c1m2.4xlarge, and ecs.n4.8xlarge. The detailed physical memory and CPU information are demonstrated in Table S5. The operation system of each ECS is Linux 4.9. The Python version used is 3.7, and the detailed package requirements for our experiments are listed on our GitHub repository. Please check the requirements.txt in the root directory.

To reproduce the results of out experiments, please download and install the workloads we use following the instructions in README.md, and run train.py under directory script/ with specified arguments like below:

python train.py --method=VBO --knobs_num=5 --y_variable=lat
--workload=job  --dbname=imdbload --lhs_log=JOB5_VBO.res
--knobs_config=../experiment/gen_knobs/job_shap.json

More reproduction details are provided on our online repository¹¹footnotemark: 1.

Table S5. Hardware configurations for database instances.

Type	CPU	RAM
`ecs.s6-c1m2.xlarge`	4 cores	8GB
`ecs.s6-c1m2.2xlarge`	8 cores	16GB
`ecs.s6-c1m2.4xlarge`	16 cores	32GB
`ecs.n4.8xlarge`	32 cores	64GB

Facilitating Database Tuning with Hyper-Parameter Optimization: A Comprehensive Experimental Evaluation [Supplemental Material]