Facilitating Database Tuning with Hyper-Parameter Optimization: A Comprehensive Experimental Evaluation
[Supplemental Material]
OUTLINE
This supplemental material is organized as follows.
S1. More background for evaluating configuration tuning systems.
S2. Details about intra-algorithms.
S3. More details about workloads.
S4. More details and results about the experiment.
S5. Construction for database configuration tuning benchmark.
S6. Experimental environment and reproduction instructions.
Appendix S1 More background for evaluating configuration tuning systems
The previous evaluation for database configuration tuning is limited to a subset of existing systems where the analysis and evaluation of intra-algorithm components are ignored. Instead, we identify three key modules of configuration tuning systems and conduct a thorough analysis and experimental evaluation from a micro perspective (i.e., evaluating every fine-grained algorithm). Figure S1 presents the fine-grained algorithms adopted by existing database tuning systems or from the HPO field. When conducting database configuration tuning in practice, we have to chose a solution “path” across the three modules: (1) knob selection, (2) configuration optimization, and (3) knowledge transfer, as shown in the figure. Each knob selection algorithm could determine an unique configuration space and can be “linked” to any of the configuration optimization algorithms (i.e., optimizers). And among the optimizers, all the BO-based optimizers assuming a Gaussian model (SMAC, vanilla BO, mixed-kernel BO, TurBO) can be “linked” to workload mapping or RGPE transfer frameworks. And the DDPG algorithm is “linked” to fine-tune framework. We have noted that existing systems only cover a part of the possible solutions and it remains unclear to identify the best “path” for database configuration tuning. We evaluate all the fine-grained algorithms listed in Figure S1 and carefully decompose the search (evaluation) space to identify the best “path” in various scenarios.
Appendix S2 Details about intra-algorithms
In this section, we present details about the intra-algorithms which we describe on a high level in the paper due to space constraints.
S2.1. Ablation Analysis
Ablation analysis (DBLP:conf/aaai/BiedenkappLEHFH17) selects the features whose changes contribute the most to improve the performance of configurations. We now describe how ablation analysis quantifies the performance change due to a certain feature’s change. Given a default configuration and a target configuration (usually a better one), ablation analysis first computes the feature differences between the default and target configurations. Next, an ablation path is iteratively constructed. In each iteration with previous ablation path configuration , we consider all remaining feature changes and apply the change to the previous ablation path configuration and obtain the candidate . Each parameter change is a modification of one feature from its value in to its value in , along with any other feature modifications that may be necessary due to conditionally constraints in . The next configuration on the ablation path is the candidate with the best objective performance . The performance evaluation is approximated via surrogate for efficiency reasons. The order that the feature is changed is the importance rank from the ablation analysis. Given a set of observations, we conduct the ablation analysis between the default configuration and the configurations with better performance than the default in the observation set (i.e., target configurations). For each feature, we use the average rank from each ablation path as the final ranking of importance.
S2.2. SHAP
SHAP (DBLP:conf/nips/LundbergL17) (SHapley Additive exPlanation) uses Shapley values of a conditional expectation function of the original model. SHAP values attribute to each feature the change in the expected model prediction when conditioning on that feature. They explain how to get from the base value that would be predicted if we did not know any features to the current output. When the model is non-linear or the input features are not independent, the order in which features are added to the expectation matters, and the SHAP values arise from averaging the contributing values across all possible orderings (DBLP:conf/nips/LundbergL17). The exact computation of SHAP values is challenging, which can be estimated by Shapley sampling values method (DBLP:journals/kais/StrumbeljK14) or Kernel SHAP method (DBLP:conf/nips/LundbergL17).
S2.3. SMAC
SMAC (DBLP:conf/lion/HutterHL11) constructs a random forest as a set of regression trees, each of which is built on n data points randomly sampled with repetitions from the entire training data set. It computes the random forest’s predictive mean and variance for a new configuration as the empirical mean and variance of the Gaussian distribution . SMAC uses the random forest model to select a list of promising parameter configurations. To quantify how promising a configuration is, it uses the model’s predictive distribution for to compute its expected positive improvement (DBLP:journals/jgo/JonesSW98) over the best configuration seen so far. is large for configurations with high predicted performance and for those with high predicted uncertainty; thereby, it offers an automatic trade-off between exploitation (focusing on known good parts of the space) and exploration (gathering more information in unknown parts of the space). To gather a set of promising configurations with low computational overhead, SMAC performs a simple multi-start local search and considers all resulting configurations with locally maximal .
S2.4. RGPE
RGPE (feurer2018scalable) is a scalable meta-learning framework to accelerate BO-based optimizer. First, for each previous tuning task , it trains a base Gaussian process (GP) model on the corresponding observations from . Then it builds a surrogate model combine the base GP models, instead of the original surrogate fitted on the observations of the target task only. The prediction of at point is given by:
(1) |
where is the weight of base surrogate , and and are the predictive mean and variance of the base surrogate . The weight reflects the similarity between the previous task and the current task. Therefore, utilizes the knowledge on previous tuning tasks, which can greatly accelerate the convergence of the tuning in the target task. We then use the following ranking loss function , i.e., the number of misranked pairs, to measure the similarity between previous tasks and the target task:
(2) |
where is the exclusive-or operator, denotes the number of tuning tasks and means the prediction of on configuration . Based on the ranking loss function, the weight is set to the probability that has the smallest ranking loss on , that is, . This probability can be estimated using MCMC sampling (DBLP:conf/aistats/MartensTY19).
S2.5. More Details about Workloads
While we have presented general profile information for workloads in Table LABEL:wkl in the paper, we detail the implementation and the reason we select those workloads in this section.
The Reasons for Workload Selection. When answering Q1 and Q2, we analyze the tuning performances over OLTP and OLAP scenarios. We use an OLAP workload – JOB and an OLTP workload – SYSBENCH. The reason is that the two workloads are often adopted in evaluating database configuration tuning methods and involve the scenarios of an online transaction/analytical processing. For example, JOB is adopted by QTune (DBLP:journals/pvldb/LiZLG19) and SYSBENCH is adopted by CDBTune (DBLP:conf/sigmod/ZhangLZLXCXWCLR19), QTune (DBLP:journals/pvldb/LiZLG19) and ResTune (DBLP:conf/sigmod/ZhangWCJT0Z021). When conducting knowledge transfer experiments (Q3), we focus on the OLTP scenarios since there are fewer OLAP workloads suitable for constructing source workloads of tuning tasks, except JOB and TPC-H. We choose three OLTP workloads – SYSBENCH, TPC-C, Twitter as the target tuning workloads, which have been adopted in previous studies. For example, OtterTune (DBLP:conf/sigmod/AkenPGZ17), CDBTune (DBLP:conf/sigmod/ZhangLZLXCXWCLR19), and ResTune (DBLP:conf/sigmod/ZhangWCJT0Z021)) has adopted TPC-C for evaluation and ResTune has also adopted Twitter. We use additional four OLTP workloads (i.e., SEATS, Smallbank, TATP, Voter, SIBENCH) as source workloads and configure them with various sizes, read-write ratios as shown in Table LABEL:wkl. SIBench is a microbenchmark designed to explore snapshot isolation in DBMSs (DBLP:journals/pvldb/JungHFR11). Based on our observations, the tuning opportunity for SIBench is limited. We add SIBench to the source workloads with the purpose of increasing the diversity.
Implementation of The Workloads.
For JOB, we use the same setup illustrated in (DBLP:journals/pvldb/LeisGMBK015).
For SYSBENCH, we load 150 tables each of which contains 800000 rows, and adopt the read-write mode.
For workloads from OLTP-Bench, we use the scale factor to determine the data size (e.g., the number of warehouses in TPCC) as shown in Table S1.
In addition, the parameter terminal
is set to 64 for each workload.
We keep other parameters as the default value as OLTP-Bench provided, including isolation
and weights
of transactions.
Workload | TPCC | Smallbank | SIBench | Voter | Seats | TATP | |
---|---|---|---|---|---|---|---|
Scale Factor | 200 | 1500 | 10 | 1000 | 10000 | 50 | 100 |
Appendix S3 More details and results about experiment
In this section, we present additional experimental details.
S3.1. Knob Selection
Knob | Type | Dynamic | Module | Description |
---|---|---|---|---|
innodb_thread_concurrency | Integer | Yes | Concurrency | The maximum number of threads permitted inside of InnoDB. |
innodb_log_file_size | Integer | No | Logging | The size in bytes of each log file in a log group. |
max_allowed_packet | Integer | Yes | Replication | The upper limit on the size of any single message between the MySQL server and clients. |
innodb_io_capacity_max | Integer | Yes | IO | The maximum number of IOPS performed by InnoDB background tasks. |
tmp_table_size | Integer | Yes | Memory | The maximum size of internal in-memory temporary tables. |
query_prealloc_size | Integer | Yes | Memory | The size in bytes of the persistent buffer used for statement parsing and execution. |
max_heap_table_size | Integer | Yes | Memory | The maximum size to which user-created memory tables are permitted to grow. |
innodb_doublewrite | Categorical | No | Memory | Whether the doublwrite buffer is enabled. |
transaction_alloc_block_size | Interger | Yes | Memory | The amount in bytes by which to increase a per-transaction memory pool which needs memory. |
join_buffer_size | Interger | Yes | Memory | The minimum size of the buffer that is used for joins. |
innodb_flush_log_at_trx_commit | Categorical | Yes | Logging | Controlling the balance between ACID compliance for commit operations and performance. |
innodb_max_dirty_pages_pct_lwm | Integer | Yes | Logging | The percentage of dirty pages at which preflushing is enabled to control the dirty page ratio. |
innodb_log_files_in_group | Integer | No | Logging | The number of log files in the log group. |
innodb_buffer_pool_size | Integer | Yes | Memory | The size in bytes of the buffer pool. |
innodb_online_alter_log_max_size | Integer | Yes | Logging | An upper limit on the size of the log files used during online DDL operations for InnoDB tables. |
key_cache_age_threshold | Integer | Yes | Memory | The demotion of buffers from the hot sublist of a key cache to the warm sublist. |
binlog_cache_size | Integer | Yes | Memory | The size of the cache to hold changes to the binary log during a transaction. |
innodb_purge_rseg_truncate_frequency | Integer | Yes | Logging | The frequency with which the purge system frees rollback segments. |
query_cache_limit | Integer | Yes | Memory | The minimum size of cached results. |
innodb_sort_buffer_size | Integer | No | Memory | The sort buffer size for online DDL operations that create or rebuild secondary indexes. |
Instance | A | B | C | D |
---|---|---|---|---|
CPU | 4 cores | 8 cores | 16 cores | 32 cores |
RAM | 8GB | 16GB | 32GB | 64GB |
Top impacting knobs with high tunability for OLTP workloads. We further conduct an experiment using SHAP to generate a ranking of the most impacting knobs across OLTP workloads and hardware instances. And we use this ranking to conduct an evaluation for knowledge transfer component across OLTP workloads in LABEL:sec:exp-transfer in the paper. We use the seven OLTP workloads listed in Table LABEL:wkl in the paper and perform LHS to collect 1250 samples for each workload on the four hardware instances listed in Table LABEL:tab:hardward. Then we adopt SHAP to generate an importance ranking respectively and count the number of times that each knob appears in the top 20 of all the rankings to measure their overall importance. Table S2 shows the top-20 important knobs for OLTP workloads and their brief description. We believe this ranking could provide database practitioners a guidance for choosing the knobs to tune. Dynamic variables can be changed at runtime using the SET statement, while others can only be set at server startup using options on the command line or in an option file. We have that configuring the maximum number of threads and the size of the log file can contribute to the performance gain the most, which is aligned with the previous analysis (DBLP:conf/sigmod/AkenPGZ17; DBLP:conf/sigmod/ZhangWCJT0Z021; DBLP:journals/pvldb/AkenYBFZBP21). And 11 of the top-20 knobs are related to memory allocation, which indicates that the default setting of memory allocation in MySQL may not be appropriate across the workloads and hardware instances. We leave the important knobs ranking for OLAP workloads as future work, as there are fewer OLAP workloads suitable for database tuning tasks, except JOB and TPC-H.
S3.2. Configuration Optimization
Optimizer | Vanilla BO | Oon-Hot BO | Mixed-Kernel BO | SMAC | TPE | TURBO | DDPG | GA | |
Small Configuration Space | JOB | 5.00 | 2.67 | 2.33 | 3.33 | 6.33 | 3.67 | 5.00 | 7.67 |
SYSBENCH | 5.67 | 5.33 | 2.00 | 3.33 | 5.33 | 4.00 | 5.00 | 5.33 | |
Average | 5.33 | 4.00 | 2.17 | 3.33 | 5.83 | 3.83 | 5.00 | 6.50 | |
Medium Configuration Space | JOB | 5.33 | 3.33 | 3.00 | 1.00 | 7.67 | 3.67 | 5.00 | 7.00 |
SYSBENCH | 5.00 | 4.33 | 1.67 | 1.67 | 6.67 | 4.33 | 6.00 | 6.33 | |
Average | 5.17 | 3.83 | 2.33 | 1.33 | 7.17 | 4.00 | 5.50 | 6.67 | |
Large Configuration Space | JOB | 7.00 | 7.00 | 7.00 | 1.00 | 7.00 | 7.00 | 2.00 | 7.00 |
SYSBENCH | 7.67 | 6.00 | 3.33 | 1.00 | 6.00 | 3.00 | 4.33 | 4.67 | |
Average | 7.33 | 6.50 | 5.17 | 1.00 | 6.50 | 5.00 | 3.17 | 5.83 | |
Overall | 5.94 | 4.78 | 3.22 | 1.89 | 6.50 | 4.28 | 4.56 | 6.33 |
Average ranking of optimizers in terms of the best configuration they found. While we have presented the average rankings of optimizers in Table 7 in the paper, we detail the rankings on each workload and configuration space as shown in Table S4. In addition, we present how we calculate the average ranking of optimizers. For each workload and configuration space, we run three tuning sessions for an optimizer and sort the three sessions in terms of the best performance they found within 200 iterations. Then, we rank the optimizers based on the best performance in their best session, and then rank them based on their second session, and lastly the worst session. Finally, we average the three ranks of an optimizer, which corresponds to a row in Table S4.
S3.3. Knowledge Transfer
We have demonstrated the average performance enhancement (i.e., PE), speedup, and absolute performance ranking (i.e., APR) in Table 8 and omit the performance plot in the paper due to space constraints. Figure S2 plots the absolute performance over iteration of each baseline (i.e., the combination of transfer framework and base learner). On TPCC, both RGPE (Mixed-kernel BO) and RGPE (SMAC) find the approximately best performance in 200 iterations, while RGPE (SMAC) has a better speedup. On SYSBENCH, RGPE (SMAC) finds the best performance, though it takes a few more steps. On Twitter, RGPE (Mixed-kernel BO) finds the best performance, and at a fast speed. In general, we find that the combinations of RGPE and base learners have the best absolute performance as well as speedup.
Appendix S4 Experimental environment and reproduction instructions.
We conduct all experiments on Aliyun ECS. Each experiment consists of two instances. The first instance is used for the tuning sever, deployed on ecs.s6-c1m2.xlarge
. The second instance is used for the target DBMS deployment, with four kinds of hardware configurations : ecs.s6-c1m2.xlarge
, ecs.s6-c1m2.2xlarge
, ecs.s6-c1m2.4xlarge
, and ecs.n4.8xlarge
. The detailed physical memory and CPU information are demonstrated in Table S5.
The operation system of each ECS is Linux 4.9.
The Python version used is 3.7, and the detailed package requirements for our experiments are listed on our GitHub repository. Please check the requirements.txt
in the root directory.
To reproduce the results of out experiments, please download and install the workloads we use following the instructions in README.md
, and run train.py
under directory script/
with specified arguments like below:
python train.py --method=VBO --knobs_num=5 --y_variable=lat --workload=job --dbname=imdbload --lhs_log=JOB5_VBO.res --knobs_config=../experiment/gen_knobs/job_shap.json
More reproduction details are provided on our online repository11footnotemark: 1.
Type | CPU | RAM |
---|---|---|
ecs.s6-c1m2.xlarge |
4 cores | 8GB |
ecs.s6-c1m2.2xlarge |
8 cores | 16GB |
ecs.s6-c1m2.4xlarge |
16 cores | 32GB |
ecs.n4.8xlarge |
32 cores | 64GB |