This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

ParaGraph: Weighted Graph Representation for Performance Optimization of HPC Kernels

Ali TehraniJamsaz1, Alok Mishra2, Akash Dutta1, Abid M. Malik3, Barbara Chapman2, Ali Jannesari1
1Iowa State University, Ames, Iowa, USA {tehrani, adutta, jannesar}@iastate.edu 2Stony Brook University, Stony Brook, New York, USA {almishra, bchapman}@cs.stonybrook.edu 3Brookhaven National Laboratory, Upton, New York, USA amalik@bnl.gov
Abstract

GPU-based HPC clusters are attracting more scientific application developers due to their extensive parallelism and energy efficiency. In order to achieve portability among a variety of multi/many core architectures, a popular choice for an application developer is to utilize directive-based parallel programming models, such as OpenMP. However, even with OpenMP, the developer must choose from among many strategies for exploiting a GPU or a CPU. Recently, Machine Learning (ML) approaches have brought significant advances in the optimizations of HPC applications. To this end, several ways have been proposed to represent applications’ characteristics for ML models. However, the available techniques fail to capture features that are crucial for exposing parallelism. In this paper, we introduce a new graph-based program representation for parallel applications that extends the Abstract Syntax Tree to represent control and data flow information. The originality of this work lies in the addition of new edges exploiting the implicit ordering and parent-child relationships in ASTs, as well as the introduction of edge weights to account for loop and condition information. We evaluate our proposed representation by training a Graph Neural Network (GNN) to predict the runtime of an OpenMP code region across CPUs and GPUs. Various transformations utilizing collapse and data transfer between the CPU and GPU are used to construct the dataset. The predicted runtime of the model is used to determine which transformation provides the best performance. Results show that our approach is indeed effective and has normalized RMSE as low as 4×1034\times 10^{-3} to at most 1×1021\times 10^{-2} in its runtime predictions.

Index Terms:
OpenMP, HPC, offloading, program representation

I Introduction

Over the years, the increase in the number of on-chip cores has led to significant improvement in the performance of parallel code. To exploit this increased computation capacity, applications have been readily modified using Pthreads or OpenMP constructs. In the last decade, General Purpose Graphic Processing Units (GPGPU) have gained popularity. They can handle massive data parallelism with low power consumption. As a result, most HPC platforms are now coupled with GPGPU(s). HPC platforms will continue to support more accelerators, but given the difficulties in utilizing and configuring even one accelerator, using and configuring multiple heterogeneous accelerators will become increasingly difficult in the future. On the other hand, utilizing GPUs effectively imposes challenges that require re-engineering the code and applications. The full computing power of GPUs may not be utilized if they are not used properly. In addition, improper data transfer between the host (CPU) and device (GPU) can result in significant overheads. It is exceptionally challenging and burdensome for developers to create applications for extremely heterogeneous platforms with multiple devices. The recent emergence of tools and programming models aims to automate this process of application adaptation to heterogeneous platforms. One of the most popular parallel programming models, OpenMP [1], aims to make the process of developing parallel programs that can run on different architectures simpler. Despite this, optimizing a code to use the OpenMP directives correctly is still a tedious task for large and complex applications.

Recent advances in Deep Learning (DL) have enabled researchers to apply DL to a wide range of software engineering jobs and challenges, including code comment generation, compiler optimizations, and heterogeneous platform adaptation. Since source code cannot be directly fed into DL models, we need a suitable representation for applications to serve as input to various DL models. In this paper, we propose ParaGraphParaGraph, a graph-based program representation that aims to expose critical characteristics (e.g., the number of iterations in a loop) of HPC applications. To the best of our knowledge, existing program representations are not designed to expose and represent the characteristics of parallel HPC applications. These program representations usually target serial code and do not incorporate characteristics of a parallel code. As a result, DL models built on top of these representations cannot model features inherent to parallel codes.

Previous works, such as COMPOFF [2], have relied on feature engineering to overcome these shortcomings. However, feature engineering requires expert knowledge. For a fast-evolving field such as HPC, always relying on expert intervention for such feature engineering is not realistic. There is a need for an adaptable approach that can automatically extract such features. Our proposed approach aims to address this and uses Graph Neural Networks (GNNs) to model the code graphs generated by ParaGraphParaGraph. Experimental results show that our model can predict the runtime of HPC kernels with a very low error rate (at most 1×1021\times 10^{-2} in terms of normalized RMSE), confirming the efficacy of our strategic approach.

In the rest of the paper, we first discuss some background work in Section II on program representation, OpenMP GPU offloading, DL in compilers, OpenMP Advisor (a tool we used to generate various kernels for our work), and other related work. Then in Section III, we define ParaGraphParaGraph. Section IV covers the experiments carried out in this paper, and Section V discusses the result. Finally, we conclude our work with discussions of future plans in Section VI.

II Background and Other Related Work

In this section, related works and some background for program representation, OpenMP parallelism, and its tools are discussed.

II-A Program Representation

Recently, with breakthroughs in Machine Learning (ML) and specifically Deep Learning (DL), researchers have been applying data-driven approaches to various software engineering tasks, and challenges, ranging from code comment generation [3] to compiler optimizations [4]. However, the source code can not simply be fed to DL models as DL models do not accept strings as input. Therefore, applications need to be presented in a way usable by the models. Some works have presented programs as a sequence of tokens [5, 6]. While this kind of representation has worked well for some downstream tasks, presenting programs as a sequence of tokens discards critical syntactical and semantic characteristics of programs which are essential for deep learning models to reason.

Recently, to better present programs’ semantics and syntactic information, graph-based representations have been proposed. Allamanis et al. [7] proposed enhanced ASTs and added new edges to present control flow information. They applied the augmented AST to two tasks: variable misuse detection and variable name suggestion. Inspired by their work, we also augment the AST by adding new edges. However, we incorporate additional edges to better represent loopsloops and ifif conditions.

Cummins et al. in [4] proposed a lower-level graph representation based on LLVM intermediate representation for solving compiler optimization tasks.These representations are effective for downstream tasks, such as simple algorithm classification. To the best of our knowledge, there does not exist a representation tailored toward representing the characteristics of parallel loops and if-statements.

II-B OpenMP GPU Offloading

When different architecture supports either a different native language or various optimization techniques for the same language, program portability becomes a serious problem. Regrettably, programming languages and compilers are still far from being able to handle portability on their own. Program portability should be the primary priority for developers who have to deal with multiple device memory accesses on the same data or someone whose programs must work effectively on systems with various node architectures, such as many-core vs. GPU-accelerated nodes.

Utilizing a directive-based programming paradigm, such as OpenMP, the de-facto standard for parallel programming in C/C++ and Fortran, is one approach to ensuring portability across several architectures. OpenMP intends to move to extremely heterogeneous architectures [8] and has supported GPU offloading from specification 4.0. Unfortunately, even with OpenMP, optimizing large-scale applications remains a challenging task.

parallel forparallel forparallel forparallel fordistributetarget teamstarget region ends
Figure 1: OpenMP target teams distribute parallel for.

Since GPUs are very parallel machines, programmers should take full advantage of this fact to get the most performance possible. The “omp parallel for” pragma alone will only parallelize a code for CPUs; it won’t offload the computation to a GPU. We expect a high level of coarse grain parallelism on a platform like GPU. The amount of parallelism that a GPU can use is constrained by this design. Figure 1 illustrates how OpenMP teams and distribute directives create additional levels of parallelism. At the start of a target region, only one team and one member thread are active. The teams distribute directive distributes the full loop iteration space among all available teams. Additionally, in the case of nested parallel loops, we can utilize the parallel for directive on them to further distribute the iterations of the nested loops among threads within a team. We utilize the combined directive “teams distribute parallel for” to distribute the iteration space of one loop among teams and threads inside a team when there is just one level of parallel loops or when the outer loop has sufficient parallelism. teams are used to group threads, and distribute enables a team group to be scheduled to run in a loop. Teams resemble CUDA threadblocks on NVIDIA GPUs, as there is no synchronization primitive to function as a barrier between threads from different teams. And the threads within each team are parallelized typically using parallel for.

There are several frameworks being worked on right now that will automatically assist application developers in handling severe heterogeneity. These frameworks necessitate analytical cost models, which will help them select the kind of optimization the application requires. Hand-tuned cost functions are extensively employed currently; however, calculating optimization costs requires a deeper understanding of the underlying hardware. Despite its efficacy, manually building a cost model for a single architecture can take months. Since cost functions are crucial and manual tuning is time-consuming, compiler engineers are investigating Machine, and Deep Learning approaches as a means of automating this process.

II-C ML in Compiler

Recently, a lot of research has been done on how to use learning-based techniques in compilation as well. Early work exploiting ML in compilers primarily explored its use to help optimize sequential programs. However, its application to the task of optimizing parallel programs has attracted significant attention in the past decade due to the prevalence of multi-core platforms and, more recently, heterogeneous systems [9, 10]. Mirka et.al. [11] devised a decision-tree-based approach to predict the scheduling policy for an OpenMP parallel region. Also, Denoyelle et.al. [12] uses machine learning techniques to optimize OpenMP programs for scheduling policies and the number of threads. Alcaraz et.al. [13] uses neural networks and performance counters to predict the number of threads. Dutta et.al. [14] uses an LLVM-IR based graph representation and performance counter to train a GNN model to predict number of thread, chunk size and scheduling policy for OpenMP loops. Tree and graph-based features have also been used by Malik et.al. [15], who present a unique graph-based approach for feature representation. Learning-based techniques were used to build classifiers to determine whether to offload OpenCL code [16] and to select a clock frequency at which the processor should operate [17]. A high level of accuracy was reported; however, the benefits could not be quantified as the work did not attempt to generate modified code. They also explored regression techniques to build curve fitting models to search for the sweet spot for work partitioning among processors [18] or a trade-off of energy and performance [19].

The results of prior efforts from applying ML on compiler optimizations are encouraging. However, new feature engineering practices need to be explored that can help learning-based methods to learn more about a code and its computational needs. In COMPOFF [2], the authors provide a proof of concept for using an Artificial Neural Network model to make better decisions in offloading a kernel to a GPU using OpenMP. They utilize a fully-connected feed-forward network, also referred to as multi-layer perceptrons (MLPs), which are effectively stacked layers of linear regression.

II-D OpenMP Advisor

OpenMP Advisor [20] is a first-of-its-kind compiler tool that enables OpenMP code offloading to a GPU via Machine Learning. The Advisor is divided into three major modules: Kernel Analysis, Cost Model, and Code Transformation. The Kernel Analysis module recommends various variants for a given application, and the Code Transformation module generates codes for those variants. In our work, we use the code transformation module of OpenMP Advisor to generate various kernels for training our model. Through the use of COMPOFF, a machine learning-based cost model, this tool can identify the kernels that are most suitable for offloading by predicting their runtime. But COMPOFF has some limitations. It requires figuring out how many operations are contained within a kernel, which is a challenging task in and of itself. The Advisor’s current functionality is only restricted to GPUs, though it has the potential to be expanded to support additional OpenMP-capable hardware. One technique for expanding to other devices is to develop a model that works outside GPUs as well. Extending COMPOFF to work beyond GPUs is a daunting task, and there is a need for a new cost model that can bridge this gap.

II-E Other Related Works

A recent tool called BLISS [21] allows for auto-tuning parallel applications without the use of instrumentation or domain-specific knowledge. This auto-tuner develops a variety of lightweight models with the aid of Bayesian optimization that can compete with the output quality of a complex, heavyweight Machine Learning (ML) model. In the same context, Bayesian optimization is frequently used by other autotuners, including BOHB [22], HiPerBOt [23], GPTune [24] and ytopt [25]. Unfortunately, due to their expensive evaluation functions, tuning large-scale HPC systems, which are becoming increasingly heterogeneous, complex, and expensive to evaluate, is still quite challenging. Nonetheless, most of the autotuners mentioned above are online, meaning that even in the inference time, they execute applications iteratively to tune them. Therefore, in terms of complex heterogeneous large-scale HPC systems, these tuners will be very costly to apply. In contrast, ParaGraphParaGraph is an offline approach and does not require the execution of applications in the inference phase. COMPOFF  [2] provides a novel portable cost model that statically calculates the cost of OpenMP offloading on various architectures. COMPOFF is also an offline approach. In this work, we compare ParaGraphParaGraph to COMPOFF since both can predict the runtime of a kernel on a GPU statically.

III ParaGraph

int x;// ...x = 50;CompoundStmtBinaryOperatorImplicitCastExprIntegerLiteralVarDecl. . .DeclRefExprCompoundStmtBinaryOperatorImplicitCastExprIntegerLiteralVarDecl. . .DeclRefExpr1111NextTokenNextSiblingRefif ( x > 50) {/*...*/}else {/*...*/}CompoundStmtIfStmtBinaryOperatorCompoundStmtCompoundStmtCompoundStmtIfStmtBinaryOperatorCompoundStmtCompoundStmt50502525ConTrueConFalsefor(int i=0; i<50; i++) { /*...*/}CompoundStmtForStmtDeclStmtBinaryOperatorCompoundStmtUnaryStmtCompoundStmtForStmtDeclStmtBinaryOperatorCompoundStmtUnaryStmt11505050ForExecForExecForNextForNext AST Edge NextToken Reference NextSibling ConFalse ConTrue ForExec ForNext
Figure 2: Modification to AST to create Augmented AST for ParaGraph

In this section, we present ParaGraphParaGraph, a weighted graph representation of programs that captures characteristics related to HPC kernels. Abstract Syntax Tree serves as a base for ParaGraphParaGraph. In fact, additional edges and attributes are introduced to AST to construct ParaGraphParaGraph. Some previous works have proposed various ways of augmenting ASTs to present semantic information [26, 7]. Although these AST-based program representations are effective in training deep learning models for downstream tasks such as variable misuse detection, not enough attention has been given to better represent regions of programs that have a high impact on the performance of the application, such as loopsloops and ifif condition. In ParaGraphParaGraph we try to address these shortcomings. As stated, ParaGraphParaGraph is based on AST; however, it includes new edges to present characteristics such as control flow or order of children of a node. Moreover, loopsloops are represented by special edges conveying the order in which the loop condition and its body execute iteratively. Additionally, we add weights to the edges to present how many times each node will be utilized during the execution of a program. It is worth noting that all these augmentations are applied statically; therefore, they do not cause significant overhead. ParaGraphParaGraph can be directly utilized by Graph Neural Network models, and our experimental results show that, in performance optimization, this representation is quite effective and outperforms the state-of-the-art approach.

III-A ParaGraph Construction

III-A1 Abstract Syntax Tree

As stated, the base of ParaGraphParaGraph is Abstract Syntax Tree. Typically, any kind of compiler can be used to generate AST. In this study, we use Clang, a popular C/C++ front-end, to parse and compile C/C++ programs. ASTs contain two types of nodes: non-terminal and terminal. Non-terminal nodes are often called syntax nodes, whereas terminal nodes are called syntax tokens. The relation between nodes in AST is a parent-child relation. Once an AST is retrieved, we add additional edges and attributes depicting information related to control flow and data flow. The following subsections will provide more details on these augmentations.

III-A2 Augmenting Abstract Syntax Tree

ASTs typically provide structural and syntactical information of a program. Although this information can be useful for neural networks to learn the characteristics of programs to some extent, it has been shown that adding additional attributes and information, especially from a compiler point of view, boosts the learning capabilities of deep learning models. Inspired by existing works [26, 7], we introduce the following additional edges to AST (shown in Figure 2) to represent information related to the control and data flow of the programs.

  • NextToken: By default there is no order imposed among the syntax tokens. However, from a compiler’s perspective, syntax tokens have an order. The order among the syntax tokens is from far left token to the far right one. To present this information in a graph, we introduce a new edge type called NextToken. NextToken connects each syntax token to the syntax token on its right side. In other words, it connects each syntax token to the next syntax token. In this way, an order is imposed on syntax tokens. This edge type is shown in orange in Figure 2 (the AST on the left).

  • NextSib: As mentioned above, edges in ASTs show a parent-child relationship. Compilers assume an order among children nodes. For example, a binary operator such as division has two children. In an AST, the left child is always the numerator, and the right child is the denominator. Therefore, it is necessary to show the order of the children. NextSib presents this information. NextSib connects each syntax node to its sibling on the right (the blue edge in Figure 2.

  • Ref: In clang’s AST, referenced variables are presented by DecRefExpr nodes. These nodes are terminal and do not have children. During graph construction, we introduce Ref edges (shown in pink in Figure 2) connecting a DecRefExpr node to where the corresponding variable is defined. This conveys information where a declared variable is used throughout the graph.

  • ForNext, ForExec: Loops are typically shown as ForStmt nodes in Clang’s AST. ForStmt nodes usually have four children. First child is related to initializing the loop counter, whereas the second child is about checking the loop’s condition, deciding whether the next iteration should be executed or the loop has ended. Third child is CompoundStmt which presents the body of a loop. Lastly, the last child is a modifier that modifies the loop counter such as incrementing it by one. The relationship between these four children exposes critical characteristics of loops. These characteristics are known by compilers but are not presented explicitly in the ASTs. We add ForExec edges, which connect the first child (counter initialization) to the second child (loop condition) and also connect the second child to the third child (loop body) as shown in the right side of Figure 2. In fact, ForExec edges show the flow of executing the next iteration of the loop. We connect the third child (Loop’s body) of the ForStmt node to the fourth child (loop counter modifier) via a new type of edge called ForNext. Basically, ForNext represents the information related to whether the next iteration of the loop needs to be executed or not, whereas ForExec represents the next execution of the loop. After the fourth child (loop counter modifier) comes the second child (loop condition), which checks if next iteration is going to happen or if the loop has ended. Therefore, the fourth child is connected to the second child with a ForNext edge as shown in Figure 2.

  • ConTrue, ConFalse: If statements in the ASTs are shown by IfStmt nodes. IfStmt nodes typically have three children. First child presents the comparison condition; the second child is the body of the if condition, and the third child is the body of the else part. To present this information, we connect the first child to the second child through a new type of edge called ConTrue to show that the flow moves to the second child if the if statement is true. On the other hand, the first child is connected to the third child via a ConFalse edge. This edge shows that if the condition of if statement is not met, the flow moves to the third child.

  • Child: Child edges are normal AST edges that present a parent-child relationship between the nodes in AST.

ASTInputAST Variant Generator ASTASTASTASTASTASTVariantsAST Compiler Frontend ASTAST ParaGraph Generator ASTAST11NN111ASTASTASTASTASTASTASTParaGraphASTParaGraph GenerationAST Compile and Build AST Runtime Measurement Module ASTASTTime (ms)AST746846AST9984695AST916548ASTMeasuring Execution Time of ApplicationsASTASTASTASTASTASTASTASTASTASTASTASTDatasetAST GNN model ASTPredictionASTPredictAST T r a i n ASTASTTrainAST
Figure 3: Workflow of ParaGraph

III-A3 Weighted Edges

In the previous section, we explained how we augment the AST by introducing new types of edges to better present programs. However, there is still some information that is missing in our representation. For instance, whenever we encounter a loop in a program, the loop’s body could be executed multiple times. Or the branches of an if statement will not execute the same number of times. To present this information in our graph structure, we augment the edges by adding weights. Weights of edges are calculated based on the region and edge type. Weights are added only to Child edges since these are AST edges, and a compiler uses the AST nodes and child edges to transform an AST to lower-level for machine execution. For illustration purpose, some edge weights are shown in Figure 2. We add the weights as follows:

  • Default edge weight: By default, we add a weight of 1 to edges to represent the fact that each statement in code executes one time only, and once a statement executes, the control moves to the next statement.

  • Loops: Despite the typical statements that execute only once, statements in the body of a loop will be executed several times depending on the number of iterations. To expose the number of iterations in our graph representation, we first observe the number of iterations in a loop and then multiply the edge weights by that number. Moreover, the workload of each thread is also implicitly taken into account if the loop scheduling is static. This is done by dividing the number of iterations by the number of threads. For instance, if a loop has 100 iterations, and it is statically scheduled among four threads, we roughly assume each thread executes 25 iterations; therefore, edges within the body of the loop will have a weight of 25.

  • If statements: The nodes and edges under one branch of an if statement versus the other will not have the same number of executions. Therefore, there is a need for justification of edges of if statements. To illustrate the execution of branches, we posit a probability. The assumption is that each branch of an if statement has a probability of 12\frac{1}{2}. Therefore, the weights of edges within if statements are divided by 2. Experimental results show that this representation of if statements provides better results, further optimization of edge weights is left as a future work.

III-B Benefit to GNN

ParaGraphParaGraph can be used by existing GNN models, which are a type of neural network that can operate over graphs, for downstream tasks. Especially, those tasks that are related to performance optimizations. ParaGraphParaGraph can be exploited both as a heterogeneous graph or as a homogeneous graph. For simplicity, in this study, we treat ParaGraphParaGraph as a homogeneous graph. Typically a graph is shown as Equation 1, where VV is a set of nodes and EE contains an adjacency matrix. Elements of EE are either 1 or 0 to show whether an edge exists between two nodes.

G=(V,E)G=(V,E) (1)

We extend the AST for our new representation by including new edges like NextToken, NextSib, and other types that were explained earlier. We formally define ParaGraphParaGraph as follows:

ParaGraph=(V,E,T,W)ParaGraph=(V,E,T,W) (2)

Where VV and EE are previously defined in Equation 1 and T+T\in\mathbb{Z}^{+} represents the type of the edges (such as NextToken, NextSib etc). W+W\in\mathbb{Z}^{+} presents the weight which is zero for any edge type other than Child.

These additional edges and attributes are all added to the graphs statically. Therefore one obvious question could be why we do not let the model reason about these additional attributes by itself rather than presenting them in the graph? On one hand, adding these attributes to the graphs is not costly; on the other hand, we do not want to waste the resources and the learning capabilities of the model to realize these obvious facts. Therefore by augmenting the ASTs and including the attributes mentioned earlier, we can train GNN models more effectively. Later on, our ablation study further improves that the model benefits from the attributes and information presented in ParaGraphParaGraph.

We adapt Relational Graph Attention Networks (RGAT) [27] and use ParaGraphParaGraph representation as input to train a model for predicting runtime of applications on different accelerators. In RGAT, attention logits are computed per each edge type. In the result section, we will see that with ParaGraphParaGraph, the model has a very small error in its predictions and outperforms the current state-of-the-art approach.

Figure 3 shows the overall workflow of our GNN-based pipeline for runtime prediction. The first step is preparing different variants of an application for each accelerator used in this study. This step is further explained in detail in the next section. Then, using Clang, ASTs are produced and a series of augmentations, as discussed, are applied to them to construct ParaGraphParaGraph. In order to train a model to predict the runtime of an application, we need to create a dataset with a list of applications and their variant accompanied by their runtime. Therefore, we execute each one of the variants on the specified accelerator and measure the runtime. Lastly, the ParaGraphParaGraph representation of variants and their runtime are used to train the GNN model. Along with ParaGraphParaGraphs, our feature set also includes the number of teams and threads used for executing an application. These features are then fed into the GNN-based model for predicting the runtime.

IV Experiments and Setups

We used two clusters and compilers to test and evaluate our tool. Our experiments were carried out on the ORNL’s Summit supercomputing cluster [28] with LLVM/Clang (ver 13.0) and a nvptx backend for GPUs, as well as the LLNL’s Corona cluster [29] with LLVM/Clang (ver 15.0) and a rocm backend for GPUs. For the purposes of this study, we only use one GPU per cluster node.

One of the primary obstacles we faced, as with any other data-driven approach, is the lack of a publicly accessible dataset. Creating a dataset that can be used to train our model was the first step in building a data-driven based cost model. Although there are plenty of OpenMP benchmark suites publicly available, little effort has been put into OpenMP GPU offloading benchmarking. There are very few publicly available benchmarks for OpenMP GPU offloading. Even though a few benchmarks, such as the Rodinia benchmark suite [30], include kernels that implement OpenMP GPU offloading, we cannot use them due to a lack of different kernel variations. Therefore, as a first step, we create a collection of benchmarks that leverage OpenMP GPU offloading. The goal is to include a broad class of benchmarks that would cover a spectrum of statistical simulation, linear algebra, data mining, etc.

IV-A Data Collection

There are three parts to our data collection: Code Variant Generation, Graph Generation, and Runtime Collection.

Application Num Kernels Domain
Correlation Coefficient  [31] 1 Statistics
Covariance 2 Probability Theory
Gauss Seidel Method 1 Linear Algebra
K-nearest neighbors [30] 1 Data Mining
Laplace’s Equation [32] 2 Numerical Analysis
Matrix-Matrix Multiplication 1 Linear Algebra
Matrix-Vector Multiplication 1 Linear Algebra
Matrix Transpose 1 Linear Algebra
Particle Filter [30] 7 Medical Imaging
TABLE I: Benchmark Applications

IV-A1 Code Variant Generation

The first step is to create different kernels for which data needs to be collected. We needed application kernels from multiple domains to have a better prediction across diverse disciplines of Computer Science. Table I specifies nine benchmark applications that we have identified for data collection. However, nine applications are not enough. There are only seventeen kernels if we count the number of kernels in each of these applications. In the future, as this model develops, more applications will be added.

As a proof of concept for this research, we only consider the following six transformations:

  • cpu: A cpu parallel kernel using omp parallel for.

  • cpu_collapse: In case of a nested collapsible loop, we collapse it with omp parallel for collapse(2) directive.

  • gpu: A gpu kernel using a combined omp target teams distribute parallel for directive. All data is already considered to be present on the GPU.

  • gpu_collapse: A gpu kernel with nested collapsible loop using omp target teams distribute parallel for collapse(2) directive. All data is already considered to be present on the GPU.

  • gpu_mem: Same as combined gpu offloading, but with data transfer.

  • gpu_collapse_mem: Same as combined gpu_collapse, but with data transfer.

We used the OpenMP Advisor tool [20] to generate these kernel variants. Even after creating multiple variations for each kernel, we only had 66 distinct kernels. To create more kernels, we vary the levels of parallelism and data used for all these kernels. We modified these parameters for each application to generate a minimum of 2000-3000 distinct kernels for our dataset. When the kernels from all applications are combined, we get almost 26,000 unique pieces of information for our dataset. This step is independent of the architecture on which we will be running the kernels.

IV-A2 ParaGraph Generation

In this step of our pipeline, we create the ParaGraphsParaGraphs for the kernels generated in the previous section. For each kernel that we want to train, our dataset should include two parts – the GNN graph and the runtime. For training, the first step is to generate ParaGraphParaGraph for each one of the kernels. Since ParaGraphParaGraph is based on AST, each kernel is compiled, and ASTs are generated. In order to produce ParaGraphParaGraph, each AST is traversed, and additional edges and weights are added as explained in Section III. While edge type conveys the information from the compiler’s point of view, edge weights expose information about the flow of execution and, more importantly, the hotspots of the kernels, such as loops. Typically, edges within a ForLoop have more weights to represent the number of iterations of a loop.

IV-A3 Runtime Collection

As our next step, we collect the actual runtime of each kernel. This process is entirely dependent on the hardware that the kernel is intended to run on.The complexity of this process was determined by our opportunity to get access to the compute nodes of the HPC clusters. Failures in this step were difficult to handle interactively, and we had to wait for individual jobs to fail before even realizing the error. As a result, we had to be very cautious when submitting jobs to run on these clusters.

Platform #Data Points Runtime Range (ms) Std. Dev.
Summit
IBM POWER9 (CPU) 13,023 [0.23 - 736,798] 48,502
NVIDIA V100 (GPU) 26,040 [0.035 - 30,174] 3,708
Corona
AMD EPYC7401 (CPU) 17,681 [0.024 - 291,627] 16,942
AMD MI50 (GPU) 26,668 [0.448 - 46,913] 4,828
TABLE II: Data points collected on each accelerator.

Using the OpenMP Advisor, we were able to generate code that already collected kernel runtime information. It accomplished this by including the gettimeofday function call, which gets the clock time of a system expressed in elapsed seconds and microseconds, both before and after a kernel call, and then computes the difference between the times to obtain the runtime in microseconds. OpenMP Advisor also aided us in generating all six variants of each kernel, and we were able to generate multiple variants for each application by varying the levels of parallelism and data used. We built each of these kernels on a local cluster to ensure that our code does not break as a result of using an automated tool to generate our kernels. After successfully building all of our kernels, we ran each kernel on a local cluster to validate our data collection. Once we were confident that all of our kernels were building and running on a local cluster, we created jobs to build and run them on the Summit and Corona clusters. Nonetheless, we had limited access to these clusters, and our job would not run for long due to node failure or time constraints. As a result, we had to submit multiple jobs and keep track of their success or failure in collecting data.

As soon as we had each kernel’s runtime, we could easily accompany them with the corresponding ParaGraphsParaGraphs to collect data points for our dataset. Table II shows the statistics of data points collected from each one of the accelerators.

IV-B ParaGraph Model

Once we have enough data on both the Summit and Corona clusters, we begin training our model. We implemented a GNN-based neural network using RGAT [27] as the convolution layers. Our model is implemented using Pytorch-Geometric library with Mean Squared Error as the loss function, and Adam [33] as the optimizer. To embed the graph, the model uses three graph convolution layers based on RGAT, followed by two fully connected layers with ReLu activation function. As mentioned, the number of teams and threads are considered as two additional features. Another fully connected layer is used to embed these two features. Finally, the embedding of the graph and the two features are concatenated together and passed through the last fully connected layer for runtime prediction.

The edge weights and two additional features are normalized using MinMaxScaler. The dataset is split into train-validation sets using a 9:1 ratio.

V Results

In the following subsection, we explain the metrics we have used to evaluate the ParaGraphParaGraph model.

V-A Evaluation Metric

To evaluate the performance of ParaGraphParaGraph, we use RMSERMSE which is Root Mean Square Error (Equation 3).

RMSE=i=1N(xix^i)2NRMSE=\sqrt{\frac{\sum_{i=1}^{N}(x_{i}-\hat{x}_{i})^{2}}{N}} (3)

Where xix_{i} stands for the runtime of a data point in microseconds, x^i\hat{x}_{i} is the predicted runtime by ParaGraphParaGraph, and NN is the total number of samples.

Since the range of runtime differs across platforms, the normalized version of RMSERMSE is also considered. Normalized RMSERMSE is calculated by dividing the RMSERMSE by the distance between the minimum and maximum runtime. We also use relative error (i.e., absolute error divided by the range of runtime) to report the error rate.

Platform RMSE (ms) Norm-RMSE
Summit
IBM POWER9 (CPU) 43254325 6×1036\times 10^{-3}
NVIDIA V100 (GPU) 280280 9×1039\times 10^{-3}
Corona
AMD EPYC7401 (CPU) 968968 4×1034\times 10^{-3}
AMD MI50 (GPU) 510510 1×1021\times 10^{-2}
TABLE III: Experimental Results
Refer to caption
Figure 4: Prediction error per 10-second bins.

V-B Experimental Results

Table III shows the experimental results for each accelerator. We have NVIDIA V100, AMD MI50, and IBM POWER9 with 22 cores and AMD EPYC 7401 with 24 cores. As shown, the RMSERMSE values range from 280 (ms) to 4325 (ms). The reason why we have different RMSERMSE values, such as 4325 (ms) for POWER9 accelerator, lies in the fact that the runtime dispersion differs across the accelerators. Standard Deviation in Table II gives us some insights on how dispersed the collected data are. Moving on to Normalized-RMSE, which is independent of the range of runtime, we see ParaGraphParaGraph has relatively the same error across accelerators; thus, it can be applied to different accelerators.

To further analyze our results, we have calculated a relative error per bins of 10 seconds. Figure 4 shows 11 bins for all four accelerators. Each bin has a 10-second range except the last bin. The figure shows that the relative error is small across different bins and accelerators (less than %10), meaning that our ParaGraphParaGraph model has stable behavior for different ranges of runtimes.

Moreover, we analyze how stable the model is during the training process in Figure 5. The figure shows validation RMSERMSE for four accelerators. In the first few epochs, the ParaGraphParaGraph model is not very stable, resulting in fluctuations in the RMSERMSE; however, as the model is trained further, it is able to pick and learn the features from the representation and reduce RMSERMSE value per each epoch and ultimately converges.

Refer to caption
Figure 5: Normalized RMSE per each epoch.
Refer to caption
Figure 6: Error rate per each application.

Lastly, we calculate the average relative error per application to see if the ParaGraphParaGraph model is able to have prediction with low error for all types of applications. Figure 6 shows the error rate of each application. As can be seen, the model is indeed capable of having accurate predictions for a wide variety of applications resulting in a low error rate. Therefore, the model is not biased toward one application. On the AMD MI50 GPU, the Laplace data was corrupted during collection. Consequently, neither this study nor the training process includes that data.

V-C Ablation Study

ParaGraphParaGraph representation, as explained in Section III, is built by applying a series of major augmentations on top of AST. In this section, we quantify the impact of these augmentations. First, we consider the AST itself without additional edges and weights, we call it Raw AST. Then, we add additional edges and edge types to the AST and name it Augmented AST. Lastly, we have ParaGraphParaGraph that contains both additional edges and also edge weights. Table IV shows the results of the ablation study. We see that Raw AST results in the highest error for all four accelerators. Adding new edges and introducing new types of edges (Augmented AST) improve the prediction to some extent. For example, the RMSERMSE of V100 drops from 2114 (ms) to 786 (ms) with the addition of these new edges.

One of the key characteristics of our proposed program representation is the edge weights. Edge weights convey essential information about how often different regions of the AST execute. Therefore, we see quite a good improvement in RMSERMSE when edge weights are added. For instance, the RMSERMSE for V100 is further improved to 280 (ms).

Platform Raw AST Aug AST ParaGraph
Summit
IBM POWER9 (CPU) 2759327593 2686026860 4325
NVIDIA V100 (GPU) 2114 786 280
Corona
AMD EPYC7401 (CPU) 1191111911 96339633 968
AMD MI50 (GPU) 28882888 11771177 510
TABLE IV: RMSE of training with and without edges’ weight.
Refer to caption
Figure 7: RMSE of the validation set during training the GNN models on MI50 data points.

We analyze the addition of edges and their weights further to see how the training process of the ParaGraphParaGraph model is affected by these augmentations.

Figure 7 shows how the model behaves during training. This figure depicts the RMSERMSE value per each epoch for Raw AST, Augmented AST, and ParaGraphParaGraph on the MI50 accelerator.

Using only the Raw AST without any augmentations, which means having only one edge type, the model is able to learn some characteristics of the applications and reduce the RMSERMSE per epoch; however, this reduction in RMSERMSE is not significant. Augmented AST contains 8 different types of edges. We see the addition of these edges destabilized the training process of the model. In the first few epochs, the model is challenged to learn different relations between the nodes. However, eventually, after several epochs, the prediction of the model is stabilized and it achieves RMSERMSE of 1177 (ms). Once edges of the Augmented AST are augmented with weights, thus ParaGraphParaGraph is constructed, we see further improvements in the model’s predictions. Although the validation RMSERMSE has some fluctuation in the initial epochs, it ultimately converges with a considerably smaller error.

V-D Comparison with State-of-the-art Tool

To the best of our knowledge, COMPOFFCOMPOFF [2] is the only state-of-the-art OpenMP GPU offloading cost model. We further compare the results from ParaGraphParaGraph with those of COMPOFFCOMPOFF. As mentioned in Section II-D OpenMP Advisor uses COMPOFFCOMPOFF for predicting runtime for OpenMP kernels.

While OpenMP Advisor eventually needs a cost model that can forecast for all possible underlying architectures, COMPOFFCOMPOFF is currently only suitable for GPU execution. In contrast, ParaGraphParaGraph can model the application runtime regardless of the underlying architecture. In our experiments, ParaGraphParaGraph was used to predict runtime on both CPUs and GPUs. Therefore, we compare ParaGraphParaGraph and COMPOFFCOMPOFF using data points collected from the NVIDIA V100 GPU only.

For both ParaGraphParaGraph and COMPOFFCOMPOFF, Figure 9 demonstrates a strong correlation between the actual and predicted data. The results for COMPOFFCOMPOFF are represented by blue dots, while those for ParaGraphParaGraph are represented by orange dots. Despite the fact that both perform admirably, ParaGraphParaGraph demonstrates a much stronger correlation between the predicted and actual runtime. In Figure 8, COMPOFFCOMPOFF (blue) demonstrates a slightly higher error rate for smaller runtime kernels, but as the runtime increases, this error rate decreases (or disappears entirely). However, for all kernels, the error rate is significantly lower for ParaGraphParaGraph (orange). Similar results were also observed on the AMD GPU. We were unable to compare the outcomes of our prediction on CPUs because COMPOFFCOMPOFF does not support CPUs. This is one significant advantage ParaGraphParaGraph has over COMPOFFCOMPOFF.

Refer to caption
Figure 8: Comparison of ParaGraph and COMPOFF on predicting runtimes on NVIDIA V100 for each data point.

V-E Discussion

ParaGraphParaGraph is able to model the execution of applications statically. Our experimental results show the effectiveness of ParaGraphParaGraph in predicting applications’ runtime across different accelerators. On the other hand, there are applications whose runtime behavior is not stable. That is, they show different behaviors across various executions. These applications can not be modeled statically. As a result, predicting their runtime using only static features is a challenge. One way of tackling this challenge could be developing a hybrid approach using static and dynamic features (for example, Performance Counters) of applications. However, collecting dynamic features of applications requires the execution of applications which will potentially increase the cost of prediction.

Refer to caption
Figure 9: ParaGraph and COMPOFF prediction on NVIDIA V100 as compared to the actual runtime.

VI Conclusion and Future Work

In this paper, we proposed ParaGraphParaGraph. A novel way of representing HPC kernels. ParaGraphParaGraph is based on ASTs; it incorporates additional features to the AST to better represent applications. In particular, ParaGraphParaGraph includes some of the insights of the compiler by adding new edges to the AST. ParaGraphParaGraph also adds information about the way loops and if statements will be executed by adding weights to the corresponding edges. We evaluated ParaGraphParaGraph on a set of applications for four different accelerators. It achieved a low error rate highlighting the effectiveness of ParaGraphParaGraph. One of the main benefits of ParaGraphParaGraph is its independence of underlying hardware architecture. Therefore, it can be applied to a wider range of architectures, unlike previous approaches.

In this work, we used ParaGraphParaGraph for predicting the execution time of a given kernel for OpenMP GPU offloading problem. We plan to explore and analyze how ParaGraphParaGraph can help other OpenMP optimization strategies such as predicting SIMD stride, scheduling strategies, loop chunk size, etc. in the future. Another interesting research area to explore is to use ParaGraphParaGraph and capture parallelism for other parallel programming models such as OpenACC, Kokkos, HIP, etc. for building cost models for various optimization problems.

References

  • [1] L. Dagum and R. Menon, “Openmp: an industry standard api for shared-memory programming,” IEEE computational science and engineering, vol. 5, no. 1, pp. 46–55, 1998.
  • [2] A. Mishra, S. Chheda, C. Soto, A. M. Malik, M. Lin, and B. Chapman, “COMPOFF: A compiler cost model using machine learning to predict the cost of openmp offloading,” in 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), May 30-June 3, 2022.   IEEE, 2022.
  • [3] A. Ciurumelea, S. Proksch, and H. C. Gall, “Suggesting comment completions for python using neural language models,” in 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER).   IEEE, 2020, pp. 456–467.
  • [4] C. Cummins, Z. V. Fisches, T. Ben-Nun, T. Hoefler, M. F. O’Boyle, and H. Leather, “Programl: A graph-based program representation for data flow analysis and compiler optimizations,” in International Conference on Machine Learning.   PMLR, 2021, pp. 2244–2253.
  • [5] V. Raychev, M. Vechev, and E. Yahav, “Code completion with statistical language models,” in Proceedings of the 35th ACM SIGPLAN conference on programming language design and implementation, 2014, pp. 419–428.
  • [6] M. Allamanis, H. Peng, and C. Sutton, “A convolutional attention network for extreme summarization of source code,” in International conference on machine learning.   PMLR, 2016, pp. 2091–2100.
  • [7] M. Allamanis, M. Brockschmidt, and M. Khademi, “Learning to represent programs with graphs,” arXiv preprint arXiv:1711.00740, 2017.
  • [8] M. A. Heroux, R. Thakur, L. McInnes, J. S. Vetter, X. S. Li, J. Aherns, T. Munson, and K. Mohror, “Ecp software technology capability assessment report,” Oak Ridge National Lab.(ORNL), Oak Ridge, TN (United States), Tech. Rep., 2020.
  • [9] Y. Kim, J. Lee, J.-S. Kim, H. Jei, and H. Roh, “Comprehensive techniques of multi-gpu memory optimization for deep learning acceleration,” Cluster Computing, vol. 23, no. 3, pp. 2193–2204, 2020.
  • [10] K. Rzadca, P. Findeisen, J. Swiderski, P. Zych, P. Broniek, J. Kusmierek, P. Nowak, B. Strack, P. Witusowski, S. Hand et al., “Autopilot: workload autoscaling at google,” in Proceedings of the Fifteenth European Conference on Computer Systems, 2020, pp. 1–16.
  • [11] M. Mirka, G. Sassatelli, and A. Gamatié, “Online learning for dynamic control of openmp workloads,” in 2020 9th International Conference on Modern Circuits and Systems Technologies (MOCAST).   IEEE, 2020, pp. 1–6.
  • [12] N. Denoyelle, B. Goglin, E. Jeannot, and T. Ropars, “Data and thread placement in numa architectures: A statistical learning approach,” in Proceedings of the 48th International Conference on Parallel Processing, 2019, pp. 1–10.
  • [13] J. Alcaraz, A. TehraniJamsaz, A. Dutta, A. Sikora, A. Jannesari, J. Sorribes, and E. Cesar, “Predicting number of threads using balanced datasets for openmp regions,” Computing, pp. 1–19, 2022.
  • [14] A. Dutta, J. Alcaraz, A. TehraniJamsaz, A. Sikora, E. Cesar, and A. Jannesari, “Pattern-based autotuning of openmp loops using graph neural networks,” in 2022 IEEE/ACM International Workshop on Artificial Intelligence and Machine Learning for Scientific Applications (AI4S).   IEEE, 2022, pp. 26–31.
  • [15] A. M. Malik, “Automatic static feature generation for compiler optimization problems,” in Australasian Joint Conference on Artificial Intelligence.   Springer, 2011, pp. 769–778.
  • [16] S. Dublish, V. Nagarajan, and N. Topham, “Poise: Balancing thread-level parallelism and memory system performance in gpus using machine learning,” in 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).   IEEE, 2019, pp. 492–505.
  • [17] A. Iranfar, W. S. De Souza, M. Zapater, K. Olcoz, S. X. de Souza, and D. Atienza, “A machine learning-based framework for throughput estimation of time-varying applications in multi-core servers,” in 2019 IFIP/IEEE 27th International Conference on Very Large Scale Integration (VLSI-SoC).   IEEE, 2019, pp. 211–216.
  • [18] D. Grewe and M. F. O’Boyle, “A static task partitioning approach for heterogeneous systems using opencl,” in International conference on compiler construction.   Springer, 2011, pp. 286–305.
  • [19] H. Sayadi, “Energy-efficiency prediction of multithreaded workloads on heterogeneous composite cores architectures using machine learning techniques,” arXiv preprint arXiv:1808.01728, 2018.
  • [20] A. Mishra, A. M. Malik, M. Lin, and B. Chapman, “Openmp advisor,” arXiv preprint arXiv:2301.03636, 2023.
  • [21] R. B. Roy, T. Patel, V. Gadepally, and D. Tiwari, “Bliss: auto-tuning complex applications using a pool of diverse lightweight learning models,” in Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, 2021, pp. 1280–1295.
  • [22] S. Falkner, A. Klein, and F. Hutter, “Bohb: Robust and efficient hyperparameter optimization at scale,” in International Conference on Machine Learning.   PMLR, 2018, pp. 1437–1446.
  • [23] H. Menon, A. Bhatele, and T. Gamblin, “Auto-tuning parameter choices in hpc applications using bayesian optimization,” in 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS).   IEEE, 2020, pp. 831–840.
  • [24] Y. Liu, W. M. Sid-Lakhdar, O. Marques, X. Zhu, C. Meng, J. W. Demmel, and X. S. Li, “Gptune: multitask learning for autotuning exascale applications,” in Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2021, pp. 234–246.
  • [25] X. Wu, M. Kruse, P. Balaprakash, H. Finkel, P. Hovland, V. Taylor, and M. Hall, “Autotuning polybench benchmarks with llvm clang/polly loop optimization pragmas using bayesian optimization,” Concurrency and Computation: Practice and Experience, vol. 34, no. 20, p. e6683, 2022.
  • [26] M. Allamanis, “Graph neural networks in program analysis,” in Graph Neural Networks: Foundations, Frontiers, and Applications.   Springer, 2022, pp. 483–497.
  • [27] D. Busbridge, D. Sherburn, P. Cavallo, and N. Y. Hammerla, “Relational graph attention networks,” arXiv preprint arXiv:1904.05811, 2019.
  • [28] ORNL, “Oak Ridge Leadership Computing Facility - Summit supercomputing cluster,” 2017. [Online]. Available: https://www.olcf.ornl.gov/summit/
  • [29] L. L. N. Laboratory, “LLNL - Corona,” 2019. [Online]. Available: https://hpc.llnl.gov/hardware/compute-platforms/corona
  • [30] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron, “Rodinia: A benchmark suite for heterogeneous computing,” in 2009 IEEE international symposium on workload characterization (IISWC).   Ieee, 2009, pp. 44–54.
  • [31] P. Schober, C. Boer, and L. A. Schwarte, “Correlation coefficients: appropriate use and interpretation,” Anesthesia & Analgesia, vol. 126, no. 5, pp. 1763–1768, 2018.
  • [32] A. K. Mitra, “Finite difference method for the solution of laplace equation,” Department of aerospace engineering Iowa state University, 2010.
  • [33] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.