This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\NewDocumentCommand\showvariable

sm\IfBooleanTF#1\agava_showvar:N\agava_showvar:N#2

Anatomy of High-Performance Column-Pivoted QR Decomposition

Maksim Melnichenko111Innovative Computing Laboratory, University of Tennessee, Knoxville    Riley Murray222Sandia National Laboratories    William Killian333NVIDIA    James Demmel444University of California Berkeley    Michael W. Mahoney555International Computer Science Institute (ICSI) 666Lawrence Berkeley National Laboratory footnotemark:    Piotr Luszczek777MIT Lincoln Labfootnotemark:    Mark Gatesfootnotemark:
Abstract

We introduce an algorithmic framework for performing QR factorization with column pivoting (QRCP) on general matrices. The framework enables the design of practical QRCP algorithms through user-controlled choices for the core subroutines. We provide a comprehensive overview of how to navigate these choices on modern hardware platforms, offering detailed descriptions of alternative methods for both CPUs and GPUs. The practical QRCP algorithms developed within this framework are implemented as part of the open-source RandLAPACK library. Our empirical evaluation demonstrates that, on a dual AMD EPYC 9734 system, the proposed method achieves performance improvements of up to two orders of magnitude over LAPACK’s standard QRCP routine and greatly surpasses the performance of the current state-of-the-art randomized QRCP algorithm [MOHvdG:2017:QR]. Additionally, on an NVIDIA H100 GPU, our method attains approximately 65%65\% of the performance of cuSOLVER’s unpivoted QR factorization.

1\!{}^{1} Send correspondence to the first author, at mmelnic1@vols.utk.edu.

1 Introduction

Randomized Numerical Linear Algebra (RandNLA) is a relatively young branch of the field of numerical linear algebra (NLA). It leverages randomization as a computational resource to develop algorithms for computing solutions to classical linear algebra problems, with performance superior to deterministic NLA schemes. Some of these algorithms reliably compute approximate solutions to a given problem. Others, under some suitable assumptions, provide the exact solution by generating a full matrix factorization. Algorithms for full matrix factorizations are the essential part of the decompositional approach to matrix computations, which revolutionized the field of computational science  [814652].

Despite the ongoing widespread influence of RandNLA, it has yet to have a major practical impact on how we compute the “classical” full matrix decompositions, e.g., Cholesky, LU, QR, Schur, eigenvalue decomposition, and the full SVD [Higham:blog:big6]. This is largely due to the fact that, in modern implementations of such decompositions, all of the computations contributing to the leading-order terms in algorithms’ complexity are cast in terms of Level 3 Basic Linear Algebra Subprograms (BLAS) operations. Such operations are ideally suited to achieve high performance on modern hardware, and hence it is notoriously difficult to improve upon the performance of schemes that predominantly rely on Level 3 BLAS. There is, however, a classical matrix decomposition, the widely adapted implementation for which does not predominantly rely on Level 3 BLAS: the QR decomposition with column pivoting (QRCP). The fundamental problem with the classical approach to QR with column pivoting (Householder QRCP) is that only half of the computations can be cast in terms of Level 3 BLAS operations [LAWN114]. Consequently, the standard LAPACK function for QRCP, called GEQP3, remains very slow due to memory bandwidth constraints.

In recent work [MBM2024], we introduced a novel QRCP algorithm called “Cholesky QR with Randomization and Pivoting for Tall matrices” (CQRRPT). CQRRPT carefully uses techniques from RandNLA to deliver acceleration over the alternative methods for QRCP. These include specialized communication-avoiding algorithms [FK2020], and, in certain cases, standard LAPACK unpivoted QR, called GEQRF, together with LATSQR, which is tailored specifically for tall matrices [demmel2012communication]. Furthermore, CQRRPT can be implemented to achieve numerical stability unconditionally of the properties of its input data. This highlights not only its reliability, but it also advertises the method as an appealing tool for orthogonalization. The main limitation of this scheme, however, hides in its name: CQRRPT is only applicable to rectangular matrices, where the number of rows is much larger than the number of columns. This limitation of CQRRPT disqualifies it from candidacy for a definitive “solution” to the problem of designing a QRCP algorithm.

This paper introduces an algorithmic framework, referred to as BQRRP,888pronounced “bee-crip” which stands for “Blocked QR with Randomization and Pivoting.” BQRRP serves as a natural evolution of CQRRPT,999pronounced “see-cript” encapsulating much of CQRRPT’s core features, while resolving its main aforementioned limitation.

This paper does not dwell too much on the theoretical properties of the algorithmic framework we propose. Rather, it concentrates on providing many details on the practical aspects of what goes into a high-performance QRCP implementation. As such, while our work is particularly beneficial for numerical software engineers, it is accessible to a broader audience who may find value in exploring it at their own pace.

1.1 Existing work and our contribution

Naturally, efforts to revise the standard approach to QRCP have emerged in the past, among which are the works of Bischof and Quintana-Ortí [BQ1998_Alg, BQ1998_Code], as well as the independent works of Martinsson [Martinsson:2015:QR] and Duersch and Gu [DG:2017:QR], with subsequent extensions by Martinsson et al. [MOHvdG:2017:QR] and also by Xiao, Gu, and Langou [XGL:2017:RandQRCP]. Of particular interest are randomized algorithms described in [DG:2017:QR] and [MOHvdG:2017:QR]. Both these papers, in addition to the derivation of the pseudocode schemes, present software with practical QRCP implementations.

The early randomized Householder methods.

In [DG:2017:QR], the authors show several benchmarks of prospective QRCP methods (written in Fortran 9090), compared against both the standard pivoted and unpivoted QR algorithms, provided by the contemporary (though unspecified) version of Intel Math Kernel Library (MKL). The RQRCP method, described in the pseudocode [DG:2017:QR, Algorithm 4] exhibits an order-of-magnitude speedup over the standard pivoted QR and achieves asymptotically up to 60%60\% of the performance of the standard unpivoted QR, when applied to full-rank square matrices. Their work also presents a TRQRCP scheme ([DG:2017:QR, Algorithm 6]), which proves faster than RQRCP in low-rank cases.

In the other work [MOHvdG:2017:QR], the authors present a randomized algorithm called HQRRP that offers an order-of-magnitude speedup over GEQP3 and that achieves up to 8080% of the performance of the standard unpivoted QR, GEQRF. In the benchmarks shown therein, the standard pivoted and unpivoted QR functions were taken from LAPACK version 3.4.03.4.0, and the implementations were linked to BLAS from MKL version 11.2.311.2.3. The performance results were reported for HQRRP that was implemented with libflame version 1110411104 [FVE09]. In addition to the libflame-based implementation, the authors presented an LAPACK-compatible version of HQRRP.101010Available at https://github.com/flame/hqrrp The emphasis on the practicality and portability of HQRRP can be seen throughout the [MOHvdG:2017:QR] paper, suggesting that, at its time, this algorithm qualified for an ideal replacement of GEQP3 in LAPACK.

Our motivation.

The development of these algorithms, however, was not the final chapter in the quest for a high-performance QRCP method. As with its predecessors, neither HQRRP nor any of the variants of RQRCP made it into LAPACK. Many years of hardware development have passed, and benchmarking on modern machines shows that the gap in performance between the standard pivoted and unpivoted QR factorization algorithms has grown from roughly 10×10\times to near 100×100\times. Simultaneously, while the LAPACK-compatible version of HQRRP proved to be easily portable to modern libraries, its performance did not increase commensurately. See Figure 1.

[Uncaptioned image]

Figure 1: Speedup of HQRRP [MOHvdG:2017:QR] (first row) and standard LAPACK unpivoted QR, GEQRF (second row), over the standard LAPACK QRCP, GEQP3, attained on matrices of sizes between 1,0001{,}000-by-1,0001{,}000 to 10,00010{,}000-by-10,00010{,}000, when varying the number of OpenMP threads used. In all experiments, HQRRP block size was set to 128128, as suggested in [MOHvdG:2017:QR, Sec. 4.1]. Results captured on machines, described in Table 1. On an Intel system, the performance of HQRRP relative to that of GEQP3 stagnates with the increase of the number of OpenMP threads used; on an AMD system, HQRRP is able to achieve performance above what was reported in [MOHvdG:2017:QR, Fig. 1]. On both systems, the speedup of GEQRF over GEQP3 greatly exceeds the speedup observed in [MOHvdG:2017:QR, Fig. 5]. We further discuss the observed results in Appendix A.
Intel CPUAMD CPUSpeedup HQRRP/GEQP3Speedup GEQRF/GEQP3𝐦=𝐧\mathbf{m=n}𝐦=𝐧\mathbf{m=n}

With that, following the steps of our predecessors, we accepted the challenge of formulating a modern approach for QR with column pivoting that would match the contemporary performance of the algorithms using Level 3 BLAS functionality.

Modern randomized QRCP.

Our newest installment in the QRCP family of algorithms is the BQRRP algorithmic framework. In this framework, one implements a BQRRP instantiation by selecting a set of subroutines, most suitable for a system on which BQRRP is to be used. Our main contributions are: (a) a detailed description of the most suitable practical choices for such subroutines on two modern systems; and (b) CPU and GPU implementations of BQRRP, as part of an open-source RandLAPACK library, that allows users to configure the algorithm in a way that is most suitable for their needs. This manuscript describes both CPU and GPU versions of BQRRP. As we show in Section 7, an implementation of BQRRP_CPU not only outperforms HQRRP, but also it is up to two orders of magnitude faster than the standard pivoted QR (GEQP3) implementation available in MKL 2025. BQRRP_GPU exhibits excellent throughput and shows reasonable performance relative to an unpivoted QR (GEQRF) implementation available in NVIDIA’s cuSOLVER 11.6.1.9. This makes BQRRP_GPU a strong candidate for adoption as a standard general QRCP method by GPU linear algebra library vendors.

The dominant cost of BQRRP_CPU comes from a routine that applies an implicitly-stored orthonormal matrix to a portion of the input data in a loop. On a GPU, in addition to this bottleneck, there is the added complication in the form of the cost of permuting the columns of a portion of the input matrix at every iteration of the main loop. This cost can be mitigated by employing advanced pivoting strategies that allow for additional levels of parallelism, such as the “parallel pivots” approach. We discuss these in Section 4. The rest of the major operations in BQRRP are all highly efficient, reaching the Level 3 BLAS performance.

BQRRP_CPU is designed to be storage-efficient, applying all of its subsequent operations in place (modulo comparatively tiny workspace buffers). Furthermore, the output format of both BQRRP_CPU and BQRRP_GPU is identical to that of GEQP3.

Whether our novel approach remains relevant in the long run is an open question. However, even in the worst-case scenario, we emphasize that the value of our work extends beyond delivering a modern high-performance algorithm for QRCP. It also lies in establishing a solid foundation for analyzing the performance and implementation details of future QRCP methods, as well as exposing the underlying structure of such algorithms, which can be invaluable for future developers if the scourge of slow QRCP returns.

1.2 Outline of the manuscript

The rest of Section 1 is dedicated to describing the notation (Section 1.3), as well as the setup of the experiments (Section 1.4) used throughout the paper. Section 2 then starts by formally introducing the generalized view of the BQRRP algorithmic framework in Algorithm 1. Our formalism emphasizes that the BQRRP is defined by the choices of its core subroutines. The choices for such subroutines are presented in Sections 2.12.5. The subroutine details there are aided by a series of pseudocode algorithms and CPU performance plots. Section 3 presents a step-by-step guide to how BQRRP_CPU is constructed in RandLAPACK, with particular attention to its in-place storage feature. The discussion is aided by a detailed storage visualization in Figure 5. Section 4 discusses how constructing a GPU version of BQRRP differs from constructing the previously-described CPU version, making references to Algorithm 1 and Sections 2.12.5. Section 5 shows details on how each subroutine that comprises the CPU and GPU versions of BQRRP affects the overall performance of the algorithm, presenting runtime breakdown plots. This is done to help identify specific performance bottlenecks inside the algorithm. Section 6 provides empirical investigations of pivot quality. It shows how easy-to-compute metrics of pivot quality compare when running LAPACK’s default function GEQP3 versus when running the versions of BQRRP, defined previously in Section 3. The results are shown for input matrices that are notoriously difficult to be processed by QRCP. Section 7 provides performance experiments with both GPU and CPU implementations of BQRRP in RandLAPACK. Concluding remarks are given in Section 8.

1.3 Definitions and notation

Preliminaries.

Matrices appear in boldface sans-serif capital letters. The transpose of a matrix 𝗫\bm{\mathsf{X}} is given by 𝗫T\bm{\mathsf{X}}^{\text{\tiny{T}}}. Numerical vectors appear as boldface lowercase letters, while index vectors appear as uppercase letters. The identity matrix is denoted by 𝗜\bm{\mathsf{I}}. We enumerate components of matrices and vectors with indices starting from zero, rather than starting from one. We extract the leading kk columns of 𝗫\bm{\mathsf{X}} by writing 𝗫(:,0:k)\bm{\mathsf{X}}(\hskip 1.25pt{}{:}\hskip 0.24994pt{}{},{0}{:}{k}). This range excludes column kk, while its trailing nkn-k columns are extracted by writing 𝗫(:,k:n)\bm{\mathsf{X}}(\hskip 1.25pt{}{:}\hskip 0.24994pt{}{},{k}{:}{n}). The (i,j)th(i,j)^{\text{th}} entry of 𝗫\bm{\mathsf{X}} is 𝗫(i,j)\bm{\mathsf{X}}(i,j). Similar conventions apply to extracting the rows of a matrix or components of a vector.

The matrix we ultimately aim to decompose is denoted by 𝗠\bm{\mathsf{M}} and has dimensions mm-by-nn, with mm and nn not related in any particular way. We use the letters ii and jj as zero-based integer indices between 0 and min{m,n}1\min\{m,n\}-1. In some contexts, ii denotes the iteration of an algorithm loop. The symbol bb is used to refer to the block size parameter used in a given algorithm; and bb is expected to be small relative to the matrix size, i.e., bmin{m,n}b\ll\min\{m,n\}. The symbol \ell denotes the estimated rank of a given matrix. The symbol kk denotes the block rank, or the rank of the given submatrix, kbk\leq b.

Given an iteration i0:(n/b1)i\in{0}{:}{(\lceil n/b\rceil-1)} of an arbitrary algorithm that parses the given matrix in increments of block size bb, we use s=i×b\mathit{s}=i\times b to denote the start of the current row/column block range; r=min{m,(i+1)b}\mathit{r}=\min\{m,(i+1)b\} denotes the (exclusive) end of the current row block range; and c=min{n,(i+1)b}\mathit{c}=\min\{n,(i+1)b\} denotes the (exclusive) end of the current column block range.

Throughout this paper, we aim to clearly identify the purpose of each function name, which are presented in typewriter-style font.

We often refer to linear algebraic functions using their conventional BLAS and LAPACK names in uppercase. The explanation of the standard naming conventions can be found at https://www.netlib.org/lapack/lug/node24.html. Whenever we mention a given LAPACK function name, we avoid using the letter that denotes the precision, assuming that all computations are performed in double precision (for example, we use “GEQP3” instead of “DGEQP3”). We use BQRRP (fixed-width font) when referring to any practical algorithm developed from the BQRRP (standard font) algorithmic framework. We use BQRRP_CPU and BQRRP_GPU to specifically denote the respective CPU and GPU versions of BQRRP.

BQRRP intended output format.

A BQRRP algorithm intends to provide output in a format that is identical to the standard LAPACK QRCP routine, GEQP3. Assuming that a matrix 𝗠m×n\bm{\mathsf{M}}\in\mathbb{R}^{m\times n} is passed as input into GEQP3, the output format is described as follows:

  • Vector 𝝉\bm{\tau} of length min{m,n}\min\{m,n\} that holds the scalar factors of the elementary reflectors.

  • Modified in-place 𝗠\bm{\mathsf{M}} with its above-diagonal portion occupied by an explicit upper-trapezoidal 𝗥\bm{\mathsf{R}} of size min{m,n}×n\min\{m,n\}\times n. The below-diagonal portion of output 𝗠\bm{\mathsf{M}} stores Householder vectors 𝒗i\bm{v}_{i} for i=0:min{m,n}i={0}{:}{\min\{m,n\}}, which, together with the vector 𝝉\bm{\tau}, compactly represents the orthonormal matrix 𝗤\bm{\mathsf{Q}} as a product of min{m,n}\min\{m,n\} elementary reflectors:

    𝗤=𝗛1𝗛2𝗛min{m,n}\displaystyle\bm{\mathsf{Q}}=\bm{\mathsf{H}}_{1}\bm{\mathsf{H}}_{2}\dots\bm{\mathsf{H}}_{\min\{m,n\}}
    𝗛i=𝗜τ𝒗i𝒗iT.\displaystyle\bm{\mathsf{H}}_{i}=\bm{\mathsf{I}}-\tau\bm{v}_{i}\bm{v}_{i}^{\text{\tiny{T}}}.
  • Permutation vector 𝑱\bm{J} of size nn such that if 𝑱(j)=i+1\bm{J}(j)=i+1, then the jthj^{\mathrm{th}} column of 𝗠(:,𝑱)\bm{\mathsf{M}}(:,\bm{J}) was the ithi^{\mathrm{th}} column of 𝗠\bm{\mathsf{M}}. Note that in LAPACK convention, the permutation vector 𝑱\bm{J} stores entries in a one-based index format used by Fortran and MATLAB (hence the presence of “+1+1” term in the 𝑱(j)=i+1\bm{J}(j)=i+1 expression).

We refer back to the above output format throughout the paper when talking about the “intended” output format for BQRRP. The described format of the 𝗤\bm{\mathsf{Q}} factor is referred to as the implicit economical storage format. By contrast, some of the algorithms described in this manuscript use an explicit economical storage format, where 𝗤\bm{\mathsf{Q}} is comprised of min{m,n}\min\{m,n\} explicitly-defined orthogonal column vectors.

1.4 Experiments setup

Each of the further sections of this manuscript is interlaced with the practical performance experiments. For that reason, we present upfront the details of the hardware and software setup of our experiments.

Our software.

Versions of BQRRP algorithm discussed in this work, as well as their subcomponent functions, are implemented as part of an open-source C++ library called RandLAPACK. Together with its counterpart, RandBLAS, RandLAPACK provides a platform for developing, testing, and benchmarking high-performance RandNLA algorithms. This software was produced as part of the BALLISTIC project [BALLISTIC], and it is actively developed by the authors of this paper. It is important to note that RandBLAS and RandLAPACK rely on BLAS++ and LAPACK++ [gates2022portable] for the purpose of obtaining basic linear algebra functions; BLAS++ and LAPACK++ themselves serve as wrappers around the low-level vendor-optimized BLAS and LAPACK packages. As such, many vendor-optimized libraries (Intel’s MKL, AMD’s AOCL, Apple Accelerate, etc.) can be used to supply our software with basic linear algebra capabilities. Furthermore, RandBLAS relies on the Random123111111Available at https://github.com/DEShawResearch/random123. library for the purpose of exporting random number generators. For a detailed description of RandBLAS, visit:

https://randblas.readthedocs.io/en/latest/.

In contrast to RandBLAS, RandLAPACK’s scope still continues to change and expand. Because of this, we have yet to define explicit project documentation.

All experiments in this manuscript were run using the following version of our software:

https://github.com/BallisticLA/RandLAPACK/releases/tag/BQRRP-benchmark.

The code for BQRRP can be found in /RandLAPACK/drivers/rl_bqrrp*. The code for constructing and dispatching the experiments can be found in /benchmark/bench_BQRRP/*, as well as /test/drivers/bench_bqrrp*.

An automated script for re-running all of the experiments shown in this paper, as well as the MATLAB plotting scripts for replicating our plots, can be found in:

https://github.com/BallisticLA/BQRRP_benchmarking.

Algorithm performance metric.

Throughout this paper, we measure the algorithm performance via the canonical FLOP rate. This metric is obtained by dividing the FLOP count of a standard algorithm for a given matrix size by the wall clock time required to run a comparable algorithm under consideration. For example, for comparing canonical FLOP rates of various versions of QR and QRCP factorizations, we use the flop count of the standard LAPACK unpivoted QR called GEQRF [LAWN41:1994, Page 121] and divide it by the wall clock time of any given QR or QRCP algorithm being compared. The canonical FLOP rate gives a clear way to compare algorithm performance by using a standard FLOP count divided by the algorithm’s actual runtime. Unlike the raw FLOP rate, which gives credit to unnecessary work, the canonical FLOP rate sets a fair comparison by holding all algorithms to the same amount of work. While runtime alone could be used for comparisons, the FLOP rates are useful because they connect to familiar performance standards, such as the peak FLOP rates advertised for the hardware or standard algorithms like GEMM for matrix-matrix multiply. FLOP rates also help reveal scalability issues across problem sizes: if an algorithm’s FLOP rate drops a lot as the problem size grows, it may reveal an inefficient method or implementation. Canonical FLOP rates, therefore, give a balanced measure that considers both runtime and standard performance benchmarks.

Numerical properties of test matrices.

All test matrices were generated using RandBLAS. For the performance experiments, each matrix entry is independently sampled from the standard normal distribution. However, it should be noted that such matrices are appropriate for our performance tests as being representative of “typical” user inputs, and they have predictable numerical and performance properties. For the numerically challenging matrices, see Section 6. The average-case performance, on which we focus extensively here, is much better observed with our primary choice of matrix elements drawn from i.i.d. distribution, giving a representative behavior in pivot column interchanges.

Sizes of test matrices.

We focus our experiments on large matrices where the performance gap between Level-2 and Level-3 BLAS based algorithms is readily apparent. Simultaneously, we want to ensure that the experiments we set up terminate in reasonable time. Hence, we do not conduct runs on input matrices of the absolute largest sizes that can fit on our systems. Once we finish analyzing the established versions of the BQRRP algorithm for the larger matrix sizes, we present the investigation of the performance of various methods when applied to input matrices of a wide range of sizes, using various numbers of OpenMP threads, in Section 7.

There are some known performance concerns when it comes to selecting the specific sizes of the input matrices. Specifically, matrices with column sizes being powers of 2 experience increased demand for the main memory bandwidth based on how the modern cache memory works: elements from nearby columns are mapped to the exact cache line because power-of-2 memory addresses have all lower bits set to 0. Note that all modern libraries use a packed storage format for an efficient in-cache matrix multiply kernel, but they still have to read and write the cached data from/to the main memory, causing false sharing of cache lines. Due to a much-simplified structure of caches, GPUs do not experience this problem, and thus, we needed to compromise on the matrix size selection for comparable results between the types of computing devices. Because of this, we chose to run our experiments on matrices with dimensions that are powers of two and multiples of ten. Additionally, in our main performance experiments with various QR and QRCP algorithms, we focus on square input matrices. This choice provides a meaningful baseline for performance comparisons, while ensuring a balanced computational workload and even distribution of work across threads. Nonetheless, we do present the performance results captured on both tall and wide matrices in Section C.2.

Hardware and software configuration.

All of our tests used double-precision arithmetic, all code was compiled with the -O3 flag. The performance of each algorithm is determined by selecting its best execution time from five consecutive runs. Instead of running each algorithm individually five times, we execute the entire set of compared algorithms in sequence, repeating this set five times to capture comparatively consistent performance.

The detailed configurations of the machines that we used for conducting the CPU experiments are shown in Table 1.

Intel Xeon 8462Y+ (2×\times) AMD EPYC 9734 (2×\times)
Cores per socket 32 112
Total threads 128 448
Clock Speed
Base
2.80 GHz 2.2 GHz
Boost
4.10 GHz 3.0 GHz
Cache sizes per socket L1 80 KB 64 KB
L2 2 MB 1 MB
L3 60 MB 256 MB
RAM DDR5 1TB
FP64 Peak Performance
5.4 TeraFLOPs 6.5 TeraFLOPs
BLAS & LAPACK MKL 2025.0
Compiler GCC 13.2.0
CMake 3.31.4
OS Red Hat Enterprise Linux 8.9
Table 1: Key details of the hardware and software configuration of the platforms where CPU testing was performed. Note that we are using MKL on both Intel and AMD systems, since it performs better than AOCL 5.0 on AMD hardware. This observation is discussed in Section A.1.

When running benchmarks on CPUs described in Table 1, we set the number of OpenMP threads (with OMP_NUM_THREADS environment variable) to the maximum value of threads available, unless specified otherwise. While using the maximum number of available threads may not always be the optimal approach to achieve peak performance, we advocate for harnessing all available computational resources. Both tested Intel and AMD systems featured dual-socket configurations. We launched tests with the numactl --interleave=all command when running our experiments to balance the memory usage across the two memory controllers.

The system setup used for the GPU experiments can be found in Table 2. Since the GPU listed in Table 2 has “only” 8080 GB of memory, we are unable to run double-precision experiments with the input matrices of certain sizes on this GPU.

NVIDIA Hopper (H100)
CUDA cores 16,89616{,}896
Memory Type HBM3
Size 80 GB
Bandwidth 3.35 TB/s
BLAS & LAPACK cuBLAS 12.4.5.8 & cuSOLVER 11.6.1.9
FP64 Peak Performance
60 TeraFLOPs
CUDA NVCC 12.4.1
CMake 3.27
Table 2: Key details of the hardware and software configuration of the platform where GPU testing was performed.

2 The framework

As stated in Section 1.1, the intended format for a BQRRP algorithm is to be in line with that of standard LAPACK QRCP, GEQP3. Algorithm 1 gives pseudocode toward this end. For simplicity of presentation, it uses a simplified representation of orthogonal factors from QR factorizations, stating that such factors represent a square matrix using some number of elementary reflectors (implicit economical storage format). The pseudocode is valid in a setting where variables are passed by value rather than by reference, and where functions can return nontrivial datastructures. Details on how a BQRRP algorithm can be properly implemented in-place and when working with raw pointers are deferred to Section 3.

The pseudocode in Algorithm 1 has two required inputs: the matrix 𝗠\bm{\mathsf{M}} and an integer block size bb. It relies on five essential helper functions inside its main loop. These helper functions are applied to various submatrices whose bounds are computed at step 11.

  • qrcp_wide in step 12 represents a column-pivoted QR factorization method, suitable for wide matrices.

  • tri_rank in step 14 computes some notion of numerical rank of an input triangular matrix.

  • col_perm in steps 16, 18, 19 is responsible for permuting the columns of a given matrix in accordance with the indices stored in a given vector.

  • qr_tall in step 20 performs a tall unpivoted QR factorization.

  • apply_trans_q in steps 24 and 27 applies the transpose 𝗤\bm{\mathsf{Q}}-factor output by qr_tall to a given matrix.

Sections 2.12.5 describe possibilities of how each of these functions might be implemented. We recommend specific implementations targeting the CPU architectures from Table 1.

Algorithm 1 Blocked QR with Randomization and Pivoting
1:Required input: An mm-by-nn matrix 𝗠\bm{\mathsf{M}}; and an integer block size parameter bb.
2:Optional input: A scalar γ\gamma that sets the size of the sketch relative to bb (γ1\gamma\geq 1).
3:Output: Numerical rank \ell, orthonormal m×mm\times m matrix 𝗤\bm{\mathsf{Q}} based on \ell elementary reflectors, upper-trapezoidal ×n\ell\times n matrix 𝗥\bm{\mathsf{R}}, and a column permutation vector 𝑱\bm{J} of length nn.
4:function bqrrp(𝗠,b,γ\bm{\mathsf{M}},b,\gamma)
5:  Set d=γbd=\lceil\gamma\cdot b\rceil and sample a dd-by-nn sketching operator 𝗦\bm{\mathsf{S}} from a Gaussian distribution
6:  Allocate empty 𝗤\bm{\mathsf{Q}}, 𝗥\bm{\mathsf{R}}; 𝑱=1:(n+1)\bm{J}=1:(n+1)
7:  Sketch 𝗠sk=𝗦𝗠{\bm{\mathsf{M}}}^{\mathrm{sk}}=\bm{\mathsf{S}}\bm{\mathsf{M}}
8:  for i=0:n/bi=0:\lceil n/b\rceil do
9:    s=ib\mathit{s}=i\cdot b is the start of the current row/col block;
10:     c=min{n,(i+1)b}\mathit{c}=\min\{n,(i+1)b\} is the (exclusive) end column of the column block;
11:     r=min{m,(i+1)b}\mathit{r}=\min\{m,(i+1)b\} is the (exclusive) end row of the row block;
12:    Decompose [,𝗥sk,𝑱sk]=qrcp_wide(𝗠sk(:,s:))[\sim,{\bm{\mathsf{R}}}^{\mathrm{sk}},{\bm{J}}^{\mathrm{sk}}]=\texttt{qrcp\_wide}({\bm{\mathsf{M}}}^{\mathrm{sk}}(:,{s}{:}))
13:         // 𝑱sk{\bm{J}}^{\mathrm{sk}} is a permutation vector of length nsn-s
14:    Determine k=tri_rank(𝗥sk)k=\texttt{tri\_rank}({\bm{\mathsf{R}}}^{\mathrm{sk}})
15:    // kmin{b,ns,ms}k\leq\min\{b,n-\mathit{s},m-\mathit{s}\}
16:    Permute 𝗥(0:s,s:)=col_perm(𝗥(0:s,s:),𝑱sk)\bm{\mathsf{R}}({0}{:}{\mathit{s}},{\mathit{s}}{:})=\texttt{col\_perm}(\bm{\mathsf{R}}({0}{:}{\mathit{s}},{\mathit{s}}{:}),{\bm{J}}^{\mathrm{sk}})
17:         // The rectangular portion of the computed rows is permuted (no-op at i=0i=0)
18:    Permute 𝗠(s:,s:)=col_perm(𝗠(s:,s:),𝑱sk)\bm{\mathsf{M}}({\mathit{s}}{:},{\mathit{s}}{:})=\texttt{col\_perm}(\bm{\mathsf{M}}({\mathit{s}}{:},{\mathit{s}}{:}),{\bm{J}}^{\mathrm{sk}})
19:    Update 𝑱=col_perm(𝑱(s:),𝑱sk)\bm{J}=\texttt{col\_perm}(\bm{J}({\mathit{s}}{:}),{\bm{J}}^{\mathrm{sk}})
20:    Decompose [𝗤curr,𝗥11]=qr_tall(𝗠(s:,s:c),k)[\bm{\mathsf{Q}}^{\mathrm{curr}},\bm{\mathsf{R}}_{11}]=\texttt{qr\_tall}(\bm{\mathsf{M}}({\mathit{s}}{:},{\mathit{s}}{:}\mathit{c}),k)
21:         // 𝗤curr\bm{\mathsf{Q}}^{\mathrm{curr}} uses kk reflectors, implicitly represents msm-\mathit{s} columns and 𝗥11\bm{\mathsf{R}}_{11} is kk-by-bb
22:    Update 𝗥(s:(s+k),s:c)=𝗥11\bm{\mathsf{R}}({\mathit{s}}{:}(\mathit{s}+k),{\mathit{s}}{:}\mathit{c})=\bm{\mathsf{R}}_{11}
23:    if kmin{b,(ns),(ms)}k\neq\min\{b,(n-\mathit{s}),(m-\mathit{s})\} then
24:     Perform [,𝗥12]=apply_trans_q(𝗤curr,𝗠(s:k,c:))[\sim,\bm{\mathsf{R}}_{12}]=\texttt{apply\_trans\_q}(\bm{\mathsf{Q}}^{\mathrm{curr}},\bm{\mathsf{M}}({\mathit{s}}{:}k,{\mathit{c}}{:}))
25:              // Output 𝗥12\bm{\mathsf{R}}_{12} is k×(nc)k\times(n-\mathit{c})
26:    else
27:     Perform [𝗠to_update,𝗥12]=apply_trans_q(𝗤curr,𝗠(s:,c:))[\bm{\mathsf{M}}^{\mathrm{to\_update}},\bm{\mathsf{R}}_{12}]=\texttt{apply\_trans\_q}(\bm{\mathsf{Q}}^{\mathrm{curr}},\bm{\mathsf{M}}({\mathit{s}}{:},{\mathit{c}}{:}))
28:              // Output 𝗠to_update\bm{\mathsf{M}}^{\mathrm{to\_update}} is (mr)(m-\mathit{r})-by-(nc)(n-\mathit{c}), 𝗥12\bm{\mathsf{R}}_{12} is bb-by-(nc)(n-\mathit{c})
29:     Update 𝗠(r:,c:)=𝗠to_update\bm{\mathsf{M}}({\mathit{r}}{:},{\mathit{c}}{:})=\bm{\mathsf{M}}^{\mathrm{to\_update}}     
30:    Update 𝗥(s:(s+k),c:)=𝗥12\bm{\mathsf{R}}({\mathit{s}}{:}(\mathit{s}+k),{\mathit{c}}{:})=\bm{\mathsf{R}}_{12}
31:    Update 𝗤(s:r,s:(s+k))=𝗤curr\bm{\mathsf{Q}}({\mathit{s}}{:}\mathit{r},{\mathit{s}}{:}(\mathit{s}+k))=\bm{\mathsf{Q}}^{\mathrm{curr}}
32:    if i=n/bi=n/b or kmin{b,(ns),(ms)}k\neq\min\{b,(n-\mathit{s}),(m-\mathit{s})\} then
33:     =s+k\ell=\mathit{s}+k
34:     break     
35:    Update 𝗠sk(:,c:)=[𝗥12sk𝗥11sk(𝗥11)1𝗥12𝗥22sk]{\bm{\mathsf{M}}}^{\mathrm{sk}}(\hskip 1.15623pt{}{:}\hskip 0.23119pt{}{},{\mathit{c}}{:})=\begin{bmatrix}{\bm{\mathsf{R}}}^{\mathrm{sk}}_{12}-{\bm{\mathsf{R}}}^{\mathrm{sk}}_{11}(\bm{\mathsf{R}}_{11})^{-1}\bm{\mathsf{R}}_{12}\\ \ {\bm{\mathsf{R}}}^{\mathrm{sk}}_{22}\end{bmatrix}   
36:  return \ell, 𝗤\bm{\mathsf{Q}}, 𝗥\bm{\mathsf{R}}, 𝑱\bm{J}

While BQRRP is a randomized algorithm, it uses randomness only once. Before entering the main loop, it generates a matrix whose entries are independent mean-zero variance-one Gaussian random variables.121212The unit-variance requirement is not essential, so long as all entries are sampled from the same mean-zero Gaussian distribution. This matrix is applied 𝗠\bm{\mathsf{M}} to obtain a smaller matrix called the sketch. The number of rows in the sketch is d=γbd=\lceil\gamma b\rceil, where γ\gamma is called the sampling factor. The natural default value of γ\gamma depends on the implementation of qcrp_wide (in our recommended implementation, the only reasonable value is γ=1\gamma=1). As BQRRP iterates, the sketch is updated deterministically with a method proposed by Duersch and Gu [DG:2017:QR, Section 4].

Remark 1.

In our prior work for QRCP of tall matrices, [MBM2024], we used a fast sparse operator to prevent sketching from becoming a computational bottleneck. In the context of BQRRP, Gaussian sketching has no risk of being a bottleneck operation. This is because BQRRP sketches in the sampling regime rather than the embedding regime, to borrow terms from [RandLAPACK_Book, Section 2.2].

Note that Sections 2.12.5 do not dwell too much on the edge cases of Algorithm 1 (mm, nn not divisible by bb, small square inputs in the described pseudocodes, etc.), instead describing the default formulation of the subroutines that go into Algorithm 1 and their performance on the most relevant inputs.

2.1 A practical wide QRCP selection

At iteration i0:(n/b1)i\in{0}{:}{(\lceil n/b\rceil-1)}, step 12 in Algorithm 1 uses a qrcp_wide function, applied to a wide (for most ii) sketched matrix 𝗠skd×(nib){\bm{\mathsf{M}}}^{\mathrm{sk}}\in\mathbb{R}^{d\times(n-ib)}. One could implement this by calling the standard LAPACK QRCP function, GEQP3. In our high-performance implementation of BQRRP, we employ an approach that uses the LU factorization with partial pivoting, GETRF, to retrieve the pivot vector 𝑱sk{\bm{J}}^{\mathrm{sk}} and then unpivoted QR, GEQRF, to find the matrix 𝗥sk{\bm{\mathsf{R}}}^{\mathrm{sk}} needed in the sample updating step (step 35). Algorithm 2 shows how such a method is implemented.

Algorithm 2 : Practical wide QRCP
1:Input: a submatrix 𝗠skd×(nib){\bm{\mathsf{M}}}^{\mathrm{sk}}\in\mathbb{R}^{d\times(n-ib)}, where d=γbbd=\lceil\gamma b\rceil\geq b.
2:function qrcp_practical(𝗠sk)\texttt{qrcp\_practical}({\bm{\mathsf{M}}}^{\mathrm{sk}})
3:  Allocate 𝗠sk_trans=transpose(𝗠sk)\bm{\mathsf{M}}^{\mathrm{sk\_trans}}=\texttt{transpose}({\bm{\mathsf{M}}}^{\mathrm{sk}})
4:  Compute [,,𝑱lu]=lu(𝗠sk_trans)[\sim,\sim,\bm{J}^{\mathrm{lu}}]=\texttt{lu}(\bm{\mathsf{M}}^{\mathrm{sk\_trans}})
5:     // Done via standard row-pivoted LU factorization, GETRF
6:  Convert 𝑱qr=piv_transform(𝑱lu)\bm{J}^{\mathrm{qr}}=\texttt{piv\_transform}(\bm{J}^{\mathrm{lu}})
7:  Permute 𝗠sk=col_perm(𝗠sk,𝑱qr){\bm{\mathsf{M}}}^{\mathrm{sk}}=\texttt{col\_perm(}{\bm{\mathsf{M}}}^{\mathrm{sk}},\bm{J}^{\mathrm{qr}})
8:  Compute [𝗤sk,𝗥sk]=qr(𝗠sk)[{\bm{\mathsf{Q}}}^{\mathrm{sk}},{\bm{\mathsf{R}}}^{\mathrm{sk}}]=\texttt{qr}({\bm{\mathsf{M}}}^{\mathrm{sk}})
9:     // Done via standard unpivoted QR factorization, GEQRF
10:  return 𝗤sk{\bm{\mathsf{Q}}}^{\mathrm{sk}}, 𝗥sk{\bm{\mathsf{R}}}^{\mathrm{sk}}, 𝑱qr\bm{J}^{\mathrm{qr}}

Step 3 in Algorithm 2 does not use an in-place transpose on purpose. Since the matrix 𝗠sk{\bm{\mathsf{M}}}^{\mathrm{sk}} that is to be used by the qrcp_wide function in step 2 of Algorithm 1 is marginally smaller than 𝗠\bm{\mathsf{M}}, allocating a d×nd\times n buffer for 𝗠sk_trans\bm{\mathsf{M}}^{\mathrm{sk\_trans}} is more efficient than performing an additional in-place transpose to restore 𝗠sk{\bm{\mathsf{M}}}^{\mathrm{sk}} at the end of the algorithm.

Remark 2.

Note that a portion of 𝗥sk{\bm{\mathsf{R}}}^{\mathrm{sk}} can be used in step 20 for preconditioning the next panel of 𝗠\bm{\mathsf{M}} (see details in Section 2.3).

Permutation formats.

Step 6 in Algorithm 2 is crucial, since the format in which a pivoted LU represents the permutation vector 𝑱\bm{J} is different from the pivoted QR format, and hence the format conversion procedure is required. In the context of pivoted LU factorization, row ii of the input matrix was interchanged with row 𝑱lu(i)\bm{J}^{\mathrm{lu}}(i). The format conversion consists of first creating a vector 𝑱qr\bm{J}^{\mathrm{qr}} of length nn with entries from 11 to nn and then serially swapping elements in it according to the entries in 𝑱lu\bm{J}^{\mathrm{lu}}. Simply put, for element at index i0:(n1)i\in{0}{:}{(n-1)}, element at 𝑱qr(i)\bm{J}^{\mathrm{qr}}(i) is to swap positions with the element at 𝑱qr(𝑱lu(i)1)\bm{J}^{\mathrm{qr}}(\bm{J}^{\mathrm{lu}}(i)-1).

Remark 3.

When implementing the pivot translation procedure in practice, it is important to remember that pivot vectors in LAPACK store entries in a one-based index format used by Fortran and MATLAB (as stated previously in Section 1.3).

Wide QRCP and sketching.

When Algorithm 2 is in use in BQRRP, the first bb components of 𝑱qr\bm{J}^{\text{qr}} are the same for γ=1\gamma=1 and γ>1\gamma>1. Therefore the rank-revealing properties of BQRRP using Algorithm 2 for qrcp_wide cannot be improved by using γ>1\gamma>1.

Candidates’ performance.

Observe a practical comparison of the performance of the two approaches in Figure 2. The performance superiority of Algorithm 2 to the standard QRCP approach makes it the best option to be used in a BQRRP implementation.

[Uncaptioned image]

Figure 2: Performance of the candidate methods for qrcp_wide function in step 12 of Algorithm 1, captured on Intel and AMD systems (see Table 1). The performance is measured via the canonical FLOP rate, relying on the FLOP count of the standard LAPACK QR function (GEQRF). Experiments were conducted on matrices of size d1×n1d_{1}\times n_{1} and d2×n2d_{2}\times n_{2}, with n1=65,536n_{1}=65{,}536 and d1256{1,2,4,,32}d_{1}\in 256\cdot\{1,2,4,\dots,32\}, and n2=64,000n_{2}=64{,}000 and d2250{1,2,4,,32}d_{2}\in 250\cdot\{1,2,4,\dots,32\}. On both systems, the performance of LUQR scheme (Algorithm 2) is superior to GEQP3, for all choices of dd.
Intel CPUAMD CPUGigaFLOP/s𝐧𝟏=𝟔𝟓,𝟓𝟑𝟔\mathbf{n_{1}=65{,}536}GigaFLOP/s𝐧𝟏=𝟔𝟒,𝟎𝟎𝟎\mathbf{n_{1}=64{,}000}𝐝\mathbf{d}𝐝\mathbf{d}

2.2 Numerical rank selection

In order to understand the role of tri_rank at step 14, it is useful to pretend that all computations in BQRRP and its subroutines are performed in exact arithmetic. If the current block is not full-rank (kmin{b,(ns),(ms)}k\neq\min\{b,(n-\mathit{s}),(m-\mathit{s})\}), then the current iteration of BQRRP will be the final one. To see why this is reasonable, suppose tri_rank returns the exact rank of its input matrix. One can show that, conditional on an event which occurs with probability 1, BQRRP returns =rank(𝗠)\ell=\operatorname{rank}(\bm{\mathsf{M}}) and a full decomposition of 𝗠\bm{\mathsf{M}}. If the update at Step 35 were still performed, 𝗠sk(:,c:){\bm{\mathsf{M}}}^{\mathrm{sk}}(\hskip 1.25pt{}{:}\hskip 0.24994pt{}{},{c}{:}) would be the zero matrix.

Of course, it is unrealistic to assume black-box access to a method to compute the exact rank of a floating-point matrix. This raises the question of whether, setting aside rounding errors, we could ensure a full decomposition of 𝗠\bm{\mathsf{M}} if we allowed for overestimation of rank. The answer is that we can; it is valid for tri_rank to simply return the dimension pp of the input p×pp\times p triangular matrix. This choice can be used with minor modifications to the rest of BQRRP and would ensure that BQRRP mimics GEQP3 as closely as possible. It would be sufficient to use a different updating formula at step 35 that remains well-posed if 𝗥11\bm{\mathsf{R}}_{11} is singular; two such methods are described in [DG:2017:QR, §4]. It would also be sufficient to keep the ill-posed update, while ensuring that qrcp_wide returns 𝑱sk=(1,2,,ns+1){\bm{J}}^{\mathrm{sk}}=(1,2,\ldots,n-s+1) when 𝗠sk{\bm{\mathsf{M}}}^{\mathrm{sk}} contains infs or NaNs. Since 𝑱sk{\bm{J}}^{\mathrm{sk}} only affects the pivot decisions used in 𝗠\bm{\mathsf{M}}, a stable Householder QR method can safely be applied in qr_tall, without being impacted by 𝗠sk{\bm{\mathsf{M}}}^{\mathrm{sk}} becoming ill-formed.

The BQRRP implementation in RandLAPACK uses a naive rank estimator that is described in our prior work on preconditioned column-pivoted Cholesky QR [MBM2024]. This strategy is only needed for implementations of qr_tall based on Cholesky QR. More sophisticated strategies would be needed if BQRRP were intended as a drop-in replacement for LAPACK 3.12’s GEQP3RK function for truncated QRCP of low-rank matrices. Such strategies are beyond the scope of this manuscript. In particular, all performance experiments we conduct are on matrices of full numerical rank.

2.3 Tall QR selection

Step 20 in Algorithm 1 is concerned with performing unpivoted QR on a (generally) tall submatrix of the input matrix that has been permuted via a permutation vector, computed according to the description in Section 2.1. Thus, any QR factorization method that can handle tall matrices is suitable here. The resulting 𝗤\bm{\mathsf{Q}}-factor must follow the format outlined in Section 1.3. If the selected QR method produces 𝗤\bm{\mathsf{Q}} in a different format, it should be converted to the required representation. Additionally, at iteration i0:(n/b1)i\in{0}{:}{(\lceil n/b\rceil-1)}, the tall QR is to be performed on a matrix of size (mib)(m-ib)-by-bb (except possibly in the last iteration if nn is not evenly divisible by bb), regardless of whether the block has full rank kk (estimated with the procedure described in Section 2.2). Despite the fact that bb reflectors would be computed in that case, we would use only kk of them outside of this step.

Available methods.

LAPACK offers several algorithms suitable for tall QR. The natural choice in this context is LATSQR [lapack_latsqr]: a specialized Householder QR factorization for tall matrices. Alternatively, one could use a standard Householder QR factorization, GEQRF [lapack_geqrf]. As the third option, GEQRT [lapack_geqrt], is a blocked algorithm based on compact WY representation [Bischof:1987:WRP:37316.37324]. Finally, LAPACK’s GEQR calls either LATSQR or GEQRT, depending on the size of the input matrix and the output from LAPACK’s ILAENV inquiry function [lapack_ilaenv]. In addition to listing the standard LAPACK algorithms, we consider the use of Cholesky QR in this context, following our approach from prior work [MBM2024]. Background on Cholesky QR is given in Appendix B.

Remark 4.

In principle, one could use a version of the Gram-Schmidt method in order to obtain an explicit version of the 𝗤\bm{\mathsf{Q}}-factor. While this may be a viable option some settings, it would increase space requirements by m2m^{2} words and it would not meet our output format requirements.

Cholesky QR dependencies.

Although Cholesky QR can be safe to use in the context of BQRRP, its use produces the following complication: steps 24 and 27 require access to the fully-formed mm-by-mm representation of the orthonormal factor computed at the current iteration, while Cholesky QR only outputs an explicit economical representation 𝗤cholm×k\bm{\mathsf{Q}}^{\mathrm{chol}}\in\mathbb{R}^{m\times k}. We can acquire the full representation by using a specialized LAPACK routine ORHR_COL. This routine transforms an explicit economical 𝗤chol\bm{\mathsf{Q}}^{\mathrm{chol}} output by Cholesky QR into an implicit representation via Householder reconstruction. The description of the basic approach for performing the Householder reconstruction can be found in [BD2015, Algorithms 5, 6]. An advanced recursive implementation of ORHR_COL is described in its respective Netlib LAPACK documentation [lapack_orhr_col]; see also [F1997].

Algorithm 3 shows a practical implementation of Cholesky QR and its dependencies, offering a method that can be used in step 20.

Algorithm 3 : Cholesky QR + dependencies in the context of qr_tall
1:Input: Iteration i0:(n/b1)i\in{0}{:}{(\lceil n/b\rceil-1)} (from Algorithm 1); matrix 𝗠perm(mib)×b{\bm{\mathsf{M}}}^{\mathrm{perm}}\in\mathbb{R}^{(m-ib)\times b}, where bmb\ll m; upper-triangular 𝗥skd×(nib){\bm{\mathsf{R}}}^{\mathrm{sk}}\in\mathbb{R}^{d\times(n-ib)} (output from step 12 in Algorithm 1), where d=γbγbbd=\lceil\gamma b\rceil\geq\gamma b\geq b when γ1\gamma\geq 1; block rank kbk\leq b
2:function cholqr_deps(𝗠perm,𝗥sk,k)\texttt{cholqr\_deps}({\bm{\mathsf{M}}}^{\mathrm{perm}},{\bm{\mathsf{R}}}^{\mathrm{sk}},k)
3:  Truncate and precondition 𝗠pre=𝗠perm(:,0:k)(𝗥sk)1{\bm{\mathsf{M}}}^{\mathrm{pre}}={\bm{\mathsf{M}}}^{\mathrm{perm}}(\hskip 1.15623pt{}{:}\hskip 0.23119pt{}{},{0}{:}{k})({\bm{\mathsf{R}}}^{\mathrm{sk}})^{-1}
4:  Decompose [𝗤chol,𝗥chol]=cholqr(𝗠pre)[\bm{\mathsf{Q}}^{\mathrm{chol}},\bm{\mathsf{R}}^{\mathrm{chol}}]=\texttt{cholqr}({\bm{\mathsf{M}}}^{\mathrm{pre}})
5:         // 𝗤chol\bm{\mathsf{Q}}^{\mathrm{chol}} is explicit (mib)(m-ib)-by-kk; 𝗥chol\bm{\mathsf{R}}^{\mathrm{chol}} is kk-by-kk
6:  Reconstruct [𝗤curr,𝑫]=householder_reconstruct(𝗤chol)[\bm{\mathsf{Q}}^{\mathrm{curr}},\bm{D}]=\texttt{householder\_reconstruct}(\bm{\mathsf{Q}}^{\mathrm{chol}})
7:         // Using ORHR_COL; 𝗤curr\bm{\mathsf{Q}}^{\mathrm{curr}} uses kk reflectors, represents mibm-ib; 𝑫\bm{D} is a sign vector
8:  Compute 𝗥11=𝗥choldiag(𝑫)𝗥sk(0:k,0:b)\bm{\mathsf{R}}_{11}=\bm{\mathsf{R}}^{\mathrm{chol}}\texttt{diag}(\bm{D}){\bm{\mathsf{R}}}^{\mathrm{sk}}({0}{:}{k},{0}{:}{b})
9:         // Undoing the preconditioning; 𝗥11\bm{\mathsf{R}}_{11} is k×bk\times b
10:  return 𝗤curr\bm{\mathsf{Q}}^{\mathrm{curr}}, 𝗥11\bm{\mathsf{R}}_{11}

Candidates’ performance.

Figure 3 illustrates how the use of preconditioning and Householder restoration affect the performance of Cholesky QR, relative to alternative methods for tall QR factorization. Figure 3 shows that GEQRF is the best alternative method for tall QR factorization across both systems. Additionally, favoring the use of Cholesky QR on the Intel system for the input matrices of select sizes can be a reasonable option.

[Uncaptioned image]

Figure 3: Performance of the candidate methods for qr_tall function in step 20 of Algorithm 1, captured on Intel and AMD systems (see Table 1). The performance is measured via canonical FLOP rate, using the FLOP count of LAPACK’s GEQRF, on matrices of sizes m1×k1m_{1}\times k_{1} and m2×k2m_{2}\times k_{2}, with m1=65,536m_{1}=65{,}536 and k1256{1,2,4,,32}k_{1}\in 256\cdot\{1,2,4,\dots,32\}, and m2=64,000m_{2}=64{,}000 and k2250{1,2,4,,32}k_{2}\in{250\cdot\{1,2,4,\dots,32\}}. Across all systems, we observe that Cholesky QR outperforms all alternative methods; however, the version of Cholesky QR with all of its dependencies (Algorithm 3) is only competitive on the Intel system with the input data of dimensions that are powers of two. Otherwise, on both Intel and AMD systems, GEQRF shows the best performance out of the practical alternatives.
Intel CPUAMD CPUGigaFLOP/s𝐦𝟏=𝟔𝟓,𝟓𝟑𝟔\mathbf{m_{1}=65{,}536}GigaFLOP/s𝐦𝟏=𝟔𝟒,𝟎𝟎𝟎\mathbf{m_{1}=64{,}000}𝐤\mathbf{k}𝐤\mathbf{k}
Remark 5.

In theory, the performance of GEQRT should be comparable to that of GEQRF, as the primary distinction between the two is that GEQRT provides access to the 𝗧\bm{\mathsf{T}} matrix, which encodes the sequence of upper triangular block reflectors. However, as shown in Figure 3, GEQRT performs significantly worse than all other tall QR implementations. Varying the internal block size had no noticeable impact on performance. We suspect this is because GEQRT is not as heavily optimized in MKL 2025.0 as other QR routines.

Output 𝗤\bm{\mathsf{Q}} representation

Only GEQRF outputs the representation of 𝗤\bm{\mathsf{Q}} that precisely matches the one we are after (described in Section 1.3). The format of 𝗤\bm{\mathsf{Q}} produced by ORHR_COL differs from the intended one in the following way: instead of generating a vector with scalar factors of the elementary reflectors 𝝉\bm{\tau}, it produces an upper-trapezoidal matrix 𝗧\bm{\mathsf{T}} that represents a sequence of upper triangular block reflectors stored compactly. Each block in 𝗧\bm{\mathsf{T}} is of size nb×bn_{b}\times b (except possibly in the last iteration if nn is not evenly divisible by bb), with nbn_{b} being the block size parameter. The scalar factors of the elementary reflectors, originally intended for 𝝉\bm{\tau}, are stored along the diagonals of the block matrix 𝗧\bm{\mathsf{T}}.

For details of the representations used by GEQR, LATSQR, and GEQRT, we refer the readers to the official LAPACK documentation.

2.4 Selection of methods for applying the Q-factor

In BQRRP, steps 24 and 27 represent an application of a transpose of a full representation of the current iterations’s 𝗤\bm{\mathsf{Q}} factor, (𝗤curr)T(\bm{\mathsf{Q}}^{\mathrm{curr}})^{\text{\tiny{T}}} (obtained as described in Section 2.3) to a pivoted trailing portion of the current iteration’s matrix 𝗠\bm{\mathsf{M}}. Whether line 24 or 27 is executed depends on whether the current block rank, estimated as outlined in Section 2.2, equals the number of columns in the current block. The distinction between these lines lies in what is updated during the current iteration of the main BQRRP loop. One approach updates only a portion of the output 𝗥\bm{\mathsf{R}}-factor (in this case, the current iteration is the final one). The other approach updates both the 𝗥\bm{\mathsf{R}}-factor and the trailing portion of the input matrix, enabling the algorithm to proceed with further iterations. In an actual implementation (i.e., one where the decomposition is performed in place), no logical branching is needed to distinguish between lines 24 and 27. The difference is only a matter of how many rows of 𝗠\bm{\mathsf{M}} are involved in the computation.

Available methods.

There are two main methods for applying (𝗤curr)T(\bm{\mathsf{Q}}^{\mathrm{curr}})^{\text{\tiny{T}}} to a given matrix; the choice of the right method relates to the decision made in the tall QR step.

The first available method is LAPACK’s ORMQR function [lapack_ormqr], which takes reflectors produced by GEQRF or GEQP3. Recall that Section 2.3 states that the GEQP3-compatible representation of 𝗤\bm{\mathsf{Q}} must be obtained regardless of the algorithm used in the tall QR step. Therefore ORMQR is always an option after obtaining the GEQP3-compatible representation. The cost of ORMQR is 4nmk2nk2+3nk4nmk-2nk^{2}+3nk FLOPs [LAWN41:1994, Page 122].

The second option to use in this context is the GEMQRT function [lapack_gemqrt]. The input format of GEMQRT matches the output format of ORHR_COL. Since Cholesky QR must use ORHR_COL, this implies pairing Cholesky QR with GEMQRT can be an efficient choice. Alternatively, one could pair GEMQRT with any tall QR method, first ensuring that the representation of the output 𝗤\bm{\mathsf{Q}}-factor is made compatible with the input representation of GEMQRT.

One more alternative to the methods described above is avoiding the trailing update altogether, as described in [DG:2017:QR, Algorithm 5.1]. This approach is, however, only applicable to low-rank data, and in an implementation, it would further increase the complexity of a rather nontrivial BQRRP scheme.

Candidates’ performance.

The relative performance of the alternative ways of applying an orthonormal factor (𝗤curr)T(\bm{\mathsf{Q}}^{\mathrm{curr}})^{\text{\tiny{T}}} to an arbitrary matrix of size mm-by-(nb)(n-b) is shown in Figure 4. The performance superiority of ORMQR function to the alternatives in the majority of the explored cases makes it the best option to be used in a BQRRP implementation.

[Uncaptioned image]

Figure 4: Performance of the candidate methods for apply_trans_q function in steps 24 and 27 of Algorithm 1, captured on Intel and AMD system (see Table 1). The performance is measured via canonical FLOP rate, relying on the FLOP count of the LAPACK function (ORMQR). Experiments were conducted using orthonormal matrices represented by b1,2b_{1,2} column vectors of length m1,2m_{1,2}, applied to general matrices of size m1,2m_{1,2}-by-(n1,2b1,2)(n_{1,2}-b_{1,2}), where m1=n1=65,536m_{1}=n_{1}=65{,}536 and b1256{1,2,4,,32}b_{1}\in 256\cdot\{1,2,4,\dots,32\}, and m2=n2=64,000m_{2}=n_{2}=64{,}000 and b2250{1,2,4,,32}b_{2}\in 250\cdot\{1,2,4,\dots,32\}. As such, our experiment mimics the action in step 27 in Algorithm 1 at iteration one. In most cases across both systems, the performance of ORMQR function is superior to that of GEMQRT for all choices of bb and nbn_{b}.
Intel CPUAMD CPUGigaFLOP/s𝐦𝟏=𝟔𝟓,𝟓𝟑𝟔\mathbf{m_{1}=65{,}536}GigaFLOP/s𝐦𝟏=𝟔𝟒,𝟎𝟎𝟎\mathbf{m_{1}=64{,}000}𝐛\mathbf{b}𝐛\mathbf{b}
Remark 6.

Similar to the underperformance of GEQRT, which we attributed to limited optimization in MKL 2025.0 GEMQRT also exhibits subpar performance compared to ORMQR, as shown in Figure 4, despite the expectation that their performance should be comparable.

2.5 Selection of implementation for column permutation

A BQRRP algorithm requires a conceptually trivial, yet crucially important kernel: a function for permuting columns of a given matrix in accordance with the pivot vector output from qrcp_wide at step 12. This kernel is used in Algorithm 1 at step 16 for permuting the columns in the rectangular portion of the 𝗥\bm{\mathsf{R}}-factor, at step 18 for permuting the columns of a submatrix of the input matrix, and at step 19 for updating the permutation vector.

In the context of pivoted QR factorization that generates a pivot vector 𝑱qr\bm{J}^{\mathrm{qr}} indicating that if 𝑱qr(j)=i+1\bm{J}^{\mathrm{qr}}(j)=i+1, then the jthj^{\mathrm{th}} column of 𝗠(:,𝑱qr)\bm{\mathsf{M}}(:,\bm{J}^{\mathrm{qr}}) was the ithi^{\mathrm{th}} column of 𝗠\bm{\mathsf{M}}. This representation of the pivot vector can be referred to as “permutation format.” These are stored with one-based indexing for consistency with LAPACK. The approach for permuting the columns of a given matrix in accordance with 𝑱qr\bm{J}^{\mathrm{qr}} is described in Algorithm 4.

Algorithm 4 : Sequential approach to column permutation
1:Input: A matrix 𝗠m×n\bm{\mathsf{M}}\in\mathbb{R}^{m\times n}, a pivot vector 𝑱qr\bm{J}^{\mathrm{qr}} of length nn produced by a black-box qrcp function.
2:function col_perm_sequential(𝗠,𝑱qr)\texttt{col\_perm\_sequential}(\bm{\mathsf{M}},\bm{J}^{\mathrm{qr}})
3:  for i=0:ni=0:n do
4:    j=𝑱qr(i)1j=\bm{J}^{\mathrm{qr}}(i)-1
5:    swap(𝗠(:,i),𝗠(:,j)\bm{\mathsf{M}}(:,i),\bm{\mathsf{M}}(:,j))
6:         // Swap entirety of two columns in 𝗠\bm{\mathsf{M}}
7:    𝚒𝚍𝚡=find(𝑱qr,i+1)\mathtt{idx}=\texttt{find}(\bm{J}^{\mathrm{qr}},i+1)
8:         // Find the index of an element with value ii
9:    𝑱qr(𝚒𝚍𝚡)=j+1\bm{J}^{\mathrm{qr}}(\mathtt{idx})=j+1   
10:  return 𝗠\bm{\mathsf{M}}

There are two important things to note about Algorithm 4. First, as seen from step 9, the pivot vector 𝑱qr\bm{J}^{\mathrm{qr}} is updated at every iteration of the main loop. In the context of a BQRRP algorithm, we want to preserve 𝑱qr\bm{J}^{\mathrm{qr}} after column permutation is done, and hence copying the pivot vector is required. Second, the idea behind how the permutations are performed implies that the same column can be moved several times. This prevents us from parallelizing the main loop in this algorithm (hence the keyword “sequential” in its name).

In principle, the sequential nature of Algorithm 4 could cause a performance bottleneck in a BQRRP algorithm. However, we do not anticipate this happening when running BQRRP on a CPU, as it is considered latency tolerant hardware with a very low level of parallelism required to saturate the available memory bandwidth. The column permutation is inherently data intensive operation with no floating-point computation involved, and thus its main hardware bottleneck is the maximum rate at which the matrix elements can be transferred between their main memory locations. Historically, CPUs continue to suffer from decreasing bandwidth per compute cores, and thus very few of them are needed to saturate the local memory controller in a socket. Specialized vector load and store instructions inside the swap function (and other Level 1 BLAS) allow one to take advantage of all memory channels, even with the majority of the cores remaining idle. The remaining issue may stem from the sequential nature of the loop in Algorithm 4. However, modern CPUs feature latency-hiding mechanisms like out-of-order execution, register renaming across an extended shadow register file, branch prediction, and data prefetching. These hardware resources allow for multiple iterations of the loop to be unrolled at runtime onto the CPU units, waiting for the moment when the swap operation finishes. If the non-temporal data moving instructions are used for swapping, then there is no interference between all levels of cache memory and the matrix column elements that have to be moved through only a few vector registers. In summary, the utilization of the CPU components is evenly distributed, and the only bottleneck is the main memory bandwidth during column swapping.

3 Practical implementation of BQRRP and storage management

The ideas from Section 2 can be consolidated in the form of the two visions for BQRRP algorithms: one relying on Cholesky QR (represented by Algorithm 3 and 𝗤\bm{\mathsf{Q}} matrix storage format conversion); and another relying on Householder QR in step 20. In both cases, ORMQR is the function of choice in steps 24 and 27, due to its reliable performance and straightforward input format requirements. Both approaches would rely on Algorithm 2 due to its unarguably superior efficiency compared to GEQP3.

From the storage standpoint, both approaches preserve the central feature of BQRRP_CPU, namely the fact that it can be implemented in a way that all the major computations take place in the space occupied by the given input matrix (disregarding the use of some relatively small workspace buffers). To explain how this can be achieved, this section provides a step-by-step breakdown of how each operation in the two versions of BQRRP_CPU is implemented and how the data in each operation is stored.

The rest of this section is split into subsections that correspond to specific steps (or collections of steps) in Algorithm 1. We make references to the previous sections (Sections 2.12.4) and the steps in Algorithm 1. We emphasize that our description matches the way BQRRP is implemented as part of RandLAPACK. The aforementioned two versions of BQRRP_CPU are implicitly defined through the choices that the user makes when configuring the algorithm. To aid the upcoming detailed description, we present in Figure 5 the visualization of how all the major components of BQRRP are stored at iteration ii.

\begin{overpic}[width=433.62pt]{figures/fig_5_bqrrp_storage} \end{overpic}
Figure 5: A visualization of how all the major components of BQRRP are stored at iteration i0:(n/b1)i\in{0}{:}{(\lceil n/b\rceil-1)} of its main loop. Refer to pseudocode Algorithm 1 for the parameter and buffer names. BQRRP would require a maximum of dm+2dn+2b2+4n+bdm+2dn+2b^{2}+4n+b additional words of memory for the internal workspace buffers. This includes buffers for 𝗠sk_trans\bm{\mathsf{M}}^{\mathrm{sk\_trans}} (size n×dn\times d, d=γbd=\gamma b), 𝑫\bm{D} (size bb) and 𝑱lu\bm{J}^{\mathrm{lu}} (of length nn), not depicted in the figure. These storage costs are modest, if not negligible, considering the relative sizes of parameters used in practical settings: bdnb\leq d\ll n.

3.1 Input and output specification (steps 4, 36)

On input, BQRRP has access to the following: a matrix 𝗠m×n\bm{\mathsf{M}}\in\mathbb{R}^{m\times n}, with no specific requirements on how mm and nn relate, two empty buffers of length nn to store the permutation vector 𝑱\bm{J} and a vector with scalar factors of the elementary reflectors 𝝉\bm{\tau}, an integer block size parameter bnb\ll n, and a scalar γ\gamma that sets the sketch size relative to bb (if using LU-based QRCP on the sketch 𝗠sk\bm{\mathsf{M}}^{\text{sk}}, γ=1.0\gamma=1.0). Additionally, BQRRP receives instructions on which internal subroutine decisions to take; in this case, the choice is to use Householder QR or Cholesky QR in step 20. If Cholesky QR is chosen, the user can supply an optional integer parameter nbn_{b}, defining the number of column block reflectors to be used in ORHR_COL.

On output, BQRRP would return the following: the inferred numerical rank \ell; the 𝗤\bm{\mathsf{Q}}-factor represented by \ell Householder reflectors that are stored implicitly and occupy the portion of the space below the diagonal (that initially contained the input matrix 𝗠\bm{\mathsf{M}}) together with a vector 𝝉\bm{\tau} of length \ell; an upper-trapezoidal 𝗥\bm{\mathsf{R}} factor sized ×n\ell\times n and stored in the upper-triangular portion of 𝗠\bm{\mathsf{M}}’s space; and a permutation vector 𝑱\bm{J} of length nn.

3.2 Initial sketching details (steps 5-7)

At step 5 of Algorithm 1, a sketching operator 𝗦d×m\bm{\mathsf{S}}\in\mathbb{R}^{d\times m} (where d=γbd=\lceil\gamma b\rceil) is constructed with independent mean-zero variance-one Gaussian random variables as its entries.131313Variance 1/d1/d could also be used. This would have the effect of making the expected squared norm of each column in 𝗦𝗠\bm{\mathsf{S}}\bm{\mathsf{M}} match the squared norm of the corresponding column in 𝗠\bm{\mathsf{M}}. The generation of this sketching operator is governed by RandBLAS; the memory required to store 𝗦\bm{\mathsf{S}} will be automatically allocated upon the operator construction and deallocated when the operator application function returns. In our in-place implementation of BQRRP, step 6 is, naturally, not explicitly performed. At step 7, a sketch resulting from applying an operator 𝗦\bm{\mathsf{S}} to the input 𝗠\bm{\mathsf{M}} is stored in a d×nd\times n buffer 𝗠sk{\bm{\mathsf{M}}}^{\mathrm{sk}}. A portion of this space will further be used to store an updated sketch at every iteration of BQRRP’s main loop.

3.3 Block partitions in BQRRP (steps 8, 11)

In Sections 3.4-3.7, we refer to the current iteration of the Algorithm 1 main loop as i0:(n/b1)i\in{0}{:}{(\lceil n/b\rceil-1)}. We use s=ib\mathit{s}=ib to denote the start of the current row/column block range, r=min{m,(i+1)b}\mathit{r}=\min\{m,(i+1)b\} to denote the (exclusive) end of the current row block range, and c=min{n,(i+1)b}\mathit{c}=\min\{n,(i+1)b\} to denote the (exclusive) end of the current column block range (the same notation used in step 11 of Algorithm 1). This ensures that the final iteration is well-defined even if neither mm or nn are evenly divisible by bb.

3.4 Processing the sketch and column permutation (steps 12-19)

QRCP on the sketch.

At step 12, performing QRCP on the sketch as described in Algorithm 2 requires an n×dn\times d buffer for transposing 𝗠sk{\bm{\mathsf{M}}}^{\mathrm{sk}} (the full buffer size is only needed at i=0i=0). The important data computed in this step at the iteration ii is stored as follows: the upper-trapezoidal 𝗥skd×(ns){\bm{\mathsf{R}}}^{\mathrm{sk}}\in\mathbb{R}^{d\times(n-\mathit{s})} is stored at 𝗠sk(:,s:){\bm{\mathsf{M}}}^{\mathrm{sk}}(:,{\mathit{s}}{:}); and a vector 𝑱sk(ns){\bm{J}}^{\mathrm{sk}}\in\mathbb{R}^{(n-\mathit{s})} is stored in a buffer of length nn. In our further elaboration, we partition 𝗥sk{\bm{\mathsf{R}}}^{\mathrm{sk}} into three components: upper-triangular 𝗥11sk=𝗠sk(0:b,s:c){\bm{\mathsf{R}}}^{\mathrm{sk}}_{11}={\bm{\mathsf{M}}}^{\mathrm{sk}}({0}{:}{b},{\mathit{s}}{:}\mathit{c}); rectangular 𝗥12sk=𝗠sk(0:b,c:){\bm{\mathsf{R}}}^{\mathrm{sk}}_{12}={\bm{\mathsf{M}}}^{\mathrm{sk}}({0}{:}{b},{\mathit{c}}{:}); and an upper-trapezoidal 𝗥22sk=𝗠sk(b:,c:){\bm{\mathsf{R}}}^{\mathrm{sk}}_{22}={\bm{\mathsf{M}}}^{\mathrm{sk}}({b}{:},{\mathit{c}}{:}).

Our implementation of BQRRP allows for alternatively performing QRCP via GEQP3, in which case a buffer for transposing 𝗠sk{\bm{\mathsf{M}}}^{\mathrm{sk}} is not needed.

Block rank computation.

In step 14, if the block rank kk (computed as described in Section 2.2) is not equal to min{b,(ns),(ms)}\min\{b,(n-\mathit{s}),(m-\mathit{s})\} (when the current block is not full rank), then the given iteration would be the final one, as the termination criteria at step 32 states. In that case, the trailing portion of the matrix 𝗠\bm{\mathsf{M}} is not updated at step 29.

If BQRRP is set to use Cholesky QR in step 20, then a rank estimate is employed to ensure no infinite or not-a-number values appear when the preconditioning is performed.

Combining column permutations.

Since the intended output format of BQRRP matches that of GEQP3, the output 𝗥\bm{\mathsf{R}} factor will be stored in the upper-triangular portion of 𝗠\bm{\mathsf{M}}’s memory space. At iteration i1i\geq 1, step 16 permutes the trailing columns in the computed rectangular portion of the 𝗥\bm{\mathsf{R}}-factor, implying that 𝗠(0:s,s:)\bm{\mathsf{M}}({0}{:}{\mathit{s}},{\mathit{s}}{:}) is to be permuted in accordance with 𝑱sk{\bm{J}}^{\mathrm{sk}}. Meanwhile, step 18 requires 𝗠(s:,s:)\bm{\mathsf{M}}({\mathit{s}}{:},{\mathit{s}}{:}) to be permuted by the same vector. Therefore, steps 16 and 18 can be combined into permuting 𝗠(:,s:)\bm{\mathsf{M}}(\hskip 1.25pt{}{:}\hskip 0.24994pt{}{},{\mathit{s}}{:}) at every iteration of the BQRRP’s main loop.

Remark 7.

If Algorithm 4 is used for permuting columns, a copy of 𝑱sk{\bm{J}}^{\mathrm{sk}} is required.

After the columns of 𝗠(:,s:)\bm{\mathsf{M}}(\hskip 1.25pt{}{:}\hskip 0.24994pt{}{},{\mathit{s}}{:}) have been permuted, we may perform an early termination check by verifying whether the first column in 𝗠(:,s:)\bm{\mathsf{M}}(\hskip 1.25pt{}{:}\hskip 0.24994pt{}{},{\mathit{s}}{:}) consists of all zeros. In that case, BQRRP terminates immediately. If at iteration ii we have kmin{b,ns,ms}k\neq\min\{b,n-\mathit{s},m-\mathit{s}\}, then BQRRP should avoid permuting 𝗠(s+k:,c:)\bm{\mathsf{M}}({\mathit{s}+k}{:},{\mathit{c}}{:}), since this trailing portion of the input matrix would not be used (i.e., the trailing update to 𝗠\bm{\mathsf{M}} would be avoided in step 24). This was not shown in Algorithm 1 for simplicity of presentation.

Updating the permutation vector.

By step 19, the pivot vector 𝑱sk{\bm{J}}^{\mathrm{sk}}, computed at the current iteration, has been used for all the necessary internal operations. It can now be incorporated into the vector 𝑱\bm{J} that is to be output by BQRRP. When ii is 0, this is done by copying 𝑱sk{\bm{J}}^{\mathrm{sk}} into 𝑱\bm{J}; at any subsequent iteration, the trailing portion of 𝑱\bm{J}, 𝑱(s:)\bm{J}({\mathit{s}}{:}), will be permuted in accordance with 𝑱sk{\bm{J}}^{\mathrm{sk}} via (a vector version of) Algorithm 4. Since 𝑱sk{\bm{J}}^{\mathrm{sk}} is not needed after this step, we do not have to create a copy of it in Algorithm 4.

3.5 Panel QR factorizations details (steps 20, 22)

The internal details of step 20 depend on the user’s choice of tall QR function. As said before, in our implementation of BQRRP, we allow users to choose between GEQRF and the Cholesky QR-based approach.

Householder QR on a panel.

When using GEQRF in step 20, we perform the factorization on all columns of 𝗠perm=𝗠(s:,s:c){\bm{\mathsf{M}}}^{\mathrm{perm}}=\bm{\mathsf{M}}({\mathit{s}}{:},{\mathit{s}}{:}\mathit{c}), regardless of whether kk, the current column block rank, is equal to kmax=min{b,(ns),(ms)}k_{\mathrm{max}}=\min\{b,(n-\mathit{s}),(m-\mathit{s})\}. This is because we want 𝗥11\bm{\mathsf{R}}_{11} to have (cs)=min{b,(ns)}(\mathit{c}-\mathit{s})=\min\{b,(n-\mathit{s})\} columns. Despite the fact that (cs)(\mathit{c}-\mathit{s}) reflectors are computed in that case, we use only kk of them outside of this step. Using GEQRF on 𝗠perm{\bm{\mathsf{M}}}^{\mathrm{perm}} in this step requires no additional storage. This function produces 𝗤curr\bm{\mathsf{Q}}^{\mathrm{curr}}, represented by (cs)(\mathit{c}-\mathit{s}) reflectors of length (ms)(m-\mathit{s}) (stored in the portion of 𝗠(s:,s:c)\bm{\mathsf{M}}({\mathit{s}}{:},{\mathit{s}}{:}\mathit{c}) below the diagonal) and the vector of scalar factors of the elementary reflectors (stored in 𝝉(s:c)\bm{\tau}({\mathit{s}}{:}\mathit{c})). The explicit 𝗥11\bm{\mathsf{R}}_{11} of size (rs)×(cs)(\mathit{r}-\mathit{s})\times(\mathit{c}-\mathit{s}) is stored right above the 𝗤curr\bm{\mathsf{Q}}^{\mathrm{curr}}, in the upper-triangular portion of 𝗠(s:r,s:c)\bm{\mathsf{M}}({\mathit{s}}{:}\mathit{r},{\mathit{s}}{:}\mathit{c}). Note that in this case, step 22 is performed implicitly, since 𝗥11\bm{\mathsf{R}}_{11} will be stored where it needs to be on output from GEQRF.

Cholesky QR on a panel.

Using the Cholesky QR approach in step 20 is comprised of four parts: preconditioning; performing Cholesky QR on the preconditioned matrix; implicit 𝗤curr\bm{\mathsf{Q}}^{\mathrm{curr}} reconstruction; and computing 𝗥11\bm{\mathsf{R}}_{11}. All of these are done in accordance with Algorithm 3.

The preconditioning is to be done on a portion of the matrix 𝗠perm{\bm{\mathsf{M}}}^{\mathrm{perm}}. We perform 𝗠pre=𝗠perm(:,0:k)(𝗥11sk(0:k,0:k))1{\bm{\mathsf{M}}}^{\mathrm{pre}}={\bm{\mathsf{M}}}^{\mathrm{perm}}(\hskip 1.25pt{}{:}\hskip 0.24994pt{}{},{0}{:}{k})({\bm{\mathsf{R}}}^{\mathrm{sk}}_{11}({0}{:}{k},{0}{:}{k}))^{-1}, using 𝗥11sk{\bm{\mathsf{R}}}^{\mathrm{sk}}_{11} stored in 𝗠sk(0:b,s:c){\bm{\mathsf{M}}}^{\mathrm{sk}}({0}{:}{b},{\mathit{s}}{:}\mathit{c}). This step requires no workspace buffers, and the matrix 𝗠pre(ms)×k{\bm{\mathsf{M}}}^{\mathrm{pre}}\in\mathbb{R}^{(m-\mathit{s})\times k} is stored at 𝗠(s:,s:(s+k))\bm{\mathsf{M}}({\mathit{s}}{:},{\mathit{s}}{:}(\mathit{s}+k)) after the preconditioning. The next step is to perform Cholesky QR on 𝗠pre{\bm{\mathsf{M}}}^{\mathrm{pre}}.

As explained in Section 2.3, Cholesky QR is comprised of: computing the Gram matrix (SYRK), performing the Cholesky factorization (POTRF), and forming the explicit economical version of the orthonormal factor (TRSM). These functions require (at most) a single b×bb\times b workspace buffer to store the Gram matrix and the output upper-triangular 𝗥chol\bm{\mathsf{R}}^{\mathrm{chol}} factor, both of size k×kk\times k, kmin{b,(ns),(ms)}k\leq\min\{b,(n-\mathit{s}),(m-\mathit{s})\}. The explicit orthonormal factor 𝗤chol(ms)×k\bm{\mathsf{Q}}^{\mathrm{chol}}\in\mathbb{R}^{(m-\mathit{s})\times k} is stored at 𝗠(s:,s:(s+k))\bm{\mathsf{M}}({\mathit{s}}{:},{\mathit{s}}{:}(\mathit{s}+k)), in place of 𝗠pre{\bm{\mathsf{M}}}^{\mathrm{pre}}.

ORHR_COL used for the Householder reconstruction returns an implicit representation of 𝗤curr\bm{\mathsf{Q}}^{\mathrm{curr}} in the form of kk reflectors and an upper-trapezoidal matrix 𝗧\bm{\mathsf{T}} of size nb×kn_{b}\times k. The kk reflectors fit exactly in place of the portion of 𝗤chol\bm{\mathsf{Q}}^{\mathrm{chol}} below the diagonal of 𝗠(s:,s:(s+k))\bm{\mathsf{M}}({\mathit{s}}{:},{\mathit{s}}{:}(\mathit{s}+k)), while 𝗧\bm{\mathsf{T}} requires an additional buffer of size nb×bn_{b}\times b. Furthermore, ORHR_COL produces a sign vector 𝑫\bm{D} of length kk that requires a storage buffer of length bb.

Computing 𝗥11\bm{\mathsf{R}}_{11} requires first applying the entries from the sign vector 𝑫\bm{D} to the columns of 𝗥chol\bm{\mathsf{R}}^{\mathrm{chol}} (this is done so that 𝗥chol\bm{\mathsf{R}}^{\mathrm{chol}} is in line with 𝗤curr\bm{\mathsf{Q}}^{\mathrm{curr}}, computed by ORHR_COL). The next step amounts to undoing the preconditioning on 𝗥chol\bm{\mathsf{R}}^{\mathrm{chol}} via multiplying it by an upper-triangular 𝗥11sk(0:k,0:(cs)){\bm{\mathsf{R}}}^{\mathrm{sk}}_{11}({0}{:}{k},{0}{:}{(\mathit{c}-\mathit{s})}), stored at 𝗠sk(0:k,s:c){\bm{\mathsf{M}}}^{\mathrm{sk}}({0}{:}{k},{\mathit{s}}{:}\mathit{c}). The product is temporarily stored in place of 𝗥chol\bm{\mathsf{R}}^{\mathrm{chol}} and then is copied into the upper-triangular portion of 𝗠(s:(s+k),s:c)\bm{\mathsf{M}}({\mathit{s}}{:}(\mathit{s}+k),{\mathit{s}}{:}\mathit{c}), placing it right above the implicitly-stored Householder reflectors (step 22).

Remark 8.

This additional copy is currently unavoidable because no standard BLAS function for multiplying two triangular matrices exists (although one could implement trtrmm).

For the output format of BQRRP to match that described in Section 1.3, we would need to copy the entries from the block diagonals of 𝗧\bm{\mathsf{T}} into 𝝉(s:(s+k))\bm{\tau}({\mathit{s}}{:}(\mathit{s}+k)).

3.6 Applying transposed Q and updating factors (steps 23-31)

Step 24 or 27 perform operations described in Section 2.4. Depending on whether the estimated block rank kk matches the column block size bb, we apply a transpose of 𝗤curr\bm{\mathsf{Q}}^{\mathrm{curr}} (from the Householder representation in 𝗠(s:,s:(s+k))\bm{\mathsf{M}}({\mathit{s}}{:},{\mathit{s}}{:}(\mathit{s}+k)) and 𝝉(s:(s+k))\bm{\tau}({\mathit{s}}{:}(\mathit{s}+k))), to either the first kk or to all (ms)(m-\mathit{s}) rows of 𝗠(s:,c:)\bm{\mathsf{M}}({\mathit{s}}{:},{\mathit{c}}{:}).

Regardless of whether step 24 or 27 is executed, the first kk rows of 𝗠(s:(s+k),c:)\bm{\mathsf{M}}({\mathit{s}}{:}(\mathit{s}+k),{\mathit{c}}{:}) represents 𝗥12\bm{\mathsf{R}}_{12} (this implicitly fulfills step 30). If step 27 is executed, then the remaining rows 𝗠(r:,c:)\bm{\mathsf{M}}({\mathit{r}}{:},{\mathit{c}}{:}) represent the “working submatrix” of the matrix 𝗠\bm{\mathsf{M}} at the next iteration (this implicitly fulfills step 29). Otherwise, the trailing 𝗠\bm{\mathsf{M}} update is avoided.

In BQRRP, step 31 is performed implicitly, as the matrix 𝗤\bm{\mathsf{Q}} is constructed from the Householder vectors, stored below the main diagonal of 𝗠\bm{\mathsf{M}} and the scalar factors of the elementary reflectors stored in 𝝉\bm{\tau}.

3.7 Algorithm termination and sample update (steps 32-35)

Termination criteria.

The termination criteria check (step 32) simply amounts to verifying whether the maximum number of iterations has been exceeded (ii reached n/b1\lceil n/b\rceil-1) or whether the block rank kk does not match the column block size min{b,ns,ms}\min\{b,n-\mathit{s},m-\mathit{s}\}. Before the termination, the rank parameter \ell is updated to s+k\mathit{s}+k (step 33).

Sample update.

If the maximum number of iterations has not been reached at this point, the sketch 𝗠sk{\bm{\mathsf{M}}}^{\mathrm{sk}} is updated at step 35. This step first computes the expression 𝗥11sk(𝗥11)1{\bm{\mathsf{R}}}^{\mathrm{sk}}_{11}(\bm{\mathsf{R}}_{11})^{-1}, where 𝗥11sk{\bm{\mathsf{R}}}^{\mathrm{sk}}_{11} is stored in the upper-triangular portion of 𝗠sk(0:b,s:c){\bm{\mathsf{M}}}^{\mathrm{sk}}({0}{:}{b},{\mathit{s}}{:}\mathit{c}) and 𝗥11\bm{\mathsf{R}}_{11} is stored in the upper-triangular portion of 𝗠(s:r,s:c)\bm{\mathsf{M}}({\mathit{s}}{:}\mathit{r},{\mathit{s}}{:}\mathit{c}). The result is written into the space of 𝗥11sk{\bm{\mathsf{R}}}^{\mathrm{sk}}_{11}, at 𝗠sk(0:b,s:c){\bm{\mathsf{M}}}^{\mathrm{sk}}({0}{:}{b},{\mathit{s}}{:}\mathit{c}); we explicitly zero out the entries in this space below the diagonal. Next, the full expression 𝗥12sk𝗥11sk(𝗥11)1𝗥12{\bm{\mathsf{R}}}^{\mathrm{sk}}_{12}-{\bm{\mathsf{R}}}^{\mathrm{sk}}_{11}(\bm{\mathsf{R}}_{11})^{-1}\bm{\mathsf{R}}_{12} is computed. The matrix 𝗥12sk{\bm{\mathsf{R}}}^{\mathrm{sk}}_{12} is stored in 𝗠sk(0:b,c:){\bm{\mathsf{M}}}^{\mathrm{sk}}({0}{:}{b},{\mathit{c}}{:}) and 𝗥12\bm{\mathsf{R}}_{12} is stored at 𝗠(s:r,c:)\bm{\mathsf{M}}({\mathit{s}}{:}\mathit{r},{\mathit{c}}{:}). The result of this expression is placed into the space of 𝗥12sk{\bm{\mathsf{R}}}^{\mathrm{sk}}_{12}, at 𝗠sk(0:b,c:){\bm{\mathsf{M}}}^{\mathrm{sk}}({0}{:}{b},{\mathit{c}}{:}). The upper-trapezoidal matrix 𝗥22sk{\bm{\mathsf{R}}}^{\mathrm{sk}}_{22} is already stored at its intended location, in 𝗠sk(b:(db),c:){\bm{\mathsf{M}}}^{\mathrm{sk}}({b}{:}(d-b),{\mathit{c}}{:}), but we need to make sure that the entries are zeroed out below the diagonal. After the sketch update is completed, the “working submatrix” of 𝗠sk{\bm{\mathsf{M}}}^{\mathrm{sk}} is implicitly located at 𝗠sk(:,c:){\bm{\mathsf{M}}}^{\mathrm{sk}}(\hskip 1.25pt{}{:}\hskip 0.24994pt{}{},{\mathit{c}}{:}).

4 Consideration specific to implementations targeting GPU accelerators

4.1 Limited LAPACK functionality

Implementing a GPU version of the BQRRP algorithm involves several challenges, primarily due to the limited availability of GPU variants of most LAPACK and BLAS-level functions. Specifically, NVIDIA cuSOLVER lacks support for the full range of LAPACK functions. This issue, however, is rather understandable, as the latest version of LAPACK 3.12.0 includes over 20002000 functions, many of which are not widely used. Nevertheless, the lack of a wide range of LAPACK functions limits our choices for the internal subroutines in a BQRRP_GPU.

For the purpose of constructing BQRRP_GPU, the most notable function that cuSOLVER does not offer is ORHR_COL, which we require when using Cholesky QR in the qr_tall subroutine. In order to provide some investigation of Cholesky QR methods for qr_tall we developed a basic CUDA implementation of [BD2015, Algorithm 55]. We note that dramatically improved performance could be achieved by the Cholesky QR approach if cuSOLVER offered a recursive LU-based implementation of ORHR_COL similar to the one used by LAPACK. It is also worth noting that there is no standard QRCP algorithm offered in cuSOLVER. Therefore our BQRRP_GPU implementation is forced to use the LU-based Algorithm 2 for qrcp_wide.

4.2 Column permutation

A major difference between the CPU and GPU versions of BQRRP is the importance of a high-performance implementation of a function for permuting columns of a given matrix on the GPU in accordance with a pivot vector.

Hardware considerations.

Recall that Algorithm 4 illustrated a sequential approach to permuting columns. Despite this approach being sequential, we do not anticipate it having a major negative effect on the overall performance of BQRRP_CPU, as described in Section 2.5. By contrast, on a GPU, there are very few hardware resources devoted to dealing with inherently sequential instruction streams or even tolerating the high latency of the main memory transactions. The initially presented sequential loop for pivoting is anathema to the parallel GPU hardware.

The alternative approach.

The widely-used alternative solution is the “parallel pivoting” strategy, shown in Algorithm 5. By contrast to Algorithm 4, Algorithm 5 does not swap columns within the memory space of a single matrix. Instead, it copies columns from one space into another at the new location from the permutation vector, which allows for parallelizing the main loop, significantly increasing the performance of this column permutation approach. The performance gain offered by Algorithm 5 comes at the cost of increased space usage, as it requires a copy of the input 𝗠\bm{\mathsf{M}}.

Algorithm 5 : Column-parallel approach to column permutation
1:Input: A matrix 𝗠m×n\bm{\mathsf{M}}\in\mathbb{R}^{m\times n}, a pivot vector 𝑱qr\bm{J}^{\mathrm{qr}} output by a black-box qrcp function
2:function col_perm_parallel(𝗠,𝑱qr)\texttt{col\_perm\_parallel}(\bm{\mathsf{M}},\bm{J}^{\mathrm{qr}})
3:  𝗠cpy=copy(𝗠)\bm{\mathsf{M}}^{\mathrm{cpy}}=\texttt{copy}(\bm{\mathsf{M}})
4:  for i=0:ni=0:n do
5:    j=𝑱qr(i)j=\bm{J}^{\mathrm{qr}}(i)
6:    𝗠(:,i)=𝗠cpy(:,j)\bm{\mathsf{M}}(:,i)=\bm{\mathsf{M}}^{\mathrm{cpy}}(:,j)   
7:return 𝗠\bm{\mathsf{M}}

Managing data copies in Algorithm 5.

In order to minimize the number of explicit copies performed as part of using Algorithm 5 in practice, one could swap the pointers that point to the memory spaces of 𝗠\bm{\mathsf{M}} and 𝗠cpy\bm{\mathsf{M}}^{\mathrm{cpy}} before performing column permutation. This approach is safe as long as either the entire matrix 𝗠\bm{\mathsf{M}} is permuted in Algorithm 5, or the non-permuted part of 𝗠\bm{\mathsf{M}} is not used outside of Algorithm 5. As stated in Section 2.5, the column permutation is used in steps 16, 18, and 19 in Algorithm 1 (where steps 16 and 18 can be merged) and in step 7 of Algorithm 2. As such, we will be applying permutations to portions of matrices 𝗠sk\bm{\mathsf{M}}^{\mathrm{sk}} and 𝗠\bm{\mathsf{M}}, as well as a vector 𝑱sk{\bm{J}}^{\mathrm{sk}} at every iteration of BQRRP_GPU.

Remark 9.

To avoid overcomplicating the notation, the description below does not account for cases where nn is not evenly divisible by bb.

Since the full 𝗠sk\bm{\mathsf{M}}^{\mathrm{sk}} is not used outside of BQRRP_GPU, the pointer-swapping strategy can be used without any memory safety concerns. This is, however, not the case with 𝗠\bm{\mathsf{M}} and 𝑱sk{\bm{J}}^{\mathrm{sk}}, as both of their entire memory spaces will have to store correct data on output from the BQRRP_GPU. To illustrate the complication with the pointer-swapping strategy, suppose the main loop in BQRRP_GPU is executing starting with i=0i=0, where index 0 is considered “even.” Then, at the end of the 0th0^{th} iteration, the space of 𝗠cpy\bm{\mathsf{M}}^{\mathrm{cpy}}, will contain the correct data to be a part of the output of BQRRP_GPU since the pointer swapping took place. At iteration 11, the space 𝗠cpy\bm{\mathsf{M}}^{\mathrm{cpy}} will contain the first bb properly-computed columns, the rest of the correct entries will be contained in 𝗠(:,b:)\bm{\mathsf{M}}(:,{b}{:}). Continuing with that logic, we conclude that the space 𝗠cpy\bm{\mathsf{M}}^{\mathrm{cpy}} will contain the “correct” entries in 𝗠cpy(:,ib:(i+1)b)\bm{\mathsf{M}}^{\mathrm{cpy}}(:,{ib}{:}(i+1)b) for all even in/bi\leq n/b, and the space 𝗠\bm{\mathsf{M}} will contain the correct entries for all odd ii in 𝗠(:,ib:(i+1)b)\bm{\mathsf{M}}(:,{ib}{:}(i+1)b) when BQRRP_GPU terminates. If BQRRP_GPU terminated at an even iteration, the pointers to 𝗠\bm{\mathsf{M}} and 𝗠cpy\bm{\mathsf{M}}^{\mathrm{cpy}} will need to be swapped back around. Regardless of the final parity status of the loop’s index, the “correct” columns will need to be copied from 𝗠cpy\bm{\mathsf{M}}^{\mathrm{cpy}} to 𝗠\bm{\mathsf{M}}. Note that if BQRRP_GPU was to terminate early at an even iteration, the trailing entries in range (i+1)b(i+1)b to mm in 𝗠cpy\bm{\mathsf{M}}^{\mathrm{cpy}} will need to be copied into 𝗠\bm{\mathsf{M}}.

As per the vector 𝑱\bm{J}, since it is not permuted at iteration 0, the space of 𝑱cpy\bm{J}^{\mathrm{cpy}} would contain the “correct” entries in 𝑱cpy(ib:(i+1)b)\bm{J}^{\mathrm{cpy}}({ib}{:}(i+1)b) for all odd in/bi\leq n/b, while the space 𝑱\bm{J} will contain the correct entries for all even ii in 𝑱(ib:(i+1)b)\bm{J}({ib}{:}(i+1)b) when BQRRP_GPU terminates. Otherwise, the vector 𝑱\bm{J} is to be processed similarly to the matrix 𝗠\bm{\mathsf{M}}.

4.3 The view of a practical BQRRP_GPU

Considering the constraints of designing a GPU version of BQRRP described in Sections 4.1 and 4.2, we adjust our view of the BQRRP_GPU algorithms from the two implicit CPU versions of BQRRP described in the beginning of Section 3.

All in all, we stick to the vision of BQRRP_GPU being represented by two implicit algorithms (defined by the user choices during algorithm configuration), where one version relies on Cholesky QR and its dependencies, and the other version relies on Householder QR in step 20. Cholesky QR is paired with the sequential approach to ORHR_COL, as we decided that implementing a high-performance GPU version of ORHR_COL is beyond the scope of this work. In steps 24 and 27, we use cuSOLVER’s ORMQR function. Furthermore, we decided to sacrifice storage for performance by using Algorithm 5 (and the pointer-swapping logic) for column permutation in our implementation of BQRRP_GPU.

At the time of writing, the current version 1.0.1 of RandBLAS does not offer GPU support. Because of that, steps 5 and 7 are performed outside of BQRRP_GPU, and 𝗠sk\bm{\mathsf{M}}^{\mathrm{sk}} is provided as an input into the algorithm.

5 Performance profiling of major computational kernels

Having established how the two views of BQRRP_CPU and BQRRP_GPU are constructed, in Sections 3 and 4, it is important to analyze how different parts of these algorithms affect the overall algorithm performance, in order to identify the bottlenecks for future optimization. In this section, we omit depicting the results for matrices with dimensions that are multiples of ten, as their performance profiling plots are nearly identical to those for matrices with dimensions that are powers of two.

CPU performance breakdown.

We first present in Figure 6 the algorithm runtime breakdown results for the two CPU versions of BQRRP. We depict the percentage of runtime that is occupied by a given component kernel of the two versions of BQRRP_CPU on the yy-axis. We use square test matrices with 65,53665{,}536 rows and columns and the block size parameter bb varying as powers of two from 256256 to 20482048 (xx-axis).

\begin{overpic}[width=355.57156pt]{figures/fig_6_bqrrp_runtime_breakdown_cpu} \put(20.0,82.0){{Intel CPU}} \put(52.0,82.0){{AMD CPU}} \put(1.5,61.0){\rotatebox[origin={c}]{90.0}{{BQRRP\_CQR\_CPU}}} \put(4.0,62.0){\rotatebox[origin={c}]{90.0}{{runtime (\%)}}} \put(1.5,25.0){\rotatebox[origin={c}]{90.0}{{BQRRP\_HQR\_CPU}}} \put(4.0,26.0){\rotatebox[origin={c}]{90.0}{{runtime (\%)}}} \put(60.0,0.0){{b}} \put(28.0,0.0){{b}} \end{overpic}
Figure 6: Percentages of BQRRP_CPU runtime, occupied by its respective subroutines. The top row represents BQRRP_CPU with Cholesky QR on a panel, and separately shows the percentage of runtime occupied by the preconditioned Cholesky QR and Householder restoration. The bottom row represents BQRRP_CPU with Householder QR on a panel. The results are captured on an Intel CPU (left) and an AMD CPU (right) (see Table 1). Observe that in all plots, apply_trans_q (Section 2.4) is among the most costly subroutines.

An immediate observation from Figure 6 is that apply_trans_q function is the major bottleneck in the three out of four BQRRP formations that we consider, despite the fact that we used used the best available function for this step (per our results in Section 2.4). We also observe that col_perm_sequential and ORHR_COL are proportionally much slower on the AMD system than the Intel system. It is plausible that the slowdown in the former function is more visible in the AMD system given its up to 448 threads available. Moreover, the only parallelism within col_perm_sequential stems from the BLAS SWAP function. Since MKL is used on the AMD system (as noted in Section 1.4), the underlying BLAS functions may not be fully optimized for AMD hardware, which could result in suboptimal performance. We do not have an explanation for the slow performance of ORHR_COL on the AMD system.

Remark 10.

The results in Figure 6 raise the question: could one have predicted this plot based solely on the operation counts of the core subroutines in BQRRP? The short answer is no, as evidenced by the expense of column permutation, which requires a total of O(mn)O(mn) operations across the entire algorithm. Furthermore, since BQRRP is a blocked algorithm, the operation count for each subroutine varies across iterations; adding these operation counts would lead to complex expressions that obscure more than they clarify.

GPU performance breakdown.

In Figure 7, we present the runtime breakdown of the two implicit versions of BQRRP_GPU to see any bottlenecks in their respective subroutines. Any given implementation of BQRRP_GPU uses a single GPU stream, and hence timing its computational kernels does not involve any complications. We depict the percentage of runtime that is occupied by a given subcomponent of BQRRP_GPU on the yy-axis. We use a square test matrix with 32,76832{,}768 rows and columns and the block size parameter bb varying as powers of two from 3232 to 20482048 (xx-axis).

\begin{overpic}[width=346.89731pt]{figures/fig_7_bqrrp_runtime_breakdown_gpu} \put(-1.5,72.0){\rotatebox[origin={c}]{90.0}{{BQRRP\_CQR\_GPU}}} \put(2.0,73.0){\rotatebox[origin={c}]{90.0}{{runtime (\%)}}} \put(-1.5,29.0){\rotatebox[origin={c}]{90.0}{{BQRRP\_HQR\_GPU}}} \put(2.0,30.0){\rotatebox[origin={c}]{90.0}{{runtime (\%)}}} \put(33.0,0.0){{b}} \end{overpic}
Figure 7: Percentages of the BQRRP_GPU runtime, occupied by its respective subroutines. The top row represents BQRRP_GPU with Cholesky QR on a panel, and separately shows the percentage of runtime occupied by the preconditioned Cholesky QR and Householder restoration. The bottom row represents BQRRP_GPU with Householder QR on a panel. The results are captured on an NVIDIA H100 GPU (see Table 2). In the algorithm with Cholesky QR on a panel, observe that our naive approach for ORHR_COL_GPU appears to be too costly. In the algorithm with Householder QR on a panel, observe that the main bottleneck is the apply_trans_q function, similar to the CPU results.

The glaring issue with the top plot can be seen as the block size increases: the fact that our simple implementation of ORHR_COL dominates the runtime. This suggests that the simplest approach to ORHR_COL is simply not viable in practice. As such, the version of BQRRP_GPU with Cholesky QR on a panel is expected to have worse overall performance than that with Householder QR. As we can see from the bottom plot in Figure 7, whenever a naive implementation of ORHR_COL is not an issue, the main algorithm bottleneck is ORMQR, just like in the CPU version of BQRRP. Both plots in Figure 7 show that when smaller block sizes are in use, permuting columns in the matrix 𝗠\bm{\mathsf{M}} is rather costly.

6 Pivot quality on the Kahan matrix

This section gives experimental comparisons of pivot quality using LAPACK’s default QRCP subroutine GEQP3, compared to those produced by BQRRP, configured to use Algorithm 2 in Step 12. For a QR factorization with column pivoting, the pivot quality directly coincides with the reconstruction quality of the factors that a given algorithm produces.

We use two pivot quality metrics, following our prior work [MBM2024, Section 4]. The first is the Frobenius norms of the trailing submatrix of the output 𝗥\bm{\mathsf{R}}-factor, 𝗥(i:,i:)\bm{\mathsf{R}}({i}{:},{i}{:}) for 0i<n0\leq i<n. This has the natural interpretation as the norm of the residual from a rank-ii approximation of 𝗠\bm{\mathsf{M}} as 𝗤(:,0:i)𝗥(0:i,:)\bm{\mathsf{Q}}(\hskip 1.25pt{}{:}\hskip 0.24994pt{}{},{0}{:}{i})\bm{\mathsf{R}}({0}{:}{i},\hskip 1.25pt{}{:}\hskip 0.24994pt{}{}). We plot this metric as ratios

𝗥geqp3(i:,i:)F/𝗥bqrrp(i:,i:)F.\|\bm{\mathsf{R}}^{\text{geqp3}}({i}{:},{i}{:})\|_{\mathrm{F}}/\|\bm{\mathsf{R}}^{\text{bqrrp}}({i}{:},{i}{:})\|_{\mathrm{F}}.

The second pivot quality metric involves the ratios of |𝗥(i,i)||\bm{\mathsf{R}}(i,i)| to the singular values of 𝗠\bm{\mathsf{M}}. If 𝗥\bm{\mathsf{R}} comes from GEQP3, then this ratio can be quite bad in the worst case. Letting σi\sigma_{i} denote the ithi^{\text{th}} singular value of 𝗠\bm{\mathsf{M}}, this only guarantees that |𝗥(i,i)|/σi|\bm{\mathsf{R}}(i,i)|/\sigma_{i} is between (n(n+1)/2)1/2(n(n+1)/2)^{-1/2} and 2n12^{n-1} [Higham:blog:rrf]. Since there is a chance for large deviations, we plot |𝗥(i,i)|/σi|\bm{\mathsf{R}}(i,i)|/\sigma_{i} for BQRRP and GEQP3 separately (rather than plotting the ratio |𝗥geqp3(i,i)|/|𝗥bqrrp(i,i)||\bm{\mathsf{R}}^{\text{geqp3}}(i,i)|/|\bm{\mathsf{R}}^{\text{bqrrp}}(i,i)|).

The quality of pivots produced by BQRRP naturally depends on the type of matrix used as input. Rather than exhaustively testing BQRRP with various commonly used matrices in QRCP verification schemes, we focused directly on a challenging case: the Kahan matrix, which is notoriously difficult for QRCP to handle. The Kahan matrix is known for having small differences in column norms, hence potentially causing pivoting to fail, which results in inaccurate factorizations. From the description in [kahan_matrix_generator], we can parameterize the Kahan matrix by n,p,θn,p,\theta, by taking α=sin(θ)\alpha=\sin(\theta), β=cos(θ)\beta=-\cos(\theta), and

𝗠=[1α1αn1]diagonal[β11β1β]upper-triangular+ϵmachp[nn11]diagonal{\bm{\mathsf{M}}=\underbrace{\begin{bmatrix}1&&&\\ &\alpha^{1}&&\\ &&\ddots&\\ &&&\alpha^{n-1}\end{bmatrix}}_{\text{diagonal}}\underbrace{\begin{bmatrix}\beta&1&\cdots&1\\ &\beta&\ddots&\vdots\\ &&\ddots&1\\ &&&\beta\\ \end{bmatrix}}_{\text{upper-triangular}}+~\epsilon_{\mathrm{mach}}\cdot p\underbrace{\begin{bmatrix}n&&&\\ &n-1&&\\ &&\ddots&\\ &&&1\end{bmatrix}}_{\text{diagonal}}} (1)

Figure 8 shows the spectrum of a Kahan matrix (with default choices for the values of the tuning parameters), obtained by the Jacobi SVD function, GESVD. Figure 9 shows BQRRP pivot quality compared against GEQP3 pivot quality in the two aforementioned metrics. The experiment was conducted using two different BQRRP block sizes.

\begin{overpic}[width=325.215pt]{figures/fig_8_spectrum_of_kahan_matrix} \put(-3.0,45.0){\rotatebox[origin={c}]{90.0}{{$\mathbf{\sigma_{i}}$}}} \put(47.0,-1.0){$\scalebox{0.9}{\mbox{$\displaystyle\mathbf{i}$}}$} \end{overpic}
Figure 8: Spectrum of Kahan matrix of order 16,38416{,}384, generated as described in Eq. 1, using p=1000p=1000 and θ=1.2\theta=1.2. The spectrum was obtained using LAPACK’s most accurate SVD function, GESVD. Note that the trailing singular values fall below the double-precision machine ϵ\epsilon.
Figure 9: Pivot quality results for BQRRP with block sizes 6464 and 40964096 show that block size has limited impact on pivot quality. The residual-norm ratio is generally similar to that of GEQP3, diverging mainly near sharp singular value drops, especially for b=4096b=4096. The second metric shows near-identical behavior, except at the final singular value.
\begin{overpic}[width=433.62pt,trim=0.0pt 0.0pt 56.9055pt 0.0pt,clip]{figures/fig_9_bqrrp_pivot_quality} \put(20.0,64.0){$\scalebox{0.8}{\mbox{$\displaystyle\displaystyle\frac{\|\bm{\mathsf{R}}^{\text{geqp3}}({i}{:},{i}{:})\|_{\mathrm{F}}}{\|\bm{\mathsf{R}}^{\text{bqrrp}}({i}{:},{i}{:})\|_{\mathrm{F}}}$}}$} \put(32.0,4.0){$\scalebox{0.9}{\mbox{$\displaystyle\mathbf{i}$}}$} \put(25.0,25.0){$\scalebox{0.8}{\mbox{$\displaystyle\displaystyle\frac{|\bm{\mathsf{R}}(i,i)|}{\sigma_{i}}$}}$} \put(75.0,4.0){$\scalebox{0.9}{\mbox{$\displaystyle\mathbf{i}$}}$} \put(28.0,88.0){$\scalebox{0.9}{\mbox{$\displaystyle b=64$}}$} \put(68.0,88.0){$\scalebox{0.9}{\mbox{$\displaystyle b=4096$}}$} \end{overpic}

7 Performance results

The experiments that we conduct in this section compare the performance of the following QR and QRCP algorithms:

  • BQRRP_CQR – a version of BQRRP_CPU that uses Cholesky QR (and its dependencies) on a panel, ORMQR for the updating step, and Algorithm 2 in Step 12.

  • BQRRP_HQR – a version of BQRRP_CPU that uses Householder QR on a panel, ORMQR for the updating step, and Algorithm 2 as a black-box qrcp_wide.

  • GEQRF – standard unpivoted Householder QR.

This section concentrates on square matrices; tall and wide experiments are found in Section C.2. The sampling factor, γ\gamma, is set to the default value of 1.01.0 in all experiments (since this is the only reasonable choice if Algorithm 2 is in use at step 12). In all experiments, the performance is measured via canonical FLOP rate, relying on the FLOP count of GEQRF.

7.1 CPU algorithms performance

In addition to the algorithms listed above, CPU experiments also involve benchmarking GEQP3 – standard pivoted QR, and HQRRP – randomized pivoted QR algorithm from [MOHvdG:2017:QR]. We ported an LAPACK-compatible implementation [hqrrp_lapack_sources] of HQRRP into RandLAPACK (found in /RandLAPACK/drivers/rl_hqrrp.hh) for easier benchmarking.

The first set of experimental results is shown in Figure 10. Algorithms were run on m1,2×m1,2m_{1,2}\times m_{1,2} matrices using block sizes b1,2b_{1,2}, with m1=65,536m_{1}=65{,}536 (Figure 10, row one) and m2=64,000m_{2}=64{,}000 (Figure 10, row two), with block size in BQRRP_CPU and HQRRP varying as the powers of two of multiples of ten as b1=256{1,2,4,,32}b_{1}=256\cdot\{1,2,4,\dots,32\} and b2=250{1,2,4,,32}b_{2}=250\cdot\{1,2,4,\dots,32\}.

We set the HQRRP block sizes to the same values as BQRRP block sizes, given that the internal logic of HQRRP is extremely similar to that of BQRRP. This allows us to explore how similar block size decisions impact both algorithms. A detailed exploration of the optimal HQRRP block size is provided in Section A.2.

[Uncaptioned image]

Figure 10: Performance of various QR and QRCP methods, captured on an Intel and AMD systems (see Table 1). The execution of GEQRF and GEQP3 does not depend on the block size parameter bb, and hence the performance is depicted as constant. Observe that on an Intel system, both versions of BQRRP_CPU exhibit comparable performance. On an AMD system, BQRRP_CPU with Householder QR is the fastest method.
Intel CPUAMD CPUGigaFLOP/s𝐦𝟏=𝟔𝟓,𝟓𝟑𝟔\mathbf{m_{1}=65{,}536}GigaFLOP/s𝐦𝟏=𝟔𝟒,𝟎𝟎𝟎\mathbf{m_{1}=64{,}000}𝐛\mathbf{b}𝐛\mathbf{b}

Figure 10 shows that on an Intel system, BQRRP_CPU with Cholesky QR on a panel has near-identical performance to that with Householder QR on a panel. BQRRP_CQR and BQRRP_HQR are up to 19×19\times faster than the standard pivoted QR, GEQP3 (as seen in the bottom left plot in Figure 10), and they achieve up to 60%60\% of performance of the unpivoted QR, GEQRF (as seen in the top left plot in Figure 10). Furthermore, BQRRP_CPU algorithms are 7×7\times20×20\times faster than HQRRP, depending on the block size used.

On an AMD system, BQRRP_CPU with Householder QR on a panel is the fastest QRCP method. BQRRP_HQR is up to 148×148\times faster than the standard pivoted QR, GEQP3 (as seen in the bottom right plot in Figure 10), and it achieves up to 35%35\% of performance of the unpivoted QR, GEQRF (as seen in the top right plot in Figure 10). Furthermore, it is 3×3\times148×148\times faster than HQRRP, depending on the block size used.

Thread scaling results.

Figure 11 depicts thread scaling results for the QR and QRCP schemes run on m×mm\times m matrices with m{8,000,16,000,32,000}m\in\{8{,}000,~16{,}000,~32{,}000\}. In these plots, we set the block size parameter in BQRRP_CQR and BQRRP_HQR to b=m/32b=m/32. This is because Figure 10 shows that this block size setting generally yields the best BQRRP performance across all experiments.

[Uncaptioned image]

Figure 11: Effects of varying the numbers of threads on the FLOP rates in QR and QRCP methods, captured on an Intel and AMD systems (see Table 1). In these experiments, m×mm\times m matrices are used, where m{8,000,16,000,32,000}m\in\{8{,}000,~16{,}000,~32{,}000\}; HQRRP, BQRRP_CQR, and BQRRP_HQR block size is set to b=m/32b=m/32. Observe that both versions of BQRRP_CPU depict excellent thread scaling on par with GEQRF for the larger matrix sizes. Simultaneously, both HQRRP and GEQP3 exhibit poor thread scaling (especially on the AMD system).
Intel CPUAMD CPUThreads=44GigaFLOP/sThreads=1616GigaFLOP/sThreads=6464GigaFLOP/sThreads=128128GigaFLOP/s𝐦=𝐧\mathbf{m=n}𝐦=𝐧\mathbf{m=n}

Figure 11 shows that BQRRP exhibits good thread scaling. For a 32,000×32,00032{,}000\times 32{,}000 matrix, BQRRP’s performance improves by a factor of 20×20\times on the Intel system and 37×37\times on the AMD system when scaling from 11 to 128128 threads. In comparison, GEQP3 achieves only a 6×6\times speedup on Intel and exhibits negligible thread scaling on AMD.

The standout performer is, as expected, GEQRF, with speedups of 43×43\times on Intel and 99×99\times on AMD. Observe also that the performance of GEQRF on the AMD system begins to exceed that on the Intel system at 128128 threads for the matrix of size 32,000×32,00032{,}000\times 32{,}000. This is in line with what we saw in Figure 10, where GEQRF and BQRRP were showing much better overall performance on an AMD system.

We also observe that the performance of HQRRP begins to stagnate at 6464 threads used on an Intel system. We present a thorough investigation of HQRRP thread scaling results for various block sizes in Section A.2.

In the bottom-left plot of Figure 11, the performance of BQRRP approaches that of GEQP3 for a matrix of size 8,000×8,0008{,}000\times 8{,}000, with only a 1.5×1.5\times difference. This suggests that a BQRRP block size of m/32m/32 may be suboptimal for smaller input matrices. We further explore the performance of the discussed QR and QRCP schemes on smaller matrices in Section C.1.

7.2 Performance of GPU Implementations

As noted in Section 4, a major challenge in designing GPU versions of algorithms is the lack of GPU implementations for many LAPACK-level functions. Unlike the CPU experiments, the only readily available GPU algorithm to compare with BQRRP_GPU is cuSOLVER’s GEQRF141414A GPU QRCP exists in MAGMA [magma_zgeqp3_gpu], [tomov2010magma], but is not widely used.. We therefore compare cuSOLVER’s GEQRF with two BQRRP_GPU variants: one using Cholesky QR, and another using Householder QR for panel factorization. Experiments are run on m×mm\times m matrices with m2568,16,32,,128m\in 256\cdot{8,16,32,\dots,128} (Figure 12), and block sizes from 3232 to 20482048.

\begin{overpic}[width=433.62pt]{figures/fig_12_bqrrp_performance_varying_block_size_gpu} \put(4.5,80.0){\rotatebox[origin={c}]{90.0}{{$\mathbf{m=2{,}048}$}}} \put(1.0,80.0){\rotatebox[origin={c}]{90.0}{{TeraFLOP/s}}} \put(50.0,80.0){\rotatebox[origin={c}]{90.0}{{$\mathbf{m=4{,}096}$}}} \put(4.5,50.0){\rotatebox[origin={c}]{90.0}{{$\mathbf{m=8{,}192}$}}} \put(1.0,50.0){\rotatebox[origin={c}]{90.0}{{TeraFLOP/s}}} \put(50.0,50.0){\rotatebox[origin={c}]{90.0}{{$\mathbf{m=16{,}384}$}}} \put(4.5,21.0){\rotatebox[origin={c}]{90.0}{{$\mathbf{m=32{,}768}$}}} \put(1.0,21.0){\rotatebox[origin={c}]{90.0}{{TeraFLOP/s}}} \put(50.0,4.0){{b}} \end{overpic}
Figure 12: Performance of standard QR and two versions of BQRRP_GPU, captured on an NVIDIA GPU (for the details on system configuration, refer to Table 2). cuSOLVER’s GEQRF performance is constant, as it is independent of the block size bb. BQRRP with Householder QR outperforms its Cholesky-based variant, aligning with our profiling results in Figure 7. The relative performance of both BQRRP_GPU versions to cuSOLVER remains stable across input sizes. Notably, smaller matrices reach peak performance with smaller block sizes.

Figure 12 shows that the implementation of BQRRP_GPU with Householder QR on a panel is able to achieve up to 60%60\% of the performance of the unpivoted QR method offered by cuSOLVER. The performance of BQRRP_GPU relying on Cholesky QR is far from acceptable due to the fact that a slow implementation of ORHR_COL was used. Figure 12 also shows that the relative performance of the two BQRRP_GPU schemes to cuSOLVER’s GEQRF scales with the change in the input matrix size.

8 Conclusion

We have introduced BQRRP, a powerful algorithmic framework for QR factorization with column pivoting (QRCP) for general matrices. The framework enables the design of practical QRCP algorithms by allowing users to control key subroutine choices. We provide a detailed analysis of how these choices can be navigated for modern hardware, presenting formulations of BQRRP_CPU and BQRRP_GPU. The CPU version is designed for maximally in-place computation, while the GPU version prioritizes performance at the cost of additional storage. Both implementations produce output in the same format as GEQP3.

Looking ahead, the relative performance of core subroutines within the BQRRP framework may evolve due to advancements in computational techniques. Nevertheless, our work remains relevant, as it offers a plug-and-play algorithmic framework and a structured approach for analyzing the efficiency of individual subroutines in modern QRCP methods.

Our RandLAPACK implementation of BQRRP_CPU achieves up to 140×140\times the speed of GEQP3 and up to 60%60\% of the performance of GEQRF. Similarly, BQRRP_GPU, also implemented in RandLAPACK, reaches up to 60%60\% of GEQRF performance. Given these results and the flexibility of the BQRRP framework, it presents a strong case for inclusion as an alternative to GEQP3 in LAPACK.

Our CPU benchmarks were conducted on Intel Sapphire Rapids and AMD Zen4c, while GPU results were obtained using an NVIDIA H100. The core subroutine choices were tuned to optimize performance on these architectures. For large matrices, we recommend setting the BQRRP block size to 1/321/32 of the number of columns; while for smaller matrices, it should be set to its maximum feasible value. While this tuning strategy is not rigorously derived, it serves as a general guideline. For more specialized cases, where computations are performed on specific hardware and vendor-optimized libraries, subroutine choices within BQRRP should be re-evaluated accordingly.

This adaptability opens the door for future research and practical implementations by engineers looking to integrate BQRRP into their software. In particular, it would be valuable to assess BQRRP’s performance on ARM-based architectures and consumer-grade hardware such as Apple M-series silicon. Additionally, broader GPU testing is of interest, although current evaluations are limited by RandLAPACK’s CUDA-specific GPU support.

Acknowledgements

This work was partially funded by an NSF Collaborative Research Framework: Basic ALgebra LIbraries for Sustainable Technology with Interdisciplinary Collaboration (BALLISTIC), a project of the International Computer Science Institute, the University of Tennessee’s ICL, the University of California at Berkeley, and the University of Colorado at Denver (NSF Grant Nos. 2004235, 2004541, 2004763, 2004850, respectively). MWM would also like to acknowledge the NSF, DOE, and ONR Basic Research Challenge on RLA for providing partial support for this work.

RM was partially supported by Laboratory Directed Research and Development (LDRD) funding from Sandia National Laboratories; Sandia is a multimission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly-owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DENA0003525.

PL was supported in part by the Department of the Air Force Artificial Intelligence Accelerator and was accomplished under Cooperative Agreement Number FA8750-19-2-1000.

The views and conclusions contained in this document are those of the authors and should not be interpreted as presenting the official policies, either expressed or implied, of the Department of the Air Force, the Department of Energy, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.

References

  • \ProcessBibTeXEntry \ProcessBibTeXEntry \ProcessBibTeXEntry \ProcessBibTeXEntry \ProcessBibTeXEntry \ProcessBibTeXEntry \ProcessBibTeXEntry \ProcessBibTeXEntry \ProcessBibTeXEntry \ProcessBibTeXEntry \ProcessBibTeXEntry \ProcessBibTeXEntry \ProcessBibTeXEntry \ProcessBibTeXEntry \ProcessBibTeXEntry \ProcessBibTeXEntry \ProcessBibTeXEntry \ProcessBibTeXEntry \ProcessBibTeXEntry \ProcessBibTeXEntry \ProcessBibTeXEntry \ProcessBibTeXEntry \ProcessBibTeXEntry \ProcessBibTeXEntry \ProcessBibTeXEntry \ProcessBibTeXEntry \ProcessBibTeXEntry \ProcessBibTeXEntry \ProcessBibTeXEntry \ProcessBibTeXEntry \ProcessBibTeXEntry \ProcessBibTeXEntry \ProcessBibTeXEntry \ProcessBibTeXEntry \ProcessBibTeXEntry \ProcessBibTeXEntry \ProcessBibTeXEntry

Appendix A Investigating HQRRP performance

Recall from Section 1.1 that we discussed prior work on developing a modern QRCP approach, highlighting HQRRP [MOHvdG:2017:QR] as a particularly promising scheme. Nevertheless, comparing the results in Figure 1 with those reported in [MOHvdG:2017:QR, Fig. 1] and [MOHvdG:2017:QR, Fig. 5], we observe that the performance gap between GEQRF and GEQP3 has increased from under 10×10\times to around 100×100\times on a modern AMD CPU, while the speedup of HQRRP over GEQP3 has not scaled proportionately, remaining at approximately 13×13\times at best. Moreover, from column one in Figure 1, we see that the performance of HQRRP may fall below that of GEQP3 as the number of OpenMP threads in use increases.

This overall difference in performance is, of course, largely attributable to the drastic differences in hardware platforms: our experiments were conducted on modern CPUs, described in Table 1, whereas the experiments from [MOHvdG:2017:QR] were performed on an Intel Xeon E5-2695 v3 (the Haswell platform) processor featuring “only” 1414 cores in a single socket, a platform that is now over a decade old. Nonetheless, to thoroughly assess the HQRRP algorithm, we shall investigate how tuning its block size parameter affects performance on modern systems (although [MOHvdG:2017:QR, Sec. 4.1] suggests that using the block size of 6464 or 128128 should yield near-best performance, this may not hold true on our hardware systems). Furthermore, we shall perform the subroutine performance profiling (similar to Section 5) in HQRRP to identify any computational bottlenecks. Finally, it is worth analyzing the performance of each individual algorithm involved in Figure 1 in order to detect any performance instabilities of each given scheme (of particular interest is the relationship between the performance of GEQRF and GEQP3 on an Intel system). The following subsections address each of these tasks.

A.1 Alternative view of Figure 1

[Uncaptioned image]

Figure 13: Performance of HQRRP (first row), GEQRF (second row), and GEQP3 (third row), attained on matrices of sizes between 1,0001{,}000-by1,000-1{,}000 to 10,00010{,}000-by10,000-10{,}000, when varying the number of OpenMP threads used. Performance is measured in terms of canonical FLOP rate, relying on the FLOP count of the standard LAPACK QR function (GEQRF). Results were captured on machines, described in Table 1. Column three depicts results obtained on an AMD CPU using AOCL 5.0.0 (as opposed to MKL 2025.0, used elsewhere).
Intel CPU++MKLAMD CPU++MKLAMD CPU++AOCLGigaFLOP/sHQRRPGigaFLOP/sGEQRFGigaFLOP/sGEQP3𝐦=𝐧\mathbf{m=n}𝐦=𝐧\mathbf{m=n}𝐦=𝐧\mathbf{m=n}

Figure 13 presents the performance results of running HQRRP, GEQRF, and GEQP3, measured in terms of the canonical FLOP rate, as opposed to relative speedup (Figure 1). As such, we are able to assess any performance instabilities of each individual algorithm. We observe that the algorithms perform rather unstably on an Intel CPU across the board, even though, as stated in Section 1.4, we perform 2020 runs of each algorithm per given matrix size to address the potential instabilities. Of particular interest is the fact that increasing the number of OpenMP threads used only occasionally has a positive effect on the performance of HQRRP on Intel hardware. Additionally, it is worth noting that the performance of GEQP3 on Intel CPU is far superior to that on AMD CPU, regardless of which vendor library the algorithm is sourced from.

Additional observations can be made regarding the relative performance of algorithms run on AMD hardware when using AOCL versus MKL. As seen from the two rightmost columns in Figure 13, while the HQRRP performs roughly the same at its peak, the performance of GEQRF and GEQP3 sourced from MKL is far superior to that from AOCL. This observation is the basis for our decision to report AMD system results using MKL instead of AOCL.

A.2 Varying HQRRP block size

As stated previously, in the experiments presented in Figure 1 and Figure 13, the HQRRP block size parameter is set to 128128, per the suggestion in [MOHvdG:2017:QR, Sec. 4.1]. Figure 14 shows how varying the block size in HQRRP affects its performance for the various numbers of OpenMP threads used.

[Uncaptioned image]

Figure 14: Performance comparison of HQRRP, GEQRF, and GEQP3, captured on Intel and AMD systems (see Table 1). The experiments are conducted on matrices with 32,00032{,}000 rows and columns and the HQRRP block size parameter b{5,10,25,50,125,250,500,1000,2000,4000,8000}b\in\{5,~10,~25,~50,~125,~250,~500,\\ ~1000,~2000,~4000,~8000\} (xx-axis).
Intel CPUAMD CPUThreads=44GigaFLOP/sThreads=1616GigaFLOP/sThreads=6464GigaFLOP/sThreads=128128GigaFLOP/s𝐛\mathbf{b}𝐛\mathbf{b}

Results in Figure 11 show that the performance of HQRRP indeed peaks at around block size 6412864-128, as [MOHvdG:2017:QR] suggests. However, comparing Figure 11 with Figure 14, we see that even at its best, HQRRP does not reach the performance of either version of BQRRP (except, of course, in a single-threaded case). Furthermore, as seen previously, the performance of HQRRP begins to stagnate at 6464 OpenMP threads, suggesting that some suboptimal (possibly, Level-2 BLAS) subroutines may have been used in HQRRP.

A.3 HQRRP subroutines profiling

In an effort to better understand the performance gap between BQRRP and HQRRP seen in Figure 11 and Figure 14 (as well as the reason for poor thread scaling in HQRRP), we present the subroutines performance breakdown of HQRRP below (similar to how we did it for BQRRP in Section 5). Figure 15 depicts the percentage of runtime that is occupied by a given subcomponent of HQRRP on the yy-axis. We use square test matrices with 32,00032{,}000 rows and columns and the block size parameter b{5,10,25,50,125,250,500,1000,2000,4000,8000}b\in\{5,~10,~25,~50,~125,~250,~500,~1000,~2000,~4000,~8000\} (xx-axis).

\begin{overpic}[width=433.62pt]{figures/fig_15_hqrrp_runtime_breakdown} \put(23.0,93.0){{Intel CPU}} \put(53.0,93.0){{AMD CPU}} \put(3.0,82.0){\rotatebox[origin={c}]{90.0}{{Threads=$4$}}} \put(5.0,82.0){\rotatebox[origin={c}]{90.0}{{GigaFLOP/s}}} \put(3.0,61.0){\rotatebox[origin={c}]{90.0}{{Threads=$16$}}} \put(5.0,61.0){\rotatebox[origin={c}]{90.0}{{GigaFLOP/s}}} \put(3.0,40.0){\rotatebox[origin={c}]{90.0}{{Threads=$64$}}} \put(5.0,40.0){\rotatebox[origin={c}]{90.0}{{GigaFLOP/s}}} \put(3.0,19.0){\rotatebox[origin={c}]{90.0}{{Threads=$128$}}} \put(5.0,19.0){\rotatebox[origin={c}]{90.0}{{GigaFLOP/s}}} \put(28.0,2.0){$\mathbf{b}$} \put(60.0,2.0){$\mathbf{b}$} \end{overpic}
Figure 15: Percentages of HQRRP runtime, occupied by its respective subroutines. Experminets were conducted on square matrices of size 32,000×32,00032{,}000\times 32{,}000 with the HQRRP block size bb taking values {5,10,25,50,125,250,500,1000,2000,4000,8000}\{5,~10,~25,~50,~125,~250,~500,~1000,~2000,~4000,~8000\}. The results are captured on Intel and AMD systems (see Table 1).

As seen in Figure 15, the function that is responsible for updating the input matrix tends to occupy most of the HQRRP runtime when smaller block sizes are in use (and especially for the lower numbers of OpenMP threads used). In the LAPACK-compatible HQRRP implementation, this function is called “NoFLA_Apply_Q_WY_lhfc_blk_var4.” It relies on a single LAPACK subroutine, LARFB, which applies a block reflector to a given rectangular matrix. Other notable subroutines come from the QR and QRCP functions within HQRRP, which are performed via calling “NoFLA_QRPmod_WY_unb_var4” in pivoted and unpivoted modes, respectively. There, the most costly functions are LARF (applies a single reflector to a given rectangular matrix). Finally, the column permutation strategy in HQRRP, implemented in “NoFLA_QRP_pivot_G_B_C” becomes costly when larger numbers of OpenMP threads are used (particularly at 6464 threads, where we start seeing the stagnation in HQRRP performance).

Overall, we conclude that using LARF within HQRRP is suboptimal, since this function operates on a single reflector at a time, and consequently is largely cast in terms of level 2 BLAS. Furthermore, it could be the case that depending on the system and the input problem, ORMQR used in BQRRP is superior to LARFB used in HQRRP.

Appendix B Cholesky QR background

This section is intended to provide background information on Cholesky QR that can be used in the context of Section 2.3.

Cholesky QR.

Given an m×nm\times n matrix 𝗠\bm{\mathsf{M}}, with mnm\geq n, Cholesky QR computes the Gram matrix 𝗚=𝗠T𝗠\bm{\mathsf{G}}=\bm{\mathsf{M}}^{\text{\tiny{T}}}\bm{\mathsf{M}}, factors 𝗚=𝗥T𝗥\bm{\mathsf{G}}=\bm{\mathsf{R}}^{\text{\tiny{T}}}\bm{\mathsf{R}}, computing a non-singular upper-triangular matrix 𝗥\bm{\mathsf{R}}, and obtains an orthonormal factor 𝗤=𝗠𝗥1\bm{\mathsf{Q}}=\bm{\mathsf{M}}\bm{\mathsf{R}}^{-1}. Note that this procedure works only if 𝗠\bm{\mathsf{M}} has rank nn. The FLOP count in Cholesky QR is close to that of standard LAPACK unpivoted QR, GEQRF, as long as mnm\gg n. In a practical implementation, Cholesky QR can significantly outperform GEQRF even when the input matrices are not conventionally considered “very tall,” with the ratio m/nm/n being on the order of 1010. Despite its simplicity and speed, Cholesky QR is rarely used in practice, as it fails151515POTRF always outputs a factorization of the whole input or a leading principal submatrix thereof. to provide accurate output when the numerical rank of the matrix 𝗚\bm{\mathsf{G}} falls below nn. This phenomenon can be mitigated with a variety of preconditioning and truncation strategies.

Preconditioned Cholesky QR.

The use of Cholesky QR in the context of step 20 is motivated by the fact that the suitable preconditioner in the form of 𝗥sk(0:k,0:k){\bm{\mathsf{R}}}^{\mathrm{sk}}({0}{:}{k},{0}{:}{k}) is acquired “for free” in step 12 (as this step is necessary for acquiring the permutation vector). The preconditioning would be performed by applying (𝗥sk(0:k,0:k))1({\bm{\mathsf{R}}}^{\mathrm{sk}}({0}{:}{k},{0}{:}{k}))^{-1} to a portion of the permuted matrix 𝗠\bm{\mathsf{M}} from the right. The effectiveness of the preconditioning of this flavor has been thoroughly analyzed in Section 2.1, Appendix A1 of [MBM2024]. This idea was first introduced by Fan, Guo, and Lin [FGL:2021:CholeskyQR] and studied in detail by others [Balabanov:2022:cholQR, HSBY:2023:rand_chol_qr]. In the context of Cholesky QR, the numerical rank computation (step 14, described in Section 2.2) is not strictly necessary; this step is solely used to ensure that the preconditioning is performed safely (meaning that no infinite or not-a-number values should appear when inverting 𝗥sk(0:k,0:k){\bm{\mathsf{R}}}^{\mathrm{sk}}({0}{:}{k},{0}{:}{k})). As such, naive rank estimation suffices.

Appendix C Additional CPU performance experiments

C.1 Performance results on smaller inputs

As stated in Section 7.1, although the results from Figure 10 show that BQRRP performs best when the block size bb is set to b=n/32b=n/32 (given an m×nm\times n input matrix), this may only hold for larger input matrix sizes (as seen in Figure 11, the performance of BQRRP gets suspiciously close to that of GEQRF when the input is of size 8,000×8,0008{,}000\times 8{,}000). As such, we investigate what block size setting works best for the smaller input matrices.

[Uncaptioned image]

Figure 16: Performance of various QR and QRCP methods, captured on Intel and AMD systems (see Table 1). The execution of the GEQRF and GEQP3 functions does not depend on the block size parameter bb, and hence the performance is depicted as constant. This figure concentrates on assessing how varying the block size parameter in BQRRP affects its performance in the context of the smaller input matrices.
Intel CPUAMD CPUGigaFLOP/s𝐦=𝟏𝟎𝟎𝟎\mathbf{m=1000}GigaFLOP/s𝐦=𝟐𝟎𝟎𝟎\mathbf{m=2000}GigaFLOP/s𝐦=𝟒𝟎𝟎𝟎\mathbf{m=4000}𝐛\mathbf{b}𝐛\mathbf{b}

Figure 16 shows that in all the smaller matrices tested across both systems, using block sizes that are either twice smaller or as large as m=nm=n is the best choice. Figure 16 also suggests that using BQRRP_HQR over BQRRP_CQR is always by far the best option and that at no point does the performance of HQRRP exceeds that of GEQP3.

C.2 Varying the aspect ratio in the test matrices

Section 1.4 briefly explains our choice of test matrix types and sizes. As noted there and in Section 7, we primarily use square matrices to evaluate QR and QRCP schemes. However, since BQRRP applies to any aspect ratio, we also present results for tall and wide matrices. We do not, however, explore the extremely tall or wide cases, where specialized algorithms may be more suitable [MBM2024, Fukaya2024CholeskyQR, Armstrong2025].

[Uncaptioned image]

Figure 17: Performance of various QR and QRCP methods, captured on Intel and AMD systems (see Table 1). The first row depicts tall input matrices, such that m=2nm=2n. The second row depicts wide input matrices, such that n=2mn=2m. BQRRP block size is set to n/32n/32, HQRRP block size is set to 128128.
Intel CPUAMD CPUGigaFLOP/sGigaFLOP/s𝐦\mathbf{m}𝐦\mathbf{m}𝐧\mathbf{n}𝐧\mathbf{n}

Figure 17 depicts the expected results of the performance in the two BQRRP versions, reaching closer to that of GEQRF as the input matrix size increases. Additionally, as seen previously in Section C.1, we observe that using the BQRRP block size b=n/32b=n/32 is suboptimal when testing smaller matrices.