MEMQSim: Highly Memory-Efficient and Modularized Quantum State-Vector Simulation

Boyuan Zhang bozhan@iu.edu Indiana University BloomingtonINUSA , Bo Fang bo.fang@pnnl.gov Pacific Northwest National Laboratory RichlandWAUSA , Qiang Guan qguan@kent.edu Kent State UniversityKentOHUSA , Ang Li ang.li@pnnl.gov Pacific Northwest National Laboratory RichlandWAUSA and Dingwen Tao ditao@iu.edu Indiana University BloomingtonINUSA

(2023)

^†^†journalyear: 2023^†^†copyright: licensedusgovmixed^†^†conference: Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis; November 12–17, 2023; Denver, CO, USA^†^†booktitle: Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis (SC-W 2023), November 12–17, 2023, Denver, CO, USA^†^†price: 15.00^†^†doi: 10.1145/3624062.3624257^†^†isbn: 979-8-4007-0785-8/23/11

1. Introduction

The field of quantum computing has seen a marked advancement (boixo2018characterizing, ). However, despite the substantial potential, present-day quantum computing mechanisms are challenged by considerable environmental noise, and the efficacy of quantum error correction in the Noise-Intermediate-Scale-Quantum (NISQ) era remains limited (preskill2018quantum, ). Quantum circuit simulation serves as an essential tool for researchers from a variety of disciplines, offering invaluable benefits to validate the accuracy of quantum algorithms (jones2019quest, ).

Nonetheless, the simulation of quantum circuits poses significant challenges, primarily because memory utilization escalates exponentially with the increment of qubit quantity. For instance, the Frontier system, with a memory capacity of 47.3 PB, is only equipped to simulate 51 qubits, while the Summit, with a memory capacity of 2.8 PB, can merely simulate 47 qubits (wu2019full, ). In light of the reality that vast-memory systems such as high-performance computing (HPC) or cloud infrastructures are not readily available to the majority of practitioners in the field of quantum computing, the capability to simulate the execution of quantum circuits is considerably restricted by devices’ memory capacities.

Cutting-edge quantum state-vector simulators (li2021sv, ; zhang2022uniq, ; zhang2021hyquas, ; hisvsim, ) have, thus far, not placed an emphasis on minimizing the memory footprint during the simulation process. A prior research endeavor (wu2019full, ) did incorporate compression into state vector simulation with the aim of expanding the number of qubits accommodated within a restricted memory space. This approach, while promising, still presents unresolved complications that necessitate substantial research attention: (1) In this study, compression and decompression processes occur with high frequency, thereby constituting a significant portion of the total simulation time. (2) it does not harness the processing power of GPUs to expedite the simulation process. Despite the potential for supporting a greater number of qubits, the subpar performance renders it impractical for wider application. (3) Data locality is insufficiently utilized, resulting in low cache hit rates. A more efficient memory access pattern is needed.

Refer to caption — Figure 1. Modularized quantum circuit simulation overview.

In response to these challenges, we introduce a state vector quantum simulator that leverages data compression, termed as MEMQSim. The high-level conceptual overview of our approach is depicted in Figure 1. Our methodology employs data compression to achieve high memory efficiency, with careful design to ensure efficient simulation. MEMQSim is independent of quantum algorithm composition and simulation computational tasks, offering significant modularity that allows it to be incorporated into major simulation backends such as Qiskit, Cirq, SV-Sim (li2021sv, ), UniQ (zhang2022uniq, ), among others. We make the following contributions:

•

We introduce a highly memory-efficient, modular framework. MEMQSim is designed to be compatible with the integration into state-of-the-art GPU simulator backends.
•

We extend the potential of lossy compression into state-vector simulation to augment the qubit threshold.
•

MEMQSim effectively manages the massive memory interchange between CPU and GPU. We build a prototype of MEMQSim and show the preliminary results.

2. Design and Preliminary Result

In this section, we present an in-depth design of a memory-efficient state vector simulator via data compression. Our proposed system employs CPU to store the state vector and GPU to execute the state vector updates. This approach is adopted considering that user-level devices typically possess larger CPU memory compared to GPU memory. Yet, the implementation of this design is not straightforward due to the complexities of data management and the intricacies of compression/decompression during the simulation.

Design challenges: (1) The intensive data exchange between the CPU and GPU requires careful scheduling. Since normally GPU memory capacity is much less, the GPU must retrieve data from the CPU, conducting computational operations one partial data at a time. (2) The frequency and granularity of compression and decompression significantly influence the simulation speed. Excessive compression/decompression could result in substantial overhead on the end-to-end time, and a coarser granularity could precipitate a significant memory footprint issue, while excessively fine granularity could lead to a lower compression ratio. (3) Different quantum algorithms’ behaviors affect the access pattern on the state vector.

In light of these identified challenges, we propose our design, MEMQSim. An overview of our approach is illustrated in Figure 2. We explain the overall simulation process below:

Offline stage MEMQSim partitions the input circuit and the corresponding state vector and each data chunk of the state vector is compressed independently and stored in CPU memory with such compressed format.

Online stage: As shown in Figure 1, MEMQSim pipelines the decompression, buffer transfer between CPU and GPU and GPU computation. In particular, MEMQSim (1) decompresses a selection of data chunks to the CPU buffers and (2) transfers the corresponding state vector amplitudes to the GPU memory. This process is repeated throughout the entire state vector until the GPU memory is fully occupied with ordered state vector amplitudes. (3) MEMQSim initiates the GPU kernel asynchronously to update the state vector amplitudes during the CPU-GPU data transfer and (4) returns the updated values back to the CPU buffers. (5) Subsequently, the CPU leverages idle cores to decompress the data chunks and perform updates to the state vector amplitudes on the CPU side. (6) Finally, the data block is re-compressed and stored back into the main memory. Upon the GPU’s completion of a single iteration, the aforementioned procedure is reiterated to update all amplitudes. Subsequently, the process advances to the subsequent stage and continues until all stages have been addressed. We have developed a prototype of MEMQSim, plugged into the SV-SIM (li2021sv, ) framework. As we move forward, our design harbours the potential to serve as a plugin for a range of GPU simulators, while also being adaptable to accommodate various compression algorithms.

As of step (2), we have devised two strategies to execute this process. The first approach entails the transfer of corresponding state vector elements to the GPU memory one at a time, utilizing CUDA asynchronous copies. The alternate strategy involves allocating a buffer on the GPU side and shifting the data chunk from the CPU buffer to the GPU buffer. Following this, GPU threads are employed to map all these amplitudes to their appropriate positions.

We present some preliminary results pertaining to the time taken by various data movement strategies between the CPU and GPU, as depicted in 1. The synchronous strategy entails the transfer of a complete data chunk through a singular CUDA memory copy operation, thereby exemplifying the minimum time necessary for the transfer between the CPU and GPU. As indicated, the host-to-device time associated with the asynchronous strategy is approximately 870 times longer than the synchronous time. This discrepancy is achieved by reducing multiple initiations of CUDA memory copy operations that cause significant overhead. As for the buffer strategy, although it demands additional memory space, it significantly boosts the data movement speed: the time needed for the buffer strategy is only about 1.03x compared to the synchronous version. By employing the state-of-the-art data compressor, we extrapolate that on average 5 more qubits to simulate can be achieved without slowing down the original quantum circuit simulation.

Table 1. Data transfer time H2D/D2H in seconds.

qubits $\downarrow$ Sync copy time Async copy time Buffer copy time 20 0.003/0.008 2.7/9.2 0.003/0.004 25 0.080/0.233 77.9/294.4 0.110/0.273

3. Conclusion and Future Work

In this extended abstract, we have introduced a highly memory-efficient state vector simulation of quantum circuits premised on data compression, harnessing the capabilities of both CPUs and GPUs. We have elucidated the inherent challenges in architecting this system, while concurrently proposing our tailored solutions. Moreover, we have delineated our preliminary implementation and deliberated upon the potential for integration with other GPU-oriented simulators. In forthcoming research, we aim to present a more comprehensive set of results, bolstering the assertion of the efficacy and performance of our approach.

References

[1] Sergio Boixo, Sergei V Isakov, Vadim N Smelyanskiy, Ryan Babbush, Nan Ding, Zhang Jiang, Michael J Bremner, John M Martinis, and Hartmut Neven. Characterizing quantum supremacy in near-term devices. Nature Physics, 14(6):595–600, 2018.
[2] Bo Fang, M. Yusuf Özkaya, Ang Li, Ümit V. Çatalyürek, and Sriram Krishnamoorthy. Efficient hierarchical state vector simulation of quantum circuits via acyclic graph partitioning. In CLUSTER, pages 289–300, 2022.
[3] Tyson Jones, Anna Brown, Ian Bush, and Simon C Benjamin. Quest and high performance simulation of quantum computers. Scientific reports, 9(1):1–11, 2019.
[4] Ang Li, Bo Fang, Christopher Granade, Guen Prawiroatmodjo, Bettina Heim, Martin Roetteler, and Sriram Krishnamoorthy. Sv-sim: scalable pgas-based state vector simulation of quantum circuits. In SC21, pages 1–14, 2021.
[5] John Preskill. Quantum computing in the nisq era and beyond. Quantum, 2:79, 2018.
[6] Xin-Chuan Wu, Sheng Di, Emma Maitreyee Dasgupta, Franck Cappello, Hal Finkel, Yuri Alexeev, and Frederic T Chong. Full-state quantum circuit simulation by using data compression. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–24, 2019.
[7] Chen Zhang, Zeyu Song, Haojie Wang, Kaiyuan Rong, and Jidong Zhai. Hyquas: hybrid partitioner based quantum circuit simulation system on gpu. In Proceedings of the ACM International Conference on Supercomputing, pages 443–454, 2021.
[8] Chen Zhang, Haojie Wang, Zixuan Ma, Lei Xie, Zeyu Song, and Jidong Zhai. Uniq: a unified programming model for efficient quantum circuit simulation. In SC22, pages 692–707. IEEE Computer Society, 2022.