This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Fast State Stabilization using Deep Reinforcement Learning for Measurement-based Quantum Feedback Control

Chunxiang Song, Yanan Liu, Daoyi Dong, Hidehiro Yonezawa Chunxiang Song is with the School of Engineering and Technology, University of New South Wales, Canberra, ACT 2600, Australia (chunxsong@gmail.com).Yanan Liu is with the School of Engineering, University of Newcastle, Callaghan, NSW 2308, Australia (yaananliu@gmail.com).Daoyi Dong is with the School of Engineering, Australian National University, Canberra, ACT 2601, Australia (daoyidong@gmail.com).Hidehiro Yonezawa is with the Optical Quantum Control Research Team, RIKEN Center for Quantum Computing, 2-1 Hirosawa, Wako, Saitama, 351-0198, Japan (hidehiro.yonezawa@riken.jp).
Abstract

The stabilization of quantum states is a fundamental problem for realizing various quantum technologies. Measurement-based-feedback strategies have demonstrated powerful performance, and the construction of quantum control signals using measurement information has attracted great interest. However, the interaction between quantum systems and the environment is inevitable, especially when measurements are introduced, which leads to decoherence. To mitigate decoherence, it is desirable to stabilize quantum systems faster, thereby reducing the time of interaction with the environment. In this paper, we utilize information obtained from measurement and apply deep reinforcement learning (DRL) algorithms, without explicitly constructing specific complex measurement-control mappings, to rapidly drive random initial quantum state to the target state. The proposed DRL algorithm has the ability to speed up the convergence to a target state, which shortens the interaction between quantum systems and their environments to protect coherence. Simulations are performed on two-qubit and three-qubit systems, and the results show that our algorithm can successfully stabilize random initial quantum system to the target entangled state, with a convergence time faster than traditional methods such as Lyapunov feedback control and several DRL algorithms with different reward functions. Moreover, it exhibits robustness against imperfect measurements and delays in system evolution.

Index Terms:
deep reinforcement learning (DRL), feedback control, learning control, quantum state stabilization

I Introduction

Quantum control theory focuses on manipulating quantum systems using external control fields or operations to regulate their behaviors [1]. A significant objective in quantum control is the preparation of target states, particularly quantum entangled states, which serve as vital resources for various quantum applications, including quantum teleportation [2, 3], fast quantum algorithms [3, 4], and quantum computations [5]. Achieving high-fidelity entangled states often involves in using classical control methods, with feedback control technology being particularly noteworthy. Quantum systems can be stabilized at target states or spaces through feedback control methods that continuously monitor the system and design feedback controllers based on real-time feedback information. In quantum measurement-based feedback, quantum measurements, while providing valuable information, introduce stochastic noise, complicating the state preparation process. To address the challenges posed by stochastic nonlinear problems in quantum systems due to measurements, some classical control methods, such as the Lyapunov method [6, 7, 8], have been applied. However, devising feedback strategies remains a formidable task, given the vast space of possibilities where different responses may be required for each measurement outcome. Moreover, opportunities exist for further enhancing stability and convergence speed.

Recently, quantum learning control, first introduced in [9], has proven potential in addressing various quantum control problems. Its popularity has grown with the incorporation of additional machine learning (ML) algorithms that exhibit excellent optimization performance and promising outcomes. Quantum learning control concerns to apply proper ML algorithms as tools for improving quantum system performance [10, 9, 11, 12, 13], and can offer robust solutions for developing effective quantum control and estimation methods [10]. For instance, the utilization of gradient algorithms in quantum learning control has demonstrated its significance in addressing quantum robust control problems [14]. Another example involves a widely used class of algorithms, evolutionary algorithms (EAs), which gains attention in learning control due to their ability to avoid local optima and their independence from gradient information[1]. Nonetheless, the real-time nature and randomness of measurement feedback control pose challenges, expanding the decision space significantly due to randomness [15]. This randomness makes it almost impossible to reproduce the same measurement trajectory, bringing challenges for EAs to apply control policies from certain sample trajectories to entirely different ones.

Our study is motivated by the application of suitable deep reinforcement learning (DRL) algorithms within feedback loops to exploit information obtained from measurements, thereby achieving predefined objectives. This approach holds the potential to enhance feedback control schemes, leading to a reduction in the convergence time to reach target states and exhibiting robustness in the face of uncertainties. RL techniques have been applied for target state preparation in many-body quantum systems of interacting qubits, where a sequence of unitaries was applied and a measurement was implemented at the final time to provide reward information to the agent [16]. The similar idea was then utilized in harmonic oscillator systems, to achieve different target state preparations, through an ancilla qubit in [17], where a final reward measurement (POVM) was carefully designed to provide reward information to the agent for different tasks of state preparation. In recent years, DRL (RL) approaches also started to play an important role in quantum algorithms, such as QAOA, for ground state preparation in different quantum systems [18, 19]. The similar state preparation problem has also been considered in [20], where the system is a double-well system and the reward is a unique function of the measurement results. In [21, 22], DRL-based approaches were employed for the preparation of cavity Fock state superpositions using fidelity-based reward functions, with system states as the training information under continuous weak measurement. In this study, we aim to devise a feedback control scheme based on DRL algorithms to enhance state stabilization, such as Bell states and GHZ states, for multi-qubit systems under continuous weak measurement. The designed algorithm can be applied to multi-qubit systems and provides a high fidelity and faster convergence to a given target state.

To achieve the objectives, we exploit the information derived from quantum measurement as the input signal to train our DRL agent. The agent actively interacts with the quantum system, making control decisions based on the received input. We design a generalized reward function that quantifies the similarity between the current quantum state and the desired target state. This incentivizes the DRL agent to iteratively learn control strategies that maximize rewards, ultimately leading to more effective control strategies for stabilizing entangled quantum states. Our work shows the potential of DRL in solving complex quantum control challenges, contributing to the fields of quantum information processing and quantum computation.

The main contributions of this paper are as follows:

  1. 1.

    A DRL algorithm is proposed to achieve the stabilization of given entangled states in multi-qubit systems under continuous measurement. We design an effective and versatile reward function based on the distance between the system state and the target state, allowing flexible parameter adjustment for different objectives to enhance the performance of the DRL agent.

  2. 2.

    We compare the proposed DRL-based control strategy with the Lyapunov method and other DRL methods, for state preparation through numerical simulations. Our DRL method achieves a faster stabilization for both target states, which effectively reduces the noise generated during system-environment interactions.

  3. 3.

    We analyze the robustness of our DRL scheme under the presence of imperfect measurements and time delays in the feedback loop. The trained DRL agent exhibits remarkable adaptability to uncertainties in the environment, particularly excelling in the pursuit of robust control fields to achieve quantum state stability.

  4. 4.

    We conduct ablation studies to showcase the superiority of the proposed reward function. By comparing several commonly used reward function designs, our reward function shows better performance in stabilizing the target states.

The following is the organization of this paper. Section II briefly introduces the stochastic master equation for quantum systems under continuous weak measurements. Section III explains in detail the logic and implementation behind DRL. Section IV gives some details of the implementation of DRL in the quantum measurement feedback control. Numerical results are given in Section V. In Section VI, the performance of the proposed algorithm is analyzed through ablation studies and comparisons with other related methods. Section VII is the conclusion.

II Quantum System Dynamics

For a quantum system, its state can be represented by a density matrix ρ\rho defined in the Hilbert space \mathbb{H}. This density matrix exhibits essential properties: Hermitian (ρ=ρ\rho=\rho^{\dagger}), trace unity (Tr(ρ)=1{\rm{Tr}}(\rho)=1), and positive semi-definite (ρ0\rho\geq 0). The dynamics of the quantum system may be observed through continuous weak measurements, enabling us to acquire pertinent measurement information for the design of an appropriate feedback controller. The evolution of the quantum trajectory can be described by the stochastic master equation (SME) [15, 23]:

dρt=\displaystyle d\rho_{t}= i[H0+j=1Muj(t)Hj,ρt]dt\displaystyle-\frac{i}{\hbar}[H_{0}+\sum_{j=1}^{M}u_{j}(t)H_{j},\rho_{t}]dt (1)
+κc𝒟[c]ρtdt+ηcκc[c]ρtdW,\displaystyle+\kappa_{c}\mathcal{D}[c]\rho_{t}dt+\sqrt{\eta_{c}\kappa_{c}}\mathcal{H}[c]\rho_{t}dW,

where i=1i=\sqrt{-1}, the reduced Planck constant =1\hbar=1 is used throughout this paper; the Hermitian operator H0H_{0} and HjH_{j} (j=1,2,,M)(j=1,2,\cdots,M) are the free Hamiltonian and control Hamiltonians, respectively; uj(t)u_{j}(t) is a real-valued control signal, which can be interpreted as the strength of the corresponding control Hamiltonians HjH_{j}; κc\kappa_{c} and ηc\eta_{c} are measurement strength and efficiency, respectively; dWdW is a standard Wiener process caused by measurement; the Hermitian operator cc is an observable; the superoperators 𝒟[c]ρt\mathcal{D}[c]\rho_{t} and [c]ρt\mathcal{H}[c]\rho_{t} are related to the measurement, e.g., they can describe the disturbance to the system state, and the information gain from the measurement process, respectively [24], and have the following forms:

{𝒟[c]ρt=cρtc12ccρt12ρtcc,[c]ρt=cρt+ρtcTr[(c+c)ρt]ρt.\begin{cases}\mathcal{D}[c]\rho_{t}=c\rho_{t}c^{\dagger}-\frac{1}{2}c^{\dagger}c\rho_{t}-\frac{1}{2}\rho_{t}c^{\dagger}c,\\ \mathcal{H}[c]\rho_{t}=c\rho_{t}+\rho_{t}c^{\dagger}-\rm{Tr}[(c+c^{\dagger})\rho_{t}]\rho_{t}.\end{cases} (2)

On any given trajectory the corresponding measured current is I(t)=dy(t)/dtI(t)=dy(t)/dt [25, 23] where

dy(t)=ηcκcTr[(c+c)ρt]dt+dW.{dy(t)=\sqrt{\eta_{c}\kappa_{c}}\rm{Tr}[(c+c^{\dagger})\rho_{t}]dt+dW.} (3)

With the measurement result yty_{t}, information on the standard Wiener process dWdW can be collected from (3). Utilizing (1), an estimate of the system state can be obtained and utilized to construct a feedback controller.

In this paper, we consider the DRL-based feedback stabilization of the target quantum states. We will show our algorithm in stabilizing a GHZ entangled states of a three-qubit system and an eigenstate of an angular momentum system, while our scheme has the potential to be extended to other quantum systems.

III Deep Reinforcement Learning

Abstracting a real-world problem into a Markov decision process (MDP) serves as the foundational step in applying DRL [26]. MDP provides a formal framework for modeling the interaction between an agent and its environment (quantum systems in this work), offering a structured specification of the agent’s decision-making problem. The environment is abstracted with essential elements such as states, actions, rewards, and more. The agent is a pivotal component of DRL, representing a learning entity or decision-maker that, through interactions with the environment, learns to take actions to achieve objectives and continually refines its decision strategies to enhance the effectiveness of its actions. This process of agent-environment interaction and learning constitutes the core mechanism through which DRL efficiently tackles real-world challenges and achieves desirable outcomes.

An MDP is a structured representation denoted by the tuple <𝒮,𝒜,R,P,γ><\mathcal{S},\mathcal{A},R,P,\gamma>, where each element serves as a crucial role in modeling the problem and applying DRL:

  • 𝒮={ρ:ρ=ρ,Tr(ρ)=1,ρ0}\mathcal{S}=\{\rho\in\mathbb{H}:\rho=\rho^{\dagger},\rm Tr(\rho)=1,\rho\geq 0\} represents the set of states. At each time step tt, the environment presents a specific quantum state ρt\rho_{t} to the agent, who subsequently makes decisions based on this state.

  • 𝒜\mathcal{A} signifies the set of actions, incorporating the actions at𝒜a_{t}\in\mathcal{A} that the agent can undertake at each time step. In this context, the actions correspond to the control signals uj(t)u_{j}(t) defined in (1), with values ranging from any bounded control strength, for example, uj(t)[1,1]u_{j}(t)\in[-1,1] in this paper.

  • RR denotes the reward function. This paper considers the task of stabilizing the current state to the target state, thus the immediate reward rtr_{t} can be simplified as

    rt=R(ρt).r_{t}=R(\rho_{t}). (4)

    In this study, the reward function is defined using the trace-based distance DρtD_{\rho_{t}}:

    DρtTriangleq1Tr(ρdρt),D_{\rho_{t}}\rm Triangleq1-\rm Tr(\rho_{d}\rho_{t}), (5)

    which quantifies the difference between the current state ρt\rho_{t} and the target state ρd\rho_{d}. When Dρt=0D_{\rho_{t}}=0, the system state has stabilized at the target state ρd\rho_{d}.

  • P(ρt+1|ρt,at)P(\rho_{t+1}|\rho_{t},a_{t}) is the state transition function. It indicates how the environment transitions to the next state ρt+1\rho_{t+1} after taking action ata_{t} in the current state ρt\rho_{t}. It is consistent with the stochastic evolution of the quantum system described in (1).

  • γ\gamma is the discount factor, which determines the emphasis placed on future rewards, influencing the agent’s decision-making process in a long-term perspective.

Refer to caption


Figure 1: Illustration of MDP for Agent-Environment interaction.

An MDP is a time-dependent and ongoing process, involving continuous interaction between the agent and the environment. The interaction can be illustrated as shown in Fig. 1. For simplicity, we consider the interaction between the agent and the environment as a discrete time series (e.g., t=0,1,2,3,t=0,1,2,3,\cdots). Starting from t=0t=0 with the known initial state ρ0\rho_{0}, the agent selects an action a0a_{0} based on ρ0\rho_{0} and applies it to the environment. Subsequently, the environment undergoes a state transition according to the state transition function P(ρt+1|ρt,at)P(\rho_{t+1}|\rho_{t},a_{t}), resulting in the next state ρ1\rho_{1}, and provides an immediate reward r1r_{1} based on the reward function RR. The environment also utilizes a classical computer to solve the SME (1) to estimate the density matrix of ρ1\rho_{1}, which is then fed back to the agent. This process is iterated until completion. Therefore, the MDP and the agent jointly generate a sequence or trajectory as follows:

{ρ0,r0,a0},{ρ1,r1,a1},{ρ2,r2,a2},\{\rho_{0},r_{0},a_{0}\},\{\rho_{1},r_{1},a_{1}\},\{\rho_{2},r_{2},a_{2}\},\cdots (6)

The function that selects an action aa from the set of actions 𝒜\mathcal{A} based on the current state ρ\rho is referred to as a policy π(a|ρ)\pi(a|\rho). The objective of the MDP is to find the policy π(a|ρ)\pi(a|\rho) that allows the agent to make optimal decisions, effectively maximizing long-term cumulative rewards. Different methods, including value-based, policy-based and actor-critic-based, have been developed to update the policy in the DRL [27, 28, 29]. In this paper, a highly effective actor-critic style proximal policy optimization (PPO) algorithm [30] will be applied.

IV Applying DRL to Quantum Measurement-based Feedback Control

In this section, we apply the DRL to the quantum systems and aim to design a measurement-based feedback strategy to stabilize a given target state. The application is comprised of training and testing parts. In the training part, the primary objective lies in the agent’s policy function πθ(a|ρ)\pi_{\theta}(a|\rho), constructed by a neural network with the adjustable parameter set θ\theta. This parameter set is updated aiming for a higher reward by using data that is generated through the interaction between the agent and the environment. Once the agent finishes training, it can be applied to the quantum systems to generate real-time feedback control signals in achieving the stabilization of target states.

IV-A Environment: States and Actions

The environment’s state is represented by the quantum system’s density matrix, ρ\rho, which contains all the information about the quantum state. To use ρ\rho in DRL, it needs to be converted into a format suitable for neural networks. We achieve this by flattening the density matrix into a vector that includes both its real and imaginary parts. For example, for a single-qubit system with ρ=[α1+β1iα2+β2iα3+β3iα4+β4i],\rho=\begin{bmatrix}\alpha_{1}+\beta_{1}i&\alpha_{2}+\beta_{2}i\\ \alpha_{3}+\beta_{3}i&\alpha_{4}+\beta_{4}i\end{bmatrix}, it is converted to: ρ:=[α1,α2,α3,α4,β1,β2,β3,β4]T\rho:=\begin{bmatrix}\alpha_{1},\alpha_{2},\alpha_{3},\alpha_{4},\beta_{1},\beta_{2},\beta_{3},\beta_{4}\end{bmatrix}^{T}.This process keeps all the quantum state information and makes it usable for the neural network.

The policy function πθ(a|ρ)\pi_{\theta}(a|\rho) maps the quantum state’s vectorized form to a set of control actions aa. These actions correspond to the control signals applied to the control Hamiltonians H1,H2,,HMH_{1},H_{2},\dots,H_{M}. For example, if there are two control Hamiltonians, the actions at each time step are given by a:=[u1,u2]Ta:=\begin{bmatrix}u_{1},u_{2}\end{bmatrix}^{T}, where u1u_{1} and u2u_{2} are the control amplitudes.

The DRL algorithm trains the policy to select actions that stabilize the quantum state. At each time step tt, the agent uses the policy to choose an action ata_{t} based on the state ρt\rho_{t}. The environment then updates the quantum state to ρt+1\rho_{t+1} according to the system dynamics and provides a reward rtr_{t}. This reward encourages the agent to take actions that reduces the difference between ρt+1\rho_{t+1} and the target state ρtarget\rho_{\text{target}}.

It is important to note that the resulting state ρ\rho may not be a valid quantum state due to the inherent randomness in measurements and cumulative errors in solving the SME using classical computers [31]. For example, the state matrix may contain non-physical (negative) eigenvalues. To address this issue, a suitable approach is to check the eigenvalues of the estimated matrix at each step. When negative values are encountered, the state should be projected back onto a valid physical state. This can be achieved by finding the closest density matrix under the 2-norm [32]. This approximation ensures that a non-physical density matrix is transformed into the most probable positive semi-definite quantum state with a trace equal to 11.

As mentioned in Section III, the interaction between the agent and the environment forms an episode τ={ρ0,r0,a0,ρ1,r1,a1,,ρm},\tau=\{\rho_{0},r_{0},a_{0},\rho_{1},r_{1},a_{1},\dots,\rho_{m}\}, where mm is the final step of the episode, τ𝔻={τn}n=1,2,,N\tau\in\mathbb{D}=\{\tau_{n}\}_{n=1,2,\dots,N}, with NN denoting the total number of possible sequences. Without confusion, we use τ\tau to represent a single episode. The primary goal of the agent is to maximize the expected cumulative reward across all possible episodes. The construction of the reward function will be discussed in detail in the following section.

IV-B Reward

The design of the reward function is critical in DRL. For instance, [16, 17] proposed sparse reward functions, i.e., providing rewards only at the end of each trajectory by evaluating the final state. While this approach is straightforward and easy to implement, it often suffers from slow learning or even training failure in complex systems due to the sparsity of reward signals. To address this, [21] and [22] introduced a scheme that collects fidelity information at each step and applies a higher-weighted reward for high-fidelity states. However, these methods still face challenges in balancing exploration and exploitation, particularly in systems with high-dimensional state spaces.

In this work, we propose a Partitioned Nonlinear Reward (PNR) function based on the distance DρtD_{\rho_{t}} (5). A lower DρtD_{\rho_{t}} indicates better alignment with the target state. The reward at each time step rtr_{t} is defined as:

rt\displaystyle r_{t} =(D¯D¯𝔣(DρtD¯)𝔢(DρtD¯)1𝔣)\displaystyle=\left(\frac{\overline{D}-\underline{D}}{\mathfrak{f}*(D_{\rho_{t}}-\underline{D})-\mathfrak{e}*(D_{\rho_{t}}-\overline{D})}-\frac{1}{\mathfrak{f}}\right)
×(𝔢𝔣(R¯R¯)𝔣𝔢)+R¯,\displaystyle\times\left(\frac{\mathfrak{e}\mathfrak{f}*(\overline{R}-\underline{R})}{\mathfrak{f}-\mathfrak{e}}\right)+\underline{R}, (7)

where D¯\overline{D} and D¯\underline{D} are the upper and lower bounds of the distance DρtD_{\rho_{t}}, R¯\overline{R} and R¯\underline{R} are the upper and lower bounds of the reward, and 𝔢\mathfrak{e} and 𝔣\mathfrak{f} are parameters that regulate the slope of the reward curve.

In general, this reward function is motivated from the inverse proportional function (the relation between rtr_{t} and DρtD_{\rho_{t}} when other parameters in (7) are fixed), which ensures that the reward value increases when the distance DρtD_{\rho_{t}} decreases. More specifically, the bounds D¯\overline{D} and D¯\underline{D} are designed to ensure Dρt[D¯,D¯]D_{\rho_{t}}\in[\underline{D},\overline{D}]. The distance is mapped to a reward value constrained within the range [R¯,R¯][\underline{R},\overline{R}]. The two tunable parameters, 𝔣\mathfrak{f} and 𝔢\mathfrak{e} are used to adjust the steepness when DρtD_{\rho_{t}} is approaching D¯\underline{D} and D¯\overline{D}, respectively.

The detailed effects of different 𝔢\mathfrak{e} and 𝔣\mathfrak{f} are plotted in Fig. 2. For the given bounds D¯,D¯,R¯\underline{D},\overline{D},\underline{R}, and R¯\overline{R}, when 𝔢<𝔣\mathfrak{e}<\mathfrak{f}, the reward curve near D¯\underline{D} is steeper than that near D¯\overline{D}. This indicates that the rate of increase in the reward value is more pronounced as DρtD_{\rho_{t}} approaches D¯\underline{D}. Conversely, when 𝔢>𝔣\mathfrak{e}>\mathfrak{f}, the trend is reversed.

Refer to caption

Figure 2: Example Reward Curves for Different Parameter Combinations of 𝔢\mathfrak{e} and 𝔣\mathfrak{f}.

In most existing work, the upper and lower bounds of DρtD_{\rho_{t}} are fixed at D¯=1\overline{D}=1 and D¯=0\underline{D}=0, respectively. The reward function in Eq. (7), however, offers the flexibility to adjust these bounds, as well as the corresponding reward values R¯\overline{R} and R¯\underline{R}. This flexibility enables the division of the state space into multiple regions, which is particularly useful for addressing complex problems. Moreover, by adjusting D¯\overline{D} and D¯\underline{D}, it becomes possible to assign positive reward values in certain regions of the state space and negative reward values (penalties) in others. This feature allows for more efficient and effective optimization, resulting in improved performance.

In this paper, we divide the state space into two regions using a partition parameter dd. These two regions are denoted as Proximity Zone (Dρt<dD_{\rho_{t}}<d) and Exploration Zone (DρtdD_{\rho_{t}}\geq d). The same reward function in the formula (7) will be used in both of these two zones, and the corresponding bounds, D¯,D¯\underline{D},\overline{D} are determined by the partition parameter dd. This division provides more flexibility to balance the reward and the penalty in the whole state space. One possible curve of the reward function in these two zones are shown in Fig.  3.

Refer to caption

Figure 3: Reward Function. (a) When the DρtD_{\rho_{t}} is between 0 and dd, we consider that the current system is almost approaching the target state. A smaller DρtD_{\rho_{t}} value corresponds to a larger positive reward. When Dρt=0D_{\rho_{t}}=0, the maximum reward r=R¯Pr=\overline{R}_{P} is obtained. (b) When the DρtD_{\rho_{t}} is between dd and 11, we consider that there is still distance between the current state and the target state. At Dρt=1D_{\rho_{t}}=1, the maximum negative reward (penalty) is imposed, and as the DρtD_{\rho_{t}} value decreases, the penalty decreases accordingly.

In the Proximity Zone (Dρt<dD_{\rho_{t}}<d), the state is considered close to success. Positive rewards encourage the agent to converge rapidly to the target state. The distance bounds in this region are D¯=D¯P=d\overline{D}=\overline{D}_{P}=d and D¯=D¯P=0\underline{D}=\underline{D}_{P}=0, while the reward bounds are R¯P>R¯P0\overline{R}_{P}>\underline{R}_{P}\geq 0. As shown in Fig. 3(a), the reward function ensures that the slope of the curve increases as DρtD_{\rho_{t}} approaches zero, emphasizing rapid stabilization near the target state.

In the Exploration Zone (DρtdD_{\rho_{t}}\geq d), the system is considered far from the target state, and penalties are applied to guide exploration. The boundaries of this zone are D¯=D¯E=1\overline{D}=\overline{D}_{E}=1 and D¯=D¯E=d\underline{D}=\underline{D}_{E}=d, with the reward range satisfying R¯E<R¯E0\underline{R}_{E}<\overline{R}_{E}\leq 0. As illustrated in Fig. 3(b), the penalty decreases as DρtD_{\rho_{t}} approaches dd, with the slope of the curve ensuring a progressively faster reduction. This mechanism encourages the state to transition into the Proximity Zone.

Remark 1: In the Exploration Zone, when the distance DρtD_{\rho_{t}} approaches dd, we reduce penalties rather than introduce large positive rewards. This approach avoids potential reward hacking [33], where the agent might exploit increasing rewards by staying near DρtdD_{\rho_{t}}\approx d without fully reaching the target. By applying small penalties at each step, the agent is encouraged to explore more broadly while progressively improving its policy.

Additionally, a small penalty proportional to the number of steps taken before stabilization is introduced to encourage efficiency. For example, the first step incurs a penalty of 1×106-1\times 10^{-6}, the second step 2×106-2\times 10^{-6}, and so forth. This step-based penalty discourages unnecessary delays and motivates the agent to achieve stabilization promptly.

The proposed reward function adjusts based on the distance DρtD_{\rho_{t}}, with steeper slopes in both regions as the system approaches the target state. This design encourages exploration in the Exploration Zone while driving efficient convergence in the Proximity Zone. We show the superiority of our design by simulating different reward functions in Section VI.

With this reward function, we are now in the position to maximize the cumulative expected reward of the sequence, so for a complete sequences τ\tau, its cumulative reward can be expressed as

R(τ)=t=0mAθ(ρt,at).R(\tau)=\sum_{t=0}^{m}A^{\theta}(\rho_{t},a_{t}). (8)

Aθ(ρt,at)=Q(ρt,at)Vϕ(ρt)A^{\theta}(\rho_{t},a_{t})=Q(\rho_{t},a_{t})-V^{\phi}(\rho_{t}) is known as the advantage function in the field of RL, which is utilized to assess the desirability of taking a specific action ata_{t} at state ρt\rho_{t}. Q(ρt,at)=t=tmγttrtQ(\rho_{t},a_{t})=\sum_{t^{\prime}=t}^{m}\gamma^{t^{\prime}-t}r_{t^{\prime}} represents the action-value function, indicating the expected discounted reward for choosing action ata_{t} in state ρt\rho_{t}, i.e., the cumulative sum of rewards until the end of the episode after executing this action. rtr_{t^{\prime}} is the reward function in (7). The value of γ\gamma lies between 0 and 11, determining the emphasis on long-term rewards (close to 11) or short-term rewards (close to 0). It effectively introduces a discounting mechanism for future rewards, thereby shaping the agent’s preference for future reward consideration when making decisions. Vϕ(ρt)V^{\phi}(\rho_{t}) is referred to as the state-value function (or baseline) and is modeled by a neural network with the same structure as the policy network but with different parameters ϕ\phi. It is primarily employed to approximate the discounted rewards from state ρt\rho_{t} to the end of an episode. Specifically, if the current state is ρt\rho_{t}, and for all possible actions at(1),at(2),,at()a_{t}^{(1)},a_{t}^{(2)},\dots,a_{t}^{(\dots)}, they correspond to discounted rewards Q(ρt,at(1)),Q(ρt,at(2)),,Q(ρt,at())Q(\rho_{t},a_{t}^{(1)}),Q(\rho_{t},a_{t}^{(2)}),\dots,Q(\rho_{t},a_{t}^{(\dots)}). As Vϕ(ρt)V^{\phi}(\rho_{t}) represents the expected value of the discounted rewards at ρt\rho_{t}, we can use Q(ρt,at(1)),Q(ρt,at(2)),,Q(ρt,at())Q(\rho_{t},a_{t}^{(1)}),Q(\rho_{t},a_{t}^{(2)}),\dots,Q(\rho_{t},a_{t}^{(\dots)}) as features to approximate the value of Vϕ(ρt)V^{\phi}(\rho_{t}), representing the expected value of rewards in state ρt\rho_{t}. When Aθ(ρt,at)>0A^{\theta}(\rho_{t},a_{t})>0, action ata_{t} is considered better than average and is worth increasing the probability of being chosen in subsequent iterations while decreasing the probability otherwise.

IV-C Training

The essence of employing DRL to address MDPs is centered on the training regimen of the agent, which is governed by the policy πθ(a|ρ)\pi_{\theta}(a|\rho) . This study implements the PPO algorithm, a model-free policy gradient method within the domain of DRL. PPO incorporates a “clip ratio” mechanism that limits the extent of policy updates, thereby enhancing the stability and reliability of the learning process. This approach mitigates the requirement for extensive environmental sampling, which is a common feature of traditional policy gradient techniques, and consequently, it improves the efficiency of the training phase. The PPO algorithm operates with three sets of network parameters: the primary policy parameters, policy parameter duplicates, and value network parameters. These parameters are instrumental in the iterative policy refinement and state value estimation processes. The comprehensive algorithmic details and the underlying mathematical formalism are presented in the Appendix.

Our DRL algorithm is implemented using the open-source Python library Stable-Baselines3 [34], while the quantum dynamic environment is constructed within the Gymnasium framework [35]. All simulations in this study are conducted on a computer equipped with an Apple M1 Pro chip and 3232 GB of memory, utilizing Python 3.11.33.11.3, stable-baselines3 2.3.22.3.2, and Gymnasium 0.29.10.29.1. We design a reasonable reward function to guide the DRL agent through iterative learning, aiming to train an excellent DRL agent capable of generating control signals to achieve the stability of the target entangled state.

Reward Function Parameters: We set d=0.001d=0.001, which is the partition parameter of the reward function mentioned in Section IV-B. This means that when the distance DρtD_{\rho_{t}} is less than 0.0010.001, the system is considered to be close to the target and receives a positive reward. In this case, the fidelity of the system state can easily exceed 0.9990.999. Ideally, the smaller the value of dd, the higher the control fidelity of the trained agent. However, as dd decreases, the training process becomes increasingly challenging and time consuming. Therefore, the choice of dd should strike a balance between achieving high fidelity and managing training complexity.

The upper and lower bounds of the reward in the Proximity Zone, R¯P\overline{R}_{P} and R¯P\underline{R}_{P} , are set to 100100 and 11, respectively, encouraging the system to get as close as possible to the target. In the Exploration Zone, the upper and lower bounds of R¯E\overline{R}_{E} and R¯E\underline{R}_{E} are set to 0 and 0.1-0.1, respectively, maintaining a penalty without overly punishing the agent.

We set 𝔢=2\mathfrak{e}=2 and 𝔣=10\mathfrak{f}=10, emphasizing a steeper reward escalation as DρtD_{\rho_{t}} approaches D¯\underline{D}.

Initial State: During training, the state is randomly reset to a quantum state after the completion of each episode, which means that at each episode in the training iteration, the agent starts from a new state and explores the environment from that point.

Early Termination: Regarding early termination, continuous quantum measurement feedback control can be modeled as an infinite-horizon MDP, but during training, each episode is simulated on a finite time horizon. Additionally, practical applications require a finite system evolution time. Therefore, we set fixed duration or termination conditions to end an episode. The termination conditions include the following:

  • If the system is measured continuously for 1010 iterations, and the distance DρtD_{\rho_{t}} remains within the interval [0,d][0,d] for all measurements, the system is considered to have reached the target state with high fidelity. At this point, the task is concluded.

  • In a specific system, the maximum training time for a trajectory is set to a fixed value. For example, for the two-qubit state stabilization problem in Section V-A, we set the maximum training time T=20T=20 arbitrary units (a.u.). When the evolution time reaches 2020 a.u., regardless of whether it has converged to the goal or not, the training trajectory is halted. This approach not only greatly saves training time but also significantly reduces the issue of overfitting.

These early termination conditions bias the data distribution towards samples that are more relevant to the task, thereby saving training time and preventing undesirable behaviors. During agent testing, the time is typically not limited to evaluate the agent’s performance in a real environment and assess its ability to complete tasks within a reasonable time frame.

The pseudo-code for the PPO in quantum state stabilization is shown in Algorithm 1.

Algorithm 1 PPO for Quantum Feedback Control
  Begin
  Initial policy weights θ\theta
  Initial value function weights ϕ\phi
  repeat
     Sampling training data from the environment
     for t = 0,1,… do
        ρt\rho_{t}\leftarrow start state
        rtr_{t}\leftarrow reward
        atπθ(at|ρt)a_{t}\thicksim\pi_{\theta}(a_{t}|\rho_{t})
        Apply aa and simulate the system one step forward
        ρt+1\rho_{t+1}\leftarrow end state
        rt+1r_{t+1}\leftarrow reward
        Record (ρt,rt,at,ρt+1)(\rho_{t},r_{t},a_{t},\rho_{t+1}) into memory MM
     end for
     θθ\theta^{\prime}\leftarrow\theta
     for each update do
        Sample NmNm samples {(ρt,rt,at,ρt+1)}\{(\rho_{t},r_{t},a_{t},\rho_{t+1})\} from MM
        Update main policy:
        for each (ρt,rt,at,ρt+1)(\rho_{t},r_{t},a_{t},\rho_{t+1}) do
           compute advantage AθA^{\theta} in (8) using current VϕV^{\phi} and GAE
           compute pθ(at|ρt)pθ(at|ρt)\frac{p_{\theta}(a_{t}|\rho_{t})}{p_{\theta^{\prime}}(a_{t}|\rho_{t})}
        end for
        Calculate the gradient to update the policy parameters θ\theta via (26)
        Update value function VϕV^{\phi}:
        for each (ρt,rt,at,ρt+1)(\rho_{t},r_{t},a_{t},\rho_{t+1}) do
           Use TD to update the value function VϕV^{\phi}
        end for
     end for
  until The training termination condition is triggered

V Numerical Simulation

V-A Two-Qubit System

We consider a two-qubit system in a symmetric dispersive interaction with an optical probe, as described in [36]. The system’s dynamics are governed by the SME in (1), where we utilize a DRL control scheme to stabilize the system to a target entangled state from arbitrary initial states. Denote the Pauli matrices

σx=[0110],σy=[0ii0],σz=[1001].\sigma_{x}=\begin{bmatrix}0&1\\ 1&0\end{bmatrix},\qquad\sigma_{y}=\begin{bmatrix}0&-i\\ i&0\end{bmatrix},\qquad\sigma_{z}=\begin{bmatrix}1&0\\ 0&-1\end{bmatrix}.

Control Hamiltonians in (1) are chosen as

H1=σyI2,andH2=I2σy,H_{1}=\sigma_{y}\otimes I_{2},\quad\text{and}\quad H_{2}=I_{2}\otimes\sigma_{y}, (9)

which implies that two control channels are applied to this quantum system. And the forms of H1H_{1} and H2H_{2} represent σy\sigma_{y} rotations on the first and second qubits, respectively, enabling independent control over each qubit.

The observable operator is chosen as:

c=σzI2+I2σz,c=\sigma_{z}\otimes I_{2}+I_{2}\otimes\sigma_{z}, (10)

which corresponds to the measurement of σz\sigma_{z}-like observable for the two qubits.

Specify the target state as

ρd=[000000.50.5000.50.500000],\rho_{d}={\begin{bmatrix}0&0&0&0\\ 0&0.5&0.5&0\\ 0&0.5&0.5&0\\ 0&0&0&0\end{bmatrix}}, (11)

which is a symmetric two-qubit Bell state.

We utilize the previously summarized PPO algorithm to train the DRL agent. For the training trajectories, we set a time interval of Δt=0.001\Delta t=0.001 a.u. for each measurement step, with a maximum evolution time of T=20T=20 a.u., corresponding to a maximum of 20,00020,000 steps. At each step, the DRL agent interacts with the environment, obtaining system information to generate control signals, which are then stored for iterative updates of the policy. The total number of training steps is set to 10710^{7}. On the computer with configuration in Section IV-C, the training process requires approximately 5050 minutes. However, the primary focus of this study is the performance of the trained agent rather than the optimization of training time. In practical implementations, employing a pre-trained agent is feasible. Once the target state is specified, the agent does not require repeated updates and can be directly utilized after training is completed.

In order to evaluate its performance, we test the proposed strategy on 5050 randomly selected distinct initial quantum states ρ0\rho_{0} (corresponding to different initial distances Dρ0D_{\rho_{0}}, as indicated by the blue line in Fig. 4). The light blue lines represent the evolution trajectory of a specific initial state with respect to the target state, averaged over 5050 different trajectories under varying environmental noise, while the dark blue line depicts the average evolutionary trajectory of all the different initial states, i.e., the average trajectory of 25002500 different evolutionary trajectories. Smaller values of the distance DρtD_{\rho_{t}} indicate that the system is closer to the target state. It is worth noting that a trained agent can successfully stabilize any initial state to the target state with high fidelity. Furthermore, for comparison with the Lyapunov method mentioned in [36], we retain the same 5050 sets of randomly selected initial states and obtain the orange trajectories in Fig. 4 using the Lyapunov method. It can be observed that the control signals generated by the DRL agent outperform the Lyapunov method. Assuming that the time when the distance DρtD_{\rho_{t}} is less than 0.0010.001 is the evolution time of the system, the average evolution time of the 25002500 trajectories under the guidance of the DRL agent is 4.594.59 a.u., while the average time using the Lyapunov method is 5.865.86 a.u.. The DRL’s average stabilization time is improved by 22%22\% over the Lyapunov method. This indicates that our DRL approach successfully stabilizes these quantum states to the target state faster than the Lyapunov method.

Refer to caption

Figure 4: Evolution of the distance DρtD_{\rho_{t}} for 5050 random initial states stabilized to the target Two-Qubit state under the control of the DRL agent (blue) and Lyapunov method (orange). The average stabilization time to the target under DRL control is 4.594.59 a.u., while the Lyapunov method requires an average time of 5.865.86 a.u.. (The stabilization time is defined as the time when the distance Dρt0.001D_{\rho_{t}}\leq 0.001.) The light blue (orange) lines represent the average evolution trajectories for different initial states, and the dark blue (orange) line represents the average trajectory across all different initial states.

V-B GHZ State

We then consider a more complex problem of preparing three-qubit entangled GHZ states, which are special entangled states and have been regarded as maximally entangled states in many measures [37], [38]. A GHZ entangled state is defined in the following form [37, 39]:

|GHZ=12(|0n+|1n),|{\rm{GHZ}}\rangle=\frac{1}{\sqrt{2}}(|0\rangle^{\otimes n}+|1\rangle^{\otimes n}), (12)

where nn is the number of qubits. Its density matrix can be expressed as ρGHZTriangleq|GHZGHZ|\rho_{\rm{GHZ}}\rm Triangleq|{\rm{GHZ}}\rangle\langle{\rm{GHZ}}|. For the three-qubit GHZ state, we choose |GHZ=(1/2)(|000+|111)|{\rm{GHZ}}\rangle=(1/\sqrt{2})({|000\rangle}+{|111\rangle}), which gives the following density matrix:

ρGHZ=12[1000000100000000000000000000000000000000000000000000000010000001].\begin{split}\rho_{\rm{GHZ}}=\frac{1}{2}\begin{bmatrix}1&0&0&0&0&0&0&1\\ 0&0&0&0&0&0&0&0\\ 0&0&0&0&0&0&0&0\\ 0&0&0&0&0&0&0&0\\ 0&0&0&0&0&0&0&0\\ 0&0&0&0&0&0&0&0\\ 0&0&0&0&0&0&0&0\\ 1&0&0&0&0&0&0&1\end{bmatrix}.\end{split} (13)

A degenerate observable cc is required according to the quantum state collapse after measurement [36, 8, 40]. The quantum state collapse states that the system in (1) will randomly converge to an eigenstate or eigenspace of cc without any control. Hence, we choose an observable in the following diagonal form:

c=diag[λd,λ2,,λn1,λd],c=\text{diag}[\lambda_{d},\lambda_{2},\cdots,\lambda_{n-1},\lambda_{d}], (14)

where λdλk\lambda_{d}\neq\lambda_{k} (k=2,,n1)(k=2,\cdots,n-1), and λd\lambda_{d} is the eigenvalue corresponding to the target state ρd\rho_{d}, i.e., cρd=λdρdc\rho_{d}=\lambda_{d}\rho_{d}.

Due to the degenerate form of the observable cc in (14), the system may converge to other state in the corresponding eigenspace related to λd\lambda_{d}, two control channels H1H_{1} and H2H_{2} based on Lyapunov method have been applied in [36, 8, 40] to solve this problem. For subsequent performance comparisons, two control channels are also used in this paper.

For any training trajectory, we take a time interval Δt=0.001\Delta t=0.001 a.u. for each measurement step. Given that we have set the maximum evolution time as T=40T=40 a.u., it means the maximum number of evolution steps for any trajectory during training is 40,00040,000. The total number of training steps is 10810^{8}.

For all instances, in order to compare with the Lyapunov methods presented in [8], we choose the same system Hamiltonian as H0=diag[1,1,1,1,1,1,1,1]H_{0}=\text{diag}[1,-1,-1,1,1,-1,-1,1]. The diagonal form of which indicates that the eigenvalues correspond to the energy levels of the system in the computational basis |q1q2q3|q_{1}q_{2}q_{3}\rangle (where qi{0,1}q_{i}\in\{0,1\} for three qubits). The target state is ρdTriangleqρGHZ\rho_{d}\rm Triangleq\rho_{\rm{GHZ}} (13), and the observable cc is chosen to be:

c\displaystyle c =2×(σzσzI2)+I2σzσz\displaystyle=2\times(\sigma_{z}\otimes\sigma_{z}\otimes I_{2})+I_{2}\otimes\sigma_{z}\otimes\sigma_{z}
=diag[3,1,3,1,1,3,1,3],\displaystyle=\text{diag}[3,1,-3,-1,-1,-3,1,3], (15)

to measure correlations between the z-components of different pairs of qubits in a three-qubit system and also ensure that the target state is an eigenstate. The control Hamiltonians H1H_{1} and H2H_{2} are chosen as

H1=I2I2σx+σxσxI2,H_{1}=I_{2}\otimes I_{2}\otimes\sigma_{x}+\sigma_{x}\otimes\sigma_{x}\otimes I_{2}, (16)

and

H2=σxI2I2+I2σxσx.H_{2}=\sigma_{x}\otimes I_{2}\otimes I_{2}+I_{2}\otimes\sigma_{x}\otimes\sigma_{x}. (17)

Similar to the two-qubit systems, two control channels represented by H1H_{1} and H2H_{2} together with their strengths u1u_{1} and u2u_{2} provide control over the system’s dynamics. The forms of control Hamiltonian represent how the control action is applied to the three-qubit system. For example, the form of H1H_{1} generally represents an independent xx-axis control to the third qubit and a correlated xx-axis interaction between the first and the second qubits.

We test the trained DRL agent in various environments to evaluate its performance and robustness. The goal is to assess how well the agent generalizes its learned policies to different scenarios and how it copes with perturbations and variations in the environment. To achieve this, we expose the trained DRL agent to a set of diverse environments, each with unique characteristics and challenges. These environments are carefully designed to represent a wide range of scenarios and potential disturbances that the agent might encounter in real-world applications. During the testing phase, we measure the agent’s performance in terms of its ability to achieve the desired objectives and maintain stability in each environment. In particular, we examine its response to changes in the measurement efficiency ηc\eta_{c} and time delay disturbances to assess its robustness and adaptability.

We first investigate the “perfect case”. In this paper, the “perfect case” entails assuming that negligible delay in solving the SME (1) by classical computers, and perfect detection, that is, measurement efficiency ηc=1\eta_{c}=1. In contrast, situations where there is delay or imperfect detection within the system are collectively referred to as “imperfect cases”. We then show some performance indications for “imperfect cases”.

V-B1 Stabilization of the GHZ state under perfect case

We initiate the testing phase to evaluate the ability of arbitrary initial states to stabilize to the target GHZ state within a specified time frame. As shown in Fig. 5, we employ the comparative approach mentioned in Section V-A, randomly selecting 5050 distinct initial states for control using the DRL agent and the Lyapunov method. The blue and orange lines correspond to the DRL method and the Lyapunov method, respectively. The average evolution time of 2500 trajectories guided by the DRL agent is 16%16\% shorter than that of the Lyapunov method (DRL: 10.41 a.u. vs. Lyapunov: 12.33 a.u.). This indicates that our DRL approach successfully stabilizes quantum states to the target GHZ state more rapidly.

Refer to caption

Figure 5: Evolution of the distance DρtD_{\rho_{t}} for 5050 random initial states stabilized to the target GHZ state under the control of the DRL agent (blue) and Lyapunov method (orange). The average stabilization time to the target under DRL control is 10.4110.41 a.u., while the Lyapunov method requires an average time of 12.3312.33 a.u.. (The stabilization time is defined as the time when the distance Dρt0.001D_{\rho_{t}}\leq 0.001.) The light blue (orange) lines represent the average evolution trajectories for different initial states, and the dark blue (orange) line represents the average trajectory across all different initial states.

In addition, we also explore the evolution of two specific initial states, denoted as ρ01=diag[0,1,0,0,0,0,0,0]\rho_{0}^{1}=\text{diag}[0,1,0,0,0,0,0,0] and ρ02=diag[1,0,0,0,0,0,0,0]\rho_{0}^{2}=\text{diag}[1,0,0,0,0,0,0,0], mentioned in [8] as examples. We repeat their stabilization 5050 times each to obtain averaged convergence curves that approximate the system’s evolution. Fig. 6(a) and Fig. 6(b) depict the evolution of these two distinct initial states. The blue curve represents the evolution controlled by the DRL agent, while the orange curve represents the evolution controlled by the Lyapunov method from [8]. It can be observed that the well-trained DRL agent not only achieves stable convergence to the target state but also showcases faster convergence compared to the Lyapunov method.

Refer to caption

Figure 6: Evolution of the distance DρtD_{\rho_{t}} to stabilize two particular initial states (a) ρ01=diag[0,1,0,0,0,0,0,0]\rho_{0}^{1}=\text{diag}[0,1,0,0,0,0,0,0], (b) ρ02=diag[1,0,0,0,0,0,0,0]\rho_{0}^{2}=\text{diag}[1,0,0,0,0,0,0,0] to the target GHZ state under the control of the DRL agent (blue) and the Lyapunov method (orange). Each trajectory represents the average of 50 different stabilization processes for these specific initial states.

We randomly select a single trajectory under the control of a DRL agent with initial state ρ01\rho_{0}^{1}. The left subplot of Fig. 7 illustrates the evolutionary trajectory with DρtD_{\rho_{t}}, and the top images display the Wigner function of the system state at five different evolution times. In contrast, the subplot on the right serves as a reference plot for the target three-qubit GHZ state. A comparison reveals that the phase-space distribution of the system state gradually approaches the target state over time, and at t=20t=20 a.u., the system state is identical to the target state.

Refer to caption

Figure 7: One specific (not averaged) evolutionary trajectory of the distance DρtD_{\rho_{t}} under the control of the DRL agent, starting from the initial state ρ01\rho_{0}^{1}. The top five images depict the Wigner function of the system state at different moments during the evolution. The figure on the right shows the phase-space probability distributions of the standard three-qubit GHZ state, which serves as a reference.

In practical agent training and application, uncertainties often exist. For example, the efficiency of measurements is typically not perfect, and there are frequently issues related to time delays in the feedback process. In the following two subsections we explore the robustness of our DRL agent to these two imperfections.

V-B2 Stabilization of the GHZ state with imperfect measurement

We first investigate the impact of reduced measurement efficiency, focusing on the robustness of an agent trained under the assumption of “perfect case”. In this test, we consider a measurement efficiency of ηc=0.8\eta_{c}=0.8, which represents a relatively high level achievable in current laboratory environments [41]. As shown in Fig. 8, both the DRL-based agent and the Lyapunov-based method successfully stabilize 5050 randomly selected initial states to the target GHZ state. Moreover, the DRL agent demonstrates slightly superior performance. These results highlight that the DRL agent, trained under ideal conditions, retains significant robustness even in the presence of reduced measurement efficiency.

Refer to caption

Figure 8: With a measurement efficiency ηc=0.8\eta_{c}=0.8, under the control of the DRL agent and the Lyapunov-based method, 5050 random initial states stably evolve to the target GHZ state. The light lines represent the average evolution trajectories for different initial states, and the dark blue line represents the average trajectory across all different initial states.

V-B3 Stabilization of the GHZ state with time delay

We evaluate the performance of the trained agent under the presence of time delays in the feedback process. In rapidly evolving quantum systems, the time required for traditional computers to solve the SME (1) is often non-negligible. To account for this, we incorporate fixed time compensation during agent testing. For example, assuming a time delay of 𝓉=0.05\mathcal{t}=0.05 a.u., the agent only receives the initial state ρ0\rho_{0} as input from t=0t=0 a.u. to t=0.05t=0.05 a.u., and generates control signals based on this state to guide the system’s evolution. At t=0.051t=0.051 a.u., the agent receives the state ρ1\rho_{1}, and subsequently at each step, it processes the state ρt0.05\rho_{t-0.05}, which corresponds to the state from 𝓉=0.05\mathcal{t}=0.05 a.u. earlier. As illustrated in Figure 9, using 5050 randomly selected initial states, we observe that both the trained DRL agent and the Lyapunov-based method handle time delays effectively. Moreover, the DRL agent demonstrates superior performance compared to the Lyapunov-based approach.

Refer to caption

Figure 9: With a time delay 𝓉=0.05\mathcal{t}=0.05 a.u., under the control of the DRL agent and the Lyapunov-based method, 5050 random initial states stably evolve to the target GHZ state. The light blue lines represent the average evolution trajectories for different initial states, and the dark blue line represents the average trajectory across all different initial states.

VI Analyzing the Advantages of PNR Reward Function

In this section, the 3-qubit example will be used to analyze the advantages of the PNR reward function. As described in Section IV-B, the PNR employs a partition parameter dd to divide the state space into two regions: the Proximity Zone (Dρt<dD_{\rho_{t}}<d) and the Exploration Zone (DρtdD_{\rho_{t}}\geq d). The reward function in each region follows the nonlinear form specified in (7).

The parameters for the PNR are set as follows:

  • d=0.001d=0.001,  𝔢=2\mathfrak{e}=2,  𝔣=10\mathfrak{f}=10;

  • R¯P=100\overline{R}_{P}=100,  R¯P=1\underline{R}_{P}=1,  R¯E=0\overline{R}_{E}=0,  R¯E=0.1\underline{R}_{E}=-0.1.

To evaluate the effectiveness of the proposed approach, we consider several alternative reward functions for training DRL agent. For fair comparisons, all agents are trained with the same parameters for a given target state, with the only difference being the reward functions used during training. The purpose of the analysis is to give us a clear and comprehensive understanding to the impact of the various modules of the designed reward function on the control performance, as well as to compare its performance with existing reward function designs.

The following assumptions are made:

  • The maximum evolution time of the system, denoted as TmaxT_{\text{max}}, is set to 100100 a.u.. For any trajectory where the system does not converge to the target state within TmaxT_{\text{max}} (i.e., DρTmax>0.001D_{\rho_{T_{\text{max}}}}>0.001), it is considered a non-convergent trajectory. For the purpose of the follow-up calculation of the average time to convergence, the “convergence time” for these non-converging trajectories up to 100100 a.u. is set to 100100 a.u..

  • A metric termed the Stabilization Success Rate is defined. During testing, 5050 distinct initial states are considered for stabilization to the target, with each initial state being stabilized under 5050 different environmental noise conditions. This results in a total of 50×50=250050\times 50=2500 individual trajectories. The Stabilization Success Rate is calculated as the ratio of the number of convergent trajectories within Tmax=100T_{\text{max}}=100 to the total number of trajectories (25002500).

VI-A Reward Function Designs

The reward functions tested are categorized as follows:

  • Partitioned Nonlinear Reward 1 (PNR1): This follows the structure of PNR, with the only difference being the parameters 𝔢=10\mathfrak{e}=10 and 𝔣=2\mathfrak{f}=2, such that the slope decreases as the distance diminishes.

  • Partitioned Linear Reward (PLR): This design mirrors the partitioned structure of PNR but employs a linear reward function in each region instead of a nonlinear one. The bounds remain consistent with PNR.

  • Partitioned Sparse Reward (PSR): [16, 17] use a sparse reward structure that provides rewards only at the last time step. We consider similar design ideas, retaining the partition structure for consistency:

    • In the Proximity Zone, the reward is a fixed positive constant.

    • In the Exploration Zone, no reward is assigned.

  • Non-Partitioned Nonlinear Reward (NPNR): This reward design does not use partitioning of the state space. Common reward function designs, such as those in [21, 22], apply a uniform reward structure across all states, treating the entire state space equally without distinguishing between different regions. The NPNR reward function adopts the nonlinear form described in (7), defined over the entire state space Dρt[0,1]D_{\rho_{t}}\in[0,1]. The rewards are non-positive, ranging from [1,0][-1,0], where the penalty magnitude decreases as DρtD_{\rho_{t}} gets closer to 0.

  • Non-Partitioned Linear Negative Reward (NPLNR): Similar to NPNR, this design does not partition the state space. The reward function is linearly defined over Dρt[0,1]D_{\rho_{t}}\in[0,1] and is always non-positive [1,0][-1,0]. The penalty decreases linearly as DρtD_{\rho_{t}} approaches 0.

  • Non-Partitioned Linear Positive Reward (NPLPR): The only distinction between this design and NPLNR is that the non-partitioned reward values are strictly positive, ranging from [0,100][0,100]. The reward increases linearly as DρtD_{\rho_{t}} approaches 0.

  • Fidelity-Based Positive Reward (FPR): This non-partitioned reward is based on fidelity, a common approach in machine learning for quantum measurement feedback problems[21, 22]. Specifically, we adopt the reward function proposed in [22]:

    r=F(t)4+4F(t)25,r=F(t)^{4}+4\cdot F(t)^{25}, (18)

    where F(t)F(t) is the fidelity at time tt. The reward is positive and increases as DρtD_{\rho_{t}} approaches 0.

VI-B Performance Comparison

To facilitate comparison, the reward functions are categorized into three types. Our PNR serves as the baseline for evaluating and comparing these categories.

VI-B1 Category 1: Partitioned Nonlinear Rewards

This category contains PNR1. The effect of the parameters 𝔢\mathfrak{e} and 𝔣\mathfrak{f} on the performance of the agent is evaluated. Using 5050 random initial states, each averaged over 5050 trajectories, the results in Fig. 10 show that PNR1 performs worse than PNR. The average convergence time for PNR1 is 12.1912.19 a.u., compared to 10.4110.41 a.u. for PNR. Additionally, the convergence success rate for PNR1 is 99.2%99.2\%, lower than the 100%100\% success rate of PNR. These results suggest that setting 𝔢\mathfrak{e} smaller than 𝔣\mathfrak{f} is more effective, meaning the reward function’s steepness increases as the target is approached.

Refer to caption

Figure 10: The effect of 𝔢\mathfrak{e} and 𝔣\mathfrak{f} parameters in the PNR reward function on the performance of the DRL algorithm. PNR: 𝔢=2\mathfrak{e}=2 and 𝔣=10\mathfrak{f}=10; PNR1: 𝔢=10\mathfrak{e}=10, 𝔣=2\mathfrak{f}=2.

VI-B2 Category 2: Partitioned Linear and Sparse Rewards

This class includes PLR and PSR, whose only difference from our PNR is that the form of their reward functions are linear and sparse rewards, respectively. As shown in Fig. 11, PLR achieves relatively good performance, with a convergence time of 11.8011.80 a.u. and a success rate of 99.76%99.76\%, although it still underperforms compared to PNR. This highlights the superiority of our designed nonlinear reward function. In contrast, PSR performs poorly, as sparse rewards alone are insufficient for complex systems.

Refer to caption

Figure 11: Effect of reward function on DRL algorithm under partitioning conditions. PNR: partitioned nonlinear reward; PLR: partitioned linear reward; PSR: partitioned sparse reward.

VI-B3 Category 3: Non-Partitioned Rewards

This category evaluates NPNR, NPLNR, NPLPR, and FPR. These non-partitioned reward functions include linear, nonlinear, and fidelity-based, and the reward values are either positively encouraging (gradually increasing the positive reward value) or negatively spurring (gradually decreasing the negative reward value). Fig. 12 shows that among these designs, NPNR and NPLNR perform the best with convergence times of 13.6013.60 a.u. and 18.4818.48 a.u. and success rates of 99.88%99.88\% and 99.56%99.56\%, respectively. These results suggest that decreasing negative penalties near the target is more effective than increasing positive rewards. In addition, the outperformance of NPNR over NPLNR highlights the role of nonlinear reward functions. The commonly used FPR performs poorly, with a convergence time and success rate of 43.5943.59 a.u. and 97.72%97.72\%, respectively, highlighting its limitations in stabilizing complex quantum states. NPLPR’s method of incrementing the positive rewards is nearly ineffective, which coincides with Remark 11.

Refer to caption

Figure 12: Effect of DRL algorithms with non-partitioned reward functions when not partitioned. NPNR: nonlinear negative reward; NPLNR: linear negative reward; NPLPR: linear positive reward; FPR: fidelity-based reward.

Our PNR outperforms other reward designs, demonstrating superior performance in stabilizing complex quantum states like the GHZ state. Nonlinear reward functions, especially when combined with partitioning, are shown to be more effective than linear or sparse reward structures in achieving faster convergence and higher success rates.

Table I summarizes the parameters and characteristics of these reward function designs, as well as their performance.

TABLE I: Comparison of Reward Functions
Reward Type Partitioned 𝔢,𝔣\mathfrak{e},\mathfrak{f} Reward Function PZ Rewards EZ Rewards NP Rewards Time (a.u.) Success Rate
PNR Yes 𝔢=2,𝔣=10\mathfrak{e}=2,\mathfrak{f}=10 Nonlinear [1, 100] [-0.1, 0] \ 10.41 100%
PNR1 Yes 𝔢=10,𝔣=2\mathfrak{e}=10,\mathfrak{f}=2 Nonlinear [1, 100] [-0.1, 0] \ 12.19 99.2%
PLR Yes \ Linear [1, 100] [-0.1, 0] \ 11.8 99.76%
PSR Yes \ Sparse 1 0 \ 53.25 62.08%
NPNR No \ Nonlinear \ \ [-1, 0] 13.6 99.88%
NPLNR No \ Linear \ \ [-1, 0] 18.48 99.56%
NPLPR No \ Linear \ \ [0, 100] 97.23 3.08%
FPR No \ Fidelity-Based \ \ Fidelity 43.59 97.72%
PZ (EZ) Rewards: Reward values in Proximity (Exploration) Zone;  NP Rewards: Reward values when not partitioned.

VII Conclusions

In this work, we designed a DRL agent and applied it to measurement-based feedback control for stabilizing quantum entangled states. With a designed reward function, the trained agent can quickly stabilize the quantum system to the target state with high fidelity and demonstrates strong robustness.

The DRL agent was tested under a “perfect case” and its performance was mainly compared to a Lyapunov-based switching method, and several other DRL algorithms with different reward functions. The results showed that our approach achieve comparable fidelity in a shorter time, potentially mitigating noise caused by prolonged interactions. The agent was also evaluated under “imperfect cases,” including low measurement efficiency and feedback delay, where it maintained great performance under these challenging conditions.

To analyze our method, we conducted ablation studies on the reward function to identify the role of each component in achieving the final performance. We also compared our reward function with fidelity-based designs from the literature and several variants of our reward function, showing the superior performance of our approach.

The proposed DRL-based framework has the potential to be extended to stabilize any multi-qubit system. Naturally, it is expected that as the system dimension increases, both the training time and the time required for state estimation using the SME will grow. Future research could explore methods to reduce the dimensionality of the state space and enhance the speed of quantum state estimation based on the SME. Furthermore, developing control strategies that rely on less system information, such as using only measurement currents, may allow for efficient stabilization of the target state with sufficient fidelity. These directions aim to further improve the practical applicability of the proposed control framework.

[PPO algorithm] In the field of DRL, PPO algorithm has become a prominent method due to its effectiveness and stability in policy optimization. This appendix is dedicated to elucidating the core algorithmic concepts of PPO and providing a detailed description of its mechanisms. The aim is to provide the reader with a comprehensive understanding of how the algorithm works and facilitate its application in practical scenarios.The core of PPO is to maximize the expected cumulative returns by optimizing the strategy parameters while ensuring that the updates remain stable. The following paragraphs delve into the specifics of these algorithmic ideas.

-A Core Algorithmic Ideas of PPO

The probability of each sequence occurring is multiplied by its corresponding cumulative reward, and the sum of these products yields the expected reward. The probability of a specific τ\tau sequence occurring given θ\theta is defined as:

pθ(τ)=t=0mpθ(at|ρt)p(ρt+1|ρt,at).p_{\theta}(\tau)=\prod_{t=0}^{m}p_{\theta}(a_{t}|\rho_{t})p(\rho_{t+1}|\rho_{t},a_{t}). (19)

We denote p(EVENT)p(\text{EVENT}) to signify the probability of the occurrence of the EVENT. For example, pθ(at|ρt)p_{\theta}(a_{t}|\rho_{t}) represents the probability of agent to choose action ata_{t} given ρt\rho_{t} while p(ρt+1|ρt,at)p(\rho_{t+1}|\rho_{t},a_{t}) represents the probability of environment transiting at ρt+1\rho_{t+1} from ρt\rho_{t} given the action ata_{t} applied.

When the parameter θ\theta is given, the expected value of the total reward, denoted as J(θ)J(\theta), is evaluated as the weighted sum of each sampled τ\tau sequence, expressed by

J(θ)=τR(τ)pθ(τ):=𝔼τpθ(τ)[R(τ)].J(\theta)=\sum_{\tau}R(\tau)p_{\theta}(\tau):=\underset{\tau\sim p_{\theta}(\tau)}{\mathbb{E}}[R(\tau)]. (20)

To maximize J(θ)J(\theta), which indicates that our chosen policy parameters θ\theta can lead to higher average rewards, we adopt the well-known gradient descent method. Thus, we take the derivative of the expected reward J(θ)J(\theta) in (20), resulting in the expression shown in

J(θ)\displaystyle\nabla J(\theta) =τR(τ)pθ(τ)\displaystyle=\sum_{\tau}R(\tau)\nabla p_{\theta}(\tau)
=τR(τ)pθ(τ)logpθ(τ)\displaystyle=\sum_{\tau}R(\tau)p_{\theta}(\tau)\nabla\log\,p_{\theta}(\tau)
=τR(τ)pθ(τ)t=0mlogpθ(at|ρt)\displaystyle=\sum_{\tau}R(\tau)p_{\theta}(\tau)\sum_{t=0}^{m}\nabla\log\,p_{\theta}(a_{t}|\rho_{t}) (21)
=𝔼τpθ(τ)[R(τ)t=0mlogpθ(at|ρt)]\displaystyle=\underset{\tau\sim p_{\theta}(\tau)}{\mathbb{E}}\Bigg{[}R(\tau)\sum_{t=0}^{m}\nabla\log\,p_{\theta}(a_{t}|\rho_{t})\Bigg{]}
𝔼(ρt,at)πθ[logpθ(at|ρt)Aθ(ρt,at)].\displaystyle\approx\underset{(\rho_{t},a_{t})\sim\pi_{\theta}}{\mathbb{E}}\Bigg{[}\nabla\log\,p_{\theta}(a_{t}|\rho_{t})A^{\theta}(\rho_{t},a_{t})\Bigg{]}.

We use f(x)=f(x)logf(x)\nabla f(x)=f(x)\nabla\log f(x) to derive the second row of (21). The last approximate equation is a result of practical gradient computations, where instead of calculating the expected reward for an entire trajectory, rewards contributed by each individual state-action pair (ρ,a\rho,a) are computed separately. These individual rewards are then summed up to obtain the total cumulative reward for the optimization process. The direction of the policy update πθ(a|ρ)\pi_{\theta}(a|\rho) is biased towards favoring state-action pairs that contribute to higher cumulative rewards within the sequence. For instance, if an action aa executed in state ρ\rho leads to a positive cumulative discounted reward, the subsequent update will enhance the probability of choosing action aa in state ρ\rho, while diminishing the likelihood of selecting other actions. The update equation for the parameters θ\theta is as follows:

θ=θ+ηJ(θ),\theta=\theta+\eta\nabla J(\theta), (22)

where η\eta is the learning rate.

Once the policy πθ(a|ρ)\pi_{\theta}(a|\rho) is updated, it necessitates the reacquisition of training data prior to the subsequent policy update. This arises due to the alteration in the probability distribution pθ(τ)p_{\theta}(\tau) brought about by the modified policy. Following data sampling, the parameter θ\theta undergoes refinement, leading to the discarding of all prior data. Subsequent parameter updates mandate the collection of fresh data, constituting the fundamental principle underlying the conventional PG algorithm. However, in the context of quantum systems, the process of sampling system information is often characterized by time-intensive and computationally demanding operations. For instance, after each measurement, a classical computer is requisitioned to solve the SME (1) to ascertain the system’s state in the subsequent moment. This inability to reutilize previously acquired data contributes to a protracted training process. To address this challenge, an additional strategy πθ\pi_{\theta^{\prime}} is introduced, mirroring the architecture of πθ(a|ρ)\pi_{\theta}(a|\rho). Instead of directly engaging with the environment for data gathering, the primary agent πθ\pi_{\theta} employs the auxiliary agent πθ\pi_{\theta^{\prime}} to interact with the environment and accumulate data. The objective is to subsequently utilize this data to train πθ\pi_{\theta} multiple times, effectively reducing the computational and resource demands for data collection. Ensuring the consistency of data sampled by πθ\pi_{\theta^{\prime}} with that of πθ\pi_{\theta}, importance sampling [42] is introduced to facilitate this synchronization process. This approach contributes to enhancing data reuse and the overall efficiency of the training procedure. (21) is updated as:

J(θ)\displaystyle\nabla J(\theta)
=𝔼(ρt,at)πθ[pθ(ρt,at)pθ(ρt,at)logpθ(at|ρt)Aθ(ρt,at)]\displaystyle=\underset{(\rho_{t},a_{t})\sim\pi_{\theta^{\prime}}}{\mathbb{E}}\Bigg{[}\frac{p_{\theta}(\rho_{t},a_{t})}{p_{\theta^{\prime}}(\rho_{t},a_{t})}\nabla\log p_{\theta}(a_{t}|\rho_{t})A^{\theta^{\prime}}(\rho_{t},a_{t})\Bigg{]}
=𝔼(ρt,at)πθ[pθ(at|ρt)pθ(ρt)pθ(at|ρt)pθ(ρt)logpθ(at|ρt)Aθ(ρt,at)]\displaystyle=\underset{(\rho_{t},a_{t})\sim\pi_{\theta^{\prime}}}{\mathbb{E}}\Bigg{[}\frac{p_{\theta}(a_{t}|\rho_{t})p_{\theta}(\rho_{t})}{p_{\theta^{\prime}}(a_{t}|\rho_{t})p_{\theta^{\prime}}(\rho_{t})}\nabla\log p_{\theta}(a_{t}|\rho_{t})A^{\theta^{\prime}}(\rho_{t},a_{t})\Bigg{]}
=𝔼(ρt,at)πθ[pθ(at|ρt)pθ(at|ρt)logpθ(at|ρt)Aθ(ρt,at)].\displaystyle=\underset{(\rho_{t},a_{t})\sim\pi_{\theta^{\prime}}}{\mathbb{E}}\Bigg{[}\frac{p_{\theta}(a_{t}|\rho_{t})}{p_{\theta^{\prime}}(a_{t}|\rho_{t})}\nabla\log p_{\theta}(a_{t}|\rho_{t})A^{\theta^{\prime}}(\rho_{t},a_{t})\Bigg{]}. (23)

Here, all the state-action pairs (ρt,at)(\rho_{t},a_{t}) (or alternatively, all trajectories τ𝔻\tau\in\mathbb{D}) are sampled from πθ\pi_{\theta^{\prime}}. The term pθ(ρt,at)/pθ(ρt,at){p_{\theta}(\rho_{t},a_{t})}/{p_{\theta^{\prime}}(\rho_{t},a_{t})} represents the importance weight, which dynamically adjusts the weighting of data sampled by πθ\pi_{\theta^{\prime}} in real time to more accurately estimate the expected value under the target policy πθ\pi_{\theta}.

The corresponding objective function from (23) can be calculated as:

Jθ(θ)=𝔼(ρt,at)πθ[pθ(at|ρt)pθ(at|ρt)Aθ(ρt,at)].J^{\theta^{\prime}}(\theta)=\underset{(\rho_{t},a_{t})\sim\pi_{\theta^{\prime}}}{\mathbb{E}}\Big{[}\frac{p_{\theta}(a_{t}|\rho_{t})}{p_{\theta^{\prime}}(a_{t}|\rho_{t})}A^{\theta^{\prime}}(\rho_{t},a_{t})\Big{]}. (24)

Nonetheless, in the absence of constraint, such as when Aθ(ρt,at)>0A^{\theta^{\prime}}(\rho_{t},a_{t})>0, indicating the desirability of specific action-state combinations, the agent’s inclination would be to elevate their likelihood, effectively amplifying the pθ(at|ρt)/pθ(at|ρt){p_{\theta}(a_{t}|\rho_{t})}/{p_{\theta^{\prime}}(a_{t}|\rho_{t})} value. This scenario can lead to policy learning inaccuracies and an erratic learning process, impeding convergence. To counteract this, the PPO introduces a pivotal mechanism, termed the “clip ratio”. This clip ratio imposition serves to confine the proportions between the new and preceding policies, thereby ensuring congruence and augmenting the algorithm’s dependability. The following equation demonstrates the PPO Clipping algorithm, incorporating the clipping term to bound the difference between pθp_{\theta} and pθp_{\theta^{\prime}} during the policy update.

JPPOθ(θ)\displaystyle J_{\rm{PPO}}^{\theta^{\prime}}(\theta)
𝔼(ρt,at)πθmin(ϱAθ(ρt,at),\displaystyle\approx\underset{(\rho_{t},a_{t})\sim\pi_{\theta^{\prime}}}{\mathbb{E}}\min\Bigg{(}\varrho A^{\theta^{\prime}}(\rho_{t},a_{t}), (25)
clip(ϱ,1ς,1+ς)Aθ(ρt,at)),\displaystyle\qquad\qquad\qquad\qquad{\rm{clip}}\Bigg{(}\varrho,1-\varsigma,1+\varsigma\Bigg{)}A^{\theta^{\prime}}(\rho_{t},a_{t})\Bigg{)},

where ϱ=pθ(at|ρt)/pθ(at|ρt)\varrho={p_{\theta}(a_{t}|\rho_{t})}/{p_{\theta^{\prime}}(a_{t}|\rho_{t})}. The last two terms 1ς1-\varsigma, and 1+ς1+\varsigma in the clip function limit the boundaries of the first term. ς\varsigma is a hyperparameter, typically set to 0.10.1 or 0.20.2. Exhaustively considering all possible sequences is typically infeasible, and thus in practical training, the objective function (25) is often formulated in the following manner:

JPPOθ(θ)\displaystyle J_{\rm{PPO}}^{\theta^{\prime}}(\theta)
1Nmτ𝔻t=0mmin(ϱAθ(ρt,at),\displaystyle\approx\frac{1}{Nm}\sum_{\tau\in\mathbb{D}}\sum_{t=0}^{m}\min\Bigg{(}\varrho A^{\theta^{\prime}}(\rho_{t},a_{t}), (26)
clip(ϱ,1ς,1+ς)Aθ(ρt,at)),\displaystyle\qquad\qquad\qquad\qquad\quad{\rm{clip}}\Bigg{(}\varrho,1-\varsigma,1+\varsigma\Bigg{)}A^{\theta^{\prime}}(\rho_{t},a_{t})\Bigg{)},

where NN and mm represent finite real numbers that respectively signify the count of collected sequences and the maximum number of steps within each sequence.

Based on the above, we can see that PPO has three sets of network parameters for its update strategy:

  • One set of main policy parameters θ\theta, which is updated every time.

  • One set of policy parameter copies θ\theta^{\prime}, which interact with the environment and collect data. They utilize importance sampling to assist in updating the main policy parameters θ\theta. Typically, θ\theta^{\prime} is updated only after several updates of θ\theta has been performed.

  • One set of value network parameters ϕ\phi, which are updated based on the collected data using supervised learning to update the evaluation of states. They are also updated every time.

-B Generalized parameter selection for training agents

Neural Network Architecture: Each policy πθ\pi_{\theta} is represented by a neural network that maps a given state ρ\rho to a probability distribution over actions aa. The action distribution is modeled as a gaussian distribution. The input layer is processed by two fully connected hidden layers, each with 128128 neurons, accompanied by a linear output layer with the same dimension as the action space (22 in this paper). All hidden layers use the Tanh activation function. The value function Vϕ(ρt)V^{\phi}(\rho_{t}) is composed of a similar neural network architecture, with the only difference being that the output layer is a single linear unit used to estimate the state-value function. The value function Vϕ(ρt)V^{\phi}(\rho_{t}) is estimated using the temporal difference (TD) method [43]. Then, the generalized advantage estimator (GAE) [44] is employed to compute the advantage function in (8), which is subsequently used in (25) to calculate the gradient for updating the policy πθ\pi_{\theta}.

Learning Rate: The learning rate is a hyperparameter that determines the step size of the algorithm’s updates based on observed rewards and experiences during training. In our training process, the learning rate η\eta is not a constant value but follows a linear schedule. Our learning rate starts at η=5107\eta=5*10^{-7} and linearly decreases over time during the training process. This allows the algorithm to explore more in the early stages of training when the policy might be far from optimal. As the training progresses, the policy approaches convergence, and the learning rate decreases to promote stability in the learning process and fine-tune the policy around the optimal solution. This helps the DRL agent achieve better performance and stability during the training process. Please refer to [43, 45] for more details.

References

  • [1] D. Dong and I. R. Petersen, Learning and Robust Control in Quantum Technology.   Springer Nature, 2023.
  • [2] A. Karlsson and M. Bourennane, “Quantum teleportation using three-particle entanglement,” Physical Review A, vol. 58, no. 6, p. 4394, 1998.
  • [3] A. Ekert and R. Jozsa, “Quantum algorithms: entanglement–enhanced information processing,” Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences, vol. 356, no. 1743, pp. 1769–1782, 1998.
  • [4] R. Jozsa and N. Linden, “On the role of entanglement in quantum-computational speed-up,” Proceedings of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences, vol. 459, no. 2036, pp. 2011–2032, 2003.
  • [5] X. Gu, L. Chen, A. Zeilinger, and M. Krenn, “Quantum experiments and graphs. III. high-dimensional and multiparticle entanglement,” Physical Review A, vol. 99, no. 3, p. 032338, 2019.
  • [6] S. Kuang, D. Dong, and I. R. Petersen, “Rapid Lyapunov control of finite-dimensional quantum systems,” Automatica, vol. 81, pp. 164–175, 2017.
  • [7] Y. Liu, D. Dong, S. Kuang, I. R. Petersen, and H. Yonezawa, “Two-step feedback preparation of entanglement for qubit systems with time delay,” Automatica, vol. 125, p. 109174, 2021.
  • [8] Y. Liu, S. Kuang, and S. Cong, “Lyapunov-based feedback preparation of GHZ entanglement of NN-qubit systems,” IEEE Transactions on Cybernetics, vol. 47, no. 11, pp. 3827–3839, 2016.
  • [9] R. S. Judson and H. Rabitz, “Teaching lasers to control molecules,” Physical Review Letters, vol. 68, no. 10, p. 1500, 1992.
  • [10] D. Dong and I. R. Petersen, “Quantum estimation, control and learning: opportunities and challenges,” Annual Reviews in Control, vol. 54, pp. 243–251, 2022.
  • [11] D. Dong, “Learning control of quantum systems,” in Encyclopedia of Systems and Control, J. Baillieul and T. Samad, Eds.   Springer London, 2020, https://doi.org/10.1007/978-1-4471-5102-9_100161-1.
  • [12] S. Sharma, H. Singh, and G. G. Balint-Kurti, “Genetic algorithm optimization of laser pulses for molecular quantum state excitation,” The Journal of Chemical Physics, vol. 132, no. 6, p. 064108, 2010.
  • [13] O. M. Shir, Niching in derandomized evolution strategies and its applications in quantum control.   Leiden University, 2008.
  • [14] D. Dong, M. A. Mabrok, I. R. Petersen, B. Qi, C. Chen, and H. Rabitz, “Sampling-based learning control for quantum systems with uncertainties,” IEEE Transactions on Control Systems Technology, vol. 23, no. 6, pp. 2155–2166, 2015.
  • [15] H. M. Wiseman, “Quantum theory of continuous feedback,” Physical Review A, vol. 49, no. 3, p. 2133, 1994.
  • [16] M. Bukov, A. G. Day, D. Sels, P. Weinberg, A. Polkovnikov, and P. Mehta, “Reinforcement learning in different phases of quantum control,” Physical Review X, vol. 8, no. 3, p. 031086, 2018.
  • [17] V. Sivak, A. Eickbusch, H. Liu, B. Royer, I. Tsioutsios, and M. Devoret, “Model-free quantum control with reinforcement learning,” Physical Review X, vol. 12, no. 1, p. 011059, 2022.
  • [18] M. M. Wauters, E. Panizon, G. B. Mbeng, and G. E. Santoro, “Reinforcement-learning-assisted quantum optimization,” Physical Review Research, vol. 2, no. 3, p. 033446, 2020.
  • [19] J. Yao, L. Lin, and M. Bukov, “Reinforcement learning for many-body ground-state preparation inspired by counterdiabatic driving,” Physical Review X, vol. 11, no. 3, p. 031070, 2021.
  • [20] S. Borah, B. Sarma, M. Kewming, G. J. Milburn, and J. Twamley, “Measurement-based feedback quantum control with deep reinforcement learning for a double-well nonlinear potential,” Physical Review Letters, vol. 127, no. 19, p. 190403, 2021.
  • [21] R. Porotti, A. Essig, B. Huard, and F. Marquardt, “Deep reinforcement learning for quantum state preparation with weak nonlinear measurements,” Quantum, vol. 6, p. 747, 2022.
  • [22] A. Perret and Y. Bérubé-Lauzière, “Preparation of cavity-Fock-state superpositions by reinforcement learning exploiting measurement backaction,” Physical Review A, vol. 109, no. 2, p. 022609, 2024.
  • [23] A. C. Doherty, S. Habib, K. Jacobs, H. Mabuchi, and S. M. Tan, “Quantum feedback control and classical control theory,” Physical Review A, vol. 62, no. 1, p. 012105, 2000.
  • [24] K. Jacobs and D. A. Steck, “A straightforward introduction to continuous quantum measurement,” Contemporary Physics, vol. 47, no. 5, pp. 279–303, 2006.
  • [25] H. M. Wiseman and G. J. Milburn, Quantum Measurement and Control.   Cambridge University Press, 2009.
  • [26] M. Van Otterlo and M. Wiering, “Reinforcement learning and Markov decision processes,” Reinforcement Learning: State-of-the-Art, pp. 3–42, 2012.
  • [27] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, 2013.
  • [28] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” Advances in Neural Information Processing Systems, vol. 12, 1999.
  • [29] V. Konda and J. Tsitsiklis, “Actor-critic algorithms,” Advances in Neural Information Processing Systems, vol. 12, 1999.
  • [30] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
  • [31] B. Qi, Z. Hou, L. Li, D. Dong, G. Xiang, and G. Guo, “Quantum state tomography via linear regression estimation,” Scientific Reports, vol. 3, no. 1, p. 3496, 2013.
  • [32] J. A. Smolin, J. M. Gambetta, and G. Smith, “Efficient method for computing the maximum-likelihood quantum state from measurements with additive gaussian noise,” Physical Review Letters, vol. 108, no. 7, p. 070502, 2012.
  • [33] D. Hadfield-Menell, S. Milli, P. Abbeel, S. J. Russell, and A. Dragan, “Inverse reward design,” Advances in Neural Information Processing Systems, vol. 30, 2017.
  • [34] A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, and N. Dormann, “Stable-baselines3: Reliable reinforcement learning implementations,” Journal of Machine Learning Research, vol. 22, no. 268, pp. 1–8, 2021.
  • [35] M. Towers, A. Kwiatkowski, J. Terry, J. U. Balis, G. De Cola, T. Deleu, M. Goulão, A. Kallinteris, M. Krimmel, A. KG et al., “Gymnasium: A standard interface for reinforcement learning environments,” arXiv preprint arXiv:2407.17032, 2024.
  • [36] M. Mirrahimi and R. van Handel, “Stabilizing feedback controls for quantum systems,” SIAM Journal on Control and Optimization, vol. 46, no. 2, pp. 445–467, 2007.
  • [37] D. M. Greenberger, M. A. Horne, and A. Zeilinger, “Going beyond bell’s theorem,” in Bell’s theorem, quantum theory and conceptions of the universe.   Springer, 1989, pp. 69–72.
  • [38] W. Dür, G. Vidal, and J. I. Cirac, “Three qubits can be entangled in two inequivalent ways,” Physical Review A, vol. 62, no. 6, p. 062314, 2000.
  • [39] T. Monz, P. Schindler, J. T. Barreiro, M. Chwalla, D. Nigg, W. A. Coish, M. Harlander, W. Hänsel, M. Hennrich, and R. Blatt, “14-qubit entanglement: Creation and coherence,” Physical Review Letters, vol. 106, no. 13, p. 130506, 2011.
  • [40] S. Kuang, G. Li, Y. Liu, X. Sun, and S. Cong, “Rapid feedback stabilization of quantum systems with application to preparation of multiqubit entangled states,” IEEE Transactions on Cybernetics, vol. 52, no. 10, pp. 11 213–11 225, 2021.
  • [41] J. Zhang, Y.-x. Liu, R.-B. Wu, K. Jacobs, and F. Nori, “Quantum feedback: theory, experiments, and applications,” Physics Reports, vol. 679, pp. 1–60, 2017.
  • [42] T. Xie, Y. Ma, and Y.-X. Wang, “Towards optimal off-policy evaluation for reinforcement learning with marginalized importance sampling,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  • [43] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.   MIT press, 2018.
  • [44] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-dimensional continuous control using generalized advantage estimation,” arXiv preprint arXiv:1506.02438, 2015.
  • [45] X. B. Peng, P. Abbeel, S. Levine, and M. Van de Panne, “Deepmimic: Example-guided deep reinforcement learning of physics-based character skills,” ACM Transactions on Graphics (TOG), vol. 37, no. 4, pp. 1–14, 2018.