This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Accelerated Magnonic Motional Cooling with Deep Reinforcement Learning

Bijita Sarma bijita.sarma@oist.jp    Sangkha Borah    A Kani    Jason Twamley Quantum Machines Unit, Okinawa Institute of Science and Technology Graduate University, Okinawa 904-0495, Japan bijita.sarma@oist.jp Quantum Machines Unit, Okinawa Institute of Science and Technology Graduate University, Okinawa 904-0495, Japan
Abstract

Achieving fast cooling of motional modes is a prerequisite for leveraging such bosonic quanta for high-speed quantum information processing. In this work, we address the aspect of reducing the time limit for cooling below that constrained by the conventional sideband cooling techniques; and propose a scheme to apply deep reinforcement learning (DRL) to achieve this. In particular, we have shown how the scheme can be used effectively to accelerate the dynamic motional cooling of a macroscopic magnonic sphere, and how it can be uniformly extended for more complex systems, for example, a tripartite opto-magno-mechanical system to obtain cooling of the motional mode below the time bound of coherent cooling. While conventional sideband cooling methods do not work beyond the well-known rotating wave approximation (RWA) regimes, our proposed DRL scheme can be applied uniformly to regimes operating within and beyond the RWA, and thus this offers a new and complete toolkit for rapid control and generation of macroscopic quantum states for application in quantum technologies.

Deep learning; Reinforcement learning; Ground state cooling; Quantum control; Magnomechanics; Trapping and cooling
preprint: APS/123-QED

Introduction.- Fast cooling of bosonic mechanical modes of macroscopic systems is a primary objective of the ongoing efforts in quantum technology Schäfermeier et al. (2016); Guo et al. (2019); Park and Wang (2009); Clark et al. (2017); Frimmer et al. (2016), and is a prerequisite for diverse prospective applications, such as the realization of macroscopic superposition states Abdi et al. (2016), gravitational tests of decoherence Bose et al. (1999); Marshall et al. (2003), ultra-precise measurements and sensing Schliesser et al. (2009); Manley et al. (2021), and bosonic quantum computing Bourassa et al. (2021). Macroscopic yttrium-iron-garnet (YIG, Y3Fe5O12\rm{Y}_{3}\rm{Fe}_{5}\rm{O}_{12}) magnets have recently attracted strong interest towards such applications given the versatility of such systems in coupling to other modes falling in a wide frequency spectrum, e.g., with optical, microwave, and acoustic modes, as well as to superconducting qubits Zhang et al. (2016); Lachance-Quirion et al. (2019); Wang and Hu (2020); Li et al. (2018); Zhang et al. (2014); Wang et al. (2018). In addition, highly polished YIG spheres feature high magnonic QQ-factor and exhibit large frequency tunability properties due to the magnetic field dependence of the excited magnon modes. Considering, in particular, the cooling of motional modes of such objects, the usual method of sideband cooling based on weak magnomechanical interaction, operates on a time scale longer than the mechanical period of oscillation and depends on the relaxation dynamics of the subsystems. Cooling of bosonic modes in such systems in a timescale less than the mode frequency is highly advantageous for quantum computation and bosonic error correction Joshi et al. (2021); Bourassa et al. (2021). By going over to the strong coupling regimes in magnon-phonon interactions, the speed of motional cooling can be highly enhanced giving rise to accelerated cooling. However, in such strong coupling regimes, the energy-nonconserving dynamics prevails because of the simultaneous presence of counter-rotating interactions, which makes it impossible to use sideband motional cooling techniques in this regime. In this work, we explore the usefulness of a machine learning based approach to address the aspect of reducing the time limit for cooling below that constrained by the conventional cooling techniques.

Recently, various machine learning (ML) approaches, aided with artificial neural networks as function approximators, have found widespread technological applications Goodfellow et al. (2016). Among the various ML approaches, reinforcement learning (RL) Sutton and Barto (2018), is considered to exhibit the closest resemblance to a human-like learning approach, in that the RL-agent tries to gather experience on its own by interacting with its environment in a trial and error approach. In RL terminology, the environment describes the virtual/real-world surrounding the agent, with all the physics hard-coded into it along with a reward function based on which the agent can classify its good moves from the bad ones. RL, when operated in combination with artificial neural networks, is known as deep reinforcement learning (DRL). DRL has become crucial in many industrial and engineering applications, primarily after recent seminal works by Google DeepMind researchers Silver et al. (2016, 2017). Following these developments, there have been several fascinating applications of DRL in various fundamental domains of science, including some in quantum physics in areas of quantum error correction, Carleo et al. (2019); Bukov et al. (2018); Fösel et al. (2018) quantum control Borah et al. (2021), and state engineering Wang et al. (2020); Niu et al. (2019); Zhang et al. (2019a); Xu et al. (2021); Zhang et al. (2019b); Mackeprang et al. (2020); Haug et al. (2020); Guo et al. (2021); Bilkis et al. (2020); Porotti et al. (2019).

In this work, we propose a DRL-based dynamical coupling scheme for accelerated motional cooling of a macroscopic object, that works for a generalized parameter setting of the coupling strength between the subsystems. In particular, we use the protocol to cool the acoustic phonon modes of a YIG sphere with a magno-mechanical interaction, and show that it works efficiently in the strong coupling regime, where other methods such as sideband cooling fail. Also, we show how going over to the strong coupling regime is particularly advantageous, as it lowers the cooling time well below the phonon oscillation period and two orders of magnitude below the sideband cooling time limit. We demonstrate the usefulness and generalizability of our DRL cooling protocol by extending its application to a tripartite system of a trapped YIG magnet with its magnonic modes coupled to the center-of-mass (COM) mode in the trap and an optical cavity mode; and show that despite the system being in the ultrastrong coupling regime, our DRL scheme can reveal nontrivial coupling modulations to cool the motional mode, which is usually not possible with coherent counter-intuitive protocols.

Refer to caption
Figure 1: (a) The schematic workflow of the DRL protocol is shown, where the RL-environment is either the bipartite magno-mechanical system (b), or the tripartite opto-magno-mechanical system (c). See text for further detail on DRL and the physical models.

Bipartite magno-mechanical cooling.- We consider a setup as shown in Fig. 1(b), where a highly polished YIG sphere is placed in a microwave cavity. With a large external homogeneous magnetic field, B0B_{0}, applied in the z^\hat{z}-direction, the YIG sphere is magnetized to its saturation magnetization, Ms=5.87×105Am1M_{s}=5.87\times 10^{5}\mathrm{Am}^{-1}, which gives rise to collective spin wave magnon modes Tabuchi et al. (2014); Zhang et al. (2014). The frequency of the Kittel mode, which is the uniform magnon mode in the YIG sphere, is given by, ωm=γB0\omega_{m}=\gamma B_{0}, where γ/2π=28GHz/T\gamma/2\pi=28~{}\mathrm{GHz/T} is the gyromagnetic ratio. Due to the magnetostriction properties, YIG spheres also exhibit high-Q acoustic modes which are coupled to the magnon modes Zhang et al. (2016); Li et al. (2018), and by driving the magnon modes with MW fields, the magno-mechanical coupling can be tuned and controlled Wang et al. (2018); Li et al. (2018). In the limit of adiabatic elimination for a low-QQ cavity, the bipartite magno-mechanical Hamiltonian is given by (see Supplemental Material for detail),

~/=Δmmm+ωbbb+(G~m+G~m)(b+b),\displaystyle\tilde{\mathcal{H}}/\hbar=\Delta_{m}m^{\dagger}m+\omega_{b}b^{\dagger}b+(\tilde{G}m+\tilde{G}^{\ast}m^{\dagger})(b+b^{\dagger}), (1)

where m(m)m(m^{\dagger}) and b(b)b(b^{\dagger}) are the magnonic and acoustic mode annihilation (creation) operators, ωb\omega_{b} is the resonance frequency of the acoustic mode and Δm=ωmωd\Delta_{m}=\omega_{m}-\omega_{d} (drive frequency ωd\omega_{d}) is the detuning. In such a bipartite system, while the beam-splitter interaction, G~mb+G~mb\tilde{G}mb^{\dagger}+\tilde{G}^{*}m^{\dagger}b, valid for weak magno-mechanical coupling, favours mechanical cooling at the red sideband Δm=ωb\Delta_{m}=\omega_{b}; the full coupling interaction accounting for the strong/ultrastrong coupling regimes, is not favourable in the usual sideband cooling approach. Sideband cooling works through anti-Stokes scattering of the excitation from the thermally populated mode, bb to the mode at zero entropy, mm. However, such cooling needs constant driving as it is a steady-state process that takes a duration of the order of the relaxation dynamics of the subsystems (see Supplemental Material). If one can access the strong coupling regime and manage to tame the counter-rotating interactions therein, there is a possibility of getting faster cooling than this limit. In the following, we design an algorithm based on DRL to model a dynamic variation of coupling, G~\tilde{G}, to get faster cooling of the acoustic mode, that operates within and beyond the weak coupling regime.

The schematic workflow of the DRL scheme applied to the physical model (RL-environment) is shown in Fig. 1(a). The RL-agent consists of a neural network model that is optimized for selected choices of actions that lead to desirable changes, and to net maximum rewards. The RL-agent is modelled using the recently proposed Soft Actor-Critic (SAC) Haarnoja et al. (2018) algorithm that is based on the maximization of the entropy, (πθ(|st))\mathcal{H}(\pi_{\theta}(\cdot|s_{t})) of the policy, πθ\pi_{\theta} as well as the long-term discounted cumulative return r(st,at)r(s_{t},a_{t}), i.e., max𝔼[tγt(r(st,at)+α(πθ(|st)))],\max\mathbb{E}\left[\sum_{t}\gamma_{t}(r(s_{t},a_{t})+\alpha\mathcal{H}(\pi_{\theta}(\cdot|s_{t})))\right], where, θ\theta denotes the optimizable weights of the neural networks, γ\gamma is the discount factor and α\alpha is the regularization coefficient that determines the stochasticity of the policy. The policy, πθ\pi_{\theta} sets the rules for the particular actions to be applied on the RL-environment (see Supplemental Material for further details).

The dynamics of the system is described by the quantum master equation (QME) for the density matrix ρ\rho with the Hamiltonian ~\tilde{\mathcal{H}} as,

dρ(t)\displaystyle{d\rho}(t) =i[~,ρ]dt\displaystyle=-\frac{i}{\hbar}\left[\tilde{\mathcal{H}},\rho\right]{dt}
+j[κj(n¯cj+1)[cj]ρ+κjn¯cj[cj]ρ]dt,\displaystyle+\sum_{j}{\left[\kappa_{j}\left(\bar{n}_{c_{j}}+1\right){\mathcal{L}}[c_{j}]\rho+\kappa_{j}\bar{n}_{c_{j}}{\mathcal{L}}[c_{j}^{\dagger}]\rho\right]}{dt}, (2)

with dissipations and thermal fluctuations given by the Lindblad superoperators, [cj]ρcjρcj12{cjcj,ρ}{\mathcal{L}}[c_{j}]\rho\equiv c_{j}\rho c_{j}^{\dagger}-\frac{1}{2}\,\{c_{j}^{\dagger}c_{j},\rho\} where, κj\kappa_{j}’s are the damping rates of the modes; and the thermal occupation of each bosonic mode is given by, n¯cj=[exp(ωj/kBT)1]1\bar{n}_{c_{j}}=[{\rm exp}(\hbar{\omega_{j}}/k_{B}T)-1]^{-1}, where TT is the bath temperature and kBk_{B} is the Boltzmann constant. Solving the full QME to obtain the mean occupancy is a computationally intensive task. DRL typically requires several thousands of episodes of training, and solving the full QME within each episode is too resource sensitive for complex systems such as the ones we consider in this work. We employ an alternative approach to compute the mean occupancies in each mode using a set of linear differential equations for the second-order moments obtained from the QME, given by, to^io^j=Tr(ρ˙o^io^j)=m,nζmno^mo^n,{\partial}_{t}\left\langle{\hat{o}}_{i}{\hat{o}}_{j}\right\rangle=\mathrm{Tr}\left(\dot{\rho}{\hat{o}}_{i}{\hat{o}}_{j}\right)=\sum_{m,n}{{\zeta}_{mn}\left\langle{\hat{o}}_{m}{\hat{o}}_{n}\right\rangle}, where o^i{\hat{o}}_{i}o^j{\hat{o}}_{j}o^m{\hat{o}}_{m}o^n{\hat{o}}_{n} are one of the operators (cj,cj{c_{j}}^{\dagger},\ c_{j}); and ζmn{\zeta}_{mn} are the corresponding coefficients. Instantaneous solutions of these equations and the controls, G~\tilde{G}, are used as the observations for the RL-agent in Fig. 1(a), and the reward function is chosen as, r(t)=1/n~b(t)r(t)=1/{\tilde{n}_{b}}(t). Here n~b(t)=bb(t)/nbT{\tilde{n}_{b}}(t)=\langle b^{\dagger}b\rangle(t)/n_{b}^{T} is the cooling quotient of the phonon number with respect to thermal occupancy, nbTn_{b}^{T} at temperature TT, where bb\langle b^{\dagger}b\rangle represents the mean value of the phonon population. Further details of the DRL controller can be found in Supplementary Information.

Refer to caption
Figure 2: (a) Accelerated DRL-based cooling of the phonon mode shown with the cooling quotient, n~b\tilde{n}_{b}, of the bipartite system as shown in Fig. 1(b) as a function of G~max\tilde{G}_{\mathrm{max}}, the maximum allowed control [with the values under RWA or beyond-RWA (BRWA)]. (b) The DRL-optimized complex pulse sequence is shown as a polar plot for G~max/(2ωb)=5\tilde{G}_{\rm max}/(\sqrt{2}\omega_{b})=5. The green dot shows the starting point while the red dot shows the final point. (c) The time required for cooling with the DRL-scheme, τ~DRL\tilde{\tau}_{\rm DRL} (right yy-axis), for n~b104\tilde{n}_{b}\lesssim 10^{-4}, is compared with the time limit for effective sideband cooling, τSB\tau_{\rm SB} (left yy-axis).

In Fig. 2(a), we show the cooling quotient, n~b\tilde{n}_{b} for the DRL-optimized controls, as a function of G~max\tilde{G}_{\rm max} (the maximum of the coupling parameter). We consider Δm=ωb\Delta_{m}=\omega_{b}, and damping rates as, κb/ωb=105\kappa_{b}/\omega_{b}=10^{-5} and κm/ωb=0.1\kappa_{m}/\omega_{b}=0.1. It is found that as the coupling is increased towards the ultrastrong coupling regime (G~ωb\tilde{G}\gtrsim\omega_{b}), the cooling becomes much faster. The DRL-optimized complex pulse sequence is shown as a polar plot for G~max/(2ωb)=5\tilde{G}_{\rm max}/(\sqrt{2}\omega_{b})=5 in Fig. 2(b). We denote the minimum time for cooling achieved by our method as τ~DRL\tilde{\tau}_{\mathrm{DRL}}. Fig. 2(c) compares this time as a function of G~max\tilde{G}_{\mathrm{max}} with respect to the sideband cooling time limit, τSB{\tau}_{\rm SB}, which represents the shortest cooling time limit possible with these methods, that work only when the rotating wave approximation (RWA) is applicable (see Supplemetal Material). With the DRL-based coupling modulations, we can achieve very low limits of cooling time compared to sideband cooling techniques, showing a lowering of approximately two orders of magnitude; which for the ultrastrong coupling regime, is further lowered.

Tripartite opto-magno-mechanical cooling.- Now, we show that the proposed scheme can be extended effectively to more complex systems, for example a higher-order tripartite opto-magno-mechanical system, where we intend to cool the motional mode through non-trivial three-mode interactions. For this, we consider a system comprising a levitated YIG sphere in a harmonic trap Seberson et al. (2020); Rusconi et al. (2017); Huillery et al. (2020), along with a driven WGM optical microresonator placed along the x^\hat{x}-direction with a magnetostrictive rod (MR) attached to it, as depicted in Fig. 1(c), which will be used as the RL-environment in Fig. 1(a) in the DRL model. In a large external homogeneous magnetic field, B0B_{0}, applied in the z^\hat{z}-direction, the YIG sphere is magnetized to the saturation magnetization, MsM_{s}, and the homogeneously magnetized fundamental magnon mode (Kittel mode) produces a change in the axial length of the MR, which modulates the WGM optical mode frequency Forstner et al. (2014); Xia et al. (2015); Yu et al. (2016); Guo et al. (2019). This gives rise to a coupling between the WGM optical mode (a)(a) and the magnon mode (m)(m) of the form, ΩSaa(m+m)\Omega_{S}a^{\dagger}a(m+m^{\dagger}), where ΩS=Δω\Omega_{S}=\Delta\omega is the optical frequency shift. The magnon mode can also be coupled to the COM motion of the YIG sphere, by applying a spatially inhomogeneous external time-dependent magnetic field, 𝐇g(y,t)\mathbf{H}_{g}(y,t) Hoang et al. (2016); Delord et al. (2018), which satisfies the weak driving, |𝐇g(y,t)|H0\left|\mathbf{H}_{g}(y,t)\right|\ll H_{0}, and small-curl, |×𝐇g(y,t)𝐫||𝐇g(y,t)|\left|\nabla\times\mathbf{H}_{g}(y,t)\|\mathbf{r}\right|\ll\left|\mathbf{H}_{g}(y,t)\right| conditions (see the Supplemental Material for more information). Considering a time-varying gradient magnetic field of the form, 𝐇g(𝐲,t)=bg(t)μ0yy^\mathbf{H}_{g}(\mathbf{y},t)=\frac{b_{g}(t)}{\mu_{0}}y\hat{y}, (bgb_{g} in units of [T/m][\mathrm{T}/\mathrm{m}]), the interaction Hamiltonian for the COM motion in the y^\hat{y}-direction (frequency ωb\omega_{b}) and the magnon mode is given by, mb(t)=Ω~P(b^+b^)(m^+m^),\mathcal{H}_{mb}(t)=\tilde{\Omega}_{P}(\hat{b}+\hat{b}^{\dagger})(\hat{m}+\hat{m}^{\dagger}), with Ω~P=bg4|γ|MSρωb\tilde{\Omega}_{P}=\frac{b_{g}}{4}\sqrt{\frac{|\gamma|M_{S}}{\rho\omega_{b}}}, where Ms=5.87×105Am1M_{s}=5.87\times 10^{5}\mathrm{Am}^{-1} is the saturation magnetization, and ρ=5170kg/m3\rho=5170\ \rm{kg/m^{3}} is the mass density of YIG. In the rotating frame of the cavity drive and the displacement picture of the average field in each mode Khosla et al. (2017), the complete Hamiltonian is described by

~/=\displaystyle\tilde{{\mathcal{H}}}/\hbar= Δaaa+ωmmm+ωbbb+Ω~S(a+a)(m+m)\displaystyle\Delta_{a}a^{\dagger}a+\omega_{m}m^{\dagger}m+\omega_{b}b^{\dagger}b+\tilde{\Omega}_{S}(a+a^{\dagger})(m+m^{\dagger})
+\displaystyle+ Ω~P(m+m)(b+b),\displaystyle~{}\tilde{\Omega}_{P}(m+m^{\dagger})(b+b^{\dagger}), (3)

where Δa{\Delta}_{a} is the cavity detuning, and Ω~S\tilde{\Omega}_{S} is the driving-enhanced optomagnonic coupling rate. One can modulate Ω~S\tilde{\Omega}_{S} via the external drive, whereas the phonon-magnon coupling Ω~P\tilde{\Omega}_{P} can be modulated using the time-varying magnetic field gradient. In order to cool the COM motion, we intend to transfer the phonon population from the COM mode to the optical mode without populating the magnon mode. The damping rates of the cavity, magnon, and COM modes are given by, κi\kappa_{i}’s, and the corresponding thermal populations at temperature TT are niTn_{i}^{T}, with i{a,m,b}i\in\{a,m,b\}. Since the optical cavity mode oscillates at high frequency, its corresponding thermal bath even at room temperature yields zero thermal occupancy, however, the phonon and magnon baths are occupied.

Similar to the bipartite system discussed above, we next use the second-order moment equations to solve the dynamics of the system and the DRL scheme is used to optimize the controls, Ω~P/S\tilde{\Omega}_{\rm P/S} by maximizing the net reward signal per episode, r(t)=1/n~bλmm(t)r(t)=1/{\tilde{n}_{b}}-\lambda\,{\langle m^{\dagger}m}\rangle(t), where λ\lambda is a constant chosen such that the magnon mode does not get populated. Given the fact that the COM mode frequencies of the YIG are of the order of ωb/2π10100\omega_{b}/2\pi\sim 10-100’s of kHz{\rm kHz}, and the magnon frequency is ωm/2π10GHz\omega_{m}/2\pi\sim 10~{}{\rm GHz}, the ideal choice of ωm\omega_{m} for ωb/2π=100kHz\omega_{b}/2\pi=100\ \rm kHz is ωm=105ωb\omega_{m}=10^{5}\,\omega_{b}. With such a high frequency difference with the intermediate magnon mode at ωm=105ωb\omega_{m}=10^{5}\,\omega_{b}, this constitutes a largely detuned system (ωmΩ~P2+Ω~S2\omega_{m}\gg\sqrt{\tilde{\Omega}_{P}^{2}+\tilde{\Omega}_{S}^{2}}). In such a system while the magnon mode is usually decoupled, the ideal time limit to obtain swap between mechanical quanta and optical mode with the ideal Raman pulses is given by τ~lim=πωm/(2Ω~PΩ~S)\tilde{\tau}_{\rm lim}=\pi\omega_{m}/(2\tilde{\Omega}_{P}\tilde{\Omega}_{S}) (see the Supplemental Material). However, this is the limit for the situation as long as the RWA is valid. It is also noted that the effectiveness of this kind of cooling is highly reduced in presence of damping, and going beyond the RWA to decrease the cooling time limit is not possible because of the counter-rotating dynamics. We apply the DRL strategy to work in a regime where Ω~i\tilde{\Omega}_{i}’s are sufficiently high to access the cooling limit not obtainable by these conventional means, and also keep the counter-rotating dynamics in control. While the use of the method of coupled second order moments reduces the computation resources drastically, simulating the dynamics with the choice of the realistic parameter ωm=105ωb\omega_{m}=10^{5}\ \omega_{b} with high coupling strengths, for example Ω~Smax=Ω~Pmax=100ωb\tilde{\Omega}_{S}^{\rm max}=\tilde{\Omega}_{P}^{\rm max}=100\ \omega_{b}, turns out to be a computationally highly intensive problem, due to the very stiff solution of the set of differential equations. Hence, we adopt a two-step training procedure for the problem. In this protocol, we first use an auxiliary system with ωm=103ωb\omega_{m}=10^{3}\omega_{b} and Ω~Smax=Ω~Pmax=10ωb\tilde{\Omega}_{S}^{\rm max}=\tilde{\Omega}_{P}^{\rm max}=10\ \omega_{b}, for which solution of the set of equations can be obtained without much computational effort. The trained auxiliary model is then used as a supervisor/teacher for the actual system, that we call primary, with ωm=105ωb\omega_{m}=10^{5}\omega_{b} and Ω~Smax=Ω~Pmax=100ωb\tilde{\Omega}_{S}^{\rm max}=\tilde{\Omega}_{P}^{\rm max}=100\ \omega_{b}. Training the primary system for a few hundred episodes with periodic evaluation of the RL-agent yields the best trained model. In the literature of RL, such a scheme is known as imitation learning, and is a feature of generalizability of RL-trained model.

In Fig. 3(a), the cooling quotients of the photon, magnon and phonon modes, n~i=ni/nT\tilde{n}_{i}=n_{i}/n_{T}, are shown, where nin_{i}’s are the mean occupancies and nTn_{T} is the phonon thermal bath population. The cooling time limit for this system is given by τ~limωb/(2π)5\tilde{\tau}_{\rm lim}\omega_{b}/(2\pi)\sim 5 (see the Supplemental Material). The plot shows the population dynamics in the three modes for the DRL-based coupling parameters with a maximum allowed value, Ω~imax/ωb=100\tilde{\Omega}_{i}^{\rm max}/\omega_{b}=100. One can see that while the magnon mode is kept at a constant population, there is a scattering between the mechanical and optical mode that gives rise to five orders of cooling in the mechanical mode. This draws an analogy to the stimulated Raman adiabatic passage (STIRAP) techniques for three-mode systems. However, it is well-known that such adiabatic techniques require longer time and only works ideally as long as the counter-rotating terms are not present (see the Supplemental Material). On the contrary, the coupling parameters found by our DRL protocol are non-trivial, which are shown in Fig. 3(b), and the overall time required for cooling is reduced below the adiabatic limit even for high values of coupling. In the bottom panel of Fig. 3(a) we show the cooling time required by our DRL method for even larger values, and higher coupling parameters yield even lower time limits.

Refer to caption
Figure 3: (a) The DRL-scheme based mean occupancies of the phonon, magnon and photon modes are shown in terms of the cooling quotient, n~i\tilde{n}_{i} (in green, orange and blue lines respectively) for Ω~{S,P}/ωb=100\tilde{\Omega}_{\{\rm S,\rm P\}}/\omega_{b}=100. The vertical lines show the time limits for the DRL-based cooling with Ω~imax/ωb={100,120,150}\tilde{\Omega}_{i}^{\rm{max}}/\omega_{b}=\{100,120,150\} respectively (in green, purple, pink). While the DRL-derived scheme gives a cooling quotient of n~b<104\tilde{n}_{b}<10^{-4}, the corresponding Raman pulses with the same amount of maximum allowed coupling strength gives lower values, and the cooling time limits for these are far higher than those obtained with the DRL-scheme. The markers show the cooling results obtained for the conventional counter-intuitive pulses, with the corresponding cooling quotients and time limits shown with each point in the order (τ~lim,n~b)(\tilde{\tau}_{\rm{lim}},\tilde{n}_{b}). (b) The corresponding DRL-optimized coupling parameters are shown for the case of maximum allowed value of Ω~{S,P}/ωb=100\tilde{\Omega}_{\{\rm S,\rm P\}}/\omega_{b}=100.

Conclusion.- To conclude, in this work we address the aspect of reducing the time required for cooling bosonic motional modes below the time limit accessible by well-known cooling methods with scattering techniques. While such conventional methods of cooling do not work in strong and ultrastrong coupling regimes where the RWA is not valid, we design a DRL-based algorithm that works within and beyond such regimes, and show that by accessing the ultrastrong coupling limit with DRL-designed pulses the cooling limit can be broken resulting in accelerated cooling. We further show how the protocol can be adapted for cooling in a tripartite system with an opto-magno-mechanical interaction, that represents more complexity for the DRL-control owing to the huge dimensionality in the Hilbert space; and find nontrivial three-mode interactions leading to accelerated cooling breaking the coherent cooling time limits. Thus, this study outlines a comprehensive toolbox for application of DRL for fast and efficient quantum control in a magnonic system within and beyond RWA restrictions, which can be adapted to other quantum systems of interest. For future and ongoing efforts in quantum technology, this is expected to play a pivotal role, especially in conjunction with various laboratory experiments.

I acknowledgments

Acknowledgements.
The authors thank Okinawa Institute of Science and Technology (OIST) Graduate University for the super-computing facilities provided by the Scientific Computing and Data Analysis section of the Research Support Division, and for financial support. The authors are grateful for the help and support provided by Jeffery Prine from the Digital Content, Brand, and Design Section of the Communication and Public Relations Division at OIST.

Bijita Sarma Sangkha Borah A Kani Jason Twamley

Supporting Information

S1 Reinforcement learning and soft actor-critic algorithm

Reinforcement learning (RL) has been one of the most exciting fields of machine learning (ML) following a few revolutionary works carried out during 2013-2016 by the DeepMind and Google Silver et al. (2017, 2016). Although these applications were demonstrated initially for different board games, researchers all around the world are now looking into its potential uses for different technological applications. In RL, a software agent (we will call it the RL-agent), makes some observations and applies some actions on an environment (we call it the RL-environment), that causes a change in the dynamics of the environment and, in return, the RL-agent receives some scalar values as rewards. The objective of the RL-agent is to maximize the total rewards obtained in a particular time range (we call it an episode). This approach of training is particularly different from typical training approaches where the ML-model is trained based on predefined data with labels (supervised learning), or with no labels (unsupervised learning).

Refer to caption
Figure S1: The workflow of deep reinforcement learning (DRL) is shown. The RL consists of two main characters – the RL-environment and the RL-agent. The RL-agent, in the case of DRL, comprises artificial neural networks which operate as function approximators. The RL-agent applies certain actions on the RL-environment and receives certain quantities as observations along with a scalar reward signal, based on which the neural network is updated to adopt a better policy for improved rewards.

The workflow of a typical RL is shown in Fig. S1. Here, the block named as the RL-agent is depicted as comprising a set of neural network layers which act as function approximators. This type of RL is known as deep reinforcement learning (or DRL for short). On the other hand, the block in the right depicts the RL-environment, which is the world wherein the RL-agent lives and can affect change via actions on the dynamics of the environment. In RL terminology, the environment signifies the real/virtual space in which the rules of the physics of the problem/system are encoded along with a reward estimation function returning some real scalar values that determine whether the applied action has led to desirable changes or not. In practice, the RL-agent makes a certain number of interactions by first randomly applying some actions on the RL-environment based on which the weights of the nonlinear neural network function estimators are updated. The procedure is repeated several hundreds/thousands of times, depending on the complexity of the problem. Such a learning process is analogous to the learning process of a human, where the agent learns by conducting trial and error experiments on the RL-environment. Depending on the type of the RL algorithm used, the neural network based agent can explore even newer possibilities not usually explored, thus outperforming the analogous supervised learning agent.

Refer to caption
Figure S2: The workflow of DRL in the soft-actor-critic (SAC) setting is shown. In the actor-critic framework, the RL-agent has two independent neural networks for policy evaluation and policy improvement to guide the choice of actions, known respectively as the critic and actor networks. The critic, in the SAC framework, consists of two neural networks, Q1Q_{1} and Q2Q_{2} for improved policy evaluation through Q-function updates. Based on the suggestion of the critic (in terms of estimation of the discounted future rewards and entropy regularized critic loss for SAC), the actor-network is updated, and the next action is chosen.

The algorithm that the RL-agent uses to determine the actions is called its policy, which in the case of DRL, represents the neural network itself that takes the observations as the input, and outputs the actions. A policy can be deterministic in which case the action of the RL-agent is determined by the policy parameters θ\theta for a given state sts_{t} at time tt: at=μθ(st)a_{t}=\mu_{\theta}(s_{t}), or stochastic in which the actions are sampled from a probability distribution conditioned on sts_{t}: atπθ(|st)a_{t}\sim\pi_{\theta}(\cdot|s_{t}). The task of the RL-agent is to optimize the policy parameters (θ\theta) to maximize the net discounted rewards R(τ)=t=0γtrtR(\tau)=\sum_{t=0}^{\infty}\gamma^{t}r_{t} over a trajectory τ=(s0,a0,s1,a2,)\tau=(s_{0},a_{0},s_{1},a_{2},...), for the discount factor γ(0,1)\gamma\in(0,1). To achieve that, the expected return over the discounted rewards J(π)=𝔼[R(τ)]J(\pi)=\mathbb{E}[R(\tau)] is optimized to obtain the optimal policy π=argmaxJ(π)\pi^{\star}=\mathrm{argmax}\,J(\pi).

As a fundamental concept in RL, the value function gives a prediction of the expected, cumulative, discounted, future reward, and provides a measure of how well a given state ss or state-action pair (s,a)(s,a) behaves to generate a higher net return. The state value Vπ(s)=𝔼[Rt|st=s]V_{\pi}(s)=\mathbb{E}[R_{t}|s_{t}=s] is defined as the expected return for following policy π\pi from state ss. On the other hand, the action value Qπ(s,a)=𝔼[Rt|st=s,at=a]Q_{\pi}(s,a)=\mathbb{E}[R_{t}|s_{t}=s,a_{t}=a] is the expected return for selecting action aa in state ss and then following policy π\pi. The value functions obey the so-called Bellman equations Sutton and Barto (2018), which can be solved self-consistently. For example, the action-value function obeys,

Qπ(s,a)=𝔼[r(s,a)+γ.maxaQθ(s,a)].\displaystyle Q_{\pi}(s,a)=\mathbb{E}\left[r(s,a)+\gamma.\max_{a^{\prime}}Q_{\theta}(s^{\prime},a^{\prime})\right]. (S4)

There are several approaches to optimize the policy, that can be broadly categorized into three types, namely (a) policy-gradient-based, (b) value-based and (c) actor-critic based methods. In value-based approaches, such as the Q-learning methods, the policy is optimized to get the net maximum value functions solving the Bellman equations as discussed above. On the other hand, in policy gradient methods the policy parameters are optimized by using gradient descent algorithms:

θJ(πθ)=𝔼t=0T[θlogπθ(at|st)R^t],\displaystyle\nabla_{\theta}J(\pi_{\theta})=\mathbb{E}\sum_{t=0}^{T}\left[\nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t})\hat{R}_{t}\right], (S5)

where 𝔼\mathbb{E} represents the expectation value over the trajectory τ\tau. This basic approach can be improved by using a baseline function, b(st)b(s_{t}), to reduce the variance of gradient estimation and forms the basis of the more advanced state-of-the-art DRL actor-critic algorithms. In the most generalized form, the policy gradient methods work by optimizing the following objective (loss) function:

LPG(θ)=𝔼^t[logπθ(aθ|st)A^t],\mathrm{L}^{\rm PG}(\theta)=\hat{\mathbb{E}}_{t}\left[\log\pi_{\theta}(a_{\theta}|s_{t})\hat{A}_{t}\right], (S6)

where, πθ\pi_{\theta} is a stochastic policy, and A^t=Q(st,at)V(st)\hat{A}_{t}=Q(s_{t},a_{t})-V(s_{t}) is an estimator of the advantage function at timestep t, since RtR_{t} is an estimate of Q(at,st)Q(a_{t},s_{t}).

An actor-critic algorithm learns both a policy and a state-value function, and the value function is used for bootstrapping, i.e., updating a state from subsequent estimates, to reduce variance and accelerate learning Sutton and Barto (2018), while the critic updates action-value function parameters, and the actor updates policy parameters, in the direction suggested by the critic.

The Soft Actor-Critic (SAC) algorithm is a recently proposed actor-critic algorithm Haarnoja et al. (2018) in the field of reinforcement learning. The main difference of SAC compared to other actor-critic methods is that its policy is optimized in an entropy-regularized manner and is inherently stochastic. The policy is trained to maximize a tradeoff between expected return and entropy, with an increase in entropy leading to more exploration and, in addition, preventing the policy from converging prematurely. The following discussion closely follows the theory presented in Achiam (2021).

In entropy-regularized RL, the agent gets a bonus reward at each time step proportional to the entropy of the policy,

π=argmaxπ𝔼τπ[t=0γt(R(st,at,st+1)+αH(π(|st)))],\displaystyle\pi^{*}=\arg\max_{\pi}\underset{\tau\sim\pi}{\mathbb{E}}\left[{\sum_{t=0}^{\infty}\gamma^{t}\bigg{(}R(s_{t},a_{t},s_{t+1})+\alpha H\left(\pi(\cdot|s_{t})\right)\bigg{)}}\right], (S7)

where H(P)=𝔼xP[logP(x)]H(P)=\underset{x\sim P}{\mathbb{E}}[{-\log P(x)}] denotes the entropy computed from the probability distribution PP, and α>0\alpha>0 is the trade-off coefficient. The corresponding value functions in this setting, VπV^{\pi} and QπQ^{\pi} are changed to,

Vπ(s)=\displaystyle V^{\pi}(s)= 𝔼τπ[t=0γt(R(st,at,st+1)+αH(π(|st)))|s0=s],\displaystyle\underset{\tau\sim\pi}{\mathbb{E}}\Big{[}\sum_{t=0}^{\infty}\gamma^{t}\big{(}R(s_{t},a_{t},s_{t+1})+\alpha H\left(\pi(\cdot|s_{t})\right)\big{)}|s_{0}=s\Big{]}, (S8)
Qπ(s,a)=\displaystyle Q^{\pi}(s,a)= 𝔼τπ[t=0γtR(st,at,st+1)+αt=1γtH(π(|st))|s0=s,a0=a].\displaystyle\underset{\tau\sim\pi}{\mathbb{E}}\Big{[}\sum_{t=0}^{\infty}\gamma^{t}R(s_{t},a_{t},s_{t+1})+\alpha\sum_{t=1}^{\infty}\gamma^{t}H\left(\pi(\cdot|s_{t})\right)|s_{0}=s,a_{0}=a\Big{]}. (S9)

With these definitions, VπV^{\pi} and QπQ^{\pi} are connected by,

Vπ(s)=𝔼aπ[Qπ(s,a)+αH(π(|s))].\displaystyle V^{\pi}(s)=\underset{a\sim\pi}{\mathbb{E}}\left[{Q^{\pi}(s,a)}+\alpha H\left(\pi(\cdot|s)\right)\right]. (S10)

The Bellman equation for QπQ^{\pi} becomes,

Qπ(s,a)r+γ(Qπ(s,a~)αlogπ(a~|s)),a~π(|s),\displaystyle Q^{\pi}(s,a)\approx r+\gamma\left(Q^{\pi}(s^{\prime},\tilde{a}^{\prime})-\alpha\log\pi(\tilde{a}^{\prime}|s^{\prime})\right),\;\tilde{a}^{\prime}\sim\pi(\cdot|s^{\prime}), (S11)

where, the expectation over the next states, rr and the states, ss^{\prime} come from the replay buffer, while the next actions a~\tilde{a}^{\prime} are sampled fresh from the policy. SAC concurrently learns a policy πθ\pi_{\theta} and two Q-functions Qϕ1,Qϕ2Q_{\phi_{1}},Q_{\phi_{2}} and set up the loss functions for each Q-function, and takes the minimum Q-value between the two Q approximators,

L(ϕi,𝒟)=𝔼(s,a,r,s,d)𝒟[(Qϕi(s,a)y(r,s,d))2],\displaystyle L(\phi_{i},{\mathcal{D}})=\underset{(s,a,r,s^{\prime},d)\sim{\mathcal{D}}}{{\mathbb{E}}}\left[\Bigg{(}Q_{\phi_{i}}(s,a)-y(r,s^{\prime},d)\Bigg{)}^{2}\right], (S12)

where the target is given by

y(r,s,d)=r+γ(minj=1,2Qϕtarg,j(s,a~)αlogπθ(a~|s)),a~πθ(|s).\displaystyle y(r,s^{\prime},d)=r+\gamma\left(\min_{j=1,2}Q_{\phi_{\text{targ},j}}(s^{\prime},\tilde{a}^{\prime})-\alpha\log\pi_{\theta}(\tilde{a}^{\prime}|s^{\prime})\right),\ \tilde{a}^{\prime}\sim\pi_{\theta}(\cdot|s^{\prime}). (S13)

For the studies reported in this article, we design the RL-environments using the framework template of OpenAI-Gym Brockman et al. (2016). The RL-agents are modelled using SAC policy, with two critic models following the implementation available in the open-source software package Stable-Baselines3 Raffin et al. (2019). For modelling the neural networks and optimization of the network parameters we have used the PyTorch Paszke et al. (2019) ML module.

Refer to caption
Figure S3: This plot demonstrates the mechanical sideband cooling dynamics of the two-mode system with constant magno-mechanical coupling rates given by G~\tilde{G}, for the parameters Δm/ωb=1\Delta_{m}/\omega_{b}=1, κm/ωb=0.1\kappa_{m}/\omega_{b}=0.1 and κb/ωb=105\kappa_{b}/\omega_{b}=10^{-5}. From the figure, one can get an estimate of the steady-state cooling time limit and the optimum cooling quotient, n~b\tilde{n}_{b}. Considering an optimum cooling quotient of, n~b104\tilde{n}_{b}\approx 10^{-4}, the best cooling from such sideband cooling process is found to occur at a cooling time limit of the order of τSBωb/(2π)40\tau_{\rm{SB}}\omega_{b}/(2\pi)\approx 40.

The bipartite magno-mechanical model and sideband cooling time limit

The effective two-mode magno-mechanical model shown in Fig. 1(b) in the main text consists of a YIG sphere put in a microwave (MW) cavity, where a uniform magnetic field, B0B_{0} is applied in the zz-direction that induces spin wave magnons precessing in the xyxy-plane. The magnetic field component of the cavity mode couples to the magnons through magnetic dipole coupling Tabuchi et al. (2014); Zhang et al. (2014). The YIG sphere also acts as an excellent mechanical resonator because of the magnetostrictive effect, that gives rise to vibrational acoustic phonon modes Zhang et al. (2016); Li et al. (2018). We consider the magnon modes to be driven by a MW field with frequency ωp\omega_{p} and amplitude ϵp\epsilon_{p} Wang et al. (2018); Li et al. (2018). The Hamiltonian of the system in a frame rotating with the magnon drive frequency is given by,

0/=δaaa+ωbbb+δmmm+J(am+am)+Gmm(b+b)+i(ϵpmϵpm),\displaystyle\mathcal{H}_{0}/\hbar=\delta_{a}a^{\dagger}a+\omega_{b}b^{\dagger}b+\delta_{m}m^{\dagger}m+J(am^{\dagger}+a^{\dagger}m)+Gm^{\dagger}m(b+b^{\dagger})+i(\epsilon_{p}m^{\dagger}-\epsilon^{*}_{p}m), (S14)

where a(a)a(a^{\dagger}), m(m)m(m^{\dagger}) and b(b)b(b^{\dagger}) are the annihilation (creation) operators for the cavity, magnon and phonon modes respectively, δi=ωiωp\delta_{i}=\omega_{i}-\omega_{p} are the detunings, JJ is the magnon-photon coupling strength and GG is the magno-mechanical coupling rate. The Rabi frequency ϵp\epsilon_{p} is given by 5NγB0/4\sqrt{5N}\gamma B_{0}/4, where γ/2π=28GHz/T\gamma/2\pi=28\ \rm{GHz/T} is the gyromagnetic ratio, N=ρnVN=\rho_{n}V is the total number of spins, where ρn=4.22×1027m3\rho_{n}=4.22\times 10^{27}\ \rm{m}^{-3} is the spin density of YIG and VV is the volume of the sphere.

The quantum Langevin equations for the average dynamics of the fluctuation operators are given by

a˙\displaystyle\dot{a} =(iδa+κa)aiJm,\displaystyle=-(i\delta_{a}+\kappa_{a})a-iJm,
m˙\displaystyle\dot{m} =(iδm+κm)miJaiG~(b+b),\displaystyle=-(i\delta_{m}+\kappa_{m})m-iJa-i\tilde{G}(b+b^{\dagger}), (S15)
b˙\displaystyle\dot{b} =(iωb+κb)bi(G~m+G~m),\displaystyle=-(i\omega_{b}+\kappa_{b})b-i({\tilde{G}}^{*}m+\tilde{G}m^{\dagger}),

where G~\tilde{G} is the driving-enhanced magno-mechanical coupling. When the cavity has a low Q-factor, with a damping rate of the order of κa{J,G~,κm,κb}\kappa_{a}\gg\{J,\ \tilde{G},\ \kappa_{m},\kappa_{b}\}, the cavity field can be adiabatically eliminated, that gives rise to an effective magno-mechanical Hamiltonian of the form,

~/=ωbbb+Δmmm+(G~m+G~m)(b+b),\displaystyle\tilde{\mathcal{H}}/\hbar=\omega_{b}b^{\dagger}b+{\Delta}_{m}m^{\dagger}m+(\tilde{G}m^{\dagger}+\tilde{G}^{*}m)(b+b^{\dagger}), (S16)

where the effective parameters are given by, Δm=δm|J|2δa2+κa2δa\Delta_{m}=\delta_{m}-\frac{|J|^{2}}{\delta_{a}^{2}+\kappa_{a}^{2}}\delta_{a}, κ~m=κm+|J|2δa2+κa2κa\tilde{\kappa}_{m}=\kappa_{m}+\frac{|J|^{2}}{\delta_{a}^{2}+\kappa_{a}^{2}}\kappa_{a}. Hence, when the sphere is put at a node of the cavity magnetic field, or the cavity damping is very high, effectively the system reduces to a two-mode coupled system with a magno-mechanical interaction.

The magnon and mechanical modes have largely different frequencies, with ωm2π×10GHz\omega_{m}\sim 2\pi\times 10\ {\rm{GHz}} and ωb2π×10MHz\omega_{b}\sim 2\pi\times 10\ {\rm{MHz}}, therefore the magnon mode behaves as a ‘cold’ sink at zero temperature for the ‘hot’ mechanical mode. At the red sideband, Δm=ωb\Delta_{m}=\omega_{b}, by virtue of the resonant interaction in the weak coupling regime, between the mechanical mode and magnon anti-Stokes sideband, it gives rise to transfer of thermal quanta from the mechanical mode to the magnon mode, which is the working principle of the well-known sideband cooling technique. In Fig. S3 we show the sideband cooling dynamics with several values of G~\tilde{G}, in the regime G~<ωb\tilde{G}<\omega_{b}. The time limit for such irreversible cooling is denoted by τSB\tau_{\rm{SB}}. From the figure, one can get an estimate of the steady-state cooling time limit and the optimum cooling quotient, n~b\tilde{n}_{b}. Considering an optimum cooling quotient of, n~b104\tilde{n}_{b}\approx 10^{-4}, the best steady-state cooling from such sideband cooling mechanism is found to occur at a time limit of the order of τSBωb/(2π)40\tau_{\rm{SB}}\omega_{b}/(2\pi)\approx 40. As can be seen from the figure, such sideband cooling is not possible with higher values of the magno-mechanical coupling, G~\tilde{G} because of the effect of the counter-rotating terms beyond RWA. In our DRL-scheme, we show that cooling can be obtained even in the strong coupling regime with proper choice of coupling modulations. And more interestingly, it leads to lower time limit for cooling.

S2 The DRL controller for the bipartite magno-mechanical system

Refer to caption
Figure S4: (Top) The learning curve of the RL-agent is shown for the bipartite magno-mechanical model shown in Fig. 1(b) in the main body of the article. In the upper panel of the figure, the mean net reward R¯(N)\bar{R}(N) is shown over each episode NN within the training period of the RL agent. The corresponding cooling quotient over each episode NN is shown in the lower panel of the figure. To understand the learning process of the RL-agent, we divide the learning curve into four distinct regions, namely A, B, C, and D. The typical performance of the RL agent within these regions is evaluated and plotted respectively in Figs. A, B, C, and D below the learning curves. In each plot, the variation of cooling quotient, n~b\tilde{n}_{b} is shown for an episode in which the controls, shown as polar plots in the inset, are determined by the selected RL-agent having already acquired knowledge up to that point of the learning curve. In A and B, the RL agent tries more or less random combinations of the controls and tries to learn the big lessons to achieve a net large return. In A, the RL agent learns to avoid increasing the cooling quotient, while in B it learns to choose controls such that n~b\tilde{n}_{b} can be held constant at n~b1\tilde{n}_{b}\sim 1. The RL agent must explore for a relatively long time to gain control and move beyond it, which does not happen until about N=8000N=8000. Most of the learning to achieve cooling takes place in region C, where the RL-agent tries to learn the little nitty-gritty of the problem by making its controls more fine-tuned to achieve an overall larger return, which is further improved in the learning period marked by D.

For the two-mode magno-mechanical system, the control parameter is the coupling strength G~(t)\tilde{G}(t), which can be complex in nature (see Eq. 1 of the main text of the article). In the DRL model, the real and imaginary parts of G~\tilde{G} are considered as the two actions that are learned by the DRL-agent through trial and error. In a given iteration of the RL workflow at time tt, the RL-agent chooses two actions based on which the system (RL-environment) makes the dynamics to t=t+δtt=t+\delta t, where δt=0.1ωb/(2π)\delta t=0.1\omega_{b}/(2\pi) during which G~(t)\tilde{G}(t) remains fixed. The observations that the RL-agent gets from the RL-environment comprise the instantaneous values of the moments obtained by solving the second order coupled differential equations. The actions of the RL-agent are also added to the list of observations, which we found to be advantageous for learning. The SAC controller is made based on the following settings of hyperparameters,

Network hidden layer size: 512×256×256×128512\times 256\times 256\times 128
Activation function: Rectified Linear Unit (ReLU)
Learning rate: 10410^{-4}
Buffer size: 1000000
Batch size: 512
Soft update coefficient, τ\tau: 0.005
Discount factor, γ\gamma: 0.99
Entropy regularization
coefficient, α\alpha: Adaptive starting from 0.1.

The goal of the problem is to optimize the neural network parameters such that it learns the non-trivial combinations of the controls, G~\tilde{G}, so that the phonon occupancy of the YIG sphere is reduced thereby cooling it significantly. For that the agent is given the following reward function,

r(t)=\displaystyle r(t)~{}=~{} nTbb(t),\displaystyle\frac{n_{T}}{{\langle b^{\dagger}b(t)\rangle}}, (S17)

where nTn_{T} is the thermal phonon number, bbb^{\dagger}b is the number operator of the mechanical mode, and A\langle A\rangle denotes the expectation value of the operator AA. The task of the RL-agent would be to maximize the net reward, R=i𝒩triR=\sum_{i}^{{\mathcal{N}}_{t}}r_{i} over an episode. Note that this, in theory, represents a highly challenging task as for each control (action), the number of possible combinations in a given episode is 𝒩C𝒩t\mathcal{N}_{C}^{\mathcal{N}_{t}}, where 𝒩t\mathcal{N}_{t} is the number of total time steps allowed within an episode, and 𝒩C{\mathcal{N}_{C}} is the number of possible discretization of the control strength bounded by G~max\tilde{G}_{\rm max}. Since in our case, the actions are continuous the problem is even harder.

The RL-agent is trained for the following set of cases: G~max/(2ωb)={0.1,0.2,0.3,0.5,1,2,3,4,5}\tilde{G}_{\rm max}/(\sqrt{2}\omega_{b})=\left\{0.1,0.2,0.3,0.5,1,2,3,4,5\right\}. The learning curve for G~max/(2ωb)=5\tilde{G}_{\rm max}/(\sqrt{2}\omega_{b})=5 is shown in Fig. S4. In these plots, we have shown how the RL-agent gradually learns the non-intuitive control sequences by optimizing the neural network weights to achieve the goal of attaining larger net returns determined by Eq. S17. We have also shown the performance of the RL-agent at different stages to demonstrate the learning process, as explained in the caption of Fig. S4 in more detail.

Refer to caption
Figure S5: The effectiveness of STIRAP protocol for the tripartite opto-magno-mechanical system for different Ω~P/Smax\tilde{\Omega}_{\rm P/S}^{\rm max} with numerically optimized counterintuitive Gaussian pulses, typically used in such protocols are compared. The performance with (dotted lines) and without damping (solid lines) are shown. For (a) and (b), such a protocol leads to cooling of the phonon mode (n~b\tilde{n}_{b}) below 3 orders without damping. For Ω~P/Smax=10ωb\tilde{\Omega}_{\rm P/S}^{\rm max}=10\omega_{b}, shown in (c), and for Ω~P/Smax=12ωb\tilde{\Omega}_{\rm P/S}^{\rm max}=12\omega_{b}, in (d), the STIRAP protocol cannot lead to any effective cooling, because of the counter-rotating terms of the non-RWA model. This can be more easily understood for Ω~P/Smax=15ωb\tilde{\Omega}_{\rm P/S}^{\rm max}=15\omega_{b}, shown in (e), where the populations are increased leading to heating of the system. In (f), we compare the minimum achievable phonon occupation, n~bmin\tilde{n}_{b}^{\rm min} for the above choices of Ω~P/Smax\tilde{\Omega}_{\rm P/S}^{\rm max} with (dotted) and without (solid) damping. In (g), the time required to cool the phonon occupation below 10310^{-3} or the time required to attain the minimum occupation, τsmin\tau_{s}^{\rm min} is compared for cases with (dotted) and without (solid) damping.

Tripartite opto-magno-mechanical model and the cooling time limit

For the tripartite system model, we consider the cooling of a levitated spherical YIG ferrimagnet with high spin density, trapped in a harmonic potential wherein spin-wave magnon modes are excited by applying an external bias magnetic field. In a large external homogeneous magnetic field, B0B_{0}, applied in the z^\hat{z}-direction, the YIG sphere is magnetized to its saturation magnetization, Ms=5.87×105Am1M_{s}=5.87\times 10^{5}\mathrm{Am}^{-1}, whereas the small magnetization fluctuations, 𝒎(𝒓,t)\bm{m}(\bm{r},t) around the fully magnetized state is given by, 𝑴(𝒓,t)=MSz^+𝒎(𝒓,t)\bm{M}(\bm{r},t)=M_{S}\hat{z}+\bm{m}(\bm{r},t). For the homogeneously magnetized fundamental magnon mode (Kittel mode), precessing around the z^\hat{z}-axis with an eigen-frequency, ωm=|γ|B0\omega_{m}=|\gamma|B_{0}, the magnetization fluctuation is given by 𝒎(𝒓,t)=K(x^+iy^)m+h.c.\bm{m}(\bm{r},t)=\mathcal{M}_{K}(\hat{x}+i\hat{y})m+h.c. Fletcher and Bell (1959), which induces a magnetic field, 𝒃(𝒓,t)=μ0K3R3r3eiϕ(2sinθr^cosθθ^iϕ^)m+h.c.,{\bm{b}}(\bm{r},t)=\mu_{0}\frac{\mathcal{M}_{K}}{3}\frac{R^{3}}{r^{3}}e^{i\phi}\left(2\sin\theta\ \hat{r}-\cos\theta\ \hat{\theta}-i\hat{\phi}\right)m+\textit{h.c.}, outside the sphere, where, K=γMs2V\mathcal{M}_{K}=\sqrt{\frac{\hbar\gamma M_{s}}{2V}} is the zero-point magnetization, μ0\mu_{0} is the vacuum permeability, mm is the bosonic magnon operator, RR and VV are the radius and the volume of the sphere. As schematically shown in the main text, a driven WGM optical microsphere with a magnetostrictive rod (MR) attached to it, is placed near the YIG magnet. The magnon mode field modulates the axial length of the MR, which modulates the WGM optical mode frequency that is depicted by a coupling of the form ΩSaa(m+m)\Omega_{S}a^{\dagger}a(m+m^{\dagger}) between the WGM optical mode (a)(a) and the magnon bosonic mode (m)(m), where ΩS=Δω\Omega_{S}=\Delta\omega is the optical frequency shift i.e. the single photon magnon-cavity coupling.

The magnon mode can also be coupled to the COM motion of the YIG sphere, by applying a spatially inhomogeneous external time-dependent magnetic field, 𝐇g(y,t)\mathbf{H}_{g}(y,t) Hoang et al. (2016); Delord et al. (2018), which satisfies the weak driving, |𝐇g(y,t)|H0\left|\mathbf{H}_{g}(y,t)\right|\ll H_{0}, and small-curl, |×𝐇g(y,t)𝐫||𝐇g(y,t)|\left|\nabla\times\mathbf{H}_{g}(y,t)\|\mathbf{r}\right|\ll\left|\mathbf{H}_{g}(y,t)\right| conditions. The COM motion of the magnet with frequencies ωx,ωy\omega_{x},\omega_{y}, and ωz\omega_{z}, in x^,y^\hat{x},\hat{y} and z^\hat{z} directions, is depicted by the Hamiltonian, Hb=j=x,y,zωjbjbj{H}_{b}=\sum_{j=x,y,z}\omega_{j}{b}_{j}^{\dagger}{b}_{j}, where the bosonic ladder operators describe annihilation and creation of a motional quantum along the direction jj. The quantum Hamiltonian describing the interaction between the COM motion, and the spin-wave magnetization due to the gradient magnetic field is given by, mb(t)=μ02𝑑V[MSz^+𝐦(𝐫)]𝐇g(𝐲,t)\mathcal{H}_{mb}(t)=-\frac{\mu_{0}}{2}\int dV\left[M_{S}\hat{z}+{\mathbf{m}}(\mathbf{r})\right]\cdot\mathbf{H}_{g}\left(\mathbf{y},t\right). Considering a time-varying gradient magnetic field of the form, 𝐇g(𝐲,t)=bg(t)μ0yy^\mathbf{H}_{g}(\mathbf{y},t)=\frac{b_{g}(t)}{\mu_{0}}y\hat{y}, (bgb_{g} in units of [T/m][\mathrm{T}/\mathrm{m}]), the interaction Hamiltonian for the COM motion in the y^\hat{y}-direction (frequency ωyωb\omega_{y}\equiv\omega_{b} for bybb_{y}\equiv b) and the magnon mode is given by, mb(t)=Ω~P(t)(b^+b^)(m^+m^),\mathcal{H}_{mb}(t)=\tilde{\Omega}_{P}(t)\left(\hat{b}+\hat{b}^{\dagger}\right)\left(\hat{m}+\hat{m}^{\dagger}\right), with Ω~P(t)=bg(t)4|γ|MSρωb\tilde{\Omega}_{P}(t)=\frac{b_{g}(t)}{4}\sqrt{\frac{|\gamma|M_{S}}{\rho\omega_{b}}}, where ρ=5170kg/m3\rho=5170\ \rm{kg/m^{3}} is the mass density of YIG. Taking into account all these interactions, in the rotating frame of the cavity drive and the displacement picture of the average field in each mode Wang and Clerk (2012), the Hamiltonian is given by

~/=\displaystyle\tilde{\mathcal{H}}/\hbar= Δaaa+ωmmm+ωbbb+Ω~S(a+a)(m+m)\displaystyle\Delta_{a}a^{\dagger}a+\omega_{m}m^{\dagger}m+\omega_{b}b^{\dagger}b+\tilde{\Omega}_{S}(a+a^{\dagger})(m+m^{\dagger})
+\displaystyle+ Ω~P(m+m)(b+b),\displaystyle~{}\tilde{\Omega}_{P}(m+m^{\dagger})(b+b^{\dagger}), (S18)

where Δa{\Delta}_{a} is the cavity detuning, and Ω~S\tilde{\Omega}_{S} is the driven optomagnonic coupling.

Refer to caption
Figure S6: (Top) The learning curve of the RL-agent is shown for the three-mode opto-magno-mechanical model shown in Fig.1(c) in the main text of the paper. In the upper panel of the figure, the mean net reward R~(N)\tilde{R}(N) (scaled by the thermal number) is shown over each episode NN within the training period. The corresponding variation in the cooling quotient of the mean phonon number averaged over each episode, n~¯b(N)\bar{\tilde{n}}_{b}(N) is shown in the lower panel of the figure. The zoomed view of the region beyond N=1kN=1k is dubbed in the inset of the plot. To understand the learning process of the RL-agent better, we divide the learning curve into four distinct regions, namely A, B, C, and D. The typical performance of the RL-agent within these regions is evaluated and plotted respectively in Figs. A, B, C, and D. In each plot, the variation of n~b\tilde{n}_{b} (in green), n~m\tilde{n}_{m} (in orange) and n~a\tilde{n}_{a} (in blue) are shown for an episode in which the controls, Ω~P/S\tilde{\Omega}_{\rm P/S} are shown in the lower panels (similar to the Fig. 3(b) in the main body of the article), which are determined by the selected RL-agent having already acquired knowledge up to that point of the learning curve. In the region A learning curve, the RL-agent explores in large parameter space more or less randomly to search for controls that would reduce the mean phonon number from within its thermal population limit, nbTn_{b}^{T}. At the end of stage A, the RL-agent more or less learns that applying constant control pulses cannot lead to a transfer of population out of the phonon mode. In the middle part of B, the RL-agent learns that it is possible to get considerably higher net returns if larger amplitude of control pulses are followed by smaller ones. The RL-agent then learns to apply the smaller controls appropriately at precise times which leads to better cooling through population transfer in the region C and D.

In this tripartite system, the optical and COM modes are not directly coupled, however both are coupled to the magnon mode. In the large detuning condition, ωmΩ~P2+Ω~S2\omega_{m}\gg\sqrt{\tilde{\Omega}_{P}^{2}+\tilde{\Omega}_{S}^{2}}, and under RWA, the model can be reduced to a bipartite effective system with the form of Hamiltonian,

~/=(ΔeffΩeffΩeffΔeff),\displaystyle\tilde{\mathcal{H}}/\hbar=\begin{pmatrix}-\Delta_{\rm{eff}}&-\Omega_{\rm{eff}}\\ -\Omega_{\rm{eff}}&\Delta_{\rm{eff}}\end{pmatrix}, (S19)

where Δeff=(Ω~P2Ω~S2)/(2ωm)\Delta_{\rm{eff}}=(\tilde{\Omega}_{P}^{2}-\tilde{\Omega}_{S}^{2})/(2\omega_{m}) and Ωeff=Ω~SΩ~P/ωm\Omega_{\rm{eff}}=\tilde{\Omega}_{S}\tilde{\Omega}_{P}/\omega_{m}. This results in a transfer time limit of τ~lim=πωm/(2Ω~SΩ~P)\tilde{\tau}_{\rm{lim}}=\pi\omega_{m}/(2\tilde{\Omega}_{S}\tilde{\Omega}_{P}). In the three-mode system given by the full tripartite model under RWA, a widely applicable scheme of population transfer between the uncoupled modes in an irreversible manner is to apply STIRAP-like ideal pulses Bergmann et al. (1998). However, only when the total operation time of pulse sequences, Tτ~limT\gg\tilde{\tau}_{\rm{lim}}, adiabatic effective population transfer can occur between the optical and COM modes in this system. In Fig. S5, we show the cooling dynamics with ideal STIRAP-like Gaussian pulses. This is obtained with counter-intuitive sequence of the coupling parameters for transfer of quanta from the COM mode to the optical mode. The cooling time is of the order of the limit given from the above expression, τ~lim\tilde{\tau}_{\rm{lim}}, which shows that these pulses are not adiabatic, but rather gives the shortest time for excitation transfer. The performance of the cooling quotient can be seen from the figure. As the time limit is the shortest possible time for cooling, one has to compromise on the effective cooling achieved. For the coupling parameters in the range, Ω~imax/ωb={6,8,10}\tilde{\Omega}_{i}^{\rm{max}}/\omega_{b}=\{6,8,10\}, the RWA is valid, whereas for higher values of Ω~imax\tilde{\Omega}_{i}^{\rm{max}} beyond 12ωb12\omega_{b}, the counter-rotating terms have significant effect. Therefore, even if the increase of coupling parameters shows sign of the lowering of transfer time, it is not possible to obtain effective cooling in these cases. On the contrary, our proposed DRL-scheme finds finely-tuned pulses to go beyond such limits to obtain effective and much faster cooling.

Refer to caption
Figure S7: The same as Fig. 3 in the main text of the article, but for the parameters (a) Ω~S/Pmax/ωb=120\tilde{\Omega}_{\rm{S/P}}^{\rm{max}}/\omega_{b}={120}, (b) (a) Ω~S/Pmax/ωb=150\tilde{\Omega}_{\rm{S/P}}^{\rm{max}}/\omega_{b}={150}.

S3 The DRL controller for the tripartite opto-magno-mechanical system

The tripartite system is modelled similarly to the two-mode model described earlier, except the fact that an additional Gaussian noise layer is added to the final layer of the network to facilitate more exploration to uncover the non-triviality. Here the controls are, Ω~P/S\tilde{\Omega}_{\rm P/S}, which are taken to be real and positive. As earlier, the observables are the instantaneous solutions of the coupled differential equations of the second order moments of the quantum master equation of the tripartite system, along with the controls, Ω~P/S\tilde{\Omega}_{\rm P/S}. In a given episode, the RL-agent makes a total of 150 interactions with the RL-environment in timesteps of dt=0.1dt=0.1, each time applying its actions, maintaining a generous exploration-exploitation trade-off. The reward function, for this particular case is,

r(t)=\displaystyle r(t)~{}=~{} nTbb(t)λmm(t),\displaystyle\frac{n_{T}}{{\langle b^{\dagger}b(t)\rangle}}-\lambda\,{\langle m^{\dagger}m}(t)\rangle, (S20)

where nTn_{T} is the thermal phonon number, bb(mm)b^{\dagger}b~{}(m^{\dagger}m) is the number operator of the phonon (magnon) modes, and λ=10\lambda=10 is chosen which guarantees to shuffle population out of the phonon mode and into the optical cavity mode leading to cooling of the phonon mode, without transferring to the magnon modes.

Given the large computational overhead to solve the coupled second order moment equations for the tripartite opto-magno-mechanical system with state-of-the-art experimental parameter choice of ωm=105ωb\omega_{m}=10^{5}\,\omega_{b}, the problem is solved in two steps. In the first step, we train an auxiliary system with ωm=103ωb\omega_{m}=10^{3}\omega_{b} and Ω~P/S=10ωb\tilde{\Omega}_{\rm P/S}=10\ \omega_{b}, for which solution of the set of equations can be obtained without much computational effort. The trained auxiliary model is then used as a supervisor/teacher for the actual system, that we call primary, with ωm=105ωb\omega_{m}=10^{5}\omega_{b} and Ω~P/S=100ωb\tilde{\Omega}_{\rm P/S}=100\ \omega_{b}. Training the actual system for a few hundred episodes with periodic evaluation of the RL-agent yields the best trained model. In the literature of RL, this can be considered as a form of imitation learning.

The (auxiliary) RL-agent is trained for the following set of cases: Ω~P/Smax=10,12,15\tilde{\Omega}_{\rm P/S}^{\rm max}=10,12,15. The learning curve for the first case is shown in Fig. S6. In these plots, we have shown how the RL-agent gradually learns the non-intuitive control sequences by optimizing the neural network weights to achieve the goal of attaining larger net returns determined by Eq. S20. We have also shown the performance of the RL-agent at different stages of the learning process, as explained in the caption of Fig. S6 in more detail.

References