Incorporation of Internal Coordinates Interpolation into the Freezing String Method

Jonah Marks Joseph Gomes joe-gomes@uiowa.edu [

Abstract

We present an improved method for determining guess structures for transition state searches by incorporating internal coordinates interpolation into the freezing string method (FSM). We test our method on over 40 reactions across 3 benchmark datasets covering a diverse set of chemical reactions. Our results show that incorporation of internal coordinates interpolation improves the computational efficiency of the FSM by enabling the use of larger interpolation step sizes and fewer optimization steps per cycle while maintaining 100% success rate on benchmark chemical reaction test cases including systems where previous attempts based on linear synchronous transit interpolation have failed. We provide an open-source Python implementation of the FSM, in addition to the reactant, product, transition state structures of all reactions studied.

keywords:

Chemical calculations, Chemical reactions, Optimization, Transition states

Iowa] Department of Chemical and Biochemical Engineering
University of Iowa, Iowa City, United States \abbreviationsMEP,TS,FSM,LST,RIC

{tocentry}

Reaction coordinate diagram of the bicyclo[1.1.0]butane ring opening reaction to trans-butadiene.

1 Introduction

The determination of reaction mechanisms is an important step for chemists in the prediction of thermodynamics and kinetic properties of chemical reactions. In computational chemistry, the reaction mechanism is typically represented as a pathway on the Born-Oppenheimer potential energy surface (PES) of the system of interest given by a suitable potential energy function of the nuclear coordinates. The equilibrium states correspond to stationary points on the PES with zero first order derivatives (gradient) in all directions and all positive eigenvalues of the second order derivative (Hessian) matrix. The equilibrium states can be identified by energy minimization given a suitable initial guess structure. There exist many pathways by which equilibrium states may interconvert; however, only a small subset of these pathways are thermally accessible and relevant to thermodynamic and kinetic analysis of reaction mechanisms. The minimum energy path (MEP) is the route that needs the least amount of potential energy for the system to undergo the transition. The transition states (TSs), based on transition state theory, are first order saddle points with zero gradient and only one negative eigenvalue of the Hessian matrix. The MEP connecting two equilibrium states must go through one or more TSs and serves as a representative reaction path.

The first step in identifying the MEP is typically locating the minimum energy TS connecting the equilibrium states of interest. Determination of TS geometries is a computationally expensive task that frequently requires significant human intervention. Efforts have been made to develop automated processes for finding first order saddle points on PESs while attempting to minimize computational cost. Many modern algorithms take a chain-of-states¹ approach to this problem. Geometric interpolation between the reactant and product geometries is performed, producing a string of intermediate structures, or nodes, distributed along the reaction pathway. The resulting structures are optimized using ab-initio methods to approximate the MEP of the reaction. The highest energy structure in the chain-of-states is taken as a guess of the TS geometry. A local surface walking algorithm refines the guess to exact TS geometry. ^{2, 3, 4, 5, 6, 7, 8, 9, 10} These algorithms require the use of local gradient and Hessian information to locate saddle points, and their success is heavily dependent on the initial guess being located within the desired basin of attraction.

Many such chain-of-states methods that provide accurate TS geometry guesses from reactant and product structures have been previously reported. Details on how intermediate structures are selected and optimized differs between methods, and consequently influences the computational cost and success of the algorithm. The nudged elastic band (NEB) method creates an initial chain-of-states by interpolation between reactant and product, and every intermediate structure along the string is iteratively optimized to lie along the reaction pathway, together with an additional penalty term to maintain an even distribution of structures along the chain-of-states. ^{11, 12, 13, 14, 15, 16, 17, 18} The growing string method (GSM) in most use cases incurs fewer gradient calculations than the NEB by developing a better initial chain-of-states, or string.^{19, 20, 21, 22, 23, 24, 25, 26, 27, 28} The GSM creates two strings; one starts at the reactant configuration and the other starts at the product configuration. At each iteration of the growing process, an intermediate structure, or node, is added to each string frontier in the direction of the opposing string, after which the entire string is optimized. This growth process is repeated until the strings meet, providing a better initial chain-of-states, and followed by additional optimization of the unified string. In addition to providing a better initial chain-of-states, this avoids simulation of non-physical structures in the interior nodes which may result from the initial interpolation.

The freezing string method (FSM) further reduces the number of gradient calls but at the cost of identifying the true MEP. ^{29, 30, 31} Two strings are grown from the product and reactant like the GSM. Once a frontier node is placed, optimization is performed to step the node closer to the reaction pathway, after which it is “frozen”, i.e., it will not move for the remainder of the calculation. New frontier nodes are added to both strings, and the process is repeated. Once the two strings unite, the highest energy structure is taken to be the TS guess without optimization of the unified string. In practice, the resulting pathway can deviate significantly from the MEP but often produces guess structures suitable for further refinement.

The FSM greatly improves the computational efficiency of TS guess finding compared to previously described algorithms, often resulting in an order of magnitude reduction in the required number of electronic structure calculations required to determine the true TS structure. Despite these improvements, the FSM is not a foolproof algorithm. The efficiency of the calculation and ultimately its success heavily depends on the interpolation algorithm used to generate initial guess structures at the frontier of the growing strings. There exist known issues with commonly applied Cartesian coordinate and linear synchronous transit (LST) interpolation techniques³², which can produce high energy or otherwise aphysical molecular geometry structures far from the true MEP which are then subjected to electronic structure calculation and geometry optimization.^{25, 33} Due to the relatively few number of optimization steps used in the FSM, the incorporation of these geometries as anchor points into future interpolation steps poisons the calculation, resulting in a failed search and wasted computational effort.

In this work, we demonstrate an improved method for initiating TS searches by incorporating internal coordinates (ICs) interpolation into the FSM. Additionally, we incorporate the L-BFGS-B with explicit line search for step size determination which improves reaction pathway optimization steps. We test our method on over 40 reactions across three benchmark datasets covering a broad set of chemical reactions. Our results show that, using previously studied LST interpolation, the incorporation of the L-BFGS-B optimization reduces the computational effort required in previous studies, even considering the additional computational cost incurred due to step size determination by line search. Incorporation of ICs interpolation further improves the computational efficiency of the FSM and successfully locates high-quality TS guess structures where previous attempts based on LST interpolation have failed. We provide an open-source Python implementation of the FSM, in addition to the reactant, product, TS structures of all reactions computed at the $\omega$ B97X-V/def2-TZVP level of theory. We anticipate that these resources will be of broad interest for researchers in computational chemistry studying problems where the fast and reliable location of TSs is important, or those developing algorithms who wish to evaluate their methods on a broad set of realistic chemical problems.

2 Methods

2.1 Overview of the Freezing String Method

Refer to caption — Figure 1: Algorithm flowchart describing the freezing string method. $s$ is the interpolation step size and $N_{\mathrm{opt}}$ is the maximum number of optimization steps per interpolation step.

A flowchart of the FSM algorithm is shown in Figure 1. The goal of the FSM is to produce an approximate reaction pathway connecting two given reactant and product structures. The approximate reaction pathway is represented as a chain-of-states consisting of nodes along a parameterized string. The interior nodes on the string represent intermediate geometries along the approximate reaction pathway. The method evolves the string by alternately adding nodes to the reactant and product sides of a growing string. The new reactant and product side structures are generated by taking a step inward along an interpolated path between frontier nodes. After interpolation, the structures undergo geometry optimization in the direction perpendicular to the approximate reaction pathway. Optimization of these new frontier structures is performed, and then the geometries are frozen. These steps are repeated until the reactant and product sides of growing string meet. The highest energy node along the pathway is chosen as the TS guess structure that is refined to the true TS geometry using a local surface walking optimization algorithm.

The interpolation step is performed by adding nodes at a user-defined distance, $s$ , from the reactant and product frontier nodes. During the first evolution step, the frontier nodes are the user-defined reactant and product structures, while subsequent steps use the frozen nodes from the previous iteration as anchor points for interpolation. The interpolation step size $s$ is fixed during the calculation, and in this work, we determine $s$ by dividing the arc length of an interpolated path connecting the initial reactant and product structures by a user-defined nominal number of nodes, $N_{\mathrm{nodes}}$ . The interpolation can be performed using many different coordinate systems. We consider both LST interpolation and linear interpolation in redundant internal coordinates. Tangent directions at the new reactant and product frontier nodes are determined from a cubic spline fitted through the chemical structures of the interpolated pathway.

During optimization, the energy is minimized by assumption of the local quadratic approximation:

E(\boldsymbol{x})=E(\boldsymbol{x_{0}})+(\boldsymbol{x}-\boldsymbol{x_{0}})^{T}\boldsymbol{g}^{\perp}+\frac{1}{2}(\boldsymbol{x}-\boldsymbol{x_{0}})^{T}\boldsymbol{H}(\boldsymbol{x}-\boldsymbol{x_{0}})

(1)

Here, $\boldsymbol{x_{0}}$ is the current geometry, $\boldsymbol{x}$ is a proposed geometry, $\boldsymbol{g}^{\perp}$ is the perpendicular gradient, and $\boldsymbol{H}$ is the approximate Hessian in the space perpendicular to the approximate reaction pathway. The perpendicular gradient is given by $\boldsymbol{g}^{\perp}=(\boldsymbol{I}-\boldsymbol{\hat{t}}\boldsymbol{\hat{t}}^{T})\boldsymbol{g}$ where $\boldsymbol{\hat{t}}$ is the normalized tangent vector determined during the interpolation step.

The optimization of equation 1 is performed using the L-BFGS-B algorithm³⁴ as implemented in the SciPy Python library.³⁵ We place bounds on each Cartesian coordinate such that no coordinate is displaced by more than 0.3 Å during a single optimization step. Each optimization step is begun by performing a backtracking line search, where at most $N_{\mathrm{ls}}$ (often one) energy and gradient calculations are performed to determine an appropriate step size. The optimization proceeds until a specified convergence criteria is satisfied, or a maximum number of step $N_{\mathrm{opt}}$ has been performed.

2.2 Redundant internal coordinates

Cartesian coordinates $\boldsymbol{x}=(x_{1},x_{2},\dots,x_{3N})$ are used to specify the positions of $N$ atoms in a molecule and are a necessary input for electronic structure calculations. Internal coordinates (ICs) $\boldsymbol{q}=(q_{1}(\boldsymbol{x}),q_{2}(\boldsymbol{x}),\dots,q_{n}(\boldsymbol{x}))$ are functions of the Cartesian coordinates that better describe the collective motions of atoms. There exists several choices for constructing coordinate systems when performing interpolation between two molecular geometries. The definition of the redundant internal coordinates used in this work have been reported previously ^{36, 37}. Redundant internal coordinate sets are constructed separately for reactant and product molecules, and the union of these two sets gives the full set of redundant coordinates. Bonds, angles, linear angles, torsion angles, and out-of-plane angles are assigned based on a procedure outlined in Bakken and Helgaker ³⁸, and the full set of coordinates is further pruned to ensure each internal coordinate is well-defined for both reactant and product molecules.

2.3 ICs interpolation

Given a set of $n$ primitive redundant coordinates $\boldsymbol{q}$ , a set of structures $\boldsymbol{q}^{i}(f)$ are produced by linear interpolation

\boldsymbol{q}^{i}(f)=(1-f)\boldsymbol{q}^{R}+f\boldsymbol{q}^{P}

(2)

where $f$ is the interpolation parameter and $\boldsymbol{q}^{R}$ and $\boldsymbol{q}^{P}$ represent the reactant and product internal coordinate values. The displacements $\boldsymbol{\Delta q}$ in $\boldsymbol{q}$ are related by the the well-known $B$ matrix³⁹ $\boldsymbol{\Delta q}=\boldsymbol{B}\boldsymbol{\Delta x}$ valid for small Cartesian displacements $\boldsymbol{\Delta x}$ . The final step is to transform the geometries along the interpolated path from internal coordinates back to Cartesians. Given a target geometry $\boldsymbol{q}^{i}$ in internal coordinates, this is done iteratively using the formula

\boldsymbol{x}_{k+1}=\boldsymbol{x}_{k}+(\boldsymbol{B}(\boldsymbol{x}_{k})^{T})^{-1}[\boldsymbol{q}^{i}-\boldsymbol{q}(\boldsymbol{x}_{k})].

(3)

The iteration is terminated when the Cartesian coordinates generated on the ( $k+1$ )th iteration $\boldsymbol{x}_{k+1}$ are identical to those on the $k$ th iteration $\boldsymbol{x}_{k}$ within a tolerance $10^{-7}$ Å. The transformation of all $\boldsymbol{q}^{i}(f)$ results in a chain-of-states $\boldsymbol{x}^{c}(f)$ connecting the current frontier reactant and product nodes that is used during the interpolation step of the FSM.

2.4 LST interpolation

LST is another chemically realistic interpolation method that produces a chain-of-states pathway.³² A set of structures $\boldsymbol{r}^{i}(f)$ are produced by linear interpolation in a set of internuclear distances $\boldsymbol{r}^{i}=\{r_{ab};a>b=1,2,\dots,N\}$

\boldsymbol{r}^{i}(f)=(1-f)\boldsymbol{r}^{R}+f\boldsymbol{r}^{P}

(4)

where $f$ is the interpolation parameter and $\boldsymbol{r}^{R}$ and $\boldsymbol{r}^{P}$ represent the reactant and product internuclear distances. A set of structures $\boldsymbol{x}^{i}(f)$ are produced by linear interpolation in Cartesian coordinates

\boldsymbol{x}^{i}(f)=(1-f)\boldsymbol{x}^{R}+f\boldsymbol{x}^{P}

(5)

The final interpolated pathway is determined by minimizing the objective $S$ by the method of least-squares at each interpolation point $f$

S=\sum_{a>b}^{N}\frac{(r_{ab}^{i}-r_{ab}^{c})^{2}}{(r_{ab}^{i})^{4}}+w\sum_{j=1}^{3N}(x_{j}^{i}-x_{j}^{c})^{2}

(6)

where $w$ is a weighting factor (nominally $10^{-6}$ ) and the superscripts $i$ and $c$ denote interpolated and calculated values, respectively. The variables $\boldsymbol{x}^{c}$ are those optimized during the minimization of $S$ , and the variables $\boldsymbol{r}^{c}$ are derived from $\boldsymbol{x}^{c}$ . The denominator in the first term ensures that important distances between bonded atoms are preserved in the final interpolated geometries. The second term is weighted by a factor $w$ , and produces forces to prevent rigid translation or rotation such that the interpolated structures align with the end structures. The weighting factor $w$ additionally weights within the objective function the relative importance of linear interpolation in internuclear distances (first term) and linear interpolation in Cartesian coordinates (second term). The optimization of equation 6 is performed using the L-BFGS-B algorithm³⁴ as implemented in the SciPy Python library. The optimization of $S$ for all $f$ results in a chain-of-states $\boldsymbol{x}^{c}(f)$ connecting the current frontier reactant and product nodes that is used during the interpolation step of the FSM.

2.5 Computational details

Electronic structure calculations are performed to produce the optimized geometries of the reactant and product for each reaction studied, compute the quantum mechanical gradients required by the FSM algorithm, as well as to refine the geometry of the TS guess structures produced by the FSM. All electronic structure calculations were performed using the range-separated, hybrid generalized gradient approximation with non-local correlation exchange-correlation function $\omega$ B97X-V⁴⁰ using the triple- $\zeta$ , polarized valence def2-TZVP basis set.⁴¹ During geometry optimization, energies were converged to 10^-6 Ha (Hartree) and the maximum norm of the Cartesian gradient was converged to 10^-3 Ha bohr^-1. Geometry optimization was terminated if the convergence criteria were not met within 250 optimization steps. All reported energy values are electronic energy without thermal or zero-point correction. The eigenvector-following local TS searches were performed using the partitioned rational function optimization method.⁵ Frequency analysis was performed to confirm the nature of each stationary point: there must be zero imaginary frequencies for PES minima and exactly one imaginary frequency for PES TSs. Intrinsic reaction coordinate (IRC) pathway calculations initiated at the TS were performed with the predictor-corrector algorithm of Schmidt et al. ⁴² to further characterize TS geometries. The FSM software is written in Python and performs file-based data exchange with an electronic structure software package to obtain the quantum mechanical gradients used in the method. All QM calculations were performed using a release version of the Q-Chem 6.0 software package.⁴³

3 Results and Discussion

We measure benchmark performance by tracking the number of quantum mechanical gradient calculations performed as part of the FSM and TS search procedure, as well as the total number of gradient calculations. We report separately the total number of gradients computed during FSM calculation, which measures the efficiency at which a TS guess is obtained, and the total number of gradients computed during the TS calculation, which serves as an indicator for TS guess quality. We compare FSM calculations with either redundant internal coordinates (FSM-RIC) or linear synchronous transit (FSM-LST) interpolation methods. We perform FSM calculations with nominally 18 nodes ( $N_{\mathrm{nodes}}$ ) along the approximate reaction pathway, two steps per optimization cycle ( $N_{\mathrm{opt}}$ ), and at most three line search steps per optimization cycle ( $N_{\mathrm{ls}}$ ) unless otherwise stated. We consider this set of parameters to be a conservative baseline where both the FSM-RIC and FSM-LST should perform well.

ID	Reaction	Gradients
	Gradients	(FSM-RIC)
	Gradients
	Gradients
	Gradients
	Gradients
	(TOTAL-LST)
1	H₂CO $\rightarrow$ H₂ + CO	58	13	71	62	11	73
2	SiH₂ + H₂ $\rightarrow$ SiH₄	55	5	60	53	4	57
3	acetaldehyde Keto-enol tautomerism	63	3	66	61	4	65
4	CH₃CH₃ $\rightarrow$ CH₂CH₂ + H₂	58	24	82	52	21	73
5	bicyclo[1.1.0]butane $\rightarrow$ trans-butandiene	61	47	108	58	25	83
6	parent Diels-Alder cycloaddition reaction	62	15	77	53	33	86
7	cis,cis-2,4-hexadiene $\rightarrow$ 3,4-dimethylcyclobutene	60	18	78	79	25	104
8	alanine dipeptide rearrangment	61	64	125	45	45	90
9	silyl ketene acetal $\rightarrow$ silyl ester Ireland-Claisen rearrangement	60	82	142	60	88	148

ID	Reaction	Gradients
	Gradients	(FSM-RIC)
	Gradients
	Gradients
	Gradients
	Gradients
	(TOTAL-LST)
1	C₂H₄ + N₂O $\rightarrow$ C₂N₂O	69	7	76	61	48	109
2	1,3-pentadiene hydrogen transfer	63	5	68	68	6	74
3	HCN $\rightarrow$ HNC	78	3	81	72	9	81
4	1,4-hexadiene Cope rearrangement	56	27	83	60	12	72
5	1,3-cyclopentadiene hydrogen shift	62	5	67	64	7	71
6	1,3-butadiene cyclization	60	6	66	66	8	74
7	Diels-Alder endo addition of cyclopentadiene to cyclopentadiene	62	50	112	57	52	109
8	Diels-Alder addition of cyclopentadiene and ethylene	62	11	73	63	11	74
9	difluorocarbene addition to ethylene	56	24	80	50	12	62
10	ene reaction of ethylene and propene	49	42	91	51	47	98
11	Grignard addition of phenyl magnesium bromide to benzophenone	53	38	91	59	49	108
12	H₂CO $\rightarrow$ H₂ + CO	58	13	71	62	11	73
13	CH₃CH₂F $\rightarrow$ CH₂CH₂ + HF	52	9	61	67	9	76
14	water assisted hydrolysis of ethyl acetate	42	60	102	56	86	142
15	H₂ + H₂CO $\rightarrow$ CH₃OH	52	8	60	61	17	78
16	2-methyl-3-phenyloxirane ring opening	58	48	106	59	36	95
17	CH₂CHCH₂CH₂CHO Claisen rearrangement	57	31	88	61	30	91
18	SiH₂ + H₂ $\rightarrow$ SiH₄	55	5	60	53	4	57
19	Cl^- + CH₃F $\rightarrow$ CH₃Cl + F^-	56	4	60	60	3	63
20	sulfur dioxide addition to butadiene	63	21	84	65	22	87

ID	Reaction	Gradients
	Gradients	(FSM-RIC)
	Gradients
	Gradients
	Gradients
	Gradients
	(TOTAL-LST)
1	HCN $\rightarrow$ HNC	78	3	81	72	9	81
2	HCCH $\rightarrow$ CCH₂	61	11	72	63	4	67
3	H₂CO $\rightarrow$ H₂ + CO	58	13	71	62	11	73
4	CH₃O $\rightarrow$ CH₂OH	52	5	57	48	7	55
5	cyclopropyl ring opening	59	11	70	62	25	87
6	bicyclo[1.1.0]butane $\rightarrow$ trans-butandiene	61	47	108	58	25	83
7	formyloxyethyl 1,2-migration	63	14	77	62	11	73
8	parent Diels-Alder cycloaddition	62	15	77	53	33	86
9	s-tetrazine $\rightarrow$ 2HCN + N₂	76	7	83	80	15	95
10	trans-butadiene $\rightarrow$ cis-butadiene	63	2	65	69	2	71
11	CH₃CH₃ $\rightarrow$ CH₂CH₂ + H₂	58	24	82	52	21	73
12	CH₃CH₂F $\rightarrow$ CH₂CH₂ + HF	52	9	61	67	9	76
13	acetaldehyde keto-enol tautomerism	63	3	66	61	4	65
14	HCOCl $\rightarrow$ HCl + CO	65	5	70	67	6	73
15	H₂O + PO ${}_{3}^{-}$ $\rightarrow$ H₂PO ${}_{4}^{-}$	55	21	76	48	24	72
16	CH₂CHCH₂CH₂CHO Claisen rearrangement	57	31	88	61	30	91
17	SiH₂ + CH₃CH₃ $\rightarrow$ SiH₃CH₂CH₃	55	5	60	53	10	63
18	HNCCS $\rightarrow$ HNC + CS	72	8	80	53	9	62
19	HCONH ${}_{3}^{+}$ $\rightarrow$ NH ${}_{4}^{+}$ + CO	65	18	83	99	17	116
20	acrolein rotational TS	63	2	65	59	13	72
21	HCONHOH $\rightarrow$ HCOHNHO	57	12	69	62	5	67
22	HNC + H₂ $\rightarrow$ H₂CNH	36	18	54	66	15	81
23	H₂CNH $\rightarrow$ HCNH₂	30	6	36	66	3	69
24	HCNH₂ $\rightarrow$ HCN + H₂	60	41	101	53	26	79

Incorporation of Internal Coordinates Interpolation into the Freezing String Method

Abstract

keywords:

1 Introduction

2 Methods

2.1 Overview of the Freezing String Method

2.2 Redundant internal coordinates

2.3 ICs interpolation

2.4 LST interpolation

2.5 Computational details

3 Results and Discussion

3.1 Sharada test set

3.2 Birkholz test set

3.3 Baker test set

3.4 Ablation Study

4 Conclusion

References