Robust crystal structure identification at extreme conditions using a density-independent spectral descriptor and supervised learning

P. Lafourcade paul.lafourcade@cea.fr J.-B. Maillet C. Denoual E. Duval A. Allera A. M. Goryaeva M.-C. Marinica CEA DAM DIF, 91297 Arpajon, France Université Paris-Saclay, LMCE, 91680 Bruyères-le-Châtel, France Laboratoire d’Acoustique de l’Université du Mans (LAUM), UMR 6613, CNRS, Le Mans Université, Le Mans, France Université Paris-Saclay, CEA, Service de recherche en Corrosion et Comportement des Matériaux, SRMP, F-91191 Gif-sur-Yvette, France

(August 12, 2025)

Abstract

The increased time- and length-scale of classical molecular dynamics simulations have led to raw data flows surpassing storage capacities, necessitating on-the-fly integration of structural analysis algorithms. As a result, algorithms must be computationally efficient, accurate, and stable at finite temperature to reliably extract the relevant features of the data at simulation time. In this work, we leverage spectral descriptors to encode local atomic environments and build crystal structure classification models. In addition to the classical way spectral descriptors are computed, i.e. over a fixed radius neighborhood sphere around a central atom, we propose an extension to make them independent from the material’s density. Models are trained on defect-free crystal structures with moderate thermal noise and elastic deformation, using the linear discriminant analysis (LDA) method for dimensionality reduction and logistic regression (LR) for subsequent classification. The proposed classification model is intentionally designed to be simple, incorporating only a limited number of parameters. This deliberate simplicity enables the model to be trained effectively even when working with small databases. Despite the limited training data, the model still demonstrates inherent transferability, making it applicable to a broader range of scenarios and datasets. The accuracy of our models in extreme conditions (high temperature, high density, large deformation) is compared to traditional algorithms from the literature, namely adaptive common neighbor analysis (a-CNA), polyhedral template matching (PTM) and diamond structure identification (IDS). Finally, we showcase two applications of our method: tracking a solid-solid BCC-to-HCP phase transformation in Zirconium at high pressure up to high temperature, and visualizing stress-induced dislocation loop expansion in single crystal FCC Aluminum containing a Frank-Read source, at high temperature.

keywords:

Crystal Structure , Atomic Descriptors , Supervised Learning , Molecular dynamics

1 Introduction

In the last decades, atomic scale simulations such as ab-initio calculations and molecular dynamics (MD) have been increasingly used to model materials properties based on atomic-scale processes. As large-scale simulations are needed to realistically simulate the dynamics of extended systems (e.g. linear defects, interfaces) over long simulated times, MD simulations have been successfully used to investigate a wide range of thermodynamic regimes due to their favorable linear scaling of computational time vs. system’s size. More recently, MD simulations based on empirical potentials have seen a considerable increase of their accuracy and accessible time- and length-scale, as they benefited from the development of ab-initio calculations and machine learning (ML) methods –resulting in more accurate force fields, in conjunction with the development of high performance computing (HPC) facilities and more efficient computational methods. As a result, simulations based on ab-initio-accurate ML force fields have reached the scale of tens [1, 2] to hundreds [3] of billions atoms, allowing studies to be carried out at the micrometer scale where direct microstructure comparison with experiments becomes possible [4]. However, trajectories of atoms obtained via atomistic simulations must be further processed in order to extract the properties and statistics of objects of interest, i.e. defects, phases, interfaces or precipitates for example.

Present-day standard visualization methods can help to identify structural changes by analyzing atomic trajectories. However, the automation of such quantitative analysis as well as a robust identification and extraction of crystalline defects is still challenging. For this purpose, numerous computational methods have been developed in order to enable post-processing analysis of particle-position datasets. These methods generally proceed by comparing individual atomic environments to that of a reference structure, while allowing a certain tolerance in the result. The most common methods for the local structure analysis include basic energy thresholding, censtrosymmetry parameter (CSP) analysis [5], bond order analysis [6], common neighbor analysis (CNA) [7, 8], adaptative common neighbor analysis (a-CNA) [9], bond angle analysis (BAA) [10], Voronoi analysis [11], neighbor distance analysis (NDA) [9] and polyhedral template matching (PTM) analysis [12].

Another set of techniques, oriented towards continuum mechanics measures, has been proposed in the literature to identify the deformation state as well as defects such as dislocation lines. In order to observe dislocation-mediated plasticity, such tools can be used to filter the data, remove crystalline atoms, and extract the atoms that constitute the core of defects. However, no information can be extracted concerning the type of crystal defects, which can be represented by vacancies, interstitials, or dislocation lines. Also, atoms identification can fail, and cases where defects overlap (e.g. an interstitial is found in a dislocation core) is not taken into account. Continuum-like measures, computed as per-atom variables over the current neighborhood, with respect to a reference configuration, have also been introduced [13, 14, 15, 16, 17], allowing the computation of local deformation gradient tensors, slip vectors or Nye tensor. The combination of the latter techniques, along with structural identification methods, have also been used to identify dislocation lines and their Burgers vectors. However they become inefficient at high temperatures or when dislocation interactions come into play. Another widely-used method, known as the dislocation extraction algorithm (DXA) [18, 19] allows to consecutively mesh the atomistic configuration, map the local tetrahedra to perfect crystal structures, extract the distorted tetrahedra (i.e. disordered atoms), and perform the Burgers circuit in order to extract the dislocation lines as well as their Burgers vectors. This algorithm is very helpful for monitoring the evolution of dislocation densities over time [20, 4, 21] and characterizing crystal plasticity features [22].

Most of the methods listed above are directly available in the Open Visualization Tool (OVITO) [23], enabling a straightforward comparison with newly developed structure identification tools. However, a common weakness of traditional analysis methods is their sensitivity to thermal noise which can be limiting when simulating crystals at finite temperature. To remove thermal vibrations while preserving the features of the high-temperature structure, vibration-averaging can be used, as well as structure denoising based on graph neural networks [24].

Some of the techniques described above have been used to process atomistic simulations on-the-fly, benefiting from their ease of implementation and high data throughput since they are easily integrated in MD codes such as LAMMPS [25, 26] or ExaSTAMP [27, 28, 29, 30]. However, more computationally demanding methods, such as DXA, remain challenging to use for on-the-fly, large-scale application. Although the DXA implementation available in OVITO is highly optimized, the algorithm is computationally expensive and requires 1 GB of free RAM memory per million atoms. The memory should be distributed across several nodes to scale to large systems, or a parallelism scheme should be adopted, which is not the case in the official distribution but seems to be a work in progress on their side according to very recent communications. For example, one cubic micrometer sample of BCC Tantalum with only the positions written to a single ASCII file would contain approximately 55 billion atoms and occupy 24 GB of disk space. In order to run DXA on this sample, up to 55,000 GB of memory would be required, which is technically difficult to achieve without special hardware design. This shows the urgent necessity to develop in-situ analysis tools, since filtering such simulations on-the-fly can cut the needs of storage capacity by orders of magnitude. In addition to basic properties, other descriptors like deformation gradient tensors, velocity-gradient tensors or bond-order parameters for example would drastically increase the necessary disk space. It becomes clear that storing a few terabytes per snapshot over a few nanoseconds, even with a low output frequency, cannot be considered as viable.

More lately, additional atomic descriptors enabling local structure analysis have been introduced [31, 32], such as Behler-Parrinello Chebyshev polynomial representations (CPR) [33], Behler-Parrinello symmetry functions (BP) [34], smooth overlap of atomic positions (SOAP) [35], atomic cluster expansion (ACE) [36], adaptative generalizable neighborhod informed features (AGNI) [37, 38]. In addition, machine learning aided crystal structure identifiers have been published, either based on Bayesian deep learning (ARISE) [39] or neural networks [40]. Finally, structural defects in crystalline solids can be effectively detected as structural outliers using distortion scores of local atomic environments [41, 42]. This method uses minimum covariance determinant (MCD) in conjunction with compact atomic descriptors like bispectrum [43] and allows for accurate structural analysis even in noisy structures. However, until now, it has not been coupled with an automated structure classifier. These methods, while being highly accurate, also have a non-negligible computational cost compared to traditional methods presented above. However, the performance of BSO4 for example, has been dramatically improved since the initial version of SNAP (see [1]) and was ported to GPU, making it one of the most efficient force-field framework with near-ab-initio accuracy. For practical use in on-the-fly MD analysis, the calculation of BSO4 every Nth step should thus not be a bottleneck –even when using another force-field than SNAP. The present methodology tightly integrates with the MD engine with acceptable overhead.

In this work, we address in-situ analysis of large-scale MD trajectories and strive to minimize the amount of information to be stored. While today the research community aims at simulating larger and larger samples through MD simulations, the bottleneck is not only in performing the simulation itself, but mainly in effectively and accurately analyzing it, which represents a paradigm shift. However, there is always a trade-off to find between computational cost, accuracy, and robustness since a low sensitivity to atomic displacements usually comes at the price of a reduced capability of the identification method to distinguish similar structures [9] Here, we propose a novel algorithm (see Algorithm 1) for the identification and classification of crystal structures, also allowing for the accurate detection of atoms that contribute to defects. The method uses machine learning (ML) techniques and can be used for the analysis of materials under extreme conditions, including thermal noise, large deformations as well as large hydrostatic pressures.

The paper is organized as follows. The first section details the construction of the training database used for different models. The second section describes the training process for the different algorithms employed in this work. Finally, in the last section, we demonstrate the performance of our crystal structure identification algorithm and apply it to analyze large-scale MD simulations at finite temperature. We present two cases of interest: solid-solid phase transition in hexagonal close-packed Zirconium and Frank-Read source dislocation loops expansion in face-centered cubic Aluminum.

Algorithm 1 Crystal structure classification algorithm

C:

Number of crystal structures in the database

M:

Dimension of the atomic descriptor

\bm{\mu}_{\mathrm{cs}}:

Mean descriptor of sub database with structure

\mathrm{cs}

{\mathbf{\Sigma}}_{\mathrm{cs}}:

Covariance matrix of sub database with structure

\mathrm{cs}

\epsilon_{\mathrm{cs}}:

Acceptance threshold for structure

\mathrm{cs}

for Each atom in simulation cell do

1. Compute per-atom descriptor

\mathbf{B}\in\mathbb{R}^{D}

2. Reduce dimension

\mathbf{x}=P_{\textrm{LDA}}(\mathbf{B})

P_{\textrm{LDA}}:\mathbb{R}^{D}\to\mathbb{R}^{C-1}

3. Logistic regression

\mathrm{cs}=P_{\textrm{LR}}(\mathbf{x})

P_{\textrm{LR}}:\mathbb{R}^{C-1}\to\mathbb{R}^{1}

4. Compute

d^{\mathrm{cs}}_{\mathrm{Maha}}=\sqrt{(\mathbf{x}-\bm{\mu}_{\mathrm{cs}})^{\mathrm{T}}{\mathbf{\Sigma}}^{-1}_{\mathrm{cs}}(\mathbf{x}-\bm{\mu}_{\mathrm{cs}})}

d^{\mathrm{cs}}_{\mathrm{Maha}}>\epsilon_{\mathrm{cs}}

then

Define atom as non-crystalline, i.e.

\mathrm{cs}=-1

else

Assign crystal structure

\mathrm{cs}

to atom

end if

end for

2 Database preparation

The training database contains four different crystal structures: body-centered cubic (BCC), face-centered cubic (FCC), hexagonal close-packed (HCP), and cubic diamond (c-DIA). The extension to other crystalline structures is straightforward. For each structure, a model metal is considered, and modelled using the following interatomic potentials: Aluminium [44] for BCC, Iron [45] for FCC, Zirconium [46] for HCP and Silicon [47] for c-DIA. A common aspect of supervised machine learning techniques is that their application range is given by the information contained in the learning database. Hence the elements of the training database should be carefully chosen with respect to target applications. Here, we pay attention to include the structures carrying information about temperature and small deformations. Then, once built, the database is mapped into a descriptor space onto which learning will be performed. These procedures are detailed below.

2.1 Construction of the database in Cartesian space

2.1.1 Finite temperature molecular dynamics trajectories

The effects of temperature must be accounted for when developing a robust crystal structure classifier suitable for materials at extreme conditions. In order to sample finite-temperature configurations for the database, we compute MD trajectories in the NPT ensemble for each material representing the different crystal structures. The simulation cell dimensions correspond to the equilibrium density of each material at 300 K and ambient pressure, and contain 864 atoms. Trajectories are integrated with a timestep of 1 fs, and the coupling parameters for the thermostat and barostat are set to 0.1 and 1.0 ps, respectively. NPT simulations are performed at ambient pressure while ramping up the temperature from 0 to 2/3 of each material’s melting temperature over a 500 ps time window. Configurations are extracted every 5 ps along each trajectory, leading to an ensemble of 100 configurations per crystal structure.

2.1.2 Deformation measure

A macroscopic deformation gradient tensor ${\mathbf{F}}$ is applied to the entire system while remapping the $3N$ atoms coordinates $\mathbf{q}={\mathbf{r}}_{1}\oplus\ldots\oplus{\mathbf{r}}_{N}\in\mathbb{R}^{3N}$ (where ${\mathbf{r}}_{i}\in\mathbb{R}^{3}$ are the cartesian coordinates of the $i^{th}$ atom) into the deformed simulation cell. For every configuration extracted from the NPT trajectories:

\mathbf{r}_{i,\textrm{deformed}}={\mathbf{F}}\mathbf{r}_{i},

(1)

where $\mathbf{r}_{i,\textrm{deformed}}\in\mathbb{R}^{3}$ stands for the cartesian coordinates of the $i^{th}$ atom subjected to a homogeneous deformation governed by ${\mathbf{F}}$ . This deformation gradient tensor reads:

{\mathbf{F}}=\begin{pmatrix}F_{11}&F_{12}&F_{13}\\ 0&F_{22}&F_{23}\\ 0&0&F_{33}\end{pmatrix}

(2)

leading to $6$ independent variables describing all homogeneous deformation possibilities in any material. In the present work, we focus on the measure of deviatoric deformation $\epsilon_{\mathrm{eq}}=\sqrt{\frac{3}{2}\mathrm{dev}({\mathbf{E}}):\mathrm{dev}({\mathbf{E}})}$ as a threshold criterion. Here, $\mathrm{dev}{\mathbf{(E)}}={\mathbf{E}}-\frac{1}{3}\mathrm{tr}({\mathbf{E}}){\mathbf{I}}$ with ${\mathbf{E}}$ the Green-Lagrange strain tensor constructed from ${\mathbf{F}}$ with ${\mathbf{E}}=1/2({\mathbf{F}}^{T}{\mathbf{F}}-{\mathbf{I}})$ and ${\mathbf{I}}$ the identity. Only small strains, i.e. within the elastic deformation regime, are considered by imposing a threshold value on $\epsilon_{eq}$ , i.e. $\epsilon_{eq}<0.05$ . This way, subsets of atomistic configurations subjected to large local deformation such as dislocation-mediated plasticity or amorphous shear banding should emerge as outliers during the classification process.

2.2 Mapping the database into the descriptor space

2.2.1 Bispectrum SO4 descriptor

In this work, we use the bispectrum SO4 [43] to map the local atomic density function into invariant representations. This descriptor has several advantages compared to using the Cartesian coordinates of the surrounding atoms, e.g., it has constant dimension and is invariant to atomic permutation, rotation, and translation. It has been widely used in the context of machine learning interatomic potentials - MLIP, including spectral neighbor analysis potentials (SNAP) [48] and other similar forms [49, 50, 51]. Bispectrum SO4 also has the capacity to provide an accurate description of the atomic neighborhood, suitable for advanced structural analysis [43, 41]. Below, we briefly recall the key concepts and the algebraic formalism used to compute this descriptor. The fully detailed mathematical definition is given in [48]. For all monoatomic systems employed in the present study, the atomic neighbor density around atom $i$ at location ${\mathbf{r}}_{i}$ reads:

\rho_{i}({\mathbf{r}})=\delta({\mathbf{r}})+\sum_{r_{ii^{\prime}}<r_{\mathrm{cut}}}f_{c}(r_{ii^{\prime}})\delta({\mathbf{r}}-{\mathbf{r_{ii^{\prime}}}})

(3)

where ${\mathbf{r}}_{ii^{\prime}}={\mathbf{r}}_{i}-{\mathbf{r}}_{i^{\prime}}$ is the distance between the central atom $i$ and the neighbor atom $i^{\prime}$ , and the cutoff function $f_{c}$ ensures that the contribution of neighboring atoms smoothly decreases to zero at $r_{\mathrm{cut}}$ . By mapping radial neighbor coordinates $r$ to an angular component $\theta_{0}=\theta_{0}^{\mathrm{max}}r/r_{\mathrm{cut}}$ , the atomic neighbor density can be expanded in the basis functions of the unit 3-sphere, the 4D hyper-spherical harmonics $U^{j}_{m,m^{\prime}}(\theta_{0},\theta,\phi)$ :

\rho({\mathbf{r}})=\sum_{j=0,1/2,...}^{\infty}\sum_{m=-j}^{j}\sum_{m^{\prime}=-j}^{j}u^{j}_{m,m^{\prime}}U^{j}_{m,m^{\prime}}(\theta_{0},\theta,\phi),

(4)

where the expansion coefficients $u^{j}_{m,m^{\prime}}$ are a sum over discrete values of the corresponding basis function evaluated at each neighbor position,

u^{j}_{m,m^{\prime}}=U^{j}_{m,m^{\prime}}(0)+\sum_{r_{ii^{\prime}}<r_{\mathrm{cut}}}f_{c}(r_{ii^{\prime}})U^{j}_{m,m^{\prime}}(\theta_{0},\theta,\phi).

(5)

Finally, using the scalar triple products of these expansion coefficients, the real-value bispectrum components can be expressed as:

B_{j_{1},j_{2},j}=\sum_{m,m^{\prime}}u^{j*}_{m,m^{\prime}}\sum_{\begin{subarray}{c}m_{1},m^{\prime}_{1}\\ m_{2},m^{\prime}_{2}\end{subarray}}H\begin{subarray}{c}jmm^{\prime}\\ j_{1}m_{1}m^{\prime}_{1}\\ j_{2}m_{2}m^{\prime}_{2}\end{subarray}u^{j_{1}}_{m_{1},m^{\prime}_{1}}u^{j_{2}}_{m_{2},m^{\prime}_{2}},

(6)

with $*$ the complex conjugation operator and where the constants $H\begin{subarray}{c}jmm^{\prime}\\ j_{1}m_{1}m^{\prime}_{1}\\ j_{2}m_{2}m^{\prime}_{2}\end{subarray}$ are the Clebsch-Gordan coefficients for the hyper-spherical harmonics. The final coefficient is invariant to rotation and permutation. The order of the expansion $J_{max}$ determines the accuracy of the geometrical representation of the atomic neighborhood, although bispectrum coefficients are not listed in order of importance. However, increasing the value of $J_{max}$ leads to better accuracy but also to a higher computational cost. In the following, we choose a value of the expansion parameter $J_{max}=4$ that represents a good compromise between the accuracy of geometrical description and computational cost [52, 41]. This leads to a bispectrum $\mathbf{B}$ with $55$ real components, i.e. $\in\mathbb{R}^{55}$ , which corresponds to the dimensionality of feature space.

2.2.2 Fixed cutoff or fixed number of neighbors computation

Refer to caption — Figure 1: Construction of a density-invariant descriptor. Heatmaps showing the relative evolution of bispectrum components (in %) as a function of volumetric ratio with respect to a reference bispectrum taken at $V/V_{0}=1.0$ . Panels correspond to different bispectrum variants and setups: (a) Fixed $6\text{\,}\mathrm{\SIUnitSymbolAngstrom}$ cutoff distance. (b) and (c) Fixed number of neighbors $N_{\mathrm{neigh}}$ =24 with a Heavyside and tanh weight functions, respectively. While a fixed number of neighbors considerably reduces the density dependence, the regularized $\tanh$ weight function achieves near-invariance with respect to the local density at finite temperature.

Concerning the cutoff parameter $r_{\mathrm{cut}}$ used to define the neighborhood of a central atom for which the bispectrum is computed, two strategies are proposed. Firstly we consider a fixed cutoff radius, as for the calculation of the potential which is the common procedure in the context of MLIP. Each neighbor in the cutoff sphere is included in the bispectrum calculation, weighted by the cut-off function which smoothly switches from 1 for distances lower than $r_{\mathrm{cut}}$ to exactly zero above $r_{\mathrm{cut}}$ . The number of neighbors $N_{\mathrm{neigh}}$ of a central atom is determined by the magnitude of $r_{\mathrm{cut}}$ , resulting in a larger number of neighbors for denser materials.

The second alternative is to compute the bispectrum of an atomic environment containing a fixed number of neighbors $N_{\mathrm{neigh}}$ . Its main advantage is to remove the dependence of the crystal structure analysis on the density of the material. Hence different materials at varying densities (even locally) could be mapped to an equivalent descriptor representation.

A simple way to achieve the selection of the $N_{\mathrm{neigh}}$ neighbors would be to sort them by their distance to the central atom and consider only the $N_{\mathrm{neigh}}$ first ones while choosing the cutoff radius as the distance between the central atom and its farthest neighbor. More formally, using a basic dichotomy algorithm, one can compute the optimal cutoff radius $r_{\mathrm{cut}}$ that satisfies these requirements by defining the following equality:

W_{i}(r_{\mathrm{cut}})=\sum_{j}^{M}H(r_{\mathrm{cut}}-|r_{ij}|)=N_{\mathrm{neigh}},

(7)

where $W_{i}(r_{\mathrm{cut}})$ is the total weight factor associated with the central atom $i$ , $H$ is the Heavyside function, $N_{\mathrm{neigh}}$ is the target number of neighbors and $M$ is the actual number of neighbors, related to $r_{\mathrm{cut}}$ , required to satisfy $W_{i}(r_{\mathrm{cut}})=N_{\mathrm{neigh}}$ . Using the Heavyside weight function systematically leads to the solution with $M=N_{\mathrm{neigh}}$ and $r_{\mathrm{cut}}$ equal to the distance to the $N^{\mathrm{th}}$ neighbor. However, at finite temperature, the BSO4 descriptor strongly depends on local thermal fluctuations since the position (and the identity) of the last neighbor may vary from one step to another. A way to regularize the neighborhood construction by limiting thermal temperature effects is to use a smooth weight function such as tanh:

W_{i}(r_{\mathrm{cut}})=\sum_{j}^{M}\frac{1}{2}\left(1-\tanh{\frac{r_{ij}-r_{\mathrm{cut}}}{\delta}}\right)=N_{\mathrm{neigh}}\;.

(8)

The $\delta$ parameter ensures the smooth transition for the weights from 1 to 0 and is set to $0.3\text{\,}\mathrm{\SIUnitSymbolAngstrom}$ in the present work, close to thermal fluctuations of atomic positions. In Algorithm 2 we present an algorithm able to compute a general $\mathbf{B}$ descriptor for a constant numbers of neighbours.

Algorithm 2 Density-independent

\mathbf{B}

N_{\mathrm{neigh}}:

Target number of neighbors

W_{i}:

Weight function, Heavyside or tanh

r_{\mathrm{neigh}}:

cut-off for initial neihgbor list

\mathcal{S}(i,r):

neighbour list of the

i^{th}

atom within distance

r

if (

W_{i}

== tanh) then

Choose tolerance factor

\delta

end if

for Each atom

i

in simulation cell do

1. Build initial neighbor list of the i^th atom,

\mathcal{S}(i,r_{\mathrm{neigh}})

2. Find optimal

r_{\mathrm{cut}}

to minimize

\mathcal{L}=|W_{i}(r_{\mathrm{cut}})-N_{\mathrm{neigh}}|

for Each neighbor

j

\mathcal{S}(i,r_{\mathrm{neigh}})

Compute interatomic distance

d=\sqrt{(\mathbf{r}_{j}-\mathbf{r}_{i})^{2}}

(d<r_{\mathrm{cut}})

then

Add neighbor

j

\mathcal{S}(i,r_{\mathrm{cut}})

end if

end for

3. Basic computation of

\mathbf{B}

using neighbor list

\mathcal{S}(i,r_{\mathrm{cut}})

end for

In Figure 1, we highlight the distinctions between the standard bispectrum with a fixed cutoff radius, denoted as $\mathbf{B}_{\textrm{cut}}$ , and the proposed approach that utilizes a fixed number of neighbors, denoted as $\mathbf{B}^{\mathrm{Heavyside}}_{nn}$ or $\mathbf{B}^{\mathrm{tanh}}_{nn}$ (depending on the weight function employed). We set up NPT simulations of BCC Fe at 100 K, during which the pressure was gradually increased to around 40 GPa (corresponding to a volumetric compression of approximately 20%). The variations of the $\mathbf{B}_{cut}$ and $\mathbf{B}_{nn}$ components with respect to the volumetric ratio $V/V_{0}$ are illustrated in Figure 1-a, b, c, respectively.

The two ways of computing the bispectrum lead to very different results. Indeed, $\mathbf{B}_{nn}$ is almost insensitive to density while the $\mathbf{B}_{cut}$ components evolve significantly with it. Thus, the $\mathbf{B}_{nn}$ should be better at identifying crystal structures even during MD simulations involving a large change in material density. Besides, one can notice a difference between $\mathbf{B}^{\mathrm{Heavyside}}_{nn}$ and $\mathbf{B}^{\mathrm{tanh}}_{nn}$ . Indeed, even if it does not depend on local density, $\mathbf{B}^{\mathrm{Heavyside}}_{nn}$ displays some slight evolution that is caused by thermal fluctuations around the sharp step of the Heavyside function. On the contrary, $\mathbf{B}^{\mathrm{tanh}}_{nn}$ appears satisfyingly stable with density, exhibiting minor sensitivity. In the following, we consider BSO4 descriptors computed using either a fixed cutoff radius or a fixed target number of neighbors using the $\tanh$ regularization. The two descriptors will be respectively labelled $\mathbf{B}^{N}_{nn}$ and $\mathbf{B}^{R}_{\mathrm{cut}}$ with $N$ and $R$ the values of the corresponding target number of neighbors or cutoff radius.

2.2.3 Size of the database

As described above, configurations included in the database are selected at different temperatures and after distinct instantaneous deformations. The combination of these two effects makes the local environment of each atom of the supercell unique. Hence, there is no need to apply a peculiar sparsification procedure to avoid redundancy in the database.

For each of the four different crystal structures considered, 100 snapshots containing 864 atoms each are extracted from the NPT trajectory up to 2/3 $T_{m}$ . Then, the 6 non-zero components of the deformation gradient tensor ${\mathbf{F}}$ are sampled using the Latin Hypercube Sampling with Multi-Dimensional Uniformity (LHSMDU) [53], in order to obtain 100 draws to be applied to the different snapshots, while ensuring the condition $\epsilon_{eq}<0.05$ . Following this procedure, our total database contains $N_{\mathrm{atoms}}$ x $N_{\mathrm{snapshots}}$ = 864 x 100 = 86400 bispectrum vectors $\mathbf{B}\in\mathbb{R}^{55}$ , for each crystal structures, giving a total of $M=345600$ $\mathbf{B}$ vectors.

3 Crystal Structure Classifier

In this section we present the different steps of our procedure to build the supervised learning crystal structure analysis (SL-CSA) tool. Firstly, we delve into the configuration of our current classifier, which is constructed through a two-step process involving dimensionality reduction and logistic regression (LR). Subsequently, we compare our classification models to established tools in the literature, with particular emphasis on the adaptive common neighbor analysis (a-CNA), polyhedral template matching (PTM) and diamond structure identification (IDS), three methods available in OVITO [23].

3.1 Dimensionality reduction

The database is composed of 4 different crystal structures namely body centered cubic (BCC), face-centered cubic (FCC), hexagonal close packed (HCP) and cubic diamond (c-DIA), each containing 86400 local atomic environments encoded by descriptor vectors $\mathbf{B}\in\mathbb{R}^{55}$ . For the initial step of classification, we perfomed a supervised dimensionality reduction. We have uses the linear discriminant analysis (LDA), a statistical technique that is commonly used for supervised classification and feature extraction in machine learning [54]. LDA is particularly relevant for dimension reduction while preserving the most important discriminatory information. The underlying assumption is that the covariance matrix of each class is the same. LDA works by finding a linear combination of the original features that maximizes the separation between different classes in the data. The separation between classes is achieved by maximizing the ratio of the between-class variance to the within-class variance. The resulting linear combination, or discriminant function, is then used to project the data onto a lower-dimensional space, with dimension equal to the number of labels reduced by one. In the general case, one can reduce the dimension of the initial descriptor $\mathbf{B}\in\mathbb{R}^{D}$ leading to a new projected descriptor $\mathbf{x}=P_{\mathrm{LDA}}(\mathbf{B})$ $:\mathbb{R}^{D}\to\mathbb{R}^{d=C-1}$ :

P_{\textrm{LDA}}(\mathbf{B})=\mathbf{C}_{\mathrm{LDA}}^{\mathrm{T}}\cdot(\mathbf{B}-\bm{\mu}^{\mathbf{B}}_{\mathrm{db}})

(9)

with $\mathbf{C}_{\mathrm{LDA}}\in\mathbb{R}^{Dxd}$ the reduction coefficients matrix of LDA and $\bm{\mu}^{\mathbf{B}}_{\mathrm{db}}\in\mathbb{R}^{D}$ the average descriptor of the entire database. In the present case, the initial dimension $D=55$ corresponds to the BSO4 dimension imposed by the choice $J_{\mathrm{max}}=4$ and the new projector has a dimension $d=3$ equal to the number of crystal structures of the database minus one. Thanks to LDA, the separation between classes is maximized in this low dimensional space, hence facilitating the subsequent classification step.

3.2 Logistic Regression

Once the dimension reduction step is performed by means of LDA, the new descriptor $\mathbf{x}$ is considered as an input for performing a multinomial logistic regression which provides a probability vector $\mathbf{p}=P_{\textrm{LR}}(\mathbf{x})$ $P:\mathbb{R}^{d}\to\mathbb{R}^{C}$ defined as:

P_{\textrm{LR}}(\mathbf{x})=\frac{\exp(\mathbf{s}(\mathbf{x}))}{\sum_{i}\exp(\mathbf{s(\mathbf{x})}[i])}

(10)

where $\mathbf{s}(\mathbf{x})\in\mathbb{R}^{C}$ corresponds to the score vector that reads:

\mathbf{s}(\mathbf{x})=\mathbf{b}_{\mathrm{LR}}+\mathbf{D_{\mathrm{LR}}}\cdot\mathbf{x}^{\mathrm{T}}

(11)

with $\mathbf{b}_{\mathrm{LR}}\in\mathbb{R}^{C}$ and $\mathbf{D}_{\mathrm{LR}}\in\mathbb{R}^{Cxd}$ the bias vector and decision matrix of the logistic regression model after training. In the end, the crystal structure assigned to each atom with descriptor $\mathbf{x}$ is computed as the argmax of the probability vector $\mathbf{p}(\mathbf{x})$ . In the present work, $C=4$ e.g. the total number of crystal structures. The logistic regression step will systematically attributes a crystal structure to an atom which can lead to misclassification. Some atoms, e.g. defective ones, will be wrongly classified as crystalline. This misclassification is expected because the LDA dimension-reduced descriptors $\mathbf{x}$ are constructed based on the assumption that the covariance matrix of each class is identical. This pitfall can be overcome by methods such as QDA (Quadratic Discriminant Analysis), which are less strict and allow for different feature covariance matrices for different classes. However, these methods result in a quadratic decision boundary, which is more challenging to train and stabilize. For this reason, we stick to the framework of LDA and make the final decision in the classification by employing statistical distances with respect to each class based on the full covariance matrix of each class, similar to [41, 42]. Consequently, an additional step, serving as a sanity check (referred to as step 4 in Algorithm 1), is performed and described in the subsequent section.

3.3 Crystal structure classifier

The full database presented in 2 is replicated into six different databases computed with different flavors of the descriptors. We use the $\mathbf{B}^{R}_{\mathrm{cut}}$ descriptor with $R$ equal to 3.0, 6.0, and $9.0\text{\,}\mathrm{\SIUnitSymbolAngstrom}$ , and the $\mathbf{B}^{N}_{\mathrm{nn}}$ descriptor with $N$ set to 24, 48, and 72 respectively. A different instance of our classification model (SL-CSA) is trained on each of the six databases. The results obtained with each model are compared against each other and against crystal structure classifiers of the literature in the next section.

The purpose of the classification procedure is to identify local crystal structure with high fidelity or to detect atoms that do not belong to any of the reference crystal structures, considered as outliers. Since the logistic regression assigns a crystal structure to all atoms, we consider a final sanity check step using the Mahalanobis distance of a given recued descriptor $\mathbf{x}$ to a given crystal structure cs:

d^{\mathrm{cs}}_{\mathrm{Maha}}(\mathbf{x})=\sqrt{(\mathbf{x}-\bm{\mu}_{\mathrm{cs}})^{\mathrm{T}}\cdot{\mathbf{\Sigma}}^{-1}_{\mathrm{cs}}\cdot(\mathbf{x}-\bm{\mu}_{\mathrm{cs}})}

(12)

where ${\mathbf{\Sigma}}_{\mathrm{cs}}$ and $\bm{\mu}_{\mathrm{cs}}$ are the sample covariance matrix and average of the descriptors in the reference database associated to crystal structure CS. The decision to assign the cs to the descriptor ${\mathbf{x}}$ is made based on a threshold criterion on this distance: if the distance is lower than an acceptance threshold, the cs is assigned to the descriptor. Hence, this acceptance threshold determines the accuracy of the classifier. It is chosen by investigating the properties of the distributions of distances in the reference database: for each descriptor, the distance to each cs is computed, and the results are gathered by crystal structure so that the distribution of within- and between-class distances can be computed for each crystal structure. Results corresponding to $\mathbf{B}^{24}_{nn}$ are shown in Figure 2 whereas the distributions for other descriptors are given in the Supporting information (see Figures S1 to S4). The separation between c-DIA and other classes is systematically large due to the geometrical particularities of the diamond structure compared to BCC, FCC and HCP. On the other hand, between-class distances for BCC, FCC and HCP crystal structures are smaller and may even overlap. The acceptance threshold needs to be carefully chosen to ensure a minimal error during classification. In order to calculate this threshold for each class, we define the error rate of a crystal structure cs as:

\tau_{\mathrm{cs}}=\frac{n_{d^{\mathrm{cs}}_{\mathrm{Maha}}\geq\mathrm{threshold}}}{n^{\mathrm{cs}}_{\mathrm{tot}}}

(13)

where $n_{d^{\mathrm{cs}}_{\mathrm{Maha}}}$ corresponds to the number of atoms of the database with a distance greater than the threshold and $n^{\mathrm{cs}}_{\mathrm{tot}}$ is the total number of atoms with crystal structure cs. We also define a second error rate that quantifies the number of misassigned atoms (i.e. atoms of another crystal structure with a between-class distance with cs lower than the defined threshold):

\overline{\tau_{\mathrm{cs}}}=\frac{n_{d^{\mathrm{\overline{cs}}}_{\mathrm{Maha}}\leq\mathrm{threshold}}}{n_{\mathrm{tot}}-n^{\mathrm{cs}}_{\mathrm{tot}}}

(14)

where $n_{d^{\mathrm{\overline{cs}}}_{\mathrm{Maha}}}$ corresponds to the number of atoms belonging to another crystal structure with a distance to cs lower than the threshold, and $n_{\mathrm{tot}}$ is the total number of samples in the database. Finally, the optimal threshold for each crystal structure is defined as the Mahalanobis distance that minimizes $|\tau_{\mathrm{cs}}-\overline{\tau_{\mathrm{cs}}}|$ . Both error rates are displayed in blue and red lines in Figure 3 for each crystal structure as a function of the acceptance threshold. The optimal threshold is displayed as a vertical dashed line and is defined by the crossing of the two error rates.

Since the distributions of Mahalanobis distance are entirely disjointed for the c-DIA case, the acceptance threshold is arbitrarily set to $6\text{\,}\mathrm{\SIUnitSymbolAngstrom}$ . This allows to get 99 % of correct predictions on the database. For BCC, FCC and HCP crystal structures, the calculated optimal thresholds are equal to 5.1, 5.4 and $5.1\text{\,}\mathrm{\SIUnitSymbolAngstrom}$ respectively. This means that atoms exhibiting within-class distances greater than the acceptance threshold for each crystal structure will be considered as defective. In addition, the Mahalanobis distance to a specific crystal structure can also be used as a tool to classify defects, as was done previously [41] and this will be discussed in Section 4.

3.4 Comparison to standard classifiers

In the following, we focus on the two classifiers built on the $\mathbf{B}^{24}_{\mathrm{nn}}$ and $\mathbf{B}^{6}_{\mathrm{cut}}$ , descriptors as they systematically provide better scores when identifying the structures in the testing database. The selected classifiers are tested against well-established crystal structure identification methods, namely the a-CNA, PTM, and IDS algorithms (where applicable). It is important to mention that certain algorithms may require specific configurations. Therefore, in this section, we provide a detailed description of the parameters used in the current study. Prior to analyzing a particle neighborhood, a-CNA determines the optimal cutoff radius automatically for each individual particle by computing a local length scale specific to each crystal structure. It is to be noted that a-CNA does not have a tolerance criterion associated with its classification, i.e. if an atom neighborhood cannot be mapped to one of the known crystal structures, it is classified as non-crystalline, and labeled “unknown". In contrast to a-CNA, PTM requires a user-defined RMSD parameter, typically set to 0.1 by default. This parameter is the same for all structures, and atoms with RMSD greater than the threshold are assigned as non-crystalline. Setting a much larger RSMD threshold can reduce the number of “unknown" labels, however, it also favors the appearance of false positives. Calibration of the RMSD parameter for different crystal structures is out of the scope of the present study and we only perform the analysis with the most commonly used standard settings, i.e. RMSD=0.1. The acceptance threshold of the SL-CSA classifier is defined for each type of structure as described in Section 3.3.

Below we explore the performance of the structure identification methods for different thermo-mechanical states. In Section 3.4.1 we investigate the effect of thermal fluctuations; in Section 3.4.2 sensitivity to the material’s density is examined; and finally, in Section 3.4.3 we consider the sensitivity to large non-hydrostatic deformation.

3.4.1 Sensitivity to high temperature

Here we perform an analysis of NPT trajectories in BCC Iron, FCC Aluminium, HCP Zirconium, and c-DIA Silicon, where pressure was maintained at 0 GPa as the temperature is increased up to each material’s melting point $T_{m}$ . These simulations are distinct from those of the training database.

We define the accuracy score of a crystal structure classifier algorithm as the number of atoms identified as crystalline over the total number of atoms in the simulation, assuming that the analyzed materials are fully crystalline at $\frac{2}{3}T_{m}$ . Figure 4 reports the evolution of the accuracy score as a function of the average temperature in the simulation cell for the four different crystal structures. The SL-CSA classifier clearly outperforms a-CNA, and PTM for BCC Fe, FCC Al, and HCP Zr. The conventional methods shift from 100 % accuracy before reaching $\frac{2}{3}T_{m}$ , while the SL-CSA retains more than 98 % accuracy along the whole NPT trajectory. The IDS tool used for c-DIA Si appears not very sensitive to thermal noise and provides comparable results with SL-CSA with an accuracy above 99 % for this structure along the entire trajectory. Finally, the classifier built with $\mathbf{B}^{24}_{\mathrm{nn}}$ appears less sensitive to thermal fluctuations than its counterpart using $\mathbf{B}^{6}_{\mathrm{cut}}$ .

3.4.2 Sensitivity to material’s density

In order to examine the performance of our classifiers in extrapolation conditions with changing density (beyond their trained domain at a density of $\rho_{0}$ ), we generate 100 synthetic samples for each of the four crystal structures at densities $\rho\in[0.5\rho_{0},1.5\rho_{0}]$ . These samples correspond to simulation cells at different lattice parameters containing 864 atoms, each atom being shifted from its crystalline position by Gaussian noise with $\sigma=0.1$ Å. This setup allows building artificial configurations far from the domain of validity of the interatomic potential, spanning a wide range of material densities.

Figure 5 compares the results between a-CNA, IDS, PTM, and the two classifiers based on $\mathbf{B}^{6}_{\mathrm{cut}}$ and $\mathbf{B}^{24}_{\mathrm{nn}}$ descriptors. Here, $\rho_{0}$ is taken as the reference density for all measurements and classifiers since we aim at comparing the results w.r.t. the samples present in the database. For the three metallic systems, a-CNA and PTM algorithms provide very good results and correctly predict 100 % of the crystal structures in the entire range of density spanned. The CNA-based classifiers are not adapted for the diamond structure, and we use the IDS classifier instead. Both IDS and PTM predict 100% of c-DIA structure for any density, proving that they are not sensitive to volumetric strain. Thus, the three tested standard classifiers can perform well in the presence of large volumetric deformation, at least in the presence of moderate thermal noise. The $\mathbf{B}^{6}_{\mathrm{cut}}$ based classifier performs well for the densities near $\rho_{0}$ . However, due to the fixed cutoff, it largely fails at predicting correct crystal structures when density changes and may even predict a wrong crystal structure, e.g. BCC instead of FCC/HCP. Using the classifier trained on the $\mathbf{B}^{24}_{\mathrm{nn}}$ descriptor allows for the full restitution of the correct crystal structures in BCC, FCC, HCP, and c-IDA systems, over the whole range of densities. Together with the low sensitivity to thermal noise, these results prove its applicability to various thermodynamic conditions, i.e. when large hydrostatic pressure and high temperature are involved.

3.4.3 Sensitivity to large deformation

In order to explore the sensitivity of the classifiers to large deformations, we design a database with NPT trajectories for the four tested structures. For each material, the trajectories are performed at ambient pressure and at $\frac{2}{3}T_{m}$ for 200 ps. Along each trajectory, 200 snapshots are extracted for subsequent application of large deformations. All configurations are different from those of the learning database, and simulations were carried out with different seeds for temperature initialization.

In order to explore independently diagonal and deviatoric deformations, we consider two subsets of structures, where ( $F_{11}$ , $F_{33}$ ) and ( $F_{12}$ , $F_{13}$ ) are applied. Each component of the deformation tensor $F_{ij}$ is drawn from a uniform distribution in the intervals [0.7, 1.3] and [-0.3, 0.3] for longitudinal and deviatoric strains components respectively.

The 200 couples of diagonal deformation tensor components ( $F_{11}$ , $F_{33}$ ) are assigned to the 200 snapshots of each crystal structure, by applying the corresponding deformation gradient tensor ${\mathbf{F}}$ to the simulation cell while rescaling atomic positions. Finally, the different classifiers a-CNA, PTM, $\mathbf{B}_{\mathrm{cut}}^{6}$ and $\mathbf{B}_{\mathrm{nn}}^{24}$ are used to analyse the deformed samples. The same process using deviatoric deformation tensor components ( $F_{12}$ , $F_{13}$ ) has been employed. Results for each crystal structure are displayed in Figure 6.

Similar trends are observed for BCC, FCC, HCP, and c-DIA, where the performance of the classifiers is roughly independent of the crystalline structure. For diagonal deformations associated with compression and tension of the simulation cell, the CNA and PTM analysis are limited to small or very small deformation only (lower than 5%), and their accuracy is strongly reduced beyond this point. This effect is likely attributed to the combined effects of deformation and temperature. On the other hand, the performances of the $\mathbf{B}_{\mathrm{cut}}^{6}$ and $\mathbf{B}_{\mathrm{nn}}^{24}$ classifiers remain robust up to deformation as large as 20% (30% in the best cases). We note that $\mathbf{B}_{\mathrm{nn}}^{24}$ even shows an extended range of accuracy compared to $\mathbf{B}_{\mathrm{cut}}^{6}$ , and both classifiers exhibit some anisotropic response to diagonal deformation. Concerning deviatoric deformations the performances of our classifiers are even better than CNA and PTM, with correct structural assignment up to 30% deformation. Once again we attribute the high accuracy of our classifiers to their capabilities to handle temperature and deformation effect conjointly. This is highly valuable in the context of in situ classification of large-scale simulations of materials at extreme conditions. In the following, we demonstrate the applications of our most robust classifier constructed with the $\mathbf{B}^{24}_{\mathrm{nn}}$ descriptor for the analysis of structures challenging to perform with traditional methods.

4 Applications

Two examples of interest to the materials science community are outlined below. The first example considers the crystalline Zr solid-solid phase transition from HCP to BCC under high pressure, and the second explores the identification of dislocations as they form and expand from a Frank-Read source in Al. The difficulty in the first example is in the accurate identification of the crystal structures where volume discontinuity may be present, whereas in the second example it relies on the extraction of defective atoms present in the dislocations’ cores. The comparison of the results provided by SL-CSA and traditional methods is made for both applications.

4.1 HCP-BCC phase transition in HCP Zr

Here we perform the analysis of the high-pressure HCP $\to$ BCC transition in Zr, following the MD simulation procedure previously described in Refs. [55, 56, 57, 58, 59]. We aim to reveal the effect of the accuracy of the classification on the characterization of this transition. To this end, we track the number of atoms belonging to different crystalline structures as a function of time. To investigate this transition, we performed MD simulations in a simulation cell with 442,368 atoms at high temperature and high pressure. The system was first equilibrated at 0 GPa and 1500 K for 10 ps in the NPT ensemble using a 2 fs timestep. Then, the target pressure was set to 18 GPa, allowing the simulation cell to relax independently in the three directions of space. The temperature was maintained at 1500 K for the entire simulation with a Nosé-Hoover thermostat. At high pressure, a few ps are needed for the first BCC seed to nucleate in the HCP single crystal, meaning that the BCC structure becomes thermodynamically more stable than its HCP counterpart. Four snapshots of the MD simulation are taken at different times and are depicted in Figure 7.

The results of structure identification during the phase transition revealed unexpected differences between the methods, which can be grouped into two distinct categories- CNA and PTM0.1 for one and PTM0.15 and SL-CSA for the other, as shown in Figure 8. From the beginning of the simulation, CNA and PTM0.1 correctly identify 80% of the atoms, this proportion rapidly decreasing to 70%. Then, as the structural transition occurs, the number of unclassified atoms increases drastically up to 60%, demonstrating the inability of these methods to extract the correct underlying mechanism. The proportion of BCC atoms finally increases slowly, reaching asymptotic values between 60 and 70%, well below the expected proportion. In this context, using these results to characterize the kinetic of this phase transition would probably lead to wrong predictions.

On the other hand, PTM0.15 and SL-CSA exhibit similar trends. Initially, there is a high population of HCP atoms followed by a rapid transition towards the BCC structure. Finally, the system reaches an asymptotic behavior, resulting in a significant population of BCC atoms. The population of unclassified atoms during the transition remains low. Although these two methods show similar behavior, there are still some differences, especially during the transition phase. SL-CSA exhibits a lower population of unclassified atoms compared to PTM0.15. Additionally, at the end of the simulation, there is an approximate 10% difference in the population of BCC atoms between the two methods, with SL-CSA leading to a higher population of the BCC phase. Based on these results, we conclude that the SL-CSA classifier yields better results and, that this method is well suitable for a quantitative evaluation of the mechanism and kinetics of phase transitions. In addition, the proposed method does not require any parameter-tuning in opposition with PTM and its corresponding RMSD, to which the results are very sensitive.

4.2 Frank-Read source in Aluminium at 700 K

In this example, we investigate the capabilities of the classifier to extract the atoms that belong to defect structures from the bulk, and, in particular, to identify atoms belonging to dislocation cores. We emphasize that the classifier was not trained to distinguish any defective configuration. Hence, the analysis developed below only concerns the identification of crystalline and non-crystalline (or defective) atoms.

The present simulation involves the double emission of Frank-Read dislocation sources in FCC Al at high temperature, an example that has been previously used (at low temperature) to demonstrate the capability of the PTM algorithm in [60], with a setup similar to the one described in [61]. The only difference is that the dislocation sources have been pinned by two cylindrical pores periodically along the $z$ direction. The system with 2,300,504 atoms was initially equilibrated in the NPT ensemble at 0 GPa and 700 K for 10 ps, using similar coupling constants as for the HCP to BCC simulation described in the previous section. Finally, a shear stress $\sigma_{yz}$ of 1.5 GPa was applied to the simulation cell, while keeping the temperature at 700 K in order to trigger the expansion of dislocation lines emerging from the Frank-Read sources. A snapshot of the simulation cell at t = 14 ps is displayed in Figure 9.

The aim here is to investigate the capability of the SL-CSA procedure to extract defects, i.e. atoms that have not been assigned by the classifier to a specific crystal structure. Figure 9-a, b depicts the microstructure analyzed using PTM0.1 and SL-CSA, after removing atoms belonging to the FCC structure. The significant thermal noise in the simulation cell causes the presence of atomic environments deviating from the ideal FCC structure, including non-crystalline and BCC atoms. Here, PTM tends to identify more non-crystalline atoms than SL-CSA, the latter leading to the identification of HCP atoms in the FCC bulk. Since FCC Al tends to easily nucleate HCP stacking faults, the fact that thermally disturbed FCC bulk atoms are being identified as HCP is not that uncommon. A main difference subsists between both classifiers: PTM method appears more dependent on the local thermal/stress state than SL-CSA. Indeed, the area that is relaxed by the dislocation loop (i.e. at lower stress) appears less noisy than the one that is not in its vicinity, in opposition to the SL-CSA, where the noisy atoms look homogeneously distributed across the sample. In the end, what really matters is the ability of the present methods to extract defective parts of atomistic simulations for subsequent analysis. For example, defective atoms identified with SL-CSA or PTM could be fed to the Dislocation eXtraction Analysis (DXA) tool [19] for dislocation identification. Both PTM and SL-CSA seem able to identify the dislocation loop stacking faults generated by the Frank-Read sources, constituted by atoms assigned with the HCP crystal structure. Employing the cluster analysis available in OVITO allows for removing noisy bulk atoms and the remaining defective structure extracted from both PTM and SL-CSA classifications are displayed in Figure 9-c, d respectively. In comparison to the PTM0.1 analysis workflow, SL-CSA looks more robust and leads to the extraction of almost only dislocation core atoms. Such a structure would be a good candidate for performing extended analysis in terms of dislocation density calculations. However, dislocation line extraction is not the object of the present work and will be part of further studies. Overall, our crystal structure classification procedure SL-CSA performs rather well compared to the existing tools from the literature, even when both non-hydrostatic stresses and high temperature are involved such as in this dislocation-mediated plasticity simulation toy model.

5 Conclusion and perspectives

We have introduced a novel classifier that surpasses the capabilities of conventional approaches, such as a-CNA and PTM, when it comes to identifying crystal structures under extreme conditions like high temperature, high pressure, and large deformation. This makes our method particularly suitable for real-time analysis of molecular dynamics (MD) simulations.

Our proposed classifier operates on a simple learning process, utilizing a training database that encompasses various structures of interest, including BCC, FCC, HCP, and c-DIA. The characterization of local structures is facilitated by a spectral descriptor, which captures the geometric arrangement of neighboring atoms and represents it as a vector. We have proposed a modification to the conventional bispectrum descriptor, ensuring that a fixed number of neighbors is incorporated within the descriptor. This modification enables the analysis of materials at varying densities without loosing accuracy. The newfound insensitivity of the modified descriptor to density changes has a significant impact on the size of the required training database for the classifiers, while maintaining transferability. The training database includes configurations at high temperatures and small deformations, i.e. in the elastic regime.

A simple logistic regression is employed for the classifier, carefully controlling the balance between false positives and false negatives. The classifier is applied following dimensionality reduction using an LDA discriminator. While LDA may lose information about the covariance matrix differences between the classes, we mitigate this by refining the results with a test based on Mahalanobis distance within each and across the classes. We compare the performance of our current SL-CSA classifier to standard classification tools (a-CNA and PTM) across various densities, temperatures, and deformations. Notably, the SL-CSA classifier demonstrates superior reliability, even in scenarios where temperature and deformation interact.

Finally, the SL-CSA classifier was examined on large-scale simulations of solid-solid phase transformations and the detection of dislocation core atoms. These simulations showed that our classifier can conduct an analysis of crystalline structures with higher precision than traditional techniques, allowing to accurately estimate a proportion of atoms that belong to a given crystalline structure. We are optimistic that this capability will be advantageous for raising the accuracy of coarse-grained models of such processes.

In perspective, given its transferability and capability to analyze unknown features, the present method holds potential for further expansion in identifying more complex crystalline structures and directly detecting specific defect types, such as dislocation cores. Additionally, its application can extend to novelty detection in the field of materials science, such as recent nano-phases inclusions [42].

Data Availability

Data will be made available on request.

References

[1] Kien Nguyen-Cong, Jonathan T Willman, Stan G Moore, Anatoly B Belonoshko, Rahulkumar Gayatri, Evan Weinberg, Mitchell A Wood, Aidan P Thompson, and Ivan I Oleynik. Billion atom molecular dynamics simulations of carbon at extreme conditions and experimental time and length scales. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–12, 2021.
[2] Zhuoqiang Guo, Denghui Lu, Yujin Yan, Siyu Hu, Rongrong Liu, Guangming Tan, Ninghui Sun, Wanrun Jiang, Lijun Liu, Yixiao Chen, et al. Extending the limit of molecular dynamics with ab initio accuracy to 10 billion atoms. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 205–218, 2022.
[3] Anders Johansson, Yu Xie, Cameron J Owen, Jin Soo, Lixin Sun, Jonathan Vandermause, Boris Kozinsky, et al. Micron-scale heterogeneous catalysis with bayesian force fields from first principles and active learning. arXiv preprint arXiv:2204.12573, 2022.
[4] L. A. Zepeda-Ruiz & A. Stukowski & T. Oppelstrup & V. V. Bulatov. Probing the limits of metal plasticity with molecular dynamics simulations. Nature, 550:492–495, 2017.
[5] Cynthia L. Kelchner, S. J. Plimpton, and J. C. Hamilton. Dislocation nucleation and defect structure during surface indentation. Phys. Rev. B, 58:11085–11088, Nov 1998.
[6] Paul J. Steinhardt, David R. Nelson, and Marco Ronchetti. Bond-orientational order in liquids and glasses. Phys. Rev. B, 28:784–805, Jul 1983.
[7] J. Dana. Honeycutt and Hans C. Andersen. Molecular dynamics study of melting and freezing of small lennard-jones clusters. The Journal of Physical Chemistry, 91(19):4950–4963, 1987.
[8] D Faken and H Jónsson. Systematic analysis of local atomic structure combined with 3d computer graphics. Comput. Mater. Sci., 2:279–286, 1994.
[9] Alexander Stukowski. Structure identification methods for atomistic simulations of crystalline materials. Modelling and Simulation in Materials Science and Engineering, 20(4):045021, may 2012.
[10] G. J. Ackland and A. P. Jones. Applications of local crystal structure measures in experiment and simulation. Phys. Rev. B, 73:054104, Feb 2006.
[11] Emanuel A. Lazar, Jian Han, and David J. Srolovitz. Topological framework for local structure analysis in condensed matter. Proceedings of the National Academy of Sciences, 112(43):E5769–E5776, 2015.
[12] Peter Mahler Larsen, Søren Schmidt, and Jakob Schiøtz. Robust structural identification via polyhedral template matching. Modelling and Simulation in Materials Science and Engineering, 24(5):055007, may 2016.
[13] G J Tucker, J A Zimmerman, and D L McDowell. Shear deformation kinematics of bicrystalline grain boundaries in atomistic simulations. Modelling and Simulation in Materials Science and Engineering, 18(1):015002, dec 2009.
[14] Garritt J. Tucker, Jonathan A. Zimmerman, and David L. McDowell. Continuum metrics for deformation and microrotation from atomistic simulations: Application to grain boundaries. International Journal of Engineering Science, 49(12):1424–1434, 2011. Advances in generalized continuum mechanics.
[15] Jonathan A. Zimmerman, Douglas J. Bammann, and Huajian Gao. Deformation gradients for continuum mechanical analysis of atomistic simulations. International Journal of Solids and Structures, 46(2):238–253, 2009.
[16] Craig S. Hartley and Y. Mishin. Representation of dislocation cores using nye tensor distributions. Materials science & engineering. A, Structural materials : properties, microstructure and processing, 400:18–21, 2005.
[17] Paolo Cermelli and Morton E. Gurtin. On the characterization of geometrically necessary dislocations in finite plasticity. Journal of the mechanics and physics of solids, 49(7):1539–1568, 2001.
[18] A. Stukowski & K. Albe. Extracting dislocations and non-dislocation crystal defects from atomistic simulation data. Modelling Simul. Mater. Sci. Eng., 18(085001), 2010.
[19] A. Stukowski et al. Automated identification and indexing of dislocations in crystal interfaces. Modelling Simul. Mater. Sci. Eng., 20(085007), 2012.
[20] V. V. Bulatov & W. Cai. Computer Simulations Of Dislocations. Oxford University Press, 2006.
[21] L. A. Zepeda-Ruiz & A. Stukowski & T. Oppelstrup & N. Bertin & N. R. Barton & R. Freitas & V. V. Bulatov. Atomistic insights into metal hardening. Nature Materials, pages 1–6, 2020.
[22] Nicolas Bertin, Ryan B. Sills, and Wei Cai. Frontiers in the simulation of dislocations. Annual Review of Materials Research, 50(1):437–464, 2020.
[23] Alexander Stukowski. Visualization and analysis of atomistic simulation data with ovito-the open visualization tool. Modelling and Simulation in Materials Science and Engineering, 18(1), JAN 2010.
[24] Tim Hsu, Babak Sadigh, Nicolas Bertin, Cheol Woo Park, James Chapman, Vasily Bulatov, and Fei Zhou. Score-based denoising for atomic structure identification, 2023.
[25] Steve Plimpton. Fast parallel algorithms for short-range molecular dynamics. Journal of Computational Physics, 117(1):1–19, 1995.
[26] Aidan P. Thompson, H. Metin Aktulga, Richard Berger, Dan S. Bolintineanu, W. Michael Brown, Paul S. Crozier, Pieter J. in ’t Veld, Axel Kohlmeyer, Stan G. Moore, Trung Dac Nguyen, Ray Shan, Mark J. Stevens, Julien Tranchida, Christian Trott, and Steven J. Plimpton. Lammps - a flexible simulation tool for particle-based materials modeling at the atomic, meso, and continuum scales. Computer Physics Communications, 271:108171, 2022.
[27] Emmanuel Cieren, Laurent Colombet, Samuel Pitoiset, and Raymond Namyst. Exastamp: A parallel framework for molecular dynamics on heterogeneous clusters. In Luís Lopes, Julius Žilinskas, Alexandru Costan, Roberto G. Cascella, Gabor Kecskemeti, Emmanuel Jeannot, Mario Cannataro, Laura Ricci, Siegfried Benkner, Salvador Petit, Vittorio Scarano, José Gracia, Sascha Hunold, Stephen L. Scott, Stefan Lankes, Christian Lengauer, Jesús Carretero, Jens Breitbart, and Michael Alexander, editors, Euro-Par 2014: Parallel Processing Workshops, pages 121–132, Cham, 2014. Springer International Publishing.
[28] Emmanuel Cieren. Molecular Dynamics for Exascale Supercomputers. Theses, Université de Bordeaux, October 2015.
[29] Raphael Prat. Équilibrage dynamique de charge sur supercalculateur exaflopique appliqué à la dynamique moléculaire. Theses, Université de Bordeaux, October 2019.
[30] Raphael Prat, Thierry Carrard, Laurent Soulard, Olivier Durand, Raymond Namyst, and Laurent Colombet. Amr-based molecular dynamics for non-uniform, highly dynamic particle simulations. Computer Physics Communications, 253:107177, 2020.
[31] Felix Musil, Andrea Grisafi, Albert P. Bartók, Christoph Ortner, Gábor Csányi, and Michele Ceriotti. Physics-inspired structural representations for molecules and materials. Chemical Reviews, 121(16):9759–9815, 2021. PMID: 34310133.
[32] Heejung W. Chung, Rodrigo Freitas, Gowoon Cheon, and Evan J. Reed. Data-centric framework for crystal structure identification in atomistic simulations using machine learning. Phys. Rev. Mater., 6:043801, Apr 2022.
[33] Nongnuch Artrith, Alexander Urban, and Gerbrand Ceder. Efficient and accurate machine-learning interpolation of atomic energies in compositions with many species. Physical Review B, 96(1):014112, 2017.
[34] Jörg Behler. Atom-centered symmetry functions for constructing high-dimensional neural network potentials. The Journal of chemical physics, 134(7):074106, 2011.
[35] Sandip De, Albert P Bartók, Gábor Csányi, and Michele Ceriotti. Comparing molecules and solids across structural and alchemical space. Physical Chemistry Chemical Physics, 18(20):13754–13769, 2016.
[36] Claudio Zeni, Kevin Rossi, Aldo Glielmo, and Stefano De Gironcoli. Compact atomic descriptors enable accurate predictions via linear models. The Journal of Chemical Physics, 154(22):224112, 2021.
[37] Rohit Batra, Huan Doan Tran, Chiho Kim, James Chapman, Lihua Chen, Anand Chandrasekaran, and Rampi Ramprasad. General atomic neighborhood fingerprint for machine learning-based methods. The Journal of Physical Chemistry C, 123(25):15859–15866, 2019.
[38] Anand Chandrasekaran, Deepak Kamal, Rohit Batra, Chiho Kim, Lihua Chen, and Rampi Ramprasad. Solving the electronic structure problem with machine learning. npj Computational Materials, 5(1):1–7, 2019.
[39] Andreas Leitherer, Angelo Ziletti, and Luca M Ghiringhelli. Robust recognition and exploratory analysis of crystal structures via bayesian deep learning. Nature communications, 12(1):1–13, 2021.
[40] Ryan S DeFever, Colin Targonski, Steven W Hall, Melissa C Smith, and Sapna Sarupria. A generalized deep learning approach for local structure identification in molecular simulations. Chemical science, 10(32):7503–7515, 2019.
[41] Alexandra M Goryaeva, Clovis Lapointe, Chendi Dai, Julien Dérès, Jean-Bernard Maillet, and Mihai-Cosmin Marinica. Reinforcing materials modelling by encoding the structures of defects in crystalline solids into distortion scores. Nature communications, 11(1):1–14, 2020.
[42] Alexandra M. Goryaeva, Christophe Domain, Alain Chartier, Alexandre Dézaphie, Thomas D. Swinburne, Kan Ma, Marie Loyer-Prost, Jérôme Creuze, and Mihai-Cosmin Marinica. Compact A15 Frank-Kasper nano-phases at the origin of dislocation loops in face-centred cubic metals. Nature Communications, 14(1):3003, May 2023.
[43] Albert Bartók, Risi Kondor, and Gábor Csányi. On representing chemical environments. Physical Review B, 87:184115, 2013.
[44] M.I. Mendelev, M.J. Kramer, C.A. Becker, and M. Asta. Analysis of semi-empirical interatomic potentials appropriate for simulation of crystalline and liquid al and cu. Philosophical Magazine, 88(12):1723–1750, 2008.
[45] M. I. Mendelev, S. Han, D. J. Srolovitz, G. J. Ackland, D. Y. Sun, and M. Asta. Development of new interatomic potentials appropriate for crystalline and liquid iron. Philosophical Magazine, 83(35):3977–3994, 2003.
[46] M. I. Mendelev and G. J. Ackland. Development of an interatomic potential for the simulation of phase transformations in zirconium. Philosophical Magazine Letters, 87(5):349–359, 2007.
[47] Frank H. Stillinger and Thomas A. Weber. Computer simulation of local order in condensed phases of silicon. Phys. Rev. B, 31:5262–5271, Apr 1985.
[48] Aidan Thompson, L.P. Swiler, Christian Trott, S.M. Foiles, and Garritt Tucker. Spectral neighbor analysis method for automated generation of quantum-accurate interatomic potentials. Journal of Computational Physics, 285, 2015.
[49] Mitchell A. Wood and Aidan P. Thompson. Extending the accuracy of the snap interatomic potential form. J. Chem. Phys., 148(24), 2018.
[50] Alexandra M Goryaeva, Julien Dérès, Clovis Lapointe, Petr Grigorev, Thomas D Swinburne, James R Kermode, Lisa Ventelon, Jacopo Baima, and Mihai-Cosmin Marinica. Efficient and transferable machine learning potentials for the simulation of crystal defects in bcc Fe and W. Phys. Rev. Mater., 5(10):103803, 2021.
[51] Anruo Zhong, Clovis Lapointe, Alexandra M. Goryaeva, Jacopo Baima, Manuel Athènes, and Mihai-Cosmin Marinica. Anharmonic thermo-elasticity of tungsten from accelerated bayesian adaptive biasing force calculations with data-driven force fields. Phys. Rev. Mater., 7:023802, Feb 2023.
[52] Alexandra M Goryaeva, Jean-Bernard Maillet, and Mihai-Cosmin Marinica. Towards better efficiency of interatomic linear machine learning potentials. Comput. Mater. Sci., 166:200–209, 2019.
[53] Jared Deutsch and Clayton Deutsch. Gaussian approximation potentials: The accuracy of quantum mechanics, without the electrons. Journal of Statistical Planning and Inference, 142:763–772, 2012.
[54] R.A. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7:179, 1936.
[55] Fran çois Willaime and Carlo Massobrio. Temperature-induced hcp-bcc phase transformation in zirconium: A lattice and molecular-dynamics study based on an n-body potential. Phys. Rev. Lett., 63:2244–2247, Nov 1989.
[56] Rajeev Ahuja, John M. Wills, Börje Johansson, and Olle Eriksson. Crystal structures of ti, zr, and hf under compression: Theory. Phys. Rev. B, 48:16269–16279, Dec 1993.
[57] C W Greeff. Phase changes and the equation of state of zr. Modelling and Simulation in Materials Science and Engineering, 13(7):1015, aug 2005.
[58] Hongxiang Zong, Yufei Luo, Xiangdong Ding, Turab Lookman, and Graeme J. Ackland. hcp → $\omega$ phase transition mechanisms in shocked zirconium: A machine learning based atomic simulation study. Acta Materialia, 162:126–135, 2019.
[59] Hui Xia, Arthur L. Ruoff, and Yogesh K. Vohra. Temperature dependence of the $\omega$ -bcc phase transition in zirconium metal. Phys. Rev. B, 44:10374–10376, Nov 1991.
[60] Alexander Stukowski and Karsten Albe. Dislocation detection algorithm for atomistic simulations. Modelling and Simulation in Materials Science and Engineering, 18(2):025016, mar 2010.
[61] Maurice de Koning, Wei Cai, and Vasily V. Bulatov. Anomalous dislocation multiplication in fcc metals. Phys. Rev. Lett., 91:025503, Jul 2003.