SreferenceSI
Topology combined machine learning for consonant recognition
Abstract
In artificial-intelligence-aided signal processing, existing deep learning models often exhibit a black-box structure, and their validity and comprehensibility remain elusive. The integration of topological methods, despite its relatively nascent application, serves a dual purpose of making models more interpretable as well as extracting structural information from time-dependent data for smarter learning. Here, we provide a transparent and broadly applicable methodology, TopCap, to capture the most salient topological features inherent in time series for machine learning. Rooted in high-dimensional ambient spaces, TopCap is capable of capturing features rarely detected in datasets with low intrinsic dimensionality. Applying time-delay embedding and persistent homology, we obtain descriptors which encapsulate information such as the vibration of a time series, in terms of its variability of frequency, amplitude, and average line, demonstrated with simulated data. This information is then vectorised and fed into multiple machine learning algorithms such as -nearest neighbours and support vector machine. Notably, in classifying voiced and voiceless consonants, TopCap achieves an accuracy exceeding 96% and is geared towards designing topological convolutional layers for deep learning of speech and audio signals.
1 Introduction
In 1966, Mark Kac asked the famous question: ``Can you hear the shape of a drum?" To hear the shape of a drum is to infer information about the shape of the drumhead from the sound it makes, using mathematical theory. In this article, we mirror the question across senses and address instead: ``Can we see the sound of a human speech?"
The artificial intelligence (AI) advancements have led to a widespread adoption of voice recognition technologies, encompassing applications such as speech-to-text conversion and music generation. The rise of topological data analysis (TDA) [1] has integrated topological methods into many areas including AI [2, 3], which makes neural networks more interpretable and efficient, with a focus on structural information. In the field of voice recognition [4, 5], more specifically consonant recognition [6, 7, 8, 9, 10], prevalent methodologies frequently revolve around the analysis of energy and spectral information. While topological approaches are still rare in this area, we combine TDA and machine learning to obtain a classification for speech data, based on geometric patterns hidden within phonetic segments. The method we propose, TopCap (referring to capturing topological structures of data), is not only applicable to audio data but also to general-purpose time series data that require extraction of structural information for machine learning algorithms. Initially, we endow phonetic time series with point-cloud structure in a high-dimensional Euclidean space via time-delay embedding (TDE, see Fig. 1a) with appropriate choices of parameters. Subsequently, 1-dimensional persistence diagrams are computed using persistent homology (see Section 3 of Supplementary Information for an explanation of the terminologies). We then conduct evaluations with nine machine learning algorithms to demonstrate the significant capabilities of TopCap in the desired classification.

Conceptually, TDA is an approach which facilitates the examination of data structure through the lens of topology. This discipline was originally formulated to investigate the ``shape'' of data, particularly point-cloud data in high-dimensional spaces [11]. Characterised by a unique insensitivity to metrics, robustness against noise, invariance under continuous deformation, and coordinate-free computation [1], TDA has been combined with machine learning algorithms to uncover intricate and concealed information within datasets [12, 3, 13, 14, 15, 16]. In these contexts, topological methods have been employed to extract structural information from the dataset, thereby enhancing the efficiency of the original algorithms. Notably, TDA excels in identifying patterns such as clusters, loops, and voids in data, establishing it as a burgeoning tool in the realm of data analysis [17]. As a nascent field of study, the majority of theoretical results pertaining to topological methods have yet found their optimal applications and benefited everyday life. Nevertheless, with its distinctive emphasis on the shape of data, TDA has led to novel applications in various far-reaching fields, as evidenced in the literature. These include image recognition [18, 19, 20], time series forecasting [21] and classification [22], brain activity monitoring [23, 24], protein structural analysis [25, 26], speech recognition [27], signal processing [28, 29], neural networks [30, 31, 32, 2], among others. It is anticipated that with the further development of theoretical foundations and their applications, the promising future of TDA will pave a new direction to enhance numerous aspects of daily life.
The task of extracting features that pertain to structural information is both intriguing and formidable. This process is integral to a multitude of practical applications, as evidenced by various studies [33, 34, 35, 36]. Scholars strive to identify the most effective representatives and descriptors of shape within a given dataset. Despite the fact that TDA is specifically designed for shape capture, there are several hurdles that persist in this newly developed field of study. These include (1) the nature and sensitivity of descriptors obtained by methods in TDA, (2) the dimensionality of the data and other parameter choices, (3) the vectorisation of topological features, and (4) computational cost. These challenges will be elaborated in the following paragraphs within this section. Subsequently, we will demonstrate how our proposed methodology, TopCap, addresses these challenges through a case study on consonant classification.
When applying TDA, the most imminent question is to comprehend the characteristics and nature of descriptors extracted via topological methods. TDA is grounded in the pure-mathematical field of algebraic topology (AT) [37, 38], with persistent homology (PH) being its primary tool [39, 40]. While AT can quantify topological information to a certain extent [38, 1, 17], it is vitally important to understand both the capabilities and limitations of TDA. Generally speaking, TDA methods distinguish objects based on continuous deformation. For example, PH cannot differentiate a disk from a filled rectangle, given that one can continuously deform the rectangle into a disk by pulling out its four edges. In contrast, PH can distinguish between a filled rectangle and an unfilled one due to the presence of a ``hole'' in the latter, preventing a continuous deformation between the two. In certain circumstances, these methods are considered excessively ambiguous to capture the structural information in data, thereby necessitating a more precise descriptor of shapes. To draw an analogy, TDA can be conceptualised as a scanner with diverse inputs encompassing time series, graphs, pictures, videos, etc. The output of this scanner is a multiset of intervals in the extended real line, referred to as a persistence diagram (PD)111In this article, we shall freely use the usual birth-by-death PDs and their birth-by-lifetime variants, whichever better serve our purposes. See Section 3 of Supplementary Information for details. or a persistence barcode (PB) [11, 41, 42] (cf. Fig. 1f). The precision of the topological descriptor depends on two factors: (1) the association of a topological space, i.e., the process of transforming the input data into a topological space (see Fig. 1b for a simplicial-complex representation of spaces; typically, the original datasets are less structured, and one should find a suitable representation of the data), and (2) the vectorisation of PD or PB, i.e., how to perform statistical inference with PD/PB. Despite there are many theoretical results which provide a solid foundation for TDA, few can elucidate the practical implications of PD and PB. For example, what does it mean if many points are distributed near the birth–death diagonal line in a PD? In most cases, these points are regarded as descriptors of noise and are often disregarded if possible. Consequently, the TDA scanner can be seen as an imprecise observer, overlooking much of the information contained in less significant regions. In this article, we present an example of simulated time series to demonstrate that points distributed in such regions indeed encode vibration patterns of the time series, and a different distribution in these regions leads to a different pattern of vibration. This serves as a motivation for proposing TopCap and is further discussed in Section 2.1. It turns out that topological descriptors can be sharpened by noting patterns in these regions.
Motivated by the capability of topological methods to discern vibration patterns in time series, we carry out a case study that classifies consonants into voiced and voiceless categories through topological means (see Section 4 of Supplementary Information for details of phonetic categories). This case study serves to demonstrate our methodology. The first challenge, as many researchers may encounter when applying topological methods, is to determine the dimension of point clouds derived from input data [43, 44, 45]. This essentially involves transforming the input into a topological space. In situations where the dimensionality of the data is large, researchers often project the data into a lower-dimensional topological space to facilitate visualisation and reduce computational cost [23, 24, 46]. On the other hand, as in this study and other applications with time series analysis [47, 48, 49, 50, 22, 51, 27], low-dimensional data are embedded into a higher-dimensional space. In both scenarios, deciding on the data dimensionality is both critical and challenging. Often, tuning the dimension is a tremendous task. In Section 3 of Discussion below, we delve into the issue of data dimensionality. In our case, as it might seem to contradict most algorithms, when the data are embedded into a higher-dimensional space, the computation will be a little faster, the point cloud appears smoother and more regular, and most importantly, more salient topological features can be spotted, which seldom happen in lower-dimensional spaces. When encountering the dimensionality of data, researchers would think of the well-known issue ``curse of dimensionality'' [52]: as a typical algorithm grapple, with the increase of dimension, more data are involved, often growing exponentially and thereby escalating computational cost. Even worse, the computational cost of the algorithm itself normally rises as the dimension goes higher. However, topological methods do not necessarily prefer data of lower dimension. For computing PH (see Fig. 1d for the process of computing PD/PB from point clouds), a commonly used algorithm [53, 54] sees complexity grow with an increase in the number of simplices during the process, with a worst-case polynomial time-complexity of . The computational cost is directly related to the number of simplices formed during filtration, and practically, computation time may not increase much due to the increase of dimension of data because it may have little effect on the capacity of the point cloud data. Thus, the number of simplices formed during filtration may not increase much, as observed in our case.
Having obtained a suitable topological space from input data, one can derive a PD/PB from the topological space, which constitutes a multiset of intervals. The subsequent challenge lies in the vectorisation of the PD/PB for its integration into a machine-learning algorithm. The vectorisation process is essentially linked to the construction of the topological space, as the combination of different methods for constructing the topological space and vectorisation together determine the descriptor utilised in machine learning. A plethora of vectorisation methods exist, such as persistence landscape (PL) [55] and persistence image (PI) [56], among others, as documented in various studies [40, 57] (cf. Fig. 1f). The selection of these methods requires careful consideration. In Section 4 of Methods, we employ the maximal persistence and its corresponding birth time as two features. These have been integrated into nine traditional machine learning algorithms to classify voiced and voiceless consonants, yielding an accuracy that exceeds 96% with each algorithm. This vectorisation method is quite simple, primarily due to our construction of topological spaces from phonetic time series, as detailed in the Method section. This construction enables PH to capture significant topological features within the time series. In Section 2.1, we also observe a pattern of vibration which could potentially be vectorised by PI into a matrix. As one of its strengths, PI emphasizes regions where the weighting function scores are high, which makes it a computationally flexible method. Future work may involve a more precise recognition of such patterns using PI.
An outline for the remainder of this article is as follows: Section 1.1 gives an overview of closely related works in the field, with an extended commentary relegated to Section 1 of Supplementary Information; Section 2 of Results provides in more detail the motivations for TopCap, presents final results of classifying voiced and voiceless consonants, and explain our purposes in practical use; Section 3 of Discussion highlights important parameter setups and indicates potential directions for future work; Section 4 of Methods contains a detailed template of TopCap; Section 5 on data and code availability gives the data and code sources for our experiments.
1.1 Related works
Time series analysis [58] is a prevalent tool for various applied sciences. The recent surge in TDA has opened new avenues for the integration of topological methods into time series analysis [21, 59, 60]. Much literature has contributed to the theoretical foundation in this area. For example, theoretical frameworks for processing periodic time series have been proposed by Harer and Perea [61], followed by their and their collaborators' implementation in discovering periodicity in gene expressions [62]. Their article [61] studied the geometric structure of truncated Fourier series of a periodic function and its dependence on parameters in time-delay embedding (TDE), providing a solid background for TopCap. In addition to periodic time series, towards more general and complex scenarios, quasi-periodic time series have also been the subject of scholarly attention. Research in this direction has primarily concentrated on the selection of parameters for geometric space reconstruction [63] and extended to vector-valued time series [64].
In this article, a topological space is constructed from data using TDE, a technique that has been widely employed in the reconstruction of time series (see Fig. 1a and cf. Section 2 of Supplementary Information for more background). Thanks to the topological invariance of TDE, the general construction of simplicial-complex representation (see Fig. 1b) and computation of PH from point clouds (see Fig. 1d) apply to time series data, although this transformation involves subtle technical issues in practice. For instance, Emrani et al. [47] utilised TDE and PH to identify the periodic structure of dynamical systems, with applications to wheeze detection in pulmonology. They selected the embedded dimension as 2, and their delay parameter was determined by an autocorrelation-like (ACL) function, which provided a range for the delay between the first and second critical points of the ACL function. Pereira and de Mello [48] proposed a data clustering approach based on PD. The data were initially reconstructed by TDE, with and , so as to obtain the corresponding PD, which was then subjected to -means clustering. The delay was determined using the first minimum of an auto mutual information, and the embedded dimension was set to be 2 as using 3 dimensions did not significantly improve the results. Khasawneh and Munch [49] introduced a topological approach for examining the stability of a class of nonlinear stochastic delay equations. They used false nearest neighbours to determine the embedded dimension and chose the delay to equal the first zeros of the autocorrelation function. Subsequently, the longest persistence lifetime in PD was used as a vectorisation to quantify periodicity. Umeda [22] focused on a classification problem for volatile time series by extracting the structure of attractors, using TDA to represent transition rules of the time series. He assigned , in his study and introduced a novel vectorisation method, which was then applied to a convolutional neural network (CNN) to achieve classification. Gidea and Katz [51] employed TDA to detect early signs prior to financial crashes. They studied multi-dimensional time series with and used persistence landscape as a vectorisation method. For speech recognition, Brown and Knudson [27] examined the structure of point clouds obtained via TDE of human speech signals. The TDE parameters were set as , , after which they examined the structure of point clouds and their corresponding PB.
Upon reviewing the relevant literature, we see that currently there is no general framework for systematically choosing and , and researchers often have to make choices in an ad hoc fashion for practical needs. While the TDE–PH topological approach to handling time series data is not new, TopCap extracts features from high-dimensional spaces. For example, in our experiment . It happens in some cases that in a low-dimensional space, regardless of how optimal the choice of is, the structure of the time series cannot be adequately captured. In contrast, given a high-dimensional space, feature extraction from data becomes simpler. Of course, operating in a high-dimensional space comes with its own cost. For example, the adjustment of then requires careful consideration. Nonetheless, it also offers advantages, which we will elucidate step by step in the subsequent sections.
2 Results
This research drew inspiration from Carlsson and his collaborators' discovery of the Klein-bottle distribution of high-contrast, local patches of natural images [20], as well as their subsequent recent work on topological convolutional neural networks for learning image and even video data [2]. By analogy, we would like to understand a distribution space for speech data, even a directed graph structure on it modeling the complex network of speech-signal sequences, and how these topological inputs may enable smarter learning. Here are some of our first findings in this direction.
2.1 Construction of vibrating time series
The impetus behind TopCap lies in an observation of how PD can capture vibration patterns within time series. To begin with, our aim is to determine which sorts of information can be extracted using topological methods. As the name indicates, topological methods quantify features based on topology, which distinguishes spaces that cannot continuously deform to each other. In the context of time series, we conduct a series of experiments to scrutinize the performance of topological methods, their limitations as well as their potential.
Given a periodic time series, its TDE target is situated on a closed curve (i.e., a loop) in a sufficiently high-dimensional Euclidean space (see Fig. 1a). Despite the satisfactory point-cloud representation of a periodic time series, it remains rare in practical measurement and observation to capture a truly periodic series. Often, we find ourselves dealing with time series that are not periodic yet exhibit certain patterns within some time segments. For instance, Fig. 1c portrays the average temperature of the United States from the year 2012 to 2022, as documented in [65]. Although the temperature does not adhere strictly to a periodic pattern, it does display a noticeable cyclical trend on an annual basis. Typically, the temperature tends to rise from January to July and fall from August to December, with each year approximately comprising one cycle of the variation pattern. One strength of topological methods is their ability to capture ``cycles''. A question then arises naturally: Can these methods also capture the cycle of temperature as well as subtle variations within and among these cycles? To be more precise, we first observe that variations occur in several ways. For instance, the amplitude (or range) of the annual temperature variation may fluctuate slightly, with the maximum and minimum annual temperatures varying from year to year. Additionally, the trend line for the annual average temperature also shows fluctuations, such as the average temperature in 2012 surpassing that of 2013. Despite each year’s temperature pattern bearing resemblance to that depicted in the left panel in Fig. 1c (representing a single cycle of temperature within a year), it may be more beneficial for prediction and response strategies to focus on the evolution of this pattern rather than its specific form. In other words, attention should be directed towards how this cycle varies over the years. This leads to several questions. How can we consistently capture these subtle changes in the pattern’s evolution, such as variations in the frequency, amplitude, and trend line of cycles? How can we describe the similarities and differences between time series that possess distinct evolutionary trajectories? In applications, these are crucial inquiries that warrant further exploration.
To address these questions, we propose three kinds of ``fundamental variations'' which are utilised for depicting the evolutionary trace of a time series. Consider a series of a periodic function , where represents a period.
-
(1)
Variation of frequency. Denote the frequency by . Note that the series is not necessarily periodic in the mathematical sense. Rather, it exhibits a recurring pattern after the period . For instance, the average temperature from Fig. 1c is not a periodic series, but we consider its period to be one year since it follows a specific pattern, i.e., the one displayed in the left panel of Fig. 1c. This 1-year pattern always lasts for a year as time progresses. Hence, there is no frequency variation in this example. This type of variations can be represented as , where is a series representing the changing frequency. This type of variation occurs, for example, when one switches their vocal tone or when one's heartbeats experience a transition from walking mode to running mode.
-
(2)
Variation of amplitude. The amplitudes of temperature in the years 2014 and 2015 are 42.73∘F and 40.93∘F, respectively. So the variation of amplitude from 2014 to 2015 is F. This can be represented by , where is a series of the changing amplitude. This type of variation is observed when a particle vibrates with resistance or when there is a change in the volume of a sound.
-
(3)
Variation of average line. The average temperatures through the years 2012 and 2013 are 55.28∘F and 52.43∘F, respectively. The variation of average line from 2012 to 2013 is F. Let , where is a series representing the variation of average line. This type of variation is observed when a stock experiences a downturn over several days or when global warming causes a year-by-year increase in temperature.
To summarize, Fig. 1e provides a visual representation of the three fundamental variations. It is important to note that these variations are not utilised to depict the pattern itself but rather to illustrate the variation within the pattern or how the time series oscillates over time. This approach offers a dynamic perspective on the evolution of the time series, capturing changes in patterns that static analyses might overlook.

Using three simulated time series corresponding to the above three fundamental types of variation (see Section 4.1 for detailed construction), we demonstrate that PD can distinguish these variations and detect how significant they are. See Fig. 2, where a smaller value of indicates a more rapid fundamental variation. Here, regardless of which value takes, each individual diagram features a prominent single point at the top and a cluster of points with relatively short duration, except when (i.e., ). In this case, the series represents a cosine function, and thus the diagram consists of a single point. Normally, one tends to overlook the points in a PD that exhibit a short duration as they are sometimes inferred as noise. However, in this example, the distribution of those points holds valuable information regarding the three fundamental variations. As shown in Fig. 2, each fundamental variation has its distinct pattern of distribution in the lower region of a diagram, which leads to refined inferences: if the points spiral along the vertical axis of lifetime, it is probably due to a variation of amplitude; if every two or four points stay close to form a ``shuttle", it probably indicates a variation of average line; otherwise the points just seem to randomly spread over, which more likely results from a variation of frequency. It is also straightforward to distinguish the values of for a specific fundamental variation, by their most significant point in the diagram: longer lifetime for the barcode of the solitary point indicates slower variation. The lower region of a diagram also gives some hints in this respect.
In this simulated example, we demonstrated how PD could be utilised as a uniform means to distinguish three fundamental variations of the cosine series and their respective rates of change. However, it is important to note that in general scenarios, identifying the fundamental variations in a time series using topological methods may encounter significant challenges. Although topological methods are indeed capable of capturing this information, vectorising this information for subsequent utilisation remains a complex task at this stage. Having recognised the potential of topological methods, we resort to an alternative algorithm for handling time series. Specifically, despite the difficulty in vectorising PD to measure each fundamental variation, we have developed a simplified algorithm to measure the vibration of time series as a whole. This approach provides a comprehensive understanding of the overall behaviour of a time series, bypassing the need for complex vectorisation.
2.2 Traditional machine learning methods with novel topological features
Utilising datasets comprising human speech, we initially employ the Montreal Forced Aligner to align natural speech into phonetic segments. Following preprocessing of these phonetic segments, TDE is conducted with dimension parameter and delay parameter set to equal , where approximates the (minimal) period of the time series. Following additional refinement procedures, PDs are computed for these segments and are then vectorised based on the maximal persistence and its corresponding birth time. The comprehensive procedural framework is expounded in Sections 4.2 and 4.3, while the corresponding workflow is shown in Fig. 3e. In the applications of TDE, the dimension parameter is usually determined through some algorithms designed to identify the minimal appropriate dimension [45, 66]. The delay parameter is determined by autocorrelation function with no specific rule, but in many cases for some positive integer . In our pursuit of enhanced extraction of topological features, a relatively high dimension is chosen (see Section 3.1 for more discussion on dimension in TDE). Given this higher dimension, the usual case of with may prove excessively diminutive, particularly in light of the time series only taking values in discrete time steps. Consequently, in TopCap we adopt an adjusted parametrisation for with a relatively large value .
We input the pair of birth time and maximal persistence from 1-dimensional PD for each sound record to multiple traditional classification algorithms: Tree, Discriminant, Logistic Regression, Naive Bayes, Support Vector Machine, -Nearest Neighbours, Kernel, Ensemble, and Neural Network. We use the application of the MATLAB (R2022b) Classification Learner, with 5-fold cross-validation, and set aside 30% records as test data. This application performs machine learning algorithms in an automatic way. There are a total of 1016 records, with 712 training samples and 304 test samples. Among them, 694 records are voiced consonants and the remaining are voiceless consonants. The models we choose in this application are Optimizable Tree, Optimizable Discriminant, Efficient Logistic Regression, Optimizable Naive Bayes, Optimizable SVM, Optimizable KNN, Kernel, Optimizable Ensemble, and Optimizable Neural Network. The results are shown in Fig. 3a–d.

The receiver operating characteristic (ROC) curve, area under the curve (AUC), and accuracy metrics collectively demonstrate the efficacy of these topological features as inputs for a variety of machine learning algorithms. Each of these algorithms achieves an accuracy of higher than 96%. The ROC and AUC together depict the high performance of our classification model across all classification thresholds. The 2D histograms depicted in Fig. 3c–d collectively illustrate the distinct distributions of voiced and voiceless consonants. Voiced consonants tend to exhibit a relatively higher birth time and lifetime, which provides an explanation for the high performance of these algorithms. Despite the intricate structure that a PD may present, appropriately extracted topological features enable traditional machine learning algorithms to separate complex data effectively. This highlights the potential of TDA in enhancing the performance of machine learning models. In summary, the most significant distinction between voiced and voiceless consonants is that the former exhibit higher maximal persistence. This can scarcely be detected in lower dimensions regardless of how we tune the delay parameter for TDE. See Fig. S2 of Supplementary Information for recognition of vowels as well as consonants in terms of their ``shapes".
2.3 The three fundamental variations within one-dimensional PD
A PD for 1-dimensional PH encodes much more information beyond the birth time and lifetime of the maximal persistence point. The three fundamental variations examined in Section 2.1 also manifest themselves in certain regions of the PD, which can in turn be vectorised. To capture these variations, we perform an experiment with two records of the vowel [A]. Specifically, we demonstrate the fundamental variations by comparing the PDs of (a) the record of [A] relatively unstable with respect to the fundamental variations and (b) the other record of the same vowel that is relatively stable. To better illustrate the results, we crop each record into 4 overlapping intervals, each starting from time 0 and ending at 600, 800, 1000, 1200, respectively. When adding a new segment of 200 units into the original sample each time, the amplitude and frequency of the series altered more drastically in case (a). A more rapid changing rate may lead to more points distributed in the lower region of the diagram. The outcomes are presented in Fig. 4. In particular, the plots in Fig. 4c show that the spectral frequency of (a) indeed varies faster than (b).
We should also mention that the 1-dimensional PD here serves as a profile for the combined effect of the fundamental variations. Currently, it is unclear how the points in the lower region change in response to a specific fundamental variation.

3 Discussion
In the realm of topological methods applied to time series [47, 48, 49, 50, 22, 51, 27], the determination of parameters in TDE emerges as a pivotal subject. This significance stems from the profound impact that the selection of parameters has on the resulting topological spaces and their corresponding PDs. Indeed, there exist several convenient algorithms for parameter selection. For example, a method for ascertaining the embedded dimension is provided by the False Nearest Neighbours algorithm (FNN) [66], a widely utilised tool for identifying the minimal embedded dimension.
However, in the context of persistent homology, usually the objective is not to achieve a ``minimal'' dimension. Contrarily, a dimension of substantial magnitude might be desirable due to the certain advantages it offers, which will be explained later. In this section, our objective is to elucidate the intricate relationship between TDE and PD. We intend to explore in detail how the embedded dimension and subsampling (i.e., the skip parameter in TDE) affect both maximal persistence and computation time. Furthermore, we also present the maximal persistence of a time series under varying , , and skip parameters in TDE. Despite the absence of a universal framework for parameter selection, it is noteworthy that topological spaces exhibit traceable variations in response to these parameters.
There are three crucial parameters in TDE, namely, , , and skip. However, it should be mentioned that the TDE plus PD approach encompasses many other significant parameters. These include the construction of underlying topological space of the point clouds (i.e., the distance function for pairwise points), and the type of complexes utilised in PD, among others. Some of these choices, despite their importance, are seldom discussed before. In this article, we propose a method for determining delay in order to capture the theoretical maximal persistence of a time series in high-dimensional TDE. Hopefully in future research, there will be some more systematic approaches for determining other parameters, particularly the dimension of the TDE.
3.1 Embedded dimension and maximal persistence
In the TDE plus PD approach, the determination of dimension in TDE can be complex. However, it plays a pivotal role in the extraction of maximal persistence. It is observed that a larger dimension can significantly enhance the theoretical maximal persistence of a time series, which will be discussed in this section. In TopCap construction, the dimension of TDE is set at 100, indicating a relatively large dimension for the experiment. The dimension may fluctuate based on factors such as the length of the time series, as the dimension cannot exceed the length of the time series as it would render the resulting point cloud pointless; the periodicity of the time series, as the window size should be compatible with the approximate period of the time series; among other considerations.
According to Perea and Harer [61, Proposition 5.1], there will be no information loss for trigonometric polynomials if and only if the dimension of TDE exceeds twice the maximal frequency. In this context, no data loss implies that the original time series can be reconstructed from the embedded point cloud. In general, for a periodic function, a higher dimension of TDE can yield a more precise approximation by trigonometric polynomials. Although there are no absolutely periodic functions in real data, each time series exhibits its own ``pattern of vibration'', as discussed in the Section 2.1, a higher dimension may be employed to capture a more accurate vibration pattern in time series. Furthermore, an increased embedded dimension may result in reduced computation time for PD. For instance, computation times for a voiced consonant [N] (see Fig. 5a) are 0.2671, 0.2473, and 0.2375 seconds, corresponding to embedded dimensions 10, 100, and 1000, respectively. This is attributed to the potential influence of a larger dimension on the number of points in a point cloud. While the reduction in computation time may not be considered substantial compared to the impact of skip (see Fig. 5d), it may become significant when handling large datasets. Most importantly, an increased embedded dimension can yield benefits such as enhanced maximal persistence (which serves as a motivation for a larger dimension) and a smoother shape of resulting point clouds obtained through TDE, which makes TDE more reasonable in common sense. Typically, for most algorithms, a lower dimension is preferred due to factors such as the curse of dimensionality and computation cost. However, in the TDE plus PD approach, we opt for a higher dimension.
However, the embedded dimension cannot be arbitrarily large. As illustrated in Fig. 5c, when the embedded dimension escalates to 1280, it becomes unfeasible to capture a significant maximal persistence in the phonetic time series. This is attributed to the dehiscence of the point clouds, rendering them incapable of capturing a proper maximal persistence. When the embedded dimension reaches 1290, an empty 1-dimensional barcode is procured due to the insufficiency of points necessary to form even one cycle. In this scenario, the dimension of TDE is related to the length of the time series.
Using a sound record of the voiced consonant [N] as an exemplar, we delineate the correlation between maximal persistence and embedded dimension in Fig. 5. As depicted in Fig. 5b, maximal persistence tends to escalate precipitously in a nonlinear fashion with the increase in dimension, signifying that a more substantial maximal persistence is captured in higher-dimensional TDE. Notably, two severe reductions in maximal persistence are observed, corresponding to embedded dimensions 600 and 1190. When , this time series can theoretically attain its maximal persistence when . However, given the length of the series is 1337 and the window size is , with the skip set at 5, only 28 points are in the resulting point clouds for persistence diagram computation. The scarcity of points in the point clouds fails to represent the original series adequately, leading to a decrease in maximal persistence, the same way as when dimension equals 610, 620. Also, a similar phenomenon occurs when the dimension reaches 1190. The principal component analysis (PCA) for dimension 1280 is shown in Fig. 5c. In this scenario, the hypothetical cycle fails to form as there is a dehiscence in the point cloud, resulting in these relatively catastrophic decreases in maximal persistence. When , this series can achieve its maximal persistence when , resulting in a window size of . There are 142 points in the point cloud for the persistence diagram if skip equals 5, ensuring that the maximal persistence is reached again without any dehiscence and improper presentation in the point clouds. The embedded dimension also contributes significantly to the geometric property of time-delay embedding. As the shape becomes smoother in higher dimensions, it results in a more reasonable structure of the point clouds.

3.2 Skip, maximal persistence, and persistence execution time
Computation time assumes a critical role when processing a substantial volume of data. In this context, the parameter skip in TDE is considered, as it significantly influences the number of points within the point clouds, thereby directly impacting the number of simplices during filtration, thus reducing computation time for PD. In this subsection, we demonstrate that an appropriate increment in the skip parameter can markedly decrease computation time. However, it is noteworthy that maximal persistence exhibits resilience to an increase in skip to a certain extent. Consequently, in this case, it is feasible to augment skip in TDE to expedite the computation of PD. For details on the complexity of computing persistent homology, readers can refer to Zomorodian and Carlsson [53, Section 4.3 Discussion] as well as Edelsbrunner et al. [67, Section 4].
Using an example of a sound record of the voiced consonant [m], we elucidate the relationship between skip, computation duration, and cardinality of the resulting point clouds obtained via TDE in Fig. 5d. Computation duration is computed every time after restarting the Jupyter Notebook, measured on Dell Precision 3581, with CPU Intel® Core i7-13800H CPU with basic frequency 2.50 GHz, 14 cores. Computation time means the time for executing the code: ripser(Points,maxdim=1). As depicted in Fig. 5d, a substantial reduction in computation time is observed with an increase in the skip parameter. In contrast, our algorithm's computation of maximal persistence remains stable and resilient.
3.3 Dependence of maximal persistence
In this subsection, we present a table that delineates the maximum persistence in relation to dimension, delay, and skip within the context of the TDE plus PD approach. The experiment is executed on a voiced consonant [N], which comprises 887 points as its length. Theoretically, for a periodic function, one should attain the maximum persistence of a function in the desired dimension with the condition that window size is approximate to the period. However, the phonetic time series that we typically handle deviate far from being periodic. Despite our approach to calculating the period of time series by autocorrelation function, we cannot assure that the desired delay will indeed reach the actual maximum persistence of a time series in general cases. Nevertheless, the desired delay usually yields relatively good maximum persistence. For instance, as illustrated in Table 1, when the dimension is 10, the desired delay is 40. This corresponds to a maximum persistence of 0.1290, which is marginally lower than the maximum persistence achieved at a delay of 60, yielding a maximum persistence of 0.1315. However, as the dimension escalates, the point clouds resulting from TDE become more reasonable. It becomes increasingly probable that at the desired delay, one can attain maximum persistence of the time series. For example, when the dimension is either 50 or 100, the maximum persistence of the time series is achieved at the desired delay. This provides additional justification for preferring higher dimensions. This table further illustrates that an augmentation in dimension might lead to a more substantial enhancement in the maximal persistence of a time series than tuning delay.
dimension = 10 | dimension = 50 | dimension = 100 | ||||||
desired delay = 40 | desired delay = 8 | desired delay = 4 | ||||||
delay | skip | maximal persistence | delay | skip | maximal persistence | delay | skip | maximal persistence |
1 | 1 | 0.0610 | 1 | 1 | 0.2834 | 1 | 1 | 0.4270 |
10 | 1 | 0.1299 | 3 | 1 | 0.3021 | 2 | 1 | 0.4337 |
20 | 1 | 0.1312 | 4 | 1 | 0.3054 | 2 | 5 | 0.4146 |
30 | 1 | 0.1281 | 5 | 1 | 0.3058 | 3 | 1 | 0.4357 |
39 | 1 | 0.1229 | 6 | 1 | 0.3042 | 3 | 5 | 0.4120 |
39 | 5 | 0.1134 | 7 | 1 | 0.3052 | 4 | 1 | 0.4381 |
40 | 1 | 0.1290 | 7 | 5 | 0.2886 | 4 | 5 | 0.4139 |
40 | 5 | 0.1195 | 8 | 1 | 0.3093 | 5 | 1 | 0.4375 |
41 | 1 | 0.1200 | 8 | 5 | 0.2928 | 5 | 5 | 0.4105 |
41 | 5 | 0.1153 | 9 | 1 | 0.3091 | 6 | 1 | 0.4347 |
45 | 1 | 0.0940 | 9 | 5 | 0.2913 | 6 | 5 | 0.4114 |
50 | 1 | 0.1226 | 10 | 1 | 0.3069 | 7 | 1 | 0.4380 |
60 | 1 | 0.1315 | 15 | 1 | 0.3070 | 8 | 1 | 0.4378 |
94 | 1 | empty | 18 | 1 | empty | 9 | 1 | empty |
4 Methods
4.1 On the construction of vibrating time series
There are three kinds of fundamental variations mentioned in Section 2.1. In order to substantiate our argument, let
Note that when , ; when , . In fact, is a series of line section connecting and . The frequency of changes more slowly as increases. In particularly, when , , so , which is a periodic function. Performing TDE to with dimension 3, delay 100 and skip 10. Then, compute its 1-dimensional PD. The underlying space of this point cloud is 3-dimensional Euclidean space. See Fig. 2a for the result. Replace by and , we can obtain the similar diagram in Fig. 2b and Fig. 2c.
4.2 Obtain phonetic data from natural speech
We used speech files sourced from SpeechBox [68], ALLSSTAR Corpus [69], task HT1 language English L1 file, derived on 28 January, 2023. SpeechBox is a web-based system providing access to an extensive collection of digital speech corpora developed by the Speech Communication Research Group in the Department of Linguistics at Northwestern University. This section contains a total of 25 individual files, comprising 14 files from females and 11 files from males. The age range of these speakers spans from 18 to 26 years, with an average age of 19.92 years. Each file is presented in the WAV format and is accompanied by its corresponding aligned file in Textgrid format, featuring three tiers: sentences, words, and phones. Collectively, these 25 speech files amount to a total duration of 41.21 minutes. The speech file contains each individual reading the same sentences consecutively for a duration ranging from 80 to 120 seconds, contingent upon each person's pace. The original WAV file has a sampling frequency of 22050 and comprises only one channel. Since Montreal Forced Aligner [70] is trained in a sampling frequency of 16000, we have opted to adjust the sampling frequency of the WAV files accordingly. Extracted the words tier from Textgrid and aligned words into phones using english_mfa dictionary and acoustic model (mfa version 2.0.6). Thus, we obtained corresponding phonetic data from these speech files. Then we used voiced and voiceless consonants in those segments as our dataset. Voiced consonants are consonants that vocal cords vibrate in the throat during articulation, and voiceless consonants on the contrary.
Using Praat [71], we extracted voiced consonants [N], [m], [n], [j], [l], [v], and [Z]; for voiceless consonants, we selected [f], [k], [8], [t], [s], and [tS]. These phones were then read as time series. Our selection was limited to these voiced and voiceless consonants, as we aimed to balance the ratio of voiced and voiceless consonant records in these speech files. Additionally, some consonants, such as [d] and [h], were difficult to classify using this method.
4.3 Derive topological features from phonetic data
Prior to the extraction of topological features from a time series, we shall first imbue this 1-dimensional time series with a topological structure through TDE. It is noteworthy that this technique is also applicable to multidimensional time series. The underlying space throughout this article will always be Euclidean space. By establishing the topological structure, or more precisely, the distance matrix, we could subsequently calculate PD. We elaborate on the following main steps.
(1) Data cleaning. This involved eliminating the initial and final sections of the time series until the first point with an amplitude exceeding 0.03 was encountered. This approach was aimed at mitigating the impact of environmental noise at the beginning and end of a phone. Any resulting series with fewer than 500 points will be disregarded, as such series were considered insufficiently long or contained excessive environmental noise.
(2) Parameter selection for TDE. Here, we selected suitable parameters for TDE to capture the theoretical maximal persistence of a given time series. The dimension of the embedding was fixed to be 100. The principle for determining the appropriate dimension is that we want to choose the embedded dimension to be large for a time series of limited length. As discussed in the Section 3.1 and cf. Section 2 of Supplementary Information, a higher dimension results in a more accurate approximation. This approach also aimed to enhance computational efficiency and the occurrence of more prominent maximal persistence. Nonetheless, it is imperative to exercise caution when selecting the dimension, as excessively large dimensions may lead to empty point clouds and other uncontrollable factors. With a proper dimension, we then compute the delay of the embedding. According to Perea and Harer [61], in the case of a periodic function, the delay can be expressed as
where denotes the period, represents the dimension of the embedding, n is some positive integer. Under these conditions, we could obtain the largest maximal persistence. In the later passage, we shall call it reaches the maximal persistence of the time series. The time series under consideration in our case was far from periodic, so we used the first peak of the autocorrelation function to represent the (minimal) period and set , thus obtaining a relatively proper delay. The common choice of is to let window size equal the (minimal) period. However, in the discrete time series case, one probably obtains , or in this way since the dimension of TDE is too large. So, one strategy is increasing to get a relatively reasonable . The performance of delay obtained in this way is presented in the Section 3.1. Then was rounded to the nearest integer. If was rounded to 0, retake it as 1. It was common for the case that exceeded the number of points in the series, resulting in an empty embedding. In this case, we adopted to take , where denotes the number of points, i.e., the point capacity in the time series, then rounded it downwards. This enabled us to obtain the appropriate delay for each time series, thereby facilitating the attainment of maximal persistence for this specified dimension. Lastly, took the skip=5. We chose this skip mainly to reach an adorable computation time. The impact of the skip parameter in TDE on maximal persistence and computation time is expounded upon in the Section 3.2 for those interested. Once the parameter has been established, the time series can be transformed into point clouds. If there were less than 40 points in this point cloud, we would exclude this time series from further analysis, considering that there are too few points to represent the original structure of the time series. The problem of too few points is also presented in the Section 3.1.
(3) Computing PD. Using Ripser [72, 73], we could compute the PD of the point clouds in a fast and efficient way. Then we extracted the maximal persistence from the 1-dimensional PD, using its birth time and lifetime as two features of this time series. The process of vectorising a PD presents a challenge due to the indeterminate (and potentially large) number of tuples in the barcode, coupled with the ambiguous information they contain. This ambiguity arises from our lack of knowledge about the types of information that can be derived from different parts of the PD. Here we only extracted the maximal persistence and corresponding birth time. This decision was informed by our prior selection of an appropriate parameter, which ensured that the PD reached its maximum. See Fig. 3e for the flow chart of this section.
5 Data and code availability
The data that support the findings of this study are openly available in SpeechBox [68], ALLSSTAR Corpus [69], L1-ENG division at https://speechbox.linguistics.northwestern.edu. The code resources and supplementary materials for TopCap can be accessed on the GitHub page at https://github.com/AnnFeng233/TDA_Consonant_Recognition.
References
References
- [1] G. Carlsson, ``Topology and data,'' Bulletin of The American Mathematical Society, vol. 46, pp. 255–308, Apr. 2009.
- [2] E. R. Love, B. Filippenko, V. Maroulas, and G. Carlsson, ``Topological convolutional layers for deep learning,'' Journal of Machine Learning Research, vol. 24, no. 59, pp. 1–35, 2023.
- [3] G. Carlsson and R. B. Gabrielsson, ``Topological approaches to deep learning,'' in Topological Data Analysis: The Abel Symposium 2018, pp. 119–146, Springer, 2020.
- [4] K. W. Grant, B. E. Walden, and P. F. Seitz, ``Auditory-visual speech recognition by hearing-impaired subjects: Consonant recognition, sentence recognition, and auditory-visual integration,'' The Journal of the Acoustical Society of America, vol. 103, no. 5, pp. 2677–2690, 1998.
- [5] Y. Shen, N. C. Harris, S. Skirlo, M. Prabhu, T. Baehr-Jones, M. Hochberg, X. Sun, S. Zhao, H. Larochelle, and D. Englund, ``Deep learning with coherent nanophotonic circuits,'' Nature Photonics, vol. 11, no. 7, pp. 441–446, 2017.
- [6] E. W. Healy, S. E. Yoho, Y. Wang, F. Apoux, and D. Wang, ``Speech-cue transmission by an algorithm to increase consonant recognition in noise for hearing-impaired listeners,'' The Journal of the Acoustical Society of America, vol. 136, no. 6, pp. 3325–3336, 2014.
- [7] T. Nazzi and A. Cutler, ``How consonants and vowels shape spoken-language recognition,'' Annual Review of Linguistics, vol. 5, pp. 25–47, 2019.
- [8] D. J. Van Tasell, S. D. Soli, V. M. Kirby, and G. P. Widin, ``Speech waveform envelope cues for consonant recognition,'' The Journal of the Acoustical Society of America, vol. 82, no. 4, pp. 1152–1161, 1987.
- [9] D. Wang and G. Hu, ``Unvoiced speech segregation,'' in 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, vol. 5, pp. V–V, 2006.
- [10] P. Weber, C. J. Champion, S. Houghton, P. Jančovič, and M. Russell, ``Consonant recognition with continuous-state hidden markov models and perceptually-motivated features,'' in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
- [11] G. Carlsson, A. Zomorodian, A. Collins, and L. J. Guibas, ``Persistence barcodes for shapes,'' International Journal of Shape Modeling, vol. 11, no. 02, pp. 149–187, 2005.
- [12] O. Balabanov and M. Granath, ``Unsupervised learning using topological data augmentation,'' Physical Review Research, vol. 2, no. 1, p. 013354, 2020.
- [13] A. Hadadi, C. Guillet, J.-R. Chardonnet, M. Langovoy, Y. Wang, and J. Ovtcharova, ``Prediction of cybersickness in virtual environments using topological data analysis and machine learning,'' Frontiers in Virtual Reality, vol. 3, p. 973236, 2022.
- [14] F. A. Khasawneh, E. Munch, and J. A. Perea, ``Chatter classification in turning using machine learning and topological data analysis,'' IFAC-PapersOnLine, vol. 51, no. 14, pp. 195–200, 2018.
- [15] D. Leykam and D. G. Angelakis, ``Topological data analysis and machine learning,'' Advances in Physics: X, vol. 8, no. 1, 2023.
- [16] G. Muszynski, K. Kashinath, V. Kurlin, and M. Wehner, ``Topological data analysis and machine learning for recognizing atmospheric river patterns in large climate datasets,'' Geoscientific Model Development, vol. 12, no. 2, pp. 613–628, 2019.
- [17] F. Chazal and B. Michel, ``An introduction to topological data analysis: fundamental and practical aspects for data scientists,'' Frontiers in Artificial Intelligence, vol. 4, p. 108, 2021.
- [18] A. Asaad and S. Jassim, ``Topological data analysis for image tampering detection,'' in Digital Forensics and Watermarking: 16th International Workshop, IWDW 2017, Magdeburg, Germany, August 23-25, 2017, Proceedings 16, pp. 136–146, Springer, 2017.
- [19] L. Ver Hoef, H. Adams, E. J. King, and I. Ebert-Uphoff, ``A primer on topological data analysis to support image analysis tasks in environmental science,'' Artificial Intelligence for the Earth Systems, vol. 2, no. 1, p. e220039, 2023.
- [20] G. Carlsson, T. Ishkhanov, V. Silva, and A. Zomorodian, ``On the local behavior of spaces of natural images,'' International Journal of Computer Vision, vol. 76, pp. 1–12, Jan. 2008.
- [21] S. Zeng, F. Graf, C. Hofer, and R. Kwitt, ``Topological attention for time series forecasting,'' Advances in Neural Information Processing Systems, vol. 34, pp. 24871–24882, 2021.
- [22] Y. Umeda, ``Time series classification via topological data analysis,'' Information and Media Technologies, vol. 12, pp. 228–239, 2017.
- [23] M. Saggar, O. Sporns, J. Gonzalez-Castillo, P. A. Bandettini, G. Carlsson, G. Glover, and A. L. Reiss, ``Towards a new approach to reveal dynamical organization of the brain using topological data analysis,'' Nature Communications, vol. 9, no. 1, p. 1399, 2018.
- [24] J. L. Nielson, S. R. Cooper, J. K. Yue, M. D. Sorani, T. Inoue, E. L. Yuh, P. Mukherjee, T. C. Petrossian, J. Paquette, and P. Y. Lum, ``Uncovering precision phenotype-biomarker associations in traumatic brain injury using topological data analysis,'' PloS One, vol. 12, no. 3, p. e0169490, 2017.
- [25] T. K. Dey and S. Mandal, ``Protein classification with improved topological data analysis,'' in 18th International Workshop on Algorithms in Bioinformatics (WABI 2018), Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2018.
- [26] A. Martino, A. Rizzi, and F. M. F. Mascioli, ``Supervised approaches for protein function prediction by topological data analysis,'' in 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–8, IEEE, 2018.
- [27] K. Brown and K. Knudson, ``Nonlinear statistics of human speech data,'' International Journal of Bifurcation and Chaos, vol. 19, pp. 2307–2319, July 2009.
- [28] S. Barbarossa and S. Sardellitti, ``Topological signal processing over simplicial complexes,'' IEEE Transactions on Signal Processing, vol. 68, pp. 2992–3007, 2020.
- [29] E. Tulchinskii, K. Kuznetsov, L. Kushnareva, D. Cherniavskii, S. Barannikov, I. Piontkovskaya, S. Nikolenko, and E. Burnaev, ``Topological data analysis for speech processing,'' 2023.
- [30] D. Chen, Y. Lin, W. Li, P. Li, J. Zhou, and X. Sun, ``Measuring and relieving the over-smoothing problem for graph neural networks from the topological view,'' in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 3438–3445, 2020.
- [31] W. Bae, J. Yoo, and J. Chul Ye, ``Beyond deep residual learning for image restoration: Persistent homology-guided manifold simplification,'' in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 145–153, 2017.
- [32] C. Hofer, R. Kwitt, M. Niethammer, and A. Uhl, ``Deep learning with topological signatures,'' Advances in Neural Information Processing Systems, vol. 30, 2017.
- [33] E. Gerardin, G. Chételat, M. Chupin, R. Cuingnet, B. Desgranges, H.-S. Kim, M. Niethammer, B. Dubois, S. Lehéricy, and L. Garnero, ``Multidimensional classification of hippocampal shape features discriminates Alzheimer's disease and mild cognitive impairment from normal aging,'' Neuroimage, vol. 47, no. 4, pp. 1476–1486, 2009.
- [34] M. Yang, K. Kpalma, and J. Ronsin, ``A survey of shape feature extraction techniques,'' Pattern Recognition, vol. 15, no. 7, pp. 43–90, 2008.
- [35] Z. Xiong, Z. Tang, X. Chen, X. Zhang, K. Zhang, and C. Ye, ``Research on image retrieval algorithm based on combination of color and shape features,'' Journal of Signal Processing Systems, vol. 93, pp. 139–146, 2021.
- [36] D. Zhang and G. Lu, ``Review of shape representation and description techniques,'' Pattern Recognition, vol. 37, no. 1, pp. 1–19, 2004.
- [37] J. Lee, Introduction to topological manifolds, vol. 202. Springer Science & Business Media, 2010.
- [38] A. Hatcher, Algebraic topology. Cambridge: Cambridge University Press, 2002.
- [39] A. Zomorodian and G. Carlsson, ``Computing persistent homology,'' Discrete & Computational Geometry, vol. 33, no. 2, pp. 249–274, 2005.
- [40] H. Edelsbrunner and J. Harer, ``Persistent homology – a survey,'' Contemporary Mathematics, vol. 453, no. 26, pp. 257–282, 2008.
- [41] R. Ghrist, ``Barcodes: the persistent topology of data,'' Bulletin of the American Mathematical Society, vol. 45, no. 1, pp. 61–75, 2008.
- [42] D. Cohen-Steiner, H. Edelsbrunner, and J. Harer, ``Stability of persistence diagrams,'' in Proceedings of the twenty-first annual symposium on Computational geometry, pp. 263–271, 2005.
- [43] C. C. Aggarwal, A. Hinneburg, and D. A. Keim, ``On the surprising behavior of distance metrics in high dimensional space,'' in Database Theory—ICDT 2001: 8th International Conference London, UK, January 4–6, 2001 Proceedings 8, pp. 420–434, Springer, 2001.
- [44] V. Polianskii and F. T. Pokorny, ``Voronoi graph traversal in high dimensions with applications to topological data analysis and piecewise linear interpolation,'' 2020.
- [45] M. B. Kennel, R. Brown, and H. D. Abarbanel, ``Determining embedding dimension for phase-space reconstruction using a geometrical construction,'' Physical Review A, vol. 45, no. 6, p. 3403, 1992.
- [46] B. Lin, ``Topological data analysis in time series: temporal filtration and application to single-cell genomics,'' Algorithms, vol. 15, no. 10, p. 371, 2022.
- [47] S. Emrani, T. Gentimis, and H. Krim, ``Persistent homology of delay embeddings and its application to wheeze detection,'' IEEE Signal Processing Letters, vol. 21, no. 4, pp. 459–463, 2014.
- [48] C. M. Pereira and R. F. de Mello, ``Persistent homology for time series and spatial data clustering,'' Expert Systems with Applications, vol. 42, no. 15-16, pp. 6026–6038, 2015.
- [49] F. A. Khasawneh and E. Munch, ``Chatter detection in turning using persistent homology,'' Mechanical Systems and Signal Processing, vol. 70, pp. 527–541, 2016.
- [50] L. M. Seversky, S. Davis, and M. Berger, ``On time-series topological data analysis: New data and opportunities,'' in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 59–67, 2016.
- [51] M. Gidea and Y. Katz, ``Topological data analysis of financial time series: Landscapes of crashes,'' Physica A: Statistical Mechanics and its Applications, vol. 491, pp. 820–834, 2018.
- [52] R. Bellman, ``Dynamic programming,'' Science, vol. 153, no. 3731, pp. 34–37, 1966.
- [53] A. Zomorodian and G. Carlsson, ``Computing persistent homology,'' Discrete & Computational Geometry, vol. 33, pp. 249–274, Feb. 2005.
- [54] N. Milosavljević, D. Morozov, and P. Skraba, ``Zigzag persistent homology in matrix multiplication time,'' in Proceedings of the twenty-seventh Annual Symposium on Computational Geometry, pp. 216–225, 2011.
- [55] P. Bubenik, ``Statistical topological data analysis using persistence landscapes,'' Journal of Machine Learning Research, vol. 16, no. 1, pp. 77–102, 2015.
- [56] H. Adams, T. Emerson, M. Kirby, R. Neville, C. Peterson, P. Shipman, S. Chepushtanova, E. Hanson, F. Motta, and L. Ziegelmeier, ``Persistence images: A stable vector representation of persistent homology,'' Journal of Machine Learning Research, vol. 18, 2017.
- [57] D. Ali, A. Asaad, M. J. Jimenez, V. Nanda, E. Paluzo-Hidalgo, and M. Soriano-Trigueros, ``A survey of vectorization methods in topological data analysis,'' IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–14, 2023.
- [58] J. D. Hamilton, Time series analysis. Princeton University Press, 2020.
- [59] Y.-M. Chung, C.-S. Hu, Y.-L. Lo, and H.-T. Wu, ``A persistent homology approach to heart rate variability analysis with an application to sleep-wake classification,'' Frontiers in Physiology, vol. 12, p. 637684, 2021.
- [60] J. A. Perea, ``Topological time series analysis,'' Notices of the American Mathematical Society, vol. 66, no. 5, pp. 686–694, 2019.
- [61] J. A. Perea and J. Harer, ``Sliding windows and persistence: An application of topological methods to signal analysis,'' Foundations of Computational Mathematics, vol. 15, pp. 799–838, June 2015.
- [62] J. Perea, A. Deckard, S. Haase, and J. Harer, ``SW1PerS: Sliding windows and 1-persistence scoring; discovering periodicity in gene expression time series data,'' BMC bioinformatics, vol. 16, p. 257, Aug. 2015.
- [63] C. Tralie and J. Perea, ``(Quasi) periodicity quantification in video data, using topology,'' SIAM Journal on Imaging Sciences, vol. 11, Apr. 2017.
- [64] H. Gakhar and J. A. Perea, ``Sliding window persistence of quasiperiodic functions,'' Journal of Applied and Computational Topology, 2023.
- [65] NOAA, ``National centers for environmental information,'' 2023. Climate at a Glance: National Time Series, published January 2023, retrieved on January 28 from https://www.citedrive.com/overleaf.
- [66] H. Kantz and T. Schreiber, Nonlinear time series analysis, vol. 7. Cambridge University Press, 2004.
- [67] Edelsbrunner, Letscher, and Zomorodian, ``Topological persistence and simplification,'' Discrete & Computational Geometry, vol. 28, pp. 511–533, Nov. 2002.
- [68] A. R. Bradlow. Retrieved from https://speechbox.linguistics.northwestern.edu.
- [69] A. R. Bradlow, ``ALLSSTAR: Archive of L1 and L2 scripted and spontaneous transcripts and recordings.'' Retrieved from https://speechbox.linguistics.northwestern.edu/.
- [70] M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger, ``Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi,'' in Proc. Interspeech 2017, pp. 498–502, 2017.
- [71] P. Boersma and D. Weenink, ``Praat: doing phonetics by computer [computer program].,'' 2023. Version 6.3.09, retrieved 2 March 2023 from http://www.praat.org/.
- [72] C. Tralie, N. Saul, and R. Bar-On, ``Ripser.py: A lean persistent homology library for python,'' The Journal of Open Source Software, vol. 3, p. 925, Sep 2018.
- [73] U. Bauer, ``Ripser: efficient computation of Vietoris-Rips persistence barcodes,'' Journal of Applied and Computational Topology, vol. 5, no. 3, pp. 391–423, 2021.
Acknowledgements
The authors would like to thank Meng Yu for his invaluable mentorship on audio and speech signal processing. The authors would also like to thank Haibao Duan, Houhong Fan, Fuquan Fang, Fengchun Lei, Yanlin Li, Zhi Lü, Jie Wu, Kelin Xia, Jiang Yang, Jin Zhang and Zhen Zhang for helpful discussions and encouragement. This work was partly supported by the National Natural Science Foundation of China grant 12371069.
Author information
Authors and Affiliations
Department of Mathematics, Southern University of Science and Technology, Shenzhen, China
Pingyao Feng, Siheng Yi, Qingrui Qu, Zhiwang Yu & Yifei Zhu
Contributions
Y.Z. planned the project. P.F. and S.Y. constructed the theoretical framework. P.F. designed the sample, built the algorithms, and analysed the data. S.Y. assisted with the algorithms. P.F., S.Y., Q.Q., Z.Y., and Y.Z. wrote the paper and contributed to the discussion.
Corresponding authors
Correspondence to Pingyao Feng or Yifei Zhu.
Supplementary information
1 Related works in TDA
Here we present a review of literature on the topics (1) TDA and its applications, which encompasses the genesis of TDA, recommended resources, and practical applications; (2) vectorisation of PH, wherein we summarize methods for vectorising PH.
1.1 TDA and its applications
The evolution of TDA is relatively nascent when juxtaposed with other enduring fields, and its applications are still somewhat delimited. The genesis of the concept of invariants of filtered complexes can be traced back to Barannikov in 1994 \citeSSbarannikov, which are nowadays referred to as PD/PB (persistence diagram / barcode). These invariants were conceived with the objective of quantifying some specific critical point within some ambit of an extension of function. In 1999, Robins \citeSSHistory1 pioneered the concept of ``persistent Betti numbers'' of inverse systems and underscored their stability in Hausdorff distance. The modern incarnation of persistent homology was established in the first decade of the 21st century. Zomorodian, under the tutelage of Edelsbrunner, submitted his doctoral thesis in 2001 \citeSSZomorodian2001ComputingAC, wherein he employed persistence to distinguish between topological noise and inherent features of a space. After that, the term ``persistent homology group'' first appeared in the work by Edelsbrunner et al. \citeSSedel_persistent_diagram in 2002. This seminal work formalised topological methodologies to chronicle the evolution of an expanding complex originating from a point set in , a process they termed as topological simplification. The expansion process is recognised as filtration. They classified topological modifications based on the lifetime of topological features during filtration and proposed an algorithm to compute this simplification process. Subsequently, in 2005, Carlsson et al. \citeSSpersistent_shape applied persistent homology to generate a barcode as a shape descriptor. Their methodology was able to distinguish between shapes with varying degrees of ``sharp'' features, such as corners. In the same year, Zomorodian and Carlsson \citeSSPH1 presented an algebraic interpretation of persistent homology and developed a natural algorithm for computing persistent homology of spaces in any dimension over any field. Cohen-Steiner et al. \citeSSBarcode3 considered the stability property of persistence algorithm. Robustness is measured by the bottleneck distance between persistence diagrams. In 2008, Carlsson, Singh and Sexton founded Ayasdi, a company that combines mathematics and finance to truly put theory into practice. The inception of TDA may be complex, as it originates from some pure mathematical fields such as Morse theory and PH. However, the underlying principle remains steadfast: to identify topological features that can quantify the shape of the data to certain degrees, which is robust against noise and perturbations.
An abundance of materials is available that offer a thorough understanding of TDA for both specialists and non-specialists. In 2009, Carlsson \citeSScarlsson_topology_2009 wrote an extensive survey on the applications of geometry and topology to the analysis of various types of data. This work introduced topics such as the characteristics of topological methods, persistence, and clusters. More recent work by Carlsson and Vejdemo-Johansson \citeSScarlsson_TDA_application discussed practical case studies of topological methods, such as their applications to image data and time series. For non-specialists seeking to delve into TDA, the introductory article by Chazal and Michel \citeSSintroDS may be more readable. It provides explicit explanations and hands-on guidance on both the theoretical and practical aspects of TDA. Several software tools assist researchers in building case studies on data. The GUDHI library \citeSSGUDHI, an open-source C++ library with a Python interface, includes a comprehensive set of tools involving different complexes and vectorisation tools. Ripser \citeSSBauer2021Ripser, also a C++ library with a Python binding, surpasses GUDHI in terms of computing Vietoris–Rips PD/PB, especially when high-dimensional cases or large quantities of PD/PB are required. TTK \citeSSTTK is both a library and software designed for topological analysis with a focus on scientific visualisation. Other standard libraries include Dionysus222https://mrzv.org/software/dionysus2, PHAT333https://bitbucket.org/phat-code/phat, DIPHA444https://github.com/DIPHA/dipha, and Giotto555https://giotto-ai.github.io/gtda-docs/0.4.0. Additionally, an R interface named TDA \citeSSfasy2015introduction is available for the libraries GUDHI, Dionysus, and PHAT.
The recent proliferation of TDA has established it as an effective instrument in numerous studies. Owing to the characteristics of topological methods \citeSScarlsson_topology_2009, a multitude of applications have been discovered, particularly in the realm of recognition. In the field of biomedicine, Nicolau et al. \citeSSBio4 utilised the topological methods Mapper \citeSSMapper to analyse transcriptional data related to breast cancer. This method is used due to its high performance in shape recognition in high dimensions. The book authored by Rabadán and Blumberg \citeSSBio5 provides an introduction to TDA techniques and their specific applications in biology, encompassing topics such as evolutionary processes and cancer genomics. In signal processing, Emrani et al. \citeSSSP1 introduced a topological approach for the analysis of breathing sound signals for the detection of wheezing, which can distinguish abnormal wheeze signals from normal breathing signals due to the periodic patterns within wheezing. Robinson \citeSStopo_sig offers a systematic exploration of the intersection between topology and signal processing. In the context of deep learning, Bae et al. \citeSSDL1 proposed a PH-based deep residual learning algorithm for image restoration tasks. Hofer et al. \citeSSDL2 incorporated topological signatures into deep neural networks to learn unusual structures that are typically challenging for most machine learning techniques. Love et al. \citeSSDL3 extracted the topological features of images and videos by TDA, and input these features into the kernel of the convolutional layer. In their case, manifolds with relationships to the natural image space are used to parameterize image filters, which also parameterize slices in layers of neural networks. For networks, an early application of PH on sensor networks is presented in the work by de Silva and Ghrist \citeSSsensor_Silva. They applied topological methods to graphs representing the distance estimation between nodes and a proximity sensor. Subsequently, Horak et al. \citeSSPH_CN discussed PH in different networks, observing that persistent topological attributes are related to the robustness of networks and reflect deficiencies in certain connectivity properties of networks. Additionally, Jonsson’s book \citeSSsim_com_graph provides insights on how to construct a simplicial complex from a graph. Given the potential of topological methods, researchers from various fields may be inspired to incorporate these techniques into their studies in future possible cases to see if this leads to unexpected and exciting discoveries.
1.2 Vectorisation of PH
When executing PH, one typically obtains PD/PB, which is a set of intervals on the extended line. Indeed, PD/PB can be considered a form of vectorisation of original data. However, they may not be sufficiently accessible for further applications, such as integration into machine learning algorithms for future model development. Since the intervals exist on the extended line, some may involve as their termination point, which could pose challenges for certain algorithms when encountering infinity. This issue can be mitigated by something like setting a threshold for the maximal lifetime, which is a relatively straightforward solution. However, there are more intrinsic challenges embedded in the vectorisation of PD/PB that are not easily resolved and may pose difficulties for researchers attempting to leverage this powerful tool. For example, the number of intervals in PD/PB is not fixed; sometimes, there may be 10, and other times there may be 100. Moreover, PD is too sparse to put into machine learning algorithms. Researchers might extract the top five longest intervals from the set as a method of vectorisation, or remove intervals with a length less than a certain threshold from the set, or implement the distance functions and kernel methods of PD/PB to achieve vectorisation. In this article, vectorisation in TopCap is relatively simple, as we extract the maximal persistence and its corresponding birth time as two topological features to feed into machine learning algorithms.
There is no definitive rule to determine that one method of vectorisation is superior to another, as the performance of vectorisation methods largely depends on the data and how it is transformed into a topological space. Nevertheless, there are a great many creative methods for vectorising PH. Persistence Landscapes (PL) \citeSSVec3, developed by Bubenik, is one popular method. Bubenik’s work introduces both theoretical and experimental aspects of PL in a statistical manner. Generally speaking, PL maps PD into a function space that is stable and invertible \citeSSPLProperty. A toolbox \citeSSPLTool is also available for implementing PL. Persistence Image (PI) \citeSSVec4, another vectorisation method developed by Adams et al., stably maps PD to a finite-dimensional vector representation depending on resolution, weight function, and distribution of points in PD. For additional vectorisation methods, one might consider the article by Ali et al. \citeSSVec2, which presents 13 ways to vectorise PD.
2 Time-delay embedding
Time-delay embedding (TDE) is also known as sliding window embedding, delay embedding, and delay coordinate embedding. TDE of a real-valued function , with parameters and , is defined to be the vector-valued function
where is a positive integer and is a positive real number. The integer is the dimension of the target space for the embedding, is the delay, and their product is called the window size. Given the Manifold Hypothesis that time series data lie on a manifold, this method reconstructs this topological space from the input time series, when is at least twice the dimension of the hypothesised manifold . The embedding property holds (generically in a technical sense we omit here) for time series sampled from a trajectory whose image is dense in via an ``observation" function , i.e., .
Section 5 in \citeSSperea_sliding_2015 establishes that the -truncated Fourier series expansion
of a periodic time series can be reconstructed into a circle, i.e.,
when , where is a constant of such that
What's more, the maximal persistence of resulting point cloud is the largest when the window size is proportional to , i.e.,
for some positive integer . To put it briefly, an increase in the dimension of TDE results in a more precise truncation when expanding the Fourier series; the maximal persistence of the point cloud is the largest when the window size is proportional to the period. Furthermore, Proposition 4.2 in \citeSSperea_sliding_2015 states that the difference between and can be very small.
This methodology also proves particularly advantageous in scenarios where the system under investigation exhibits nonlinear dynamics, precluding simplistic or straightforward analysis of the time series data. By embedding the time series, the inherent geometric configuration of the system is exposed, facilitating more profound comprehension and refined analysis.
3 Persistent homology
Topology is a branch of mathematics that studies the properties of geometric objects that remain unchanged under continuous transformations. In other words, it focuses on the intrinsic features of a space that are preserved regardless of its shape or size, and algebraic topology provides a quantitative description of its topological properties.
A simplicial complex is a powerful tool in algebraic topology that enables us to represent a topological space using discrete data. Unlike the original space, which can be challenging to compute and analyse, a simplicial complex provides a combinatorial description that is much more amenable to computation. We can use algebraic techniques to study the properties of a simplicial complex, such as its homology and cohomology groups, which provide information about the topology of the underlying space.
Formally, a simplicial complex with vertices in a nonempty set is a collection of nonempty finite subsets so that any nonempty subset of always implies . A set with elements is called an -simplex of the simplicial complex . For instance, consider , the circle wedge sphere, as a topology space. It can be approximated by the simplicial complex with six vertices , , , , , . The simplicial complex can be represented as
which is a combinatorial avatar for . See Fig. S1.

Given a simplicial complex , let be a prime number and be the finite field with elements. Define to be the free -vector space with basis the set of -simplices in . If is an -simplex, define the boundary of , denote by , to be the alternating formal sum of the -dimensional faces of given by
where is defined as the -th -dimensional face of , which is the -simplex with vertices . We can extend to so that . The boundary operator satisfies . Elements in with boundary 0 are called -cycles, denoted by . Elements in , which are the image of the element of under , are called -boundaries and denoted by . It follows that
due to . Then, define the quotient
to be the -th simplicial homology group of with -coefficients. We call the -th Betti number, denoted by , which records the number of -dimensional holes in the topological space.
For two simplicial complexes, and and a simplicial map , the simplicial map induces a linear map . If two topological spaces are homotopy equivalent, the above homology groups are isomorphic, and Betti number are equivalent, i.e., and , which means that the persistence diagrams corresponds to the two topological spaces are the same.
Let be a finite point cloud with metric . Define a family of simplicial complexes, called Rips complexes, by
The family
is known as the Rips filtration of . If , then . So we can get the following sequence
for . As varies, the topological features in simplicial complexes vary along , resulting in the emergence and disappearance of holes.
Record the emergence and disappearance of holes by in which represents the birth time, represents the death time, and represents the lifetime of the holes. So we get a multiset
The multiset can be represented as a multiset of points in the coordinate system called persistence diagram or as a graph of bars called barcode. We use maximal persistence to refer to the maximal lifetime among all the points in the persistence diagram.
4 Vowels and consonants
In phonetics, a phone is the smallest basic unit of human speech sound, which is a short speech segment possessing distinct physical or perceptual properties. Phones are generally classified into two principal classes: vowels and consonants. A vowel is defined as a speech sound pronounced by an open vocal tract with no significant build-up of air pressure at any point above the glottis, and at least making some airflow escape through the mouth. While a consonant is a speech sound that is articulated with a complete or partial closure of the vocal tract and usually forces air through a narrow channel in the mouth or nose. Rather than all vowels must be pronounced by vibrated vocal cords, consonants can be further categorised into two classes according to whether the vocal cords vibrate or not in the throat during articulation. If the vocal cords vibrate, the consonant is known as a voiced consonant. In contrast, if the vocal cords do not vibrate, the consonant is known as a voiceless consonant. Since vocal cord vibration can produce a stable periodic signal of air pressure, the voiced consonants will tend to have more periodic components than voiceless consonants, which can be detected by the first homology groups in persistent homology as topological information in time series. To visualise vowels, voiced consonants, and voiceless consonants in TDE and PD, see Fig. S2.

References
ieeetr \bibliographySreferenceSI