Principal Subsimplex Analysis
Abstract
Compositional data, also referred to as simplicial data, naturally arise in many scientific domains such as geochemistry, microbiology, and economics. In such domains, obtaining sensible lower-dimensional representations and modes of variation plays an important role. A typical approach to the problem is applying a log-ratio transformation followed by principal component analysis (PCA). However, this approach has several well-known weaknesses: it amplifies variation in minor variables; it can obscure important variation within major elements; it is not directly applicable to data sets containing zeros and zero imputation methods give highly variable results; it has limited ability to capture linear patterns present in compositional data. In this paper, we propose novel methods that produce nested sequences of simplices of decreasing dimensions analogous to backwards principal component analysis. These nested sequences offer both interpretable lower dimensional representations and linear modes of variation. In addition, our methods are applicable to data sets contain zeros without any modification. We demonstrate our methods on simulated data and on relative abundances of diatom species during the late Pliocene. Supplementary materials and R implementations for this article are available online.
Keywords: Modes of variation; Backwards approach; Nested relations; Compositional data; Paleoceanography
1 Introduction
1.1 Motivation
Compositional data, which are also referred to as simplicial data, are multivariate observations consisting of vectors of proportions. Such data are prevalent in domains where the relative magnitude between variables is the primary concern, including fields such as geochemistry (proportions of constituent elements), microbiology (proportions of species), and economics (proportions of portfolio components). Compositional data vectors are characterized by the constraints that the entries are nonnegative and sum to one. This constraint complicates the application of traditional multivariate analysis techniques, which are designed for Euclidean data, to compositional datasets.
A major challenge with compositional data is identifying meaningful lower dimensional approximations and the corresponding modes of variation (see Section 3.1.4. of Marron and Dryden, (2021).) A mode of variation is a set of data objects (e.g. composition vectors) that is one-dimensional in some sense which describes a component of variation. A prototypical example is found in classical Principal Components Analysis (PCA), where modes of variations are the straight lines through the mean spanned by the loading vectors.
While classical PCA is popular, it often fails to provide effective approximations and modes of variation for non-Euclidean data, including compositional data. Figure 1 demonstrates this issue using a compositional dataset with three variables. The distribution of this 2-dimensional compositional data set is effectively visualized by a ternary plot, a rotation of the 2-simplex, as shown in Figure 1. The blue dots represent data points, while the green arrows depict the first and second principal component directions. The red dots show the one-dimensional approximations of the two selected blue data points, as indicated by the dashed lines. Notably, these red dots fall outside the original compositional space, meaning that the lower-dimensional representations have negative proportions and no longer represents compositions. This example motivates development of more interpretable and effective methods for lower-dimensional approximation in compositional data analysis.

A popular strategy to address such challenges is applying transformations such as log-ratio transformations (Aitchison, 1982, 1983, 1986) and power transformations (Aitchison, 1986) followed by the usual PCA. Log-ratio transformations bijectively map the interior of the simplex to a Euclidean space of the same dimension, thus enabling application of techniques for Euclidean data. Log-ratio transformations well capture curvature present in compositional data and are the basis of the well-known logistic normal distribution (Aitchison, 1986). On the other hand, power transforms provide a natural continuum between no transformation and log-ratio transformations. Power transforms are effective in normalizing marginal distributions of data, especially remedying high skewness which is common in compositional data. However, these transformations can strongly distort the original compositional space and lead to undesirable results in some applications. Log-ratio transformations magnify variation within variables with low average proportions (minor variables), obscuring important signals present in variables with high average proportions (major variables). In addition, log-ratio transformations do not naturally accommodate data points with zero proportions, requiring either a separate treatment of those data points or imputation for zeros. Scealy et al., (2015) proposed a different set of power transformations that map compositional data vectors onto the surface of various manifolds. They then applied classical and robust PCA on the tangent space to the manifold. All power transformations inherit the problem of direct PCA in the spirit of Figure 1, that the lower dimensional approximations may leave the object space, and the corresponding modes of variation are not interpretable in a compositional sense.
The above discussion motivates the development of an alternative PCA method that aims to describe variation in major variables while avoiding the limitations of transformations. In other contexts there has also been a recent growing shift away from transformation based approaches, with a focus on analyzing compositional data on its original scale of measurement (Xiong et al., 2015, Weistuch et al., 2022, Fiksel et al., 2022, Scealy and Wood, 2023, Firth and Sammut, 2023, Lundborg and Pfister, 2023, Scealy et al., 2024). This article aligns with this view.
1.2 Backwards Principal Component Analysis
This paper proposes applying the backwards Principal Component Analysis approach to compositional data. Backwards PCA was motivated by Jung et al., (2012) then discussed in detail by Damon and Marron, (2014) and Marron and Dryden, (2021, Section 8.6). Given an object space that is a rank manifold, backwards PCA sequentially finds the best-fitting rank submanifold of the previous rank submanifold. A major advantage of the backwards PCA is that its approximations of any rank naturally remain in the object space, and the nested rank approximations have natural relationships with the rank approximations which lead to easily defined th scores. The nested nature of backwards PCA naturally leads to modes of variation, which distinguishes backwards PCA from general dimensionality reduction methods.
Backwards PCA is a unifying framework and requires special attention for each class of object spaces. Backwards PCA has been implemented for a variety of object spaces, including spherical data (Jung et al., 2012), skeletal representations (Pizer et al., 2013), polyspheres (Eltzner et al., 2015), nonnegative data (Zhang et al., 2015), and high-dimensional tori (Eltzner et al., 2018, Zoubouloglou et al., 2023). More discussion on the applications of backwards PCA can be found in Section S1 of the supplementary material.
1.3 Proposed Methods
Using the backwards PCA framework, this paper proposes Principal Subsimplex Analysis (PSA). Given a dimensional compositional data set, PSA effectively identifies a nested sequence of simplices of dimension for . The vertices of each subsimplex form a partition of parts, thus PSA is closely related to amalgamation in the sense defined in Aitchison, (1986). We introduce two versions: PSA via Simplices (PSA-S) and PSA via Orthants (PSA-O), which use the Euclidean metric and the spherical metric (arc length on the unit sphere), respectively. It will become clear that PSA-S approximates data through amalgamation, while PSA-O combines parts in a related but distinct manner.
Our proposed method has the following four key features.
-
1.
Targets Variation among Major Variables: Because PSA uses Euclidean or spherical metrics, it has better ability than transformation-based approaches to capture important variation that occurs among major variables. In addition, PSA is more robust to the existence of pure noise variables.
-
2.
Natural Treatment of Zeros: By construction of the method, PSA naturally handles data sets with zeros.
-
3.
Compositional Lower Dimensional Approximations: Each lower dimensional approximation is a simplex whose vertices represent groups of variables, which offers interpretable structure. This approach preserves the important nature of compositional data, such as nonnegativity and potential negative correlations between variables. In particular, the 2-dimensional representation provides useful visualization.
-
4.
Linear Modes of Variation: PSA produces interpretable modes of variation. In addition, these modes of variation are linear, which serve as a more interpretable alternative to the non-linear modes of variation produced by log-ratio PCA.
1.4 Related Work
Quinn and Erb, (2020) proposed a method Amalgams which uses amalgamation for dimensionality reduction of compositional data. For -dimensional compositional data and a prescribed number , Amalgams searches for a -dimensional amalgamation that optimizes a given objective function, for example, either the average inter-sample distance or classification accuracy. The most important difference between our method and Amalgams is that our method, based on backwards PCA, produces modes of variation. Our method may also be computationally faster as Amalgams employed a slow genetic algorithm for estimation.
1.5 Organization
The rest of the paper is organized as follows. Section 2 describes the geometry of simplices and proposes PSA-S. Motivated by scaling challenges, Section 3 proposes PSA-O. These two versions of PSA, together with Euclidean PCA and the two transformation-based PCAs, are applied to simulated data sets in Section 5, illustrating distinctive properties of our new PSA methods. The same methods are compared on relative abundances of diatom species during the late Pliocene in Section 6. The analysis for this article was performed using R Statistical Software (R Core Team,, 2024). The R package PSA and the authors’ code for this article are available at https://github.com/haneone33/Principal-Subsimplex-Analysis.
2 Principal Subsimplex Analysis via Simplices
Each version of the PSA is a special case of backwards PCA that produces a nested sequence of simplices
(1) |
such that is an -dimensional subsimplex of for each , in the sense formally defined in Section 2.1. Each subsimplex is estimated within so that it best approximates the given data and serves as the rank approximating subset. Both versions of PSA construct the sequence of simplices based on pairwise aggregation of vertices, but the two versions differ in the way they approximate data points using the subsimplices.
In this section, we focus on the first version of PSA, Principal Subsimplex Analysis via Simplices (PSA-S). Section 2.1 defines subsimplices and develops notation for sequences of subsimplices. Section 2.2 describes how an aggregation of two vertices yields a subsimplex. Section 2.3 handles a scaling problem. The procedure of PSA-S is then given in Section 2.4.
2.1 Notation
In a Euclidean space , a set of vectors is said to be affinely independent if are linearly independent. An -dimensional simplex is the convex hull of an affinely independent set of vectors, that is,
(2) |
In this case, the vectors are called the vertices of . Given a simplex , a subsimplex of is a simplex of any dimension that is also a subset of .
The -dimensional unit simplex, denoted by , is a special case of a simplex in whose vertices are the unit vectors , where is a vector of zeros with a unique one in the th coordinate. A vector of compositions of elements naturally belongs to the -dimensional unit simplex. Given a compositional data set in , PSA reveals interpretable lower dimensional representations taking the form of subsimplices of of varying dimensions.
2.2 Pairwise Aggregation of Vertices
The sequence of nested subsimplices (1) has the following explicit representation using vertices,
(3) |
where is an affinely independent subset of . For notational consistency, we set so that
Given an -simplex , an easy and interpretable way to construct is to form the vertex set by aggregating two of the vertices of at a ratio , while keeping the other vertices. Without loss of generality, suppose the two merging vertices are and . The new vertex is given as the convex combination of these two vertices at an appropriate ratio ,
(4) |
This new vertex together with the existing vertices after the reindexing
(5) |
are affinely independent give an subsimplex .
2.3 Scaling Functions
Given an -dimensional simplex , the size of an -dimensional subsimplex varies depending on its vertex set, in terms of its -dimensional volume. In particular, a -dimensional subsimplex of the unit -simplex is smaller than the unit -simplex unless the vertices of the subsimplex are chosen from the vertices of . This raises a challenge in terms of variation explained by the Residual Sum of Squared scores (RSS). This scale problem is handled by correctly identifying an -simplex with the unit -simplex.
An -simplex is naturally identified with the unit -simplex by matching the vertices of to those of the unit -simplex. We define the scaling function by
(6) |
Given a sample , the proportion associated with vertex is given by . In this sense, can be interpreted as writing in a new coordinate system with basis .
2.4 Principal Subsimplex Analysis via Simplices (PSA-S)
To describe the PSA-S procedure, assume we have a -dimensional compositional data set in . We denote by the rank approximation of which is a member of , for . For notational consistency, we set .
The subsimplices in (3) are constructed inductively through repeated aggregation of vertices. Suppose we have obtained the rank subsimplex and the approximations . Further suppose that the approximating subsimplex is formed by combining and at a ratio . (The first two vertices are chosen for notational convenience.) PSA-S approximates a data point in in a mass preserving way, as illustrated in Figure 2. The point in is mapped onto along the line segment parallel to the edge connecting and . This mapping preserves the weights and only re-distributes the first two weights according to the ratio . In other words, the rank approximation is given by the map ,
(7) |
Thus the rank approximation is . Notably, the mapping is well defined for data points with zero proportions.


The th score of a point is the signed distance from to where is rescaled to have unit size through :
To arrive at an analogue of modes of variation in PCA, define the th loading vector to be the difference of the two merged vertices, i.e. . This loading vector indicates the direction pointing from each to . The score represents the amount of the movement in that direction because
As described in Section 3 of the supplementary material, the corresponding mode of variation is and thus is indeed the direction of the mode of variation. More detailed investigation of modes of variation for PSA and backwards PCA are found in the same section of the supplementary material.
The merged vertices and the ratio are chosen so that they minimize the Residual Sum of Squared scores (RSS) . Given the vertices and to be merged, the optimal has a closed-form solution
The pair of vertices are chosen so that RSS is minimized at the optimal .
The procedure of PSA-S is summarized in Algorithm 1 of the Supplementary material.
3 Principal Subsimplex Analysis via Orthants
PSA-S uses scaling functions to handle the varying size of subsimplices. This rescaling can be avoided by representing unit simplices as unit nonnegative orthants, which are closely related to simplices through the invertible projection map which is described below. PSA-O employs a collection of nonnegative orthants, playing a parallel role to the collection of subsimplices in PSA-S. Because two nonnegative orthants of the same dimension have the same size, (that is, they are isometric Riemannian manifolds with boundary,) scales of approximating subsets in the sequence naturally remain the same.
3.1 Nonnegative Orthants
Let be a set of orthonormal vectors in . The -dimensional nonnegative orthant with the vertex set is the set of linear combinations of with nonnegative coefficients whose squared sum is one. In other words,
Note that is a subset of the -dimensional unit sphere. Important subsets of an -nonnegative orthant are suborthants, which are subsets of that are -nonnegative orthants for some . The unit -nonnegative orthant is a special case of a nonnegative orthant in whose vertices are .
For each -nonnegative orthant there exists a corresponding -simplex with the same vertex set. Each nonnegative orthant and its corresponding simplex are intimately related through the invertible projection map ,
The inverse correspondence is
This projection is equivalent to transporting along the line from the origin through until it intersects with the unit sphere.
The (geodesic) distance along the surface of the sphere between two points in an -nonnegative orthant is the length of the shorter arc that connects two points, which is a part of a great circle. The distance function is given by . Equipped with this geodesic distance, any two -nonnegative orthants are isometric.
3.2 Principal Suborthant Analysis (PSA-O)
PSA-O proceeds in a manner similar to that of PSA-S but the fundamental low-rank approximation happens in the unit nonnegative orthant . PSA-O has three steps: (i) project data points in to in through the projection , (ii) find a nested sequence of nonnegative orthants and lower dimensional approximations for , and (iii) project and in back onto through the inverse of the projection map, . As mentioned earlier, the use of nonnegative orthants naturally avoids the scaling problem because all nonnegative orthants of the same dimension have the same size. We will use to indicate points on nonnegative orthants, and continue to use and for points on simplices.
The above three steps are detailed as follows. Firstly, we project on to the corresponding point on and initialize for so that
Next, similar to the PSA-S algorithm, PSA-O iteratively combines two vertices at some ratio . Suppose the rank approximating orthant has been computed and and are merged to give the new vertex
(8) |
This new vertex and the remaining vertices for form an orthonormal set and thus defines an -nonnegative suborthant


The lower dimensional approximation of is the point in that is closest to with respect to the geodesic distance (shown as an arc in Figure 3(b)) on and is given by
This projection is equivalent to transporting onto the suborthant along the great circle that is perpendicular to the suborthant (Figure 3). The score is the signed geodesic distance,
Again, modes of variation are based on the th loading vector defined as . Note that although the difference is not exactly parallel to the loading vector as in the case of PSA-S, the loading vectors still serve as effective representatives of the directions.
Lastly, approximating subsets and lower dimensional approximations for are mapped onto through the inverse of the projection map.
The search for the optimal pair of vertices and the ratio is implemented by multiple grid searches for , for each of pairs of vertices. Note that the grid search is fast and effective because for each pair of vertices, is a one-dimensional parameter in the bounded region [0,1].
The procedure of PSA-O is summarized in Algorithm 2 of the Supplementary material.
4 Benchmark Methods
Both the simulation study in Section 5 and the applications to real data in Section 6 compare two versions of PSA with three common benchmark approaches, Euclidean PCA, power transform PCA, and log-ratio PCA. To describe the transformations, we use interchangeably use and to denote a vector in .
Power transform PCA starts with transformation for some parameter . In this paper, was used for all data sets for simplicity. For improved analysis, can be selected to maximize the Gaussian likelihood of the transformed data.
For log-ratio PCA, the central log-ratio transformation was used where
(9) |
and is the geometric mean of the entries of . For the map to be defined, zeros in the data set should be removed or replaced. In this paper, only for log-ratio PCA, zeros were replaced by half of the overall minimum nonzero value of the data.
5 Simulation Studies
In this section, the behavior of the two versions of PSA and the three benchmark methods are compared in two toy examples.
The first simulated data set is 2-dimensional, shown using a ternary plot in the top left panel of Figure 4. The data set consists of four clusters with centers , , , and . Each cluster contains either 5 or 10 data points that are randomly drawn from around the cluster centers following an isotropic Gaussian distribution with standard deviation of . To ensure the vectors lie on the unit 2-simplex, negative entries were reset to zero and the resulting vectors were renormalized to have unit sum. Consequently, 5.5% of the entries of the data set were zero. The unbalanced size of the clusters is intended to contrast PSA-S and PSA-O.
Figure 4 illustrates lower dimensional representations produced by the methods. The rank 1 approximating subsets (red line segments) demonstrate the fact that the approximating subsets of PSA are subsimplices. In addition, the first modes of variation shown as red lines are linear in . PCA also provides linear modes of variation, however the rank 1 approximations of PCA leave the simplex, which is a substantial drawback of PCA. The power transform PCA plot suggests that power transformation distorts the original data and again the approximations leave the simplex. Lastly, log-ratio PCA provides compositional lower dimensional representation but the log transform results in curved modes of variation when mapped back onto the simplex. Note that in the first mode of variation of log-ratio PCA, the green points are so spread that they are not a clear cluster.

The scores scatter plot matrices are shown in Figure 5 (a). The off-diagonal panels are the scatter plots of the scores and the diagonal panels are the density plots for the scores of each rank colored by cluster. In each of the matrices, the same axes were used for both x-axis and y-axis and for all panels to enhance comparison of scales. The corresponding loading vectors (reflecting the impact of each feature on the mode of variation) are shown in Figure 5 (b).


The first column of Figure 5 (b) suggests that the first mode of variation for PSA-S occurs between V2 and a combination of V1 and V3. The corresponding panel of Figure 5 (a) confirms that a large positive first score is associated with high V2 content (green and purple clusters) and a large negative first score is associated with high V1 or V3 content (red and cyan clusters). On the other hand, the lower panel of the loading plots suggests that the second mode of variation occurs between V1 and V3 as confirmed in the scores plot. PSA-O provides a similar result in this example. The PCA panels of Figure 5 allows a similar interpretation but provides a rotated view with a different emphasis of the clusters. The power transform PCA shows a similar result to PCA with more noisy scores. Interpretation of the scores of log-ratio PCA is more involved due to curved modes of variation. A large positive first score is associated with the points near the V1 (cyan cluster) and a large negative value first score is associated with the points near V3 (red cluster). However, a medium level of the first score is associated with the point near V2 (green and purple clusters). We remark that this unexpected association does not happen for the other methods. In addition, the clear separation of the clusters is significantly blurred in the log-ratio scores plot.
To compare how the five different methods handle low-proportion pure noise variables, we concatenated each vector of simulated data with three random numbers that were independently drawn from . Again, negative values were reset to zero and renormalized to have unit sum. Figure 6 shows the first three scores of the methods. The distribution of the first two scores of PSA-S, PSA-O, PCA, and power transform PCA are similar to those in Figure 5, implying that these methods are resilient to the existence of added noise variables. The scores of the log-ratio PCA do not show strong differences between the clusters, suggesting that the first three modes of variation for log-ratio PCA were sensitive to the pure noise variables.

In summary, PSA-S and PSA-O provide clearer visual separation of this type of cluster that lives near the edge of the simplex than power transform PCA or log-ratio PCA. Furthermore, PSA-S and PSA-O are more resistant to pure noise variables. Conventional PCA gives similar performance, but at the cost of low dimensional approximations that are not in a subsimplex, and thus no longer compositions.
6 Application to Relative Abundance of Diatom Species
Taylor-Silva and Riesselman, (2018) studied historic relative abundance of diatom species to explore Southern Ocean conditions during the late Pliocene, which had global temperatures similar to what may occur with global warming. The diatoms were observed between 89.84mbsf and 56.53mbsf (meters below seafloor) in a drill core from the Antarctic margins. Across 71 different depth samples, 61 different diatom taxa were identified. The dataset is available at https://agupubs.onlinelibrary.wiley.com/doi/full/10.1002/2017PA003225 as Table S3.
The distribution of the diatom species is shown in a parallel coordinate plot (Figure 7) with colors representing the depth of the sample. The diatom dataset contains a large proportion of zero relative abundances (46% of entries). For the log ratio PCA only, zeros have been replaced by half of 0.0026, the overall minimum nonzero value of the data set and then each measurement was renormalized to have unit sum.
Taylor-Silva and Riesselman, (2018) noticed the following major features of the compositions: (1) starting from the deepest measurements, the relative abundance of sea-ice preferring diatoms increased until (2) a very short warm period called the KM3 marine isotope stage (70.34mbsf - 69.44mbsf) that contained a high abundance of warm-water diatoms, and (3) after KM3, the proportion of warm-water diatoms plummeted, there was a high proportion of sea ice preferring diatoms and some open-ocean taxa progressively declined. A measurement with extremely low total abundance of diatoms (Taylor-Silva and Riesselman,, 2018) forms a clear outlier at directly after the KM3 period (post-KM3, 67.03mbsf) with extremely high proportion of A. ingens. The high abundance of warm-water diatoms during KM3 suggested a dramatic difference in ocean currents and because the KM3 stage coincided with a temporary elevation of atmospheric CO2 concentration to post-industrial levels, Taylor-Silva and Riesselman, (2018) inferred that a temporary elevation in atmospheric CO2 was sufficient to trigger a substantial climate response.

Figure 8 shows the scatter plots of the first two scores for each of the five methods. Color change along each of the x-axes suggests that all first modes of variation are associated with depth. The unusual nature of the KM3 compositions was not highlighted by any of the methods, however all but PSA-S indicate two distinct phases separated by KM3. Notably, the PSA-O scores are aligned into arms that are close to parallel to each axis and separated by the KM3 period, which provides the clearest and most interpretable modes of variation of all five methods. The PSA-S scores are similar to those of PSA-O but the transition between two phases does not align with KM3, and the second mode of variation is confounded with the post-KM3 outlier. Indeed, the loading vectors of PSA-S shown in the first row of Figure S3 suggests that while the first loading vector of PSA-S is similar to that of PSA-O, the second loading vector of PSA-S is a mixture of the second and the fourth loading vectors of PSA-O. The scores scatter plot of PCA in Figure 8 shows a similar pattern to PSA-O but is rotated compared to the axes and scores appear to contain more noise, making the interpretation of scores in connections with loadings more difficult. The scores plots of the power transform PCA and the log-ratio PCA in Figure 8 are similar to that of PCA but appear even more corrupted by noise, especially for the log-ratio PCA.
The following discussion on the scores plot with a particular emphasis on PSA-O illustrates the strong interpretability of modes of variation and lower dimensional representations obtained from PSA.

Up until the end of KM3 the compositions moved approximately according to the first mode of variation of PSA-O. This first mode of variation has large positive loading (Figure 9) for several open-ocean diatoms (F. barronii, R. antarctica, D. antarcticus, R. naviculoides), large negative loading for two sea-ice affinity diatoms (F. sublinearis, A. karstenii), and large loadings for a number of other taxa. As the scores of the first mode of variation are decreasing over time, this loading means that there was increasing sea-ice. This confirms observation (1) above by Taylor-Silva and Riesselman, (2018), although they also noted this change towards more sea ice was happening concurrently with changes to a number of other taxa unrelated to ocean temperatures. After KM3 the compositions maintained very low scores for the first mode of variation, which is consistent with their observation (3), although progressive declines in open-ocean taxa were not highlighted by the PSA-O scores or loadings.


The second mode of variation for PSA-O, shown in the second panel of Figure 9 (b), has a large positive loading for T. insigna (no particular association with ocean temperatures) and only small loadings for the other taxa, which suggests that after KM3, the largest changes in diatom compositions corresponded to increases with T. insigna over time. The scores for the third mode of variation for PSA-O jump from negative to large positive values after KM3 and the outlier, and then decreases over time. This third mode has a large positive loading for sea-ice affinity F. bohatyi diatoms, however there are also several much smaller negative loadings for sea-ice affinity. Combined with the second mode of variation, this suggests that there was a rapid jump in F. bohatyi after KM3 and the outlier, and then F. bohatyi had a progressive decline and T. insigna had a progressive increase. The scores for the fourth mode of variation highlight the outlier at 67.03mbsf and the loadings suggest that the outlier had unusually low proportions of T. inura and Chaetoceros spp, and unusually high proportions of A. karstenii and Denticulopsis spp. Both are consistent with Figure 7, with the unusually low proportions difficult to notice there. The loadings further suggest the outlier has unusually high proportions of T. torokina, A. ingens, and E. antarctica, which broadly aligns with Figure 7.
Lastly, the compositional rank 2 representation produced by PSA-O has an effective visualization as a ternary plot (Figure 10). More directly apparent than in the scores and loading plots (Figures 8 and 9) and (Taylor-Silva and Riesselman,, 2018, Fig. 3) is the central role of T. insigna (the major part of vertex V3) with nearly zero relative abundance prior to KM3 and dominating behavior of the compositions after KM3. The trend over time from open-ocean diatoms (largely represented by V2) to sea-ice affinity diatoms (largely represented by V1) is also clear.


7 Conclusion
This article proposed a new approach for decomposition of compositional data, applying a backwards PCA framework. At each rank , the proposed optimization can be effectively performed by either closed form expression or a one-dimensional grid search for each of the pairs of vertices. The proposed approach provides lower dimensional representations that are compositions and the corresponding modes of variation have compositional interpretations in the sense that the proportion of one group of species increases while that of another group of species decreases. Furthermore, the modes of variation are linear. These properties were illustrated via simulated data sets and analysis of diatom species relative abundance, for which our orthant-based, PSA-O, provided the clearest modes of variation and an informative rank 2 approximation.
References
- Aitchison, (1982) Aitchison, J. (1982). The statistical analysis of compositional data. Journal of the Royal Statistical Society: Series B (Methodological), 44(2):139–160.
- Aitchison, (1983) Aitchison, J. (1983). Principal component analysis of compositional data. Biometrika, 70(1):57–65.
- Aitchison, (1986) Aitchison, J. (1986). The statistical analysis of compositional data. Chapman and Hall.
- Damon and Marron, (2014) Damon, J. and Marron, J. S. (2014). Backwards principal component analysis and principal nested relations. Journal of Mathematical Imaging and Vision, 50(1-2):107–114.
- Eltzner et al., (2018) Eltzner, B., Huckemann, S., and Mardia, K. V. (2018). Torus principal component analysis with applications to rna structure. Annals of Applied Statistics.
- Eltzner et al., (2015) Eltzner, B., Jung, S., and Huckemann, S. (2015). Dimension reduction on polyspheres with application to skeletal representations. In Geometric Science of Information: Second International Conference, GSI 2015, Palaiseau, France, October 28-30, 2015, Proceedings 2, pages 22–29. Springer.
- Fiksel et al., (2022) Fiksel, J., Zeger, S., and Datta, A. (2022). A transformation-free linear regression for compositional outcomes and predictors. Biometrics, 78:974–987.
- Firth and Sammut, (2023) Firth, D. and Sammut, F. (2023). Analysis of composition on the original scale of measurement. https://arxiv.org/abs/2312.10548.
- Fletcher et al., (2004) Fletcher, P. T., Lu, C., Pizer, S. M., and Joshi, S. (2004). Principal geodesic analysis for the study of nonlinear statistics of shape. IEEE transactions on medical imaging, 23(8):995–1005.
- Huckemann and Ziezold, (2006) Huckemann, S. and Ziezold, H. (2006). Principal component analysis for riemannian manifolds, with an application to triangular shape spaces. Advances in Applied Probability, 38(2):299–319.
- Jung et al., (2012) Jung, S., Dryden, I. L., and Marron, J. S. (2012). Analysis of principal nested spheres. Biometrika, 99(3):551–568.
- Lundborg and Pfister, (2023) Lundborg, A. R. and Pfister, N. (2023). Perturbation-based analysis of compositional data. https://arxiv.org/abs/2311.18501.
- Marron and Dryden, (2021) Marron, J. S. and Dryden, I. L. (2021). Object oriented data analysis. CRC Press.
- Pizer et al., (2013) Pizer, S. M., Jung, S., Goswami, D., Vicory, J., Zhao, X., Chaudhuri, R., Damon, J. N., Huckemann, S., and Marron, J. S. (2013). Nested sphere statistics of skeletal models. Innovations for shape analysis: Models and algorithms, pages 93–115.
- Quinn and Erb, (2020) Quinn, T. P. and Erb, I. (2020). Amalgams: data-driven amalgamation for the dimensionality reduction of compositional data. NAR genomics and bioinformatics, 2(4):lqaa076.
- R Core Team, (2024) R Core Team (2024). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
- Scealy et al., (2015) Scealy, J. L., de Caritat, P., Grunsky, E. C., Tsagris, M. T., and Welsh, A. H. (2015). Robust principal component analysis for power transformed compositional data. Journal of the American Statistical Association, 110(509):136–148.
- Scealy et al., (2024) Scealy, J. L., Hingee, K. L., Kent, J. T., and Wood, A. T. A. (2024). Robust score matching for compositional data. Statistics and Computing, 34.
- Scealy and Welsh, (2011) Scealy, J. L. and Welsh, A. H. (2011). Regression for compositional data by using distributions defined on the hypersphere. Journal of the Royal Statistical Society Series B: Statistical Methodology, 73(3):351–375.
- Scealy and Wood, (2023) Scealy, J. L. and Wood, A. T. A. (2023). Score matching for compositional distributions. Journal of the American Statistical Association, 118:1811–1823.
- Taylor-Silva and Riesselman, (2018) Taylor-Silva, B. I. and Riesselman, C. R. (2018). Polar frontal migration in the warm late pliocene: Diatom evidence from the wilkes land margin, east antarctica. Paleoceanography and Paleoclimatology, 33(1):76–92.
- Weistuch et al., (2022) Weistuch, C., Zhu, J., Deasy, J. O., and Tannenbaum, A. R. (2022). The maximum entropy principle for compositional data. BMC Bioinformatics, 23:1–13.
- Xiong et al., (2015) Xiong, J., Dittmer, D. P., and Marron, J. S. (2015). Virus hunting using radial distance weighted discrimination. The Annals of Applied Statistics, 9:2090–2109.
- Zhang et al., (2015) Zhang, L., Lu, S., and Marron, J. S. (2015). Nested nonnegative cone analysis. Computational Statistics & Data Analysis, 88:100–110.
- Zhu and Müller, (2024) Zhu, C. and Müller, H.-G. (2024). Spherical autoregressive models, with application to distributional and compositional time series. Journal of Econometrics, 239(2):105389.
- Zoubouloglou et al., (2023) Zoubouloglou, P., García-Portugués, E., and Marron, J. S. (2023). Scaled torus principal component analysis. Journal of Computational and Graphical Statistics, 32(3):1024–1035.
Supplementary of Principal Subsimplex Analysis
S1 Backwards Principal Component Analysis
For a data set lying on a manifold, backwards PCA searches a nested sequence of submanifolds of decreasing dimensions that fit the data. These submanifolds provide lower dimensional approximations. Moreover, the differences of rank and rank approximations will be a key component for modes of variation of backwards PCA. (See Section S2 for more discussion on modes of variation.)
Specifically, backwards PCA starts from the full rank () representation of data and finds a -dimensional submanifold . One identifies a -dimensional submanifold within and continue this procedure until reaching a 0-dimensional submanifold. Each -dimensional submanifold is referred to as the rank approximating subset of the data. In particular, the 0-dimensional submanifold is called the backwards mean, which is frequently a single point though it could be a discrete collection of points in general.
Principal Nested Spheres (PNS) is an application of backwards PCA to spherical data, which is the first concrete implementation of backwards PCA. Preceding the development of PNS, important precursors include Principal Geodesic Analysis (Fletcher et al.,, 2004) and Geodesic Principal Component Analysis (Huckemann and Ziezold, 2006). These precursors find a sequence of submanifolds spanned by the increasing number of geodesics. While geodesic based approaches have proved its efficacy in many applications, their utility is sometimes constrained by inherent limitations. For example, in spheres, submanifolds spanned by geodesics are the great spheres –the intersections of the sphere with hyperplanes passing through the origin– thus geodesic-based approaches cannot explain spherical patterns beyond those distributed along the great circles. PNS was conceived to find small spheres, which are the intersections of the sphere with hyperplanes not necessarily passing through the origin. The procedure starts from a -dimensional unit sphere, finds a -dimensional small sphere, finds a -dimensional small sphere within it, and iterate the procedure until it reaches the backwards mean. PNS has been adapted to a variety of spaces which involve spheres in different ways. Examples include the space of skeletal representations (Pizer et al., 2013), polyspheres (Eltzner et al., 2015), and high-dimensional tori (Eltzner et al., 2018, Zoubouloglou et al., 2023).
Another application of backwards PCA to nonnegative matrix factorization (Zhang et al., 2015) illustrates distinction between Backwards PCA and many dimensionality reduction methods in interpretability. In most of the nonnegative matrix factorization algorithms, the rank approximation has no trivial relationship with the rank approximation. In contrast, in backwards PCA, the rank approximation of the object space is a subset of the rank approximation thus it is easy to understand the relationship between approximations of different ranks. Moreover, the differences between rank approximations and rank approximations naturally define modes of variation.
S2 Modes of Variation by PSA
Modes of variation are the important ‘directions’ in which the data vary. In the Euclidean Principal Component Analysis context, the loading vectors (equivalently principal components) are interpreted as the modes of variation. Modes of variation are not necessarily linear, as in the case of Principal Geodesic Analysis where modes of variation are geodesics passing through the Frechet mean.
More generally, Marron and Dryden,, 2021 defined a mode of variation as a one parameter family of objects in the data space. The best-known modes of variation are those for Euclidean PCA which are orthogonal straight lines passing through the mean of the data. Thinking of a mode of variation as a family of data objects effectively generalizes the definition of modes of variation for Euclidean PCA to nonlinear spaces. For example, the modes of variation for principal geodesic analysis fit this definition as a geodesic is a one-parameter family of points.
A natural choice of the first mode of variation for backwards PCA is the collection of one-dimensional approximations. For example, in PNS, the first mode of variation was defined as the rank 1 approximating circle. However, the higher order modes of variation for backwards PCA have not yet been formally defined. In this paper, we propose the higher order modes of variation for PSA as follows.
The idea behind modes of variation in PSA is illustrated using Example 1 in Section 5 of the paper. The middle panel of Figure S1 shows a different view of the application of PSA-S to Example 1. The plot illustrates that the first merge occurred between the vertices and giving the new vertex . The merge yielded the rank 1-approximating subset which is depicted as the red line segment. The vertex which was not merged was relabeled as . In the second merge of PSA-S, was merged with to give the backwards mean
which is marked by the black solid dot.
More importantly, the panel also illustrates the way each point in is approximated by PSA-S. Each of the gray line segments is the collection of the points with the same rank 1 approximation. We call these collections the rank 2 trajectories. The index rank 2 indicates that the collection is chosen from the rank 2 approximating subset. Generally, a rank trajectory is the one-dimensional subset of the rank approximating subset all of which have the same rank approximation. The unique rank 1 trajectory is the red line segment, which is the rank 1 approximating subset itself.



We define the th mode of variation as the rank trajectory of the backwards mean. The first and the second modes of variation are shown in Figure S1 as the red and blue line segments, respectively. This definition generalizes the modes of variation for PCA and PGA. Moreover, this definition of higher order modes of variation can be naturally extended to backwards PCA on other spaces, including PNS.
The right panel of Figure S1 is the same display for PSA-O. It is clear from the figures that while the trajectories of PSA-S are parallel to a side of the ternary plot, those of PSA-O are not. This is because in PSA-O, points are projected to the orthant space, then projected onto the great circle that corresponds to the red line segment along great circles that are perpendicular to that great circle, and then projected back to the simplex.
An alternative way of describing a mode of variation, instead of using a set of data objects, is to use the difference of two merged vertices (as vectors in ), which we call a loading vector. In the PSA-S panel of Figure S1, the direction of the second mode of variation is the difference of the merged vertices and . The direction of the first mode of variation is the difference of the merged vertices and . This observation leads to the definition of loading vectors in Section 2.4. We may represent modes of variation of PSA-O in the same way, although the trajectories of PSA-O are not exactly parallel to each other. Figure S2 visualizes the loading vectors using bar plots (the same figure as Figure 5 (b).) A positive score implies high content of the elements with the red bars. The red bars and the green bars add up to 1 and -1, respectively, due to the facts that two merged vertices are in and that two merged vertices consist of disjoint sets of elements.

Figure S3 uses the same bar plots to display the loading vectors of Example 2.

S3 Algorithms of PSA
S4 Alternative Approaches to Compositional Data
Let us denote the component-wise arithmetic by where and . Similarly, component-wise logarithm and power transformations are denoted by and for any . Also, where . Let .
S4.1 Log-Ratio Transformations
The log transform is defined for positive values, so we denote by the interior of or the open unit -simplex, that is,
Aitchison, (1986) proposed the additive log-ratio transformation, defined by
(10) |
Division by the last component gives the additive log-ratio transformation the nice property that the component-wise log transformation does not have: it is a one-to-one mapping from the open unit -simplex to the entire Euclidean space .
This correspondence is particularly useful in parametric analysis of compositional data with no zero entries. A random variable supported on is said to follow logistic normal distribution with parameters and , denoted by , if follows . The class of logistic normal distributions is more flexible than the traditional class of Dirichlet distributions in the sense that it can model correlation between components in a way that Dirichlet distributions cannot.
Because the additive log-ratio transformation depends on the choice of ordering of components and it may greatly affect subsequent analyses, Aitchison, (1986) further proposed the centered log-ratio transformation, , which centers each point by its geometric mean .
(11) |
The centered log-ratio transformation is a one-to-one mapping from to the -dimensional subspace . One can remove the redundant dimension by isometrically identifying the subspace with , which leads to the isometric log-ratio transformation, . The identification can be computed by
(12) |
where is the lower submatrix of the Helmert matrix of order . This refinement is required when a statistical method requires the data to be of full rank.
For principal component analysis purpose, however, there is no real difference between using clr and ilr, but clr transformation is intuitively and computationally simpler, for clr maps all data points into the hyperplane spanned by the simplex which essentially passes through the mean. Consequently, the last principal component of the clr-transformed data is the normal vector and variation explained by the principal component is zero. In addition, all lower dimensional approximating subspaces and approximation points of clr-PCA lie on and these approximations can be effectively mapped back onto the open unit simplex through the inverse the of clr map.
Figure S4 illustrates an example of 2-dimensional compositional data in which log-ratio PCA is applied. In Figure S4 (a), two outlying data points are colored blue and red, while the other points are colored black. Figure S4 (b) shows the distribution of the same data after ilr transformation, along with the magenta and green line segments indicating the first and the second principal directions. These line segments are mapped back onto the simplex in Figure S4 (c). Note that significant curvature are introduced through the log-ratio transformation.



The log-ratio transformations are not applicable when a data set contains zeros. Using terminologies in Chapter 11 of Aitchison, (1986), zeros in compositional data are called essential zeros (or structural zeros) if the corresponding parts were truly not there, and rounded zeros (or count zeros) if the zeros occur because the true values were below detection level. Essential zeros may suggest existence of distinct subpopulations that can be modeled by hierarchical models, for example, zero inflated Poisson models. Poisson component is modeling count zeros, and the point mass at the zero is modeling essential zeros. Rounded zeros are typically replaced by small values through imputation.
S4.2 Power Transformations
Box-Cox power transformations are also frequently used to handle high skewness, and they generalize the log transformation in the sense that . Aitchison, (1986) proposed to apply Box-Cox power transformation to the ratio of the components, which we denote by ,
(13) |
for some . On the other hand, one can consider Box-Cox power transformation on the components but not on the ratio, ,
(14) |
This approach was extended regarding robustness in Scealy et al., (2015). In contrast to the case of the centered log-ratio transformation, the data points resulting from power transformations lie on a curved space but not a linear space.
For PCA purposes, is often chosen so that it maximizes the profile log-likelihood assuming normality after power transformation. Another approach is simply using and consider . Note that this is a translation and a dilation of the Box-Cox power transformation. This transformation effectively maps a simplex onto the nonnegative orthant of the same dimension, which proves particularly advantageous for parametric analysis of compositional data. This enables the utilization of models defined on spheres (Scealy and Welsh, 2011, Zhu and Müller, 2024).
S5 Additional Discussion on the Diatom Data
This section provides additional figures and discussion for the Diatom data.
S5.1 Data Set
In Section 6 of the main manuscript, we investigated the distribution of the compositions of the Diatom data. Figure S5 (a) shows the depth distribution with an equally spaced grid of the unit interval for the y-axis. This shows that the depths where the samples were measured are not uniformly distributed but appear in clusters. Figure S5(b) shows the percent of nonzero observations for each of the diatom species. The bar plot suggests that the percent of nonzero observations is not proportional to the variance, and the most observations of the last 10-15 species are zero.


S5.2 Scores and Loading Vectors
Figure S6(a)-(e) are the score plot matrices of the five methods. The same color scheme as in Figure 8 of the main manuscript was used and the outlier is marked by a red circle as before. Note that the diagonal panels are different from the density plots that are usual for scatter plot matrices. Each of the diagonal panels has the x-axis of the score and the y-axis of the depth. Figure S11 provides the corresponding loading vectors.
The detailed discussion on the scores and the loading vectors of PSA-O given in Section 6 is compared to those of PSA-S. Recall that the first, second, and fourth modes of variation are driven by the first phase (Depth 67), the second phase (Depth 67), and the outlier, respectively. A similar investigation into the diagonal panels of Figure S6(a) suggests that each of the modes of variation captures variation that is driven by either different time periods or the outlier. The first four modes of variation of PSA-S are associated with the transition of species during the first phase, the second phase, the earlier period of the first phase, and the later period of the second phase, respectively. In addition, the second mode of variation is also associated with the outlier. This observation can be attributed to that the first loading vector of PSA-S is similar to that of PSA-O, as represented by positive weight of F. barronii and R. antarctica and negative weight of T. inura and A. karstenii. The second loading vector of PSA-S is similar to a mixture of the second and the fourth loading vectors of PSA-O, as represented by positive weight of T. insigna and Denticulopsis. spp., and negative weight of T. inura and A. karstenii. However, importantly, the second mode of PSA-S is not exactly same as the mixture as A. karstenii indicates the opposite directions, negative in the second mode of PSA-S and positive in the mixture of two modes of PSA-O. The two methods are pointing out different yet important aspects of the outlier. PSA-S pulls out A. ingense which is the most particular species in the outlier, while PSA-O extracts most of the species that are particular in the outlier with altered weights.
The diagonal panels of Figure S6(c) suggests that the first, second, and fourth modes of variation of PCA are associated with the transition of diatom species through the entire period, the second phase, and the later part of the first phase, respectively, while the third mode is associated with the outlier. The second and the third rows of Figure S11 indicate that the first mode of PCA is a mixture of the negative of the first mode and the positive of the second mode of PSA-O, as represented by postivie weight of T. insigna and A. karstenii and negative weight of F. barronii and R. antarctica. The second mode of PCA is a mixture of the second and the third mode of PSA-O as represented by extreme weights of T. instigna and F. bohatyi. The third mode of PCA and the fourth mode of PSA-O share some species but also have some distinct species, indicating that they see different aspects of the outlier.
The first two scores of the power transform PCA show similar trends to those of PCA except that the outlier begins to stick out in the second mode of variation. Accordingly, A. ingens is introduced in the second loading vector of the power transform PCA while the first two loading vectors of the power transform PCA are mostly the same as those of PCA. On contrary, the third mode of the power transform PCA is more noisy and does not manifest clear association with any time period. The fourth mode is driven by the outlier.
Similarly, the first two scores and the loading vectors of the log-ratio PCA are close to those of PCA, while the outlier is more clearly pulled out in the second mode of variation as a result of higher weight of A. ingens in the loading vector.






S5.3 Parallel Coordinate Plot Representation of Loading Vectors
An alternative way of representing loading vector is using parallel coordinate plots. Figure S12 shows parallel coordinate plots for PSA-S and PSA-O loading vectors. An advantage of using parallel coordinate plots over using bar plots is that the same species appears at the same x-axis so one can easily compare the loading vectors across different methods. For example, by comparing Figure S12(a) and S12(b) one can quickly notice that the first loading vectors of PSA-S and PSA-O are similar and the second loading vector of PSA-S is a mixture of the second and the fourth loading vectors of PSA-O. On contrary, an advantage of using bar plots is that one can easily recognize the important players in the loading vectors, particularly when there are more than a handful of variables.


S5.4 Lower Dimensional Compositional Representations and Ternary Plots
We saw in Figure 10 that the rank approximations of PSA-O can be effectively visualized through a ternary plot. The same display for PSA-S is shown in Figure S14. It is clear from the ternary plot that the observations in the first phase (yellow to green) have similarly low levels of , while the the observations in the second phase (green to navy) and the outlier have varied proportion of . In addition, consists of T. insigna together with five more variables, effectively pin-pointing the variables which drive the variation in the second phase. Similarly to the vertices of PSA-O, is more related to the sea-ice affinity diatoms and is more related to open-ocean diatoms.

