The -Robinson-Foulds Dissimilarity Measures for Comparison of Labeled Trees
Abstract
Understanding the mutational history of tumor cells is a critical endeavor in unraveling the mechanisms underlying cancer. Since the modeling of tumor cell evolution employs labeled trees, researchers are motivated to develop different methods to assess and compare mutation trees and other labeled trees. While the Robinson-Foulds distance is a widely utilized metric for comparing phylogenetic trees, its applicability to labeled trees reveals certain limitations. This paper introduces the -Robinson-Foulds dissimilarity measures, tailored to address the challenges of labeled tree comparison. The Robinson-Foulds distance is succinctly expressed as -RF in the space of labeled trees with nodes. Like the Robinson-Foulds distance, the -Robinson-Foulds is a pseudometric for multiset-labeled trees and becomes a metric in the space of 1-labeled trees. By setting to a small value, the -Robinson-Foulds dissimilarity can capture analogous local regions in two labeled trees with different size or different labels.
keywords:
Phylogenetic trees, mutation trees, labeled trees, Robinson-Foulds distance, -Robinson-Foulds dissimilarity1 Introduction
Trees in biology are a fundamental concept as they depict the evolutionary history of entities. These entities may consist of organisms, species, proteins, genes or genomes. Trees are also useful for healthcare analysis and medical diagnosis. Introducing different kinds of tree models has given rise to the question about how these models can be efficiently compared for evaluation. This question has led to defining a robust dissimilarity measure in the space of targeted trees. For example, mutation/clonal trees are introduced to model tumor evolution, in which nodes represent cellular populations and are labeled by the gene mutations carried by those populations (Karpov et al., 2019; Schwartz and Schäffer, 2017). The progression of tumors varies among different patients; additionally, information about such variations is significant for cancer treatment. Therefore, dissimilarity measures for mutation trees have become a focus of recent research (DiNardo et al. (2020); Jahn et al. (2021); Llabrés et al. (2021); Karpov et al. (2019)).
In prior studies on species trees, several measures have been introduced to compare two phylogenetic trees. Some examples of such distances are Robinson-Foulds distance (RF) (Robinson and Foulds, 1981), Nearest-Neighbor Interchange (NNI) (Li et al., 1996; Robinson, 1971), Quartet distance (Estabrook et al., 1985), and Path distance (Steel and Penny, 1993; Williams and Clifford, 1971). Although these distances have been widely used for phylogenetic trees, they are defined based on the assumption that the involved trees have the same label sets. Moreover, only leaves of phylogenetic trees are labeled. Thus, these distances are not useful for comparing trees with different label sets or trees in which all the nodes are labeled.
1.1 Related Work on Comparison of Labeled Trees
To get around some limitations of the above-mentioned distances in the comparison of mutation trees, researchers have introduced new measures for mutation trees. Some of these measures are Common Ancestor Set distance (CASet) (DiNardo et al., 2020), Distinctly Inherited Set Comparison distance (DISC) (DiNardo et al., 2020), and Multi-Labeled Tree Dissimilarity measure (MLTD) (Karpov et al., 2019). While these distance measures enable efficient comparison of clonal trees, they are defined based on the assumption that mutations cannot occur more than once and mutations will not be lost in the course of tumor evolution. As a result, these metrics exhibit multiple limitations when applied to the comparison of trees used to model complex tumor evolution, wherein mutations may indeed occur multiple times and subsequently be lost.
In addition to the three measures mentioned above, a group of other dissimilarity measures have been introduced for the comparison of mutation trees, including Parent-Child Distance (Govek et al., 2018) and Ancestor-Descendant Distance (Govek et al., 2018). These measures are metric only in the space of ‘1-mutation’ trees, in which each node is labeled by exactly one mutation. These distances are again defined based on the assumption that mentioned above.
Additionally, there are other measures for mutation trees, defined based on the generalization methods. In such methods, researchers aim to extend the definition of an existing distance, which mostly used to compare phylogenetic trees, to mutation trees. For example, the generalized Nearest Neighbour Interchange (gNNI) (Jahn et al. (2021)) is defined by some minor modifications of NNI, which was first defined for the comparison of phylogenetic trees. The other example is the Path Distance (Govek et al. (2018)) which was first defined for phylogenetic trees comparison. Although these measure are applicable to mutation trees, they are only well defined for mutation trees with the same label sets (Jahn et al. (2021); Govek et al. (2018)).
Apart from the measures mentioned above, another recently proposed distance is the generalized RF distance (GRF) (Llabrés et al., 2020, 2021). This measure allows for the comparison of phylogenetic trees, phylogenetic networks, mutation and clonal trees. An important point about this distance is that its value depends on the intersection between clusters or clones of trees. However, this intersection does not contribute to the RF distance. In fact, if two clusters or clones of two trees are different, their contribution to the RF distance is 1; otherwise, it is 0. Hence, the generalized RF distance has a better resolution than the RF distance. However, it is defined based on the assumption that two distinct nodes in each tree are labeled by two disjoint sets (Llabrés et al., 2020).
There are some other generalizations of the RF distance, such as Bourque distance (Jahn et al., 2021). This distance is effective for comparing mutation trees whose nodes are labeled by non-empty sets and has linear time complexity. However, like the above distances, it does not allow for multiple occurrences of mutations during the tumor history (Jahn et al., 2021). Other generalization of the RF distance have also been proposed for gene trees (Briand et al., 2020, 2022).
The above-mentioned measures are not able to quantify similarity or difference of some tree models. Two instances of such models are the Dollo (Farris, 1977) and the Camin-Sokal model (Camin and Sokal, 1965). The reason behind the inadequacy of the measures for these models is that it is possible for mutations to get lost after they are gained in the Dollo model, and a mutation can occur more than once during the tumor history in the Camin-Sokal model (Llabrés et al., 2020). Hence, some measures are needed to mitigate the problem. To the best of our knowledge, Triplet-based Distance (Ciccolella et al., 2021) is the only measure introduced so far to resolve the issue. The distance is useful for comparing mutation trees whose nodes are labeled by non-empty sets; additionally, it allows for multiple occurrences of mutations during the tumor history and losing a mutation after it is gained (Ciccolella et al., 2021). Thus, the measure is applicable to the broader family of trees in which two nodes of a tree may have non-disjoint sets of labels. Nevertheless, it is not able to compare those trees in which there is a node whose label has more than one copy of a mutation.
Although no tree model has been introduced so far that allows for more than one copy of a mutation in the label of a single node, current models can be extended to deal with copy number of mutations. For example, the constrained -Dollo model (Sashittal et al., 2023) takes the variant read count and the total read count of each mutation in each cell, derived from single-cell DNA sequencing data, as input; then, based on three thresholds for the variant read count, the total read count, and the variant allele frequency, it decides whether a mutation is present or absent in a cell or it is missing (Sashittal et al., 2023). Alternatively, the model can consider the exact frequency numbers to show the multiplicity of each mutation in each cell. This motivates us to introduce new distances that can be used to compare pairs of labeled trees whose nodes are labeled by non-empty multisets.
1.2 Our Contributions to Tree Comparison
In this paper, we present -RF dissimilarity measures designed for the comparison of labeled trees. They are first defined for 1-labeled trees (Section 3). Subsequently, we extend these measures to multiset-labeled trees (Section 5). We delve into the mathematical properties of the -RF measures in Sections 4 and 5. In particular, -RF is a metric for 1-labeled trees. We also assess the validity of the -RF measures through comparisons with CASet, DISC, and GRF (Section 5), and the evaluation of their performance in the context of tree clustering (Section 6).
2 Concepts and Notations
A graph consists of a set of nodes and a set of edges that are each an unordered pair of distinct nodes, whereas a directed graph consists of a set of nodes and a set of directed edges that are each an ordered pair of distinct nodes.
Let be a (directed) graph. We use and to denote its node and edge set, respectively. If is undirected, will still be used to denote an edge between and with the understanding that . Let . If , we say that and are adjacent, the edge is incident to and , or and are two endpoints of .
The degree of is defined as the number of edges incident to . In addition, if is directed, the indegree and outdegree of are defined as the number of edges such that and , respectively. The nodes of degree 1 are called the leaves in an undirected graph, whereas the nodes of indegree 1 and outdegree 0 are called the leaves in a directed graph. We use to denote the leaf set for . Non-leaf nodes are called internal nodes.
A path of length from to consists of a sequence of nodes such that , and for . The distance from to , denoted as , is the length of the shortest paths from to , and it is set to if there is no path from to . Note that if is undirected, for all . The diameter of , denoted as , is defined as . If is directed, its diameter is defined as the diameter of its undirected version that has the node set and edge set .
2.1 Trees
A tree is a graph in which there is exactly one path from every node to any other node. It is binary if every internal node is of degree 3. It is a line tree if every internal node is of degree 2. Each line tree has exactly two leaves.
2.2 Rooted Trees
A rooted tree is a directed tree with a specific root node where the edges are oriented away from the root. In a rooted tree, there is exactly one edge entering for every non-root node , and thus there is a unique path from its root to every other node.
Let be a rooted tree and . If , is called a child of and is called the parent of . In general, for , if belongs to the unique path from to , is said to be a descendant of , and is said to be an ancestor of . We use , and to denote the set of all children, ancestors and descendants of , respectively. Note that and .
A binary rooted tree is a rooted tree in which the root is of indegree 0 and outdegree 2, and every other internal node is of indegree 1 and outdegree 2. A rooted line tree is a rooted tree in which each internal node has only one child. A rooted caterpillar tree is a rooted tree in which every internal node has at most one child that is internal.
2.3 Labeled Trees
Let be a set and be the set of all subsets of . A tree or rooted tree is labeled with the subsets of if is equipped with a function such that , and for every . In particular, if contains exactly one element for each and is one-to-one, is said to be a -labeled tree on . In addition, if is 1-labeled on , then for , is defined as .
2.4 Phylogenetic and Mutation Trees
Let be a finite taxon set. A phylogenetic tree (respectively, rooted phylogenetic tree) on is a binary tree (respectively, binary rooted tree) in which the leaves are uniquely labeled by the taxa of and the internal nodes are unlabeled.
A mutation tree on a set M of mutated genes is a rooted tree in which nodes are labeled with nonempty subsets of .
2.5 Dissimilarity Measures for Trees
Let be a set of trees. A dissimilarity measure on is a symmetric real function . A dissimilarity measure should capture the idea that the more similar two trees are, the lower their measure value is. A pseudometric on is a dissimilarity measure that satisfies the triangle inequality condition. Finally, a metric (distance) on is a pseudometric such that unless and are the same tree.
3 The -RF Measure for 1-Labeled Trees
In this section, we first recall the definition of the RF distance and then present -RF dissimilarity measures for 1-labeled trees for arbitrary .
3.1 The -RF Measure for 1-Labeled Unrooted Trees
Let be a set of labels and let be a 1-labeled tree over . Each induces a pair of label subsets on :
(1) | |||
(2) |
We further define:
(3) |
The RF distance of two 1-labeled trees and is defined as:
(4) |

Example 1.
Consider the three 1-labeled trees in Figure 1. We have , because to are:
respectively, whereas to are:
respectively. However, , even if and have the same topology and only one node is labeled differently.
The above example indicates that the RF cannot capture local similarity (and difference) for 1-labeled trees if they have different labels. One popular dissimilarity measure for sets is the Jaccard distance. It is obtained by dividing the size of the symmetric difference of two sets by the size of their union. Two 1-labeled trees are identical if and only if they have the same set of edges. Therefore, we propose to use and its generalization to measure the dissimilarity for 1-labeled trees and .
Let be a non-negative integer and let be a -labeled tree. Each edge induces the following pair of subsets of labels:
(5) | |||
Clearly, and . We further define:
(6) |
Definition 1.
Let and let and be two 1-labeled trees. The -RF dissimilarity score of and is defined as:
(7) |
Example 2.
Continuing with Example 1, we have , as for are:
, | , | , | ||
, | , | , |
respectively, and for are:
, | , | , | ||
, | , | , |
respectively. We also have . Thus, -RF captures the difference of the trees better than the RF distance.
3.2 The -RF Measure for 1-Labeled Rooted Trees
Let be an integer and let be a -labeled rooted tree. For a node , we define and as:
(8) | |||
(9) |
For each , we define as the following ordered pair od two label subsets:
(10) |
Here, the first subset of contains the labels of the descendants that are at distance at most from , whereas the second subset contains the labels of the other nodes around the edge within a distance of . Next, we define:
(11) |
Definition 2.
Let and let and be two 1-labeled rooted trees. Then, the -RF dissimilarity between and is defined as:
(12) |
Example 3.
Consider the two 1-labeled rooted trees and in Figure 2. () are the following ordered pairs of label subsets:
() are the following ordered pairs of label subsets:
Therefore, .

4 Characterization of -RF for 1-Labeled Trees
In order to evaluate -RF, we first provide the mathematical properties of the -RF. We then present experimental results on the frequency distribution of these measures.
4.1 Mathematical Properties
Proposition 1.
Let and be two 1-labeled trees.
(a) Let and . For any , .
(b) Assume that . For ,
. In addition, the second inequality become equality if and
(c) Renaming each node with its label, we have .
(d) If , then .
Proof.
(a) Note that if and , each involves at least three labels. If and have only two common elements, for every and . Thus, we have , implying that
(b) The second inequality follows from that and for . We prove the first inequality as follows.
Let . Since and are 1-labeled, we identify a node with its label in the trees. Without loss of generality, we may assume . Define .
If , then, , as . This also implies that for every , and .
If , there exists at least a node that is or more distance away from . Since is connected, we let be a path from and with the smallest length. Clearly, the first nodes in (including ) are all in , i.e. at least one end of the first edges of are found in .
In summary, we have proved that there are at least edges such that either or . For each of these edges , appears in at least one label subset of and thus . Therefore, .
If and , then, . Therefore, the induced pair contains for every edge of . On the other hand, the induced pair does not contain for each edge of . Thus, and .
(c) Note that we may represent each node of a 1-labeled tree with its unique label. As a result, and for and . Thus, (c) follows.
(d) It follows from the definition of the -RF. ∎
Lemma 4.1.
Let be an integer. -RF satisfies the non-negativity, symmetry and triangle inequality conditions.
Proof 4.2.
Let . The non-negativity and symmetry conditions are trivial. The triangle inequality is derived from the inequality for any three 1-labeled trees .
Remark 4.3.
Proposition 4.4.
The 0-RF is a metric on the space of all 1-labeled rooted trees.
Proof 4.5.
Let and be two 1-labeled rooted trees. By Remark 4.3, it is enough to show that and are identical if . By identifying a node with its label in and , we obtain that and . If , and thus , i.e. and are identical.
Lemma 4.6.
Let be a 1-labeled rooted tree with at least two nodes and be a subset of . Define to be the subtree obtained by the removal of all the leaves of . Then, for any ,
(13) |
Proof 4.7.
Since is 1-labeled, we identify a node of with its label in the following discussion. With this convention, for any subset of nodes, .
Let denote the subset of edges incident to a leaf of , i.e., . Then,
If , and thus
This has proved Eqn. (13).
Proposition 4.8.
Let be an integer. The -RF is a metric in the space of all 1-labeled rooted trees.
Proof 4.9.
Let and be two 1-labeled rooted trees. By Remark 4.3, it is enough to show that (equivalently, ) implies that and are identical. To this end, we prove that can be uniquely determined by using mathematical induction.
Since , is a single node if and only if is empty if and only is empty. Therefore, the single-node tree is uniquely determined by .
Assume is uniquely determined by for arbitrary 1-labeled tree such that . Consider a 1-labeled tree such that .
For a leaf , there is a unique edge entering . Note that . Since if and only if is a leaf, we can identify from such that
For , there is a unique edge entering . Since , the children of are all a leaf if and only if if and only if . Therefore, we can identify whose children are all leaves from the ordered pairs such that contains only .
Let be the set of all nodes whose children are just leaves and . Since is nonempty, . Define .
For the tree obtained from by the removal of the leaves of , By Eqn. (13), can be efficiently constructed from . By the induction hypothesis, is uniquely determined by . As a result, is determined.
This concludes the proof.
Corollary 4.10.
Let . The -RF is a metric in the space of all -labeled trees.
Proof 4.11.
Lemma 4.12.
Let and let be a 1-labeled rooted tree with nodes. All subsets and for all nodes and can be computed in at most set operations.
Proof 4.13.
Since is 1-labeled, we can identify a node of with its label. In this way, for all nodes and . By ordering the labels, we represent each subset of labels (and each subset of nodes) as a -bit 0-1 string, where the -th bit is 1 if and only if the -th label (node) is in the subset.
The statement is obvious in the case , since and, clearly, all the for can be computed in at most set operations. We assume and prove the statement by induction as follows.
Assume that all the for have been computed in at most set operations. Assume has children . Then,
This implies that for all can be computed from all using set operations. In total, we can compute all subsets and ) in at most set operations.
Lemma 4.14.
Let and be a 1-labeled rooted tree with nodes. Using for , we can compute for all in set operations, where is defined in Eqn. (8).
Proof 4.15.
Since is a 1-labeled rooted tree, we identify a node with its label. In this way, we just need to show that for all nodes can be computed in set operations.
Let be the root of . For any node , let the unique path from to be
Then, we have that
Given the subsets for all and , the above formula implies that can be computed in at most set operations. In total, we can compute all for all in set operations.
Proposition 4.16.
Let and be two 1-labeled trees with nodes and . Then, can be computed in time.
Proof 4.17.
We first consider the rooted tree case. Let and be two 1-labeled rooted trees with nodes. Without loss of generality, we may assume that and are labeled on the same set , with . (Otherwise, we can consider them labeled on , with .) By Lemma 4.12 and Lemma 4.14, we can compute for all in set operations for . Since each edge induces an ordered pair of label subsets and we represent each label subset using a -bit string, we consider as a -bit string. In this way, we sort all the edge-induced pairs of label subsets for each tree in time by radix sort (that is, indexing) and then compute the symmetric difference of the two set of edge-induced pairs in time. This concludes the proof.
In the unrooted case, we first root the trees at a leaf. In this way, we can compute all the edge-induced pairs of label subsets in the derived rooted trees in time. Since the edges induce unordered pairs of label subsets in the original trees, we rearrange the two label subsets obtained for an edge in such a way that the smallest label in the first subset is smaller than every label in the second one. After the rearrangement, we can radix-sort the edge-induced pairs and compute the -RF score in time.




4.2 The Distribution of -RF Scores
We examined the distribution of the -RF dissimilarity scores for 1-labeled unrooted and rooted trees with the same label set and with different label sets.
The distribution of the frequency of the pairwise -RF scores in the space of -node 1-labeled unrooted and rooted trees for from 4 to 7 are presented in Figures 3 to 6, respectively. For each , it suffices to consider . Recall that -RF is actually the RF distance. The frequency distribution for the RF distance in the space of phylogenetic trees is known to be Poisson (Steel and Penny, 1993). It seems also true that the pairwise 0-RF and -RF scores have a Poisson distribution in the space of -node 1-labeled unrooted and rooted trees. However, the distribution of the pairwise -RF scores is unlikely Poisson when and .
We examined 1,679,616 (respectively, 60,466,176) pairs of 6-node 1-labeled unrooted (respectively, rooted) trees such that the trees in each pair have common labels, with . Table 1 shows that the majority of pairs have a largest dissimilarity score of 10.
1-RF | 2 | 4 | 6 | 8 | 10 |
---|---|---|---|---|---|
3 | 0 | 0 | 0 | 3,072 | 1,676,544 |
4 | 0 | 0 | 432 | 16,800 | 1,662,384 |
5 | 0 | 340 | 3,720 | 53,100 | 1,622,456 |
1-RF | 2 | 4 | 6 | 8 | 10 |
---|---|---|---|---|---|
3 | 0 | 0 | 0 | 79,872 | 60,386,304 |
4 | 0 | 0 | 7,776 | 419,136 | 60,039,264 |
5 | 0 | 4,080 | 65,760 | 1,310,880 | 59,085,456 |
5 A Generalization to Multiset-Labeled Trees
In this section, we extend the measures introduced in Section 3 to multiset-labeled unrooted and rooted trees.
5.1 Multisets and Their Operations
A multiset is a collection of elements in which an element can occur one or more times (Jűrgensen, 2020). The set of all distinct elements appearing in a multiset is denoted by . In this paper, we simply represent by the monomial if , where is simplified to for each .
Let and be two multisets. We say is a sub-multiset of , denoted by , if for every , . In addition, we say that if and . Furthermore, the union, sum, intersection, difference, and symmetric difference of and are respectively defined as follows:
-
•
;
-
•
;
-
•
;
-
•
;
-
•
;
where is defined as 0 if for .
Let be a set and be the set of all sub-multisets on . A tree is labeled with the sub-multisets of if is equipped with a function such that and , for every . We call such a tree as a multiset-labeled tree. For and , we define and as follows:
(14) | |||||
(15) |
5.2 The -RF for Multiset-Labeled Trees
Let be a multiset-labeled tree. Then, each edge of induces a pair of multisets
(16) |
where is defined in Eqn. (14), and in Eqn. (2). Note that Eqn. (16) is obtained from Eqn. (1) by replacing with .
Remark 5.1.

We use Eqn. (16), Eqn. (3) and Eqn. (4) to define the RF-distance for multiset-labeled trees by replacing with in Eqn. (4).
Let . We use Eqn. (5), Eqn. (6), and Eqn. (7) to define the -RF for multiset-labeled trees by replacing with in Eqn. (5) and replacing with in Eqn. (7).
Example 5.2.

5.3 The -RF for Multiset-Labeled Rooted Trees
Let be an integer. We use Eqn. (10), Eqn. (11), and Eqn. (12) to define -RF for multiset-labeled rooted trees by replacing with in Eqn. (10) and replacing with in Eqn. (12).
Proposition 5.3.
Let be an integer. The -RF satisfies the non-negativity, symmetry, and triangle inequality conditions. Hence, -RF is a pseudometric for each in the space of multiset-labeled (rooted) trees.
Proof 5.4.
The non-negativity and symmetry conditions follow from the definition of the -RF. The triangle inequality condition is proved as follows.
Let , , and be three multiset-labeled trees. We need to show:
Note that denotes the multiset of edge-induced order pairs of sub-multisets in for .
If , we have either or . Assume . Then, . If , we have . This implies that and Thus,
On the other hand, if and , we have:
If and , we have , implying . Thus, we have:
Lastly, if , we can obtain the same result.
In summary, we have:
In addition, for each , we have:
Therefore, we have:
that is, the triangle inequality holds.
For multiset-labeled rooted trees, the proof is similar and hence omitted.

Remark 5.5.
For multiset-labeled trees, does not imply and are identical, as shown in Fig. 9.
Proposition 5.6.
Let and and be two (rooted) trees whose nodes are labeled by and , respectively. Then, can be computed in time, where is the maximum multiplicity of a label appearing in and .
Proof 5.7.
An algorithm for the 1-labeled case can be modified as follows for computing -RF on multiset-labeled rooted and unrooted trees:
-
•
Represent each label multiset as a -dimensional vector, in which the integer at position is the multiplicity of the -th label. Computing all edge-induced pairs in both trees takes set operations. Each set operation takes integer operations.
-
•
Radix-sort all the edge-induced pairs for and in and integer operations, respectively.
-
•
Compute the symmetric difference of the set of the edge-induced pairs in the two input trees in set operation. Each set operation takes integer operations.
In summary, one can compute using integer operations, as .

5.4 Correlation of the -RF and the Other Measures
Let and be two 1-labeled rooted trees with the same label set . Again, we identify the nodes with their labels in the two trees. For any two subset and of , we use to denote their Jaccard distance. The CASet distance between and is defined to be the average of a pair of nodes and , whereas the DISC distance between and is the average of an order pair of nodes DiNardo et al. (2020).
Using the Pearson correlation, we compared the -RF with CASet, DISC, and GRF (Llabrés et al., 2020) in the space of set-labeled trees for different from 0 to 28.
Firstly, we conducted the correlation analysis in the space of mutation trees with the same label set. Using a method reported by Jahn et al. (2021), we generated a simulated dataset containing 5,000 rooted trees in which the root was labeled with 0 and the other nodes were labeled by the disjoint subsets of , where the trees might have different number of nodes. Using all pairwise scores for CASet, DISC, GRF and -RF, we conducted the Pearson correlation analysis of -RF with the other three (left panel, Fig. 10).
Our results show that CASet, DISC and GRF were all positively correlated with -RF. We observed the following facts:
-
•
The GRF and -RF had the largest Pearson correlation for each , whereas the DISC and -RF had the largest Pearson correlation for each .
-
•
The 5-RF and 6-RF were less correlated to CASet, DISC and GRF than other -RF.
-
•
The Pearson correlation between -RF and CASet (respectively, DISC) increased when went from 6 to 15.
Secondly, we conducted the Pearson correlation analysis on the trees with different but overlapping label sets. The dataset was generated by the same method and was a union of 5 groups of rooted trees, each of which contained 200 trees over the same label set. We computed the dissimilarity scores for each tree in the first family and each tree in other groups and then computed the Pearson correlation between different measures. Again, all the dissimilarity measures were positively correlated, but less correlated than in the first case; see Fig 10 (right). This observation could be the result of the fact that difference in label sets of two trees makes their -RF score at least . However, the difference does not strongly contribute to the other distances because DISC and CASet consider the intersection of label sets (see (DiNardo et al., 2020)), and GRF considers the intersection of clusters.
The right dotplot of Fig. 10 shows that the -RF and DISC had the largest Pearson correlation for from 1 to 9, and the -RF and the CASet had the largest Pearson correlation for . Moreover, all the Pearson correlations decreased when changed from 1 to 15. This trend was not observed in the first case. This decreasing trend could be the result of the fact that difference in label sets contributes to -RF more as increases.
6 Clustering Trees with the -RF
A test was designed to demonstrate which of the -RF, CASet, DISC, and GRF is good at clustering labeled trees.
We generated randomly 5 tree families each containing 50 trees using the program reported by Jahn et al. (2021). The nodes were labeled by the subsets of a 30-label set in the trees of each family. The label sets used in different tree families were different, but overlapping. As the nodes were labeled by disjoint subsets, each different label between the label sets of two trees induces at least different pairs, where is the degree of the node with the label. Thus, a large number of different elements between the label sets could make the trees more distinguishable by the -RF. Therefore, the label sets used for the different tree families differed in only one label.
We computed the pairwise dissimilarity scores for all 250 trees in the five groups using each measure; we then clustered the 250 trees into clusters using the -means algorithm, where ranges from to . The clustering results were assessed using the Silhouette score (Kaufman and Rousseeuw, 2009).

As Fig. 11 illustrates, neither of the CASet, DISC, and GRF distances were able to recognize the exact number of families. However, CASet had the highest Silhouette score when the number of clusters was 5, compared to DISC, GRF, and the -RF for . In addition, the figure shows that the -RF could recognize the correct number of families when ranges from 12 to 19. Moreover, the Silhouette score of the -RF increased when increased from to . This interesting observation may stem from the fact that as increases, the number of pairs of trees achieving the highest possible -RF score also increases, thereby enhancing the recognizability of families. It’s worth noting that such pairs are guaranteed to exist when is larger than the minimum diameter of the trees, which is 8 in our case.
7 Conclusions
The development of an efficient and robust measure for the comparison of labeled trees is important. In this paper, we have proposed a novel variant of dissimilarity metrics, namely the -RF, tailored for labeled trees. The -RF facilitates the analysis of local structures in labeled trees, accommodating nodes labeled with (not necessarily the same) multisets. Significantly, these metrics find practical applicability in mutation trees used in cancer research.
The RF distance is succinctly expressed as -RF within the space of labeled trees with nodes. By setting to a value smaller than , the -RF metric can capture analogous local regions in two labeled trees. Notably, for every , the -RF is a pseudometric for multiset-labeled trees and becomes a metric in the space of 1-labeled trees. However, the distribution of pairwise -RF scores in the space of 1-labeled unrooted (or rooted) trees conforms to a Poisson distribution specifically for , and unlikely have the same trend for other values of .
We verified the -RF measures through a comprehensive comparison with CASet, DISC (DiNardo et al. (2020)) and GRF (Llabrés et al. (2021)) on randomly labeled trees generated by a house-made program (Jahn et al. (2021)). Our findings revealed a consistent positive correlation between -RF and each of the other three measures for every value of . Notably, the correlation values exhibited a tendency to be higher when the measures were applied to assess mutation trees with identical label sets. Furthermore, our study underscored the superior clustering capabilities of -RF compared to the three mentioned measures.
We would like to emphasize that selecting an appropriate -RF in practical applications lacks a universal rule of thumb, primarily due to a shortage of experience in this domain. Perhaps a judicious approach involves choosing a suitable -RF by carefully considering the topological similarity among the trees under consideration.
Future work includes how to apply the -RF to designing tree inference algorithms like GraPhyC (Govek et al., 2018) and also how to infer the exact frequency distribution of the -RF for each . It is also interesting to investigate the generalization of RF-distance for clonal trees (Llabrés et al., 2020).
The computer program for the -RF can be downloaded from https://github.com/Elahe-khayatian/k-RF-measures.git.
Acknowledgments
The authors would like to thank the anonymous reviewer for providing helpful suggestions and comments to our first submission of the work. This research was partially supported by the the Ministerio de Ciencia e Innovación (MCI), the Agencia Estatal de Investigación (AEI) and the European Regional Development Funds (ERDF) through project METACIRCLE PID2021-126114NB-C44, also supported by the European Regional Development Fund (FEDER), by the Agency for Management of University and Research Grants (AGAUR) through grant 2017-SGR-786 (ALBCOM), and by Singapore MOE Tier 1 grant R-146-000-318-114.
References
- Briand et al. (2020) Briand, S., Dessimoz, C., El-Mabrouk, N., Lafond, M., and Lobinska, G. (2020). A generalized Robinson-Foulds distance for labeled trees. BMC Genomics, 21(Suppl 10):779.
- Briand et al. (2022) Briand, S., Dessimoz, C., El-Mabrouk, N., and Nevers, Y. (2022). A linear time solution to the labeled Robinson-Foulds distance problem. Systematic Biology, 71(6):1391–1403.
- Camin and Sokal (1965) Camin, J. H. and Sokal, R. R. (1965). A method for deducing branching sequences in phylogeny. Evolution, 19(3):311–326.
- Ciccolella et al. (2021) Ciccolella, S., Bernardini, G., Denti, L., Bonizzoni, P., Previtali, M., and Vedova, G. D. (2021). Triplet-based similarity score for fully multilabeled trees with poly-occurring labels. Bioinformatics, 37(2):178–184.
- DiNardo et al. (2020) DiNardo, Z., Tomlinson, K., Ritz, A., and Oesper, L. (2020). Distance measures for tumor evolutionary trees. Bioinformatics, 36(7):2090–2097.
- Estabrook et al. (1985) Estabrook, G. F., McMorris, F. R., and Meacham, C. A. (1985). Comparison of undirected phylogenetic trees based on subtrees of four evolutionary units. Systematic Zoology, 34(2):193–200.
- Farris (1977) Farris, J. S. (1977). Phylogenetic analysis under Dollo’s law. Systematic Biology, 26(1):77–88.
- Govek et al. (2018) Govek, K., Sikes, C., and Oesper, L. (2018). A consensus approach to infer tumor evolutionary histories. In Proc. 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics (BCB ’18), pages 63–72, New York, NY. ACM Press.
- Jahn et al. (2021) Jahn, K., Beerenwinkel, N., and Zhang, L. (2021). The Bourque distances for mutation trees of cancers. Algorithms for Molecular Biology, 16:9.
- Jűrgensen (2020) Jűrgensen, H. (2020). Multisets, heaps, bags, families: What is a multiset? Mathematical Structures in Computer Science, 30(2):139–158.
- Karpov et al. (2019) Karpov, N., Malikic, S., Rahman, M. K., and Sahinalp, S. C. (2019). A multi-labeled tree dissimilarity measure for comparing clonal trees of tumor progression. Algorithms for Molecular Biology, 14:17.
- Kaufman and Rousseeuw (2009) Kaufman, L. and Rousseeuw, P. J. (2009). Finding groups in data: An introduction to cluster analysis. John Wiley & Sons.
- Li et al. (1996) Li, M., Tromp, J., and Zhang, L. (1996). On the nearest neighbour interchange distance between evolutionary trees. Journal of Theoretical Biology, 182(4):463–467.
- Llabrés et al. (2020) Llabrés, M., Rosselló, F., and Valiente, G. (2020). A generalized Robinson-Foulds distance for clonal trees, mutation trees, and phylogenetic trees and networks. In Proc. 11th ACM Int. Conf. Bioinformatics, Computational Biology and Health Informatics, pages 13:1–13:10, New York, NY. ACM Press.
- Llabrés et al. (2021) Llabrés, M., Rosselló, F., and Valiente, G. (2021). The generalized Robinson-Foulds distance for phylogenetic trees. Journal of Computational Biology, 28(12):1–15.
- Robinson (1971) Robinson, D. F. (1971). Comparison of labeled trees with valency three. Journal of Combinatorial Theory, 11(2):105–119.
- Robinson and Foulds (1981) Robinson, D. F. and Foulds, L. R. (1981). Comparison of phylogenetic trees. Mathematical Biosciences, 53(1–2):131–147.
- Sashittal et al. (2023) Sashittal, P., Zhang, H., Iacobuzio-Donahue, C. A., and Raphael, B. J. (2023). Condor: Tumor phylogeny inference with a copy-number constrained mutation loss model. bioRxiv [Preprint].
- Schwartz and Schäffer (2017) Schwartz, R. and Schäffer, A. A. (2017). The evolution of tumour phylogenetics: Principles and practice. Nature Reviews Genetics, 18(4):213–229.
- Steel and Penny (1993) Steel, M. A. and Penny, D. (1993). Distributions of tree comparison metrics: Some new results. Systematic Biology, 42(2):126–141.
- Williams and Clifford (1971) Williams, W. T. and Clifford, H. T. (1971). On the comparison of two classifications of the same set of elements. Taxon, 20(4):519–522.