This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

The kk-Robinson-Foulds Dissimilarity Measures for Comparison of Labeled Trees

Elahe Khayatian1    Gabriel Valiente2    Louxin Zhang1∗
1Department of Mathematics, National University of Singapore,
Singapore 119076
2Department of Computer Science,
Technical University of Catalonia, E-08034 Barcelona, Spain
Corresponding author: E-mail: matzlx@nus.edu.sg
Abstract

Understanding the mutational history of tumor cells is a critical endeavor in unraveling the mechanisms underlying cancer. Since the modeling of tumor cell evolution employs labeled trees, researchers are motivated to develop different methods to assess and compare mutation trees and other labeled trees. While the Robinson-Foulds distance is a widely utilized metric for comparing phylogenetic trees, its applicability to labeled trees reveals certain limitations. This paper introduces the kk-Robinson-Foulds dissimilarity measures, tailored to address the challenges of labeled tree comparison. The Robinson-Foulds distance is succinctly expressed as nn-RF in the space of labeled trees with nn nodes. Like the Robinson-Foulds distance, the kk-Robinson-Foulds is a pseudometric for multiset-labeled trees and becomes a metric in the space of 1-labeled trees. By setting kk to a small value, the kk-Robinson-Foulds dissimilarity can capture analogous local regions in two labeled trees with different size or different labels.

keywords:
Phylogenetic trees, mutation trees, labeled trees, Robinson-Foulds distance, kk-Robinson-Foulds dissimilarity

1 Introduction

Trees in biology are a fundamental concept as they depict the evolutionary history of entities. These entities may consist of organisms, species, proteins, genes or genomes. Trees are also useful for healthcare analysis and medical diagnosis. Introducing different kinds of tree models has given rise to the question about how these models can be efficiently compared for evaluation. This question has led to defining a robust dissimilarity measure in the space of targeted trees. For example, mutation/clonal trees are introduced to model tumor evolution, in which nodes represent cellular populations and are labeled by the gene mutations carried by those populations (Karpov et al., 2019; Schwartz and Schäffer, 2017). The progression of tumors varies among different patients; additionally, information about such variations is significant for cancer treatment. Therefore, dissimilarity measures for mutation trees have become a focus of recent research (DiNardo et al. (2020); Jahn et al. (2021); Llabrés et al. (2021); Karpov et al. (2019)).

In prior studies on species trees, several measures have been introduced to compare two phylogenetic trees. Some examples of such distances are Robinson-Foulds distance (RF) (Robinson and Foulds, 1981), Nearest-Neighbor Interchange (NNI) (Li et al., 1996; Robinson, 1971), Quartet distance (Estabrook et al., 1985), and Path distance (Steel and Penny, 1993; Williams and Clifford, 1971). Although these distances have been widely used for phylogenetic trees, they are defined based on the assumption that the involved trees have the same label sets. Moreover, only leaves of phylogenetic trees are labeled. Thus, these distances are not useful for comparing trees with different label sets or trees in which all the nodes are labeled.

1.1 Related Work on Comparison of Labeled Trees

To get around some limitations of the above-mentioned distances in the comparison of mutation trees, researchers have introduced new measures for mutation trees. Some of these measures are Common Ancestor Set distance (CASet) (DiNardo et al., 2020), Distinctly Inherited Set Comparison distance (DISC) (DiNardo et al., 2020), and Multi-Labeled Tree Dissimilarity measure (MLTD) (Karpov et al., 2019). While these distance measures enable efficient comparison of clonal trees, they are defined based on the assumption that mutations cannot occur more than once and mutations will not be lost in the course of tumor evolution. As a result, these metrics exhibit multiple limitations when applied to the comparison of trees used to model complex tumor evolution, wherein mutations may indeed occur multiple times and subsequently be lost.

In addition to the three measures mentioned above, a group of other dissimilarity measures have been introduced for the comparison of mutation trees, including Parent-Child Distance (Govek et al., 2018) and Ancestor-Descendant Distance (Govek et al., 2018). These measures are metric only in the space of ‘1-mutation’ trees, in which each node is labeled by exactly one mutation. These distances are again defined based on the assumption that mentioned above.

Additionally, there are other measures for mutation trees, defined based on the generalization methods. In such methods, researchers aim to extend the definition of an existing distance, which mostly used to compare phylogenetic trees, to mutation trees. For example, the generalized Nearest Neighbour Interchange (gNNI) (Jahn et al. (2021)) is defined by some minor modifications of NNI, which was first defined for the comparison of phylogenetic trees. The other example is the Path Distance (Govek et al. (2018)) which was first defined for phylogenetic trees comparison. Although these measure are applicable to mutation trees, they are only well defined for mutation trees with the same label sets (Jahn et al. (2021); Govek et al. (2018)).

Apart from the measures mentioned above, another recently proposed distance is the generalized RF distance (GRF) (Llabrés et al., 2020, 2021). This measure allows for the comparison of phylogenetic trees, phylogenetic networks, mutation and clonal trees. An important point about this distance is that its value depends on the intersection between clusters or clones of trees. However, this intersection does not contribute to the RF distance. In fact, if two clusters or clones of two trees are different, their contribution to the RF distance is 1; otherwise, it is 0. Hence, the generalized RF distance has a better resolution than the RF distance. However, it is defined based on the assumption that two distinct nodes in each tree are labeled by two disjoint sets (Llabrés et al., 2020).

There are some other generalizations of the RF distance, such as Bourque distance (Jahn et al., 2021). This distance is effective for comparing mutation trees whose nodes are labeled by non-empty sets and has linear time complexity. However, like the above distances, it does not allow for multiple occurrences of mutations during the tumor history (Jahn et al., 2021). Other generalization of the RF distance have also been proposed for gene trees (Briand et al., 2020, 2022).

The above-mentioned measures are not able to quantify similarity or difference of some tree models. Two instances of such models are the Dollo (Farris, 1977) and the Camin-Sokal model (Camin and Sokal, 1965). The reason behind the inadequacy of the measures for these models is that it is possible for mutations to get lost after they are gained in the Dollo model, and a mutation can occur more than once during the tumor history in the Camin-Sokal model (Llabrés et al., 2020). Hence, some measures are needed to mitigate the problem. To the best of our knowledge, Triplet-based Distance (Ciccolella et al., 2021) is the only measure introduced so far to resolve the issue. The distance is useful for comparing mutation trees whose nodes are labeled by non-empty sets; additionally, it allows for multiple occurrences of mutations during the tumor history and losing a mutation after it is gained (Ciccolella et al., 2021). Thus, the measure is applicable to the broader family of trees in which two nodes of a tree may have non-disjoint sets of labels. Nevertheless, it is not able to compare those trees in which there is a node whose label has more than one copy of a mutation.

Although no tree model has been introduced so far that allows for more than one copy of a mutation in the label of a single node, current models can be extended to deal with copy number of mutations. For example, the constrained kk-Dollo model (Sashittal et al., 2023) takes the variant read count and the total read count of each mutation in each cell, derived from single-cell DNA sequencing data, as input; then, based on three thresholds for the variant read count, the total read count, and the variant allele frequency, it decides whether a mutation is present or absent in a cell or it is missing (Sashittal et al., 2023). Alternatively, the model can consider the exact frequency numbers to show the multiplicity of each mutation in each cell. This motivates us to introduce new distances that can be used to compare pairs of labeled trees whose nodes are labeled by non-empty multisets.

1.2 Our Contributions to Tree Comparison

In this paper, we present kk-RF dissimilarity measures designed for the comparison of labeled trees. They are first defined for 1-labeled trees (Section 3). Subsequently, we extend these measures to multiset-labeled trees (Section 5). We delve into the mathematical properties of the kk-RF measures in Sections 4 and 5. In particular, kk-RF is a metric for 1-labeled trees. We also assess the validity of the kk-RF measures through comparisons with CASet, DISC, and GRF (Section 5), and the evaluation of their performance in the context of tree clustering (Section 6).

2 Concepts and Notations

A graph consists of a set of nodes and a set of edges that are each an unordered pair of distinct nodes, whereas a directed graph consists of a set of nodes and a set of directed edges that are each an ordered pair of distinct nodes.

Let GG be a (directed) graph. We use V(G)V(G) and E(G)E(G) to denote its node and edge set, respectively. If GG is undirected, (u,v)(u,v) will still be used to denote an edge between uu and vv with the understanding that (u,v)=(v,u)(u,v)=(v,u). Let u,vV(G)u,v\in V(G). If (u,v)E(G)(u,v)\in E(G), we say that uu and vv are adjacent, the edge (u,v)(u,v) is incident to uu and vv, or uu and vv are two endpoints of (u,v)(u,v).

The degree of vv is defined as the number of edges incident to vv. In addition, if GG is directed, the indegree and outdegree of vv are defined as the number of edges (x,y)(x,y) such that y=vy=v and x=vx=v, respectively. The nodes of degree 1 are called the leaves in an undirected graph, whereas the nodes of indegree 1 and outdegree 0 are called the leaves in a directed graph. We use 𝐿𝑒𝑎𝑓(G)\mathit{Leaf}(G) to denote the leaf set for GG. Non-leaf nodes are called internal nodes.

A path of length kk from uu to vv consists of a sequence of nodes u0,u1,,uku_{0},u_{1},\ldots,u_{k} such that u0=uu_{0}=u, uk=vu_{k}=v and (ui1,ui)E(G)(u_{i-1},u_{i})\in E(G) for i=1,2,,ki=1,2,\cdots,k. The distance from uu to vv, denoted as dG(u,v)d_{G}(u,v), is the length of the shortest paths from uu to vv, and it is set to \infty if there is no path from uu to vv. Note that if GG is undirected, dG(u,v)=dG(v,u)d_{G}(u,v)=d_{G}(v,u) for all u,vV(G)u,v\in V(G). The diameter of GG, denoted as diam(G)\textrm{diam}(G), is defined as maxu,vV(G)dG(u,v)\max_{u,v\in V(G)}d_{G}(u,v). If GG is directed, its diameter is defined as the diameter of its undirected version that has the node set V(G)V(G) and edge set E(G){(u,v)(v,u)E(G)}E(G)\cup\{(u,v)\mid(v,u)\in E(G)\}.

2.1 Trees

A tree TT is a graph in which there is exactly one path from every node to any other node. It is binary if every internal node is of degree 3. It is a line tree if every internal node is of degree 2. Each line tree has exactly two leaves.

2.2 Rooted Trees

A rooted tree is a directed tree with a specific root node where the edges are oriented away from the root. In a rooted tree, there is exactly one edge entering uu for every non-root node uu, and thus there is a unique path from its root to every other node.

Let TT be a rooted tree and u,vV(T)u,v\in V(T). If (u,v)E(T)(u,v)\in E(T), vv is called a child of uu and uu is called the parent of vv. In general, for uvu\neq v, if uu belongs to the unique path from 𝑟𝑜𝑜𝑡(T)\mathit{root}(T) to vv, vv is said to be a descendant of uu, and uu is said to be an ancestor of vv. We use CT(u)C_{T}(u), AT(u)A_{T}(u) and DT(u)D_{T}(u) to denote the set of all children, ancestors and descendants of uu, respectively. Note that uAT(u)u\notin A_{T}(u) and uDT(u)u\notin D_{T}(u).

A binary rooted tree is a rooted tree in which the root is of indegree 0 and outdegree 2, and every other internal node is of indegree 1 and outdegree 2. A rooted line tree is a rooted tree in which each internal node has only one child. A rooted caterpillar tree is a rooted tree in which every internal node has at most one child that is internal.

2.3 Labeled Trees

Let LL be a set and (L)\mathbb{P}(L) be the set of all subsets of LL. A tree or rooted tree TT is labeled with the subsets of LL if TT is equipped with a function :V(T)(L)\ell:V(T)\to\mathbb{P}(L) such that vV(T)(v)=L\cup_{v\in V(T)}\ell(v)=L, and (v)\ell(v)\neq\emptyset for every vV(T)v\in V(T). In particular, if (v)\ell(v) contains exactly one element for each vV(T)v\in V(T) and \ell is one-to-one, TT is said to be a 11-labeled tree on LL. In addition, if TT is 1-labeled on LL, then for CV(T)C\subseteq V(T), L(C)L(C) is defined as L(C)={xLwC:(w)={x}}L(C)=\{x\in L\mid\exists w\in C:\ell(w)=\{x\}\}.

2.4 Phylogenetic and Mutation Trees

Let XX be a finite taxon set. A phylogenetic tree (respectively, rooted phylogenetic tree) on XX is a binary tree (respectively, binary rooted tree) in which the leaves are uniquely labeled by the taxa of XX and the internal nodes are unlabeled.

A mutation tree on a set M of mutated genes is a rooted tree in which nodes are labeled with nonempty subsets of MM.

2.5 Dissimilarity Measures for Trees

Let 𝒯\mathcal{T} be a set of trees. A dissimilarity measure on 𝒯\mathcal{T} is a symmetric real function d:𝒯×𝒯0d:\mathcal{T}\times\mathcal{T}\to\mathbb{R}^{\geqslant 0}. A dissimilarity measure should capture the idea that the more similar two trees are, the lower their measure value is. A pseudometric on 𝒯\mathcal{T} is a dissimilarity measure that satisfies the triangle inequality condition. Finally, a metric (distance) on 𝒯\mathcal{T} is a pseudometric dd such that d(S,T)0d(S,T)\neq 0 unless SS and TT are the same tree.

3 The kk-RF Measure for 1-Labeled Trees

In this section, we first recall the definition of the RF distance and then present kk-RF dissimilarity measures for 1-labeled trees for arbitrary kk.

3.1 The kk-RF Measure for 1-Labeled Unrooted Trees

Let XX be a set of labels and let TT be a 1-labeled tree over XX. Each e=(u,v)E(T)e=(u,v)\in E(T) induces a pair of label subsets on XX:

PT(e)={L(Be(u)),L(Be(v))},\displaystyle P_{T}(e)=\left\{L(B_{e}(u)),L(B_{e}(v))\right\}, (1)
Be(u)={wdT(w,u)<dT(w,v)},\displaystyle B_{e}(u)=\{w\mid d_{T}(w,u)<d_{T}(w,v)\},
Be(v)={wdT(w,v)<dT(w,u)}.\displaystyle B_{e}(v)=\{w\mid d_{T}(w,v)<d_{T}(w,u)\}. (2)

We further define:

𝒫(T)={PT(e)eE(T)}.\mathcal{P}(T)=\{P_{T}(e)\mid e\in E(T)\}. (3)

The RF distance of two 1-labeled trees SS and TT is defined as:

dRF(S,T)=|𝒫(S)𝒫(T)|.d_{RF}(S,T)=|\mathcal{P}(S)\bigtriangleup\mathcal{P}(T)|. (4)
Refer to caption
Figure 1: Three 1-labeled trees in Example 1 to illustrate that the Robinson-Foulds distance exhibits a heavy penalty against trees with different labels. Although TT and S´\acute{S} is only different in labelling one node, the RF distance is 4 for SS and TT, but 12 for S´\acute{S} and TT.
Example 1.

Consider the three 1-labeled trees in Figure 1. We have dRF(S,T)=4d_{\mbox{\tiny RF}}(S,T)=4, because PT(e1)P_{T}(e_{1}) to PT(e6)P_{T}(e_{6}) are:

{{a,b,c,d,e,f},{g}},\displaystyle\left\{\{a,b,c,d,e,f\},\;\{g\}\right\}, {{a,b,c,d,f},{e,g}},\displaystyle\left\{\{a,b,c,d,f\},\;\{e,g\}\right\}, {{a,b,c,d},{e,f,g}},\displaystyle\left\{\{a,b,c,d\},\;\{e,f,g\}\right\},
{{a,e,f,g},{b,c,d}},\displaystyle\left\{\{a,e,f,g\},\;\{b,c,d\}\right\}, {{a,c,d,e,f,g},{b}},\displaystyle\left\{\{a,c,d,e,f,g\},\;\{b\}\right\}, {{a,b,d,e,f,g},{c}},\displaystyle\left\{\{a,b,d,e,f,g\},\;\{c\}\right\},

respectively, whereas PS(e¯1)P_{S}(\bar{e}_{1}) to PS(e¯6)P_{S}(\bar{e}_{6}) are:

{{a,b,c,d,f,g},{e}},\displaystyle\left\{\{a,b,c,d,f,g\},\;\{e\}\right\}, {{a,b,c,d,e,g},{f}},\displaystyle\left\{\{a,b,c,d,e,g\},\;\{f\}\right\}, {{a,b,c,d},{e,f,g}},\displaystyle\left\{\{a,b,c,d\},\;\{e,f,g\}\right\},
{{a,e,f,g},{b,c,d}},\displaystyle\left\{\{a,e,f,g\},\;\{b,c,d\}\right\}, {{a,b,d,e,f,g},{c}},\displaystyle\{\{a,b,d,e,f,g\},\;\{c\}\}, {{a,c,d,e,f,g},{b}},\displaystyle\{\{a,c,d,e,f,g\},\;\{b\}\},

respectively. However, dRF(S´,T)=12d_{\mbox{\tiny RF}}(\acute{S},T)=12, even if TT and S´\acute{S} have the same topology and only one node is labeled differently.

The above example indicates that the RF cannot capture local similarity (and difference) for 1-labeled trees if they have different labels. One popular dissimilarity measure for sets is the Jaccard distance. It is obtained by dividing the size of the symmetric difference of two sets by the size of their union. Two 1-labeled trees are identical if and only if they have the same set of edges. Therefore, we propose to use |E(S)E(T)||E(S)\bigtriangleup E(T)| and its generalization to measure the dissimilarity for 1-labeled trees SS and TT.

Let kk be a non-negative integer and let TT be a 11-labeled tree. Each edge e=(u,v)e=(u,v) induces the following pair of subsets of labels:

PT(e,k)={L(Be(u,k)),L(Be(v,k))},\displaystyle P_{T}(e,k)=\{L(B_{e}(u,k)),L(B_{e}(v,k))\}, (5)
Be(x,k)={wBe(x)dT(w,x)k},x=u,v.\displaystyle B_{e}(x,k)=\{w\in B_{e}(x)\mid d_{T}(w,x)\leqslant k\},\;\;x=u,v.

Clearly, Be(u,)=Be(u)B_{e}(u,\infty)=B_{e}(u) and Be(u,0)={u}B_{e}(u,0)=\{u\}. We further define:

𝒫k(T)={PT(e,k)eE(T)}.\mathcal{P}_{k}(T)=\{P_{T}(e,k)\mid e\in E(T)\}. (6)
Definition 1.

Let k0k\geqslant 0 and let SS and TT be two 1-labeled trees. The kk-RF dissimilarity score of SS and TT is defined as:

dk-RF(S,T)=|𝒫k(S)𝒫k(T)|.d_{\tiny k\mbox{\rm-RF}}(S,T)=|\mathcal{P}_{k}(S)\bigtriangleup\mathcal{P}_{k}(T)|. (7)
Example 2.

Continuing with Example 1, we have d1-RF(S´,T)=4d_{\tiny 1\mbox{\rm-RF}}(\acute{S},T)=4, as PT(ei,1)P_{T}(e_{i},1) for 1i61\leqslant i\leqslant 6 are:

{{g},{e,f}}\{\{g\},\{e,f\}\}, {{e,g},{a,f}}\{\{e,g\},\{a,f\}\}, {{e,f},{a,d}}\{\{e,f\},\{a,d\}\},
{{a,f},{b,c,d}}\{\{a,f\},\{b,c,d\}\}, {{b},{a,c,d}}\{\{b\},\{a,c,d\}\}, {{c},{a,b,d}}\{\{c\},\{a,b,d\}\},

respectively, and PS´(e´i,1)P_{\acute{S}}(\acute{e}_{i},1) for 1i61\leqslant i\leqslant 6 are:

{{h},{e,f}}\{\{h\},\{e,f\}\}, {{e,h},{a,f}}\{\{e,h\},\{a,f\}\}, {{e,f},{a,d}}\{\{e,f\},\{a,d\}\},
{{a,f},{b,c,d}}\{\{a,f\},\{b,c,d\}\}, {{b},{a,c,d}}\{\{b\},\{a,c,d\}\}, {{c},{a,b,d}}\{\{c\},\{a,b,d\}\},

respectively. We also have d1-RF(S,T)=8d_{\tiny 1\mbox{\rm-RF}}(S,T)=8. Thus, 11-RF captures the difference of the trees better than the RF distance.

3.2 The kk-RF Measure for 1-Labeled Rooted Trees

Let k0k\geqslant 0 be an integer and let TT be a 11-labeled rooted tree. For a node wV(T)w\in V(T), we define Bk(w)B_{k}(w) and Dk(w)D_{k}(w) as:

Bk(w)={xV(T)yAT(w){w}:d(y,w)+d(y,x)k},\displaystyle B_{k}(w)=\{x\in V(T)\mid\exists y\in A_{T}(w)\cup\{w\}:d(y,w)+d(y,x)\leqslant k\}, (8)
Dk(w)={w}{xDT(w)d(w,x)k}.\displaystyle D_{k}(w)=\{w\}\cup\{x\in D_{T}(w)\mid d(w,x)\leqslant k\}. (9)

For each e=(u,v)E(T)e=(u,v)\in E(T), we define PT(e,k)P_{T}(e,k) as the following ordered pair od two label subsets:

PT(e,k)=(L(Dk(v)),L(Bk(u)Dk(v))).P_{T}(e,k)=(L(D_{k}(v)),L(B_{k}(u)\setminus D_{k}(v))). (10)

Here, the first subset of PT(e,k)P_{T}(e,k) contains the labels of the descendants that are at distance at most kk from vv, whereas the second subset contains the labels of the other nodes around the edge ee within a distance of kk. Next, we define:

𝒫k(T)={PT(e,k)eE(T)}.\mathcal{P}_{k}(T)=\{P_{T}(e,k)\mid e\in E(T)\}. (11)
Definition 2.

Let k0k\geqslant 0 and let SS and TT be two 1-labeled rooted trees. Then, the kk-RF dissimilarity between SS and TT is defined as:

dk-RF(S,T)=|𝒫k(S)𝒫k(T)|.d_{\tiny k\mbox{\rm-RF}}(S,T)=|\mathcal{P}_{k}(S)\bigtriangleup\mathcal{P}_{k}(T)|. (12)
Example 3.

Consider the two 1-labeled rooted trees SS and TT in Figure 2. PT(ei,1)P_{T}(e_{i},1) (1i71\leqslant i\leqslant 7) are the following ordered pairs of label subsets:

({f,h},{b,d}),({c,f,g},{b,h}),({c},{f,g,h}),({g},{c,f,h}),({a,d,e},{b,h}),({a},{b,d,e}),({e},{a,b,d}).\begin{array}[]{lllllll}(\{f,h\},\{b,d\}),&&(\{c,f,g\},\{b,h\}),&&(\{c\},\{f,g,h\}),&&(\{g\},\{c,f,h\}),\\ (\{a,d,e\},\{b,h\}),&&(\{a\},\{b,d,e\}),&&(\{e\},\{a,b,d\}).&&\end{array}

PS(e¯i,1)P_{S}(\bar{e}_{i},1) (1i71\leqslant i\leqslant 7) are the following ordered pairs of label subsets:

({b,d},{c,f}),({a,d,e},{b,c}),({a},{b,d,e}),({e},{a,b,d}),({f,g,h},{b,c}),({g},{c,f,h}),({h},{c,f,g}).\begin{array}[]{lllllll}(\{b,d\},\{c,f\}),&&(\{a,d,e\},\{b,c\}),&&(\{a\},\{b,d,e\}),&&(\{e\},\{a,b,d\}),\\ (\{f,g,h\},\{b,c\}),&&(\{g\},\{c,f,h\}),&&(\{h\},\{c,f,g\}).&&\end{array}

Therefore, d1-RF(S,T)=8d_{\tiny 1\mbox{\rm-RF}}(S,T)=8.

Refer to caption
Figure 2: Two 1-labeled rooted trees used to illustrate the 11-RF in Example 3.

4 Characterization of kk-RF for 1-Labeled Trees

In order to evaluate kk-RF, we first provide the mathematical properties of the kk-RF. We then present experimental results on the frequency distribution of these measures.

4.1 Mathematical Properties

Proposition 1.

Let SS and TT be two 1-labeled trees.

(a) Let |L(S)L(T)|2|L(S)\cap L(T)|\leqslant 2 and |E(T)|2|E(T)|\geqslant 2. For any k1k\geqslant 1, dk-RF(S,T)=|E(S)|+|E(T)|d_{\tiny k\mbox{\rm-RF}}(S,T)=|E(S)|+|E(T)|.

(b) Assume that L(S)L(T)L(S)\neq L(T). For k<min{diam(T),diam(S)}k<\min\{\textrm{diam}(T),\textrm{diam}(S)\},
k+1dk-RF(S,T)|E(S)|+|E(T)|k+1\leqslant d_{\tiny k\mbox{\rm-RF}}(S,T)\leqslant|E(S)|+|E(T)|. In addition, the second inequality become equality if kmin{diam(T),diam(S)}k\geqslant\min\{\textrm{diam}(T),\textrm{diam}(S)\} and |L(S)|=|L(T)||L(S)|=|L(T)|

(c) Renaming each node with its label, we have d0-RF(S,T)=|E(S)E(T)|d_{\tiny 0\mbox{\rm-RF}}(S,T)=|E(S)\bigtriangleup E(T)|.

(d) If kmax{diam(S),diam(T)}1k\geqslant\max\{\textrm{diam}(S),\textrm{diam}(T)\}-1, then dk-RF(S,T)=dRF(S,T)d_{\tiny k\mbox{\rm-RF}}(S,T)=d_{RF}(S,T).

Proof.

(a) Note that if k1k\geqslant 1 and |E(T)|2|E(T)|\geqslant 2, each PT(e,k)P_{T}(e,k) involves at least three labels. If L(S)L(S) and L(T)L(T) have only two common elements, PT(e,k)PS(e´,k)P_{T}(e,k)\neq P_{S}(\acute{e},k) for every eE(T)e\in E(T) and e´E(S)\acute{e}\in E(S). Thus, we have 𝒫k(S)𝒫k(T)=\mathcal{P}_{k}(S)\cap\mathcal{P}_{k}(T)=\emptyset, implying that dk-RF(S,T)=|𝒫k(S)𝒫k(T)|=|𝒫k(T)|+|𝒫k(S)=|E(S)|+|E(T)|.d_{\tiny k\mbox{\rm-RF}}(S,T)=|\mathcal{P}_{k}(S)\bigtriangleup\mathcal{P}_{k}(T)|=|\mathcal{P}_{k}(T)|+|\mathcal{P}_{k}(S)=|E(S)|+|E(T)|.

(b)  The second inequality follows from that dk-RF(S,T)=|𝒫k(S)𝒫k(T)||𝒫k(T)|+|𝒫k(S)|d_{\tiny k\mbox{\rm-RF}}(S,T)=|\mathcal{P}_{k}(S)\bigtriangleup\mathcal{P}_{k}(T)|\leqslant|\mathcal{P}_{k}(T)|+|\mathcal{P}_{k}(S)| and |𝒫k(X)|=|E(X)||\mathcal{P}_{k}(X)|=|E(X)| for X=S,TX=S,T. We prove the first inequality as follows.

Let k<min{diam(T),diam(S)}k<\min\{\mbox{diam}(T),\mbox{diam}(S)\}. Since SS and TT are 1-labeled, we identify a node with its label in the trees. Without loss of generality, we may assume vV(T)V(S)v\in V(T)\setminus V(S). Define 𝒩T(k)(v)={u|dT(u,v)k}{\cal N}^{(k)}_{T}(v)=\{u\;|\;d_{T}(u,v)\leqslant k\}.

If 𝒩T(k)(v)=V(T){\cal N}^{(k)}_{T}(v)=V(T), then, |𝒩T(k)(v)|=|V(T)|diam(T)+1k+2|{\cal N}^{(k)}_{T}(v)|=|V(T)|\geqslant\mbox{diam}(T)+1\geqslant k+2, as k<diam(T)k<\mbox{diam}(T). This also implies that for every (x,y)E(T)(x,y)\in E(T), dT(v,x)kd_{T}(v,x)\leqslant k and dT(v,y)kd_{T}(v,y)\leqslant k.

If 𝒩T(k)(v)V(T){\cal N}^{(k)}_{T}(v)\neq V(T), there exists at least a node ww that is k+1k+1 or more distance away from vv. Since TT is connected, we let P(v,w)P(v,w) be a path from vv and ww with the smallest length. Clearly, the first k+1k+1 nodes in P(v,w)P(v,w) (including vv) are all in 𝒩T(k)(v){\cal N}^{(k)}_{T}(v), i.e. at least one end of the first k+1k+1 edges of P(v,w)P(v,w) are found in 𝒩T(k)(v){\cal N}^{(k)}_{T}(v).

In summary, we have proved that there are at least k+1k+1 edges (x,y)(x,y) such that either dT(v,x)kd_{T}(v,x)\leqslant k or dT(v,y)kd_{T}(v,y)\leqslant k. For each of these edges ee, vv appears in at least one label subset of PT(e,k)P_{T}(e,k) and thus PT(e,k)𝒫k(S)P_{T}(e,k)\not\in\mathcal{P}_{k}(S). Therefore, dk-RF(S,T)|𝒫k(T)𝒫k(S)|k+1d_{\tiny k\mbox{\rm-RF}}(S,T)\geqslant|\mathcal{P}_{k}(T)\setminus\mathcal{P}_{k}(S)|\geqslant k+1.

If |L(S)|=|L(T)||L(S)|=|L(T)| and kmin{diam(T),diam(S)}k\geqslant\min\{\textrm{diam}(T),\textrm{diam}(S)\}, then, 𝒩T(k)(v)=V(T){\cal N}^{(k)}_{T}(v)=V(T). Therefore, the induced pair PT(e,k)P_{T}(e,k) contains vv for every edge ee of TT. On the other hand, the induced pair PS(e,k)P_{S}(e,k) does not contain vv for each edge ee of SS. Thus, 𝒫k(S)𝒫k(T)=\mathcal{P}_{k}(S)\cap\mathcal{P}_{k}(T)=\emptyset and dk-RF(S,T)=|𝒫k(S)|+|𝒫k(T)|=|E(S)|+|E(T)|d_{\tiny k\mbox{\rm-RF}}(S,T)=|\mathcal{P}_{k}(S)|+|\mathcal{P}_{k}(T)|=|E(S)|+|E(T)|.

(c)  Note that we may represent each node of a 1-labeled tree with its unique label. As a result, PT(e,0)=eP_{T}(e,0)=e and PS(e¯,0)=eP_{S}(\bar{e},0)=e for eE(T)e\in E(T) and e¯E(S)\bar{e}\in E(S). Thus, (c) follows.

(d)  It follows from the definition of the kk-RF. ∎

Lemma 4.1.

Let k0k\geqslant 0 be an integer. kk-RF satisfies the non-negativity, symmetry and triangle inequality conditions.

Proof 4.2.

Let k0k\geqslant 0. The non-negativity and symmetry conditions are trivial. The triangle inequality dk-RF(T1,T2)dk-RF(T1,T3)+dk-RF(T3,T2)d_{\tiny k\mbox{\rm-RF}}(T_{1},T_{2})\leqslant d_{\tiny k\mbox{\rm-RF}}(T_{1},T_{3})+d_{\tiny k\mbox{\rm-RF}}(T_{3},T_{2}) is derived from the inequality 𝒫k(T1)𝒫k(T2)(𝒫k(T1)𝒫k(T3))(𝒫k(T3)𝒫k(T2))\mathcal{P}_{k}(T_{1})\bigtriangleup\mathcal{P}_{k}(T_{2})\subseteq(\mathcal{P}_{k}(T_{1})\bigtriangleup\mathcal{P}_{k}(T_{3}))\cup(\mathcal{P}_{k}(T_{3})\bigtriangleup\mathcal{P}_{k}(T_{2})) for any three 1-labeled trees T1,T2,T3T_{1},T_{2},T_{3}.

Remark 4.3.

Proposition 1 and Lemma 4.1 can be proved in the same manner for 1-labeled rooted trees.

Proposition 4.4.

The 0-RF is a metric on the space of all 1-labeled rooted trees.

Proof 4.5.

Let SS and TT be two 1-labeled rooted trees. By Remark 4.3, it is enough to show that SS and TT are identical if d0-RF(S,T)=0d_{\tiny 0\mbox{\rm-RF}}(S,T)=0. By identifying a node with its label in SS and TT, we obtain that 𝒫0(S)=E(S)\mathcal{P}_{0}(S)=E(S) and 𝒫0(T)=E(T)\mathcal{P}_{0}(T)=E(T). If d0-RF(S,T)=0d_{\tiny 0\mbox{\rm-RF}}(S,T)=0, |E(T)E(S)|=0|E(T)\bigtriangleup E(S)|=0 and thus E(T)=E(S)E(T)=E(S), i.e. SS and TT are identical.

Lemma 4.6.

Let TT be a 1-labeled rooted tree with at least two nodes and \mathcal{L} be a subset of 𝐿𝑒𝑎𝑓(T)\mathit{Leaf}(T). Define TT^{\prime} to be the subtree obtained by the removal of all the leaves of {\mathcal{L}}. Then, for any kk,

𝒫k(T)={(X,Y)|(X,Y)𝒫k(T)&X}.\displaystyle{\mathcal{P}}_{k}(T^{\prime})=\{(X\setminus\mathcal{L},Y\setminus\mathcal{L})\;\;|\;(X,Y)\in{\mathcal{P}}_{k}(T)\;\&\;X\cap\mathcal{L}\neq\emptyset\}. (13)
Proof 4.7.

Since TT is 1-labeled, we identify a node of TT with its label in the following discussion. With this convention, for any subset SS of nodes, L(S)=SL(S)=S.

Let E¯(T)\bar{E}(T) denote the subset of edges incident to a leaf of \mathcal{L}, i.e., E¯(T)={(x,y)E(T)|y}\bar{E}(T)=\{(x,y)\in E(T)\;|\;y\in\mathcal{L}\}. Then,

V(T)=V(T),E(T)=E(T)E¯(T).V(T)=V(T^{\prime})\uplus\mathcal{L},\;\;\ E(T)=E(T^{\prime})\uplus\bar{E}(T).

If (u,v)E¯(T)(u,v)\in\bar{E}(T), v𝐿𝑒𝑎𝑓(T)v\in\mathcal{L}\subseteq\mathit{Leaf}(T) and thus Dk(v)={v}.D_{k}(v)=\{v\}\subseteq\mathcal{L}.

For an edge e=(u,v)E(T)e=(u,v)\in E(T^{\prime}), PT(e,k)=(Dk(v),Bk(u)Dk(v))P_{T}(e,k)=(D_{k}(v),B_{k}(u)\setminus D_{k}(v)). By Eqn. (8) and  (9,

Dk(v)\displaystyle D_{k}(v) =\displaystyle= Dk(v)V(T)Dk(v),\displaystyle D_{k}(v)\cap V(T^{\prime})\uplus D_{k}(v)\cap\mathcal{L},
Bk(u)Dk(v)\displaystyle B_{k}(u)\setminus D_{k}(v) =\displaystyle= [(Bk(v)Dk(v))V(T)][(Bk(v)Dk(v))]\displaystyle[(B_{k}(v)\setminus D_{k}(v))\cap V(T^{\prime})]\uplus[(B_{k}(v)\setminus D_{k}(v))\cap\mathcal{L}]
=\displaystyle= [(Bk(v)V(T)][Dk(v)V(T)](Bk(v)Dk(v))\displaystyle[(B_{k}(v)\cap V(T^{\prime})]\setminus[D_{k}(v)\cap V(T^{\prime})]\uplus(B_{k}(v)\setminus D_{k}(v))\cap\mathcal{L}

If (u,v)E(T)(u,v)\in E(T^{\prime}), Dk(v)=Dk(v)V(T)D_{k}(v)\setminus\mathcal{L}=D_{k}(v)\cap V(T^{\prime})\neq\emptyset and
(Bk(v)Dk(v))=(Bk(v)V(T)][Dk(v)V(T)].(B_{k}(v)\setminus D_{k}(v))\setminus\mathcal{L}=(B_{k}(v)\cap V(T^{\prime})]\setminus[D_{k}(v)\cap V(T^{\prime})]. Therefore,
(Dk(v),(Bk(v)Dk(v)))=PT(e,k).\left(D_{k}(v)\setminus\mathcal{L},(B_{k}(v)\setminus D_{k}(v))\setminus\mathcal{L}\right)=P_{T^{\prime}}(e,k).

This has proved Eqn. (13).

Proposition 4.8.

Let k1k\geqslant 1 be an integer. The kk-RF is a metric in the space of all 1-labeled rooted trees.

Proof 4.9.

Let SS and TT be two 1-labeled rooted trees. By Remark 4.3, it is enough to show that dk-RF(S,T)=0d_{\tiny k\mbox{\rm-RF}}(S,T)=0 (equivalently, 𝒫k(T)=𝒫k(S)\mathcal{P}_{k}(T)=\mathcal{P}_{k}(S)) implies that SS and TT are identical. To this end, we prove that E(T)E(T) can be uniquely determined by 𝒫k(T)\mathcal{P}_{k}(T) using mathematical induction.

Since |E(T)|=|𝒫k(T)||E(T)|=|{\mathcal{P}}_{k}(T)|, TT is a single node if and only if E(T)E(T) is empty if and only 𝒫k(T){\mathcal{P}}_{k}(T) is empty. Therefore, the single-node tree is uniquely determined by 𝒫k(T){\mathcal{P}}_{k}(T).

Assume SS is uniquely determined by 𝒫k(S){\mathcal{P}}_{k}(S) for arbitrary 1-labeled tree SS such that |V(S)|<k|V(S)|<k. Consider a 1-labeled tree TT such that |V(S)|=k|V(S)|=k.

For a leaf v𝐿𝑒𝑎𝑓(T)v\in\mathit{Leaf}(T), there is a unique edge e=(u,v)e=(u,v) entering vv. Note that k1k\geqslant 1. Since Dk(v)={v}D_{k}(v)=\{v\} if and only if vv is a leaf, we can identify vv from PT(e,k)=(P1,P2)𝒫k(T)P_{T}(e,k)=(P_{1},P_{2})\in{\mathcal{P}}_{k}(T) such that P1={v}.P_{1}=\{v\}.

For vV(T)𝐿𝑒𝑎𝑓(T)v\in V(T)\setminus\mathit{Leaf}(T), there is a unique edge e=(u,v)e=(u,v) entering vv. Since k1k\geqslant 1, the children of vv are all a leaf if and only if Dk(v)={v}CT(u)D_{k}(v)=\{v\}\cup C_{T}(u) if and only if DK(v)𝐿𝑒𝑎𝑓(T)={v}D_{K}(v)\setminus\mathit{Leaf}(T)=\{v\}. Therefore, we can identify vv whose children are all leaves from the ordered pairs (P1,P2)𝒫k(T)(P_{1},P_{2})\in{\mathcal{P}}_{k}(T) such that P1𝐿𝑒𝑎𝑓(T)P_{1}\setminus\mathit{Leaf}(T) contains only vv.

Let VV^{\prime} be the set of all nodes whose children are just leaves and DT(V)=xVCT(x)D_{T}(V^{\prime})=\cup_{x\in V^{\prime}}C_{T}(x). Since VV^{\prime} is nonempty, DT(V)D_{T}(V^{\prime})\neq\emptyset . Define E(T)={(x,y)E(T)|xV,yDT(V)}E^{\prime}(T)=\{(x,y)\in E(T)\;\;|\;\;x\in V^{\prime},y\in D_{T}(V^{\prime})\}.

For the tree TT^{\prime} obtained from TT by the removal of the leaves of DT(V)D_{T}(V^{\prime}), |V(T)|=|V(S)||DT(V)|<k.|V(T^{\prime})|=|V(S)|-|D_{T}(V^{\prime})|<k. By Eqn. (13), 𝒫k(T){\mathcal{P}}_{k}(T^{\prime}) can be efficiently constructed from 𝒫k(T){\mathcal{P}}_{k}(T). By the induction hypothesis, E(T)E(T^{\prime}) is uniquely determined by 𝒫k(T){\mathcal{P}}_{k}(T^{\prime}). As a result, E(T)=E(T)E(T)E(T)=E(T^{\prime})\cup E^{\prime}(T) is determined.

This concludes the proof.

Corollary 4.10.

Let k0k\geqslant 0. The kk-RF is a metric in the space of all 11-labeled trees.

Proof 4.11.

If k=0k=0, the statement follows from the same proof as for Proposition 4.4. Now, let SS and TT be two 1-labeled trees and k1k\geqslant 1. By Lemma 4.1, it is enough to show that if dk-RF(S,T)=0d_{\tiny k\mbox{\rm-RF}}(S,T)=0 (equivalently, 𝒫k(T)=𝒫k(S)\mathcal{P}_{k}(T)=\mathcal{P}_{k}(S)), then SS and TT. This can be proved in a manner similar to Proposition 4.8.

Lemma 4.12.

Let k0k\geqslant 0 and let TT be a 1-labeled rooted tree with nn nodes. All subsets Di(w)={w}{xDT(w)d(w,x)i}D_{i}(w)=\{w\}\cup\{x\in D_{T}(w)\mid d(w,x)\leqslant i\} and L(Di(w))L(D_{i}(w)) for all nodes ww and iki\leqslant k can be computed in at most 2(k+1)n2(k+1)n set operations.

Proof 4.13.

Since TT is 1-labeled, we can identify a node of TT with its label. In this way, Di(w)=L(Di(w))D_{i}(w)=L(D_{i}(w)) for all nodes ww and iki\leqslant k. By ordering the nn labels, we represent each subset of labels (and each subset of nodes) as a nn-bit 0-1 string, where the ii-th bit is 1 if and only if the ii-th label (node) is in the subset.

The statement is obvious in the case k=0k=0, since D0(w)={w}D_{0}(w)=\{w\} and, clearly, all the D0(w)D_{0}(w) for wV(T)w\in V(T) can be computed in at most 2n2n set operations. We assume k>0k>0 and prove the statement by induction as follows.

Assume that all the Dk1(w)D_{k-1}(w) for wV(T)w\in V(T) have been computed in at most 2kn2kn set operations. Assume ww has dwd_{w} children u1,u2,,ud(w)u_{1},u_{2},\ldots,u_{d(w)}. Then,

Dk(w)={w}(i=1dwDk1(ui))D_{k}(w)=\{w\}\cup\left(\cup_{i=1}^{d_{w}}D_{k-1}(u_{i})\right)

This implies that Dk(w)D_{k}(w) for all ww can be computed from all Dk1(w)D_{k-1}(w) using vV(T)(1+dw)=2n1\sum_{v\in V(T)}(1+d_{w})=2n-1 set operations. In total, we can compute all subsets Di(w)D_{i}(w) (ik(i\leqslant k and wV(T)w\in V(T)) in at most 2n1+2kn2(k+1)n2n-1+2kn\leqslant 2(k+1)n set operations.

Lemma 4.14.

Let k0k\geqslant 0 and TT be a 1-labeled rooted tree with nn nodes. Using L(Di(w))L(D_{i}(w)) for wV(T),0ikw\in V(T),0\leqslant i\leqslant k, we can compute L(Bk(w))L(B_{k}(w)) for all ww in O(kn)O(kn) set operations, where Bk(w)B_{k}(w) is defined in Eqn. (8).

Proof 4.15.

Since TT is a 1-labeled rooted tree, we identify a node with its label. In this way, we just need to show that Bk(w)B_{k}(w) for all nodes ww can be computed in O(kn)O(kn) set operations.

Let rr be the root of TT. For any node wV(T)w\in V(T), let the unique path from rr to ww be

w0=r,w1,,wt=w.w_{0}=r,w_{1},\ldots,w_{t}=w.

Then, we have that

Bk(wt)=i=0min(k,t)Dki(wti).B_{k}(w_{t})=\cup^{\min(k,t)}_{i=0}D_{k-i}(w_{t-i}).

Given the subsets Di(u)D_{i}(u) for all iki\leqslant k and uV(T)u\in V(T), the above formula implies that Bk(wt)B_{k}(w_{t}) can be computed in at most kk set operations. In total, we can compute all Bk(wt)B_{k}(w_{t}) for all wV(T)w\in V(T) in O(kn)O(kn) set operations.

Proposition 4.16.

Let SS and TT be two 1-labeled trees with nn nodes and k0k\geqslant 0. Then, dk-RF(S,T)d_{\tiny k\mbox{\rm-RF}}(S,T) can be computed in O(kn2)O(kn^{2}) time.

Proof 4.17.

We first consider the rooted tree case. Let SS and TT be two 1-labeled rooted trees with nn nodes. Without loss of generality, we may assume that SS and TT are labeled on the same set LL, with |L|=n|L|=n. (Otherwise, we can consider them labeled on L=L(S)L(T)L=L(S)\cup L(T), with n|L|2nn\leqslant|L|\leqslant 2n.) By Lemma 4.12 and Lemma 4.14, we can compute PX(e,k)P_{X}(e,k) for all eE(X)e\in E(X) in O(kn)O(kn) set operations for X=S,TX=S,T. Since each edge induces an ordered pair of label subsets and we represent each label subset using a nn-bit string, we consider PX(e,k)P_{X}(e,k) as a 2n2n-bit string. In this way, we sort all the edge-induced pairs of label subsets for each tree in O(n2)O(n^{2}) time by radix sort (that is, indexing) and then compute the symmetric difference of the two set of edge-induced pairs in O(n2)O(n^{2}) time. This concludes the proof.

In the unrooted case, we first root the trees at a leaf. In this way, we can compute all the edge-induced pairs of label subsets in the derived rooted trees in O(kn2)O(kn^{2}) time. Since the edges induce unordered pairs of label subsets in the original trees, we rearrange the two label subsets obtained for an edge in such a way that the smallest label in the first subset is smaller than every label in the second one. After the rearrangement, we can radix-sort the edge-induced pairs and compute the kk-RF score in O(n2)O(n^{2}) time.

Refer to caption
Figure 3: The frequency distributions of all pairwise kk-Robinson-Foulds (RF) scores in the space of 1-labeled unrooted (top row) and rooted (bottom row) 4-node trees for k=0,1,2k=0,1,2. In the bar-charts, the xx-axis represents kk-RF scores and the yy-axis represents the number of tree pairs with a specific kk-RF score.
Refer to caption
Figure 4: The frequency distributions of all pairwise kk-Robinson-Foulds (RF) scores in the space of 1-labeled unrooted (top row) and rooted (bottom row) 5-node trees, where k3k\leqslant 3. In the bar-charts, the xx-axis represents kk-RF scores and the yy-axis represents the number of tree pairs with a specific kk-RF score.
Refer to caption
Figure 5: The frequency distributions of all pairwise kk-Robinson-Foulds (RF) scores in the space of 1-labeled unrooted (top row) and rooted (bottom row) 6-node trees for k4k\leqslant 4. In each bar-chart, the xx-axis represents kk-RF scores and the yy-axis represents the number of tree pairs whose kk-RF equals a given score.
Refer to caption
Figure 6: The frequency distributions of all pairwise kk-Robinson-Foulds (RF) scores in the space of 1-labeled unrooted (top row) and rooted (bottom row) 7-node trees for k5k\leqslant 5. In each bar-chart, the xx-axis represents kk-RF scores and the yy-axis represents the number of tree pairs whose kk-RF equals a given score.

4.2 The Distribution of kk-RF Scores

We examined the distribution of the kk-RF dissimilarity scores for 1-labeled unrooted and rooted trees with the same label set and with different label sets.

The distribution of the frequency of the pairwise kk-RF scores in the space of nn-node 1-labeled unrooted and rooted trees for nn from 4 to 7 are presented in Figures 3 to 6, respectively. For each nn, it suffices to consider k=0,,n2k=0,...,n-2. Recall that (n2)(n-2)-RF is actually the RF distance. The frequency distribution for the RF distance in the space of phylogenetic trees is known to be Poisson (Steel and Penny, 1993). It seems also true that the pairwise 0-RF and (n2)(n-2)-RF scores have a Poisson distribution in the space of nn-node 1-labeled unrooted and rooted trees. However, the distribution of the pairwise kk-RF scores is unlikely Poisson when k=1,2,3k=1,2,3 and kn2k\neq n-2.

We examined 1,679,616 (respectively, 60,466,176) pairs of 6-node 1-labeled unrooted (respectively, rooted) trees such that the trees in each pair have cc common labels, with c=3,4,5c=3,4,5. Table 1 shows that the majority of pairs have a largest dissimilarity score of 10.

Table 1: The number of pairs of 1-labeled 6-node unrooted (top) and rooted (bottom) trees that have cc labels in common and have 1-Robinson-Foulds (RF) score dd for c=3,4,5c=3,4,5 and d=2,4,6,8,10d=2,4,6,8,10.
1-RF   2 4 6 8 10
3 0 0 0 3,072 1,676,544
4 0 0 432 16,800 1,662,384
5 0 340 3,720 53,100 1,622,456
1-RF   2 4 6 8 10
3 0 0 0 79,872 60,386,304
4 0 0 7,776 419,136 60,039,264
5 0 4,080 65,760 1,310,880 59,085,456

5 A Generalization to Multiset-Labeled Trees

In this section, we extend the measures introduced in Section 3 to multiset-labeled unrooted and rooted trees.

5.1 Multisets and Their Operations

A multiset is a collection of elements in which an element xx can occur one or more times (Jűrgensen, 2020). The set of all distinct elements appearing in a multiset AA is denoted by Supp(A)\mbox{Supp}(A). In this paper, we simply represent AA by the monomial x1mA(x1)xnmA(xn)x_{1}^{m_{A}(x_{1})}\ldots x_{n}^{m_{A}(x_{n})} if Supp(A)={x1,x2,,xn}\mbox{Supp}(A)=\{x_{1},x_{2},\cdots,x_{n}\}, where xi1x_{i}^{1} is simplified to xix_{i} for each ii.

Let AA and BB be two multisets. We say AA is a sub-multiset of BB, denoted by AmBA\subseteq_{m}B, if for every xSupp(A)x\in\mbox{Supp}(A), mA(x)mB(x)m_{A}(x)\leqslant m_{B}(x). In addition, we say that A=BA=B if AmBA\subseteq_{m}B and BmAB\subseteq_{m}A. Furthermore, the union, sum, intersection, difference, and symmetric difference of AA and BB are respectively defined as follows:

  • AmB={xmax{mA(x),mB(x)}xSupp(A)Supp(B)}A\cup_{m}B=\left\{x^{\max\{m_{A}(x),m_{B}(x)\}}\mid x\in\mbox{Supp}(A)\cup\mbox{Supp}(B)\right\};

  • AmB={xmA(x)+mB(x)xSupp(A)Supp(B)}A\uplus_{m}B=\left\{x^{m_{A}(x)+m_{B}(x)}\mid x\in\mbox{Supp}(A)\cup\mbox{Supp}(B)\right\};

  • AmB={xmin{mA(x),mB(x)}xSupp(A)Supp(B)}A\cap_{m}B=\left\{x^{\min\{m_{A}(x),m_{B}(x)\}}\mid x\in\mbox{Supp}(A)\cap\mbox{Supp}(B)\right\};

  • AmB={xmA(x)mB(x)xSupp(A):mA(x)>mB(x)}A\setminus_{m}B=\left\{x^{m_{A}(x)-m_{B}(x)}\mid x\in\mbox{Supp}(A):m_{A}(x)>m_{B}(x)\right\};

  • AmB=(AmB)m(AmB)A\triangle_{m}B=(A\cup_{m}B)\setminus_{m}(A\cap_{m}B);

where mX(x)m_{X}(x) is defined as 0 if xSupp(X)x\notin\mbox{Supp}(X) for X=A,BX=A,B.

Let LL be a set and m(L)\mathbb{P}_{m}(L) be the set of all sub-multisets on LL. A tree TT is labeled with the sub-multisets of LL if TT is equipped with a function :V(T)m(L)\ell:V(T)\to\mathbb{P}_{m}(L) such that vV(T)Supp((v))=L\cup_{v\in V(T)}\mbox{Supp}(\ell(v))=L and (v)\ell(v)\neq\emptyset, for every vV(T)v\in V(T). We call such a tree as a multiset-labeled tree. For CV(T)C\subseteq V(T) and xLx\in L, we define Lm(C)L_{m}(C) and mT(x)m_{T}(x) as follows:

Lm(C)\displaystyle L_{m}(C) =\displaystyle= vC(v);\displaystyle\uplus_{v\in C}\ell(v); (14)
mT(x)\displaystyle m_{T}(x) =\displaystyle= vV(T)m(v)(x).\displaystyle\sum_{v\in V(T)}m_{\ell(v)}(x). (15)

5.2 The kk-RF for Multiset-Labeled Trees

Let TT be a multiset-labeled tree. Then, each edge e=(u,v)e=(u,v) of TT induces a pair of multisets

PT(e)={Lm(Be(u)),Lm(Be(v))},P_{T}(e)=\left\{L_{m}(B_{e}(u)),L_{m}(B_{e}(v))\right\}, (16)

where Lm()L_{m}() is defined in Eqn. (14), and Be(u)B_{e}(u) in Eqn. (2). Note that Eqn. (16) is obtained from Eqn. (1) by replacing L()L() with Lm()L_{m}().

Remark 5.1.

In a multiset-labeled tree TT, two edges may induce the same multi-set pair as shown in Figure 7. Hence, 𝒫(T)\mathcal{P}(T) in Eqn. (3) is a multiset in general.

Refer to caption
Figure 7: Two multiset-labeled trees used to show that different edges can give the same label multi-subset pair. Here, PT(e2)=PT(e3)={abc,a2b2c}P_{T}(e_{2})=P_{T}(e_{3})=\{abc,a^{2}b^{2}c\}.

We use Eqn. (16), Eqn. (3) and Eqn. (4) to define the RF-distance for multiset-labeled trees by replacing \triangle with m\triangle_{m} in Eqn. (4).

Let k0k\geqslant 0. We use Eqn. (5), Eqn. (6), and Eqn. (7) to define the kk-RF for multiset-labeled trees by replacing L()L() with Lm()L_{m}() in Eqn. (5) and replacing \triangle with m\triangle_{m} in Eqn. (7).

Example 5.2.

Consider the multiset-labeled trees SS, S´\acute{S}, and TT in Figure 8. 𝒫k(T),𝒫k(S){\mathcal{P}}_{k}(T),{\mathcal{P}}_{k}(S) and 𝒫k(S´){\mathcal{P}}_{k}(\acute{S}) for k=0,1,k=0,1,\infty are summarized in Table 2. We obtain:

d0-RF(T,S´)=2;d1-RF(T,S´)=6;dRF(T,S´)=12;d0-RF(S,S´)=10;d1-RF(S,S´)=12;dRF(S,S´)=12.\begin{array}[]{lll}d_{0\mbox{\tiny-RF}}(T,\acute{S})=2;&d_{\tiny 1\mbox{\rm-RF}}(T,\acute{S})=6;&d_{\mbox{\tiny RF}}(T,\acute{S})=12;\\ d_{0\mbox{\tiny-RF}}(S,\acute{S})=10;&d_{\tiny 1\mbox{\rm-RF}}(S,\acute{S})=12;&d_{\mbox{\tiny RF}}(S,\acute{S})=12.\\ \end{array}

It is not hard to see that both d0-RF(T,S´)d_{0\mbox{\tiny-RF}}(T,\acute{S}) and d1-RF(T,S´)d_{\tiny 1\mbox{\rm-RF}}(T,\acute{S}) reflect the local similarity of the two multiset-labeled trees better than dRF(T,S´)d_{\mbox{\tiny RF}}(T,\acute{S}).

Refer to caption
Figure 8: Three multiset-labeled trees in Example 5.2.
Table 2: Edge-induced unordered pairs of multisets in the three trees in Fig. 8 for k=0,1,k=0,1,\infty.
Tree𝒫0()𝒫1()𝒫()T{c2,e2}{c2,ce2}{a2b2c3d2e2,c2}{c,e2}{ab2cd2,ac2}{a2b2c3d2,c2e2}{ac,c}{ac2,c2e2}{a2b2c2d2,c3e2}{ac,d}{ab2,ac2d2}{ab2cd2,ac4e2}{ab2,d}{acd,ce2}{ab2,ac5d2e2}{cd,d}{a2b2cd,cd}{a2b2c4de2,cd}S{c2,e}{ac2e2,c2}{a2b2c3d2e2,c2}{ce,e}{a2bc2d,bd}{a2b2c2d2,c3e2}{ac,e}{ab2cd2,ace}{ab2cd2,ac4e2}{ac,d}{ac3e,ce}{a2b2c4d2e,ce}{abc,d}{acd,c3e2}{abc,abc4d2e2}{bd,d}{abc,abcd2}{a2bc5de2,bd}S´{c2,e2}{c2,ce2},{ab3c3d2e2,c2}{c,e2}{ac2,c2e2}{ab3c3d2,c2e2}{ac,c}{acd,ce2}{ab3c2d2,c3e2}{ac,d}{ac2d2,b3}{ac4e2,b3cd2}{b3,d}{ac2,b3cd2}{ac5e2d2,b3}{cd,d}{ab3cd,cd}{ab3c4e2d,cd}\begin{array}[]{clll}\hline\cr\textrm{Tree}&\mathcal{P}_{0}(\;)&\mathcal{P}_{1}(\;)&\mathcal{P}_{\infty}(\;)\\ \hline\cr T&\{c^{2},e^{2}\}&\{c^{2},ce^{2}\}&\{a^{2}b^{2}c^{3}d^{2}e^{2},c^{2}\}\\ &\{c,e^{2}\}&\{ab^{2}cd^{2},ac^{2}\}&\{a^{2}b^{2}c^{3}d^{2},c^{2}e^{2}\}\\ &\{ac,c\}&\{ac^{2},c^{2}e^{2}\}&\{a^{2}b^{2}c^{2}d^{2},c^{3}e^{2}\}\\ &\{ac,d\}&\{ab^{2},ac^{2}d^{2}\}&\{ab^{2}cd^{2},ac^{4}e^{2}\}\\ &\{ab^{2},d\}&\{acd,ce^{2}\}&\{ab^{2},ac^{5}d^{2}e^{2}\}\\ &\{cd,d\}&\{a^{2}b^{2}cd,cd\}&\{a^{2}b^{2}c^{4}de^{2},cd\}\\ \hline\cr S&\{c^{2},e\}&\{ac^{2}e^{2},c^{2}\}&\{a^{2}b^{2}c^{3}d^{2}e^{2},c^{2}\}\\ &\{ce,e\}&\{a^{2}bc^{2}d,bd\}&\{a^{2}b^{2}c^{2}d^{2},c^{3}e^{2}\}\\ &\{ac,e\}&\{ab^{2}cd^{2},ace\}&\{ab^{2}cd^{2},ac^{4}e^{2}\}\\ &\{ac,d\}&\{ac^{3}e,ce\}&\{a^{2}b^{2}c^{4}d^{2}e,ce\}\\ &\{abc,d\}&\{acd,c^{3}e^{2}\}&\{abc,abc^{4}d^{2}e^{2}\}\\ &\{bd,d\}&\{abc,abcd^{2}\}&\{a^{2}bc^{5}de^{2},bd\}\\ \hline\cr\acute{S}&\{c^{2},e^{2}\}&\{c^{2},ce^{2}\},&\{ab^{3}c^{3}d^{2}e^{2},c^{2}\}\\ &\{c,e^{2}\}&\{ac^{2},c^{2}e^{2}\}&\{ab^{3}c^{3}d^{2},c^{2}e^{2}\}\\ &\{ac,c\}&\{acd,ce^{2}\}&\{ab^{3}c^{2}d^{2},c^{3}e^{2}\}\\ &\{ac,d\}&\{ac^{2}d^{2},b^{3}\}&\{ac^{4}e^{2},b^{3}cd^{2}\}\\ &\{b^{3},d\}&\{ac^{2},b^{3}cd^{2}\}&\{ac^{5}e^{2}d^{2},b^{3}\}\\ &\{cd,d\}&\{ab^{3}cd,cd\}&\{ab^{3}c^{4}e^{2}d,cd\}\\ \hline\cr\end{array}

5.3 The kk-RF for Multiset-Labeled Rooted Trees

Let k0k\geqslant 0 be an integer. We use Eqn. (10), Eqn. (11), and Eqn. (12) to define kk-RF for multiset-labeled rooted trees by replacing L()L() with Lm()L_{m}() in Eqn. (10) and replacing \triangle with m\triangle_{m} in Eqn. (12).

Proposition 5.3.

Let k0k\geqslant 0 be an integer. The kk-RF satisfies the non-negativity, symmetry, and triangle inequality conditions. Hence, kk-RF is a pseudometric for each kk in the space of multiset-labeled (rooted) trees.

Proof 5.4.

The non-negativity and symmetry conditions follow from the definition of the kk-RF. The triangle inequality condition is proved as follows.

Let T1T_{1}, T2T_{2}, and T3T_{3} be three multiset-labeled trees. We need to show:

dk-RF(T1,T2)dk-RF(T1,T3)+dk-RF(T3,T2).\displaystyle d_{\tiny k\mbox{\rm-RF}}(T_{1},T_{2})\leqslant d_{\tiny k\mbox{\rm-RF}}(T_{1},T_{3})+d_{\tiny k\mbox{\rm-RF}}(T_{3},T_{2}).

Note that 𝒫k(X)\mathcal{P}_{k}(X) denotes the multiset of edge-induced order pairs of sub-multisets in XX for X=T1,T2,T3X=T_{1},T_{2},T_{3}.

If xm(x)𝒫k(T1)m𝒫k(T2)x^{m(x)}\in\mathcal{P}_{k}(T_{1})\triangle_{m}\mathcal{P}_{k}(T_{2}), we have either xm(x)𝒫k(T1)m𝒫k(T2)x^{m(x)}\in\mathcal{P}_{k}(T_{1})\setminus_{m}\mathcal{P}_{k}(T_{2}) or xm(x)𝒫k(T2)m𝒫k(T1)x^{m(x)}\in\mathcal{P}_{k}(T_{2})\setminus_{m}\mathcal{P}_{k}(T_{1}). Assume xm(x)𝒫k(T1)m𝒫k(T2)x^{m(x)}\in\mathcal{P}_{k}(T_{1})\setminus_{m}\mathcal{P}_{k}(T_{2}). Then, m𝒫k(T1)(x)>m𝒫k(T2)(x)m_{\mathcal{P}_{k}(T_{1})}(x)>m_{\mathcal{P}_{k}(T_{2})}(x). If xSupp(𝒫k(T3)m𝒫k(T2))x\notin\textrm{Supp}(\mathcal{P}_{k}(T_{3})\setminus_{m}\mathcal{P}_{k}(T_{2})), we have m𝒫k(T1)(x)>m𝒫k(T2)(x)m𝒫k(T3)(x)m_{\mathcal{P}_{k}(T_{1})}(x)>m_{\mathcal{P}_{k}(T_{2})}(x)\geqslant m_{\mathcal{P}_{k}(T_{3})}(x). This implies that xSupp(𝒫k(T1)m𝒫k(T3))x\in\mbox{Supp}(\mathcal{P}_{k}(T_{1})\setminus_{m}\mathcal{P}_{k}(T_{3})) and m𝒫k(T1)m𝒫k(T3)(x)=m𝒫k(T1)(x)m𝒫k(T3)(x)m𝒫k(T1)(x)m𝒫k(T2)(x)=m(x).m_{\mathcal{P}_{k}(T_{1})\setminus_{m}\mathcal{P}_{k}(T_{3})}(x)=m_{\mathcal{P}_{k}(T_{1})}(x)-m_{\mathcal{P}_{k}(T_{3})}(x)\geqslant m_{\mathcal{P}_{k}(T_{1})}(x)-m_{\mathcal{P}_{k}(T_{2})}(x)=m(x). Thus, m(x)m𝒫k(T1)m𝒫k(T3)(x)+m𝒫k(T3)m𝒫k(T2)(x).m(x)\leqslant m_{\mathcal{P}_{k}(T_{1})\triangle_{m}\mathcal{P}_{k}(T_{3})}(x)+m_{\mathcal{P}_{k}(T_{3})\triangle_{m}\mathcal{P}_{k}(T_{2})}(x).

On the other hand, if xSupp(𝒫k(T3)m𝒫k(T2))x\in\textrm{Supp}(\mathcal{P}_{k}(T_{3})\setminus_{m}\mathcal{P}_{k}(T_{2})) and m𝒫k(T3)(x)m𝒫k(T1)(x)m_{\mathcal{P}_{k}(T_{3})}(x)\geqslant m_{\mathcal{P}_{k}(T_{1})}(x), we have:

m𝒫k(T3)m𝒫k(T2)(x)\displaystyle m_{\mathcal{P}_{k}(T_{3})\setminus_{m}\mathcal{P}_{k}(T_{2})}(x) =\displaystyle= m𝒫k(T3)(x)m𝒫k(T2)(x)\displaystyle m_{\mathcal{P}_{k}(T_{3})}(x)-m_{\mathcal{P}_{k}(T_{2})}(x)
\displaystyle\geqslant m𝒫k(T1)(x)m𝒫k(T2)(x)=m(x).\displaystyle m_{\mathcal{P}_{k}(T_{1})}(x)-m_{\mathcal{P}_{k}(T_{2})}(x)=m(x).

If xSupp(𝒫k(T3)m𝒫k(T2))x\in\textrm{Supp}(\mathcal{P}_{k}(T_{3})\setminus_{m}\mathcal{P}_{k}(T_{2})) and m𝒫k(T3)(x)<m𝒫k(T1)(x)m_{\mathcal{P}_{k}(T_{3})}(x)<m_{\mathcal{P}_{k}(T_{1})}(x), we have m𝒫k(T1)(x)>m𝒫k(T3)(x)>m𝒫k(T2)(x)m_{\mathcal{P}_{k}(T_{1})}(x)>m_{\mathcal{P}_{k}(T_{3})}(x)>m_{\mathcal{P}_{k}(T_{2})}(x), implying xSupp(𝒫k(T1)m𝒫k(T3))x\in\textrm{Supp}(\mathcal{P}_{k}(T_{1})\setminus_{m}\mathcal{P}_{k}(T_{3})). Thus, we have:

m(x)\displaystyle m(x) =\displaystyle= m𝒫k(T1)m𝒫k(T3)(x)+m𝒫k(T3)m𝒫k(T2)(x)\displaystyle m_{\mathcal{P}_{k}(T_{1})\setminus_{m}\mathcal{P}_{k}(T_{3})}(x)+m_{\mathcal{P}_{k}(T_{3})\setminus_{m}\mathcal{P}_{k}(T_{2})}(x)
\displaystyle\leqslant m𝒫k(T1)m𝒫k(T3)(x)+m𝒫k(T3)m𝒫k(T2)(x).\displaystyle m_{\mathcal{P}_{k}(T_{1})\triangle_{m}\mathcal{P}_{k}(T_{3})}(x)+m_{\mathcal{P}_{k}(T_{3})\triangle_{m}\mathcal{P}_{k}(T_{2})}(x).

Lastly, if xm(x)𝒫k(T2)m𝒫k(T1)x^{m(x)}\in\mathcal{P}_{k}(T_{2})\setminus_{m}\mathcal{P}_{k}(T_{1}), we can obtain the same result.

In summary, we have:

Supp(𝒫k(T1)m𝒫k(T2))Supp(𝒫k(T1)m𝒫k(T3))Supp(𝒫k(T3)m𝒫k(T2)).\textrm{Supp}(\mathcal{P}_{k}(T_{1})\triangle_{m}\mathcal{P}_{k}(T_{2}))\subseteq\textrm{Supp}(\mathcal{P}_{k}(T_{1})\triangle_{m}\mathcal{P}_{k}(T_{3}))\cup\textrm{Supp}(\mathcal{P}_{k}(T_{3})\triangle_{m}\mathcal{P}_{k}(T_{2})).

In addition, for each xSupp(𝒫k(T1)m𝒫k(T2))x\in\textrm{Supp}(\mathcal{P}_{k}(T_{1})\triangle_{m}\mathcal{P}_{k}(T_{2})), we have:

m𝒫k(T1)m𝒫k(T2)(x)m𝒫k(T1)m𝒫k(T3)(x)+m𝒫k(T3)m𝒫k(T2)(x).m_{\mathcal{P}_{k}(T_{1})\triangle_{m}\mathcal{P}_{k}(T_{2})}(x)\leqslant m_{\mathcal{P}_{k}(T_{1})\triangle_{m}\mathcal{P}_{k}(T_{3})}(x)+m_{\mathcal{P}_{k}(T_{3})\triangle_{m}\mathcal{P}_{k}(T_{2})}(x).

Therefore, we have:

|𝒫k(T1)m𝒫k(T2)||𝒫k(T1)m𝒫k(T3)|+|𝒫k(T3)m𝒫k(T2)|,|\mathcal{P}_{k}(T_{1})\triangle_{m}\mathcal{P}_{k}(T_{2})|\leqslant|\mathcal{P}_{k}(T_{1})\triangle_{m}\mathcal{P}_{k}(T_{3})|+|\mathcal{P}_{k}(T_{3})\triangle_{m}\mathcal{P}_{k}(T_{2})|,

that is, the triangle inequality holds.

For multiset-labeled rooted trees, the proof is similar and hence omitted.

Refer to caption
Figure 9: Two distinct multiset-labeled trees SS and TT satisfy that 𝒫2(S)=𝒫2(T)={{a2d2,b},{abd,ad},{a,abd2}}{\cal P}_{2}(S)={\cal P}_{2}(T)=\{\{a^{2}d^{2},b\},\{abd,ad\},\{a,abd^{2}\}\}, showing that 2-RF score can be 0 even for distinct trees.
Remark 5.5.

For multiset-labeled trees, dk-RF(S,T)=0d_{\tiny k\mbox{\rm-RF}}(S,T)=0 does not imply SS and TT are identical, as shown in Fig. 9.

Proposition 5.6.

Let k0k\geqslant 0 and SS and TT be two (rooted) trees whose nodes are labeled by L(S)L(S) and L(T)L(T), respectively. Then, dk-RF(S,T)d_{\tiny k\mbox{\rm-RF}}(S,T) can be computed in O((k+B)D(|V(S)|+|V(T)|)O((k+B)D(|V(S)|+|V(T)|) time, where BB is the maximum multiplicity of a label appearing in {PT(e,k)eV(T)}{PS(e,k)eV(S)}\{P_{T}(e,k)\mid e\in V(T)\}\cup\{P_{S}(e,k)\mid e\in V(S)\} and D=|𝑆𝑢𝑝𝑝(L(S))𝑆𝑢𝑝𝑝(L(T))|D=|\mathit{Supp}(L(S))\cup\mathit{Supp}(L(T))|.

Proof 5.7.

An algorithm for the 1-labeled case can be modified as follows for computing kk-RF on multiset-labeled rooted and unrooted trees:

  • Represent each label multiset as a DD-dimensional vector, in which the integer at position jj is the multiplicity of the jj-th label. Computing all edge-induced pairs in both trees takes O(k(|E(S)|+|E(T)|))O(k(|E(S)|+|E(T)|)) set operations. Each set operation takes DD integer operations.

  • Radix-sort all the edge-induced pairs for SS and TT in O(D(|E(S)|+B))O(D(|E(S)|+B)) and O(D(|E(T)|+B))O(D(|E(T)|+B)) integer operations, respectively.

  • Compute the symmetric difference of the set of the edge-induced pairs in the two input trees in |E(S)|+|E(T)||E(S)|+|E(T)| set operation. Each set operation takes DD integer operations.

In summary, one can compute dk-RF(S,T)d_{\tiny k\mbox{\rm-RF}}(S,T) using O((k+B)D(|V(S)|+|V(T)|)O((k+B)D(|V(S)|+|V(T)|) integer operations, as |E(S)|=|V(S)|1|E(S)|=|V(S)|-1.

Refer to caption
Figure 10: Pearson correlation of the kk-Robinson-Foulds (RF) with CASet\cap, DISC\cap, and GRF. The analyses were conducted on rand rooted trees with the same label set (left) and with different but overlapping label sets (right) that were reported in Jahn et al. (2021). The Pearson correlation became constant for k19k\geqslant 19 in the range kk-RF becomes RF. CASet\cap: Common Ancestor Set distance; DISC\cap: Distinctly Inherited Set Comparison distance DiNardo et al. (2020). GRF: Generalized RF distance Llabrés et al. (2021).

5.4 Correlation of the kk-RF and the Other Measures

Let TT and SS be two 1-labeled rooted trees with the same label set XX. Again, we identify the nodes with their labels in the two trees. For any two subset XX^{\prime} and X′′X^{\prime\prime} of XX, we use dJ(X,X′′)d_{J}(X^{\prime},X^{\prime\prime}) to denote their Jaccard distance. The CASet\cap distance between TT and SS is defined to be the average dJ(AT(i)AT(j),AS(i)AS(j))d_{J}(A_{T}(i)\cap A_{T}(j),A_{S}(i)\cap A_{S}(j)) of a pair of nodes ii and jj, whereas the DISC\cap distance between TT and SS is the average dJ(AT(i)AT(j),AS(i)AS(j))d_{J}(A_{T}(i)\setminus A_{T}(j),A_{S}(i)\setminus A_{S}(j)) of an order pair (i,j)(i,j) of nodes DiNardo et al. (2020).

Using the Pearson correlation, we compared the kk-RF with CASet\cap, DISC\cap, and GRF (Llabrés et al., 2020) in the space of set-labeled trees for different kk from 0 to 28.

Firstly, we conducted the correlation analysis in the space of mutation trees with the same label set. Using a method reported by Jahn et al. (2021), we generated a simulated dataset containing 5,000 rooted trees in which the root was labeled with 0 and the other nodes were labeled by the disjoint subsets of {1,2,,29}\{1,2,\ldots,29\}, where the trees might have different number of nodes. Using all (5,0002)\binom{5,000}{2} pairwise scores for CASet\cap, DISC\cap, GRF and kk-RF, we conducted the Pearson correlation analysis of kk-RF with the other three (left panel, Fig. 10).

Our results show that CASet\cap, DISC\cap and GRF were all positively correlated with kk-RF. We observed the following facts:

  • The GRF and kk-RF had the largest Pearson correlation for each k<8k<8, whereas the DISC\cap and kk-RF had the largest Pearson correlation for each k8k\geqslant 8.

  • The 5-RF and 6-RF were less correlated to CASet\cap, DISC\cap and GRF than other kk-RF.

  • The Pearson correlation between kk-RF and CASet\cap (respectively, DISC\cap) increased when kk went from 6 to 15.

Secondly, we conducted the Pearson correlation analysis on the trees with different but overlapping label sets. The dataset was generated by the same method and was a union of 5 groups of rooted trees, each of which contained 200 trees over the same label set. We computed the dissimilarity scores for each tree in the first family and each tree in other groups and then computed the Pearson correlation between different measures. Again, all the dissimilarity measures were positively correlated, but less correlated than in the first case; see Fig 10 (right). This observation could be the result of the fact that difference in label sets of two trees makes their kk-RF score at least k+1k+1. However, the difference does not strongly contribute to the other distances because DISC\cap and CASet\cap consider the intersection of label sets (see  (DiNardo et al., 2020)), and GRF considers the intersection of clusters.

The right dotplot of Fig. 10 shows that the kk-RF and DISC\cap had the largest Pearson correlation for kk from 1 to 9, and the kk-RF and the CASet\cap had the largest Pearson correlation for k10k\geqslant 10. Moreover, all the Pearson correlations decreased when kk changed from 1 to 15. This trend was not observed in the first case. This decreasing trend could be the result of the fact that difference in label sets contributes to kk-RF more as kk increases.

6 Clustering Trees with the kk-RF

A test was designed to demonstrate which of the kk-RF, CASet\cap, DISC\cap, and GRF is good at clustering labeled trees.

We generated randomly 5 tree families each containing 50 trees using the program reported by Jahn et al. (2021). The nodes were labeled by the subsets of a 30-label set in the trees of each family. The label sets used in different tree families were different, but overlapping. As the nodes were labeled by disjoint subsets, each different label between the label sets of two trees induces at least dd different pairs, where dd is the degree of the node with the label. Thus, a large number of different elements between the label sets could make the trees more distinguishable by the kk-RF. Therefore, the label sets used for the different tree families differed in only one label.

We computed the pairwise dissimilarity scores for all 250 trees in the five groups using each measure; we then clustered the 250 trees into cc clusters using the KK-means algorithm, where cc ranges from 22 to 5757. The clustering results were assessed using the Silhouette score (Kaufman and Rousseeuw, 2009).

Refer to caption
Figure 11: Silhouette scores of clustering 250 rooted trees with kk-RF for 0k110\leqslant k\leqslant 11 (left) and 12k1912\leqslant k\leqslant 19 (middle) and with CASet\cap, DISC\cap, and GRF (right). RF: Robinson-Foulds. CASet\cap: Common Ancestor Set distance; DISC\cap: Distinctly Inherited Set Comparison distance DiNardo et al. (2020). GRF: Generalized RF distance Llabrés et al. (2021).

As Fig. 11 illustrates, neither of the CASet\cap, DISC\cap, and GRF distances were able to recognize the exact number of families. However, CASet\cap had the highest Silhouette score when the number of clusters was 5, compared to DISC\cap, GRF, and the kk-RF for k12k\leqslant 12. In addition, the figure shows that the kk-RF could recognize the correct number of families when kk ranges from 12 to 19. Moreover, the Silhouette score of the kk-RF increased when kk increased from 88 to 1919. This interesting observation may stem from the fact that as kk increases, the number of pairs of trees achieving the highest possible kk-RF score also increases, thereby enhancing the recognizability of families. It’s worth noting that such pairs are guaranteed to exist when kk is larger than the minimum diameter of the trees, which is 8 in our case.

7 Conclusions

The development of an efficient and robust measure for the comparison of labeled trees is important. In this paper, we have proposed a novel variant of dissimilarity metrics, namely the kk-RF, tailored for labeled trees. The kk-RF facilitates the analysis of local structures in labeled trees, accommodating nodes labeled with (not necessarily the same) multisets. Significantly, these metrics find practical applicability in mutation trees used in cancer research.

The RF distance is succinctly expressed as (n1)(n-1)-RF within the space of labeled trees with nn nodes. By setting kk to a value smaller than n1n-1, the kk-RF metric can capture analogous local regions in two labeled trees. Notably, for every kk, the kk-RF is a pseudometric for multiset-labeled trees and becomes a metric in the space of 1-labeled trees. However, the distribution of pairwise kk-RF scores in the space of 1-labeled unrooted (or rooted) trees conforms to a Poisson distribution specifically for k=n2k=n-2, and unlikely have the same trend for other values of k1k\geqslant 1.

We verified the kk-RF measures through a comprehensive comparison with CASet, DISC (DiNardo et al. (2020)) and GRF (Llabrés et al. (2021)) on randomly labeled trees generated by a house-made program (Jahn et al. (2021)). Our findings revealed a consistent positive correlation between kk-RF and each of the other three measures for every value of kk. Notably, the correlation values exhibited a tendency to be higher when the measures were applied to assess mutation trees with identical label sets. Furthermore, our study underscored the superior clustering capabilities of kk-RF compared to the three mentioned measures.

We would like to emphasize that selecting an appropriate kk-RF in practical applications lacks a universal rule of thumb, primarily due to a shortage of experience in this domain. Perhaps a judicious approach involves choosing a suitable kk-RF by carefully considering the topological similarity among the trees under consideration.

Future work includes how to apply the kk-RF to designing tree inference algorithms like GraPhyC (Govek et al., 2018) and also how to infer the exact frequency distribution of the kk-RF for each k1k\geqslant 1. It is also interesting to investigate the generalization of RF-distance for clonal trees (Llabrés et al., 2020).

The computer program for the kk-RF can be downloaded from https://github.com/Elahe-khayatian/k-RF-measures.git.

Acknowledgments

The authors would like to thank the anonymous reviewer for providing helpful suggestions and comments to our first submission of the work. This research was partially supported by the the Ministerio de Ciencia e Innovación (MCI), the Agencia Estatal de Investigación (AEI) and the European Regional Development Funds (ERDF) through project METACIRCLE PID2021-126114NB-C44, also supported by the European Regional Development Fund (FEDER), by the Agency for Management of University and Research Grants (AGAUR) through grant 2017-SGR-786 (ALBCOM), and by Singapore MOE Tier 1 grant R-146-000-318-114.

References

  • Briand et al. (2020) Briand, S., Dessimoz, C., El-Mabrouk, N., Lafond, M., and Lobinska, G. (2020). A generalized Robinson-Foulds distance for labeled trees. BMC Genomics, 21(Suppl 10):779.
  • Briand et al. (2022) Briand, S., Dessimoz, C., El-Mabrouk, N., and Nevers, Y. (2022). A linear time solution to the labeled Robinson-Foulds distance problem. Systematic Biology, 71(6):1391–1403.
  • Camin and Sokal (1965) Camin, J. H. and Sokal, R. R. (1965). A method for deducing branching sequences in phylogeny. Evolution, 19(3):311–326.
  • Ciccolella et al. (2021) Ciccolella, S., Bernardini, G., Denti, L., Bonizzoni, P., Previtali, M., and Vedova, G. D. (2021). Triplet-based similarity score for fully multilabeled trees with poly-occurring labels. Bioinformatics, 37(2):178–184.
  • DiNardo et al. (2020) DiNardo, Z., Tomlinson, K., Ritz, A., and Oesper, L. (2020). Distance measures for tumor evolutionary trees. Bioinformatics, 36(7):2090–2097.
  • Estabrook et al. (1985) Estabrook, G. F., McMorris, F. R., and Meacham, C. A. (1985). Comparison of undirected phylogenetic trees based on subtrees of four evolutionary units. Systematic Zoology, 34(2):193–200.
  • Farris (1977) Farris, J. S. (1977). Phylogenetic analysis under Dollo’s law. Systematic Biology, 26(1):77–88.
  • Govek et al. (2018) Govek, K., Sikes, C., and Oesper, L. (2018). A consensus approach to infer tumor evolutionary histories. In Proc. 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics (BCB ’18), pages 63–72, New York, NY. ACM Press.
  • Jahn et al. (2021) Jahn, K., Beerenwinkel, N., and Zhang, L. (2021). The Bourque distances for mutation trees of cancers. Algorithms for Molecular Biology, 16:9.
  • Jűrgensen (2020) Jűrgensen, H. (2020). Multisets, heaps, bags, families: What is a multiset? Mathematical Structures in Computer Science, 30(2):139–158.
  • Karpov et al. (2019) Karpov, N., Malikic, S., Rahman, M. K., and Sahinalp, S. C. (2019). A multi-labeled tree dissimilarity measure for comparing clonal trees of tumor progression. Algorithms for Molecular Biology, 14:17.
  • Kaufman and Rousseeuw (2009) Kaufman, L. and Rousseeuw, P. J. (2009). Finding groups in data: An introduction to cluster analysis. John Wiley & Sons.
  • Li et al. (1996) Li, M., Tromp, J., and Zhang, L. (1996). On the nearest neighbour interchange distance between evolutionary trees. Journal of Theoretical Biology, 182(4):463–467.
  • Llabrés et al. (2020) Llabrés, M., Rosselló, F., and Valiente, G. (2020). A generalized Robinson-Foulds distance for clonal trees, mutation trees, and phylogenetic trees and networks. In Proc. 11th ACM Int. Conf. Bioinformatics, Computational Biology and Health Informatics, pages 13:1–13:10, New York, NY. ACM Press.
  • Llabrés et al. (2021) Llabrés, M., Rosselló, F., and Valiente, G. (2021). The generalized Robinson-Foulds distance for phylogenetic trees. Journal of Computational Biology, 28(12):1–15.
  • Robinson (1971) Robinson, D. F. (1971). Comparison of labeled trees with valency three. Journal of Combinatorial Theory, 11(2):105–119.
  • Robinson and Foulds (1981) Robinson, D. F. and Foulds, L. R. (1981). Comparison of phylogenetic trees. Mathematical Biosciences, 53(1–2):131–147.
  • Sashittal et al. (2023) Sashittal, P., Zhang, H., Iacobuzio-Donahue, C. A., and Raphael, B. J. (2023). Condor: Tumor phylogeny inference with a copy-number constrained mutation loss model. bioRxiv [Preprint].
  • Schwartz and Schäffer (2017) Schwartz, R. and Schäffer, A. A. (2017). The evolution of tumour phylogenetics: Principles and practice. Nature Reviews Genetics, 18(4):213–229.
  • Steel and Penny (1993) Steel, M. A. and Penny, D. (1993). Distributions of tree comparison metrics: Some new results. Systematic Biology, 42(2):126–141.
  • Williams and Clifford (1971) Williams, W. T. and Clifford, H. T. (1971). On the comparison of two classifications of the same set of elements. Taxon, 20(4):519–522.