The $k$ -Robinson-Foulds Dissimilarity Measures for Comparison of Labeled Trees

Elahe Khayatian¹ Gabriel Valiente² Louxin Zhang^1∗
¹Department of Mathematics, National University of Singapore,
Singapore 119076
²Department of Computer Science,
Technical University of Catalonia, E-08034 Barcelona, Spain
^∗Corresponding author: E-mail: matzlx@nus.edu.sg

Abstract

Understanding the mutational history of tumor cells is a critical endeavor in unraveling the mechanisms underlying cancer. Since the modeling of tumor cell evolution employs labeled trees, researchers are motivated to develop different methods to assess and compare mutation trees and other labeled trees. While the Robinson-Foulds distance is a widely utilized metric for comparing phylogenetic trees, its applicability to labeled trees reveals certain limitations. This paper introduces the $k$ -Robinson-Foulds dissimilarity measures, tailored to address the challenges of labeled tree comparison. The Robinson-Foulds distance is succinctly expressed as $n$ -RF in the space of labeled trees with $n$ nodes. Like the Robinson-Foulds distance, the $k$ -Robinson-Foulds is a pseudometric for multiset-labeled trees and becomes a metric in the space of 1-labeled trees. By setting $k$ to a small value, the $k$ -Robinson-Foulds dissimilarity can capture analogous local regions in two labeled trees with different size or different labels.

keywords:

Phylogenetic trees, mutation trees, labeled trees, Robinson-Foulds distance,

k

-Robinson-Foulds dissimilarity

1 Introduction

Trees in biology are a fundamental concept as they depict the evolutionary history of entities. These entities may consist of organisms, species, proteins, genes or genomes. Trees are also useful for healthcare analysis and medical diagnosis. Introducing different kinds of tree models has given rise to the question about how these models can be efficiently compared for evaluation. This question has led to defining a robust dissimilarity measure in the space of targeted trees. For example, mutation/clonal trees are introduced to model tumor evolution, in which nodes represent cellular populations and are labeled by the gene mutations carried by those populations (Karpov et al., 2019; Schwartz and Schäffer, 2017). The progression of tumors varies among different patients; additionally, information about such variations is significant for cancer treatment. Therefore, dissimilarity measures for mutation trees have become a focus of recent research (DiNardo et al. (2020); Jahn et al. (2021); Llabrés et al. (2021); Karpov et al. (2019)).

In prior studies on species trees, several measures have been introduced to compare two phylogenetic trees. Some examples of such distances are Robinson-Foulds distance (RF) (Robinson and Foulds, 1981), Nearest-Neighbor Interchange (NNI) (Li et al., 1996; Robinson, 1971), Quartet distance (Estabrook et al., 1985), and Path distance (Steel and Penny, 1993; Williams and Clifford, 1971). Although these distances have been widely used for phylogenetic trees, they are defined based on the assumption that the involved trees have the same label sets. Moreover, only leaves of phylogenetic trees are labeled. Thus, these distances are not useful for comparing trees with different label sets or trees in which all the nodes are labeled.

1.1 Related Work on Comparison of Labeled Trees

To get around some limitations of the above-mentioned distances in the comparison of mutation trees, researchers have introduced new measures for mutation trees. Some of these measures are Common Ancestor Set distance (CASet) (DiNardo et al., 2020), Distinctly Inherited Set Comparison distance (DISC) (DiNardo et al., 2020), and Multi-Labeled Tree Dissimilarity measure (MLTD) (Karpov et al., 2019). While these distance measures enable efficient comparison of clonal trees, they are defined based on the assumption that mutations cannot occur more than once and mutations will not be lost in the course of tumor evolution. As a result, these metrics exhibit multiple limitations when applied to the comparison of trees used to model complex tumor evolution, wherein mutations may indeed occur multiple times and subsequently be lost.

In addition to the three measures mentioned above, a group of other dissimilarity measures have been introduced for the comparison of mutation trees, including Parent-Child Distance (Govek et al., 2018) and Ancestor-Descendant Distance (Govek et al., 2018). These measures are metric only in the space of ‘1-mutation’ trees, in which each node is labeled by exactly one mutation. These distances are again defined based on the assumption that mentioned above.

Additionally, there are other measures for mutation trees, defined based on the generalization methods. In such methods, researchers aim to extend the definition of an existing distance, which mostly used to compare phylogenetic trees, to mutation trees. For example, the generalized Nearest Neighbour Interchange (gNNI) (Jahn et al. (2021)) is defined by some minor modifications of NNI, which was first defined for the comparison of phylogenetic trees. The other example is the Path Distance (Govek et al. (2018)) which was first defined for phylogenetic trees comparison. Although these measure are applicable to mutation trees, they are only well defined for mutation trees with the same label sets (Jahn et al. (2021); Govek et al. (2018)).

Apart from the measures mentioned above, another recently proposed distance is the generalized RF distance (GRF) (Llabrés et al., 2020, 2021). This measure allows for the comparison of phylogenetic trees, phylogenetic networks, mutation and clonal trees. An important point about this distance is that its value depends on the intersection between clusters or clones of trees. However, this intersection does not contribute to the RF distance. In fact, if two clusters or clones of two trees are different, their contribution to the RF distance is 1; otherwise, it is 0. Hence, the generalized RF distance has a better resolution than the RF distance. However, it is defined based on the assumption that two distinct nodes in each tree are labeled by two disjoint sets (Llabrés et al., 2020).

There are some other generalizations of the RF distance, such as Bourque distance (Jahn et al., 2021). This distance is effective for comparing mutation trees whose nodes are labeled by non-empty sets and has linear time complexity. However, like the above distances, it does not allow for multiple occurrences of mutations during the tumor history (Jahn et al., 2021). Other generalization of the RF distance have also been proposed for gene trees (Briand et al., 2020, 2022).

The above-mentioned measures are not able to quantify similarity or difference of some tree models. Two instances of such models are the Dollo (Farris, 1977) and the Camin-Sokal model (Camin and Sokal, 1965). The reason behind the inadequacy of the measures for these models is that it is possible for mutations to get lost after they are gained in the Dollo model, and a mutation can occur more than once during the tumor history in the Camin-Sokal model (Llabrés et al., 2020). Hence, some measures are needed to mitigate the problem. To the best of our knowledge, Triplet-based Distance (Ciccolella et al., 2021) is the only measure introduced so far to resolve the issue. The distance is useful for comparing mutation trees whose nodes are labeled by non-empty sets; additionally, it allows for multiple occurrences of mutations during the tumor history and losing a mutation after it is gained (Ciccolella et al., 2021). Thus, the measure is applicable to the broader family of trees in which two nodes of a tree may have non-disjoint sets of labels. Nevertheless, it is not able to compare those trees in which there is a node whose label has more than one copy of a mutation.

Although no tree model has been introduced so far that allows for more than one copy of a mutation in the label of a single node, current models can be extended to deal with copy number of mutations. For example, the constrained $k$ -Dollo model (Sashittal et al., 2023) takes the variant read count and the total read count of each mutation in each cell, derived from single-cell DNA sequencing data, as input; then, based on three thresholds for the variant read count, the total read count, and the variant allele frequency, it decides whether a mutation is present or absent in a cell or it is missing (Sashittal et al., 2023). Alternatively, the model can consider the exact frequency numbers to show the multiplicity of each mutation in each cell. This motivates us to introduce new distances that can be used to compare pairs of labeled trees whose nodes are labeled by non-empty multisets.

1.2 Our Contributions to Tree Comparison

In this paper, we present $k$ -RF dissimilarity measures designed for the comparison of labeled trees. They are first defined for 1-labeled trees (Section 3). Subsequently, we extend these measures to multiset-labeled trees (Section 5). We delve into the mathematical properties of the $k$ -RF measures in Sections 4 and 5. In particular, $k$ -RF is a metric for 1-labeled trees. We also assess the validity of the $k$ -RF measures through comparisons with CASet, DISC, and GRF (Section 5), and the evaluation of their performance in the context of tree clustering (Section 6).

2 Concepts and Notations

A graph consists of a set of nodes and a set of edges that are each an unordered pair of distinct nodes, whereas a directed graph consists of a set of nodes and a set of directed edges that are each an ordered pair of distinct nodes.

Let $G$ be a (directed) graph. We use $V(G)$ and $E(G)$ to denote its node and edge set, respectively. If $G$ is undirected, $(u,v)$ will still be used to denote an edge between $u$ and $v$ with the understanding that $(u,v)=(v,u)$ . Let $u,v\in V(G)$ . If $(u,v)\in E(G)$ , we say that $u$ and $v$ are adjacent, the edge $(u,v)$ is incident to $u$ and $v$ , or $u$ and $v$ are two endpoints of $(u,v)$ .

The degree of $v$ is defined as the number of edges incident to $v$ . In addition, if $G$ is directed, the indegree and outdegree of $v$ are defined as the number of edges $(x,y)$ such that $y=v$ and $x=v$ , respectively. The nodes of degree 1 are called the leaves in an undirected graph, whereas the nodes of indegree 1 and outdegree 0 are called the leaves in a directed graph. We use $\mathit{Leaf}(G)$ to denote the leaf set for $G$ . Non-leaf nodes are called internal nodes.

A path of length $k$ from $u$ to $v$ consists of a sequence of nodes $u_{0},u_{1},\ldots,u_{k}$ such that $u_{0}=u$ , $u_{k}=v$ and $(u_{i-1},u_{i})\in E(G)$ for $i=1,2,\cdots,k$ . The distance from $u$ to $v$ , denoted as $d_{G}(u,v)$ , is the length of the shortest paths from $u$ to $v$ , and it is set to $\infty$ if there is no path from $u$ to $v$ . Note that if $G$ is undirected, $d_{G}(u,v)=d_{G}(v,u)$ for all $u,v\in V(G)$ . The diameter of $G$ , denoted as $\textrm{diam}(G)$ , is defined as $\max_{u,v\in V(G)}d_{G}(u,v)$ . If $G$ is directed, its diameter is defined as the diameter of its undirected version that has the node set $V(G)$ and edge set $E(G)\cup\{(u,v)\mid(v,u)\in E(G)\}$ .

2.1 Trees

A tree $T$ is a graph in which there is exactly one path from every node to any other node. It is binary if every internal node is of degree 3. It is a line tree if every internal node is of degree 2. Each line tree has exactly two leaves.

2.2 Rooted Trees

A rooted tree is a directed tree with a specific root node where the edges are oriented away from the root. In a rooted tree, there is exactly one edge entering $u$ for every non-root node $u$ , and thus there is a unique path from its root to every other node.

Let $T$ be a rooted tree and $u,v\in V(T)$ . If $(u,v)\in E(T)$ , $v$ is called a child of $u$ and $u$ is called the parent of $v$ . In general, for $u\neq v$ , if $u$ belongs to the unique path from $\mathit{root}(T)$ to $v$ , $v$ is said to be a descendant of $u$ , and $u$ is said to be an ancestor of $v$ . We use $C_{T}(u)$ , $A_{T}(u)$ and $D_{T}(u)$ to denote the set of all children, ancestors and descendants of $u$ , respectively. Note that $u\notin A_{T}(u)$ and $u\notin D_{T}(u)$ .

A binary rooted tree is a rooted tree in which the root is of indegree 0 and outdegree 2, and every other internal node is of indegree 1 and outdegree 2. A rooted line tree is a rooted tree in which each internal node has only one child. A rooted caterpillar tree is a rooted tree in which every internal node has at most one child that is internal.

2.3 Labeled Trees

Let $L$ be a set and $\mathbb{P}(L)$ be the set of all subsets of $L$ . A tree or rooted tree $T$ is labeled with the subsets of $L$ if $T$ is equipped with a function $\ell:V(T)\to\mathbb{P}(L)$ such that $\cup_{v\in V(T)}\ell(v)=L$ , and $\ell(v)\neq\emptyset$ for every $v\in V(T)$ . In particular, if $\ell(v)$ contains exactly one element for each $v\in V(T)$ and $\ell$ is one-to-one, $T$ is said to be a $1$ -labeled tree on $L$ . In addition, if $T$ is 1-labeled on $L$ , then for $C\subseteq V(T)$ , $L(C)$ is defined as $L(C)=\{x\in L\mid\exists w\in C:\ell(w)=\{x\}\}$ .

2.4 Phylogenetic and Mutation Trees

Let $X$ be a finite taxon set. A phylogenetic tree (respectively, rooted phylogenetic tree) on $X$ is a binary tree (respectively, binary rooted tree) in which the leaves are uniquely labeled by the taxa of $X$ and the internal nodes are unlabeled.

A mutation tree on a set M of mutated genes is a rooted tree in which nodes are labeled with nonempty subsets of $M$ .

2.5 Dissimilarity Measures for Trees

Let $\mathcal{T}$ be a set of trees. A dissimilarity measure on $\mathcal{T}$ is a symmetric real function $d:\mathcal{T}\times\mathcal{T}\to\mathbb{R}^{\geqslant 0}$ . A dissimilarity measure should capture the idea that the more similar two trees are, the lower their measure value is. A pseudometric on $\mathcal{T}$ is a dissimilarity measure that satisfies the triangle inequality condition. Finally, a metric (distance) on $\mathcal{T}$ is a pseudometric $d$ such that $d(S,T)\neq 0$ unless $S$ and $T$ are the same tree.

3 The $k$ -RF Measure for 1-Labeled Trees

In this section, we first recall the definition of the RF distance and then present $k$ -RF dissimilarity measures for 1-labeled trees for arbitrary $k$ .

3.1 The $k$ -RF Measure for 1-Labeled Unrooted Trees

Let $X$ be a set of labels and let $T$ be a 1-labeled tree over $X$ . Each $e=(u,v)\in E(T)$ induces a pair of label subsets on $X$ :

	$\displaystyle P_{T}(e)=\left\{L(B_{e}(u)),L(B_{e}(v))\right\},$		(1)
	$\displaystyle B_{e}(u)=\{w\mid d_{T}(w,u)<d_{T}(w,v)\},$
	$\displaystyle B_{e}(v)=\{w\mid d_{T}(w,v)<d_{T}(w,u)\}.$		(2)

We further define:

\mathcal{P}(T)=\{P_{T}(e)\mid e\in E(T)\}.

(3)

The RF distance of two 1-labeled trees $S$ and $T$ is defined as:

d_{RF}(S,T)=|\mathcal{P}(S)\bigtriangleup\mathcal{P}(T)|.

(4)

Refer to caption — Figure 1: Three 1-labeled trees in Example 1 to illustrate that the Robinson-Foulds distance exhibits a heavy penalty against trees with different labels. Although $T$ and $\acute{S}$ is only different in labelling one node, the RF distance is 4 for $S$ and $T$ , but 12 for $\acute{S}$ and $T$ .

Example 1.

Consider the three 1-labeled trees in Figure 1. We have $d_{\mbox{\tiny RF}}(S,T)=4$ , because $P_{T}(e_{1})$ to $P_{T}(e_{6})$ are:

	$\displaystyle\left\{\{a,b,c,d,e,f\},\;\{g\}\right\},$	$\displaystyle\left\{\{a,b,c,d,f\},\;\{e,g\}\right\},$	$\displaystyle\left\{\{a,b,c,d\},\;\{e,f,g\}\right\},$
	$\displaystyle\left\{\{a,e,f,g\},\;\{b,c,d\}\right\},$	$\displaystyle\left\{\{a,c,d,e,f,g\},\;\{b\}\right\},$	$\displaystyle\left\{\{a,b,d,e,f,g\},\;\{c\}\right\},$

respectively, whereas $P_{S}(\bar{e}_{1})$ to $P_{S}(\bar{e}_{6})$ are:

	$\displaystyle\left\{\{a,b,c,d,f,g\},\;\{e\}\right\},$	$\displaystyle\left\{\{a,b,c,d,e,g\},\;\{f\}\right\},$	$\displaystyle\left\{\{a,b,c,d\},\;\{e,f,g\}\right\},$
	$\displaystyle\left\{\{a,e,f,g\},\;\{b,c,d\}\right\},$	$\displaystyle\{\{a,b,d,e,f,g\},\;\{c\}\},$	$\displaystyle\{\{a,c,d,e,f,g\},\;\{b\}\},$

respectively. However, $d_{\mbox{\tiny RF}}(\acute{S},T)=12$ , even if $T$ and $\acute{S}$ have the same topology and only one node is labeled differently.

The above example indicates that the RF cannot capture local similarity (and difference) for 1-labeled trees if they have different labels. One popular dissimilarity measure for sets is the Jaccard distance. It is obtained by dividing the size of the symmetric difference of two sets by the size of their union. Two 1-labeled trees are identical if and only if they have the same set of edges. Therefore, we propose to use $|E(S)\bigtriangleup E(T)|$ and its generalization to measure the dissimilarity for 1-labeled trees $S$ and $T$ .

Let $k$ be a non-negative integer and let $T$ be a $1$ -labeled tree. Each edge $e=(u,v)$ induces the following pair of subsets of labels:

	$\displaystyle P_{T}(e,k)=\{L(B_{e}(u,k)),L(B_{e}(v,k))\},$		(5)
	$\displaystyle B_{e}(x,k)=\{w\in B_{e}(x)\mid d_{T}(w,x)\leqslant k\},\;\;x=u,v.$

Clearly, $B_{e}(u,\infty)=B_{e}(u)$ and $B_{e}(u,0)=\{u\}$ . We further define:

\mathcal{P}_{k}(T)=\{P_{T}(e,k)\mid e\in E(T)\}.

(6)

Definition 1.

Let $k\geqslant 0$ and let $S$ and $T$ be two 1-labeled trees. The $k$ -RF dissimilarity score of $S$ and $T$ is defined as:

d_{\tiny k\mbox{\rm-RF}}(S,T)=|\mathcal{P}_{k}(S)\bigtriangleup\mathcal{P}_{k}(T)|.

(7)

Example 2.

Continuing with Example 1, we have $d_{\tiny 1\mbox{\rm-RF}}(\acute{S},T)=4$ , as $P_{T}(e_{i},1)$ for $1\leqslant i\leqslant 6$ are:

$\{\{g\},\{e,f\}\}$ ,		$\{\{e,g\},\{a,f\}\}$ ,		$\{\{e,f\},\{a,d\}\}$ ,
$\{\{a,f\},\{b,c,d\}\}$ ,		$\{\{b\},\{a,c,d\}\}$ ,		$\{\{c\},\{a,b,d\}\}$ ,

respectively, and $P_{\acute{S}}(\acute{e}_{i},1)$ for $1\leqslant i\leqslant 6$ are:

$\{\{h\},\{e,f\}\}$ ,		$\{\{e,h\},\{a,f\}\}$ ,		$\{\{e,f\},\{a,d\}\}$ ,
$\{\{a,f\},\{b,c,d\}\}$ ,		$\{\{b\},\{a,c,d\}\}$ ,		$\{\{c\},\{a,b,d\}\}$ ,

respectively. We also have $d_{\tiny 1\mbox{\rm-RF}}(S,T)=8$ . Thus, $1$ -RF captures the difference of the trees better than the RF distance.

3.2 The $k$ -RF Measure for 1-Labeled Rooted Trees

Let $k\geqslant 0$ be an integer and let $T$ be a $1$ -labeled rooted tree. For a node $w\in V(T)$ , we define $B_{k}(w)$ and $D_{k}(w)$ as:

	$\displaystyle B_{k}(w)=\{x\in V(T)\mid\exists y\in A_{T}(w)\cup\{w\}:d(y,w)+d(y,x)\leqslant k\},$		(8)
	$\displaystyle D_{k}(w)=\{w\}\cup\{x\in D_{T}(w)\mid d(w,x)\leqslant k\}.$		(9)

For each $e=(u,v)\in E(T)$ , we define $P_{T}(e,k)$ as the following ordered pair od two label subsets:

P_{T}(e,k)=(L(D_{k}(v)),L(B_{k}(u)\setminus D_{k}(v))).

(10)

Here, the first subset of $P_{T}(e,k)$ contains the labels of the descendants that are at distance at most $k$ from $v$ , whereas the second subset contains the labels of the other nodes around the edge $e$ within a distance of $k$ . Next, we define:

\mathcal{P}_{k}(T)=\{P_{T}(e,k)\mid e\in E(T)\}.

(11)

Definition 2.

Let $k\geqslant 0$ and let $S$ and $T$ be two 1-labeled rooted trees. Then, the $k$ -RF dissimilarity between $S$ and $T$ is defined as:

d_{\tiny k\mbox{\rm-RF}}(S,T)=|\mathcal{P}_{k}(S)\bigtriangleup\mathcal{P}_{k}(T)|.

(12)

Example 3.

Consider the two 1-labeled rooted trees $S$ and $T$ in Figure 2. $P_{T}(e_{i},1)$ ( $1\leqslant i\leqslant 7$ ) are the following ordered pairs of label subsets:

\begin{array}[]{lllllll}(\{f,h\},\{b,d\}),&&(\{c,f,g\},\{b,h\}),&&(\{c\},\{f,g,h\}),&&(\{g\},\{c,f,h\}),\\ (\{a,d,e\},\{b,h\}),&&(\{a\},\{b,d,e\}),&&(\{e\},\{a,b,d\}).&&\end{array}

$P_{S}(\bar{e}_{i},1)$ ( $1\leqslant i\leqslant 7$ ) are the following ordered pairs of label subsets:

\begin{array}[]{lllllll}(\{b,d\},\{c,f\}),&&(\{a,d,e\},\{b,c\}),&&(\{a\},\{b,d,e\}),&&(\{e\},\{a,b,d\}),\\ (\{f,g,h\},\{b,c\}),&&(\{g\},\{c,f,h\}),&&(\{h\},\{c,f,g\}).&&\end{array}

Therefore, $d_{\tiny 1\mbox{\rm-RF}}(S,T)=8$ .

4 Characterization of $k$ -RF for 1-Labeled Trees

In order to evaluate $k$ -RF, we first provide the mathematical properties of the $k$ -RF. We then present experimental results on the frequency distribution of these measures.

4.1 Mathematical Properties

Proposition 1.

Let $S$ and $T$ be two 1-labeled trees.

(a) Let $|L(S)\cap L(T)|\leqslant 2$ and $|E(T)|\geqslant 2$ . For any $k\geqslant 1$ , $d_{\tiny k\mbox{\rm-RF}}(S,T)=|E(S)|+|E(T)|$ .

(b) Assume that $L(S)\neq L(T)$ . For $k<\min\{\textrm{diam}(T),\textrm{diam}(S)\}$ ,
$k+1\leqslant d_{\tiny k\mbox{\rm-RF}}(S,T)\leqslant|E(S)|+|E(T)|$ . In addition, the second inequality become equality if $k\geqslant\min\{\textrm{diam}(T),\textrm{diam}(S)\}$ and $|L(S)|=|L(T)|$

(d) If $k\geqslant\max\{\textrm{diam}(S),\textrm{diam}(T)\}-1$ , then $d_{\tiny k\mbox{\rm-RF}}(S,T)=d_{RF}(S,T)$ .

Proof.

(a) Note that if $k\geqslant 1$ and $|E(T)|\geqslant 2$ , each $P_{T}(e,k)$ involves at least three labels. If $L(S)$ and $L(T)$ have only two common elements, $P_{T}(e,k)\neq P_{S}(\acute{e},k)$ for every $e\in E(T)$ and $\acute{e}\in E(S)$ . Thus, we have $\mathcal{P}_{k}(S)\cap\mathcal{P}_{k}(T)=\emptyset$ , implying that $d_{\tiny k\mbox{\rm-RF}}(S,T)=|\mathcal{P}_{k}(S)\bigtriangleup\mathcal{P}_{k}(T)|=|\mathcal{P}_{k}(T)|+|\mathcal{P}_{k}(S)=|E(S)|+|E(T)|.$

(b) The second inequality follows from that $d_{\tiny k\mbox{\rm-RF}}(S,T)=|\mathcal{P}_{k}(S)\bigtriangleup\mathcal{P}_{k}(T)|\leqslant|\mathcal{P}_{k}(T)|+|\mathcal{P}_{k}(S)|$ and $|\mathcal{P}_{k}(X)|=|E(X)|$ for $X=S,T$ . We prove the first inequality as follows.

Let $k<\min\{\mbox{diam}(T),\mbox{diam}(S)\}$ . Since $S$ and $T$ are 1-labeled, we identify a node with its label in the trees. Without loss of generality, we may assume $v\in V(T)\setminus V(S)$ . Define ${\cal N}^{(k)}_{T}(v)=\{u\;|\;d_{T}(u,v)\leqslant k\}$ .

If ${\cal N}^{(k)}_{T}(v)=V(T)$ , then, $|{\cal N}^{(k)}_{T}(v)|=|V(T)|\geqslant\mbox{diam}(T)+1\geqslant k+2$ , as $k<\mbox{diam}(T)$ . This also implies that for every $(x,y)\in E(T)$ , $d_{T}(v,x)\leqslant k$ and $d_{T}(v,y)\leqslant k$ .

If ${\cal N}^{(k)}_{T}(v)\neq V(T)$ , there exists at least a node $w$ that is $k+1$ or more distance away from $v$ . Since $T$ is connected, we let $P(v,w)$ be a path from $v$ and $w$ with the smallest length. Clearly, the first $k+1$ nodes in $P(v,w)$ (including $v$ ) are all in ${\cal N}^{(k)}_{T}(v)$ , i.e. at least one end of the first $k+1$ edges of $P(v,w)$ are found in ${\cal N}^{(k)}_{T}(v)$ .

In summary, we have proved that there are at least $k+1$ edges $(x,y)$ such that either $d_{T}(v,x)\leqslant k$ or $d_{T}(v,y)\leqslant k$ . For each of these edges $e$ , $v$ appears in at least one label subset of $P_{T}(e,k)$ and thus $P_{T}(e,k)\not\in\mathcal{P}_{k}(S)$ . Therefore, $d_{\tiny k\mbox{\rm-RF}}(S,T)\geqslant|\mathcal{P}_{k}(T)\setminus\mathcal{P}_{k}(S)|\geqslant k+1$ .

If $|L(S)|=|L(T)|$ and $k\geqslant\min\{\textrm{diam}(T),\textrm{diam}(S)\}$ , then, ${\cal N}^{(k)}_{T}(v)=V(T)$ . Therefore, the induced pair $P_{T}(e,k)$ contains $v$ for every edge $e$ of $T$ . On the other hand, the induced pair $P_{S}(e,k)$ does not contain $v$ for each edge $e$ of $S$ . Thus, $\mathcal{P}_{k}(S)\cap\mathcal{P}_{k}(T)=\emptyset$ and $d_{\tiny k\mbox{\rm-RF}}(S,T)=|\mathcal{P}_{k}(S)|+|\mathcal{P}_{k}(T)|=|E(S)|+|E(T)|$ .

(c) Note that we may represent each node of a 1-labeled tree with its unique label. As a result, $P_{T}(e,0)=e$ and $P_{S}(\bar{e},0)=e$ for $e\in E(T)$ and $\bar{e}\in E(S)$ . Thus, (c) follows.

(d) It follows from the definition of the $k$ -RF. ∎

Lemma 4.1.

Let $k\geqslant 0$ be an integer. $k$ -RF satisfies the non-negativity, symmetry and triangle inequality conditions.

Proof 4.2.

Let $k\geqslant 0$ . The non-negativity and symmetry conditions are trivial. The triangle inequality $d_{\tiny k\mbox{\rm-RF}}(T_{1},T_{2})\leqslant d_{\tiny k\mbox{\rm-RF}}(T_{1},T_{3})+d_{\tiny k\mbox{\rm-RF}}(T_{3},T_{2})$ is derived from the inequality $\mathcal{P}_{k}(T_{1})\bigtriangleup\mathcal{P}_{k}(T_{2})\subseteq(\mathcal{P}_{k}(T_{1})\bigtriangleup\mathcal{P}_{k}(T_{3}))\cup(\mathcal{P}_{k}(T_{3})\bigtriangleup\mathcal{P}_{k}(T_{2}))$ for any three 1-labeled trees $T_{1},T_{2},T_{3}$ .

Remark 4.3.

Proposition 1 and Lemma 4.1 can be proved in the same manner for 1-labeled rooted trees.

Proposition 4.4.

The 0-RF is a metric on the space of all 1-labeled rooted trees.

Proof 4.5.

Let $S$ and $T$ be two 1-labeled rooted trees. By Remark 4.3, it is enough to show that $S$ and $T$ are identical if $d_{\tiny 0\mbox{\rm-RF}}(S,T)=0$ . By identifying a node with its label in $S$ and $T$ , we obtain that $\mathcal{P}_{0}(S)=E(S)$ and $\mathcal{P}_{0}(T)=E(T)$ . If $d_{\tiny 0\mbox{\rm-RF}}(S,T)=0$ , $|E(T)\bigtriangleup E(S)|=0$ and thus $E(T)=E(S)$ , i.e. $S$ and $T$ are identical.

Lemma 4.6.

Let $T$ be a 1-labeled rooted tree with at least two nodes and $\mathcal{L}$ be a subset of $\mathit{Leaf}(T)$ . Define $T^{\prime}$ to be the subtree obtained by the removal of all the leaves of ${\mathcal{L}}$ . Then, for any $k$ ,

\displaystyle{\mathcal{P}}_{k}(T^{\prime})=\{(X\setminus\mathcal{L},Y\setminus\mathcal{L})\;\;|\;(X,Y)\in{\mathcal{P}}_{k}(T)\;\&\;X\cap\mathcal{L}\neq\emptyset\}.

(13)

Proof 4.7.

Since $T$ is 1-labeled, we identify a node of $T$ with its label in the following discussion. With this convention, for any subset $S$ of nodes, $L(S)=S$ .

Let $\bar{E}(T)$ denote the subset of edges incident to a leaf of $\mathcal{L}$ , i.e., $\bar{E}(T)=\{(x,y)\in E(T)\;|\;y\in\mathcal{L}\}$ . Then,

V(T)=V(T^{\prime})\uplus\mathcal{L},\;\;\ E(T)=E(T^{\prime})\uplus\bar{E}(T).

If $(u,v)\in\bar{E}(T)$ , $v\in\mathcal{L}\subseteq\mathit{Leaf}(T)$ and thus $D_{k}(v)=\{v\}\subseteq\mathcal{L}.$

For an edge $e=(u,v)\in E(T^{\prime})$ , $P_{T}(e,k)=(D_{k}(v),B_{k}(u)\setminus D_{k}(v))$ . By Eqn. (8) and (9,

$\displaystyle D_{k}(v)$	$\displaystyle=$	$\displaystyle D_{k}(v)\cap V(T^{\prime})\uplus D_{k}(v)\cap\mathcal{L},$
$\displaystyle B_{k}(u)\setminus D_{k}(v)$	$\displaystyle=$	$\displaystyle[(B_{k}(v)\setminus D_{k}(v))\cap V(T^{\prime})]\uplus[(B_{k}(v)\setminus D_{k}(v))\cap\mathcal{L}]$
	$\displaystyle=$	$\displaystyle[(B_{k}(v)\cap V(T^{\prime})]\setminus[D_{k}(v)\cap V(T^{\prime})]\uplus(B_{k}(v)\setminus D_{k}(v))\cap\mathcal{L}$

If $(u,v)\in E(T^{\prime})$ , $D_{k}(v)\setminus\mathcal{L}=D_{k}(v)\cap V(T^{\prime})\neq\emptyset$ and
$(B_{k}(v)\setminus D_{k}(v))\setminus\mathcal{L}=(B_{k}(v)\cap V(T^{\prime})]\setminus[D_{k}(v)\cap V(T^{\prime})].$ Therefore,
$\left(D_{k}(v)\setminus\mathcal{L},(B_{k}(v)\setminus D_{k}(v))\setminus\mathcal{L}\right)=P_{T^{\prime}}(e,k).$

This has proved Eqn. (13).

Proposition 4.8.

Let $k\geqslant 1$ be an integer. The $k$ -RF is a metric in the space of all 1-labeled rooted trees.

Proof 4.9.

Let $S$ and $T$ be two 1-labeled rooted trees. By Remark 4.3, it is enough to show that $d_{\tiny k\mbox{\rm-RF}}(S,T)=0$ (equivalently, $\mathcal{P}_{k}(T)=\mathcal{P}_{k}(S)$ ) implies that $S$ and $T$ are identical. To this end, we prove that $E(T)$ can be uniquely determined by $\mathcal{P}_{k}(T)$ using mathematical induction.

Since $|E(T)|=|{\mathcal{P}}_{k}(T)|$ , $T$ is a single node if and only if $E(T)$ is empty if and only ${\mathcal{P}}_{k}(T)$ is empty. Therefore, the single-node tree is uniquely determined by ${\mathcal{P}}_{k}(T)$ .

Assume $S$ is uniquely determined by ${\mathcal{P}}_{k}(S)$ for arbitrary 1-labeled tree $S$ such that $|V(S)|<k$ . Consider a 1-labeled tree $T$ such that $|V(S)|=k$ .

For a leaf $v\in\mathit{Leaf}(T)$ , there is a unique edge $e=(u,v)$ entering $v$ . Note that $k\geqslant 1$ . Since $D_{k}(v)=\{v\}$ if and only if $v$ is a leaf, we can identify $v$ from $P_{T}(e,k)=(P_{1},P_{2})\in{\mathcal{P}}_{k}(T)$ such that $P_{1}=\{v\}.$

For $v\in V(T)\setminus\mathit{Leaf}(T)$ , there is a unique edge $e=(u,v)$ entering $v$ . Since $k\geqslant 1$ , the children of $v$ are all a leaf if and only if $D_{k}(v)=\{v\}\cup C_{T}(u)$ if and only if $D_{K}(v)\setminus\mathit{Leaf}(T)=\{v\}$ . Therefore, we can identify $v$ whose children are all leaves from the ordered pairs $(P_{1},P_{2})\in{\mathcal{P}}_{k}(T)$ such that $P_{1}\setminus\mathit{Leaf}(T)$ contains only $v$ .

Let $V^{\prime}$ be the set of all nodes whose children are just leaves and $D_{T}(V^{\prime})=\cup_{x\in V^{\prime}}C_{T}(x)$ . Since $V^{\prime}$ is nonempty, $D_{T}(V^{\prime})\neq\emptyset$ . Define $E^{\prime}(T)=\{(x,y)\in E(T)\;\;|\;\;x\in V^{\prime},y\in D_{T}(V^{\prime})\}$ .

For the tree $T^{\prime}$ obtained from $T$ by the removal of the leaves of $D_{T}(V^{\prime})$ , $|V(T^{\prime})|=|V(S)|-|D_{T}(V^{\prime})|<k.$ By Eqn. (13), ${\mathcal{P}}_{k}(T^{\prime})$ can be efficiently constructed from ${\mathcal{P}}_{k}(T)$ . By the induction hypothesis, $E(T^{\prime})$ is uniquely determined by ${\mathcal{P}}_{k}(T^{\prime})$ . As a result, $E(T)=E(T^{\prime})\cup E^{\prime}(T)$ is determined.

This concludes the proof.

Corollary 4.10.

Let $k\geqslant 0$ . The $k$ -RF is a metric in the space of all $1$ -labeled trees.

Proof 4.11.

If $k=0$ , the statement follows from the same proof as for Proposition 4.4. Now, let $S$ and $T$ be two 1-labeled trees and $k\geqslant 1$ . By Lemma 4.1, it is enough to show that if $d_{\tiny k\mbox{\rm-RF}}(S,T)=0$ (equivalently, $\mathcal{P}_{k}(T)=\mathcal{P}_{k}(S)$ ), then $S$ and $T$ . This can be proved in a manner similar to Proposition 4.8.

Lemma 4.12.

Let $k\geqslant 0$ and let $T$ be a 1-labeled rooted tree with $n$ nodes. All subsets $D_{i}(w)=\{w\}\cup\{x\in D_{T}(w)\mid d(w,x)\leqslant i\}$ and $L(D_{i}(w))$ for all nodes $w$ and $i\leqslant k$ can be computed in at most $2(k+1)n$ set operations.

Proof 4.13.

Since $T$ is 1-labeled, we can identify a node of $T$ with its label. In this way, $D_{i}(w)=L(D_{i}(w))$ for all nodes $w$ and $i\leqslant k$ . By ordering the $n$ labels, we represent each subset of labels (and each subset of nodes) as a $n$ -bit 0-1 string, where the $i$ -th bit is 1 if and only if the $i$ -th label (node) is in the subset.

The statement is obvious in the case $k=0$ , since $D_{0}(w)=\{w\}$ and, clearly, all the $D_{0}(w)$ for $w\in V(T)$ can be computed in at most $2n$ set operations. We assume $k>0$ and prove the statement by induction as follows.

Assume that all the $D_{k-1}(w)$ for $w\in V(T)$ have been computed in at most $2kn$ set operations. Assume $w$ has $d_{w}$ children $u_{1},u_{2},\ldots,u_{d(w)}$ . Then,

D_{k}(w)=\{w\}\cup\left(\cup_{i=1}^{d_{w}}D_{k-1}(u_{i})\right)

This implies that $D_{k}(w)$ for all $w$ can be computed from all $D_{k-1}(w)$ using $\sum_{v\in V(T)}(1+d_{w})=2n-1$ set operations. In total, we can compute all subsets $D_{i}(w)$ $(i\leqslant k$ and $w\in V(T)$ ) in at most $2n-1+2kn\leqslant 2(k+1)n$ set operations.

Lemma 4.14.

Let $k\geqslant 0$ and $T$ be a 1-labeled rooted tree with $n$ nodes. Using $L(D_{i}(w))$ for $w\in V(T),0\leqslant i\leqslant k$ , we can compute $L(B_{k}(w))$ for all $w$ in $O(kn)$ set operations, where $B_{k}(w)$ is defined in Eqn. (8).

Proof 4.15.

Since $T$ is a 1-labeled rooted tree, we identify a node with its label. In this way, we just need to show that $B_{k}(w)$ for all nodes $w$ can be computed in $O(kn)$ set operations.

Let $r$ be the root of $T$ . For any node $w\in V(T)$ , let the unique path from $r$ to $w$ be

w_{0}=r,w_{1},\ldots,w_{t}=w.

Then, we have that

B_{k}(w_{t})=\cup^{\min(k,t)}_{i=0}D_{k-i}(w_{t-i}).

Given the subsets $D_{i}(u)$ for all $i\leqslant k$ and $u\in V(T)$ , the above formula implies that $B_{k}(w_{t})$ can be computed in at most $k$ set operations. In total, we can compute all $B_{k}(w_{t})$ for all $w\in V(T)$ in $O(kn)$ set operations.

Proposition 4.16.

Let $S$ and $T$ be two 1-labeled trees with $n$ nodes and $k\geqslant 0$ . Then, $d_{\tiny k\mbox{\rm-RF}}(S,T)$ can be computed in $O(kn^{2})$ time.

Proof 4.17.

We first consider the rooted tree case. Let $S$ and $T$ be two 1-labeled rooted trees with $n$ nodes. Without loss of generality, we may assume that $S$ and $T$ are labeled on the same set $L$ , with $|L|=n$ . (Otherwise, we can consider them labeled on $L=L(S)\cup L(T)$ , with $n\leqslant|L|\leqslant 2n$ .) By Lemma 4.12 and Lemma 4.14, we can compute $P_{X}(e,k)$ for all $e\in E(X)$ in $O(kn)$ set operations for $X=S,T$ . Since each edge induces an ordered pair of label subsets and we represent each label subset using a $n$ -bit string, we consider $P_{X}(e,k)$ as a $2n$ -bit string. In this way, we sort all the edge-induced pairs of label subsets for each tree in $O(n^{2})$ time by radix sort (that is, indexing) and then compute the symmetric difference of the two set of edge-induced pairs in $O(n^{2})$ time. This concludes the proof.

In the unrooted case, we first root the trees at a leaf. In this way, we can compute all the edge-induced pairs of label subsets in the derived rooted trees in $O(kn^{2})$ time. Since the edges induce unordered pairs of label subsets in the original trees, we rearrange the two label subsets obtained for an edge in such a way that the smallest label in the first subset is smaller than every label in the second one. After the rearrangement, we can radix-sort the edge-induced pairs and compute the $k$ -RF score in $O(n^{2})$ time.

4.2 The Distribution of $k$ -RF Scores

We examined the distribution of the $k$ -RF dissimilarity scores for 1-labeled unrooted and rooted trees with the same label set and with different label sets.

The distribution of the frequency of the pairwise $k$ -RF scores in the space of $n$ -node 1-labeled unrooted and rooted trees for $n$ from 4 to 7 are presented in Figures 3 to 6, respectively. For each $n$ , it suffices to consider $k=0,...,n-2$ . Recall that $(n-2)$ -RF is actually the RF distance. The frequency distribution for the RF distance in the space of phylogenetic trees is known to be Poisson (Steel and Penny, 1993). It seems also true that the pairwise 0-RF and $(n-2)$ -RF scores have a Poisson distribution in the space of $n$ -node 1-labeled unrooted and rooted trees. However, the distribution of the pairwise $k$ -RF scores is unlikely Poisson when $k=1,2,3$ and $k\neq n-2$ .

We examined 1,679,616 (respectively, 60,466,176) pairs of 6-node 1-labeled unrooted (respectively, rooted) trees such that the trees in each pair have $c$ common labels, with $c=3,4,5$ . Table 1 shows that the majority of pairs have a largest dissimilarity score of 10.

Table 1: The number of pairs of 1-labeled 6-node unrooted (top) and rooted (bottom) trees that have

c

labels in common and have 1-Robinson-Foulds (RF) score

d

for

c=3,4,5

and

d=2,4,6,8,10

1-RF	4	6	8	10
3	0	0	3,072	1,676,544
4	0	432	16,800	1,662,384
5	340	3,720	53,100	1,622,456

1-RF	4	6	8	10
3	0	0	79,872	60,386,304
4	0	7,776	419,136	60,039,264
5	4,080	65,760	1,310,880	59,085,456

5 A Generalization to Multiset-Labeled Trees

In this section, we extend the measures introduced in Section 3 to multiset-labeled unrooted and rooted trees.

5.1 Multisets and Their Operations

A multiset is a collection of elements in which an element $x$ can occur one or more times (Jűrgensen, 2020). The set of all distinct elements appearing in a multiset $A$ is denoted by $\mbox{Supp}(A)$ . In this paper, we simply represent $A$ by the monomial $x_{1}^{m_{A}(x_{1})}\ldots x_{n}^{m_{A}(x_{n})}$ if $\mbox{Supp}(A)=\{x_{1},x_{2},\cdots,x_{n}\}$ , where $x_{i}^{1}$ is simplified to $x_{i}$ for each $i$ .

Let $A$ and $B$ be two multisets. We say $A$ is a sub-multiset of $B$ , denoted by $A\subseteq_{m}B$ , if for every $x\in\mbox{Supp}(A)$ , $m_{A}(x)\leqslant m_{B}(x)$ . In addition, we say that $A=B$ if $A\subseteq_{m}B$ and $B\subseteq_{m}A$ . Furthermore, the union, sum, intersection, difference, and symmetric difference of $A$ and $B$ are respectively defined as follows:

•

$A\cup_{m}B=\left\{x^{\max\{m_{A}(x),m_{B}(x)\}}\mid x\in\mbox{Supp}(A)\cup\mbox{Supp}(B)\right\}$ ;
•

$A\uplus_{m}B=\left\{x^{m_{A}(x)+m_{B}(x)}\mid x\in\mbox{Supp}(A)\cup\mbox{Supp}(B)\right\}$ ;
•

$A\cap_{m}B=\left\{x^{\min\{m_{A}(x),m_{B}(x)\}}\mid x\in\mbox{Supp}(A)\cap\mbox{Supp}(B)\right\}$ ;
•

$A\setminus_{m}B=\left\{x^{m_{A}(x)-m_{B}(x)}\mid x\in\mbox{Supp}(A):m_{A}(x)>m_{B}(x)\right\}$ ;
•

$A\triangle_{m}B=(A\cup_{m}B)\setminus_{m}(A\cap_{m}B)$ ;

where $m_{X}(x)$ is defined as 0 if $x\notin\mbox{Supp}(X)$ for $X=A,B$ .

Let $L$ be a set and $\mathbb{P}_{m}(L)$ be the set of all sub-multisets on $L$ . A tree $T$ is labeled with the sub-multisets of $L$ if $T$ is equipped with a function $\ell:V(T)\to\mathbb{P}_{m}(L)$ such that $\cup_{v\in V(T)}\mbox{Supp}(\ell(v))=L$ and $\ell(v)\neq\emptyset$ , for every $v\in V(T)$ . We call such a tree as a multiset-labeled tree. For $C\subseteq V(T)$ and $x\in L$ , we define $L_{m}(C)$ and $m_{T}(x)$ as follows:

	$\displaystyle L_{m}(C)$	$\displaystyle=$	$\displaystyle\uplus_{v\in C}\ell(v);$		(14)
	$\displaystyle m_{T}(x)$	$\displaystyle=$	$\displaystyle\sum_{v\in V(T)}m_{\ell(v)}(x).$		(15)

5.2 The $k$ -RF for Multiset-Labeled Trees

Let $T$ be a multiset-labeled tree. Then, each edge $e=(u,v)$ of $T$ induces a pair of multisets

P_{T}(e)=\left\{L_{m}(B_{e}(u)),L_{m}(B_{e}(v))\right\},

(16)

where $L_{m}()$ is defined in Eqn. (14), and $B_{e}(u)$ in Eqn. (2). Note that Eqn. (16) is obtained from Eqn. (1) by replacing $L()$ with $L_{m}()$ .

Remark 5.1.

In a multiset-labeled tree $T$ , two edges may induce the same multi-set pair as shown in Figure 7. Hence, $\mathcal{P}(T)$ in Eqn. (3) is a multiset in general.

We use Eqn. (16), Eqn. (3) and Eqn. (4) to define the RF-distance for multiset-labeled trees by replacing $\triangle$ with $\triangle_{m}$ in Eqn. (4).

Let $k\geqslant 0$ . We use Eqn. (5), Eqn. (6), and Eqn. (7) to define the $k$ -RF for multiset-labeled trees by replacing $L()$ with $L_{m}()$ in Eqn. (5) and replacing $\triangle$ with $\triangle_{m}$ in Eqn. (7).

Example 5.2.

Consider the multiset-labeled trees $S$ , $\acute{S}$ , and $T$ in Figure 8. ${\mathcal{P}}_{k}(T),{\mathcal{P}}_{k}(S)$ and ${\mathcal{P}}_{k}(\acute{S})$ for $k=0,1,\infty$ are summarized in Table 2. We obtain:

\begin{array}[]{lll}d_{0\mbox{\tiny-RF}}(T,\acute{S})=2;&d_{\tiny 1\mbox{\rm-RF}}(T,\acute{S})=6;&d_{\mbox{\tiny RF}}(T,\acute{S})=12;\\ d_{0\mbox{\tiny-RF}}(S,\acute{S})=10;&d_{\tiny 1\mbox{\rm-RF}}(S,\acute{S})=12;&d_{\mbox{\tiny RF}}(S,\acute{S})=12.\\ \end{array}

It is not hard to see that both $d_{0\mbox{\tiny-RF}}(T,\acute{S})$ and $d_{\tiny 1\mbox{\rm-RF}}(T,\acute{S})$ reflect the local similarity of the two multiset-labeled trees better than $d_{\mbox{\tiny RF}}(T,\acute{S})$ .

Table 2: Edge-induced unordered pairs of multisets in the three trees in Fig. 8 for

k=0,1,\infty

\begin{array}[]{clll}\hline\cr\textrm{Tree}&\mathcal{P}_{0}(\;)&\mathcal{P}_{1}(\;)&\mathcal{P}_{\infty}(\;)\\ \hline\cr T&\{c^{2},e^{2}\}&\{c^{2},ce^{2}\}&\{a^{2}b^{2}c^{3}d^{2}e^{2},c^{2}\}\\ &\{c,e^{2}\}&\{ab^{2}cd^{2},ac^{2}\}&\{a^{2}b^{2}c^{3}d^{2},c^{2}e^{2}\}\\ &\{ac,c\}&\{ac^{2},c^{2}e^{2}\}&\{a^{2}b^{2}c^{2}d^{2},c^{3}e^{2}\}\\ &\{ac,d\}&\{ab^{2},ac^{2}d^{2}\}&\{ab^{2}cd^{2},ac^{4}e^{2}\}\\ &\{ab^{2},d\}&\{acd,ce^{2}\}&\{ab^{2},ac^{5}d^{2}e^{2}\}\\ &\{cd,d\}&\{a^{2}b^{2}cd,cd\}&\{a^{2}b^{2}c^{4}de^{2},cd\}\\ \hline\cr S&\{c^{2},e\}&\{ac^{2}e^{2},c^{2}\}&\{a^{2}b^{2}c^{3}d^{2}e^{2},c^{2}\}\\ &\{ce,e\}&\{a^{2}bc^{2}d,bd\}&\{a^{2}b^{2}c^{2}d^{2},c^{3}e^{2}\}\\ &\{ac,e\}&\{ab^{2}cd^{2},ace\}&\{ab^{2}cd^{2},ac^{4}e^{2}\}\\ &\{ac,d\}&\{ac^{3}e,ce\}&\{a^{2}b^{2}c^{4}d^{2}e,ce\}\\ &\{abc,d\}&\{acd,c^{3}e^{2}\}&\{abc,abc^{4}d^{2}e^{2}\}\\ &\{bd,d\}&\{abc,abcd^{2}\}&\{a^{2}bc^{5}de^{2},bd\}\\ \hline\cr\acute{S}&\{c^{2},e^{2}\}&\{c^{2},ce^{2}\},&\{ab^{3}c^{3}d^{2}e^{2},c^{2}\}\\ &\{c,e^{2}\}&\{ac^{2},c^{2}e^{2}\}&\{ab^{3}c^{3}d^{2},c^{2}e^{2}\}\\ &\{ac,c\}&\{acd,ce^{2}\}&\{ab^{3}c^{2}d^{2},c^{3}e^{2}\}\\ &\{ac,d\}&\{ac^{2}d^{2},b^{3}\}&\{ac^{4}e^{2},b^{3}cd^{2}\}\\ &\{b^{3},d\}&\{ac^{2},b^{3}cd^{2}\}&\{ac^{5}e^{2}d^{2},b^{3}\}\\ &\{cd,d\}&\{ab^{3}cd,cd\}&\{ab^{3}c^{4}e^{2}d,cd\}\\ \hline\cr\end{array}

5.3 The $k$ -RF for Multiset-Labeled Rooted Trees

Let $k\geqslant 0$ be an integer. We use Eqn. (10), Eqn. (11), and Eqn. (12) to define $k$ -RF for multiset-labeled rooted trees by replacing $L()$ with $L_{m}()$ in Eqn. (10) and replacing $\triangle$ with $\triangle_{m}$ in Eqn. (12).

Proposition 5.3.

Let $k\geqslant 0$ be an integer. The $k$ -RF satisfies the non-negativity, symmetry, and triangle inequality conditions. Hence, $k$ -RF is a pseudometric for each $k$ in the space of multiset-labeled (rooted) trees.

Proof 5.4.

The non-negativity and symmetry conditions follow from the definition of the $k$ -RF. The triangle inequality condition is proved as follows.

Let $T_{1}$ , $T_{2}$ , and $T_{3}$ be three multiset-labeled trees. We need to show:

\displaystyle d_{\tiny k\mbox{\rm-RF}}(T_{1},T_{2})\leqslant d_{\tiny k\mbox{\rm-RF}}(T_{1},T_{3})+d_{\tiny k\mbox{\rm-RF}}(T_{3},T_{2}).

Note that $\mathcal{P}_{k}(X)$ denotes the multiset of edge-induced order pairs of sub-multisets in $X$ for $X=T_{1},T_{2},T_{3}$ .

If $x^{m(x)}\in\mathcal{P}_{k}(T_{1})\triangle_{m}\mathcal{P}_{k}(T_{2})$ , we have either $x^{m(x)}\in\mathcal{P}_{k}(T_{1})\setminus_{m}\mathcal{P}_{k}(T_{2})$ or $x^{m(x)}\in\mathcal{P}_{k}(T_{2})\setminus_{m}\mathcal{P}_{k}(T_{1})$ . Assume $x^{m(x)}\in\mathcal{P}_{k}(T_{1})\setminus_{m}\mathcal{P}_{k}(T_{2})$ . Then, $m_{\mathcal{P}_{k}(T_{1})}(x)>m_{\mathcal{P}_{k}(T_{2})}(x)$ . If $x\notin\textrm{Supp}(\mathcal{P}_{k}(T_{3})\setminus_{m}\mathcal{P}_{k}(T_{2}))$ , we have $m_{\mathcal{P}_{k}(T_{1})}(x)>m_{\mathcal{P}_{k}(T_{2})}(x)\geqslant m_{\mathcal{P}_{k}(T_{3})}(x)$ . This implies that $x\in\mbox{Supp}(\mathcal{P}_{k}(T_{1})\setminus_{m}\mathcal{P}_{k}(T_{3}))$ and $m_{\mathcal{P}_{k}(T_{1})\setminus_{m}\mathcal{P}_{k}(T_{3})}(x)=m_{\mathcal{P}_{k}(T_{1})}(x)-m_{\mathcal{P}_{k}(T_{3})}(x)\geqslant m_{\mathcal{P}_{k}(T_{1})}(x)-m_{\mathcal{P}_{k}(T_{2})}(x)=m(x).$ Thus, $m(x)\leqslant m_{\mathcal{P}_{k}(T_{1})\triangle_{m}\mathcal{P}_{k}(T_{3})}(x)+m_{\mathcal{P}_{k}(T_{3})\triangle_{m}\mathcal{P}_{k}(T_{2})}(x).$

On the other hand, if $x\in\textrm{Supp}(\mathcal{P}_{k}(T_{3})\setminus_{m}\mathcal{P}_{k}(T_{2}))$ and $m_{\mathcal{P}_{k}(T_{3})}(x)\geqslant m_{\mathcal{P}_{k}(T_{1})}(x)$ , we have:

	$\displaystyle m_{\mathcal{P}_{k}(T_{3})\setminus_{m}\mathcal{P}_{k}(T_{2})}(x)$	$\displaystyle=$	$\displaystyle m_{\mathcal{P}_{k}(T_{3})}(x)-m_{\mathcal{P}_{k}(T_{2})}(x)$
		$\displaystyle\geqslant$	$\displaystyle m_{\mathcal{P}_{k}(T_{1})}(x)-m_{\mathcal{P}_{k}(T_{2})}(x)=m(x).$

If $x\in\textrm{Supp}(\mathcal{P}_{k}(T_{3})\setminus_{m}\mathcal{P}_{k}(T_{2}))$ and $m_{\mathcal{P}_{k}(T_{3})}(x)<m_{\mathcal{P}_{k}(T_{1})}(x)$ , we have $m_{\mathcal{P}_{k}(T_{1})}(x)>m_{\mathcal{P}_{k}(T_{3})}(x)>m_{\mathcal{P}_{k}(T_{2})}(x)$ , implying $x\in\textrm{Supp}(\mathcal{P}_{k}(T_{1})\setminus_{m}\mathcal{P}_{k}(T_{3}))$ . Thus, we have:

	$\displaystyle m(x)$	$\displaystyle=$	$\displaystyle m_{\mathcal{P}_{k}(T_{1})\setminus_{m}\mathcal{P}_{k}(T_{3})}(x)+m_{\mathcal{P}_{k}(T_{3})\setminus_{m}\mathcal{P}_{k}(T_{2})}(x)$
		$\displaystyle\leqslant$	$\displaystyle m_{\mathcal{P}_{k}(T_{1})\triangle_{m}\mathcal{P}_{k}(T_{3})}(x)+m_{\mathcal{P}_{k}(T_{3})\triangle_{m}\mathcal{P}_{k}(T_{2})}(x).$

Lastly, if $x^{m(x)}\in\mathcal{P}_{k}(T_{2})\setminus_{m}\mathcal{P}_{k}(T_{1})$ , we can obtain the same result.

In summary, we have:

\textrm{Supp}(\mathcal{P}_{k}(T_{1})\triangle_{m}\mathcal{P}_{k}(T_{2}))\subseteq\textrm{Supp}(\mathcal{P}_{k}(T_{1})\triangle_{m}\mathcal{P}_{k}(T_{3}))\cup\textrm{Supp}(\mathcal{P}_{k}(T_{3})\triangle_{m}\mathcal{P}_{k}(T_{2})).

In addition, for each $x\in\textrm{Supp}(\mathcal{P}_{k}(T_{1})\triangle_{m}\mathcal{P}_{k}(T_{2}))$ , we have:

m_{\mathcal{P}_{k}(T_{1})\triangle_{m}\mathcal{P}_{k}(T_{2})}(x)\leqslant m_{\mathcal{P}_{k}(T_{1})\triangle_{m}\mathcal{P}_{k}(T_{3})}(x)+m_{\mathcal{P}_{k}(T_{3})\triangle_{m}\mathcal{P}_{k}(T_{2})}(x).

Therefore, we have:

|\mathcal{P}_{k}(T_{1})\triangle_{m}\mathcal{P}_{k}(T_{2})|\leqslant|\mathcal{P}_{k}(T_{1})\triangle_{m}\mathcal{P}_{k}(T_{3})|+|\mathcal{P}_{k}(T_{3})\triangle_{m}\mathcal{P}_{k}(T_{2})|,

that is, the triangle inequality holds.

For multiset-labeled rooted trees, the proof is similar and hence omitted.

Remark 5.5.

For multiset-labeled trees, $d_{\tiny k\mbox{\rm-RF}}(S,T)=0$ does not imply $S$ and $T$ are identical, as shown in Fig. 9.

Proposition 5.6.

Let $k\geqslant 0$ and $S$ and $T$ be two (rooted) trees whose nodes are labeled by $L(S)$ and $L(T)$ , respectively. Then, $d_{\tiny k\mbox{\rm-RF}}(S,T)$ can be computed in $O((k+B)D(|V(S)|+|V(T)|)$ time, where $B$ is the maximum multiplicity of a label appearing in $\{P_{T}(e,k)\mid e\in V(T)\}\cup\{P_{S}(e,k)\mid e\in V(S)\}$ and $D=|\mathit{Supp}(L(S))\cup\mathit{Supp}(L(T))|$ .

Proof 5.7.

An algorithm for the 1-labeled case can be modified as follows for computing $k$ -RF on multiset-labeled rooted and unrooted trees:

•

Represent each label multiset as a $D$ -dimensional vector, in which the integer at position $j$ is the multiplicity of the $j$ -th label. Computing all edge-induced pairs in both trees takes $O(k(|E(S)|+|E(T)|))$ set operations. Each set operation takes $D$ integer operations.
•

Radix-sort all the edge-induced pairs for $S$ and $T$ in $O(D(|E(S)|+B))$ and $O(D(|E(T)|+B))$ integer operations, respectively.
•

Compute the symmetric difference of the set of the edge-induced pairs in the two input trees in $|E(S)|+|E(T)|$ set operation. Each set operation takes $D$ integer operations.

In summary, one can compute $d_{\tiny k\mbox{\rm-RF}}(S,T)$ using $O((k+B)D(|V(S)|+|V(T)|)$ integer operations, as $|E(S)|=|V(S)|-1$ .

5.4 Correlation of the $k$ -RF and the Other Measures

Let $T$ and $S$ be two 1-labeled rooted trees with the same label set $X$ . Again, we identify the nodes with their labels in the two trees. For any two subset $X^{\prime}$ and $X^{\prime\prime}$ of $X$ , we use $d_{J}(X^{\prime},X^{\prime\prime})$ to denote their Jaccard distance. The CASet $\cap$ distance between $T$ and $S$ is defined to be the average $d_{J}(A_{T}(i)\cap A_{T}(j),A_{S}(i)\cap A_{S}(j))$ of a pair of nodes $i$ and $j$ , whereas the DISC $\cap$ distance between $T$ and $S$ is the average $d_{J}(A_{T}(i)\setminus A_{T}(j),A_{S}(i)\setminus A_{S}(j))$ of an order pair $(i,j)$ of nodes DiNardo et al. (2020).

Using the Pearson correlation, we compared the $k$ -RF with CASet $\cap$ , DISC $\cap$ , and GRF (Llabrés et al., 2020) in the space of set-labeled trees for different $k$ from 0 to 28.

Firstly, we conducted the correlation analysis in the space of mutation trees with the same label set. Using a method reported by Jahn et al. (2021), we generated a simulated dataset containing 5,000 rooted trees in which the root was labeled with 0 and the other nodes were labeled by the disjoint subsets of $\{1,2,\ldots,29\}$ , where the trees might have different number of nodes. Using all $\binom{5,000}{2}$ pairwise scores for CASet $\cap$ , DISC $\cap$ , GRF and $k$ -RF, we conducted the Pearson correlation analysis of $k$ -RF with the other three (left panel, Fig. 10).

Our results show that CASet $\cap$ , DISC $\cap$ and GRF were all positively correlated with $k$ -RF. We observed the following facts:

•

The GRF and $k$ -RF had the largest Pearson correlation for each $k<8$ , whereas the DISC $\cap$ and $k$ -RF had the largest Pearson correlation for each $k\geqslant 8$ .
•

The 5-RF and 6-RF were less correlated to CASet $\cap$ , DISC $\cap$ and GRF than other $k$ -RF.
•

The Pearson correlation between $k$ -RF and CASet $\cap$ (respectively, DISC $\cap$ ) increased when $k$ went from 6 to 15.

Secondly, we conducted the Pearson correlation analysis on the trees with different but overlapping label sets. The dataset was generated by the same method and was a union of 5 groups of rooted trees, each of which contained 200 trees over the same label set. We computed the dissimilarity scores for each tree in the first family and each tree in other groups and then computed the Pearson correlation between different measures. Again, all the dissimilarity measures were positively correlated, but less correlated than in the first case; see Fig 10 (right). This observation could be the result of the fact that difference in label sets of two trees makes their $k$ -RF score at least $k+1$ . However, the difference does not strongly contribute to the other distances because DISC $\cap$ and CASet $\cap$ consider the intersection of label sets (see (DiNardo et al., 2020)), and GRF considers the intersection of clusters.

The right dotplot of Fig. 10 shows that the $k$ -RF and DISC $\cap$ had the largest Pearson correlation for $k$ from 1 to 9, and the $k$ -RF and the CASet $\cap$ had the largest Pearson correlation for $k\geqslant 10$ . Moreover, all the Pearson correlations decreased when $k$ changed from 1 to 15. This trend was not observed in the first case. This decreasing trend could be the result of the fact that difference in label sets contributes to $k$ -RF more as $k$ increases.

6 Clustering Trees with the $k$ -RF

A test was designed to demonstrate which of the $k$ -RF, CASet $\cap$ , DISC $\cap$ , and GRF is good at clustering labeled trees.

We generated randomly 5 tree families each containing 50 trees using the program reported by Jahn et al. (2021). The nodes were labeled by the subsets of a 30-label set in the trees of each family. The label sets used in different tree families were different, but overlapping. As the nodes were labeled by disjoint subsets, each different label between the label sets of two trees induces at least $d$ different pairs, where $d$ is the degree of the node with the label. Thus, a large number of different elements between the label sets could make the trees more distinguishable by the $k$ -RF. Therefore, the label sets used for the different tree families differed in only one label.

We computed the pairwise dissimilarity scores for all 250 trees in the five groups using each measure; we then clustered the 250 trees into $c$ clusters using the $K$ -means algorithm, where $c$ ranges from $2$ to $57$ . The clustering results were assessed using the Silhouette score (Kaufman and Rousseeuw, 2009).

As Fig. 11 illustrates, neither of the CASet $\cap$ , DISC $\cap$ , and GRF distances were able to recognize the exact number of families. However, CASet $\cap$ had the highest Silhouette score when the number of clusters was 5, compared to DISC $\cap$ , GRF, and the $k$ -RF for $k\leqslant 12$ . In addition, the figure shows that the $k$ -RF could recognize the correct number of families when $k$ ranges from 12 to 19. Moreover, the Silhouette score of the $k$ -RF increased when $k$ increased from $8$ to $19$ . This interesting observation may stem from the fact that as $k$ increases, the number of pairs of trees achieving the highest possible $k$ -RF score also increases, thereby enhancing the recognizability of families. It’s worth noting that such pairs are guaranteed to exist when $k$ is larger than the minimum diameter of the trees, which is 8 in our case.

7 Conclusions

The development of an efficient and robust measure for the comparison of labeled trees is important. In this paper, we have proposed a novel variant of dissimilarity metrics, namely the $k$ -RF, tailored for labeled trees. The $k$ -RF facilitates the analysis of local structures in labeled trees, accommodating nodes labeled with (not necessarily the same) multisets. Significantly, these metrics find practical applicability in mutation trees used in cancer research.

The RF distance is succinctly expressed as $(n-1)$ -RF within the space of labeled trees with $n$ nodes. By setting $k$ to a value smaller than $n-1$ , the $k$ -RF metric can capture analogous local regions in two labeled trees. Notably, for every $k$ , the $k$ -RF is a pseudometric for multiset-labeled trees and becomes a metric in the space of 1-labeled trees. However, the distribution of pairwise $k$ -RF scores in the space of 1-labeled unrooted (or rooted) trees conforms to a Poisson distribution specifically for $k=n-2$ , and unlikely have the same trend for other values of $k\geqslant 1$ .

We verified the $k$ -RF measures through a comprehensive comparison with CASet, DISC (DiNardo et al. (2020)) and GRF (Llabrés et al. (2021)) on randomly labeled trees generated by a house-made program (Jahn et al. (2021)). Our findings revealed a consistent positive correlation between $k$ -RF and each of the other three measures for every value of $k$ . Notably, the correlation values exhibited a tendency to be higher when the measures were applied to assess mutation trees with identical label sets. Furthermore, our study underscored the superior clustering capabilities of $k$ -RF compared to the three mentioned measures.

We would like to emphasize that selecting an appropriate $k$ -RF in practical applications lacks a universal rule of thumb, primarily due to a shortage of experience in this domain. Perhaps a judicious approach involves choosing a suitable $k$ -RF by carefully considering the topological similarity among the trees under consideration.

Future work includes how to apply the $k$ -RF to designing tree inference algorithms like GraPhyC (Govek et al., 2018) and also how to infer the exact frequency distribution of the $k$ -RF for each $k\geqslant 1$ . It is also interesting to investigate the generalization of RF-distance for clonal trees (Llabrés et al., 2020).

The computer program for the $k$ -RF can be downloaded from https://github.com/Elahe-khayatian/k-RF-measures.git.

Acknowledgments

The authors would like to thank the anonymous reviewer for providing helpful suggestions and comments to our first submission of the work. This research was partially supported by the the Ministerio de Ciencia e Innovación (MCI), the Agencia Estatal de Investigación (AEI) and the European Regional Development Funds (ERDF) through project METACIRCLE PID2021-126114NB-C44, also supported by the European Regional Development Fund (FEDER), by the Agency for Management of University and Research Grants (AGAUR) through grant 2017-SGR-786 (ALBCOM), and by Singapore MOE Tier 1 grant R-146-000-318-114.

References

Briand et al. (2020) Briand, S., Dessimoz, C., El-Mabrouk, N., Lafond, M., and Lobinska, G. (2020). A generalized Robinson-Foulds distance for labeled trees. BMC Genomics, 21(Suppl 10):779.
Briand et al. (2022) Briand, S., Dessimoz, C., El-Mabrouk, N., and Nevers, Y. (2022). A linear time solution to the labeled Robinson-Foulds distance problem. Systematic Biology, 71(6):1391–1403.
Camin and Sokal (1965) Camin, J. H. and Sokal, R. R. (1965). A method for deducing branching sequences in phylogeny. Evolution, 19(3):311–326.
Ciccolella et al. (2021) Ciccolella, S., Bernardini, G., Denti, L., Bonizzoni, P., Previtali, M., and Vedova, G. D. (2021). Triplet-based similarity score for fully multilabeled trees with poly-occurring labels. Bioinformatics, 37(2):178–184.
DiNardo et al. (2020) DiNardo, Z., Tomlinson, K., Ritz, A., and Oesper, L. (2020). Distance measures for tumor evolutionary trees. Bioinformatics, 36(7):2090–2097.
Estabrook et al. (1985) Estabrook, G. F., McMorris, F. R., and Meacham, C. A. (1985). Comparison of undirected phylogenetic trees based on subtrees of four evolutionary units. Systematic Zoology, 34(2):193–200.
Farris (1977) Farris, J. S. (1977). Phylogenetic analysis under Dollo’s law. Systematic Biology, 26(1):77–88.
Govek et al. (2018) Govek, K., Sikes, C., and Oesper, L. (2018). A consensus approach to infer tumor evolutionary histories. In Proc. 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics (BCB ’18), pages 63–72, New York, NY. ACM Press.
Jahn et al. (2021) Jahn, K., Beerenwinkel, N., and Zhang, L. (2021). The Bourque distances for mutation trees of cancers. Algorithms for Molecular Biology, 16:9.
Jűrgensen (2020) Jűrgensen, H. (2020). Multisets, heaps, bags, families: What is a multiset? Mathematical Structures in Computer Science, 30(2):139–158.
Karpov et al. (2019) Karpov, N., Malikic, S., Rahman, M. K., and Sahinalp, S. C. (2019). A multi-labeled tree dissimilarity measure for comparing clonal trees of tumor progression. Algorithms for Molecular Biology, 14:17.
Kaufman and Rousseeuw (2009) Kaufman, L. and Rousseeuw, P. J. (2009). Finding groups in data: An introduction to cluster analysis. John Wiley & Sons.
Li et al. (1996) Li, M., Tromp, J., and Zhang, L. (1996). On the nearest neighbour interchange distance between evolutionary trees. Journal of Theoretical Biology, 182(4):463–467.
Llabrés et al. (2020) Llabrés, M., Rosselló, F., and Valiente, G. (2020). A generalized Robinson-Foulds distance for clonal trees, mutation trees, and phylogenetic trees and networks. In Proc. 11th ACM Int. Conf. Bioinformatics, Computational Biology and Health Informatics, pages 13:1–13:10, New York, NY. ACM Press.
Llabrés et al. (2021) Llabrés, M., Rosselló, F., and Valiente, G. (2021). The generalized Robinson-Foulds distance for phylogenetic trees. Journal of Computational Biology, 28(12):1–15.
Robinson (1971) Robinson, D. F. (1971). Comparison of labeled trees with valency three. Journal of Combinatorial Theory, 11(2):105–119.
Robinson and Foulds (1981) Robinson, D. F. and Foulds, L. R. (1981). Comparison of phylogenetic trees. Mathematical Biosciences, 53(1–2):131–147.
Sashittal et al. (2023) Sashittal, P., Zhang, H., Iacobuzio-Donahue, C. A., and Raphael, B. J. (2023). Condor: Tumor phylogeny inference with a copy-number constrained mutation loss model. bioRxiv [Preprint].
Schwartz and Schäffer (2017) Schwartz, R. and Schäffer, A. A. (2017). The evolution of tumour phylogenetics: Principles and practice. Nature Reviews Genetics, 18(4):213–229.
Steel and Penny (1993) Steel, M. A. and Penny, D. (1993). Distributions of tree comparison metrics: Some new results. Systematic Biology, 42(2):126–141.
Williams and Clifford (1971) Williams, W. T. and Clifford, H. T. (1971). On the comparison of two classifications of the same set of elements. Taxon, 20(4):519–522.

The kk-Robinson-Foulds Dissimilarity Measures for Comparison of Labeled Trees