This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

[2]\fnmMomoko \surHayamizu

1]\orgdivDepartment of Pure and Applied Mathematics, \orgnameGraduate School of Fundamental Science and Engineering, Waseda University, \countryJapan

2]\orgdivDepartment of Applied Mathematics, Faculty of Science and Engineering, \orgnameWaseda University, \countryJapan

Orientability of Undirected Phylogenetic Networks to a Desired Class: Practical Algorithms and Application to Tree-Child Orientation

\fnmTsuyoshi \surUrata uratsuyo244@moegi.waseda.jp    \fnmManato \surYokoyama mana.aki.aya@akane.waseda.jp    \fnmHaruki \surMiyaji h.miyaji@ruri.waseda.jp    hayamizu@waseda.jp [ [
Abstract

The 𝒞\mathcal{C}-Orientation problem asks whether it is possible to orient an undirected graph to a directed phylogenetic network of a desired network class 𝒞\mathcal{C}. This problem arises, for example, when visualising evolutionary data, as popular methods such as Neighbor-Net are distance-based and inevitably produce undirected graphs. The complexity of 𝒞\mathcal{C}-Orientation remains open for many classes 𝒞\mathcal{C}, including binary tree-child networks, and practical methods are still lacking. In this paper, we propose an exact FPT algorithm for 𝒞\mathcal{C}-Orientation that is applicable to any class 𝒞\mathcal{C} and parameterised by the reticulation number and the maximum size of minimal basic cycles, and a very fast heuristic for Tree-Child Orientation. While the state-of-the-art for 𝒞\mathcal{C}-Orientation is a simple exponential time algorithm whose computational bottleneck lies in searching for appropriate reticulation vertex placements, our methods significantly reduce this search space. Experiments show that, although our FPT algorithm is still exponential, it significantly outperforms the existing method. The heuristic runs even faster but with increasing false negatives as the reticulation number grows. Given this trade-off, we also discuss theoretical directions for improvement and biological applicability of the heuristic approach.

keywords:
Phylogenetic Networks, Tree-Child Networks, Acyclic Graph Orientation, FPT Algorithm, Exact Algorithm, Heuristic Algorithm

1 Introduction

Phylogenetic networks are a powerful tool for representing complex evolutionary relationships between species that cannot be adequately modelled by trees. These networks are particularly useful in the presence of reticulate events, such as hybridisation, horizontal gene transfer (HGT) and recombination. Hybridisation refers to the interbreeding of individuals from different species, which can lead to the formation of a new hybrid species that shares genetic material from both parent species. HGT is the transmission of copied genetic material to another organism without being its offspring, a process particularly common in bacteria and archaea. Recombination is the exchange of genetic material between different genomes and is a common occurrence not only in viruses but also in bacteria, eukaryotes, and other organisms. While phylogenetic trees depict the hierarchical branching of evolutionary lineages, networks can represent both the hierarchical and non-hierarchical connections resulting from reticulate events, providing a more nuanced view of evolutionary history.

However, constructing directed phylogenetic networks from biological data remains a challenging task. Distance-based methods, such as Neighbor-Net, are widely used because they are scalable and helpful in visualising the data. However, the resulting networks are inevitably undirected, often making it difficult to interpret the evolutionary history. To provide a more informative representation of the data, it is meaningful to develop a method for transforming undirected graphs into directed phylogenetic networks in a way that ensures the resulting network has a desired property.

Recently, Huber et al.[1] introduced two different orientation problems, each considering orientation under a different constraint. The Constrained Orientation problem asks whether a given undirected phylogenetic network can be oriented to a directed network under the constraint of a given root edge and desired in-degrees of all vertices, and asks to find a feasible orientation under that constraint if one exists. In [1], it was shown that such a feasible orientation is unique if one exists, and a linear time algorithm for solving this problem was provided. However, when an undirected network has been created from data, it is not usually the case that there is complete knowledge of where to insert the root and which vertices are reticulations. The 𝒞\mathcal{C}-Orientation problem does not constrain the position of the root or the in-degree of the vertices. Instead, it asks whether a given binary network can be oriented to a directed phylogenetic network belonging to a desired class 𝒞\mathcal{C}.

The complexity of 𝒞\mathcal{C}-Orientation is not fully understood for many classes 𝒞\mathcal{C}, and no study has discussed practically useful methods. In [1], Tree-Based Orientation was shown to be NP-hard. Maeda et al.[2] conjectured that Tree-Child Orientation is NP-hard. Bulteau et al.[3] studied a similar problem and showed that for a graph with maximum degree five, determining whether the graph can be oriented to a tree-child network with a designated root vertex is NP-hard (Corollary 5 in [3]). Without the assumption of maximum degree five, the complexity of this problem is still unclear. Indeed, [3] was concluded by noting that even for graphs of maximum degree three (i.e. binary networks), finding an orientation to a desired class, such as tree-child, tree-based or reticulation-visible networks, may not be easy. Although [1] provided FPT algorithms for a special case of 𝒞\mathcal{C}-Orientation where 𝒞\mathcal{C} satisfies several conditions, which are theoretically applicable to various classes including tree-child networks but are not easy to implement, there is no FPT algorithm to solve 𝒞\mathcal{C}-Orientation in its general form. Also, no studies have pursued practically useful heuristics.

In this paper, we provide a practically efficient exponential time algorithm for 𝒞\mathcal{C}-Orientation (Algorithm 1), which is FPT in both the reticulation number and the maximum size of minimal basic cycles selected by the algorithm. We also present a heuristic method for Tree-Child Orientation (Algorithm 2) which, although still exponential, runs very fast in practice because it only considers reticulation placements to maximise the sum of their pairwise distances. Using artificially generated networks, we compare the accuracy and execution time of the proposed methods for solving Tree-Child Orientation with those of the existing exponential time algorithm for 𝒞\mathcal{C}-Orientation (Algorithm 2 in [1]). Our theoretical and empirical results demonstrate the usefulness of Algorithm 1, especially for relatively large input graphs with 55 or more reticulations, where the exponential time method in [1] becomes computationally infeasible. Our Algorithm 2, while much faster than Algorithm 1, tends to decrease in accuracy as the reticulation number increases.

The rest of the paper is structured as follows. Section 2 provides the necessary mathematical definitions and notation, including the definition of phylogenetic networks. Section 3 briefly reviews relevant results from [1] and formally states the problems of interest. Section 4 gives the theoretical background of our proposed methods, including the concept of ‘cycle basis’ and a theorem that allows us to reduce the search space (Theorem 4). Section 5 describes the proposed exact method (Algorithm 1) and a heuristic method (Algorithm 2). Theorem 7 ensures that the heuristic is correct when r2r\leq 2. We analyse the time complexity of Algorithm 1. Section 6 explains the experimental setup, including the method used to create undirected graphs (details can be found in the appendix), and presents the results of three experiments. Section 7 discusses the limitations and biological application of the heuristic method. Finally, Section 8 concludes the paper and outlines future research directions.

2 Definitions and Notation

2.1 Graph theory

An undirected graph is an ordered pair (V,E)(V,E) consisting of a set VV of vertices and a set EE of edges between vertices without any orientation. Given an undirected graph GG, its vertex set and edge set are denoted by V(G)V(G) and E(G)E(G), respectively. An edge of an undirected graph between vertices uu and vv is denoted by {u,v}\{u,v\} or {v,u}\{v,u\}. An undirected graph is simple if it contains neither a loop nor multiple edges, namely, any edge {u,v}\{u,v\} satisfies uvu\neq v, and any edges {u,v}{u,v}\{u,v\}\neq\{u^{\prime},v^{\prime}\} satisfy at least one of uuu\neq u^{\prime} and vvv\neq v^{\prime}. Given simple graphs Gi=(Vi,Ei)G_{i}=(V_{i},E_{i}) (i=1,2,,ni=1,2,\dots,n), the graph (V,E)(V,E) with V:=i=1nViV:=\bigcup_{i=1}^{n}V_{i} and E:=i=1nEiE:=\bigcup_{i=1}^{n}E_{i} is called the union of the graphs GiG_{i}.

For a vertex vv of an undirected graph GG, the degree of vv in GG, denoted by 𝑑𝑒𝑔G(v){\it deg}_{G}(v), is the number of edges of GG that joins vv with another vertex of GG.

An (undirected) path is an undirected graph PP with a vertex set {v1,v2,,v}\{v_{1},v_{2},\ldots,v_{\ell}\} and an edge set {{v1,v2},{v2,v3},,{v1,v}}\{\{v_{1},v_{2}\},\{v_{2},v_{3}\},\ldots,\{v_{\ell-1},v_{\ell}\}\}. The number of edges of a path PP is called the length of PP. An (undirected) cycle is an undirected graph CC with a vertex set {v1,v2,,v}\{v_{1},v_{2},\ldots,v_{\ell}\} and an edge set E={{v1,v2},{v2,v3},,{v1,v},{v,v1}}E=\{\{v_{1},v_{2}\},\{v_{2},v_{3}\},\ldots,\{v_{\ell-1},v_{\ell}\},\{v_{\ell},v_{1}\}\}. The number of edges of a cycle CC is called the length of CC. A subgraph of an undirected graph G=(V,E)G=(V,E) is a graph G=(V,E)G^{\prime}=(V^{\prime},E^{\prime}) such that VVV^{\prime}\subseteq V and EEE^{\prime}\subseteq E. In this case, GG contains GG^{\prime}. If an undirected graph GG contains a cycle CC as a subgraph, CC is a cycle of GG. An undirected graph GG is connected if for any u,vV(G)u,v\in V(G), GG contains a path between uu and vv. The distance dG(u,v)d_{G}(u,v) between uu and vv in GG is defined by the length of the shortest path connecting uu and vv in GG.

A directed graph is an ordered pair (V,A)(V,A) consisting of a set VV of vertices and a set AA of oriented edges called arcs. Given a directed graph DD, its vertex set and arc set are denoted by V(D)V(D) and A(D)A(D), respectively. An arc that goes from vertex uu to vertex vv is denoted by (u,v)(u,v). Given an arc (u,v)(u,v), uu is a parent of vv and vv is a child of uu. A directed graph is simple if it contains neither a loop nor multiple arcs. For a vertex vv of a directed graph DD, the in-degree of vv in DD, denoted by 𝑖𝑛𝑑𝑒𝑔D(v){\it indeg}_{D}(v), is the number of arcs of DD that arrive at vv. Likewise, the out-degree of vv in DD, denoted by 𝑜𝑢𝑡𝑑𝑒𝑔D(v){\it outdeg}_{D}(v), is the number of arcs of DD that start from vv. A directed cycle is a directed graph with a vertex set V={v1,v2,,v}V=\{v_{1},v_{2},\ldots,v_{\ell}\} and an arc set A={(v1,v2),(v2,v3),,(v1,v),(v,v1)}A=\{(v_{1},v_{2}),(v_{2},v_{3}),\ldots,(v_{\ell-1},v_{\ell}),(v_{\ell},v_{1})\}. A subgraph of a directed graph is defined in the same way as before. A directed acyclic graph (DAG) is a directed graph that contains no cycle as its subgraph. Given a directed graph DD, the undirected graph obtained by ignoring the direction of all its arcs is called the underlying graph of DD and denoted by U(D)U(D).

2.2 Phylogenetic networks

Throughout this paper, XX is a finite set with |X|2|X|\geq 2, representing a set of the present-day species of interest. All graphs considered here are simple and finite, meaning the numbers of vertices and edges are finite. An undirected binary phylogenetic network on XX is a simple, connected, undirected graph NN such that its vertex set VV is partitioned into VI:={vV𝑑𝑒𝑔N(v)=3}V_{I}:=\{v\in V\mid{\it deg}_{N}(v)=3\} and VL:={vV𝑑𝑒𝑔N(v)=1}V_{L}:=\{v\in V\mid{\it deg}_{N}(v)=1\}, and VLV_{L} can be identified with XX. Each vertex in VIV_{I} and in VLV_{L} is called an internal vertex and a leaf of NN, respectively.

A directed binary phylogenetic network on XX is a simple, acyclic directed graph DD such that the underlying graph of DD is connected, the vertex set VV of DD contains a unique vertex ρ\rho with (𝑖𝑛𝑑𝑒𝑔D(ρ),𝑜𝑢𝑡𝑑𝑒𝑔D(ρ))=(0,2)({\it indeg}_{D}(\rho),{\it outdeg}_{D}(\rho))=(0,2) and the set V{ρ}V\setminus\{\rho\} is partitioned into VT:={vV(𝑖𝑛𝑑𝑒𝑔D(v),𝑜𝑢𝑡𝑑𝑒𝑔D(v))=(1,2)}V_{T}:=\{v\in V\mid({\it indeg}_{D}(v),{\it outdeg}_{D}(v))=(1,2)\} and VR:={vV(𝑖𝑛𝑑𝑒𝑔D(v),𝑜𝑢𝑡𝑑𝑒𝑔D(v))=(2,1)}V_{R}:=\{v\in V\mid({\it indeg}_{D}(v),{\it outdeg}_{D}(v))=(2,1)\}, and VL:={vV(𝑖𝑛𝑑𝑒𝑔D(v),𝑜𝑢𝑡𝑑𝑒𝑔D(v))=(1,0)}V_{L}:=\{v\in V\mid({\it indeg}_{D}(v),{\it outdeg}_{D}(v))=(1,0)\} that can be identified with XX. The vertex ρ\rho is called the root of DD, and each vertex in VTV_{T}, in VRV_{R} and in VLV_{L} is called a tree vertex, a reticulation and a leaf of DD, respectively.

A directed binary phylogenetic network DD on XX is a tree-child network if every non-leaf vertex of DD has at least one tree vertex as a child [4]. Tree-child networks are characterised by the absence of the two forbidden subgraphs [5] that are illustrated in Fig. 1. Namely, tree-child networks contain neither a vertex with two reticulation children nor a reticulation with a reticulation child.

Refer to caption
Figure 1: The forbidden subgraphs for tree-child networks. Left: a vertex with two reticulation children. Right: a reticulation with a reticulation child. The black vertices represent reticulations and the white circle indicates a tree vertex.

3 Acyclic orientation problems and known results

For an undirected phylogenetic network NN on XX, orienting NN is a procedure that inserts the root ρ\rho into a unique edge eρ={u,v}e_{\rho}=\{u,v\} of NN by replacing the edge {u,v}\{u,v\} with the arcs (ρ,u),(ρ,v)(\rho,u),(\rho,v) and then orienting the other edges so that the resulting graph N\vec{N} is a directed phylogenetic network on XX. In other words, a directed phylogenetic network N\vec{N} on XX is an orientation of NN if the underlying graph U(N)U(\vec{N}) becomes isomorphic to NN after suppressing the root ρ\rho of U(N)U(\vec{N}) (i.e. replacing the undirected path uρvu-\rho-v with the edge {u,v}\{u,v\}).

Determining the orientability of undirected graphs to directed graphs with desired properties is a classic topic in graph theory and its application (e.g. [6]), but orientability problems for phylogenetic networks have only recently been studied [1, 3]. Huber et al.[1] discussed two types of problems, Constrained Orientation and 𝒞\mathcal{C}-Orientation.

3.1 Orientation with a desired root position and in-degrees

Problem 1.

(Constrained Orientation)

INPUT:

An undirected (not necessarily binary) phylogenetic network N=(V,E)N=(V,E) on XX, an edge eρEe_{\rho}\in E into which a unique root ρ\rho is inserted, and the desired in-degree δN(v)\delta_{N}^{-}(v) of each vVv\in V.

OUTPUT:

An orientation N\vec{N} of NN that satisfies the constraint (eρ,δN)(e_{\rho},\delta_{N}^{-}) if it exists, and ‘NO’ otherwise.

We note that when NN is binary, the constraint δN\delta_{N}^{-} in Problem 1 specifies which internal vertices of NN are to become reticulations or tree vertices in N\vec{N}. Below we restate the relevant result from [1].

Theorem 1 (Part of Theorems 1 and 2 in [1]).

In Problem 1, if there exists an orientation N\vec{N} of NN that satisfies the constraint (eρ,δN)(e_{\rho},\delta_{N}^{-}), then N\vec{N} is unique for (eρ,δN)(e_{\rho},\delta_{N}^{-}). Algorithm 1 in [1] can determine whether or not N\vec{N} exists and find N\vec{N} if it does, both in O(|E|)O(|E|) time.

Although Theorem 1 is sufficient for our purpose, a necessary and sufficient condition for when N\vec{N} exists under the constraint (eρ,δN)(e_{\rho},\delta_{N}^{-}) was provided in [1] using the notion of ‘degree cut’. The interested reader is referred to [1].

3.2 Orientation to a desired class 𝒞\mathcal{C} of networks

The next problem is about orientation under a different constraint. It asks whether a given graph can be oriented to be a directed phylogenetic network in a desired class 𝒞\mathcal{C}, where the position of the root edge and the in-degree of each vertex are unknown. In [1], 𝒞\mathcal{C}-Orientation was defined under the assumption that the input NN is binary, unlike Problem 1.

Problem 2.

(𝒞\mathcal{C}-Orientation)

INPUT:

An undirected binary phylogenetic network NN on XX.

OUTPUT:

An orientation N\vec{N} of NN such that N\vec{N} belongs to the class 𝒞\mathcal{C} of directed binary phylogenetic networks on XX if it exists, and ‘NO’ otherwise.

Huber et al.[1] described a simple exponential time algorithm for solving Problem 2 (Algorithm 2 in [1]), which uses the above-mentioned O(|E|)O(|E|) time algorithm for Problem 1. It repeatedly solves Problem 1 for all possible combinations of the root edge eρEe_{\rho}\in E and a set {v1,,vr}=VRV\{v_{1},\dots,v_{r}\}=V_{R}\subseteq V of reticulations (i.e. vertices of desired in-degree 22) until it finds an orientation N\vec{N} of NN that satisfies (eρ,δN)(e_{\rho},\delta_{N}^{-}) and belongs to class 𝒞\mathcal{C}.

The complexity of 𝒞\mathcal{C}-Orientation depends on 𝒞\mathcal{C} but is still unknown for most of the popular classes of phylogenetic networks. For example, when 𝒞\mathcal{C} is the class of trees, Problem 2 is obviously solvable in polynomial time, and when 𝒞\mathcal{C} is the class of tree-based networks, it was shown in [1] that the problem is NP-hard. An important remark is that if 𝒞\mathcal{C}^{\prime} is a subclass of 𝒞\mathcal{C}, it does not necessarily imply that 𝒞\mathcal{C}^{\prime}-Orientation is easier or harder than 𝒞\mathcal{C}-Orientation. In fact, the complexity of the following problem is still open [7, 8].

Problem 3.

(Tree-Child Orientation)

INPUT:

An undirected binary phylogenetic network NN on XX.

OUTPUT:

An orientation N\vec{N} of NN that is a tree-child network on XX if it exists, and ‘NO’ otherwise.

While FPT algorithms for a special case of 𝒞\mathcal{C}-Orientation were provided in [1], it remains challenging to develop a practical method for this type of orientation problems. This motivates us to explore a heuristic approach for solving Problem 3.

In what follows, when there exists such an orientation N\vec{N} of NN as described in Problem 3, we say that NN is tree-child orientable and call N\vec{N} a tree-child orientation of NN.

4 Theoretical aspects of the proposed methods

Before discussing Problem 3, we consider a general setting where we want to orient any undirected phylogenetic network N=(V,E)N=(V,E) on XX to a rooted directed phylogenetic network N\vec{N} on XX. Then N\vec{N} must contain r=|E||V|+1r=|E|-|V|+1 reticulation vertices (which can be easily verified by induction, and can also be derived from the equations in Lemma 2.1 in [9]), and so we need to decide which rr vertices among |V||X||V|-|X| internal vertices of NN will have in-degree 2 in N\vec{N}. The number of possible ways to select rr vertices from non-leaf vertices of NN is (|V||X|r)\binom{|V|-|X|}{r}, which is exponential. We will now consider how we can reduce the number of candidates to examine.

Lemma 2.

Let N=(V,A)N=(V,A) be any directed acyclic graph and let U(N)=(V,E)U(N)=(V,E) be the underlying graph of NN. Then, any cycle CC of U(N)U(N) has a vertex whose in-degree in NN is at least 22.

Proof.

To obtain a contradiction, suppose U(N)U(N) contains a cycle C=(v0,v1,,vn)C=(v_{0},v_{1},\dots,v_{n}) such that 𝑖𝑛𝑑𝑒𝑔N(vi)1{\it indeg}_{N}(v_{i})\leq 1 for every viV(C)v_{i}\in V(C). Let C\vec{C} be the subgraph of NN that corresponds to CC. If C\vec{C} contains viv_{i} with 𝑖𝑛𝑑𝑒𝑔N(vi)=0{\it indeg}_{N}(v_{i})=0, then C\vec{C} must contain another vertex vjv_{j} with 𝑖𝑛𝑑𝑒𝑔N(vj)2{\it indeg}_{N}(v_{j})\geq 2. Then, for each viV(C)v_{i}\in V(C), 𝑖𝑛𝑑𝑒𝑔N(vi)=1{\it indeg}_{N}(v_{i})=1 holds. Let {vi1,vi}\{v_{i-1},v_{i}\} and {vi,vi+1}\{v_{i},v_{i+1}\} be two consecutive undirected edges of CC. We may assume that C\vec{C} has the arc (vi1,vi)(v_{i-1},v_{i}), instead of (vi,vi1)(v_{i},v_{i-1}). Since 𝑖𝑛𝑑𝑒𝑔N(vi)=1{\it indeg}_{N}(v_{i})=1, C\vec{C} contains (vi,vi+1)(v_{i},v_{i+1}), not (vi+1,vi)(v_{i+1},v_{i}). The same argument applies to all arcs of C\vec{C}. It follows that C\vec{C} is a directed cycle, a contradiction. ∎

Lemma 2 allows us to exclude some of the inappropriate reticulation placements (specifically, cases where there is a cycle having no reticulation) that make orientation impossible. However, checking all cycles in a graph is computationally inefficient. To reduce the search space, we will now introduce some relevant concepts.

For a connected undirected graph N=(V,E)N=(V,E), the cycle rank of NN is defined to be the number r:=|E||V|+1r:=|E|-|V|+1 (e.g. p.24 in [10]), which is also known as circuit rank, cyclomatic number and the (first-order) Betti number of NN. Note that rr is zero if NN is a tree and that rr is the number of desired reticulation vertices if NN is a binary phylogenetic network that is an instance of Problem 1. The cycle rank rr of NN can also be interpreted as the rank of a vector space called the ‘cycle space’, where each of the rr basis vectors, called a basic cycle, corresponds to the edge-set of a simple cycle in NN. If we define the summation of cycles CC and CC^{\prime} as the cycle induced by the symmetric difference of their edge-sets, then the cycle space of NN can be identified with the set of even-degree subgraphs of NN, so any cycle in NN can be expressed as a sum of basic cycles in a cycle basis 𝒮\mathcal{S} (for details, see Section 1.9 of [11], Section 4.3 of [12] and Section 6.4.2 of [10]).

Proposition 3 will be used in the proof of the main theorem (Theorem 4).

Proposition 3.

Let D=(V,A)D=(V,A) be any directed graph. If each vVv\in V satisfies (indegD(v),outdegD(v))(\mathrm{indeg}_{D}(v),\mathrm{outdeg}_{D}(v)){(0,2),(1,1),(1,2),(2,1),(2,0)}\in\{(0,2),(1,1),(1,2),(2,1),(2,0)\}, then r=|V2||V0|+1r=|V_{2}|-|V_{0}|+1 holds, where r:=|A||V|+1r:=|A|-|V|+1 and V2V_{2} (resp. V0V_{0}) denotes the set of vertices vv with indegD(v)=2\mathrm{indeg}_{D}(v)=2 (resp. indegD(v)=0\mathrm{indeg}_{D}(v)=0). If, in addition, DD is acyclic, then r|V2|r\leq|V_{2}| holds.

Proof.

Let V1:=V(V0V2)V_{1}:=V\setminus(V_{0}\cup V_{2}). Since the sum of in-degrees of all vertices must equal |A||A| , we have |A|=|V1|+2|V2||A|=|V_{1}|+2|V_{2}|. Hence, r=|A||V|+1=|V1|+2|V2|(|V0|+|V1|+|V2|)+1=|V2||V0|+1r=|A|-|V|+1=|V_{1}|+2|V_{2}|-(|V_{0}|+|V_{1}|+|V_{2}|)+1=|V_{2}|-|V_{0}|+1.

We claim |V0|1|V_{0}|\geq 1 holds if DD is acyclic. To verify this, we will prove that any finite directed acyclic graph GG has at least one vertex of in-degree zero. Let PP be a longest vertex-disjoint directed path in GG, with ss and tt as the start and end vertices of PP, respectively. If indegG(s)>0\mathrm{indeg}_{G}(s)>0, there exists a vertex sV(G)s^{\prime}\in V(G) with (s,s)A(G)(s^{\prime},s)\in A(G). Since GG is acyclic, ss^{\prime} is not a vertex of PP. Then, by adding (s,s)(s^{\prime},s) to PP, one can obtain a vertex-disjoint directed path in GG that is longer than PP, but this contradicts the maximality of PP. Thus, indegG(s)=0\mathrm{indeg}_{G}(s)=0 holds, which proves the above claim. Hence, we obtain r|V2|r\leq|V_{2}|. ∎

Theorem 4.

Let (N,eρ,δN)(N,e_{\rho},\delta_{N}^{-}) be an instance of Problem 1 where NN is binary and let VRV_{R} denote the set of reticulations specified by δN(v)\delta_{N}^{-}(v). If there exists an orientation N\vec{N} of NN satisfying the constraint (eρ,δN)(e_{\rho},\delta_{N}^{-}), then for any cycle basis 𝒮\mathcal{S} of NN, there exists a bijection ϕ:𝒮VR\phi:\mathcal{S}\to V_{R} with the property that ϕ(C)V(C)\phi(C)\in V(C) holds for each C𝒮C\in\mathcal{S}.

Proof.

Let 𝒮\mathcal{S} be any cycle basis of NN and let BB be the bipartite graph defined by V(B):=𝒮VRV(B):=\mathcal{S}\sqcup V_{R} and E(B):={(C,v)𝒮×VRvV(C)}E(B):=\{(C,v)\in\mathcal{S}\times V_{R}\mid v\in V(C)\}. Here, |𝒮|=|VR||\mathcal{S}|=|V_{R}| holds (note that we could have |𝒮||VR||\mathcal{S}|\geq|V_{R}| if NN were allowed to have a reticulation with a large in-degree but here NN is binary). We also see that no vertex of BB has degree zero for the following reasons: since N\vec{N} is a directed acyclic graph, Lemma 2 implies that for each cycle C𝒮C\in\mathcal{S}, there exists at least one reticulation vVRv\in V_{R} with vV(C)v\in V(C); conversely, for each reticulation vVRv\in V_{R}, there exists at least one cycle C𝒮C\in\mathcal{S} with vV(C)v\in V(C). The proof will be completed if we can show that there is a perfect matching in BB. We will prove that BB satisfies the marriage condition, i.e. for any subset 𝒮\mathcal{S}^{\prime} of 𝒮\mathcal{S}, |𝒮||VR||\mathcal{S}^{\prime}|\leq|V_{R}^{\prime}| holds, where VRV_{R}^{\prime} is the neighbourhood of 𝒮\mathcal{S}^{\prime} in BB. To prove this, without loss of generality, we may assume that the union of all cycles in 𝒮\mathcal{S}^{\prime}, denoted by NN^{\prime}, is a connected subgraph of NN.

Recalling that N\vec{N} is a unique acyclic orientation for (N,eρ,VR)(N,e_{\rho},V_{R}), one can convert NN^{\prime} into a directed graph DD by assigning the same direction to each edge of NN^{\prime} as in N\vec{N}. Since DD is a subgraph of N\vec{N}, DD is also acyclic. By construction, each vertex uu of DD satisfies (indegD(u),outdegD(u))(\mathrm{indeg}_{D}(u),\mathrm{outdeg}_{D}(u)){(0,2),(1,1),(1,2),(2,1),(2,0)}\in\{(0,2),(1,1),(1,2),(2,1),(2,0)\}. If we write V2V_{2} for the set {uV(D)indegD(u)=2}\{u\in V(D)\mid\mathrm{indeg}_{D}(u)=2\}, then Proposition 3 yields |A(D)||V(D)|+1|V2||A(D)|-|V(D)|+1\leq|V_{2}|. The left hand side equals |E(N)||V(N)|+1=|𝒮||E(N^{\prime})|-|V(N^{\prime})|+1=|\mathcal{S}^{\prime}|. Since the in-degree of a vertex in a subgraph DD never exceeds its in-degree in the original graph N\vec{N}, we have V2VRV_{2}\subseteq V_{R}^{\prime}. Thus, |V2||VR||V_{2}|\leq|V_{R}^{\prime}| holds. Hence, by Hall’s marriage theorem, BB has a perfect matching. Any perfect matching in BB induces a bijection ϕ:𝒮VR\phi:\mathcal{S}\to V_{R} satisfying ϕ(C)V(C)\phi(C)\in V(C) for each C𝒮C\in\mathcal{S}. ∎

Theorem 4 provides a necessary condition for the feasibility of a reticulation placement VRV_{R} in Problem 1 (binary version), where feasibility means that (N,eρ,VR)(N,e_{\rho},V_{R}) admits an acyclic orientation for some eρe_{\rho}. This implies that we need not consider all (|V||X|r)\binom{|V|-|X|}{r} reticulation placements in solving Problem 2, regardless of the class CC we are interested in. Therefore, we may use an arbitrary cycle basis 𝒮\mathcal{S} of NN and choose exactly one reticulation vertex from each of the rr basic cycles, thereby reducing the search space for feasible reticulation placements in both Problems 2 and 3.

In solving Problem 3, we will also use the following results. From the forbidden structures of tree-child networks (Fig. 1), Lemma 5 follows immediately, and this leads to Theorem 6.

Lemma 5.

A directed phylogenetic network NN is tree-child if every two reticulations of NN are distant at least 33 in the underlying graph of NN.

Theorem 6.

If (N,eρ,δN)(N,e_{\rho},\delta_{N}^{-}) is an instance of Problem 1 such that dN(u,v)3d_{N}(u,v)\geq 3 for any distinct u,v{vV(N)δN(v)=2}u,v\in\{v\in V(N)\mid\delta_{N}^{-}(v)=2\} and if there exists an orientation N\vec{N} of NN that satisfies the constraint (eρ,δN)(e_{\rho},\delta_{N}^{-}), then N\vec{N} is a tree-child network.

We note that Theorem 6 provides a sufficient condition for tree-child orientability, not a necessary condition. In fact, a tree-child network can contain a pair of reticulations whose distance is less than 3 (see Fig. 2).

Refer to caption
Figure 2: Left: An instance (N,eρ,δN)(N,e_{\rho},\delta_{N}^{-}) of Problem 1 containing a pair of prescribed reticulations (i.e. vertices of desired in-degree 22) at distance 22. The squares and circles are the leaves and internal vertices, respectively. The root edge eρe_{\rho} is highlighted in bold. The number next to each vertex vv indicates its desired in-degree δN(v)\delta_{N}^{-}(v), and the black vertices are the prescribed reticulations (i.e. those with desired in-degree 22). Right: The orientation N\vec{N} for this instance (N,eρ,δN)(N,e_{\rho},\delta_{N}^{-}), which is a tree-child network. The root ρ\rho inserted into eρe_{\rho} is shown as a unique star vertex.

5 Proposed methods

Based on Theorem 4, we propose an exact method (Algorithm 1) for 𝒞\mathcal{C}-Orientation (Problem 2) and a heuristic method (Algorithm 2) for Tree-Child Orientation (Problem 3). While Theorem 4 holds for any cycle basis, our algorithms use a ‘minimal’ one for computational efficiency. A cycle basis 𝒮\mathcal{S} of NN is called minimal if the sum of the lengths of all cycles in 𝒮\mathcal{S} is not greater than that of any other cycle basis of NN, i.e. minimising C𝒮|E(C)|\sum_{C\in\mathcal{S}}{|E(C)|}. Intuitively, focusing on a minimal cycle basis allows us to use basic cycles that do not overlap too much. The problem of computing a minimal cycle basis has been extensively studied (e.g. [13, 14, 15]; see Chapter 7 of [16] for a brief literature review) and polynomial time algorithms exist (e.g. [16, 17, 15]). By using an algorithm given in [17], a minimal cycle basis of N=(V,E)N=(V,E) can be computed in O(|V||E|2/log|V|)O(|V||E|^{2}/\log|V|).

For both Problems 2 and 3, our proposed methods first compute a minimal cycle basis 𝒮={C1,,Cr}\mathcal{S}=\{C_{1},\dots,C_{r}\} of a given network NN. Then, for Problem 2, our exact method (Algorithm 1) repeatedly picks exactly one reticulation vertex viv_{i} from each basic cycle CiC_{i} to obtain a candidate set VRV_{R} of rr reticulations, and solves Problem 1 for all (eρ,VR)(e_{\rho},V_{R}) until it finds a 𝒞\mathcal{C}-orientation of NN. While this algorithm still requires exponential time, it is more efficient than the exponential time method in [1] because the search space for appropriate VRV_{R} is smaller than (O(|V|)r)\binom{O(|V|)}{r}.

For Problem 3, we first note that a straightforward way to reduce the search space for VRV_{R} is to exclude any candidate sets containing adjacent reticulations, as such configurations are invalid in tree-child networks (recall Fig. 1, right panel). Beyond this basic constraint, we introduce a novel approach to further reduce the search space based on Theorem 6, which suggests that reticulations should be placed as far apart as possible. This insight leads to our heuristic method (Algorithm 2) that considers only those reticulation sets VRV_{R} that maximise the sum of pairwise distances between reticulations. While this significant reduction in the search space makes the algorithm faster than Algorithm 1, it may fail to find an existing tree-child orientation; more precisely, while false positives cannot occur, a ‘NO’ output from Algorithm 2 should be interpreted as ‘Probably NO’. However, Theorem 7 ensures that Algorithm 2 works correctly when rr is very small.

Algorithm 1 Exact FPT Algorithm for 𝒞\mathcal{C}-Orientation
1:An undirected binary phylogenetic network N=(V,E)N=(V,E) on XX
2:A tree-child orientation N\vec{N} of NN if one exists, else ‘NO’
3:Compute the number r:=|E||V|+1r:=|E|-|V|+1 of reticulations N\vec{N} must have
4:Compute a minimal cycle basis 𝒮={C1,,Cr}\mathcal{S}=\{C_{1},\dots,C_{r}\} of NN
5:Compute S:={(v1,,vr)V(C1)××V(Cr)}S:=\{(v_{1},\dots,v_{r})\in V(C_{1})\times\dots\times V(C_{r})\}
6:for each reticulation placement s=(v1,,vr)Ss=(v_{1},\dots,v_{r})\in S do
7:     for each vertex vVv\in V do
8:         Define the desired in-degree δN(v)\delta_{N}^{-}(v) as δN(v):=2\delta_{N}^{-}{(v)}:=2 if v{v1,,vr}v\in\{v_{1},\dots,v_{r}\} and δN(v):=1\delta_{N}^{-}{(v)}:=1 otherwise
9:     end for
10:     repeat
11:         Pick any eEe\in E, set eρ:=ee_{\rho}:=e
12:         Run the linear time algorithm for Problem 1 in [1] for the instance (N,eρ,δN)(N,e_{\rho},\delta_{N}^{-})
13:         if the algorithm finds the feasible orientation N~\tilde{N} for (N,eρ,δN)(N,e_{\rho},\delta_{N}^{-}) and N~\tilde{N} is in 𝒞\mathcal{C} then
14:              return N~\tilde{N} as a 𝒞\mathcal{C}-orientation N\vec{N} of NN
15:         end if
16:     until no more edges are left in EE
17:end for
18:return ‘NO’
Algorithm 2 Heuristic Algorithm for Tree-Child Orientation
1:An undirected binary phylogenetic network N=(V,E)N=(V,E) on XX.
2:A tree-child orientation N\vec{N} of NN if found, else ‘NO’.
3:Compute the number r:=|E||V|+1r:=|E|-|V|+1 of reticulations N\vec{N} must have
4:Compute a minimal cycle basis 𝒮={C1,,Cr}\mathcal{S}=\{C_{1},\dots,C_{r}\} of NN
5:Compute S:={(v1,,vr)V(C1)××V(Cr)dN(vi,vj)2}S:=\{(v_{1},\dots,v_{r})\in V(C_{1})\times\dots\times V(C_{r})\mid d_{N}(v_{i},v_{j})\geq 2\}
6:Compute S:={sSs=argmaxsSf(s):=1i<jrdN(vi,vj)}S^{\ast}:=\{s^{\ast}\in S\mid s^{\ast}=\operatorname{arg\,max}_{s\in S}{f(s):=\sum_{1\leq i<j\leq r}{d_{N}(v_{i},v_{j})}}\}
7:for each reticulation placement s=(v1,,vr)Ss^{\ast}=(v_{1}^{\ast},\dots,v_{r}^{\ast})\in S^{\ast} do
8:     Define the desired in-degree δN(v)\delta_{N}^{-}{(v)} of all vVv\in V as δN(v):=2\delta_{N}^{-}{(v)}:=2 if v{v1,,vr}v\in\{v_{1}^{\ast},\dots,v_{r}^{\ast}\} and otherwise δN(v):=1\delta_{N}^{-}{(v)}:=1
9:     repeat
10:         Pick any eEe\in E, set eρ:=ee_{\rho}:=e
11:         Run the linear time algorithm for Problem 1 in [1] for the instance (N,eρ,δN)(N,e_{\rho},\delta_{N}^{-})
12:         if the algorithm finds the feasible orientation N~\tilde{N} for (N,eρ,δN)(N,e_{\rho},\delta_{N}^{-}) and no vertex of N~\tilde{N} has only reticulations as its children then
13:              return N~\tilde{N} as a tree-child orientation N\vec{N} of NN
14:         end if
15:     until no more edges are left in EE
16:end for
17:return ‘NO’
Theorem 7.

If the input N=(V,E)N=(V,E) satisfies r:=|E||V|+12r:=|E|-|V|+1\leq 2, then Algorithm 2 returns a correct solution to Tree-Child Orientation (Problem 3).

Proof.

We may focus on the case of r=2r=2 as the statement is obvious when r1r\leq 1. Then, NN contains exactly two cycles C1C_{1} and C2C_{2}, each of which has at least 33 edges.

When C1C_{1} and C2C_{2} do not share an edge, NN is a level-11 network. This implies that NN is planar and tree-child orientable (see Fig. 3). Algorithm 2 computes a minimal cycle basis {C1,C2}\{C_{1},C_{2}\} of NN, which is unique in this case. Then, Algorithm 2 selects a most distant pair s=(v1,v2)s^{\ast}=(v_{1}^{\ast},v_{2}^{\ast}) in V(C1)×V(C2)V(C_{1})\times V(C_{2}). Since NN is binary, dN(v1,v2)3d_{N}(v_{1}^{\ast},v_{2}^{\ast})\geq 3. We can see that (N,eρ,δN)(N,e_{\rho},\delta_{N}^{-}) has an orientation N\vec{N} for the root edge eρe_{\rho} shown in Fig. 3. By Theorem 6, N\vec{N} is tree-child. Hence, Algorithm 2 can correctly find a tree-child orientation N\vec{N} of NN.

When C1C_{1} and C2C_{2} share an edge, they share exactly one edge because r3r\geq 3 otherwise. Then, as Fig. 4 indicates, NN has a tree-child orientation N\vec{N} if and only if at least one of C1C_{1} and C2C_{2} has 44 or more edges. When each of C1C_{1} and C2C_{2} has exactly 33 edges as in Fig. 4(a), Algorithm 2 correctly returns ‘NO’. When |E(C1)|=3|E(C_{1})|=3 and |E(C2)|=4|E(C_{2})|=4, the algorithm selects a most distant pair s=(v1,v2)s^{\ast}=(v_{1}^{\ast},v_{2}^{\ast}) in V(C1)×V(C2)V(C_{1})\times V(C_{2}) as in Fig. 4(b). Although dN(v1,v2)=2d_{N}(v_{1}^{\ast},v_{2}^{\ast})=2 holds, the algorithm can insert the root ρ\rho into an appropriate edge eρEe_{\rho}\in E, making their distance from 22 into 33. When |E(C1)|4|E(C_{1})|\geq 4 and |E(C2)|4|E(C_{2})|\geq 4 as in Fig. 4(c), dN(v1,v2)3d_{N}(v_{1}^{\ast},v_{2}^{\ast})\geq 3. Hence, when a tree-child orientation N\vec{N} of NN exists, Algorithm 2 correctly outputs it. ∎

Refer to caption
Figure 3: Proof of Theorem 7 (the case when two cycles are edge-disjoint). The star is the root ρ\rho. The black vertices are a pair of reticulations, v1v_{1}^{\ast} and v2v_{2}^{\ast}, that maximises the sum of their distances.
Refer to caption
Figure 4: Proof of Theorem 7 (when two cycles are not edge-disjoint). The star is the root ρ\rho. The black vertices are a pair of reticulations, v1v_{1}^{\ast} and v2v_{2}^{\ast}, that maximises the sum of their distances in each case. (a) When both C1C_{1} and C2C_{2} are 33-cycles, NN must be a NO instance; (b) (c) YES instances.

Theorem 8 states that Algorithm 1 is FPT both in the reticulation number rr and the size cc of longest basic cycles in 𝒮\mathcal{S} used in the computation.

Theorem 8.

Algorithm 1 solves Tree-Child Orientation (Problem 3) in O(cr|V||E|)=O(cr|V|2)O(c^{r}~{}\cdot~{}|V||E|)=O(c^{r}\cdot|V|^{2}) time, where rr is the reticulation number of N=(V,E)N=(V,E) and cc is the size of largest cycles in a minimal cycle basis 𝒮\mathcal{S} chosen at Line 2.

Proof.

Recalling a minimal cycle basis 𝒮\mathcal{S} of NN can be computed in O(|V||E|2/log|V|)O(|V||E|^{2}/\log|V|) time by the algorithm in [17], we know that Line 2 takes O(|V||E|2/log|V|)O(|V||E|^{2}/\log|V|) time. Line 5–7 takes O(|V|)O(|V|) time for each sSs\in S. Also, since one can check in O(|V|)O(|V|) time whether or not N~\tilde{N} is in the class 𝒞\mathcal{C} of tree-child networks, Line 8–14 takes O(|E|×(|V|+|V|))=O(|V||E|)O(|E|\times(|V|+|V|))=O(|V||E|) time for each sSs\in S. Line 5–14 needs to be repeated |S|=O(|V(C1)|××|V(Cr)|)=O(cr)|S|=O(|V(C_{1})|\times\dots\times|V(C_{r})|)=O(c^{r}) times. Therefore, Line 4–15 can be done in O(cr|V||E|)O(c^{r}\cdot|V||E|) time. Since NN is binary, O(|V|)=O(|E|)O(|V|)=O(|E|) holds. Thus, Algorithm 1 runs in O(cr|V|2)O(c^{r}\cdot|V|^{2}) time. ∎

Theorem 8 implies that the unparameterised worst-case complexity of Algorithm 1 is O(|V|r+2)O(|V|^{r+2}). This complexity is equivalent to that of the exponential time method (Algorithm 2 in [1]) that performs O(|V||E|)O(|V||E|) time calculations for all (|V|r)=O(|V|r)\binom{|V|}{r}=O(|V|^{r}) reticulation placements. Although two FPT algorithms for a special case of 𝒞\mathcal{C}-Orientation were proposed in [1], Algorithm 1 differs in its parameterisation from them. Specifically, our Algorithm 1 is parameterised by rr and cc, while the FPT algorithms in [1] are parameterised by rr and by the level of NN, respectively.

Theorem 8 also shows that the size of search space depends on the choice of 𝒮\mathcal{S} at Line 2 of Algorithm 1 although cc never exceeds the size of longest cycles in NN (the same applies to Line 2 of Algorithm 2). To illustrate this, consider two minimal cycle bases 𝒮={C1,C2,C3}\mathcal{S}=\{C_{1},C_{2},C_{3}\} with |V(C1)|=|V(C2)|=|V(C3)|=4|V(C_{1})|=|V(C_{2})|=|V(C_{3})|=4 and 𝒮={C1,C2,C3}\mathcal{S}^{\prime}=\{C_{1}^{\prime},C_{2}^{\prime},C_{3}^{\prime}\} where |V(C1)|=3|V(C_{1}^{\prime})|=3, |V(C2)|=4|V(C_{2}^{\prime})|=4 and |V(C3)|=5|V(C_{3}^{\prime})|=5 (note that they have the same total length 4+4+4=3+4+54+4+4=3+4+5). When the former 𝒮\mathcal{S} is selected, the number of elements of SS at Line 3 of either algorithm is |V(C1)|×|V(C2)|×|V(C3)|=43|V(C_{1})|\times|V(C_{2})|\times|V(C_{3})|=4^{3}, whereas |S|=|V(C1)|×|V(C2)|×|V(C3)|=3×4×5|S|=|V(C_{1}^{\prime})|\times|V(C_{2}^{\prime})|\times|V(C_{3}^{\prime})|=3\times 4\times 5 for the latter 𝒮\mathcal{S}^{\prime}.

6 Experiments

We implemented our two proposed methods (Algorithms 1 and 2) and the existing exponential time algorithm described by Huber et al. (Algorithm 2 in [1]) using Python 3.11.6. In the implementation of Algorithms 1 and 2, we used the minimum_cycle_basis function from the Python package networkx to compute a minimal cycle basis. We note that the algorithm implemented in networkx is not the O(|V||E|2/log|V|)O(|V||E|^{2}/\log|V|) algorithm given in [17] but the O(|E|3+|E||V|2log|V|)O(|E|^{3}+|E||V|^{2}\log{|V|}) algorithm given in Section 7.2 of [16]. Undirected binary phylogenetic networks were generated as test data using a method described in Appendix. The source code, test data, and the program used to generate the data are available at https://github.com/hayamizu-lab/tree-child-orienter. The full details of the results can be found at https://github.com/hayamizu-lab/tree-child-orienter/tree/main/results.

6.1 Experiment 1: Execution time

In Experiment 6.1, we compared the execution times of the following methods: the existing exponential time algorithm (Algorithm 2 in [1]), Algorithm 1, and Algorithm 2 on 20 hand-picked tree-child orientable networks with 10 leaves. The 20 networks consisted of five samples each for reticulation numbers r=2,3,4r=2,3,4 and 55. Due to the time-consuming nature of the existing method, conducting experiments with a larger number of samples was infeasible. The experiment was performed on a MacBook Air (CPU: Intel Core i5, 1.6GHz, 8GB memory). We note that the hardware specification in Experiment 6.1 differs from Experiments 6.2 and 6.3, but this does not undermine the validity of this study as we do not compare results across experiments.

The results are summarised in Fig. 5. Although both Algorithm 2 in [1] and Algorithm 1 require exponential time in general, Algorithm 1 is expected to be faster in practice due to its smaller search space. Indeed, Algorithm 1 was significantly faster than the existing exponential time method. We also confirmed that Algorithm 2 was faster than Algorithm 1.

Refer to caption
Figure 5: Results of Experiment 6.1. Comparison of the execution time of the three methods for tree-child orientable networks with 1010 leaves. Our exponential time algorithm (Algorithm 1) has been shown to be significantly faster than the existing exponential time algorithm described in [1].

6.2 Experiment 2: Accuracy for small graphs

In Experiment 6.2, we evaluated the accuracy and execution time of Algorithms 1 and 2 on the 289 networks with 10 leaves and 1 to 5 reticulations in Table 4. The experiment was performed on a MacBook Pro (CPU: Apple M2 Pro, clock speed 3.49GHz, memory 16GB). The tree-child orientability of the sample graphs was determined using the existing exponential time algorithm (Algorithm 2 in [1]). Out of 289 instances, 268 were tree-child orientable (YES instances) and 21 were not (NO instances).

The accuracy and execution time of the two methods are summarised in Table 1. The correctness of Algorithm 1 was empirically confirmed, while being much faster than the existing method in [1]. In accordance with Theorem 7, Algorithm 2 returned a correct solution whenever r2r\leq 2, but in practice, it still worked correctly for most cases with r4r\leq 4, with a much shorter running time than Algorithm 1 for both YES and NO instances. Algorithm 2 often failed to find a tree-child orientation for YES instances with r=5r=5.

6.3 Experiment 3: Accuracy for large graphs

In Experiment 6.3, we evaluated the performance of Algorithms 1 and 2 on the 471 larger sample networks with 20 leaves and 1 to 9 reticulations in Table 4. The experiment was performed on the same MacBook Pro used in Experiment 6.2. Since Algorithm 2 in [1] became infeasible for most cases with r6r\geq 6, we compared Algorithms 1 and 2 based on the number of tree-child orientations found and the time taken to find one.

The results are summarised in Table 2. As reticulation number rr increased, the ability of Algorithm 2 to find tree-child orientations decreased monotonically. By contrast, Algorithm 1 was still able to find a tree-child orientation for many instances with large rr. It generally takes a long time, but sometimes it can find a tree-child orientation in a practical time.

Table 1: Results of Experiment 6.2. The input networks have 1010 leaves. #YES (resp. #NO) is the number of tree-child orientable (resp. non-tree-child orientable) instances among the generated networks with each reticulation number rr. #YES and #NO have been verified using the existing exponential time algorithm in [1]. The execution time here is the time taken to find a tree-child orientation or to output ‘NO’ or ‘Probably NO’.
Algorithm 1 Algorithm 2
rr #YES Accuracy Execution Time (sec) Accuracy Execution Time (sec)
#NO Mean Min Max Mean Min Max
1 170
170/170
(100%)
0.002 0.001 0.004
170/170
(100%)
0.002 0.001 0.004
0 N/A N/A N/A N/A N/A N/A N/A N/A
2 52
52/52
(100%)
0.011 0.002 0.063
52/52
(100%)
0.004 0.002 0.018
5
5/5
(100%)
0.062 0.061 0.064
5/5
(100%)
0.013 0.007 0.036
3 24
24/24
(100%)
0.081 0.004 0.497
24/24
(100%)
0.007 0.005 0.012
7
7/7
(100%)
0.539 0.232 1.173
7/7
(100%)
0.021 0.011 0.047
4 17
17/17
(100%)
0.741 0.008 3.735
16/17
(94%)
0.026 0.013 0.056
6
6/6
(100%)
3.918 2.428 7.650
6/6
(100%)
0.048 0.020 0.112
5 4
4/4
(100%)
14.382 1.481 48.980
1/4
(25%)
0.153 0.133 0.173
4
4/4
(100%)
23.340 13.892 39.959
4/4
(100%)
0.115 0.088 0.156
Table 2: Results of Experiment 6.3. The input networks have 2020 leaves. #Graph is the number of generated networks with each reticulation number rr. #YES is the number of YES instances among those networks, which is equal to the number of tree-child orientations found by Algorithm 1. #TC is the number of tree-child orientations found by Algorithm 2. The execution time here is the time taken to find a tree-child orientation.
Algorithm 1 Algorithm 2
rr #Graph #YES Execution time (sec) #TC Execution Time (sec)
Mean Min Max Mean Min Max
1 163 163 0.005 0.004 0.025 163 0.005 0.003 0.025
2 108 107 0.032 0.006 0.152 107 0.009 0.006 0.064
3 77 75 0.323 0.010 2.312 74 0.019 0.010 0.081
4 49 44 2.352 0.014 13.404 40 0.067 0.031 0.122
5 23 18 11.485 0.031 52.339 15 0.414 0.125 1.452
6 20 13 226.701 0.026 966.848 5 3.360 1.607 5.327
7 16 12 2732.563 1.190 23761.347 4 31.745 11.619 61.007
8 10 7 23155.714 1115.183 67313.690 1 942.973 942.973 942.973
9 5 3 229788.111 20176.763 623017.346 0 N/A N/A N/A

7 Discussion

7.1 Theoretical limitations of Algorithm 2

Algorithm 2 is very fast for both YES and NO instances, but interestingly, it becomes inaccurate as the reticulation number rr increases. It would be useful to analyse the possible causes of such failures.

The first remark is that the algorithm does not necessarily find a reticulation placement that maximises the sum of the pairwise distances, because it searches for an optimal placement for a fixed minimal cycle basis 𝒮\mathcal{S}. More precisely, the choice of 𝒮\mathcal{S} can affect the maximum value f(s)f(s^{\ast}) of the objective function. For example, when the algorithm selects the minimal cycle basis 𝒮={C1,,C8}\mathcal{S}=\{C_{1},\dots,C_{8}\} as shown on the left of Fig. 6, an optimal reticulation placement ss^{\ast} attains f(s)=130f(s^{\ast})=130. On the other hand, when the cycle C8C_{8} is replaced as shown on the right, then f(s)=129f(s^{\ast})=129. This observation suggests that if our goal is to maximise the sum of the distances between reticulations, then we need to carefully select a minimal cycle basis 𝒮\mathcal{S}.

Refer to caption
Figure 6: Discussion of the objective function ff of Algorithm 2. This network NN has multiple minimal cycle bases 𝒮\mathcal{S} with different values of f(s)f(s^{\ast}). Different basic cycles are highlighted in different colours, and the black vertices indicate an optimal reticulation placement for each minimal cycle basis. The reticulation set in the left figure achieves f(s)=130f(s^{*})=130. On the other hand, the placement in the right figure, where C8C_{8} is replaced and the remaining cycles are the same, only attains f(s)=129f(s^{*})=129.

However, more importantly, we note that maximising the sum of the pairwise distances of the reticulations is not always advantageous for finding a tree-child orientation. For example, the graph on the left of Fig. 7 has a tree-child orientation if the four reticulations are placed as shown on the right. However, by maximising the sum of the distances of the reticulations, Algorithm 2 has to select the reticulation placement with f(s)=20f(s^{\ast})=20 as on the left of Fig. 7. Then, the algorithm will end up with returning ‘NO’ because there is no tree-child orientation for this reticulation placement, regardless of the choice of root edge eρe_{\rho}. The placement on the right has f(s)=19f(s)=19, which is not maximum but does allow for a tree-child orientation.

Refer to caption
Figure 7: An example of NN that has a tree-child orientation but for which Algorithm 2 fails. For any sSs^{\ast}\in S^{\ast} maximising ff with f(s)=20f(s^{\ast})=20, (N,eρ,δN)(N,e_{\rho},\delta_{N}^{-}) is not tree-child orientable, regardless of the choice of root edge eρe_{\rho}. However, choosing sSSs\in S\setminus S^{\ast} as shown on the right, with f(s)=19f(s)=19, yields a tree-child orientable (N,eρ,δN)(N,e_{\rho},\delta_{N}^{-}).

7.2 Biological application of Algorithm 2

While the biological implications of 𝒞\mathcal{C}-Orientation methods require further investigation, we present here a case study applying our heuristic to a network in the biological literature. We emphasise that our focus is not on validating the biological accuracy, but rather on identifying challenges and areas for potential improvement.

Fig. 8(a) shows a part of the biologically estimated tree-child network representing the evolutionary history of eight wheat relatives [18]. Note that the original network proposed in [18] is more complex, and we have ignored the uncertain reticulation arcs and selected some of the most likely reticulation arcs to create a tree-child network. Given the unrooted and undirected form of this network, Algorithm 2 successfully determined it as tree-child orientable. However, the algorithm could not recover the same network as in Fig. 8(a) and instead produced an alternative tree-child orientation with a different root position, such as the one shown in Fig. 8(b).

It is true that the networks differ in their deep ancestral reticulation histories due to the alternative root positions, but they agree with more recent important reticulation events explained in [18], namely the introgression of the Sitopsis ancestor by the Ae. speltoides ancestor and the complex gene flows from Ae. tauschii to Ae. caudata. Notably, despite the difference in the root position, most arc orientations remain consistent with the original network. This observation underscores the importance of allowing root flexibility to find desired orientations that might otherwise be overlooked.

While we have seen the similarity between the networks, the result also shows a significant limitation of the heuristic. Indeed, even for the root placement shown in Fig. 8(b), there exists a more accurate tree-child orientation that preserves the original reticulation structure, which can be obtained by reversing the red arc. The failure of Algorithm 2 to identify the better solution stems from its current objective function, which prioritises maximising the sum of distances between reticulations. This observation further highlights the challenge of developing biologically meaningful orientation heuristics.

Refer to caption
Figure 8: Application of Algorithm 2 to a biologically estimated network. (a) A tree-child network (created by a slight modification of Fig. 5 in [18]) representing an evolutionary scenario for wheat relatives (from left to right: Ae. speltoides, Ae. mutica, Sitopsis, Ae. tauschii, Ae. caudata, Ae. umbellulata, Comopyrum and Triticum). (b) One of the possible outputs of Algorithm 2 when applied to the unrooted and undirected version of network (a). The networks differ in root edge selection and the orientation of one edge (highlighted in red). Star vertices indicate roots, and black vertices denote nodes with in-degree two.

8 Conclusion and future work

The CC-Orientation problem, which asks whether a given undirected binary phylogenetic network can be oriented to a directed phylogenetic network of a desired class CC, is an important computational problem in phylogenetics. The complexity of this problem remains unknown for many network classes CC, including the class of binary tree-child networks. A simple exponential time algorithm for 𝒞\mathcal{C}-Orientation was provided in [1]. FPT algorithms for a special case of the problem were also proposed in [1], but their practical application is limited due to the intricate nature of the procedures and the challenges in implementation, despite the constraints imposed on 𝒞\mathcal{C}. Additionally, no study has explored heuristic approaches to solve 𝒞\mathcal{C}-Orientation in practice, even for a particular class 𝒞\mathcal{C} such as tree-child networks.

In this paper, we have proposed a simple, easy to implement, practical exact FPT algorithm (Algorithm 1) for 𝒞\mathcal{C}-Orientation and a heuristic algorithm for Tree-Child Orientation (Algorithm 2) based on Theorem 4. They improve the search space of the existing simple exponential time algorithm by using a cycle basis to reduce the number of possible reticulation placements. Our experiments showed that Algorithm 1 is significantly faster in practice than a state-of-the-art exponential time algorithm in [1]. Algorithm 2 is even faster, with a trade-off between the accuracy and the reticulation number. Further research using larger and more diverse datasets could provide more insight into the strengths and limitations of the proposed methods. Their usefulness and effectiveness should also be tested in real-world data analysis. As tree-child networks can describe evolutionary histories that do not involve frequent reticulation events, one can find various tree-child networks in the literature other than the one we used in Section 7.2, such as a hybridisation network of bread wheat (Fig. 3 in [19]), a hybridisation network of mosquitoes (Fig. 15 in [20]) and an admixture network of diverse populations (Fig. 3 in [21]).

Although we have used Tree-Child Orientation as a case study for performance evaluation, Algorithm 1 is applicable to orientation problems for classes 𝒞\mathcal{C} other than tree-child networks. For example, orientation for stack-free networks [22] or tree-based networks [23] is expected to be a good application, because one can quickly decide whether a given network belongs to such a class. Speeding up Algorithm 1 and extending it to non-binary networks are topics for future research. It would be also interesting to consider alternative FPT algorithms that do not use a cycle basis.

Improving the accuracy of Algorithm 2 is also an interesting direction for future research. As discussed in Section 7, there are clearly rooms for improvement in the current objective function. Introducing a more suitable objective function may lead to the development of faster and more useful tree-child orientation heuristics.

Appendix A Algorithm for generating undirected binary phylogenetic networks

To create undirected binary phylogenetic networks on XX for the experiments, we used a simple method explained below (see Table 3 and Fig. 9 for illustrations). The code can be found at https://github.com/hayamizu-lab/tree-child-orienter/tree/main/Appendix. This method uses the idea of the coalescent model which is a popular approach for simulating phylogenetic trees.

Given a set XX of nn present-day species, we trace back nn lineages from the present to the past. At each step, lineages can either split with probability PrP_{r} or coalesce with probability (1Pr)(1-P_{r}). If two lineages coalesce, a tree vertex is created, and the set of extant taxa is updated. If a lineage splits, a reticulation is created, and the set of extant taxa is updated. This process continues until all lineages coalesce into a single vertex, the root. The resulting graph becomes a rooted directed binary phylogenetic network on XX after the vertices of in-degree and out-degree 11 are suppressed and the necessary vertex and arc are added to resolve the nonbinary vertices. It can then be converted to an undirected phylogenetic network on XX after suppressing the root and by ignoring all arc orientations and non-leaf vertex labels.

By adjusting the value of PrP_{r}, we can generate phylogenetic networks with various reticulation numbers. When Pr=0P_{r}=0, no reticulations occur, and the generated networks are guaranteed to be phylogenetic trees. If PrP_{r} has been set a larger value, the algorithm tends to produce graphs having more reticulations, as shown in Table 4. Thus, we generated undirected binary phylogenetic networks on XX with varying levels of complexity, and selected suitable ones in each computational experiment.

For Experiments 6.2 and 6.3, we generated 12001200 various undirected binary phylogenetic networks on XX. The method requires the number |X||X| of leaves and reticulation probability parameter PrP_{r} to be specified. Table 4 summarises the breakdown of the generated graphs.

Table 3: An illustration of the graph generation method used in this study (see also Fig. 9).
Step Taxon set Selected lineage(s) Event New taxa New arcs
1 {1,2,3,4}\{1,2,3,4\} 3,4 Coalesce 5 (5,3),(5,4)(5,3),(5,4)
2 {1,2,5}\{1,2,5\} 2 Split 6,7 (6,2),(7,2)(6,2),(7,2)
3 {1,5,6,7}\{1,5,6,7\} 1,6 Coalesce 8 (8,1),(8,6)(8,1),(8,6)
4 {5,7,8}\{5,7,8\} 7,8 Coalesce 9 (9,7),(9,8)(9,7),(9,8)
5 {5,9}\{5,9\} 5 Split 10,11 (10,5),(11,5)(10,5),(11,5)
6 {9,10,11}\{9,10,11\} 9,10 Coalesce 12 (12,9),(12,10)(12,9),(12,10)
7 {11,12}\{11,12\} 11,12 Coalesce 13 (13,11),(13,12)(13,11),(13,12)
8 {13}\{13\} - - - -
Refer to caption
Figure 9: An illustration of how the process in Table 3 yields a directed binary phylogenetic network on X={x1,x2,x3,x4}X=\{x_{1},x_{2},x_{3},x_{4}\} and how it is converted to an undirected one.
Table 4: Summary of the undirected binary phylogenetic networks on XX generated before the experiments.
Number rr of reticulations
(|X|,Pr)(|X|,P_{r}) 0 11 22 33 44 55 66 77 88 99 1010\leq Total
(10, 0.05) 143 48 8 1 0 0 0 0 0 0 0 200
(10, 0.1) 99 67 20 9 3 2 0 0 0 0 0 200
(10, 0.15) 66 55 29 21 20 6 1 0 1 0 1 200
(20, 0.05) 73 82 31 10 3 0 0 0 1 0 0 200
(20, 0.1) 37 62 40 29 16 5 2 6 2 1 0 200
(20, 0.15) 14 19 37 38 30 18 18 10 7 4 5 200
Total 432 333 165 108 72 31 21 16 11 5 6 1200
\bmhead

Supplementary information The source code, datasets and detailed experimental results are available at https://github.com/hayamizu-lab/tree-child-orienter (archived at swh:1:dir:666a10c01c14741702fddd5f8704b30bc90299e5).

\bmhead

Acknowledgements We thank Yufeng Wu and Louxin Zhang for sharing code for randomly generating graphs, which we adapted for our experiments. We also appreciate the anonymous reviewers’ careful reading of our manuscript and useful comments.

\bmhead

Declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

\bmhead

Funding MH is supported by JST FOREST Program Grant Number JPMJFR2135, Japan.

References

  • \bibcommenthead
  • Huber et al. [2024] Huber, K.T., Iersel, L., Janssen, R., Jones, M., Moulton, V., Murakami, Y., Semple, C.: Orienting undirected phylogenetic networks. Journal of Computer and System Sciences 140, 103480 (2024)
  • Maeda et al. [2023] Maeda, S., Kaneko, Y., Muramatsu, H., Murakami, Y., Hayamizu, M.: Orienting undirected phylogenetic networks to tree-child network. arXiv preprint arXiv:2305.10162 (2023)
  • Bulteau et al. [2023] Bulteau, L., Weller, M., Zhang, L.: On Turning A Graph Into A Phylogenetic Network. working paper or preprint (2023). https://hal.science/hal-04085424
  • Cardona et al. [2008] Cardona, G., Rosselló, F., Valiente, G.: Comparison of Tree-child Phylogenetic Networks. IEEE/ACM Transactions on Computational Biology and Bioinformatics 6(4), 552–569 (2008)
  • Semple [2016] Semple, C.: Phylogenetic Networks with Every Embedded Phylogenetic Tree a Base Tree. Bulletin of Mathematical Biology 78(1), 132–137 (2016)
  • Robbins [1939] Robbins, H.E.: A Theorem on Graphs, with an Application to a Problem of Traffic Control. The American Mathematical Monthly 46(5), 281–283 (1939)
  • Döcker and Linz [2025] Döcker, J., Linz, S.: On the existence of funneled orientations for classes of rooted phylogenetic networks. Theoretical Computer Science 1023, 114908 (2025) https://doi.org/10.1016/j.tcs.2024.114908
  • Bulteau and Zhang [2023] Bulteau, L., Zhang, L.: The tree-child network problem and the shortest common supersequences for permutations are NP-hard. arXiv preprint arXiv:2307.04335 (2023)
  • McDiarmid et al. [2015] McDiarmid, C., Semple, C., Welsh, D.: Counting phylogenetic networks. Annals of Combinatorics 19, 205–224 (2015)
  • Gross et al. [2003] Gross, J.L., Yallen, J., Zhang, P.: Handbook of Graph Theory. CRC press, Boca Raton (2003)
  • Diestel [2017] Diestel, R.: Graph Theory, 5th Edition. Springer, Berlin (2017)
  • Bondy and Murty [2008] Bondy, J.A., Murty, U.S.R.: Graph Theory. Springer, London (2008)
  • Stepanets [1964] Stepanets, G.F.: Basis systems of vector cycles with extremal properties in graphs. Uspekhi Matematicheskikh Nauk 19(2), 171–175 (1964)
  • Deo et al. [1982] Deo, N., Prabhu, G., Krishnamoorthy, M.S.: Algorithms for Generating Fundamental Cycles in a Graph. ACM Transactions on Mathematical Software 8(1), 26–201342 (1982) https://doi.org/10.1145/355984.355988
  • Horton [1987] Horton, J.D.: A polynomial-time algorithm to find the shortest cycle basis of a graph. Society for Industrial and Applied Mathematics Journal on Computing 16(2), 358–366 (1987)
  • Coelho de Pina [1995] Pina, J.: Applications of shortest path methods. Ph.D. thesis, University of Amsterdam (December 1995). https://hdl.handle.net/11245/1.118244
  • Amaldi et al. [2010] Amaldi, E., Iuliano, C., Rizzi, R.: Efficient Deterministic Algorithms for Finding a Minimum Cycle Basis in Undirected Graphs. In: Integer Programming and Combinatorial Optimization: 14th International Conference, IPCO 2010, Lausanne, Switzerland, June 9-11, 2010. Proceedings 14, pp. 397–410 (2010). Springer
  • Glémin et al. [2019] Glémin, S., Scornavacca, C., Dainat, J., Burgarella, C., Viader, V., Ardisson, M., Sarah, G., Santoni, S., David, J., Ranwez, V.: Pervasive hybridizations in the history of wheat relatives. Science Advances 5(5), 9188 (2019) https://doi.org/10.1126/sciadv.aav9188 https://www.science.org/doi/pdf/10.1126/sciadv.aav9188
  • Marcussen et al. [2014] Marcussen, T., Sandve, S.R., Heier, L., et al.: Ancient hybridizations among the ancestral genomes of bread wheat. Science 345(6194), 1250092 (2014) https://doi.org/10.1126/science.1250092 https://www.science.org/doi/pdf/10.1126/science.1250092
  • Willems et al. [2014] Willems, M., Tahiri, N., Makarenkov, V.: A new efficient algorithm for inferring explicit hybridization networks following the Neighbor-Joining principle. Journal of Bioinformatics and Computational Biology 12(05), 1450024 (2014)
  • Mallick et al. [2016] Mallick, S., Li, H., Lipson, M., et al.: The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature 538, 201–206 (2016) https://doi.org/10.1038/nature18964
  • Semple and Simpson [2018] Semple, C., Simpson, J.: When is a Phylogenetic Network Simply an Amalgamation of Two Trees? Bulletin of Mathematical Biology 80(9), 2338–2348 (2018)
  • Francis and Steel [2015] Francis, A.R., Steel, M.: Which Phylogenetic Networks are Merely Trees with Additional Arcs? Systematic Biology 64(5), 768–777 (2015) https://doi.org/%****␣sn-article.bbl␣Line␣375␣****10.1093/sysbio/syv037