This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Cluster Editing with Vertex Splitting

Faisal N. Abu-Khzam    Emmanuel Arrighi    Matthias Bentert    Pål Grønås Drange    Judith Egan    Serge Gaspers    Alexis Shaw    Peter Shaw    Blair D. Sullivan    Petra Wolf
Abstract

Cluster Editing, also known as Correlation Clustering, is a well-studied graph modification problem. In this problem, one is given a graph and the task is to perform up to kk edge additions or deletions to transform it into a cluster graph, i.e., a graph consisting of a disjoint union of cliques. However, in real-world networks, clusters are often overlapping. For example in social networks, a person might belong to several communities—e.g. those corresponding to work, school, or neighborhood. Other strong motivations come from biological network analysis and from language networks. Trying to cluster words with similar usage in the latter can be confounded by homonyms, that is, words with multiple meanings like “bat”. In this paper, we introduce a new variant of Cluster Editing whereby a vertex can be split into two or more vertices. First used in the context of graph drawing, this operation allows a vertex vv to be replaced by two vertices whose combined neighborhood is the neighborhood of vv (and thus vv can belong to more than one cluster). We call the new problem Cluster Editing with Vertex Splitting and we initiate the study of it. We show that it is 𝒩𝒫\mathcal{NP}-complete and fixed-parameter tractable when parameterized by the total number kk of allowed vertex-splitting and edge-editing operations. In particular, we obtain an O(29klogk+n+m)O(2^{9k\log k}+n+m)-time algorithm and a 6k6k-vertex kernel.

footnotetext: Author E-Mail addresses: faisal.abukhzam@lau.edu.lb, emmanuel.arrighi@gmail.com, matthias.bentert@uib.no, pal.drange@uib.no, judith.egan@cdu.edu.au, serge.gaspers@unsw.edu.au, Alexis.Shaw@student.unsw.edu.au, petershaw@ojlab.ac.cn, sullivan@cs.utah.edu, and mail@wolfp.net.

1 Introduction

Cluster Editing is defined as follows. Given a graph GG and a non-negative integer kk, one is asked whether GG can be turned into a disjoint union of cliques by a sequence of at most kk edge-editing operations (i.e. additions or removals of edges). The problem is known to be 𝒩𝒫\mathcal{NP}-complete since the work of Křivánek and Morávek [24] and it is fixed-parameter tractable when parameterized by kk, the total number of allowed edge-editing operations [5, 6, 18]. Over the last decade, Cluster Editing has been well studied from both theoretical and practical perspectives [5, 7, 8, 10, 11, 14, 15, 19, 21].

In general, clustering results in a partitioning of the input graph. Hence, it forces each vertex to be in one and only one cluster. This can be a limitation when the entity represented by a vertex plays a role in multiple clusters. This situation is recorded in work on gene regulatory networks [2], where enumeration of maximal cliques was considered a viable alternative to clustering. Moreover, such vertices can effectively hide clique-like structures and also greatly increase the computational time required to obtain an optimal solution [27, 28].

In this paper, we introduce a new variant, which we call Cluster Editing with Vertex Splitting in an attempt to allow for such overlapping clusters111Preliminary versions of parts of this paper have been presented at ISCO 2018 and IPEC 2023 (see [3] and [4]).. We show the new problem to be 𝒩𝒫\mathcal{NP}-complete and investigate its parameterized complexity. We obtain a polynomial kernel using the notion of a critical cliques as introduced by Lin et al. [25] and applied to Cluster Editing by Guo [19].

This paper is structured as follows: In Section 2, we give some basic definitions and notation used throughout the paper. In Section 3, we prove that our problem is 𝒩𝒫\mathcal{NP}-complete, even on graphs of bounded maximum degree. In Section 4, we study the order of operations in an optimal solution and Section 5 is devoted to critical cliques. In Section 6, we show how to obtain a 6k6k-vertex kernel in linear time. Section 7 presents a fixed-parameter tractable algorithm and we conclude in Section 8 with some open problems and future directions.

2 Preliminaries

For a positive integer nn, we use [n][n] to denote the set {1,2,,n}\{1,2,\ldots,n\} of all positive integers up to nn. All logarithms in this paper use 2 as their base. We use standard graph-theoretic notation and refer the reader to the textbook by Diestel [12] for commonly used definitions. All graphs in this work are simple, unweighted, and undirected. We denote the open and closed neighborhoods of a vertex vv by N(v)N(v) and N[v]N[v], respectively. For a subset VV^{\prime} of vertices in a graph GG, we denote by G[V]G[V^{\prime}] the subgraph of GG induced by VV^{\prime}. For an introduction to parameterized complexity, fixed-parameter tractability, and kernelization, we refer the reader to the textbooks by Flum and Grohe [17], Niedermeier [26], and Cygan et al. [9].

The exponential-time hypothesis (ETH), formulated by Impagliazzo, Paturi, and Zane [22], states that there exists some positive real number ss such that 3-Sat on NN variables and MM clauses cannot be solved in 2s(N+M)2^{s(N+M)} time.

A cluster graph is a graph in which the vertex set of each connected component induces a clique. Equivalently, a graph is a cluster graph if and only if the graph does not have P3P_{3} as an induced subgraph.

Problem Definition

Given a graph G=(V,E)G=(V,E), an edit sequence of length kk is a sequence σ=(e1,e2,,ek)\sigma=(e_{1},e_{2},\dots,e_{k}) of kk operations, where each eie_{i} is one of the following operations:

  1. 1.

    do nothing,

  2. 2.

    add an edge to EE,

  3. 3.

    delete an edge from EE, and

  4. 4.

    split a vertex, that is, replace a vertex vv by two vertices v1,v2v_{1},v_{2} such that N(v1)N(v2)=N(v)N(v_{1})\cup N(v_{2})=N(v).

An example of the splitting operation is given in Figure 1 and we call the two new vertices v1v_{1} and v2v_{2} copies of the original vertex vv (and if v1v_{1} or v2v_{2} are further split in the future, then the resulting vertices are also called copies of vv).

aabbccddvv\Rightarrowaabbccddv1v_{1}v2v_{2}
Figure 1: An illustration of a vertex-split operation. The vertex vv is replaced by v1v_{1} and v2v_{2}, with the vertices a,c,a,c, and dd being adjacent to exactly one of the two vertices and bb being adjacent to both.

We denote the graph resulting from applying an edit sequence σ\sigma to a graph GG by G|σG_{|\sigma}. Cluster Editing with Vertex Splitting is then defined as follows. Given a graph GG and an integer kk, is there an edit sequence σ\sigma of length kk such that G|σG_{|\sigma} is a cluster graph?

3 𝒩𝒫\mathcal{NP}-hardness

In this section, we show that Cluster Editing with Vertex Splitting is 𝒩𝒫\mathcal{NP}-complete.

Theorem 1.

Cluster Editing with Vertex Splitting is 𝒩𝒫\mathcal{NP}-complete. Moreover, assuming the exponential-time hypothesis, there is no 2o(n+m)2^{o(n+m)}-time or 2o(k)poly(n)2^{o(k)}\cdot\text{poly}(n)-time algorithm for it.

Proof.

Since containment in 𝒩𝒫\mathcal{NP} is obvious (non-deterministically guess the sequence of operations and check that the resulting graph is indeed a cluster graph), we focus on the 𝒩𝒫\mathcal{NP}-hardness and present a reduction from 3-Sat. Therein, we will use two gadgets, a variable gadget and a clause gadget. The variable gadget is a wheel graph with two (connected) center vertices. An example of this graph is depicted on the left side of Figure 2. We call this graph with tt vertices on the outside WtW_{t} and we will only consider instances with tmod6=0t\bmod 6=0, that is, t=6at=6a for some positive integer aa. The clause gadget is a “crown graph” as depicted in Figure 3(a).

(a) The graph W6W_{6} (variable gadget).
(b) One of the three ways of transforming W6W_{6} into two K5K_{5}’s using six operations.
Figure 2: The graph W6tW_{6t} requires 8t28t-2 edits and any solution with exactly 8t28t-2 edits results in a disjoint union of K5K_{5}s.
yyxxzz
(a) The crown graph
yyxxzz
(b) Good solution: One added edge and two deleted edges
yyxxzz
(c) Bad solution 1: three added edges
yyxxzz
(d) Bad solution 2: two splits and one added edge
yyxxzz
(e) Bad solution 3: one split, one deleted edge, and one added edge
Figure 3: The crown graph with its four solutions of size 3. The good solution is the only solution with three operations that creates at least one isolated vertex.

More precisely, for each variable xix_{i}, we construct a variable gadget GiG_{i} which is a W6aW_{6a} where aa is the number of clauses that contain either xix_{i} or ¬xi\neg x_{i}. For each clause CjC_{j}, we construct a clause gadget HjH_{j} as depicted in Figure 3(a), that is, a K5K_{5} with the edges of a triangle removed. We arbitrarily assign each of the three vertices of degree two in HjH_{j} to one literal in CjC_{j}. Finally, we connect the variable and clause gadgets as follows. If a variable xix_{i} appears in a clause CjC_{j}, then let uu be the vertex in HjH_{j} assigned to xix_{i} (or ¬xi\neg x_{i}). Moreover, let bb be the number such that CjC_{j} is the bbth clause containing either xix_{i} or ¬xi\neg x_{i} and let c=6(b1)c=6(b-1). Let the vertices on the outer cycle of GiG_{i} be v1,v2,,v6av_{1},v_{2},\ldots,v_{6a}. If CjC_{j} contains the literal xix_{i}, then we add the three edges {u,vc+1},{u,vc+2},{u,vc+3}\{u,v_{c+1}\},\{u,v_{c+2}\},\{u,v_{c+3}\}. If CjC_{j} contains the literal ¬xi\neg x_{i}, then we add the three edges {u,vc+2},{u,vc+3},{u,vc+4}\{u,v_{c+2}\},\{u,v_{c+3}\},\{u,v_{c+4}\}. To complete the reduction, we set k=35M2Nk=35M-2N, where MM is the number of clauses and NN is the number of variables.

We next show that the reduction is correct, that is, the constructed instance of Cluster Editing with Vertex Splitting is a yes-instance if and only if the original formula ϕ\phi of 3-Sat is satisfiable. To this end, first assume that ϕ\phi is satisfiable and let β\beta be a satisfying assignment. For each variable xix_{i}, we will partition GiG_{i} into K5K_{5}’s as follows. Let aa be the value such that GiG_{i} is isomorphic to W6aW_{6a}. If β\beta sets xix_{i} to true, then we remove the edge {v3j,v3j+1}\{v_{3j},v_{3j+1}\} and add the edge {v3j+1,v3j+3}\{v_{3j+1},v_{3j+3}\} for each integer 1j2a1\leq j\leq 2a (where values larger than 6a6a are taken modulo 6a6a). If β\beta sets xix_{i} to false, then we remove the edge {v3j+1,v3j+2}\{v_{3j+1},v_{3j+2}\} and add the edge {v3j+2,v3j+4}\{v_{3j+2},v_{3j+4}\} for each 1j2a1\leq j\leq 2a. Moreover, we split the two center vertices 2a12a-1 times. In total, we use 8a28a-2 modifications to transform GiG_{i} into a collection of K5K_{5}’s. Since each clause contains exactly three literals and we add six vertices for each variable appearance, the sum of lengths of cycles in all variable gadgets combined is 18M18M. Hence, in all variable gadgets combined, we perform 24M2N24M-2N modifications.

Next, we modify the crown graphs. To this end, let CjC_{j} be a clause and let HjH_{j} be the constructed clause gadget. Since β\beta is a satisfying assignment, at least one variable appearing in CjC_{j} satisfies it. If multiple such variables exist, then we pick one arbitrarily. Let xix_{i} be the selected variable and let uu be the vertex in HjH_{j} assigned to xix_{i}. We first turn HjH_{j} into a K4K_{4} and an isolated vertex by removing the two edges incident to uu in HjH_{j} and add the missing edge between the two vertices assigned to different variables. Finally, we look at the edges between variable gadgets and clause gadgets. For the vertex uu, note that by construction the three vertices that uu is adjacent to in GiG_{i} already belong to a K5K_{5} and hence we can add two edges to (copies of) the two centers of the variable gadget to form a K6K_{6}. For the two other vertices in HjH_{j} that have edges to vertices in variable gadgets, we remove all three such edges, that is, six edges per clause. Hence, we use 3+2+6=113+2+6=11 modifications for each clause. Since the total number of modifications is 35M2N35M-2N and the resulting graph is a collection of K4K_{4}’s, K5K_{5}’s, and K6K_{6}’s, the constructed instance of Cluster Editing with Vertex Splitting is a yes-instance.

For the other direction, suppose the constructed instance of Cluster Editing with Vertex Splitting is a yes-instance. We first show that 24M2N24M-2N modifications are necessary to transform all variable gadgets into cluster graphs and that this bound can only be achieved if each time exactly three consecutive vertices on the cycles are contained in the same K5K_{5}. To this end, consider any variable gadget GiG_{i}. By construction, GiG_{i} is isomorphic to W6aW_{6a} for some integer aa. By the counting argument from above, we show that at least 8a28a-2 modifications are necessary. Note first that some edge in the cycle has to be removed or some vertex on the cycle has to be split as otherwise any solution would contain a clique with all vertices in the cycle and this would require at least 18a29a>8a218a^{2}-9a>8a-2 edge additions (since the degree of each of the 6a6a vertices in the cycle would need to increase from 22 to 6a16a-1). We next analyze how many modifications are necessary to separate bb vertices from the outer cycle into a clique. We require at least two modifications for the center vertices (either splitting them or removing the edges between them and the first vertex that we want to separate) and one operation to separate the cycle on the other end (either splitting a vertex or removing an edge of the cycle). For b{1,2}b\in\{1,2\} these operations are enough. For b3b\geq 3, we need to add (b2)(b1)\binom{b}{2}-(b-1) edges (all edges in a clique of size bb minus the already existing edges of a path on bb vertices). Note that the “average cost” per separated vertex (number of operations divided by bb) is minimized (only) with b=3b=3 with a cost of 44 for three vertices. Hence, to separate all but cc vertices from the cycle, we require at least 4(6ac)/34(6a-c)/3 operations. The cost for making the remaining cc vertices into a clique requires again (c2)(c1)\binom{c}{2}-(c-1) edge additions. Analogously, the optimal solution is to have c=3c=3 with just a single edge addition. Thus, the minimum number of required operations is at least 4(6a3)/3+2=8a24(6a-3)/3+2=8a-2 (where the +2+2 comes from the initial edge removal and the final edge addition between the last c=3c=3 vertices) and this value can only be reached by partitioning the cycle into triples which each form a K5K_{5} with the two center vertices. Note that it is always preferable to delete an edge on the outer cycle and not split one of the two incident edges as splitting a vertex increases the number of vertices on the cycle and thus invokes a higher overall cost. Next, we analyze the clause gadget and the edges between the different gadgets. We start with the latter. Let uu be a vertex in a clause gadget HjH_{j} with (three) incident edges to some variable gadget. The only way to not use at least three operations to deal with the three edges is if uu is an isolated vertex or if the three neighbors do not have two more neighbors in the current solution. In the former case, we can (possibly) add the two edges between uu and the two centers of the respective variable gadgets to build a K6K_{6}. In the latter case, we have used at least three operations more in the variable gadget than intended (either by removing edges between neighbors of uu and the center vertices or by splitting all neighbors of uu). Since each vertex in a variable gadget is only adjacent to at most one vertex in a clause gadget, this cannot lead to an overall reduction in the number of operations and we can therefore ignore this latter case.

We are now in a position to argue that at least eleven modifications are necessary for each clause gadget. First, note that at least three operations are required to transform a crown into a cluster graph. Possible ways of achieving this are depicted in Figure 3. In each of these possibilities, at most one vertex becomes an isolated vertex. To make two vertices independent, at least four operations are required and for three isolated vertices, at least five operations are required. As shown above, at least two operations are required for each isolated vertex with edges to variable gadgets and at least three operations are required for non-isolated vertices with edges to variable gadgets. Thus, at least eleven operations are required for each clause gadget and eleven operations are sufficient if and only if the three vertices incident to one of the vertices in HjH_{j} belong to the same K5K_{5} in the variable gadget.

By the argument above, at least 24M2N+11M=k24M-2N+11M=k operations are necessary and since the constructed instance is a yes-instance, there is a way to cover all variable gadgets with K5K_{5}’s such that for each clause there is at least one vertex whose three neighbors in a variable gadget belong to the same K5K_{5}. Let CjC_{j} be a clause, let uu be a vertex with all three neighbors in the same K5K_{5}, and let xix_{i} be the variable corresponding to this variable gadget. If xix_{i} appears positively in CjC_{j}, then v3i+1,v3i+2,v_{3i+1},v_{3i+2}, and v3i+3v_{3i+3} belong to the same K5K_{5} for each ii and we set xix_{i} to true. If xix_{i} appears negatively in CjC_{j}, then v3i+2,v3i+3,v_{3i+2},v_{3i+3}, and v3i+4v_{3i+4} belong to the same K5K_{5} for each ii and we set xix_{i} to false. Note that we never set a variable to both true and false in this way. We set all remaining variables arbitrarily to true or false. By construction, the variable xix_{i} satisfies CjC_{j} and since we do the same for all clauses, all clauses are satisfied, that is, the original formula ϕ\phi is satisfiable. Thus, the constructed instance is equivalent to the original instance of 3-Sat.

Since the reduction can clearly be computed in polynomial time, this concludes the proof for the 𝒩𝒫\mathcal{NP}-hardness. For the ETH-based hardness, observe that k,n,mO(N+M)k,n,m\in O(N+M). This implies that there are no 2o(n+m)2^{o(n+m)}-time or 2o(k)poly(n)2^{o(k)}\cdot\text{poly}(n)-time algorithms for Cluster Editing with Vertex Splitting unless the ETH fails [22]. ∎

In contrast to the reduction for Cluster Editing [23], our reduction does not produce instances with constant maximum degree. We instead observe that in our reduction, the maximum degree of the produced instances depends only on the maximum number of times a variable appears in a clause. Combining this with the fact that 3-Sat remains 𝒩𝒫\mathcal{NP}-hard when restricted to instances where each variable appears in at most four clauses [29], we obtain the following corollary:

Corollary 1.

Cluster Editing with Vertex Splitting remains 𝒩𝒫\mathcal{NP}-hard on graphs with maximum degree 24.

4 The Edit-Sequence Approach

In this section, we show that we can always assume that a solution to Cluster Editing with Vertex Splitting has a specific structure, that is, it first adds edges, then removes edges, and finally splits vertices. We mention that removing edges can also be moved to the front or to the back and that we do not use the fact that we want to reach a cluster graph at any point. The statement therefore holds for any graph-modification problem which only adds edges, deletes edges, and splits vertices.

To start, we say that two edit sequences σ\sigma and σ\sigma^{\prime} are equivalent if G|σG_{|\sigma} and G|σG_{|\sigma^{\prime}} are isomorphic. We show first that all vertex splittings can be moved to the back of the edit sequence.

Lemma 1.

For any edit sequence σ=(e1,e2,,ei,ei+1,,ek)\sigma=(e_{1},e_{2},\dots,e_{i},e_{i+1},\dots,e_{k}) where eie_{i} is a vertex splitting and ei+1e_{i+1} is an edge addition, an edge deletion, or a do-nothing operation, there is an equivalent edit sequence σ=(e1,e2,,ei1,ei,ei+1,ei+2,,ek)\sigma^{\prime}=(e_{1},e_{2},\dots,e_{i-1},e^{\prime}_{i},e^{\prime}_{i+1},e_{i+2},\dots,e_{k}) of the same length where eie^{\prime}_{i} is an edge addition, an edge deletion, or a do-nothing operation and ei+1e_{i+1}^{\prime} is a vertex splitting.

Proof.

If ei+1e_{i+1} is a do-nothing operation or the edge added or removed by it is not incident to one of the vertices introduced by the vertex split eie_{i}, then the edit sequence σ\sigma^{\prime} where ei=ei+1e^{\prime}_{i}=e_{i+1} and ei+1=eie^{\prime}_{i+1}=e_{i} is equivalent to σ\sigma and of the same length. So assume without loss of generality that eie_{i} splits vertex vv into v1v_{1} and v2v_{2} and ei+1e_{i+1} adds or deletes the edge {v1,w}\{v_{1},w\} for some vertex ww. If ei+1e_{i+1} adds the edge {v1,w}\{v_{1},w\} and the edge {v,w}\{v,w\} exists after performing the edit subsequence σ1=(e1,e2,,ei1){\sigma_{1}=(e_{1},e_{2},\ldots,e_{i-1})}, then we set eie^{\prime}_{i} to be the do-nothing operation. If ei+1e_{i+1} adds the edge {v1,w}\{v_{1},w\} and the edge {v,w}\{v,w\} does not exist after performing σ1\sigma_{1}, then we let eie^{\prime}_{i} add the edge {v,w}\{v,w\}. In both cases, we let ei+1=eie^{\prime}_{i+1}=e_{i} with the modification that v1v_{1} is also adjacent to ww. Note that this is possible as vv is adjacent to ww after performing the edit sequence (e1,e2,,ei1,ei)(e_{1},e_{2},\ldots,e_{i-1},e^{\prime}_{i}). If ei+1e_{i+1} deletes the edge {v1,w}\{v_{1},w\}, then we know that the edge {v,w}\{v,w\} exists after performing σ1\sigma_{1} and we can assume without loss of generality that the edge {v2,w}\{v_{2},w\} does not exists after performing the edit sequence (e1,e2,,ei1,ei,ei+1)(e_{1},e_{2},\ldots,e_{i-1},e^{\prime}_{i},e^{\prime}_{i+1}). Hence, we let eie^{\prime}_{i} remove the edge {v,w}\{v,w\} and let ei+1=eie^{\prime}_{i+1}=e_{i} with the modification that v1v_{1} is no longer adjacent to ww. In all cases, the graphs reached after performing the edit sequences (e1,e2,,ei,ei+1)(e_{1},e_{2},\ldots,e_{i},e_{i+1}) and (e1,e2,,ei1,ei,ei+1)(e_{1},e_{2},\ldots,e_{i-1},e^{\prime}_{i},e^{\prime}_{i+1}) are identical and hence performing the edit sequence σ=(ei+2,ei+3,,ek)\sigma^{*}=(e_{i+2},e_{i+3},\ldots,e_{k}) afterwards shows that σ\sigma and σ\sigma^{\prime} are equivalent (and they are of the same length). ∎

We show next that moving edge additions to the front results in an equivalent edit sequence.

Lemma 2.

For any edit sequence σ=(e1,e2,,ei,ei+1,,ek)\sigma=(e_{1},e_{2},\dots,e_{i},e_{i+1},\dots,e_{k}) where eie_{i} is an edge deletion and ei+1e_{i+1} is an edge addition, at least one of the edit sequences σ=(e1,e2,,ei1,ei,ei+1,ei+2,,ek)\sigma^{\prime}=(e_{1},e_{2},\dots,e_{i-1},e^{\prime}_{i},e^{\prime}_{i+1},e_{i+2},\dots,e_{k}) where ei=ei+1e^{\prime}_{i}=e_{i+1} and ei+1=eie_{i+1}^{\prime}=e_{i} or both are do-nothing operations is of the same length and equivalent to σ\sigma.

Proof.

Note that if ei+1e_{i+1} adds the edge that eie_{i} removed, then replacing these two (consecutive) operations with do-nothing operations results in the same graph. If ei+1e_{i+1} adds a different edge than eie_{i} removed, then the graphs reached after the edit subsequences σ1=(e1,e2,,ei,ei+1)\sigma_{1}=(e_{1},e_{2},\ldots,e_{i},e_{i+1}) and σ2=(e1,e2,,ei+1,ei){\sigma_{2}=(e_{1},e_{2},\ldots,e_{i+1},e_{i})} are identical.

Hence, performing the edit sequence σ=(ei+2,ei+3,,ek)\sigma^{*}=(e_{i+2},e_{i+3},\ldots,e_{k}) afterwards results in isomorphic graphs for both starting edit sequences. Note that performing σ1\sigma_{1} first and then σ\sigma^{*} is equivalent to performing σ\sigma and performing σ2\sigma_{2} first and σ\sigma^{*} afterwards is equivalent to performing σ\sigma^{\prime} where ei=ei+1e^{\prime}_{i}=e_{i+1} and ei+1=eie_{i+1}^{\prime}=e_{i}. This concludes the proof. ∎

We can easily deduce the following theorem from the two above lemmata.

Theorem 2.

For any edit sequence σ=(e1,e2,,ek)\sigma=(e_{1},e_{2},\ldots,e_{k}), there is an edit sequence σ=(e1,e2,,ek){\sigma^{\prime}=(e^{\prime}_{1},e^{\prime}_{2},\ldots,e^{\prime}_{k^{\prime}})} such that

  1. 1.

    kkk^{\prime}\leq k,

  2. 2.

    σ\sigma and σ\sigma^{\prime} are equivalent,

  3. 3.

    eie^{\prime}_{i} is not a do-nothing operations for any i[k]i\in[k^{\prime}].

  4. 4.

    if eie^{\prime}_{i} is an edge addition and eje^{\prime}_{j} is an edge deletion or a vertex splitting, then i<ji<j,

  5. 5.

    if eie^{\prime}_{i} is an edge deletion and eje^{\prime}_{j} is a vertex splitting, then i<ji<j, and

Proof.

Let σ=(e1,e2,,ek)\sigma=(e_{1},e_{2},\ldots,e_{k}) be any edit sequence. If eie_{i} is a do-nothing operation for some i[k]i\in[k], then the edit sequence (e1,e2,,ei1,ei+1,,ek)(e_{1},e_{2},\ldots,e_{i-1},e_{i+1},\ldots,e_{k}) is equivalent and shorter. Hence, we can assume that σ\sigma does not contain any do-nothing operations. If there exists a vertex-splitting operation eie_{i} and an operation eje_{j} with j>0j>0 such that eje_{j} is not a vertex-splitting operation, then let ii^{\prime} be the last index of a vertex-splitting operation such that ei+1e_{i^{\prime}+1} is not a vertex-splitting operation. By Lemma 1, we can modify σ\sigma into an equivalent edit sequence (e1,e2,,ek)(e^{\prime}_{1},e^{\prime}_{2},\ldots,e^{\prime}_{k}) where ej=eje^{\prime}_{j}=e_{j} for all j[k]{i,i+1}j\in[k]\setminus\{i^{\prime},i^{\prime}+1\} and ei+1e_{i^{\prime}+1} is a vertex-splitting operation and eie_{i^{\prime}} is not. If eie^{\prime}_{i^{\prime}} is a do-noting operation, we can again remove it. Performing this procedure repeatedly until no longer applicable results in an edit sequence σ1\sigma_{1} that is equivalent to σ\sigma, does not contain any do-nothing operations and all edge-addition and edge-deletion operations come before all vertex-splitting operations. If σ1\sigma_{1} performs an edge-deletion operation eie_{i} before an edge-addition operation eje_{j} (i<ji<j), then there is also an index ii^{\prime} such that eie_{i^{\prime}} is an edge-deletion operation and ei+1e_{i^{\prime}+1} is an edge-addition operation. We can then use Lemma 2 to obtain a new sequence in which the number of index pairs (i,j)(i,j) with i<ji<jeie_{i} is an edge-deletion operation, and eie_{i} is an edge-addition operation is reduced by at least one. If the application of Lemma 2 introduces a do-nothing operation, then we again remove it. Repeatedly applying Lemma 2 finally results in an equivalent sequence σ\sigma^{\prime} where the number of mentioned index pairs is zero, that is, all edge-addition operation come before all edge-deletion operations. Note that σ\sigma^{\prime} now satisfies all requirements of the theorem statement. ∎

We refer to an edit sequence satisfying the statement of Theorem 2 as an edit sequence in standard form.

5 Critical Cliques

Originally introduced by Lin et al. [25], critical cliques provide a useful tool in understanding the clique structure in graphs.

Definition 1.

A critical clique is a subset of vertices CC that is maximal with the properties that

  1. 1.

    CC is a clique

  2. 2.

    there exists UV(G)U\subseteq V(G) s.t. N[v]=UN[v]=U for all vCv\in C.

Equivalently, two vertices uu and vv belong to the same critical clique if and only if they are true twins, that is, N[u]=N[v]N[u]=N[v]. Hence, each vertex vv appears in exactly one critical clique. The critical clique quotient graph 𝒞\mathcal{C} of GG contains a node for each critical clique in GG and two nodes are adjacent if and only if the two respective critical cliques C1C_{1} and C2C_{2} are adjacent, that is, there is an edge between each vertex in C1C_{1} and each vertex in C2C_{2}. Note that by the definition of critical cliques, this is equivalent to the condition that at least one edge {u,v}\{u,v\} with uC1u\in C_{1} and vC2v\in C_{2} exists. To avoid confusion, we will call the vertices in 𝒞\mathcal{C} nodes and in GG vertices.

The following lemma is adapted from Lemma 1 by Guo [19] with a careful restatement in the context of our new problem.222We mention in passing that we claimed a slightly stronger lemma in a previous version of this paper. Firbas et al. [16] observed that the stronger version does not hold and conjectured that this weaker version is true. We confirm this here.

Lemma 3 (Critical clique lemma).

Let (G,k)(G,k) be a yes-instance of Cluster Editing with Vertex Splitting. Then, there exists a solution σ\sigma of length at most kk such that for each critical clique CiC_{i} in GG and each clique SjS_{j} in G|σG_{|\sigma}, either SjS_{j} contains exactly one copy of each vertex in CiC_{i} or SjS_{j} does not contain any copy of a vertex in CiC_{i}.

Proof.

Let σ^\hat{\sigma} be an optimal solution for (G,k)(G,k) in standard form. For each critical clique CiC_{i}, we select a representative vertex riCir_{i}\in C_{i} by picking any vertex in CiC_{i} with fewest appearances in σ^\hat{\sigma}. If it holds for each critical clique CiC_{i} in GG and each clique SjS_{j} in the resulting cluster graph G|σ^G_{|\hat{\sigma}} that SjS_{j} contains exactly one copy of each vertex in CiC_{i} or SjS_{j} does not contain any copy of a vertex in CiC_{i}, then σ^\hat{\sigma} satisfies all requirements of the lemma statement and we are done. Otherwise, there exists a clique SjS_{j} in G|σ^G_{|\hat{\sigma}} which contains two copies of some vertex vv or there exists a critical clique CiC_{i} and two vertices u,vCiu,v\in C_{i} such that SjS_{j} contains a copy of uu but no copy of vv. In the former case, note that removing any operations involving one of the two copies results in a solution of strictly shorter length, contradicting the fact that σ^\hat{\sigma} was an optimal solution. In the latter case, there also exists such a pair of vertices where one of the two vertices is rir_{i}. Let ww be the other vertex.

We find a new optimal solution by removing all operations involving ww and copying all operations including rir_{i} and replacing rir_{i} by ww in the copy. For the sake of notational convenience, we say that that if some operation involves the jjth copy of rir_{i}, then the very next operation will be the copy and will use the jjth copy of ww.333We only point out here that since we removed any operations involving ww, the original edge between rir_{i} and ww is not removed. Moreover when splitting another vertex (not rir_{i} or ww), then we treat each copy of ww the same as the corresponding copy of rir_{i}. When splitting the jjth copy of rir_{i} to create the jj^{\prime}th and j′′j^{\prime\prime}th copies, then we keep both new vertices adjacent to the jjth copy of ww and when splitting the jjth copy of ww, then we make the jj^{\prime}th copy of ww adjacent to the jj^{\prime}th copy of rir_{i} and the j′′j^{\prime\prime}th copy of ww adjacent to the j′′j^{\prime\prime}th copy of rir_{i}. Note that each copy of ww is adjacent to the respective copy of rir_{i} and not adjacent to any other copy of rir_{i} (and vice versa).

Since ww is involved in at least as many operations as rir_{i} (recall that rir_{i} was picked as having fewest appearances in σ^\hat{\sigma}), the resulting sequence σ\sigma^{\prime} will be at most as long as σ^\hat{\sigma}. We next show that σ\sigma^{\prime} is also a solution. If the resulting graph G|σG_{|\sigma^{\prime}} is not a cluster graph, then it contains an induced P3P_{3}. Let XX be an induced P3P_{3} in G|σG_{|\sigma^{\prime}}. Since we only modified operations for ww, some copy of ww has to be part of XX. Let jj be the index such that XX contains the jjth copy of ww. If XX contains another copy of ww, then the two copies are the two ends. The center cannot be a copy of rir_{i} as each copy of ww is only adjacent to the corresponding copy of rir_{i} and vice versa. Hence, we have another induced P3P_{3} where we replace the two copies of ww by their respective copies of rir_{i}. This contradicts that σ^\hat{\sigma} is a solution.

So assume that XX contains only the jjth copy of ww. If XX does not contain the jjth copy of rir_{i}, then the graph contains also a P3P_{3} where we replace the jjth copy ww by the jjth copy of rir_{i}. This again contradicts the assumption that σ^\hat{\sigma} is a solution. Thus, we may assume that XX contains both the jjth copy of rir_{i} and the jjth copy of ww. Since these two vertices are adjacent, one of the two vertices is the center of XX. However, all other vertices have either both or none of the two vertices as neighbors. This contradicts that XX exists. Hence, σ\sigma^{\prime} is also a solution.

Note that the number of vertices that behave differently than their representative is one smaller in σ\sigma^{\prime} compared to σ^\hat{\sigma}. Hence, repeating the above procedure at most nn times results results in an optimal solution σ\sigma as stated in the lemma. ∎

In the following two sections, we will use Lemma 3 to develop a linear-vertex kernel and a 2O(klogk)(n+m)2^{O(k\log k)}(n+m)-time algorithm.

6 A 6k6k-vertex kernel

In this section, we prove a problem kernel for Cluster Editing with Vertex Splitting with at most 6k6k vertices based on the critical clique lemma (Lemma 3) and a similar kernel for Cluster Editing by Guo [19]. To this end, we propose three reduction rules, prove that they are safe444A reduction rule is safe if its application results in an equivalent instance., that they can be performed exhaustively in linear time, and that their application gives a kernel as required. We start with the simplest one.

Reduction Rule 1.

Remove all isolated cliques.

Lemma 4.

Reduction Rule 1 is safe.

Proof.

Notice that for any optimal solution σ\sigma, each clique in G|σG_{|\sigma} only contains copies of vertices of one connected component in GG. Hence, if a clique contains a copy of a vertex in an isolated clique KK in GG, then it only contains copies of vertices in KK. Since KK by definition does not require any operations to be transformed into a clique with no outgoing edges, removing KK results in an equivalent instance of Cluster Editing with Vertex Splitting. ∎

We next bound the size of each critical clique in terms of the sizes of adjacent critical cliques.

Reduction Rule 2.

If there is a critical clique KK such that |K|>|CN(K)C|+1|K|>|\bigcup_{C\in N(K)}C|+1, then reduce the size of KK to |CN(K)C|+1|\bigcup_{C\in N(K)}C|+1.

Lemma 5.

Reduction Rule 2 is safe.

Proof.

Let KK be a critical clique with |K|>|CN(K)C||K|>|\bigcup_{C\in N(K)}C| and let σ\sigma be an optimal solution. We show that σ\sigma does not split any vertex corresponding to KK and does not add or delete any edge incident to such a vertex. By Lemma 3, we may assume that all cliques in G|σG_{|\sigma} contain no or exactly one copy of vertices in KK. Let HH be a clique in G|σG_{|\sigma} that contains exactly one copy of each vertex corresponding to KK. Let AA be the set of nodes in N(K)N(K) in the critical clique quotient graph 𝒞\mathcal{C} of GG whose corresponding vertices have a copy in HH. Analogously, let BB be the set of nodes not in N[K]N[K] in 𝒞\mathcal{C} whose corresponding vertices have a copy in HH. Note first that we can assume B=B=\emptyset as otherwise σ\sigma adds at least |K||K| edges between a vertex corresponding to a node in BB and all vertices in KK. Splitting all vertices corresponding to nodes in AA instead (and therefore splitting the clique HH into two cliques, one containing a copy of each vertex corresponding to a node in KAK\cup A and the other containing a copy of each vertex corresponding to a node in ABA\cup B) results in another cluster graph reached by a shorter edit sequence as |A||CN(K)C|<|K||A|\leq|\bigcup_{C\in N(K)}C|<|K|. This contradicts the fact that σ\sigma is an optimal solution. We next show that A=N(K)A=N(K). Assume towards a contradiction that AN(K)A\neq N(K). Then, there exists a node vN(K)Av\in N(K)\setminus A. Modifying σ\sigma to split each vertex corresponding to vv once, add all missing edges between vertices corresponding to vv and vertices corresponding to a node in AA, and removing all edge removals between vertices corresponding to KK and vv results in another cluster graph. Moreover, the newly acquired edit sequence is in fact shorter as |v|+|v||CAC||v|(|CN(K){v}C|+1)|v|(|CN(K)C|)<|v||K||v|+|v|\cdot|\bigcup_{C\in A}C|\leq|v|\cdot(|\bigcup_{C\in N(K)\setminus\{v\}}C|+1)\leq|v|\cdot(|\bigcup_{C\in N(K)}C|)<|v|\cdot|K|. This again contradicts the fact that σ\sigma is optimal. We have shown that A=N(K)A=N(K) and B=B=\emptyset. Since HH contains a copy of each vertex adjacent to a vertex in KK, we can assume without loss of generality that no vertex in KK is split and no edge incident to such a vertex is added or deleted. Since the above argument holds as long as |K|>|CN(K)C||K|>|\bigcup_{C\in N(K)}C|, we can reduce the size of KK to |CN(K)C|+1|\bigcup_{C\in N(K)}C|+1 and still have an equivalent instance of Cluster Editing with Vertex Splitting. ∎

We next show that if more than 6k6k vertices are left after performing Reduction Rules 1 and 2 exhaustively, then we have a no-instance.

Lemma 6.

If there is a solution to Cluster Editing with Vertex Splitting on (G,k)(G,k), then there are at most 6k6k vertices and 4k4k critical cliques left after performing Reduction Rules 1 and 2 exhaustively.

Proof.

We follow an approach similar to that taken by Guo [19]. Let σ\sigma be an optimal solution of Cluster Editing with Vertex Splitting that satisfies Lemma 3. Let GG^{\prime} be the graph obtained from GG after applying Reduction Rules 1 and 2 exhaustively. Partition the node set of the critical clique quotient graph 𝒞\mathcal{C} of GG^{\prime} into 4 sets W,X,Y,W,X,Y, and ZZ as follows. Let WW be the set of nodes whose corresponding vertices are the endpoint of some edge added by σ\sigma. Let XX be the subset of nodes not contained in WW whose corresponding vertices are the endpoint of some edge deleted by σ\sigma. Let YY be the subset of nodes not in WW or XX whose corresponding vertices are split by σ\sigma. Finally, let ZZ be all other nodes in 𝒞\mathcal{C}. As each vertex corresponding to a node in WW, XX, and YY is affected by some operation in σ\sigma and each operation can affect at most 22 vertices, it holds that |CWXYC|2|σ|2k|\bigcup_{C\in W\cup X\cup Y}C|\leq 2|\sigma|\leq 2k.

Let us now consider ZZ. Assume towards a contradiction that there is a clique HH in G|σG_{|\sigma} that contains the vertices corresponding to a node in ZZ but no vertex corresponding to a node in WXYW\cup X\cup Y or CC contains vertices corresponding to two nodes in ZZ. In the former case, let vZv\in Z be a node whose corresponding vertices are contained in HH. By definition of ZZ, the vertices corresponding to vv are adjacent to all vertices in HH (or the vertices whose copies are in HH). Hence, none of these vertices correspond to a node in WXYW\cup X\cup Y. Since Reduction Rule 1 removed all isolated critical cliques, vv is not an isolated node and there is therefore a second node uZu\in Z whose corresponding vertices are contained in HH. Hence, we are in the second case where HH contains the vertices of at least two critical cliques u,vZu,v\in Z. By definition of ZZ, the vertices corresponding to uu and vv are adjacent to all vertices in HH (or the vertices whose copies are in HH) and not adjacent to any other vertices. Moreover, since HH is a clique and no edges incident to vertices corresponding to uu or vv were added by σ\sigma, all vertices corresponding to uu and vv are pairwise adjacent in GG. Note that this means that uu and vv are actually the same critical clique, a contradiction. Hence, each clique HH in G|σG_{|\sigma} contains vertices corresponding to at most one node in ZZ and if it does, then it contains vertices of at least one node in WXYW\cup X\cup Y. This also means that the number of nodes in ZZ is upper bounded by |WXY|2k|W\cup X\cup Y|\leq 2k. Thus, GG^{\prime} contains at most |WXYZ|4k|W\cup X\cup Y\cup Z|\leq 4k critical cliques.

Consider now the graph G|σG_{|\sigma}. The number of vertices that are copies of vertices corresponding to nodes in WXYW\cup X\cup Y is at most 2k2k. By definition of ZZ, the vertices corresponding to nodes in ZZ are not split during σ\sigma. Hence, the number of vertices in G|σG_{|\sigma} that are copies of vertices corresponding to nodes in ZZ is the same as the number of vertices in GG^{\prime} that correspond to nodes in ZZ. Assume towards a contradiction that this number is more than 4k4k. Then, there exists by the pigeonhole principle a clique HH in G|σG_{|\sigma} and an integer \ell such that HH contains \ell vertices which are copies of vertices corresponding to nodes in WXYW\cup X\cup Y and at least 2+12\ell+1 vertices that are (copies of vertices) corresponding to nodes in ZZ. Let AA be the set of vertices in GG^{\prime} which correspond to a node in WXYW\cup X\cup Y and have a copy in HH and let BB be the set of vertices in GG^{\prime} which correspond to a node in ZZ and are contained in HH. By definition of ZZ, all vertices in BB are adjacent to all vertices in AA in GG^{\prime}. Moreover, as shown above, all vertices in BB belong to the same critical clique CC in GG^{\prime}. Hence, |C|2|CN(C)C|+1|CN(C)C|+2|C|\geq 2|\bigcup_{C^{\prime}\in N(C)}C^{\prime}|+1\geq|\bigcup_{C^{\prime}\in N(C)}C^{\prime}|+2. This contradicts the fact that CC^{\prime} was reduced with respect to Reduction Rule 2. Thus, the number of vertices in GG^{\prime} that correspond to a node in ZZ is at most 4k4k and the total number of vertices in GG^{\prime} is at most 6k6k. ∎

The above lemma can be converted into the following reduction rule. Note that a C4C_{4} is a cycle on four vertices and a C4C_{4} cannot be transformed into a cluster graph with a single operation.

Reduction Rule 3.

If there are more than 6k6k vertices or 4k4k critical cliques left after applying Reduction Rules 1 and 2 exhaustively, then reduce the graph to a C4C_{4} and set k=1k=1.

Based on the previous three reduction rules, it is easy to derive a problem kernel with 6k6k vertices in linear time.

Theorem 3.

One can compute a kernel with at most 6k6k vertices and 4k4k critical cliques for Cluster Editing with Vertex Splitting in linear time.

Proof.

Computing the critical clique quotient graph 𝒞\mathcal{C} of GG takes linear time [25]. Applying Reduction Rule 1 exhaustively also takes linear time and this removes all isolated nodes in 𝒞\mathcal{C}. Applying Reduction Rule 2 exhaustively takes linear time as shown next. First, we can sort all critical cliques by their size in linear time using bucket sort. Applying Reduction Rule 2 to a critical clique CC takes deg(C)\deg(C) time. By the handshaking lemma, this procedure takes time linear in the number of edges in 𝒞\mathcal{C} which is upper-bounded by the number of edges in the input graph. Note that if we iterate over the critical cliques in increasing order of size, then an application of Reduction Rule 2 can never affect previously considered critical cliques. This is due to the fact that an application of Reduction Rule 2 does only depend on adjacent critical cliques. Since, whenever we reduce the size of a critical clique, we only reduce it to a size larger than all of its adjacent critical cliques, this can never lead to a situation where we can reduce a critical clique that was initially smaller than the current critical clique. Hence, applying Reduction Rule 2 exhaustively takes linear time. Afterwards, we can compute the critical clique quotient graph 𝒞\mathcal{C}^{\prime} of the resulting graph GG^{\prime} and count the number of vertices in GG^{\prime} and 𝒞\mathcal{C}^{\prime} in linear time. If the number of vertices in GG is at most 6k6k and the number of vertices in 𝒞\mathcal{C}^{\prime} is at most 4k4k, then we are done. Otherwise, we apply Reduction Rule 3. This is correct by Lemma 6, can be performed in constant time, and the resulting graph has 46k4\leq 6k^{\prime} vertices and 44k4\leq 4k^{\prime} critical cliques, where k=1k^{\prime}=1 is the newly set parameter. This concludes the proof. ∎

We leave it as an open problem whether the size of the kernel (especially the number of edges) can be improved and mention that there is a 2k2k-vertex kernel for Cluster Editing [7].

7 An FPT algorithm

The result in Theorem 3 implies that Cluster Editing with Vertex Splitting is fixed-parameter tractable. By Lemma 3, we can assume that all vertices in the same critical clique belong to the same clique in a solution. It is easy to see that the final solution consists of at most 2k2k cliques and guessing these for each of the at most 4k4k critical cliques takes O((22k)4k)=O(28k2)O((2^{2k})^{4k})=O(2^{8k^{2}}) time. Checking whether a given solution can be reached in kk operations takes O(k2)O(k^{2}) time and combined with the time for computing the kernel, this results in a running time in O(28k2k2+n+m)O(2^{8k^{2}}k^{2}+n+m). This is, however, far from optimal as shown next.

Theorem 4.

Cluster Editing with Vertex Splitting can be solved in O(29klogk+n+m)O(2^{9k\log k}+n+m) time.

Proof.

First, we compute the critical clique of each vertex and the critical clique quotient graph 𝒞\mathcal{C} of GG in linear time [25]. Next, we compute the kernel from Theorem 3 in linear time. Note that 𝒞\mathcal{C} contains at most 4k4k vertices. By Lemma 3, we can also assume that all vertices in a critical clique belong to the same clique in the graph G|σG_{|\sigma} reached after performing an optimal solution σ\sigma. Let 𝒳={S1,S2,,S}\mathcal{X}=\{S_{1},S_{2},\ldots,S_{\ell}\} be the set of cliques in G|σG_{|\sigma}. Note that 𝒳\mathcal{X} contains 2k\ell\leq 2k cliques as each operation can complete at most two cliques of the solution (removing an edge between two cliques or splitting a vertex contained in both cliques) and Reduction Rule 1 removed all isolated cliques (cliques that can be completed without an operation). Hence, if there are more than 2k2k cliques in the solution, then we cannot reach the solution with kk operations. To streamline the following argumentation, we will cover the nodes in 𝒞\mathcal{C} by cliques S1,S2,,S=2kS_{1},S_{2},\ldots,S_{\ell=2k} and assume that an optimal solution contains exactly 2k2k cliques by allowing some of the cliques to be empty. Next, we iterate over all possible colorings of the nodes in 𝒞\mathcal{C} using +1\ell+1 colors 0,1,2,,0,1,2,\ldots,\ell. Note that there are at most (+1)4kO((2k+1)4k){(\ell+1)^{4k}\in O((2k+1)^{4k})} such colorings.

The idea behind the coloring is the following. All colors 1,2,,1,2,\ldots,\ell will correspond to the cliques S1,S2,,SS_{1},S_{2},\ldots,S_{\ell}, that is, we try to find a solution where all (vertices in critical cliques corresponding to) nodes of the same color (except for color 0) belong to the same clique in the solution. The color 0 indicates that the node will belong to multiple cliques in the solution, that is, all vertices in the respective critical clique will be split. Since each such split operation reduces kk by one, we can reject any coloring in which the number of vertices in critical cliques corresponding to nodes with color 0 is more than kk. In particular, we can reject any coloring in which more than kk nodes have color 0.

Next, we guess two indices i[k],j[]i\in[k],j\in[\ell] and assume that the iith node of color 0 belongs to SjS_{j} or we guess that all nodes of color 0 have been assigned to all cliques they belong to. Note that in each iteration, there are k+1k\ell+1 possibilities and since each guess reduces kk by at least one, is the last guess corresponding to one of the nodes of color 0, or the last guess in general, we can make at most 2k+12k+1 guesses. Hence, there are at most (k+1)2k+1=(2k2+1)2k+1O((2k+1)4k+1)(k\ell+1)^{2k+1}=(2k^{2}+1)^{2k+1}\in O((2k+1)^{4k+1}) such guesses.

It remains to compute the best solution corresponding to each possible set of guesses. To this end, we first iterate over each pair of vertices and add an edge between them if this edge does not already exist and we guessed that there is a clique SiS_{i} which contains a copy of each of the two vertices. Moreover, we remove an existing edge between them if we guessed that the two vertices do not appear in a common clique. Finally, we perform all split operations. Therein, we iteratively split one vertex vv into two vertices u1u_{1} and u2u_{2} where u1u_{1} will be the vertex in some clique SiS_{i} and u2u_{2} might be split further in the future. The vertex u1u_{1} is adjacent to all vertices that are guessed to belong to SiS_{i}. The vertex u2u_{2} is adjacent to all vertices that uu was adjacent to, except for vertices that are adjacent to u1u_{1} and not guessed to also belong to some other clique SjS_{j} which (some copy of) u2u_{2} belongs to.

Since our algorithm basically performs an exhaustive search, it will find an optimal solution. It only remains to analyze the running time. We first compute the kernel in O(n+m)O(n+m) time. We then try O((2k+1)4k)O((2k+1)^{4k}) possible colorings of 𝒞\mathcal{C} and for each coloring O((2k+1)4k+1)O((2k+1)^{4k+1}) guesses. Afterwards, we compute the solution in O(k2)O(k^{2}) time as nO(k)n\in O(k) by Reduction Rule 3. Thus, the overall running time is in O((2k+1)8k+1k2+n+m)O(29klogk+n+m)O((2k+1)^{8k+1}\cdot k^{2}+n+m)\subseteq O(2^{9k\log k}+n+m). ∎

We should note in passing that, while the constants in the running time of our algorithm can probably be improved, a completely new approach is required if one wants a single-exponential-time algorithm. This is due to the fact that the number of possible partitions of O(k)O(k) critical cliques into clusters grows super-exponentially (roughly as fast as k!k!) even if no vertex-splitting operations are allowed.

8 Conclusion

By allowing a vertex to split into two vertices, we extend the notion of Cluster Editing in an attempt to better model clustering problems where a data element may have roles in more than one cluster. On the one hand, we show that this new problem, which we call Cluster Editing with Vertex Splitting, is 𝒩𝒫\mathcal{NP}-complete and, assuming the ETH, there are no 2o(n+m)2^{o(n+m)}-time or 2o(k)poly(n)2^{o(k)}\cdot\text{poly}(n)-time algorithms for it. On the other hand, we give a 6k6k-vertex kernel and an O(29klogk+n+m)O(2^{9k\log k}+n+m)-time algorithm. This leaves the following gaps.

Open Problem 1.

Does there exist a 2O(k)poly(n)2^{O(k)}\cdot\text{poly}(n)-time algorithm for Cluster Editing with Vertex Splitting?

Open Problem 2.

Does there exist a linear-size kernel, that is, a kernel in which the number of vertices plus the number of edges is in O(k)O(k)?

However, even resolving these questions should only be seen as a starting point for a much larger undertaking. While we do understand the parameterized complexity of Cluster Editing with Vertex Splitting with respect to kk reasonably well, there are still a lot of open questions regarding structural parameters of the input graph. Future work may also consider a bound on the number of allowed edge edits incident to each vertex as used by Komusiewicz and Uhlmann [23] and Abu-Khzam [1]. Moreover, one might also study the approximability of Cluster Editing with Vertex Splitting as the trivial constant-factor approximation of Cluster Editing does not carry over if we allow vertex splitting. In case Cluster Editing with Vertex Splitting turns out to be hard to approximate, one might then continue with studying FPT-approximation (schemes) and approximation kernels (also known as lossy kernels).

The vertex splitting operation may also be applicable to other classes of target graphs, including bipartite graphs, chordal-graphs, comparability graphs, perfect graphs, or disjoint unions of graphs like complete bipartite graphs (bi-clusters), ss-cliques, ss-clubs, ss-plexes, kk-cores, or γ\gamma-quasi-cliques. We note that the results in Section 4 are directly applicable to these settings as well. Especially Bicluster Editing has received significant attention recently [13, 20, 30, 31]. To the best of our knowledge, nothing is known about Bicluster Editing with Vertex Splitting. We should also note that there are two natural versions in the bipartite case and both of them seem worth studying. The two versions differ in whether or not one requires that all copies of a split vertex lie on the same side of a bipartition in a solution. On the one hand, the additional requirement makes sense if the data is inherently bipartite. This happens for example if each vertex represents either a researcher or a paper and an edge represents an authorship. On the other hand, if edges reflect something like a seller-buyer relationship, then it is plausible that a person both sells and buys.

Finally, we believe that it also makes sense to study a variant of the vertex-splitting operation where the neighborhood of the two newly introduced vertices are a partition of the neighborhood of the split vertex rather than a covering. That is, when we split a vertex vv into v1v_{1} and v2v_{2}, instead of allowing a neighbor ww of vv to be adjacent to both v1v_{1} and v2v_{2}, we require that ww must be adjacent to exactly one of the two. This operation is called exclusive vertex splitting in the literature, and can be seen to be closely related to the Clique Partitioning problem.

Acknowledgments

Matthias Bentert is supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 819416). Pål Grønås Drange was partially supported by the Research Council of Norway, grant Parameterized Complexity for Practical Computing (PCPC) (NFR, no. 274526). Serge Gaspers is the recipient of an Australian Research Council (ARC) Future Fellowship (FT140100048) and acknowledges support under the ARC’s Discovery Projects funding scheme (DP150101134). Alexis Shaw is the recipient of an Australian Government Research Training Program Scholarship.

References

  • Abu-Khzam [2017] F. N. Abu-Khzam. On the complexity of multi-parameterized cluster editing. Journal of Discrete Algorithms, 45:26–34, 2017.
  • Abu-Khzam et al. [2005] F. N. Abu-Khzam, N. E. Baldwin, M. A. Langston, and N. F. Samatova. On the relative efficiency of maximal clique enumeration algorithms, with applications to high-throughput computational biology. In Proceedings of the 2005 International Conference on Research Trends in Science and Technology, pages 1–10, 2005.
  • Abu-Khzam et al. [2018] F. N. Abu-Khzam, J. Egan, S. Gaspers, A. Shaw, and P. Shaw. Cluster editing with vertex splitting. In International Symposium on Combinatorial Optimization, pages 1–13. Springer, 2018.
  • Arrighi et al. [2023] E. Arrighi, M. Bentert, P. G. Drange, B. D. Sullivan, and P. Wolf. Cluster editing with overlapping communities. In Proceedings of the 18th International Symposium on Parameterized and Exact Computation (IPEC ’23). Schloss Dagstuhl — Leibniz-Zentrum für Informatik, 2023. To appear.
  • Böcker [2012] S. Böcker. A golden ratio parameterized algorithm for cluster editing. Journal of Discrete Algorithms, 16:79–89, 2012.
  • Cai [1996] L. Cai. Fixed-parameter tractability of graph modification problems for hereditary properties. Information Processing Letters, 58(4):171–176, 1996.
  • Chen and Meng [2012] J. Chen and J. Meng. A 2kk kernel for the cluster editing problem. Journal of Computer and System Sciences, 78(1):211–220, 2012.
  • Chen et al. [2006] J. Chen, X. Huang, I. A. Kanj, and G. Xia. Strong computational lower bounds via parameterized complexity. Journal of Computer and System Sciences, 72(8):1346 – 1367, 2006.
  • Cygan et al. [2015] M. Cygan, F. V. Fomin, L. Kowalik, D. Lokshtanov, D. Marx, M. Pilipczuk, M. Pilipczuk, and S. Saurabh. Parameterized Algorithms. Springer, 2015.
  • D’Addario et al. [2014] M. D’Addario, D. Kopczynski, J. Baumbach, and S. Rahmann. A modular computational framework for automated peak extraction from ion mobility spectra. BMC Bioinformatics, 15(1):25, 2014.
  • Dehne et al. [2006] F. Dehne, M. A. Langston, X. Luo, S. Pitre, P. Shaw, and Y. Zhang. The cluster editing problem: Implementations and experiments. In Proceedings of the 2nd International Workshop on Parameterized and Exact Computation (IWPEC ’06), pages 13–24. Springer Berlin Heidelberg, 2006.
  • Diestel [2005] R. Diestel. Graph Theory. Springer, 2005.
  • Drange et al. [2015] P. G. Drange, F. Reidl, F. S. Villaamil, and S. Sikdar. Fast biclustering by dual parameterization. In Proceedings of the 10th International Symposium on Parameterized and Exact Computation (IPEC ’15), pages 402–413. Schloss Dagstuhl — Leibniz-Zentrum für Informatik, 2015.
  • Fadiel et al. [2006] A. Fadiel, M. A. Langston, X. Peng, A. D. Perkins, H. S. Taylor, O. Tuncalp, D. Vitello, P. H. Pevsner, and F. Naftolin. Computational analysis of mass spectrometry data using novel combinatorial methods. In Proceedings of the 2006 IEEE/ACS International Conference on Computer Systems and Applications (AICCSA ’06), pages 266–273. IEEE Computer Society, 2006.
  • Fellows et al. [2007] M. Fellows, M. Langston, F. Rosamond, and P. Shaw. Efficient parameterized preprocessing for cluster editing. In Proceedings of the 16th International Symposium on Fundamentals of Computation Theory (FCT ’07), pages 312–321. Springer, 2007.
  • Firbas et al. [2023] A. Firbas, A. Dobler, F. Holzer, J. Schafellner, M. Sorge, A. Villedieu, and M. Wißmann. The complexity of cluster vertex splitting and company. CoRR, abs/2309.00504, 2023.
  • Flum and Grohe [2006] J. Flum and M. Grohe. Parameterized Complexity Theory. Springer, 2006.
  • Gramm et al. [2005] J. Gramm, J. Guo, F. Hüffner, and R. Niedermeier. Graph-modeled data clustering: Exact algorithms for clique generation. Theory of Computing Systems, 38(4):373–392, 2005.
  • Guo [2009] J. Guo. A more effective linear kernelization for cluster editing. Theoretical Computer Science, 410(8-10):718 – 726, 2009.
  • Guo et al. [2008] J. Guo, F. Hüffner, C. Komusiewicz, and Y. Zhang. Improved algorithms for bicluster editing. In Proceedings of the 5th International Conference on Theory and Applications of Models of Computation (TAMC ’08), pages 445–456. Springer, 2008.
  • Hüffner et al. [2010] F. Hüffner, C. Komusiewicz, H. Moser, and R. Niedermeier. Fixed-parameter algorithms for cluster vertex deletion. Theory of Computing Systems, 47(1):196–217, 2010.
  • Impagliazzo et al. [2001] R. Impagliazzo, R. Paturi, and F. Zane. Which problems have strongly exponential complexity? Journal of Computer and System Sciences, 63(4):512–530, 2001.
  • Komusiewicz and Uhlmann [2012] C. Komusiewicz and J. Uhlmann. Cluster editing with locally bounded modifications. Discrete Applied Mathematics, 160(15):2259–2270, 2012.
  • Kr̆ivánek and Morávek [1986] M. Kr̆ivánek and J. Morávek. NP-hard problems in hierarchical-tree clustering. Acta Informatica, 23(3):311–323, 1986.
  • Lin et al. [2000] G.-H. Lin, T. Jiang, and P. E. Kearney. Phylogenetic kk-root and Steiner kk-root. In Proceedings of the 11th International Symposium on Algorithms and Computation (ISAAC ’00), pages 539–551. Springer, 2000.
  • Niedermeier [2006] R. Niedermeier. An Invitation to Fixed-Parameter Algorithms. Oxford University Press, 2006.
  • Radovanović et al. [2010] M. Radovanović, A. Nanopoulos, and M. Ivanović. Hubs in space: Popular nearest neighbors in high-dimensional data. Journal of Machine Learning Research, 11:2487–2531, 2010.
  • Tomasev et al. [2014] N. Tomasev, M. Radovanovic, D. Mladenic, and M. Ivanovic. The role of hubness in clustering high-dimensional data. IEEE Transactions on Knowledge and Data Engineering, 26(3):739–751, 2014.
  • Tovey [1984] C. A. Tovey. A simplified NP-complete satisfiability problem. Discrete Applied Mathematics, 8(1):85–89, 1984.
  • Tsur [2021] D. Tsur. Faster parameterized algorithm for bicluster editing. Information Processing Letters, 168, 2021.
  • Xiao and Kou [2022] M. Xiao and S. Kou. A simple and improved parameterized algorithm for bicluster editing. Information Processing Letters, 2022.