\AtAppendix

Local Hyper-Flow Diffusion

\nameKimon Fountoulakis \emailkfountou@uwaterloo.ca
\addrSchool of Computer Science,
University of Waterloo,
Waterloo, ON, Canada. \AND\namePan Li \emailpanli@purdue.edu
\addrDepartment of Computer Science,
Purdue University,
West Lafayette, IN, United States. \AND\nameShenghao Yang \emailshenghao.yang@uwaterloo.ca
\addrSchool of Computer Science,
University of Waterloo,
Waterloo, ON, Canada Kimon Fountoulakis School of Computer Science, University of Waterloo, Waterloo, ON, Canada. E-mail: kfountou@uwaterloo.ca. Pan Li Department of Computer Science, Purdue University, West Lafayette, IN, USA. E-mail: panli@purdue.edu. Shenghao Yang School of Computer Science, University of Waterloo, Waterloo, ON, Canada. E-mail: s286yang@uwaterloo.ca.

Abstract

Recently, hypergraphs have attracted a lot of attention due to their ability to capture complex relations among entities. The insurgence of hypergraphs has resulted in data of increasing size and complexity that exhibit interesting small-scale and local structure, e.g., small-scale communities and localized node-ranking around a given set of seed nodes. Popular and principled ways to capture the local structure are the local hypergraph clustering problem and related seed set expansion problem. In this work, we propose the first local diffusion method that achieves edge-size-independent Cheeger-type guarantee for the problem of local hypergraph clustering while applying to a rich class of higher-order relations that covers many previously studied special cases. Our method is based on a primal-dual optimization formulation where the primal problem has a natural network flow interpretation, and the dual problem has a cut-based interpretation using the $\ell_{2}$ -norm penalty on associated cut-costs. We demonstrate the new technique is significantly better than state-of-the-art methods on both synthetic and real-world data.

1 Introduction

Hypergraphs [8] generalize graphs by allowing a hyperedge to consist of multiple nodes that capture higher-order relationships in complex systems and datasets [36]. Hypergraphs have been used for music recommendation on Last.fm data [10], news recommendation [29], sets of product reviews on Amazon [37], and sets of co-purchased products at Walmart [1]. Beyond the internet, hypergraphs are used for analyzing higher-order structure in neuronal, air-traffic and food networks [6, 30].

In order to explore and understand higher-order relationships in hypergraphs, recent work has made use of cut-cost functions that are defined by associating each hyperedge with a specific set function. These functions assign specific penalties of separating the nodes within individual hyperedges. They generalize the notion of hypergraph cuts and are crucial for determining small-scale community structure [30, 33]. The most popular cut-cost functions with increasing capability to model complex multiway relationships are the unit cut-cost [28, 27, 23], cardinality-based cut-cost [43, 44] and general submodular cut-cost [31, 47]. An illustration of a hyperedge and the associated cut-cost function is given in Figure 1. In the simplest setting, all cut-cost functions take value either 0 or 1 (e.g., the case when $\gamma_{1}=\gamma_{2}=1$ in Figure 1(b)), we obtain a unit cut-cost hypergraph. In a slightly more general setting, the cut-costs are determined solely by the number of nodes in either side of the hyperedge cut (e.g., the case when $\gamma_{1}=1/2$ and $\gamma_{2}=1$ in Figure 1(b)), we obtain a cardinality-based hypergraph. We refer to hypergraphs associated with arbitrary submodular cut-cost functions (e.g., the case when $\gamma_{1}=1/2$ and $0\leq\gamma_{2}\leq 1$ in Figure 1(b)) as general submodular hypergraphs.

Hypergraphs that arise from data science applications consist of interesting small-scale local structure such as local communities [33, 42]. Exploiting this structure is central to the above mentioned applications on hypergraphs and related applications in machine learning and applied mathematics [7]. We consider local hypergraph clustering as the task of finding a community-like cluster around a given set of seed nodes, where nodes in the cluster are densely connected to each other while relatively isolated to the rest of the graph. One of the most powerful primitives for the local hypergraph clustering task is the graph diffusion. Diffusion on a graph is the process of spreading a given initial mass from some seed node(s) to neighbor nodes using the edges of the graph. Graph diffusions have been successfully employed in the industry, for example, both Pinterest and Twitter use diffusion methods for their recommendation systems [16, 17, 22]. Google uses diffusion methods to perform clustering query refinements [40]. Let us not forget PageRank [9, 39], Google’s model for their search engine.

Empirical and theoretical performance of local diffusion methods is often measured on the problem of local hypergraph clustering [45, 20, 33]. Existing local diffusion methods only directly apply to hypergraphs with the unit cut-cost [26, 42]. For the slightly more general cardinality-based cut-cost, they rely on graph reduction techniques which result in a rather pessimistic edge-size-dependent approximation error [46, 26, 33, 43]. Moreover, none of the existing methods is capable of processing general submodular cut-costs. In this work, we are interested in designing a diffusion framework that (i) achieves stronger theoretical guarantees for the problem of local hypergraph clustering, (ii) is flexible enough to work with general submodular hypergraphs, and (iii) permits computationally efficient algorithms. We propose the first local diffusion method that simultaneously accomplishes these goals.

In what follows we describe our main contributions and previous work. In Section 2 we provide preliminaries and notations. In Section 3 we introduce our diffusion model from a combinatorial flow perspective. In Section 4 we discuss the local hypergraph clustering problem and Cheeger-type quadratic approximation error. In Section 6 we perform experiments using both synthetic and real datasets.

1.1 Our main contributions

In this work we propose a generic local diffusion model that applies to hypergraphs characterized by a rich class of cut-cost functions that covers many previously studied special cases, e.g., unit, cardinality-based and submodular cut-costs. We provide the first edge-size-independent Cheeger-type approximation error for the problem of local hypergraph clustering using any of these cut-costs. In particular, assume that there exists a cluster $C$ with conductance $\Phi(C)$ , and assume that we are given a set of seed nodes that reasonably overlaps with $C$ , then the proposed diffusion model can be used to find a cluster $\hat{C}$ with conductance at most $O(\sqrt{\Phi(C)})$ (in the appendix we show that an $\ell_{p}$ -norm version of the proposed model can achieve $O(\Phi(C))$ asymptotically). Our hypergraph diffusion model is formulated as a convex optimization problem. It has a natural combinatorial flow interpretation that generalizes the notion of network flows over hyperedges. We show that the optimization problem can be solved efficiently by an alternating minimization method. In addition, we prove that the number of nonzero nodes in the optimal solution is independent of the size of the hypergraph, and it only depends on the size of the initial mass. This key property ensures that our algorithm scales well in practice for large datasets. We evaluate our method using both synthetic and real-world data. We show that our method improves accuracy significantly for hypergraphs with unit, cardinarlity-based and general submodular cut-costs for local clustering.

1.2 Previous work

Recently, clustering methods on hypergraphs received renewed interest. Different methods require different assumptions about the hyperedge cut-cost, which can be roughly categorized into unit cut-cost, cardinality-based (and submodular) cut-cost and general submodular cut-cost. Moreover, existing methods can be either global, where the output is not localized around a given set of seed nodes, or local, where the output is a tiny cluster around a set of seed nodes. Local algorithms are the only scalable ones for large hypergraphs, which is our main focus. Many works propose global methods and thus they are not scalable to large hypergraphs [49, 3, 25, 34, 6, 48, 11, 30, 31, 12, 47, 32, 13, 24, 42]. Local diffusion-based methods are more relevant to our work [26, 33, 43]. In particular, iterative hypergraph min-cut methods for the local hypergraph clustering problem can be adopted [43]. However, these methods require in theory and in practice a large seed set, i.e., they are not expansive and thus cannot work with one seed node. The expansive combinatorial diffusion [45] is generalized for hypergraphs [26], which can detect a target cluster using only one seed node. However, combinatorial methods have a large bias towards low conductance clusters as opposed to finding the target cluster [18]. The most relevant paper to our work is [33]. However, the proposed methods in [33] depend on a reduction from hypergraphs to directed graphs. This results in an approximation error for clustering that is proportional to the size of hyperedges and induces performance degeneration when the hyperedges are large. In fact, none of the above approaches (including global and local ones) has an edge-size-independent approximation error bound for even simple cardinality-based hypergraphs. Moreover, existing local approaches do not work for general submodular hypergraphs.

2 Preliminaries and Notations

Submodular function. Given a set $S$ , we denote $2^{S}$ the power set of $S$ and $|S|$ the cardinality of $S$ . A submodular function $F:2^{S}\rightarrow\mathbb{R}$ is a set function such that $F(A)+F(B)\geq F(A\cup B)+F(A\cap B)$ for any $A,B\subseteq S$ .

Submodular hypergaph. A hypergraph $H=(V,E)$ is defined by a set of nodes $V$ and a set of hyperedges $E\subseteq 2^{V}$ , i.e., each hyperedge $e\in E$ is a subset of $V$ . A hypergraph is termed submodular if every $e\in E$ is associated with a submodular function $w_{e}:2^{e}\rightarrow\mathbb{R}_{\geq 0}$ [31]. The weight $w_{e}(S)$ indicates the cut-cost of splitting the hyperedge $e$ into two subsets, $S$ and $e\setminus S$ . This general form allows us to describe the potentially complex higher-order relation among multiple nodes (Figure 1). A proper hyperedge weight $w_{e}$ should satisfy that $w_{e}(\emptyset)=w_{e}(e)=0$ . To ease notation we extend the domain of $w_{e}$ to $2^{V}$ by setting $w_{e}(S):=w_{e}(S\cap e)$ for any $S\subseteq V$ . We assume without loss of generality that $w_{e}$ is normalized by $\vartheta_{e}:=\max_{S\subseteq e}w_{e}(S)$ , so that $w_{e}(S)\in[0,1]$ for any $S\subseteq V$ . For the sake of simplicity in presentation, we assume that $\vartheta_{e}=1$ for all $e$ .¹¹1This is without loss of generality. In the appendix we show that our method works with arbitrary $\vartheta_{e}>0$ . A submodular hypergraph is written as $H=(V,E,\mathcal{W})$ where $\mathcal{W}:=\{w_{e},\vartheta_{e}\}_{e\in E}$ . Note that when $w_{e}(S)=1$ for any $S\in 2^{e}\backslash\{\emptyset,e\}$ , the definition reduces to unit cut-cost hypergraphs. When $w_{e}(S)$ only depends on $|S|$ , it reduces to cardinality-based cut-cost hypergraphs.

Vector/Function on $V$ or $E$ . For a set of nodes $S\subseteq V$ , we denote $\mathit{1}_{S}$ the indicator vector of $S$ , i.e., $[\mathit{1}_{S}]_{v}=1$ if $v\in S$ and 0 otherwise. For a vector $x\in\mathbb{R}^{|V|}$ , we write $x(S):=\sum_{v\in S}x_{v}$ , where $x_{v}$ in the entry in $x$ that corresponds to $v\in V$ . We define the support of $x$ as $\textnormal{supp}(x):=\{v\in V|x_{v}\neq 0\}$ . The support of a vector in $\mathbb{R}^{|E|}$ is defined analogously. We refer to a function over nodes $x:V\rightarrow\mathbb{R}$ and its explicit representation as a $|V|$ -dimensional vector interchangeably.

Volume, cut, conductance. Given a submodular hypergraph $H=(V,E,\mathcal{W})$ , the degree of a node $v$ is defined as $d_{v}:=|\{e\in E:v\in e\}|$ . We reserve $d$ for the vector of node degrees and $D=\mbox{diag}(d)$ . We refer to $\textnormal{vol}(S):=d(S)$ as the volume of $S\subseteq V$ . A cut is treated as a proper subset $C\subset V$ , or a partition $(C,\bar{C})$ where $\bar{C}:=V\setminus C$ . The cut-set of $C$ is defined as $\partial C:=\{e\in E|e\cap C\neq\emptyset,e\cap\bar{C}\neq\emptyset\}$ ; the cut-size of $C$ is defined as $\textnormal{vol}(\partial C):=\sum_{e\in\partial C}\vartheta_{e}w_{e}(C)=\sum_{e\in E}\vartheta_{e}w_{e}(C)$ . The conductance of a cut $C$ in $H$ is $\Phi(C):=\frac{\textnormal{vol}(\partial C)}{\min\{\textnormal{vol}(C),\textnormal{vol}(V\setminus C)\}}$ .

Flow. A flow routing over a hyperedge $e$ is a function $r_{e}:e\rightarrow\mathbb{R}$ where $r_{e}(v)$ specifies the amount of mass that flows from $\{v\}$ to $e\setminus\{v\}$ over $e$ . To ease notation we extend the domain of $r_{e}$ to $V$ by identifying $r_{e}(v)=0$ for $v\not\in e$ , so $r_{e}$ is treated as a function $r_{e}:V\rightarrow\mathbb{R}$ or equivalently a $|V|$ -dimensional vector. The net (out)flow at a node $v$ is given by $\sum_{e\in E}r_{e}(v)$ . Given a routing function $r_{e}$ and a set of nodes $S\subseteq V$ , a directional routing on $e$ with direction $S\rightarrow e\setminus S$ is represented by $r_{e}(S)$ , which specifies the net amount of mass that flows from $S$ to $e\setminus S$ . A routing $r_{e}$ is called proper if it obeys flow conservation, i.e., $r_{e}^{T}\mathit{1}_{e}=0$ . Our flow definition generalizes the notion of network flows to hypergraphs. We provide concrete illustrations in Figure 2.

(a) Flows on graph

(b) Hyperedge routing

Figure 2: Illustration of proper flow routings. The numbers next to each node correspond to entries in the flow routing

r_{e}

over a (hyper)edge

e

. We assign the same color to a (hyper)edge and its associated flow values. Our flow definition is a natural generalization of graph edge flow where

r_{e}(v)=\pm f

if and only if

v\in e

, i.e.,

v

is incident to

e

, where

f

and the sign determine the amplitude and direction of the flow over

e

. In Figure 2(a), the net (out)flow at node

v_{3}

is given by

\sum_{e\in E}r_{e}(v_{3})=1-2-3=-4

. In Figure 2(b), the directional flow from

\{v_{1}\}

\{v_{2},v_{3},v_{4},v_{5}\}

over this hyperedge equals

-1

; similarly, the directional flow from

\{v_{1},v_{2},v_{4}\}

\{v_{3},v_{5}\}

equals

3+2-1=4

, etc. In Figure 2(c), the net (out)flow at node

v_{3}

is given by

\sum_{e\in E}r_{e}(v_{3})=3-2=1

3 Diffusion as an Optimization Problem

In this section we provide details of the proposed local diffusion method. We consider diffusion process as the task of spreading mass from a small set of seed nodes to a larger set of nodes. More precisely, given a hypergraph $H=(V,E,\mathcal{W})$ , we assign each node a sink capacity specified by a sink function $T$ , i.e., node $v$ is allowed to hold at most $T(v)$ amount of mass. In this work we focus on the setting where $T(v)=d_{v}$ , so that a high-degree node that is part of many hyperedges can hold more mass than a low-degree node that is part of few hyperedges. Moreover, we assign each node some initial mass specified by a source function $\Delta$ , i.e., node $v$ holds $\Delta(v)$ amount of mass at the start of the diffusion. In order to encourage the spread of mass in the hypergraph, the initial mass on the seed nodes is larger than their capacity. This forces the seed nodes to diffuse mass to neighbor nodes to remove their excess mass. In Section 4 we will discuss the choice of $\Delta$ to obtain good theoretical guarantees for the problem of local hypergraph clustering. Details about the local hypergraph clustering problem are provided in Section 4.

Given a set of proper flow routings $r_{e}$ for $e\in E$ , recall that $\sum_{e\in E}r_{e}(v)$ specifies the amount of net (out)flow at node $v$ . Therefore, the vector $m=\Delta-\sum_{e\in E}r_{e}$ gives the amount of net mass at each node after routing. The excess mass at a node $v$ is $\textnormal{ex}(v):=\max\{m_{v}-d_{v},0\}$ . In order to force the diffusion of initial mass we could simply require that $\textnormal{ex}(v)=0$ for all $v\in V$ , or equivalently, $\Delta-\sum_{e\in E}r_{e}\leq d$ . But to provide additional flexibility in the diffusion dynamics, we introduce a hyper-parameter $\sigma\geq 0$ and we impose a softer constraint $\Delta-\sum_{e\in E}r_{e}\leq d+\sigma Dz$ , where $z\in\mathbb{R}^{|V|}$ is an optimization variable that controls how much excess mass is allowed on each node. In the context of numerical optimization, we show in Section 5 that $\sigma$ allows a reformulation which makes the optimization problem amenable to efficient alternating minimization schemes.

Note that so far we have not yet talked about how specific higher-order relations among nodes within a hyperedge would affect the flow routings over it. Apparently, simply requiring that the $r_{e}$ ’s obey flow conservation (i.e., $r_{e}^{T}\mathit{1}_{e}=0$ ) similar to the standard graph setting is not enough for hypergraphs. An important difference between hyperedge flows and graph edge flows is that additional constraints on $r_{e}$ are in need. To this end, we consider $r_{e}=\phi_{e}\rho_{e}$ for some $\phi_{e}\in\mathbb{R}_{+}$ and $\rho_{e}\in B_{e}$ , where

B_{e}:=\{\rho_{e}\in\mathbb{R}^{|V|}~{}|~{}\rho_{e}(S)\leq w_{e}(S),\forall S\subseteq V,\ \mbox{and}\ \rho_{e}(V)=w_{e}(V)\}

is the base polytope [4] for the submodular cut-cost $w_{e}$ associated with hyperedge $e$ . It is straightforward to see that $r_{e}(v)=0$ for every $v\not\in e$ and $r_{e}^{T}\mathit{1}_{e}=0$ , so $r_{e}$ defines a proper flow routing over $e$ . Moreover, for any $e\subseteq V$ , recall that $r_{e}(S)$ represents the net amount of mass that moves from $S$ to $e\setminus S$ over hyperedge $e$ . Therefore, the constraints $\rho_{e}(S)\leq w_{e}(S)$ for $S\subseteq e$ mean that the directional flows $r_{e}(S)$ are upper bounded by a submodular function $\phi_{e}w_{e}(S)$ . Intuitively, one may think of $\phi_{e}$ and $\rho_{e}$ as the scale and the shape of $r_{e}$ , respectively.

The goal of our diffusion problem is to find low cost routings $r_{e}\in\phi_{e}B_{e}$ for $e\in E$ such that the capacity constraint $\Delta-\sum_{e\in E}r_{e}\leq d+\sigma Dz$ is satisfied. We consider the (weighted) $\ell_{2}$ -norm of $\phi$ and $z$ as the cost of diffusion. In the appendix we show that one readily extends the $\ell_{2}$ -norm to $\ell_{p}$ -norm for any $p\geq 2$ . Formally, we arrive at the following convex optimization formulation (input: the source function $\Delta$ , the hypergraph $H=(V,E,\mathcal{W})$ , and a hyper-parameter $\sigma$ ):

\min_{\phi\in\mathbb{R}^{|E|}_{+},z\in\mathbb{R}^{|V|}_{+}}~{}\frac{1}{2}\sum_{e\in E}\phi_{e}^{2}+\frac{\sigma}{2}\sum_{v\in V}d_{v}z_{v}^{2},\ \ \mbox{s.t.}\ \ \Delta-\sum_{e\in E}r_{e}\leq d+\sigma Dz,\ r_{e}\in\phi_{e}B_{e},\forall e\in E.

(1)

We name problem (1) Hyper-Flow Diffusion (HFD) for its combinatorial flow interpretation we discussed above. The dual problem of (1) is:

\max_{x\in\mathbb{R}^{|V|}_{+}}~{}(\Delta-d)^{T}x-\frac{1}{2}\sum_{e\in E}f_{e}(x)^{2}-\frac{\sigma}{2}\sum_{v\in V}d_{v}x_{v}^{2},

(2)

where $f_{e}$ in (2) is the support function of the base polytope $B_{e}$ given by $f_{e}(x):=\max_{\rho_{e}\in B_{e}}\rho_{e}^{T}x$ . $f_{e}$ is also known as the Lovász extension of the submodular function $w_{e}$ .

We provide a combinatorial interpretation for (2) and leave algebraic derivations to the appendix. For the dual problem, one can view the solution $x$ as assigning heights to nodes, and the goal is to separate/cut the nodes with source mass from the rest of the hypergraph. Observe that the linear term in the dual objective function encourages raising $x$ higher on the seed nodes and setting it lower on others. The cost $f_{e}(x)$ captures the discrepancy in node heights over a hyperedge $e$ and encourages smooth height transition over adjacent nodes. The dual solution embeds nodes into the nonnegative real line, and this embedding is what we actually use for local clustering and node ranking.

4 Local Hypergraph Clustering

In this section we discuss the performance of the primal-dual pair (1)-(2), respectively, in the context of local hypergraph clustering. We consider a generic hypergraph $H=(V,E,\mathcal{W})$ with submodular hyperedge weights $\mathcal{W}=\{w_{e},\vartheta_{e}\}_{e\in E}$ . Given a set of seed nodes $S\subset V$ , the goal of local hypergraph clustering is to identify a target cluster $C\subset V$ that contains or overlaps well with $S$ . This generalizes the definition of local clustering over graphs [19]. To the best of our knowledge, we are the first one to consider this problem for general submodular hypergraphs. We consider a subset of nodes having low conductance as a good cluster, i.e., these nodes are well-connected internally and well-separated from the rest of the hypergraph. Following prior work on local hypergraph clustering, we assume the existence of an unknown target cluster $C$ with conductance $\Phi(C)$ . We prove that applying sweep-cut to an optimal solution $\hat{x}$ of (2) returns a cluster $\hat{C}$ whose conductance is at most quadratically worse than $\Phi(C)$ . Note that this result resembles Cheeger-type approximation guarantees of spectral clustering in the graph setting [2], and it is the first result that is independent of hyperedge size for general hypergraphs. We keep the discussion at high level and defer details to the appendix, where we prove a more general, and stronger, i.e., constant approximation error result when the primal problem (1) is penalized by the $\ell_{p}$ -norm for any $p\geq 2$ .

In order to start a diffusion process we need to provide the source mass $\Delta$ . Similar to the $p$ -norm flow diffusion in the graph setting [20], we let

\Delta(v)=\left\{\begin{array}[]{ll}\delta d_{v}&\mbox{if}~{}v\in S,\\ 0&\mbox{otherwise,}\end{array}\right.

(3)

where $S$ is a set of seed nodes and $\delta\geq 1$ . Below, we make the assumptions that the seed set $S$ and the target cluster $C$ have some overlap, there is a constant factor of $\textnormal{vol}(C)$ amount of mass trapped in $C$ initially, and the hyper-parameter $\sigma$ is not too large. Note that Assumption 2 is without loss of generality: if the right value of $\delta$ is not known apriori, we can always employ binary search to find a good choice. Assumption 3 is very weak as it allows $\sigma$ to reside in an interval containing 0.

Assumption 1.

$\textnormal{vol}(S\cap C)\geq\alpha\textnormal{vol}(C)$ and $\textnormal{vol}(S\cap C)\geq\beta\textnormal{vol}(S)$ for some $\alpha,\beta\in(0,1]$ .

Assumption 2.

The source mass $\Delta$ as specified in (3) satisfies $\delta=3/\alpha$ , so $\Delta(C)\geq 3\textnormal{vol}(C)$ .

Assumption 3.

$\sigma$ satisfies $0\leq\sigma\leq\beta\Phi(C)/3$ .

Let $\hat{x}$ be an optimal solution for the dual problem (2). For $h>0$ define the sweep sets $S_{h}:=\{v\in V|\hat{x}_{v}\geq h\}$ . We state the approximation property in Theorem 1.

Theorem 1.

Under Assumptions 1, 2, 3, there exists $h>0$ such that $\Phi(S_{h})\leq O(\sqrt{\Phi(C)}/\alpha\beta).$

One of the challenges we face in establishing the result in Theorem 1 is making sure that our diffusion model enjoys both good clustering guarantees and practical algorithmic advantages at the same time. This is achieved by introducing the hyper-parameter $\sigma$ to our diffusion problem. We will demonstrate how $\sigma$ helps with algorithmic development in Section 5, but from a clustering perspective, the additional flexibility given by $\sigma>0$ complicates the underlying diffusion dynamics, making it more difficult to analyze. Another challenge is connecting the Lovász extension $f_{e}(x)$ in (2) with the conductance of a cluster. We resolve all these problems by combining a generalized Rayleigh quotient result for submodular hypergraphs [31], primal-dual convex conjugate relations between (1) and (2), and a classical property of the Choquet integral/Lovász extension.

Let $(\hat{\phi},\hat{r},\hat{z})$ be an optimal solution for the primal problem (1). We state the following lemma on the locality (i.e., sparsity) of the optimal solutions, which justifies why HFD is a local diffusion method.

Lemma 2.

$|\textnormal{supp}(\hat{\phi})|\leq\textnormal{vol}(\textnormal{supp}(\hat{x}))\leq\|\Delta\|_{1}$ ; moreover, $\textnormal{vol}(\textnormal{supp}(\hat{z}))=\textnormal{vol}(\textnormal{supp}(\hat{x}))$ if $\sigma>0$ .

5 Optimization algorithm for HFD

We use a simple Alternating Minimization (AM) [5] method that efficiently solves the primal diffusion problem (1). For $e\in E$ , we define a diagonal matrix $A_{e}\in\mathbb{R}^{|V|\times|V|}$ such that $[A_{e}]_{v,v}=1$ if $v\in e$ and 0 otherwise. Denote $\mathcal{C}:=\{(\phi,r):r_{e}\in\phi_{e}B_{e},~{}\forall e\in E\}$ . The following Lemma 1 allows us to cast problem (1) to an equivalent separable formulation amenable to the AM method.

Lemma 1.

The following problem is equivalent to (1) for any $\sigma>0$ , in the sense that $(\hat{\phi},\hat{r},\hat{z})$ is optimal in (1) for some $\hat{z}\in\mathbb{R}^{|V|}$ if and only if $(\hat{\phi},\hat{r},\hat{s})$ is optimal in (4) for some $\hat{s}\in\bigotimes_{e\in E}\mathbb{R}^{|V|}$ .

\min_{\phi,r,s}~{}\frac{1}{2}\sum_{e\in E}\left(\phi_{e}^{2}+\frac{1}{\sigma}\left\|s_{e}-r_{e}\right\|_{2}^{2}\right),~{}\mbox{s.t.}~{}(\phi,r)\in\mathcal{C},~{}\Delta-\sum_{e\in E}s_{e}\leq d,~{}s_{e,v}=0,\forall v\not\in e.

(4)

The AM method for problem (4) is given in Algorithm 1. The first sub-problem corresponds to computing projections to a group of cones, where all the projections can be computed in parallel. The computation of each projection depends on the choice of base polytope $B_{e}$ . If the hyperedge weight $w_{e}$ is unit cut-cost, $B_{e}$ holds special structures and projection can be computed with $O(|e|\log|e|)$ [32]. For general $B_{e}$ , a conic Fujishige-Wolfe minimum norm algorithm can be adopted to obtain the projection [32]. The second sub-problem in Algorithm 1 can be easily computed in closed-form. We provide more information about Algorithm 1 and its convergence properties in the appendix.

Algorithm 1 Alternating Minimization for problem (4)

Initialization:

\phi^{(0)}:=0,r^{(0)}:=0,s^{(0)}_{e}:=D^{-1}A_{e}\left[\Delta-d\right]_{+},\forall e\in E.

For $k=0,1,2,\ldots$ do:

	$\displaystyle(\phi^{(k+1)},r^{(k+1)}):=\operatorname*{argmin}\limits_{(\phi,r)\in\mathcal{C}}\sum\limits_{e\in E}(\phi_{e}^{2}+\tfrac{1}{\sigma}\\|s_{e}^{(k)}-r_{e}\\|_{2}^{2})$
	$\displaystyle s^{(k+1)}:=\operatorname*{argmin}\limits_{s}\sum\limits_{e\in E}\\|s_{e}-r_{e}^{(k+1)}\\|_{2}^{2}$
	$\displaystyle\hskip 62.59605pt\mbox{s.t.}~{}\Delta-\sum\limits_{e\in E}s_{e}\leq d,\ s_{e,v}=0,\forall v\not\in e.$

We remark that the reformulation (4) for $\sigma>0$ is crucial from an algorithmic point of view. If $\sigma=0$ , then the primal problem (1) has complicated coupling constraints that are hard to deal with. In this case, one has to resort to the dual problem (2). However, problem (2) has a nonsmooth objective function, which prohibits applicability of optimization methods for smooth objective functions. Even though subgradient method may be applied, we have observed empirically that its convergence rate is extremely slow for our problem, and early stopping results in a bad quality output.

Lastly, as noted in Lemma 2, the number of nonzeros in the optimal solution is upper bounded by $\|\Delta\|_{1}$ . In Figure 3 we plot the number of nodes having positive excess (which equals the number of nonzeros in the dual solution $\hat{x}$ ) at every iteration of Algorithm 1. Figure 3 indicates that Algorithm 1 is strongly local, meaning that it works only on a small fraction of nodes (and their incident hyperedges) as opposed to producing dense iterates. This key empirical observation has enabled our algorithm to scale to large datasets by simply keeping track of all active nodes and hyperedges. Proving that the worst-case running time of AM depends only on the number of nonzero nodes at optimality as opposed to size of the whole hypergraph is an open problem, which we leave for future work.

6 Empirical Results

In this section we evaluate the performance of HFD for local clustering. First, we carry out experiments on synthetic hypergraphs with varying target cluster conductances and varying hyperedge sizes. For the unit cut-cost setting, we show that HFD is more robust and has better performance when the target cluster is noisy; for a cardinality-based cut-cost setting, we show that the edge-size-independent approximation guarantee is important for obtaining good recovery results. Second, we carry out experiments using real-world data. We show that HFD significantly outperforms existing state-of-the-art diffusion methods for both unit and cardinality-based cut-costs. Moreover, we provide a compelling example where specialized submodular cut-cost is necessary for obtaining good results. Code that reproduces all results is available at https://github.com/s-h-yang/HFD.

6.1 Synthetic experiments using hypergraph stochastic block model (HSBM)

The generative model. We generalize the standard $k$ -uniform hypergraph stochastic block model ( $k$ HSBM) [21] to allow different types of inter-cluster hyperedges appear with possibly different probabilities according to the cardinality of hyperedge cut. Let $V=\{1,2,\ldots,n\}$ be a set of nodes and let $k\geq 2$ be the required constant hyperedge size. We consider $k$ HSBM with parameters $k$ , $n$ , $p$ , $q_{j}$ , $j=1,2,\ldots,\lfloor k/2\rfloor$ . The model samples a $k$ -uniform hypergraph according to the following rules: (i) The community label $\sigma_{i}\in\{0,1\}$ is chosen uniformly at random for $i\in V$ ;²²2We consider two blocks for simplicity. In general the model applies to any number of blocks. (ii) Each size $k$ subset $e=\{v_{1},v_{2},\ldots,v_{k}\}$ of $V$ appears independently as a hyperedge with probability

\mathbb{P}(e\in E)=\left\{\begin{array}[]{ll}p&\mbox{if}~{}\sigma_{v_{1}}=\sigma_{v_{2}}=\cdots=\sigma_{v_{k}},\\ q_{j}&\mbox{if}~{}\min\{k-\sum_{i=1}^{k}\sigma_{v_{i}},\sum_{i=1}^{k}\sigma_{v_{i}}\}=j.\end{array}\right.

If $k=3$ or all $q_{j}$ ’s are the same, then we obtain the standard two-block $k$ HSBM. We use this setting to evaluate HFD for unit cut-cost. If $q_{j}$ ’s are different, then we obtain a cardinality-based $k$ HSBM. In particular, when $q_{1}\geq q_{2}\geq\cdots\geq q_{\lfloor k/2\rfloor}$ , it models the scenario where hyperedges containing similar numbers of nodes from each block are rare, while small noises (e.g., hyperedges that have one or two nodes in one block and all the rest in the other block) are more frequent. We use $q_{1}\gg q_{j}$ , $j\geq 2$ , to evaluate HFD for cardinality-based cut-cost. There are other random hypergraph models, for example the Poisson degree-corrected HSBM [14] that deals with degree heterogeneity and edge size heterogeneity. In our experiments we focus on $k$ HSBM because it allows stronger control over hyperedge sizes. We provide details on data generation in the appendix.

Task and methods. We consider the local hypergraph clustering problem. We assume that we are given a single labelled node and the goal is to recover all nodes having the same label. Using a single seed node the most common (and sought-after) practice for local graph clustering tasks. We test the performance of HFD with two other methods: (i) Localized Quadratic Hypergraph Diffusions (LH) [33], which can be seen as a hypergraph analogue of Approximate Personalized PageRank (APPR); (ii) ACL [2], which is used to compute APPR vectors on a standard graph obtained from reducing a hypergraph through star expansion [50].³³3There are other heuristic methods which first reduce a hypergraph to a graph by clique expansion [6] and then apply diffusion methods for standard graphs. We did not compare with this approach because clique expansion often results in a dense graph and consequently makes the computation slow. Moreover, it has been shown in [33] that clique expansion did not offer significant performance improvement over star expansion.

Cut-costs and parameters. We consider both unit cut-cost, i.e., $w_{e}(S)=1$ if $S\cap e\neq\emptyset$ and $e\setminus S\neq\emptyset$ , and cardinality cut-cost $w_{e}(S)=\min\{|S\cap e|,|e\setminus S|\}/\lfloor|e|/2\rfloor$ . HFD that uses unit and cardinality cut-costs are denoted by U-HFD and C-HFD, respectively. LH also works with both unit and cardinality cut-costs and we specify them by U-LH and C-LH, respectively. For HFD, we initialize the seed mass so that $\|\Delta\|_{1}$ is a constant factor times the volume of the target cluster. We set $\sigma=0.01$ . We highly tune LH by performing binary search over its parameters $\kappa$ and $\delta$ and pick the output cluster having the lowest conductance. For ACL we use the same parameter choices as in [33]. Details on parameter setting are provided in the appendix.

Results. For each hypergraph, we randomly pick a block as the target cluster. We run the methods 50 times. Each time we choose a different node from the target cluster as the single seed node.
Unit cut-cost results. Figure 4 shows local clustering results when we fix $k=3$ but vary the conductance of the target cluster (i.e., constant $p$ but varying $q_{1}$ ). Observe that the performances of all methods become worse as the target cluster becomes more noisy, but U-HFD has significantly better performance than both U-LH and ACL when the conductance of the target cluster is between 0.2 and 0.4. The reason that U-HFD performs better is in part because it requires much weaker conditions for the theoretical conductance guarantee to hold. On the contrary, LH assumes an upper bound on the conductance of the target cluster [33]. This upper bound is dataset-dependent and could become very small in many cases, leading to poor practical performances. We provide more details in this perspective in the appendix. ACL with star expansion is a heuristic method that has no performance guarantee.
Cardinality cut-cost results. Figure 5 shows the median (markers) and 25-75 percentiles (lower-upper bars) of conductance ratios (i.e., the ratio between output conductance and ground-truth conductance, lower is better) and F1 scores for different methods for $k\in\{3,4,5,6\}$ . The target cluster for each $k$ has conductance around 0.3.⁴⁴4See the appendix for similar results when we fix the target cluster conductances around 0.2 and 0.25, respectively. These cover a reasonably wide range of scenarios in terms of the target conductance and illustrate the performance of algorithms for different levels of noise. For $k=3$ , unit and cardinality cut-costs are equivalent, therefore all methods have similar performances. As $k$ increases, cardinality cut-cost provides better performance than unit cut-cost in both conductance and F1. However, since the theoretical approximation guarantee of C-LH depends on hyperedge size [33], there is a noticeable performance degradation for C-LH when we increase $k=3$ to $k=4$ . On the other hand, the performance of C-HFD appears to be independent from $k$ , which aligns with our conductance bound in Theorem 1.

6.2 Experiments using real-world data

We conduct extensive experiments using real-world data. First, we show that HFD has superior local clustering performances than existing methods for both unit and cardinality-based cut-costs. Then, we show that general submodular cut-cost (recall that HFD is the only method that applies to this setting) can be necessary for capturing complex high-order relations in the data, improving F1 scores by up to 20% for local clustering and providing the only meaningful result for node ranking. In the appendix we show additional local clustering experiments on two additional datasets, where our method improves F1 scores by 8% on average for 13 different target clusters.

Datasets. We provide basic information about the datasets used in our experiments. Complete descriptions are provided in the appendix.
Amazon-reviews ( $|V|$ = 2,268,264, $|E|$ = 4,285,363) [38, 43]. In this hypergraph each node represents a product. A set of products are connected by a hyperedge if they are reviewed by the same person. We use product category labels as ground truth cluster identities. We consider all clusters of less than 10,000 nodes.
Trivago-clicks ( $|V|$ = 172,738, $|E|$ = 233,202) [14]. The nodes in this hypergraph are accommodations/hotels. A set of nodes are connected by a hyperedge if a user performed “click-out” action during the same browsing session. We use geographical locations as ground truth cluster identities. There are 160 such clusters, and we filter them using cluster size and conductance.
Florida Bay food network ( $|V|$ = 128, $|E|$ = 141,233) [30]. Nodes in this hypergraph correspond to different species or organisms that live in the Bay, and hyperedges correspond to transformed motifs (Figure 1) of the original dataset. Each species is labelled according its role in the food chain: producers, low-level consumers, high-level consumers.

Methods and parameters. We compare HFD with LH and ACL.⁵⁵5We also tried a flow-improve method for hypergraphs [43], but the method was very slow in our experiments, so we only used it for small datasets. See appendix for results. The flow-improve method did not improve the performance of existing methods, therefore, we omitted it from comparisons on larger datasets. There is a heuristic nonlinear variant of LH which is shown to outperform linear LH in some cases [33]. Therefore we also compare with the same nonlinear variant considered in [33]. We denote the linear and nonlinear versions by LH-2.0 and LH-1.4, respectively. We set $\sigma=0.0001$ for HFD and we set the parameters for LH-2.0, LH-1.4 and ACL as suggested by the authors [33]. More details on parameter choices appear in the appendix. We prefix methods that use unit and cardinality-based cut-costs by U- and C-, respectively.

Experiments for unit and cardinality cut-costs. For each target cluster in Amazon-reviews and Trivago-clicks, we run the methods multiple times, each time we use a different node as the singe seed node.⁶⁶6We show additional results using seed sets of more than one node in the appendix. We report the median F1 scores of the output clusters in Table 1 and Table 2. For Amazon-reviews, we only compare the unit cut-cost because it is both shown in [33] and verified by our experiments that unit cut-cost is more suitable for this dataset. Observe that U-HFD obtains the highest F1 scores for nearly all clusters. In particular, U-HFD significantly outperforms other methods for clusters 12, 18, 24, where we see an increase in F1 score by up to 52%. For Trivago-clicks, C-HFD has the best performance for all but one clusters. Among the rest of all other methods, U-HFD has the second highest F1 scores for nearly all clusters. Moreover, observe that for each method (i.e., HFD, LH-2.0, LH-1.4), cardinality cut-cost leads to higher F1 than its unit cut-cost counterpart.

Table 1: F1 results for Amazon-reviews network

Method	1	2	3	12	15	17	18	24	25
U-HFD	0.45	0.09	0.65	0.92	0.04	0.10	0.80	0.81	0.09
U-LH-2.0	0.23	0.07	0.23	0.29	0.05	0.06	0.21	0.28	0.05
U-LH-1.4	0.23	0.09	0.35	0.40	0.00	0.07	0.31	0.35	0.06
ACL	0.23	0.07	0.22	0.25	0.04	0.05	0.17	0.20	0.04

Table 2: F1 results for Trivago-clicks network

Method	KOR	ISL	PRI	UA-43	VNM	HKG	MLT	GTM	UKR	EST
U-HFD	0.75	0.99	0.89	0.85	0.28	0.82	0.98	0.94	0.60	0.94
C-HFD	0.76	0.99	0.95	0.94	0.32	0.80	0.98	0.97	0.68	0.94
U-LH-2.0	0.70	0.86	0.79	0.70	0.24	0.92	0.88	0.82	0.50	0.90
C-LH-2.0	0.73	0.90	0.84	0.78	0.27	0.94	0.96	0.88	0.51	0.83
U-LH-1.4	0.69	0.84	0.80	0.75	0.28	0.87	0.92	0.83	0.47	0.90
C-LH-1.4	0.71	0.88	0.84	0.78	0.27	0.88	0.93	0.85	0.50	0.85
ACL	0.65	0.84	0.75	0.68	0.23	0.90	0.83	0.69	0.50	0.88

Experiments for general submodular cut-cost. In order to understand the importance of specialized general submodular hypergraphs we study the node-ranking problem for the Florida Bay food network using hypergraph modelling shown in Figure 1. We compare HFD using unit (U-HFD, $\gamma_{1}=\gamma_{2}=1$ ), cardinality-based (C-HFD, $\gamma_{1}=1/2$ and $\gamma_{2}=1$ ) and submodular (S-HFD, $\gamma_{1}=1/2$ and $\gamma_{2}=0$ ) cut-costs. Our goal is to search the most similar species of a queried species based on the food-network structure. Table 3 shows that S-HFD provides the only meaningful node ranking results. Intuitively, when $\gamma_{2}=0$ , separating the preys $v_{1},v_{2}$ from the predators $v_{3},v_{4}$ incurs 0 cost. This encourages S-HFD to diffuse mass among preys or predators only and not to cross from a predator to a prey or vice versa. As a result, similar species receive similar amount of mass and thus are ranked similarly. In the local clustering setting, Table 3 compares HFD using different cut-costs. By exploiting specialized higher-order relations, S-HFD further improves F1 scores by up to 20% over U-HFD and C-HFD. This is not surprising, given the poor node-ranking results of other cut-costs. In the appendix we show another application of submodular cut-cost for node-ranking in an international oil trade network.

Table 3: Node-ranking and local clustering in Florida Bay food network using different cut-costs

	Top-2 node-ranking results		Clustering F1
Method	Query: Raptors	Query: Gray Snapper	Prod.	Low	High
U-HFD	Epiphytic Gastropods, Detriti. Gastropods	Meiofauna, Epiphytic Gastropods	0.69	0.47	0.64
C-HFD	Epiphytic Gastropods, Detriti. Gastropods	Meiofauna, Epiphytic Gastropods	0.67	0.47	0.64
S-HFD	Gruiformes, Small Shorebirds	Snook, Mackerel	0.69	0.62	0.84

References

[1] I. Amburg, N. Veldt, and A. R. Benson. Clustering in graphs and hypergraphs with categorical edge labels. In Proceedings of the Web Conference, 2020.
[2] R. Andersen, F. Chung, and K. Lang. Local graph partitioning using pagerank vectors. FOCS ’06 Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science, pages 475–486, 2006.
[3] Sanjeev Arora, Satish Rao, and Umesh Vazirani. Expander flows, geometric embeddings and graph partitioning. JACM, 56(2), April 2009.
[4] F. Bach. Learning with submodular functions: A convex optimization perspective. Foundations and Trends in Machine Learning, 6(2-3):145–373, 2013.
[5] A. Beck. On the convergence of alternating minimization for convex programming with applications to iteratively reweighted least squares and decomposition schemes. SIAM Journal on Optimization, 25(1):185–209, 2015.
[6] A. R. Benson, D. F. Gleich, and J. Leskovec. Higher-order organization of complex networks. Science, 353(6295):163–166, 2016.
[7] Austin Benson, David Gleich, and Desmond Higham. Higher-order network analysis takes off, fueled by classical ideas and new data. SIAM News (online), 2021.
[8] Claude Berge. Hypergraphs: combinatorics of finite sets, volume 45. Elsevier, 1984.
[9] S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems, 30(1–7):107–117, 1998.
[10] J. Bu, S. Tan, C. Chen, C. Wang, H. Wu, L. Zhang, and X. He. Music recommendation by unified hypergraph: combining social media information and music content. In MM ’10: Proceedings of the 18th ACM international conference on Multimedia, 2010.
[11] T.-H. Hubert Chan, Anand Louis, Zhihao Gavin Tang, and Chenzi Zhang. Spectral properties of hypergraph laplacian and approximation algorithms. JACM, 65(3), March 2018.
[12] Eli Chien, Pan Li, and Olgica Milenkovic. Landing probabilities of random walks for seed-set expansion in hypergraphs, 2019.
[13] Uthsav Chitra and Benjamin Raphael. Random walks on hypergraphs with edge-dependent vertex weights. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 1172–1181. PMLR, 09–15 Jun 2019.
[14] Philip S. Chodrow, Nate Veldt, and Austin R. Benson. Generative hypergraph clustering: from blockmodels to modularity, 2021.
[15] Ivar Ekeland and Roger Témam. Convex Analysis and Variational Problems. Society for Industrial and Applied Mathematics, 1999.
[16] C. Eksombatchai, P. Jindal, J. Z. Liu, Y. Liu, R. Sharma, C. Sugnet, M. Ulrich, and J. Leskovec. Pixie: A system for recommending $3+$ billion items to $200+$ million users in real-time. WWW ’18: Proceedings of the 2018 World Wide Web Conference, pages 1775–1784, 2018.
[17] C. Eksombatchai, J. Leskovec, R. Sharma, C. Sugnet, and M. Ulrich. Node graph traversal methods. U.S. Patent 10 762 134 B1, Sep. 2020, 2020.
[18] K. Fountoulakis, M. Liu, , D. F. Gleich, and M. W. Mahoney. Flow-based algorithms for improving clusters: A unifying framework, software, and performance. arXiv:2004.09608, 2020.
[19] K. Fountoulakis, F. Roosta-Khorasani, J. Shun, X. Cheng, and M. W. Mahoney. Variational perspective on local graph clustering. Mathematical Programming B, pages 1–21, 2017.
[20] K. Fountoulakis, D. Wang, and S. Yang. $p$ -norm flow diffusion for local graph clustering. In Proceedings of the 37th International Conference on Machine Learning, 2020.
[21] Debarghya Ghoshdastidar and Ambedkar Dukkipati. Consistency of spectral partitioning of uniform hypergraphs under planted partition model. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014.
[22] P. Gupta, A. Goel, J. Lin, A. Sharma, D. Wang, and R. Zadeh. WTF: the who to follow service at twitter. WWW ’13: Proceedings of the 22nd international conference on World Wide Web, pages 505–514, 2013.
[23] Scott W. Hadley. Approximation techniques for hypergraph partitioning problems. Discrete Appl. Math., 59(2):115–127, May 1995.
[24] Koby Hayashi, Sinan G. Aksoy, Cheong Hee Park, and Haesun Park. Hypergraph random walks, laplacians, and clustering. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, CIKM ’20, page 495–504, New York, NY, USA, 2020. Association for Computing Machinery.
[25] Matthias Hein, Simon Setzer, Leonardo Jost, and Syama Sundar Rangapuram. The total variation on hypergraphs-learning on hypergraphs revisited. In Proceedings of the 26th International Conference on Neural Information Processing Systems-Volume 2, pages 2427–2435, 2013.
[26] R. Ibrahim and D. F. Gleich. Local hypergraph clustering using capacity releasing diffusion. PLOS ONE, 15(12):1–20, 12 2020.
[27] Edmund Ihler, Dorothea Wagner, and Frank Wagner. Modeling hypergraphs by graphs with the same mincut properties. Information Processing Letters, 45(4):171–175, 1993.
[28] E. L. Lawler. Cutsets and partitions of hypergraphs. Networks, 3(3):275–285, 1973.
[29] L. Li and T. Li. News recommendation via hypergraph learning: encapsulation of user behavior and news content. In WSDM ’13: Proceedings of the sixth ACM international conference on Web search and data mining, 2013.
[30] P. Li and O. Milenkovic. Inhomogeneous hypergraph clustering with applications. In Advances in Neural Information Processing Systems, 2017.
[31] P. Li and O. Milenkovic. Submodular hypergraphs: p-laplacians, cheeger inequalities and spectral clustering. In Proceedings of the 35th International Conference on Machine Learning, 2018.
[32] Pan Li, Niao He, and Olgica Milenkovic. Quadratic decomposable submodular function minimization: Theory and practice. Journal of Machine Learning Research, 21(106):1–49, 2020.
[33] M. Liu, N. Veldt, H. Song, P. Li, and D. F. Gleich. Strongly local hypergraph diffusions for clustering and semi-supervised learning. In TheWebConf 2021, 2021.
[34] Anand Louis. Hypergraph markov operators, eigenvalues and approximation algorithms. STOC, page 713–722, New York, NY, USA, 2015. Association for Computing Machinery.
[35] Rossana Mastrandrea, Julie Fournet, and Alain Barrat. Contact patterns in a high school: A comparison between data collected using wearable sensors, contact diaries and friendship surveys. PLOS ONE, 10(9):e0136497, 2015.
[36] Ron Milo, Shai Shen-Orr, Shalev Itzkovitz, Nadav Kashtan, Dmitri Chklovskii, and Uri Alon. Network motifs: simple building blocks of complex networks. Science, 298(5594):824–827, 2002.
[37] J. Ni, J. Li, and J. McAuley. Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 188–197, 2019.
[38] Jianmo Ni, Jiacheng Li, and Julian McAuley. Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 188–197, 2019.
[39] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford, 1999. Technical Report 1999-66, Stanford InfoLab.
[40] E. Sadikov, J. Madhavan, L. Wang, and A. Halevy. Clustering query refinements by user intent. WWW ’10: Proceedings of the 19th international conference on World wide web, pages 841–850, 2010.
[41] Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-June (Paul) Hsu, and Kuansan Wang. An overview of microsoft academic service (MAS) and applications. In Proceedings of the 24th International Conference on World Wide Web, 2015.
[42] Yuuki Takai, Atsushi Miyauchi, Masahiro Ikeda, and Yuichi Yoshida. Hypergraph clustering based on pagerank. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1970–1978, 2020.
[43] N. Veldt, A. R. Benson, and J. Kleinberg. Minimizing localized ratio cut objectives in hypergraphs. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2020.
[44] Nate Veldt, Austin R. Benson, and Jon Kleinberg. Hypergraph cuts with general splitting functions, 2020.
[45] D. Wang, K. Fountoulakis, M. Henzinger, M. W. Mahoney, and S. Rao. Capacity releasing diffusion for speed and locality. Proceedings of the 34th International Conference on Machine Learning, 70:3607–2017, 2017.
[46] Hao Yin, Austin R Benson, Jure Leskovec, and David F Gleich. Local higher-order graph clustering. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pages 555–564, 2017.
[47] Yuichi Yoshida. Cheeger inequalities for submodular transformations. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2582–2601. SIAM, 2019.
[48] Chenzi Zhang, Shuguang Hu, Zhihao Gavin Tang, and TH Hubert Chan. Re-revisiting learning on hypergraphs: confidence interval and subgradient method. In International Conference on Machine Learning, pages 4026–4034. PMLR, 2017.
[49] Dengyong Zhou, Jiayuan Huang, and Bernhard Schölkopf. Learning with hypergraphs: Clustering, classification, and embedding. Advances in neural information processing systems, 19:1601–1608, 2006.
[50] J. Y. Zien, M. D. F. Schlag, and P. K. Chan. Multilevel spectral hypergraph partitioning with arbitrary vertex sizes. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 18(9):1389–1399, 1999.

Appendices for: Local Hyper-Flow Diffusion

Outline of the Appendix:

•

Appendix A contains supplementary material to Section 3 and Section 4 of the paper:
- –
  
  mathematical derivation of the dual diffusion problem;
- –
  
  proofs of Theorem 1 and Lemma 2.
•

Appendix B contains supplementary material to Section 5 of the paper:
- –
  
  proof of Lemma 3;
- –
  
  convergence properties of Algorithm 1;
- –
  
  specialized algorithms for alternating minimization sub-problems of Algorithm 1.
•

Appendix C contains supplementary material to Section 6 of the paper:
- –
  
  additional synthetic experiments using $k$ -uniform hypergraph stochastic block model;
- –
  
  complete information about the real datasets considered in Section 6 of the paper;
- –
  
  experiments for local clustering using seed sets that contain more than one node;
- –
  
  experiments using 3 additional real datasets that are not discussed in the main paper;
- –
  
  parameter settings and implementation details.

Appendix A Approximation guarantee for local hypergraph clustering

In this section we prove a generalized and stronger version of Theorem 1 in the main paper, where the primal and dual diffusion problems are penalized by $\ell_{p}$ -norm and $\ell_{q}$ -norm, respectively, where $p\geq 2$ and $1/p+1/q=1$ . Moreover, we consider a generic hypergraph $H=(V,E,\mathcal{W})$ with general submodular weights $\mathcal{W}=\{w_{e},\vartheta_{e}\}_{e\in E}$ for any nonzero $\vartheta_{e}:=\max_{S\subseteq e}w_{e}(S)$ . All claims in the main paper are therefore immediate special cases when $p=q=2$ and $\vartheta_{e}=1$ for all $e\in E$ .

Unless otherwise stated, we use the same notation as in the main paper. We generalize the definition of the degree of a node $v\in V$ as

d_{v}:=\sum_{e\in E:v\in e}\vartheta_{e}.

Note that when $\vartheta_{e}=1$ for all $e$ , the above definition reduces to $d_{v}=|\{e\in E:v\in e\}|$ , which is the number of hyperedges to which $v$ belongs to.

Given $H=(V,E,\mathcal{W})$ where $\mathcal{W}=\{w_{e},\vartheta_{e}\}_{e\in E}$ , $p\geq 2$ , and a hyperparameter $\sigma\geq 0$ , our primal Hyper-Flow Diffusion (HFD) problem is written as

\begin{split}\min_{\phi\in\mathbb{R}^{|E|}_{+},z\in\mathbb{R}^{|V|}_{+}}~{}&\frac{1}{p}\sum_{e\in E}\vartheta_{e}\phi_{e}^{p}+\frac{\sigma}{p}\sum_{v\in V}d_{v}z_{v}^{p}\\ \mbox{s.t.}\hskip 25.60747pt&\Delta-\sum_{e\in E}\vartheta_{e}r_{e}\leq d+\sigma Dz\\ &r_{e}\in\phi_{e}B_{e},~{}\forall e\in E\end{split}

(A.1)

where

B_{e}:=\{\rho_{e}\in\mathbb{R}^{|V|}~{}|~{}\rho_{e}(S)\leq w_{e}(S),\forall S\subseteq V,~{}\mbox{and}~{}\rho_{e}(V)=w_{e}(V)\}

is the base polytope of $w_{e}$ . The vector $m=\Delta-\sum_{e\in E}\vartheta_{e}r_{e}$ gives the net amount of mass after routing. Note that we multiply $r_{e}$ by $\vartheta_{e}$ because we have normalized $w_{e}$ by $\vartheta_{e}$ in its definition.

Lemma 1.

The following optimization problem is dual to (A.1):

\max_{x\in\mathbb{R}^{|V|}_{+}}~{}(\Delta-d)^{T}x-\frac{1}{q}\sum_{e\in E}\vartheta_{e}f_{e}(x)^{q}-\frac{\sigma}{q}\sum_{v\in V}d_{v}x_{v}^{q}

(A.2)

where $f_{e}(x):=\max_{\rho_{e}\in B_{e}}\rho_{e}^{T}x$ is the support function of base polytope $B_{e}$ .

Proof.

Using convex conjugates, for $x\in\mathbb{R}^{|V|}_{+}$ , we have


$\displaystyle\frac{1}{q}f_{e}(x)^{q}$	$\displaystyle=\max_{\phi_{e}\geq 0}~{}\phi_{e}f_{e}(x)-\frac{1}{p}\phi_{e}^{p},~{}\forall e\in E,$	(A.3a)
$\displaystyle\frac{1}{q}x_{v}^{q}$	$\displaystyle=\max_{z_{v}\geq 0}~{}z_{v}x_{v}-\frac{1}{p}z_{v}^{p},~{}\forall v\in V.$	(A.3b)

Apply the definition of $f_{e}(x)$ , we can write (A.3a) as

\frac{1}{q}f_{e}(x)^{q}=\max_{\phi_{e}\geq 0}~{}\phi_{e}f_{e}(x)-\frac{1}{p}\phi_{e}^{p}\ =\max_{\phi_{e}\geq 0,r_{e}\in\phi_{e}B_{e}}r_{e}^{T}x-\frac{1}{p}\phi_{e}^{p}.

Therefore,

\begin{split}&~{}\max_{x\in\mathbb{R}^{|V|}_{+}}~{}(\Delta-d)^{T}x-\frac{1}{q}\sum_{e\in E}\vartheta_{e}f_{e}(x)^{q}-\frac{\sigma}{q}\sum_{v\in V}d_{v}x_{v}^{q}\\ =&~{}\max_{x\in\mathbb{R}^{|V|}_{+}}~{}(\Delta-d)^{T}x-\sum_{e\in E}\vartheta_{e}\left(\max_{\phi_{e}\geq 0,r_{e}\in\phi_{e}B_{e}}~{}r_{e}^{T}x-\frac{1}{p}\phi_{e}^{p}\right)-\sigma\sum_{v\in V}d_{v}\left(\max_{z_{v}\geq 0}~{}z_{v}x_{v}-\frac{1}{p}z_{v}^{p}\right)\\ =&~{}\max_{x\in\mathbb{R}^{|V|}_{+}}~{}(\Delta-d)^{T}x+\min_{\begin{subarray}{c}\phi\in\mathbb{R}^{|E|}_{+}\\ r_{e}\in\phi_{e}B_{e},\forall e\in E\end{subarray}}\sum_{e\in E}\left(\frac{1}{p}\vartheta_{e}\phi_{e}^{p}-\vartheta_{e}r_{e}^{T}x\right)+\min_{z\in\mathbb{R}^{|V|}_{+}}\sigma\sum_{v\in V}\left(\frac{1}{p}d_{v}z_{v}^{p}-d_{v}z_{v}x_{v}\right)\\ =&~{}\min_{\begin{subarray}{c}\phi\in\mathbb{R}^{|E|}_{+},z\in\mathbb{R}^{|V|}_{+}\\ r_{e}\in\phi_{e}B_{e},\forall e\in E\end{subarray}}\frac{1}{p}\sum_{e\in E}\vartheta_{e}\phi_{e}^{p}+\frac{\sigma}{p}\sum_{v\in V}d_{v}z_{v}^{p}+\max_{x\in\mathbb{R}^{|V|}_{+}}\left((\Delta-d)^{T}x-\sum_{e\in E}\vartheta_{e}r_{e}^{T}x-\sigma\sum_{v\in V}d_{v}z_{v}x_{v}\right)\\ =&~{}\min_{\begin{subarray}{c}\phi\in\mathbb{R}^{|E|}_{+},z\in\mathbb{R}^{|V|}_{+}\\ r_{e}\in\phi_{e}B_{e},\forall e\in E\end{subarray}}\frac{1}{p}\sum_{e\in E}\vartheta_{e}\phi_{e}^{p}+\frac{\sigma}{p}\sum_{v\in V}d_{v}z_{v}^{p}\quad\mbox{s.t.}\quad\Delta-d-\sum_{e\in E}\vartheta_{e}r_{e}-\sigma Dz\leq 0.\end{split}

In the above derivations, we may exchange the order of minimization and maximization and arrive at the second last equality, due to Proposition 2.2, Chapter VI, in [15]. The last equality follows from

\max_{x\in\mathbb{R}^{|V|}_{+}}\bigg{(}(\Delta-d)^{T}x-\sum_{e\in E}\vartheta_{e}r_{e}^{T}x-\sigma\sum_{v\in V}d_{v}z_{v}x_{v}\bigg{)}=\left\{\begin{array}[]{ll}0,&\mbox{if}~{}\Delta-d-\sum\limits_{e\in E}\vartheta_{e}r_{e}-\sigma Dz\leq 0,\\ +\infty,&\mbox{otherwise}.\end{array}\right.

∎

Notation. For the rest of this section, we reserve the notation $(\hat{\phi},\hat{z})$ and $\hat{x}$ for optimal solutions of (A.1) and (A.2) respectively. If $\sigma=0$ , we simply treat $\hat{z}=0$ .

The next lemma relates primal and dual optimal solutions. We make frequent use of this relation throughout our discussion.

Lemma 2.

We have that $\hat{\phi}_{e}^{p}=f_{e}(\hat{x})^{q}$ for all $e\in E$ . Moreover, if $\sigma>0$ , then $\hat{z}_{v}^{p}=\hat{x}_{v}^{q}$ for all $v\in V$ .

Proof.

Given $\hat{x}$ an optimal solution to (A.2), it follows directly from (A.3) and strong duality that $(\hat{\phi},\hat{z})$ must satisfy, for each $e\in E$ and $v\in V$ ,

\hat{\phi}_{e}=f(\hat{x})^{q-1}=\operatorname*{argmax}_{\phi_{e}\geq 0}~{}\phi_{e}f_{e}(\hat{x})-\frac{1}{p}\phi_{e}^{p}\quad\mbox{and}\quad\hat{z}_{v}=\hat{x}_{v}^{q-1}=\operatorname*{argmax}_{z_{v}\geq 0}~{}z_{v}\hat{x}_{v}-\frac{1}{p}z_{v}^{p}.

∎

Diffusion setup. Recall that we pick a scalar $\delta$ and set the source $\Delta$ as

\Delta_{v}=\left\{\begin{array}[]{ll}\delta d_{v},&\mbox{if}~{}v\in S,\\ 0,&\mbox{otherwise}.\end{array}\right.

(A.4)

For convenience we restate the assumptions below.

Assumption A.1.

$\textnormal{vol}(S\cap C)\geq\alpha\textnormal{vol}(C)$ and $\textnormal{vol}(S\cap C)\geq\beta\textnormal{vol}(S)$ for some $\alpha,\beta\in(0,1]$ .

Assumption A.2.

The source mass $\Delta$ as specified in (A.4) satisfies $\delta=3/\alpha$ , which gives $\Delta(C)\geq 3\textnormal{vol}(C)$ .

Assumption A.3.

$\sigma$ satisfies $0\leq\sigma\leq\beta\Phi(C)/3$ .

A.1 Technical lemmas

In this subsection we state and prove some technical lemmas that will be used for the main proof in the next subsection.

The following lemma characterizes the maximizers of the support function for a base polytope.

Lemma 3 (Proposition 4.2 in [4]).

Let $w$ be a submodular function such that $w(\emptyset)=0$ . Let $x\in\mathbb{R}^{|V|}$ , with unique values $a_{1}>\cdots>a_{m}$ , taken at sets $A_{1},\ldots,A_{m}$ (i.e., $V=A_{1}\cup\cdots\cup A_{m}$ and $\forall i\in\{1,\ldots,m\}$ , $\forall v\in A_{i}$ , $x_{v}=a_{v}$ ). Let $B$ be the associated base polytope. Then $\rho\in B$ is optimal for $\max_{\rho\in B}\rho^{T}x$ if and only if for all $i=1,\ldots,m$ , $\rho(A_{1}\cup\cdots\cup A_{i})=w(A_{1}\cup\cdots\cup A_{i})$ .

Recall that $(\hat{\phi},\hat{z})$ and $\hat{x}$ denote the optimal solutions of (A.1) and (A.2) respectively. We start with a lemma on the locality of the optimal solutions.

Lemma 4 (Lemma 2 in the main paper).

We have

\sum_{e\in\textnormal{supp}(\hat{\phi})}\vartheta_{e}~{}=~{}\textnormal{vol}(\textnormal{supp}(\hat{x}))~{}\leq~{}\|\Delta\|_{1}.

Moreover, if $\sigma>0$ , then $\textnormal{vol}(\textnormal{supp}(\hat{z}))=\textnormal{vol}(\textnormal{supp}(\hat{x}))$ .

Proof.

To see the first inequality, note that if $\hat{x}_{v}=0$ for every $v\in e$ for some $e$ , then $f_{e}(\hat{x})=0$ . By Lemma 2, this means $\hat{\phi}_{e}=0$ . Thus, $\hat{\phi}_{e}\neq 0$ only if there is some $v\in e$ such that $\hat{x}_{v}\neq 0$ . Therefore, we have that

\sum_{e\in\textnormal{supp}(\hat{\phi})}\vartheta_{e}\leq\sum_{v\in\textnormal{supp}(\hat{x})}\sum_{e\in E:v\in e}\vartheta_{e}=\sum_{v\in\textnormal{supp}(\hat{x})}d_{v}=\textnormal{vol}(\textnormal{supp}(\hat{x})).

To see the last inequality, note that, by the first order optimality condition of (A.2), if $\hat{x}_{v}\neq 0$ then we must have

\Delta_{v}-d_{v}=\sum_{e\in E}\vartheta_{e}f_{e}(\hat{x})^{q-1}\hat{\rho}_{e,v}+\sigma d_{v}\hat{x}_{v}^{q-1},~{}~{}\mbox{for some}~{}~{}\hat{\rho}_{e}\in\partial f_{e}(\hat{x})=\operatorname*{argmax}_{\rho_{e}\in B_{e}}\rho_{e}^{T}\hat{x}.

(A.5)

Denote $N:=\textnormal{supp}(\hat{x})$ and $E[N]:=\{e\in E~{}|~{}v\in N~{}\mbox{for all}~{}v\in e\}$ . Note that $E[N]\cap\partial N=\emptyset$ , and $E[N]\cup\partial N=\{e\in E~{}|~{}v\in N~{}\mbox{for some}~{}v\in e\}$ , that is, $E[N]\cup\partial N$ contain all hyperedges that are incident to some node in $N$ . Moreover, we have that for any $\hat{\rho}_{e}\in\operatorname*{argmax}_{\rho_{e}\in B_{e}}\rho_{e}^{T}\hat{x}$ ,

\sum_{v\in N}\hat{\rho}_{e,v}=\hat{\rho}_{e}(N)=\left\{\begin{array}[]{ll}w_{e}(N),&\mbox{if}~{}e\in\partial N,\\ 0,&\mbox{if}~{}e\in E[N],\end{array}\right.

where $\hat{\rho}_{e}(N)=w_{e}(N)$ for $e\in\partial N$ follows from Lemma 3, since $\hat{x}_{v}>0$ for $v\in N$ and $\hat{x}_{v}=0$ for $v\not\in N$ . The equality $\hat{\rho}_{e}(N)=0$ for $e\in E[N]$ follows from $\hat{\rho}_{e}(N)=\hat{\rho}_{e}(e)=0$ because $e\subseteq N$ and $\hat{\rho}_{e,v}=0$ for all $v\not\in e$ .

Taking sums over $v\in N$ on both sides of equation (A.5) we obtain

\begin{split}\Delta(N)-\textnormal{vol}(N)~{}&=~{}\sum_{v\in N}\sum_{e\in E}\vartheta_{e}f_{e}(\hat{x})^{q-1}\hat{\rho}_{e,v}+\sum_{v\in N}\sigma d_{v}\hat{x}_{v}^{q-1}\\ &=~{}\sum_{v\in N}\sum_{e\in E[N]}\vartheta_{e}f_{e}(\hat{x})^{q-1}\hat{\rho}_{e,v}+\sum_{v\in N}\sum_{e\in\partial N}\vartheta_{e}f_{e}(\hat{x})^{q-1}\hat{\rho}_{e,v}+\sum_{v\in N}\sigma d_{v}\hat{x}_{v}^{q-1}\\ &=~{}\sum_{e\in E[N]}\vartheta_{e}f_{e}(\hat{x})^{q-1}\sum_{v\in N}\hat{\rho}_{e,v}+\sum_{e\in\partial N}\vartheta_{e}f_{e}(\hat{x})^{q-1}\sum_{v\in N}\hat{\rho}_{e,v}+\sum_{v\in N}\sigma d_{v}\hat{x}_{v}^{q-1}\\ &=~{}0+\sum_{e\in\partial N}\vartheta_{e}f_{e}(\hat{x})^{q-1}w_{e}(N)+\sum_{v\in N}\sigma d_{v}\hat{x}_{v}^{q-1}\\ &\geq~{}0.\end{split}

The second equality follows from $\hat{\rho}_{e,v}=0$ for all $v\not\in e$ . This proves $\textnormal{vol}(\textnormal{supp}(\hat{x}))\leq\Delta(\textnormal{supp}(\hat{x}))\leq\|\Delta\|_{1}$ .

Finally, if $\sigma>0$ , then $\textnormal{vol}(\textnormal{supp}(\hat{z}))=\textnormal{vol}(\textnormal{supp}(\hat{x}))$ follows from Lemma 2 that $\hat{z}^{p}=\hat{x}^{q}$ for all $v\in V$ . ∎

The following inequality is a special case of Hölder’s inequality for degree-weighted norms. It will become useful later.

Lemma 5.

For $x\in\mathbb{R}^{|V|}$ and $p>1$ we have that

\Bigg{(}\sum_{v\in V}d_{v}|x_{v}|\Bigg{)}^{p}\leq\textnormal{vol}(\textnormal{supp}(x))^{p-1}\sum_{v\in V}d_{v}|x_{v}|^{p}.

Proof.

Let $q=p/(p-1)$ . Apply Hölder’s inequality we have

\begin{split}\sum_{v\in V}d_{v}|x_{v}|=\sum_{v\in\textnormal{supp}(x)}|d_{v}^{1/q}||d_{v}^{1/p}x_{v}|&\leq\Bigg{(}\sum_{v\in\textnormal{supp}(x)}d_{v}\Bigg{)}^{1/q}\Bigg{(}\sum_{v\in\textnormal{supp}(x)}d_{v}|x_{v}|^{p}\Bigg{)}^{1/p}\\ &=\textnormal{vol}(\textnormal{supp}(x))^{1/q}\Bigg{(}\sum_{v\in V}d_{v}|x_{v}|^{p}\Bigg{)}^{1/p}.\end{split}

∎

Lemma 6 (Lemma I.2 in [31]).

For any $x\in\mathbb{R}^{|V|}_{+}\setminus\{0\}$ and $q\geq 1$ , one has

\frac{\sum_{e\in E}\vartheta_{e}f_{e}(x)^{q}}{\sum_{v\in V}d_{v}x_{v}^{q}}\geq\frac{c(x)^{q}}{q^{q}},

where

c(x):=\min_{h\geq 0}\frac{\textnormal{vol}(\partial\{v\in V|x_{v}^{q}>h\})}{\textnormal{vol}(\{v\in V|x_{v}^{q}>h\})}=\min_{h\geq 0}\frac{\textnormal{vol}(\partial\{v\in V|x_{v}>h\})}{\textnormal{vol}(\{v\in V|x_{v}>h\})}.

Recall that the objective function of our primal diffusion problem (A.1) consists of two parts. The first part is $\sum_{e\in E}\vartheta_{e}\phi_{e}^{p}$ and it penalizes the cost of flow routing, the second part is $\sum_{v\in V}d_{v}z_{v}^{p}$ and it penalizes the cost of excess mass. An immediate consequence of Lemma 6 is the inequality in Lemma 7 that relates the cost of optimal flow routing $\sum_{e\in E}\vartheta_{e}\hat{\phi}_{e}^{p}$ and the cost of excess mass $\sum_{v\in V}d_{v}\hat{z}_{v}^{p}$ at optimality.

For $h>0$ , recall that the sweep sets are defined as $S_{h}:=\{v\in V|\hat{x}_{v}\geq h\}$ .

Let $\hat{h}\in\operatorname*{argmin}_{h>0}\Phi(S_{h})$ and denote $\hat{S}=S_{\hat{h}}$ . That is, $\hat{S}=S_{h}$ for some $h>0$ and $\Phi(\hat{S})\leq\Phi(S_{h})$ for all $h>0$ .

Lemma 7.

For $p>1$ and $q=p/(p-1)$ we have that

\sum_{e\in E}\vartheta_{e}\hat{\phi}_{e}^{p}\geq\left(\frac{\Phi(\hat{S})}{q}\right)^{q}\sum_{v\in V}d_{v}\hat{z}_{v}^{p}.

Proof.

By Lemma 2,

\sum_{e\in E}\vartheta_{e}\hat{\phi}_{e}^{p}=\sum_{e\in E}\vartheta_{e}f_{e}(\hat{x})^{q}\quad\mbox{and}\quad\sum_{v\in V}d_{v}\hat{z}_{v}^{p}=\sum_{v\in V}d_{v}\hat{x}_{v}^{q},

and the result follows from applying Lemma 6. ∎

Given a vector $a\in\mathbb{R}^{|V|}$ and a set $S\subseteq V$ , recall that we write $a(S)=\sum_{v\in S}a_{v}$ . This actually defines a modular set-function $a$ taking input on subsets of $V$ . The Lovász extension of modular function $a$ is simply $f(x)=a^{T}x$ [4]. Since all modular functions are also submodular, we arrive at the following lemma that follows from a classical property of the Choquet integral/Lovász extension.

Lemma 8.

We have that

\Delta^{T}\hat{x}=\int_{h=0}^{+\infty}\Delta(S_{h})dh,\quad d^{T}\hat{x}=\int_{h=0}^{+\infty}\textnormal{vol}(S_{h})dh,\quad f_{e}(\hat{x})=\int_{h=0}^{+\infty}w_{e}(S_{h})dh.

Proof.

Recall that, by definition, $\textnormal{vol}(S)=d(S)$ where $d$ is the degree vector. $\Delta$ and $d$ are modular functions on $2^{V}$ and $w_{e}$ is a submodular function on $2^{V}$ . The Lovász extension of $\Delta$ and $d$ are $\Delta^{T}x$ and $d^{T}x$ , respectively. The Lovász extension of $w_{e}$ is $f_{e}(x)$ . The results then follow immediately from representing the Lovász extensions using Choquet integrals. See, e.g., Proposition 3.1 in [4]. ∎

A.2 Proof of Theorem 1 in the main paper

We restate the theorem below with respect to the general formulations (A.1) and (A.2) for any $p\geq 2$ and $q=p/(p-1)$ .

Let us recall that the sweep sets are defined as $S_{h}:=\{v\in V|\hat{x}_{v}\geq h\}$ .

Theorem 9.

Under Assumptions A.1, A.2, A.3, for some $h>0$ we have that

\Phi(S_{h})\leq O\left(\frac{\Phi(C)^{1/q}}{\alpha\beta}\right).

Recall that $\hat{S}$ is such that $\hat{S}=S_{h}$ for some $h>0$ and $\Phi(\hat{S})\leq\Phi(S_{h})$ for all $h>0$ . We will assume without loss of generality that $\Phi(C)\leq(\Phi(\hat{S})/q)^{q}$ , as otherwise $\Phi(\hat{S})<q\Phi(C)^{1/q}$ and the statement in Theorem 9 already holds.

Denote $\hat{\nu}:=\sum_{e\in E}\vartheta_{e}\hat{\phi}_{e}^{p}$ , the cost of optimal flow routing. The following claim states that $\hat{\nu}$ must be large.

Claim A.1.

$\hat{\nu}\geq\textnormal{vol}(C)^{p}/\textnormal{vol}(\partial C)^{p-1}.$

Proof.

The proof of this claim follows from a case analysis on the total amount of excess mass $\sigma\sum_{v\in V}d_{v}\hat{z}_{v}$ at optimality. Intuitively, if the excess is small, then naturally there must be a large amount of flow in order to satisfy the primal constraint; if the excess is large, then Lemma 7 and Lemma 5 guarantee that flow is also large. We give details below.

Suppose that $\sigma\sum_{v\in V}d_{v}\hat{z}_{v}<\textnormal{vol}(C)$ . Note that this also includes the case where $\sigma=0$ . By Assumption A.2 there is at least $\Delta(C)\geq 3\textnormal{vol}(C)$ amount of source mass trapped in $C$ at the beginning. Moreover, the primal constraint enforces the nodes in $C$ can settle at most $\sum_{v\in C}(d_{v}+\sigma d_{v}\hat{z}_{v})\leq\textnormal{vol}(C)+\sum_{v\in V}\sigma d_{v}\hat{z}_{v}<2\textnormal{vol}(C)$ amount of mass. Therefore, the remaining at least $\textnormal{vol}(C)$ amount of mass needs to get out of $C$ using the hyperedges in $\partial C$ . That is, the net amount of mass that moves from $C$ to $V\setminus C$ satisfies $\sum_{e\in\partial C}\vartheta_{e}\hat{r}_{e}(C)\geq\textnormal{vol}(C)$ . We focus on the cost of $\hat{\phi}$ restricted to these hyperedges along. It is easy to see that


$\displaystyle\sum_{e\in\partial C}\vartheta_{e}\hat{\phi}_{e}^{p}~{}$	$\displaystyle\geq~{}\min_{\phi\in\mathbb{R}^{\|\partial C\|}_{+}}~{}\sum_{e\in\partial C}\vartheta_{e}\phi_{e}^{p}~{}~{}\mbox{subject to}~{}~{}\hat{r}_{e}\in\phi_{e}B_{e},~{}\forall e\in\partial C$	(A.6a)
	$\displaystyle\geq~{}\min_{\phi\in\mathbb{R}^{\|\partial C\|}_{+}}~{}\sum_{e\in\partial C}\vartheta_{e}\phi_{e}^{p}~{}~{}\mbox{subject to}~{}\sum_{e\in\partial C}\vartheta_{e}\hat{r}_{e}(C)\leq\sum_{e\in\partial C}\vartheta_{e}\phi_{e}w_{e}(C)$	(A.6b)
	$\displaystyle\geq~{}\min_{\phi\in\mathbb{R}^{\|\partial C\|}_{+}}~{}\sum_{e\in\partial C}\vartheta_{e}\phi_{e}^{p}~{}~{}\mbox{subject to}~{}~{}\textnormal{vol}(C)\leq\sum_{e\in\partial C}\vartheta_{e}\phi_{e}w_{e}(C).$	(A.6c)

The first inequality follows because $\hat{\phi}$ restricted to $\partial C$ is a feasible solution in problem (A.6a). The second inequality follows because $\hat{r}_{e}\in\phi_{e}B_{e}$ implies $\hat{r}_{e}(C)\leq\phi_{e}w_{e}(C)$ , therefore every feasible solution for (A.6a) is also a feasible solution for (A.6b). The third inequality follows because $\textnormal{vol}(C)\leq\sum_{e\in E}\vartheta_{e}\hat{r}_{e}(C)$ . Let $\bar{\phi}\in\mathbb{R}^{|\partial C|}_{+}$ be an optimal solution of problem (A.6c). The optimality condition of (A.6c) is given by (we may assume the $p$ factor in the gradient of $\sum_{e\in\partial C}\vartheta_{e}\phi_{e}^{p}$ is absorbed into multipliers $\lambda$ and $\eta_{e}$ )

\begin{split}&\vartheta_{e}\phi_{e}^{p-1}-\lambda\vartheta_{e}w_{e}(C)-\eta_{e}=0,~{}\forall e\in\partial C\\ &\phi_{e}\geq 0,~{}\eta_{e}\geq 0,~{}\phi_{e}\eta_{e}=0,~{}\forall e\in\partial C\\ &\textnormal{vol}(C)\leq\sum_{e\in\partial C}\vartheta_{e}\phi_{e}w_{e}(C)\\ &\lambda\geq 0,~{}\lambda\bigg{(}\textnormal{vol}(C)-\sum_{e\in\partial C}\vartheta_{e}\phi_{e}w_{e}(C)\bigg{)}=0.\end{split}

(A.7)

If $\lambda=0$ , then the conditions in (A.7) imply that $\vartheta_{e}\phi_{e}^{p-1}=\eta_{e}$ , but then by complimentary slackness we would obtain $\phi_{e}=\eta_{e}=0$ for all $e\in\partial C$ which will violate feasibility. Therefore we must have $\lambda>0$ , and consequently, we have that

\sum_{e\in\partial C}\vartheta_{e}\bar{\phi}_{e}w_{e}(C)=\textnormal{vol}(C).

(A.8)

Moreover, the conditions in (A.7) imply that for $e\in\partial C$ , $\bar{\phi}_{e}=0$ if and only if $w_{e}(C)=0$ , and hence we have that

\vartheta_{e}\bar{\phi}_{e}^{p-1}=\lambda\vartheta_{e}w_{e}(C),~{}\forall e\in\partial C.

(A.9)

Rearrange (A.9) we get

\bar{\phi}_{e}w_{e}(C)=\lambda^{1/(p-1)}w_{e}(C)^{p/(p-1)},~{}\forall e\in\partial C.

Substitute the above into (A.8),

\textnormal{vol}(C)=\sum_{e\in\partial C}\vartheta_{e}\bar{\phi}_{e}w_{e}(C)=\sum_{e\in\partial C}\vartheta_{e}\lambda^{1/(p-1)}w_{e}(C)^{p/(p-1)},

this gives

\lambda^{1/(p-1)}=\frac{\textnormal{vol}(C)}{\sum_{e\in\partial C}\vartheta_{e}w_{e}(C)^{p/(p-1)}}.

Therefore, the solution $\bar{\phi}$ for (A.6c) is give by

\bar{\phi}_{e}=\lambda^{1/(p-1)}w_{e}(C)^{1/(p-1)}=\frac{\textnormal{vol}(C)w_{e}(C)^{1/(p-1)}}{\sum_{e^{\prime}\in\partial C}\vartheta_{e^{\prime}}w_{e^{\prime}}(C)^{p/(p-1)}},\quad\forall e\in\partial C,

and hence,

\begin{split}\hat{\nu}=\sum_{e\in E}\vartheta_{e}\hat{\phi}_{e}^{p}\geq\sum_{e\in\partial C}\vartheta_{e}\hat{\phi}_{e}^{p}\geq\sum_{e\in\partial C}\vartheta_{e}\bar{\phi}_{e}^{p}&=\sum_{e\in\partial C}\vartheta_{e}\frac{\textnormal{vol}(C)^{p}w_{e}(C)^{p/(p-1)}}{\left(\sum_{e^{\prime}\in\partial C}\vartheta_{e^{\prime}}w_{e^{\prime}}(C)^{p/(p-1)}\right)^{p}}\\ &=\frac{\textnormal{vol}(C)^{p}\sum_{e\in\partial C}\vartheta_{e}w_{e}(C)^{p/(p-1)}}{\left(\sum_{e^{\prime}\in\partial C}\vartheta_{e^{\prime}}w_{e^{\prime}}(C)^{p/(p-1)}\right)^{p}}\\ &=\frac{\textnormal{vol}(C)^{p}}{\left(\sum_{e^{\prime}\in\partial C}\vartheta_{e^{\prime}}w_{e^{\prime}}(C)^{p/(p-1)}\right)^{p-1}}\\ &\geq\frac{\textnormal{vol}(C)^{p}}{\big{(}\sum_{e^{\prime}\in\partial C}\vartheta_{e^{\prime}}w_{e^{\prime}}(C)\big{)}^{p-1}}\end{split}

where the last inequality follows because $w_{e}(C)\in[0,1]$ and $p\geq 1$ .

Suppose now that $\sigma\sum_{v\in V}d_{v}\hat{z}_{v}\geq\textnormal{vol}(C)$ . Becase $\Phi(C)\leq(\Phi(\hat{S})/q)^{q}$ (recall that we assumed this without loss of generality), by Assumption A.3, we know that $\sigma<(\phi(\hat{S})/q)^{q}$ . Therefore,

\begin{split}\hat{\nu}=\sum_{e\in E}\vartheta_{e}\hat{\phi}_{e}^{p}&\stackrel{{\scriptstyle(i)}}{{\geq}}\sigma\sum_{v\in V}d_{v}\hat{z}_{v}^{p}\\ &\stackrel{{\scriptstyle(ii)}}{{\geq}}\frac{\sigma\left(\sum_{v\in V}d_{v}\hat{z}_{v}\right)^{p}}{\textnormal{vol}(\textnormal{supp}(\hat{z}))^{p-1}}\\ &\stackrel{{\scriptstyle(iii)}}{{\geq}}\frac{\sigma^{p}\left(\sum_{v\in V}d_{v}\hat{z}_{v}\right)^{p}}{\sigma^{p-1}(3\textnormal{vol}(C)/\beta)^{p-1}}\\ &\stackrel{{\scriptstyle(iv)}}{{\geq}}\frac{\sigma^{p}\left(\sum_{v\in V}d_{v}\hat{z}_{v}\right)^{p}}{\textnormal{vol}(\partial C)^{p-1}}\\ &\stackrel{{\scriptstyle(v)}}{{\geq}}\frac{\textnormal{vol}(C)^{p}}{\textnormal{vol}(\partial C)^{p-1}}.\end{split}

$(i)$ is due to Lemma 6. $(ii)$ is due to Lemma 5. $(iii)$ is due to Lemma 4 that $\textnormal{vol}(\textnormal{supp}(\hat{z}))\leq\|\Delta\|_{1}$ and Assumption A.2 that $\|\Delta\|_{1}\leq 3\textnormal{vol}(c)/\beta$ , so $\textnormal{vol}(\textnormal{supp}(\hat{z}))^{p-1}\leq(3\textnormal{vol}(C)/\beta)^{p-1}$ for $p\geq 1$ . $(iv)$ is due to Assumption A.3 that $\sigma\leq\frac{\beta\textnormal{vol}(\partial C)}{3\textnormal{vol}(C)}$ , so $(3\sigma\textnormal{vol}(C)/\beta)^{p-1}\leq\textnormal{vol}(\partial C)^{p-1}$ for $p\geq 1$ . $(v)$ is due to the assumption that $\sigma\sum_{v\in V}d_{v}\hat{z}_{v}\geq\textnormal{vol}(C)$ . ∎

To connect $\Phi(S_{h})$ with $\Phi(C)$ , we define the length of a hyperedge $e\in E$ as

\hat{l}(e):=\left\{\begin{array}[]{ll}\max(1/\textnormal{vol}(C)^{1/q},f_{e}(\hat{x})/\hat{\nu}^{1/q}),&\mbox{if}~{}f_{e}(\hat{x})>0,\\ 0,&\mbox{otherwise}.\end{array}\right.

The next claim follows from simple algebraic computations and the locality of solutions in Lemma 4.

Claim A.2.

$\sum_{e\in E}\vartheta_{e}f_{e}(\hat{x})\hat{l}(e)^{q-1}\leq 4\hat{\nu}^{1/q}/\beta$ .

Proof.

For $e\in E$ , define $l(e):=f_{e}(\hat{x})/\hat{\nu}^{1/q}$ . Then $l(e)\leq\hat{l}(e)$ . Moreover,

\sum_{e:l(e)<\hat{l}(e)}\vartheta_{e}\leq\sum_{e\in\textnormal{supp}(\hat{\phi})}\vartheta_{e}\leq\textnormal{vol}(\textnormal{supp}(\hat{x}))\leq\|\Delta\|_{1}=\frac{3}{\alpha}\textnormal{vol}(S)\leq\frac{3}{\beta}\textnormal{vol}(C).

The first inequality follows from that $l(e)<\hat{l}(e)$ only if $l(e)\neq 0$ , and by Lemma 2, $l(e)\neq 0$ if and only if $\hat{\phi}_{e}\neq 0$ . The second and the third inequalities are due to Lemma 4. The second to last equality follows from the diffusion setting (A.4) and Assumption A.2 that $\delta=3/\alpha$ . The last inequality follows from Assumption A.1. Therefore,

\begin{split}\sum_{e\in E}\vartheta_{e}f_{e}(\hat{x})\hat{l}(e)^{q-1}&=\sum_{e:l(e)=\hat{l}(e)}\vartheta_{e}f_{e}(\hat{x})\frac{f_{e}(\hat{x})^{q-1}}{\hat{\nu}^{(q-1)/q}}+\sum_{e:l(e)<\hat{l}(e)}\vartheta_{e}f_{e}(\hat{x})\frac{1}{\textnormal{vol}(C)^{(q-1)/q}}\\ &\leq\sum_{e:l(e)=\hat{l}(e)}\vartheta_{e}f_{e}(\hat{x})\frac{f_{e}(\hat{x})^{q-1}}{\hat{\nu}^{(q-1)/q}}+\sum_{e:l(e)<\hat{l}(e)}\vartheta_{e}\frac{\hat{\nu}^{1/q}}{\textnormal{vol}(C)^{1/q}}\frac{1}{\textnormal{vol}(C)^{(q-1)/q}}\\ &=\frac{1}{\hat{\nu}^{(q-1)/q}}\sum_{e:l(e)=\hat{l}(e)}\vartheta_{e}f_{e}(\hat{x})^{q}+\frac{\hat{\nu}^{1/q}}{\textnormal{vol}(C)}\sum_{e:l(e)<\hat{l}(e)}\vartheta_{e}\\ &\leq\frac{1}{\hat{\nu}^{(q-1)/q}}\sum_{e\in E}\vartheta_{e}f_{e}(\hat{x})^{q}+\frac{\hat{\nu}^{1/q}}{\textnormal{vol}(C)}\frac{3\textnormal{vol}(C)}{\beta}\\ &=\frac{\hat{\nu}}{\hat{\nu}^{(q-1)/q}}+\frac{3\hat{\nu}^{1/q}}{\beta}\\ &\leq\frac{4\hat{\nu}^{1/q}}{\beta}\end{split}

where the last equality follows from Lemma 2 that $\hat{\nu}=\sum_{e\in E}\vartheta_{e}\hat{\phi}_{e}^{p}=\sum_{e\in E}\vartheta_{e}f_{e}(\hat{x})^{q}$ . ∎

By the strong duality between (A.1) and (A.2), we know that

(\Delta-d)^{T}\hat{x}-\frac{1}{q}\sum_{e\in E}\vartheta_{e}f_{e}(\hat{x})^{q}-\frac{\sigma}{q}\sum_{v\in V}d_{v}\hat{x}_{v}^{q}=\frac{1}{p}\sum_{e\in E}\vartheta_{e}\hat{\phi}_{e}^{p}+\frac{\sigma}{p}\sum_{v\in V}d_{v}\hat{z}_{v}^{p}.

Hence, by Lemma 2, we get

(\Delta-d)^{T}\hat{x}\geq\frac{1}{q}\sum_{e\in E}\vartheta_{e}f_{e}(\hat{x})^{q}+\frac{1}{p}\sum_{e\in E}\vartheta_{e}\hat{\phi}_{e}^{p}=\sum_{e\in E}\vartheta_{e}\hat{\phi}_{e}^{p}=\hat{\nu}.

It then follows that

\frac{\sum_{e\in E}\vartheta_{e}f_{e}(\hat{x})\hat{l}(e)^{q-1}}{(\Delta-d)^{T}\hat{x}}\leq\frac{\sum_{e\in E}\vartheta_{e}f_{e}(\hat{x})\hat{l}(e)^{q-1}}{\hat{\nu}}\stackrel{{\scriptstyle(i)}}{{\leq}}\frac{4\hat{\nu}^{1/q}}{\beta\hat{\nu}}=\frac{4}{\beta\hat{\nu}^{1/p}}\stackrel{{\scriptstyle(ii)}}{{\leq}}\frac{4\textnormal{vol}(\partial C)^{1/q}}{\beta\textnormal{vol}(C)},

(A.10)

where $(i)$ is follows from Claim A.2 and $(ii)$ follows from Claim A.1.

We can write the left-most ratio in (A.10) in its integral form, as follows. By Lemma 8, we have

(\Delta-d)^{T}\hat{x}=\int_{h=0}^{\infty}(\Delta(S_{h})-\textnormal{vol}(S_{h}))dh,

and

\begin{split}\sum_{e\in E}\vartheta_{e}f_{e}(\hat{x})\hat{l}(e)^{q-1}&=\sum_{e\in E}\vartheta_{e}\int_{h=0}^{\infty}w_{e}(S_{h})dh~{}\hat{l}(e)^{q-1}\\ &=\int_{h=0}^{\infty}\sum_{e\in E}\vartheta_{e}w_{e}(S_{h})\hat{l}(e)^{q-1}dh\\ &=\int_{h=0}^{\infty}\sum_{e\in\partial S_{h}}\vartheta_{e}w_{e}(S_{h})\hat{l}(e)^{q-1}dh,\end{split}

where the last equality follows from the fact that $w_{e}(S_{h})=0$ for $e\not\in\partial S_{h}$ . Therefore, we get

\int_{h=0}^{\infty}\frac{\sum_{e\in\partial S_{h}}\vartheta_{e}w_{e}(S_{h})\hat{l}(e)^{q-1}}{\Delta(S_{h})-\textnormal{vol}(S_{h})}dh\leq\frac{4\textnormal{vol}(\partial C)^{1/q}}{\beta\textnormal{vol}(C)},

which means that there exists $h>0$ such that

\frac{\sum_{e\in\partial S_{h}}\vartheta_{e}w_{e}(S_{h})\hat{l}(e)^{q-1}}{\Delta(S_{h})-\textnormal{vol}(S_{h})}\leq\frac{4\textnormal{vol}(\partial C)^{1/q}}{\beta\textnormal{vol}(C)}.

(A.11)

Finally, we connect the left hand side in inequality (A.11) to the conductance of $S_{h}$ . For the denominator, by Assumption A.2, we have

\Delta(S_{h})-\textnormal{vol}(S_{h})\leq\frac{3}{\alpha}\textnormal{vol}(S_{h}).

(A.12)

For the numerator, every hyperedge $e\in\partial S_{h}$ must contain some $u,v\in e$ such that $\hat{x}_{u}\neq\hat{x}_{v}$ , thus $f_{e}(\hat{x})>0$ , which means $\hat{l}(e)\geq 1/\textnormal{vol}(C)^{1/q}$ . This gives

\sum_{e\in\partial S_{h}}\vartheta_{e}w_{e}(S_{h})\hat{l}(e)^{q-1}\geq\frac{\sum_{e\in\partial S_{h}}\vartheta_{e}w_{e}(S_{h})}{\textnormal{vol}(C)^{(q-1)/q}}=\frac{\textnormal{vol}(\partial S_{h})}{\textnormal{vol}(C)^{(q-1)/q}}.

(A.13)

Put (A.11), (A.12) and (A.13) together, there exists $h>0$ such that

\Phi(S_{h})=\frac{\textnormal{vol}(\partial S_{h})}{\textnormal{vol}(S_{h})}\leq\frac{12\textnormal{vol}(\partial C)^{1/q}}{\alpha\beta\textnormal{vol}(C)^{1/q}}=\frac{12\Phi(C)^{1/q}}{\alpha\beta}.

Appendix B Optimization algorithm for HFD

In this section we give details on an Alternating Minimization (AM) algorithm [5] that solves the primal problem (A.1). In Algorithm B.1 we write the basic AM steps in a slightly more general form than what is given by Algorithm 1 in the main paper. The key observation is that the AM method provides a unified framework to solve HFD, when the objective function of the primal problem (A.1) is penalized by any $\ell_{p}$ -norm for $p\geq 2$ .

Algorithm B.1 Alternating Minimization for HFD

Initialization:

\phi^{(0)}:=0,r^{(0)}:=0,s^{(0)}_{e}:=D^{-1}A_{e}\left[\Delta-d\right]_{+},\forall e\in E.

For $k=0,1,2,\ldots$ do:

	$\displaystyle(\phi^{(k+1)},r^{(k+1)}):=\operatorname*{argmin}\limits_{(\phi,r)\in\mathcal{C}}\sum\limits_{e\in E}\vartheta_{e}\left(\phi_{e}^{p}+\frac{1}{\sigma^{p-1}}\\|s_{e}^{(k)}-r_{e}\\|_{p}^{p}\right)$
	$\displaystyle s^{(k+1)}:=\operatorname*{argmin}\limits_{s}\sum\limits_{e\in E}\vartheta_{e}\\|s_{e}-r_{e}^{(k+1)}\\|_{p}^{p},\hskip 5.69054pt\mbox{s.t.}~{}\Delta-\sum\limits_{e\in E}\vartheta_{e}s_{e}\leq d,\ s_{e,v}=0,\forall v\not\in e.$

Let us remind the reader the definitions and notation that we will use. We consider a generic hypergraph $H=(V,E,\mathcal{W})$ where $\mathcal{W}=\{w_{e},\vartheta_{e}\}_{e\in E}$ are submodular hyperedge weights. For each $e\in E$ , we define a diagonal matrix $A_{e}\in\mathbb{R}^{|V|\times|V|}$ such that $[A_{e}]_{v,v}=1$ if $v\in e$ and 0 otherwise. We use the notation $r\in\bigotimes_{e\in E}\mathbb{R}^{|V|}$ to represent a vector in the space $\mathbb{R}^{|V||E|}$ , where each $r_{e}\in\mathbb{R}^{|V|}$ corresponds to a block in $r$ indexed by $e\in E$ . For a vector $r_{e}\in\mathbb{R}^{|V|}$ , $r_{e,v}$ is the entry in $r_{e}$ that corresponds to $v\in V$ . For a vector $x\in\mathbb{R}^{|V|}$ , $[x]_{+}:=\max\{x,0\}$ where the maximum is taken entry-wise.

We denote $\mathcal{C}:=\{(\phi,r)\in\mathbb{R}^{|E|}_{+}\times(\bigotimes_{e\in E}\mathbb{R}^{|V|})~{}|~{}r_{e}\in\phi_{e}B_{e},~{}\forall e\in E\}$ .

We will prove the equivalence between the primal diffusion problem (A.1) and its separable reformulation shortly, but let us start with a simple lemma that gives closed-form solution for one of the AM sub-problems.

Lemma 1.

The optimal solution to the following problem

\min_{s\in\bigotimes_{e\in E}\mathbb{R}^{|V|}}\sum_{e\in E}\vartheta_{e}\|s_{e}-r_{e}\|_{p}^{p},~{}\mbox{s.t.}~{}\Delta-\sum_{e\in E}\vartheta_{e}s_{e}\leq d,~{}s_{e,v}=0,\forall v\not\in e.

(B.1)

is given by

s_{e}^{*}=r_{e}+A_{e}D^{-1}\Big{[}\Delta-\sum_{e^{\prime}\in E}\vartheta_{e^{\prime}}r_{e^{\prime}}-d\Big{]}_{+},~{}\forall e\in E.

(B.2)

Proof.

Rewrite (B.1) as

	$\displaystyle\min_{s\in\bigotimes_{e\in E}\mathbb{R}^{\|V\|}}$	$\displaystyle\sum_{v\in V}\sum_{e\in E}\vartheta_{e}\|s_{e,v}-r_{e,v}\|^{p}$
	s.t.	$\displaystyle\Delta_{v}-\sum_{e\in E}\vartheta_{e}s_{e,v}\leq d_{v},~{}\forall v\in V$
		$\displaystyle s_{e,v}=0,~{}\forall v\not\in e.$

Then it is immediate to see that (B.1) decomposes into $|V|$ sub-problems indexed by $v\in V$ ,

\min_{\xi_{v}\in\mathbb{R}^{|E_{v}|}}\sum_{e\in E_{v}}\vartheta_{e}|\xi_{v,e}-r_{e,v}|^{p},~{}\mbox{s.t.}~{}\Delta_{v}-\sum_{e\in E_{v}}\vartheta_{e}\xi_{v,e}\leq d_{v},

(B.3)

where $E_{v}:=\{e\in E~{}|~{}v\in e\}$ is the set of hyperedges incident to $v$ , and we use $\xi_{v,e}$ for the entry in $\xi_{v}$ that corresponds to $e\in E_{v}$ . Let $\xi_{v}^{*}$ denote the optimal solution for (B.3). We have that $s^{*}_{e,v}=\xi^{*}_{v,e}$ if $v\in e$ and $s^{*}_{e,v}=0$ otherwise. Therefore, it suffices to find $\xi_{v}^{*}$ for $v\in V$ . The optimality condition of (B.3) is given by

	$\displaystyle p\vartheta_{e}\|\xi_{v,e}-r_{e,v}\|^{p-1}\mathop{\mathrm{sign}}(\xi_{v,e}-r_{e,v})-\vartheta_{e}\lambda\ni 0,~{}\forall e\in E_{v},$
	$\displaystyle\lambda\geq 0,~{}\Delta_{v}-\sum_{e\in E_{v}}\vartheta_{e}\xi_{v,e}\leq d_{v},~{}\lambda\Big{(}\Delta_{v}-\sum_{e\in E_{v}}\vartheta_{e}\xi_{v,e}-d_{v}\Big{)}=0,$

where

\mathop{\mathrm{sign}}(a):=\left\{\begin{array}[]{ll}\{-1\},&\mbox{if}~{}a<0,\\ \{1\},&\mbox{if}~{}a>0,\\ \mbox{$[-1,1]$}&\mbox{if}~{}a=0.\end{array}\right.

There are two cases about $\lambda$ . We show that in both cases the solution given by (B.2) is optimal.

Case 1. If $\lambda>0$ , then we must have that $p\vartheta_{e}|\xi_{v,e}-r_{e,v}|^{p-1}>0$ for all $e\in E_{v}$ (otherwise, the stationarity condition would be violated). This means that $p|\xi_{v,e}-r_{e,v}|^{p-1}=\lambda$ for all $e\in E_{v}$ , that is, $\xi_{v,e_{1}}-r_{e_{1},v}=\xi_{v,e_{2}}-r_{e_{2},v}>0$ for every $e_{1},e_{2}\in E_{v}$ . Denote $t_{v}:=\xi_{v,e}-r_{e,v}$ . Because $\lambda>0$ , by complementarity we have

\Delta_{v}-\sum_{e\in E_{v}}\vartheta_{e}(t_{v}+r_{e,v})=\Delta_{v}-\sum_{e\in E_{v}}\vartheta_{e}\xi_{v,e}=d_{v},

which implies that $t_{v}=(\sum_{e\in E_{v}}\vartheta_{e})^{-1}(\Delta_{v}-\sum_{e\in E_{v}}\vartheta_{e}r_{e,v}-d_{v})$ . Note that $\Delta_{v}-\sum_{e\in E_{v}}\vartheta_{e}r_{e,v}-d_{v}>0$ because $\Delta_{v}-\sum_{e\in E_{v}}\vartheta_{e}\xi_{v,e}-d_{v}=0$ and $\xi_{v,e}>r_{e,v}$ for all $e\in E_{v}$ . Therefore we have that

s^{*}_{e,v}=\xi^{*}_{v,e}=r_{e,v}+d_{v}^{-1}\Big{[}\Delta_{v}-\sum_{e\in E_{v}}\vartheta_{e}r_{e,v}-d_{v}\Big{]}_{+}.

Case 2. If $\lambda=0$ , then we have that $p\vartheta_{e}|\xi_{v,e}-r_{e,v}|^{p-1}\mathop{\mathrm{sign}}(\xi_{v,e}-r_{e,v})\ni 0$ for all $e\in E_{v}$ , which implies $\xi_{v,e}-r_{e,v}=0$ for all $e\in E_{v}$ . Then we must have

\Delta_{v}-\sum_{e\in E_{v}}\vartheta_{e}r_{e,v}=\Delta_{v}-\sum_{e\in E_{v}}\vartheta_{e}\xi_{v,e}\leq d_{v}.

Therefore we still have that

s^{*}_{e,v}=\xi^{*}_{v,e}=r_{e,v}=r_{e,v}+d_{v}^{-1}\Big{[}\Delta_{v}-\sum_{e\in E_{v}}\vartheta_{e}r_{e,v}-d_{v}\Big{]}_{+}.

The required result then follows from the definition of $A_{e}$ and $D$ . ∎

We are now ready to show that the primal problem (A.1) can be cast into an equivalent separable formulation, which can then be solved by the AM method in Algorithm B.1. We give the reformulation under general $\ell_{p}$ -norm penalty and arbitrary $\vartheta_{e}>0$ .

Lemma 2 (Lemma 3 in the main paper).

The following problem is equivalent to (A.1) for any $\sigma>0$ , in the sense that $(\hat{\phi},\hat{r},\hat{z})$ is optimal in (A.1) for some $\hat{z}\in\mathbb{R}^{|V|}$ if and only if $(\hat{\phi},\hat{r},\hat{s})$ is optimal in (B.4) for some $\hat{s}\in\bigotimes_{e\in E}\mathbb{R}^{|V|}$ .

\begin{split}\min_{\phi,r,s}~{}&\frac{1}{p}\sum_{e\in E}\vartheta_{e}\left(\phi_{e}^{p}+\frac{1}{\sigma^{p-1}}\left\|s_{e}-r_{e}\right\|_{p}^{p}\right)\\ \textnormal{s.t.}~{}&(\phi,r)\in\mathcal{C},~{}\Delta-\sum_{e\in E}\vartheta_{e}s_{e}\leq d,~{}s_{e,v}=0,\forall v\not\in e.\end{split}

(B.4)

Proof.

We will show the forward direction and the converse follows from exactly the same reasoning. Let $\hat{\nu}_{1}$ and $\hat{\nu}_{2}$ denote the optimal objective value of problems (A.1) and (B.4), respectively. Let $(\hat{\phi},\hat{r},\hat{z})$ be an optimal solution for (A.1). Define $\hat{s}_{e}:=\hat{r}_{e}+\sigma A_{e}\hat{z}$ for $e\in E$ . We show that $(\hat{\phi},\hat{r},\hat{s})$ is an optimal solution for (B.4).

Because $\hat{r}_{e,v}=0$ for all $v\not\in e$ , by the definition of $A_{e}$ , we know that $\hat{s}_{e,v}=0$ for all $v\not\in e$ . Moreover,

\sigma D\hat{z}=\sigma\sum_{e\in E}\vartheta_{e}A_{e}\hat{z}=\sum_{e\in E}\vartheta_{e}(\hat{s}_{e}-\hat{r}_{e}),

\Delta-\sum_{e\in E}\vartheta_{e}\hat{s}_{e}=\Delta-\sum_{e\in E}\vartheta_{e}\hat{r}_{e}-\sigma D\hat{z}\leq d.

Therefore, $(\hat{\phi},\hat{r},\hat{s})$ is a feasible solution for (B.4). Furthermore,

\begin{split}\sigma\sum_{v\in V}d_{v}\hat{z}_{v}^{p}&=\sigma\sum_{e\in E}\vartheta_{e}\sum_{v\in e}\hat{z}_{v}^{p}=\sigma\sum_{e\in E}\vartheta_{e}\|A_{e}\hat{z}\|_{p}^{p}\\ &=\frac{1}{\sigma^{p-1}}\sum_{e\in E}\vartheta_{e}\left\|\sigma A_{e}\hat{z}\right\|_{p}^{p}=\frac{1}{\sigma^{p-1}}\sum_{e\in E}\vartheta_{e}\left\|\hat{s}_{e}-\hat{r}_{e}\right\|_{p}^{p}.\end{split}

This means that $(\hat{\phi},\hat{r},\hat{s})$ attains objective value $\hat{\nu}_{1}$ in (B.4). Hence $\hat{\nu}_{1}\geq\hat{\nu}_{2}$ .

In order to show that $(\hat{\phi},\hat{r},\hat{s})$ is indeed optimal for (B.4), it left to show that $\hat{\nu}_{2}\geq\hat{\nu}_{1}$ . Let $(\phi^{\prime},r^{\prime},s^{\prime})$ be an optimal solution for (B.4). Then we know that

s^{\prime}=\operatorname*{argmin}_{s\in\bigotimes_{e\in E}\mathbb{R}^{|V|}}\sum_{e\in E}\vartheta_{e}\|s_{e}-r^{\prime}_{e}\|_{p}^{p},~{}\mbox{s.t.}~{}\Delta-\sum_{e\in E}\vartheta_{e}s_{e}\leq d,~{}s_{e,v}=0~{}\forall v\not\in e.

(B.5)

According to Lemma 1, we know that

s^{\prime}_{e}=r^{\prime}_{e}+A_{e}D^{-1}\Big{[}\Delta-\sum_{e^{\prime}\in E}\vartheta_{e^{\prime}}r^{\prime}_{e^{\prime}}-d\Big{]}_{+},~{}\forall e\in E.

(B.6)

Define $z^{\prime}:=\frac{1}{\sigma}D^{-1}[\Delta-\sum_{e\in E}\vartheta_{e}r^{\prime}_{e}-d]_{+}$ . Then $z^{\prime}\geq 0$ . Moreover, we have that

\sum_{e\in E}\vartheta_{e}s^{\prime}_{e}-\sum_{e\in E}\vartheta_{e}r^{\prime}_{e}=\sum_{e\in E}\vartheta_{e}A_{e}D^{-1}\Big{[}\Delta-\sum_{e^{\prime}\in E}\vartheta_{e^{\prime}}r^{\prime}_{e^{\prime}}-d\Big{]}_{+}=\Big{[}\Delta-\sum_{e^{\prime}\in E}\vartheta_{e^{\prime}}r^{\prime}_{e^{\prime}}-d\Big{]}_{+}=\sigma Dz^{\prime},

\Delta-\sum_{e\in E}\vartheta_{e}r^{\prime}_{e}=\Delta-\sum_{e\in E}\vartheta_{e}s^{\prime}_{e}+\sigma Dz^{\prime}\leq d+\sigma Dz^{\prime}.

Therefore, $(\phi^{\prime},r^{\prime},z^{\prime})$ is a feasible solution for (A.1). Furthermore,

\begin{split}\frac{1}{\sigma^{p-1}}\sum_{e\in E}\vartheta_{e}\left\|s^{\prime}_{e}-r^{\prime}_{e}\right\|_{p}^{p}&=\frac{1}{\sigma^{p-1}}\sum_{e\in E}\vartheta_{e}\left\|\sigma A_{e}z^{\prime}\right\|_{p}^{p}=\sigma\sum_{e\in E}\vartheta_{e}\|A_{e}z^{\prime}\|_{p}^{p}\\ &=\sigma\sum_{e\in E}\vartheta_{e}\sum_{v\in e}{z^{\prime}}_{v}^{p}=\sigma\sum_{v\in V}d_{v}{z^{\prime}}_{v}^{p}.\end{split}

This means that $(\phi^{\prime},r^{\prime},z^{\prime})$ attains objective value $\hat{\nu}_{2}$ in (A.1). Hence $\hat{\nu}_{2}\geq\hat{\nu}_{1}$ . ∎

Remark. The constructive proof of Lemma 2 means that, given an optimal solution $(\hat{\phi},\hat{r},\hat{s})$ for problem (B.4), one can recover an optimal solution $(\hat{\phi},\hat{r},\hat{z})$ for our original primal formulation (A.1) via $\hat{z}:=\frac{1}{\sigma}D^{-1}[\Delta-\sum_{e\in E}\vartheta_{e}\hat{r}_{e}-d]_{+}$ . It then follows from Lemma 2 that the dual optimal solution $\hat{x}$ is given by $\hat{x}=\hat{z}^{p-1}$ . Therefore, a sweep cut rounding procedure readily applies to the solution $(\hat{\phi},\hat{r},\hat{s})$ of problem (B.4).

Let $g(\phi,r,s)$ denote the objective function of problem (B.4) and let $g^{*}$ denote its optimal objective value.

The following theorem gives the convergence rate of Algorithm B.1 applied to (B.4), when its objective function is penalized by $\ell_{p}$ -norm for $p\geq 2$ .

Theorem 3 ([5]).

Let $\{\phi^{(k)},r^{(k)},s^{(k)}\}_{k\geq 0}$ be the sequence generated by Algorithm B.1. Then for any $k\geq 1$ ,

g(\phi^{(k)},r^{(k)},s^{(k)})-g^{*}\leq\frac{3\max\{g(\phi^{(0)},r^{(0)},s^{(0)})-g^{*},L_{p}R^{2}\}}{k},

where

\begin{split}R&=\max_{(\phi,r,s)\in\mathcal{F}}~{}\max_{(\hat{\phi},\hat{r},\hat{s})\in\mathcal{O}}\big{\{}\|\phi-\hat{\phi}\|_{2}^{2}+\|r-\hat{r}\|_{2}^{2}+\|s-\hat{s}\|_{2}^{2}~{}\big{|}~{}g(\phi,r,s)\leq g(\phi^{(0)},r^{(0)},s^{(0)})\big{\}},\\ L_{p}&=(p-1)\frac{\vartheta_{\max}^{2/p}\|\Delta\|_{p}^{p-2}}{d_{\min}^{(p-1)(p-2)/p}\sigma^{p-1}},\end{split}

where $\mathcal{F}$ and $\mathcal{O}$ denote the feasible set and set of optimal solutions, respectively, $\vartheta_{\max}:=\max\limits_{e\in E}\vartheta_{e}$ , and $d_{\min}:=\min\limits_{v\in\textnormal{supp}(\Delta)}d_{v}$ .

Remark. When $p=2$ , as considered in the main paper, the objective function $g(\phi,r,s)$ has Lipschitz continuous gradient with constant $L_{2}=\vartheta_{\max}/\sigma$ . When $p>2$ , the gradient of $g(\phi,r,s)$ is not generally Lipschitz continuous. However, the sub-linear convergence rate in Theorem 3 applies as long as $g(\phi,r,s)$ is block Lipschitz smooth in the sub-level sets containing the iterates generated by Algorithm B.1. We give more details in Subsection B.1.

B.1 Block Lipschitz smoothness over sub-level set

Recall that $g(\phi,r,s)$ denotes the objective function of problem (B.4). Lemma 4 concerns specifically the setting when problem B.4 is penalized by the $\ell_{p}$ -norm for some $p>2$ .

Lemma 4 (Block Lipschitz smoothness).

The partial gradient $\nabla_{(\phi,r)}g(\phi,r,s)$ is Lipschitz continuous over the sub-level sets (given any fixed $s$ )

U_{\phi,r}(s):=\{(\phi,r)\in\mathbb{R}^{|V|}_{+}\times(\mbox{$\bigotimes_{e\in E}$}\mathbb{R}^{|V|})~{}|~{}g(\phi,r,s)\leq g(\phi^{(0)},r^{(0)},s^{(0)})\}

with constant $L_{\phi,r}$ such that

L_{\phi,r}\leq(p-1)\frac{\vartheta_{\max}^{2/p}\|\Delta\|_{p}^{p-2}}{d_{\min}^{(p-1)(p-2)/p}\sigma^{p-1}},

where $\vartheta_{\max}:=\max_{e\in E}\vartheta_{e}$ and $d_{\min}:=\min_{v\in\textnormal{supp}(\Delta)}d_{v}$ . The partial gradient $\nabla_{s}g(\phi,r,s)$ is Lipschitz continuous over the sub-level sets (given any fixed $(\phi,r)$ )

U_{s}(\phi,r):=\{s\in\mbox{$\bigotimes_{e\in E}$}\mathbb{R}^{|V|}~{}|~{}g(\phi,r,s)\leq g(\phi^{(0)},r^{(0)},s^{(0)})\}

with constant $L_{s}\leq L_{\phi,r}$ .

Proof.

Fix $s\in\bigotimes_{e\in E}\mathbb{R}^{|V|}$ and consider

g_{1}(\phi,r):=g(\phi,r,s)=\frac{1}{p}\sum_{e\in E}\vartheta_{e}\phi_{e}^{p}+\frac{1}{p\sigma^{p-1}}\sum_{e\in E}\sum_{v\in V}\vartheta_{e}|r_{e,v}-s_{e,v}|^{p}.

The function $g_{1}(\phi,r)$ is coordinate-wise separable and hence its second order derivative $\nabla^{2}g_{1}(\phi,r)$ is a diagonal matrix. Therefore, the largest eigenvalue of $\nabla^{2}g_{1}(\phi,r)$ is the largest coordinate-wise second order partial derivative, that is,

L_{\phi,r}=\max_{(\phi,r)\in U_{\phi,r}(s)}\lambda_{\max}(\nabla^{2}g_{1}(\phi,r))=\max_{(\phi,r)\in U_{\phi,r}(s)}\max_{e\in E,v\in V}\{\nabla^{2}_{\phi_{e}}g_{1}(\phi,r),\nabla^{2}_{r_{e,v}}g_{1}(\phi,r)\}.

So it suffices to upper bound $\nabla^{2}_{\phi_{e}}G(\phi,r)$ and $\nabla^{2}_{r_{e,v}}G(\phi,r)$ for all $(\phi,r)\in U_{\phi,r}(s)$ . We have that

g(\phi^{(0)},r^{(0)},s^{(0)})=\frac{1}{p\sigma^{p-1}}\sum_{e\in E}\vartheta_{e}\sum_{v\in e}\frac{[\Delta_{v}-d_{v}]_{+}^{p}}{d_{v}^{p}}=\frac{1}{p\sigma^{p-1}}\sum_{v\in V}\frac{[\Delta_{v}-d_{v}]_{+}^{p}}{d_{v}^{p-1}}\leq\frac{\|\Delta\|_{p}^{p}}{p\sigma^{p-1}d_{\min}^{p-1}}

where $d_{\min}=\min_{v\in\textnormal{supp}(\Delta)}d_{v}$ . It follows that for all $(\phi,r)\in U_{\phi,r}(s)$ ,

	$\displaystyle\nabla^{2}_{\phi_{e}}g_{1}(\phi,r)$	$\displaystyle=(p-1)\vartheta_{e}\phi_{e}^{p-2}\leq\frac{(p-1)\vartheta_{e}^{2/p}\\|\Delta\\|_{p}^{p-2}}{d_{\min}^{(p-1)(p-2)/p}\sigma^{(p-1)(p-2)/p}}\leq\frac{(p-1)\vartheta_{e}^{2/p}\\|\Delta\\|_{p}^{p-2}}{d_{\min}^{(p-1)(p-2)/p}\sigma^{p-1}},~{}\forall e\in E,$
	$\displaystyle\nabla^{2}_{r_{e,v}}g_{1}(\phi,r)$	$\displaystyle=(p-1)\frac{\vartheta_{e}}{\sigma^{p-1}}\|s_{e,v}-r_{e,v}\|^{p-2}\leq\frac{(p-1)\vartheta_{e}^{2/p}\\|\Delta\\|_{p}^{p-2}}{d_{\min}^{(p-1)(p-2)/p}\sigma^{p-1}},~{}\forall e\in E,~{}\forall v\in V,$

because otherwise we would have $g(\phi,r,s)>g(\phi^{(0)},r^{(0)},s^{(0)})$ . Hence,

L_{\phi,r}\leq\max_{e\in E}\frac{(p-1)\vartheta_{e}^{2/p}\|\Delta\|_{p}^{p-2}}{d_{\min}^{(p-1)(p-2)/p}\sigma^{p-1}}=\frac{(p-1)\vartheta_{\max}^{2/p}\|\Delta\|_{p}^{p-2}}{d_{\min}^{(p-1)(p-2)/p}\sigma^{p-1}}.

Finally, by the symmetry between $r$ and $s$ in $F(\phi,r,s)$ , we know that $L_{s}\leq L_{\phi,r}$ . ∎

Remark. Because the iterates generated by Algorithm B.1 monotonically decrease the objective function value, in particular, we have that

g(\phi^{(0)},r^{(0)},s^{(0)})\geq g(\phi^{(k+1)},r^{(k+1)},s^{(k)})\geq g(\phi^{(k+1)},r^{(k+1)},s^{(k+1)})

for any $k\geq 0$ . Therefore, the sequence of iterates live in the sub-level sets. As a result, for any $p>2$ , the block Lipschitz smoothness within sub-level sets suffices to obtain the sub-linear convergence rate for the AM method [5].

B.2 Alternating minimization sub-problems

We now discuss how to solve the sub-problems in Algorithm B.1 efficiently. By Lemma 1, we know that the sub-problem with respect to $s$ ,

s^{(k+1)}:=\operatorname*{argmin}\limits_{s}\sum\limits_{e\in E}\vartheta_{e}\|s_{e}-r_{e}^{(k+1)}\|_{p}^{p},~{}\mbox{s.t.}~{}\Delta-\sum\limits_{e\in E}\vartheta_{e}s_{e}\leq d,\ s_{e,v}=0,\forall v\not\in e,

has closed-form solution

s_{e}^{(k+1)}=r_{e}^{(k+1)}+A_{e}D^{-1}\Big{[}\Delta-\sum_{e^{\prime}\in E}\vartheta_{e^{\prime}}r_{e^{\prime}}^{(k+1)}-d\Big{]}_{+},~{}\forall e\in E.

For the sub-problem with respect to $(\phi,r)$ ,

(\phi^{(k+1)},r^{(k+1)}):=\operatorname*{argmin}\limits_{(\phi,r)\in\mathcal{C}}\sum\limits_{e\in E}\vartheta_{e}\left(\phi_{e}^{p}+\frac{1}{\sigma^{p-1}}\|s_{e}^{(k)}-r_{e}\|_{p}^{p}\right),

note that it decomposes into $|E|$ independent problems that can be minimized separately. That is, for $e\in E$ , we have

\begin{split}(\phi_{e}^{(k+1)},r_{e}^{(k+1)})&=\operatorname*{argmin}_{\phi_{e}\geq 0,r_{e}\in\phi_{e}B_{e}}\vartheta_{e}\phi_{e}^{p}+\frac{1}{\sigma^{p-1}}\vartheta_{e}\|s_{e}^{(k)}-r_{e}\|_{p}^{p}\\ &=\operatorname*{argmin}_{\phi_{e}\geq 0,r_{e}\in\phi_{e}B_{e}}\frac{1}{p}\phi_{e}^{p}+\frac{1}{p\sigma^{p-1}}\|s_{e}^{(k)}-r_{e}\|_{p}^{p}.\end{split}

(B.7)

The above problem (B.7) is strictly convex so it has a unique minimizer.

We focus on $p=2$ first. In this case, problem (B.7) can be solved in sub-linear time using either the conic Frank-Wolfe algorithm or the conic Fujishige-Wolfe minimum norm algorithm studied in [32]. Notice that the dimension of problem (B.7) is the size of the corresponding hyperedge. Therefore, as long as the hyperedge is not extremely large, we can easily obtain a good update $(\phi_{e}^{(k+1)},r_{e}^{(k+1)})$ .

If $B_{e}$ has a special structure, for example, if the hyperedge weight $w_{e}$ models unit cut-cost, then an exact solution for (B.7) can be computed in time $O(|e|\log|e|)$ [32]. For completeness we transfer the algorithmic details in [32] to our setting and list them in Algorithm B.2. The basic idea is to find optimal dual variables achieving dual optimality, and then recover primal optimal solution from the dual. We refer the reader to [32] for detailed justifications. Given $e\in E$ , $s_{e}\in\mathbb{R}^{|V|}$ , and $a,b\in\mathbb{R}$ , denote

e_{\geq}(a):=\{v\in e~{}|~{}s_{e,v}\geq\sigma a\}~{}~{}\mbox{and}~{}~{}e_{\leq}(b):=\{v\in e~{}|~{}s_{e,v}\leq\sigma b\}.

Define

\gamma(a,b):=a-b+\sum_{v\in e_{\geq}(a)}\sigma\left(a-\frac{s_{e,v}}{\sigma}\right).

Algorithm B.2 An Exact Projection Algorithm for (B.7) (

p=2

, unit cut-cost) [32]

1: Input:

e

s_{e}

a\leftarrow\max_{v\in e}s_{e,v}/\sigma

b\leftarrow\min_{v\in e}s_{e,v}/\sigma

3: While true:

w_{a}\leftarrow\sigma\,|e_{\geq}(a)|

w_{b}\leftarrow\sigma\,|e_{\leq}(b)|

a_{1}\leftarrow\max_{v\in e\setminus e_{\geq}(a)}s_{e,v}/\sigma

b_{1}\leftarrow b+(a-a_{1})w_{a}/w_{b}

b_{2}\leftarrow\min_{v\in e\setminus e_{\leq}(b)}s_{e,v}/\sigma

a_{2}\leftarrow a-(b_{2}-b)w_{b}/w_{a}

i^{*}\leftarrow\operatorname*{argmin}_{i\in\{1,2\}}b_{i}

8: If

a_{i^{*}}\leq b_{i^{*}}

\gamma(a_{i^{*}},b_{i^{*}})\leq 0

break

a\leftarrow a_{i^{*}}

b\leftarrow b_{i^{*}}

10:

a\leftarrow a-\gamma(a,b)w_{b}/(w_{a}w_{b}+w_{a}+w_{b})

b\leftarrow b+\gamma(a,b)w_{a}/(w_{a}w_{b}+w_{a}+w_{b})

11: For

v\in e

do:

12: If

v\in e_{\geq}(a)

then

r_{e,v}\leftarrow s_{e,v}-\sigma a

13: Else if

v\in e_{\leq}(b)

then

r_{e,v}\leftarrow s_{e,v}-\sigma b

14: Else

r_{e,v}\leftarrow 0

15: Return:

r_{e}

Now we discuss the case $p>2$ in (B.7). The dual of (B.7) is written as

\min_{y_{e}}\frac{1}{q}f_{e}(y_{e})^{q}+\frac{\sigma}{q}\|y_{e}\|_{q}^{q}-y_{e}^{T}s_{e}^{(k)}.

(B.8)

Let $(\phi_{e}^{*},r_{e}^{*})$ and $y_{e}^{*}$ be optimal solutions of (B.7) and (B.8), respectively. Then one has

r_{e}^{*}=s_{e}^{(k)}-\sigma(y_{e}^{*})^{q-1}~{}\mbox{and}~{}\phi_{e}^{*}=\big{(}(r_{e}^{*})^{T}y_{e}^{*}\big{)}^{1/q}.

Both the derivation of (B.8) and the above relations between $(\phi_{e}^{*},r_{e}^{*})$ and $y_{e}^{*}$ follow from similar reasoning and algebraic computations used in the proofs of Lemma 1 and Lemma 2. Therefore, we can use subgradient method to compute $y_{e}^{*}$ first and then recover $\phi_{e}^{*}$ and $r_{e}^{*}$ . For special cases like the unit cut-cost, a similar approach to Algorithm B.2 can be adopted to obtain an almost (up to a binary search tolerance) exact solution, by modifying Steps 2-6 to work with general $\ell_{p}$ -norm and replacing Step 10 with binary search. See Algorithm B.3 for details.

Caution. To simplify notation in Algorithm B.3, for $c\in\mathbb{R}$ and $p>0$ , $c^{p}$ is to be interpreted as $c^{p}:=|c|^{p}\mathop{\mathrm{sign}}(c)$ , where we treat $\mathop{\mathrm{sign}}(0):=0$ . For $q=p/(p-1)$ , we define

\gamma_{p}(a,b):=(a-b)^{q-1}+\sum_{v\in e_{\geq}(a^{q-1})}\sigma\left(a^{q-1}-\frac{s_{e,v}}{\sigma}\right).

Algorithm B.3 An

\ell_{p}

-Projection Algorithm for (B.7) (

p>2

, unit cut-cost)

1: Input:

e

s_{e}

a\leftarrow\max_{v\in e}(s_{e,v}/\sigma)^{p-1}

b\leftarrow\min_{v\in e}(s_{e,v}/\sigma)^{p-1}

q\leftarrow p/(p-1)

3: While true:

w_{a}\leftarrow\sigma\,|e_{\geq}(a^{q-1})|

w_{b}\leftarrow\sigma\,|e_{\leq}(b^{q-1})|

a_{1}\leftarrow\max_{v\in e\setminus e_{\geq}(a^{q-1})}(s_{e,v}/\sigma)^{p-1}

b_{1}\leftarrow(b^{q-1}+(a^{q-1}-a_{1}^{q-1})w_{a}/w_{b})^{p-1}

b_{2}\leftarrow\min_{v\in e\setminus e_{\leq}(b^{q-1})}(s_{e,v}/\sigma)^{p-1}

a_{2}\leftarrow(a^{q-1}-(b_{2}^{q-1}-b^{q-1})w_{b}/w_{a})^{p-1}

i^{*}\leftarrow\operatorname*{argmin}_{i\in\{1,2\}}b_{i}

8: If

a_{i^{*}}\leq b_{i^{*}}

\gamma_{p}(a_{i^{*}},b_{i^{*}})\leq 0

break

a\leftarrow a_{i^{*}}

b\leftarrow b_{i^{*}}

10: Employ binary search for

\hat{a}\in[b,a]

such that

\gamma_{p}(\hat{a},\hat{b})=0

while maintaining

\hat{b}=(b^{q-1}+(a^{q-1}-\hat{a}^{q-1})w_{a}/w_{b})^{p-1}

and

\hat{b}\leq\hat{a}

11: For

v\in e

do:

12: If

v\in v\in e_{\geq}(\hat{a}^{q-1})

then

r_{e,v}\leftarrow s_{e,v}-\sigma\hat{a}^{q-1}

13: Else if

v\in e_{\leq}(\hat{b}^{q-1})

then

r_{e,v}\leftarrow s_{e,v}-\sigma\hat{b}^{q-1}

14: Else

r_{e,v}\leftarrow 0

15: Return:

r_{e}

Appendix C Empirical set-up and results

C.1 Experiments using synthetic data

In this subsection we provide details aboue how we generate synthetic hypergraphs using $k$ -uniform stochastic block model and how we set the parameters for the algorithms used in our experiments. Additional synthetic experiments that demonstrate or explain the robustness of our method are also provided.

Data generation. We generate four sets of hypergraphs using the generalized $k$ HSBM described in the main paper. All hypergraphs have $n=100$ nodes. For simplicity, we require that each block in the hypergraph has constant size 50.

1st set of hypergraphs. We generate the first set of hypergraphs with $k=3$ , constant $p=0.0765$ and varying $q\in[0.0041,0.0735]$ . Recall that for $k=3$ there is only one possible inter-cluster probability $q\equiv q_{1}$ . We pick $p=0.0765$ so the expected number of intra-cluster hyperedges is 1500 for each block of size 50. We set a wide range for $q$ so that the interval covers both extremes, i.e., when the ground-truth target cluster is very clean or very noisy. These hypergraphs are used to evaluate the performance of algorithms for the unit cut-cost setting when the target cluster conductance varies. Figure 4 in the main paper uses the local clustering results on these hypergraphs.

2nd set of hypergraphs. For the second set of hypergraphs, we vary $k\in\{3,4,5,6\}$ . Moreover, we set $q_{2}=\cdots=q_{\lfloor k/2\rfloor}=0$ , so every inter-cluster hyperedge contains a single node on one side and the rest on the other side. In this setting, separating the two ground-truth communities will incur a small penalty using the cardinality cut-cost, but a large penalty using the unit cut-cost. Therefore, methods that exploit appropriate cardinality-based cut-cost should perform better. The hypergraphs are sampled so that the conductance of a block stays the same across different $k$ ’s. We compute the conductance based on the unit cut-cost when generating the hypergraphs, because the scale of conductance based on the unit cut-cost is less affected by $k$ than the scale of conductance based on the cardinality cut-cost. See details below for how the scale of conductance based on the cardinality cut-cost is affected by $k$ . The second set of hypergraphs is used to evaluate the performance of algorithms for both unit and cardinality cut-costs when the hyperedge size varies. Figure 5 in the main paper (and Figure C.3 and Figure C.4 in the appendix) uses the local clustering results on these hypergraphs.

3rd set of hypergraphs. For the third set of hypergraphs, we set $q_{2}=\cdots=q_{\lfloor k/2\rfloor}=0$ . We consider constant $k=4$ or $k=5$ , constant $p$ and varying $q_{1}$ . These hypergraphs are used to evaluate the performance of algorithms for both unit and cardinality cut-costs when the target cluster conductance varies. Figure C.1 and Figure C.2 in the appendix are based on the local clustering results on these hypergraphs.

4th set of hypergraphs. This set consists of two hypergraphs generated with $k=3$ , $p=0.04$ and $q\in\{0.001,0.011\}$ . The ground-truth target cluster in the first hypergraph has conductance 0.05, while the ground-truth target cluster in the second hypergraph has conductance 0.3. These two hypergraphs are used to compare the performance of algorithms for the unit cut-cost setting, when the theoretical assumptions of LH holds (for the first hypergraph) or fails (for the second hypergraph).

Parameters. For HFD, for all synthetic experiments, we initialize the seed mass so that $\|\Delta\|_{1}$ is three times the volume of the target cluster (recall from Assumption 2 this is without loss of generality). We set $\sigma=0.01$ . We tune the parameters for LH as suggested by the authors [33]. Specifically, LH has a regularization parameter $\kappa$ and we let $\kappa=c\cdot r$ where $r$ is the ratio between the number of seed node(s) and the size of the target cluster. We perform a binary search on $c$ and find that $c=0.35$ gives good results for the synthetic hypergraphs. An important parameter for LH is $\delta$ . When $\delta=1$ it models unit cut-cost and when $\delta\geq 1$ it models cardinality-based cut-cost with an upper bound $\delta$ [33]. We consider both cases $\delta=1$ (U-LH) and $\delta\geq 1$ (C-LH). In principle, for $k$ -uniform hypergraphs LH should produce the same result for any $\delta\geq k$ , so one could simply set $\delta=k$ for C-LH. However in our experiments we find that the $\delta$ value that gives the best clustering results can be much larger than $k$ . In order to get the best performance out of C-LH, we run C-LH for $\delta=2^{i}$ , $i=0,1,\ldots,12$ . Among the 13 output clusters from C-LH we pick the one with the lowest conductance. For ACL, we use the same set of parameter values used in [33] because that parameter setting also produces good results in our synthetic experiments.

Scale of cardinality-based conductance. To see how ground-truth conductance scales (computed using the cardinality cut-cost) with hyperedge size $k\geq 2$ , let us assume that a hypergraph $H=(V,E)$ , having $|V|=100$ nodes and two blocks where each block contains 50 nodes, is generated from $p=0$ , $q_{1}=1$ and $q_{2}=\ldots=q_{\lfloor k/2\rfloor}=0$ . In this case, the hypergraph consists of all and only inter-cluster hyperedges. Let $C$ denote a target cluster, that is, $C$ is either one of the two ground-truth blocks. Since we have $|V|=100$ nodes and each of the two blocks contains $50$ nodes, the total number of hyperedges is

|E|=2\binom{50}{k-1}\binom{50}{1}.

Let $w_{e}$ denote the cardinality-based cut-cost given by $w_{e}(S)=\min\{|S\cap e|,|e\setminus S|\}/\lfloor|e|/2\rfloor$ . Then for each $e\in E$ we have that $w_{e}(C)=\frac{1}{\lfloor k/2\rfloor}$ . Moreover, the volume of $C$ is

\textnormal{vol}(C)=(k-1)\binom{50}{k-1}\binom{50}{1}+\binom{50}{1}\binom{50}{k-1}=k\binom{50}{k-1}\binom{50}{1},

and hence we have

\Phi(C)=\frac{\textnormal{vol}(\partial C)}{\textnormal{vol}(C)}=\frac{\sum_{e\in E}w_{e}(C)}{\textnormal{vol}(C)}=\frac{\frac{1}{\lfloor k/2\rfloor}|E|}{\textnormal{vol}(C)}=\frac{\frac{2}{\lfloor k/2\rfloor}\binom{50}{k-1}\binom{50}{1}}{k\binom{50}{k-1}\binom{50}{1}}=\frac{2}{k\lfloor k/2\rfloor}.

This means that, for any $p\geq 0$ , $q_{1}\leq 1$ , $q_{2}=\cdots=q_{\lfloor k/2\rfloor}=0$ , let $B$ be one of the two blocks in $H$ , then $\Phi(B)\leq 1$ for $k=2,3$ , $\Phi(B)\leq 1/4$ for $k=4$ , and $\Phi(B)\leq 1/5$ for $k=5$ . This explains why the ranges of ground-truth conductance we consider in Figure C.1 and Figure C.2 are different from the range of ground-truth conductance in Figure 4 in the main paper. For each $k$ we try to make the range of conductance (i.e., $x$ -axis) as wide as possible, but due to the different scales of cardinality-based conductance for different $k$ , the ranges vary accordingly.

Additional results. Figure C.1 and Figure C.2 show how the algorithms perform on $k$ -uniform hypergraphs for $k=4,5$ , respectively, as we vary the target cluster conductances. The plots show that as the target cluster becomes more noisy, the performance of all methods degrades. However, C-HFD is better in terms of both conductance and F1 score, especially when the target cluster is noisy but not complete noise (i.e., the ground-truth conductance is high but not too high). For $k=5$ and high-conductance regime, methods that use unit cut-cost, e.g., U-HFD, have poor performance because they find low-conductance clusters based on the unit cut-cost as opposed to the cardinality cut-cost. In general, lower unit cut-cost conductance does not necessarily translates to lower cardinality-based conductance or higher F1 score. For both Figure C.1 and Figure C.2, the ground-truth conductance is computed using cardinality-based cut-cost, therefore the ground-truth conductances (on the $x$ -axes) have different scales and ranges. Figure C.3 and Figure C.4 show the median (markers) and 25-75 percentiles (lower-upper bars) of conductance ratios and F1 scores for $k=3,4,5,6$ . The target clusters have unit cut-cost conductances around 0.2 for Figure C.3 and 0.25 for Figure C.4. Notice that, when the target clusters are less noisy (cf. Figure 5 in the main paper where target clusters are more noisy, having unit conductance around 0.3), U-HFD and C-HFD are significantly better than other methods. The performance of U-HFD is slightly affected by the hyperedge size when the target clusters have unit conductance around 0.25, while the performance of C-HFD stays the same across all $k$ ’s.

Why is the empirical performance of U-HFD better than U-LH? For the unit cut-cost setting, the local clustering guarantee for HFD holds under much weaker assumptions than those required for LH. The assumptions for LH could fail in many cases, and consequently we see that U-HFD has significantly better performance than U-LH in the experiments with both synthetic and real data. More specifically, the theoretical framework for LH assumes that the node embeddings are global (i.e., the solution is dense). However, in order to obtain a localized algorithm, the authors use a regularization parameter $\kappa>0$ to impose sparsity in the solution. The localized algorithm computes a sparse approximation to the original global solution, but some clustering errors could also be introduced. In general, this does not seem to be a major issue, as localized solutions only seem to slightly affect the clustering performance as shown in Figure C.5. A more crucial assumption of LH is that its approximation guarantee relies on a strong condition that the conductance of the target cluster is upper bounded by $\frac{\gamma}{8c}$ , where $\gamma\in(0,1)$ is a tuning parameter and $c$ is a constant that depends on both $\gamma$ and a specific sampling strategy for selecting a seed node from the target cluster. In our experiments we find that this assumption often breaks. In what follows we provide a simple illustrating example using synthetic hypergraphs. First of all, we sample a sequence of hypergraphs using $k$ HSBM with $n=100$ nodes, two ground-truth communities each consisting of 50 nodes, constant $k=3$ , varying $p$ and $q$ . For each hypergraph we identify one ground-truth community as the target cluster, and we select a seed node uniformly at random from the target cluster. We compute the quantity $\frac{\gamma}{8c}$ and we find that this quantity is always less than 0.12 for any $\gamma\in(0,1)$ . This means that in order for the assumption of LH to hold, the target cluster must have conductance no more than 0.12, which is a very strict requirement and cannot hold in general. In order to compare the performances of LH when its assumption holds or fails, respectively, we picked two hypergraphs (i.e., the fourth set of hypergraphs that we generate) that correspond to the two scenarios. The target clusters have conductance 0.05 and 0.3, respectively. Therefore, the assumption for LH holds for the first hypergraph but fails for the second hypergraph. Moreover, we consider both global and localized solutions for LH. The global solution demonstrates the performance of LH under the required theoretical framework, while the localized solution demonstrates what happens in practice when one uses sparse approximation for computational efficiency. For LH, we compute the global solution by simply setting the regularization parameter $\kappa$ to 0; we tune the localized solution and set $\kappa=0.25r$ where $r$ is the ratio between the number of seed node(s) and the size of the target cluster. The way we pick $\kappa$ is similar to the authors’ choice for LH. For HFD, we set $\sigma=0.01$ and initial mass 3 times the volume of the target cluster. We run both methods multiple times, each time we use a different node from the target cluster as the single seed node. The median, lower and upper quantiles of F1 scores are shown in Figure C.5. For LH, observe that (i) for both hypergraphs where the assumption either holds or fails, localizing the solution slightly reduces the F1 score, and (ii) for both global and localized solutions, LH has much worse performance on the hypergraph where its assumption does not hold. On the other hand, HFD perfectly recovers the target clusters in both settings.

C.2 Experiments using real-world data

C.2.1 Datasets and ground-truth clusters

We provide complete details on the real hypergraphs we used in the experiments. The last three datasets are used for additional experiments in the appendix only.

Amazon-reviews [38, 43]. This is a hypergraph constructed from Amazon product review data, where each node represents a product. A set of products are connected by a hyperedge if they are reviewed by the same person. We use product category labels as ground truth cluster identities. In total there are 29 product categories. Because we are mostly interested in local clustering, we consider all clusters consisting of less than 10,000 nodes.

Trivago-clicks [14]. The nodes in this hypergraph are accommodations/hotels. A set of nodes are connected by a hyperedge if a user performed “click-out” action during the same browsing session, which means the user was forwarded to a partner site. We use geographical locations as ground truth cluster identities. There are 160 such clusters. We consider all clusters in this dataset that consists of less than 1,000 nodes and has conductance less than 0.25.

Florida Bay food network [30]. Nodes in this hypergraph correspond to different species or organisms that live in the Bay, and hyperedges correspond to transformed network motifs of the original dataset. Each species is labelled according its role in the food chain.

High-school-contact [35, 14]. Nodes in this hypergraph represent high school students. A group of people are connected by a hyperedge if they were all in proximity of one another at a given time, based on data from sensors worn by students. We use the classroom to which a student belongs to as ground truth. In total there are 9 classrooms.

Microsoft-academic [41, 1]. The original co-authorship network is a subset of the Microsoft Academic Graph where nodes are authors and hyperedges correspond to a publication from those authors. We take the dual of the original hypergraph by converting hyperedges to nodes and nodes to hyperedges. After constructing the dual hypergraph, we removed all hyperedges having just one node and we kept the largest connected component. In the resulting hypergraph, each node represents a paper and is labelled by its publication venue. A set of papers are connected by a hyperedge if they share a common coauthor. We combine similar computer science conferences into four broader categories: Data (KDD, WWW, VLDB, SIGMOD), ML (ICML, NeurIPS), TCS (STOC, FOCS), CV (ICCV, CVPR).

Oil-trade network. This hypergraph is constructed using the 2017 international oil trade records from UN Comtrade Dataset. We adopt a similar modelling approach to Figure 1 in the main paper. Each node represents a country, $\{v_{1},v_{2},v_{3},v_{4}\}$ form a hyperedge if the trade surplus from each of $v_{1},v_{2}$ to each of $v_{3},v_{4}$ exceeds 10 million USD (this is roughly 80% percentile country-wise oil export value). Therefore, two countries belong to the same hyperedge if they share $\geq 2$ important trading partners in common. We use this network to for the node ranking problem.

Table C.1 provides summary statistics about the hypergraphs. Table C.2 includes the statistics of all ground truth clusters that we used in the experiments.

Table C.1: Summary of real-world hypergraphs

Dataset Number of nodes Number of hyperedges Maximum hyperedge size Maximum node degree Median / Mean hyperedge size Median / Mean node degree Amazon-reviews 2,268,231 4,285,363 9,350 28,973 8.0 / 17.1 11.0 / 32.2 Trivago-clicks 172,738 233,202 86 588 3.0 / 4.1 2.0 / 5.6 Florida-Bay 126 141,233 4 19,843 4.0 / 4.0 3,770.5 / 4,483.6 Microsoft-academic 44,216 22,464 187 21 3.0 / 5.4 2.0 / 2.7 High-school-contact 327 7,818 5 148 2.0 / 2.3 53.0 / 55.6 Oil-trade 229 100,639 4 16,394 4.0 / 4.0 175.0 / 1,757.9

Table C.2: Summary of ground-truth clusters used in the experiments

Dataset	Cluster	Size	Volume	Conductance
Amazon-reviews	1 - Amazon Fashion	31	3042	0.06
	2 - All Beauty	85	4092	0.12
	3 - Appliances	48	183	0.18
	12 - Gift Cards	148	2965	0.13
	15 - Industrial & Scientific	5334	72025	0.14
	17 - Luxury Beauty	1581	28074	0.11
	18 - Magazine Subs.	157	2302	0.13
	24 - Prime Pantry	4970	131114	0.10
	25 - Software	802	11884	0.14
Trivago-clicks	KOR - South Korea	945	3696	0.24
	ISL - Iceland	202	839	0.21
	PRI - Puerto Rico	144	473	0.25
	UA-43 - Crimea	200	1091	0.24
	VNM - Vietnam	832	2322	0.24
	HKG - Hong Kong	536	4606	0.24
	MLT - Malta	157	495	0.24
	GTM - Guatemala	199	652	0.24
	UKR - Ukraine	264	648	0.24
	SET - Estonia	158	850	0.23
Florida- Bay	Producers	17	10781	0.70
	Low-level consumers	35	173311	0.58
	High-level consumers	70	375807	0.54
Microsoft- academic	Data	15817	45060	0.06
	ML	10265	26765	0.16
	TCS	4159	10065	0.08
	CV	13974	38395	0.08
High-school-contact	Class 1	36	1773	0.25
	Class 2	34	1947	0.29
	Class 3	40	2987	0.20
	Class 4	29	913	0.41
	Class 5	38	2271	0.26
	Class 6	34	1320	0.26
	Class 7	44	2951	0.16
	Class 8	39	2204	0.19
	Class 9	33	1826	0.25

C.2.2 Methods and parameter setting

HFD We use $\sigma=0.0001$ for all the experiments. We set the total amount of initial mass $\|\Delta\|_{1}$ as a constant factor $t$ times the volume of the target cluster. For Amazon-reviews, on the smaller clusters 1, 2, 3, 12, 18, we used $t=200$ ; on the larger clusters 15, 17, 24, 25, we used $t=50$ . For both Trivago-clicks, High-school-contact and Microsoft-academic, we used $t=3$ . For Florida Bay food network, we used $t=20,10,5$ for clusters 1, 2, 3, respectively. In all experiments, the choice of $t$ is to ensure that the diffusion process will cover some part of the target and incur a high cost in the objective function. For the single seed node setting, we simply set the initial mass on the seed node as $\|\Delta\|_{1}$ . For the multiple seed nodes setting where we are given a seed set $S$ , for each $v\in S$ we set the initial mass on $v$ as $d_{v}\|\Delta\|_{1}/\textnormal{vol}(S)$ .

LH, ACL We used the parameters as suggested by the authors [33]. For both *-LH-2.0 and *-LH-1.4, we set $\gamma=0.1$ , $\rho=0.5$ , $\kappa=c\cdot r$ where $r$ is the ratio between the number of seed nodes and the size of the target cluster, and $c$ is a tuning constant. For Amazon-reviews, we set $c=0.025$ as suggested in [33]. For Microsoft-academic, Trivago-clicks, and Florida-Bay we also used $c=0.025$ because it produces good results. For High-school-contact we selected $c=0.25$ after some tuning to make sure both *-LH-2.0 and *-LH-1.4 have good results. We set the parameters for ACL in exactly the same way as in [33]. We set $\delta=1$ for U-LH-* and $\delta=\max_{e\in E}|e|$ for C-LH-*.

C.2.3 Additional experiments

Multiple seed nodes. We conduct additional experiments using multiple seed nodes for Amazon-reviews and Trivago-clicks datasets. For each target cluster, we randomly select 1% nodes from that cluster as seed nodes, and we enforce that at least 5 nodes are selected as seeds. For example, if a cluster consists of only 100 nodes, we still select 5 nodes to form a seed set. We run 30 trials for each cluster and report the median conductance and F1 score of the output clusters. The results are shown in Table C.3 and Table C.4. For the multiple seed nodes setting, the results of U-LH-1.4, U-LH-2.0 and ACL on Amazon-reviews align with the ones reported in [33]: We reproduced almost identical numbers under the same setting, with only a few small differences due to randomness in seed nodes selection. In general, using more seed nodes improves the performance for all methods in terms of both conductance and F1. For Amazon-reviews, the output clusters of HFD always have the lowest conductance, even though in some cases, low conductance does not align well with the given ground-truth, and hence the lowest conductance does not lead to the highest F1 score. Similarly, for Trivago-clicks, both U-HFD and C-HFD consistently find the lowest conductance clusters among all methods, which in general (but not always) lead to a higher F1 score. Note that, if a method uses the unit cut-cost (resp. the cardinality-based cut-cost), then we compute the conductance of the output cluster using the unit cut-cost (resp. the cardinality-based cut-cost). Therefore, depending on the specific cut-cost, the conductances in Table C.4 may have different scales. We highlight the lowest conductance for both cut-costs separately.

Table C.3: Complete local clustering results for Amazon-reviews network

			Cluster
Metric	Seed	Method	1	2	3	12	15	17	18	24	25
Conductance	Single	U-HFD	0.17	0.11	0.12	0.16	0.36	0.25	0.17	0.14	0.28
		U-LH-2.0	0.42	0.50	0.25	0.44	0.74	0.44	0.57	0.58	0.61
		U-LH-1.4	0.33	0.44	0.25	0.36	0.81	0.40	0.51	0.54	0.59
		ACL	0.42	0.50	0.25	0.54	0.77	0.52	0.63	0.68	0.65
	Multiple	U-HFD	0.05	0.10	0.12	0.13	0.20	0.16	0.14	0.11	0.32
		U-LH-2.0	0.05	0.15	0.15	0.21	0.45	0.45	0.26	0.18	0.53
		U-LH-1.4	0.05	0.13	0.15	0.15	0.35	0.33	0.19	0.14	0.47
		ACL	0.05	0.27	0.16	0.27	0.56	0.53	0.33	0.30	0.59
F1 score	Single	U-HFD	0.45	0.09	0.65	0.92	0.04	0.10	0.80	0.81	0.09
		U-LH-2.0	0.23	0.07	0.23	0.29	0.05	0.06	0.21	0.28	0.05
		U-LH-1.4	0.23	0.09	0.35	0.40	0.00	0.07	0.31	0.35	0.06
		ACL	0.23	0.07	0.22	0.25	0.04	0.05	0.17	0.20	0.04
	Multiple	U-HFD	0.49	0.50	0.69	0.98	0.19	0.36	0.91	0.89	0.33
		U-LH-2.0	0.59	0.42	0.73	0.77	0.22	0.25	0.65	0.62	0.17
		U-LH-1.4	0.52	0.45	0.73	0.90	0.27	0.29	0.79	0.77	0.20
		ACL	0.59	0.25	0.70	0.64	0.20	0.19	0.51	0.49	0.14

Table C.4: Complete local clustering results for Trivago-clicks network

			Cluster
Metric	Seed	Method	KOR	ISL	PRI	UA-43	VNM	HKG	MLT	GTM	UKR	EST
Conductance	Single	U-HFD	0.010	0.023	0.014	0.011	0.018	0.017	0.010	0.007	0.016	0.012
		U-LH-2.0	0.020	0.042	0.027	0.027	0.037	0.035	0.031	0.035	0.032	0.019
		U-LH-1.4	0.036	0.069	0.047	0.039	0.060	0.052	0.040	0.045	0.065	0.036
		ACL	0.027	0.050	0.034	0.031	0.042	0.043	0.047	0.039	0.043	0.026
\cdashline3-13		C-HFD	0.007	0.016	0.007	0.005	0.009	0.011	0.007	0.003	0.010	0.009
		C-LH-2.0	0.022	0.066	0.030	0.030	0.035	0.035	0.029	0.028	0.029	0.029
		C-LH-1.4	0.043	0.095	0.042	0.048	0.071	0.059	0.053	0.047	0.075	0.046
	Multiple	U-HFD	0.009	0.023	0.011	0.010	0.014	0.017	0.010	0.008	0.017	0.012
		U-LH-2.0	0.023	0.034	0.018	0.021	0.054	0.030	0.021	0.022	0.041	0.018
		U-LH-1.4	0.048	0.045	0.038	0.032	0.084	0.051	0.049	0.049	0.085	0.024
		ACL	0.030	0.037	0.018	0.024	0.064	0.033	0.021	0.024	0.045	0.020
\cdashline3-13		C-HFD	0.006	0.016	0.006	0.005	0.006	0.011	0.007	0.003	0.011	0.009
		C-LH-2.0	0.024	0.062	0.021	0.021	0.047	0.034	0.023	0.017	0.036	0.029
		C-LH-1.4	0.054	0.067	0.033	0.037	0.094	0.057	0.053	0.044	0.094	0.032
F1 score	Single	U-HFD	0.75	0.99	0.89	0.85	0.28	0.82	0.98	0.94	0.60	0.94
		U-LH-2.0	0.70	0.86	0.79	0.70	0.24	0.92	0.88	0.82	0.50	0.90
		U-LH-1.4	0.69	0.84	0.80	0.75	0.28	0.87	0.92	0.83	0.47	0.90
		ACL	0.65	0.84	0.75	0.68	0.23	0.90	0.83	0.69	0.50	0.88
		C-HFD	0.76	0.99	0.95	0.94	0.32	0.80	0.98	0.97	0.68	0.94
		C-LH-2.0	0.73	0.90	0.84	0.78	0.27	0.94	0.96	0.88	0.51	0.83
		C-LH-1.4	0.71	0.88	0.84	0.78	0.27	0.88	0.93	0.85	0.50	0.85
	Multiple	U-HFD	0.87	0.99	0.97	0.92	0.55	0.82	0.98	0.97	0.87	0.94
		U-LH-2.0	0.83	0.91	0.92	0.84	0.71	0.93	0.95	0.93	0.86	0.92
		U-LH-1.4	0.78	0.84	0.83	0.79	0.74	0.85	0.85	0.84	0.75	0.87
		ACL	0.81	0.89	0.91	0.85	0.68	0.93	0.96	0.91	0.83	0.90
		C-HFD	0.86	0.99	0.97	0.96	0.32	0.80	0.98	0.97	0.69	0.94
		C-LH-2.0	0.86	0.94	0.94	0.87	0.76	0.94	0.97	0.94	0.88	0.91
		C-LH-1.4	0.83	0.89	0.90	0.83	0.67	0.89	0.92	0.85	0.77	0.89

Additional datasets, local clustering using unit and cardinality cut-costs. Table C.5 and Table C.6 show local clustering results on High-school-contact and Microsoft-academic networks, respectively. We use the single seed node setting, run the methods from each node in a target cluster, and report the median conductance and F1 score. We cap the maximum number of runs to 500. Similar to the results on other datasets, the output clusters of HFD always have the lowest conductance, leading to the highest F1 score in most cases. We omit cardinality-based methods for Microsoft-academic because they are very similar to the unit cut-cost setting.

Table C.5: Local clustering results for High-school-contact network

		Cluster
Metric	Method	Class 1	Class 2	Class 3	Class 4	Class 5	Class 6	Class 7	Class 8	Class 9
Conductance	U-HFD	0.25	0.29	0.13	0.42	0.21	0.26	0.16	0.19	0.25
	U-LH-2.0	0.31	0.36	0.23	0.63	0.33	0.36	0.18	0.21	0.30
	U-LH-1.4	0.29	0.32	0.21	0.54	0.29	0.37	0.16	0.22	0.29
	ACL	0.62	0.64	0.61	0.98	0.61	0.60	0.59	0.55	0.59
\cdashline3-11	C-HFD	0.25	0.28	0.20	0.41	0.24	0.26	0.16	0.19	0.25
	C-LH-2.0	0.27	0.33	0.20	0.57	0.29	0.32	0.16	0.20	0.27
	C-LH-1.4	0.28	0.32	0.20	0.52	0.28	0.33	0.16	0.21	0.28
F1 score	U-HFD	0.99	1.00	0.59	0.96	0.73	1.00	0.88	1.00	0.99
	U-LH-2.0	0.91	0.83	0.93	0.66	0.84	0.88	0.96	0.96	0.90
	U-LH-1.4	0.93	0.78	0.90	0.78	0.70	0.90	0.97	0.95	0.88
	ACL	0.72	0.73	0.73	0.06	0.70	0.76	0.77	0.78	0.76
	C-HFD	0.99	1.00	1.00	0.96	0.80	1.00	1.00	1.00	0.99
	C-LH-2.0	0.93	0.82	0.92	0.74	0.84	0.93	0.97	0.97	0.91
	C-LH-1.4	0.94	0.74	0.69	0.84	0.76	0.94	0.96	0.96	0.85

Table C.6: Local clustering results for Microsoft-academic network

		Cluster
Metric	Method	Data	ML	TCS	CV
Cond	U-HFD	0.03	0.06	0.06	0.03
	U-LH-2.0	0.07	0.09	0.10	0.07
	U-LH-1.4	0.07	0.08	0.09	0.07
	ACL	0.08	0.11	0.11	0.09
F1 score	U-HFD	0.78	0.54	0.86	0.73
	U-LH-2.0	0.67	0.46	0.71	0.61
	U-LH-1.4	0.65	0.46	0.59	0.59
	ACL	0.64	0.43	0.70	0.57

Additional dataset, node ranking using general submodular cut-cost. We provide another compelling use case of general submodular cut-cost. We consider the node ranking problem in the Oil-trade network. Our goal is to search the most related country of a queried country based on the trade-network structure. We use the hypergraph modelling shown in Figure 1 in the main paper. We compare HFD using unit (U-HFD, $\gamma_{1}=\gamma_{2}=1$ ), cardinality-based (C-HFD, $\gamma_{1}=1/2$ and $\gamma_{2}=1$ ) and submodular (S-HFD, $\gamma_{1}=1/2$ and $\gamma_{2}=0$ ) cut-costs. Table C.7 shows the top-2 ranking results. In this example, we use Iran as the seed node and we rank other countries according to the ordering of dual variables returned by HFD. In 2017, US imposed strict sanctions on Iran. However, Bangladesh (generally accepted as an American ally) is among the top two ranked countries based on unit or cardinality-based cut-cost, which does not make any sense. On the other hand, S-HFD ranks Iraq and Turkmenistan as the top two. Interested readers can easily verify that these counties share strong economic or historical ties with Iran.

Additional method: $p$ -norm HFD. We tried HFD with unit cut-cost and $p=4$ (U-HFD-4.0). However, in practice we did not observe that a larger $p>2$ necessarily lead to better clustering results. We show a sample result of U-HFD-4.0 for Amazon-reviews in Table C.8. Notice that the performances of U-HFD-2.0 ( $p=2$ ) and U-HFD-4.0 are very similar.

Additional method: LH $+$ flow improve. We tried a flow-improve method for hypergraphs [43]. We apply the flow-improve method to the output of U-LH-2.0. The method is slow in our experiments, so we only tried it on a few small instances. The results for the Florida Bay food network is shown in Table C.9. In general, we find that applying the flow-improve method does not lead to consistent performance improvements.

Table C.7: Top-2 node-ranking results for Oil-trade network

Method	Query: Iran
U-HFD	Kenya, Bangladesh
C-HFD	Bangladesh, United Rep. of Tanzania
S-HFD	Turkmenistan, Iraq

Table C.8: Local clustering results for Amazon-reviews network using

p

-norm HFD

			Cluster
Metric	Seed	Method	1	2	3	12	15	17	18	24	25
Cond	Single	U-HFD-2.0	0.17	0.11	0.12	0.16	0.36	0.25	0.17	0.14	0.28
	Single	U-HFD-4.0	0.17	0.10	0.12	0.16	0.35	0.26	0.17	0.14	0.38
	Multiple	U-HFD-2.0	0.05	0.10	0.12	0.13	0.20	0.16	0.14	0.11	0.32
	Multiple	U-HFD-4.0	0.05	0.10	0.12	0.14	0.20	0.16	0.14	0.12	0.32
F1 score	Single	U-HFD-2.0	0.45	0.09	0.65	0.92	0.04	0.10	0.80	0.81	0.09
	Single	U-HFD-4.0	0.48	0.07	0.65	0.92	0.04	0.09	0.80	0.82	0.10
	Multiple	U-HFD-2.0	0.49	0.50	0.69	0.98	0.19	0.36	0.91	0.89	0.33
	Multiple	U-HFD-4.0	0.49	0.50	0.69	0.98	0.19	0.36	0.91	0.88	0.35

Table C.9: Local clustering results for the food network using unit cut-costs

		Cluster
Metric	Method	Producers	Low-level consumers	High-level consumers
Conductance	U-HFD	0.49	0.36	0.35
	U-LH-2.0	0.51	0.39	0.39
	U-LH-2.0 $+$ flow	0.52	0.39	0.40
	U-LH-1.4	0.49	0.39	0.41
	ACL	0.52	0.39	0.40
F1 score	U-HFD	0.69	0.47	0.64
	U-LH-2.0	0.69	0.45	0.57
	U-LH-2.0 $+$ flow	0.69	0.45	0.57
	U-LH-1.4	0.69	0.45	0.58
	ACL	0.69	0.44	0.57

C.3 Computing platform and implementation detail

We implemented the AM algorithm [5] given in Algorithm B.1 in Julia. The code is run on a personal laptop with 32GB RAM and 2.9 GHz 6-Core Intel Core i9. GPU is not used for computation. For the rest of this section, we discuss the implementation details on how we actually solve the nontrivial sub-problem in Algorithm B.1 to obtain the update $(\phi^{(k+1)},r^{(k+1)})$ .

For the unit cut-cost case, we use an exact projection algorithm [32] to obtain the update $(\phi^{(k+1)},r^{(k+1)})$ . Algorithmic details for exact projection is provided in Algorithm B.2. For cardinality-based or general submodular cut-costs, a conic Fujishige-Wolfe minimum norm algorithm [32] can be adopted to efficiently compute $(\phi^{(k+1)},r^{(k+1)})$ . Our implementation uses alternative methods that are simpler. For the cardinality cut-cost, we use a projected subgradient method that works on a related dual problem to obtain the primal update in $(\phi^{(k+1)},r^{(k+1)})$ . The subgradient method is easy to implement, requires less computation overhead, and works well in practice for the sub-problem. For the specialized submodular cut-cost shown in Figure 1, since the hyperedge consists of only 4 nodes and has a special structure, we simply perform an exhaustive search that allows us to exactly compute $(\phi^{(k+1)},r^{(k+1)})$ using constant number of vector-vector additions and multiplications. We provide details below.

Recall that the sub-problem to compute $(\phi^{(k+1)},r^{(k+1)})$ decomposes into a sequence of separate problems indexed by $e\in E$ (cf. (B.7), in the following we assume $p=2$ for simplicity):

\min_{\phi_{e}\geq 0,r_{e}\in\phi_{e}B_{e}}\frac{1}{2}\phi_{e}^{2}+\frac{1}{2\sigma}\|s_{e}-r_{e}\|_{2}^{2}.

(C.1)

The dual problem of (C.1) is written as (cf. (B.8), here we have $p=q=2$ )

\min_{y_{e}}\frac{1}{2}f_{e}(y_{e})^{2}+\frac{\sigma}{2}\|y_{e}\|_{2}^{2}-s_{e}^{T}y_{e}.

(C.2)

Let $(\phi_{e}^{*},r_{e}^{*})$ and $y_{e}^{*}$ denote primal and dual optimal solutions for (C.1) and (C.2), respectively. Then we have that

r_{e}^{*}+\sigma y_{e}^{*}=s_{e}\quad\mbox{and}\quad{\phi_{e}^{*}}^{2}={r_{e}^{*}}^{T}y_{e}^{*}.

The dual problem (C.2) can be derived following the same way that we derive the primal-dual HFD formulations, moreover, the above relations between $\phi_{e}^{*},$ $r_{e}^{*}$ and $y_{e}^{*}$ follow immediately from the primal-dual derivation, dual optimality condition and simple algebraic work. Therefore, in order to find an optimal solution $(\phi_{e}^{*},r_{e}^{*})$ for the primal problem (C.2), it suffices to find an optimal solution $y_{e}^{*}$ for the dual problem (C.2) and then recover $(\phi_{e}^{*},r_{e}^{*})$ . Now, since $\mathit{1}^{T}r_{e}^{*}=0$ , we know that $\sigma\mathit{1}^{T}y_{e}^{*}=\mathit{1}^{T}s_{e}$ , i.e., $y_{e}^{*}$ lies in the hyperplane $\mathcal{H}:=\{y_{e}|\sigma\mathit{1}^{T}y_{e}=\mathit{1}^{T}s_{e}\}$ . Let $h$ denote the objective function of the dual problem (C.2), we compute $y_{e}^{*}$ using projected subgradient method:

y_{e}^{(k+1)}:=P_{\mathcal{H}}\left(y_{e}^{(k)}-\frac{1}{k}\frac{g^{(k)}}{\|g^{(k)}\|_{2}}\right),

where $g^{(k)}\in\partial h(y_{e}^{(k)})$ is a subgradient at $y_{e}^{(k)}$ , and $P_{\mathcal{H}}(\cdot)$ denotes the projection onto the hyperplane $\mathcal{H}$ . We add the additional projection step so that, when we stop the subgradient method after $K$ iterations to get $y_{e}^{*}\approx\tilde{y}_{e}:=y_{e}^{(K)}$ , and approximately recover $r_{e}^{*}$ as $r_{e}^{*}\approx\tilde{r}_{e}:=s_{e}-\sigma\tilde{y}_{e}$ , the resulting $\tilde{r}_{e}$ would still be a proper flow routing, i.e., $\mathit{1}^{T}\tilde{r}_{e}=0$ and hence it is possible to have $\tilde{r}_{e}\in\tilde{\phi}_{e}B_{e}$ for some $\tilde{\phi}_{e}$ . In other words, the projection step is crucial because it permits the use of sub-optimal dual solution $\tilde{y}_{e}$ to obtain sub-optimal but feasible primal solution $\tilde{r}_{e}$ .

For the cardinality cut-cost, our implementation uses the projected subgradient method we describe above to solve the sub-problem in Algorithm B.1 for $\phi_{e}$ and $r_{e}$ . In what follows we talk about how we deal with the specialized submodular cut-cost.

Given $e=\{v_{1},v_{2},v_{3},v_{4}\}$ and associated submodular cut-cost $w_{e}$ such that $w_{e}(\{v_{i}\})=1/2$ for $i=1,2,3,4$ , $w_{e}(\{v_{1},v_{2}\})=0$ , $w_{e}(\{v_{1},v_{3}\})=w_{e}(\{v_{1},v_{4}\})=1$ , and $w_{e}(S)=w_{e}(e\setminus S)$ for any $S\subseteq e$ . Let $B_{e}$ be the base polytope of $w_{e}$ . The sub-problem for this hyperedge is given in (C.1). Suppose $(\phi_{e}^{*},r_{e}^{*})$ is optimal for (C.1), and $r_{e}^{*}=\phi_{e}^{*}\rho_{e}^{*}$ for some $\rho_{e}^{*}\in B_{e}$ . If $\phi_{e}^{*}>0$ , then we know that $\phi_{e}^{*}=\frac{s_{e}^{T}\rho_{e}^{*}}{\sigma+\|\rho_{e}^{*}\|_{2}^{2}}$ . To see this, substitute $r_{e}^{*}=\phi_{e}\rho_{e}^{*}$ into (C.1) and optimize for $\phi_{e}$ only. The relation $\phi_{e}^{*}=\frac{s_{e}^{T}\rho_{e}^{*}}{\sigma+\|\rho_{e}^{*}\|_{2}^{2}}$ follows from first-order optimality condition and the assumption that $\phi_{e}^{*}>0$ . On the other hand, if $\phi_{e}^{*}=0$ , then we simply have that $r_{e}^{*}=0$ . Therefore, in order to compute $(\phi_{e}^{*},r_{e}^{*})$ when $\phi_{e}^{*}>0$ , it suffices to find $\rho_{e}^{*}$ . In order to find $\rho_{e}^{*}$ , we look at the dual problem C.2. Let $y_{e}^{*}$ be an optimal dual solution, then we have that $\rho_{e}^{*}\in\operatorname*{argmax}_{\rho_{e}\in B_{e}}\rho_{e}^{T}y_{e}^{*}$ . The subsequent claims are case analyses in order to determine all possible nontrivial candidates for $\rho_{e}^{*}$ .

Claim C.1.

If $s_{e,v_{1}}=s_{e,v_{2}}$ , then $\rho^{*}_{e,v_{1}}=\rho^{*}_{e,v_{2}}=0$ ; if $s_{e,v_{3}}=s_{e,v_{4}}$ , then $\rho^{*}_{e,v_{3}}=\rho^{*}_{e,v_{4}}=0$ .

Proof.

The optimality condition of the dual problem (C.2) is for some $\hat{\rho}_{e}\in\operatorname*{argmax}_{\rho_{e}\in B_{e}}\rho_{e}^{T}y_{e}^{*}$ ,

(\hat{\rho}_{e}^{T}y_{e}^{*})\hat{\rho}_{e}+\sigma y_{e}^{*}=s_{e}.

(C.3)

Suppose $s_{e,v_{1}}=s_{e,v_{2}}$ , then we must have $y_{e,v_{1}}^{*}=y_{e,v_{2}}^{*}$ . Otherwise, say $y_{e,v_{1}}^{*}>y_{e,v_{2}}^{*}$ , then we know that $\hat{\rho}_{e,v_{1}}=1/2>-1/2=\hat{\rho}_{e,v_{2}}$ , which follows from applying the greedy algorithm [4] to find $\hat{\rho}_{e}$ using the order of indices in $y_{e}^{*}$ . But then according to the optimality condition (C.3), we have

s_{e,v_{1}}=(\hat{\rho}_{e}^{T}y_{e}^{*})\hat{\rho}_{e,v_{1}}+\sigma y_{e,v_{1}}^{*}>(\hat{\rho}_{e}^{T}y_{e}^{*})\hat{\rho}_{e,v_{2}}+\sigma y_{e,v_{2}}^{*}=s_{e,v_{2}},

which contradicts our assumption that $s_{e,v_{1}}=s_{e,v_{2}}$ . Similarly, $y_{e,v_{1}}^{*}<y_{e,v_{2}}^{*}$ is not possible, either. Now, because $y_{e,v_{1}}^{*}=y_{e,v_{2}}^{*}$ , by the optimality condition (C.3), we must also have $\hat{\rho}_{e,v_{1}}=\hat{\rho}_{e,v_{2}}$ . Finally, because $\hat{\rho}_{e}\in B_{e}$ , we know that $\hat{\rho}_{e,v_{1}}+\hat{\rho}_{e,v_{2}}\leq 0$ and $\hat{\rho}_{e,v_{1}}+\hat{\rho}_{e,v_{2}}=-(\hat{\rho}_{e,v_{3}}+\hat{\rho}_{e,v_{4}})\geq-w_{e}(\{v_{3},v_{4}\})=0$ , so $\hat{\rho}_{e,v_{1}}+\hat{\rho}_{e,v_{2}}=0$ . Therefore, $\hat{\rho}_{e,v_{1}}=\hat{\rho}_{e,v_{2}}=0$ . Since $\hat{\rho}$ was chosen arbitrarily from the set $\operatorname*{argmax}_{\rho_{e}\in B_{e}}\rho_{e}^{T}y_{e}^{*}$ , and $\rho^{*}_{e}\in\operatorname*{argmax}_{\rho_{e}\in B_{e}}\rho_{e}^{T}y_{e}^{*}$ , we have that $\rho^{*}_{e,v_{1}}=\rho^{*}_{e,v_{2}}=0$ as required. The other claim on nodes $v_{3}$ and $v_{4}$ follows the same way. ∎

Claim C.2.

If $s_{e,v_{1}}\neq s_{e,v_{2}}$ and $s_{e,v_{3}}=s_{e,v_{4}}$ , then $\rho^{*}_{e,v_{1}},\rho^{*}_{e,v_{2}}\in\{1/2,-1/2\}$ and $\rho^{*}_{e,v_{3}}=\rho^{*}_{e,v_{4}}=0$ ; if $s_{e,v_{1}}=s_{e,v_{2}}$ and $s_{e,v_{3}}\neq s_{e,v_{4}}$ , then $\rho^{*}_{e,v_{1}}=\rho^{*}_{e,v_{2}}=0$ and $\rho^{*}_{e,v_{3}},\rho^{*}_{e,v_{4}}\in\{1/2,-1/2\}$ .

Proof.

We will show the first case, the second case follows by symmetry. Let $\hat{\rho}_{e}\in\operatorname*{argmax}_{\rho_{e}\in B_{e}}\rho_{e}^{T}y_{e}^{*}$ . Suppose $s_{e,v_{1}}\neq s_{e,v_{2}}$ and $s_{e,v_{3}}=s_{e,v_{4}}$ . Then by Claim C.1 we have $\hat{\rho}_{e,v_{3}}=\hat{\rho}_{e,v_{4}}=0$ . Let us assume without loss of generality that $s_{e,v_{1}}>s_{e,v_{2}}$ . If $y_{e,v_{1}}^{*}<y_{e,v_{2}}^{*}$ , then apply the greedy algorithm we know that $\hat{\rho}_{e,v_{1}}=-1/2<1/2=\hat{\rho}_{e,v_{2}}$ . But this contradicts the optimality condition (C.3). Therefore we must have $y_{e,v_{1}}^{*}\geq y_{e,v_{2}}^{*}$ . There are two cases. If $y_{e,v_{1}}^{*}>y_{e,v_{2}}^{*}$ , then apply the greedy algorithm we get $\hat{\rho}_{e,v_{1}}=1/2$ and $\hat{\rho}_{e,v_{2}}=-1/2$ . If $y_{e,v_{1}}^{*}=y_{e,v_{2}}^{*}$ , then because $\hat{\rho}_{e,v_{1}}+\hat{\rho}_{e,v_{2}}=0$ (see the proof of Claim C.1 for an argument for this) and $\hat{\rho}_{e,v_{3}}=\hat{\rho}_{e,v_{4}}=0$ , we have that $\hat{\rho}_{e}^{T}y_{e}^{*}=0$ . But then this contradicts the optimality condition (C.3), because $s_{e,v_{1}}>s_{e,v_{2}}$ and $y_{e,v_{1}}^{*}=y_{e,v_{2}}^{*}$ . Therefore we cannot have $y_{e,v_{1}}^{*}=y_{e,v_{2}}^{*}$ . Since our choice of $\hat{\rho}_{e}\in\operatorname*{argmax}_{\rho_{e}\in B_{e}}\rho_{e}^{T}y_{e}^{*}$ was arbitrary, and $\rho_{e,v_{1}}^{*}\in\operatorname*{argmax}_{\rho_{e}\in B_{e}}\rho_{e}^{T}y_{e}^{*}$ , so we know that $\rho_{e}^{*}$ must satisfy the properties satisfied by $\hat{\rho}_{e}$ . ∎

Claim C.3.

If $s_{e,v_{1}}\neq s_{e,v_{2}}$ and $s_{e,v_{3}}\neq s_{e,v_{4}}$ , then $\rho^{*}_{e,v_{1}},\rho^{*}_{e,v_{2}}\in\{\pm 1/2,\pm a\}$ and $\rho^{*}_{e,v_{3}},\rho^{*}_{e,v_{4}}\in\{\pm 1/2,\pm b\}$ , where $a=(\frac{1}{2}+\sigma)(s_{e,v_{1}}-s_{e,v_{2}})/(s_{e,v_{3}}-s_{e,v_{4}})$ and $b=(\frac{1}{2}+\sigma)(s_{e,v_{3}}-s_{e,v_{4}})/(s_{e,v_{1}}-s_{e,v_{2}})$ .

Proof.

Let us assume without loss of generality that $s_{e,v_{1}}>s_{e,v_{2}}$ and $s_{e,v_{3}}>s_{e,v_{4}}$ . Let $\hat{\rho}_{e}\in\operatorname*{argmax}_{\rho_{e}\in B_{e}}\rho_{e}^{T}y_{e}^{*}$ . We have that $y_{e,v_{1}}^{*}\geq y_{e,v_{2}}^{*}$ and $y_{e,v_{3}}^{*}\geq y_{e,v_{4}}^{*}$ (see the proof of Claim C.2 for an argument for this). There are four cases and we analyze them one by one in the following.

Case 1. If $y_{e,v_{1}}^{*}>y_{e,v_{2}}^{*}$ and $y_{e,v_{3}}^{*}>y_{e,v_{4}}^{*}$ , then we have $\hat{\rho}_{e,v_{1}}=\hat{\rho}_{e,v_{3}}=1/2$ and $\hat{\rho}_{e,v_{2}}=\hat{\rho}_{e,v_{4}}=-1/2$ .

Case 2. If $y_{e,v_{1}}^{*}=y_{e,v_{2}}^{*}$ and $y_{e,v_{3}}^{*}=y_{e,v_{4}}^{*}$ , then $\hat{\rho}_{e}^{T}y_{e}^{*}=0$ and hence the optimality condition (C.3) cannot be satisfied. This leads to a contradiction.

Case 3. Suppose that $y_{e,v_{1}}^{*}=y_{e,v_{2}}^{*}$ and $y_{e,v_{3}}^{*}>y_{e,v_{4}}^{*}$ . Then according to the optimality condition (C.3), because $s_{e,v_{1}}>s_{e,v_{2}}$ and $y_{e,v_{1}}^{*}=y_{e,v_{2}}^{*}$ , we must have that $\hat{\rho}_{e,v_{1}}>\hat{\rho}_{e,v_{2}}$ . Moreover, because $\hat{\rho}_{e,v_{1}}+\hat{\rho}_{e,v_{2}}=0$ , we know that $\hat{\rho}_{e,v_{1}}=a=-\hat{\rho}_{e,v_{2}}$ for some $a>0$ . We also know that $\hat{\rho}_{e,v_{3}}=1/2$ and $\hat{\rho}_{e,v_{4}}=-1/2$ since $y_{e,v_{3}}^{*}>y_{e,v_{4}}^{*}$ . Substitute the primal-dual relation $\phi_{e}^{*}=\hat{\rho}_{e}^{T}y_{e}^{*}$ into (C.3) we have

\phi_{e}^{*}\hat{\rho}_{e,v_{1}}+\sigma y_{e,v_{1}}^{*}=s_{e,v_{1}}~{}~{}\mbox{and}~{}~{}\phi_{e}^{*}\hat{\rho}_{e,v_{2}}+\sigma y_{e,v_{2}}^{*}=s_{e,v_{2}}.

Because $y_{e,v_{1}}^{*}=y_{e,v_{2}}^{*}$ , we get that

\phi_{e}^{*}(\hat{\rho}_{e,v_{1}}-\hat{\rho}_{e,v_{2}})=s_{e,v_{1}}-s_{e,v_{2}},

and hence

\phi_{e}^{*}=\frac{s_{e,v_{1}}-s_{e,v_{2}}}{\hat{\rho}_{e,v_{1}}-\hat{\rho}_{e,v_{2}}}=\frac{s_{e,v_{1}}-s_{e,v_{2}}}{2a}.

(C.4)

Because $\hat{\rho}\in\operatorname*{argmax}_{\rho_{e}\in B_{e}}\rho_{e}^{T}y_{e}^{*}$ was arbitrary, and $\rho_{e}^{*}\in\operatorname*{argmax}_{\rho_{e}\in B_{e}}\rho_{e}^{T}y_{e}^{*}$ , we know that $\rho_{e,v_{1}}^{*}=a=-\rho_{e,v_{2}}^{*}$ and $\rho_{e,v_{3}}^{*}=1/2=-\rho_{e,v_{4}}^{*}$ . On the other hand, since $s_{e,v_{1}}>s_{e,v_{2}}$ we know that $\phi_{e}^{*}>0$ , therefore

\phi_{e}^{*}=\frac{s_{e}^{T}\rho_{e}^{*}}{\sigma+\|\rho_{e}^{*}\|_{2}^{2}}=\frac{a(s_{e,v_{1}}-s_{e,v_{2}})+\frac{1}{2}(s_{e,v_{3}}-s_{e,v_{4}})}{\sigma+2a^{2}+\frac{1}{2}}.

(C.5)

Combining equations (C.4) and (C.5) we get that $a=(\frac{1}{2}+\sigma)(s_{e,v_{1}}-s_{e,v_{2}})/(s_{e,v_{3}}-s_{e,v_{4}})$ .

Case 4. Suppose that $y_{e,v_{1}}^{*}>y_{e,v_{2}}^{*}$ and $y_{e,v_{3}}^{*}=y_{e,v_{4}}^{*}$ . The following a similar argument for Case 3, we get that $\rho_{e,v_{1}}^{*}=1/2=-\rho_{e,v_{2}}$ and $\rho_{e,v_{3}}^{*}=b=-\rho_{e,v_{4}}^{*}$ where $b=(\frac{1}{2}+\sigma)(s_{e,v_{3}}-s_{e,v_{4}})/(s_{e,v_{1}}-s_{e,v_{2}})$ . ∎

Finally, combining Claims C.1, C.2, C.3 and the constraint that $\rho_{e,v_{1}}^{*}+\rho_{e,v_{2}}^{*}=\rho_{e,v_{3}}^{*}+\rho_{e,v_{4}}^{*}=0$ , there are at most 12 possible choices for $\rho_{e}^{*}$ . Therefore, an exhaustive search among these candidate vectors for $\rho_{e}^{*}$ (and hence $\phi_{e}^{*}=\frac{s_{e}^{T}\rho_{e}^{*}}{\sigma+\|\rho_{e}^{*}\|_{2}^{2}}$ and $r_{e}^{*}=\phi_{e}^{*}\rho_{e}^{*}$ ) that minimizes (C.1) can be done using constant number of vector-vector additions and multiplications.