This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\AtAppendix

Local Hyper-Flow Diffusion

\nameKimon Fountoulakis \emailkfountou@uwaterloo.ca
\addrSchool of Computer Science,
University of Waterloo,
Waterloo, ON, Canada. \AND\namePan Li \emailpanli@purdue.edu
\addrDepartment of Computer Science,
Purdue University,
West Lafayette, IN, United States. \AND\nameShenghao Yang \emailshenghao.yang@uwaterloo.ca
\addrSchool of Computer Science,
University of Waterloo,
Waterloo, ON, Canada
   Kimon Fountoulakis School of Computer Science, University of Waterloo, Waterloo, ON, Canada. E-mail: kfountou@uwaterloo.ca.    Pan Li Department of Computer Science, Purdue University, West Lafayette, IN, USA. E-mail: panli@purdue.edu.    Shenghao Yang School of Computer Science, University of Waterloo, Waterloo, ON, Canada. E-mail: s286yang@uwaterloo.ca.
Abstract

Recently, hypergraphs have attracted a lot of attention due to their ability to capture complex relations among entities. The insurgence of hypergraphs has resulted in data of increasing size and complexity that exhibit interesting small-scale and local structure, e.g., small-scale communities and localized node-ranking around a given set of seed nodes. Popular and principled ways to capture the local structure are the local hypergraph clustering problem and related seed set expansion problem. In this work, we propose the first local diffusion method that achieves edge-size-independent Cheeger-type guarantee for the problem of local hypergraph clustering while applying to a rich class of higher-order relations that covers many previously studied special cases. Our method is based on a primal-dual optimization formulation where the primal problem has a natural network flow interpretation, and the dual problem has a cut-based interpretation using the 2\ell_{2}-norm penalty on associated cut-costs. We demonstrate the new technique is significantly better than state-of-the-art methods on both synthetic and real-world data.

1 Introduction

Hypergraphs [8] generalize graphs by allowing a hyperedge to consist of multiple nodes that capture higher-order relationships in complex systems and datasets [36]. Hypergraphs have been used for music recommendation on Last.fm data [10], news recommendation [29], sets of product reviews on Amazon [37], and sets of co-purchased products at Walmart [1]. Beyond the internet, hypergraphs are used for analyzing higher-order structure in neuronal, air-traffic and food networks [6, 30].

In order to explore and understand higher-order relationships in hypergraphs, recent work has made use of cut-cost functions that are defined by associating each hyperedge with a specific set function. These functions assign specific penalties of separating the nodes within individual hyperedges. They generalize the notion of hypergraph cuts and are crucial for determining small-scale community structure [30, 33]. The most popular cut-cost functions with increasing capability to model complex multiway relationships are the unit cut-cost [28, 27, 23], cardinality-based cut-cost [43, 44] and general submodular cut-cost [31, 47]. An illustration of a hyperedge and the associated cut-cost function is given in Figure 1. In the simplest setting, all cut-cost functions take value either 0 or 1 (e.g., the case when γ1=γ2=1\gamma_{1}=\gamma_{2}=1 in Figure 1(b)), we obtain a unit cut-cost hypergraph. In a slightly more general setting, the cut-costs are determined solely by the number of nodes in either side of the hyperedge cut (e.g., the case when γ1=1/2\gamma_{1}=1/2 and γ2=1\gamma_{2}=1 in Figure 1(b)), we obtain a cardinality-based hypergraph. We refer to hypergraphs associated with arbitrary submodular cut-cost functions (e.g., the case when γ1=1/2\gamma_{1}=1/2 and 0γ210\leq\gamma_{2}\leq 1 in Figure 1(b)) as general submodular hypergraphs.

Hypergraphs that arise from data science applications consist of interesting small-scale local structure such as local communities [33, 42]. Exploiting this structure is central to the above mentioned applications on hypergraphs and related applications in machine learning and applied mathematics [7]. We consider local hypergraph clustering as the task of finding a community-like cluster around a given set of seed nodes, where nodes in the cluster are densely connected to each other while relatively isolated to the rest of the graph. One of the most powerful primitives for the local hypergraph clustering task is the graph diffusion. Diffusion on a graph is the process of spreading a given initial mass from some seed node(s) to neighbor nodes using the edges of the graph. Graph diffusions have been successfully employed in the industry, for example, both Pinterest and Twitter use diffusion methods for their recommendation systems [16, 17, 22]. Google uses diffusion methods to perform clustering query refinements [40]. Let us not forget PageRank [9, 39], Google’s model for their search engine.

Refer to caption
(a) Network motif
Refer to caption
(b) Hyperedge and cut-cost wew_{e}
Figure 1: A food network can be mapped into a hypergraph by taking each network pattern in (a) as a hyperedge [30]. This network pattern captures carbon flow from two preys (v1,v2v_{1},v_{2}) to two predators (v3,v4v_{3},v_{4}). (b) is a hyperedge associated with cut-cost wew_{e} that models their relations: wew_{e} is a set function defined over the node set ee s.t. we({vi})=γ1w_{e}(\{v_{i}\})=\gamma_{1} for i=1,2,3,4i=1,2,3,4, we({v1,v2})=γ2w_{e}(\{v_{1},v_{2}\})=\gamma_{2}, we({v1,v3})=we({v1,v4})=1w_{e}(\{v_{1},v_{3}\})=w_{e}(\{v_{1},v_{4}\})=1 and we(S)=we(e\S)w_{e}(S)=w_{e}(e\backslash S) for SeS\subseteq e. wew_{e} becomes the unit cut-cost when γ1=γ2=1\gamma_{1}=\gamma_{2}=1; wew_{e} is cardinality-based if γ1=1/2\gamma_{1}=1/2 and γ2=1\gamma_{2}=1; more generally, wew_{e} is submodular if γ1=1/2\gamma_{1}=1/2 and 0γ210\leq\gamma_{2}\leq 1. The specific choices depend on the application.

Empirical and theoretical performance of local diffusion methods is often measured on the problem of local hypergraph clustering [45, 20, 33]. Existing local diffusion methods only directly apply to hypergraphs with the unit cut-cost [26, 42]. For the slightly more general cardinality-based cut-cost, they rely on graph reduction techniques which result in a rather pessimistic edge-size-dependent approximation error [46, 26, 33, 43]. Moreover, none of the existing methods is capable of processing general submodular cut-costs. In this work, we are interested in designing a diffusion framework that (i) achieves stronger theoretical guarantees for the problem of local hypergraph clustering, (ii) is flexible enough to work with general submodular hypergraphs, and (iii) permits computationally efficient algorithms. We propose the first local diffusion method that simultaneously accomplishes these goals.

In what follows we describe our main contributions and previous work. In Section 2 we provide preliminaries and notations. In Section 3 we introduce our diffusion model from a combinatorial flow perspective. In Section 4 we discuss the local hypergraph clustering problem and Cheeger-type quadratic approximation error. In Section 6 we perform experiments using both synthetic and real datasets.

1.1 Our main contributions

In this work we propose a generic local diffusion model that applies to hypergraphs characterized by a rich class of cut-cost functions that covers many previously studied special cases, e.g., unit, cardinality-based and submodular cut-costs. We provide the first edge-size-independent Cheeger-type approximation error for the problem of local hypergraph clustering using any of these cut-costs. In particular, assume that there exists a cluster CC with conductance Φ(C)\Phi(C), and assume that we are given a set of seed nodes that reasonably overlaps with CC, then the proposed diffusion model can be used to find a cluster C^\hat{C} with conductance at most O(Φ(C))O(\sqrt{\Phi(C)}) (in the appendix we show that an p\ell_{p}-norm version of the proposed model can achieve O(Φ(C))O(\Phi(C)) asymptotically). Our hypergraph diffusion model is formulated as a convex optimization problem. It has a natural combinatorial flow interpretation that generalizes the notion of network flows over hyperedges. We show that the optimization problem can be solved efficiently by an alternating minimization method. In addition, we prove that the number of nonzero nodes in the optimal solution is independent of the size of the hypergraph, and it only depends on the size of the initial mass. This key property ensures that our algorithm scales well in practice for large datasets. We evaluate our method using both synthetic and real-world data. We show that our method improves accuracy significantly for hypergraphs with unit, cardinarlity-based and general submodular cut-costs for local clustering.

1.2 Previous work

Recently, clustering methods on hypergraphs received renewed interest. Different methods require different assumptions about the hyperedge cut-cost, which can be roughly categorized into unit cut-cost, cardinality-based (and submodular) cut-cost and general submodular cut-cost. Moreover, existing methods can be either global, where the output is not localized around a given set of seed nodes, or local, where the output is a tiny cluster around a set of seed nodes. Local algorithms are the only scalable ones for large hypergraphs, which is our main focus. Many works propose global methods and thus they are not scalable to large hypergraphs [49, 3, 25, 34, 6, 48, 11, 30, 31, 12, 47, 32, 13, 24, 42]. Local diffusion-based methods are more relevant to our work [26, 33, 43]. In particular, iterative hypergraph min-cut methods for the local hypergraph clustering problem can be adopted [43]. However, these methods require in theory and in practice a large seed set, i.e., they are not expansive and thus cannot work with one seed node. The expansive combinatorial diffusion [45] is generalized for hypergraphs [26], which can detect a target cluster using only one seed node. However, combinatorial methods have a large bias towards low conductance clusters as opposed to finding the target cluster [18]. The most relevant paper to our work is [33]. However, the proposed methods in [33] depend on a reduction from hypergraphs to directed graphs. This results in an approximation error for clustering that is proportional to the size of hyperedges and induces performance degeneration when the hyperedges are large. In fact, none of the above approaches (including global and local ones) has an edge-size-independent approximation error bound for even simple cardinality-based hypergraphs. Moreover, existing local approaches do not work for general submodular hypergraphs.

2 Preliminaries and Notations

Submodular function. Given a set SS, we denote 2S2^{S} the power set of SS and |S||S| the cardinality of SS. A submodular function F:2SF:2^{S}\rightarrow\mathbb{R} is a set function such that F(A)+F(B)F(AB)+F(AB)F(A)+F(B)\geq F(A\cup B)+F(A\cap B) for any A,BSA,B\subseteq S.

Submodular hypergaph. A hypergraph H=(V,E)H=(V,E) is defined by a set of nodes VV and a set of hyperedges E2VE\subseteq 2^{V}, i.e., each hyperedge eEe\in E is a subset of VV. A hypergraph is termed submodular if every eEe\in E is associated with a submodular function we:2e0w_{e}:2^{e}\rightarrow\mathbb{R}_{\geq 0} [31]. The weight we(S)w_{e}(S) indicates the cut-cost of splitting the hyperedge ee into two subsets, SS and eSe\setminus S. This general form allows us to describe the potentially complex higher-order relation among multiple nodes (Figure 1). A proper hyperedge weight wew_{e} should satisfy that we()=we(e)=0w_{e}(\emptyset)=w_{e}(e)=0. To ease notation we extend the domain of wew_{e} to 2V2^{V} by setting we(S):=we(Se)w_{e}(S):=w_{e}(S\cap e) for any SVS\subseteq V. We assume without loss of generality that wew_{e} is normalized by ϑe:=maxSewe(S)\vartheta_{e}:=\max_{S\subseteq e}w_{e}(S), so that we(S)[0,1]w_{e}(S)\in[0,1] for any SVS\subseteq V. For the sake of simplicity in presentation, we assume that ϑe=1\vartheta_{e}=1 for all ee.111This is without loss of generality. In the appendix we show that our method works with arbitrary ϑe>0\vartheta_{e}>0. A submodular hypergraph is written as H=(V,E,𝒲)H=(V,E,\mathcal{W}) where 𝒲:={we,ϑe}eE\mathcal{W}:=\{w_{e},\vartheta_{e}\}_{e\in E}. Note that when we(S)=1w_{e}(S)=1 for any S2e\{,e}S\in 2^{e}\backslash\{\emptyset,e\}, the definition reduces to unit cut-cost hypergraphs. When we(S)w_{e}(S) only depends on |S||S|, it reduces to cardinality-based cut-cost hypergraphs.

Vector/Function on VV or EE. For a set of nodes SVS\subseteq V, we denote 1S\mathit{1}_{S} the indicator vector of SS, i.e., [1S]v=1[\mathit{1}_{S}]_{v}=1 if vSv\in S and 0 otherwise. For a vector x|V|x\in\mathbb{R}^{|V|}, we write x(S):=vSxvx(S):=\sum_{v\in S}x_{v}, where xvx_{v} in the entry in xx that corresponds to vVv\in V. We define the support of xx as supp(x):={vV|xv0}\textnormal{supp}(x):=\{v\in V|x_{v}\neq 0\}. The support of a vector in |E|\mathbb{R}^{|E|} is defined analogously. We refer to a function over nodes x:Vx:V\rightarrow\mathbb{R} and its explicit representation as a |V||V|-dimensional vector interchangeably.

Volume, cut, conductance. Given a submodular hypergraph H=(V,E,𝒲)H=(V,E,\mathcal{W}), the degree of a node vv is defined as dv:=|{eE:ve}|d_{v}:=|\{e\in E:v\in e\}|. We reserve dd for the vector of node degrees and D=diag(d)D=\mbox{diag}(d). We refer to vol(S):=d(S)\textnormal{vol}(S):=d(S) as the volume of SVS\subseteq V. A cut is treated as a proper subset CVC\subset V, or a partition (C,C¯)(C,\bar{C}) where C¯:=VC\bar{C}:=V\setminus C. The cut-set of CC is defined as C:={eE|eC,eC¯}\partial C:=\{e\in E|e\cap C\neq\emptyset,e\cap\bar{C}\neq\emptyset\}; the cut-size of CC is defined as vol(C):=eCϑewe(C)=eEϑewe(C)\textnormal{vol}(\partial C):=\sum_{e\in\partial C}\vartheta_{e}w_{e}(C)=\sum_{e\in E}\vartheta_{e}w_{e}(C). The conductance of a cut CC in HH is Φ(C):=vol(C)min{vol(C),vol(VC)}\Phi(C):=\frac{\textnormal{vol}(\partial C)}{\min\{\textnormal{vol}(C),\textnormal{vol}(V\setminus C)\}}.

Flow. A flow routing over a hyperedge ee is a function re:er_{e}:e\rightarrow\mathbb{R} where re(v)r_{e}(v) specifies the amount of mass that flows from {v}\{v\} to e{v}e\setminus\{v\} over ee. To ease notation we extend the domain of rer_{e} to VV by identifying re(v)=0r_{e}(v)=0 for vev\not\in e, so rer_{e} is treated as a function re:Vr_{e}:V\rightarrow\mathbb{R} or equivalently a |V||V|-dimensional vector. The net (out)flow at a node vv is given by eEre(v)\sum_{e\in E}r_{e}(v). Given a routing function rer_{e} and a set of nodes SVS\subseteq V, a directional routing on ee with direction SeSS\rightarrow e\setminus S is represented by re(S)r_{e}(S), which specifies the net amount of mass that flows from SS to eSe\setminus S.  A routing rer_{e} is called proper if it obeys flow conservation, i.e., reT1e=0r_{e}^{T}\mathit{1}_{e}=0. Our flow definition generalizes the notion of network flows to hypergraphs. We provide concrete illustrations in Figure 2.

Refer to caption
(a) Flows on graph
Refer to caption
(b) Hyperedge routing
Refer to caption
(c) Flows on hypergraph
Figure 2: Illustration of proper flow routings. The numbers next to each node correspond to entries in the flow routing rer_{e} over a (hyper)edge ee. We assign the same color to a (hyper)edge and its associated flow values. Our flow definition is a natural generalization of graph edge flow where re(v)=±fr_{e}(v)=\pm f if and only if vev\in e, i.e., vv is incident to ee, where ff and the sign determine the amplitude and direction of the flow over ee. In Figure 2(a), the net (out)flow at node v3v_{3} is given by eEre(v3)=123=4\sum_{e\in E}r_{e}(v_{3})=1-2-3=-4. In Figure 2(b), the directional flow from {v1}\{v_{1}\} to {v2,v3,v4,v5}\{v_{2},v_{3},v_{4},v_{5}\} over this hyperedge equals 1-1; similarly, the directional flow from {v1,v2,v4}\{v_{1},v_{2},v_{4}\} to {v3,v5}\{v_{3},v_{5}\} equals 3+21=43+2-1=4, etc. In Figure 2(c), the net (out)flow at node v3v_{3} is given by eEre(v3)=32=1\sum_{e\in E}r_{e}(v_{3})=3-2=1.

3 Diffusion as an Optimization Problem

In this section we provide details of the proposed local diffusion method. We consider diffusion process as the task of spreading mass from a small set of seed nodes to a larger set of nodes. More precisely, given a hypergraph H=(V,E,𝒲)H=(V,E,\mathcal{W}), we assign each node a sink capacity specified by a sink function TT, i.e., node vv is allowed to hold at most T(v)T(v) amount of mass. In this work we focus on the setting where T(v)=dvT(v)=d_{v}, so that a high-degree node that is part of many hyperedges can hold more mass than a low-degree node that is part of few hyperedges. Moreover, we assign each node some initial mass specified by a source function Δ\Delta, i.e., node vv holds Δ(v)\Delta(v) amount of mass at the start of the diffusion. In order to encourage the spread of mass in the hypergraph, the initial mass on the seed nodes is larger than their capacity. This forces the seed nodes to diffuse mass to neighbor nodes to remove their excess mass. In Section 4 we will discuss the choice of Δ\Delta to obtain good theoretical guarantees for the problem of local hypergraph clustering. Details about the local hypergraph clustering problem are provided in Section 4.

Given a set of proper flow routings rer_{e} for eEe\in E, recall that eEre(v)\sum_{e\in E}r_{e}(v) specifies the amount of net (out)flow at node vv. Therefore, the vector m=ΔeErem=\Delta-\sum_{e\in E}r_{e} gives the amount of net mass at each node after routing. The excess mass at a node vv is ex(v):=max{mvdv,0}\textnormal{ex}(v):=\max\{m_{v}-d_{v},0\}. In order to force the diffusion of initial mass we could simply require that ex(v)=0\textnormal{ex}(v)=0 for all vVv\in V, or equivalently, ΔeEred\Delta-\sum_{e\in E}r_{e}\leq d. But to provide additional flexibility in the diffusion dynamics, we introduce a hyper-parameter σ0\sigma\geq 0 and we impose a softer constraint ΔeEred+σDz\Delta-\sum_{e\in E}r_{e}\leq d+\sigma Dz, where z|V|z\in\mathbb{R}^{|V|} is an optimization variable that controls how much excess mass is allowed on each node. In the context of numerical optimization, we show in Section 5 that σ\sigma allows a reformulation which makes the optimization problem amenable to efficient alternating minimization schemes.

Note that so far we have not yet talked about how specific higher-order relations among nodes within a hyperedge would affect the flow routings over it. Apparently, simply requiring that the rer_{e}’s obey flow conservation (i.e., reT1e=0r_{e}^{T}\mathit{1}_{e}=0) similar to the standard graph setting is not enough for hypergraphs. An important difference between hyperedge flows and graph edge flows is that additional constraints on rer_{e} are in need. To this end, we consider re=ϕeρer_{e}=\phi_{e}\rho_{e} for some ϕe+\phi_{e}\in\mathbb{R}_{+} and ρeBe\rho_{e}\in B_{e}, where

Be:={ρe|V||ρe(S)we(S),SV,andρe(V)=we(V)}B_{e}:=\{\rho_{e}\in\mathbb{R}^{|V|}~{}|~{}\rho_{e}(S)\leq w_{e}(S),\forall S\subseteq V,\ \mbox{and}\ \rho_{e}(V)=w_{e}(V)\}

is the base polytope [4] for the submodular cut-cost wew_{e} associated with hyperedge ee. It is straightforward to see that re(v)=0r_{e}(v)=0 for every vev\not\in e and reT1e=0r_{e}^{T}\mathit{1}_{e}=0, so rer_{e} defines a proper flow routing over ee. Moreover, for any eVe\subseteq V, recall that re(S)r_{e}(S) represents the net amount of mass that moves from SS to eSe\setminus S over hyperedge ee. Therefore, the constraints ρe(S)we(S)\rho_{e}(S)\leq w_{e}(S) for SeS\subseteq e mean that the directional flows re(S)r_{e}(S) are upper bounded by a submodular function ϕewe(S)\phi_{e}w_{e}(S). Intuitively, one may think of ϕe\phi_{e} and ρe\rho_{e} as the scale and the shape of rer_{e}, respectively.

The goal of our diffusion problem is to find low cost routings reϕeBer_{e}\in\phi_{e}B_{e} for eEe\in E such that the capacity constraint ΔeEred+σDz\Delta-\sum_{e\in E}r_{e}\leq d+\sigma Dz is satisfied. We consider the (weighted) 2\ell_{2}-norm of ϕ\phi and zz as the cost of diffusion. In the appendix we show that one readily extends the 2\ell_{2}-norm to p\ell_{p}-norm for any p2p\geq 2. Formally, we arrive at the following convex optimization formulation (input: the source function Δ\Delta, the hypergraph H=(V,E,𝒲)H=(V,E,\mathcal{W}), and a hyper-parameter σ\sigma):

minϕ+|E|,z+|V|12eEϕe2+σ2vVdvzv2,s.t.ΔeEred+σDz,reϕeBe,eE.\min_{\phi\in\mathbb{R}^{|E|}_{+},z\in\mathbb{R}^{|V|}_{+}}~{}\frac{1}{2}\sum_{e\in E}\phi_{e}^{2}+\frac{\sigma}{2}\sum_{v\in V}d_{v}z_{v}^{2},\ \ \mbox{s.t.}\ \ \Delta-\sum_{e\in E}r_{e}\leq d+\sigma Dz,\ r_{e}\in\phi_{e}B_{e},\forall e\in E. (1)

We name problem (1) Hyper-Flow Diffusion (HFD) for its combinatorial flow interpretation we discussed above. The dual problem of (1) is:

maxx+|V|(Δd)Tx12eEfe(x)2σ2vVdvxv2,\max_{x\in\mathbb{R}^{|V|}_{+}}~{}(\Delta-d)^{T}x-\frac{1}{2}\sum_{e\in E}f_{e}(x)^{2}-\frac{\sigma}{2}\sum_{v\in V}d_{v}x_{v}^{2}, (2)

where fef_{e} in (2) is the support function of the base polytope BeB_{e} given by fe(x):=maxρeBeρeTxf_{e}(x):=\max_{\rho_{e}\in B_{e}}\rho_{e}^{T}x. fef_{e} is also known as the Lovász extension of the submodular function wew_{e}.

We provide a combinatorial interpretation for (2) and leave algebraic derivations to the appendix. For the dual problem, one can view the solution xx as assigning heights to nodes, and the goal is to separate/cut the nodes with source mass from the rest of the hypergraph. Observe that the linear term in the dual objective function encourages raising xx higher on the seed nodes and setting it lower on others. The cost fe(x)f_{e}(x) captures the discrepancy in node heights over a hyperedge ee and encourages smooth height transition over adjacent nodes. The dual solution embeds nodes into the nonnegative real line, and this embedding is what we actually use for local clustering and node ranking.

4 Local Hypergraph Clustering

In this section we discuss the performance of the primal-dual pair (1)-(2), respectively, in the context of local hypergraph clustering. We consider a generic hypergraph H=(V,E,𝒲)H=(V,E,\mathcal{W}) with submodular hyperedge weights 𝒲={we,ϑe}eE\mathcal{W}=\{w_{e},\vartheta_{e}\}_{e\in E}. Given a set of seed nodes SVS\subset V, the goal of local hypergraph clustering is to identify a target cluster CVC\subset V that contains or overlaps well with SS. This generalizes the definition of local clustering over graphs [19]. To the best of our knowledge, we are the first one to consider this problem for general submodular hypergraphs. We consider a subset of nodes having low conductance as a good cluster, i.e., these nodes are well-connected internally and well-separated from the rest of the hypergraph. Following prior work on local hypergraph clustering, we assume the existence of an unknown target cluster CC with conductance Φ(C)\Phi(C). We prove that applying sweep-cut to an optimal solution x^\hat{x} of (2) returns a cluster C^\hat{C} whose conductance is at most quadratically worse than Φ(C)\Phi(C). Note that this result resembles Cheeger-type approximation guarantees of spectral clustering in the graph setting [2], and it is the first result that is independent of hyperedge size for general hypergraphs. We keep the discussion at high level and defer details to the appendix, where we prove a more general, and stronger, i.e., constant approximation error result when the primal problem (1) is penalized by the p\ell_{p}-norm for any p2p\geq 2.

In order to start a diffusion process we need to provide the source mass Δ\Delta. Similar to the pp-norm flow diffusion in the graph setting [20], we let

Δ(v)={δdvifvS,0otherwise,\Delta(v)=\left\{\begin{array}[]{ll}\delta d_{v}&\mbox{if}~{}v\in S,\\ 0&\mbox{otherwise,}\end{array}\right. (3)

where SS is a set of seed nodes and δ1\delta\geq 1. Below, we make the assumptions that the seed set SS and the target cluster CC have some overlap, there is a constant factor of vol(C)\textnormal{vol}(C) amount of mass trapped in CC initially, and the hyper-parameter σ\sigma is not too large. Note that Assumption 2 is without loss of generality: if the right value of δ\delta is not known apriori, we can always employ binary search to find a good choice. Assumption 3 is very weak as it allows σ\sigma to reside in an interval containing 0.

Assumption 1.

vol(SC)αvol(C)\textnormal{vol}(S\cap C)\geq\alpha\textnormal{vol}(C) and vol(SC)βvol(S)\textnormal{vol}(S\cap C)\geq\beta\textnormal{vol}(S) for some α,β(0,1]\alpha,\beta\in(0,1].

Assumption 2.

The source mass Δ\Delta as specified in (3) satisfies δ=3/α\delta=3/\alpha, so Δ(C)3vol(C)\Delta(C)\geq 3\textnormal{vol}(C).

Assumption 3.

σ\sigma satisfies 0σβΦ(C)/30\leq\sigma\leq\beta\Phi(C)/3.

Let x^\hat{x} be an optimal solution for the dual problem (2). For h>0h>0 define the sweep sets Sh:={vV|x^vh}S_{h}:=\{v\in V|\hat{x}_{v}\geq h\}. We state the approximation property in Theorem 1.

Theorem 1.

Under Assumptions 123, there exists h>0h>0 such that Φ(Sh)O(Φ(C)/αβ).\Phi(S_{h})\leq O(\sqrt{\Phi(C)}/\alpha\beta).

One of the challenges we face in establishing the result in Theorem 1 is making sure that our diffusion model enjoys both good clustering guarantees and practical algorithmic advantages at the same time. This is achieved by introducing the hyper-parameter σ\sigma to our diffusion problem. We will demonstrate how σ\sigma helps with algorithmic development in Section 5, but from a clustering perspective, the additional flexibility given by σ>0\sigma>0 complicates the underlying diffusion dynamics, making it more difficult to analyze. Another challenge is connecting the Lovász extension fe(x)f_{e}(x) in (2) with the conductance of a cluster. We resolve all these problems by combining a generalized Rayleigh quotient result for submodular hypergraphs [31], primal-dual convex conjugate relations between (1) and (2), and a classical property of the Choquet integral/Lovász extension.

Let (ϕ^,r^,z^)(\hat{\phi},\hat{r},\hat{z}) be an optimal solution for the primal problem (1). We state the following lemma on the locality (i.e., sparsity) of the optimal solutions, which justifies why HFD is a local diffusion method.

Lemma 2.

|supp(ϕ^)|vol(supp(x^))Δ1|\textnormal{supp}(\hat{\phi})|\leq\textnormal{vol}(\textnormal{supp}(\hat{x}))\leq\|\Delta\|_{1}; moreover, vol(supp(z^))=vol(supp(x^))\textnormal{vol}(\textnormal{supp}(\hat{z}))=\textnormal{vol}(\textnormal{supp}(\hat{x})) if σ>0\sigma>0.

5 Optimization algorithm for HFD

We use a simple Alternating Minimization (AM) [5] method that efficiently solves the primal diffusion problem (1). For eEe\in E, we define a diagonal matrix Ae|V|×|V|A_{e}\in\mathbb{R}^{|V|\times|V|} such that [Ae]v,v=1[A_{e}]_{v,v}=1 if vev\in e and 0 otherwise. Denote 𝒞:={(ϕ,r):reϕeBe,eE}\mathcal{C}:=\{(\phi,r):r_{e}\in\phi_{e}B_{e},~{}\forall e\in E\}. The following Lemma 1 allows us to cast problem (1) to an equivalent separable formulation amenable to the AM method.

Lemma 1.

The following problem is equivalent to (1) for any σ>0\sigma>0, in the sense that (ϕ^,r^,z^)(\hat{\phi},\hat{r},\hat{z}) is optimal in (1) for some z^|V|\hat{z}\in\mathbb{R}^{|V|} if and only if (ϕ^,r^,s^)(\hat{\phi},\hat{r},\hat{s}) is optimal in (4) for some s^eE|V|\hat{s}\in\bigotimes_{e\in E}\mathbb{R}^{|V|}.

minϕ,r,s12eE(ϕe2+1σsere22),s.t.(ϕ,r)𝒞,ΔeEsed,se,v=0,ve.\min_{\phi,r,s}~{}\frac{1}{2}\sum_{e\in E}\left(\phi_{e}^{2}+\frac{1}{\sigma}\left\|s_{e}-r_{e}\right\|_{2}^{2}\right),~{}\mbox{s.t.}~{}(\phi,r)\in\mathcal{C},~{}\Delta-\sum_{e\in E}s_{e}\leq d,~{}s_{e,v}=0,\forall v\not\in e. (4)

The AM method for problem (4) is given in Algorithm 1. The first sub-problem corresponds to computing projections to a group of cones, where all the projections can be computed in parallel. The computation of each projection depends on the choice of base polytope BeB_{e}. If the hyperedge weight wew_{e} is unit cut-cost, BeB_{e} holds special structures and projection can be computed with O(|e|log|e|)O(|e|\log|e|) [32]. For general BeB_{e}, a conic Fujishige-Wolfe minimum norm algorithm can be adopted to obtain the projection [32]. The second sub-problem in Algorithm 1 can be easily computed in closed-form. We provide more information about Algorithm 1 and its convergence properties in the appendix.

Algorithm 1 Alternating Minimization for problem (4)

Initialization:

ϕ(0):=0,r(0):=0,se(0):=D1Ae[Δd]+,eE.\phi^{(0)}:=0,r^{(0)}:=0,s^{(0)}_{e}:=D^{-1}A_{e}\left[\Delta-d\right]_{+},\forall e\in E.

For k=0,1,2,k=0,1,2,\ldots do:

(ϕ(k+1),r(k+1)):=argmin(ϕ,r)𝒞eE(ϕe2+1σse(k)re22)\displaystyle(\phi^{(k+1)},r^{(k+1)}):=\operatorname*{argmin}\limits_{(\phi,r)\in\mathcal{C}}\sum\limits_{e\in E}(\phi_{e}^{2}+\tfrac{1}{\sigma}\|s_{e}^{(k)}-r_{e}\|_{2}^{2})
s(k+1):=argminseEsere(k+1)22\displaystyle s^{(k+1)}:=\operatorname*{argmin}\limits_{s}\sum\limits_{e\in E}\|s_{e}-r_{e}^{(k+1)}\|_{2}^{2}
s.t.ΔeEsed,se,v=0,ve.\displaystyle\hskip 62.59605pt\mbox{s.t.}~{}\Delta-\sum\limits_{e\in E}s_{e}\leq d,\ s_{e,v}=0,\forall v\not\in e.

We remark that the reformulation (4) for σ>0\sigma>0 is crucial from an algorithmic point of view. If σ=0\sigma=0, then the primal problem (1) has complicated coupling constraints that are hard to deal with. In this case, one has to resort to the dual problem (2). However, problem (2) has a nonsmooth objective function, which prohibits applicability of optimization methods for smooth objective functions. Even though subgradient method may be applied, we have observed empirically that its convergence rate is extremely slow for our problem, and early stopping results in a bad quality output.

Lastly, as noted in Lemma 2, the number of nonzeros in the optimal solution is upper bounded by Δ1\|\Delta\|_{1}. In Figure 3 we plot the number of nodes having positive excess (which equals the number of nonzeros in the dual solution x^\hat{x}) at every iteration of Algorithm 1. Figure 3 indicates that Algorithm 1 is strongly local, meaning that it works only on a small fraction of nodes (and their incident hyperedges) as opposed to producing dense iterates. This key empirical observation has enabled our algorithm to scale to large datasets by simply keeping track of all active nodes and hyperedges. Proving that the worst-case running time of AM depends only on the number of nonzero nodes at optimality as opposed to size of the whole hypergraph is an open problem, which we leave for future work.

Refer to caption
(a) Cluster 12 - Gift Cards
Refer to caption
(b) Cluster 18 - Magazine Subs.
Refer to caption
(c) Cluster 24 - Prime Pantry
Figure 3: The blue solid line plots the number of nonzeros in the dual solution xx over 200 iterations of Algorithm 1, when it is applied to solve HFD on the Amazon-reviews hypergraph for local clustering. See Section 6.2 for details about the dataset. The error bars show standard deviation over 10 trials. In each trial we pick a different seed node and set the same amount of initial mass. The black dashed line shows the average number of nonzeros at optimality. The algorithm touches only a small fraction of nodes out the total 2,268,264 nodes in the Amazon-reviews dataset.

6 Empirical Results

In this section we evaluate the performance of HFD for local clustering. First, we carry out experiments on synthetic hypergraphs with varying target cluster conductances and varying hyperedge sizes. For the unit cut-cost setting, we show that HFD is more robust and has better performance when the target cluster is noisy; for a cardinality-based cut-cost setting, we show that the edge-size-independent approximation guarantee is important for obtaining good recovery results. Second, we carry out experiments using real-world data. We show that HFD significantly outperforms existing state-of-the-art diffusion methods for both unit and cardinality-based cut-costs. Moreover, we provide a compelling example where specialized submodular cut-cost is necessary for obtaining good results. Code that reproduces all results is available at https://github.com/s-h-yang/HFD.

6.1 Synthetic experiments using hypergraph stochastic block model (HSBM)

The generative model. We generalize the standard kk-uniform hypergraph stochastic block model (kkHSBM) [21] to allow different types of inter-cluster hyperedges appear with possibly different probabilities according to the cardinality of hyperedge cut. Let V={1,2,,n}V=\{1,2,\ldots,n\} be a set of nodes and let k2k\geq 2 be the required constant hyperedge size. We consider kkHSBM with parameters kk, nn, pp, qjq_{j}, j=1,2,,k/2j=1,2,\ldots,\lfloor k/2\rfloor. The model samples a kk-uniform hypergraph according to the following rules: (i) The community label σi{0,1}\sigma_{i}\in\{0,1\} is chosen uniformly at random for iVi\in V;222We consider two blocks for simplicity. In general the model applies to any number of blocks. (ii) Each size kk subset e={v1,v2,,vk}e=\{v_{1},v_{2},\ldots,v_{k}\} of VV appears independently as a hyperedge with probability

(eE)={pifσv1=σv2==σvk,qjifmin{ki=1kσvi,i=1kσvi}=j.\mathbb{P}(e\in E)=\left\{\begin{array}[]{ll}p&\mbox{if}~{}\sigma_{v_{1}}=\sigma_{v_{2}}=\cdots=\sigma_{v_{k}},\\ q_{j}&\mbox{if}~{}\min\{k-\sum_{i=1}^{k}\sigma_{v_{i}},\sum_{i=1}^{k}\sigma_{v_{i}}\}=j.\end{array}\right.

If k=3k=3 or all qjq_{j}’s are the same, then we obtain the standard two-block kkHSBM. We use this setting to evaluate HFD for unit cut-cost. If qjq_{j}’s are different, then we obtain a cardinality-based kkHSBM. In particular, when q1q2qk/2q_{1}\geq q_{2}\geq\cdots\geq q_{\lfloor k/2\rfloor}, it models the scenario where hyperedges containing similar numbers of nodes from each block are rare, while small noises (e.g., hyperedges that have one or two nodes in one block and all the rest in the other block) are more frequent. We use q1qjq_{1}\gg q_{j}, j2j\geq 2, to evaluate HFD for cardinality-based cut-cost. There are other random hypergraph models, for example the Poisson degree-corrected HSBM [14] that deals with degree heterogeneity and edge size heterogeneity. In our experiments we focus on kkHSBM because it allows stronger control over hyperedge sizes. We provide details on data generation in the appendix.

Task and methods. We consider the local hypergraph clustering problem. We assume that we are given a single labelled node and the goal is to recover all nodes having the same label. Using a single seed node the most common (and sought-after) practice for local graph clustering tasks. We test the performance of HFD with two other methods: (i) Localized Quadratic Hypergraph Diffusions (LH) [33], which can be seen as a hypergraph analogue of Approximate Personalized PageRank (APPR); (ii) ACL [2], which is used to compute APPR vectors on a standard graph obtained from reducing a hypergraph through star expansion [50].333There are other heuristic methods which first reduce a hypergraph to a graph by clique expansion [6] and then apply diffusion methods for standard graphs. We did not compare with this approach because clique expansion often results in a dense graph and consequently makes the computation slow. Moreover, it has been shown in [33] that clique expansion did not offer significant performance improvement over star expansion.

Cut-costs and parameters. We consider both unit cut-cost, i.e., we(S)=1w_{e}(S)=1 if SeS\cap e\neq\emptyset and eSe\setminus S\neq\emptyset, and cardinality cut-cost we(S)=min{|Se|,|eS|}/|e|/2w_{e}(S)=\min\{|S\cap e|,|e\setminus S|\}/\lfloor|e|/2\rfloor. HFD that uses unit and cardinality cut-costs are denoted by U-HFD and C-HFD, respectively. LH also works with both unit and cardinality cut-costs and we specify them by U-LH and C-LH, respectively. For HFD, we initialize the seed mass so that Δ1\|\Delta\|_{1} is a constant factor times the volume of the target cluster. We set σ=0.01\sigma=0.01. We highly tune LH by performing binary search over its parameters κ\kappa and δ\delta and pick the output cluster having the lowest conductance. For ACL we use the same parameter choices as in [33]. Details on parameter setting are provided in the appendix.

Results. For each hypergraph, we randomly pick a block as the target cluster. We run the methods 50 times. Each time we choose a different node from the target cluster as the single seed node.
Unit cut-cost results. Figure 4 shows local clustering results when we fix k=3k=3 but vary the conductance of the target cluster (i.e., constant pp but varying q1q_{1}). Observe that the performances of all methods become worse as the target cluster becomes more noisy, but U-HFD has significantly better performance than both U-LH and ACL when the conductance of the target cluster is between 0.2 and 0.4. The reason that U-HFD performs better is in part because it requires much weaker conditions for the theoretical conductance guarantee to hold. On the contrary, LH assumes an upper bound on the conductance of the target cluster [33]. This upper bound is dataset-dependent and could become very small in many cases, leading to poor practical performances. We provide more details in this perspective in the appendix. ACL with star expansion is a heuristic method that has no performance guarantee.
Cardinality cut-cost results. Figure 5 shows the median (markers) and 25-75 percentiles (lower-upper bars) of conductance ratios (i.e., the ratio between output conductance and ground-truth conductance, lower is better) and F1 scores for different methods for k{3,4,5,6}k\in\{3,4,5,6\}. The target cluster for each kk has conductance around 0.3.444See the appendix for similar results when we fix the target cluster conductances around 0.2 and 0.25, respectively. These cover a reasonably wide range of scenarios in terms of the target conductance and illustrate the performance of algorithms for different levels of noise. For k=3k=3, unit and cardinality cut-costs are equivalent, therefore all methods have similar performances. As kk increases, cardinality cut-cost provides better performance than unit cut-cost in both conductance and F1. However, since the theoretical approximation guarantee of C-LH depends on hyperedge size [33], there is a noticeable performance degradation for C-LH when we increase k=3k=3 to k=4k=4. On the other hand, the performance of C-HFD appears to be independent from kk, which aligns with our conductance bound in Theorem 1.

Refer to caption
Refer to caption
Figure 4: Output conductance and F1 against ground-truth conductance
Refer to caption
Refer to caption
Figure 5: Conductance ratio and F1 on kk-uniform hypergraphs

6.2 Experiments using real-world data

We conduct extensive experiments using real-world data. First, we show that HFD has superior local clustering performances than existing methods for both unit and cardinality-based cut-costs. Then, we show that general submodular cut-cost (recall that HFD is the only method that applies to this setting) can be necessary for capturing complex high-order relations in the data, improving F1 scores by up to 20% for local clustering and providing the only meaningful result for node ranking. In the appendix we show additional local clustering experiments on two additional datasets, where our method improves F1 scores by 8% on average for 13 different target clusters.

Datasets. We provide basic information about the datasets used in our experiments. Complete descriptions are provided in the appendix.
Amazon-reviews (|V||V| = 2,268,264, |E||E| = 4,285,363) [38, 43]. In this hypergraph each node represents a product. A set of products are connected by a hyperedge if they are reviewed by the same person. We use product category labels as ground truth cluster identities. We consider all clusters of less than 10,000 nodes.
Trivago-clicks (|V||V| = 172,738, |E||E| = 233,202) [14]. The nodes in this hypergraph are accommodations/hotels. A set of nodes are connected by a hyperedge if a user performed “click-out” action during the same browsing session. We use geographical locations as ground truth cluster identities. There are 160 such clusters, and we filter them using cluster size and conductance.
Florida Bay food network (|V||V| = 128, |E||E| = 141,233) [30]. Nodes in this hypergraph correspond to different species or organisms that live in the Bay, and hyperedges correspond to transformed motifs (Figure 1) of the original dataset. Each species is labelled according its role in the food chain: producers, low-level consumers, high-level consumers.

Methods and parameters. We compare HFD with LH and ACL.555We also tried a flow-improve method for hypergraphs [43], but the method was very slow in our experiments, so we only used it for small datasets. See appendix for results. The flow-improve method did not improve the performance of existing methods, therefore, we omitted it from comparisons on larger datasets. There is a heuristic nonlinear variant of LH which is shown to outperform linear LH in some cases [33]. Therefore we also compare with the same nonlinear variant considered in [33]. We denote the linear and nonlinear versions by LH-2.0 and LH-1.4, respectively. We set σ=0.0001\sigma=0.0001 for HFD and we set the parameters for LH-2.0, LH-1.4 and ACL as suggested by the authors [33]. More details on parameter choices appear in the appendix. We prefix methods that use unit and cardinality-based cut-costs by U- and C-, respectively.

Experiments for unit and cardinality cut-costs. For each target cluster in Amazon-reviews and Trivago-clicks, we run the methods multiple times, each time we use a different node as the singe seed node.666We show additional results using seed sets of more than one node in the appendix. We report the median F1 scores of the output clusters in Table 1 and Table 2. For Amazon-reviews, we only compare the unit cut-cost because it is both shown in [33] and verified by our experiments that unit cut-cost is more suitable for this dataset. Observe that U-HFD obtains the highest F1 scores for nearly all clusters. In particular, U-HFD significantly outperforms other methods for clusters 12, 18, 24, where we see an increase in F1 score by up to 52%. For Trivago-clicks, C-HFD has the best performance for all but one clusters. Among the rest of all other methods, U-HFD has the second highest F1 scores for nearly all clusters. Moreover, observe that for each method (i.e., HFD, LH-2.0, LH-1.4), cardinality cut-cost leads to higher F1 than its unit cut-cost counterpart.

Table 1: F1 results for Amazon-reviews network
Method 1 2 3 12 15 17 18 24 25
U-HFD 0.45 0.09 0.65 0.92 0.04 0.10 0.80 0.81 0.09
U-LH-2.0 0.23 0.07 0.23 0.29 0.05 0.06 0.21 0.28 0.05
U-LH-1.4 0.23 0.09 0.35 0.40 0.00 0.07 0.31 0.35 0.06
ACL 0.23 0.07 0.22 0.25 0.04 0.05 0.17 0.20 0.04
Table 2: F1 results for Trivago-clicks network
Method KOR ISL PRI UA-43 VNM HKG MLT GTM UKR EST
U-HFD 0.75 0.99 0.89 0.85 0.28 0.82 0.98 0.94 0.60 0.94
C-HFD 0.76 0.99 0.95 0.94 0.32 0.80 0.98 0.97 0.68 0.94
U-LH-2.0 0.70 0.86 0.79 0.70 0.24 0.92 0.88 0.82 0.50 0.90
C-LH-2.0 0.73 0.90 0.84 0.78 0.27 0.94 0.96 0.88 0.51 0.83
U-LH-1.4 0.69 0.84 0.80 0.75 0.28 0.87 0.92 0.83 0.47 0.90
C-LH-1.4 0.71 0.88 0.84 0.78 0.27 0.88 0.93 0.85 0.50 0.85
ACL 0.65 0.84 0.75 0.68 0.23 0.90 0.83 0.69 0.50 0.88

Experiments for general submodular cut-cost. In order to understand the importance of specialized general submodular hypergraphs we study the node-ranking problem for the Florida Bay food network using hypergraph modelling shown in Figure 1. We compare HFD using unit (U-HFD, γ1=γ2=1\gamma_{1}=\gamma_{2}=1), cardinality-based (C-HFD, γ1=1/2\gamma_{1}=1/2 and γ2=1\gamma_{2}=1) and submodular (S-HFD, γ1=1/2\gamma_{1}=1/2 and γ2=0\gamma_{2}=0) cut-costs. Our goal is to search the most similar species of a queried species based on the food-network structure. Table 3 shows that S-HFD provides the only meaningful node ranking results. Intuitively, when γ2=0\gamma_{2}=0, separating the preys v1,v2v_{1},v_{2} from the predators v3,v4v_{3},v_{4} incurs 0 cost. This encourages S-HFD to diffuse mass among preys or predators only and not to cross from a predator to a prey or vice versa. As a result, similar species receive similar amount of mass and thus are ranked similarly. In the local clustering setting, Table 3 compares HFD using different cut-costs. By exploiting specialized higher-order relations, S-HFD further improves F1 scores by up to 20% over U-HFD and C-HFD. This is not surprising, given the poor node-ranking results of other cut-costs. In the appendix we show another application of submodular cut-cost for node-ranking in an international oil trade network.

Table 3: Node-ranking and local clustering in Florida Bay food network using different cut-costs
Top-2 node-ranking results Clustering F1
Method Query: Raptors Query: Gray Snapper Prod. Low High
U-HFD Epiphytic Gastropods, Detriti. Gastropods Meiofauna, Epiphytic Gastropods 0.69 0.47 0.64
C-HFD Epiphytic Gastropods, Detriti. Gastropods Meiofauna, Epiphytic Gastropods 0.67 0.47 0.64
S-HFD Gruiformes, Small Shorebirds Snook, Mackerel 0.69 0.62 0.84

References

  • [1] I. Amburg, N. Veldt, and A. R. Benson. Clustering in graphs and hypergraphs with categorical edge labels. In Proceedings of the Web Conference, 2020.
  • [2] R. Andersen, F. Chung, and K. Lang. Local graph partitioning using pagerank vectors. FOCS ’06 Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science, pages 475–486, 2006.
  • [3] Sanjeev Arora, Satish Rao, and Umesh Vazirani. Expander flows, geometric embeddings and graph partitioning. JACM, 56(2), April 2009.
  • [4] F. Bach. Learning with submodular functions: A convex optimization perspective. Foundations and Trends in Machine Learning, 6(2-3):145–373, 2013.
  • [5] A. Beck. On the convergence of alternating minimization for convex programming with applications to iteratively reweighted least squares and decomposition schemes. SIAM Journal on Optimization, 25(1):185–209, 2015.
  • [6] A. R. Benson, D. F. Gleich, and J. Leskovec. Higher-order organization of complex networks. Science, 353(6295):163–166, 2016.
  • [7] Austin Benson, David Gleich, and Desmond Higham. Higher-order network analysis takes off, fueled by classical ideas and new data. SIAM News (online), 2021.
  • [8] Claude Berge. Hypergraphs: combinatorics of finite sets, volume 45. Elsevier, 1984.
  • [9] S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems, 30(1–7):107–117, 1998.
  • [10] J. Bu, S. Tan, C. Chen, C. Wang, H. Wu, L. Zhang, and X. He. Music recommendation by unified hypergraph: combining social media information and music content. In MM ’10: Proceedings of the 18th ACM international conference on Multimedia, 2010.
  • [11] T.-H. Hubert Chan, Anand Louis, Zhihao Gavin Tang, and Chenzi Zhang. Spectral properties of hypergraph laplacian and approximation algorithms. JACM, 65(3), March 2018.
  • [12] Eli Chien, Pan Li, and Olgica Milenkovic. Landing probabilities of random walks for seed-set expansion in hypergraphs, 2019.
  • [13] Uthsav Chitra and Benjamin Raphael. Random walks on hypergraphs with edge-dependent vertex weights. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 1172–1181. PMLR, 09–15 Jun 2019.
  • [14] Philip S. Chodrow, Nate Veldt, and Austin R. Benson. Generative hypergraph clustering: from blockmodels to modularity, 2021.
  • [15] Ivar Ekeland and Roger Témam. Convex Analysis and Variational Problems. Society for Industrial and Applied Mathematics, 1999.
  • [16] C. Eksombatchai, P. Jindal, J. Z. Liu, Y. Liu, R. Sharma, C. Sugnet, M. Ulrich, and J. Leskovec. Pixie: A system for recommending 3+3+ billion items to 200+200+ million users in real-time. WWW ’18: Proceedings of the 2018 World Wide Web Conference, pages 1775–1784, 2018.
  • [17] C. Eksombatchai, J. Leskovec, R. Sharma, C. Sugnet, and M. Ulrich. Node graph traversal methods. U.S. Patent 10 762 134 B1, Sep. 2020, 2020.
  • [18] K. Fountoulakis, M. Liu, , D. F. Gleich, and M. W. Mahoney. Flow-based algorithms for improving clusters: A unifying framework, software, and performance. arXiv:2004.09608, 2020.
  • [19] K. Fountoulakis, F. Roosta-Khorasani, J. Shun, X. Cheng, and M. W. Mahoney. Variational perspective on local graph clustering. Mathematical Programming B, pages 1–21, 2017.
  • [20] K. Fountoulakis, D. Wang, and S. Yang. pp-norm flow diffusion for local graph clustering. In Proceedings of the 37th International Conference on Machine Learning, 2020.
  • [21] Debarghya Ghoshdastidar and Ambedkar Dukkipati. Consistency of spectral partitioning of uniform hypergraphs under planted partition model. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014.
  • [22] P. Gupta, A. Goel, J. Lin, A. Sharma, D. Wang, and R. Zadeh. WTF: the who to follow service at twitter. WWW ’13: Proceedings of the 22nd international conference on World Wide Web, pages 505–514, 2013.
  • [23] Scott W. Hadley. Approximation techniques for hypergraph partitioning problems. Discrete Appl. Math., 59(2):115–127, May 1995.
  • [24] Koby Hayashi, Sinan G. Aksoy, Cheong Hee Park, and Haesun Park. Hypergraph random walks, laplacians, and clustering. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, CIKM ’20, page 495–504, New York, NY, USA, 2020. Association for Computing Machinery.
  • [25] Matthias Hein, Simon Setzer, Leonardo Jost, and Syama Sundar Rangapuram. The total variation on hypergraphs-learning on hypergraphs revisited. In Proceedings of the 26th International Conference on Neural Information Processing Systems-Volume 2, pages 2427–2435, 2013.
  • [26] R. Ibrahim and D. F. Gleich. Local hypergraph clustering using capacity releasing diffusion. PLOS ONE, 15(12):1–20, 12 2020.
  • [27] Edmund Ihler, Dorothea Wagner, and Frank Wagner. Modeling hypergraphs by graphs with the same mincut properties. Information Processing Letters, 45(4):171–175, 1993.
  • [28] E. L. Lawler. Cutsets and partitions of hypergraphs. Networks, 3(3):275–285, 1973.
  • [29] L. Li and T. Li. News recommendation via hypergraph learning: encapsulation of user behavior and news content. In WSDM ’13: Proceedings of the sixth ACM international conference on Web search and data mining, 2013.
  • [30] P. Li and O. Milenkovic. Inhomogeneous hypergraph clustering with applications. In Advances in Neural Information Processing Systems, 2017.
  • [31] P. Li and O. Milenkovic. Submodular hypergraphs: p-laplacians, cheeger inequalities and spectral clustering. In Proceedings of the 35th International Conference on Machine Learning, 2018.
  • [32] Pan Li, Niao He, and Olgica Milenkovic. Quadratic decomposable submodular function minimization: Theory and practice. Journal of Machine Learning Research, 21(106):1–49, 2020.
  • [33] M. Liu, N. Veldt, H. Song, P. Li, and D. F. Gleich. Strongly local hypergraph diffusions for clustering and semi-supervised learning. In TheWebConf 2021, 2021.
  • [34] Anand Louis. Hypergraph markov operators, eigenvalues and approximation algorithms. STOC, page 713–722, New York, NY, USA, 2015. Association for Computing Machinery.
  • [35] Rossana Mastrandrea, Julie Fournet, and Alain Barrat. Contact patterns in a high school: A comparison between data collected using wearable sensors, contact diaries and friendship surveys. PLOS ONE, 10(9):e0136497, 2015.
  • [36] Ron Milo, Shai Shen-Orr, Shalev Itzkovitz, Nadav Kashtan, Dmitri Chklovskii, and Uri Alon. Network motifs: simple building blocks of complex networks. Science, 298(5594):824–827, 2002.
  • [37] J. Ni, J. Li, and J. McAuley. Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 188–197, 2019.
  • [38] Jianmo Ni, Jiacheng Li, and Julian McAuley. Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 188–197, 2019.
  • [39] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford, 1999. Technical Report 1999-66, Stanford InfoLab.
  • [40] E. Sadikov, J. Madhavan, L. Wang, and A. Halevy. Clustering query refinements by user intent. WWW ’10: Proceedings of the 19th international conference on World wide web, pages 841–850, 2010.
  • [41] Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-June (Paul) Hsu, and Kuansan Wang. An overview of microsoft academic service (MAS) and applications. In Proceedings of the 24th International Conference on World Wide Web, 2015.
  • [42] Yuuki Takai, Atsushi Miyauchi, Masahiro Ikeda, and Yuichi Yoshida. Hypergraph clustering based on pagerank. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1970–1978, 2020.
  • [43] N. Veldt, A. R. Benson, and J. Kleinberg. Minimizing localized ratio cut objectives in hypergraphs. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2020.
  • [44] Nate Veldt, Austin R. Benson, and Jon Kleinberg. Hypergraph cuts with general splitting functions, 2020.
  • [45] D. Wang, K. Fountoulakis, M. Henzinger, M. W. Mahoney, and S. Rao. Capacity releasing diffusion for speed and locality. Proceedings of the 34th International Conference on Machine Learning, 70:3607–2017, 2017.
  • [46] Hao Yin, Austin R Benson, Jure Leskovec, and David F Gleich. Local higher-order graph clustering. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pages 555–564, 2017.
  • [47] Yuichi Yoshida. Cheeger inequalities for submodular transformations. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2582–2601. SIAM, 2019.
  • [48] Chenzi Zhang, Shuguang Hu, Zhihao Gavin Tang, and TH Hubert Chan. Re-revisiting learning on hypergraphs: confidence interval and subgradient method. In International Conference on Machine Learning, pages 4026–4034. PMLR, 2017.
  • [49] Dengyong Zhou, Jiayuan Huang, and Bernhard Schölkopf. Learning with hypergraphs: Clustering, classification, and embedding. Advances in neural information processing systems, 19:1601–1608, 2006.
  • [50] J. Y. Zien, M. D. F. Schlag, and P. K. Chan. Multilevel spectral hypergraph partitioning with arbitrary vertex sizes. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 18(9):1389–1399, 1999.

Appendices for: Local Hyper-Flow Diffusion

Outline of the Appendix:

  • Appendix A contains supplementary material to Section 3 and Section 4 of the paper:

    • mathematical derivation of the dual diffusion problem;

    • proofs of Theorem 1 and Lemma 2.

  • Appendix B contains supplementary material to Section 5 of the paper:

    • proof of Lemma 3;

    • convergence properties of Algorithm 1;

    • specialized algorithms for alternating minimization sub-problems of Algorithm 1.

  • Appendix C contains supplementary material to Section 6 of the paper:

    • additional synthetic experiments using kk-uniform hypergraph stochastic block model;

    • complete information about the real datasets considered in Section 6 of the paper;

    • experiments for local clustering using seed sets that contain more than one node;

    • experiments using 3 additional real datasets that are not discussed in the main paper;

    • parameter settings and implementation details.

Appendix A Approximation guarantee for local hypergraph clustering

In this section we prove a generalized and stronger version of Theorem 1 in the main paper, where the primal and dual diffusion problems are penalized by p\ell_{p}-norm and q\ell_{q}-norm, respectively, where p2p\geq 2 and 1/p+1/q=11/p+1/q=1. Moreover, we consider a generic hypergraph H=(V,E,𝒲)H=(V,E,\mathcal{W}) with general submodular weights 𝒲={we,ϑe}eE\mathcal{W}=\{w_{e},\vartheta_{e}\}_{e\in E} for any nonzero ϑe:=maxSewe(S)\vartheta_{e}:=\max_{S\subseteq e}w_{e}(S)All claims in the main paper are therefore immediate special cases when p=q=2p=q=2 and ϑe=1\vartheta_{e}=1 for all eEe\in E.

Unless otherwise stated, we use the same notation as in the main paper. We generalize the definition of the degree of a node vVv\in V as

dv:=eE:veϑe.d_{v}:=\sum_{e\in E:v\in e}\vartheta_{e}.

Note that when ϑe=1\vartheta_{e}=1 for all ee, the above definition reduces to dv=|{eE:ve}|d_{v}=|\{e\in E:v\in e\}|, which is the number of hyperedges to which vv belongs to.

Given H=(V,E,𝒲)H=(V,E,\mathcal{W}) where 𝒲={we,ϑe}eE\mathcal{W}=\{w_{e},\vartheta_{e}\}_{e\in E}, p2p\geq 2, and a hyperparameter σ0\sigma\geq 0, our primal Hyper-Flow Diffusion (HFD) problem is written as

minϕ+|E|,z+|V|1peEϑeϕep+σpvVdvzvps.t.ΔeEϑered+σDzreϕeBe,eE\begin{split}\min_{\phi\in\mathbb{R}^{|E|}_{+},z\in\mathbb{R}^{|V|}_{+}}~{}&\frac{1}{p}\sum_{e\in E}\vartheta_{e}\phi_{e}^{p}+\frac{\sigma}{p}\sum_{v\in V}d_{v}z_{v}^{p}\\ \mbox{s.t.}\hskip 25.60747pt&\Delta-\sum_{e\in E}\vartheta_{e}r_{e}\leq d+\sigma Dz\\ &r_{e}\in\phi_{e}B_{e},~{}\forall e\in E\end{split} (A.1)

where

Be:={ρe|V||ρe(S)we(S),SV,andρe(V)=we(V)}B_{e}:=\{\rho_{e}\in\mathbb{R}^{|V|}~{}|~{}\rho_{e}(S)\leq w_{e}(S),\forall S\subseteq V,~{}\mbox{and}~{}\rho_{e}(V)=w_{e}(V)\}

is the base polytope of wew_{e}. The vector m=ΔeEϑerem=\Delta-\sum_{e\in E}\vartheta_{e}r_{e} gives the net amount of mass after routing. Note that we multiply rer_{e} by ϑe\vartheta_{e} because we have normalized wew_{e} by ϑe\vartheta_{e} in its definition.

Lemma 1.

The following optimization problem is dual to (A.1):

maxx+|V|(Δd)Tx1qeEϑefe(x)qσqvVdvxvq\max_{x\in\mathbb{R}^{|V|}_{+}}~{}(\Delta-d)^{T}x-\frac{1}{q}\sum_{e\in E}\vartheta_{e}f_{e}(x)^{q}-\frac{\sigma}{q}\sum_{v\in V}d_{v}x_{v}^{q} (A.2)

where fe(x):=maxρeBeρeTxf_{e}(x):=\max_{\rho_{e}\in B_{e}}\rho_{e}^{T}x is the support function of base polytope BeB_{e}.

Proof.

Using convex conjugates, for x+|V|x\in\mathbb{R}^{|V|}_{+}, we have

1qfe(x)q\displaystyle\frac{1}{q}f_{e}(x)^{q} =maxϕe0ϕefe(x)1pϕep,eE,\displaystyle=\max_{\phi_{e}\geq 0}~{}\phi_{e}f_{e}(x)-\frac{1}{p}\phi_{e}^{p},~{}\forall e\in E, (A.3a)
1qxvq\displaystyle\frac{1}{q}x_{v}^{q} =maxzv0zvxv1pzvp,vV.\displaystyle=\max_{z_{v}\geq 0}~{}z_{v}x_{v}-\frac{1}{p}z_{v}^{p},~{}\forall v\in V. (A.3b)

Apply the definition of fe(x)f_{e}(x), we can write (A.3a) as

1qfe(x)q=maxϕe0ϕefe(x)1pϕep=maxϕe0,reϕeBereTx1pϕep.\frac{1}{q}f_{e}(x)^{q}=\max_{\phi_{e}\geq 0}~{}\phi_{e}f_{e}(x)-\frac{1}{p}\phi_{e}^{p}\ =\max_{\phi_{e}\geq 0,r_{e}\in\phi_{e}B_{e}}r_{e}^{T}x-\frac{1}{p}\phi_{e}^{p}.

Therefore,

maxx+|V|(Δd)Tx1qeEϑefe(x)qσqvVdvxvq=maxx+|V|(Δd)TxeEϑe(maxϕe0,reϕeBereTx1pϕep)σvVdv(maxzv0zvxv1pzvp)=maxx+|V|(Δd)Tx+minϕ+|E|reϕeBe,eEeE(1pϑeϕepϑereTx)+minz+|V|σvV(1pdvzvpdvzvxv)=minϕ+|E|,z+|V|reϕeBe,eE1peEϑeϕep+σpvVdvzvp+maxx+|V|((Δd)TxeEϑereTxσvVdvzvxv)=minϕ+|E|,z+|V|reϕeBe,eE1peEϑeϕep+σpvVdvzvps.t.ΔdeEϑereσDz0.\begin{split}&~{}\max_{x\in\mathbb{R}^{|V|}_{+}}~{}(\Delta-d)^{T}x-\frac{1}{q}\sum_{e\in E}\vartheta_{e}f_{e}(x)^{q}-\frac{\sigma}{q}\sum_{v\in V}d_{v}x_{v}^{q}\\ =&~{}\max_{x\in\mathbb{R}^{|V|}_{+}}~{}(\Delta-d)^{T}x-\sum_{e\in E}\vartheta_{e}\left(\max_{\phi_{e}\geq 0,r_{e}\in\phi_{e}B_{e}}~{}r_{e}^{T}x-\frac{1}{p}\phi_{e}^{p}\right)-\sigma\sum_{v\in V}d_{v}\left(\max_{z_{v}\geq 0}~{}z_{v}x_{v}-\frac{1}{p}z_{v}^{p}\right)\\ =&~{}\max_{x\in\mathbb{R}^{|V|}_{+}}~{}(\Delta-d)^{T}x+\min_{\begin{subarray}{c}\phi\in\mathbb{R}^{|E|}_{+}\\ r_{e}\in\phi_{e}B_{e},\forall e\in E\end{subarray}}\sum_{e\in E}\left(\frac{1}{p}\vartheta_{e}\phi_{e}^{p}-\vartheta_{e}r_{e}^{T}x\right)+\min_{z\in\mathbb{R}^{|V|}_{+}}\sigma\sum_{v\in V}\left(\frac{1}{p}d_{v}z_{v}^{p}-d_{v}z_{v}x_{v}\right)\\ =&~{}\min_{\begin{subarray}{c}\phi\in\mathbb{R}^{|E|}_{+},z\in\mathbb{R}^{|V|}_{+}\\ r_{e}\in\phi_{e}B_{e},\forall e\in E\end{subarray}}\frac{1}{p}\sum_{e\in E}\vartheta_{e}\phi_{e}^{p}+\frac{\sigma}{p}\sum_{v\in V}d_{v}z_{v}^{p}+\max_{x\in\mathbb{R}^{|V|}_{+}}\left((\Delta-d)^{T}x-\sum_{e\in E}\vartheta_{e}r_{e}^{T}x-\sigma\sum_{v\in V}d_{v}z_{v}x_{v}\right)\\ =&~{}\min_{\begin{subarray}{c}\phi\in\mathbb{R}^{|E|}_{+},z\in\mathbb{R}^{|V|}_{+}\\ r_{e}\in\phi_{e}B_{e},\forall e\in E\end{subarray}}\frac{1}{p}\sum_{e\in E}\vartheta_{e}\phi_{e}^{p}+\frac{\sigma}{p}\sum_{v\in V}d_{v}z_{v}^{p}\quad\mbox{s.t.}\quad\Delta-d-\sum_{e\in E}\vartheta_{e}r_{e}-\sigma Dz\leq 0.\end{split}

In the above derivations, we may exchange the order of minimization and maximization and arrive at the second last equality, due to Proposition 2.2, Chapter VI, in [15]. The last equality follows from

maxx+|V|((Δd)TxeEϑereTxσvVdvzvxv)={0,ifΔdeEϑereσDz0,+,otherwise.\max_{x\in\mathbb{R}^{|V|}_{+}}\bigg{(}(\Delta-d)^{T}x-\sum_{e\in E}\vartheta_{e}r_{e}^{T}x-\sigma\sum_{v\in V}d_{v}z_{v}x_{v}\bigg{)}=\left\{\begin{array}[]{ll}0,&\mbox{if}~{}\Delta-d-\sum\limits_{e\in E}\vartheta_{e}r_{e}-\sigma Dz\leq 0,\\ +\infty,&\mbox{otherwise}.\end{array}\right.

Notation. For the rest of this section, we reserve the notation (ϕ^,z^)(\hat{\phi},\hat{z}) and x^\hat{x} for optimal solutions of (A.1) and (A.2) respectively. If σ=0\sigma=0, we simply treat z^=0\hat{z}=0.

The next lemma relates primal and dual optimal solutions. We make frequent use of this relation throughout our discussion.

Lemma 2.

We have that ϕ^ep=fe(x^)q\hat{\phi}_{e}^{p}=f_{e}(\hat{x})^{q} for all eEe\in E. Moreover, if σ>0\sigma>0, then z^vp=x^vq\hat{z}_{v}^{p}=\hat{x}_{v}^{q} for all vVv\in V.

Proof.

Given x^\hat{x} an optimal solution to (A.2), it follows directly from (A.3) and strong duality that (ϕ^,z^)(\hat{\phi},\hat{z}) must satisfy, for each eEe\in E and vVv\in V,

ϕ^e=f(x^)q1=argmaxϕe0ϕefe(x^)1pϕepandz^v=x^vq1=argmaxzv0zvx^v1pzvp.\hat{\phi}_{e}=f(\hat{x})^{q-1}=\operatorname*{argmax}_{\phi_{e}\geq 0}~{}\phi_{e}f_{e}(\hat{x})-\frac{1}{p}\phi_{e}^{p}\quad\mbox{and}\quad\hat{z}_{v}=\hat{x}_{v}^{q-1}=\operatorname*{argmax}_{z_{v}\geq 0}~{}z_{v}\hat{x}_{v}-\frac{1}{p}z_{v}^{p}.

Diffusion setup. Recall that we pick a scalar δ\delta and set the source Δ\Delta as

Δv={δdv,ifvS,0,otherwise.\Delta_{v}=\left\{\begin{array}[]{ll}\delta d_{v},&\mbox{if}~{}v\in S,\\ 0,&\mbox{otherwise}.\end{array}\right. (A.4)

For convenience we restate the assumptions below.

Assumption A.1.

vol(SC)αvol(C)\textnormal{vol}(S\cap C)\geq\alpha\textnormal{vol}(C) and vol(SC)βvol(S)\textnormal{vol}(S\cap C)\geq\beta\textnormal{vol}(S) for some α,β(0,1]\alpha,\beta\in(0,1].

Assumption A.2.

The source mass Δ\Delta as specified in (A.4) satisfies δ=3/α\delta=3/\alpha, which gives Δ(C)3vol(C)\Delta(C)\geq 3\textnormal{vol}(C).

Assumption A.3.

σ\sigma satisfies 0σβΦ(C)/30\leq\sigma\leq\beta\Phi(C)/3.

A.1 Technical lemmas

In this subsection we state and prove some technical lemmas that will be used for the main proof in the next subsection.

The following lemma characterizes the maximizers of the support function for a base polytope.

Lemma 3 (Proposition 4.2 in [4]).

Let ww be a submodular function such that w()=0w(\emptyset)=0. Let x|V|x\in\mathbb{R}^{|V|}, with unique values a1>>ama_{1}>\cdots>a_{m}, taken at sets A1,,AmA_{1},\ldots,A_{m} (i.e., V=A1AmV=A_{1}\cup\cdots\cup A_{m} and i{1,,m}\forall i\in\{1,\ldots,m\}, vAi\forall v\in A_{i}, xv=avx_{v}=a_{v}). Let BB be the associated base polytope. Then ρB\rho\in B is optimal for maxρBρTx\max_{\rho\in B}\rho^{T}x if and only if for all i=1,,mi=1,\ldots,m, ρ(A1Ai)=w(A1Ai)\rho(A_{1}\cup\cdots\cup A_{i})=w(A_{1}\cup\cdots\cup A_{i}).

Recall that (ϕ^,z^)(\hat{\phi},\hat{z}) and x^\hat{x} denote the optimal solutions of (A.1) and (A.2) respectively. We start with a lemma on the locality of the optimal solutions.

Lemma 4 (Lemma 2 in the main paper).

We have

esupp(ϕ^)ϑe=vol(supp(x^))Δ1.\sum_{e\in\textnormal{supp}(\hat{\phi})}\vartheta_{e}~{}=~{}\textnormal{vol}(\textnormal{supp}(\hat{x}))~{}\leq~{}\|\Delta\|_{1}.

Moreover, if σ>0\sigma>0, then vol(supp(z^))=vol(supp(x^))\textnormal{vol}(\textnormal{supp}(\hat{z}))=\textnormal{vol}(\textnormal{supp}(\hat{x})).

Proof.

To see the first inequality, note that if x^v=0\hat{x}_{v}=0 for every vev\in e for some ee, then fe(x^)=0f_{e}(\hat{x})=0. By Lemma 2, this means ϕ^e=0\hat{\phi}_{e}=0. Thus, ϕ^e0\hat{\phi}_{e}\neq 0 only if there is some vev\in e such that x^v0\hat{x}_{v}\neq 0. Therefore, we have that

esupp(ϕ^)ϑevsupp(x^)eE:veϑe=vsupp(x^)dv=vol(supp(x^)).\sum_{e\in\textnormal{supp}(\hat{\phi})}\vartheta_{e}\leq\sum_{v\in\textnormal{supp}(\hat{x})}\sum_{e\in E:v\in e}\vartheta_{e}=\sum_{v\in\textnormal{supp}(\hat{x})}d_{v}=\textnormal{vol}(\textnormal{supp}(\hat{x})).

To see the last inequality, note that, by the first order optimality condition of (A.2), if x^v0\hat{x}_{v}\neq 0 then we must have

Δvdv=eEϑefe(x^)q1ρ^e,v+σdvx^vq1,for someρ^efe(x^)=argmaxρeBeρeTx^.\Delta_{v}-d_{v}=\sum_{e\in E}\vartheta_{e}f_{e}(\hat{x})^{q-1}\hat{\rho}_{e,v}+\sigma d_{v}\hat{x}_{v}^{q-1},~{}~{}\mbox{for some}~{}~{}\hat{\rho}_{e}\in\partial f_{e}(\hat{x})=\operatorname*{argmax}_{\rho_{e}\in B_{e}}\rho_{e}^{T}\hat{x}. (A.5)

Denote N:=supp(x^)N:=\textnormal{supp}(\hat{x}) and E[N]:={eE|vNfor allve}E[N]:=\{e\in E~{}|~{}v\in N~{}\mbox{for all}~{}v\in e\}. Note that E[N]N=E[N]\cap\partial N=\emptyset, and E[N]N={eE|vNfor someve}E[N]\cup\partial N=\{e\in E~{}|~{}v\in N~{}\mbox{for some}~{}v\in e\}, that is, E[N]NE[N]\cup\partial N contain all hyperedges that are incident to some node in NN. Moreover, we have that for any ρ^eargmaxρeBeρeTx^\hat{\rho}_{e}\in\operatorname*{argmax}_{\rho_{e}\in B_{e}}\rho_{e}^{T}\hat{x},

vNρ^e,v=ρ^e(N)={we(N),ifeN,0,ifeE[N],\sum_{v\in N}\hat{\rho}_{e,v}=\hat{\rho}_{e}(N)=\left\{\begin{array}[]{ll}w_{e}(N),&\mbox{if}~{}e\in\partial N,\\ 0,&\mbox{if}~{}e\in E[N],\end{array}\right.

where ρ^e(N)=we(N)\hat{\rho}_{e}(N)=w_{e}(N) for eNe\in\partial N follows from Lemma 3, since x^v>0\hat{x}_{v}>0 for vNv\in N and x^v=0\hat{x}_{v}=0 for vNv\not\in N. The equality ρ^e(N)=0\hat{\rho}_{e}(N)=0 for eE[N]e\in E[N] follows from ρ^e(N)=ρ^e(e)=0\hat{\rho}_{e}(N)=\hat{\rho}_{e}(e)=0 because eNe\subseteq N and ρ^e,v=0\hat{\rho}_{e,v}=0 for all vev\not\in e.

Taking sums over vNv\in N on both sides of equation (A.5) we obtain

Δ(N)vol(N)=vNeEϑefe(x^)q1ρ^e,v+vNσdvx^vq1=vNeE[N]ϑefe(x^)q1ρ^e,v+vNeNϑefe(x^)q1ρ^e,v+vNσdvx^vq1=eE[N]ϑefe(x^)q1vNρ^e,v+eNϑefe(x^)q1vNρ^e,v+vNσdvx^vq1=0+eNϑefe(x^)q1we(N)+vNσdvx^vq10.\begin{split}\Delta(N)-\textnormal{vol}(N)~{}&=~{}\sum_{v\in N}\sum_{e\in E}\vartheta_{e}f_{e}(\hat{x})^{q-1}\hat{\rho}_{e,v}+\sum_{v\in N}\sigma d_{v}\hat{x}_{v}^{q-1}\\ &=~{}\sum_{v\in N}\sum_{e\in E[N]}\vartheta_{e}f_{e}(\hat{x})^{q-1}\hat{\rho}_{e,v}+\sum_{v\in N}\sum_{e\in\partial N}\vartheta_{e}f_{e}(\hat{x})^{q-1}\hat{\rho}_{e,v}+\sum_{v\in N}\sigma d_{v}\hat{x}_{v}^{q-1}\\ &=~{}\sum_{e\in E[N]}\vartheta_{e}f_{e}(\hat{x})^{q-1}\sum_{v\in N}\hat{\rho}_{e,v}+\sum_{e\in\partial N}\vartheta_{e}f_{e}(\hat{x})^{q-1}\sum_{v\in N}\hat{\rho}_{e,v}+\sum_{v\in N}\sigma d_{v}\hat{x}_{v}^{q-1}\\ &=~{}0+\sum_{e\in\partial N}\vartheta_{e}f_{e}(\hat{x})^{q-1}w_{e}(N)+\sum_{v\in N}\sigma d_{v}\hat{x}_{v}^{q-1}\\ &\geq~{}0.\end{split}

The second equality follows from ρ^e,v=0\hat{\rho}_{e,v}=0 for all vev\not\in e. This proves vol(supp(x^))Δ(supp(x^))Δ1\textnormal{vol}(\textnormal{supp}(\hat{x}))\leq\Delta(\textnormal{supp}(\hat{x}))\leq\|\Delta\|_{1}.

Finally, if σ>0\sigma>0, then vol(supp(z^))=vol(supp(x^))\textnormal{vol}(\textnormal{supp}(\hat{z}))=\textnormal{vol}(\textnormal{supp}(\hat{x})) follows from Lemma 2 that z^p=x^q\hat{z}^{p}=\hat{x}^{q} for all vVv\in V. ∎

The following inequality is a special case of Hölder’s inequality for degree-weighted norms. It will become useful later.

Lemma 5.

For x|V|x\in\mathbb{R}^{|V|} and p>1p>1 we have that

(vVdv|xv|)pvol(supp(x))p1vVdv|xv|p.\Bigg{(}\sum_{v\in V}d_{v}|x_{v}|\Bigg{)}^{p}\leq\textnormal{vol}(\textnormal{supp}(x))^{p-1}\sum_{v\in V}d_{v}|x_{v}|^{p}.
Proof.

Let q=p/(p1)q=p/(p-1). Apply Hölder’s inequality we have

vVdv|xv|=vsupp(x)|dv1/q||dv1/pxv|(vsupp(x)dv)1/q(vsupp(x)dv|xv|p)1/p=vol(supp(x))1/q(vVdv|xv|p)1/p.\begin{split}\sum_{v\in V}d_{v}|x_{v}|=\sum_{v\in\textnormal{supp}(x)}|d_{v}^{1/q}||d_{v}^{1/p}x_{v}|&\leq\Bigg{(}\sum_{v\in\textnormal{supp}(x)}d_{v}\Bigg{)}^{1/q}\Bigg{(}\sum_{v\in\textnormal{supp}(x)}d_{v}|x_{v}|^{p}\Bigg{)}^{1/p}\\ &=\textnormal{vol}(\textnormal{supp}(x))^{1/q}\Bigg{(}\sum_{v\in V}d_{v}|x_{v}|^{p}\Bigg{)}^{1/p}.\end{split}

Lemma 6 (Lemma I.2 in [31]).

For any x+|V|{0}x\in\mathbb{R}^{|V|}_{+}\setminus\{0\} and q1q\geq 1, one has

eEϑefe(x)qvVdvxvqc(x)qqq,\frac{\sum_{e\in E}\vartheta_{e}f_{e}(x)^{q}}{\sum_{v\in V}d_{v}x_{v}^{q}}\geq\frac{c(x)^{q}}{q^{q}},

where

c(x):=minh0vol({vV|xvq>h})vol({vV|xvq>h})=minh0vol({vV|xv>h})vol({vV|xv>h}).c(x):=\min_{h\geq 0}\frac{\textnormal{vol}(\partial\{v\in V|x_{v}^{q}>h\})}{\textnormal{vol}(\{v\in V|x_{v}^{q}>h\})}=\min_{h\geq 0}\frac{\textnormal{vol}(\partial\{v\in V|x_{v}>h\})}{\textnormal{vol}(\{v\in V|x_{v}>h\})}.

Recall that the objective function of our primal diffusion problem (A.1) consists of two parts. The first part is eEϑeϕep\sum_{e\in E}\vartheta_{e}\phi_{e}^{p} and it penalizes the cost of flow routing, the second part is vVdvzvp\sum_{v\in V}d_{v}z_{v}^{p} and it penalizes the cost of excess mass. An immediate consequence of Lemma 6 is the inequality in Lemma 7 that relates the cost of optimal flow routing eEϑeϕ^ep\sum_{e\in E}\vartheta_{e}\hat{\phi}_{e}^{p} and the cost of excess mass vVdvz^vp\sum_{v\in V}d_{v}\hat{z}_{v}^{p} at optimality.

For h>0h>0, recall that the sweep sets are defined as Sh:={vV|x^vh}S_{h}:=\{v\in V|\hat{x}_{v}\geq h\}.

Let h^argminh>0Φ(Sh)\hat{h}\in\operatorname*{argmin}_{h>0}\Phi(S_{h}) and denote S^=Sh^\hat{S}=S_{\hat{h}}. That is, S^=Sh\hat{S}=S_{h} for some h>0h>0 and Φ(S^)Φ(Sh)\Phi(\hat{S})\leq\Phi(S_{h}) for all h>0h>0.

Lemma 7.

For p>1p>1 and q=p/(p1)q=p/(p-1) we have that

eEϑeϕ^ep(Φ(S^)q)qvVdvz^vp.\sum_{e\in E}\vartheta_{e}\hat{\phi}_{e}^{p}\geq\left(\frac{\Phi(\hat{S})}{q}\right)^{q}\sum_{v\in V}d_{v}\hat{z}_{v}^{p}.
Proof.

By Lemma 2,

eEϑeϕ^ep=eEϑefe(x^)qandvVdvz^vp=vVdvx^vq,\sum_{e\in E}\vartheta_{e}\hat{\phi}_{e}^{p}=\sum_{e\in E}\vartheta_{e}f_{e}(\hat{x})^{q}\quad\mbox{and}\quad\sum_{v\in V}d_{v}\hat{z}_{v}^{p}=\sum_{v\in V}d_{v}\hat{x}_{v}^{q},

and the result follows from applying Lemma 6. ∎

Given a vector a|V|a\in\mathbb{R}^{|V|} and a set SVS\subseteq V, recall that we write a(S)=vSava(S)=\sum_{v\in S}a_{v}. This actually defines a modular set-function aa taking input on subsets of VV. The Lovász extension of modular function aa is simply f(x)=aTxf(x)=a^{T}x [4]. Since all modular functions are also submodular, we arrive at the following lemma that follows from a classical property of the Choquet integral/Lovász extension.

Lemma 8.

We have that

ΔTx^=h=0+Δ(Sh)𝑑h,dTx^=h=0+vol(Sh)𝑑h,fe(x^)=h=0+we(Sh)𝑑h.\Delta^{T}\hat{x}=\int_{h=0}^{+\infty}\Delta(S_{h})dh,\quad d^{T}\hat{x}=\int_{h=0}^{+\infty}\textnormal{vol}(S_{h})dh,\quad f_{e}(\hat{x})=\int_{h=0}^{+\infty}w_{e}(S_{h})dh.
Proof.

Recall that, by definition, vol(S)=d(S)\textnormal{vol}(S)=d(S) where dd is the degree vector. Δ\Delta and dd are modular functions on 2V2^{V} and wew_{e} is a submodular function on 2V2^{V}. The Lovász extension of Δ\Delta and dd are ΔTx\Delta^{T}x and dTxd^{T}x, respectively. The Lovász extension of wew_{e} is fe(x)f_{e}(x). The results then follow immediately from representing the Lovász extensions using Choquet integrals. See, e.g., Proposition 3.1 in [4]. ∎

A.2 Proof of Theorem 1 in the main paper

We restate the theorem below with respect to the general formulations (A.1) and (A.2) for any p2p\geq 2 and q=p/(p1)q=p/(p-1).

Let us recall that the sweep sets are defined as Sh:={vV|x^vh}S_{h}:=\{v\in V|\hat{x}_{v}\geq h\}.

Theorem 9.

Under Assumptions A.1A.2A.3, for some h>0h>0 we have that

Φ(Sh)O(Φ(C)1/qαβ).\Phi(S_{h})\leq O\left(\frac{\Phi(C)^{1/q}}{\alpha\beta}\right).

Recall that S^\hat{S} is such that S^=Sh\hat{S}=S_{h} for some h>0h>0 and Φ(S^)Φ(Sh)\Phi(\hat{S})\leq\Phi(S_{h}) for all h>0h>0. We will assume without loss of generality that Φ(C)(Φ(S^)/q)q\Phi(C)\leq(\Phi(\hat{S})/q)^{q}, as otherwise Φ(S^)<qΦ(C)1/q\Phi(\hat{S})<q\Phi(C)^{1/q} and the statement in Theorem 9 already holds.

Denote ν^:=eEϑeϕ^ep\hat{\nu}:=\sum_{e\in E}\vartheta_{e}\hat{\phi}_{e}^{p}, the cost of optimal flow routing. The following claim states that ν^\hat{\nu} must be large.

Claim A.1.

ν^vol(C)p/vol(C)p1.\hat{\nu}\geq\textnormal{vol}(C)^{p}/\textnormal{vol}(\partial C)^{p-1}.

Proof.

The proof of this claim follows from a case analysis on the total amount of excess mass σvVdvz^v\sigma\sum_{v\in V}d_{v}\hat{z}_{v} at optimality. Intuitively, if the excess is small, then naturally there must be a large amount of flow in order to satisfy the primal constraint; if the excess is large, then Lemma 7 and Lemma 5 guarantee that flow is also large. We give details below.

Suppose that σvVdvz^v<vol(C)\sigma\sum_{v\in V}d_{v}\hat{z}_{v}<\textnormal{vol}(C). Note that this also includes the case where σ=0\sigma=0. By Assumption A.2 there is at least Δ(C)3vol(C)\Delta(C)\geq 3\textnormal{vol}(C) amount of source mass trapped in CC at the beginning. Moreover, the primal constraint enforces the nodes in CC can settle at most vC(dv+σdvz^v)vol(C)+vVσdvz^v<2vol(C)\sum_{v\in C}(d_{v}+\sigma d_{v}\hat{z}_{v})\leq\textnormal{vol}(C)+\sum_{v\in V}\sigma d_{v}\hat{z}_{v}<2\textnormal{vol}(C) amount of mass. Therefore, the remaining at least vol(C)\textnormal{vol}(C) amount of mass needs to get out of CC using the hyperedges in C\partial C. That is, the net amount of mass that moves from CC to VCV\setminus C satisfies eCϑer^e(C)vol(C)\sum_{e\in\partial C}\vartheta_{e}\hat{r}_{e}(C)\geq\textnormal{vol}(C). We focus on the cost of ϕ^\hat{\phi} restricted to these hyperedges along. It is easy to see that

eCϑeϕ^ep\displaystyle\sum_{e\in\partial C}\vartheta_{e}\hat{\phi}_{e}^{p}~{} minϕ+|C|eCϑeϕepsubject tor^eϕeBe,eC\displaystyle\geq~{}\min_{\phi\in\mathbb{R}^{|\partial C|}_{+}}~{}\sum_{e\in\partial C}\vartheta_{e}\phi_{e}^{p}~{}~{}\mbox{subject to}~{}~{}\hat{r}_{e}\in\phi_{e}B_{e},~{}\forall e\in\partial C (A.6a)
minϕ+|C|eCϑeϕepsubject toeCϑer^e(C)eCϑeϕewe(C)\displaystyle\geq~{}\min_{\phi\in\mathbb{R}^{|\partial C|}_{+}}~{}\sum_{e\in\partial C}\vartheta_{e}\phi_{e}^{p}~{}~{}\mbox{subject to}~{}\sum_{e\in\partial C}\vartheta_{e}\hat{r}_{e}(C)\leq\sum_{e\in\partial C}\vartheta_{e}\phi_{e}w_{e}(C) (A.6b)
minϕ+|C|eCϑeϕepsubject tovol(C)eCϑeϕewe(C).\displaystyle\geq~{}\min_{\phi\in\mathbb{R}^{|\partial C|}_{+}}~{}\sum_{e\in\partial C}\vartheta_{e}\phi_{e}^{p}~{}~{}\mbox{subject to}~{}~{}\textnormal{vol}(C)\leq\sum_{e\in\partial C}\vartheta_{e}\phi_{e}w_{e}(C). (A.6c)

The first inequality follows because ϕ^\hat{\phi} restricted to C\partial C is a feasible solution in problem (A.6a). The second inequality follows because r^eϕeBe\hat{r}_{e}\in\phi_{e}B_{e} implies r^e(C)ϕewe(C)\hat{r}_{e}(C)\leq\phi_{e}w_{e}(C), therefore every feasible solution for (A.6a) is also a feasible solution for (A.6b). The third inequality follows because vol(C)eEϑer^e(C)\textnormal{vol}(C)\leq\sum_{e\in E}\vartheta_{e}\hat{r}_{e}(C). Let ϕ¯+|C|\bar{\phi}\in\mathbb{R}^{|\partial C|}_{+} be an optimal solution of problem (A.6c). The optimality condition of (A.6c) is given by (we may assume the pp factor in the gradient of eCϑeϕep\sum_{e\in\partial C}\vartheta_{e}\phi_{e}^{p} is absorbed into multipliers λ\lambda and ηe\eta_{e})

ϑeϕep1λϑewe(C)ηe=0,eCϕe0,ηe0,ϕeηe=0,eCvol(C)eCϑeϕewe(C)λ0,λ(vol(C)eCϑeϕewe(C))=0.\begin{split}&\vartheta_{e}\phi_{e}^{p-1}-\lambda\vartheta_{e}w_{e}(C)-\eta_{e}=0,~{}\forall e\in\partial C\\ &\phi_{e}\geq 0,~{}\eta_{e}\geq 0,~{}\phi_{e}\eta_{e}=0,~{}\forall e\in\partial C\\ &\textnormal{vol}(C)\leq\sum_{e\in\partial C}\vartheta_{e}\phi_{e}w_{e}(C)\\ &\lambda\geq 0,~{}\lambda\bigg{(}\textnormal{vol}(C)-\sum_{e\in\partial C}\vartheta_{e}\phi_{e}w_{e}(C)\bigg{)}=0.\end{split} (A.7)

If λ=0\lambda=0, then the conditions in (A.7) imply that ϑeϕep1=ηe\vartheta_{e}\phi_{e}^{p-1}=\eta_{e}, but then by complimentary slackness we would obtain ϕe=ηe=0\phi_{e}=\eta_{e}=0 for all eCe\in\partial C which will violate feasibility. Therefore we must have λ>0\lambda>0, and consequently, we have that

eCϑeϕ¯ewe(C)=vol(C).\sum_{e\in\partial C}\vartheta_{e}\bar{\phi}_{e}w_{e}(C)=\textnormal{vol}(C). (A.8)

Moreover, the conditions in (A.7) imply that for eCe\in\partial C, ϕ¯e=0\bar{\phi}_{e}=0 if and only if we(C)=0w_{e}(C)=0, and hence we have that

ϑeϕ¯ep1=λϑewe(C),eC.\vartheta_{e}\bar{\phi}_{e}^{p-1}=\lambda\vartheta_{e}w_{e}(C),~{}\forall e\in\partial C. (A.9)

Rearrange (A.9) we get

ϕ¯ewe(C)=λ1/(p1)we(C)p/(p1),eC.\bar{\phi}_{e}w_{e}(C)=\lambda^{1/(p-1)}w_{e}(C)^{p/(p-1)},~{}\forall e\in\partial C.

Substitute the above into (A.8),

vol(C)=eCϑeϕ¯ewe(C)=eCϑeλ1/(p1)we(C)p/(p1),\textnormal{vol}(C)=\sum_{e\in\partial C}\vartheta_{e}\bar{\phi}_{e}w_{e}(C)=\sum_{e\in\partial C}\vartheta_{e}\lambda^{1/(p-1)}w_{e}(C)^{p/(p-1)},

this gives

λ1/(p1)=vol(C)eCϑewe(C)p/(p1).\lambda^{1/(p-1)}=\frac{\textnormal{vol}(C)}{\sum_{e\in\partial C}\vartheta_{e}w_{e}(C)^{p/(p-1)}}.

Therefore, the solution ϕ¯\bar{\phi} for (A.6c) is give by

ϕ¯e=λ1/(p1)we(C)1/(p1)=vol(C)we(C)1/(p1)eCϑewe(C)p/(p1),eC,\bar{\phi}_{e}=\lambda^{1/(p-1)}w_{e}(C)^{1/(p-1)}=\frac{\textnormal{vol}(C)w_{e}(C)^{1/(p-1)}}{\sum_{e^{\prime}\in\partial C}\vartheta_{e^{\prime}}w_{e^{\prime}}(C)^{p/(p-1)}},\quad\forall e\in\partial C,

and hence,

ν^=eEϑeϕ^epeCϑeϕ^epeCϑeϕ¯ep=eCϑevol(C)pwe(C)p/(p1)(eCϑewe(C)p/(p1))p=vol(C)peCϑewe(C)p/(p1)(eCϑewe(C)p/(p1))p=vol(C)p(eCϑewe(C)p/(p1))p1vol(C)p(eCϑewe(C))p1\begin{split}\hat{\nu}=\sum_{e\in E}\vartheta_{e}\hat{\phi}_{e}^{p}\geq\sum_{e\in\partial C}\vartheta_{e}\hat{\phi}_{e}^{p}\geq\sum_{e\in\partial C}\vartheta_{e}\bar{\phi}_{e}^{p}&=\sum_{e\in\partial C}\vartheta_{e}\frac{\textnormal{vol}(C)^{p}w_{e}(C)^{p/(p-1)}}{\left(\sum_{e^{\prime}\in\partial C}\vartheta_{e^{\prime}}w_{e^{\prime}}(C)^{p/(p-1)}\right)^{p}}\\ &=\frac{\textnormal{vol}(C)^{p}\sum_{e\in\partial C}\vartheta_{e}w_{e}(C)^{p/(p-1)}}{\left(\sum_{e^{\prime}\in\partial C}\vartheta_{e^{\prime}}w_{e^{\prime}}(C)^{p/(p-1)}\right)^{p}}\\ &=\frac{\textnormal{vol}(C)^{p}}{\left(\sum_{e^{\prime}\in\partial C}\vartheta_{e^{\prime}}w_{e^{\prime}}(C)^{p/(p-1)}\right)^{p-1}}\\ &\geq\frac{\textnormal{vol}(C)^{p}}{\big{(}\sum_{e^{\prime}\in\partial C}\vartheta_{e^{\prime}}w_{e^{\prime}}(C)\big{)}^{p-1}}\end{split}

where the last inequality follows because we(C)[0,1]w_{e}(C)\in[0,1] and p1p\geq 1.

Suppose now that σvVdvz^vvol(C)\sigma\sum_{v\in V}d_{v}\hat{z}_{v}\geq\textnormal{vol}(C). Becase Φ(C)(Φ(S^)/q)q\Phi(C)\leq(\Phi(\hat{S})/q)^{q} (recall that we assumed this without loss of generality), by Assumption A.3, we know that σ<(ϕ(S^)/q)q\sigma<(\phi(\hat{S})/q)^{q}. Therefore,

ν^=eEϑeϕ^ep(i)σvVdvz^vp(ii)σ(vVdvz^v)pvol(supp(z^))p1(iii)σp(vVdvz^v)pσp1(3vol(C)/β)p1(iv)σp(vVdvz^v)pvol(C)p1(v)vol(C)pvol(C)p1.\begin{split}\hat{\nu}=\sum_{e\in E}\vartheta_{e}\hat{\phi}_{e}^{p}&\stackrel{{\scriptstyle(i)}}{{\geq}}\sigma\sum_{v\in V}d_{v}\hat{z}_{v}^{p}\\ &\stackrel{{\scriptstyle(ii)}}{{\geq}}\frac{\sigma\left(\sum_{v\in V}d_{v}\hat{z}_{v}\right)^{p}}{\textnormal{vol}(\textnormal{supp}(\hat{z}))^{p-1}}\\ &\stackrel{{\scriptstyle(iii)}}{{\geq}}\frac{\sigma^{p}\left(\sum_{v\in V}d_{v}\hat{z}_{v}\right)^{p}}{\sigma^{p-1}(3\textnormal{vol}(C)/\beta)^{p-1}}\\ &\stackrel{{\scriptstyle(iv)}}{{\geq}}\frac{\sigma^{p}\left(\sum_{v\in V}d_{v}\hat{z}_{v}\right)^{p}}{\textnormal{vol}(\partial C)^{p-1}}\\ &\stackrel{{\scriptstyle(v)}}{{\geq}}\frac{\textnormal{vol}(C)^{p}}{\textnormal{vol}(\partial C)^{p-1}}.\end{split}

(i)(i) is due to Lemma 6. (ii)(ii) is due to Lemma 5. (iii)(iii) is due to Lemma 4 that vol(supp(z^))Δ1\textnormal{vol}(\textnormal{supp}(\hat{z}))\leq\|\Delta\|_{1} and Assumption A.2 that Δ13vol(c)/β\|\Delta\|_{1}\leq 3\textnormal{vol}(c)/\beta, so vol(supp(z^))p1(3vol(C)/β)p1\textnormal{vol}(\textnormal{supp}(\hat{z}))^{p-1}\leq(3\textnormal{vol}(C)/\beta)^{p-1} for p1p\geq 1. (iv)(iv) is due to Assumption A.3 that σβvol(C)3vol(C)\sigma\leq\frac{\beta\textnormal{vol}(\partial C)}{3\textnormal{vol}(C)}, so (3σvol(C)/β)p1vol(C)p1(3\sigma\textnormal{vol}(C)/\beta)^{p-1}\leq\textnormal{vol}(\partial C)^{p-1} for p1p\geq 1. (v)(v) is due to the assumption that σvVdvz^vvol(C)\sigma\sum_{v\in V}d_{v}\hat{z}_{v}\geq\textnormal{vol}(C). ∎

To connect Φ(Sh)\Phi(S_{h}) with Φ(C)\Phi(C), we define the length of a hyperedge eEe\in E as

l^(e):={max(1/vol(C)1/q,fe(x^)/ν^1/q),iffe(x^)>0,0,otherwise.\hat{l}(e):=\left\{\begin{array}[]{ll}\max(1/\textnormal{vol}(C)^{1/q},f_{e}(\hat{x})/\hat{\nu}^{1/q}),&\mbox{if}~{}f_{e}(\hat{x})>0,\\ 0,&\mbox{otherwise}.\end{array}\right.

The next claim follows from simple algebraic computations and the locality of solutions in Lemma 4.

Claim A.2.

eEϑefe(x^)l^(e)q14ν^1/q/β\sum_{e\in E}\vartheta_{e}f_{e}(\hat{x})\hat{l}(e)^{q-1}\leq 4\hat{\nu}^{1/q}/\beta.

Proof.

For eEe\in E, define l(e):=fe(x^)/ν^1/ql(e):=f_{e}(\hat{x})/\hat{\nu}^{1/q}. Then l(e)l^(e)l(e)\leq\hat{l}(e). Moreover,

e:l(e)<l^(e)ϑeesupp(ϕ^)ϑevol(supp(x^))Δ1=3αvol(S)3βvol(C).\sum_{e:l(e)<\hat{l}(e)}\vartheta_{e}\leq\sum_{e\in\textnormal{supp}(\hat{\phi})}\vartheta_{e}\leq\textnormal{vol}(\textnormal{supp}(\hat{x}))\leq\|\Delta\|_{1}=\frac{3}{\alpha}\textnormal{vol}(S)\leq\frac{3}{\beta}\textnormal{vol}(C).

The first inequality follows from that l(e)<l^(e)l(e)<\hat{l}(e) only if l(e)0l(e)\neq 0, and by Lemma 2, l(e)0l(e)\neq 0 if and only if ϕ^e0\hat{\phi}_{e}\neq 0. The second and the third inequalities are due to Lemma 4. The second to last equality follows from the diffusion setting (A.4) and Assumption A.2 that δ=3/α\delta=3/\alpha. The last inequality follows from Assumption A.1. Therefore,

eEϑefe(x^)l^(e)q1=e:l(e)=l^(e)ϑefe(x^)fe(x^)q1ν^(q1)/q+e:l(e)<l^(e)ϑefe(x^)1vol(C)(q1)/qe:l(e)=l^(e)ϑefe(x^)fe(x^)q1ν^(q1)/q+e:l(e)<l^(e)ϑeν^1/qvol(C)1/q1vol(C)(q1)/q=1ν^(q1)/qe:l(e)=l^(e)ϑefe(x^)q+ν^1/qvol(C)e:l(e)<l^(e)ϑe1ν^(q1)/qeEϑefe(x^)q+ν^1/qvol(C)3vol(C)β=ν^ν^(q1)/q+3ν^1/qβ4ν^1/qβ\begin{split}\sum_{e\in E}\vartheta_{e}f_{e}(\hat{x})\hat{l}(e)^{q-1}&=\sum_{e:l(e)=\hat{l}(e)}\vartheta_{e}f_{e}(\hat{x})\frac{f_{e}(\hat{x})^{q-1}}{\hat{\nu}^{(q-1)/q}}+\sum_{e:l(e)<\hat{l}(e)}\vartheta_{e}f_{e}(\hat{x})\frac{1}{\textnormal{vol}(C)^{(q-1)/q}}\\ &\leq\sum_{e:l(e)=\hat{l}(e)}\vartheta_{e}f_{e}(\hat{x})\frac{f_{e}(\hat{x})^{q-1}}{\hat{\nu}^{(q-1)/q}}+\sum_{e:l(e)<\hat{l}(e)}\vartheta_{e}\frac{\hat{\nu}^{1/q}}{\textnormal{vol}(C)^{1/q}}\frac{1}{\textnormal{vol}(C)^{(q-1)/q}}\\ &=\frac{1}{\hat{\nu}^{(q-1)/q}}\sum_{e:l(e)=\hat{l}(e)}\vartheta_{e}f_{e}(\hat{x})^{q}+\frac{\hat{\nu}^{1/q}}{\textnormal{vol}(C)}\sum_{e:l(e)<\hat{l}(e)}\vartheta_{e}\\ &\leq\frac{1}{\hat{\nu}^{(q-1)/q}}\sum_{e\in E}\vartheta_{e}f_{e}(\hat{x})^{q}+\frac{\hat{\nu}^{1/q}}{\textnormal{vol}(C)}\frac{3\textnormal{vol}(C)}{\beta}\\ &=\frac{\hat{\nu}}{\hat{\nu}^{(q-1)/q}}+\frac{3\hat{\nu}^{1/q}}{\beta}\\ &\leq\frac{4\hat{\nu}^{1/q}}{\beta}\end{split}

where the last equality follows from Lemma 2 that ν^=eEϑeϕ^ep=eEϑefe(x^)q\hat{\nu}=\sum_{e\in E}\vartheta_{e}\hat{\phi}_{e}^{p}=\sum_{e\in E}\vartheta_{e}f_{e}(\hat{x})^{q}. ∎

By the strong duality between (A.1) and (A.2), we know that

(Δd)Tx^1qeEϑefe(x^)qσqvVdvx^vq=1peEϑeϕ^ep+σpvVdvz^vp.(\Delta-d)^{T}\hat{x}-\frac{1}{q}\sum_{e\in E}\vartheta_{e}f_{e}(\hat{x})^{q}-\frac{\sigma}{q}\sum_{v\in V}d_{v}\hat{x}_{v}^{q}=\frac{1}{p}\sum_{e\in E}\vartheta_{e}\hat{\phi}_{e}^{p}+\frac{\sigma}{p}\sum_{v\in V}d_{v}\hat{z}_{v}^{p}.

Hence, by Lemma 2, we get

(Δd)Tx^1qeEϑefe(x^)q+1peEϑeϕ^ep=eEϑeϕ^ep=ν^.(\Delta-d)^{T}\hat{x}\geq\frac{1}{q}\sum_{e\in E}\vartheta_{e}f_{e}(\hat{x})^{q}+\frac{1}{p}\sum_{e\in E}\vartheta_{e}\hat{\phi}_{e}^{p}=\sum_{e\in E}\vartheta_{e}\hat{\phi}_{e}^{p}=\hat{\nu}.

It then follows that

eEϑefe(x^)l^(e)q1(Δd)Tx^eEϑefe(x^)l^(e)q1ν^(i)4ν^1/qβν^=4βν^1/p(ii)4vol(C)1/qβvol(C),\frac{\sum_{e\in E}\vartheta_{e}f_{e}(\hat{x})\hat{l}(e)^{q-1}}{(\Delta-d)^{T}\hat{x}}\leq\frac{\sum_{e\in E}\vartheta_{e}f_{e}(\hat{x})\hat{l}(e)^{q-1}}{\hat{\nu}}\stackrel{{\scriptstyle(i)}}{{\leq}}\frac{4\hat{\nu}^{1/q}}{\beta\hat{\nu}}=\frac{4}{\beta\hat{\nu}^{1/p}}\stackrel{{\scriptstyle(ii)}}{{\leq}}\frac{4\textnormal{vol}(\partial C)^{1/q}}{\beta\textnormal{vol}(C)}, (A.10)

where (i)(i) is follows from Claim A.2 and (ii)(ii) follows from Claim A.1.

We can write the left-most ratio in (A.10) in its integral form, as follows. By Lemma 8, we have

(Δd)Tx^=h=0(Δ(Sh)vol(Sh))𝑑h,(\Delta-d)^{T}\hat{x}=\int_{h=0}^{\infty}(\Delta(S_{h})-\textnormal{vol}(S_{h}))dh,

and

eEϑefe(x^)l^(e)q1=eEϑeh=0we(Sh)𝑑hl^(e)q1=h=0eEϑewe(Sh)l^(e)q1dh=h=0eShϑewe(Sh)l^(e)q1dh,\begin{split}\sum_{e\in E}\vartheta_{e}f_{e}(\hat{x})\hat{l}(e)^{q-1}&=\sum_{e\in E}\vartheta_{e}\int_{h=0}^{\infty}w_{e}(S_{h})dh~{}\hat{l}(e)^{q-1}\\ &=\int_{h=0}^{\infty}\sum_{e\in E}\vartheta_{e}w_{e}(S_{h})\hat{l}(e)^{q-1}dh\\ &=\int_{h=0}^{\infty}\sum_{e\in\partial S_{h}}\vartheta_{e}w_{e}(S_{h})\hat{l}(e)^{q-1}dh,\end{split}

where the last equality follows from the fact that we(Sh)=0w_{e}(S_{h})=0 for eShe\not\in\partial S_{h}. Therefore, we get

h=0eShϑewe(Sh)l^(e)q1Δ(Sh)vol(Sh)𝑑h4vol(C)1/qβvol(C),\int_{h=0}^{\infty}\frac{\sum_{e\in\partial S_{h}}\vartheta_{e}w_{e}(S_{h})\hat{l}(e)^{q-1}}{\Delta(S_{h})-\textnormal{vol}(S_{h})}dh\leq\frac{4\textnormal{vol}(\partial C)^{1/q}}{\beta\textnormal{vol}(C)},

which means that there exists h>0h>0 such that

eShϑewe(Sh)l^(e)q1Δ(Sh)vol(Sh)4vol(C)1/qβvol(C).\frac{\sum_{e\in\partial S_{h}}\vartheta_{e}w_{e}(S_{h})\hat{l}(e)^{q-1}}{\Delta(S_{h})-\textnormal{vol}(S_{h})}\leq\frac{4\textnormal{vol}(\partial C)^{1/q}}{\beta\textnormal{vol}(C)}. (A.11)

Finally, we connect the left hand side in inequality (A.11) to the conductance of ShS_{h}. For the denominator, by Assumption A.2, we have

Δ(Sh)vol(Sh)3αvol(Sh).\Delta(S_{h})-\textnormal{vol}(S_{h})\leq\frac{3}{\alpha}\textnormal{vol}(S_{h}). (A.12)

For the numerator, every hyperedge eShe\in\partial S_{h} must contain some u,veu,v\in e such that x^ux^v\hat{x}_{u}\neq\hat{x}_{v}, thus fe(x^)>0f_{e}(\hat{x})>0, which means l^(e)1/vol(C)1/q\hat{l}(e)\geq 1/\textnormal{vol}(C)^{1/q}. This gives

eShϑewe(Sh)l^(e)q1eShϑewe(Sh)vol(C)(q1)/q=vol(Sh)vol(C)(q1)/q.\sum_{e\in\partial S_{h}}\vartheta_{e}w_{e}(S_{h})\hat{l}(e)^{q-1}\geq\frac{\sum_{e\in\partial S_{h}}\vartheta_{e}w_{e}(S_{h})}{\textnormal{vol}(C)^{(q-1)/q}}=\frac{\textnormal{vol}(\partial S_{h})}{\textnormal{vol}(C)^{(q-1)/q}}. (A.13)

Put (A.11), (A.12) and (A.13) together, there exists h>0h>0 such that

Φ(Sh)=vol(Sh)vol(Sh)12vol(C)1/qαβvol(C)1/q=12Φ(C)1/qαβ.\Phi(S_{h})=\frac{\textnormal{vol}(\partial S_{h})}{\textnormal{vol}(S_{h})}\leq\frac{12\textnormal{vol}(\partial C)^{1/q}}{\alpha\beta\textnormal{vol}(C)^{1/q}}=\frac{12\Phi(C)^{1/q}}{\alpha\beta}.

Appendix B Optimization algorithm for HFD

In this section we give details on an Alternating Minimization (AM) algorithm [5] that solves the primal problem (A.1). In Algorithm B.1 we write the basic AM steps in a slightly more general form than what is given by Algorithm 1 in the main paper. The key observation is that the AM method provides a unified framework to solve HFD, when the objective function of the primal problem (A.1) is penalized by any p\ell_{p}-norm for p2p\geq 2.

Algorithm B.1 Alternating Minimization for HFD

Initialization:

ϕ(0):=0,r(0):=0,se(0):=D1Ae[Δd]+,eE.\phi^{(0)}:=0,r^{(0)}:=0,s^{(0)}_{e}:=D^{-1}A_{e}\left[\Delta-d\right]_{+},\forall e\in E.

For k=0,1,2,k=0,1,2,\ldots do:

(ϕ(k+1),r(k+1)):=argmin(ϕ,r)𝒞eEϑe(ϕep+1σp1se(k)repp)\displaystyle(\phi^{(k+1)},r^{(k+1)}):=\operatorname*{argmin}\limits_{(\phi,r)\in\mathcal{C}}\sum\limits_{e\in E}\vartheta_{e}\left(\phi_{e}^{p}+\frac{1}{\sigma^{p-1}}\|s_{e}^{(k)}-r_{e}\|_{p}^{p}\right)
s(k+1):=argminseEϑesere(k+1)pp,s.t.ΔeEϑesed,se,v=0,ve.\displaystyle s^{(k+1)}:=\operatorname*{argmin}\limits_{s}\sum\limits_{e\in E}\vartheta_{e}\|s_{e}-r_{e}^{(k+1)}\|_{p}^{p},\hskip 5.69054pt\mbox{s.t.}~{}\Delta-\sum\limits_{e\in E}\vartheta_{e}s_{e}\leq d,\ s_{e,v}=0,\forall v\not\in e.

Let us remind the reader the definitions and notation that we will use. We consider a generic hypergraph H=(V,E,𝒲)H=(V,E,\mathcal{W}) where 𝒲={we,ϑe}eE\mathcal{W}=\{w_{e},\vartheta_{e}\}_{e\in E} are submodular hyperedge weights.  For each eEe\in E, we define a diagonal matrix Ae|V|×|V|A_{e}\in\mathbb{R}^{|V|\times|V|} such that [Ae]v,v=1[A_{e}]_{v,v}=1 if vev\in e and 0 otherwise. We use the notation reE|V|r\in\bigotimes_{e\in E}\mathbb{R}^{|V|} to represent a vector in the space |V||E|\mathbb{R}^{|V||E|}, where each re|V|r_{e}\in\mathbb{R}^{|V|} corresponds to a block in rr indexed by eEe\in E.  For a vector re|V|r_{e}\in\mathbb{R}^{|V|}, re,vr_{e,v} is the entry in rer_{e} that corresponds to vVv\in V.  For a vector x|V|x\in\mathbb{R}^{|V|}, [x]+:=max{x,0}[x]_{+}:=\max\{x,0\} where the maximum is taken entry-wise.

We denote 𝒞:={(ϕ,r)+|E|×(eE|V|)|reϕeBe,eE}\mathcal{C}:=\{(\phi,r)\in\mathbb{R}^{|E|}_{+}\times(\bigotimes_{e\in E}\mathbb{R}^{|V|})~{}|~{}r_{e}\in\phi_{e}B_{e},~{}\forall e\in E\}.

We will prove the equivalence between the primal diffusion problem (A.1) and its separable reformulation shortly, but let us start with a simple lemma that gives closed-form solution for one of the AM sub-problems.

Lemma 1.

The optimal solution to the following problem

minseE|V|eEϑeserepp,s.t.ΔeEϑesed,se,v=0,ve.\min_{s\in\bigotimes_{e\in E}\mathbb{R}^{|V|}}\sum_{e\in E}\vartheta_{e}\|s_{e}-r_{e}\|_{p}^{p},~{}\mbox{s.t.}~{}\Delta-\sum_{e\in E}\vartheta_{e}s_{e}\leq d,~{}s_{e,v}=0,\forall v\not\in e. (B.1)

is given by

se=re+AeD1[ΔeEϑered]+,eE.s_{e}^{*}=r_{e}+A_{e}D^{-1}\Big{[}\Delta-\sum_{e^{\prime}\in E}\vartheta_{e^{\prime}}r_{e^{\prime}}-d\Big{]}_{+},~{}\forall e\in E. (B.2)
Proof.

Rewrite (B.1) as

minseE|V|\displaystyle\min_{s\in\bigotimes_{e\in E}\mathbb{R}^{|V|}} vVeEϑe|se,vre,v|p\displaystyle\sum_{v\in V}\sum_{e\in E}\vartheta_{e}|s_{e,v}-r_{e,v}|^{p}
s.t. ΔveEϑese,vdv,vV\displaystyle\Delta_{v}-\sum_{e\in E}\vartheta_{e}s_{e,v}\leq d_{v},~{}\forall v\in V
se,v=0,ve.\displaystyle s_{e,v}=0,~{}\forall v\not\in e.

Then it is immediate to see that (B.1) decomposes into |V||V| sub-problems indexed by vVv\in V,

minξv|Ev|eEvϑe|ξv,ere,v|p,s.t.ΔveEvϑeξv,edv,\min_{\xi_{v}\in\mathbb{R}^{|E_{v}|}}\sum_{e\in E_{v}}\vartheta_{e}|\xi_{v,e}-r_{e,v}|^{p},~{}\mbox{s.t.}~{}\Delta_{v}-\sum_{e\in E_{v}}\vartheta_{e}\xi_{v,e}\leq d_{v}, (B.3)

where Ev:={eE|ve}E_{v}:=\{e\in E~{}|~{}v\in e\} is the set of hyperedges incident to vv, and we use ξv,e\xi_{v,e} for the entry in ξv\xi_{v} that corresponds to eEve\in E_{v}. Let ξv\xi_{v}^{*} denote the optimal solution for (B.3). We have that se,v=ξv,es^{*}_{e,v}=\xi^{*}_{v,e} if vev\in e and se,v=0s^{*}_{e,v}=0 otherwise. Therefore, it suffices to find ξv\xi_{v}^{*} for vVv\in V. The optimality condition of (B.3) is given by

pϑe|ξv,ere,v|p1sign(ξv,ere,v)ϑeλ0,eEv,\displaystyle p\vartheta_{e}|\xi_{v,e}-r_{e,v}|^{p-1}\mathop{\mathrm{sign}}(\xi_{v,e}-r_{e,v})-\vartheta_{e}\lambda\ni 0,~{}\forall e\in E_{v},
λ0,ΔveEvϑeξv,edv,λ(ΔveEvϑeξv,edv)=0,\displaystyle\lambda\geq 0,~{}\Delta_{v}-\sum_{e\in E_{v}}\vartheta_{e}\xi_{v,e}\leq d_{v},~{}\lambda\Big{(}\Delta_{v}-\sum_{e\in E_{v}}\vartheta_{e}\xi_{v,e}-d_{v}\Big{)}=0,

where

sign(a):={{1},ifa<0,{1},ifa>0,[1,1]ifa=0.\mathop{\mathrm{sign}}(a):=\left\{\begin{array}[]{ll}\{-1\},&\mbox{if}~{}a<0,\\ \{1\},&\mbox{if}~{}a>0,\\ \mbox{$[-1,1]$}&\mbox{if}~{}a=0.\end{array}\right.

There are two cases about λ\lambda. We show that in both cases the solution given by (B.2) is optimal.

Case 1. If λ>0\lambda>0, then we must have that pϑe|ξv,ere,v|p1>0p\vartheta_{e}|\xi_{v,e}-r_{e,v}|^{p-1}>0 for all eEve\in E_{v} (otherwise, the stationarity condition would be violated). This means that p|ξv,ere,v|p1=λp|\xi_{v,e}-r_{e,v}|^{p-1}=\lambda for all eEve\in E_{v}, that is, ξv,e1re1,v=ξv,e2re2,v>0\xi_{v,e_{1}}-r_{e_{1},v}=\xi_{v,e_{2}}-r_{e_{2},v}>0 for every e1,e2Eve_{1},e_{2}\in E_{v}. Denote tv:=ξv,ere,vt_{v}:=\xi_{v,e}-r_{e,v}. Because λ>0\lambda>0, by complementarity we have

ΔveEvϑe(tv+re,v)=ΔveEvϑeξv,e=dv,\Delta_{v}-\sum_{e\in E_{v}}\vartheta_{e}(t_{v}+r_{e,v})=\Delta_{v}-\sum_{e\in E_{v}}\vartheta_{e}\xi_{v,e}=d_{v},

which implies that tv=(eEvϑe)1(ΔveEvϑere,vdv)t_{v}=(\sum_{e\in E_{v}}\vartheta_{e})^{-1}(\Delta_{v}-\sum_{e\in E_{v}}\vartheta_{e}r_{e,v}-d_{v}). Note that ΔveEvϑere,vdv>0\Delta_{v}-\sum_{e\in E_{v}}\vartheta_{e}r_{e,v}-d_{v}>0 because ΔveEvϑeξv,edv=0\Delta_{v}-\sum_{e\in E_{v}}\vartheta_{e}\xi_{v,e}-d_{v}=0 and ξv,e>re,v\xi_{v,e}>r_{e,v} for all eEve\in E_{v}. Therefore we have that

se,v=ξv,e=re,v+dv1[ΔveEvϑere,vdv]+.s^{*}_{e,v}=\xi^{*}_{v,e}=r_{e,v}+d_{v}^{-1}\Big{[}\Delta_{v}-\sum_{e\in E_{v}}\vartheta_{e}r_{e,v}-d_{v}\Big{]}_{+}.

Case 2. If λ=0\lambda=0, then we have that pϑe|ξv,ere,v|p1sign(ξv,ere,v)0p\vartheta_{e}|\xi_{v,e}-r_{e,v}|^{p-1}\mathop{\mathrm{sign}}(\xi_{v,e}-r_{e,v})\ni 0 for all eEve\in E_{v}, which implies ξv,ere,v=0\xi_{v,e}-r_{e,v}=0 for all eEve\in E_{v}. Then we must have

ΔveEvϑere,v=ΔveEvϑeξv,edv.\Delta_{v}-\sum_{e\in E_{v}}\vartheta_{e}r_{e,v}=\Delta_{v}-\sum_{e\in E_{v}}\vartheta_{e}\xi_{v,e}\leq d_{v}.

Therefore we still have that

se,v=ξv,e=re,v=re,v+dv1[ΔveEvϑere,vdv]+.s^{*}_{e,v}=\xi^{*}_{v,e}=r_{e,v}=r_{e,v}+d_{v}^{-1}\Big{[}\Delta_{v}-\sum_{e\in E_{v}}\vartheta_{e}r_{e,v}-d_{v}\Big{]}_{+}.

The required result then follows from the definition of AeA_{e} and DD. ∎

We are now ready to show that the primal problem (A.1) can be cast into an equivalent separable formulation, which can then be solved by the AM method in Algorithm B.1. We give the reformulation under general p\ell_{p}-norm penalty and arbitrary ϑe>0\vartheta_{e}>0.

Lemma 2 (Lemma 3 in the main paper).

The following problem is equivalent to (A.1) for any σ>0\sigma>0, in the sense that (ϕ^,r^,z^)(\hat{\phi},\hat{r},\hat{z}) is optimal in (A.1) for some z^|V|\hat{z}\in\mathbb{R}^{|V|} if and only if (ϕ^,r^,s^)(\hat{\phi},\hat{r},\hat{s}) is optimal in (B.4) for some s^eE|V|\hat{s}\in\bigotimes_{e\in E}\mathbb{R}^{|V|}.

minϕ,r,s1peEϑe(ϕep+1σp1serepp)s.t.(ϕ,r)𝒞,ΔeEϑesed,se,v=0,ve.\begin{split}\min_{\phi,r,s}~{}&\frac{1}{p}\sum_{e\in E}\vartheta_{e}\left(\phi_{e}^{p}+\frac{1}{\sigma^{p-1}}\left\|s_{e}-r_{e}\right\|_{p}^{p}\right)\\ \textnormal{s.t.}~{}&(\phi,r)\in\mathcal{C},~{}\Delta-\sum_{e\in E}\vartheta_{e}s_{e}\leq d,~{}s_{e,v}=0,\forall v\not\in e.\end{split} (B.4)
Proof.

We will show the forward direction and the converse follows from exactly the same reasoning. Let ν^1\hat{\nu}_{1} and ν^2\hat{\nu}_{2} denote the optimal objective value of problems (A.1) and (B.4), respectively. Let (ϕ^,r^,z^)(\hat{\phi},\hat{r},\hat{z}) be an optimal solution for (A.1). Define s^e:=r^e+σAez^\hat{s}_{e}:=\hat{r}_{e}+\sigma A_{e}\hat{z} for eEe\in E. We show that (ϕ^,r^,s^)(\hat{\phi},\hat{r},\hat{s}) is an optimal solution for (B.4).

Because r^e,v=0\hat{r}_{e,v}=0 for all vev\not\in e, by the definition of AeA_{e}, we know that s^e,v=0\hat{s}_{e,v}=0 for all vev\not\in e. Moreover,

σDz^=σeEϑeAez^=eEϑe(s^er^e),\sigma D\hat{z}=\sigma\sum_{e\in E}\vartheta_{e}A_{e}\hat{z}=\sum_{e\in E}\vartheta_{e}(\hat{s}_{e}-\hat{r}_{e}),

so

ΔeEϑes^e=ΔeEϑer^eσDz^d.\Delta-\sum_{e\in E}\vartheta_{e}\hat{s}_{e}=\Delta-\sum_{e\in E}\vartheta_{e}\hat{r}_{e}-\sigma D\hat{z}\leq d.

Therefore, (ϕ^,r^,s^)(\hat{\phi},\hat{r},\hat{s}) is a feasible solution for (B.4). Furthermore,

σvVdvz^vp=σeEϑevez^vp=σeEϑeAez^pp=1σp1eEϑeσAez^pp=1σp1eEϑes^er^epp.\begin{split}\sigma\sum_{v\in V}d_{v}\hat{z}_{v}^{p}&=\sigma\sum_{e\in E}\vartheta_{e}\sum_{v\in e}\hat{z}_{v}^{p}=\sigma\sum_{e\in E}\vartheta_{e}\|A_{e}\hat{z}\|_{p}^{p}\\ &=\frac{1}{\sigma^{p-1}}\sum_{e\in E}\vartheta_{e}\left\|\sigma A_{e}\hat{z}\right\|_{p}^{p}=\frac{1}{\sigma^{p-1}}\sum_{e\in E}\vartheta_{e}\left\|\hat{s}_{e}-\hat{r}_{e}\right\|_{p}^{p}.\end{split}

This means that (ϕ^,r^,s^)(\hat{\phi},\hat{r},\hat{s}) attains objective value ν^1\hat{\nu}_{1} in (B.4). Hence ν^1ν^2\hat{\nu}_{1}\geq\hat{\nu}_{2}.

In order to show that (ϕ^,r^,s^)(\hat{\phi},\hat{r},\hat{s}) is indeed optimal for (B.4), it left to show that ν^2ν^1\hat{\nu}_{2}\geq\hat{\nu}_{1}. Let (ϕ,r,s)(\phi^{\prime},r^{\prime},s^{\prime}) be an optimal solution for (B.4). Then we know that

s=argminseE|V|eEϑeserepp,s.t.ΔeEϑesed,se,v=0ve.s^{\prime}=\operatorname*{argmin}_{s\in\bigotimes_{e\in E}\mathbb{R}^{|V|}}\sum_{e\in E}\vartheta_{e}\|s_{e}-r^{\prime}_{e}\|_{p}^{p},~{}\mbox{s.t.}~{}\Delta-\sum_{e\in E}\vartheta_{e}s_{e}\leq d,~{}s_{e,v}=0~{}\forall v\not\in e. (B.5)

According to Lemma 1, we know that

se=re+AeD1[ΔeEϑered]+,eE.s^{\prime}_{e}=r^{\prime}_{e}+A_{e}D^{-1}\Big{[}\Delta-\sum_{e^{\prime}\in E}\vartheta_{e^{\prime}}r^{\prime}_{e^{\prime}}-d\Big{]}_{+},~{}\forall e\in E. (B.6)

Define z:=1σD1[ΔeEϑered]+z^{\prime}:=\frac{1}{\sigma}D^{-1}[\Delta-\sum_{e\in E}\vartheta_{e}r^{\prime}_{e}-d]_{+}. Then z0z^{\prime}\geq 0. Moreover, we have that

eEϑeseeEϑere=eEϑeAeD1[ΔeEϑered]+=[ΔeEϑered]+=σDz,\sum_{e\in E}\vartheta_{e}s^{\prime}_{e}-\sum_{e\in E}\vartheta_{e}r^{\prime}_{e}=\sum_{e\in E}\vartheta_{e}A_{e}D^{-1}\Big{[}\Delta-\sum_{e^{\prime}\in E}\vartheta_{e^{\prime}}r^{\prime}_{e^{\prime}}-d\Big{]}_{+}=\Big{[}\Delta-\sum_{e^{\prime}\in E}\vartheta_{e^{\prime}}r^{\prime}_{e^{\prime}}-d\Big{]}_{+}=\sigma Dz^{\prime},

so

ΔeEϑere=ΔeEϑese+σDzd+σDz.\Delta-\sum_{e\in E}\vartheta_{e}r^{\prime}_{e}=\Delta-\sum_{e\in E}\vartheta_{e}s^{\prime}_{e}+\sigma Dz^{\prime}\leq d+\sigma Dz^{\prime}.

Therefore, (ϕ,r,z)(\phi^{\prime},r^{\prime},z^{\prime}) is a feasible solution for (A.1). Furthermore,

1σp1eEϑeserepp=1σp1eEϑeσAezpp=σeEϑeAezpp=σeEϑevezvp=σvVdvzvp.\begin{split}\frac{1}{\sigma^{p-1}}\sum_{e\in E}\vartheta_{e}\left\|s^{\prime}_{e}-r^{\prime}_{e}\right\|_{p}^{p}&=\frac{1}{\sigma^{p-1}}\sum_{e\in E}\vartheta_{e}\left\|\sigma A_{e}z^{\prime}\right\|_{p}^{p}=\sigma\sum_{e\in E}\vartheta_{e}\|A_{e}z^{\prime}\|_{p}^{p}\\ &=\sigma\sum_{e\in E}\vartheta_{e}\sum_{v\in e}{z^{\prime}}_{v}^{p}=\sigma\sum_{v\in V}d_{v}{z^{\prime}}_{v}^{p}.\end{split}

This means that (ϕ,r,z)(\phi^{\prime},r^{\prime},z^{\prime}) attains objective value ν^2\hat{\nu}_{2} in (A.1). Hence ν^2ν^1\hat{\nu}_{2}\geq\hat{\nu}_{1}. ∎

Remark. The constructive proof of Lemma 2 means that, given an optimal solution (ϕ^,r^,s^)(\hat{\phi},\hat{r},\hat{s}) for problem (B.4), one can recover an optimal solution (ϕ^,r^,z^)(\hat{\phi},\hat{r},\hat{z}) for our original primal formulation (A.1) via z^:=1σD1[ΔeEϑer^ed]+\hat{z}:=\frac{1}{\sigma}D^{-1}[\Delta-\sum_{e\in E}\vartheta_{e}\hat{r}_{e}-d]_{+}. It then follows from Lemma 2 that the dual optimal solution x^\hat{x} is given by x^=z^p1\hat{x}=\hat{z}^{p-1}. Therefore, a sweep cut rounding procedure readily applies to the solution (ϕ^,r^,s^)(\hat{\phi},\hat{r},\hat{s}) of problem (B.4).

Let g(ϕ,r,s)g(\phi,r,s) denote the objective function of problem (B.4) and let gg^{*} denote its optimal objective value.

The following theorem gives the convergence rate of Algorithm B.1 applied to (B.4), when its objective function is penalized by p\ell_{p}-norm for p2p\geq 2.

Theorem 3 ([5]).

Let {ϕ(k),r(k),s(k)}k0\{\phi^{(k)},r^{(k)},s^{(k)}\}_{k\geq 0} be the sequence generated by Algorithm B.1. Then for any k1k\geq 1,

g(ϕ(k),r(k),s(k))g3max{g(ϕ(0),r(0),s(0))g,LpR2}k,g(\phi^{(k)},r^{(k)},s^{(k)})-g^{*}\leq\frac{3\max\{g(\phi^{(0)},r^{(0)},s^{(0)})-g^{*},L_{p}R^{2}\}}{k},

where

R=max(ϕ,r,s)max(ϕ^,r^,s^)𝒪{ϕϕ^22+rr^22+ss^22|g(ϕ,r,s)g(ϕ(0),r(0),s(0))},Lp=(p1)ϑmax2/pΔpp2dmin(p1)(p2)/pσp1,\begin{split}R&=\max_{(\phi,r,s)\in\mathcal{F}}~{}\max_{(\hat{\phi},\hat{r},\hat{s})\in\mathcal{O}}\big{\{}\|\phi-\hat{\phi}\|_{2}^{2}+\|r-\hat{r}\|_{2}^{2}+\|s-\hat{s}\|_{2}^{2}~{}\big{|}~{}g(\phi,r,s)\leq g(\phi^{(0)},r^{(0)},s^{(0)})\big{\}},\\ L_{p}&=(p-1)\frac{\vartheta_{\max}^{2/p}\|\Delta\|_{p}^{p-2}}{d_{\min}^{(p-1)(p-2)/p}\sigma^{p-1}},\end{split}

where \mathcal{F} and 𝒪\mathcal{O} denote the feasible set and set of optimal solutions, respectively, ϑmax:=maxeEϑe\vartheta_{\max}:=\max\limits_{e\in E}\vartheta_{e}, and dmin:=minvsupp(Δ)dvd_{\min}:=\min\limits_{v\in\textnormal{supp}(\Delta)}d_{v}.

Remark. When p=2p=2, as considered in the main paper, the objective function g(ϕ,r,s)g(\phi,r,s) has Lipschitz continuous gradient with constant L2=ϑmax/σL_{2}=\vartheta_{\max}/\sigma. When p>2p>2, the gradient of g(ϕ,r,s)g(\phi,r,s) is not generally Lipschitz continuous. However, the sub-linear convergence rate in Theorem 3 applies as long as g(ϕ,r,s)g(\phi,r,s) is block Lipschitz smooth in the sub-level sets containing the iterates generated by Algorithm B.1. We give more details in Subsection B.1.

B.1 Block Lipschitz smoothness over sub-level set

Recall that g(ϕ,r,s)g(\phi,r,s) denotes the objective function of problem (B.4). Lemma 4 concerns specifically the setting when problem B.4 is penalized by the p\ell_{p}-norm for some p>2p>2.

Lemma 4 (Block Lipschitz smoothness).

The partial gradient (ϕ,r)g(ϕ,r,s)\nabla_{(\phi,r)}g(\phi,r,s) is Lipschitz continuous over the sub-level sets (given any fixed ss)

Uϕ,r(s):={(ϕ,r)+|V|×(eE|V|)|g(ϕ,r,s)g(ϕ(0),r(0),s(0))}U_{\phi,r}(s):=\{(\phi,r)\in\mathbb{R}^{|V|}_{+}\times(\mbox{$\bigotimes_{e\in E}$}\mathbb{R}^{|V|})~{}|~{}g(\phi,r,s)\leq g(\phi^{(0)},r^{(0)},s^{(0)})\}

with constant Lϕ,rL_{\phi,r} such that

Lϕ,r(p1)ϑmax2/pΔpp2dmin(p1)(p2)/pσp1,L_{\phi,r}\leq(p-1)\frac{\vartheta_{\max}^{2/p}\|\Delta\|_{p}^{p-2}}{d_{\min}^{(p-1)(p-2)/p}\sigma^{p-1}},

where ϑmax:=maxeEϑe\vartheta_{\max}:=\max_{e\in E}\vartheta_{e} and dmin:=minvsupp(Δ)dvd_{\min}:=\min_{v\in\textnormal{supp}(\Delta)}d_{v}. The partial gradient sg(ϕ,r,s)\nabla_{s}g(\phi,r,s) is Lipschitz continuous over the sub-level sets (given any fixed (ϕ,r)(\phi,r))

Us(ϕ,r):={seE|V||g(ϕ,r,s)g(ϕ(0),r(0),s(0))}U_{s}(\phi,r):=\{s\in\mbox{$\bigotimes_{e\in E}$}\mathbb{R}^{|V|}~{}|~{}g(\phi,r,s)\leq g(\phi^{(0)},r^{(0)},s^{(0)})\}

with constant LsLϕ,rL_{s}\leq L_{\phi,r}.

Proof.

Fix seE|V|s\in\bigotimes_{e\in E}\mathbb{R}^{|V|} and consider

g1(ϕ,r):=g(ϕ,r,s)=1peEϑeϕep+1pσp1eEvVϑe|re,vse,v|p.g_{1}(\phi,r):=g(\phi,r,s)=\frac{1}{p}\sum_{e\in E}\vartheta_{e}\phi_{e}^{p}+\frac{1}{p\sigma^{p-1}}\sum_{e\in E}\sum_{v\in V}\vartheta_{e}|r_{e,v}-s_{e,v}|^{p}.

The function g1(ϕ,r)g_{1}(\phi,r) is coordinate-wise separable and hence its second order derivative 2g1(ϕ,r)\nabla^{2}g_{1}(\phi,r) is a diagonal matrix. Therefore, the largest eigenvalue of 2g1(ϕ,r)\nabla^{2}g_{1}(\phi,r) is the largest coordinate-wise second order partial derivative, that is,

Lϕ,r=max(ϕ,r)Uϕ,r(s)λmax(2g1(ϕ,r))=max(ϕ,r)Uϕ,r(s)maxeE,vV{ϕe2g1(ϕ,r),re,v2g1(ϕ,r)}.L_{\phi,r}=\max_{(\phi,r)\in U_{\phi,r}(s)}\lambda_{\max}(\nabla^{2}g_{1}(\phi,r))=\max_{(\phi,r)\in U_{\phi,r}(s)}\max_{e\in E,v\in V}\{\nabla^{2}_{\phi_{e}}g_{1}(\phi,r),\nabla^{2}_{r_{e,v}}g_{1}(\phi,r)\}.

So it suffices to upper bound ϕe2G(ϕ,r)\nabla^{2}_{\phi_{e}}G(\phi,r) and re,v2G(ϕ,r)\nabla^{2}_{r_{e,v}}G(\phi,r) for all (ϕ,r)Uϕ,r(s)(\phi,r)\in U_{\phi,r}(s). We have that

g(ϕ(0),r(0),s(0))=1pσp1eEϑeve[Δvdv]+pdvp=1pσp1vV[Δvdv]+pdvp1Δpppσp1dminp1g(\phi^{(0)},r^{(0)},s^{(0)})=\frac{1}{p\sigma^{p-1}}\sum_{e\in E}\vartheta_{e}\sum_{v\in e}\frac{[\Delta_{v}-d_{v}]_{+}^{p}}{d_{v}^{p}}=\frac{1}{p\sigma^{p-1}}\sum_{v\in V}\frac{[\Delta_{v}-d_{v}]_{+}^{p}}{d_{v}^{p-1}}\leq\frac{\|\Delta\|_{p}^{p}}{p\sigma^{p-1}d_{\min}^{p-1}}

where dmin=minvsupp(Δ)dvd_{\min}=\min_{v\in\textnormal{supp}(\Delta)}d_{v}. It follows that for all (ϕ,r)Uϕ,r(s)(\phi,r)\in U_{\phi,r}(s),

ϕe2g1(ϕ,r)\displaystyle\nabla^{2}_{\phi_{e}}g_{1}(\phi,r) =(p1)ϑeϕep2(p1)ϑe2/pΔpp2dmin(p1)(p2)/pσ(p1)(p2)/p(p1)ϑe2/pΔpp2dmin(p1)(p2)/pσp1,eE,\displaystyle=(p-1)\vartheta_{e}\phi_{e}^{p-2}\leq\frac{(p-1)\vartheta_{e}^{2/p}\|\Delta\|_{p}^{p-2}}{d_{\min}^{(p-1)(p-2)/p}\sigma^{(p-1)(p-2)/p}}\leq\frac{(p-1)\vartheta_{e}^{2/p}\|\Delta\|_{p}^{p-2}}{d_{\min}^{(p-1)(p-2)/p}\sigma^{p-1}},~{}\forall e\in E,
re,v2g1(ϕ,r)\displaystyle\nabla^{2}_{r_{e,v}}g_{1}(\phi,r) =(p1)ϑeσp1|se,vre,v|p2(p1)ϑe2/pΔpp2dmin(p1)(p2)/pσp1,eE,vV,\displaystyle=(p-1)\frac{\vartheta_{e}}{\sigma^{p-1}}|s_{e,v}-r_{e,v}|^{p-2}\leq\frac{(p-1)\vartheta_{e}^{2/p}\|\Delta\|_{p}^{p-2}}{d_{\min}^{(p-1)(p-2)/p}\sigma^{p-1}},~{}\forall e\in E,~{}\forall v\in V,

because otherwise we would have g(ϕ,r,s)>g(ϕ(0),r(0),s(0))g(\phi,r,s)>g(\phi^{(0)},r^{(0)},s^{(0)}). Hence,

Lϕ,rmaxeE(p1)ϑe2/pΔpp2dmin(p1)(p2)/pσp1=(p1)ϑmax2/pΔpp2dmin(p1)(p2)/pσp1.L_{\phi,r}\leq\max_{e\in E}\frac{(p-1)\vartheta_{e}^{2/p}\|\Delta\|_{p}^{p-2}}{d_{\min}^{(p-1)(p-2)/p}\sigma^{p-1}}=\frac{(p-1)\vartheta_{\max}^{2/p}\|\Delta\|_{p}^{p-2}}{d_{\min}^{(p-1)(p-2)/p}\sigma^{p-1}}.

Finally, by the symmetry between rr and ss in F(ϕ,r,s)F(\phi,r,s), we know that LsLϕ,rL_{s}\leq L_{\phi,r}. ∎

Remark. Because the iterates generated by Algorithm B.1 monotonically decrease the objective function value, in particular, we have that

g(ϕ(0),r(0),s(0))g(ϕ(k+1),r(k+1),s(k))g(ϕ(k+1),r(k+1),s(k+1))g(\phi^{(0)},r^{(0)},s^{(0)})\geq g(\phi^{(k+1)},r^{(k+1)},s^{(k)})\geq g(\phi^{(k+1)},r^{(k+1)},s^{(k+1)})

for any k0k\geq 0. Therefore, the sequence of iterates live in the sub-level sets. As a result, for any p>2p>2, the block Lipschitz smoothness within sub-level sets suffices to obtain the sub-linear convergence rate for the AM method [5].

B.2 Alternating minimization sub-problems

We now discuss how to solve the sub-problems in Algorithm B.1 efficiently. By Lemma 1, we know that the sub-problem with respect to ss,

s(k+1):=argminseEϑesere(k+1)pp,s.t.ΔeEϑesed,se,v=0,ve,s^{(k+1)}:=\operatorname*{argmin}\limits_{s}\sum\limits_{e\in E}\vartheta_{e}\|s_{e}-r_{e}^{(k+1)}\|_{p}^{p},~{}\mbox{s.t.}~{}\Delta-\sum\limits_{e\in E}\vartheta_{e}s_{e}\leq d,\ s_{e,v}=0,\forall v\not\in e,

has closed-form solution

se(k+1)=re(k+1)+AeD1[ΔeEϑere(k+1)d]+,eE.s_{e}^{(k+1)}=r_{e}^{(k+1)}+A_{e}D^{-1}\Big{[}\Delta-\sum_{e^{\prime}\in E}\vartheta_{e^{\prime}}r_{e^{\prime}}^{(k+1)}-d\Big{]}_{+},~{}\forall e\in E.

For the sub-problem with respect to (ϕ,r)(\phi,r),

(ϕ(k+1),r(k+1)):=argmin(ϕ,r)𝒞eEϑe(ϕep+1σp1se(k)repp),(\phi^{(k+1)},r^{(k+1)}):=\operatorname*{argmin}\limits_{(\phi,r)\in\mathcal{C}}\sum\limits_{e\in E}\vartheta_{e}\left(\phi_{e}^{p}+\frac{1}{\sigma^{p-1}}\|s_{e}^{(k)}-r_{e}\|_{p}^{p}\right),

note that it decomposes into |E||E| independent problems that can be minimized separately. That is, for eEe\in E, we have

(ϕe(k+1),re(k+1))=argminϕe0,reϕeBeϑeϕep+1σp1ϑese(k)repp=argminϕe0,reϕeBe1pϕep+1pσp1se(k)repp.\begin{split}(\phi_{e}^{(k+1)},r_{e}^{(k+1)})&=\operatorname*{argmin}_{\phi_{e}\geq 0,r_{e}\in\phi_{e}B_{e}}\vartheta_{e}\phi_{e}^{p}+\frac{1}{\sigma^{p-1}}\vartheta_{e}\|s_{e}^{(k)}-r_{e}\|_{p}^{p}\\ &=\operatorname*{argmin}_{\phi_{e}\geq 0,r_{e}\in\phi_{e}B_{e}}\frac{1}{p}\phi_{e}^{p}+\frac{1}{p\sigma^{p-1}}\|s_{e}^{(k)}-r_{e}\|_{p}^{p}.\end{split} (B.7)

The above problem (B.7) is strictly convex so it has a unique minimizer.

We focus on p=2p=2 first. In this case, problem (B.7) can be solved in sub-linear time using either the conic Frank-Wolfe algorithm or the conic Fujishige-Wolfe minimum norm algorithm studied in [32]. Notice that the dimension of problem (B.7) is the size of the corresponding hyperedge. Therefore, as long as the hyperedge is not extremely large, we can easily obtain a good update (ϕe(k+1),re(k+1))(\phi_{e}^{(k+1)},r_{e}^{(k+1)}).

If BeB_{e} has a special structure, for example, if the hyperedge weight wew_{e} models unit cut-cost, then an exact solution for (B.7) can be computed in time O(|e|log|e|)O(|e|\log|e|) [32]. For completeness we transfer the algorithmic details in [32] to our setting and list them in Algorithm B.2. The basic idea is to find optimal dual variables achieving dual optimality, and then recover primal optimal solution from the dual. We refer the reader to [32] for detailed justifications. Given eEe\in E, se|V|s_{e}\in\mathbb{R}^{|V|}, and a,ba,b\in\mathbb{R}, denote

e(a):={ve|se,vσa}ande(b):={ve|se,vσb}.e_{\geq}(a):=\{v\in e~{}|~{}s_{e,v}\geq\sigma a\}~{}~{}\mbox{and}~{}~{}e_{\leq}(b):=\{v\in e~{}|~{}s_{e,v}\leq\sigma b\}.

Define

γ(a,b):=ab+ve(a)σ(ase,vσ).\gamma(a,b):=a-b+\sum_{v\in e_{\geq}(a)}\sigma\left(a-\frac{s_{e,v}}{\sigma}\right).
Algorithm B.2 An Exact Projection Algorithm for (B.7) (p=2p=2, unit cut-cost) [32]
1:  Input: ee, ses_{e}.
2:  amaxvese,v/σa\leftarrow\max_{v\in e}s_{e,v}/\sigma,   bminvese,v/σb\leftarrow\min_{v\in e}s_{e,v}/\sigma
3:  While true:
4:       waσ|e(a)|w_{a}\leftarrow\sigma\,|e_{\geq}(a)|,  wbσ|e(b)|w_{b}\leftarrow\sigma\,|e_{\leq}(b)|
5:       a1maxvee(a)se,v/σa_{1}\leftarrow\max_{v\in e\setminus e_{\geq}(a)}s_{e,v}/\sigma,  b1b+(aa1)wa/wbb_{1}\leftarrow b+(a-a_{1})w_{a}/w_{b}
6:       b2minvee(b)se,v/σb_{2}\leftarrow\min_{v\in e\setminus e_{\leq}(b)}s_{e,v}/\sigma,  a2a(b2b)wb/waa_{2}\leftarrow a-(b_{2}-b)w_{b}/w_{a}
7:       iargmini{1,2}bii^{*}\leftarrow\operatorname*{argmin}_{i\in\{1,2\}}b_{i}
8:       If aibia_{i^{*}}\leq b_{i^{*}} or γ(ai,bi)0\gamma(a_{i^{*}},b_{i^{*}})\leq 0 break
9:       aaia\leftarrow a_{i^{*}},  bbib\leftarrow b_{i^{*}}
10:  aaγ(a,b)wb/(wawb+wa+wb)a\leftarrow a-\gamma(a,b)w_{b}/(w_{a}w_{b}+w_{a}+w_{b}),  bb+γ(a,b)wa/(wawb+wa+wb)b\leftarrow b+\gamma(a,b)w_{a}/(w_{a}w_{b}+w_{a}+w_{b})
11:  For vev\in e do:
12:       If ve(a)v\in e_{\geq}(a) then  re,vse,vσar_{e,v}\leftarrow s_{e,v}-\sigma a
13:       Else if ve(b)v\in e_{\leq}(b) then  re,vse,vσbr_{e,v}\leftarrow s_{e,v}-\sigma b
14:       Else  re,v0r_{e,v}\leftarrow 0
15:  Return: rer_{e}

Now we discuss the case p>2p>2 in (B.7). The dual of (B.7) is written as

minye1qfe(ye)q+σqyeqqyeTse(k).\min_{y_{e}}\frac{1}{q}f_{e}(y_{e})^{q}+\frac{\sigma}{q}\|y_{e}\|_{q}^{q}-y_{e}^{T}s_{e}^{(k)}. (B.8)

Let (ϕe,re)(\phi_{e}^{*},r_{e}^{*}) and yey_{e}^{*} be optimal solutions of (B.7) and (B.8), respectively. Then one has

re=se(k)σ(ye)q1andϕe=((re)Tye)1/q.r_{e}^{*}=s_{e}^{(k)}-\sigma(y_{e}^{*})^{q-1}~{}\mbox{and}~{}\phi_{e}^{*}=\big{(}(r_{e}^{*})^{T}y_{e}^{*}\big{)}^{1/q}.

Both the derivation of (B.8) and the above relations between (ϕe,re)(\phi_{e}^{*},r_{e}^{*}) and yey_{e}^{*} follow from similar reasoning and algebraic computations used in the proofs of Lemma 1 and Lemma 2. Therefore, we can use subgradient method to compute yey_{e}^{*} first and then recover ϕe\phi_{e}^{*} and rer_{e}^{*}. For special cases like the unit cut-cost, a similar approach to Algorithm B.2 can be adopted to obtain an almost (up to a binary search tolerance) exact solution, by modifying Steps 2-6 to work with general p\ell_{p}-norm and replacing Step 10 with binary search. See Algorithm B.3 for details.

Caution. To simplify notation in Algorithm B.3, for cc\in\mathbb{R} and p>0p>0, cpc^{p} is to be interpreted as cp:=|c|psign(c)c^{p}:=|c|^{p}\mathop{\mathrm{sign}}(c), where we treat sign(0):=0\mathop{\mathrm{sign}}(0):=0. For q=p/(p1)q=p/(p-1), we define

γp(a,b):=(ab)q1+ve(aq1)σ(aq1se,vσ).\gamma_{p}(a,b):=(a-b)^{q-1}+\sum_{v\in e_{\geq}(a^{q-1})}\sigma\left(a^{q-1}-\frac{s_{e,v}}{\sigma}\right).
Algorithm B.3 An p\ell_{p}-Projection Algorithm for (B.7) (p>2p>2, unit cut-cost)
1:  Input: ee, ses_{e}.
2:  amaxve(se,v/σ)p1a\leftarrow\max_{v\in e}(s_{e,v}/\sigma)^{p-1},   bminve(se,v/σ)p1b\leftarrow\min_{v\in e}(s_{e,v}/\sigma)^{p-1},   qp/(p1)q\leftarrow p/(p-1)
3:  While true:
4:       waσ|e(aq1)|w_{a}\leftarrow\sigma\,|e_{\geq}(a^{q-1})|,  wbσ|e(bq1)|w_{b}\leftarrow\sigma\,|e_{\leq}(b^{q-1})|
5:       a1maxvee(aq1)(se,v/σ)p1a_{1}\leftarrow\max_{v\in e\setminus e_{\geq}(a^{q-1})}(s_{e,v}/\sigma)^{p-1},  b1(bq1+(aq1a1q1)wa/wb)p1b_{1}\leftarrow(b^{q-1}+(a^{q-1}-a_{1}^{q-1})w_{a}/w_{b})^{p-1}
6:       b2minvee(bq1)(se,v/σ)p1b_{2}\leftarrow\min_{v\in e\setminus e_{\leq}(b^{q-1})}(s_{e,v}/\sigma)^{p-1},   a2(aq1(b2q1bq1)wb/wa)p1a_{2}\leftarrow(a^{q-1}-(b_{2}^{q-1}-b^{q-1})w_{b}/w_{a})^{p-1}
7:       iargmini{1,2}bii^{*}\leftarrow\operatorname*{argmin}_{i\in\{1,2\}}b_{i}
8:       If aibia_{i^{*}}\leq b_{i^{*}} or γp(ai,bi)0\gamma_{p}(a_{i^{*}},b_{i^{*}})\leq 0 break
9:       aaia\leftarrow a_{i^{*}},  bbib\leftarrow b_{i^{*}}
10:  Employ binary search for a^[b,a]\hat{a}\in[b,a] such that γp(a^,b^)=0\gamma_{p}(\hat{a},\hat{b})=0 while maintaining b^=(bq1+(aq1a^q1)wa/wb)p1\hat{b}=(b^{q-1}+(a^{q-1}-\hat{a}^{q-1})w_{a}/w_{b})^{p-1} and b^a^\hat{b}\leq\hat{a}
11:  For vev\in e do:
12:       If vve(a^q1)v\in v\in e_{\geq}(\hat{a}^{q-1}) then  re,vse,vσa^q1r_{e,v}\leftarrow s_{e,v}-\sigma\hat{a}^{q-1}
13:       Else if ve(b^q1)v\in e_{\leq}(\hat{b}^{q-1}) then  re,vse,vσb^q1r_{e,v}\leftarrow s_{e,v}-\sigma\hat{b}^{q-1}
14:       Else  re,v0r_{e,v}\leftarrow 0
15:  Return: rer_{e}

Appendix C Empirical set-up and results

C.1 Experiments using synthetic data

In this subsection we provide details aboue how we generate synthetic hypergraphs using kk-uniform stochastic block model and how we set the parameters for the algorithms used in our experiments. Additional synthetic experiments that demonstrate or explain the robustness of our method are also provided.

Data generation. We generate four sets of hypergraphs using the generalized kkHSBM described in the main paper. All hypergraphs have n=100n=100 nodes. For simplicity, we require that each block in the hypergraph has constant size 50.

1st set of hypergraphs. We generate the first set of hypergraphs with k=3k=3, constant p=0.0765p=0.0765 and varying q[0.0041,0.0735]q\in[0.0041,0.0735]. Recall that for k=3k=3 there is only one possible inter-cluster probability qq1q\equiv q_{1}. We pick p=0.0765p=0.0765 so the expected number of intra-cluster hyperedges is 1500 for each block of size 50. We set a wide range for qq so that the interval covers both extremes, i.e., when the ground-truth target cluster is very clean or very noisy. These hypergraphs are used to evaluate the performance of algorithms for the unit cut-cost setting when the target cluster conductance varies. Figure 4 in the main paper uses the local clustering results on these hypergraphs.

2nd set of hypergraphs. For the second set of hypergraphs, we vary k{3,4,5,6}k\in\{3,4,5,6\}. Moreover, we set q2==qk/2=0q_{2}=\cdots=q_{\lfloor k/2\rfloor}=0, so every inter-cluster hyperedge contains a single node on one side and the rest on the other side. In this setting, separating the two ground-truth communities will incur a small penalty using the cardinality cut-cost, but a large penalty using the unit cut-cost. Therefore, methods that exploit appropriate cardinality-based cut-cost should perform better. The hypergraphs are sampled so that the conductance of a block stays the same across different kk’s. We compute the conductance based on the unit cut-cost when generating the hypergraphs, because the scale of conductance based on the unit cut-cost is less affected by kk than the scale of conductance based on the cardinality cut-cost. See details below for how the scale of conductance based on the cardinality cut-cost is affected by kk. The second set of hypergraphs is used to evaluate the performance of algorithms for both unit and cardinality cut-costs when the hyperedge size varies. Figure 5 in the main paper (and Figure C.3 and Figure C.4 in the appendix) uses the local clustering results on these hypergraphs.

3rd set of hypergraphs. For the third set of hypergraphs, we set q2==qk/2=0q_{2}=\cdots=q_{\lfloor k/2\rfloor}=0. We consider constant k=4k=4 or k=5k=5, constant pp and varying q1q_{1}. These hypergraphs are used to evaluate the performance of algorithms for both unit and cardinality cut-costs when the target cluster conductance varies. Figure C.1 and Figure C.2 in the appendix are based on the local clustering results on these hypergraphs.

4th set of hypergraphs. This set consists of two hypergraphs generated with k=3k=3, p=0.04p=0.04 and q{0.001,0.011}q\in\{0.001,0.011\}. The ground-truth target cluster in the first hypergraph has conductance 0.05, while the ground-truth target cluster in the second hypergraph has conductance 0.3. These two hypergraphs are used to compare the performance of algorithms for the unit cut-cost setting, when the theoretical assumptions of LH holds (for the first hypergraph) or fails (for the second hypergraph).

Parameters. For HFD, for all synthetic experiments, we initialize the seed mass so that Δ1\|\Delta\|_{1} is three times the volume of the target cluster (recall from Assumption 2 this is without loss of generality). We set σ=0.01\sigma=0.01. We tune the parameters for LH as suggested by the authors [33]. Specifically, LH has a regularization parameter κ\kappa and we let κ=cr\kappa=c\cdot r where rr is the ratio between the number of seed node(s) and the size of the target cluster. We perform a binary search on cc and find that c=0.35c=0.35 gives good results for the synthetic hypergraphs. An important parameter for LH is δ\delta. When δ=1\delta=1 it models unit cut-cost and when δ1\delta\geq 1 it models cardinality-based cut-cost with an upper bound δ\delta [33]. We consider both cases δ=1\delta=1 (U-LH) and δ1\delta\geq 1 (C-LH). In principle, for kk-uniform hypergraphs LH should produce the same result for any δk\delta\geq k, so one could simply set δ=k\delta=k for C-LH. However in our experiments we find that the δ\delta value that gives the best clustering results can be much larger than kk. In order to get the best performance out of C-LH, we run C-LH for δ=2i\delta=2^{i}, i=0,1,,12i=0,1,\ldots,12. Among the 13 output clusters from C-LH we pick the one with the lowest conductance. For ACL, we use the same set of parameter values used in [33] because that parameter setting also produces good results in our synthetic experiments.

Scale of cardinality-based conductance. To see how ground-truth conductance scales (computed using the cardinality cut-cost) with hyperedge size k2k\geq 2, let us assume that a hypergraph H=(V,E)H=(V,E), having |V|=100|V|=100 nodes and two blocks where each block contains 50 nodes, is generated from p=0p=0, q1=1q_{1}=1 and q2==qk/2=0q_{2}=\ldots=q_{\lfloor k/2\rfloor}=0. In this case, the hypergraph consists of all and only inter-cluster hyperedges. Let CC denote a target cluster, that is, CC is either one of the two ground-truth blocks. Since we have |V|=100|V|=100 nodes and each of the two blocks contains 5050 nodes, the total number of hyperedges is

|E|=2(50k1)(501).|E|=2\binom{50}{k-1}\binom{50}{1}.

Let wew_{e} denote the cardinality-based cut-cost given by we(S)=min{|Se|,|eS|}/|e|/2w_{e}(S)=\min\{|S\cap e|,|e\setminus S|\}/\lfloor|e|/2\rfloor. Then for each eEe\in E we have that we(C)=1k/2w_{e}(C)=\frac{1}{\lfloor k/2\rfloor}. Moreover, the volume of CC is

vol(C)=(k1)(50k1)(501)+(501)(50k1)=k(50k1)(501),\textnormal{vol}(C)=(k-1)\binom{50}{k-1}\binom{50}{1}+\binom{50}{1}\binom{50}{k-1}=k\binom{50}{k-1}\binom{50}{1},

and hence we have

Φ(C)=vol(C)vol(C)=eEwe(C)vol(C)=1k/2|E|vol(C)=2k/2(50k1)(501)k(50k1)(501)=2kk/2.\Phi(C)=\frac{\textnormal{vol}(\partial C)}{\textnormal{vol}(C)}=\frac{\sum_{e\in E}w_{e}(C)}{\textnormal{vol}(C)}=\frac{\frac{1}{\lfloor k/2\rfloor}|E|}{\textnormal{vol}(C)}=\frac{\frac{2}{\lfloor k/2\rfloor}\binom{50}{k-1}\binom{50}{1}}{k\binom{50}{k-1}\binom{50}{1}}=\frac{2}{k\lfloor k/2\rfloor}.

This means that, for any p0p\geq 0, q11q_{1}\leq 1, q2==qk/2=0q_{2}=\cdots=q_{\lfloor k/2\rfloor}=0, let BB be one of the two blocks in HH, then Φ(B)1\Phi(B)\leq 1 for k=2,3k=2,3, Φ(B)1/4\Phi(B)\leq 1/4 for k=4k=4, and Φ(B)1/5\Phi(B)\leq 1/5 for k=5k=5. This explains why the ranges of ground-truth conductance we consider in Figure C.1 and Figure C.2 are different from the range of ground-truth conductance in Figure 4 in the main paper. For each kk we try to make the range of conductance (i.e., xx-axis) as wide as possible, but due to the different scales of cardinality-based conductance for different kk, the ranges vary accordingly.

Additional results. Figure C.1 and Figure C.2 show how the algorithms perform on kk-uniform hypergraphs for k=4,5k=4,5, respectively, as we vary the target cluster conductances. The plots show that as the target cluster becomes more noisy, the performance of all methods degrades. However, C-HFD is better in terms of both conductance and F1 score, especially when the target cluster is noisy but not complete noise (i.e., the ground-truth conductance is high but not too high). For k=5k=5 and high-conductance regime, methods that use unit cut-cost, e.g., U-HFD, have poor performance because they find low-conductance clusters based on the unit cut-cost as opposed to the cardinality cut-cost. In general, lower unit cut-cost conductance does not necessarily translates to lower cardinality-based conductance or higher F1 score. For both Figure C.1 and Figure C.2, the ground-truth conductance is computed using cardinality-based cut-cost, therefore the ground-truth conductances (on the xx-axes) have different scales and ranges. Figure C.3 and Figure C.4 show the median (markers) and 25-75 percentiles (lower-upper bars) of conductance ratios and F1 scores for k=3,4,5,6k=3,4,5,6. The target clusters have unit cut-cost conductances around 0.2 for Figure C.3 and 0.25 for Figure C.4. Notice that, when the target clusters are less noisy (cf. Figure 5 in the main paper where target clusters are more noisy, having unit conductance around 0.3), U-HFD and C-HFD are significantly better than other methods. The performance of U-HFD is slightly affected by the hyperedge size when the target clusters have unit conductance around 0.25, while the performance of C-HFD stays the same across all kk’s.

Refer to caption
Refer to caption
Figure C.1: Average output conductance and F1 score against ground-truth conductance, on kk-uniform hypergraphs with k=4k=4. The error bars show variation over 50 runs using different seed nodes. Both the ground-truth and the target conductances are computed using cardinality-based cut-cost.
Refer to caption
Refer to caption
Figure C.2: Average output conductance and F1 score against ground-truth conductance, on kk-uniform hypergraphs with k=5k=5. The error bars show variation over 50 runs using different seed nodes. Both the ground-truth and the target conductances are computed using cardinality-based cut-cost.
Refer to caption
Refer to caption
Figure C.3: Conductance ratio and F1 score on kk-uniform hypergraphs for k{3,4,5,6}k\in\{3,4,5,6\}. Target clusters have unit conductance around 0.20.
Refer to caption
Refer to caption
Figure C.4: Conductance ratio and F1 score on kk-uniform hypergraphs for k{3,4,5,6}k\in\{3,4,5,6\}. Target clusters have unit conductance around 0.25.

Why is the empirical performance of U-HFD better than U-LH? For the unit cut-cost setting, the local clustering guarantee for HFD holds under much weaker assumptions than those required for LH. The assumptions for LH could fail in many cases, and consequently we see that U-HFD has significantly better performance than U-LH in the experiments with both synthetic and real data. More specifically, the theoretical framework for LH assumes that the node embeddings are global (i.e., the solution is dense). However, in order to obtain a localized algorithm, the authors use a regularization parameter κ>0\kappa>0 to impose sparsity in the solution. The localized algorithm computes a sparse approximation to the original global solution, but some clustering errors could also be introduced. In general, this does not seem to be a major issue, as localized solutions only seem to slightly affect the clustering performance as shown in Figure C.5. A more crucial assumption of LH is that its approximation guarantee relies on a strong condition that the conductance of the target cluster is upper bounded by γ8c\frac{\gamma}{8c}, where γ(0,1)\gamma\in(0,1) is a tuning parameter and cc is a constant that depends on both γ\gamma and a specific sampling strategy for selecting a seed node from the target cluster. In our experiments we find that this assumption often breaks. In what follows we provide a simple illustrating example using synthetic hypergraphs. First of all, we sample a sequence of hypergraphs using kkHSBM with n=100n=100 nodes, two ground-truth communities each consisting of 50 nodes, constant k=3k=3, varying pp and qq. For each hypergraph we identify one ground-truth community as the target cluster, and we select a seed node uniformly at random from the target cluster. We compute the quantity γ8c\frac{\gamma}{8c} and we find that this quantity is always less than 0.12 for any γ(0,1)\gamma\in(0,1). This means that in order for the assumption of LH to hold, the target cluster must have conductance no more than 0.12, which is a very strict requirement and cannot hold in general. In order to compare the performances of LH when its assumption holds or fails, respectively, we picked two hypergraphs (i.e., the fourth set of hypergraphs that we generate) that correspond to the two scenarios. The target clusters have conductance 0.05 and 0.3, respectively. Therefore, the assumption for LH holds for the first hypergraph but fails for the second hypergraph. Moreover, we consider both global and localized solutions for LH. The global solution demonstrates the performance of LH under the required theoretical framework, while the localized solution demonstrates what happens in practice when one uses sparse approximation for computational efficiency. For LH, we compute the global solution by simply setting the regularization parameter κ\kappa to 0; we tune the localized solution and set κ=0.25r\kappa=0.25r where rr is the ratio between the number of seed node(s) and the size of the target cluster. The way we pick κ\kappa is similar to the authors’ choice for LH. For HFD, we set σ=0.01\sigma=0.01 and initial mass 3 times the volume of the target cluster. We run both methods multiple times, each time we use a different node from the target cluster as the single seed node. The median, lower and upper quantiles of F1 scores are shown in Figure C.5. For LH, observe that (i) for both hypergraphs where the assumption either holds or fails, localizing the solution slightly reduces the F1 score, and (ii) for both global and localized solutions, LH has much worse performance on the hypergraph where its assumption does not hold. On the other hand, HFD perfectly recovers the target clusters in both settings.

Refer to caption
Figure C.5: Local clustering results under various settings for LH. The markers show the median, the error bars show the 25th and 75th percentiles, respectively. The left-most case aligns with the required theoretical framework for LH, moreover, the srong assumption on the target cluster conductance is satisfied; the right-most case is what typically happens in practice when one applies localized algorithm for LH, moreover, the assumption on the target cluster conductance does not hold. ACL is a heuristic method that applies to the star expansion of hypergraphs. ACL has no performance guarantee. In practice, we observe that ACL and LH have similar performances.

C.2 Experiments using real-world data

C.2.1 Datasets and ground-truth clusters

We provide complete details on the real hypergraphs we used in the experiments. The last three datasets are used for additional experiments in the appendix only.

Amazon-reviews [38, 43]. This is a hypergraph constructed from Amazon product review data, where each node represents a product. A set of products are connected by a hyperedge if they are reviewed by the same person. We use product category labels as ground truth cluster identities. In total there are 29 product categories. Because we are mostly interested in local clustering, we consider all clusters consisting of less than 10,000 nodes.

Trivago-clicks [14]. The nodes in this hypergraph are accommodations/hotels. A set of nodes are connected by a hyperedge if a user performed “click-out” action during the same browsing session, which means the user was forwarded to a partner site.  We use geographical locations as ground truth cluster identities. There are 160 such clusters. We consider all clusters in this dataset that consists of less than 1,000 nodes and has conductance less than 0.25.

Florida Bay food network [30]. Nodes in this hypergraph correspond to different species or organisms that live in the Bay, and hyperedges correspond to transformed network motifs of the original dataset. Each species is labelled according its role in the food chain.

High-school-contact [35, 14]. Nodes in this hypergraph represent high school students. A group of people are connected by a hyperedge if they were all in proximity of one another at a given time, based on data from sensors worn by students. We use the classroom to which a student belongs to as ground truth. In total there are 9 classrooms.

Microsoft-academic [41, 1]. The original co-authorship network is a subset of the Microsoft Academic Graph where nodes are authors and hyperedges correspond to a publication from those authors. We take the dual of the original hypergraph by converting hyperedges to nodes and nodes to hyperedges. After constructing the dual hypergraph, we removed all hyperedges having just one node and we kept the largest connected component.  In the resulting hypergraph, each node represents a paper and is labelled by its publication venue. A set of papers are connected by a hyperedge if they share a common coauthor. We combine similar computer science conferences into four broader categories: Data (KDD, WWW, VLDB, SIGMOD), ML (ICML, NeurIPS), TCS (STOC, FOCS), CV (ICCV, CVPR).

Oil-trade network. This hypergraph is constructed using the 2017 international oil trade records from UN Comtrade Dataset. We adopt a similar modelling approach to Figure 1 in the main paper. Each node represents a country, {v1,v2,v3,v4}\{v_{1},v_{2},v_{3},v_{4}\} form a hyperedge if the trade surplus from each of v1,v2v_{1},v_{2} to each of v3,v4v_{3},v_{4} exceeds 10 million USD (this is roughly 80% percentile country-wise oil export value). Therefore, two countries belong to the same hyperedge if they share 2\geq 2 important trading partners in common. We use this network to for the node ranking problem.

Table C.1 provides summary statistics about the hypergraphs. Table C.2 includes the statistics of all ground truth clusters that we used in the experiments.

Table C.1: Summary of real-world hypergraphs

Dataset Number of nodes Number of hyperedges Maximum hyperedge size Maximum node degree Median / Mean hyperedge size Median / Mean node degree Amazon-reviews 2,268,231 4,285,363 9,350 28,973 8.0 / 17.1 11.0 / 32.2 Trivago-clicks 172,738 233,202 86 588 3.0 / 4.1 2.0 / 5.6 Florida-Bay 126 141,233 4 19,843 4.0 / 4.0 3,770.5 / 4,483.6 Microsoft-academic 44,216 22,464 187 21 3.0 / 5.4 2.0 / 2.7 High-school-contact 327 7,818 5 148 2.0 / 2.3 53.0 / 55.6 Oil-trade 229 100,639 4 16,394 4.0 / 4.0 175.0 / 1,757.9

Table C.2: Summary of ground-truth clusters used in the experiments
Dataset Cluster Size Volume Conductance
Amazon-reviews 1 - Amazon Fashion 31 3042 0.06
2 - All Beauty 85 4092 0.12
3 - Appliances 48 183 0.18
12 - Gift Cards 148 2965 0.13
15 - Industrial & Scientific 5334 72025 0.14
17 - Luxury Beauty 1581 28074 0.11
18 - Magazine Subs. 157 2302 0.13
24 - Prime Pantry 4970 131114 0.10
25 - Software 802 11884 0.14
Trivago-clicks KOR - South Korea 945 3696 0.24
ISL - Iceland 202 839 0.21
PRI - Puerto Rico 144 473 0.25
UA-43 - Crimea 200 1091 0.24
VNM - Vietnam 832 2322 0.24
HKG - Hong Kong 536 4606 0.24
MLT - Malta 157 495 0.24
GTM - Guatemala 199 652 0.24
UKR - Ukraine 264 648 0.24
SET - Estonia 158 850 0.23
Florida- Bay Producers 17 10781 0.70
Low-level consumers 35 173311 0.58
High-level consumers 70 375807 0.54
Microsoft- academic Data 15817 45060 0.06
ML 10265 26765 0.16
TCS 4159 10065 0.08
CV 13974 38395 0.08
High-school-contact Class 1 36 1773 0.25
Class 2 34 1947 0.29
Class 3 40 2987 0.20
Class 4 29 913 0.41
Class 5 38 2271 0.26
Class 6 34 1320 0.26
Class 7 44 2951 0.16
Class 8 39 2204 0.19
Class 9 33 1826 0.25

C.2.2 Methods and parameter setting

HFD We use σ=0.0001\sigma=0.0001 for all the experiments. We set the total amount of initial mass Δ1\|\Delta\|_{1} as a constant factor tt times the volume of the target cluster. For Amazon-reviews, on the smaller clusters 1, 2, 3, 12, 18, we used t=200t=200; on the larger clusters 15, 17, 24, 25, we used t=50t=50. For both Trivago-clicks, High-school-contact and Microsoft-academic, we used t=3t=3. For Florida Bay food network, we used t=20,10,5t=20,10,5 for clusters 1, 2, 3, respectively. In all experiments, the choice of tt is to ensure that the diffusion process will cover some part of the target and incur a high cost in the objective function. For the single seed node setting, we simply set the initial mass on the seed node as Δ1\|\Delta\|_{1}. For the multiple seed nodes setting where we are given a seed set SS, for each vSv\in S we set the initial mass on vv as dvΔ1/vol(S)d_{v}\|\Delta\|_{1}/\textnormal{vol}(S).

LH, ACL We used the parameters as suggested by the authors [33]. For both *-LH-2.0 and *-LH-1.4, we set γ=0.1\gamma=0.1, ρ=0.5\rho=0.5, κ=cr\kappa=c\cdot r where rr is the ratio between the number of seed nodes and the size of the target cluster, and cc is a tuning constant. For Amazon-reviews, we set c=0.025c=0.025 as suggested in [33]. For Microsoft-academic, Trivago-clicks, and Florida-Bay we also used c=0.025c=0.025 because it produces good results. For High-school-contact we selected c=0.25c=0.25 after some tuning to make sure both *-LH-2.0 and *-LH-1.4 have good results. We set the parameters for ACL in exactly the same way as in [33]. We set δ=1\delta=1 for U-LH-* and δ=maxeE|e|\delta=\max_{e\in E}|e| for C-LH-*.

C.2.3 Additional experiments

Multiple seed nodes. We conduct additional experiments using multiple seed nodes for Amazon-reviews and Trivago-clicks datasets. For each target cluster, we randomly select 1% nodes from that cluster as seed nodes, and we enforce that at least 5 nodes are selected as seeds. For example, if a cluster consists of only 100 nodes, we still select 5 nodes to form a seed set. We run 30 trials for each cluster and report the median conductance and F1 score of the output clusters. The results are shown in Table C.3 and Table C.4. For the multiple seed nodes setting, the results of U-LH-1.4, U-LH-2.0 and ACL on Amazon-reviews align with the ones reported in [33]: We reproduced almost identical numbers under the same setting, with only a few small differences due to randomness in seed nodes selection. In general, using more seed nodes improves the performance for all methods in terms of both conductance and F1. For Amazon-reviews, the output clusters of HFD always have the lowest conductance, even though in some cases, low conductance does not align well with the given ground-truth, and hence the lowest conductance does not lead to the highest F1 score. Similarly, for Trivago-clicks, both U-HFD and C-HFD consistently find the lowest conductance clusters among all methods, which in general (but not always) lead to a higher F1 score. Note that, if a method uses the unit cut-cost (resp. the cardinality-based cut-cost), then we compute the conductance of the output cluster using the unit cut-cost (resp. the cardinality-based cut-cost). Therefore, depending on the specific cut-cost, the conductances in Table C.4 may have different scales. We highlight the lowest conductance for both cut-costs separately.

Table C.3: Complete local clustering results for Amazon-reviews network
Cluster
Metric Seed Method 1 2 3 12 15 17 18 24 25
Conductance Single U-HFD 0.17 0.11 0.12 0.16 0.36 0.25 0.17 0.14 0.28
U-LH-2.0 0.42 0.50 0.25 0.44 0.74 0.44 0.57 0.58 0.61
U-LH-1.4 0.33 0.44 0.25 0.36 0.81 0.40 0.51 0.54 0.59
ACL 0.42 0.50 0.25 0.54 0.77 0.52 0.63 0.68 0.65
Multiple U-HFD 0.05 0.10 0.12 0.13 0.20 0.16 0.14 0.11 0.32
U-LH-2.0 0.05 0.15 0.15 0.21 0.45 0.45 0.26 0.18 0.53
U-LH-1.4 0.05 0.13 0.15 0.15 0.35 0.33 0.19 0.14 0.47
ACL 0.05 0.27 0.16 0.27 0.56 0.53 0.33 0.30 0.59
F1 score Single U-HFD 0.45 0.09 0.65 0.92 0.04 0.10 0.80 0.81 0.09
U-LH-2.0 0.23 0.07 0.23 0.29 0.05 0.06 0.21 0.28 0.05
U-LH-1.4 0.23 0.09 0.35 0.40 0.00 0.07 0.31 0.35 0.06
ACL 0.23 0.07 0.22 0.25 0.04 0.05 0.17 0.20 0.04
Multiple U-HFD 0.49 0.50 0.69 0.98 0.19 0.36 0.91 0.89 0.33
U-LH-2.0 0.59 0.42 0.73 0.77 0.22 0.25 0.65 0.62 0.17
U-LH-1.4 0.52 0.45 0.73 0.90 0.27 0.29 0.79 0.77 0.20
ACL 0.59 0.25 0.70 0.64 0.20 0.19 0.51 0.49 0.14
Table C.4: Complete local clustering results for Trivago-clicks network
Cluster
Metric Seed Method KOR ISL PRI UA-43 VNM HKG MLT GTM UKR EST
Conductance Single U-HFD 0.010 0.023 0.014 0.011 0.018 0.017 0.010 0.007 0.016 0.012
U-LH-2.0 0.020 0.042 0.027 0.027 0.037 0.035 0.031 0.035 0.032 0.019
U-LH-1.4 0.036 0.069 0.047 0.039 0.060 0.052 0.040 0.045 0.065 0.036
ACL 0.027 0.050 0.034 0.031 0.042 0.043 0.047 0.039 0.043 0.026
\cdashline3-13 C-HFD 0.007 0.016 0.007 0.005 0.009 0.011 0.007 0.003 0.010 0.009
C-LH-2.0 0.022 0.066 0.030 0.030 0.035 0.035 0.029 0.028 0.029 0.029
C-LH-1.4 0.043 0.095 0.042 0.048 0.071 0.059 0.053 0.047 0.075 0.046
Multiple U-HFD 0.009 0.023 0.011 0.010 0.014 0.017 0.010 0.008 0.017 0.012
U-LH-2.0 0.023 0.034 0.018 0.021 0.054 0.030 0.021 0.022 0.041 0.018
U-LH-1.4 0.048 0.045 0.038 0.032 0.084 0.051 0.049 0.049 0.085 0.024
ACL 0.030 0.037 0.018 0.024 0.064 0.033 0.021 0.024 0.045 0.020
\cdashline3-13 C-HFD 0.006 0.016 0.006 0.005 0.006 0.011 0.007 0.003 0.011 0.009
C-LH-2.0 0.024 0.062 0.021 0.021 0.047 0.034 0.023 0.017 0.036 0.029
C-LH-1.4 0.054 0.067 0.033 0.037 0.094 0.057 0.053 0.044 0.094 0.032
F1 score Single U-HFD 0.75 0.99 0.89 0.85 0.28 0.82 0.98 0.94 0.60 0.94
U-LH-2.0 0.70 0.86 0.79 0.70 0.24 0.92 0.88 0.82 0.50 0.90
U-LH-1.4 0.69 0.84 0.80 0.75 0.28 0.87 0.92 0.83 0.47 0.90
ACL 0.65 0.84 0.75 0.68 0.23 0.90 0.83 0.69 0.50 0.88
C-HFD 0.76 0.99 0.95 0.94 0.32 0.80 0.98 0.97 0.68 0.94
C-LH-2.0 0.73 0.90 0.84 0.78 0.27 0.94 0.96 0.88 0.51 0.83
C-LH-1.4 0.71 0.88 0.84 0.78 0.27 0.88 0.93 0.85 0.50 0.85
Multiple U-HFD 0.87 0.99 0.97 0.92 0.55 0.82 0.98 0.97 0.87 0.94
U-LH-2.0 0.83 0.91 0.92 0.84 0.71 0.93 0.95 0.93 0.86 0.92
U-LH-1.4 0.78 0.84 0.83 0.79 0.74 0.85 0.85 0.84 0.75 0.87
ACL 0.81 0.89 0.91 0.85 0.68 0.93 0.96 0.91 0.83 0.90
C-HFD 0.86 0.99 0.97 0.96 0.32 0.80 0.98 0.97 0.69 0.94
C-LH-2.0 0.86 0.94 0.94 0.87 0.76 0.94 0.97 0.94 0.88 0.91
C-LH-1.4 0.83 0.89 0.90 0.83 0.67 0.89 0.92 0.85 0.77 0.89

Additional datasets, local clustering using unit and cardinality cut-costs. Table C.5 and Table C.6 show local clustering results on High-school-contact and Microsoft-academic networks, respectively. We use the single seed node setting, run the methods from each node in a target cluster, and report the median conductance and F1 score. We cap the maximum number of runs to 500. Similar to the results on other datasets, the output clusters of HFD always have the lowest conductance, leading to the highest F1 score in most cases. We omit cardinality-based methods for Microsoft-academic because they are very similar to the unit cut-cost setting.

Table C.5: Local clustering results for High-school-contact network
Cluster
Metric Method Class 1 Class 2 Class 3 Class 4 Class 5 Class 6 Class 7 Class 8 Class 9
Conductance U-HFD 0.25 0.29 0.13 0.42 0.21 0.26 0.16 0.19 0.25
U-LH-2.0 0.31 0.36 0.23 0.63 0.33 0.36 0.18 0.21 0.30
U-LH-1.4 0.29 0.32 0.21 0.54 0.29 0.37 0.16 0.22 0.29
ACL 0.62 0.64 0.61 0.98 0.61 0.60 0.59 0.55 0.59
\cdashline3-11 C-HFD 0.25 0.28 0.20 0.41 0.24 0.26 0.16 0.19 0.25
C-LH-2.0 0.27 0.33 0.20 0.57 0.29 0.32 0.16 0.20 0.27
C-LH-1.4 0.28 0.32 0.20 0.52 0.28 0.33 0.16 0.21 0.28
F1 score U-HFD 0.99 1.00 0.59 0.96 0.73 1.00 0.88 1.00 0.99
U-LH-2.0 0.91 0.83 0.93 0.66 0.84 0.88 0.96 0.96 0.90
U-LH-1.4 0.93 0.78 0.90 0.78 0.70 0.90 0.97 0.95 0.88
ACL 0.72 0.73 0.73 0.06 0.70 0.76 0.77 0.78 0.76
C-HFD 0.99 1.00 1.00 0.96 0.80 1.00 1.00 1.00 0.99
C-LH-2.0 0.93 0.82 0.92 0.74 0.84 0.93 0.97 0.97 0.91
C-LH-1.4 0.94 0.74 0.69 0.84 0.76 0.94 0.96 0.96 0.85
Table C.6: Local clustering results for Microsoft-academic network
Cluster
Metric Method Data ML TCS CV
Cond U-HFD 0.03 0.06 0.06 0.03
U-LH-2.0 0.07 0.09 0.10 0.07
U-LH-1.4 0.07 0.08 0.09 0.07
ACL 0.08 0.11 0.11 0.09
F1 score U-HFD 0.78 0.54 0.86 0.73
U-LH-2.0 0.67 0.46 0.71 0.61
U-LH-1.4 0.65 0.46 0.59 0.59
ACL 0.64 0.43 0.70 0.57

Additional dataset, node ranking using general submodular cut-cost. We provide another compelling use case of general submodular cut-cost. We consider the node ranking problem in the Oil-trade network. Our goal is to search the most related country of a queried country based on the trade-network structure. We use the hypergraph modelling shown in Figure 1 in the main paper. We compare HFD using unit (U-HFD, γ1=γ2=1\gamma_{1}=\gamma_{2}=1), cardinality-based (C-HFD, γ1=1/2\gamma_{1}=1/2 and γ2=1\gamma_{2}=1) and submodular (S-HFD, γ1=1/2\gamma_{1}=1/2 and γ2=0\gamma_{2}=0) cut-costs. Table C.7 shows the top-2 ranking results. In this example, we use Iran as the seed node and we rank other countries according to the ordering of dual variables returned by HFD. In 2017, US imposed strict sanctions on Iran. However, Bangladesh (generally accepted as an American ally) is among the top two ranked countries based on unit or cardinality-based cut-cost, which does not make any sense. On the other hand, S-HFD ranks Iraq and Turkmenistan as the top two. Interested readers can easily verify that these counties share strong economic or historical ties with Iran.

Additional method: pp-norm HFD. We tried HFD with unit cut-cost and p=4p=4 (U-HFD-4.0). However, in practice we did not observe that a larger p>2p>2 necessarily lead to better clustering results. We show a sample result of U-HFD-4.0 for Amazon-reviews in Table C.8. Notice that the performances of U-HFD-2.0 (p=2p=2) and U-HFD-4.0 are very similar.

Additional method: LH ++ flow improve. We tried a flow-improve method for hypergraphs [43]. We apply the flow-improve method to the output of U-LH-2.0. The method is slow in our experiments, so we only tried it on a few small instances. The results for the Florida Bay food network is shown in Table C.9. In general, we find that applying the flow-improve method does not lead to consistent performance improvements.

Table C.7: Top-2 node-ranking results for Oil-trade network
Method Query: Iran
U-HFD Kenya, Bangladesh
C-HFD Bangladesh, United Rep. of Tanzania
S-HFD Turkmenistan, Iraq
Table C.8: Local clustering results for Amazon-reviews network using pp-norm HFD
Cluster
Metric Seed Method 1 2 3 12 15 17 18 24 25
Cond Single U-HFD-2.0 0.17 0.11 0.12 0.16 0.36 0.25 0.17 0.14 0.28
U-HFD-4.0 0.17 0.10 0.12 0.16 0.35 0.26 0.17 0.14 0.38
Multiple U-HFD-2.0 0.05 0.10 0.12 0.13 0.20 0.16 0.14 0.11 0.32
U-HFD-4.0 0.05 0.10 0.12 0.14 0.20 0.16 0.14 0.12 0.32
F1 score Single U-HFD-2.0 0.45 0.09 0.65 0.92 0.04 0.10 0.80 0.81 0.09
U-HFD-4.0 0.48 0.07 0.65 0.92 0.04 0.09 0.80 0.82 0.10
Multiple U-HFD-2.0 0.49 0.50 0.69 0.98 0.19 0.36 0.91 0.89 0.33
U-HFD-4.0 0.49 0.50 0.69 0.98 0.19 0.36 0.91 0.88 0.35
Table C.9: Local clustering results for the food network using unit cut-costs
Cluster
Metric Method Producers Low-level consumers High-level consumers
Conductance U-HFD 0.49 0.36 0.35
U-LH-2.0 0.51 0.39 0.39
U-LH-2.0 ++ flow 0.52 0.39 0.40
U-LH-1.4 0.49 0.39 0.41
ACL 0.52 0.39 0.40
F1 score U-HFD 0.69 0.47 0.64
U-LH-2.0 0.69 0.45 0.57
U-LH-2.0 ++ flow 0.69 0.45 0.57
U-LH-1.4 0.69 0.45 0.58
ACL 0.69 0.44 0.57

C.3 Computing platform and implementation detail

We implemented the AM algorithm [5] given in Algorithm B.1 in Julia. The code is run on a personal laptop with 32GB RAM and 2.9 GHz 6-Core Intel Core i9. GPU is not used for computation. For the rest of this section, we discuss the implementation details on how we actually solve the nontrivial sub-problem in Algorithm B.1 to obtain the update (ϕ(k+1),r(k+1))(\phi^{(k+1)},r^{(k+1)}).

For the unit cut-cost case, we use an exact projection algorithm [32] to obtain the update (ϕ(k+1),r(k+1))(\phi^{(k+1)},r^{(k+1)}). Algorithmic details for exact projection is provided in Algorithm B.2. For cardinality-based or general submodular cut-costs, a conic Fujishige-Wolfe minimum norm algorithm [32] can be adopted to efficiently compute (ϕ(k+1),r(k+1))(\phi^{(k+1)},r^{(k+1)}). Our implementation uses alternative methods that are simpler. For the cardinality cut-cost, we use a projected subgradient method that works on a related dual problem to obtain the primal update in (ϕ(k+1),r(k+1))(\phi^{(k+1)},r^{(k+1)}). The subgradient method is easy to implement, requires less computation overhead, and works well in practice for the sub-problem. For the specialized submodular cut-cost shown in Figure 1, since the hyperedge consists of only 4 nodes and has a special structure, we simply perform an exhaustive search that allows us to exactly compute (ϕ(k+1),r(k+1))(\phi^{(k+1)},r^{(k+1)}) using constant number of vector-vector additions and multiplications. We provide details below.

Recall that the sub-problem to compute (ϕ(k+1),r(k+1))(\phi^{(k+1)},r^{(k+1)}) decomposes into a sequence of separate problems indexed by eEe\in E (cf. (B.7), in the following we assume p=2p=2 for simplicity):

minϕe0,reϕeBe12ϕe2+12σsere22.\min_{\phi_{e}\geq 0,r_{e}\in\phi_{e}B_{e}}\frac{1}{2}\phi_{e}^{2}+\frac{1}{2\sigma}\|s_{e}-r_{e}\|_{2}^{2}. (C.1)

The dual problem of (C.1) is written as (cf. (B.8), here we have p=q=2p=q=2)

minye12fe(ye)2+σ2ye22seTye.\min_{y_{e}}\frac{1}{2}f_{e}(y_{e})^{2}+\frac{\sigma}{2}\|y_{e}\|_{2}^{2}-s_{e}^{T}y_{e}. (C.2)

Let (ϕe,re)(\phi_{e}^{*},r_{e}^{*}) and yey_{e}^{*} denote primal and dual optimal solutions for (C.1) and (C.2), respectively. Then we have that

re+σye=seandϕe2=reTye.r_{e}^{*}+\sigma y_{e}^{*}=s_{e}\quad\mbox{and}\quad{\phi_{e}^{*}}^{2}={r_{e}^{*}}^{T}y_{e}^{*}.

The dual problem (C.2) can be derived following the same way that we derive the primal-dual HFD formulations, moreover, the above relations between ϕe,\phi_{e}^{*}, rer_{e}^{*} and yey_{e}^{*} follow immediately from the primal-dual derivation, dual optimality condition and simple algebraic work. Therefore, in order to find an optimal solution (ϕe,re)(\phi_{e}^{*},r_{e}^{*}) for the primal problem (C.2), it suffices to find an optimal solution yey_{e}^{*} for the dual problem (C.2) and then recover (ϕe,re)(\phi_{e}^{*},r_{e}^{*}). Now, since 1Tre=0\mathit{1}^{T}r_{e}^{*}=0, we know that σ1Tye=1Tse\sigma\mathit{1}^{T}y_{e}^{*}=\mathit{1}^{T}s_{e}, i.e., yey_{e}^{*} lies in the hyperplane :={ye|σ1Tye=1Tse}\mathcal{H}:=\{y_{e}|\sigma\mathit{1}^{T}y_{e}=\mathit{1}^{T}s_{e}\}. Let hh denote the objective function of the dual problem (C.2), we compute yey_{e}^{*} using projected subgradient method:

ye(k+1):=P(ye(k)1kg(k)g(k)2),y_{e}^{(k+1)}:=P_{\mathcal{H}}\left(y_{e}^{(k)}-\frac{1}{k}\frac{g^{(k)}}{\|g^{(k)}\|_{2}}\right),

where g(k)h(ye(k))g^{(k)}\in\partial h(y_{e}^{(k)}) is a subgradient at ye(k)y_{e}^{(k)}, and P()P_{\mathcal{H}}(\cdot) denotes the projection onto the hyperplane \mathcal{H}. We add the additional projection step so that, when we stop the subgradient method after KK iterations to get yey~e:=ye(K)y_{e}^{*}\approx\tilde{y}_{e}:=y_{e}^{(K)}, and approximately recover rer_{e}^{*} as rer~e:=seσy~er_{e}^{*}\approx\tilde{r}_{e}:=s_{e}-\sigma\tilde{y}_{e}, the resulting r~e\tilde{r}_{e} would still be a proper flow routing, i.e., 1Tr~e=0\mathit{1}^{T}\tilde{r}_{e}=0 and hence it is possible to have r~eϕ~eBe\tilde{r}_{e}\in\tilde{\phi}_{e}B_{e} for some ϕ~e\tilde{\phi}_{e}. In other words, the projection step is crucial because it permits the use of sub-optimal dual solution y~e\tilde{y}_{e} to obtain sub-optimal but feasible primal solution r~e\tilde{r}_{e}.

For the cardinality cut-cost, our implementation uses the projected subgradient method we describe above to solve the sub-problem in Algorithm B.1 for ϕe\phi_{e} and rer_{e}. In what follows we talk about how we deal with the specialized submodular cut-cost.

Given e={v1,v2,v3,v4}e=\{v_{1},v_{2},v_{3},v_{4}\} and associated submodular cut-cost wew_{e} such that we({vi})=1/2w_{e}(\{v_{i}\})=1/2 for i=1,2,3,4i=1,2,3,4, we({v1,v2})=0w_{e}(\{v_{1},v_{2}\})=0, we({v1,v3})=we({v1,v4})=1w_{e}(\{v_{1},v_{3}\})=w_{e}(\{v_{1},v_{4}\})=1, and we(S)=we(eS)w_{e}(S)=w_{e}(e\setminus S) for any SeS\subseteq e. Let BeB_{e} be the base polytope of wew_{e}. The sub-problem for this hyperedge is given in (C.1). Suppose (ϕe,re)(\phi_{e}^{*},r_{e}^{*}) is optimal for (C.1), and re=ϕeρer_{e}^{*}=\phi_{e}^{*}\rho_{e}^{*} for some ρeBe\rho_{e}^{*}\in B_{e}. If ϕe>0\phi_{e}^{*}>0, then we know that ϕe=seTρeσ+ρe22\phi_{e}^{*}=\frac{s_{e}^{T}\rho_{e}^{*}}{\sigma+\|\rho_{e}^{*}\|_{2}^{2}}. To see this, substitute re=ϕeρer_{e}^{*}=\phi_{e}\rho_{e}^{*} into (C.1) and optimize for ϕe\phi_{e} only. The relation ϕe=seTρeσ+ρe22\phi_{e}^{*}=\frac{s_{e}^{T}\rho_{e}^{*}}{\sigma+\|\rho_{e}^{*}\|_{2}^{2}} follows from first-order optimality condition and the assumption that ϕe>0\phi_{e}^{*}>0. On the other hand, if ϕe=0\phi_{e}^{*}=0, then we simply have that re=0r_{e}^{*}=0. Therefore, in order to compute (ϕe,re)(\phi_{e}^{*},r_{e}^{*}) when ϕe>0\phi_{e}^{*}>0, it suffices to find ρe\rho_{e}^{*}. In order to find ρe\rho_{e}^{*}, we look at the dual problem C.2. Let yey_{e}^{*} be an optimal dual solution, then we have that ρeargmaxρeBeρeTye\rho_{e}^{*}\in\operatorname*{argmax}_{\rho_{e}\in B_{e}}\rho_{e}^{T}y_{e}^{*}. The subsequent claims are case analyses in order to determine all possible nontrivial candidates for ρe\rho_{e}^{*}.

Claim C.1.

If se,v1=se,v2s_{e,v_{1}}=s_{e,v_{2}}, then ρe,v1=ρe,v2=0\rho^{*}_{e,v_{1}}=\rho^{*}_{e,v_{2}}=0; if se,v3=se,v4s_{e,v_{3}}=s_{e,v_{4}}, then ρe,v3=ρe,v4=0\rho^{*}_{e,v_{3}}=\rho^{*}_{e,v_{4}}=0.

Proof.

The optimality condition of the dual problem (C.2) is for some ρ^eargmaxρeBeρeTye\hat{\rho}_{e}\in\operatorname*{argmax}_{\rho_{e}\in B_{e}}\rho_{e}^{T}y_{e}^{*},

(ρ^eTye)ρ^e+σye=se.(\hat{\rho}_{e}^{T}y_{e}^{*})\hat{\rho}_{e}+\sigma y_{e}^{*}=s_{e}. (C.3)

Suppose se,v1=se,v2s_{e,v_{1}}=s_{e,v_{2}}, then we must have ye,v1=ye,v2y_{e,v_{1}}^{*}=y_{e,v_{2}}^{*}. Otherwise, say ye,v1>ye,v2y_{e,v_{1}}^{*}>y_{e,v_{2}}^{*}, then we know that ρ^e,v1=1/2>1/2=ρ^e,v2\hat{\rho}_{e,v_{1}}=1/2>-1/2=\hat{\rho}_{e,v_{2}}, which follows from applying the greedy algorithm [4] to find ρ^e\hat{\rho}_{e} using the order of indices in yey_{e}^{*}. But then according to the optimality condition (C.3), we have

se,v1=(ρ^eTye)ρ^e,v1+σye,v1>(ρ^eTye)ρ^e,v2+σye,v2=se,v2,s_{e,v_{1}}=(\hat{\rho}_{e}^{T}y_{e}^{*})\hat{\rho}_{e,v_{1}}+\sigma y_{e,v_{1}}^{*}>(\hat{\rho}_{e}^{T}y_{e}^{*})\hat{\rho}_{e,v_{2}}+\sigma y_{e,v_{2}}^{*}=s_{e,v_{2}},

which contradicts our assumption that se,v1=se,v2s_{e,v_{1}}=s_{e,v_{2}}. Similarly, ye,v1<ye,v2y_{e,v_{1}}^{*}<y_{e,v_{2}}^{*} is not possible, either. Now, because ye,v1=ye,v2y_{e,v_{1}}^{*}=y_{e,v_{2}}^{*}, by the optimality condition (C.3), we must also have ρ^e,v1=ρ^e,v2\hat{\rho}_{e,v_{1}}=\hat{\rho}_{e,v_{2}}. Finally, because ρ^eBe\hat{\rho}_{e}\in B_{e}, we know that ρ^e,v1+ρ^e,v20\hat{\rho}_{e,v_{1}}+\hat{\rho}_{e,v_{2}}\leq 0 and ρ^e,v1+ρ^e,v2=(ρ^e,v3+ρ^e,v4)we({v3,v4})=0\hat{\rho}_{e,v_{1}}+\hat{\rho}_{e,v_{2}}=-(\hat{\rho}_{e,v_{3}}+\hat{\rho}_{e,v_{4}})\geq-w_{e}(\{v_{3},v_{4}\})=0, so ρ^e,v1+ρ^e,v2=0\hat{\rho}_{e,v_{1}}+\hat{\rho}_{e,v_{2}}=0. Therefore, ρ^e,v1=ρ^e,v2=0\hat{\rho}_{e,v_{1}}=\hat{\rho}_{e,v_{2}}=0. Since ρ^\hat{\rho} was chosen arbitrarily from the set argmaxρeBeρeTye\operatorname*{argmax}_{\rho_{e}\in B_{e}}\rho_{e}^{T}y_{e}^{*}, and ρeargmaxρeBeρeTye\rho^{*}_{e}\in\operatorname*{argmax}_{\rho_{e}\in B_{e}}\rho_{e}^{T}y_{e}^{*}, we have that ρe,v1=ρe,v2=0\rho^{*}_{e,v_{1}}=\rho^{*}_{e,v_{2}}=0 as required. The other claim on nodes v3v_{3} and v4v_{4} follows the same way. ∎

Claim C.2.

If se,v1se,v2s_{e,v_{1}}\neq s_{e,v_{2}} and se,v3=se,v4s_{e,v_{3}}=s_{e,v_{4}}, then ρe,v1,ρe,v2{1/2,1/2}\rho^{*}_{e,v_{1}},\rho^{*}_{e,v_{2}}\in\{1/2,-1/2\} and ρe,v3=ρe,v4=0\rho^{*}_{e,v_{3}}=\rho^{*}_{e,v_{4}}=0; if se,v1=se,v2s_{e,v_{1}}=s_{e,v_{2}} and se,v3se,v4s_{e,v_{3}}\neq s_{e,v_{4}}, then ρe,v1=ρe,v2=0\rho^{*}_{e,v_{1}}=\rho^{*}_{e,v_{2}}=0 and ρe,v3,ρe,v4{1/2,1/2}\rho^{*}_{e,v_{3}},\rho^{*}_{e,v_{4}}\in\{1/2,-1/2\}.

Proof.

We will show the first case, the second case follows by symmetry. Let ρ^eargmaxρeBeρeTye\hat{\rho}_{e}\in\operatorname*{argmax}_{\rho_{e}\in B_{e}}\rho_{e}^{T}y_{e}^{*}. Suppose se,v1se,v2s_{e,v_{1}}\neq s_{e,v_{2}} and se,v3=se,v4s_{e,v_{3}}=s_{e,v_{4}}. Then by Claim C.1 we have ρ^e,v3=ρ^e,v4=0\hat{\rho}_{e,v_{3}}=\hat{\rho}_{e,v_{4}}=0. Let us assume without loss of generality that se,v1>se,v2s_{e,v_{1}}>s_{e,v_{2}}. If ye,v1<ye,v2y_{e,v_{1}}^{*}<y_{e,v_{2}}^{*}, then apply the greedy algorithm we know that ρ^e,v1=1/2<1/2=ρ^e,v2\hat{\rho}_{e,v_{1}}=-1/2<1/2=\hat{\rho}_{e,v_{2}}. But this contradicts the optimality condition (C.3). Therefore we must have ye,v1ye,v2y_{e,v_{1}}^{*}\geq y_{e,v_{2}}^{*}. There are two cases. If ye,v1>ye,v2y_{e,v_{1}}^{*}>y_{e,v_{2}}^{*}, then apply the greedy algorithm we get ρ^e,v1=1/2\hat{\rho}_{e,v_{1}}=1/2 and ρ^e,v2=1/2\hat{\rho}_{e,v_{2}}=-1/2. If ye,v1=ye,v2y_{e,v_{1}}^{*}=y_{e,v_{2}}^{*}, then because ρ^e,v1+ρ^e,v2=0\hat{\rho}_{e,v_{1}}+\hat{\rho}_{e,v_{2}}=0 (see the proof of Claim C.1 for an argument for this) and ρ^e,v3=ρ^e,v4=0\hat{\rho}_{e,v_{3}}=\hat{\rho}_{e,v_{4}}=0, we have that ρ^eTye=0\hat{\rho}_{e}^{T}y_{e}^{*}=0. But then this contradicts the optimality condition (C.3), because se,v1>se,v2s_{e,v_{1}}>s_{e,v_{2}} and ye,v1=ye,v2y_{e,v_{1}}^{*}=y_{e,v_{2}}^{*}. Therefore we cannot have ye,v1=ye,v2y_{e,v_{1}}^{*}=y_{e,v_{2}}^{*}. Since our choice of ρ^eargmaxρeBeρeTye\hat{\rho}_{e}\in\operatorname*{argmax}_{\rho_{e}\in B_{e}}\rho_{e}^{T}y_{e}^{*} was arbitrary, and ρe,v1argmaxρeBeρeTye\rho_{e,v_{1}}^{*}\in\operatorname*{argmax}_{\rho_{e}\in B_{e}}\rho_{e}^{T}y_{e}^{*}, so we know that ρe\rho_{e}^{*} must satisfy the properties satisfied by ρ^e\hat{\rho}_{e}. ∎

Claim C.3.

If se,v1se,v2s_{e,v_{1}}\neq s_{e,v_{2}} and se,v3se,v4s_{e,v_{3}}\neq s_{e,v_{4}}, then ρe,v1,ρe,v2{±1/2,±a}\rho^{*}_{e,v_{1}},\rho^{*}_{e,v_{2}}\in\{\pm 1/2,\pm a\} and ρe,v3,ρe,v4{±1/2,±b}\rho^{*}_{e,v_{3}},\rho^{*}_{e,v_{4}}\in\{\pm 1/2,\pm b\}, where a=(12+σ)(se,v1se,v2)/(se,v3se,v4)a=(\frac{1}{2}+\sigma)(s_{e,v_{1}}-s_{e,v_{2}})/(s_{e,v_{3}}-s_{e,v_{4}}) and b=(12+σ)(se,v3se,v4)/(se,v1se,v2)b=(\frac{1}{2}+\sigma)(s_{e,v_{3}}-s_{e,v_{4}})/(s_{e,v_{1}}-s_{e,v_{2}}).

Proof.

Let us assume without loss of generality that se,v1>se,v2s_{e,v_{1}}>s_{e,v_{2}} and se,v3>se,v4s_{e,v_{3}}>s_{e,v_{4}}. Let ρ^eargmaxρeBeρeTye\hat{\rho}_{e}\in\operatorname*{argmax}_{\rho_{e}\in B_{e}}\rho_{e}^{T}y_{e}^{*}. We have that ye,v1ye,v2y_{e,v_{1}}^{*}\geq y_{e,v_{2}}^{*} and ye,v3ye,v4y_{e,v_{3}}^{*}\geq y_{e,v_{4}}^{*} (see the proof of Claim C.2 for an argument for this). There are four cases and we analyze them one by one in the following.

Case 1. If ye,v1>ye,v2y_{e,v_{1}}^{*}>y_{e,v_{2}}^{*} and ye,v3>ye,v4y_{e,v_{3}}^{*}>y_{e,v_{4}}^{*}, then we have ρ^e,v1=ρ^e,v3=1/2\hat{\rho}_{e,v_{1}}=\hat{\rho}_{e,v_{3}}=1/2 and ρ^e,v2=ρ^e,v4=1/2\hat{\rho}_{e,v_{2}}=\hat{\rho}_{e,v_{4}}=-1/2.

Case 2. If ye,v1=ye,v2y_{e,v_{1}}^{*}=y_{e,v_{2}}^{*} and ye,v3=ye,v4y_{e,v_{3}}^{*}=y_{e,v_{4}}^{*}, then ρ^eTye=0\hat{\rho}_{e}^{T}y_{e}^{*}=0 and hence the optimality condition (C.3) cannot be satisfied. This leads to a contradiction.

Case 3. Suppose that ye,v1=ye,v2y_{e,v_{1}}^{*}=y_{e,v_{2}}^{*} and ye,v3>ye,v4y_{e,v_{3}}^{*}>y_{e,v_{4}}^{*}. Then according to the optimality condition (C.3), because se,v1>se,v2s_{e,v_{1}}>s_{e,v_{2}} and ye,v1=ye,v2y_{e,v_{1}}^{*}=y_{e,v_{2}}^{*}, we must have that ρ^e,v1>ρ^e,v2\hat{\rho}_{e,v_{1}}>\hat{\rho}_{e,v_{2}}. Moreover, because ρ^e,v1+ρ^e,v2=0\hat{\rho}_{e,v_{1}}+\hat{\rho}_{e,v_{2}}=0, we know that ρ^e,v1=a=ρ^e,v2\hat{\rho}_{e,v_{1}}=a=-\hat{\rho}_{e,v_{2}} for some a>0a>0. We also know that ρ^e,v3=1/2\hat{\rho}_{e,v_{3}}=1/2 and ρ^e,v4=1/2\hat{\rho}_{e,v_{4}}=-1/2 since ye,v3>ye,v4y_{e,v_{3}}^{*}>y_{e,v_{4}}^{*}. Substitute the primal-dual relation ϕe=ρ^eTye\phi_{e}^{*}=\hat{\rho}_{e}^{T}y_{e}^{*} into (C.3) we have

ϕeρ^e,v1+σye,v1=se,v1andϕeρ^e,v2+σye,v2=se,v2.\phi_{e}^{*}\hat{\rho}_{e,v_{1}}+\sigma y_{e,v_{1}}^{*}=s_{e,v_{1}}~{}~{}\mbox{and}~{}~{}\phi_{e}^{*}\hat{\rho}_{e,v_{2}}+\sigma y_{e,v_{2}}^{*}=s_{e,v_{2}}.

Because ye,v1=ye,v2y_{e,v_{1}}^{*}=y_{e,v_{2}}^{*}, we get that

ϕe(ρ^e,v1ρ^e,v2)=se,v1se,v2,\phi_{e}^{*}(\hat{\rho}_{e,v_{1}}-\hat{\rho}_{e,v_{2}})=s_{e,v_{1}}-s_{e,v_{2}},

and hence

ϕe=se,v1se,v2ρ^e,v1ρ^e,v2=se,v1se,v22a.\phi_{e}^{*}=\frac{s_{e,v_{1}}-s_{e,v_{2}}}{\hat{\rho}_{e,v_{1}}-\hat{\rho}_{e,v_{2}}}=\frac{s_{e,v_{1}}-s_{e,v_{2}}}{2a}. (C.4)

Because ρ^argmaxρeBeρeTye\hat{\rho}\in\operatorname*{argmax}_{\rho_{e}\in B_{e}}\rho_{e}^{T}y_{e}^{*} was arbitrary, and ρeargmaxρeBeρeTye\rho_{e}^{*}\in\operatorname*{argmax}_{\rho_{e}\in B_{e}}\rho_{e}^{T}y_{e}^{*}, we know that ρe,v1=a=ρe,v2\rho_{e,v_{1}}^{*}=a=-\rho_{e,v_{2}}^{*} and ρe,v3=1/2=ρe,v4\rho_{e,v_{3}}^{*}=1/2=-\rho_{e,v_{4}}^{*}. On the other hand, since se,v1>se,v2s_{e,v_{1}}>s_{e,v_{2}} we know that ϕe>0\phi_{e}^{*}>0, therefore

ϕe=seTρeσ+ρe22=a(se,v1se,v2)+12(se,v3se,v4)σ+2a2+12.\phi_{e}^{*}=\frac{s_{e}^{T}\rho_{e}^{*}}{\sigma+\|\rho_{e}^{*}\|_{2}^{2}}=\frac{a(s_{e,v_{1}}-s_{e,v_{2}})+\frac{1}{2}(s_{e,v_{3}}-s_{e,v_{4}})}{\sigma+2a^{2}+\frac{1}{2}}. (C.5)

Combining equations (C.4) and (C.5) we get that a=(12+σ)(se,v1se,v2)/(se,v3se,v4)a=(\frac{1}{2}+\sigma)(s_{e,v_{1}}-s_{e,v_{2}})/(s_{e,v_{3}}-s_{e,v_{4}}).

Case 4. Suppose that ye,v1>ye,v2y_{e,v_{1}}^{*}>y_{e,v_{2}}^{*} and ye,v3=ye,v4y_{e,v_{3}}^{*}=y_{e,v_{4}}^{*}. The following a similar argument for Case 3, we get that ρe,v1=1/2=ρe,v2\rho_{e,v_{1}}^{*}=1/2=-\rho_{e,v_{2}} and ρe,v3=b=ρe,v4\rho_{e,v_{3}}^{*}=b=-\rho_{e,v_{4}}^{*} where b=(12+σ)(se,v3se,v4)/(se,v1se,v2)b=(\frac{1}{2}+\sigma)(s_{e,v_{3}}-s_{e,v_{4}})/(s_{e,v_{1}}-s_{e,v_{2}}). ∎

Finally, combining Claims C.1C.2C.3 and the constraint that ρe,v1+ρe,v2=ρe,v3+ρe,v4=0\rho_{e,v_{1}}^{*}+\rho_{e,v_{2}}^{*}=\rho_{e,v_{3}}^{*}+\rho_{e,v_{4}}^{*}=0, there are at most 12 possible choices for ρe\rho_{e}^{*}. Therefore, an exhaustive search among these candidate vectors for ρe\rho_{e}^{*} (and hence ϕe=seTρeσ+ρe22\phi_{e}^{*}=\frac{s_{e}^{T}\rho_{e}^{*}}{\sigma+\|\rho_{e}^{*}\|_{2}^{2}} and re=ϕeρer_{e}^{*}=\phi_{e}^{*}\rho_{e}^{*}) that minimizes (C.1) can be done using constant number of vector-vector additions and multiplications.