This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\UseRawInputEncoding

PrEF: Percolation-based Evolutionary Framework for the diffusion-source-localization problem in large networks

Yang Liu1111yangliuyh@gmail.com , Xiaoqi Wang1, Xi Wang2,3, Zhen Wang1, Ju¨\ddot{\text{u}}rgen Kurths4
1Northwestern Polytechnical University
2The Chinese University of Hong Kong
3Stanford University
4Potsdam Institute for Climate Impact Research
Abstract

We assume that the state of a number of nodes in a network could be investigated if necessary, and study what configuration of those nodes could facilitate a better solution for the diffusion-source-localization (DSL) problem. In particular, we formulate a candidate set which contains the diffusion source for sure, and propose the method, Percolation-based Evolutionary Framework (PrEF), to minimize such set. Hence one could further conduct more intensive investigation on only a few nodes to target the source. To achieve that, we first demonstrate that there are some similarities between the DSL problem and the network immunization problem. We find that the minimization of the candidate set is equivalent to the minimization of the order parameter if we view the observer set as the removal node set. Hence, PrEF is developed based on the network percolation and evolutionary algorithm. The effectiveness of the proposed method is validated on both model and empirical networks in regard to varied circumstances. Our results show that the developed approach could achieve a much smaller candidate set compared to the state of the art in almost all cases. Meanwhile, our approach is also more stable, i.e., it has similar performance irrespective of varied infection probabilities, diffusion models, and outbreak ranges. More importantly, our approach might provide a new framework to tackle the DSL problem in extreme large networks.

1 Introduction

There has recently been an enormous amount of interest focusing on the diffusion-source-localization (DSL) problem on networks, which aims to find out the real source of an undergoing or finished diffusion [1, 2, 3]. Two specific scenarios are epidemics and misinformation and both of which can be well modeled by the networks. Being one of the biggest enemies to global health, infectious diseases could cause rapid population declines or species extinction [4], from the Black Death (probably bubonic plague), which is estimated to have caused the death of as much as one third of the population of Europe between 1346 and 1350 [5], to nowadays COVID-19 pandemic, which might result in the largest global recession in history, in particular, climate change keeping exacerbating the spread of diseases and increasing the probability of global epidemics [6, 7]. In this case, the study of the DSL problem can potentially help administrations to make policies to prevent future outbreaks and hence save a lot of lives and resources. Further regarding misinformation, as one of the biggest threats to our societies, it could cause great impact on global health and economy and further weaken the public trust in governments [8], such as the ongoing COVID-19 where fake news circulates on online social networks (OSNs) and has made people believe that the virus could be killed by drinking slaty water and bleach [9], and the ‘Barack Obama was injured’ which wiped out $130 billion in stock value [10]. In this circumstance, the localization of the real source would play important role for containing the misinformation fundamentally.

The main task of the DSL problem is to find an estimator that could give us an inferred source based on the known partial information, and the most ideal estimator is the one that gives us the real source. However, due to the complexity of contact patterns and the uncertainty of diffusion models, the real source is generally almost impossible to be inferred exactly, even the underlying network is a tree [11]. Hence, as an alternative, the error distance is used, and an estimator is said to be better than another one if the corresponding inferred source is closer to the real source in hop-distance [2, 12, 13, 14, 3, 11]. And therefore, varied methods have been developed to minimize such error distance based on different-known-information assumptions, such as observers having knowledge of time stamps and infection directions [2], the diffusion information [15], and the states of all nodes [3], etc. [14, 11]. But here we argue that: what should we do next once we acquire the estimator having small error distance? Indeed, one can carry out intensive detection on the neighbor region of the inferred source to search for the real source. In this case, for regular networks, a small error distance is usually associated with a small number of nodes to be checked. However, most real-world networks are heterogeneous, which indicates that even a short error distance might correspond to a great number of nodes, especially those social networks.

Hence, in this paper, we present a method, Percolation-based Evolutionary Framework (PrEF), to tackle the DSL problem by suppressing a candidate set that the real source belongs to for sure. In particular, we assume that there are a group of nodes in the networks, whose states can be investigated if necessary. Meanwhile, those nodes are also assumed to have information of both time stamps and infection directions. Then, our goal is to use as fewer observers (nodes) as possible to achieve the containment of the candidate set. We find that such goal can be reached by the solution of the network immunization problem. Hence, we have our method based on the network percolation and evolutionary computation. Results on both model and empirical networks show that the proposed method is much better compared to the state-of-the-art approaches.

Key contributions of this paper are summarized as follows:

  • DSL vs. network immunization. We concretely study and derive the connection of the DSL problem and the network immunization problem, and find that the solution of the network immunization problem can be used to and could effectively cope with the DSL problem.

  • Percolation-based evolutionary framework. We propose a percolation-based evolutionary framework to solve the DSL problem, which takes a network percolation process and potentially works for a range of scenarios.

  • Extensive evaluation on synthetic and empirical networks. We evaluate the proposed method on two synthetic networks and four empirical networks drawn from different real-world scenarios, whose sizes are up to over 800,000800,000 nodes. Results show that our method is more effective, efficient, and stable than the-state-of-the-art approaches, and is also capable of handling large networks.

2 Related Work

DSL approaches. Shah and Zaman first studied the DSL problem of single diffusion source and introduced the rumor centrality method by counting the distinct diffusion configurations of each node [1, 16]. They considered the Susceptible-Infected (SI) model and concluded that a node is said to be the source with higher probability if it has a larger rumor centrality. Following that, Dong et al. also investigated the SI model and proposed a better approach based on the local rumor center generalized from the rumor centrality, given that a priori distribution is known for each node being the rumor source [17]. Similarly, Zhu and Ying investigated the problem under the Susceptible-Infected-Recovered (SIR) model and found that the Jordan center could be used to characterize such probability [12]. Wang et al. showed that the detection probability could be boosted if multiple diffusion snapshots were observed [13], which can be viewed as the case that the information of the diffusion time was integrated to some extent. Indeed, if the time-stamp or other additional information is known, the corresponding method would usually work better [2, 3, 18]. In short, almost all those exiting methods study the DSL problem on either simple networks (such a tree-type networks or model networks) or small empirical networks. Hence, their performance might be questioned in real and complex scenarios, such as networks having a lot of cycles [19, 20].

Network immunization. The network immunization problem aims to find a key group of nodes whose removal could fragment a given network to the greatest extend, which has been proved to be a NP-hard problem [21]. In general, approaches for coping with this problem can be summarized into four categories. The first one is to obtain the key group by strategies such as randomly selecting nodes from the network, which is usually called local-information-based methods [22, 23]. In this scope, since the network topology information does not have to be precisely known, these methods would be quite useful in some scenarios. Rather than that, when the network topology is known, the second category is usually much more effective. Methods of the second category draw the key group by directly classifying nodes based on measurements like degree centrality, eigenvector centrality, pagerank, and betweenness centrality [20]. More concretely, they firstly calculate the importance of each node using their centralities and choose those ranking on the top as the key group. The third category takes the same strategy but will heuristically update the importance of nodes in the remaining network after the most important node is removed, and the key group eventually consists of all removed nodes [21]. The last category obtains the key group in indirect ways [24, 25]. For instance, refs. [26, 27] achieve the goal by tackling the feedback vertex set problem.

3 Model

We assume that a diffusion ζ\zeta occurs on a network G(𝒱,)G(\mathcal{V},\mathcal{E}), where 𝒱\mathcal{V} and \mathcal{E} are the corresponding node set and edge set, respectively. Letting euve_{uv}\in\mathcal{E} be the edge between nodes uu and vv, we define the nearest neighbor set regarding uu as Γ(u)={v,euv}\Gamma(u)=\{v,\forall e_{uv}\in\mathcal{E}\}. A connected component cic_{i} of GG is a subnetwork G(𝒱,)G^{\prime}(\mathcal{V}^{\prime},\mathcal{E}^{\prime}) satisfying 𝒱𝒱\mathcal{V}^{\prime}\subset\mathcal{V}, =(𝒱×𝒱)\mathcal{E}^{\prime}=\mathcal{E}\cap(\mathcal{V}^{\prime}\times\mathcal{V}^{\prime}), and ((𝒱𝒱)×𝒱)\mathcal{E}\cap((\mathcal{V}\setminus\mathcal{V}^{\prime})\times\mathcal{V}^{\prime})\equiv\emptyset. In particular, denoting |ci||c_{i}| the size of cic_{i} (i.e., the number of nodes in cic_{i}), the largest connected component (LCC), cmaxc_{\text{max}}, is then defined as the component that consists of the majority of nodes, that is, cmax=argmaxci|ci|c_{\text{max}}=\operatorname*{arg\,max}_{c_{i}}|c_{i}|. Now assuming that G′′G^{\prime\prime} is the remaining network of GG after the removal of qq fraction nodes and the incident edges, the corresponding size of the LCC, |cmax′′||c^{\prime\prime}_{\text{max}}|, will hence be a monotonically decreasing function of qq. Such function is also known as the order parameter 𝒢(q)=|cmax′′|/n\mathcal{G}(q)=|c^{\prime\prime}_{\text{max}}|/n, where n=|𝒱|n=|\mathcal{V}| is the number of nodes in GG. According to the percolation theory [28], when qq is large enough, say larger than the critical threshold qcq_{c}, the probability that a randomly selected node belongs to the LCC would approach zero. In other words, if q>qcq>q_{c}, then there would be no any giant connected component in G′′G^{\prime\prime}.

The diffusion ζ\zeta is generally associated with four factors: the network structure GG, the diffusion source vsv_{s}, the dynamic model MM, and the corresponding time tt, say ζ(G,vs,M,t)\zeta(G,v_{s},M,t). Regarding MM, here we particularly consider the Susceptible-Infected-Recovered (SIR) model [29] as an example to explain the proposed method. More models will further be discussed in the result section. For nodes of GG governed by the SIR model, their states are either susceptible, infected or recovered. As tt+1t\rightarrow t+1, an infected node uu has an infection probability βuv\beta_{uv} (or a time interval τuv=1/βuv\tau_{uv}=1/\beta_{uv}) to transmit the information or virus, say ς\varsigma, to its susceptible neighbor vΓ(u)v\in\Gamma(u). Meanwhile, it recovers with a recovery probability γu\gamma_{u} (or duration of τu=1/γu\tau_{u}=1/\gamma_{u}). Those recovered nodes stay in the recovered state and ς\varsigma cannot pass through a recovered node.

Now supposing that a group of nodes 𝒪𝒱\mathcal{O}\in\mathcal{V} are particularly chosen as observers and hence their states could be investigated if necessary, we study what and how the configuration of 𝒪\mathcal{O} could facilitate better solutions for the diffusion-source-localization (DSL) problem [1, 2, 3]. In particular, we assume that a node u𝒪u\in\mathcal{O} would record the relative infection time stamp tut_{u} once it gets infected. Besides, we also consider that part of 𝒪\mathcal{O}, say 𝒪d\mathcal{O}_{d}, have the ability to record the infection direction, that is, a node u𝒪du\in\mathcal{O}_{d} can also show us the node vv if vv transmits ς\varsigma to uu. Based on these assumptions, for a diffusion ζ\zeta triggered by an unknown diffusion source usu_{s} at time t0t_{0}, the DSL problem aims to design an estimator σ(G,𝒪)\sigma(G,\mathcal{O}) that gives us the inferred source u^s=σ(G,𝒪)\widehat{u}_{s}=\sigma(G,\mathcal{O}) satisfying u^s=argmaxvG𝒫(𝒪|v)\widehat{u}_{s}=\operatorname*{arg\,max}_{v\in G}\mathcal{P}(\mathcal{O}|v), where 𝒫(𝒪|v)\mathcal{P}(\mathcal{O}|v) is the probability that we observe 𝒪\mathcal{O} given ζ(G,v,M,t)\zeta(G,v,M,t). Obviously, the state of a node i𝒪i\in\mathcal{O} is governed by all parameters of ζ\zeta but with unknown MM and tt. Hence, with high probability v^s\widehat{v}_{s} differs from the real source vsv_{s} in most scenarios [2, 12, 3]. And therefore, the error distance ϵ=h(v^s,vs)\epsilon=h(\widehat{v}_{s},v_{s}) is conducted to verify the performance of an estimator, where h(v^s,vs)h(\widehat{v}_{s},v_{s}) represents the hop-distance between v^s\widehat{v}_{s} and vsv_{s}. Usually, an estimator is said to be better than another one if it has smaller ϵ\epsilon.

But here we argue that: what should we do next after we obtain the estimator having a small ϵ\epsilon? Or in other words, can the estimator help us find the diffusion source more easily? Indeed, after acquiring v^s\widehat{v}_{s}, one can further conduct more intensive search on the neighbor region of v^s\widehat{v}_{s} to detect vsv_{s}. In this case, a small ϵ\epsilon generally corresponds to a small search region. However, due to the heterogeneity of contact patterns in most real-world networks [30], even a small region (i.e., a small ϵ\epsilon) might be associated with a lot of nodes. Therefore, it would be more practical in real scenarios if such estimator gives us a candidate set 𝒱c\mathcal{V}_{c} satisfying vs𝒱cv_{s}\in\mathcal{V}_{c} for sure. And hence, we formulate the goal function that this paper aims to achieve as

𝒪^=argmin𝒪|𝒱c|,vs𝒱c for sure,\widehat{\mathcal{O}}=\operatorname*{arg\,min}_{\mathcal{O}}|\mathcal{V}_{c}|,v_{s}\in\mathcal{V}_{c}\text{ for sure}, (1)

where |𝒱c||\mathcal{V}_{c}| is the size of 𝒱c\mathcal{V}_{c}. Intuitively, Eq. (1) is developed based on the assumption that: the smaller the candidate set, the lower the cost of the intensive search. And in general, 𝒱c\mathcal{V}_{c} should be finite guaranteed by finite 𝒪\mathcal{O} for an infinite GG, otherwise, the cost would be infinite since the intensive search has to be carried out on an infinite population.

4 Method

Let 𝒱r\mathcal{V}_{r} be the removal node set and 𝒱o\mathcal{V}_{o} the rest, i.e., 𝒱r𝒱o=𝒱\mathcal{V}_{r}\cup\mathcal{V}_{o}=\mathcal{V} and 𝒱r𝒱o=\mathcal{V}_{r}\cap\mathcal{V}_{o}=\emptyset. For the subnetwork GG^{\prime} regarding 𝒱o\mathcal{V}_{o}, the boundary of a connected component cic_{i}, say c^i\widehat{c}_{i}, is defined as

c^i={u|euv,vci,u𝒱r}.\widehat{c}_{i}=\{u|e_{uv}\in\mathcal{E},\forall v\in c_{i},\forall u\in\mathcal{V}_{r}\}. (2)

Likewise, we write the component cover α(u)\alpha(u) that a specific node u𝒱ru\in\mathcal{V}_{r} corresponds to as

α(u)=vΓ(u)ci(v),\alpha(u)=\bigcup_{v\in\Gamma(u)}c_{i}(v), (3)

where ci(v)c_{i}(v) represents the component that node vv belongs to. Denoting tvt_{v} the time stamp that node vv gets infected and 𝒪={u|u=argminv𝒪tv}\mathcal{O}^{\prime}=\{u|u=\operatorname*{arg\,min}_{v\in\mathcal{O}}t_{v}\}, where tvt_{v} is assumed to be infinite if vv is still susceptible, the proposed approach is developed based on the following observation (Observation 1).

Observation 1.

Letting

𝒱c=u𝒪α(u){u},\mathcal{V}_{c}^{\prime}=\bigcap_{\forall u\in\mathcal{O}^{\prime}}\alpha(u)\cup\{u\}, (4)

we then have vs𝒱cv_{s}\in\mathcal{V}_{c}^{\prime} for sure.

Refer to caption
Figure 1: Examples regarding DSL, where 𝒱r=𝒪\mathcal{V}_{r}=\mathcal{O} (Observer) and 𝒱o\mathcal{V}_{o} the rest. Nodes in susceptible state are colored by green (marked by 𝐒\mathbf{S}), infected by orange (𝐈\mathbf{I}), and recovered by blue (𝐑\mathbf{R}). tvt_{v} represents the time stamp that node vv gets infected, e.g., t1t_{1} of node 11. (a) Snapshot of ζ\zeta. (b) 𝒪={1}\mathcal{O}^{\prime}=\{1\}, i.e., t1<tv,v[2,7]t_{1}<t_{v},v\in[2,7]. (c) 𝒪={1,2}\mathcal{O}^{\prime}=\{1,2\}, i.e., t1=t2<tv,v[3,7]t_{1}=t_{2}<t_{v},v\in[3,7].

Example 1. Considering Fig. 1(a) and (b), the boundary of the connected component that the diffusion source belongs to is {1,2}\{1,2\}.

Example 2. Regarding Fig. 1(b), α(1)\alpha(1) consists of all nodes covered by those shadows.

Example 3. With respect to Fig. 1(c), 𝒱c\mathcal{V}_{c}^{\prime} comprises of node 11, node 22, the diffusion source, and the node adjacent to the source. \blacksquare

We now consider the generation of the observer set 𝒪\mathcal{O} (or equivalently 𝒱r\mathcal{V}_{r}). For a given network G(𝒱,)G(\mathcal{V},\mathcal{E}) constructed by the configuration model [31], letting k2\langle k^{2}\rangle and k\langle k\rangle accordingly be the first and second moments of the corresponding degree sequence, we then have Lemma 1.

Lemma 1.

(Molloy-Reed criterion [31]) A network GG constructed based on the configuration model with high probability has a giant connected component (GCC) if

k2/k>2,\langle k^{2}\rangle/\langle k\rangle>2, (5)

where GCC represents a connected component whose size is proportional to the network size nn.

Now suppose that 𝒱r\mathcal{V}_{r} consists of nodes randomly chosen from GG and let q=|𝒱r|q=|\mathcal{V}_{r}| represent the fraction of removed nodes. Apparently, such removal would change the degree sequence of the remaining network (i.e., subnetwork GG^{\prime}, also see Fig. 1(b) as an example). Assuming that there is a node vv shared by both GG and GG^{\prime}, the probability that its degree kk (in GG) decreases to a specific degree kk^{\prime} (in GG^{\prime}) should be

(kk)(1q)kqkk,\tbinom{k}{k^{\prime}}(1-q)^{k^{\prime}}q^{k-k^{\prime}},

where (kk)\tbinom{k}{k^{\prime}} is the combinational factor (note that each node has qq probability of being removed). Letting pkp_{k} denote the degree distribution of GG, we then have the new degree distribution pkp^{\prime}_{k^{\prime}} of GG^{\prime} as

pk=k=kpk(kk)(1q)kqkk,p^{\prime}_{k^{\prime}}=\sum_{k=k^{\prime}}^{\infty}p_{k}\tbinom{k}{k^{\prime}}(1-q)^{k^{\prime}}q^{k-k^{\prime}}, (6)

and hence the corresponding first and second moments can be further obtained as

k=k=0kpk=k=0kk=kpk(kk)(1q)kqkk=(1q)k\langle k^{\prime}\rangle=\sum_{k^{\prime}=0}^{\infty}k^{\prime}p^{\prime}_{k^{\prime}}=\sum_{k^{\prime}=0}^{\infty}k^{\prime}\sum_{k=k^{\prime}}^{\infty}p_{k}\tbinom{k}{k^{\prime}}(1-q)^{k^{\prime}}q^{k-k^{\prime}}=(1-q)\langle k\rangle (7)

and

k2=(1q)2k2+q(1q)k,\langle k^{\prime 2}\rangle=(1-q)^{2}\langle k^{2}\rangle+q(1-q)\langle k\rangle, (8)

respectively. Since GG is constructed using the configuration model, its edges are independent of each other. That is, each edge of GG shares the same probability of connecting to v𝒱rv\in\mathcal{V}_{r}. In other words, the removal of vv would remove each edge with the same probability and hence GG^{\prime} can be viewed as a special network that is also constructed by the configuration model. Thus, Lemma 1 can be used to determine whether a GCC exists in GG^{\prime} and we reach

qc=11k2/k1,q_{c}=1-\frac{1}{\langle k^{2}\rangle/\langle k\rangle-1}, (9)

where qcq_{c} is the critical threshold of qq, that is: i) if q<qcq<q_{c}, with high probability there is a GCC in GG^{\prime}; ii) if q>qcq>q_{c}, with high probability there is no GCC in GG^{\prime} [32, 33].

For random networks (such as Erdős-Rényi (ER) networks [34]), k2=k(k+1)\langle k^{2}\rangle=\langle k\rangle(\langle k\rangle+1) gives us qc=11/kq_{c}=1-1/\langle k\rangle, which indicates that qcq_{c} is usually less than 11 and it increases as GG becomes denser. But for heterogeneous networks, say pkkp_{k}\backsim k^{-\ell}, k2=kk2pkkk2\langle k^{2}\rangle=\sum_{k}^{\infty}k^{2}p_{k}\backsim\sum_{k}k^{2-\ell} diverges if <3\ell<3 (most empirical networks have 2<<32<\ell<3 [20]), which means qcq_{c} approaches 11.

Remark 1.

From the above analysis, we have the following conclusions regarding the case that the observer set 𝒪\mathcal{O} is randomly chosen. For random networks, qc<1q_{c}<1 indicates that one can always have a proper 𝒪\mathcal{O} to achieve finite 𝒱c\mathcal{V}_{c}^{\prime}. And usually the denser the network, the larger the observer set 𝒪\mathcal{O}. But for heterogeneous networks, qc1q_{c}\rightarrow 1 means that such goal could only be achieved by putting almost all nodes into the observer set 𝒪\mathcal{O}. \blacksquare

We further consider the case that 𝒪\mathcal{O} consists of hubs [35], where hubs represent those nodes that have more connections in a particular network. Specifically, for a network G(𝒱,)G(\mathcal{V},\mathcal{E}), we first define a sequence 𝒮\mathcal{S} regarding 𝒱\mathcal{V} and assume that each element of 𝒮\mathcal{S} is uniquely associated with a node in GG. Then, letting 𝒮(i)=u\mathcal{S}(i)=u and 𝒮(j)=v\mathcal{S}(j)=v if kukvk_{u}\geqslant k_{v}, satisfying i<ji<j, where kuk_{u} represents the degree of node uu, we have 𝒪={𝒮(i),i[1,nq+0.5]}\mathcal{O}=\{\mathcal{S}(i),i\in[1,\lfloor nq+0.5\rfloor]\}. For heterogeneous networks under such removals on hubs, threshold qc<1q_{c}<1 can be achieved and obtained by numerically solving [36]

qc(2)/(1)2=23kmin(qc(3)/(1)1),q_{c}^{(2-\ell)/(1-\ell)}-2=\frac{2-\ell}{3-\ell}k_{\text{min}}(q_{c}^{(3-\ell)/(1-\ell)}-1), (10)

where kmink_{\text{min}} is the minimum degree.

For 𝒱c\mathcal{V}_{c}^{\prime}, however, we could not obtain an explicit equation to indicate whether it is finite. But we can roughly show that there should be qc<1q_{c}<1 that gives us a finite 𝒱c\mathcal{V}_{c}^{\prime} for networks generated by the configuration model with degree distribution pkkp_{k}\backsim k^{-\ell}. Supposing that the size of the LCC of GG^{\prime} is proportional to nbn^{b} with b<1b<1, then

kmaxnb/nk_{\text{max}}n^{b}/n (11)

gives us the possibly largest size of 𝒱c\mathcal{V}_{c}^{\prime}, where kmaxk_{\text{max}} is the maximum degree of GG and here we assume that such degree is unique. In the mentioned case, kmax=kminn1/(1)k_{\text{max}}=k_{\text{min}}n^{1/(\ell-1)} holds and hence it approaches 0 as nn\rightarrow\infty if 2<<32<\ell<3 (again, this is the case that we are particularly interested in), which indicates that one can always find some proper value of b>0b>0 satisfying b<(2)/(1)b<(\ell-2)/(\ell-1) for a given \ell. Note that, same as the random removal, each edge of GG also shares the same probability of being removed. Hence, both the size of the LCC and the number of connections that node vmaxv_{\text{max}} has with 𝒱o\mathcal{V}_{o} decrease as qq increases, where vmaxv_{\text{max}} represents the node whose degree is kmaxk_{\text{max}}.

But for networks having kmaxnk_{\text{max}}\backsim n such as star-shape networks, 𝒱c\mathcal{V}_{c}^{\prime} is still infinite when qq is finite. In this case, the information of only infection time stamps of nodes in 𝒪\mathcal{O} is apparently not enough. However, if the infection direction is also recorded, then the central node as the unique observer is fairly enough for the DSL problem in those star-shape networks. Note that most existing DSL methods would also fail in star-shape networks. Hence, we further assume that part of 𝒪\mathcal{O}, say 𝒪d\mathcal{O}_{d}, can also show the infection direction. Obviously, |𝒱c||\mathcal{V}_{c}^{\prime}| is a monotonically decreasing function of |𝒪d||\mathcal{O}_{d}| for a specific qq. In particular, if 𝒪d=𝒪\mathcal{O}_{d}=\mathcal{O}, then the size of 𝒱c\mathcal{V}_{c}^{\prime} would be bounded by the size of the LCC.

Remark 2.

In general, finite 𝒱c\mathcal{V}_{c}^{\prime} of heterogeneous networks can be achieved by carefully choosing 𝒪\mathcal{O}. Or in other words, the configuration of 𝒪\mathcal{O} plays fundamental role for the suppression of 𝒱c\mathcal{V}_{c}^{\prime}. Associating 𝒪\mathcal{O} with the sequence 𝒮\mathcal{S}, such as random removal can be viewed as a removal over a random sequence 𝒮\mathcal{S}, our goal is now to acquire a better 𝒮\mathcal{S} that could give us a smaller 𝒱c\mathcal{V}_{c}^{\prime} in regard to a specific qq. Besides, since the number of components that vmaxv_{\text{max}} connects to is usually difficult to measure especially for real-world networks, here we choose to achieve the containment of 𝒱c\mathcal{V}_{c}^{\prime} by curbing the LCC of GG^{\prime} (see also Eq. (11)), which coincides with the suppression of the order parameter 𝒢(q)\mathcal{G}(q). \blacksquare

Therefore, we reach

𝒮^={argmin𝒮qc(𝒮),if δ is given,argmin𝒮F(𝒮),otherwise,\mathcal{\widehat{S}}=\left\{\begin{aligned} &\operatorname*{arg\,min}_{\mathcal{S}}q_{c}(\mathcal{S}),\ \ \text{if $\delta$ is given},\\ &\operatorname*{arg\,min}_{\mathcal{S}}F(\mathcal{S}),\ \ \text{otherwise},\end{aligned}\right. (12)

where qc(𝒮)=argminq𝒢(q)δq_{c}(\mathcal{S})=\operatorname*{arg\,min}_{q}\mathcal{G}(q)\leqslant\delta, F(𝒮)=q𝒢(q)F(\mathcal{S})=\sum_{q}\mathcal{G}(q), and δ\delta is a given parameter (such as δ=0.01\delta=0.01). Eq. (12) is also known as the network immunization problem which aims to contain epidemics by the isolation of as fewer nodes as possible [33]. And a lot of methods have been proposed to cope with such problem [21, 27, 37, 38, 39, 40, 24, 25]. Here we particularly choose and consider the approach based on the evolutionary framework (AEF) to construct 𝒮\mathcal{S} since it can achieve the state of the art in most networks.

We first introduce several auxiliary variables to the ease of description of AEF. Consider a given network G(𝒱,)G(\mathcal{V},\mathcal{E}) and the corresponding sequence 𝒮\mathcal{S}. Let 𝒮i=(𝒮i,𝒮i+1,,𝒮h)\mathcal{S}_{i}^{\bot}=(\mathcal{S}_{i},\mathcal{S}_{i+1},...,\mathcal{S}_{h}) , where 𝒮i\mathcal{S}_{i} is a subsequence of 𝒮\mathcal{S}. Likewise, denote G(𝒮i)G^{\prime}(\mathcal{S}_{i}^{\bot}) a subnetwork G(𝒱,)G^{\prime}(\mathcal{V}^{\prime},\mathcal{E}^{\prime}), in which 𝒱={𝒮i(j),j}\mathcal{V}^{\prime}=\{\mathcal{S}_{i}^{\bot}(j),\forall j\} and =(𝒱×𝒱)\mathcal{E}^{\prime}=\mathcal{E}\cap(\mathcal{V}^{\prime}\times\mathcal{V}^{\prime}). Based on that, FF of G(𝒮i)G^{\prime}(\mathcal{S}_{i}^{\bot}) regarding 𝒮i\mathcal{S}_{i}^{\bot} is written as F(𝒮i)F^{\prime}(\mathcal{S}_{i}^{\bot}). Further, letting pmax=|LCC|/np^{\prime}_{\text{max}}=|\text{LCC}|/n of GG^{\prime}, we define the critical subsequence 𝒮c\mathcal{S}_{c} as the subsequence satisfying that pmaxδp^{\prime}_{\text{max}}\leqslant\delta of G(𝒮i)G^{\prime}(\mathcal{S}_{i}^{\bot}) and pmax>δp^{\prime}_{\text{max}}>\delta of G((𝒮c,𝒮i))G^{\prime}((\mathcal{S}_{c},\mathcal{S}_{i}^{\bot})). Note that all FF^{\prime} is scaled by nn, namely, the size of the studied network GG.

The core of AEF is the relationship-related (RR) strategy that works by repeatedly pruning the whole sequence 𝒮\mathcal{S}. Specifically, per iteration TT, RR keeps a new sequence 𝒮\mathcal{S}^{\prime} (i.e., 𝒮𝒮\mathcal{S}\leftarrow\mathcal{S}^{\prime}) if F(𝒮)<F(𝒮)F(\mathcal{S}^{\prime})<F(\mathcal{S}) (or qc(𝒮)<qc(𝒮)q_{c}(\mathcal{S}^{\prime})<q_{c}(\mathcal{S})), which is obtained through the following steps. 1) Let j=nj=n, 𝒮𝒮\mathcal{S}^{\prime}\leftarrow\mathcal{S}, and G(𝒱,)G^{\prime}(\mathcal{V}^{\prime},\mathcal{E}^{\prime}) be a subnetwork of GG, which consists of all nodes in 𝒱={𝒮(z),z[j,n]}\mathcal{V}^{\prime}=\{\mathcal{S}^{\prime}(z),z\in[j,n]\} and the associated edges in ={euv,u,v𝒱}\mathcal{E}^{\prime}=\{e_{uv},\forall u,v\in\mathcal{V}^{\prime}\}. 2) Construct the candidate set s¯j\bar{s}_{j} by randomly choosing Δ\Delta times from {S(i),i[max(jr×n,1),j]}\{S^{\prime}(i),i\in[\max(j-r\times n,1),j]\}, where Δ[1,Δ^]\Delta\in[1,\widehat{\Delta}] and r(0,r^]r\in(0,\widehat{r}] are randomly generated per iteration, and Δ^\widehat{\Delta} and r^\widehat{r} are given parameters. 3) Choose the node uu,

u=argminvξ(v),vs¯i,u=\operatorname*{arg\,min}_{v}\xi(v),\forall v\in\bar{s}_{i}, (13)

where ξ(v)=cic(v)|ci|\xi(v)=\sum_{c_{i}\in c(v)}|c_{i}| or ξ(v)=cic(v)|ci|\xi(v)=\prod_{c_{i}\in c(v)}|c_{i}|, in which c(v)c(v) is the component set that node vv would connect. 4) Update GG^{\prime} and 𝒮\mathcal{S}^{\prime}, namely, 𝒱𝒱{u}\mathcal{V}^{\prime}\leftarrow\mathcal{V}^{\prime}\cup\{u\}, {euv,v𝒱,vu}\mathcal{E}^{\prime}\leftarrow\mathcal{E}^{\prime}\cup\{e_{uv},\forall v\in\mathcal{V}^{\prime},v\neq u\}, and swap 𝒮(j)\mathcal{S}^{\prime}(j) and 𝒮(z)\mathcal{S}^{\prime}(z) satisfying 𝒮(z)=u\mathcal{S}^{\prime}(z)=u. 5) jj1j\leftarrow j-1. 6) Repeat steps 2) - 5) until j=1j=1, which accounts for one round (see also Algorithm 1). And RR acquires the solution by repeating steps 1) - 6) T^\widehat{T} times.

Input: G(𝒱,)G(\mathcal{V},\mathcal{E}), 𝒮\mathcal{S}, Δ^\widehat{\Delta}, r^\widehat{r} Output: 𝒮\mathcal{S} 1 Initialization: 𝒱{}\mathcal{V}\leftarrow\{\}, {}\mathcal{E}\leftarrow\{\}, jn,𝒮𝒮j\leftarrow n,\mathcal{S}^{\prime}\leftarrow\mathcal{S}, Δ\Delta, and rr 2 while j1j\geqslant 1 do 3       jj1j\leftarrow j-1 4       Get the candidate set s¯i\bar{s}_{i} based on Δ\Delta and rr 5       Choose node us¯iu\in\bar{s}_{i} based on Eq. (13) 6       Update G(𝒱,)G^{\prime}(\mathcal{V}^{\prime},\mathcal{E}^{\prime}) and 𝒮\mathcal{S}^{\prime} 7       8if F(𝒮)<F(𝒮)F(\mathcal{S}^{\prime})<F(\mathcal{S}) then 9       𝒮𝒮\mathcal{S}\leftarrow\mathcal{\mathcal{S}}^{\prime} Algorithm 1 One round of RR [38]

Observation 2.

Supposing that

Fi=F(𝒮i𝒮i+1)=F(𝒮i)F(𝒮i+1)F_{i}^{\prime}=F^{\prime}(\mathcal{S}_{i}^{\bot}\leftarrow\mathcal{S}_{i+1}^{\bot})=F^{\prime}(\mathcal{S}_{i}^{\bot})-F^{\prime}(\mathcal{S}_{i+1}^{\bot}) (14)

holds, then for a specific sequence 𝒮\mathcal{S} regarding a given network GG, FiF_{i}^{\prime} would be independent of FjF_{j}^{\prime} if either i>ji>j or j>ij>i satisfies. That is, for such a case, the order of nodes in 𝒮i\mathcal{S}_{i} has no effect on FjF_{j}^{\prime}. \blacksquare

AEF is developed based on RR and Observation 2. Specifically, at TpT_{p}, a random integer j(Tp)[π1,π2]j(T_{p})\in[\pi_{1},\pi_{2}] is generated, where π1\pi_{1} and π2\pi_{2} are two given boundaries. Let 𝒮i=(𝒮(z)),z[j(Tp)×(i1)+1,min(j(Tp)×i)]\mathcal{S}_{i}=(\mathcal{S}(z)),\forall z\in[j(T_{p})\times(i-1)+1,\min(j(T_{p})\times i)]. Then, for all subsequences 𝒮i,i[1,h]\mathcal{S}_{i},\forall i\in[1,h], RR with the optimization of FF is conducted if δ\delta is unknown, otherwise, 𝒮c\mathcal{S}_{c} is optimized by RR with qcq_{c} minimum.

Hence, the containment of 𝒱\mathcal{V}^{\prime} has been achieved (see also Eq. (4)), based on which existing DSL approaches can be used to further acquire the candidate set 𝒱c𝒱\mathcal{V}_{c}\subset\mathcal{V}^{\prime} (see also Eq. (1)). Here, since our goal is to have a framework that can effectively cope with the DSL problem in large-scale networks, we choose to propose the following approach. Let 𝒪′′\mathcal{O}^{\prime\prime} be the effective periphery node set of 𝒱c′′\mathcal{V}_{c}^{\prime\prime} defined as

𝒪′′={u,u𝒪,Γ(u)𝒱c′′,tu},\mathcal{O}^{\prime\prime}=\{u,\forall u\in\mathcal{O},\exists\Gamma(u)\cap\mathcal{V}_{c}^{\prime\prime}\neq\emptyset,t_{u}\neq\infty\}, (15)

where

𝒱c′′=u𝒪α(u).\mathcal{V}_{c}^{\prime\prime}=\bigcap_{\forall u\in\mathcal{O}^{\prime}}\alpha(u). (16)

Letting tmin=argmintuu,u𝒪′′t_{\text{min}}=\operatorname*{arg\,min}_{t_{u}}u,u\in\mathcal{O}^{\prime\prime}, we first refine the time stamp by tu=tutmint_{u}^{\prime}=t_{u}-t_{\text{min}}. Then, a Reverse-Influence-Sampling (RIS) [41] like strategy is conducted to infer the source v^c\widehat{v}_{c}, which works in the following processes. 1) Let Λ={}\Lambda=\{\} and G′′(𝒱,′′)G^{\prime\prime}(\mathcal{V}^{\prime},\mathcal{E}^{\prime\prime}) be the reverse network of G(𝒱,)G^{\prime}(\mathcal{V}^{\prime},\mathcal{E}^{\prime}) satisfying that |′′||||\mathcal{E}^{\prime\prime}|\equiv|\mathcal{E}^{\prime}| and euv′′e_{uv}\in\mathcal{E}^{\prime\prime} if evue_{vu}\in\mathcal{E}^{\prime}. 2) Randomly choose a node u𝒪′′u\in\mathcal{O}^{\prime\prime} and let tu′′=t0+tut_{u}^{\prime\prime}=t_{0}^{\prime}+t_{u}^{\prime}, where t0t_{0}^{\prime} is randomly generated from [0,t^0][0,\widehat{t}_{0}] and t^0\widehat{t}_{0} is a given parameter. 3) View uu as the source and transmits ς\varsigma to one of its randomly chosen neighbors, and then it recovers. 4) Such transmission repeats tu′′t_{u}^{\prime\prime} steps and denote the latest infected node by vv. 5) Let Λ=Λ{v}\Lambda=\Lambda\cup\{v\}. 6) Repeat 2)-5) TΛT_{\Lambda} times. Using θ(v)\theta(v) to represent the frequency that a node vΛv\in\Lambda has regarding Λ\Lambda, then we estimate the source v^c\widehat{v}_{c} by

v^c=argmaxvθ(v).\widehat{v}_{c}=\operatorname*{arg\,max}_{v}\theta(v). (17)

The candidate set 𝒱c\mathcal{V}_{c} (see also Eq. (1)) is finally acquired by simply considering a few layers of neighbors of v^c\widehat{v}_{c}.

Remark 3.

Relying on AEF a finite 𝒱c\mathcal{V}_{c}^{\prime} can be achieved by a small 𝒪\mathcal{O}, especially when 𝒪d\mathcal{O}_{d} is large, i.e., the larger the RdR_{d}, the better the corresponding result, where Rd=|𝒪d|/|𝒪|R_{d}=|\mathcal{O}_{d}|/|\mathcal{O}| characterizes the rate of 𝒪d\mathcal{O}_{d} regarding 𝒪\mathcal{O}. In particular, in tandem with the approach that we present to draw 𝒱c\mathcal{V}_{c} from 𝒱c\mathcal{V}_{c}^{\prime}, we name such framework as the Percolation-based Evolutionary Framework (PrEF) for the diffusion-source-localization problem. Note that other strategies can also be further developed or integrated into PrEF to acquire 𝒱c\mathcal{V}_{c}^{\prime} based on 𝒱c\mathcal{V}_{c}, such as those existing DSL methods. \blacksquare

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 2: The fraction of the candidate set ϕ\phi as a function of the infection probability β1\beta_{1} regarding SIR1 with γ1=0.1\gamma_{1}=0.1. (a) The ER network with k=3.50\langle k\rangle=3.50 and qc=0.2100q_{c}=0.2100. (b) The SF network with k=4.00\langle k\rangle=4.00, =3.0\ell=3.0, and qc=0.1150q_{c}=0.1150. (c) The PG network with qc=0.0810q_{c}=0.0810. (d) The SCM network with qc=0.0664q_{c}=0.0664. Samples are generated at ε=0.10\varepsilon=0.10.

5 Results

Competitors. We mainly compare the proposed approach with the Jordan-Center (JC) method [12] that generally achieves comparable results in most cases [15, 11, 18]. For JC, the corresponding candidate set 𝒱c\mathcal{V}_{c} is constructed based on the associated node rank since neighbor-based strategy usually results in much larger 𝒱c\mathcal{V}_{c}. Meanwhile, 𝒪\mathcal{O} consists of hubs is also considered as a baseline, say Hubs_s. Besides, since most current DSL approaches do not work for large networks, we also verify the performance of the proposed method by comparing it with approaches from the network immunization field, including the collective influence (CI) [21], the min-sum and reverse-greedy (MSRG) strategy [27], and the FINDER (FInding key players in Networks through DEep Reinforcement learning) method [25].

Settings. JC considers all infected and recovered nodes to achieve the source localization. PrEF is conducted with Δ^=50\widehat{\Delta}=50, r^=1\widehat{r}=1, T^=20\widehat{T}=20, π1=1\pi_{1}=1, π2=0.1×n\pi_{2}=\lfloor 0.1\times n\rfloor, Tp=5,000T_{p}=5,000 for networks of n105n\leqslant 10^{5}, Tp=2,500T_{p}=2,500 for 105<n10610^{5}<n\leqslant 10^{6}, Tp=500T_{p}=500 for n>106n>10^{6} (AEF), and TΛ=106T_{\Lambda}=10^{6} (RIS). Besides, we use PrEF(Rd)\text{PrEF}(R_{d}) to represent PrEF regarding a specific RdR_{d}. In addition, for each network, qcq_{c} is obtained at 𝒢(qc)0.005\mathcal{G}(q_{c})\thickapprox 0.005 of AEF.

Diffusion models. SIR1: βuv=β1\beta_{uv}=\beta_{1} and γu=γ1\gamma_{u}=\gamma_{1}, u,v𝒱\forall u,v\in\mathcal{V}. SIR2: βuv[β0,β1]\beta_{uv}\in[\beta_{0},\beta_{1}] and γu=0\gamma_{u}=0, u,v𝒱\forall u,v\in\mathcal{V}, i.e., the Susceptible-Infected (SI) model [29]. SIR3: βuv[β0,β1]\beta_{uv}\in[\beta_{0},\beta_{1}] and γu=1\gamma_{u}=1, u,v𝒱\forall u,v\in\mathcal{V}, i.e., the Independent Cascade (IC) model [42, 43].

Letting n(t,I)n(t,\text{I}) and n(t,R)n(t,\text{R}) accordingly be the number of nodes in infection and recovery states at tt of a particular diffusion ζ(G,vs,M,t)\zeta(G,v_{s},M,t), we generate a DSL sample by the following processes. 1) A node vs𝒱v_{s}\in\mathcal{V} is randomly chosen as the diffusion source to trigger ζ\zeta. 2) ζ\zeta is terminated at the moment when

(n(t,I)+n(t,R))/nε,(n(t,\text{I})+n(t,\text{R}))/n\geqslant\varepsilon,

where ε\varepsilon is the outbreak range rate. Note that (n(t,I)+n(t,R))/n(n(t,\text{I})+n(t,\text{R}))/n might be much larger than ε\varepsilon if the infection probability is large.

Evaluation metric. We mainly consider the fraction of the candidate set (see also Eq. (1)), ϕ\phi, to evaluate the performance of the proposed method, which is defined as

ϕ=|𝒱c|/n.\phi=|\mathcal{V}_{c}|/n.

In what follows, ϕ\phi is the mean drawn from over 1,0001,000 independent realizations if there is no special explantation. Besides, we also use ϕ()\phi(\cdot) to denote ϕ\phi of a specific approach, such as ϕ(PrEF)\phi(\text{PrEF}) represents ϕ\phi of PrEF.

Table 1: Experimental networks.
Networks nn mm
ER 10,00010,000 35,00035,000
SF 10,00010,000 40,00040,000
PG 4,9414,941 6,5946,594
SCM 7,2287,228 24,78424,784
LOCG 196,591196,591 950,327950,327
WG 875,713875,713 4,322,0514,322,051

Networks. We consider both model networks (including the ER network [34] and scale-free (SF) network [30]) and empirical networks (including the Power Grid (PG) network [19], the Scottish-cattle-movement (SCM) network [37], the loc-Gowalla (LOCG) network [44], and the web-Google (WG) network [45]). We choose the PG network since it is widely used to evaluate the performance of DSL approaches. Rather than that, the rest are all highly associated with the DSL problem. In particular, the SCM network is a network of Scottish cattle movements, on which the study of the DSL problem plays important role for food security [46]. Besides, the LOCG is a location-based online social network and the WG network is a network of Google web whose nodes represent web pages and edges are hyperlinks among them. The study of them can potentially contribute to the containment of misinformation. The basic information regarding those networks can be found in Table 1.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 3: ϕ\phi as a function of the outbreak range rate ε\varepsilon, where SIR1 with β1=0.5\beta_{1}=0.5 and γ1=0.1\gamma_{1}=0.1 is considered. (a) The ER network. (b) The SF network. (c) The PG network. (d) The SCM network.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 4: ϕ\phi as a function of the fraction of observers qq. Here SIR1 is conducted with β1=0.5\beta_{1}=0.5, γ1=0.1\gamma_{1}=0.1, and ε=0.10\varepsilon=0.10. (a) and (b) The LOCG network. (c) and (d) The WG network. Besides, Rd=0R_{d}=0 is for (a) and (c), and Rd=1R_{d}=1 for (b) and (d).

Results. We first fix qq to verify the performance of PrEF over varied infection probability β1\beta_{1}. Indeed, if the diffusion is symmetrical, JC would be an effective estimator (see Figs. 2a and 2b when β1\beta_{1} is large). But such effectiveness sharply decreases as β1\beta_{1} decreases. By contrast, PrEF has steady performance for the whole range of β1\beta_{1} and is much better than JC when β1\beta_{1} is small, such as ϕ(PrEF(1))=0.0004\phi(\text{PrEF}(1))=0.0004 versus ϕ(JC)=0.0721\phi(\text{JC})=0.0721 at β1=0.1\beta_{1}=0.1 in the SF network. Besides, PrEF(0)\text{PrEF}(0) apparently works better in the ER network compared to the SF network, which indicates that kmaxk_{\text{max}} might have impact on the effectiveness of PrEF(0)\text{PrEF}(0) since the SF network has a much larger kmaxk_{\text{max}}. To further demonstrate that, we also consider two empirical networks: the Power Grid network (with kmax=19k_{\text{max}}=19) and the Scottish network (with kmax=3667k_{\text{max}}=3667). As shown in Figs. 3c and 3d, even Hubs_s is more effective than JC in the Power Grid network but, JC, Hubs_s, and PrEF(0)\text{PrEF}(0) all fail in the Scottish network. Rather than that, PrEF(1)\text{PrEF}(1) works extremely well in both cases. Further considering the fraction of the candidate set ϕ\phi as a function of the outbreak range rate ε\varepsilon (Fig. 3), PrEF(0)\text{PrEF}(0) and PrEF(1)\text{PrEF}(1) still have stable performance while ϕ(JC)\phi(\text{JC}) rapidly increases as ϕ\phi.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 5: ϕ\phi of RdR_{d} regarding SIR1 with β1=0.5\beta_{1}=0.5 and γ1=0.1\gamma_{1}=0.1, where ‘Random’ represents that 𝒪d\mathcal{O}_{d} is randomly chosen from 𝒪\mathcal{O} while ‘Importance’ corresponds to the case that 𝒪d\mathcal{O}_{d} is generated relying on 𝒮\mathcal{S}. (a) The ER network. (b) The SF network. (c) The PG network. (d) The SCM network. Samples are generated at ε=0.10\varepsilon=0.10.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 6: ϕ\phi of ε\varepsilon regarding SIR1, SIR2, and SIR3, where SIR1 is with β1=0.5\beta_{1}=0.5 and γ1=0.1\gamma_{1}=0.1, and βuv[0,1]\beta_{uv}\in[0,1] of SIR2 and βuv[0.5,1]\beta_{uv}\in[0.5,1] of SIR3 both are randomly generated. (a) The ER network. (b) The SF network. (c) The PG network. (d) The SCM network. Solid and unfilled marks are associated with PrEF(0)\text{PrEF}(0) and PrEF(1)\text{PrEF}(1), respectively.

We further evaluate the performance of PrEF under different qq by comparing it with CI, MSRG, and FINDER on the two large networks. From those results shown in Fig. 4, we have the following conclusions: i) ϕ1\phi\rightarrow 1 when q0q\rightarrow 0, which is in accordance with our previous discussion, i.e., 𝒢(q)1\mathcal{G}(q)\rightarrow 1 when q0q\rightarrow 0; ii) A specific method 𝒮\mathcal{S} that has better performance regarding Rd=0R_{d}=0 usually also works better with respect to Rd=1R_{d}=1; iii) For a specific qq, PrEF always has much smaller ϕ\phi compared to CI, MSRG, and FINDER, especially for the WG network. Indeed, the size of the observer set 𝒪\mathcal{O}, the value of RdR_{d} (see also Fig. 5), the strategy generating 𝒪\mathcal{O} all play fundamental roles for minimizing ϕ\phi. In particular, PrEF has the best performance for almost all range of qq. Besides, results in Fig. 6 further demonstrate that the proposed method is also stable against varied diffusion models.

6 Conclusion

Aiming at the development of the-state-of-the-art approach to cope with the DSL problem for large networks, the PrEF method has been developed based on the network percolation and evolutionary computation, which can effectively narrow our search region of the diffusion source. Specifically, We have found that the DSL problem is in a degree equivalent to the network immunization problem if viewing immune nodes as observers, and hence it can be tackled in a similar scheme. In particular, we have demonstrated that the search region would be bounded by the LCC if the direction information of the diffusion is known, regardless of the network structure. But for the case that only the time stamp is recorded, both LCC and the largest degree have impact on the search region. We have also conducted extensive experiments to evaluate the performance of the proposed method. Results show that our method is much more effective, efficient, and stable compared to existing approaches.

References

  • [1] D. Shah and T. Zaman, “Detecting sources of computer viruses in networks: theory and experiment,” in Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems, 2010, pp. 203–214.
  • [2] P. C. Pinto, P. Thiran, and M. Vetterli, “Locating the source of diffusion in large-scale networks,” Physical review letters, vol. 109, no. 6, p. 068702, 2012.
  • [3] S. S. Ali, T. Anwar, A. Rastogi, and S. A. M. Rizvi, “Epa: Exoneration and prominence based age for infection source identification,” in Proceedings of the 28th ACM International Conference on Information and Knowledge Management, 2019, pp. 891–900.
  • [4] C. D. Harvell, C. E. Mitchell, J. R. Ward, S. Altizer, A. P. Dobson, R. S. Ostfeld, and M. D. Samuel, “Climate warming and disease risks for terrestrial and marine biota,” Science, vol. 296, no. 5576, pp. 2158–2162, 2002.
  • [5] F. Brauer, C. Castillo-Chavez, and C. Castillo-Chavez, Mathematical models in population biology and epidemiology.   Springer, 2012, vol. 2.
  • [6] A. J. McMichael, D. H. Campbell-Lendrum, C. F. Corvalán, K. L. Ebi, A. Githeko, J. D. Scheraga, and A. Woodward, Climate change and human health: risks and responses.   World Health Organization, 2003.
  • [7] D. T. Jamison, L. H. Summers, G. Alleyne, K. J. Arrow, S. Berkley, A. Binagwaho, F. Bustreo, D. Evans, R. G. Feachem, J. Frenk et al., “Global health 2035: a world converging within a generation,” The Lancet, vol. 382, no. 9908, pp. 1898–1955, 2013.
  • [8] X. Zhou and R. Zafarani, “A survey of fake news: Fundamental theories, detection methods, and opportunities,” ACM Computing Surveys (CSUR), vol. 53, no. 5, pp. 1–40, 2020.
  • [9] S. Sahoo, S. K. Padhy, J. Ipsita, A. Mehra, and S. Grover, “Demystifying the myths about covid-19 infection and its societal importance,” Asian journal of psychiatry, vol. 54, p. 102244, 2020.
  • [10] K. Rapoza, “Can ’fake news’ impact the stock market?” https://www.forbes.com/sites/kenrapoza/2017/02/26/can-fake-news-impact-the-stock-market/?sh=5f820b392fac, 2017, accessed: 2021-09-15.
  • [11] J. Choi, S. Moon, J. Woo, K. Son, J. Shin, and Y. Yi, “Information source finding in networks: Querying with budgets,” IEEE/ACM Transactions on Networking, vol. 28, no. 5, pp. 2271–2284, 2020.
  • [12] K. Zhu and L. Ying, “Information source detection in the sir model: A sample-path-based approach,” IEEE/ACM Transactions on Networking, vol. 24, no. 1, pp. 408–421, 2014.
  • [13] Z. Wang, W. Dong, W. Zhang, and C. W. Tan, “Rumor source detection with multiple observations: Fundamental limits and algorithms,” ACM SIGMETRICS Performance Evaluation Review, vol. 42, no. 1, pp. 1–13, 2014.
  • [14] J. Jiang, S. Wen, S. Yu, Y. Xiang, and W. Zhou, “Identifying propagation sources in networks: State-of-the-art and comparative studies,” IEEE Communications Surveys & Tutorials, vol. 19, no. 1, pp. 465–481, 2016.
  • [15] A. Y. Lokhov, M. Mézard, H. Ohta, and L. Zdeborová, “Inferring the origin of an epidemic with a dynamic message-passing algorithm,” Physical Review E, vol. 90, no. 1, p. 012801, 2014.
  • [16] D. Shah and T. Zaman, “Rumor centrality: a universal source detector,” in Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems, 2012, pp. 199–210.
  • [17] W. Dong, W. Zhang, and C. W. Tan, “Rooting out the rumor culprit from suspects,” in 2013 IEEE International Symposium on Information Theory.   IEEE, 2013, pp. 2671–2675.
  • [18] Y. Chai, Y. Wang, and L. Zhu, “Information sources estimation in time-varying networks,” IEEE Transactions on Information Forensics and Security, vol. 16, pp. 2621–2636, 2021.
  • [19] D. J. Watts and S. H. Strogatz, “Collective dynamics of ‘small-world’ networks,” Nature, vol. 393, no. 6684, pp. 440–442, 1998.
  • [20] M. Newman, Networks.   Oxford university press, 2018.
  • [21] F. Morone and H. A. Makse, “Influence maximization in complex networks through optimal percolation,” Nature, vol. 524, no. 7563, pp. 65–68, 2015.
  • [22] R. Cohen, S. Havlin, and D. Ben-Avraham, “Efficient immunization strategies for computer networks and populations,” Physical Review Letters, vol. 91, no. 24, p. 247901, 2003.
  • [23] Y. Liu, Y. Deng, and B. Wei, “Local immunization strategy based on the scores of nodes,” Chaos: An Interdisciplinary Journal of Nonlinear Science, vol. 26, no. 1, p. 013106, 2016.
  • [24] X.-L. Ren, N. Gleinig, D. Helbing, and N. Antulov-Fantulin, “Generalized network dismantling,” Proceedings of the national academy of sciences, vol. 116, no. 14, pp. 6554–6559, 2019.
  • [25] C. Fan, L. Zeng, Y. Sun, and Y.-Y. Liu, “Finding key players in complex networks through deep reinforcement learning,” Nature Machine Intelligence, pp. 1–8, 2020.
  • [26] S. Mugisha and H.-J. Zhou, “Identifying optimal targets of network attack by belief propagation,” Physical Review E, vol. 94, no. 1, p. 012305, 2016.
  • [27] A. Braunstein, L. Dall’Asta, G. Semerjian, and L. Zdeborová, “Network dismantling,” Proceedings of the National Academy of Sciences, vol. 113, no. 44, pp. 12 368–12 373, 2016.
  • [28] D. Stauffer and A. Aharony, Introduction to percolation theory.   CRC press, 2018.
  • [29] M. J. Keeling and P. Rohani, Modeling infectious diseases in humans and animals.   Princeton university press, 2011.
  • [30] A.-L. Barabási and R. Albert, “Emergence of scaling in random networks,” Science, vol. 286, no. 5439, pp. 509–512, 1999.
  • [31] M. Molloy and B. Reed, “A critical point for random graphs with a given degree sequence,” Random structures & algorithms, vol. 6, no. 2-3, pp. 161–180, 1995.
  • [32] R. Cohen, K. Erez, D. Ben-Avraham, and S. Havlin, “Resilience of the internet to random breakdowns,” Physical Review Letters, vol. 85, no. 21, p. 4626, 2000.
  • [33] A.-L. Barabási et al., Network science.   Cambridge university press, 2016.
  • [34] P. Erdős and A. Rényi, “On random graphs I.” Publicationes Mathematicae (Debrecen), vol. 6, pp. 290–297, 1959.
  • [35] R. Albert, H. Jeong, and A.-L. Barabási, “Error and attack tolerance of complex networks,” Nature, vol. 406, no. 6794, pp. 378–382, 2000.
  • [36] R. Cohen, K. Erez, D. Ben-Avraham, and S. Havlin, “Breakdown of the internet under intentional attack,” Physical Review Letters, vol. 86, no. 16, p. 3682, 2001.
  • [37] P. Clusella, P. Grassberger, F. J. Pérez-Reche, and A. Politi, “Immunization and targeted destruction of networks using explosive percolation,” Physical Review Letters, vol. 117, no. 20, p. 208301, 2016.
  • [38] Y. Liu, X. Wang, and J. Kurths, “Optimization of targeted node set in complex networks under percolation and selection,” Physical Review E, vol. 98, no. 1, p. 012313, 2018.
  • [39] ——, “Framework of evolutionary algorithm for investigation of influential nodes in complex networks,” IEEE Transactions on Evolutionary Computation, vol. 23, no. 6, pp. 1049–1063, 2019.
  • [40] C. Fan, Y. Sun, Z. Li, Y.-Y. Liu, M. Chen, and Z. Liu, “Dismantle large networks through deep reinforcement learning,” in ICLR representation learning on graphs and manifolds workshop, 2019.
  • [41] C. Borgs, M. Brautbar, J. Chayes, and B. Lucier, “Maximizing social influence in nearly optimal time,” in Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms.   SIAM, 2014, pp. 946–957.
  • [42] J. Goldenberg, B. Libai, and E. Muller, “Talk of the network: A complex systems look at the underlying process of word-of-mouth,” Marketing letters, vol. 12, no. 3, pp. 211–223, 2001.
  • [43] D. Kempe, J. Kleinberg, and É. Tardos, “Maximizing the spread of influence through a social network,” in Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, 2003, pp. 137–146.
  • [44] E. Cho, S. A. Myers, and J. Leskovec, “Friendship and mobility: user movement in location-based social networks,” in Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining.   ACM, 2011, pp. 1082–1090.
  • [45] J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney, “Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters,” Internet Mathematics, vol. 6, no. 1, pp. 29–123, 2009.
  • [46] M. Keeling, M. Woolhouse, R. May, G. Davies, and B. T. Grenfell, “Modelling vaccination strategies against foot-and-mouth disease,” Nature, vol. 421, no. 6919, pp. 136–142, 2003.