On computing HITS ExpertRank via lumping the hub matrix 111This work was supported by National Natural Science Foundation of China (Grant Nos. 12001363, 71671125 ) .
Abstract
The dangling nodes is the nodes with no out-links in the web graph. It saves many computational cost and operations provided the dangling nodes are lumped into one node. In this paper, motivated by so many dangling nodes in web graph, we develop theoretical results for HITS by the lumping method. We mainly have three findings. First, the HITS model can be lumped although the matrix involved is not stochastic. Second, the hub vector of the nondangling nodes can be computed separately from dangling nodes, but not vice versa. Third, the authoritative vector of the nondangling nodes is difficult to compute separately from dangling nodes. Therefore, it is better to compute hub vector of the hub matrix in priority, not authoritative vector of the authoritative matrix or them simultaneous.
keywords:
HITS, Lumping, Nondangling nodes, Dangling nodes, Similarity transformationMSC:
[2020] 65F10, 65F50, 15A18, 15A21, 68P201 Introduction
The PageRank of Page and Brin, Hyperlink-Induced Topic Search (HITS) of Kleinberg, stochastic approach for link-structure analysis (SALSA) of Lempel and Moran use dominant eigenvector of non-negative matrices for ranking web page purpose [1, 2, 3]. PageRank is used in Google, HITS is used in the ask.com search engine, and SALSA is a combination of PageRank and HITS [3, 4]. Since the late 20th century, HITS is an another extremely successful modern web information retrieval application of dominant eigenvector. The resulting two ranking vectors (authority vector and hub vector) from HITS provide the ExpertRanks. The HITS method has broad applications such as product quality ranking, similarity ranking and so on. For discussions on HITS, together with the literature on modifications to overcome its weaknesses, we refer readers to [5].
The eigenproblems related to web information retrieval and data mining can be of huge dimension. Because of memory constraint of computer, the power method has become the dominant method for solving the HITS and PageRank eigenproblems [4, 5]. As the web is very enormous, the web may contain more than pages and increases quickly and dynamically. It can even take much time (several hours or days) to compute a large web ranking vector. For the search-dependent HITS, the computation problem which only involves users’ query related nodes is relatively small, that is why there is relatively less acceleration work on HITS. While for the search independent HITS, the matrices involved are usually of tremendous dimension and effective numerical accelerations are very desirable [6].
As we know that the power method will be lack of efficiency when the eigen gap between the largest eigenvalue () and the second largest eigenvalue ( in magnitude), is close enough to 1. The famous Krylov subspace methods can converge faster than the power method, but they are not suitable for such web information problem due to relatively large storage and subspace dimension and so on [4, 5, 7]. Therefore, many acceleration methods for information retrieval model calculations include the aggregation methods [5], extrapolation methods [8, 9], and two stage acceleration methods [10, 11] and other contributions [4, 7] are developed. Among them, most methods consider the difficult computation case that the gap-ratio approximates 1. By lumping Google matrix, Ipsen and Selee analyzed the relationship between rankings of nondangling nodes and rankings of dangling nodes [12]. To improve the computational efficiency of HITS ExpertRank, the filtered power method combining Chebyshev polynomials is proposed[6]. For more theoretical and numerical results on the web information retrieval model are available, see [4, 5, 7, 9, 13]. One may raise a question that whether the HITS model can still have similar lumping results, thus the computation cost can be reduced even if the matrix involved is not stochastic?
2 HITS model
The HITS model uses an adjacency matrix to describe web link structure graph. The eigenvetor of or are employed to reveal the relative importance (rank) of corresponding web pages’ authoritative vectors or hub vectors. Kleinberg [1] invented new matrices defined by
(1) |
respectively, where is an adjacent matrix given by
In (1), if web page has no outlinks (i.e., image files, pdf with on links to other pages), it is called a dangling node; otherwise it is called a nondangling node.
The HITS method updates and iteratively from some initial vectors and ,
(2) |
Once one of and is convergent, the other vector is solved by multiplying or . From (2), we have the following expression. The authoritative vector of the authoritative matrix and the hub vector of the hub matrix are defined by computing the principle eigenvector of and
(3) |
respectively. The matrix or is a symmetric positive semi-definite matrix, thus it has nonnegative eigenvalues. ExpertRanks is provided by the authoritative and hub vectors from HITS.
The expression (3) can’t guarantee the uniqueness of , or . To make the uniqueness, we modify and such that they are primitive matrices. The authoritative matrix and hub matrix are defined by
(4) |
respectively. Accordingly, the authoritative and hub vector are defined by
(5) |
respective, where is a suitable length vector of all ones. In this paper, we thus mainly discuss the computation problem (5). In the following section, we try to make theoretical and practical contributions for computing purposes.
3 Lumping and related theorems
The adjacency matrix is lumpable if all dangling nodes are lumped into a single node [10, 12]. According to (1), the adjacency matrix admits the structure
(6) |
where is a suitable permutation matrix, , and is the number of nondangling nodes. Then we have
(7) |
After the primitive modification, we obtain
(8) |
where is a suitable length vector of all ones. By using the lumping method and the similarity transformation matrix
where denotes the identity matrix of order , and is its th column vector [13]. Define
(9) |
then we have
(10) |
where we have used the fact that . Thus, we have proved the following theorem.
Theorem 3.1.
With the above notation, let
(11) |
and be a suitable permutation matrix. Then
(12) |
where
with , and . has the same nonzero eigenvalues as .
The following theorem distinguished the relationship between the hub ranking vector of and the stationary distribution of . As we can see, the leading elements of represent the hub vector due to the nondangling nodes, and the trailing elements stand for the hub vector associated with the dangling nodes. Thus the relationship between ranking of dangling nodes and that of nondanling nodes is derived. For ease of the following proof process, we show the structure of the submatrix . Separate the first leading row and column,
(13) |
Motivated by validating the analytic relationship between ranking vector of nondangling nodes and that of dangling nodes in the hub matrix model, we present our main lumping results for HITS.
Theorem 3.2.
Proof.
According to Theorem 3.1, the stochastic matrix of order has the same nonzero eigenvalues as . From (12) and (14), we can obtain that is an eigenvector for associated with the eigenvalue . Therefore,
(16) |
is an eigenvector of associated with . Since and have the same nonzero eigenvalues, and the principle eigenvalue of is distinct, the stationary probability distribution of is unique. We repartition
(17) |
Multiplying out . Hence, by (12) and (13), we have
(18) |
due to the fact that
Hence,
(19) |
where is a suitable permutation matrix satisfying (6). As discussed above and is unique, we conclude that if . ∎
Remark 3.1.
Remark 3.2.
Remark 3.3.
Remark 3.4.
Since one can permutate such that (6) holds, however, the web adjacency matrix is very sparse (usually there is only about ten entries per row), one can even permutate such that , this phenomenon can be verified by web data matrix from on-line Florida sparse matrix collection222https://sparse.tamu.edu/. Particularly, the number of all zero columns may be more than the number of all zero rows. In this case, the authoritative ranking vector is recommended to have priority in computing. But note that this case is very rare.
4 Conclusion
In this paper, we have studied a HITS computation approach via lumping dangling nodes of hub matrix into a single node. Thus we have answered the HITS computation question leaving in the introduction.
The HITS can be computed by lumping approach despite the involved matrices are not stochastic. The approach which we discuss is useful whether the HITS model is search-dependent or not. For the hub vector, Theorem 3.2 shows us that the rankings of nondangling nodes can be computed independently from that of dangling nodes; while rankings of dangling nodes depends on the rankings of nondangling nodes. According to Remark 3.3, the authoritative vector is relatively difficult to compute when compared with the hub vector. Thus we suggest that it is better to compute the hub vector in priority rather than authoritative vector or both of them simultaneously.
Further researches may include how to compute SALSA by the lumping method. Questions like the ranking vector relationship of dangling nodes and nondangling nodes in SALSA model are also worthy of studying.
References
References
- Page et al. [9 66] L. Page, S. Brin, R. Motwani, T. Winograd, The PageRank citation ranking: Bringing order to the web, Stanford Digital Libraries, 1999, (available online from http:// dbpubs.stanford.edu:8090/pub/1999-66.).
- Kleinberg [1999] J. M. Kleinberg, Authoritative Sources in a Hyperlinked Environment, J. ACM 46 (1999) 604–632.
- Lempel and Moran [2000] R. Lempel, S. Moran, The stochastic approach for link-structure analysis (SALSA) and the TKC effect, Computer Networks 33 (2000) 387–401.
- Langville and Meyer [2005] A. N. Langville, C. D. Meyer, A survey of eigenvector methods for web information retrieval, SIAM review 47 (2005) 135–161.
- Langville and Meyer [2006] A. N. Langville, C. D. Meyer, Google’s PageRank and Beyond: The Science of Search Engine Rankings, Princeton University Press, 2006.
- Zhou [2012] Y.-K. Zhou, Practical acceleration for computing the HITS ExpertRank vectors, Journal of Computational and Applied Mathematics 236 (2012) 4398–4409.
- Eldén [2007] L. Eldén, Matrix methods in data mining and pattern recognition, SIAM, 2007.
- Brezinski and Redivo-Zaglia [2006] C. Brezinski, M. Redivo-Zaglia, The PageRank Vector: Properties, Computation, Approximation, and Acceleration, SIAM Journal on Matrix Analysis and Applications 28 (2006) 551–575.
- Feng et al. [2021] Y.-H. Feng, J.-X. You, Y.-X. Dong, An Extrapolation Iteration and Its Lumped Type Iteration for Computing PageRank, Bulletin of the Iranian Mathematical Society (2021) 1–18.
- Lee et al. [2007] C. P. Lee, G. H. Golub, S. A. Zenios, A Two-Stage Algorithm for Computing PageRank and Multistage Generalizations, Internet Mathematics 4 (2007) 299–327.
- Dong et al. [2017] Y.-X. Dong, C.-Q. Gu, Z.-B. Chen, An Arnoldi-Inout method accelerated with a two-stage matrix splitting iteration for computing PageRank, Calcolo 54 (2017) 1–23.
- Ipsen and Selee [2007] I. C. F. Ipsen, T. M. Selee, PageRank computation with special attention to dangling nodes, SIAM Journal on Matrix Analysis and Applications 29 (2007) 1281–1296.
- Dong et al. [2021] Y.-X. Dong, Y.-H. Feng, J.-X. You, J.-R. Guan, Comments on lumping the Google matrix, arXiv preprint arXiv:2107.11080 (2021).