PageRank Approach to Ranking National Football Teams

Verica Lazova Faculty of Computer Science and Engineering
Cyril and Methodius University
Skopje, R.Macedonia
lazova992 at gmail.com Lasko Basnarkov Faculty of Computer Science and Engineering
Cyril and Methodius University
Skopje, R.Macedonia

Abstract

The Football World Cup as world’s favorite sporting event is a source of both entertainment and overwhelming amount of data about the games played. In this paper we analyse the available data on football world championships since 1930 until today. Our goal is to rank the national teams based on all matches during the championships. For this purpose, we apply the PageRank with restarts algorithm to a graph built from the games played during the tournaments. Several statistics such as matches won and goals scored are combined in different metrics that assign weights to the links in the graph. Finally, our results indicate that the Random walk approach with the use of right metrics can indeed produce relevant rankings comparable to the FIFA official all-time ranking board.

I Introduction

Football, being the world’s most favored sport, draws people’s attention in every field, from the simple means of entertainment to more complex objectives of statistics, research and data analysis. Since the FIFA world cup first took place in 1930 until this day, there have been around 20 tournaments held, each comprising of about 64 matches, not counting the qualification rounds [2, 3]. Therefore, there is significant amount of data that one could inspect, analyse and draw conclusions from.

Having that in mind researchers are tackling problems regarding playing strategy, ranking of teams or performance analysis from different aspects including economic, demographic, cultural and climatic factors [4]. A team’s game strategy for example can be observed from graph theory perspective by constructing a network of passes between players. In this context different centrality measures can be used to determine the importance of particular players [5, 6, 7]. Other subject of interest might be modelling football matches in terms of scores during the game. For example, in [8] the authors discuss a statistical model for scoring times in a match.

Here we address the problem of ranking national football teams. Our main task is to use the available statistics, in order to come up with an alternative ranking method for the football teams based on their achievements at the world cups. There are different rating methods currently in use and they produce relevant results. FIFA have their own 4-year points based FIFA/Coca-Cola rating system [9] and world cup all-time ratings [10] that includes all championships since their origin. There are also the World Football Elo Ratings based on the rating system FIDE uses to rate chess players [11].

A good ranking method should not only take into account how many times a team has won, but also consider how strong an opponent they have defeated. Victory against stronger opponent is preferable and thus more significant than victory against weaker opponent. One method that incorporates such logic is the PageRank (Random walk) method, which is applicable to vast varieties of network based problems that require ranking in some way. Other than the well known problem of rating web-pages [12] it is also utilized in social network analysis, in tasks such as link prediction, information diffusion and communities detection [13, 14, 15]. Also it is used in NLP for the purpose of text summarization and word sense disambiguation [16, 17]. For previous attempts of employing PageRank mechanism in sporting events we refer the reader to [18, 19, 20].

The rest of the paper is organized as follows. In Section II we present the ranking problem and the PageRank based method for solving it. We also give description and statistics of the data that was available to us. The obtained results are presented in Section III including a discussion and comparison to the official rankings and then we conclude the paper in Section IV.

II Materials and Methods

II-A Data

The data we used was obtained from 11v11, web-site for football statistics that contains all time figures about the matches of the world cup, qualification games inclusive [21]. For each national team there is information on which country they have played against, the number of matches won, drawn and lost, as well as the number of scored and conceded goals during all match-ups. Throughout this paper we use the term match-up in context of a single game played between two teams. And a match-up pair are every two teams that have played against each other. The dataset, contains 210 countries and statistics on 2335 match-up pairs that have played against one another, or 7141 games in total, during which 20298 goals were scored. The average number of games per match-up pair is 3.0582, and the average number of goals scored per match-up pair is 4.3465. Mexico versus USA is the pair with the largest number of games played against one another. About 28 games were played during which around 100 goals were scored, 15 of which were won by the US, 6 were drawn and the other 7 resulted in a victory for Mexico. The country with the most games played is Brazil with about 200 matches and also is the country with most games won and most goals scored as expected.

II-B Method

The ranking method explored throughout this paper is the PageRank with restarts algorithm applied to a graph build around the supplied data [12]. Each national team is a single node in the graph and two nodes are linked if the two teams (the match-up pair) have ever competed against each other in a world cup tournament. The weight of the link is determined by a weighting function that involves one or more metrics such as number of games played between a match-up pair, the number of won, lost and drawn games, or the number of scored and conceded goals. The various weighting functions we have tested are given in Table I.

Table I: Set of tested weighting function and their score in normalized number of inversions as similarity metrics to the official rankings. Less is better.

#	WEIGHTING FUNCTION	INVERSIONS
1	$f_{i,j}=\frac{l_{i,j}}{g_{i,j}}\cdot\frac{1}{G-g_{i,j}+1}$	0.032
2	$f_{i,j}=\frac{l_{i,j}}{g_{i,j}}$	0.038
3	$f_{i,j}=\frac{l_{i,j}}{g_{i,j}}+\frac{c_{i,j}}{c_{i,j}+s_{i,j}}$	0.040
4	$f_{i,j}=l_{i,j}$	0.040
5	$f_{i,j}=\frac{c_{i,j}}{s_{i,j}}$	0.041
6	$f_{i,j}=\frac{l_{i,j}}{w_{i,j}}$	0.043
7	$f_{i,j}=\frac{l_{i,j}}{g_{i,j}}+0.5\cdot\frac{d_{i,j}}{g_{i,j}}$	0.044
8	$f_{i,j}=\frac{c_{i,j}}{c_{i,j}+s_{i,j}}$	0.044
9	$f_{i,j}=\frac{c_{i,j}}{g_{i,j}}$	0.046
10	$f_{i,j}=c_{i,j}$	0.050

Within the functions we use the following notation:

$f_{i,j}$: weight of the link from node $i$ to node $j$ ;
$g_{i,j}$: number of games played between the two teams;
$l_{i,j}$: number of games lost by team $i$ amongst all the games $i$ and $j$ played;
$w_{i,j}$: number of games won by team $i$ amongst all the games $i$ and $j$ played;
$c_{i,j}$: number of goals conceded by team $i$ during all the games $i$ and $j$ played;
$s_{i,j}$: number of goals scored by team $i$ during all the games $i$ and $j$ played;
$d_{i,j}$: number of games drawn between the two teams;
$G$: maximum number of games played between any match-up pair;

Another factor that affects the PageRank is the damping factor. The damping factor corresponds to the probability that a random walker would discontinue the walk and jump to a random node [22]. The damping factor other than being necessary as assurance that the random walk would converge to a stationary distribution, it is also intuitive. The intuition behind the use of damping factor within our match-ups network is the following: although the graph is dense not every team have played against every other. So when using weighting metrics such as the loss ratio (funcion 1 in Table I) the damping factor would mean adding some wining chances to all the teams that have never been played against. It also adds some wining chances to a team that has never won a game within a match-up.

The PageRank is calculated using the power method [23]. This method is an iterative algorithm (eq. 2) that finds the dominant eigenvector, which corresponds to the invariant distribution of the time a random walker spends at a certain node - the PageRank. By normalizing the adjacency matrix $A$ we get the transition probability matrix $Q$ with elements as given in eq. 1.

	$\displaystyle Q_{i,j}$	$\displaystyle=(1-d)\cdot\frac{A_{i,j}}{\sum\limits_{k=1}^{N}A_{i,k}}+\frac{d}{N}$		(1)
	$\displaystyle\pi^{T}$	$\displaystyle=\pi^{T}Q.$		(2)

Note that $Q$ is guaranteed to be irreducible and aperiodic as a consequence of the nonzero damping factor $d$ .

Table II: Number of games played and results for each match-up pair

PAIR	GAMES	RESULTS
A-B	3	A wins 2, B wins 1
A-C	3	A wins 2, C wins 1
A-D	3	A wins 3, D wins 0
B-C	3	C wins 3, B wins 0
B-D	3	D wins 3, B wins 0
C-D	3	C wins 1, D wins 2

Table III: The PageRank of each team in descending order

TEAM	GAMES	WIN	PAGERANK
A	9	7	0.333
C	9	5	0.281
D	9	5	0.211
B	9	1	0.175

Refer to caption — Figure 1: Graph representation og the games played, the size of each node is proportional to it’s PageRank

II-C Example

For the sake of demonstration, let’s consider a toy example that illustrates our goal. Suppose there are 4 teams and the given statistics for each pair are shown in Table III. The graph (Fig. 1) is built using loss ratio as metric (function 2 at Table I). Therefore the weight of a given link from $i$ to $j$ is the part of the games that $i$ has lost to $j$ . For instance there is a link from A to C with weight of $\frac{1}{3}$ and also a link from C to A with weight of $\frac{2}{3}$ . That means out of 3 matches A and C have played against each other A has won 2 matches, C has won 1 and no matches were drawn. The next step is calculation of the PageRank. Therefore we need transition probability matrix which is calculated according to eq. 1 with a common damping factor value of 0.15.

Finally the results are shown at Table III. A is pointed as highest ranked and B is lowest ranked team as expected. On the other hand, team C and team D both have won 5 games as shown in Table III. However, PageRank takes into account the strength of the defeated opponent not only the number of winnings. As a result, team C is ranked higher since they have won a game against A, considered as strong opponent, in contrast to team D who have winnings only against weaker opponents.

III Results and Discussion

Table IV: Top 20 highest ranked national teams using combination of loss ratio and number of games the two teams played as weighting function (function 1 in Table I) and 0.05 damping factor. The 4-th column gives their position in the official ranking

#	COUNTRY	PAGERANK	OFFICIAL
1	Brazil	0.040375	1
2	Italy	0.037992	3
3	Germany	0.033801	2
4	Netherlands	0.031052	8
5	Argentina	0.029159	4
6	England	0.029100	6
7	Spain	0.027904	5
8	France	0.025670	7
9	Czechoslovakia	0.025155	NA
10	Sweden	0.022882	10
11	Mexico	0.022034	13
12	Hungary	0.022014	16
13	Uruguay	0.020660	9
14	Belgium	0.020255	14
15	Portugal	0.020211	17
16	Poland	0.019528	15
17	Denmark	0.019206	25
18	Croatia	0.018993	27
19	Switzerland	0.016650	21
20	Yugoslavia	0.016466	NA

In order to find the most precise ranking several different weighting functions have been tried and almost all of them delivered similar results. The results were evaluated by comparing the PageRank to the official world cup ranking. We have used normalized number of inversions as evaluation metric [24], taking the official FIFA all-time rankings as referent ordering. The tested weighting functions and their scores are listed at Table I. Lower score means the results generated using the corresponding metric are more similar to the official ranking. We only used the top 30 highest ranked teams in the comparison because we wanted to give them higher priority and get their ordering right at the cost of misplacing some of the lower rated teams. The error of the weighting functions also depends on the damping factor. The minimum is achieved when the damping factor value is very small, around 0.05. That is the value we used in the evaluations of the metrics shown in Table I. Fig 3 shows errors (in normalized inversions count) for the top 5 metrics as functions of the damping factor. As expected the error increases with the growth of the damping factor. Table IV shows the top 20 teams (for brevity), according to our best weighting function. The 4-th column contains the positions for each team at the official rankings board. The position is marked green if the team holds the same place in both ours and the official rankings. The position is marked with red if there is a large displacement (Denmark and Croatia). If a team is not found in the official ranking (Czechoslovakia and Yugoslavia in our case) their position is marked with NA. Fig 2 shows the match-ups graph. Each team is a node in the graph represented by their national flag and the size of each node is proportional to it’s PageRank. In the figure a portion of the links are omitted for the sake of clarity, thus the real graph is much denser than it appears.

Possible issue when using PageRank as ranking method might be the following: A node can obtain a high PageRank score if it has a high ranked neighbour from which it can receive significant amount of votes or if it has many low ranked neighbours. In our example, if a national team is high ranked then they must have either defeated many low ranked teams or achieved remarkable results against a highly ranked opponent. This property of the Random Walk affects our results especially since we treat all matches equally, without taking into account whether it is qualification round or final game. As a result there might be teams that have received high ranking only because they have played and won against many low ranked opponents in less significant qualification matches.

IV Conclusion

Throughout this paper we explored the PageRank method for ranking national football teams. Our results showed that even with simple weighting functions such as ratio of the goals scored or matches won, the PageRank algorithm derives promising results. The rankings this method produced are similar to the official FIFA all-time rankings. However, it is difficult to evaluate whether the PageRank with use of more sophisticated weighting function and more features within the dataset could lead to a better rating scheme than the official. Anyway, under the assumption that the FIFA ranking system is proper and accurate, RandomWalk despite the simple dataset and weighting metrics can replicate it’s results in a great deal.

Acknowledgment

We would like to thank Andrej Gajduk and Igor Trpevski for fruitful discussions and comments. VL also thanks TAPAN MNG D.O.O.E.L. Negotino for the internship opportunity during which the presented work was completed.

References

[1]
[2] F. I. de Football Association et al., “Fifa competitions and olympic football tournaments 1908-2017,” 2014. [Online]. Available: http://www.fifa.com/worldcup/organisation/documents/index.html
[3] ——, “Fifa world cup comparative statistics 1982-2014,” 2014. [Online]. Available: http://www.fifa.com/worldcup/organisation/documents/index.html
[4] R. Hoffmann, L. C. Ging, and B. Ramasamy, “The socio-economic determinants of international soccer performance,” Journal of Applied Economics, vol. 5, no. 2, pp. 253–272, 2002.
[5] J. L. Peña and H. Touchette, “A network theory analysis of football strategies,” arXiv preprint arXiv:1206.6904, 2012.
[6] J. Duch, J. S. Waitzman, and L. A. N. Amaral, “Quantifying the performance of individual players in a team activity,” PloS one, vol. 5, no. 6, p. e10937, 2010.
[7] M. Hughes and I. Franks, “Analysis of passing sequences, shots and goals in soccer,” Journal of Sports Sciences, vol. 23, no. 5, pp. 509–514, 2005.
[8] M. Dixon and M. Robinson, “A birth process model for association football matches,” Journal of the Royal Statistical Society: Series D (The Statistician), vol. 47, no. 3, pp. 523–538, 1998.
[9] “Fifa/coca-cola world ranking,” http://www.fifa.com/fifa-world-ranking/ranking-table/men/, accessed: 2015-01-25.
[10] F. I. de Football Association et al., “Fifa world cup all-time ranking,” 2014. [Online]. Available: http://www.fifa.com/worldcup/organisation/documents/index.html
[11] “World football elo ratings,” http://www.eloratings.net, accessed: 2015-02-11.
[12] L. Page, S. Brin, R. Motwani, and T. Winograd, “The pagerank citation ranking: Bringing order to the web.” 1999.
[13] L. Backstrom and J. Leskovec, “Supervised random walks: predicting and recommending links in social networks,” in Proceedings of the fourth ACM international conference on Web search and data mining. ACM, 2011, pp. 635–644.
[14] M. Kimura and K. Saito, “Tractable models for information diffusion in social networks,” in Knowledge Discovery in Databases: PKDD 2006. Springer, 2006, pp. 259–271.
[15] A. Stanoev, D. Smilkov, and L. Kocarev, “Identifying communities by influence dynamics in social networks,” Physical Review E, vol. 84, no. 4, p. 046102, 2011.
[16] G. Erkan and D. R. Radev, “Lexrank: Graph-based lexical centrality as salience in text summarization,” J. Artif. Intell. Res.(JAIR), vol. 22, no. 1, pp. 457–479, 2004.
[17] E. Agirre and A. Soroa, “Personalizing pagerank for word sense disambiguation,” in Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 2009, pp. 33–41.
[18] J. P. Keener, “The perron-frobenius theorem and the ranking of football teams,” SIAM review, vol. 35, no. 1, pp. 80–93, 1993.
[19] S. Mukherjee, “Identifying the greatest team and captain—a complex network approach to cricket matches,” Physica A: Statistical Mechanics and its Applications, vol. 391, no. 23, pp. 6066–6076, 2012.
[20] F. Radicchi, “Who is the best player ever? a complex network analysis of the history of professional tennis,” PloS one, vol. 6, no. 2, p. e17249, 2011.
[21] “11v11 - home of football statistics and history,” http://www.11v11.com, accessed: 2015-01-25.
[22] S. Brin and L. Page, “The anatomy of a large-scale hypertextual web search engine,” Computer networks and ISDN systems, vol. 30, no. 1, pp. 107–117, 1998.
[23] A. N. Langville and C. D. Meyer, “Deeper inside pagerank,” Internet Mathematics, vol. 1, no. 3, pp. 335–380, 2004.
[24] D. E. Knuth, The art of computer programming: sorting and searching. Pearson Education, 1998, vol. 3.