In this post we continue our investigation of Who wins the 2018 FIFA World Cup™? and take a first look on historical data of FIFA football matches. These are obtained from the site www.eloratings.net using the wayback machine and some copy-and-paste. Unfortunately, our data set obtained in this way is not complete and we did not obtain data on all FIFA matches in this millennium. However, we were able to retrieve all matches of the FIFA World cup 2018 participants plus the matches of Italy, the Netherlands, and Austria. Yes, Italy and the Netherlands are not qualified, but we still are convinced that these two teams are amongst the strongest teams in the world. We added Austria to pay homage to the country where we spent a lot of quality time.
We will try to answer questions like:
- What is the most probable outcome of a game? [->]
- What is the probability to have a win, a draw or a lose? [->]
- What is the probability that the stronger team wins? [->] And with what result? [->]
The answers to these questions can be found following the links after the questions. Detailed answers can be found below.
Programming code
The analysis is done using the open-source software R. We first load some useful packages and set up our ggplot2 theme.
library("knitr")
knitr::opts_chunk$set(echo = TRUE)
opts_knit$set(root.dir = "~/Datatreker")
library("dplyr")
library("reshape2")
library("ggplot2")
library("scales")
library("ggthemes")
theme_set(theme_solarized_2())
theme_update(axis.text.x = element_text(face="bold", color="#993333",
size=10),
axis.text.y = element_text(face="bold", color="#993333",
size=10),
title=element_text(color="black"))
Pre-processing the data
After several pre-processing steps that we skip here we obtained the file “Data2000.csv” (availabe here) that contains all the matches between 1/1/2000 and 31/12/2017.
setwd("~/Datatreker")
data2000<-read.csv("Data2000.csv")
head(data2000)
## Date TeamA TeamB GoalA GoalB Competition
## 1 2017-12-16 Japan South Korea 1 4 East Asian Cup
## 2 2017-12-12 Japan China 2 1 East Asian Cup
## 3 2017-12-12 South Korea North Korea 1 0 East Asian Cup
## 4 2017-12-09 Japan North Korea 1 0 East Asian Cup
## 5 2017-12-09 China South Korea 2 2 East Asian Cup
## 6 2017-11-15 Australia Honduras 3 1 World Cup qualifier
## Location EloABefore EloAAfter EloBBefore EloBAfter
## 1 Home 1746 1697 1702 1751
## 2 Home 1739 1746 1570 1563
## 3 Neutral 1694 1702 1449 1441
## 4 Home 1735 1739 1453 1449
## 5 Neutral 1562 1570 1702 1694
## 6 Home 1703 1718 1611 1596
So what do we have in our data? The date of the game, the names of the opponents, the obtained goals, the corresponding competition (including friendly matches), home or neutral ground, and the Elo points of the two teams before and after the match. Altogether, we have data on 6,706 games.
A first look on the data shows that it is quite heterogenous. We will sort out all matches against teams with an Elo Ranking less than 1600. We also filter the data according to location, i.e. on neutral ground or not.
DataElo<-filter(data2000, EloABefore>1600, EloBBefore>1600)
DataEloNeutral<-filter(DataElo, Location=="Neutral")
DataEloHome<-filter(DataElo, Location=="Home")
The games on neutral ground appear to be the more important ones; in particular they include most of the games during tournaments and exclude games in the qualifiers and friendly matches. Many games in the qualifiers are not of high importance. Also we will see that games in the qualifiers of the confederations Africa, Asia, North, Central America and Caribbean and Oceania are of low predictive power for the final rounds. They even worsen the forecasts. In other words, the way a team plays against a team of same strength says nothing about how it will play against a much stronger team. We discard friendly matches since they are played under completely different conditions. The data obtained from www.elorating.com has a particularity. In fact, the order of the teams, Team A vs Team B, is not necessarily the same as the official order of a match, i.e. home team vs. guest team. The data is arranged such that the teams that won the most Elo points in a game is named first.
Winning Elo points in a game can be considered as winning the game. Is this coherent with the Elo points of the teams? Yes, it is!
t.test(DataEloNeutral$EloABefore, DataEloNeutral$EloBBefore)
##
## Welch Two Sample t-test
##
## data: DataEloNeutral$EloABefore and DataEloNeutral$EloBBefore
## t = 10.05, df = 1814.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 46.78644 69.47638
## sample estimates:
## mean of x mean of y
## 1840.522 1782.391
Most likely results
Let us summarize all the different outcomes.
df <- select(DataEloNeutral, GoalA, GoalB)
tdf <- round(table(df)/sum(length(df$GoalA)), 3)
df2 <- as.data.frame(tdf)
df3 <- df2 %>% filter(Freq>0) %>% arrange(desc(Freq)) %>%
transmute(Result= paste(GoalA, GoalB, sep=":"), Freq=Freq*100)
p <- ggplot(data=df3, aes(x=reorder(Result, Freq), y=Freq)) +
geom_bar(stat="identity", fill="steelblue")+
geom_text(aes(label=Freq), hjust=-0.3, size=3.5, color="#993333")+
labs(x="Result", y="Percentage (%)", caption="Based on all matches of
the participants of 2018 FIFA World Cup (plus Italy, Netherlands n
and Austria) against teams with at least 1600 Elo points between
1/1/2000 and 31/12/2017. ")+
ggtitle("Probabilities of football match outcomes")+
theme(plot.title = element_text(hjust = 0.5),
plot.caption=element_text(hjust=0.5))+
coord_flip()
p
Note that the 7:1 of Germany vs. Brazil four years ago does not appear in the statistics since it was played on home ground.
What is the probability of having a draw?
df4 <- df2 %>% mutate( GoalA= as.character(GoalA), GoalB=as.character(GoalB)) %>%
transmute( GoalDiff= as.numeric(GoalA) - as.numeric(GoalB), Freq=Freq) %>%
count(GoalDiff, wt=Freq) %>%
transmute(GoalDiff=GoalDiff, Freq=round(n,2)) %>%
filter(Freq >0)
p <- ggplot(data=df4, aes(x=GoalDiff, y=Freq)) +
geom_bar(stat="identity", fill="steelblue")+
geom_text(aes(label=Freq), vjust=-0.3, size=3.5, color="#993333")+
labs(x="Goal difference", y="Percentage (%)", caption="Based on all matches of the participants of 2018 FIFA World Cup (plus Italy, Netherlands and Austria)n against teams with at least 1600 Elo points between 1/1/2000 and 31/12/2017. ")+
ggtitle("Goal differences of football match outcomes")+
theme(plot.title = element_text(hjust = 0.5),
plot.caption=element_text(hjust=0.5))+
scale_x_continuous(breaks=pretty_breaks(6))
p
We see clearly that football is a classical low score game. About 26% of the games are a draw and about 88% of the matches end with a goal difference of less or equal than 2!
What are the probabilities that the stronger team will win?
In order to answer this question we will rearrange the data such that Team A will have more Elo points than Team B before the match. To avoid any misunderstandings we speak of the stronger and the weaker team.
DataEloNeutral %>% filter(EloABefore-EloBBefore >0) %>%
transmute(Date=Date, Stronger=TeamA, Weaker=TeamB, GoalStronger=GoalA, GoalWeaker=GoalB, EloStrongerBefore=EloABefore, EloWeakerBefore=EloBBefore) -> DataEloNeutral2a
DataEloNeutral %>% filter(EloABefore-EloBBefore <0) %>%
transmute(Date=Date, Stronger=TeamB, Weaker=TeamA, GoalStronger=GoalB, GoalWeaker=GoalA, EloStrongerBefore=EloBBefore, EloWeakerBefore=EloABefore) -> DataEloNeutral2b
DataEloNeutral2<-rbind(DataEloNeutral2a, DataEloNeutral2b)
DataEloNeutral2 <- arrange(DataEloNeutral2, Date)
df <- select(DataEloNeutral2, GoalStronger, GoalWeaker)
tdf <- round(table(df)/sum(length(df$GoalStronger)), 3)
df2 <- as.data.frame(tdf)
df3 <- df2 %>% filter(Freq>0.05) %>% arrange(desc(Freq)) %>%
transmute(Result= paste(GoalStronger, GoalWeaker, sep=":"), Freq=Freq*100)
p <- ggplot(data=df3, aes(x=reorder(Result, Freq), y=Freq)) +
geom_bar(stat="identity", fill="steelblue")+
geom_text(aes(label=Freq), hjust=-0.3, size=3.5, color="#993333")+
labs(x="Result", y="Percentage (%)", caption="Based on all matches of the participants of 2018 FIFA World Cup (plus Italy, Netherlands n and Austria) against teams with at least 1600 Elo points between 1/1/2000 and 31/12/2017. n Showing only results with frequency above 5% ")+
ggtitle("Probabilities of football match outcomes n Stronger vs. Weaker")+
theme(plot.title = element_text(hjust = 0.5),
plot.caption=element_text(hjust=0.5))+
coord_flip()
p
Again the most likely result is 1:0, followed now by 0:0. The chances that the weaker team wins seems to be quite low. Nevertheless, the weaker team wins with probability of 27%, see below.
DataEloNeutral2 %>% mutate(Result= 3*(GoalStronger>GoalWeaker)+(GoalStronger==GoalWeaker) ) %>% count(Result) %>% arrange(desc(Result)) %>% transmute(Result=as.character(Result), Freq=round(n/sum(n),2)*100) ->df
df$Result<-factor(df$Result, levels = c(3,0,1))
p <- ggplot(data=df, aes(x=Result, y=Freq)) +
geom_bar(stat="identity", fill="steelblue")+
geom_text(aes(label=Freq), vjust=-0.3, size=3.5, color="#993333")+
labs(x="Win, draw or lose", y="Percentage (%)", caption="Based on all matches of the participants of 2018 FIFA World Cup (plus Italy, Netherlandsn and Austria) against teams with at least 1600 Elo points between 1/1/2000 and 31/12/2017. ")+
ggtitle("Result of football match n Stronger vs. Weaker")+
theme(plot.title = element_text(hjust = 0.5),
plot.caption=element_text(hjust=0.5))
p
3 thoughts on “What are typical football results?”
Comments are closed.