# What are typical football results?

In this post we continue our investigation of Who wins the 2018 FIFA World Cup™? and take a first look on historical data of FIFA football matches. These are obtained from the site www.eloratings.net using the wayback machine and some copy-and-paste. Unfortunately, our data set obtained in this way is not complete and we did not obtain data on all FIFA matches in this millennium. However, we were able to retrieve all matches of the FIFA World cup 2018 participants plus the matches of Italy, the Netherlands, and Austria. Yes, Italy and the Netherlands are not qualified, but we still are convinced that these two teams are amongst the strongest teams in the world. We added Austria to pay homage to the country where we spent a lot of quality time.

We will try to answer questions like:

• What is the most probable outcome of a game? [->]
• What is the probability to have a win, a draw or a lose? [->]
• What is the probability that the stronger team wins? [->] And with what result? [->]

The answers to these questions can be found following the links after the questions. Detailed answers can be found below.

# Programming code

The analysis is done using the open-source software R. We first load some useful packages and set up our ggplot2 theme.

``````library("knitr")
knitr::opts_chunk\$set(echo = TRUE)
opts_knit\$set(root.dir = "~/Datatreker")
library("dplyr")
library("reshape2")
library("ggplot2")
library("scales")
library("ggthemes")
theme_set(theme_solarized_2())
theme_update(axis.text.x = element_text(face="bold", color="#993333",
size=10),
axis.text.y = element_text(face="bold", color="#993333",
size=10),
title=element_text(color="black"))``````

## Pre-processing the data

After several pre-processing steps that we skip here we obtained the file “Data2000.csv” (availabe here) that contains all the matches between 1/1/2000 and 31/12/2017.

``````setwd("~/Datatreker")
``````##         Date       TeamA       TeamB GoalA GoalB         Competition
## 1 2017-12-16       Japan South Korea     1     4      East Asian Cup
## 2 2017-12-12       Japan       China     2     1      East Asian Cup
## 3 2017-12-12 South Korea North Korea     1     0      East Asian Cup
## 4 2017-12-09       Japan North Korea     1     0      East Asian Cup
## 5 2017-12-09       China South Korea     2     2      East Asian Cup
## 6 2017-11-15   Australia    Honduras     3     1 World Cup qualifier
##   Location EloABefore EloAAfter EloBBefore EloBAfter
## 1     Home       1746      1697       1702      1751
## 2     Home       1739      1746       1570      1563
## 3  Neutral       1694      1702       1449      1441
## 4     Home       1735      1739       1453      1449
## 5  Neutral       1562      1570       1702      1694
## 6     Home       1703      1718       1611      1596``````

So what do we have in our data? The date of the game, the names of the opponents, the obtained goals, the corresponding competition (including friendly matches), home or neutral ground, and the Elo points of the two teams before and after the match. Altogether, we have data on 6,706 games.

A first look on the data shows that it is quite heterogenous. We will sort out all matches against teams with an Elo Ranking less than 1600. We also filter the data according to location, i.e. on neutral ground or not.

``````DataElo<-filter(data2000, EloABefore>1600, EloBBefore>1600)
DataEloNeutral<-filter(DataElo, Location=="Neutral")
DataEloHome<-filter(DataElo, Location=="Home")``````

The games on neutral ground appear to be the more important ones; in particular they include most of the games during tournaments and exclude games in the qualifiers and friendly matches. Many games in the qualifiers are not of high importance. Also we will see that games in the qualifiers of the confederations Africa, Asia, North, Central America and Caribbean and Oceania are of low predictive power for the final rounds. They even worsen the forecasts. In other words, the way a team plays against a team of same strength says nothing about how it will play against a much stronger team. We discard friendly matches since they are played under completely different conditions. The data obtained from www.elorating.com has a particularity. In fact, the order of the teams, Team A vs Team B, is not necessarily the same as the official order of a match, i.e. home team vs. guest team. The data is arranged such that the teams that won the most Elo points in a game is named first.

Winning Elo points in a game can be considered as winning the game. Is this coherent with the Elo points of the teams? Yes, it is!

``t.test(DataEloNeutral\$EloABefore, DataEloNeutral\$EloBBefore)``
``````##
##  Welch Two Sample t-test
##
## data:  DataEloNeutral\$EloABefore and DataEloNeutral\$EloBBefore
## t = 10.05, df = 1814.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  46.78644 69.47638
## sample estimates:
## mean of x mean of y
##  1840.522  1782.391``````

## Most likely results

Let us summarize all the different outcomes.

``````df <- select(DataEloNeutral, GoalA, GoalB)
tdf <- round(table(df)/sum(length(df\$GoalA)), 3)
df2 <- as.data.frame(tdf)
df3 <- df2 %>% filter(Freq>0) %>% arrange(desc(Freq)) %>%
transmute(Result= paste(GoalA, GoalB, sep=":"), Freq=Freq*100)
p <- ggplot(data=df3, aes(x=reorder(Result, Freq), y=Freq)) +
geom_bar(stat="identity", fill="steelblue")+
geom_text(aes(label=Freq), hjust=-0.3, size=3.5, color="#993333")+
labs(x="Result", y="Percentage (%)", caption="Based on all matches of
the participants  of 2018 FIFA World Cup (plus Italy, Netherlands n
and Austria)  against teams with at least 1600 Elo points between
1/1/2000 and 31/12/2017. ")+
ggtitle("Probabilities of football match outcomes")+
theme(plot.title = element_text(hjust = 0.5),
plot.caption=element_text(hjust=0.5))+
coord_flip()
p`````` Note that the 7:1 of Germany vs. Brazil four years ago does not appear in the statistics since it was played on home ground.

### What is the probability of having a draw?

``````df4 <- df2 %>% mutate( GoalA= as.character(GoalA), GoalB=as.character(GoalB))  %>%
transmute( GoalDiff= as.numeric(GoalA) - as.numeric(GoalB), Freq=Freq) %>%
count(GoalDiff, wt=Freq) %>%
transmute(GoalDiff=GoalDiff, Freq=round(n,2)) %>%
filter(Freq >0)
p <- ggplot(data=df4, aes(x=GoalDiff, y=Freq)) +
geom_bar(stat="identity", fill="steelblue")+
geom_text(aes(label=Freq), vjust=-0.3, size=3.5, color="#993333")+
labs(x="Goal difference", y="Percentage (%)", caption="Based on all matches of the participants of 2018 FIFA World Cup (plus Italy, Netherlands and Austria)n  against teams with at least 1600 Elo points between 1/1/2000 and 31/12/2017. ")+
ggtitle("Goal differences of football match outcomes")+
theme(plot.title = element_text(hjust = 0.5),
plot.caption=element_text(hjust=0.5))+
scale_x_continuous(breaks=pretty_breaks(6))
p`````` We see clearly that football is a classical low score game. About 26% of the games are a draw and about 88% of the matches end with a goal difference of less or equal than 2!

## What are the probabilities that the stronger team will win?

In order to answer this question we will rearrange the data such that Team A will have more Elo points than Team B before the match. To avoid any misunderstandings we speak of the stronger and the weaker team.

``````DataEloNeutral %>% filter(EloABefore-EloBBefore >0) %>%
transmute(Date=Date, Stronger=TeamA, Weaker=TeamB, GoalStronger=GoalA, GoalWeaker=GoalB, EloStrongerBefore=EloABefore, EloWeakerBefore=EloBBefore) -> DataEloNeutral2a

DataEloNeutral %>% filter(EloABefore-EloBBefore <0) %>%
transmute(Date=Date, Stronger=TeamB, Weaker=TeamA, GoalStronger=GoalB, GoalWeaker=GoalA, EloStrongerBefore=EloBBefore, EloWeakerBefore=EloABefore) -> DataEloNeutral2b

DataEloNeutral2<-rbind(DataEloNeutral2a, DataEloNeutral2b)
DataEloNeutral2 <-  arrange(DataEloNeutral2, Date) ``````
``````df <- select(DataEloNeutral2, GoalStronger, GoalWeaker)
tdf <- round(table(df)/sum(length(df\$GoalStronger)), 3)
df2 <- as.data.frame(tdf)
df3 <- df2 %>% filter(Freq>0.05) %>% arrange(desc(Freq)) %>%
transmute(Result= paste(GoalStronger, GoalWeaker, sep=":"), Freq=Freq*100)
p <- ggplot(data=df3, aes(x=reorder(Result, Freq), y=Freq)) +
geom_bar(stat="identity", fill="steelblue")+
geom_text(aes(label=Freq), hjust=-0.3, size=3.5, color="#993333")+
labs(x="Result", y="Percentage (%)", caption="Based on all matches of the participants of 2018 FIFA World Cup (plus Italy, Netherlands n and Austria)  against teams with at least 1600 Elo points between 1/1/2000 and 31/12/2017. n Showing only results with frequency above 5% ")+
ggtitle("Probabilities of football match outcomes n  Stronger vs. Weaker")+
theme(plot.title = element_text(hjust = 0.5),
plot.caption=element_text(hjust=0.5))+
coord_flip()
p`````` Again the most likely result is 1:0, followed now by 0:0. The chances that the weaker team wins seems to be quite low. Nevertheless, the weaker team wins with probability of 27%, see below.

``````DataEloNeutral2 %>% mutate(Result= 3*(GoalStronger>GoalWeaker)+(GoalStronger==GoalWeaker) ) %>% count(Result) %>% arrange(desc(Result)) %>% transmute(Result=as.character(Result), Freq=round(n/sum(n),2)*100)  ->df
df\$Result<-factor(df\$Result, levels = c(3,0,1))

p <- ggplot(data=df, aes(x=Result, y=Freq)) +
geom_bar(stat="identity", fill="steelblue")+
geom_text(aes(label=Freq), vjust=-0.3, size=3.5, color="#993333")+
labs(x="Win, draw or lose", y="Percentage (%)", caption="Based on all matches of the participants of 2018 FIFA World Cup (plus Italy, Netherlandsn and Austria)  against teams with at least 1600 Elo points between 1/1/2000 and 31/12/2017. ")+
ggtitle("Result of football match n Stronger vs. Weaker")+
theme(plot.title = element_text(hjust = 0.5),
plot.caption=element_text(hjust=0.5))
p`````` 