In just a few weeks the 2018 FIFA World Cup™ starts. People already discuss passionately who is going to win and how the chances for their teams are. Almost everybody has an intuition, opinion, idea, feeling or whatsoever about the performances of the different nations. There might be a consensus among football experts and fans on the top favorites, e.g. Brazil, Germany, Spain, but more debate on possible underdogs. However, most of these predictions rely on subjective opinions and are very hard if not impossible to quantify. An additional difficulty is the complexity of the tournament, with billions of different outcomes, making it very difficult to obtain accurate guesses of the probabilities of certain events.
How can we make reasonable, objective and quantitative estimates of the outcomes? For example, what is the probability that Brazil, Germany or Spain will win the cup? What are the chances that England will make it to the Round of 16? What are the chances that Brazil beats Germany in the semifinals 7:1?
In this and the following posts, we give quantitative answers to all kind of these questions. This post will start with what we can learn by studying previous matches and tournaments. Once we found some appropriate data we will investigate which models are out there to model an event like the FIFA World Cup.
Besides the above motivation we also think that forecast models for football are a brilliant topic to learn data science in class or at home alike. We will see all steps from: defining the problem, finding and cleaning the data, understanding the data, building models, making forecasts, presenting the results and eventually making a decision in this and the following posts.
Finding the right data for the right question
Our aim is to find historical data on FIFA matches of the last decade or more. As you can imagine we are not willing to pay for this nor want to work hard. There are several free options on https://github.com/jokecamp/FootballData. But will they serve our purposes? Will they serve our purposes in the course of this project? So before diving into the data we should think first. It would be a considerable waste of time and effort to do the hard work of cleaning the data and realizing shortly after that important information is just not in our files; in other words back to start. At the end we want to have a lean (toy) model for the outcome of football matches. The problem is that there are by far too many different models out in model space. Probably the best is to cook up our own (at least for the beginning). What kind of information are most important for the outcome of a match? Let’s make a list:
- the name of the two teams (you would never have guessed this one)
- the result of the game (nor this one)
- Was the game played on neutral ground or did one of the teams have a home advantage?
- What kind of game was it? Friendly game, a game in a qualification round or a game in a tournament?
We believe that these four information are the most important ones. Clearly, one can think about many others: weather conditions, condition of the field, time of the year, players of the team, personal statistics of the players, names of the coaches, number of people playing football in these countries, economic figures of the two countries, sponsors of the teams or even star constellations. There are models out there that consider all these kind of information or even more. We believe that some of them, especially weather and field conditions, personal performances of the players and tactical proficiency of the coaches are very important. But we have a big problem here with this Big Data approach. The FIFA World Cup is too long time away and it is not even clear which players will be in the line-ups. We might do this another time.
Is this all the data we want? It might be that there is not enough data for each team to exclude statistical artefacts. It might be appropriate to group teams in different strength groups to compensate these effects.
Everybody believes that a stronger team has higher chances to beat a weaker team. Games between two equally strong teams are normally very different to games with a cocksure favorite. So we need a notion of the strength of a team or in other words a ranking. The two most popular rankings of football teams are the FIFA/Coca-Cola World ranking and the Elo ranking. The FIFA/Coca-Cola ranking changed over time and we did not see a good way to access historical data. A sponsorship or access to historical data might however change our mind. Let us follow the lead of the Elo ranking.
It is based on the Elo rating system, known in particular for chess, but includes modifications to take various football-specific variables into account. The Elo ranking is published by the website https://www.eloratings.net. The Elo ratings as of today for the top 5 nations (in this rating) are as follows:
At a first sight this ranking makes sense. And jackpot, the site also gives the other information we wanted. But there is no complete historical data available. We asked the site for a nice file that contains all the data we need, but never obtained an answer. Not astonishingly since data is money. But the internet never forgets and a publicly available memory is https://www.web.archive.org/. Let us do some copy-and-paste and talk soon.