A more sophisticated forecast model

Football is a typical low-scoring game and games are frequently decided through single events in the game. These events may be extraordinary individual performances, individual errors, injuries, refereeing errors or just lucky coincidences. Moreover, during a tournament there are most of the time teams and players that are in exceptional shape and have a strong influence on the outcome of the tournament. One consequence is that every now and then alleged underdogs win tournaments and reputed favorites drop out already in the group phase.

The above effects are notoriously difficult to forecast. Despite this fact, every team has its strengths and weaknesses (e.g. defense and attack) and most of the results reflect the qualities of the teams. In order to model the random effects and the deterministic drift forecasts should be given in terms of probabilities.

A series of statistical models have been proposed in the literature for the prediction of football outcomes. They can be divided into two broad categories. The first one, the result-based model, models directly the probability of a game outcome (win/draw/loss), while the second one, the score-based model, focusses on the match score. We want to  follow the second approach since the match score is important in the group phase of the championship and it also implies a model for the first one.  The model proposed in How to impress your football fan colleagues is a first approach but it does only give the most typical outcomes.

The chances are very low (in fact almost zero) that all matches during the world cup end with these results. There are several models for this purpose and most of them involve a Poisson model. In other words, the distribution of the goals of a team is supposed to follow a Poisson distribution. This distribution is determined by one parameter called \lambda that describes the expected number of goals.

The forecast of a match A vs. B goes now as follows:

  1. We determine the expected number of goals \lambda_{A|B} scored by A against B and the expected number of goals \lambda_{B|A} scored by B.

  2. We simulate the number of Goals G_{A} as a Poisson distribution with parameter \lambda_{A|B} and the number of Goals G_{B} as a Poisson distribution with parameter \lambda_{B|A}

  3. We obtain G_{A}:G_{B} as a forecast of the match A vs. B.

  4. We simulate the whole tournament.

Note that the result is the realization of a random variable. That means it takes different values with certain probabilities. A single simulation gives one single result and we lose the information on the probabilities of certain outcomes. One way to get a weighted (or probabilistic) forecast is to simulate the tournament many times, say 100.000 times, and count the number of times a certain result happened. This procedure is known as the Monte Carlo method.

The Model

The crucial part is the modeling (step 1). We analyzed several Poisson regression models with different degrees of complexity. We compared them using different kind of quality measures and present here just the best model. More details on this  selection process can be found in our preprint.

We use the following model that uses a dependent Poisson regression approach. In fact, we use several Poisson regressions. These  are fitted with the data described in What are typical football results? including matches since 1.1.2010.

The Poisson rates \lambda_{A|B} and \lambda_{B|A} are determined as follows:

  1. We always assume that A has higher Elo score than B. This assumption can be justified, since usually the better team dominates the weaker team’s tactics. Moreover the number of goals the stronger team scores has an impact on the number of goals of the weaker team. For example, if team A scores 5 goals it is more likely that B scores also 1 or 2 goals, because the defense of team A lacks in concentration due to the expected victory. If the stronger team A scores only 1 goal, it is more likely that B scores no or just one goal, since team A focusses more on the defense and secures the victory.

  2. We determine the Poisson rate \lambda_{A|B}. This is done in several steps.

  • We determine how many goals A scores against an opponent O. The corresponding parameter \mu_{A} as a function of the Elo rating \elo{O} of the opponent O is given as

    (1)   \begin{equation*} \log \mu_A(\elo{O}) = \alpha_0 + \alpha_1 \cdot \elo{O}, \end{equation*}

    where \alpha_0 and \alpha_1 are obtained via a Poisson regression.

  • Teams of similar Elo scores may have different strengths in attack and defense. To take this effect into account we model the number of goals team B receives against a team of Elo score \elo{}=\elo{A} using a Poisson distribution with parameter \nu_{B}. The parameter \nu_{B} as a function of the Elo rating \elo{O} is given as

    (2)   \begin{equation*} \log \nu_B(\elo{O}) = \beta_0 + \beta_1 \cdot \elo{O}, \end{equation*}

    where the parameters \beta_0 and \beta_1 are obtained via Poisson regression.

  • Team A shall in average score \mu_A\bigr(\elo{B}\bigr) goals against team B, but team B shall have \nu_B\bigl(\elo{A}\bigr) goals against. As these two values rarely coincides we model the numbers of goals G_A as a Poisson distribution with parameter

        \[\lambda_{A|B} = \frac{\mu_A\bigl(\elo{B}\bigr)+\nu_B\bigl(\elo{A}\bigr)}{2}.\]

3. We determine the Poisson rate \lambda_{B|A}
The number of goals G_B scored by B is assumed to depend on the Elo score E_A=\elo{A} and additionally on the outcome of G_A. More precisely, G_B is modeled as a Poisson distribution with parameter \lambda_B(E_A,G_A) satisfying

(3)   \begin{equation*} \log \lambda_B(E_A,G_A) = \gamma_0 + \gamma_1 \cdot E_A+\gamma_2 \cdot G_A. \end{equation*}

Once again, the parameters \gamma_0,\gamma_1,\gamma_2 are obtained by Poisson regression. Hence,

    \[\lambda_{B|A} = \lambda_B(E_A,G_A).\]

4. The result of the match A versus B is simulated by realizing G_A first and then realizing G_B in dependence of the realization of G_A.