As in every season around this time the UEFA will draw the round of 16 matches for the current UEFA Champions League season in Nyon. For the 16/17 season the draw ist going to happen on the 12th of December 2016, which is tomorrow. The rules for the draw and also the drawing procedure employed by the UEFA lead to interesting statistical biases when compared to a random draw where group runners-up are drawn against group winners whithout any further restrictions (symmetrical probabilities). The rules for the draw are the following:

1. group runners-up are drawn against group winners
2. teams from the same group in the preceding group stage may not be drwan against each other
3. teams from the same association (i.e. usually country) may not be drawn againt each other

Rule three is the main source of bias leading to non-symmetrical probabilites for each team, since especially Bundesliga (GER), Primera Division (ESP), and Premier League (ENG) teams face more restrictions than teams from other associations. This has implications for the total set of probabilities in a draw, which will be examined later in this post. First the necessary dataset will be created, then all possible draws under the named restrictions are calculated and at the end the bias through the drawing procedure is calculated via simulation.

## The Data

winner winner_id winner_association winner_group runner_up runner_up_id runner_up_association runner_up_group
Arsenal 1 ENG A Paris 9 FRA A
Napoli 2 ITA B Benfica 10 POR B
Barcelona 3 ESP C Man City 11 ENG C
Atletico 4 ESP D Bayern 12 GER D
Monaco 5 FRA E Leverkusen 13 GER E
Dortmund 6 GER F Real Madrid 14 ESP F
Leicester 7 ENG G Porto 15 POR G
Juventus 8 ITA H Sevilla 16 ESP H

As one can see there are three teams each for England (ENG), Germany (GER) and Spain (ESP) that will face restrictions due to the thrid rule. Note: For displaying the table the column names were adjsuted to nicely readable names, but the order remains unchanged. The code will use the names as in the initally created table in the first R code chunk.

## Calculating all Possible Draws

Next all draws that are possible under the draw rules. First all individually possible matches are calculated in the following manner

1. create all possible pairs of group winner vs. group runner-up according to rule 1, which leads to 8 * 8 = 64 possible combinations
2. remove all pairs that violate rule 2 and 3.

This leaves us with 47 matches that are possible. Now all possibilities for the full draw defined as a possible set of 8 pairs is calculated in the following manner:

1. take all possible pairs of one group runner-up vs. group winner
2. calculate all combinations with pairs of another group runner-up vs group winner resulting in a set of uncomplete draws (consisting of 2-tuples of matches in the first run)
3. remove all uncomplete draws where a group winner appears more than one time
4. repeat steps 2-3 starting with this this tuple (2-tuple in the first run), leading to a set of 3-tuples, and so on until you have a set of 8-tuples

This leaves us with a set of 3501 possible draws in draws (a set of 8-tuples). Which is only 8.68% of the set of 4.032 × 104 possible draws in a scenario where only rule 1 is in place. Note: All of the above code runs in a fraction of a second unlike a simulation of the drawing process.A sample draw in the draw_names table looks like the following. Each draw is assigned a unique sequential draw_id:

winner runner_up draw_id
Arsenal Porto 776
Napoli Man City 776
Barcelona Leverkusen 776
Atletico Paris 776
Monaco Bayern 776
Dortmund Sevilla 776
Juventus Benfica 776

## The Drawing Procedure and Simulation

Now that all possible draws are known it is easy to assume that each draw is equally likely, then calculate the fraction of all possible draws where each match exist and call that the probability of this match. But in fact this is not correct as was pointed out to me two years ago by Julien Guyon when I published a similar analysis. That is because there is another source of bias, which is due to the draw procedure of the UEFA. When simply starting the draw by drawing a single group runner-up and drawing a group winner against this opponent only individually checking the validity of this match one could end up in a situation where it is not possible to complete the draw since the following draws would not meet the rules anymore. So the UEFA uses the following procedure:

1. chose one randomly from the set of undrawn group runners-up
2. the computer determines the possible group winners that may be drawn against that runner-up, so that a valid draw is possible in the end.Note: This might reduce the number of possible group winners to be draw to just a single one even after just a few draws.
3. draw a group winner randomly from the set of computer determined group winners
4. repeat until all matches are drawn, at maximum this will be 7 match draws

To investigate the bias introduced during this process I wrote a function that produces a draw in the manner it is done at the UEFA and returns the draw_id from the set of corresponding possible draws as calculated before:

Then I simply calculate 1000000 draws in parallel and save the resultsNote: This function might not be fast and is by no means optimized for speed, but it computes about 100 draws per second on a single core. To speed up the process I use parallel computing on my Intel Xeon E3-1231v3 on 7 threads. Since the amount of data to export intially is very small and there are no dependencies between the worker threads this gains almost a 7 times increase in speed. So I am able to compute this million of draws in less than 45 minutes, which seems reasoable.:

Now it is time to see the results:

## Probabilities w/o Drawing Process Bias

If one ignores the bias introduced through the draw procedure one obtains the following probabilities. These are the probabilites that are being reported on major news sites and social networks. But in fact these are not quite correct. As one will see in the next graph these are close to probabilities including the bias through the draw process.

## Probabilities including the Drawing Process Bias

If one includes that second source of bias one is left with the following probabilities:

Since these calculations stem from a non deterministic simulation procedure that involves randomness these are random variables itself. Nevertheless the Chi-Squared test on Goodness-of-Fit with a uniform distribution, leads to discarding the null hypothesis that all draws are equally likely:

## Conclusion

As one can see the rules imposed by the UEFA lead to interesting probabilities for single matches. Also this shows that there is a source of bias through the drawing procedure. The next plot highlights the differences in probabilities which are small, but significantly different from zero. For example Barcelona vs. Bayern becomes more likely by 0.73 percentage points due to the drawing process bias whereas Dortmund vs. Man City becomes less likely by 0.61 percentage points:

In total Napoli vs. Porto is the least likely single match with approx. 11.03% and Atletico vs. Man City is the most likely with approx. 25.36%. Another interesting thing is that these ex-ante probabilities can change quite a lot once several matches are determined. This can lead to interesting situations where a draw is fully determined after half of the draws. Note: I will probably calculate the minimum number of draws required to effectively end the draw later. As a rule of thumb one can say that a match between teams from associations with more restrictions due to rule 3 are more likely to play each other. This can be observed in previous seasons when teams from Germany had to face teams from England more likely, e.g. Arsenal vs. Bayern.