As in every season around this time the UEFA will draw the round of 16 matches for the current UEFA Champions League season in Nyon. For the 16/17 season the draw ist going to happen on the 12th of December 2016, which is tomorrow. The rules for the draw and also the drawing procedure employed by the UEFA lead to interesting statistical biases when compared to a random draw where group runners-up are drawn against group winners whithout any further restrictions (symmetrical probabilities). The rules for the draw are the following:

  1. group runners-up are drawn against group winners
  2. teams from the same group in the preceding group stage may not be drwan against each other
  3. teams from the same association (i.e. usually country) may not be drawn againt each other

Rule three is the main source of bias leading to non-symmetrical probabilites for each team, since especially Bundesliga (GER), Primera Division (ESP), and Premier League (ENG) teams face more restrictions than teams from other associations. This has implications for the total set of probabilities in a draw, which will be examined later in this post. First the necessary dataset will be created, then all possible draws under the named restrictions are calculated and at the end the bias through the drawing procedure is calculated via simulation.

The Data

library(data.table)  # for data handling
library(parallel)  # for parallelizing simulation
library(ggplot2)  # for plotting
library(viridis)  # for plotting
library(gridExtra)  # for plotting
library(knitr)  # for formatting

# get the data
teams <- data.table(
  team_f = c("Arsenal", "Napoli", "Barcelona", "Atletico", 
             "Monaco", "Dortmund", "Leicester", "Juventus"),
  team_f_id = 1L:8L,
  team_f_a = c("ENG", "ITA", "ESP", "ESP", "FRA", "GER", "ENG", "ITA"),
  team_f_g = LETTERS[1:8],
  team_s = c("Paris", "Benfica", "Man City", "Bayern", 
             "Leverkusen", "Real Madrid", "Porto", "Sevilla"),
  team_s_id = 9L:16L,
  team_s_a = c("FRA", "POR", "ENG", "GER", "GER", "ESP", "POR", "ESP"),
  team_s_g = LETTERS[1:8]
)
winner winner_id winner_association winner_group runner_up runner_up_id runner_up_association runner_up_group
Arsenal 1 ENG A Paris 9 FRA A
Napoli 2 ITA B Benfica 10 POR B
Barcelona 3 ESP C Man City 11 ENG C
Atletico 4 ESP D Bayern 12 GER D
Monaco 5 FRA E Leverkusen 13 GER E
Dortmund 6 GER F Real Madrid 14 ESP F
Leicester 7 ENG G Porto 15 POR G
Juventus 8 ITA H Sevilla 16 ESP H

As one can see there are three teams each for England (ENG), Germany (GER) and Spain (ESP) that will face restrictions due to the thrid rule. Note: For displaying the table the column names were adjsuted to nicely readable names, but the order remains unchanged. The code will use the names as in the initally created table in the first R code chunk.

Calculating all Possible Draws

Next all draws that are possible under the draw rules. First all individually possible matches are calculated in the following manner

  1. create all possible pairs of group winner vs. group runner-up according to rule 1, which leads to 8 * 8 = 64 possible combinations
  2. remove all pairs that violate rule 2 and 3.
# cross-join tow data frames
CrossJoinDT <- function(x, y) {
  cj <- setkey(x[,c(k = 1, .SD)], k)[y[, c(k = 1, .SD)],
                                     allow.cartesian=TRUE][,k:=NULL]
  return(cj)
}

# check if values of one row in a data frame are unique
IsUniqDraw <- function(x) {
  isuniq <- apply(x, 1, function(l) {length(unique(l))}) == ncol(x)
  return(isuniq)
}

# possible matches only respecting rule (1)
matches <- CrossJoinDT(teams[, 1:4, with = FALSE], 
                       teams[, 5:8, with = FALSE])

# enforce rules (2) and (3)
matches <- matches[team_f_a != team_s_a & team_f_g != team_s_g, ]

This leaves us with 47 matches that are possible. Now all possibilities for the full draw defined as a possible set of 8 pairs is calculated in the following manner:

  1. take all possible pairs of one group runner-up vs. group winner
  2. calculate all combinations with pairs of another group runner-up vs group winner resulting in a set of uncomplete draws (consisting of 2-tuples of matches in the first run)
  3. remove all uncomplete draws where a group winner appears more than one time
  4. repeat steps 2-3 starting with this this tuple (2-tuple in the first run), leading to a set of 3-tuples, and so on until you have a set of 8-tuples
# split by team first
matches_split <- split(matches[, list(team_f_id, team_s_id)], matches$team_f_id)

# create data frame of all possible draws
draws <- data.table(NULL)
for (m in matches_split) {
  draws <- CrossJoinDT(draws, m)
  draws <- draws[IsUniqDraw(draws), ]
}

# get into long format
draws_long <- matrix(t(as.matrix(draws)), ncol = 2, byrow = TRUE)

# fill wit names instead of ids
draws_names <- data.table(matrix(c(teams$team_f, teams$team_s)[draws_long], 
                                 nrow = nrow(draws_long)))
draws_names$draw_id <- sort(rep(seq_len(nrow(draws)), 8))
setnames(draws_names, c("f", "s", "draw_id"))

This leaves us with a set of 3501 possible draws in draws (a set of 8-tuples). Which is only 8.68% of the set of 4.032 × 104 possible draws in a scenario where only rule 1 is in place. Note: All of the above code runs in a fraction of a second unlike a simulation of the drawing process.A sample draw in the draw_names table looks like the following. Each draw is assigned a unique sequential draw_id:

winner runner_up draw_id
Arsenal Porto 776
Napoli Man City 776
Barcelona Leverkusen 776
Atletico Paris 776
Monaco Bayern 776
Dortmund Sevilla 776
Leicester Real Madrid 776
Juventus Benfica 776

The Drawing Procedure and Simulation

Now that all possible draws are known it is easy to assume that each draw is equally likely, then calculate the fraction of all possible draws where each match exist and call that the probability of this match. But in fact this is not correct as was pointed out to me two years ago by Julien Guyon when I published a similar analysis. That is because there is another source of bias, which is due to the draw procedure of the UEFA. When simply starting the draw by drawing a single group runner-up and drawing a group winner against this opponent only individually checking the validity of this match one could end up in a situation where it is not possible to complete the draw since the following draws would not meet the rules anymore. So the UEFA uses the following procedure:

  1. chose one randomly from the set of undrawn group runners-up
  2. the computer determines the possible group winners that may be drawn against that runner-up, so that a valid draw is possible in the end.Note: This might reduce the number of possible group winners to be draw to just a single one even after just a few draws.
  3. draw a group winner randomly from the set of computer determined group winners
  4. repeat until all matches are drawn, at maximum this will be 7 match draws

To investigate the bias introduced during this process I wrote a function that produces a draw in the manner it is done at the UEFA and returns the draw_id from the set of corresponding possible draws as calculated before:

SimulateDraw <- function(draws_sim) {
  # initially available possibilities
  avail_draws <- draws_sim
  avail_s <- 9:16
  
  for (i in 1:7) {
    # choose allowed runner-up
    s1 = sample(avail_s, 1)
    # check possible opponents
    avail_f <- unique(avail_draws[s == s1, ]$f)
    if (length(avail_f) > 1) {
      f1 = sample(avail_f, 1)
    } else {
      f1 = avail_f
    }
    # calculate remaining draws
    remaining_draw_ids <- avail_draws[s == s1 & f == f1, ]$draw_id
    # update available runners-up/group first and draws
    avail_s <- setdiff(avail_s, s1)
    avail_draws <- avail_draws[draw_id %in% remaining_draw_ids, ]
  }
  return(avail_draws$draw_id[1])
}

Then I simply calculate 1000000 draws in parallel and save the resultsNote: This function might not be fast and is by no means optimized for speed, but it computes about 100 draws per second on a single core. To speed up the process I use parallel computing on my Intel Xeon E3-1231v3 on 7 threads. Since the amount of data to export intially is very small and there are no dependencies between the worker threads this gains almost a 7 times increase in speed. So I am able to compute this million of draws in less than 45 minutes, which seems reasoable.:

# use ids in this case
draws_sim <- data.table(cbind(draws_long, sort(rep(seq_len(nrow(draws)), 8))))
setnames(draws_sim, c("f", "s", "draw_id"))

# run N draws in parallel
# make cluster
cl <- makeCluster(detectCores() - 1)

# export libraries and data
clusterEvalQ(cl, library(data.table))
clusterExport(cl, c("draws_sim", "SimulateDraw"))

# replicate in parallel
set.seed(1337)
N <- 1e6
system.time(sim_draws <- parSapply(cl, 1:N, function(x) {
  SimulateDraw(draws_sim)
}))

# stop the cluster and save
stopCluster(cl)

Now it is time to see the results:

Probabilities w/o Drawing Process Bias

If one ignores the bias introduced through the draw procedure one obtains the following probabilities. These are the probabilites that are being reported on major news sites and social networks. But in fact these are not quite correct. As one will see in the next graph these are close to probabilities including the bias through the draw process.

plot of chunk plot_g1

Probabilities including the Drawing Process Bias

If one includes that second source of bias one is left with the following probabilities:

plot of chunk plot_g2

Since these calculations stem from a non deterministic simulation procedure that involves randomness these are random variables itself. Nevertheless the Chi-Squared test on Goodness-of-Fit with a uniform distribution, leads to discarding the null hypothesis that all draws are equally likely:

# test on Goodness-of-Fit where under the null hypothesis each draw is equally 
# likely (default of the chisq.test function)
chisq.test(table(sim_draws))
## 
## 	Chi-squared test for given probabilities
## 
## data:  table(sim_draws)
## X-squared = 8322, df = 3500, p-value < 2.2e-16

Conclusion

As one can see the rules imposed by the UEFA lead to interesting probabilities for single matches. Also this shows that there is a source of bias through the drawing procedure. The next plot highlights the differences in probabilities which are small, but significantly different from zero. For example Barcelona vs. Bayern becomes more likely by 0.73 percentage points due to the drawing process bias whereas Dortmund vs. Man City becomes less likely by 0.61 percentage points:

plot of chunk plot_g3

In total Napoli vs. Porto is the least likely single match with approx. 11.03% and Atletico vs. Man City is the most likely with approx. 25.36%. Another interesting thing is that these ex-ante probabilities can change quite a lot once several matches are determined. This can lead to interesting situations where a draw is fully determined after half of the draws. Note: I will probably calculate the minimum number of draws required to effectively end the draw later. As a rule of thumb one can say that a match between teams from associations with more restrictions due to rule 3 are more likely to play each other. This can be observed in previous seasons when teams from Germany had to face teams from England more likely, e.g. Arsenal vs. Bayern.

Software Used

sessionInfo()
## R version 3.3.2 (2016-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.10
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=de_DE.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets 
## [7] base     
## 
## other attached packages:
## [1] knitr_1.15       gridExtra_2.2.1  viridis_0.3.4   
## [4] ggplot2_2.1.0    data.table_1.9.6
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.5      digest_0.6.9     plyr_1.8.3      
##  [4] chron_2.3-47     grid_3.3.2       gtable_0.2.0    
##  [7] magrittr_1.5     evaluate_0.10    scales_0.4.0    
## [10] highr_0.6        stringi_1.1.2    labeling_0.3    
## [13] tools_3.3.2      servr_0.4        stringr_1.0.0   
## [16] munsell_0.4.3    httpuv_1.3.3     colorspace_1.2-6
## [19] methods_3.3.2