Revisting the ELO rating system

While I am currently preparing a talk on estimating player ratings using RStan to implement the Bayesian model undelying the Glicko-2 rating system, I briefly revisted the ELO rating system for comparison.Note: An implementaion of the simple derived update rule for the Glicko-2 can be found on my GitHub: Schw4rz/glicko2. The ratings in both models are defined to be such that the probility of win or loss only depends on the difference in latent (i.e. unobserved) players’ strength and a sometimes a set of additional covariates like home field advantage. The ELO model therefore uses a (scaled) logistic function to map differences in strength to probabilities of won and loss. Let be the rating for player , the rating for player , and if player wins and zero else, then:Note: Draws are excluded for simplicity here.

And the ELO update rule is:

Where - also called K-factor - is a fixed update factor, suggested to be somewhere in between by Arpad Elo. The logic behind the equation is easy: The higher the difference in expected versus observed score the more a players’ rating is adjusted. The updates for win is always positive, the update for loss is always negative and updates are symmetric, so each game is a zero-sum game.

Why the ELO is essentially a Logit model with early stopping

Now I was interested in the underlying statistical model for these intuitive choices. The first equation is apart from a scaling factor () the response function as usually specified for the Logit model:

The predictor is linear. So we can define an entry in row and column in a design matrix containing all observed matches with dimensions where is the number of matches and is the number of players:

and

which is essentially dummy coding the players where the first player has a positive and the second player has a negative sign in the design matrix. Let of dimension be the vector containing the ratings we are interested in estimating. Then the expectancy of the vector of all observed outcomes of dimension can be written as:

calculate_predictor <- function(X, beta) {
  # calculate the linear predictor Xbeta
  colSums(t(X) * beta)
}

logistic_function <- function(predictor, scaling_factor = log(10)/400) {
  # ELO uses a scaling factor compared of log(10) / 400 compared to the
  # usual specification
  1 / (1 + exp(-predictor * scaling_factor))
}

So the ELO is basically just a Logit model? Well, yes and no.: You could now go ahead and just plug that into your favorite maximum likelihood estimator to estimate . This gives you some result, It only works if and you restrict one players rating to some value, e.g. zero (for identification), otherwise will be singular. but you dont get the same ratings as in the ELO model. That is because the ELO estimates are not maximizing the likelihood function. But what is the connection then? If we look at the update rule again it looks a lot like the gradient descent update when optimizing the logistic (some times called cross-entropy) loss function. Let be the loss function:

logistic_loss <- function(X, y, beta) {
  # number of observations
  N <- length(y)
  # expected score, i.e. predicted probability for win/loss
  e_score <- logistic_function(calculate_predictor(X, beta))

  loss <- (1 / N) * sum(-y * log(e_score) - (1 - y) * log(1 - e_score))
  return(loss)
}

then its partial derivative with respect to is:

Now the update rule for gradient descent with learning rate is:

This looks very similar to the latter part of the ELO update rule. If we now fix at and only look at the update for players , we have:

update_beta <- function(X, y, beta, k) {
  # number of observations
  n <- length(y)
  # expected score, i.e. predicted probability for win/loss
  e_score <- logistic_function(calculate_predictor(X, beta))
  # gradient and the current position
  gradient <- colSums(X * (e_score - y))
  # update step
  beta <- beta - k * gradient
  # print logistic loss
  #print(logistic_loss(X, y, beta))
  return(beta)
}

Which is the update rule in the ELO system. So the ELO system is a single gradient descent update to the current ratings with a fixed learning rate that ignores the number of matches already played. This means it is a Logit model with eary stopping. It is well known for gradient boosting models that eary stopping can help overfitting. In other words it is a biased estimate of the ratings compared to the Logit, the advantage is that despite the estimate is biased it has lower variance.The lower variance is not explicitly shown here. Some more information on the Bias–variance tradeoff can be found at Wikipedia.

Another interesting feature of the ELO model is that time is not explicitly modeled, all matches are treated to happen simultaneously. So it is not a player that improves with multiple rating updates its the estimate of the players strength that improves. Updates essentially are updates of in mini-batches, where a single mini-batch contains only matches that happend at the respective “rating period”. In R we can define the corresponding gradnent descent steps as following where batches is a list containing a matrix $X$ and vector $y$ for sequential periods:

gradient_descent <- function(batches, beta_init, k, iterations) {
  # set beta to initial value
  beta <- beta_init
  # initalize matrix to store updates
  beta_history <- matrix(nrow = length(batches),
                         ncol = length(beta_init))
  # loop over iterations, aka epochs
  for (i in 1:iterations) {
    # loop over mini-batches
    for (b in 1:length(batches)) {
      # run update procedure
      batch <- batches[[b]]
      beta <- update_beta(batch$X, batch$y, beta, k)
      beta_history[b, ] <- beta
    }
  }
  return(beta_history)
}

A Simple Example

First define some sample matches for two “rating periods” of three players, where t is an integer indicating the period, f is the first player and s is the second player, and y is whether player f won or lost, where one indicates win and zero loss:

data <- data.frame(t = factor(c(0, 0, 0, 0, 1, 1, 1)),  # period
                   f = factor(c(1, 2, 3, 1, 2, 3, 3)),  # first team
                   s = factor(c(2, 3, 1, 3, 3, 1, 1)),  # second team
                   y = c(1, 1, 0, 1, 1, 0, 1))  # win/loss
t f s y
0 1 2 1
0 2 3 1
0 3 1 0
0 1 3 1
1 2 3 1
1 3 1 0
1 3 1 1

get_input_variables splits the data into batches:

get_input_variables <- function(data) {
  design_matrix <-
    model.matrix(~ as.factor(f) - 1, data) -
    model.matrix(~ as.factor(s) - 1, data)
  colnames(design_matrix) <- paste0("player_", seq_len(ncol(design_matrix)))
  return(list(X = design_matrix, y = data$y))
}

batches <- lapply(split(data, data$t), get_input_variables)

First I will show that the above defined functions actually compute the ELO update for a given rating period by comparing to the result from the PlayerRatings package, I will use a K-factor of k=1 and an initial rating beta_init and init of zero, but it works with any choice of k and intial rating:Note: The initial and absolute level of the ratings cannot be identified (and does not have an interpretation), only the difference is relevant.

# determine numer of players from data
P <- max(sapply(batches, function(x) {ncol(x$X)}))
# run batch gradient descent
bgd_beta <- gradient_descent(batches, 
                             beta = numeric(P), 
                             k = 1,
                             iterations = 1)
# formatting
bgd_beta_df <- data.frame(bgd_beta)
colnames(bgd_beta_df) <- paste0("beta_", seq_len(P))
beta_1 beta_2 beta_3
1.500000 0.0000000 -1.500000
1.491365 0.4978413 -1.989207
elo_ratings <- PlayerRatings::elo(data.frame(apply(data, 2, as.numeric)),
                                  kfac = 1, init = 0, history = TRUE)
# formatting
elo_ratings_df <- data.frame(t(elo_ratings$history[, , "Rating"]))
colnames(elo_ratings_df) <- paste0("beta_", seq_len(P))
beta_1 beta_2 beta_3
1.500000 0.0000000 -1.500000
1.491365 0.4978413 -1.989207

Then we compare with the Logit model computed using the glm package, treating everything as a single period for simplicity:

data$t <- factor(0)
# via glm
single_batch <- get_input_variables(data)
X <- single_batch$X[, -1]  # drop first player for identification
y <- single_batch$y
glm_beta <- c(0, coef(glm(y~ -1 + X, family = binomial)))
glm_beta <- glm_beta * 400 / log(10)  # normalizing factor from ELO

# formatting
glm_beta_df <- data.frame(t(glm_beta))
colnames(glm_beta_df) <- paste0("beta_", seq_len(P))
beta_1 beta_2 beta_3
0 -65.96655 -303.4489

Now we compare with the gradient_decent function. The only thing we change is the number of iterations=1000 so it is indeed optimizing the likelihood (gradient descent without early stopping):

bgd_beta_opt <- gradient_descent(list(single_batch), 
                             beta = numeric(P), 
                             k = 1,
                             iterations = 1000)

# normalize btea_1 to zero
bgd_beta_opt <- bgd_beta_opt - bgd_beta_opt[1, 1]
# formatting
bgd_beta_opt_df <- data.frame(bgd_beta_opt)
colnames(bgd_beta_opt_df) <- paste0("beta_", seq_len(P))
beta_1 beta_2 beta_3
0 -66.24151 -303.3957

And we can see it is the same up to slight numerical differences. We can check the Loss in each cases:

# for ELO
ELO_beta <- gradient_descent(list(single_batch), 
                             beta = numeric(P), 
                             k = 1,
                             iterations = 1)
logistic_loss(single_batch$X, single_batch$y, as.numeric(ELO_beta))
## [1] 0.6878389
# for GLM
GLM_beta <- gradient_descent(list(single_batch), 
                             beta = numeric(P), 
                             k = 1,
                             iterations = 1000)
logistic_loss(single_batch$X, single_batch$y, as.numeric(GLM_beta))
## [1] 0.4806901

We see that the Logit estimate has lower loss, since it optimizes the Likelihood and therefore minimizes the loss. The ELO estimate has higher loss since it does not. This is a classic case of bias-variance tradeoff, where the ELO trades some bias but has lower variance and should this be better at predicting future outcomes.

Conculsion

As shown the ELO can be interpreted as mini-batch gradient descent with logistic loss function, a fixed learning rate that equals the K-factor and ignorance of the sample size. The ELO is very commonly used, which stems mainly from its simplicity and intuitive update rule. The downside is that it does not have a thing as time periods like e.g. the Glicko rating system, nor does it have an estimate on the variance of a rating, also it ignores the number of games the to be updated is based on. The latter means that a rating of a player that was observed for multiple matches changes as quickly as for a player with fewer matches.

sessionInfo()
## R version 3.4.4 (2018-03-15)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.1 LTS
## 
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=de_DE.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  base     
## 
## other attached packages:
## [1] knitr_1.18
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.4.4      magrittr_1.5        tools_3.4.4        
##  [4] Rcpp_0.12.18        stringi_1.2.4       highr_0.6          
##  [7] PlayerRatings_1.0-1 methods_3.4.4       stringr_1.2.0      
## [10] httpuv_1.3.5        servr_0.7           evaluate_0.10.1