# Revisting the ELO rating system

While I am currently preparing a talk on estimating player ratings using RStan to implement the Bayesian model undelying the Glicko-2 rating system, I briefly revisted the ELO rating system for comparison.Note: An implementaion of the simple derived update rule for the Glicko-2 can be found on my GitHub: Schw4rz/glicko2. The ratings in both models are defined to be such that the probility of win or loss only depends on the difference in latent (i.e. unobserved) players’ strength and a sometimes a set of additional covariates like home field advantage. The ELO model therefore uses a (scaled) logistic function to map differences in strength to probabilities of won and loss. Let $r_i$ be the rating for player $i$, $r_j$ the rating for player $j$, and $y_i = 1$ if player $i$ wins and zero else, then:Note: Draws are excluded for simplicity here.

And the ELO update rule is:

Where $k$ - also called K-factor - is a fixed update factor, suggested to be somewhere in between $k \in [16, 32]$ by Arpad Elo. The logic behind the equation is easy: The higher the difference in expected $\in [0, 1]$ versus observed score $\in {0, 1}$ the more a players’ rating is adjusted. The updates for win is always positive, the update for loss is always negative and updates are symmetric, so each game is a zero-sum game.

# Why the ELO is essentially a Logit model with early stopping

Now I was interested in the underlying statistical model for these intuitive choices. The first equation is apart from a scaling factor ($ln(10)/400$) the response function as usually specified for the Logit model:

The predictor $r_j - r_i$ is linear. So we can define an entry in row $n$ and column $p$ in a design matrix $X$ containing all observed matches with dimensions $N\times P$ where $N$ is the number of matches and $P$ is the number of players:

$x_{np} = 1_{[p=i]}$ and $x_{np} = -1_{[p=j]}$

which is essentially dummy coding the players where the first player has a positive and the second player has a negative sign in the design matrix. Let $\beta$ of dimension $P$ be the vector containing the ratings $r$ we are interested in estimating. Then the expectancy of $y$ the vector of all observed outcomes of dimension $N$ can be written as:

So the ELO is basically just a Logit model? Well, yes and no.: You could now go ahead and just plug that into your favorite maximum likelihood estimator to estimate $\beta$. This gives you some result, It only works if $N>P$ and you restrict one players rating to some value, e.g. zero (for identification), otherwise $(X^TX)^{-1}$ will be singular. but you dont get the same ratings as in the ELO model. That is because the ELO estimates are not maximizing the likelihood function. But what is the connection then? If we look at the update rule again it looks a lot like the gradient descent update when optimizing the logistic (some times called cross-entropy) loss function. Let $L$ be the loss function:

then its partial derivative with respect to $\beta$ is:

Now the update rule for gradient descent with learning rate $\eta$ is:

This looks very similar to the latter part of the ELO update rule. If we now fix $k$ at $\eta/N$ and only look at the update for players $i, j$, we have:

Which is the update rule in the ELO system. So the ELO system is a single gradient descent update to the current ratings with a fixed learning rate that ignores the number of matches already played. This means it is a Logit model with eary stopping. It is well known for gradient boosting models that eary stopping can help overfitting. In other words it is a biased estimate of the ratings compared to the Logit, the advantage is that despite the estimate is biased it has lower variance.The lower variance is not explicitly shown here. Some more information on the Bias–variance tradeoff can be found at Wikipedia.

Another interesting feature of the ELO model is that time is not explicitly modeled, all matches are treated to happen simultaneously. So it is not a player that improves with multiple rating updates its the estimate of the players strength that improves. Updates essentially are updates of $\beta$ in mini-batches, where a single mini-batch contains only matches that happend at the respective “rating period”. In R we can define the corresponding gradnent descent steps as following where batches is a list containing a matrix $X$ and vector $y$ for sequential periods:

# A Simple Example

First define some sample matches for two “rating periods” of three players, where t is an integer indicating the period, f is the first player and s is the second player, and y is whether player f won or lost, where one indicates win and zero loss:

t f s y
0 1 2 1
0 2 3 1
0 3 1 0
0 1 3 1
1 2 3 1
1 3 1 0
1 3 1 1

get_input_variables splits the data into batches:

First I will show that the above defined functions actually compute the ELO update for a given rating period by comparing to the result from the PlayerRatings package, I will use a K-factor of k=1 and an initial rating beta_init and init of zero, but it works with any choice of k and intial rating:Note: The initial and absolute level of the ratings cannot be identified (and does not have an interpretation), only the difference is relevant.

beta_1 beta_2 beta_3
1.500000 0.0000000 -1.500000
1.491365 0.4978413 -1.989207
beta_1 beta_2 beta_3
1.500000 0.0000000 -1.500000
1.491365 0.4978413 -1.989207

Then we compare with the Logit model computed using the glm package, treating everything as a single period for simplicity:

beta_1 beta_2 beta_3
0 -65.96655 -303.4489

Now we compare with the gradient_decent function. The only thing we change is the number of iterations=1000 so it is indeed optimizing the likelihood (gradient descent without early stopping):

beta_1 beta_2 beta_3
0 -66.24151 -303.3957

And we can see it is the same up to slight numerical differences. We can check the Loss in each cases:

We see that the Logit estimate has lower loss, since it optimizes the Likelihood and therefore minimizes the loss. The ELO estimate has higher loss since it does not. This is a classic case of bias-variance tradeoff, where the ELO trades some bias but has lower variance and should this be better at predicting future outcomes.

# Conculsion

As shown the ELO can be interpreted as mini-batch gradient descent with logistic loss function, a fixed learning rate that equals the K-factor and ignorance of the sample size. The ELO is very commonly used, which stems mainly from its simplicity and intuitive update rule. The downside is that it does not have a thing as time periods like e.g. the Glicko rating system, nor does it have an estimate on the variance of a rating, also it ignores the number of games the to be updated $\beta$ is based on. The latter means that a rating of a player that was observed for multiple matches changes as quickly as for a player with fewer matches.