Bayesian Elo Rating

<!DOCTYPE html> <html lang="en"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <meta name="description" content="Bayeselo is a freeware tool to estimate Elo ratings. It can read a file containing game records in PGN format, and produce a rating list."> <meta name="author" content="R茅mi Coulom"> <title>Bayesian Elo Rating</title> <link href="../coulom.css" rel="stylesheet" type="text/css"> <meta name="viewport" content="width=device-width, initial-scale=1"> </head> <body> <h1>Bayesian Elo Rating</h1> <p>by <a href="..">R茅mi Coulom</a>.</p> <p>Bayeselo is a freeware tool to estimate Elo ratings. It can read a file containing game records in <a href="https://en.wikipedia.org/wiki/Portable_Game_Notation">PGN</a> format, and produce a rating list.</p> <h2>Contents</h2> <ul> <li><a href="#download">Download</a></li> <li><a href="#history">History</a></li> <li><a href="#usage">Usage Documentation</a></li> <li><a href="#theory">Theory</a></li> <li><a href="#elostat">Is Bayeselo Really Better than Elostat?</a></li> <li><a href="#links">Links</a></li> </ul> <h2><a id="download">Download Software</a></h2> <ul> <li><a href="bayeselo.exe">bayeselo.exe</a>: Windows binary</li> <li><a href="bayeselo.tar.bz2">bayeselo.tar.bz2</a>: Source code (distributed under the terms of the <a href="http://www.gnu.org/copyleft/gpl.html">GNU General Public License</a>)</li> </ul> <h2><a id="history">History</a></h2> <ul> <li>2010.03.31: <ul> <li>Minor fixes to source code so that it compiles with less warnings and errors. </ul> <li>2007.01.30: <ul> <li>New optional parameter to the <tt>ratings</tt> command to indicate whether rank is rank among players in the names file, or the global rank among all players.</li> <li>The <tt>offset</tt> command can be used to set the rating of one of the players. For instance <tt>offset 2800 Fruit</tt> will offset ratings so that the rating of Fruit is 2800.</li> <li>Fix to the <tt>elostat</tt> command, to produce a better imitation of elostat's rating calculation (thanks to Brian Mottershead).</li> </ul> <li>2006.04.17: <ul> <li>The <tt>ratings</tt> commands now takes an additional optional parameter that indicates the name of a file from which to read player names whose rating is to be displayed.</li> <li>New <tt>names</tt> command in the <tt>ResultSet</tt> interface: get the list of players in alphabetical order (can be used to write a file for the <tt>ratings</tt> commands as indicated in the previous item.)</li> </ul> </li> <li>2006.02.08: <ul> <li>Better predictions: simulations take rating uncertainty into consideration.</li> <li>New <tt>pairstats</tt> command to get statistics for every pair of players.</li> <li>The <tt>ratings</tt> command now displays average opponent.</li> <li>No need to explicitly run <tt>covariance</tt> before <tt>los</tt>.</li> </ul> </li> <li>2005.12.18: <ul> <li>New <tt>scale</tt> command to scale ratings. By default, maximum-likelihood ratings are now scaled down so that they look more like Elostat/SSDF ratings.</li> <li>New <tt>prediction</tt> command to predict the outcome of round-robin tournaments.</li> </ul> </li> <li>2005.09.29: Elostat confidence intervals + improved Elostat accuracy</li> <li>2005.09.27: Fixed a bug that affected rating calculations</li> <li>2005.09.24: <ul> <li>Elostat algorithm implemented</li> <li>Better rounding of rating estimations</li> <li>Sorted opponents in the <tt>details</tt> command</li> <li>Confusing debug output of the <tt>players</tt> command removed</li> <li>Removed dependencies: smaller source and binary</li> <li>Compilation date/time added in case I forget to increment version</li> </ul> </li> <li>2005.09.22: Fixed a potential segmentation fault + link to source</li> <li>2005.09.21: Better ratings thanks to the use of a proper prior (see Winboard Forum discussion for details)</li> <li>2005.02.13: Many changes: <ul> <li>"exactdist" does not require an amount of memory proportional to the number of players anymore, so it can now be applied to a set of results with hundreds of thousands of players.</li> <li>Deep internal changes, resulting in a different organization of commands.</li> <li>"drawdist" and "advdist" are twice faster.</li> <li>New function to get the proportion of draws as a function of the average rating of players.</li> <li>New function to estimate confidence intervals by inversing the Hessian of the log-likelihood. Also, a simplified diagonal version is available.</li> <li>Tables of "likelihood that A is stronger than B".</li> <li>Some minor problems were fixed.</li> </ul> </li> <li>2005.01.15: Fixed a PGN-parsing bug (每 was considered as EOF)</li> <li>2005.01.14: Fixed a bug that could cause the program to crash when a player is not MM-connected to anybody.</li> </ul> <h2><a id="usage">Usage Documentation</a></h2> <p>Bayeselo is a command-line tool. This section presents typical examples of use.</p> <p>The example below shows how to compute ratings from a PGN file (wbec.pgn, located in the directory where the program is executed).</p> <pre>ResultSet>readpgn wbec.pgn 37 game(s) loaded, 0 game(s) with unknown result ignored. ResultSet>elo ResultSet-EloRating>mm Iteration 100: 1.60455e-005 00:00:00,00 ResultSet-EloRating>exactdist 00:00:00,04 ResultSet-EloRating>ratings Rank Name Elo + - games score oppo. draws 1 Dragon 4.7.5 120 167 148 8 75% 5 50% 2 Cerebro 2.05a 60 229 211 4 63% 1 25% 3 Movei 0.08.336 9 138 138 12 50% 6 17% 4 Zarkov 4.67 2 146 150 9 44% 18 44% ...</pre> <p>If you wish to get the output of the "ratings" command in a file, you may redirect its output like this: <tt>ratings >myfile.txt</tt></p> <p>bayeselo can produce a likelihood-of-superiorty matrix:</p> <pre>ResultSet-EloRating>los 0 5 3 Dr Ce Mo Za Co Dragon 4.7.5 59 81 85 71 Cerebro 2.05a 40 57 60 68 Movei 0.08.336 18 42 51 51 Zarkov 4.67 14 39 48 50 Comet B.68 28 31 48 49</pre> <p>Bayeselo can also produce predictions for round-robin tournaments. For instance, if wbec1to9.pgn contains games of many players, it is possible to predict the outcome of a round-robin tournament between 4 of them and an unknown "Toto", with this kind of script:</p> <pre>readpgn wbec1to9.pgn elo mm prediction rounds 2 ;This indicates that each player plays 4 games, 2 with each color. results addplayer Esc 1.16 addplayer Pharaon 2.62 addplayer Gandalf 4.32h addplayer TheCrazyBishop 0045 addplayer Toto x players ;Displays the list of players simulate ;This runs 100000 random simulations of the tournament x x x</pre> <p>Sending the commands above to bayeselo will produce an output that looks like:</p> <pre>Num Name Elo Std.Dev. 0 Esc 1.16 205.142 13.6535 1 Pharaon 2.62 464.88 13.7103 2 Gandalf 4.32h 537.769 15.2113 3 TheCrazyBishop 0045 340.416 15.5046 4 Toto 0 1000 (5 players) Rank Player name Points EPoints StdDev ERank 1 2 3 4 5 1 Gandalf 4.32h 0.0 11.55 2.06 1.59 52 37 9 1 0 2 Pharaon 2.62 0.0 10.14 2.14 2.20 18 49 28 5 0 3 TheCrazyBishop 0045 0.0 7.59 2.19 3.32 1 9 51 34 4 4 Toto 0.0 5.73 6.44 3.60 29 4 4 6 58 5 Esc 1.16 0.0 4.99 2.12 4.29 0 0 7 55 38</pre> <p>The <tt>EPoints</tt> column indicates the expected final score. <tt>ERank</tt> is the expected final rank. The matrix to the right indicates the probability in percent for every player and every rank.</p> <p>This prediction tool may also be applied to a running tournament, where some of the games have already been played. In order to do this, simply replace the <tt>addplayer</tt> commands in the script by</p> <pre>readpgn partial.pgn</pre> <p>, where <tt>partial.pgn</tt> contains the current games of the round robin. If <tt>partial.pgn</tt> does not contain all the players of the tournament yet, you may add more players with the <tt>addplayer</tt> command. In case the rating of one participant is unknown, you can set it manually with the <tt>elo</tt> command at the level of the <tt>prediction</tt> interface.</p> <p>The prediction tool also lets you change the number of points awarded for a win, a loss, and a draw. This way, it can be applied to generate predictions in the French Football Championship (soccer in the US), where a victory is 3 points, a draw 1 point and a loss 0 point. Default values are 1, 0.5, and 0.</p> <p>If you want more usage information, you can get a list of available commands with the <tt>?</tt> command when running bayeselo.</p> <h2><a id="theory">Theory</a></h2> <p>The fundamental formula of Elo theory gives the expected result (E) of a game as a function of the rating difference (D) of players. For instance:</p> <p>E=1/(1+10^(D/400))</p> <p>The fundamental assumption of the Elo rating system is that the strength of a player can be described by a single value, and that game results are drawn according to the formula above.</p> <p>The problem of Elo evaluation consists in estimating the Elo rating of a set of players, from the observation of results of their games.</p> <h3>Elostat approach</h3> <p>The fundamental Elo formula can be reversed to obtain an estimation of the rating difference between two players, as a function of the average score. This is the basis of the Elostat approach, that works in two steps:</p> <ul> <li>Iterative method to solve a fixed-point equation, so that the rating of every player is in accordance to the reverse Elo formula, assuming an expected result equal to the average score, against an opponent whose Elo is equal to the average opponent. This is done under a constraint of a given average Elo over all the players.</li> <li>Measure of variance of score to estimate uncertainty.</li> </ul> <p>The main flaw of this approach is that the estimation of uncertainty does as if a player had played against one opponent, whose Elo is equal to the mean Elo of the opponents. This assumption has bad consequences for the estimation of ratings and uncertainties:</p> <ul> <li>The expected result against two players is not equal to the expected result against one single player whose rating is the average of the two players.</li> <li>Estimation of uncertainty is wrong, because 10 wins and 10 losses against a 1500-Elo opponent should result in less uncertainty than 10 wins against a 500-Elo opponent and 10 losses against a 2500-Elo opponent.</li> </ul> <p>Also, another problem is that the estimation of uncertainty in Elostat does as if the rating of opponents are their true ratings. But those ratings also have some uncertainty that should be taken into consideration.</p> <h3>Bayesian approach</h3> <p>All these problems of the Elostat approach can be solved using a Bayesian approach. The principle of the Bayesian approach consists in choosing a prior likelihood distribution over Elo ratings, and computing a posterior distribution as a function of the observed results.</p> <p>The principle of Bayesian inference is based on Bayes's formula:</p> <p>P(Elos|Results) = P(Results|Elos)P(Elos)/P(Results)</p> <p>P(Elos) is the prior distribution. It will be chosen to be uniform in the rest of this discussion. So we get:</p> <p>P(Elos|Results) proportional to P(Results|Elos)</p> <p>In order to perform this calculation, it is necessary to assume a little more than the usual ELO formula. The expected score as a function of the Elo difference is not enough. We need the probability of a win, a draw and a loss as a function of the Elo difference.</p> <p>This can be done this way:</p> <ul> <li>f(Delta) = 1 / (1 + 10^(Delta/400)) <li>P(WhiteWins) = f(eloBlack - eloWhite - eloAdvantage + eloDraw)</li> <li>P(BlackWins) = f(eloWhite - eloBlack + eloAdvantage + eloDraw)</li> <li>P(Draw) = 1 - P(WhiteWins) - P(BlackWins)</li> </ul> <p>eloAdvantage indicates the advantage of playing first. eloDraw indicates how likely draws are. The default values in the program were obtained by finding their maximum-likelihood values over 29,610 games of Leo Dijksman's WBEC. The value measured, with 95% confidence intervals are:</p> <ul> <li>eloAdvantage = 32.8 +/- 4</li> <li>eloDraw = 97.3 +/- 2</li> </ul> <p><tt>bayeselo</tt> finds the maximum-likelihood ratings, using a minorization-maximization (MM) algorithm. A description of this algorithm is available in the <a href="#links">Links section</a> below.</p> <h2><a id="elostat">Is Bayeselo Really Better than Elostat?</a></h2> <p>In this section, I will present some facts that highlight the differences between the two programs, that, I hope, should convince most readers that bayeselo is better than elostat.</p> <p>Still, I do not claim that bayeselo is perfect, and criticism is welcome. Bayeselo has already benefited a lot from the feedback of its users, and I thank them for that. If you find a situation where the output of bayeselo looks bad or strange, do not hesitate to let me know.</p> <h3>Bayeselo takes color into consideration</h3> <p>In chess, playing with the white pieces is an advantage estimated to be worth about 33 Elo points. Bayeselo takes this into consideration. For instance, after a single draw between two players A and B, A playing white, here are the outputs of elostat and bayeselo:</p> <pre>Elostat: Program Elo + - Games Score Av.Op. Draws 1 A : 0 0 0 1 50.0 % 0 100.0 % 2 B : 0 0 0 1 50.0 % 0 100.0 % </pre> <pre>Bayeselo: Rank Name Elo + - games score draws 1 B 5 146 146 1 50% 100% 2 A -5 146 146 1 50% 100% </pre> <p>Note that the difference in estimated playing strength according to bayeselo is relatively small compared to the 33 Elo-point value of playing first. That is because of a mechanism of bayeselo that requires many games to confirm a rating difference, detailed in the next subsection. After many such draws, bayeselo's estimated rating difference would be 33 points.</p> <h3>For bayeselo, 10-0 is not the same as 1-0</h3> <p>Bayeselo uses a prior distribution over ratings, that increases the likelihood that the ratings of players are close to each other. The consequence is that a high rating difference has to be deserved, and requires many more games than with Elostat.</p> <pre>Elostat: Program Elo + - Games Score Av.Op. Draws 1 A : 300 0 0 1 100.0 % -300 0.0 % 2 B : -300 0 0 1 0.0 % 300 0.0 % Program Elo + - Games Score Av.Op. Draws 1 A : 300 0 0 10 100.0 % -300 0.0 % 2 B : -300 0 0 10 0.0 % 300 0.0 % </pre> <pre>Bayeselo: Rank Name Elo + - games score draws 1 A 41 181 152 1 100% 0% 2 B -41 152 181 1 0% 0% Rank Name Elo + - games score draws 1 A 169 172 99 10 100% 0% 2 B -169 99 172 10 0% 0% </pre> <h3>Bayeselo behaves correctly when opponents' ratings are far apart</h3> <p>A big source of problems in elostat is that it assumes that many games against many opponents is equivalent to as many games against one opponent whose rating in the average of ratings. This is very wrong, and fails badly in situations where opponent's ratings are far apart. If we consider the simple case where A beats B, B draws C, C beats D, here are the outputs of the two programs:</p> <pre>Elostat: Program Elo + - Games Score Av.Op. Draws 1 A : 709 0 0 1 100.0 % 109 0.0 % 2 B : 109 259 409 2 25.0 % 300 50.0 % 3 C : -109 409 259 2 75.0 % -300 50.0 % 4 D : -709 0 0 1 0.0 % -109 0.0 % </pre> <pre>Bayeselo: Rank Name Elo + - games score draws 1 A 96 329 254 1 100% 0% 2 C 8 195 180 2 75% 50% 3 B -8 180 195 2 25% 50% 4 D -96 254 329 1 0% 0% </pre> <p>Note that elostat gives a 218-point difference between B and C. This is really completely wrong since the only information we have about their relative strength is that B drew C. So their ratings should be close. Bayeselo gives a small advantage to C, because it drew B while playing as black.</p> <p>This is probably the most severe weakness of elostat. It does not only show with that kind of artificial situation, but also in real tournaments. The games of the 10th edition of the WBEC Ridderkerk tournament produce these ratings:</p> <pre>Elostat: Program Elo + - Games Score Av.Op. Draws 104 NullMover 0.25 : -92 58 58 124 39.1 % -15 15.3 % 109 Natwarlal 0.12 : -412 73 70 94 72.9 % -584 16.0 % 112 Alarm 0.93.1 : -495 69 67 94 61.7 % -579 12.8 % 114 NagaSkaki 3.07 : -531 65 64 94 56.4 % -576 19.1 %</pre> <pre>Bayeselo: Rank Name Elo + - games score draws 60 Natwarlal 0.12 259 69 69 94 73% 16% 97 Alarm 0.93.1 173 66 66 94 62% 13% 106 NullMover 0.25 138 55 55 124 39% 15% 109 NagaSkaki 3.07 130 63 63 94 56% 19%</pre> <p>These ratings show big differences between bayeselo and elostat. These 4 players all participated in the "Promo D" tournament between 3rd and 4th divisions. Natwarlal, Alarm, and NagaSkaki come from division 4. NullMover from division 3. Results of the promotion tournament indicate that NullMover is weaker than Natwarlal (NullMover scored 12.5/32 whereas Natwarlal scored 23.5/32). So the ratings of bayeselo look OK according to the results of the promotion tournament, whereas those of elostat are completely wrong.</p> <h2><a id="links">Links</a></h2> <ul> <li><a href="https://projecteuclid.org/journals/annals-of-statistics/volume-32/issue-1/MM-algorithms-for-generalized-Bradley-Terry-models/10.1214/aos/1079120141.full">MM Algorithms for Generalized Bradley-Terry Models</a>, by David R. Hunter, presents the Minorization-Maximization algorithm used in bayeselo to estimate the maximum-likelihood ratings.</li> <li><a href="MMNotes.pdf">hand-written notes</a> with detailed calculations for the paper of Hunter as well as the derivation for the bayeselo algorithm (ties combined with home-field advantage).</li> <li><a href="http://resolver.sub.uni-goettingen.de/purl/?GDZPPN002370808">Die Berechnung der Turnier-Ergebnisse als ein Maximumproblem der Wahrscheinlichkeitsrechnung</a>, by <a href="https://en.wikipedia.org/wiki/Ernst_Zermelo">Ernst Zermelo</a> introduced the principles on which bayeselo is based, in 1929!</li> <li>Step-by-step instruction to produce <a href="wbec.html">predictions of the WBEC-Ridderkerk computer-chess tournament</a>.</li> </ul> </body> </html>

CINXE.COM

Bayesian Elo Rating