Slonik learns chess from self-play

Need for speed

Recently I wrote about my chess engine Slonik. It is able to play chess autonomously, without human guidance, and she is probably the best playing Python chess engine in existence. However, it is still not very strong by today’s standard. A rough estimate places it somewhere around 1500 ELO, which is “chess club player” strength. It’s greatest limitation is it’s speed. Most chess engines today are written in C++, a language optimized for speed. From the starting chess position, just generating all the legal positions possible five half-moves deep (4.8M positions), Slonik takes 58 seconds, so about 82K positions per second. Currently I am working on a C++ port of Slonik. Under the same conditions, C++ Slonik runs the same computation in a tenth of a second, at 4.6M positions per second. That’s over 500 times faster! The primary driver for this improvement was to make bigger strides in Slonik’s self learning experiments.

Artificial Intelligence

One of the other major goals mentioned in my original post which introduced Slonik to the world, was to have her learn from her own games. The original version of Slonik, like nearly every chess program in existence today, ran a hand-coded chess position evaluation function. Given a chess position, it outputs a number, indicating which side it thinks is better and by how much. Chess engines today differentiate themselves by implementation/performance/design differences, but more importantly, by how well they’ve taught their engine chess in the position evaluation function. Improvements to the chess knowledge, from version to version of the chess engine, are the primary means by which the chess engine improves. These improvements require an incredible amount of human effort, and more importantly are limited by the imagination of the engine author.

In September 2015, a new chess engine came out, called Giraffe, which learned to play for it’s own games. It is written in C++, like most chess engines, but unlike the others, it does not have a hand-written position evaluation function. Instead, it used a neural network and after starting from a somewhat initialized state (wherein it was fed labelled data teaching it only the approximate value of the pieces), it played games against itself for a week and reached about 2400 ELO strength. An engine that learns autonomously should be able to supersede the others. Giraffe was held back by performance (neural nets take longer to call than the optimized linear functions of other top engines).

One of my goals since my last post was to implement these ideas in Slonik. Inspired by Giraffe and eager to learn, I read Sutton and Barto’s Reinforcement Learning: An Introduction and even had some correspondence with Dr. Sutton (nice guy!), and then got to work on putting these ideas into practice on Slonik. To try to improve upon Giraffe’s ideas, I’ve experimented with various potential improvements, like Double Learning (two neural networks learning side by side, helping each other to improve and reducing bias) and prioritized sweeps (revisiting and giving priority to positions that were difficult, whilst adjusting for the artificial modification of the distribution of positions seen by the engine). Many of the ideas did not pan out, but some did, and some may still work yet when revisited (combined with other changes since). I have more ideas that I have yet to try. Besides the differences in architectures that I have tried, one major difference in Slonik that made a big improvement in self-learning was through the use of Huber loss, rather than the L1 loss used in Giraffe. These loss functions are an attempt to deal with the very frequent occurrence of outliers in the self-udpates arising during the learning process. The other major improvement was in the Reinforcement Learning algorithm itself. There is an interesting little story associated…

Slonik improvement iterations vs score graph
Slonik improving in chess playing strength over time through self-play. STS score on y axis. Starting from random initialization at approximately 1500 STS score (2 ply look-ahead search per position). Slonik is still playing and learning, this is a current snapshot as of the writing of this post.

Forbidden algorithms

Giraffe used an algorithm called TD-Leaf for self learning, which assumes that the evaluation of the successive positions (more accurately, the evaluation of the leaf node of the principal variation starting from the root position of the next move) is better than the corresponding evaluation one move prior. The “TD”, or temporal difference, part of TD-leaf makes the assumption that, in the backwards view, moves immediately prior are more to blame (or credit) for the change in evaluation of the target node, than moves still further prior. At some point after the recent AlphaGo success, I was looking into the work of David Silver, lead architect of AlphaGo, and discovered that he had been involved, in 2008, in the development of a revolutionary (but probably overlooked) self-learning chess program called Meep by Joel Veness, and was the first to achieve a high level of play (2300s ELO) in chess, entirely through self-play. Unlike Giraffe, Meep did not use a neural network, but a linear model, since linear models are faster to execute and are known to be sufficient for top-level play in chess. Furthermore, Meep did not use TD-Leaf, but a more efficient learning algorithm, which they called TreeStrap. TreeStrap differs from TD-Leaf by using each position’s search tree for updates, rather than the single evaluation of the next move.

I had some success with self-learning in Slonik using TD-Leaf, but when I read the TreeStrap paper, I was in awe. If this algorithm is as good as it sounds (and it sounds very good), then Giraffe should have been using this algorithm instead. I went ahead and replaced TD-leaf with TreeStrap in Slonik and the learning speed sky-rocketed. Why didn’t Giraffe use TreeStrap? Did Matthew Lai, the author of Giraffe, not know about it? Since this is my first major C++ project, I sometimes peruse the source code of Stockfish and Giraffe for implementation tips and ideas. About 3 weeks ago, I noticed an eyebrow raising comment in the Giraffe source code. There was an unused field on a class, with a comment that said “during TreeStrap we have to record write to the ttable”. Wait… so Matthew Lai did know about TreeStrap?

Then I realized what probably transpired. In a forum post on talkchess, Matthew Lai announced that he was hired by Google DeepMind, due to his awesome success on Giraffe (and his impressive background helped, no doubt — have a look at his resume), and that due to his work there, he can’t continue to improve his open-source engine Giraffe. He learns many techniques during his day job at Google and the divide between open knowledge and trade secrets is muddy. Matthew was hired there in December 2015. A little bit of investigating in the Giraffe source repository shows that the TreeStrap comment was added in that same month. I can see it in my head now — Matthew being interviewed by David Silver where David tells him about his work on Meep a whole 7 years earlier! It appears to me that Matthew started implementing TreeStrap, then learned he couldn’t use those ideas, and a forgotten remnant remained in the source. Well, it’s all speculation, but if that’s what happened, maybe Matthew can use TreeStrap in Giraffe now that someone else is using it openly. The white paper on TreeStrap is, after all, linked from David Silver’s personal page. However, probably that won’t happen, as Matthew Lai wrote he has other ideas and pet projects he’d rather spend his time on, and besides, there may be other ideas in the private Google hive mind that are even more performant. I suspect that this idea of learning from it’s own search was probably used in AlphaGo’s UCT routine, but I’m patiently waiting for Google to release the details of the techniques used in the recent success against world champion Ke Jie. In the meantime, I continue to work on Slonik (primarily the C++ port, currently). I have many ideas I want to try, including what I hope will be an improvement to the TreeStrap algorithm!