ChessGPT

Nick Hagar
5 min readMay 26, 2024

--

Read it first on my Substack

Here’s an idea I’ve been fascinated by for a while now: The transformer architecture underpinning modern LLMs excels because it effectively models sequences of discrete items. In the case of LLMs, these items are language tokens, but there’s no rule that says they have to be. In theory, anything that could be encoded as sequential data could benefit from this kind of model. I wrote about this two years ago:

Encoding is the hidden engine of AI. It’s the bridge that connects the dazzling model architectures we interact with to the real-world data they require. Finding new ways to encode that real-world data will help our models do more — not just generate text, but deploy it in intelligent ways. Anything with structure, sequence, and symbolism is a worthy target (see: GPT’s surprising adeptness at making chess moves from encoded match histories).

And again last year:

Multimodality suggests a general approach to understanding and responding to any type of incoming information, not just written language. But I still think this is an underexplored area, in that many types of information can be encoded as sequential text at scale. Chess is my go-to example here — transformers can be fine tuned on Portable Game Notation to play quite well. You can imagine similar encoding schemes for datasets like user website activity or stock market movement.

And after ruminating on this idea for quite some time, I finally did some testing, by trying to play chess with an LLM.

Full disclosure — I had most of this post written before OpenAI released GPT-4o. I was prepared to write about how LLMs can’t complete games against beginner bots, that the ways they failed were instructive, and there was still potential here. But! With the release of this newest model, my conclusions have changed.

I tested three models against bots on chess.com: Gemma 2b, GPT-4-Turbo, and GPT-4o.

Gemma 2b

I used Gemma for this task to test my initial assertion that fine tuning can improve model performance on sequential tasks. This model is on the smaller end, but I wanted to experiment with fine tuning without needing to invest heavily in GPU time. I used a sample of 50,000 games from a much larger dataset on Kaggle.

Fine-tuning did improve performance, relative to the model’s baseline. Fine-tuned Gemma improved at adhering to correct chess notation and making valid moves. However, this model could not successfully complete a game. Across a few attempts, it consistently made valid moves until around turn 18. Then, it devolved into hallucinations and illegal moves. And though its initial moves were technically correct, they weren’t very good — this model mostly just shuffled pieces around the board, without taking any of its opponent’s, before forgetting what it was doing.

GPT-4-Turbo

My next step was to test a more capable model. GPT-4-Turbo wasn’t fine-tuned for chess playing, but it’s capable of lots of tasks with zero-shot prompting. So, I gave it the same task, and had it play against the same bot.

Surprisingly, this model did not perform much better than my fine-tuned Gemma. It did start out with better valid moves — it managed to take a few pieces — but it also lost the plot after about 18 turns.

GPT-4o

At this point in my tests, I was ready to conclude that LLMs just can’t sustain a full chess game. They’re capable, I thought, of solving some one-off puzzles, and mimicking common openings. But they can’t hold the context of the game long enough to execute on a coherent strategy.

But then, mid-writing, OpenAI released GPT-4o. I decided to run one more test, again with a 250-rated beginner bot. The result was drastically different — GPT-4o won in less than 10 turns. I upped the difficulty, playing it next against a 1000-rated bot. Not only did it win, but it also broke through the 18-turn barrier, maintaining a coherent game for nearly 40 turns.

In a final test, I set it against a 2200-rated bot. Because OpenAI has been heavily emphasizing the multimodal capabilities of this model, I also switched my prompting strategy: Rather than giving the model text strings of PGN, I gave it screenshots of the board at each turn.

This was finally too much for the model. It played a coherent game for 23 turns, trading pieces but generally trailing the bot. Then on turn 24, it claimed it had found checkmate (it hadn’t) and was unable to course correct from its mistake.

Conclusion

I want to emphasize that this exercise doesn’t “prove” anything about LLM behavior. This is closer to a thought experiment than a rigorous test — there’s no large-scale evaluation here like you would see in true model benchmarks. Because of that, let’s focus more on the sequence encoding aspect of this problem than LLM behavior.

Someone came up with PGN — they devised a series of symbols that efficiently and clearly conveyed information about a specific task in a standardized way. This kind of sequential encoding is difficult! And from the model’s perspective, encodings like this are different than prose. They’re information that we can translate alongside some kind of outside context — in this case, an understanding of chess and a view of the current state of the board. Like prose, the encoding carries latent meaning that must be evinced through brute force training. Unlike prose, its “meaning” is mechanical, rather than semantic. This raises all sorts of questions for other sequential encodings: How do you decide what to encode? What can be left latent, and what needs to be explicit for the model to properly generate against?

Clearly, cutting-edge models are improving in their ability to mimic the kinds of patterns implicit in sequential encoding. But for other kinds of tasks — like more complex games, or financial time series — can we devise a rich enough notation for these models to leverage?

--

--

Nick Hagar

Northwestern University postdoc researching digital media + AI