PUBLISHED

GPT-4o vs Claude & Grok: LLMs Put to the Chess Test

GPT-4o vs Claude & Grok: LLMs Put to the Chess Test
2025-04-2411 min
FR

Large language models (LLMs) such as GPT-4o, Claude 3.5/3.7, or Grok 3.0 are now capable of responding to a variety of complex requests, covering diverse domains including programming, logic, writing, or strategic reasoning. However, their ability to respect strict rules over an extended period remains a relevant field of exploration.

Chess, governed by a set of formal and codified rules (FIDE), offers an ideal experimental framework for testing the robustness of LLMs. Unlike specialized engines like Stockfish or Leela, LLMs don't have integrated position evaluation or an internal mechanism to validate move legality. Their performance relies solely on their predictive understanding of language and underlying logical structures.

This paper proposes a simple but rigorous methodology to confront several LLMs with a complete chess game, in text format, where the human plays the role of intermediary. The objective is twofold:

  1. Test the ability of these models to manage contextual memory over time via FEN tracking (Forsyth-Edwards Notation).
  2. Identify situations where a model proposes an illegal move or loses track of the game.

The results are surprising: some models hallucinate quickly or lose context very rapidly, while others, like GPT-4o, show absolute mastery of the constraints imposed.

In this work, we will detail the process implemented, the prompts used, the errors encountered, as well as perspectives for future tests integrating advanced functionalities such as "projects" or step-by-step reasoning (CoT).

Context and Motivation

As someone passionate about LLMs, I understand the fundamentals of how they work without being a leading expert: token prediction learning, absence of persistent internal state memory, tendency to hallucinate under ambiguity constraints, etc. Yet, this is precisely what makes it fascinating to confront these models with a rigid, normative system with no possible ambiguity: chess.

Chess is based on a set of strict rules (those of FIDE), universally understood and verifiable. There is no room for syntactic improvisation, subjective interpretations, or unframed creativity. Either a move is legal, or it is not. This formal framework thus becomes an excellent testing ground for evaluating not the "creativity" of AI, but their ability to reason within a closed formal system, to follow precise instructions, and to self-constrain over the duration of a game.

This work therefore does not aim to prove that "LLMs can play chess," as this is not their primary purpose, but rather to assess their cognitive behavior under strict constraints. It is also a good stress test of prompt engineering in a normative framework: what happens when we force them, via simple instructions, to behave like rational entities, without anticipation, without shortcuts, without error?

Some AI experts might consider this use "off-topic," but it is precisely this challenge that reveals a lot about the actual robustness of a model. If an LLM fails to respect a well-defined formal system, what can we really hope for in ambiguous contexts, such as health, justice, or education?

๐Ÿ“– Before this reading for the curious, I encourage you to read:

โ†’ How to build an agentic system in 2025

โ†’ AI agent architectures in 2025

Experimental Methodology

To test the ability of LLMs to respect the official rules of chess and play within a strict framework, I set up a simple process, relying solely on the web interfaces of the tested models, without APIs or third-party engines to remain aligned with the basic usage that an average user would employ.

Experimental Framework

Each AI is invited to play as a professional chess player. It receives an explicit initial prompt, which imposes:

  • strict adherence to FIDE rules,
  • the absolute prohibition of proposing illegal moves,
  • the obligation to respond only to the previous move, without anticipating the following ones.

Initial prompt (excerpt):

"You are a professional chess player. [...] You must strictly respect the official rules of chess (FIDE). You must never cheat or propose an illegal move. [...]"

At each turn, the AI receives the exact state of the board via FEN notation, as well as the last move played by the opponent. It must respond with a single move in algebraic notation, with optionally a brief justification.

Example instruction:

Here is the current state of the chessboard: FEN: rnbqkbnr/pppppppp/8/8/4P3/8/PPPP1PPP/RNBQKBNR b KQkq - 0 1; Your opponent's last move: e4; It's your turn to play (black). Give your next move (algebraic notation), and specify if there is a capture, check, or promotion if needed.

The process follows a human dialogue loop (the human intermediary playing the moves and relaying the position and context at each turn).

Tested Models

Three major language models were put to the test, in versions updated at the time of testing (April 2025):

  • GPT-4o (OpenAI) โ€“ systematically tested as the white player.
  • Claude 3.5 & Claude 3.7 (Anthropic) โ€“ as black opponents.
  • Grok 3.0 (xAI) โ€“ also as a black opponent.

Each confrontation was played until a clear position of victory or implicit resignation (model having no relevant move left, or theoretically lost position).

Objectives

This protocol aims to observe three main elements:

  1. Respect for the rules of the game: legal moves only, no omissions or inventions.
  2. Response stability: no inconsistent return in relation to the provided position.
  3. Strategic autonomy: ability to make relevant decisions in a tactical framework.

Why this framework? (choice of prompt and FIDE loop)

The choice of a strict prompt, centered on FIDE rules, is not simply a matter of formalism. It is motivated by two important observations in the behavior of language models:

  1. LLMs don't "know" how to play chess as such: They don't have an internal validation engine (unlike a real chess engine), they generate text based on conditional probability. This means that, by default, a move like Nf3 can be "probable" without being valid in the given position.

  2. Multi-turn interaction is essential: Providing only the FEN or only the last move often leads to hallucinations or continuity errors. The combined presentation (FEN + last move played + explicit instruction) limits these biases.

  3. Alignment alone is not sufficient: Even "aligned" models (notably Claude or Grok) sometimes propose illegal moves. This is not only a problem of intention, but of representation of the validity rule in the positional context. The prompt thus aims to force attention to the framework.

  4. Testing LLMs in "end-user" conditions: The objective is also to approach common usage: what happens if an average user plays chess with an advanced LLM through the web interface, without ancillary tools to verify moves? We want to simulate this general public usage while maintaining a rigorous protocol.

Overall Results

On the three games played, systematically opposing GPT-4o (white side) to another LLM (black side), the finding is clear:

โœ… GPT-4o never proposed a single illegal move

Regardless of the depth of the game, the complexity of the position, or the phase of play (opening, middle, endgame), each response from GPT-4o was valid, consistent with the FEN, respectful of FIDE rules, and contextually logical. This model is therefore perfectly usable as a chess dialogue engine, within the textual limits of the chosen protocol.

โŒ Claude 3.5 and Claude 3.7 proposed several illegal moves

The two tested versions of Claude made errors at critical moments:

  • Claude 3.5 proposed impossible moves, notably with pieces that had already been captured.
  • Claude 3.7, although more stable, also hallucinated positions, forgetting pieces or generating them where there were none.

โš ๏ธ Grok 3.0 resisted better but hallucinated due to instability

The xAI model showed a better tactical understanding in some cases... but also produced sequences where the position no longer followed. It therefore remains interesting but not reliable in this protocol.

๐Ÿ“Š Summary Comparison

ModelMove ValidityFEN ConsistencyStrategic Quality
GPT-4oโœ… Always legalโœ… Perfect๐Ÿง  Very solid!
Claude 3.5โŒ Several illegalโŒ Frequent errors๐Ÿค” Very Weak
Claude 3.7โŒ Many!โš ๏ธ Fewer errors than 3.5๐Ÿ™‚ Poor
Grok 3.0โŒ Occasional hallucinationsโš ๏ธ Degraded position towards the end๐Ÿ™ˆ Unstable

Here are the three short videos of the games (YouTube shorts):

Claude 3.7: https://youtube.com/shorts/kSWPfE42PxE

Claude 3.5: https://youtube.com/shorts/BK-RAFPHSEY

Grok 3.0: https://youtube.com/shorts/JsO2WDJCnME

Model-by-Model Analysis

๐Ÿง  GPT-4o (OpenAI)

Status: Reference / baseline

Color: White in all games

Observed Behavior:

  • No hallucination.
  • Strict adherence to FIDE rules, even in complex endgames.
  • Gave valid moves even when the opponent proposed an illegal move (without deviating from the protocol).
  • Good understanding of classic patterns (pressure on the center, domination of open columns, pins).
  • Small note: GPT-4o does not propose variants or in-depth analysis in this protocol, but remains extremely rigorous in the production of the single move requested.

Conclusion:

A serious candidate for playing chess in a FEN + text dialogue protocol. Its behavior is perfectly aligned with the rules without requiring adjustment. The fact that it never proposed an illegal move testifies to excellent context management.

๐Ÿค– Claude 3.5 (Anthropic)

Status: Older Claude generation

Color: Black (against GPT-4o)

Observed Behavior:

  • Claude quickly made proposals for impossible moves: moving pieces that had already been captured, or accessing squares occupied by its own pieces.
  • Recurrent confusion between colors, especially in exchanges.
  • Does not seem to integrate FEN constraints well.

Hypothesis:

The model seems to rely on a fuzzy inference logic, probably based on learned patterns more than on strict modeling of the board. This makes its behavior unreliable in structured contexts like chess.

๐Ÿค– Claude 3.7 (Anthropic) โ€” Degradation compared to 3.5

Status: More recent version

Color: Black

Observed Behavior:

Contrary to expectations, Claude 3.7 behaves worse It proposes illegal moves much more quickly, confuses colors (e.g., attempting to move opponent's pawns), and shows a general weakening of contextual stability.

Hypothesis:

Claude 3.7's logic is more "linguistic" than "mechanical." It understands language well, but not always the spatial implications of a game like chess.

๐Ÿฆพ Grok 3.0 (xAI)

Status: Elon Musk's multimodal model

Color: Black

Observed Behavior:

  • Very good start of the game.
  • Ability to respond with coherent moves.
  • But progressive hallucinations, move played in a position that doesn't exist (probable poorly managed FEN shift).
  • Was not as faulty as Claude, but considerably less stable than GPT-4o.

Hypothesis:

Context management is often good, but contextual memory erodes over the length of the game. The model seems to have difficulty "maintaining a stable game" without losing coherence.

โš™๏ธ Testing and Evaluation Process

๐Ÿ” Protocol Objective

The idea is not to simulate a classic game between two optimized chess engines.

The goal is to test the contextual stability of LLMs in a constrained scenario:

"Respond to a chess move in accordance with FIDE rules based on a position given via FEN."

This allows us to measure their ability to:

  • Integrate a game state specified textually.
  • Respect strict rules in a logical universe.
  • React only to the transmitted information, without hallucinating or anticipating.

๐Ÿงช Applied Protocol

Each game follows this precise protocol:

  1. Initialization by system message: The model receives an initialization prompt that transforms it into a professional chess player, reminding it of the rules and limits of its role (no anticipation, a single move, algebraic notation).

  2. Turn by turn:

    • We send the FEN (formal representation of the position).
    • We specify the last move played by the opponent.
    • We request a unique response (e.g.: Nf3, dxe6, O-O), with possibly a short justification.
  3. Automatic and manual evaluation:

    • All moves are manually controlled (and sometimes via an engine like Stockfish) to validate their legality.
    • Each error is documented: impossible move, absent piece, ignored rule (e.g.: forbidden castling, wrong capture...).

๐Ÿ“Š Evaluation Metrics

We retained three main criteria:

CriterionDescription
โœ… Move legalityIs the proposed move allowed according to the FEN and FIDE rules?
๐Ÿง  Strategic coherenceDoes the move respect a strategic logic, even a basic one?
๐Ÿงฌ Contextual stabilityDoes the model keep track of the game in the long term? (+20 moves)

๐Ÿ‘‰ In all cases, GPT-4o committed no infraction. The other models failed on the first criterion from the middle of the game.

๐Ÿงญ Next testing phase: "Project" Mode & reasoning-oriented models

๐Ÿงฑ Why change the framework?

If testing by direct interface (web chat) reveals the raw stability of an LLM faced with structured instructions, it does not reflect the full potential of the models.

In particular, it omits:

  • The extended use of a persistent memory context.
  • The possibility of a more advanced initial framing, via tools like "projects" or "extended system instructions".
  • Integration into an agentized architecture, where the model is guided by explicit decision logic, structured memory, and validation or tooling modules.

๐Ÿงช Phase 2: Integration of "Project" mode

โœ… Objective

Explore whether context persistence and conversational encapsulation allow:

  • Improving response coherence.
  • Avoiding certain disconnections observed on Claude or Grok in long sessions.
  • Approaching a behavior close to a basic but legally clean analysis engine (zero illegal moves).

๐Ÿ”„ Methodology

  • Use of OpenAI Projects or Claude "Workbench" sessions.
  • Implementation of an enriched system framework:
    • FIDE rules injected persistently.
    • Structured and compressed game history (FEN + partial PGN).
  • Direct comparison with performances obtained in interface mode.

๐Ÿง  Extension: testing models with strong "reasoning" component

Some models are specifically designed to maintain logical coherence on structured tasks:

  • OpenAI o4-mini: a compact and high-performing version integrating advanced reasoning mechanisms, adapted to constrained environments while maintaining strong analytical capability.
  • Claude with the use of the reasoning option.

๐Ÿ“Œ Hypothesis to test (semi lol)

The integration of a structured and persistent context drastically improves the quality, coherence, and legality of moves proposed by LLMs over a long sequence.

๐Ÿง  Pedagogical Interlude: understanding typical LLM errors in chess

โš ๏ธ Identified error types

  1. Illegal move
    • Examples: playing a piece that doesn't exist on the board, or moving a king in check without legal defense.
    • Models concerned: Claude 3.7 and 3.5 (more rarely Grok 3.0).
    • โœ… Never observed with GPT-4o.
  2. Move inconsistent with FEN
    • The LLM gives a valid move on a board... but this move is not possible in the given position.
    • This often betrays a loss of positional memory, or a structural hallucination.
  3. Notation confusion
    • The model sometimes mixes figurative notation (โ™˜xf6+) and algebraic notation (Nxf6+), or proposes incorrect hybrid notations.
  4. Partially correct but incomplete response
    • Ex: an unspecified promotion (e8 instead of e8=Q) or a check not signaled (Qf6 instead of Qf6+).
    • This type of error is critical in a strict testing framework, as it introduces ambiguity or prevents automatic analysis.

๐Ÿ” Root causes (technical)

  1. Lack of stable positional reasoning
    • Unlike a chess engine, an LLM doesn't "see" the board: it deduces a position from the text context, which creates a cumulative blur.
  2. No internal validation engine
    • None of these models has, by default, a legal verification of moves. Everything relies on attention to context.
    • GPT-4o nevertheless seems to benefit from a better positional encoding, probably linked to its multimodal training.
  3. Limited contextual memory
    • Previous moves, if not effectively summarized or recalled (via FEN for example), gradually fade from the model's "view".
  4. Non-specific alignment
    • These AIs are aligned for truth, coherence, safety... but not for the strict rules of a game like chess, unless explicitly framed by dedicated instructions or prompts.

๐Ÿ’ก Implications for users

  • Claude & Grok can be useful for discussing strategy or replaying a known position.
  • But for a turn-by-turn game, only GPT-4o has shown flawless reliability to date, without a single illegal move, even after 35 moves.
  • This suggests a better generalization of GPT-4o to logical tasks under strict constraint.

๐Ÿงฎ Summary of Results by Confrontation

In order to clearly report the observed performances, here is a detailed summary of the three matches played, systematically opposing GPT-4o (white) to another LLM (black).

1. GPT-4o vs Claude 3.7 (50 half-moves)

https://youtu.be/Mwg0OyJjiGg

  • Game status: white victory by implicit resignation.
  • Critical moment: Claude 3.7 proposes an illegal move, attempting to move a non-existent piece.
  • Observation: Claude 3.7 loses track even earlier than Claude 3.5, with critical errors: moving opponent's pieces, impossible moves, increasing instability.

2. GPT-4o vs Claude 3.5 (53 half-moves)

https://youtu.be/Kq3pK8qymbk

  • Status: white victory by tactical submersion.
  • Critical moment: Claude 3.5 plays a knight that has already been captured, thus losing complete track of the position.
  • Observation: Claude 3.5 seems less able to integrate FEN as a state of truth. GPT-4o unfolds a methodical attack after the opponent's error.

3. GPT-4o vs Grok 3.0 (69 half-moves)

https://youtube.com/live/NKULhJ5Ud1c

  • Status: white victory by forced promotion.
  • Critical moment: Grok resists until the end, but makes continuity errors in certain moves (loss of positional reference).
  • Observation: despite some hallucinations, Grok is the only model to offer more prolonged resistance. GPT-4o remains perfect.

๐Ÿ“ˆ Comparative Table of Confrontations

MatchNumber of movesOpponent illegal errorContext errorOutcome
GPT-4o vs Claude 3.750โœ… Many illegal movesโš ๏ธ Loss of coherenceWhite victory
GPT-4o vs Claude 3.553โœ… Several illegalโŒ Complete disconnectionWhite victory
GPT-4o vs Grok 3.069โŒ None, but positional hallucinationโš ๏ธ On the last movesWhite victory

โš ๏ธ Limitations of the Current Protocol & Areas for Improvement

Although the proposed methodology allows testing the ability of LLMs to follow strict rules in a normative framework, certain limitations must be recognized in order to nuance the results and guide future improvements.

1. Absence of automatic feedback

The protocol relies on manual control of the legality and coherence of moves at each turn. This implies:

  • Significant human effort, especially on long games.
  • Risk of evaluation error or oversight of micro-irregularities.

2. Text-only format (web interface)

The experience is limited here to sessions in the web interfaces of the tested LLMs (chat.openai.com, claude.ai, x.ai). This means:

  • No advanced session memory or external decision support system.
  • No automatic correction or rule reminder in case of an illegal move. (As much as possible because often a total obligation to open a new chat because of continuous loop)

๐Ÿ”ง Improvement path: "project"

  • Persistent session memory.
  • An enriched system prompt in the background.
  • The addition of logical validators or compressed histories.

3. Absence of intermediate reasoning (no chain-of-thought)

The protocol imposes a direct response to each move without structured intermediate justification. Although this allows testing raw rigor, it does not take advantage of the step-by-step reasoning (chain-of-thought) that some models are capable of.

๐Ÿ”ง Improvement path

  • Test models specialized in reasoning.
  • Inject an instruction allowing a decomposition of reasoning, provided that it still respects the constraint of a single final response.
  • Example: "Think to validate the legality of the move, then give the final move."

4. Games always played against another LLM (and not an engine or a human)

The opponents are also LLMs, therefore potentially sources of errors themselves. This introduces a bias in the comparative analysis:

  • A model may seem "better" simply because the other collapsed faster.
  • Some illogical or unnatural positions may emerge, which biases the interpretation.

๐Ÿ”ง Improvement path

  • Create a series of games played only by the LLM against Stockfish (with strict validation).
  • Or, have an LLM play against itself, but with different variants of prompts or history.

๐Ÿ”ฌ Implications for LLM Research

This protocol, although artisanal and limited to web usage, raises fundamental questions about the robustness of LLMs in structured environments.

1. Legality as an indicator of "understanding"

An illegal move in FEN context is not simply a syntax error: it's a symptom of a lack of coherent mental representation. This reveals:

  • A defect in the ability to maintain a structured internal state (board + rules).
  • The absence of an internal validation module (unlike chess engines).
  • A difficulty for some models to align not on intention, but on the mechanical exactness of a normative system.

This type of test could therefore serve as a generic stress test, well beyond chess, for any domain with strict rules: procedural law, security, network architecture, etc.

2. Chess as a test bench for reasoning under constraints

Chess is not just a complex game, it's a closed structure with pure logic, which makes it an ideal tool for:

  • Testing models on their ability to integrate a complete system of rules.
  • Evaluating their stability in long interaction.
  • Analyzing internal error mechanisms without ambiguity of interpretation.

Unlike tasks such as writing or translation, here, the truth is binary: legal or illegal move, coherent or not.

3. Significant differences between model families

The results obtained clearly illustrate that not all LLMs are equivalent:

  • GPT-4o demonstrates exceptional contextual rigor, possibly linked to its multimodal architecture or to reinforced pre-training on logical structures.
  • Claude 3.5/3.7, despite being reputed for their alignment, fail on objective rules.
  • Grok 3.0 shows promise, but its instability prevents confidence over time.

These gaps indicate that "structural reasoning" is a major axis of differentiation to come in the evolution of models.

๐Ÿงพ Conclusion

This work aimed to test a simple idea: can we make a generalist AI play chess, only via text, without it cheating, hallucinating, or going outside the framework?

The answer is yes... but not with all models.

  • GPT-4o showed impressive rigor: no illegal moves, no contextual rupture, strict adherence to rules even after numerous moves.
  • Claude and Grok, on the other hand, demonstrated the limits of their architecture or their encoding of the formal framework: illegal moves, position confusion, loss of track.

The interest of this research is not to say "which model plays better," but to highlight the differential capabilities of representation and stability in a closed system. Chess, like any strictly normed domain, becomes a perfect mirror of the strengths and weaknesses of these models.

The next steps, with a shift to "project" mode, the introduction of persistent memory, or the testing of models specialized in reasoning, will allow us to go even further in our understanding and awareness of the limitations of models via normal usage.

In the meantime, GPT-4o remains the only one able to play a complete game without ever going outside the framework.

LLMchessOpenAIClaudeGrok
CypherTux OS v1.33.15
ยฉ 2025 CYPHERTUX SYSTEMS