TurnBench-MS: A Benchmark for Evaluating Multi-Turn, Multi-Step Reasoning in Large Language Models

Yiran Zhang
Macquarie University
yiran.zhang1@students.mq.edu.au
\AndMo Wang
Affiliation / Address line 1
momow818@gmail.com
\AndXiaoyang Li
Affiliation / Address line 1
lxy289692485@gmail.com
\ANDKaixuan Ren
Macquarie University
kaixuan.ren@mq.edu.au
\AndChencheng Zhu
University of New South Wales
chencheng.zhu@student.unsw.edu.au
\AndUsman Naseem
Macquarie University
usman.naseem@mq.edu.au
Yiran Zhang¹, Mo Wang¹, Xiaoyang Li¹, Kaixuan Ren¹,
Chencheng Zhu², Usman Naseem¹
¹Macquarie University ²University of New South Wales

Abstract

Despite impressive advances in large language models (LLMs), existing benchmarks often focus on single-turn or single-step tasks, failing to capture the kind of iterative reasoning required in real-world settings. To address this limitation, we introduce TurnBench, a novel benchmark that evaluates multi-turn, multi-step reasoning through an interactive code-breaking task inspired by a "Turing Machine Board Game." In each episode, a model must uncover hidden logical or arithmetic rules by making sequential guesses, receiving structured feedback, and integrating clues across multiple rounds. This dynamic setup requires models to reason over time, adapt based on past information, and maintain consistency across steps—capabilities underexplored in current benchmarks. TurnBench includes two modes: Classic, which tests standard reasoning, and Nightmare, which introduces increased complexity and requires robust inferential chains. To support fine-grained analysis, we provide ground-truth annotations for intermediate reasoning steps. Our evaluation of state-of-the-art LLMs reveals significant gaps: the best model achieves 81.5% accuracy in Classic mode, but performance drops to 17.8% in Nightmare mode. In contrast, human participants achieve 100% in both, underscoring the challenge TurnBench poses to current models. By incorporating feedback loops and hiding task rules, TurnBench reduces contamination risks and provides a rigorous testbed for diagnosing and advancing multi-step, multi-turn reasoning in LLMs.

Yiran Zhang¹, Mo Wang¹, Xiaoyang Li¹, Kaixuan Ren¹, Chencheng Zhu², Usman Naseem¹ ¹Macquarie University ²University of New South Wales

1 Introduction

Refer to caption — Figure 1: Accuracy versus average turns for leading LLMs (Deepseek-R1, GPT-o4-mini-high, Gemini-2.5-Flash) and human evaluators (Best, Average) on our TurnBench for multi-turn, multi-step reasoning. Performance is evaluated in "Classic" and "Nightmare" modes. The insets specify the percentage decrease in LLM accuracy when compared to the "Best Human" in both Classic and Nightmare mode. This visualization underscores our central finding: LLMs exhibit substantially lower accuracy than humans, particularly in the "Nightmare" setting, highlighting their current limitations in advanced complex reasoning.

Reasoning is central to human cognition and a key benchmark for evaluating the capabilities of artificial intelligence (AI) systems Wason and Johnson-Laird (1972); Dunbar and Klahr (2012). In the context of large language models (LLMs), assessing reasoning ability is especially critical as these models are increasingly deployed in complex, real-world tasks. While a growing body of work has proposed datasets and evaluation methods for probing LLM reasoning Zeng et al. (2024); Wang et al. (2023a); Welleck et al. (2022), such as Table 1, significant gaps remain in how we measure and interpret this ability—particularly in multi-step, multi-turn settings.

Dataset	Multi-Turn	Multi-Step	No Knowledge	Ground Truth	Intermediate Eval	Reasoning	Domain
Avalonbench	●	●	●	○	○	●	Game
Multi-LogiEval	○	●	○	●	○	●	Narrative
BoardgameQA	○	●	◖	●	○	◖	Game
MuSR	○	●	○	●	○	●	Narrative
AIME 2024	○	●	●	●	○	●	Math
DSGBench	●	○	○	○	◖	◖	Game
MR-Ben	○	●	●	●	●	●	Science
LOGICGAME	○	●	●	●	○	●	Game
TurnBench (Ours)	●	●	●	●	●	●	Game

Table 1: Comparison of multi-round reasoning benchmarks across six key criteria. A ●indicates presence of the feature, a ○means no presence of the feature, and a ◖indicates partial. The "Domain" column shows the task type of each benchmark.

First, most existing evaluations focus on single-turn or single-step reasoning tasks, overlooking the iterative and interactive nature of real-world problem-solving. Human reasoning often involves cycles of information gathering, hypothesis testing, and adaptation to feedback. This is especially true in scenarios where information is incomplete or distributed across multiple interactions. While recent benchmarks attempt to assess multi-step reasoning Tang et al. (2025); Zeng et al. (2024), they rarely simulate settings that require reasoning across multiple turns.

Second, current evaluation metrics typically emphasize final-answer correctness, with little insight into the model’s intermediate reasoning process Zhuang et al. (2023); Hao et al. (2024). As complex reasoning often admits multiple valid paths, simply scoring final outputs fails to distinguish between genuine inference and lucky guesses. Though some methods attempt process-level evaluation via manual annotation or automated proxies Zeng et al. (2024); Tang et al. (2025), these are limited by subjectivity and the absence of reliable ground truth for intermediate reasoning.

Third, data contamination poses a serious concern. Static benchmarks—often sourced from public datasets or templated questions—can overlap with pretraining corpora, making it difficult to disentangle memorization from actual reasoning Yang et al. (2025); Jain et al. (2024); Li et al. (2023). This undermines the reliability of benchmark results and inflates perceived model performance.

To address these gaps, we introduce TurnBench, a novel benchmark designed to evaluate multi-turn, multi-step reasoning through an interactive code-breaking task inspired by the Turing Machine board game. In this game, a model must uncover a hidden three-digit code by engaging in multiple rounds of interaction with logical verifiers. Each verifier is governed by a hidden rule; only one rule per verifier is active in a given instance. To succeed, the model must iteratively guess codes, select verifiers, analyze feedback, and gradually infer the underlying logical or arithmetic constraints—mirroring how humans perform exploratory reasoning.

TurnBench explicitly addresses key shortcomings in existing benchmarks. First, it evaluates multi-turn, multi-step reasoning by requiring LLMs to adapt dynamically to feedback across multiple rounds and integrate partial clues to formulate and revise hypotheses over time. Second, it enables process-level evaluation through a rule-based mechanism that compares models’ intermediate inferences—i.e., their identification of active rules in each verifier—against ground truth, allowing structured analysis of reasoning steps beyond final answer correctness. Finally, TurnBench offers strong contamination resistance due to its dynamic rule configurations: even under fixed game setups, varying rule activations lead to distinct reasoning trajectories, minimizing the risk of data leakage from LLM pretraining corpora. Our work makes the following key contributions:

•

We propose TurnBench, the novel benchmark designed to evaluate multi-turn, multi-step reasoning in LLMs through dynamic, interactive tasks. TurnBench includes 540 game instances across two modes—Classic and Nightmare—with three difficulty levels each.
•

We introduce a novel, automated evaluation method that leverages rule-based feedback to analyze intermediate reasoning steps, offering a grounded way to assess the internal thinking of LLMs.
•

We benchmark a range of open-source and proprietary models, including GPT-o4-mini and Gemini-2.5-Flash, alongside human participants. Results show a significant performance gap between humans (100%) and models (as low as 17.8% in Nightmare mode), highlighting the challenge TurnBench presents (Figure 1).
•

We release a new dataset comprising not only game settings and final answers, but also detailed interaction logs and annotated reasoning steps for both models and humans, providing a valuable resource for future research.

2 Related Work

LLMs in Interactive Game Environments: Recent work has explored the use of LLMs as agents in interactive games to assess their planning, reasoning, and decision-making capabilities across diverse domains such as board games, card games, and social deduction settings (Schultz et al., 2024; Xu et al., 2023; Akata et al., 2023; Light et al., 2023; † et al.(2022)(FAIR)†, Bakhtin, Brown, Dinan, Farina, Flaherty, Fried, Goff, Gray, Hu et al., FAIR; Wang et al., 2023b; Zhuang et al., 2025; Tang et al., 2025). These benchmarks typically present the game state in textual or structured formats and prompt LLMs to make the next move using natural language generation. For instance, PokerBench (Zhuang et al., 2025) adopts classification-based decision scenarios, while AvalonBench (Light et al., 2023) and BALROG (Paglieri et al., 2025) evaluate agents through multi-turn, interactive gameplay. Common evaluation metrics include win rate, legality of actions, strategy optimality, and task completion.

Benchmarks for Multi-Step and Logical Reasoning: To more directly evaluate reasoning capabilities, recent studies have proposed benchmarks focused on multi-step logical and mathematical inference. Multi-LogiEval (Patel et al., 2024) and Belief-R1 (Wilie et al., 2024) reveal that even advanced LLMs like GPT-4 and Claude struggle with tasks involving deep reasoning and belief revision. MuSR (Sprague et al., 2023) embeds multi-step logic challenges in long-form narratives to test long-context reasoning. Other efforts leverage real-world tasks such as mathematics competitions (e.g., AIME (AoPS, 2024)) to benchmark high-level mathematical reasoning. CriticBench (Lin et al., 2024) and MR-Ben (Zeng et al., 2024) further highlight the potential of multi-round, self-reflective prompting for improving LLM reasoning through iterative critique and correction.

Rule-Based Inference and Tool-Augmented Reasoning: Several benchmarks focus on rule-based or structured inference tasks. GridPuzzle and PuzzleEval (Tyagi et al., 2024) utilize logic grid puzzles, while ZebraLogic (Lin et al., 2025) frames reasoning as constraint satisfaction problems (CSPs). RuleArena (Zhou et al., 2024) evaluates models on dynamic policy reasoning. Tool-augmented frameworks like LINC (Olausson et al., 2023) and MATHSENSEI (Das et al., 2024) enable LLMs to perform formal reasoning through external tools. Meanwhile, self-reflection strategies such as Self-Refine (Madaan et al., 2023) and ReFlexion (Shinn et al., 2023) allow models to iteratively revise incorrect or incomplete outputs via internal critique loops.

While the above efforts have made significant strides in evaluating LLM reasoning, several important gaps remain. First, few benchmarks explicitly evaluate multi-step reasoning across multiple interaction rounds—a critical feature of real-world problem-solving. Most logic and tool-based tasks are static, single-shot evaluations that do not require models to gather and integrate information over time. Second, existing benchmarks often lack ground truth for intermediate reasoning steps, limiting analysis to final-answer accuracy. This makes it difficult to determine whether a correct answer results from genuine reasoning or chance. Third, many datasets are vulnerable to data contamination due to overlap with pretraining corpora. Finally, while game-based settings are promising, they rarely focus on rule-discovery and hypothesis refinement under feedback constraints.

TurnBench is designed to fill these gaps by offering a dynamic, interactive benchmark that simulates real-world multi-turn reasoning. It provides ground-truth annotations for intermediate reasoning, minimizes contamination risk through dynamic rule configurations, and emphasizes logical consistency and rule inference across turns.

3 TurnBench

3.1 Turing Machine Game Mechanics

Turing Machine is a logic-based deduction game where the player’s objective is to identify a unique three-digit code (digits 1–5), each digit associated with a distinct color (e.g., blue, yellow, purple). The challenge lies in interacting with a set of 4–6 verifiers, each governed by a single, hidden active criterion selected from a predefined rule pool. Players must deduce these hidden criteria and submit a code that satisfies all of them.

Each game unfolds in multiple rounds with four key phases: First, the player composes a proposed code (e.g., BLUE=2, YELLOW=4, PURPLE=3), which remains fixed for the current round. Next, the player queries up to three verifiers sequentially, each returning a binary judgment (PASS/FAIL) based on the verifier’s active rule. Then, using this feedback, the player can either attempt a final answer or skip and continue to the next round for further testing. The game ends once a final answer is submitted.

TurnBench supports two game modes: Classic and Nightmare, each with Easy, Medium, and Hard difficulty levels. In Classic mode, verifier responses correspond directly to the selected verifier’s criterion. In Nightmare mode, verifiers are secretly remapped; the player queries one verifier, but the response corresponds to another verifier’s logic, unknown to the player. This mapping must be deduced as part of the reasoning process.

3.2 TurnBench Construction

3.2.1 Game Setups

Each TurnBench game instance consists of a specific verifier combination, one hidden active rule per verifier, and the unique correct code. For Nightmare mode, each game additionally includes a fixed or dynamically generated hidden mapping between verifiers. We curated 270 Classic and 270 Nightmare game setups (90 per difficulty level), sourced from official materials¹¹1https://www.turingmachine.info/. All setups are reproducible, and Nightmare mappings are pre-fixed or regenerated at runtime to reduce memorization risk.

3.2.2 Verifier Design

Verifiers are central to TurnBench and encode simple numerical rules (e.g., Figure 3). We incorporate 48 official verifier types, each associated with 2 to 9 potential rules. Since the physical game’s verifier logic isn’t directly compatible with a simulation environment, we designed a Hidden Condition Selection Algorithm that selects one active rule per verifier to align with the game’s design and balance.

3.2.3 LLM Interaction Flow

At game start, the system presents the LLM with the full game context: background, rules, objective, and all verifier definitions. The model then interacts turn-by-turn as described in Section 3.1 (e.g. Figure 2), adhering to a strict output protocol. In each round:

•

In the Proposal step, the LLM outputs a code in the format <CHOICE>: BLUE=X, YELLOW=Y, PURPLE=Z.
•

In the Verifier Query step, it selects verifiers with <CHOICE>: [num]. Each verifier returns PASS or FAIL.
•

In the Deduce step, the LLM either submits the code again using the same format as Proposal step or skips the round via <CHOICE>: SKIP.
•

In Chain-of-Thought (CoT) mode, the LLM also outputs reasoning before decisions using <REASONING>.

If the LLM produces malformed output or illegal actions (e.g., invalid verifier ID), a retry mechanism prompts re-generation, while tracking error frequency. Detailed prompts and retry protocols are in the Appendix A.

3.2.4 Evaluating Model Reasoning Process

While existing benchmarks focus solely on final answers, TurnBench introduces an automated method for evaluating intermediate reasoning. Specifically, in Classic mode, a model’s reasoning involves two phases: (1) inferring each verifier’s hidden criterion, and (2) using these to deduce the final code. Since both ground truths (criteria and final code) are known, we can semantically compare model inferences with them.

Our evaluation pipeline (Figure 4) involves two LLM-based components. First, an Inference Extractor (Gemini-2.5-Flash Google (2025)) parses the model’s <REASONING> output to identify its explicit claim about a verifier’s hidden rule. Second, a Judger, also Gemini-2.5-Flash, compares the extracted rule to the ground truth and classifies it as: Correct (semantically equivalent), Incorrect (completely wrong or missing the correct rule), or Include (partial overlap with the ground truth).

We validated this automated process through manual inspection. Stratified sampling selected 120 outputs (5% of total), prioritizing failed games for robustness. Manual checks revealed the inference extractor missed 13.7% of applicable conclusions, but achieved 99.7% precision. The Judger reached 99.4% classification accuracy. These results confirm that TurnBench provides a reliable mechanism for process-level evaluation of LLM reasoning.

Models	Total Avg Acc		Easy Avg Acc		Medium Avg Acc		Hard Avg Acc		Win Avg Turn		Win Avg VER
	OA	CoT	OA	CoT	OA	CoT	OA	CoT	OA	CoT	OA	CoT
gpt-o4-mini-high (Thinking)	0.578	0.815	0.756	0.933	0.7	0.9	0.278	0.611	16	16	7	7
gemini-2.5-flash (Thinking)	0.652	0.785	0.844	0.9	0.756	0.867	0.356	0.589	13	13	6	6
deepseek-r1 (Thinking)	0.511	0.63	0.733	0.756	0.511	0.722	0.289	0.411	12	13	6	6
gpt-4.1	0.052	0.63	0.078	0.8	0.033	0.689	0.044	0.4	41	15	21	7
llama-4-maverick	0.07	0.326	0.133	0.444	0.056	0.367	0.022	0.167	28	17	12	8
llama-3.1-8b-instruct	0.007	0.015	0.011	0.022	0.011	0	0	0.022	23	13	11	6
mistral-8b	0	0.015	0	0.011	0	0.022	0	0.011	-	8	-	4
qwen-2.5-7b-instruct	0.015	0.022	0.011	0.067	0.022	0	0.011	0	34	6	17	3
Random Guess	0.0082		0.0084		0.008		0.0083		-	-	-	-
Best Human	1		1		1		1		18		8
Human Average	0.96		0.983		0.947		0.947		20		11

Table 2: Performance of different models on the Classic Game setting. Metrics include total, easy, medium, and hard average accuracy, as well as average number of turns and average number of verifiers used in successfully won games. Fewer average turns and verifier uses in winning games suggest greater reasoning efficiency. Human and random guess baselines are included for comparison. We evaluated two prompting strategies: Answer Only (AO) and Chain of Thought (CoT). The bold text represents the best results in LLM, the underline text represents the best-performing result in the non-thinking model, the blue text indicates results that are lower than Random Guess.

4 Experiment

4.1 Experiment Setup

To comprehensively explore the limitations of current large language models (LLMs) in multi-turn and multi-step reasoning tasks, we selected both commercial and widely-used open-source models for evaluation, employing different prompting methods. The commercial models include gemini-2.5-flash-preview-04-17 (thinking) Google (2025), gpt-o4-mini-high-0416 (thinking) OpenAI (2025), and gpt-4.1-2025-04-14. The open-source models tested are deepseek-r1 (thinking) DeepSeek-AI (2025), llama-4-maverick Meta (2025), mistral-8b team (2025), llama-3.1-8b-instruct AI@Meta (2024), and qwen-2.5-7b-instruct Yang et al. (2024). We also evaluated two prompting strategies: Answer Only (AO) and Chain of Thought (CoT) Wei et al. (2022).

To thoroughly test the reasoning abilities of the state-of-the-art models, all "Thinking" models had their reasoning effort set to “high.” Additionally, to compare the reasoning gap between the most advanced LLMs and humans, we invited five human participants with no prior experience with the game to take part in the experiment.

We evaluated two game modes: Classic and Nightmare. Each mode’s scenarios were divided equally into three difficulty levels: easy, standard, and hard. For Classic mode, we constructed 270 benchmark scenarios (90 per difficulty). For Nightmare mode, 45 scenarios were selected (15 per difficulty). Human participants played 45 Classic mode games (15 per difficulty), with the Nightmare mode evaluation set matching the models’.

All models and human participants were tested under identical conditions, with the same task prompts and problem setups. To ensure parity in information access, we developed a user interface for humans that displayed exactly the same text as the models saw at each step. Humans were also asked to record their reasoning and thought processes throughout.

Initial verifier errors	368	96	66	141	255	142	318	144
	llama-3.1-8b	gemini-2.5-flash	gpt-o4-mini-high	gpt-4.1	mistral-8b	llama-4-maverick	qwen-2.5-7b	deepseek-r1
Persistence of initial errors (%)	89.94	91.67	53.03	86.52	90.20	63.38	99.06	93.06
Ended with no final conclusion (%)	74.18	71.87	34.85	54.61	53.33	45.77	96.23	86.11
Next-turn still incorrect (%)	17.66	19.79	27.27	33.33	38.82	25.35	3.14	6.94
Success despite persistent errors (%)	1.08	12.72	32.14	13.41	0.66	8.11	0.54	7.87
Success when no / fixed errors (%)	1.75	95.34	87.55	84.57	3.13	41.75	8.00	90.56

Models	Total Avg
Accuracy	Easy Avg
Accuracy	Medium Avg
Accuracy	Hard Avg
Accuracy	Win Avg
Turn	Win Avg
VER
gpt-o4-mini-high	0.111	0.133	0.200	0.000	21	8
gemini-2.5-flash	0.178	0.133	0.267	0.133	16	8
deepseek-r1	0.067	0.067	0.067	0.067	12	6
Random Guess	0.0076	0.0074	0.0079	0.0075	-	-
Best Human	1	1	1	1	40	20
Human Average	0.942	0.96	0.933	0.933	38	17

TurnBench-MS: A Benchmark for Evaluating Multi-Turn, Multi-Step Reasoning in Large Language Models

Abstract

1 Introduction

2 Related Work

3 TurnBench

3.1 Turing Machine Game Mechanics

3.2 TurnBench Construction

3.2.1 Game Setups

3.2.2 Verifier Design

3.2.3 LLM Interaction Flow

3.2.4 Evaluating Model Reasoning Process

4 Experiment

4.1 Experiment Setup

4.2 Results and Findings

5 Conclusion

Limitation

References

Appendix A Prompts Used in Experiments

A.1 Classic prompts with Only-Answer (OA) and Chain-of-Thought (CoT)

A.1.1 Classic system prompt

A.1.2 Classic proposal step prompt

Classic - Proposal step - Step prompt -OA

Classic - Proposal step - Not valid format prompt - OA

Classic - Proposal step - Step prompt - CoT

Classic - Proposal step - Not valid format prompt - CoT

A.1.3 Classic question step prompt

Classic - Question step - First question prompt - OA

Classic - Question step - Following questions prompt - OA

Classic - Question step - Last question prompt - OA

Classic - Question step - Not valid format prompt - OA

Classic - Question step - Not valid verifier choice prompt - OA

Classic - Question step - First question prompt - CoT

Classic - Question step - Following questions prompt - CoT

Classic - Question step - Last question prompt - CoT

Classic - Question step - Not valid format prompt - CoT

Classic - Question step - Not valid verifier choice prompt - CoT

A.1.4 Classic deduce step prompt

Classic - Deduce step - Deduce result prompt

Classic - Deduce step - Step prompt - OA

Classic - Deduce step - Not valid format prompt - OA

Classic - Deduce step - Step prompt - CoT

Classic - Deduce step - Not valid format prompt - CoT

A.2 Nightmare Prompts with Only-Answer (OA) and Chain-of-Thought (CoT)

A.2.1 Nightmare system prompt

A.2.2 Nightmare proposal step prompt

Nightmare - Proposal step - Step prompt - OA

Nightmare - Proposal step - Not valid format prompt - OA

Nightmare - Proposal step - Step prompt - CoT

Nightmare - Proposal step - Not valid format prompt - CoT

A.2.3 Nightmare question step prompt

Nightmare - Question step - First question prompt - OA

Nightmare - Question step - Following questions prompt - OA

Nightmare - Question step - Last question prompt - OA

Nightmare - Question step - Not valid format prompt - OA

Nightmare - Question step - Not valid verifier choice prompt - OA

Nightmare - Question step - First question prompt - CoT

Nightmare - Question step - Following questions prompt - CoT

Nightmare - Question step - Last question prompt - CoT

Nightmare - Question step - Not valid format prompt - CoT

Nightmare - Question step - Not valid verifier choice prompt - CoT

A.2.4 Nightmare deduce step prompt

Nightmare - Deduce step - Deduce result prompt

Nightmare - Deduce step - Step prompt - OA

Nightmare - Deduce step - Not valid format prompt - OA

Nightmare - Deduce step - Step prompt - CoT

Nightmare - Deduce step - Not valid format prompt - CoT