How Complexity Challenges the Limits of AI Reasoning

Published on August 21, 2025

Ever tried to solve a puzzle so complex that your usual strategies stopped working? Recent research shows that even the most advanced AI reasoning models can run into the same roadblocks. In the landmark study “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity” (Shojaee et al., 2025), Apple researchers deeply probe how Large Reasoning Models (LRMs) deal with escalating difficulty—and what this means for the future of artificial intelligence.

The Rapid Evolution of Reasoning Models in AI

Large Language Models (LLMs) like GPT-4 or PaLM have set the standard in generating text, answering questions, and even writing code. Discover how text-to-video AI is transforming content creation with innovative tools and applications across industries. The next leap is Large Reasoning Models (LRMs)—AI systems designed to articulate step-by-step thinking, producing detailed “reasoning traces” as they solve complex problems. The hope is that showing their work, much like a student writing out each calculation, will lead these models to more robust and trustworthy reasoning.

But are these models genuinely reasoning, or simply mimicking thought processes? Shojaee and colleagues sought to answer this by going far beyond final answer correctness, scrutinizing the logic and structure of the AI’s problem-solving “thoughts.” Explore how AI humanizer tools transform AI-generated text into human-like writing, enhancing readability and SEO value.

Why Traditional Evaluations Fall Short

Most AI benchmarks rely on answer correctness, typically using math or coding tasks as tests. The study highlights two major pitfalls:

Data contamination: Overlap between training and test sets can artificially inflate performance.
Lack of process insight: A correct answer doesn’t guarantee sound logic; it could be the result of lucky guesses or memorization.

As the authors note, “By focusing solely on final answers, current benchmarks obscure the actual reasoning ability of these models.” This is like grading only the final answer on a math test, not the student’s work—missing whether understanding or luck was at play.

Dissecting AI Reasoning: The Puzzle Environment Approach

To overcome these limitations, the researchers built controlled puzzle environments. Imagine logic games where you can dial up the difficulty by adding more rules or steps, while keeping the underlying logic constant. This approach allowed for precise manipulation of what the authors call compositional complexity.

What is Compositional Complexity?

Compositional complexity refers to the number and diversity of logical operations required to solve a task. Explore perplexity's role in AI, from theory to AI applications, enhancing understanding of AI models' capabilities. For example:

Low complexity: “Add two numbers.”
Medium complexity: “First, double a number, then subtract another value, and finally compare the result to a target.”
High complexity: “Apply a sequence of interlocking rules, remember intermediate results, and integrate constraints across multiple steps.”

By incrementally increasing complexity, the researchers could observe precisely when and how models began to fail.

How Reasoning Traces Were Analyzed

A central innovation in the study was the analysis of reasoning traces—the model’s written-out solution path. Researchers evaluated whether the traces:

Followed explicit, logical algorithms
Maintained consistency across similar problems
Showed a pattern of reasoning effort as complexity scaled

For example, in one puzzle, an LRM might write:

 Step 1: Add 4 and 7 to get 11.
Step 2: Subtract 2 from 11 to get 9.
Step 3: Multiply 9 by 3 to get 27.
Step 4: 27 matches the target, so answer is correct.

In a similar but slightly more complex puzzle, the same LRM might skip steps, double-count, or introduce irrelevant operations—signaling a breakdown in consistency or true algorithmic reasoning.

Concrete Findings: When AI Reasoning Models Succeed—and When They Fail

The study’s comprehensive results reveal a nuanced picture. Here are the standout findings:

Effort rises, then drops: As compositional complexity increases, LRMs initially generate longer, more detailed reasoning traces. But past a certain threshold, their “effort”—measured in reasoning tokens—actually decreases, signaling cognitive overload or confusion.
Accuracy collapses at high complexity: For example, LRMs maintained over 80% accuracy on puzzles with up to 5 compositional steps, but accuracy fell abruptly to near 0% when complexity rose beyond 7 steps, even with sufficient token budget for longer reasoning.
Three performance regimes:
- Low complexity: Standard LLMs outperform LRMs, likely because added reasoning steps in LRMs introduce avoidable errors.
- Medium complexity: LRMs shine, leveraging step-by-step breakdowns to outperform LLMs—e.g., achieving 85% vs. 60% accuracy at 5-step puzzles.
- High complexity: Both models experience a steep accuracy collapse—dropping to single-digit percentages or complete failure.
Algorithmic inconsistency: Even in nearly identical puzzles, LRMs sometimes switched strategies, omitted steps, or failed to use clear algorithms. This suggests that their “reasoning” is often a shallow simulation rather than robust logical computation.

For visualization, imagine a chart where the X-axis is puzzle complexity (number of steps) and the Y-axis is accuracy. Both model lines start high, but plummet after a sharp “collapse point.” For example:

Complexity (steps)	LLM Accuracy	LRM Accuracy
3	90%	85%
5	60%	85%
7	8%	5%

This demonstrates how both models fall off a “complexity cliff.”

Sample Reasoning Traces: Logic vs. Collapse

A strong reasoning trace for a 5-step puzzle:

 Step 1: Double 3 to get 6.
Step 2: Add 2 for a total of 8.
Step 3: Divide by 4 to get 2.
Step 4: Multiply by 5 to get 10.
Step 5: Subtract 7 to reach 3. Done.

A collapse trace for a 7-step puzzle:

 Step 1: Double 3 to get 6.
Step 2: Add 2 to get 8.
Step 3: Multiply by 4. (No reason given.) 32.
Step 4: Subtract 9. (Confuses with prior steps.) 23.
Step 5: [Repeats Step 3] Multiply by 4. 92.
Step 6: [No step] ???
Step 7: Answer: 92. (Incorrect, random guess.)

Such traces make the model’s breakdown visible.

Limitations, Related Work, and Counterarguments

Why Not Just Scale Up?

A natural question: “If LRMs fail at high complexity, why not just give them more training or tokens?” As Shojaee et al. point out, “Performance collapse appears rooted in the model’s internal capacity to structure reasoning—not just in the computational resources allotted.” More data or longer answers don’t solve the core logic shortfall.

Study Limitations

Puzzle environments are artificial: While ideal for scientific scrutiny, puzzles lack the unpredictability and messy context of real-world problems.
Transferability: It remains to be seen if the sharp collapse observed for puzzles occurs in domains like natural language understanding or real-world planning.
Judging reasoning traces: Even logical traces may reflect memorized patterns, not genuine comprehension.

Broader Context: Related Research

The study builds on work like GSM-SYMBOLIC, which similarly found LLMs struggle with grade-school mathematical reasoning at higher complexity. Meanwhile, new approaches such as “interleaved reasoning” (guiding models to mix reasoning with answering, as explored in subsequent Apple research) aim to reduce time-to-first-token and potentially improve reasoning resilience.

As the authors conclude, “Understanding not just whether but how AI models attempt complex reasoning is key to making progress.”

Implications and Recommendations for Practice

Audit reasoning, not just answers: For high-stakes use, always examine AI reasoning traces for logic and consistency.
Task-appropriate modeling: Match model type to complexity—LLMs for straightforward tasks, LRMs for intermediate complexity, but always with oversight at higher complexity.
Transparency and explainability: Build evaluation tools that flag sudden drops in performance or evidence of algorithmic inconsistency.
Research directions: Focus on architectures and training methods encouraging consistent, explicit, and transferable algorithms across diverse tasks.

Conclusion: Towards Trustworthy AI Reasoning

AI has made impressive strides, but this research shows that even advanced models can fall for the “illusion of thinking.” The challenge is clear: ensure our models don’t just look smart, but can sustain logical reasoning when the going gets tough. Practitioners, researchers, and users alike must demand transparent, explainable, and reliable AI—especially as these systems are increasingly deployed in critical domains.

Want to dig into the technical details or see more sample traces? Read the full study here.

Back to Blog