Apple’s STUNNING Discovery: “LLMs CAN’T REASON” | Is AGI Cancelled?

Apple's STUNNING Discovery LLMs CAN'T REASON Is AGI Cancelled

In the rapidly evolving world of artificial intelligence, large language models (LLMs) have captured the imagination of researchers, developers, and the public alike. These models, capable of generating human-like text, have been heralded as stepping stones toward artificial general intelligence (AGI). However, a recent research paper from Apple titled “The Illusion of Thinking” challenges this optimistic outlook by arguing that LLMs fundamentally cannot reason. This article dives deep into Apple’s findings, explores their implications, and examines counterarguments that question the finality of this conclusion.

Table of Contents

🔧 The Concept of AI as a Tool for Thought

Steve Jobs once famously described the computer as a “bicycle for our minds.” It amplifies human intellectual capacity much like a bicycle amplifies physical capability. Now, AI is considered a tool that builds other tools—a meta-tool that extends human creativity and problem-solving. But how far does this extend? Can large language models truly “think” and reason like humans, or is their intelligence just an elaborate imitation?

Apple’s paper attempts to answer this question by rigorously testing LLMs on their reasoning abilities. The research focuses on whether these models can move beyond pattern matching and memorization to demonstrate genuine logical reasoning, especially as problems increase in complexity.

🧠 Understanding Large Language Models and Reasoning Models

Traditional language models respond immediately to queries, generating answers based on patterns learned during training. Recently, a new category called large reasoning models has emerged, which incorporate a “chain of thought” approach. These models simulate reasoning by breaking down problems into intermediate steps before delivering a final answer. This method has shown promise, particularly in tasks requiring multi-step logic such as mathematics and coding.

Apple’s research scrutinizes these models, noting that most evaluations focus heavily on established benchmarks in math and coding. A key concern here is that these benchmarks might be compromised because the models could have memorized solutions during training rather than genuinely solving the problems.

This raises an important question: Are current benchmarks an accurate measure of reasoning ability, or just a reflection of how well models can regurgitate learned data? This skepticism underpins Apple’s motivation to explore reasoning beyond standard tests.

🧮 Reasoning Complexity: From Simple to Complex Tasks

To illustrate the differences in reasoning effort, consider three example questions:

  • Easy: What is 2 + 2? (Instant answer: 4)
  • Medium: What is 9 × 6? (Requires some mental calculation or recall)
  • Complex: How many prime numbers are there between 1 and 15 million? (A problem requiring deep computation or a clever algorithm)

Humans naturally spend more time on medium complexity problems and might avoid or defer extremely complex ones due to cognitive effort. Similarly, AI models show varying performance based on task complexity, which Apple’s study breaks down into three main findings:

  1. Low complexity tasks: Standard language models surprisingly outperform large reasoning models. For simple arithmetic like 2 + 2, overthinking can even reduce accuracy.
  2. Medium complexity tasks: Large reasoning models demonstrate a clear advantage, excelling at problems where step-by-step thinking is beneficial.
  3. High complexity tasks: Both standard and reasoning models collapse, failing to solve very complex problems like the Tower of Hanoi with many discs or intricate puzzles.

📉 The Collapse of Reasoning at High Complexity

Apple tested models on several classic reasoning challenges like the Tower of Hanoi, checkers jumping problems, and river crossing puzzles. The results were stark: as complexity increased, model accuracy plunged to near zero.

This led to Apple’s conclusion that LLMs fail to develop generalized reasoning capabilities beyond a certain complexity threshold. In other words, the apparent reasoning these models perform is an illusion—they are, at their core, “guessing machines” that simulate logical thinking without true understanding.

🤔 Critiques and Counterpoints to Apple’s Findings

While Apple’s paper has generated buzz, it has also met with skepticism from AI experts and enthusiasts. One notable critique comes from AI researcher Sean Godecke, who highlights potential flaws in the study’s methodology and choice of benchmarks.

Godecke points out that the Tower of Hanoi puzzle, used extensively in Apple’s experiments, may be a poor proxy for reasoning ability. If concerns about contamination of training data exist for math and coding benchmarks, the Tower of Hanoi is likely even more exposed. Solutions and algorithms for this puzzle are widely available online and likely included in the training data.

Therefore, giving the model the algorithm for Tower of Hanoi does not provide new information and unsurprisingly does not improve its performance. This is akin to telling a human player to “try not to lose” in a video game—they already know the goal but might still struggle to execute.

Moreover, Godecke argues that reasoning models are typically trained on math and coding problems, not on puzzles like Tower of Hanoi. Thus, puzzles might not be the best way to evaluate reasoning skills in LLMs.

🔦 The Streetlight Effect and Its Impact on AI Testing

The choice of benchmarks can fall victim to the streetlight effect—looking for answers where it’s easiest rather than where the truth might actually lie. In AI research, this means focusing on problems with clear, available solutions rather than more nuanced or novel challenges.

By examining only well-known puzzles or benchmarks, researchers might miss areas where models display genuine reasoning or innovative problem-solving. This effect underscores the need for diverse and carefully chosen test sets that better reflect real-world reasoning tasks.

🧩 How Models Handle Complexity: Insight from Model Behavior

When faced with intractably complex problems, models tend to exhibit behavior similar to humans: they recognize the impossibility of enumerating every step and instead search for shortcuts or alternative approaches.

For example, in the Tower of Hanoi task with ten discs, the model quickly realizes that listing all moves (potentially over a thousand) is not feasible. It tries to find a generalized solution instead, mirroring how humans might approach the problem by seeking an algorithm rather than brute force.

This shift in strategy is often mistaken for failure or inability to reason. However, it may represent a more sophisticated form of reasoning—one that balances effort and feasibility.

💻 Practical Reasoning: Using AI to Build Tools for Complex Problems

While LLMs might struggle to list thousands of steps explicitly, they can assist by generating code to solve complex problems efficiently. Using Gemini 2.5 Pro, for instance, one can create a Tower of Hanoi solver that programmatically handles ten discs, calculating the minimum moves and executing the solution automatically.

This demonstrates that although direct step-by-step reasoning might be limited, LLMs can still contribute valuable reasoning by building tools and automating processes that humans would find tedious or impossible to do manually.

🧩 Are Large Language Models Truly Reasoning or Just Imitating?

The core debate centers on whether LLMs genuinely reason or merely imitate patterns of reasoning. Apple’s paper suggests the latter, branding reasoning in LLMs as an illusion. Yet, the models’ ability to recognize problem complexity, attempt shortcuts, and generate functional code hints at more nuanced capabilities.

Without clear, universally accepted definitions of “reasoning” and “thinking” in AI, this debate is far from settled. It’s possible that LLMs exhibit a form of emergent reasoning that, while different from human cognition, achieves practical results.

🗣️ Final Thoughts: What Does This Mean for the Future of AI?

Apple’s findings serve as a sobering reminder that current LLMs have limitations, particularly with complex reasoning tasks. The notion of AGI—an AI with human-level general intelligence—might be further off than some enthusiasts hope.

However, the ability of these models to assist in coding, problem decomposition, and tool creation remains impressive and valuable. Rather than cancellation of AGI, this research invites a more nuanced understanding of AI capabilities and encourages the development of hybrid systems that combine raw computational power with structured reasoning techniques.

Ultimately, the journey toward machines that truly think like humans is ongoing, and each study—whether optimistic or critical—adds valuable insight to the path forward.

❓ Frequently Asked Questions (FAQ)

What are large language models (LLMs)?

LLMs are AI models trained on massive amounts of text data to generate human-like language. They predict the next word in a sequence, enabling them to create coherent sentences, answer questions, and perform various language tasks.

What is the difference between standard LLMs and large reasoning models?

Standard LLMs provide answers immediately based on learned patterns. Large reasoning models incorporate a “chain of thought” process, breaking problems into intermediate steps before producing a final answer, simulating a reasoning process.

Why does Apple claim LLMs can’t reason?

Apple’s paper argues that beyond a certain complexity, LLMs fail to solve problems requiring deep, generalized reasoning. Their successes in simpler or medium tasks are seen as pattern matching or memorization rather than true reasoning.

Are benchmarks like math and coding tests reliable measures of AI reasoning?

Not entirely. Benchmarks can be contaminated if the model has seen the solutions during training. This makes it hard to distinguish between memorization and genuine problem-solving ability.

What is the Tower of Hanoi puzzle, and why is it used to test reasoning?

The Tower of Hanoi is a classic mathematical puzzle involving moving discs between pegs following specific rules. It’s used to test multi-step reasoning and planning in AI because it requires sequential problem solving.

What are the criticisms of using Tower of Hanoi to test AI reasoning?

Critics argue that solutions for Tower of Hanoi are widely available online and likely included in training data. Thus, the puzzle may not be a fair test of reasoning since the model could rely on memorized solutions rather than reasoning.

How do LLMs behave when faced with very complex problems?

They often recognize the complexity and shift strategies, seeking shortcuts or generalized solutions instead of enumerating every detail. This behavior mirrors human problem-solving strategies.

Does the inability to list all steps mean an AI cannot reason?

Not necessarily. It may indicate the AI is using a higher-level approach, balancing effort and feasibility, which is a form of reasoning. It also reflects practical constraints like output length limits.

Can AI models still be useful despite these reasoning limitations?

Absolutely. LLMs can generate code, automate complex tasks, and assist humans in problem-solving, making them powerful tools despite their limitations in generalized reasoning.

What does this mean for the future of AGI?

While the findings suggest current models are far from AGI, research continues. Hybrid approaches and new architectures may bridge the gap, and the path to AGI remains an open and active area of exploration.

 

Leave a Reply

Your email address will not be published. Required fields are marked *

Most Read

Subscribe To Our Magazine

Download Our Magazine