LLMs Create a SELF-IMPROVING AI Agent to Play Settlers of Catan

Sofia Alvarez

2 days ago

In the rapidly evolving world of artificial intelligence, one of the most fascinating advancements is the development of autonomous, self-improving AI agents powered by large language models (LLMs). These intelligent agents are not only capable of performing complex tasks but also of iteratively improving their performance over time without human intervention. A compelling example of this is the recent breakthrough in AI agents learning to play the strategic board game Settlers of Catan. This article delves deep into how LLM-based agents are revolutionizing strategic planning, self-improvement, and game-playing AI, highlighting the latest research and insights into this exciting field.

🧩 Understanding Autonomous AI Agents and Their Architecture
🎲 Why Settlers of Catan? The Challenge of Strategic Planning
🤖 The Multi-Agent Framework: Roles and Collaboration
📜 Prompt Engineering and Game State Representation
⚙️ Evolutionary Self-Improvement: From Basic to Advanced Agents
💻 Experimental Setup and Technology Stack
📊 Results: How Well Did the AI Agents Perform?
🔮 The Future of Self-Improving AI Agents
🔍 Frequently Asked Questions (FAQ)
🚀 Conclusion: The Rise of Self-Evolving AI Agents

🧩 Understanding Autonomous AI Agents and Their Architecture

The term “AI agents” often sparks debate due to its broad and sometimes ambiguous usage. However, in this context, it refers to systems built around large language models with additional scaffolding—essentially, frameworks that enhance the model’s capabilities by integrating tools, code-writing abilities, note-taking, and strategic reasoning. These architectures empower the AI to interact with complex environments, such as playing Settlers of Catan, by interpreting the game state, making decisions, and adapting strategies over time.

This approach is not novel but has been gaining significant traction. For instance, Google DeepMind’s AlphaEvolve and the Darwin Godel machine are early examples of self-improving coding agents that combine LLMs with modular scaffolding. Similarly, NVIDIA’s Minecraft Voyager demonstrated how LLMs guided by GPT-4 can autonomously learn and improve their gameplay in a complex, dynamic environment. The key takeaway is that large language models, when combined with an intelligent architectural framework, can continuously refine their own strategies and improve performance.

In the case of Settlers of Catan, the AI agent is built using the open-source Catanetron framework, which simulates the game environment allowing AI players to engage in multiple rounds rapidly. This simulation provides a fertile ground for training and testing self-evolving agents.

🎲 Why Settlers of Catan? The Challenge of Strategic Planning

Settlers of Catan is a board game rich in strategy, involving resource management, expansion, negotiation, and chance through dice rolls. Unlike perfect information games like chess or Go, where all players see the entire game state, Catan introduces partial observability and randomness, making it a much more challenging environment for AI agents to master.

Traditional AI methods, such as reinforcement learning, have achieved superhuman performance in perfect information games. However, these methods struggle in environments like Catan, which require long-term strategic planning, dealing with uncertainty, and adapting to changing game states. This makes Catan an excellent testbed for exploring how LLM-based agents can develop coherent, adaptive strategies over extended gameplay.

🤖 The Multi-Agent Framework: Roles and Collaboration

A standout feature of this approach is the introduction of a multi-agent system, where different specialized agents collaborate to improve overall gameplay. The architecture includes:

Analyzer Agent: Evaluates gameplay, identifies weaknesses, and summarizes areas for improvement.
Researcher Agent: Conducts in-depth strategy research using local files and web searches, gathering insights from external sources like Reddit and strategy guides.
Coder Agent: Translates strategic improvements into concrete code changes, updating the player agent’s logic.
Player Agent: The actual AI that plays the game, whose capabilities evolve over time as improvements are integrated.
Evolver Agent: Acts as the central coordinator, synthesizing input from the other agents to autonomously rewrite and optimize the player agent’s code and prompts.

This collaborative dynamic allows the system to self-diagnose, research new tactics, implement changes, and test results iteratively, mimicking a human-like cycle of learning and improvement.

📜 Prompt Engineering and Game State Representation

One critical factor in the success of LLM agents in complex environments is how information is presented to them. In this system, agents receive a structured representation of the game state, including details like available actions, current resources, longest road, largest army, and other vital statistics. This structured input is coupled with natural language prompts explaining the game rules and strategic guidance.

Continuously updating the AI with the current game state at each turn ensures that the agent retains context and maintains long-term coherence in its decision-making process. This approach contrasts with earlier systems that provided static or infrequent updates, often resulting in degraded performance over extended periods.

For example, the prompt might include:

“You are playing Settlers of Catan. Here are the current resources, board status, and available actions. Your goal is to maximize victory points through settlement expansion, resource prioritization, and strategic negotiation.”

This method of prompt engineering is a powerful tool to keep the AI’s reasoning aligned with the game’s objectives and rules.

⚙️ Evolutionary Self-Improvement: From Basic to Advanced Agents

The evolutionary process begins with a basic agent that maps unstructured game state descriptions directly to actions. Over time, through multiple iterations, the agent evolves by rewriting its own prompts and underlying code to improve strategic planning and execution.

Two main evolution strategies are employed:

Prompt Evolver: Focuses on refining the language prompts given to the AI player to improve understanding and strategic reasoning.
Agent Evolver: Involves rewriting the player agent’s codebase itself, incorporating feedback from gameplay analysis, research, and coding to enhance performance.

Each evolutionary step involves evaluating gameplay outcomes, analyzing failures, researching alternative strategies, and implementing code changes. This iterative loop enables the agent to self-improve autonomously, adapting to the complexities of the game environment.

💻 Experimental Setup and Technology Stack

The experiments were conducted using readily accessible hardware, including MacBook Pro 2019 and MacBook M1 Max 2021, over a period of approximately 60 hours. This showcases that advanced AI research of this nature is becoming increasingly feasible without requiring prohibitively expensive infrastructure.

The models tested included:

GPT-4.0: A state-of-the-art large language model known for its advanced reasoning capabilities.
Claude 3.7: Another powerful LLM with strengths in strategic thinking and code generation.
Mistral Large: An open-source large language model providing more accessible, albeit less powerful, performance.

📊 Results: How Well Did the AI Agents Perform?

The performance of the agents was benchmarked against Catanetron’s strongest heuristic-based bot, which uses alpha-beta search techniques. The evaluation metrics included average victory points, number of settlements and cities built, largest army size, and other development indicators.

Key findings included:

Base Agents: The initial, unevolved agents scored an average of around 3.6 victory points per game.
Structured Agents: Incorporating structured game state inputs and basic strategic prompts improved performance by 6% to 11% for GPT-4.0 and Claude 3.7, but Mistral Large saw a 31% drop, indicating variability in model effectiveness.
Prompt Evolver: This approach yielded a 22% improvement for GPT-4.0 and an impressive 95% increase for Claude 3.7, showcasing the power of refined prompt engineering.
Agent Evolver: Autonomous code rewriting led to performance gains of 36% for GPT-4.0 and 40% for Claude 3.7, confirming that self-improvement at the code level is highly beneficial.

Claude 3.7 emerged as the top performer, systematically developing sophisticated strategic prompts that included:

Clear delineation of short-term and long-term plans
Precise settlement placement strategies
Resource prioritization tactics
Development card usage policies
Robust counter-strategies against opponent actions

These results highlight the crucial role of the underlying LLM’s capabilities in determining the success of self-improving agents.

🔮 The Future of Self-Improving AI Agents

The implications of these findings are profound for AI development and deployment. As large language models continue to evolve, their ability to self-improve will likely accelerate, enabling even more sophisticated autonomous systems.

A few important considerations and predictions include:

Model Quality Matters: The better the base LLM, the more effective the self-improvement process. This aligns with the idea that advancements in foundational AI models will cascade into improvements across applications.
Extended Evolutionary Steps: The experiments conducted only allowed for ten evolutionary iterations. Longer-term evolution could yield even greater performance gains, though practical constraints like computational cost will play a role.
Open Source Accessibility: The use of open-source frameworks like Catanetron democratizes access to cutting-edge AI research, encouraging broader experimentation and innovation.
Broader Applications: While games offer a controlled environment for testing, the principles of autonomous self-improvement and multi-agent collaboration can be extended to business AI, robotics, and other complex domains.

🔍 Frequently Asked Questions (FAQ)

What is an autonomous self-improving AI agent?

It is an AI system built around a large language model that can independently analyze its performance, research improvements, modify its own code or prompts, and iteratively enhance its capabilities without human intervention.

Why is Settlers of Catan a challenging game for AI?

Unlike perfect information games like chess, Settlers of Catan includes elements of randomness (dice rolls), partial observability (hidden player resources), and complex negotiation, making strategic planning and long-term coherence difficult for AI.

How do multi-agent systems improve AI performance?

By dividing tasks among specialized agents—such as analysis, research, coding, and gameplay—multi-agent systems allow for better focus, collaboration, and iterative improvements, mimicking human team dynamics and enhancing overall effectiveness.

What role does prompt engineering play in AI gameplay?

Prompt engineering shapes how the AI interprets the game state and objectives, providing structured guidance that helps maintain focus, strategic coherence, and adaptability throughout gameplay.

Can these AI agents be applied outside gaming?

Absolutely. The principles of self-improvement, multi-agent collaboration, and strategic reasoning can be applied to business automation, software development, robotics, and other fields that require adaptive AI systems.

Are open-source tools available to experiment with these AI agents?

Yes. Frameworks like Catanetron provide open-source environments for simulating Settlers of Catan, enabling researchers and developers to integrate their own AI agents and experiment with self-improving architectures.

🚀 Conclusion: The Rise of Self-Evolving AI Agents

The development of LLM-powered, self-improving AI agents marks a pivotal moment in artificial intelligence research. By demonstrating the ability to autonomously enhance strategic gameplay in a complex, uncertain environment like Settlers of Catan, these systems showcase the potential for AI to tackle real-world problems requiring long-term planning, adaptability, and collaboration.

The multi-agent framework combining analysis, research, coding, and gameplay roles creates a robust feedback loop that drives continuous improvement. As large language models grow more powerful and accessible, the future promises even more advanced autonomous agents capable of learning and evolving with minimal human input.

For businesses and technologists interested in harnessing AI’s full potential, these insights offer a valuable blueprint for building adaptive, resilient AI systems. Whether in gaming, automation, or software development, the recipe for success lies in combining powerful models with intelligent scaffolding and iterative self-improvement.

To explore reliable IT support and custom software development services that leverage the latest in AI and technology, visit Biz Rescue Pro. For more insights into AI trends and innovations, check out Canadian Technology Magazine.

Table of Contents