GPT-5 just caught them all (Grok 4.20 and Gemini 3.0)

Sofia Alvarez

3 months ago

There’s a lot happening in the AI world right now — fast-moving model releases, leadership changes, new funds trading on AI excitement, and continued debates about safety and stewardship. This article pulls together the latest developments and offers context and perspective on what they mean for researchers, developers, investors, and anyone trying to make sense of the path toward more powerful AI systems.

📰 Quick roundup: what’s been happening this week
🤖 Marketplace politics: Grok, OpenAI, Apple rankings, and the attention economy
🧠 New model releases and the rumor mill: Gemini 3.0 vs. reality
🕹️ Benchmarks getting creative: Pokemon Red, IOI, and what they tell us
💼 Money and AI: Situational Awareness fund and the finance playbook
🔧 People: Igor Babushkin, XAI, and the pivot to safety
🧩 Reinforcement learning, self-play, and the next frontier
🏆 Capability milestones: GPT-5 and the evidence of rapid improvement
⚖️ Safety, governance, and why researchers are doubling down
📈 Practical takeaways for businesses, developers, and investors
🔮 The path forward: what to watch next
❓ FAQ
🔚 Closing thoughts

📰 Quick roundup: what’s been happening this week

AI progress has been relentless. In the space of days we’ve seen:

Arguments and public barbs between major AI players about app rankings and fairness in marketplace treatment.
New model releases teased (Grok 4.20) and rumors debunked (no confirmed Gemini 3.0 yet).
A leading engineer leaving a high-profile startup to focus on AI safety and founding a new venture.
A fast-starting AI-focused hedge fund reporting huge early returns and bringing veteran AI researchers into finance.
Major language models demonstrating advanced capabilities on surprising benchmarks — including classic gaming tests like Pokemon Red and major programming/algorithm contests like the International Olympiad in Informatics (IOI).
New image models aimed at cheaper, faster generation and higher-resolution outputs.

Below I’ll unpack each of these items in more detail, explain why they matter, and connect the dots between product launches, competitive dynamics, and the deepening focus on safety and governance.

🤖 Marketplace politics: Grok, OpenAI, Apple rankings, and the attention economy

Competition for user attention is heat and light for products — but it’s also politics. Recently, there’s been public discussion about how different AI chat apps are ranked in app stores. The suggestion from some quarters is that large platform providers may be giving preferential treatment to certain partners, which has sparked debate about fairness and how platform-level choices shape which models gain traction.

On one side, an argument has been raised that OpenAI-backed or partnered apps are getting top placements that could skew visibility and adoption. On the other, competitors like XAI (and its Grok assistant) are pushing to catch up — and are signaling that their next release, Grok 4.20, is imminent and aimed at cracking the top spot in app-store rankings.

Why this matters:

App store placement drives user acquisition at scale. Preferential placement can accelerate adoption of a specific model.
Public disputes highlight that the AI market isn’t just about models — it’s also about distribution channels, strategic partnerships, and platform economics.
How app stores moderate and rank LLM-based apps is increasingly a governance decision with industry-wide implications.

Watch for the Grok 4.20 rollout later this month and observe how any new features or performance improvements affect its reception. If a major app store does favor one player, regulators and competitors will pay attention.

🧠 New model releases and the rumor mill: Gemini 3.0 vs. reality

Rumors about “Gemini 3.0” circulated recently — complete with charts and speculative benchmarks. At present there’s no confirmed Gemini 3.0 release on the horizon. What did appear instead is a stream of model releases and updates that are worth noting:

Google and other labs are shipping smaller and more focused models like a 270 million parameter Gemma 3 model aimed at developers and lower-cost use cases.
Imagen image-generation technology continues to evolve: a fast Imagen model priced to generate images for roughly two cents apiece (for scaled generation), plus updates like Imagen 4 and Imagen 4 Ultra supporting 2K output.
Companies are iterating at different layers — some optimizing inference cost, some pushing image quality, and others improving general reasoning and coding performance.

Bottom line: rumors often outpace reality. Expect litters of model names and variant numbers, but focus on confirmed releases and documented benchmark improvements. Developers and product teams should evaluate release notes and try models directly rather than relying on leaked charts.

🕹️ Benchmarks getting creative: Pokemon Red, IOI, and what they tell us

Benchmarks have always been controversial: they’re useful, but narrow. Recently a quirky but informative trend has emerged — using classic videogames like Pokemon Red as intelligence and planning benchmarks for AI agents. Several models (Claude, Gemini, and now GPT-5) have all been tested on classic game-play tasks, and GPT-5 performed impressively on Pokemon Red with a dramatic reduction in the number of steps needed compared to previous baseline models.

Why use Pokémon Red?

The game requires long-term planning, adaptive strategies, and sometimes nontrivial problem solving with limited observability — a valuable test of reasoning, memory, and planning.
It’s a bounded, reproducible environment, which makes it easier to compare models on identical tasks.
It’s engaging and relatable — demonstrating proficient performance on games grabs attention in a way pure academic benchmarks sometimes don’t.

Beyond games, one of the biggest recent milestones was a leading LLM achieving the gold medal among AI competitors at the International Olympiad in Informatics (IOI). The same model placed sixth overall when including human competitors. This is significant: IOI tasks test advanced algorithmic thinking, optimization, and reasoning under time pressure. Performing at near-top human levels in such competitions shows rapid improvement in models’ logical reasoning and problem-solving abilities.

Implications of these benchmark wins:

Models are improving not only at pattern recognition but at multi-step reasoning and algorithmic tasks.
Researchers can use a broader set of creative benchmarks (games, coding contests, math olympiads) to measure progress in capabilities that matter for real-world applications.
With improved reasoning, models become more useful across professional and creative domains — but also raise the stakes for safety and alignment work.

💼 Money and AI: Situational Awareness fund and the finance playbook

A new AI-focused fund called Situational Awareness (managed by Leopold Aschenbrenner) has attracted attention after reporting a strong start with roughly 47% returns in the first half of the year and managing over $1.5 billion. The fund’s playbook mixes public equity bets in semiconductor, infrastructure, and power companies — the industries that benefit from AI growth — plus targeted venture investments in AI startups, including bets on companies like Anthropic.

Key elements of the strategy:

Long exposure to companies that directly benefit from AI compute demand: semiconductors, data center equipment, and power infrastructure.
Selective venture investments in startups at the frontier of model development and applications.
Smaller short positions on industries likely to be disrupted or “left behind” by rapid AI adoption.

A few points of caution and context:

Exceptional short-term returns are notable but rarely sustainable indefinitely. Legendary investors with multi-decade track records average far lower annualized returns.
Being an early mover in AI-focused investing can be lucrative, but it’s also volatile — the space is driven by innovation cycles, regulatory developments, and cyclical capital expenditures.
The fund’s recruiting of AI experts into finance highlights a cross-domain trend: top AI talent is being pulled toward investment firms that want situational awareness on breakthroughs.

Whether these new funds will compound returns over decades is an open question. For now, they indicate an appetite among sophisticated investors for concentrated exposure to AI infrastructure and startups.

🔧 People: Igor Babushkin, XAI, and the pivot to safety

One of the more meaningful developments on the human side is that Igor Babushkin — an early researcher and technical lead associated with a high-profile AI firm — announced his departure to focus on AI safety research and new ventures. His reflections are worth quoting. As he put it:

“In early 2023, I became convinced that we were getting close to a recipe for superintelligence. I saw the writing on the wall. Very soon, AI could reason beyond the level of humans. How could we ensure that this technology is used for good?”

That sentence captures why many leading researchers are moving from product teams to safety-focused efforts. Several trends explain this pivot:

Frontier model builders are increasingly recognizing that capability advances could outpace current safety tools.
Researchers who helped make these systems often feel a strong moral responsibility to ensure those systems are deployed safely.
The problems they want to address — alignment, interpretability, robust evaluation, and governance — require focused research that can be hard to pursue inside product-driven teams.

Igor’s move includes launching a new venture (Babushkin Ventures) and a public emphasis on safety. The tone in his announcement suggested a peaceful departure and gratitude for colleagues and the intense, late-night engineering culture that made major breakthroughs possible. Anecdotes about debugging deep learning scale runs at 4:20 AM and the relief when a training run finally succeeds are familiar to anyone who’s worked on large systems — and they help explain the deep commitment of people building these models.

🧩 Reinforcement learning, self-play, and the next frontier

There’s growing consensus among some researchers that the next wave of progress will come from blending large-scale language models with reinforcement learning (RL), self-play techniques, and other methods proven in game-playing research like AlphaStar and AlphaGo. Why is this important?

RL and self-play can create agents that learn complex behavior through interaction, not just text prediction.
When scaled, RL methods can produce emergent capabilities in planning, long-term strategy, and robust decision-making.
Combining RL with LLM-style pretraining can result in models that both generalize from large corpora and adapt via interactive experience.

That combination explains the excitement around hybrid architectures and reinforces why safety research must extend beyond static evaluation: interactive agents can explore and exploit environments in surprising ways.

🏆 Capability milestones: GPT-5 and the evidence of rapid improvement

GPT-5 has been shown to be significantly more efficient and capable than earlier releases in several respects. A few highlights:

Reduced number of steps to solve complex tasks compared to the previous o3 model.
Performance on game-based benchmarks like Pokemon Red that demonstrate planning and adaptability.
Best-in-class results in AI tracks for competitions such as the IOI, and competitive placement against human participants.

These capability gains are not just headline-grabbing — they have practical implications. More efficient planning means lower compute cost for certain applications. Better reasoning means improved utility in coding, math, problem-solving, and content generation. Those benefits are real for businesses, but they come with increased responsibility for safe deployment.

⚖️ Safety, governance, and why researchers are doubling down

With capabilities rising, safety is no longer an optional corner of research. It’s central. A few forces pushing researchers and organizations to prioritize safety:

Potential for misuse: more capable models mean more powerful tools that can be repurposed for harmful ends if not carefully governed.
Long-term alignment concerns: leading technical minds worry about scenarios where systems demonstrate recursive self-improvement or strong autonomous capabilities.
Regulatory and public scrutiny: governments and the public are increasingly interested in how AI is developed and who controls it.

Many of the people at the frontier of building large models are the same people now proposing safety frameworks and working with institutes focused on long-term risk mitigation. That alignment — from builders to safety researchers — is a healthy dynamic: expertise that understands system internals tends to produce better, more practical safety research.

📈 Practical takeaways for businesses, developers, and investors

If you’re tracking these developments from a practical perspective, here are actionable points to consider:

For product teams: Monitor new model releases but prioritize stable APIs and evaluation frameworks. Design products that degrade gracefully and include monitoring, logging, and human-in-the-loop safeguards.
For developers: Test models on domain-specific benchmarks, including adversarial tests and multi-step tasks. Keep an eye on cost-performance trade-offs as newer models change pricing and latency norms.
For investors: Consider exposure to infrastructure companies (semiconductors, data centers), as well as focused bets on startups with defensible data, unique models, or clear safety advantages. Be cautious about short-term hype cycles.
For policymakers: Encourage transparency in how app stores rank and promote AI applications, and support research into robust evaluation and alignment strategies.

🔮 The path forward: what to watch next

Over the coming months, keep an eye on several signals that will indicate important shifts in the AI landscape:

Model releases and documented benchmark improvements — but focus on reproducible tests and community evaluations.
Rollouts of low-cost, high-throughput image and inference models (e.g., Imagen variants) that reduce the cost of creative and enterprise automation.
Corporate and regulatory stances around app-store promotion and distribution of AI services — these will shape who gets users and how quickly.
Talent movement: people leaving product teams to start safety-first initiatives, or joining funds to harvest AI-driven returns, signal where expertise and capital are flowing.
Research papers demonstrating “AI doing AI” — meta-learning and automated machine learning advances — because they hint at the theoretical possibility of recursive capability improvement.

These are the areas where practical changes to business models, regulatory approaches, and societal expectations will first appear.

❓ FAQ

What is GPT-5 and why is it important?

GPT-5 refers to a generational advancement in large language models that demonstrates substantial improvements in reasoning, planning, and efficiency over earlier versions. Its importance comes from both improved utility (better answers, faster planning, more efficient use of compute) and the broader implications for capability growth — which raises both economic opportunity and safety considerations.

What does “Grok 4.20” mean?

Grok 4.20 is a model versioning label used by a competitor in the LLM space. Version updates typically indicate architecture changes, improved training data or strategies, or tuned inference behaviors. Power users and developers will evaluate the release by testing key tasks, latency, cost, and overall robustness.

Is Gemini 3.0 real?

At present, the widely circulated charts and rumors about a Gemini 3.0 release lack solid confirmation. Companies frequently iterate and release smaller components (like compact models for developers, or updated image models) before shipping a major new flagship. Treat rumor claims skeptically and wait for official release notes or reproducible benchmarks.

Why are people testing models on games like Pokemon Red?

Video games like Pokémon Red provide bounded, reproducible environments that require planning, memory, and adaptive strategies — valuable proxies for certain kinds of intelligence. They are approachable tests that combine fun demos with insightful metrics about a model’s long-horizon reasoning capabilities.

What is Situational Awareness (the fund) and why does it matter?

Situational Awareness is an AI-focused investment fund that has reported strong initial returns by investing in companies and startups that benefit from AI growth (semiconductors, data center infrastructure, AI startups) while also taking short positions on sectors likely to be disrupted. Its success matters because it channels capital toward firms enabling AI and signals that sophisticated investors are treating AI as a durable thematic rather than a short-lived fad.

Should we be worried about “recursive self-improvement” or an intelligence explosion?

Recursive self-improvement is a theoretical scenario where AI systems get increasingly better at designing and improving themselves, potentially accelerating capability growth. While it remains speculative, research and thought leaders take it seriously enough to motivate extensive safety research and governance conversations. The prudent approach is to accelerate safety work alongside capability research, maintain robust evaluation, and involve interdisciplinary expertise.

How should companies prepare for more capable models?

Companies should: (1) evaluate models rigorously on domain-specific tasks, (2) build monitoring and human oversight into production systems, (3) budget for increased compute and potential surge costs, (4) adopt responsible deployment practices, and (5) stay informed about regulatory developments and emerging safety norms.

Are app store rankings for AI apps a big deal?

Yes. App store placement affects visibility and user acquisition dramatically. If platform operators favor one model or partner, it can accelerate a model’s adoption and create de facto standards. This is why public scrutiny and transparent ranking criteria are important topics for regulators and industry groups.

🔚 Closing thoughts

We’re in a period of rapid iteration where capabilities and business models evolve alongside growing attention to safety and governance. Model performance is accelerating — from playing classic games to winning algorithmic competitions — and capital is flowing to both product builders and safety-focused researchers.

The key is balance: celebrate technical progress and the opportunities it unlocks, but also acknowledge the responsibility that comes with creating more powerful systems. That responsibility is reflected by researchers leaving product teams for safety work, by funds that seek to harness AI’s economic upside, and by public debates about distribution and platform power.

For practitioners and decision-makers, the immediate priorities remain practical and straightforward: evaluate models carefully, design for safe and auditable deployment, and build business strategies that account for the fast-changing economics of AI compute, inference cost, and user acquisition.

Keep watching the space for new releases, verified benchmarks, and thoughtful safety research — those signals will tell us more than rumor-laden charts ever will.

Table of Contents