Site icon Canadian Technology Magazine

Ex-OpenAI CTO Reveals Plan to Fix LLMs Biggest Problem

Ex-OpenAI CTO Reveals

Ex-OpenAI CTO Reveals

Table of Contents

Introduction: Why this matters and where this came from 🧭

When Thinking Machines Lab launched their research blog, Connectionism, their first post didn’t talk about bigger models or fancy training tricks. Instead, they focused on a fundamental operational issue: non-determinism during LLM inference. As I said in the video,

“reproducibility is a bedrock of scientific progress.”

If we can’t reliably get the same output from the same input, a lot of downstream activities — debugging, auditing, benchmarking, production reliability — become much harder.

In the video I walked through the paper at a high level, and I want to do the same here but with more context, clearer analogies, and practical takeaways. If you use LLMs in production or are building tooling around them, this discovery could change how you think about reliability.

What is non-determinism in LLMs? 🧩

At its simplest, non-determinism is when the same input produces different outputs on repeated runs. With LLMs this is easy to observe: ask a model the exact same prompt multiple times and you’ll often get different completions. Sometimes this is intentional and desirable — creative tasks benefit from variety — but other times it is a major pain point. For repeatable systems, especially in enterprise, finance, legal, or scientific applications, you want the same prompt to give the same answer (unless you explicitly request variability).

There are two separate layers to this:

So the question the paper asks is: why, even when we try to make inference deterministic, do we still get differences? And can we remove those differences entirely?

Common hypothesis: floating point precision and concurrency 🧮

A frequent explanation you’ll hear in engineering circles is that numerical precision and parallel execution introduce tiny rounding differences, and those tiny differences cascade into different tokens. Let me unpack that quickly.

Floating point non-associativity: Computers represent decimal numbers with finite precision (floating point). When you add numbers in different orders, rounding happens at each step, and floating point arithmetic is not associative — (a + b) + c can differ from a + (b + c) in the final bits. In deep learning, we do millions of tiny arithmetic operations, so in theory these rounding differences could accumulate and change a final token probability.

Concurrency: Modern accelerators (GPUs, TPUs, specialized chips) parallelize matrix math across many cores. The order in which parallel ops finish can change depending on thread scheduling, hardware micro-variations, and other processes on the machine. If two cores finish slightly differently across runs, you could get different rounding behaviors.

Those explanations are plausible, and they are part of the story — they explain why small numeric differences could arise. But Thinking Machines’ team observed something important: if you run the exact same matrix multiply repeatedly on a GPU with identical inputs, you often get bitwise-identical results. That suggests floating point and concurrency aren’t the whole story by themselves.

The real culprit: batch size and batching behavior 🚌

Thinking Machines pinpoints a different and surprising root cause: batching. To understand this, imagine every prompt you send to a public LLM is placed into a shared ride—the inference server will group multiple requests into a batch to amortize compute and improve throughput. Sometimes the carpool (batch) is big because the system is busy. Other times it’s small.

Why does that matter?

Because batching subtly changes the exact sequence of arithmetic operations the model performs. Even if the underlying GPU kernels are deterministic for a fixed matrix multiply, changing batch sizes changes matrix shapes, memory access patterns, and the order of intermediate arithmetic reductions. Those small differences shift the probabilities at certain token positions. When you’re predicting tokens one-by-one, a tiny shift in probability at some point can flip which token is chosen next. Even when sampling is turned off (temperature = 0, i.e., greedy decoding), the argmax selection can change when probabilities are nearly tied.

Put another way: the same single prompt, when processed alone versus processed inside a batch of 8 or 32, can traverse slightly different arithmetic paths. Those differences can produce different outputs.

How Thinking Machines proposes to fix it: batch-invariant kernels and three practical steps ⚙️

The good news from the paper is that this is not an intractable hardware-level mystery — it’s an engineering problem with concrete fixes. The Thinking Machines team proposes a combination of strategies that together remove the non-determinism introduced by variable batching.

I described these fixes in the video using a kitchen analogy, and I’ll repeat it here because I think the analogy helps.

Those three ideas — batch-invariant kernels, a stable computation layout, and fixed slicing/order for attention — together create an inference pipeline where the same prompt produces the same intermediate math and, therefore, the same final output every time, regardless of the traffic or how many concurrent prompts are coalesced.

Why consistency can be worth a small speed hit ⚖️

I know what you’re thinking: locking down compute patterns to be invariant will cost throughput or latency. That’s true in some configurations. The tradeoff that Thinking Machines is explicit about is: accept a small performance penalty to win enormous gains in determinism and reliability.

For many applications, that tradeoff is worth it. Imagine:

For creative, exploratory, or stochastic use cases (poetry, brainstorming), you might prefer randomness. In those cases you can still enable sampling parameters. The key point is that determinism should be a controllable property of your inference stack, not an unpredictable side-effect of batching.

Experiment: did it actually work? The QUEN 235B test 📊

Thinking Machines ran a clear experiment to validate their approach. They used a model (referenced in the paper as QUEN 235B) and ran 1,000 completions at temperature zero with the same prompt: “Tell me about Richard Feynman”, generating 1,000 tokens each.

Baseline behavior (no batch-invariant kernels): they observed 80 unique completions among the 1,000 runs. The most common completion occurred 78 times — meaning even at temperature zero there was significant variability due to batching differences.

After enabling the batch-invariant kernels and applying their deterministic inference pipeline, all 1,000 completions were identical. That’s not “mostly identical.” That’s bit-for-bit identical outputs for repeated runs of the same prompt under identical decoding parameters.

That experimental result is important because it shows practical, reproducible wins from the proposed engineering changes. This isn’t just a theoretical claim — it’s an empirical fix.

Why defeating non-determinism is important beyond debugging 🔍

The paper and I highlight several downstream benefits, but it’s worth spelling them out with concrete examples:

When randomness is still useful: creative and exploratory use cases 🎨

Importantly, this paper is not arguing that randomness is always bad. I emphasized in the video — and reiterate here — that creative tasks often benefit from variability. The goal is to make determinism a deliberate choice rather than an accidental side-effect of how an inference cluster happens to batch requests.

Practical systems should expose modes:

In the video I briefly mentioned Lindy, which is a tool for agent-driven application building. Why mention it here? Because reproducibility and shipping quality code go hand-in-hand. If you’re building products that rely on LLMs, you want tools that can generate, test, and deploy code with consistent behavior. Lindy claims to research best practices, write tests, and run QA so you can deploy functional apps quickly — a practical example of how engineering rigor and automation reduce friction in the product lifecycle.

How this fits into the broader AI infrastructure landscape 🌐

Thinking Machines’ work is interesting not just for its immediate fix, but for what it signals about the next stage of LLM infrastructure. As models get larger and inference costs become significant, batching will remain an essential optimization for throughput. But as LLMs move into mission-critical roles, the expectations around determinism, auditability, and reproducibility will increase.

We can expect several trends:

We’re moving from an era where many production LLM systems were “best-effort” to one where predictable, auditable behavior is expected in many industries. Thinking Machines has provided a concrete technical path to help us get there.

Limitations and open questions 🔬

There are a few caveats and open problems worth noting:

Practical checklist for engineering teams ✅

If you’re responsible for LLMs in production, here are immediate steps you can take based on these insights:

  1. Determine where determinism matters: Classify endpoints and use cases as “deterministic-required” or “stochastic-allowed”.
  2. Measure current non-determinism: Run repeated zero-temperature completions across your endpoints and quantify variability.
  3. Isolate batching effects: Test the same model under different batching regimes (single-request, batch-of-8, batch-of-32) and compare outputs.
  4. Request deterministic kernels from your provider: If you use a hosted API, ask if they can offer batch-invariant execution modes or explain their batching strategy.
  5. Introduce test fixtures: Add deterministic test cases into CI for endpoints that require reproducibility.
  6. Monitor production drift: Track probability distributions and token-level diffs to capture silent behavior changes.

Conclusion: determinism as a feature, not an accident 🏁

Thinking Machines’ paper shows that the surprising non-determinism many of us have observed in LLM inference is not an irreducible artifact of floating point math or concurrency alone — it can come from batching behavior. More importantly, with targeted engineering changes (batch-invariant kernels, stable computation layouts, and deterministic attention chunking), we can eliminate that unpredictability entirely.

That doesn’t mean we must always be deterministic. But it does mean determinism should be a first-class option in inference stacks. Making determinism controllable — and achievable without mysterious hardware hacks — opens the door to better testing, auditing, benchmarking, and production stability. For teams building mission-critical systems, that’s a huge step forward.

FAQ ❓

Q: What exactly is “batching” and why do LLM servers use it?

A: Batching is the process of grouping multiple incoming inference requests into a single compute operation so the system can leverage parallelism and amortize per-request overhead. It improves throughput and resource utilization, which is crucial when serving many concurrent users. The downside is that batching can change the shapes and memory layout of operations, which is where non-determinism can creep in.

Q: Will enabling deterministic kernels always make models slower?

A: Not necessarily always, but often there is a tradeoff. To achieve determinism you might avoid the absolutely fastest kernel for a specific batch shape and instead use a canonical kernel that’s invariant. In practice the performance hit can be small relative to the operational benefits, but it depends on your latency and throughput requirements.

Q: Is this problem only relevant for large, public LLM APIs?

A: No. Any system that uses batching or dynamic computation layouts can encounter this issue. It’s especially visible in highly-shared, high-throughput public APIs because they maximize variable batch coalescing. But private deployments, multi-tenant setups, and edge serving systems can all see similar effects.

Q: Can we get deterministic behavior by always setting batch size to 1?

A: Running with batch size 1 removes one axis of variability, but it’s not always practical due to throughput and cost. Additionally, other runtime factors (like attention chunking or layout differences) can still introduce variability. The Thinking Machines approach is to make the kernels invariant to batch size so you get determinism without sacrificing batching entirely.

Q: Does this fix apply to all model architectures?

A: The paper demonstrates the technique on the models they tested, and the principles are broadly applicable. However, different architectures, custom ops, or hardware-specific kernels might require additional engineering. Widespread adoption will likely involve collaboration between model providers, inference stack maintainers, and hardware vendors.

Q: How should product teams decide when to use deterministic vs stochastic modes?

A: Base the decision on the use case. If you need auditability, reproducibility, or regulatory compliance, prefer deterministic mode. If you need creativity or variety (marketing copy, ideation), prefer stochastic. Ideally, expose both modes to users or have hybrid modes where core outputs are deterministic and optional augmentations are stochastic.

Q: Where can I read the original Thinking Machines paper?

A: The Thinking Machines team published their post on the Connectionism blog and included technical details and experimental results. Check the Thinking Machines website under their research blog for “Defeating Non-Determinism in LLM Inference.”

Q: Will major cloud providers adopt deterministic inference?

A: It’s likely. As demand for reproducibility grows in enterprise and regulated industries, cloud providers and inference-serving vendors will be motivated to offer determinism as a supported option. The timeline depends on customer demand and engineering investment.

If you found this breakdown helpful, consider exploring the Thinking Machines post yourself and testing determinism in your own systems. If you’re building apps with LLMs, make reproducibility a design consideration — it’s a small change in approach that unlocks big wins for trust, testing, and auditability.

 

Exit mobile version