Ex-OpenAI CTO Reveals Plan to Fix LLMs Biggest Problem

Sofia Alvarez

2 months ago

Introduction: Why this matters and where this came from 🧭
What is non-determinism in LLMs? 🧩
Common hypothesis: floating point precision and concurrency 🧮
The real culprit: batch size and batching behavior 🚌
How Thinking Machines proposes to fix it: batch-invariant kernels and three practical steps ⚙️
Why consistency can be worth a small speed hit ⚖️
Experiment: did it actually work? The QUEN 235B test 📊
Why defeating non-determinism is important beyond debugging 🔍
When randomness is still useful: creative and exploratory use cases 🎨
Sponsor note: Lindy — building fast and shipping working code 🚀
How this fits into the broader AI infrastructure landscape 🌐
Limitations and open questions 🔬
Practical checklist for engineering teams ✅
Conclusion: determinism as a feature, not an accident 🏁
FAQ ❓

Introduction: Why this matters and where this came from 🧭

When Thinking Machines Lab launched their research blog, Connectionism, their first post didn’t talk about bigger models or fancy training tricks. Instead, they focused on a fundamental operational issue: non-determinism during LLM inference. As I said in the video,

“reproducibility is a bedrock of scientific progress.”

If we can’t reliably get the same output from the same input, a lot of downstream activities — debugging, auditing, benchmarking, production reliability — become much harder.

In the video I walked through the paper at a high level, and I want to do the same here but with more context, clearer analogies, and practical takeaways. If you use LLMs in production or are building tooling around them, this discovery could change how you think about reliability.

What is non-determinism in LLMs? 🧩

At its simplest, non-determinism is when the same input produces different outputs on repeated runs. With LLMs this is easy to observe: ask a model the exact same prompt multiple times and you’ll often get different completions. Sometimes this is intentional and desirable — creative tasks benefit from variety — but other times it is a major pain point. For repeatable systems, especially in enterprise, finance, legal, or scientific applications, you want the same prompt to give the same answer (unless you explicitly request variability).

There are two separate layers to this:

Sampling randomness: When a model uses temperature or stochastic sampling for token selection, you naturally get variability. That’s intended and controllable by parameters like temperature, top-k, top-p, etc.
Unexpected non-determinism: Even with sampling set to a deterministic mode (for example, temperature = 0), many deployments still produce different outputs across repeated runs. That unpredictability is the problem Thinking Machines addresses.

So the question the paper asks is: why, even when we try to make inference deterministic, do we still get differences? And can we remove those differences entirely?

Common hypothesis: floating point precision and concurrency 🧮

A frequent explanation you’ll hear in engineering circles is that numerical precision and parallel execution introduce tiny rounding differences, and those tiny differences cascade into different tokens. Let me unpack that quickly.

Floating point non-associativity: Computers represent decimal numbers with finite precision (floating point). When you add numbers in different orders, rounding happens at each step, and floating point arithmetic is not associative — (a + b) + c can differ from a + (b + c) in the final bits. In deep learning, we do millions of tiny arithmetic operations, so in theory these rounding differences could accumulate and change a final token probability.

Concurrency: Modern accelerators (GPUs, TPUs, specialized chips) parallelize matrix math across many cores. The order in which parallel ops finish can change depending on thread scheduling, hardware micro-variations, and other processes on the machine. If two cores finish slightly differently across runs, you could get different rounding behaviors.

Those explanations are plausible, and they are part of the story — they explain why small numeric differences could arise. But Thinking Machines’ team observed something important: if you run the exact same matrix multiply repeatedly on a GPU with identical inputs, you often get bitwise-identical results. That suggests floating point and concurrency aren’t the whole story by themselves.

The real culprit: batch size and batching behavior 🚌

Thinking Machines pinpoints a different and surprising root cause: batching. To understand this, imagine every prompt you send to a public LLM is placed into a shared ride—the inference server will group multiple requests into a batch to amortize compute and improve throughput. Sometimes the carpool (batch) is big because the system is busy. Other times it’s small.

Why does that matter?

Because batching subtly changes the exact sequence of arithmetic operations the model performs. Even if the underlying GPU kernels are deterministic for a fixed matrix multiply, changing batch sizes changes matrix shapes, memory access patterns, and the order of intermediate arithmetic reductions. Those small differences shift the probabilities at certain token positions. When you’re predicting tokens one-by-one, a tiny shift in probability at some point can flip which token is chosen next. Even when sampling is turned off (temperature = 0, i.e., greedy decoding), the argmax selection can change when probabilities are nearly tied.

Put another way: the same single prompt, when processed alone versus processed inside a batch of 8 or 32, can traverse slightly different arithmetic paths. Those differences can produce different outputs.

How Thinking Machines proposes to fix it: batch-invariant kernels and three practical steps ⚙️

The good news from the paper is that this is not an intractable hardware-level mystery — it’s an engineering problem with concrete fixes. The Thinking Machines team proposes a combination of strategies that together remove the non-determinism introduced by variable batching.

I described these fixes in the video using a kitchen analogy, and I’ll repeat it here because I think the analogy helps.

Fix 1 — Keep the bowl weights the same (batch-invariant kernels): Regardless of how many orders (prompts) are in the kitchen, ensure that the weight of each bowl is measured in the same way. In computing terms, that means designing inference kernels whose internal arithmetic and reduction orders are independent of batch size. The implementation is technical: it involves reworking how operations are partitioned and reduced so that the sequence of arithmetic operations is invariant to how many items are batched together. You preserve deterministic outcomes even if batch size changes.
Fix 2 — Keep the mixing step the same (consistent computation layout): Choose a single, stable internal layout for how intermediate tensors are stored and how computations are scheduled. Don’t switch layouts depending on batch size or input shape. Again, this can cost some performance in edge cases — you may not be using the absolute fastest kernel for a particular batch size — but you gain deterministic behavior.
Fix 3 — Taste in the same order (deterministic attention chunking and ordering): LLMs use attention, which often requires chunking or tiling long sequences for memory reasons. If you slice or chunk differently depending on runtime buffering, the attention reductions can sum in different orders. The fix is to canonicalize the chunking/slicing strategy so the model always “looks back” at previous tokens in the same way and in the same sequence.

Those three ideas — batch-invariant kernels, a stable computation layout, and fixed slicing/order for attention — together create an inference pipeline where the same prompt produces the same intermediate math and, therefore, the same final output every time, regardless of the traffic or how many concurrent prompts are coalesced.

Why consistency can be worth a small speed hit ⚖️

I know what you’re thinking: locking down compute patterns to be invariant will cost throughput or latency. That’s true in some configurations. The tradeoff that Thinking Machines is explicit about is: accept a small performance penalty to win enormous gains in determinism and reliability.

For many applications, that tradeoff is worth it. Imagine:

Running legal or medical QA where exact reproducibility is critical for audits;
Doing regression testing or benchmarks where you want stable outputs across CI runs;
Building safety systems that must be reliably inspectable and debuggable;
Providing a deterministic API for enterprise customers who demand audit trails.

For creative, exploratory, or stochastic use cases (poetry, brainstorming), you might prefer randomness. In those cases you can still enable sampling parameters. The key point is that determinism should be a controllable property of your inference stack, not an unpredictable side-effect of batching.

Experiment: did it actually work? The QUEN 235B test 📊

Thinking Machines ran a clear experiment to validate their approach. They used a model (referenced in the paper as QUEN 235B) and ran 1,000 completions at temperature zero with the same prompt: “Tell me about Richard Feynman”, generating 1,000 tokens each.

Baseline behavior (no batch-invariant kernels): they observed 80 unique completions among the 1,000 runs. The most common completion occurred 78 times — meaning even at temperature zero there was significant variability due to batching differences.

After enabling the batch-invariant kernels and applying their deterministic inference pipeline, all 1,000 completions were identical. That’s not “mostly identical.” That’s bit-for-bit identical outputs for repeated runs of the same prompt under identical decoding parameters.

That experimental result is important because it shows practical, reproducible wins from the proposed engineering changes. This isn’t just a theoretical claim — it’s an empirical fix.

Why defeating non-determinism is important beyond debugging 🔍

The paper and I highlight several downstream benefits, but it’s worth spelling them out with concrete examples:

Trust and Verification: If your model gives the same answer every time, users can build workflows that trust those answers. For example, legal document generation pipelines or compliance checks need deterministic outputs to ensure consistent records.
Benchmarking and Research: Comparisons between models or model versions rely on stable inputs and outputs. If a model sometimes outputs A and sometimes B for the same prompt, it’s much harder to measure improvements or regressions.
Auditing and Explainability: If you’re investigating why a model produced an output, deterministic execution makes it possible to run step-by-step reproductions and trace the internal state systematically. That’s crucial for safety teams, regulators, and internal audits.
Testing and CI/CD: Unit tests or integration tests that include LLM outputs will fail unpredictably if the outputs vary. Determinism makes automated testing possible and practical.
Production consistency: If customer-facing behavior is deterministic, support is easier, SLA guarantees are more meaningful, and you avoid hard-to-reproduce bug reports.

When randomness is still useful: creative and exploratory use cases 🎨

Importantly, this paper is not arguing that randomness is always bad. I emphasized in the video — and reiterate here — that creative tasks often benefit from variability. The goal is to make determinism a deliberate choice rather than an accidental side-effect of how an inference cluster happens to batch requests.

Practical systems should expose modes:

Deterministic mode: For testing, auditing, and regulated workflows. Same input → same output every time.
Stochastic mode: For creative generation where sampling, temperature, and nucleus sampling give diverse outputs.
Hybrid approaches: Deterministic backbone with controlled stochastic layers, where only certain parts of the pipeline are randomized for creativity while core outputs remain reproducible.

In the video I briefly mentioned Lindy, which is a tool for agent-driven application building. Why mention it here? Because reproducibility and shipping quality code go hand-in-hand. If you’re building products that rely on LLMs, you want tools that can generate, test, and deploy code with consistent behavior. Lindy claims to research best practices, write tests, and run QA so you can deploy functional apps quickly — a practical example of how engineering rigor and automation reduce friction in the product lifecycle.

How this fits into the broader AI infrastructure landscape 🌐

Thinking Machines’ work is interesting not just for its immediate fix, but for what it signals about the next stage of LLM infrastructure. As models get larger and inference costs become significant, batching will remain an essential optimization for throughput. But as LLMs move into mission-critical roles, the expectations around determinism, auditability, and reproducibility will increase.

We can expect several trends:

Deterministic inference modes: Vendors will offer both performance-optimized and determinism-optimized kernels, letting customers choose the right tradeoffs.
Hardware and compiler co-design: More attention to how tensorization and kernel scheduling affect final numeric stability, with hardware vendors offering deterministic primitives.
Standardized test suites: Public benchmarks for determinism that measure whether repeated runs give identical outputs across common prompts and decoding settings.
Better observability: Tooling that captures and compares intermediate logits and probabilities so engineers can trace where variability originates.

We’re moving from an era where many production LLM systems were “best-effort” to one where predictable, auditable behavior is expected in many industries. Thinking Machines has provided a concrete technical path to help us get there.

Limitations and open questions 🔬

There are a few caveats and open problems worth noting:

Performance tradeoffs: Deterministic kernels might be slower in some scenarios, and there’s a practical question about how much cost or latency customers are willing to accept for deterministic outputs.
Scale and model variety: The paper demonstrates results for a specific model and setup. We need broader testing across architectures, sizes, and hardware platforms to fully generalize the approach.
Hybrid deployment complexity: Many production systems use a mix of accelerators and routes. Ensuring determinism end-to-end across heterogeneous setups is non-trivial.
Operational adoption: Changing inference kernels and deployment standards requires integration with existing MLOps tooling. Rollout will take time.

Practical checklist for engineering teams ✅

If you’re responsible for LLMs in production, here are immediate steps you can take based on these insights:

Determine where determinism matters: Classify endpoints and use cases as “deterministic-required” or “stochastic-allowed”.
Measure current non-determinism: Run repeated zero-temperature completions across your endpoints and quantify variability.
Isolate batching effects: Test the same model under different batching regimes (single-request, batch-of-8, batch-of-32) and compare outputs.
Request deterministic kernels from your provider: If you use a hosted API, ask if they can offer batch-invariant execution modes or explain their batching strategy.
Introduce test fixtures: Add deterministic test cases into CI for endpoints that require reproducibility.
Monitor production drift: Track probability distributions and token-level diffs to capture silent behavior changes.

Conclusion: determinism as a feature, not an accident 🏁

Thinking Machines’ paper shows that the surprising non-determinism many of us have observed in LLM inference is not an irreducible artifact of floating point math or concurrency alone — it can come from batching behavior. More importantly, with targeted engineering changes (batch-invariant kernels, stable computation layouts, and deterministic attention chunking), we can eliminate that unpredictability entirely.

That doesn’t mean we must always be deterministic. But it does mean determinism should be a first-class option in inference stacks. Making determinism controllable — and achievable without mysterious hardware hacks — opens the door to better testing, auditing, benchmarking, and production stability. For teams building mission-critical systems, that’s a huge step forward.

FAQ ❓

Q: What exactly is “batching” and why do LLM servers use it?

A: Batching is the process of grouping multiple incoming inference requests into a single compute operation so the system can leverage parallelism and amortize per-request overhead. It improves throughput and resource utilization, which is crucial when serving many concurrent users. The downside is that batching can change the shapes and memory layout of operations, which is where non-determinism can creep in.

Q: Will enabling deterministic kernels always make models slower?

A: Not necessarily always, but often there is a tradeoff. To achieve determinism you might avoid the absolutely fastest kernel for a specific batch shape and instead use a canonical kernel that’s invariant. In practice the performance hit can be small relative to the operational benefits, but it depends on your latency and throughput requirements.

Q: Is this problem only relevant for large, public LLM APIs?

A: No. Any system that uses batching or dynamic computation layouts can encounter this issue. It’s especially visible in highly-shared, high-throughput public APIs because they maximize variable batch coalescing. But private deployments, multi-tenant setups, and edge serving systems can all see similar effects.

Q: Can we get deterministic behavior by always setting batch size to 1?

A: Running with batch size 1 removes one axis of variability, but it’s not always practical due to throughput and cost. Additionally, other runtime factors (like attention chunking or layout differences) can still introduce variability. The Thinking Machines approach is to make the kernels invariant to batch size so you get determinism without sacrificing batching entirely.

Q: Does this fix apply to all model architectures?

A: The paper demonstrates the technique on the models they tested, and the principles are broadly applicable. However, different architectures, custom ops, or hardware-specific kernels might require additional engineering. Widespread adoption will likely involve collaboration between model providers, inference stack maintainers, and hardware vendors.

Q: How should product teams decide when to use deterministic vs stochastic modes?

A: Base the decision on the use case. If you need auditability, reproducibility, or regulatory compliance, prefer deterministic mode. If you need creativity or variety (marketing copy, ideation), prefer stochastic. Ideally, expose both modes to users or have hybrid modes where core outputs are deterministic and optional augmentations are stochastic.

Q: Where can I read the original Thinking Machines paper?

A: The Thinking Machines team published their post on the Connectionism blog and included technical details and experimental results. Check the Thinking Machines website under their research blog for “Defeating Non-Determinism in LLM Inference.”

Q: Will major cloud providers adopt deterministic inference?

A: It’s likely. As demand for reproducibility grows in enterprise and regulated industries, cloud providers and inference-serving vendors will be motivated to offer determinism as a supported option. The timeline depends on customer demand and engineering investment.

If you found this breakdown helpful, consider exploring the Thinking Machines post yourself and testing determinism in your own systems. If you’re building apps with LLMs, make reproducibility a design consideration — it’s a small change in approach that unlocks big wins for trust, testing, and auditability.

Table of Contents