How Hot is it in The Token Kettle?
It is utterly mind-bending that the cutting edge of artificial intelligence is fundamentally governed by the laws of a boiling kettle.
When the boiling creativity points and chill deadpan points aren’t the same across model families and versions, it means trouble for interoperable, native AI capabilities in browsers. But we might have found a saner way forward—and this is where you can help us check our fancier kettle.
The Thermodynamics of a Prompt
If you look under the hood of almost any generative AI application today, you’ll inevitably run into a parameter called temperature. It isn’t there because your local hardware is running hot. It is there because the early pioneers of large language models borrowed their core mathematical foundations directly from Statistical Mechanics.
"In physics, temperature describes how much atoms jiggle around. High temperature means a chaotic, unpredictable gas. Absolute zero means a frozen, perfectly predictable crystal structure."
In AI, we do the exact same thing to tokens (the fragments of words a model processes):
- Set it low (e.g.,
0.1or0.2): You get a factual ice cube. Well, as much of an ice cube as a language model can realistically be. It becomes deterministic, choosing the statistically safest word every single time. - Set it high (e.g.,
0.9or1.2): You get a highly creative, chaotic, and possibly hallucinating gas. The model begins skipping the obvious choices to explore the wider, weirder edges of its probability distribution.
This system works reasonably well when you are hitting an unmoving, server-side API hosted in the cloud. But when you try to bring these capabilities natively into the browser, it brings us directly to a massive, invisible nightmare for web developers: The Interoperability Heat Crisis.
The Interoperability Heat Crisis
The catch is simple: Temperature is not a universal constant.
A 0.7 temperature setting on a massive cloud-hosted model with hundreds of billions of parameters acts like a gentle, pleasant creative spark. It gives you a nice marketing copy or an interesting brainstorming list without breaking structure.
But what happens when you throw that exact same 0.7 value at a tiny, sensitive, client-side model running locally inside a user’s browser?
Even worse, the web is inherently multi-engine and rapidly evolving:
- If your development browser uses one local model family, and your users use a range of browsers using different AI models…
- Or if a specific browser simply updates its underlying model version overnight to bring the latest innovations…
Your hardcoded, hyper-precise 0.42 value might suddenly turn your quality-whistling production web app into a tepid experience. Having “precise control” over the jiggling of atoms is incredibly fun (giggles), but it also creates a massive interoperability trap (zero giggles).
The Proposal: Categorical Sampling Modes
To solve this, we are proposing a significant pivot for the Sampling Parameters in the Prompt API: we are moving away from raw numerical scalars entirely in web contexts and replacing them with Categorical Sampling Modes.
Instead of forcing you to guess the ideal numerical thermostat settings on a model that might change underneath your feet next month, you specify your behavioral intent using explicit semantic presets.
// A look at the proposed API shape
const session = await LanguageModel.create({
samplingMode: "most-creative"
});
const creativeBrainstorm = await session.prompt(
"Give me 3 weird, highly unusual flavor combinations " +
"for an ice cream shop that surprisingly taste good."
);
Under the hood, the browser handles the heavy lifting of safely mapping that intent ("most-predictable", "predictable", "balanced", "creative", or "most-creative") to the specific mathematical dials optimal for the exact model version running on that device.
Testing the proposal’s limits
We recently surveyed members of our Early Preview Program (EPP) about this potential shift before revealing the full implementation details. The data revealed a sharp split among developers who actively tune their parameters:
- 🟢 50% indicated that predefined semantic presets would be perfectly acceptable and workable for their codebases.
- 🔴 28% expressed a definitive, hard requirement for raw parameter access.
The presets seem sufficient for the majority, but the 28% group asking for raw access isn’t exactly small.
When we look at broader industry trends outside of client-side web applications, cloud AI providers are also gradually phasing out or diminishing the explicit role of raw sampling settings—favoring simplified developer experiences or relying entirely on a model’s native capability to infer formatting out of the prompt instructions themselves.
But because local models are uniquely sensitive, we may still need some level of control. So, we need to know if a preset-only world is viable and sufficient. Since the survey was taken based purely on a description before anyone could test the physical implementation, we need to validate our approach with the community.
We want to see if our new “fancier kettle” can satisfy your production workloads, convert the skeptics, and help us discover exactly what constraints we might have missed.
Your turn
We want to build a resilient, cross-browser API anchored in real-world engineering constraints, not just hypothetical preferences. We need your hands-on, factual insights that can withstand scrutiny:
- Where do the presets fall off a cliff? Are there specific production tasks where a 5-tier spectrum completely fails compared to raw scalar tuning?
- Can you adapt? If your app relies on raw values today, could your product requirements be satisfied by a well-mapped, browser-optimized semantic mode, or does your software inherently depend on the fine-grained mathematical dials?
Feedback
Help us build an API that boils more predictably by sharing your empirical feedback on the relevant issue in the W3C WebMachine Learning Working Group Repo.
- 👉 Improve the Presets with oncrete Data: Share your specific prompt, your production settings, and exactly where the presets failed to achieve your verified production results directly.
- 👇 Defend the Honor of the Raw Dial: Leave a comment below detailing your empirical, hands-on experiences with on-device model sensitivity and why presets can’t cut it.
The Origin Trial for Sampling Mode Presets is officially live. Go test the presets, try your hardest to break them, and come back to tell us exactly how the kettle did.