EP007 — When the Words Aren't the Thinking (Latent Reasoning)
When you ask a modern language model to "think step by step," it writes out intermediate reasoning before answering and tends to do better on hard problems. The field has been treating those written steps as the reasoning itself. A new position paper from Wenshuo Wang argues that the evidence currently favors a different picture: the real reasoning happens in the hidden internal states moving through the network's layers, and the chain of thought is more like a transcript — sometimes faithful, often not. The cross-domain parallel is Nisbett and Wilson's 1977 stocking experiment, where shoppers gave confident, articulate explanations for choices that were actually driven by something they never mentioned.
Cross-domain connection
Cognitive psychology — Nisbett and Wilson's 1977 stocking experiment, in which shoppers presented with four identical pairs of stockings on a table picked the right-most pair four times more often than the left-most (a clean position effect), then, when asked why, confidently named material, texture, weave — features that did not exist. The verbal explanation came from a different system than the one that made the choice. Wang's H1 is the computational sibling: the chain of thought is produced by the same overall model whose hidden states are doing the reasoning, but it is not a transcript of those hidden states — it is a plausible verbal report, generated alongside. Holds on hidden primary computation, verbal trace produced by the same overall system, and a methodological move off the verbal report and onto either behavior or substrate. Breaks on this: in humans the verbal-report subsystem is causally decoupled from the reasoning system, but in language models the chain-of-thought tokens are not external commentary — they are fed back in as input to the next step's computation. They are IN the loop. Forward question for the show: when the trace is part of the computation that produces the next step, does the line between "architecture" and "behavior" still cut cleanly the same way?
Concepts introduced
- The "think step by step" / chain-of-thought paradigm (named, scoped to "the model writes out intermediate reasoning before answering")
- Three competing hypotheses about what reasoning IS as a computational object — H1 (latent-state trajectories), H2 (the visible chain of thought), H0 (generic serial compute) — named cleanly so the audience can hold them
- Hidden states / residual stream as the substrate where the model's computation actually moves (introduced gently, no math)
- Unfaithful chains of thought (the empirical finding that models' written reasoning sometimes contradicts the answer they give)
- The asymmetry test: rephrase the chain of thought while preserving the hidden state → performance holds; preserve the chain of thought while perturbing the hidden state → performance collapses
- Mechanistic interpretability (named once, framed as "peering inside the wiring")
- The faithfulness category itself as a contested research object