The Power of Discreteness in Language Models

Digital systems work inherently over discrete operations of ones and zeros. Analog systems work over continuous operations. Digital systems are strictly more accurate than their analog counterparts. What is it about discrete operations that leads to accurate predictions? The answer lies in the mechanism with which discreteness tackles the issue of error propagation. A system which produces continuous predictions will incur at least some error, however negligible, at the first step of prediction for any valid real world system. Only in theory is the error absolute zero. Over time, this small error could accumulate because the system lacks a way to keep its propagation in check. A discrete system on the other hand has to bin its predictions after every step —- hence offsetting any error it has accumulated over the previous step.

Discretization is a criterion enforced by humans with drawbacks as well as benefits. A discrete system typically runs slower, propagates less information, and is inherently mismatched to modeled continuous systems. In those terms, it would always do worse than a perfect continuous system, because of the approximations it has to make at every timestep. And so it appears we are losing right at the first design choice - when making the choice of discreteness. However, in practice, we do not necessarily care about maximum possible accuracy. What we care about more is correctness. If we can find a discrete system which can approximate the continuous data we see to a reasonable degree, then that system leads to much better predictions than a continuous system. The drawbacks simply do not matter compared to the overwhelming advantage of correctness. Systems which correctly predict processes are radically more valuable than systems which are not — even if they are slower and require more power.

So where does this intuition fall apart (or does it) when considering AI models - specifically language models [1]? Every human language that we have discovered chunks sounds into discrete word-like concepts. Without this capability, sophisticated communication is apparently impossible¹. Within seconds of speaking with someone, we would not be able to understand what they are saying since there would be error accumulation with respect to the sounds we are hearing and our interpretation of what they could mean. But this situation does not happen in real life. That is because in real life we always have a list of words we can map each sound to and hence correctly infer it before moving on to listening to the next set of sounds. This process happens at such a fast and unconscious pace that it is hard for us recognize its existence. This mapping is done to the set of words, which is discrete. And hence error accumulation does not occur.

A standard transformer model produces a discrete output for each token it processes. Pink squares denote where the discretization is enforced in the model.

Because language models model sequences created from a discrete set of words —- predicting the next word from the current sequence of words — they inherently are able to prevent error accumulation to a great degree. Of course, I say this assuming that errors are not made in the sense of ‘putting a prediction in the wrong bin’, which could still lead to error propagation. But are they actually doing discrete computation all the way? Our initial intuition would be to say yes, since all computation eventually breaks down into binary manipulation, which is discrete. Should we not care about discretization within the design of the model at all then, given the eventual operations are discrete? Relatedly, is the supposed point at which discrete prediction seems to be happening — when predicting the next word — the right point for the kind of capability we want out of language models? Maybe we want the model to be designed in such a way where it parses entire sentences and then predicts a few tokens corresponding to the end of the final sentence. That seems like an arbitrary design choice. Why is it arbitrary? Simply because for most pieces of text, the end of the final sentence does not carry any more of a special meaning than the rest of the sentences. The overall meaning of the piece of text is hidden somewhere everywhere but not in any particular word, usually.

When operating between the low-level discrete manipulations happening when a language model is trained and the high level discrete predictions it is making at every token position we are switching abstraction levels. Design choices such as discretization are heavily dependent on the abstraction level they are enforced at, leading to different behaviors for different abstraction levels. Put another way, if we were to tokenize entire sentences than sub-words, a plausibly different behavior would emerge even though the discretization still remains at every token position.

Error Propagation in Continuous vs Discrete Models

The propagation of error can be understood as a combination of two different phenomena — 1) how the input is modeled and 2) how the output is modeled. In a next token prediction regime, phenomenon 1 dictates how big the one step error is when choosing to model the input as discrete vs continuous. On the other hand, phenomenon 2 dictates how this error is fed forward to the next computation step, i.e. is it offset at every step or passed as is. We can think of a simple experiment to analyze the effect of these two phenomena. Notice what happens when the input is modeled continuously — a given permutation of input elements/tokens is never exactly seen again in the training dataset (or is seen very rarely) — making it harder for the model to understand the context within those permutations. We can simulate this setting within a discrete input transformer model by keeping the contexts as is but randomly swapping the discrete tokens in the context with surrogate tokens (and also the target tokens they would predict). A surrogate token corresponds to a new token that is placed in the same context as the token it replaces. Here’s what we observe:

A piece of input with the original token ids. Surrogate ids are produced by adding new vocab elements. For example, if the vocab is 1000 and the number of surrogates for each token is 5, surrogate ids = original ids + original_vocab * randint(0, 5).

The validation loss worsens monotonically as the number of surrogates in the input are increased (Figure A, left). This is expected because now it is harder for the model to understand the exact context — there are increased fluctuations in the context because of the random surrogate swappings. It doesn’t seem like training on number of tokens proportional to the number of surrogates helps resolve this issue, hinting at this being not just a sample efficiency issue. For the model to see all possible permutations it has to see exponentially many samples now so it makes sense that linearly increasing the number of tokens seen will not help much.
Notice what is happening here. We are only plotting the one step loss which worsens with more surrogates. If the one step error is increased, the model would see even more “unseen” contexts as the one step predictions are fed back in the model to produce long horizon predictions (autoregressive inference). It follows that the rollouts would be much more error prone in this case.