3 min read

Troubles with language models

Grossly simplifying, David Chapman's Better Without AI focuses on two kinds of machine learning models: (1) Recommender systems, and (2) "shiny" models. Among the "shiny" models are vision models like AlexNet, CLIP, DALL-E, VisualBERT and so on, as well as language models like BERT, T5, and GPT-(fill-in-the-blank). I've never worked on a recommender system or a vision model, but I have worked with language models, so I'm curious what he has to say about them. What troubles does he discuss that are relevant to language models?

First, there are the general problems with backpropagation and with the models that come out of it:

  1. Backprop is terribly inefficient. It needs an enormous amount of data to work.
  2. Getting backprop to work for a real-world problem is incredibly fussy and takes lots of expensive person-time and/or luck.
  3. When backprop does work, it's prone to failing/misbehaving unpredictably on inputs that are unlike the training data.
  4. Backprop models often seem better than they are because they're great at finding spurious correlations in the datasets/benchmarks they're applied to.
  5. What backprop models do well is interpolation in their latent space, however they are terrible at extrapolation, and because they're operating in latent space it is harder to tell which is which without just looking at the training data.
  6. We don't know how to make backprop models degrade gracefully for out-of-distribution inputs, because it's not obvious how we would reliably detect out-of-distribution inputs.

These points are well-known and not especially controversial. I regularly see talks and papers about (1). Anyone who's tried to train a backpropagation model has probably run into (2). Anyone who's tried to use a model has probably run into (3). A few benchmark creators, at least, are aware of (4) and try to mitigate it by various means — for example creating true/false questions in pairs that are almost but not quite identical, with one true and the other false. (5) and (6) are sort of rephrasing and extending (3). I've seen and read too many papers trying and failing to solve (6). So far this all seems accurate to me.

Then there are myths about backprop:

  1. There is a myth that we can't understand backprop models, so no need to try. This is false; researchers have worked on understanding for a few specific models and come out the other side understanding them meaningfully better.
  2. There is a myth that neural networks are the "only and inevitable method for achieving advanced AI." Chapman notes that backpropagation has benefited a lot from a combination of more money, more compute, and greatly narrowing the space of possible models using specific theories of what connections should or shouldn't be zero (narrowing the "hypothesis"/model class). Throwing such vast sums of money at the topic could plausibly help alternative methods at least as much.

Chapman points out another problem he with backprop models that's a consequence of (1): We don't typically try to understand what specific models (e.g. GPT-3 text-curie-001) do well enough to say what it they are doing at inference time. Because of this, we can't reliably predict for what inputs they will do well or poorly.

Instead, we can only point to goodness/badness of a model's outputs on a benchmark task, which is not a reliable way to predict performance in other settings. We would like to be able to say that the benchmark data are pretty similar to what we expect to see in the real-world setting where we want to apply the backprop model. David writes: "We can do better than the current practice feeding in lots of poorly-characterized input data and measuring how frequently we get bad outputs." My reading: The input data we use for training and evaluating, including benchmark data, is typically not so well-characterized that we can expect our measurement that we can reliably generalize from there to a real-world setting. We would have to know what the data is like, in some sense, as compared with the real-world setting's data, and we don't know that. We don't know what patterns or quirks are accidentally baked into the data, or what variations are absent that should be present. That means the benchmark numbers don't tell us as much as we might like.

I don't want to be too harsh about benchmarks. Making a good benchmark is hard. Some benchmark data is better characterized or at least better documented in its origins than other benchmark data, so credit where it's due. But I'd broadly agree that the relationship between benchmark performance and real-world performance is murky at best. It might help to characterize the data better. I think this claim is not too controversial either: That benchmark performance doesn't generalize as reliably as one might like, or in an engineering context might need.

The unfinished essay "Are language models Scary?" should give more detail on problems with language models, but — it's unfinished, so I can't comment on it.

My impression is that the issues raised are real ones. The difference between Better Without AI and the gestalt of research/understandings I've encountered is a matter of emphasis more than matter. The issues mentioned are not new; rather they are rarely taken so seriously. Taking them more seriously seems worthwhile.