I typed up a whole post about this today, but accidentally deleted it because it was all in one markdown box so I could have footnotes. My mistake. :(
Instead I'm gonna present this more sloppily – just numbers and links with less context:
- There are about 32000^512 valid* inputs to T5. That's 4 times 10^2306. That's a lot. That's an enormous space. It's far, far more than the number of atoms in the observable universe. It's a googol raised to the 23rd power.
- Related: Monkeys on typewriters meme theorem. The library of Babel. The library's website / implementation.
- Probably also on the same page: Infinite coin flips. Betting everything at every opportunity forever maybe maximizes expected total cash, but with probability zero. In a similar way, the chances of getting the complete Shakespeare play of Hamlet out of a monkey on a typewriter is, if not identically zero, then at least trivially small to the point that this is effectively impossible.
- BERT & family eat word salad. Language models behave poorly on junk inputs, in ways that the usual regularization (dropout and weight decay) doesn't seem to help. https://arxiv.org/abs/1910.10683
- Probably a combination of model uncertainty and lack of optimization pressure.
- By model uncertainty I mean there are loads of models that fit your train data about equally well.
- When the space is this huge how could you ever bring sufficient optimization pressure to bear while still actually doing a language model's job? MLM or CLM that is.
- Crazy thought: Maybe it is reasonable to think of your point-estimated model weights as samples of random variables. Lots of hyperparameters including random seed influence this, and slightly different hyperparameters will give different models, if only slightly. So maybe not all that crazy. Then the question is why would you expect or how could you force this random sample to behave on everything else? Model uncertainty – again, lots of options, where is the pressure for this option to behave well?
- You probably have to do something different to make language models do "the right thing..." but what?
*by valid I mean sequences of 512 tokens. in theory T5 can take longer sequences but it was pretrained with a sequence length of 512 and I've got to pick some specific length to make this point so 512 is it. also it's not exactly 32000, the vocabulary is 32100 tokens, at least in the Hugging Face implementation, and 103 of those are special tokens, so only 32100 - 103 = 31997 real tokens. but the relative error in the calculated number of sequences works out to be tiny on this, so I'm not too worried. it's still 4 times that same absurd power of 10.