A random variable is not a number. Instead, it's an abstract outcome. We can ground it, if we can ground it, to an abstract outcome of a concrete process.
For example, say we have a question-answering dataset and a script to run GPT-3 on it. The script runs GPT-3 DaVinci (
text-davinci-002) on the question-answering dataset using greedy decoding, cleans up the answers, and calculates an accuracy number. Ignoring what this script actually outputs — it should, if the API works correctly and the postprocessing is deterministic, output a single number that is consistent between runs — we can model the accuracy as a random variable.
The tricky part here is that modeling the final accuracy alone as a random variable is useless. We need a slightly more complex statistical model for this to be interesting. It is slightly more interesting to model the accuracy as being made up of a bunch of randomly sampled questions with correspondingly random-but-dependent question-level correctnesses. This model is slightly more interesting, but amounts to the same unless we make further assumptions on the random variables. For example, we might assume that the question samples are independent and identically distributed. That assumption would let us generalize to further samples of questions generated by similar-enough sampling processes.
It's important however that the statistical model actually grounds to the underlying process. We cannot ground it by fiat. In this case the independent and identically distributed model need not ground to the underlying process. It is common practice to collect datasets by creating Mechanical Turk or other crowdsourcing tasks. First one creates a public "qualification task." You then grade the results and "hire," for example, as many as are interested from the 50 best performers on this "qualification task." You then assign the "real" data collection task to these best performers. Each worker writes, say, 100 questions. This means the samples are no longer independent or identically distributed.
I think we "fix" that by taking the end-user perspective and saying: Well, the numbers "become" IID after we randomly permute the data and forget about the authorship information. And that might be fine if what you want is to generalize to another sample of 100 questions from the same authors. But if we want to estimate the error for a different ~50 author sample of 100 questions, or worse, a sample of 5,000 unique-author questions, I suspect we might have a problem. I suspect the number we come up with will overestimate performance.
If we want to fix this, I think we have to either (1) only ask for one question per worker (expensive, given the need for qualification), or (2) push it back into the statistical model we're using. Hierarchical models, also known as mixed models, seem like one possible way to fix the model. There might be others.
What happens when there's a lack of grounding like this? Do you go to statistics hell? I think what happens with frequentist statistics is that your confidence intervals end up too narrow. That is, when you estimate your model's you will get a smaller number than you should. If you keep making this kind of mistake, you will make more errors than the error bounds you choose would suggest. You will reject true nulls more than 5 times in 100 at a threshol of p<0.05. You might care about that.