3 min read

Statistical independence is weird to think about

Random variables are one of the basic concepts of statistical theory. They're everywhere. You can't escape them. They're the core of how we think about "sampling."

A common assumption is that we have a sequence or family of independent and identically distributed random variables. I'm mostly going to focus on independence in this post. If you have random variables \(X\) and \(Y\), they are defined to be independent if, roughly speaking, knowing the value of \(X\) doesn't tell you anything about the value of \(Y\).

In theory this is all fine. In practice this gets sort of confusing. Say you run a survey on college student Internet use habits, and offer responents a fixed 0.1% chance to win a $20 Amazon gift card, limit 1000 respondents. You send out your survey and hope for the best.

As part of this process, students take your survey and act on it. Jane Smith takes your survey, then gets her good friends Bob Jones and Amanda Blanda to take the survey, too. Even if you as the surveyor don't know this has happened, this scenario seems likely.

In practice, how likely a scenario is this? Not very. I'd instead expect the surveyors to award one gift card at random to one of the respondents, which gives respondents a selfish economic incentive not to bring in many more people. I think that's the more common scenario. In practice that probably pushes enough against the financial incentive.

For a slightly more realistic scenario, say you are a natural language processing researcher. You collect a dataset in two phases. First, you ask 100 Mechanical Turk workers to write natural language open-ended questions on common sense topics. They produce a bunch of questions for you. You and your colleagues annotate the questions for "quality" (is it in English? grammatical English? is it on topic? is it a multiple-choice question? is open-ended?) and pick the 20 workers who produced the best questions on average. In the second phase, you then ask each of the 20 chosen workers to write you 50 more questions, for a total of 1,000. You then finetune or prompt some deep neural models, say BOBERTix and GPT-2000, on 500 of these questions (the train set) and calculate each model's accuracy on the remaining 500 (the test set).

This accuracy number is a statistic; in fact, we can think of it as a mean. For each question, assign it the number 1 if the BOBERTix answered it correctly and 0 otherwise. The accuracy is just the mean of those numbers.

We can think of this setup statistically in terms of random variables. You have 500 random variables. Are these variables independent? It seems like in a sense they shouldn't be, because we know some of them are written by the same person. For any given question, there are (across both sets) 49 more by the same person. On the other hand, suppose we put the 1,000 questions in random order and take the last 500 to be our test set. Given that we have an equal number from each worker, it seems that we should be able to treat them as "random enough" in some sense.

Of course, if we are interested in generalizing across authors, we probably need to adjust for sample size somehow. Our sample is 1,000 questions, and the number should probably help in some sense, but the small number of authors means we probably want to treat it more like a sample of 20 for the purposes of across-author generalization.

I suspect you get other vaguely unsettling scenarios for independence any time you collect data in some "clustery" way. Experimental design probably helps.

I think this goes back to the idea that statistics is about counterfactuals. Counterfactuals are weird to think about. Statistical ones are no less. And statistical theory is purely about counterfactuals. If I recall correctly I first heard this not in my statistics class or textbooks or amateur reading, but in a "philosophy of statistics/science" sort of book, Deborah Mayo's Statistical Inference as Severe Testing. Essentially, the problem is that statistical theory is about what might have happened, rather than what did happen. Here, might is defined in a purely mathematical sense and carries certain assumptions.

Those assumptions can be wrong. But how, and when? And what happens when they are?

There is the trivial scenario of "you included a datapoint/datapoints twice when calculating the statistic." This is mostly what I've found when I searched for answers online. It is true this breaks independence, and it is definitely something to avoid, but it is a boring kind of mistake. Are there more interesting mistakes to make around independence?