7 min read

"Populations" are nebulous

The basic statistics material that I've encountered frequently refers to a notion of sample versus population statistics. The idea is that there is some "true" parameter, that is hypothetically defined as "like what we have for this sample, but if you could do that for the entire population." Under this theory "doing statistics" means that we take a random sample from this "entire population," calculate the statistic, and then draw some conclusions about what the "population parameter," or the "true" parameter, must be. Unfortunately, in many contexts it's not clear that this makes any sense.

There are a few reasons for this. There's the issue of (informal) stationarity, or as I think of it: Is the "true" parameter informally changing over time? (This doesn't mean it makes more sense to model it as being a function of continuous or discrete time either! Nebulous numbers are nebulous.) I've written about this before; it's probably a little different from formal stationarity but I'm not intimately familiar with that kind. There's also the question of independence, which I'm sure I'll write up Real Soon Now. Today though I'm thinking about the idea of population being incoherent.

There are two problems with the idea of the population. The first is that it frequently doesn't make any sense even in principle — it's unclear what the population could possibly be. But the second is that even when it seems reasonable, framing things in that way means you are skipping or omitting a step.

What could your population possibly be?

Frequently it doesn't even make sense in principle to say that we are randomly sampling from some population. My first thought is natural language processing data. My second thought is open-sampling Internet survey data.

Sometimes in Natural Language Processing researchers ask Surge.AI or Mechanical Turk workers to write question-answer pairs or other natural language data for a task/benchmark. Sometimes they write the data themselves. Sometimes they use some other source. Sometimes with Mechanical Turk workers there is some kind of qualifying test workers have to pass. If we were randomly sampling from a "population," what on earth would it be? Is it "things you can get qualifying workers to write"? Fair enough — but this is a pretty vague thing and clearly not representative of arbitrary on-topic examples.

NLP data generated in this way is funny to think about. There is an aspect of author-by-author variation in example-writing habits, which matters since authors frequently write multiple examples. But there's also the bias introduced by the instructions, similar to what Yarkoni highlighted in the Generalizability Crisis paper. And probably there are other things we might want to model as "random effects" in that scheme.

The other example is Internet surveys. I'm thinking mainly of the Astral Codex Ten Survey (formerly Slate Star Codex...). It is a survey run by a blogger, Scott Alexander, with the point of gathering informational and informal research data about the blog's readership. The link is posted publicly on the blog — which means in theory anyone can click on it and fill it out. It is often a long survey, thirty minutes or more, so presumably most participants are actually readers, rather than random trolls. It's not representativeness that's an issue in practice, although it is in theory. The issue is that there is not an obvious, objective list of "who counts" as a reader from which you could, say, estimate nonresponse bias. Scott could informally check the optional email field of the responses against a list of frequent commenters or a list of followers or subscribers if he wanted to estimate nonresponse bias — though nobody's obligated to fill it in, so some commenters, followers, or subscribers may not, meaning it's biased (slightly?) downward. Generally though the population of (potential) voters is vague.

Twitter polls I think have a similar problem. Readership is not the same as readership-with-Twitter-account. (I should know, because I read Twitter without having an account.) That's an issue if you care to make inferences about the general population of readers. However, since you also can't vote without an account it doesn't affect the poll results directly. On the other hand, if I understand correctly, anyone can vote in a poll conducted by a public account, not only your followers. Given that, the responses are from a population that is not only your direct followers. We can probably ignore this if the poll hasn't been retweeted or quote-tweeted or embedded or otherwise linked and approximate it as "probably just responses from direct followers." It's unclear to me whether poll creators can see who voted for what option, so it's not clear if they can calculate a nonresponse bias for their followers list. But you can see that the population of (potential) voters is, again, ill-defined.

So there are two examples here where the population is, at best, vague. It is not clear that there is a single objectively correct "population" or "true" parameter that we could say we are making inferences about in the way that frequentist statistics frames things.

Sometimes, though, we do have some reasonably well-defined population. It's somewhat nebulous because there's no other choice, but it is good enough for many purposes. (Which ones?) On the other hand, even here we have the difficulty where sampling never ever in practice means drawing people with uniform/equal probability from an urn containing every relevant person... but we'll ignore that for now.

Skipping a step

The other problem with the idea of "sampling from a population" is that it's skipping a step. It pretends that we know how to draw a proper random sample from our population.

Even where we have a reasonably well-defined "population," the framing of population vs. sample implies that our data generating process generates proper random samples from the population. This is not literally true, in the sense that if we are sampling people — we have to sample without replacement, which means our random variables are technically dependent — although when the population is large enough (as it would be if we're sampling from all people) then this doesn't matter too much. (Why? Maybe because the dependence factor "should" be small?) But it is also not practically true, in that for example pollsters have to deal with the issue of nonresponse bias. This is not an insurmountable problem — but it does mean that we are not in the realm of taking a proper random sample from a population. Clearly, something is missing from this picture.

I claim we are skipping two steps when we pretend that what we're doing is a proper random sample. First, there is the work we may do to make our sample more closely approximate a random sample from the relevant population. For example, that might look like pollsters calling their sampled phone numbers ten or more times to drive down the nonresponse bias, as described in Naked Statistics. The second step is where we generalize from our data and the process that generated it to some other situation and generating process. It is true that we can sometimes draw generalized conclusions from imperfectly random samples with acceptable error rates — but doing so is an extra step.

The preface to the book All of Statistics (yes, ambitiously titled) describes observed data as instead coming from a data-generating process. I think this is a better metaphor — if treated carefully, so that it doesn't collapse down to mean the same thing as "population."

(All of Statistics even has a diagram for this: A circle for the process, and a circle for the observed data, with arrows going both directions between the two. It labels the arrow from the process to the observed data "Probability." The label is wrong, because probability is not what actually generates the data we observe. Probability is a model we apply on top of the data-generating process and the data — but it cannot be said to be generating the data in any meaningful sense, unless we are literally using simulated draws from a formal probabilistic model, i.e. simulated data. And yes, I am picking on this diagram.)

On the other hand — maybe the idea of "population" is salvageable, if re-framed. A right frame might be: We do not get to choose the population we are sampling from. We only get to control our data-generating process. We should compare that with the kinds of generalizations we want to make, the activities we're using them for, and see how we can make the generating process more consistent with those generalizations and activities. This is hard, or maybe impossible, to write down a general process for. It means considering situationally relevant factors, like "maybe the people who don't answer our first polling phone call are nonrandomly different from the people who do." It's not optional, though, if we want our statistical reasoning to make sense.

I'm still not entirely satisfied with this. Frequently the way we discuss population implies there is an objective definition — a list of data that, if you could collect them, would answer the question completely and finally. This can't be true, because populations are nebulous. So that aspect is unsatisfying.

But: eh. This frame gets us part of the way there, I guess.

(I haven't written this up for e.g. official statistics, think census data. I don't think anyone else has, either. It sounds like a lot of work to do in all the necessary detail — it's got to be a complicated process, probably involves a lot of unwritten institutional knowledge plus formally rational institutional handbooks which may or may not be followed exactly. But for now, consider: Every day, some Americans are born. Every day, some Americans die. Some will be out of the country on vacation or on business trips. Some will be out on more-or-less temporary work arrangements in other countries. Some foreigners will be naturalized as citizens. Some people will be residents. So while there is a relatively well-defined list of people who live here, and even citizens who live here, it is not absolutely well-defined. It seems to me that there are plenty of contingencies that make this a question with no perfect, objectively-defined, purpose-free answer. Fortunately, there seem to be true-enough-for-many-purposes answers, which seem roughly to be the kind we get — but the true-enough and perfect kind are not quite the same!)

Is this the same idea as sampling bias? Sort of! They rhyme. But sampling bias is often framed in terms of what is bad and what you shouldn't do. I am more interested in what is good and what we are actually doing when we do statistics. I claim this is progress toward a better description.