We had a regular rival in the large school across the freeway. I once tried to convince one of them that our school was much better than theirs because my graduating class had produced SIX National Merit Scholars from our smaller graduating class while they only produced one. Clearly our small school was superior.
Not necessarily, says Tabarrok:
The problem is that because small school don't have a lot of students, scores are much more variable. If for random reasons a few geniuses happen to enroll one year in a small school scores jump up and if a few extra dullards enroll the next year scores fall.
Thus, for purely random reasons we would expect small schools to be among the best performing schools in any given year. Of course we would also expect small schools to be among the worst performing schools in any given year! And in fact, once we look at all the data this is exactly what we see.
Reversion to the mean. It bites in music and development as well, as Easterly and I have recently discussed.
The first answer many people will reach for is that we need longitudinal/panel data rather than cross-sectional to solve this. Panel data is just as prone to this bias as well. The issue is what you're clustering. In Tabarrok's example, we are clustering students within a school. Panel data adds another dimension, but does not remove by itself that fact, nor that by adding another dimension we have another way to cause the same problem! If we have thirty years' data for one school and five years' for another, randomness will affect the results in the exact same way. In development, if we have 70 years' data for one country and only 10 for another, the same problem happens.
A second answer is that we need to grab RCTs. This again is not an automatic answer. Suppose you roll out a new program and test it in a few villages first while collecting national data. The villages are selected randomly, and let's assume that everything runs perfectly with the study and there is absolutely no cause for concern about cross-contamination or behavioral effects or anything. The results from villages with smaller populations will depend more on randomness than the results from larger villages, and the results from a few villages will vary more than the results from many villages. You can't rule out completely that it wasn't luck that helped those particular villages do better than the national average, so you need to adjust your standard errors to reflect the cluster size differences.
The right answer is that we need to account for the effects of sample size in generating our statistics. If we forget to do that, we can come to incorrect conclusions.
* - This is how you know economists are bad at telling jokes. Anyone else would have typed "It was SO small ..." and ended with something mildly witty, like "the valedictorian and the cheerleader captain were the same person" or "I would tell my teachers I couldn't finish my paper because someone had already checked out the book from the library." But the economist types "sufficiently small." Pfft.