Bias (statistics): Difference between revisions

Browse history interactively ← Previous edit Next edit →Content deleted Content addedVisual WikitextInline

Revision as of 16:15, 13 December 2003 editMichael Hardy (talk \| contribs)Administrators210,279 editsm punctuation← Previous edit		Revision as of 16:16, 13 December 2003 edit undoMichael Hardy (talk \| contribs)Administrators210,279 editsm punctuationNext edit →
Line 3:		Line 3:
	One meaning is involved in what is called a biased ]: If some elements are more likely to be chosen in the sample than others, and those that are have a higher or lower value of the quantity being estimated, the outcome will be higher or lower than the true value.		One meaning is involved in what is called a biased ]: If some elements are more likely to be chosen in the sample than others, and those that are have a higher or lower value of the quantity being estimated, the outcome will be higher or lower than the true value.

	A famous case of what can go wrong when using a biased sample is found in the 1936 US presidential election polls. The ''Literary Digest'' held a poll that forecast that ] would defeat ] by 57% to 43%. ], using a much smaller sample (300,000 rather than 2,000,000), predicted Roosevelt would win, and he was right. What went wrong with the ''Literary Digest'' poll? They had used lists of telephone and automobile owners to select their sample. In those days, these were luxuries, so their sample consisted mainly of middle and upper class citizens. These voted in majority for Landon, but the lower classes voted Roosevelt. Because their sample was biased towards wealthier citizens, their result was incorrect.		A famous case of what can go wrong when using a biased sample is found in the 1936 US presidential election polls. The ''Literary Digest'' held a poll that forecast that ] would defeat ] by 57% to 43%. ], using a much smaller sample (300,000 rather than 2,000,000), predicted Roosevelt would win, and he was right. What went wrong with the ''Literary Digest'' poll? They had used lists of telephone and automobile owners to select their sample. In those days, these were luxuries, so their sample consisted mainly of middle- and upper-class citizens. These voted in majority for Landon, but the lower classes voted Roosevelt. Because their sample was biased towards wealthier citizens, their result was incorrect.

	This kind of bias is usually regarded as a worse problem than ]: Problems with statistical noise can be lessened by enlarging the sample, but a biased sample will not go away that easily. In particular, a ] will distill good data for studies that themselves suffer from statistical noise, but a meta-analysis of biased studies will be biased itself.		This kind of bias is usually regarded as a worse problem than ]: Problems with statistical noise can be lessened by enlarging the sample, but a biased sample will not go away that easily. In particular, a ] will distill good data for studies that themselves suffer from statistical noise, but a meta-analysis of biased studies will be biased itself.

Revision as of 16:16, 13 December 2003

In statistics, the word bias has at least two different senses, one referring to something considered very bad, the other referring to something that is occasionally desirable. Both mean that an estimator for some reason over- or underestimates on average what is being measured.

One meaning is involved in what is called a biased sample: If some elements are more likely to be chosen in the sample than others, and those that are have a higher or lower value of the quantity being estimated, the outcome will be higher or lower than the true value.

A famous case of what can go wrong when using a biased sample is found in the 1936 US presidential election polls. The Literary Digest held a poll that forecast that Alfred E. Landon would defeat Franklin Delano Roosevelt by 57% to 43%. George Gallup, using a much smaller sample (300,000 rather than 2,000,000), predicted Roosevelt would win, and he was right. What went wrong with the Literary Digest poll? They had used lists of telephone and automobile owners to select their sample. In those days, these were luxuries, so their sample consisted mainly of middle- and upper-class citizens. These voted in majority for Landon, but the lower classes voted Roosevelt. Because their sample was biased towards wealthier citizens, their result was incorrect.

This kind of bias is usually regarded as a worse problem than statistical noise: Problems with statistical noise can be lessened by enlarging the sample, but a biased sample will not go away that easily. In particular, a meta-analysis will distill good data for studies that themselves suffer from statistical noise, but a meta-analysis of biased studies will be biased itself.

Another kind of bias in statistics does not involve biased samples, but does involve the use of a statistic whose average value differs from the value of the quantity being estimated. For example, suppose X₁, ..., X_n are independent and identically distributed random variables, each with a normal distribution with expectation μ and variance σ. Let

{\overline {X}}=(X_{1}+\cdots +X_{n})/n

be the "sample average", and let

S^{2}={\frac {1}{n}}\sum _{i=1}^{n}(X_{i}-{\overline {X}}\,)^{2}

be a "sample variance". Then S is a "biased estimator" of σ because

E(S^{2})={\frac {n-1}{n}}\sigma ^{2}\neq \sigma ^{2}.

However, this biased estimator is, by the commonly used criterion of "mean squared error", actually better (but only very slightly) than the unbiased estimator that results from putting n - 1 in the denominator where n appears above.

A far more extreme case of a biased estimator being better than any unbiased estimator is well-known: Suppose X has a Poisson distribution with expectation λ. It is desired to estimate

P(X=0)^{2}=e^{-2\lambda }.\quad

The only function of the data constituting an unbiased estimator is

\delta (X)=(-1)^{X}.\quad

If the observed value of X is 100, then the estimate is 1, although the true value of the quantity being estimated is obviously very likely to be near 0, which is the opposite extreme. And if X is observed to be 101, then the estimate is even more absurd: it is -1, although the quantity being estimated obviously must be positive. The (biased) maximum-likelihood estimator

e^{-2X}\quad

is much better than this unbiased estimator.