The extreme value fallacy

If you take a number of samples of a random variable and put them in order of magnitude, the extreme values are the largest and smallest. These extreme values exhibit special distributions of their own, which depend on the distribution of the original variate and the number of ranked samples from which they were drawn. The fallacy occurs when the extremes are treated as though they were single samples from the original distribution.

A very common example is the birth month fallacy, which recurs in the media several times a year. It usually takes the form of a headline such as People born in July are more likely to get toe-nail cancer or more fancifully Cancerians are more likely to get toe-nail cancer. What the “researchers” have done is look at the statistics for each of the twelve months, pick out the biggest, and then marvel that it seems large compared with what you would expect for a random month. The expectation (or mean) for the largest value actually increases (logarithmically) with the number of samples from which it was drawn.

For a fully worked example from a real media story see the case of anorexia.

There are, of course, cases where the statistics of extremes is paramount. The distribution of largest values applies in cases such as floods or peak annual temperatures, and of course to all forms of record; while the distribution of smallest values applies to strength of materials problems, where the principle of a chain being as strong as its weakest link dominates, or to such phenomena as droughts or the duration of human life. The rigorous theory has been fully worked out, starting in the 1920s with the great R A Fisher, but refined in the 1950s by the likes of the great authority on the subject E J Gumbel. Despite this, engineers and scientists were long after applying the normal distribution to phenomena for which it could not possibly apply, such as the breakdown of electrical insulation.

The mathematics for, say, calculating the expected value for the largest of a given number of samples from a known distribution is rather complicated, but thanks to a neat piece of mathematics by Gumbel involving L’Hôpital’s rule, the most likely value (the mode) is easy to calculate. For n samples from a distribution F(x) the characteristic largest value is defined as the value for x at which F(x)=1-1/n. It is shown that the characteristic largest value is a good approximation for the mode of the largest valued distribution. Since inverse distributions are available in such packages as MathCad ®, this is easy to calculate, which is why the mode has been used in these pages to illustrate such phenomena as records.

Another important quantity is the return period, which is the average time interval between occurrences of a particular value, say the annual maximum flood. The return period for a distribution F(x) where the samples of x are regularly spaced in time is simply 1/(1-F(x)). One of the mainstays of myths such as global warming is holding up examples of extreme floods or heat waves, when these often have a return period of only about a century. This is a doubled up extreme value fallacy, as they not only select the largest value in time but also in space (i.e. geographical location).