Smoothing of data

Data smoothing is one of those processes that is easy to implement with a glib formula, but has much more profound implications than most users realise. In the following we assume that we start of with a set of numbers, xk, that have resulted from sampling some process in the real world, such as temperature, and that the interval between samples is T.

Technically, data smoothing is a form of low pass filtering, which means that it blocks out the high frequency components (short wiggles) in order to emphasis the low frequency ones (longer trends). There are two popular forms; (a) the running mean (or moving average) and (b) the exponentially weighted average. They are both implemented by means of efficient recursive formulae:

In each case, from an original sequence of numbers, xk, a new smoothed sequence, yk, is formed. In (a), as each new number is added into the average, the one n before is dropped out of it, so each new number is the average of the last n old numbers. In (b) a fraction of the next number in the old sequence is added to the complementary fraction of the last number in the new sequence, which means that each number from the old sequence has less and less influence as it recedes into the past.

SPECIAL NOTE: It is NOT necessary to recalculate a complete average for each new point. It is surprising how often this is done. Even the Mathcad ® statistical tutorial falls into this trap. In the running mean, smoothing a sequence of length L then results in (n-1)L unnecessary calculations, which can be a very large number with strong smoothing of long sequences, resulting in long calculation times.

Each of these methods has one parameter that must be chosen. The value of n determines how many numbers from the old sequence are averaged to produce each point in the new sequence. The value of b determines the effective time constant of the filter (actually –T/ln(b) ).

Complications

1. Transient response

It is one of the implications of the uncertainty principle that, when we take a finite block of a process to represent the whole of it, there are unavoidable errors. In this case they take the form of the transient response. You can demonstrate the transient response by putting the step function test sequence (xk  = 1,1,1,1,1,…..) into each formula. As this sequence is already smooth, the ideal output should be the same as the input, but the running mean ramps up to the value 1 over n samples, while the second formula produces an exponential rise to the value 1 and never quite getting there. Thus the running mean has the advantage that its transient response is finite in length, though the errors in the exponential weighting formula become negligible after a couple of time constants.

Various methods are used to overcome this problem in the running mean without discarding the first n output points. One is to taper the average, so that the first output point is an average of one, the second an average of two etc. up to the nth point. This means that the beginning of the output sequence is relatively unsmoothed, which can be misleading. Another slightly better method is to precalculate the average of the input sequence and pack n numbers equal to this value into the front of the sequence. In either case, it is not desirable to make deductions from the first n smoothed points.

2. Frequency response

The frequency response of the running mean formula is actually rather complicated, taking the form of what is known as a sinc function. This goes through a number of zeroes and a number of maxima as the frequency increases. Here is the actual frequency response for n=5 and n=8:

We can see that some interfering frequencies can be completely eliminated; yet a higher frequency is only reduced by a factor of 5. Thus we have to be very careful about identifying apparent periodicities in data smoothed by this method.

The exponentially weighted average drops smoothly to zero, so does not have these problems.

3 Phase response

The running mean is what is known technically as a linear phase filter, which means that, though all the frequency components are treated with different gains, they are all delayed by the same length of time. The exponentially weighted average does not have this property, so there is an extra form of distortion of the shape of the sequence.

Discussion

Data smoothing is a very useful technique for emphasising apparent slow trends in sequences of data. We have to be very careful, however, not to push it too far, especially in trying to identify periodicities in the data. We must also avoid giving too much credence to variations at the beginning (or the end!) of the smoothed sequence. Given these provisos, both the exponentially weighted average and the running mean are effective and can be implemented by means of efficient recursive formulae, though surprisingly often extremely inefficient non-recursive forms are applied. These simple examples are of real-time (or one=sided) filters, which only use present and past values, a necessary constraint in many important applications. There are many more elaborate methods, which require a much higher level of precaution.

The illustrations are condensed from Laboratory online computing, a very old (1975) and forgotten book by the author.

Back to FAQs