Insight 4 - Kahan and Neumaier Summation
Algorithms for 'online data stream' arithmetic mean
Data streams can be denoted as an ordered sequence of points that must be accessed sequentially and can be read only once or a small number of times. Efficient processing over massive data sets has taken an increased importance in the last few decades due to the growth of the data traffic and storage required by even more applications.
In particular, monitoring huge and rapidly changing streams of data has emerged as an important data management problem. Some applications are the analysis of network traffic, transaction logs, telephone call records and many more.
Since data streams have potentially unbounded size, when the amount of computation memory is limited it may be impossible to produce an exact result. Here the challenge is to produce high quality approximated answers, that is, answers as correct as possible. Let's take the arithmetic mean as an example. Considering a huge data stream and by using the well known definition of arithmetic mean, the outcomes may be wrong. However, D. Knuth (1998) proposed a different formula for computing the online mean for a data stream.
Kahan Summation Algorithm
In numerical analysis, the Kahan algorithm significantly reduces the numerical error in the total obtained by adding a sequence of finite precision floating point numbers. This is done by keeping a separate running compensation (a variable that accumulate small errors). In particular, simply summing n numbers in sequence has a worst-case error that grows proportional to n, and a root mean square error that grows as √n for random inputs.
Using the compensated summation, the worst-case error bound is independent of n, so a large number of values can be summed with a n error that only depends on the floating-point precision.
Neumaier Modification
A. Neumaier introduced a modification of Kahan's algorithm that also covers the case when the next term to be added is larger in absolute value than the running sum, which was effectively swapping the role of what is large and what is small. For many sequences of numbers both algorithms agree but a simple example showed they can also disagree.
For instance, summing the numbers [1.0, 10100, 1.0, -10100] in double precision, Kahan's algorithm yields 0.0 whereas Neumarie's algorithm yields the correct value 2.0.
Data streams can be denoted as an ordered sequence of points that must be accessed sequentially and can be read only once or a small number of times. Efficient processing over massive data sets has taken an increased importance in the last few decades due to the growth of the data traffic and storage required by even more applications.
In particular, monitoring huge and rapidly changing streams of data has emerged as an important data management problem. Some applications are the analysis of network traffic, transaction logs, telephone call records and many more.
Since data streams have potentially unbounded size, when the amount of computation memory is limited it may be impossible to produce an exact result. Here the challenge is to produce high quality approximated answers, that is, answers as correct as possible. Let's take the arithmetic mean as an example. Considering a huge data stream and by using the well known definition of arithmetic mean, the outcomes may be wrong. However, D. Knuth (1998) proposed a different formula for computing the online mean for a data stream.
Kahan Summation Algorithm
In numerical analysis, the Kahan algorithm significantly reduces the numerical error in the total obtained by adding a sequence of finite precision floating point numbers. This is done by keeping a separate running compensation (a variable that accumulate small errors). In particular, simply summing n numbers in sequence has a worst-case error that grows proportional to n, and a root mean square error that grows as √n for random inputs.
Using the compensated summation, the worst-case error bound is independent of n, so a large number of values can be summed with a n error that only depends on the floating-point precision.
function KahanSum(input) var sum = 0.0 var c = 0.0 // A running compensation for lost low-order bits. for i = 1 to input.length do var y = input[i] - c // So far, so good: c is zero. var t = sum + y // Alas, sum is big, y small, so low-order digits of y are lost. c = (t - sum) - y // (t - sum) cancels the high-order part of y; subtracting y recovers negative (low part of y) sum = t // Algebraically, c should always be zero. Beware overly-aggressive optimizing compilers! next i // Next time around, the lost low part will be added to y in a fresh attempt. return sum
Neumaier Modification
A. Neumaier introduced a modification of Kahan's algorithm that also covers the case when the next term to be added is larger in absolute value than the running sum, which was effectively swapping the role of what is large and what is small. For many sequences of numbers both algorithms agree but a simple example showed they can also disagree.
For instance, summing the numbers [1.0, 10100, 1.0, -10100] in double precision, Kahan's algorithm yields 0.0 whereas Neumarie's algorithm yields the correct value 2.0.
function NeumaierSum(input) var sum = input[1] var c = 0.0 // A running compensation for lost low-order bits. for i = 2 to input.length do var t = sum + input[i] if |sum| >= |input[i]| do c += (sum - t) + input[i] // If sum is bigger, low-order digits of input[i] are lost. else c += (input[i] - t) + sum // Else low-order digits of sum are lost sum = t return sum + c // Correction only applied once in the very end
Commenti
Posta un commento