A Question Of Rounding
By Paul Sephton
Introduction
Schools teach scholars, as part of their curriculum, how to round a number. We are taught that a number such as 0.1 may be approximated as 0, whilst the number 0.5 should be seen as approximately 1. Likewise, 1.4 is 1, whilst 1.5 is 2; 1.15 is 1.2 whilst 1.24 is also 1.2 when approximated to a single digit of precision.
Many industries use rounding; Chemistry, for one, recognises the accuracy of tests to be only within a finite range of precision. For example, whilst a Gas Chromatograph might produce the value 45.657784532, we know that it is actually only accurate to three digits of precision, and it is senseless not to round the number to 45.658 before doing further calculations.
Financial and Banking industries have similar rules for rounding. It is well known that there are very few absolutes in mathematics- for example the number Pi, the ratio of the circumference of any circle to it's diameter, is an irrational number and cannot be accurately represented. A simpler example is 1/3, or decimal 0.333333...(add an infitite number of 3's).
In modern floating point processors, we have a finite number of decimals of precision to which we store numeric floating point values. This limit of precision is dependent on the size (in bytes) of the storage unit. We primarily find 4 byte, 8 byte and 10 byte floating point numbers stored according to the IEEE-754 specification for floating point representation.
In the process of specifying the standard, the ISO had to decide what to do about the matter of loss of precision. They determined that the most stable mode for rounding at the time of precision loss, was to round towards the nearest value, or in the case of equality, to round towards the nearest even value. What this means, assuming floating point values had a single decimal of precision, is that the value 0.5 would be rounded to 0, whilst 1.5 and 2.5 would be rounded to 2. In doing this, the rounding error in calculations is averaged out, and you are left with the most consistent result.
Of course, the 8 byte double precision IEEE value has 15 decimals of precision, and not a single decimal as described above. The rounding error would only apply to the last decimal, so by rounding that last decimal of precision, we are left with a pretty accurate mathematic representation.
Probably due to a few misunderstandings, a problem has come about in the implementation of certain programming library functions, in particular the Gnu C library, GlibC and the functions used for converting IEEE floating point numbers into displayable text. This is the main interest of this document; to highlight a particular problem in the hope that this will lead to a change in approach.
The sign, mantissa and exponent
IEEE-754 double precision floating point values are stored as binary (base 2) values having a sign bit (s), a 11 bit exponent (e), and a 52 bit mantissa (f). Logically, this means that we can store positive or negative numbers in the
range 0 to 2^52 – 1 (4503599627370495) with an exponent in the range zero to 2^11 - 1 (2047). This means that we have at most 15 (and a bit) decimals of precision for the mantissa.
Values Represented by Bit Patterns in IEEE Double Format
A more familiar representation which is quite similar might be the scientific decimal representation, in the format [s]f * 10^[+/-]e – eg. -1.23e+20 or +1.23e-20, where M is the mantissa (in this case 1.23) and exp is the exponent (in this instance +20 and -20 respectively).
The problem; GlibC & sprintf()
The sprintf() function is ubiquitously used in the generation of output reports, being the prime candidate for enabling the conversion from numbers to text.
In the light of buffer overflows and other things that might go wrong, the use
of snprintf() or other string functions is strongly suggested.
-- René Pfeiffer
In coding the C library, programmers were well aware of the above information about double precision numbers, including the default rounding mode for handling calculation errors. These people knew that the FPU (Floating Processor Unit) automatically rounded the results of calculations to the nearest value, or if on the border between values, to the nearest even value.
The problem arose when this same logic was applied to precisions less than 15.
When converting a value 3.5 (stored at a precison of 15) to text, for display with a precision of zero decimals, the GCC sprintf() function correctly produces the value 4. However, when converting the value 2.5, sprintf() produces not the expected value 3, but the value 2!
It should be stressed here, that the IEEE-754 representation has a full 15 decimal precision (or 53 binary digits of precision) to play with, regardless of the number being represented. Different numbers might be represented exactly or inexactly within the FPU, but all IEEE values are represented with exactly the same precision of 15 decimals. Therefore, no assumption may be made about an implied precision of a stored number, or inferred from the display size.
Whilst a number might be inexactly stored at a precision of 15 decimals, that same number is exact when viewed at 14 decimals of precision. For example, the value 2.49999999999992 is promoted (using IEEE rounding) to 2.4999999999999 and then to 2.5 (with precision of 14) using the same rounding rules.
Where the IEEE rounding mode makes an immense amount of sense for calculations and handling the error at decimal 15, it makes no sense at all when applying that rounding mode to the display of numbers. When displaying a bond, traded on a stock exchange at 3.145 to two decimal points, one would expect 3.15 and not 3.14.
When using the function printf("%.0f", 2.5), one may therefore reasonably expect the number 2.50 to be rounded upwards to 3, since there is no ambiguity or error implied in the storage of the value 2.50.
Conclusion
The default IEEE rounding mode, as applied to calculations using numbers stored to an identical precision of 15 for double precision values, is the most stable way to produce consistent and accurate results when the rounding is consistently applied at decimal 15.
However, when formatting numbers for display, it is more desirable to accurately represent these same numbers in a consistent way according to industry and mathematical standards. Rounding upwards for positive values, or downwards for negative values is the generally accepted norm, and it is senseless to apply the IEEE-754 rules when the known precision of the number, being fixed at 15, is greater than that of the displayed precision.
It is evident that the GlibC developers, in an attempt towards compliance with the IEEE-754 standard have confused these two issues, to the detriment of the industry as a whole. The damage caused by this misunderstanding is far-reaching, and not necessarily easily circumvented. Applications which work flawlessly against the Microsoft platform have to be specifically altered before being compiled against GlibC.
Unless the difference between GlibC and the Microsoft runtime is well understood, and the adjustments made, to cater for these differences before a product is released, it is inevitable that this seemingly innocuous discrepancy will lead to general and widespread mistrust in applications which use the GlibC runtime.
Late Addendum
Subsequent to writing this article, it has come to light that the Microsoft C runtime library, whilst more accurate in most cases than the GNU C library, also fails to correctly convert binary IEEE double precision numbers to decimal representation in some cases.
The following code demonstrates the principles discussed in this article, and properly converts binary IEEE values to decimal format inside a buffer for display according to generally accepted mathematical standards - at least for the Intel platform. Note that
printf("%.2f", 5000.525);
fails to produce the expected result in both Microsoft and GNU libraries.
/* * Compile with: gcc -std=c99 -lm -o filename filename.c * * Definition of _GNU_SOURCE required for compilation with the * GNU C Compiler (disables warning about implicit definition of pow10()) */ #define _GNU_SOURCE #include <stdlib.h> #include <stdio.h> #include <string.h> #include <math.h> // Utility function converts an IEEE double precision number to a // fixed precision decimal format stored in a buffer. void tobuf(size_t max, int *len, char *buf, double x, int precision, double max_prec, double carry) { int sign = x < 0; // remember the sign double q = pow10(-precision); // current mask double y = x==0?0:fmod(fabs(x), q); // modulus double l_div = round(y*max_prec)/max_prec+carry; // significant digit int l_dec = (int)round(l_div*10/q); // round to decimal carry = l_dec>=10?l_div:0; // carry forward? l_dec = l_dec % 10; // this decimal x = x>0?x-y:x+y; // subtract modulus if (fabs(x) > 0) // recurse while |x| > 0 tobuf(max, len, buf, x, precision-1, max_prec, carry); else { // x == 0 - first digit if (*len >= max) return; if (sign) { buf[*len] = '-'; *len = *len + 1; } if (*len+1 <= max && precision >= 0) { buf[*len] = '0'; *len = *len + 1; buf[*len] = '.'; *len = *len + 1; } while (precision-- > 0) { buf[*len] = '0'; *len = *len + 1; if (*len >= max) return; } precision = -1; // don't place another period } if (*len <= max && precision == 0) { buf[*len] = '.'; *len = *len + 1; } // for first and subsequent digits, add the digit to the buffer if (*len >= max) return; if (l_dec < 0) l_dec = 0; buf[*len] = '0' + l_dec; *len = *len + 1; }
// Convert the value x to a decimal representation stored in a buffer int dbl2buf(size_t max, char *buf, double x, int precision) { const int DECIMALS=15; int max_dec = DECIMALS-(int)(trunc(log10(fabs(x)))+1); // max significant digits double max_prec = pow10(max_dec); // magnitude for precision loss int len = 0; // buffer length init double y = x==0?0:fmod(fabs(x), 1/max_prec); // determine error double l_carry = round(y*max_prec)/max_prec; // error is carried forward if (x != x) { strncpy(buf, "NAN", max); return 0; } if ((x-x) != (x-x)) { strncpy(buf, "INF", max); return 0; } tobuf(max, &len, buf, x, precision-1, max_prec, l_carry); // fill in buffer buf[len] = 0; // terminate buffer return len; // return buffer length used }
// Usage of the dbl2buf function. int main (void) { char buf[64]; double x = 5000.525; dbl2buf(sizeof(buf), buf, x, 2); printf("%.15f = %s\n", x, buf); }
References:
https://sourceware.org/bugzilla/show_bug.cgi?id=4943
Talkback: Discuss this article with The Answer Gang
Paul works as a Software Architect and Technology Officer for a financial information vendor. After abandoning a career in nuclear chemistry, during which he developed an interest in hardware and software, he joined a firm of brokers as a developer. He first started using Linux in 1994 with version 1.18 of Slackware. His first passion will always be writing software. Other interests are composing music, playing computer games, Neural network simulations and reading.