Floating-point arithmetic

Floating-point arithmetic describes operations performed on float, double, and long double types.

= Summary =

A real number consists of an a (possibly infinite) integer part, optionally preceded by a sign (negative or positive) and followed by a radix point (called a decimal point in base 10, or a binary point in base 2) and a (possibly infinite) fractional part.

A floating-point value is an approximation of a real number. It typically consists of a triplet containing a sign bit (1 if negative or 0 otherwise), an unsigned integer significand (or coefficient), and a signed integer exponent. These are combined like so:

(−1)sign &times; significand &times; 2exponent

For instance, the rational number -2 1&frasl;8 (= -2.12510 = -10.0012) might be represented as a floating-point triplet with a sign of 1, a significand of 17 (= 100012), and an exponent of -3 (3-bit fractional part):

(-1)1 &times; 17 &times; 2-3 = -1 &times; 17 &times; 1&frasl;8 = -17&frasl;8 = -2 1&frasl;8

We can represent the same number if we double the significand to 34 (= 1000102) and drop the exponent to -4 (4-bit fractional part = 10.00102):

(-1)1 &times; 34 &times; 2-4 = -1 &times; 34 &times; 1&frasl;16 = -34&frasl;16 = -17&frasl;8 = -2 1&frasl;8

The triplet &lang;1,17,-3&rang; is considered the normalized representation because its significand, 17 (= 100012), is not evenly divisible by 2 (rightmost bit is 1), meaning that it has the minimum number of significant bits necessary to represent the number.

= Limits =

C++ class template std::numeric_limits describes the limits of numeric type T. The following are some of the ones that apply to floating-point types:

static const int digits; static const int min_exponent; static const int max_exponent; static T epsilon throw;

Range

 * digits is the number of bits in the significand, such that it can represent any unsigned integer in the range 0 to 2digits-1 without rounding.
 * min_exponent and max_exponent describe the range of the exponent
 * epsilon is the smallest number such that 1 + epsilon is greater than 1 for type T and is usually equal to 1&frasl;(2digits -1).

Precision
In base 10 (decimal), the fraction 1&frasl;3 can be represented as a repeating decimal real number, where the digit 3 repeats an infinite number of times:

1&frasl;3 = 0.33333&hellip;10

A finite number of digits can approach but never reach the correct value:

1&frasl;3 = 10&frasl;30 = 100&frasl;300 = 1000&frasl;3000 = &hellip; 0.3  = 3&frasl;10     = 9&frasl;30 0.33 = 33&frasl;100   = 99&frasl;300 0.333 = 333&frasl;1000 = 999&frasl;3000 &#x22ee;

Similarly, in base 2 (binary), 8 (octal), or 16 (hexadecimal), the fraction 1&frasl;10 can only be represented with an infinitely-repeating significand, and finite significands can approach but never reach the correct value:

1&frasl;10 = 0.110 = 0.000110011001100110011001100&hellip;2 = 0.063146314631463146314&hellip;8 = 0.199999&hellip;16 0.1000000014901161193847656250000000000000 (float, 24-bit significand) 0.1000000000000000055511151231257827021182 (double, 53-bit) 0.1000000000000000000013552527156068805425 (long double, 64-bit)

= Techniques =

Comparing floating-point numbers
Consider a number n generated as follows:

double n = 0; for (int i = 0; i < 10000; ++i) n += 0.1;

Even though 10000 &times; 0.1 is 1000, n is not exactly 1000 because 0.1 can only be represented imprecisely by a binary floating-point number, and the result of each addition makes the value of n even more imprecise.

Comparison with integers

 * std::round( n ) rounds to the nearest integer
 * std::floor( n ) rounds down to the nearest integer (-4 is the floor of -4.0, -3.5, -3.4; +3 is the floor of +3.0, +3.4, +3.5).
 * std::ceil( n ) rounds up to the nearest integer (-3 is the ceil of -3.5, -3.4, -3.0; +4 is the ceil of of +3.4, +3.5, +4.0).

Comparison with other floating-point numbers

 * You would be best-served by looking at Bruce Dawson's essay on comparing floating-point numbers.

If we need to compare n to an arbitrary floating-point value, we compute the difference between them and compare it against some acceptable tolerance ("fudge factor"):


 * 1) include
 * 2) include

template  int flt_compare(T x, T y, T tolerance) { const T delta = x - y; return (std::abs(delta) < tolerance) ? 0 : delta < 0 ? -1 : +1; }

template <typename T> bool flt_equal(T x, T y, T tolerance) { return flt_compare(x, y, tolerance) == 0; }

Note that the tolerance has to be large enough to account for all imprecision introduced while computing n, which depends on the precision of the type (float</tt> is less precise than long double</tt> and therefore needs a larger tolerance) as well as on the nature of the computation.

= See also =


 * Standards for floating-point numbers:
 * Current standard on Wikipedia, IEEE 754-2008 or ISO/IEC/IEEE 60559:2011
 * Previous standard (IEEE 754-1985) on Wikipedia
 * What Every Computer Scientist Should Know About Floating-Point Arithmetic by David Goldberg
 * IEEE Standard 754 Floating Point Numbers by Steve Hollasch


 * Intel Software Developer's Manual Volume 1: Basic Architecture &sect; 4.8.2 covers Intel's implementation of floating point numbers.
 * The GNU Multiple Precision Arithmetic Library (GMP) for arbitrary precision arithmetic on signed integers, rational numbers (i.e. fractions), and floating point numbers. (&pi; is an irrational number.)