Floating-point arithmetic
From Cppwiki
Floating-point arithmetic describes operations performed on float, double, and long double types.
Contents |
Summary
A real number consists of an a (possibly infinite) integer part, optionally preceded by a sign (negative or positive) and followed by a radix point (called a decimal point in base 10, or a binary point in base 2) and a (possibly infinite) fractional part.
A floating-point value is an approximation of a real number. It typically consists of a triplet containing a sign bit (1 if negative or 0 otherwise), an unsigned integer significand (or coefficient), and a signed integer exponent. These are combined like so:
(−1)sign × significand × 2exponent
For instance, the rational number -2 1⁄8 (= -2.12510 = -10.0012) might be represented as a floating-point triplet with a sign of 1, a significand of 17 (= 100012), and an exponent of -3 (3-bit fractional part):
(-1)1 × 17 × 2-3 = -1 × 17 × 1⁄8 = -17⁄8 = -2 1⁄8
We can represent the same number if we double the significand to 34 (= 1000102) and drop the exponent to -4 (4-bit fractional part = 10.00102):
(-1)1 × 34 × 2-4 = -1 × 34 × 1⁄16 = -34⁄16 = -17⁄8 = -2 1⁄8
The triplet 〈1,17,-3〉 is considered the normalized representation because its significand, 17 (= 100012), is not evenly divisible by 2 (rightmost bit is 1), meaning that it has the minimum number of significant bits necessary to represent the number.
Limits
C++ class template std::numeric_limits<T> describes the limits of numeric type T. The following are some of the ones that apply to floating-point types:
static const int digits; static const int min_exponent; static const int max_exponent; static T epsilon() throw();
Range
- digits is the number of bits in the significand, such that it can represent any unsigned integer in the range 0 to 2digits-1 without rounding.
- min_exponent and max_exponent describe the range of the exponent
- epsilon is the smallest number such that 1 + epsilon is greater than 1 for type T and is usually equal to 1⁄(2digits-1).
Precision
In base 10 (decimal), the fraction 1⁄3 can be represented as a repeating decimal real number, where the digit 3 repeats an infinite number of times:
1⁄3 = 0.33333…10
A finite number of digits can approach but never reach the correct value:
1⁄3 = 10⁄30 = 100⁄300 = 1000⁄3000 = … 0.3 = 3⁄10 = 9⁄30 0.33 = 33⁄100 = 99⁄300 0.333 = 333⁄1000 = 999⁄3000 ⋮
Similarly, in base 2 (binary), 8 (octal), or 16 (hexadecimal), the fraction 1⁄10 can only be represented with an infinitely-repeating significand, and finite significands can approach but never reach the correct value:
1⁄10 = 0.110
= 0.000110011001100110011001100…2
= 0.063146314631463146314…8
= 0.199999…16
0.1000000014901161193847656250000000000000 (float, 24-bit significand)
0.1000000000000000055511151231257827021182 (double, 53-bit)
0.1000000000000000000013552527156068805425 (long double, 64-bit)
Techniques
Comparing floating-point numbers
Consider a number n generated as follows:
double n = 0; for (int i = 0; i < 10000; ++i) n += 0.1;
Even though 10000 × 0.1 is 1000, n is not exactly 1000 because 0.1 can only be represented imprecisely by a binary floating-point number, and the result of each addition makes the value of n even more imprecise.
Comparison with integers
- std::round(n) rounds to the nearest integer
- std::floor(n) rounds down to the nearest integer (-4 is the floor of -4.0, -3.5, -3.4; +3 is the floor of +3.0, +3.4, +3.5).
- std::ceil(n) rounds up to the nearest integer (-3 is the ceil of -3.5, -3.4, -3.0; +4 is the ceil of of +3.4, +3.5, +4.0).
Comparison with other floating-point numbers
- You would be best-served by looking at Bruce Dawson's essay on comparing floating-point numbers.
If we need to compare n to an arbitrary floating-point value, we compute the difference between them and compare it against some acceptable tolerance ("fudge factor"):
#include <limits> #include <cmath> template <typename T> int flt_compare(T x, T y, T tolerance) { const T delta = x - y; return (std::abs(delta) < tolerance) ? 0 : delta < 0 ? -1 : +1; } template <typename T> bool flt_equal(T x, T y, T tolerance) { return flt_compare(x, y, tolerance) == 0; }
Note that the tolerance has to be large enough to account for all imprecision introduced while computing n, which depends on the precision of the type (float is less precise than long double and therefore needs a larger tolerance) as well as on the nature of the computation.
See also
- Standards for floating-point numbers:
- Current standard on Wikipedia, IEEE 754-2008 or ISO/IEC/IEEE 60559:2011
- Previous standard (IEEE 754-1985) on Wikipedia
- What Every Computer Scientist Should Know About Floating-Point Arithmetic by David Goldberg
- IEEE Standard 754 Floating Point Numbers by Steve Hollasch
- Intel Software Developer's Manual Volume 1: Basic Architecture § 4.8.2 covers Intel's implementation of floating point numbers.
- The GNU Multiple Precision Arithmetic Library (GMP) for arbitrary precision arithmetic on signed integers, rational numbers (i.e. fractions), and floating point numbers. (π is an irrational number.)