IEEE 754 Machine Numbers and Machine Arithmetic

In order to make numerical programs more portable between different machines, the IEEE 754 standard defines machine numbers and how arithmetic operations should be performed. Virtually all current computers comply with this standard.

William Kahan and the History of IEEE 754
Soon after its conception in 1977 this standard has been implemented by virtually all numerical processors.

Machine Numbers

Machine numbers are stored as a sequence of k + n bits (each of which is 0 or 1):

s e1 ... ek d2 ... dn

For single precision numbers we have n=24, k=8.

For double precision numbers we have n=53, k=11.

The sign is ``+'' for s=0 and ``-'' for s=1.

The exponent is obtained as e = (e1 ... ek )2 - b where b = 2k-1-2. The largest and smallest values of e are used to represent special values. Hence the smallest remaining value is emin = 1 - b = 3 - 2k-1, the largest remaining value is emax = 2k - 2 - b = 2k-1.

For emin <= e <= emax we have
x = ±(.1d2...dn)2 2e, representing normalized numbers
For e = emin - 1 we have
x = ±(.0d2...dn)2 2emin, representing ±0 and subnormal numbers (aka denormalized numbers).
For e = emax + 1 we have
x = ±Infinity if all dj=0
x = NaN otherwise

Note: All numbers with sign "+", arranged by size from +0 up to +Infinity correspond to all the bit sequences (0 0...0 0...0) up to (0 1...1 0...0), arranged as binary integers. Therefore it is easy to compare two machine numbers, or to find the next smaller or larger machine number.


Normally rounding ``to nearest'' is enabled. Let xmax be the largest machine number and x be an arbitrary real number.
For |x| > xmax
fl(x) = ±Infinity
fl(x) is the nearest machine number. In the case of a tie the number with dn=0 is chosen.

Other rounding modes are ``towards +Infinity'', ``towards -Infinity'', ``towards 0'' (chopping).

Machine Arithmetic

For addition, subtraction, multiplication, division and square roots of machine numbers the rounded exact result must be returned. E.g., adding two machine numbers x, y returns the machine number fl(x+y). For all combination of machine numbers (including ±0, ±Infinity, NaN) the result of the operation is well defined: E.g., 1/±0=±Infinity, Infinity+Infinity=Infinity, Infinity-Infinity=NaN, 0/0=NaN, 0*Infinity=NaN. Any arithmetic operation involving NaN returns NaN.

Note that there are two distinct machine numbers +0 and -0 which behave differently in expressions such as 1/0. However, IEEE 754 defines the comparison operator "==" such that +0==-0 is true. Note that NaN==NaN is defined as false.