WSZiB : Lectures of Prof. dr Peter Sloot

4 Number Representation and Error Propagation

4.1 Number Representation

IV.1 Number Representation

Introduction

In everyday life referring to a number means referring to a decimal number, built up from units (digits) in the range from 0 up to 9. Most computers do not have a decimal look on numbers and rather use a different base for number representation. In particular, the binary number representation with base 2, where units are in the range from 0 up to 1, is widely used in computers.

The problem is that numbers, which can be represented exactly in one number system, are in general not exactly representable in another system. A binary computer cannot represent most decimal numbers exactly. To illustrate this, consider the following example. In a decimal system the output will be 1.0, 0.9, ..., 0.1. But what does a computer make of it? Try it! (Note: make sure you know how to stop an evaluation in Mathematica: use 'Alt ,' or 'Alt .')

The problem in this example is that 0.1, and therefore all intermediate results, could not be represented exactly and the stopping criterion was never satisfied.

The example above can be rewritten such that the stopping criterion can be satisfied. Verify that the code below works as expected.

As human beings and computers have a different number representation, conversion is required when going from one system to the other. Results from a calculation by a computer need to be converted from binary or octal representation (nice for a computer) to a decimal representation (nice for a human being). For example

and .

On the other hand, if one wants to provide input to a computer program, the input first has to be conversed from decimal to the number system of the computer before the data can be processed. A binary computer will make the following conversion when the decimal number 5.2 is fed into it (decimal numbers in magenta)

[Graphics:Images/nb4_gr_30.gif]

The erroneous expansion of decimal numbers can have a great impact. In the Gulf war of 1991 an Iraqi Scud slipped through the Patriot defence shield and hit an army camp in Dhahran due to an incorrect expansion of the number 0.1!

REQUIRED
IV.1a The Patriot missile defence system kept time in unites of 1/10th of a second. When the fatal Scud that hit Dhahran, was detected, the Patriot clock had been running for 100 hours. How many bits in base 2 are required to represent 100 hours in units of 1/10th of a second?

REQUIRED
IV.1b How many bits are required to represent 1- exactly in binary? How many decimal places are required to represent the same number exactly in base 10?

Positional System

The way in which numbers are represented, both in normal life and in a computer, is by combining a sequence of units or digits with the positions at which these units occur in the number. For example, the decimal number 792 means seven hundreds (or ten times ten), nine tens and two units, or . This positional system was invented 4000 years ago by the Babylonian who did not use a decimal system as we do, but a hexadecimal with base 60.

The positional system is certainly not the only system. Take for example the Romans who used a completely other way for dealing with numbers. Their system had no such a thing as a base 2 or 10. Instead they used a fixed number of symbols I, V, X, L, C, D and M, for 1, 5, 10, 50, 100, 500 and 1000 respectively and numbers were formed by a sequence of symbols in decreasing order of magnitude. (Note: when too many instances of a symbol would be needed, the order would change such that lower value symbols would precede higher value symbols.) So, the number 68 would have been written by the Romans as LXVIII.

The positional system is determined by the choice for the base or radix , a positive integer number, and the 'digits' . Besides the well-known decimal system with base 10 the following systems are familiar in computer environments

Binary:        :
Octal:        :
Hexadecimal:    :

Integer Number Representation

An arbitrary integer number will have the following form in a positional system with base

where is the sign that can be , and each of the digits is in the range from 0 to . If we define [Graphics:Images/nb4_gr_48.gif] , the value of is given by

[Graphics:Images/nb4_gr_50.gif] .

The decimal number has according to this notation the value:

Fractional Number Representation

The above-described representation for integer numbers can be modified to include fractional numbers. An arbitrary fractional number will have the following form in a positional system with base

where the sign is followed by a point, and each of the digits is in the range from 0 to . If we define [Graphics:Images/nb4_gr_60.gif] , the value of is given by

[Graphics:Images/nb4_gr_62.gif] .

In this notation the decimal number has the value . Similarly, the binary number has the value

Notice the use of the dot notation that is common to the Anglo-Saxon world, whereas the rest of the world uses a comma to denote fractional numbers.

Floating Point Number Representation

Fractional number representation becomes awkward as soon as the numbers become very large or small. The floating point number representation introduces an exponent part that is adequate especially when dealing with large or small numbers. The general form for the floating point representation of a number is

where is the by now well-known sign, is the mantissa with , and is the exponent which can be a positive or negative integer number. In the special case where the mantissa is zero, but the value of the exponent is undefined and is system dependent. In the floating point notation the decimal number -0.000001234 would be written as .

Finite Number Representation

A computer does not have an infinite length for the mantissa or exponent. Instead it uses a fixed length for both the mantissa and the exponent. This implies that the numbers that can be represented by a computer are bounded by a maximum (and of course a minimum). The largest possible value for the exponent is denoted by and the smallest (negative) value for the exponent is denoted by . All numbers represented by a computer thus lie in a fixed range, mainly depending on and .

The fixed length of the mantissa imposes a restriction on the accuracy with which numbers can be represented.

By selecting a combination of a particular , mantissa length , the minimum value for the exponent and the maximum value for the exponent , the numbers that can be represented are completely determined. A finite floating point number system is dependent on this combination and is denoted as .

To enforce unambiguous representation of floating point numbers we use normalised floating point numbers, where the mantissa starts with a non-zero digit for floating point numbers not equal to zero.

As an example all normalised numbers in the floating point system are listed, and also the total of the numbers that can be represented in this system.

			⋯
			⋯
⋮	⋮	⋮	⋮	⋮
			⋯
			⋯
			⋯
⋮	⋮	⋮	⋮	⋮
			⋯
0
			⋯
⋮	⋮	⋮	⋮	⋮
			⋯

Each row contains 900 numbers. Looking at the positive numbers there are five rows with a positive exponent, four rows with a negative exponent and one row with exponent zero. This gives ten rows for the positive numbers. This is also true for the negative numbers, giving a total of twenty rows with 900 elements each. Finally there is a row containing only the number zero. The total numbers to be represented in this system is therefore . The largest number that can be represented in this system is and the smallest number is . The smallest positive number is given by .

The largest number that can be represented in a floating point number system is called maxreal or overflow threshold or giant and its value is given by .

The smallest positive number that can be represented in a floating point number system is called minreal or underflow threshold or dwarf and its value is given by .

All representable numbers in a system are called the machine numbers. For any machine number other than the giant, there exists a smallest larger machine number , called the upper neighbour. Also, for any machine number other than the dwarf, there exists a largest smaller machine number , called the lower neighbour.

The relative dispersion is defined for any non-zero machine number other than the giant as

For , i.e. with mantissa and exponent , the relative dispersion can be expressed as

ρ := ,

with .

From this an interval for the relative dispersion can be derived:

The maximum value for the relative dispersion is called the machine precision or the system resolution and is usually denoted with η. It can be shown that the following equality holds

If we take the same floating point system as in the previous example the relative dispersion is given by , with .
The machine precision (or equivalent: maximal dispersion) in this system is given by .

A real number in the range of the machine numbers, i.e. or , will be approximated in a computer by a machine number . This machine number will be chosen such that in a way it approximates best among all machine numbers. The notation for this is . In a system using rounding will be the machine number closest to (and closer to zero in case lies exactly between two machine numbers). For a system with truncation will be the largest machine number less or equal to for positive numbers, while for negative numbers the equivalence is used.

Proposition
For real numbers in the range of the machine numbers the following holds:

with

The unknown ε is called the representation error.

For example, take the system (2, 3, -1, 2). The machine precision in this system is given by . The representation error is then given by in case of rounding and in case of truncation.

	mantissa
exponent

0
1
2

REQUIRED
IV.1c What other numbers can be represented if we drop the restriction of a normalised mantissa?

The numbers in the floating point system can also be drawn graphically. In the picture below the numbers are depicted on a one-dimensional axis.

[Graphics:Images/nb4_gr_184.gif]

REQUIRED
IV.1d
a) Use the function Table[] to  form a list  of all the positive  normalized  machine numbers on a three-didgit, base 4 computer, that has exponents in the range -2 to 2, inlcuding 0 and sign bits.
b) How many numbers are there in the list?
c) What are the smallest  and largest numbers on the list?
d) ListPLot the Log of the Sorted, Flattened list of numbers. Why do the points tend to lie in a straight line?
e) Why is the line scalloped? (hightly  irregular)
f) Would  the scallops be more or less pronounced in base 16?
(You might want to use the function  BaseForm[])

Machine Arithmetic and Errors

The arithmetic operators +, -, × and / all have a machine implementation denoted by , , and respectively. For any two machine numbers the relation between these operators and their machine counterparts is given by

	=
	=
	=
	=

More generally:

Note: instead of the ^ notation to emphasize the finite precision of the calculation there is another notation for this purpose: fl (). Thus is equivalent to fl(u + v).

For example consider the floating point system , where the minimum and maximum values for the exponent may be anything, since these are not relevant in this example. The machine precision for this system is given by . Further, take . What is ?

In exact arithmetic:

For the floating point system with a four digits mantissa this gives . The absolute error due to round-off in this case is , whereas the relative error is .

Definition
The absolute error for a number is the absolute value of the exact value minus the approximated value.

The relative error for a number different from zero is the absolute error divided by the absolute value of the exact value.

The big question now is what the 'computer result' will be of an operation , where and are real numbers. In general the numbers and will not be machine numbers, so there will be an error in representing these numbers. Moreover, as we have seen above, also the operations will introduce errors due to limited mantissa length. It appears that a distinction has to be made between the operators on the one hand, and + and - on the other hand when we want to compute .

Absolute error for and /

In the following the error in the computation with the operator is discussed. By replacing the operator by the / operator the error in the division operation is derived. If the relative errors in representing and are given by and respectively, the machine numbers and are given by and . For we then find

for some in the order of the machine precision. From this it follows that

[Graphics:Images/nb4_gr_234.gif]

In this derivation we have assumed (1) and ignored (2) the higher order terms in (see Wikinson in:IMA Bull. 22 (11/12) p.192-200, 1986). The relative error for the multiplication easily follows from the result

Absolute error for and -

The error bounds for the + and - operators are less severe than for the × and / operators. Like in the previous subsection we will only discuss the error bound for one of the operators, the + operator. The error bound for the subtraction can be derived in a similar way. Again, let the relative errors in representing and be given by and respectively. The machine numbers and are given by and . For we then find

for some in the order of the machine precision. From this it follows that

[Graphics:Images/nb4_gr_250.gif]

In this derivation it has been assumed (1) and ignored (2) the higher order terms in . The relative error for the addition easily follows from the result

The above implies that for addition and subtraction the absolute error in the result is proportional to the sum of the absolute values of the operands. Adding two operands with the same sign or subtracting operands with opposite sign also gives a small relative error compared to the exact result. But it can be seen immediately that when adding two operands with opposite sign or subtracting operands with equal sign the relative error may explode due to cancellation of numbers.

ADVANCED
IV.1e Take the floating point system . The machine precision is . If and what is ? What is the absolute error? What is the relative error? Use machine implementation of "-" with arguments having the same exponent.

Lectures | Notebooks | Packages