Software Dev Notes: Floating-point Arithmetic

Floating-point numbers

A mathematical notation offers a way to represent a number with a string of digits of any length and a radix point. If the position of the radix point is not specified, the string indicates a integer and the position of the radix point would be next the least significative digit.

A mathematical notation can represent numbers as the product of two factors: a number of significant digits in a given base and a scale factor, where the scale factor is the base and an exponent. To obtain the value of the number, multiply the significand by the base raised to the exponent, which is equivalent to shifting the radix point by a number of places equal to the value of the exponent.

In scientific notation, numbers are scaled by a power of ten, the radix point appears immediately after the first digit and the scaling factor is a power of ten.

Floating-point representation is similar to scientific notation. A floating-point number consists of:

a signed digit string of a fixed length in a given base (or radix), called the significand, coefficient or mantissa.
- the length of the significand determines the precision to which numbers can be represented
- the base can be two, ten or sixteen
a signed integer exponent (also called scale or characteristic), which modifies the magnitude of the number.

The number exactly represented has the following form:

  significand * base^exponent (significand ∈ Z, base ∈ Z and base ≥ 2, exponent ∈ Z)

Example:

 +3.1415 = +31415 × 10^-4

Floating point computation allows to represent very small and very large real numbers with finite precision.
In computing, floating-point arithmetic is used to represent approximated real numbers.

The term floating point indicates the fact that the radix point (decimal point or binary point in computers) can be floated to left or right, that is, it can be placed anywhere in the significant digits of the number.

The IEEE standard for floating numbers in computers and programming languages

Over the years, different floating-point representations have been used in computers. In 1985, IEEE 754, the Standard for Floating-Point Arithmetic, was established by the Institute of Electrical and Electronics Engineers (IEEE). The IEEE 754 standand is for binary floating point numbers representation.

The three IEEE standard number formats widely used in computers hardware and languages are:

single precision: the float type in C language. It is a binary format that occupies 32 bits (4 bytes), the significand has a precision of 24 bits, which implies about 7.2 decimal digits.
double precision: the double type in C language . It is a binary format that occupies 64 bits (8 bytes), the significand has a precision of 53 bits, which is equal to about 15.9 decimal digits.
double extended: also called exdended precision format. It is binary format that occupies 80 bits, the significand has a precision of 64 bits, which corresponds to about 19 decimal digits. If the processor handles 80 bits foating point, the "long double" C type can be used.

Software Dev Notes

Tuesday, October 23, 2018

Floating-point Arithmetic

Floating-point numbers

The IEEE standard for floating numbers in computers and programming languages

No comments:

Post a Comment