Floating-point

From Conservapedia
Jump to: navigation, search

Floating-point is a method in computing to represent an arbitrary real number in a fixed amount of bits. The data type for floating-point numbers is typically named either generically (float, real, etc.), or after the precision of the number (single, double, extended, etc). A floating-point system can represent very large or very small numbers despite a fixed amount of bits, because the position of the decimal point is stored with the numerical value. In contrast, fixed-point systems represent a smaller range of values as the position of the decimal point is fixed. The trade-off is that a fixed point number can hold more precision within its range for a given amount of bits. A floating point value contains mantissa (or significand) and exponent fields. The size of these fields indicate the limits of the number and the precision. Both fields contain a sign bit. The mantissa sign indicates the sign of the number (for instance, -10 or 10), whereas the exponent field's sign indicates whether the decimal point is moved left or right by the magnitude of the exponent. The exponent determines the maximum and minimum possible values while the mantissa contains the significant value. A mantissa of 16 bits would allow about 4.3 decimal digits of precision (16 minus 1 for the sign = 15 bits with a maximum magnitude of 32767). Each field can be sized as needed to meet specific needs of precision and numeric range. However, most floating point values used in programs correspond to the most popular floating-point standard used on many processors: the IEEE 754 standard.[1]

Not all possible combination of bits represent valid real numbers. Such invalid values are called NaN (Not a Number). Some floating-point math software/hardware uses certain NaNs to represent Infinity.

References

  1. http://babbage.cs.qc.edu/courses/cs341/IEEE-754references.html