CS 208 s21 — IEEE Floating Point

1 Introduction

We will not cover all the complexities of the 50-page IEEE floating point standard
It saved us from the wild west in the early days of computing where every manufacturer was designing their own floating point representation
- These were typically optimized for performance at the cose of accuracy
What do we want from a floating point standard?
- Scientists/numerical analysts want them to be as real as possible
- Engineers want them to be easy to implement and fast
- Scientists mostly won, as floating-point operations can be several times slower than integer operations
Basic idea: represent numbers in binary scientific notation

The idea of a binary fraction is part of the IEEE representation, so let's start with that.

$fractional-binary.png$

The value $V$ of a floating point number is computed using the following formula: $V = (-1)^s \times M \times 2^E$
- sign: $s$ indicates positive or negative (sign bit when $V=0$ is special case)
- significand: $M$ fractional binary number between 1 and $2 - \epsilon$ or between 0 and $1 - \epsilon$
- exponent: $E$ weights by a power of 2 (can be negative power)
We will choose to distribute our 6 bits to represent these quantities as follows:
- s encodes $s$
- exp encodes $E$ in biased form
  - normally, $E$ = exp $- Bias$ where the $k$ bits of exp are treated as an unsigned integer and $Bias = 2^{k-1} - 1$
- frac is the binary fraction $0.f_{n-1}\cdots f_1f_0$, and the significand is $M = 1 + f$

Take the same six bits from before, 0b1 011 10, what value do they represent under this scheme?
- s is 1, so value is negative
- exp is 3, so $E = 3 - 3 = 0$
- frac is 0.5, so $M = 1 + 0.5 = 1.5$
- V = -1.5 × 2⁰ = -1.5

we want to represent very small and very large numbers, so we need the exponent to be signed
- this suggests encoding exp as a two's complement integer
we want floating-point operations to be fast in hardware
- easier to compare floats if more 1s in exp means bigger number
clever trick: store exp as unsigned with implicit bias
- in fact, by putting exp in between s and frac, the same hardware can do two's complement comparisons and floating-point comparisons

When exp is 0, the representation switches from normalized to denormalized form
- $E = 1 - Bias$
- $M = f$

IEEE standard specifies four rounding modes
- typically round-to-even, helps avoid statistical bias in practice by distributing rounding between rounding up and rounding down
In general, perform exact computation and then round to something representable with available bits
- can underflow if closest representable value is 0
- can overflow if $E$ is too big to fit in exp (result is $\pm\infty$)
- rounding breaks associtivity