# CS 208 s20 — IEEE Floating Point

## Table of Contents

## 1 Introduction

- We will not cover all the complexities of the 50-page IEEE floating point standard
- It saved us from the wild west in the early days of computing where every manufacturer was designing their own floating point representation
- These were typically optimized for performance at the cose of accuracy

- What do we want from a floating point standard?
- Scientists/numerical analysts want them to be as real as possible
- Engineers want them to be easy to implement and fast
- Scientists mostly won, as floating-point operations can be several times slower than integer operations

- Basic idea: represent numbers in binary scientific notation

## 2 Fractional Binary Numbers

The idea of a binary fraction is part of the IEEE representation, so let's start with that.

- Example
`0b10.1010`

= \(1\times2^1 + 1\times2^{-1} + 1\times2^{-3} = 2.625\) **Exercise**: what's the closest we can get to 1/3 with 6 bits?^{1}

fraction | decimal |
---|---|

1/2 | 0.5 |

1/4 | 0.25 |

1/8 | 0.125 |

1/16 | 0.0625 |

1/32 | 0.03125 |

1/64 | 0.015625 |

**Exercise**: What is the base 10 equivalent of the 6-bit two's complement fixed-point number`0b101.110`

?^{2}

## 3 IEEE floating point in 6 bits

- The value \(V\) of a floating point number is computed using the following formula: \(V = (-1)^s \times M \times 2^E\)
*sign*: \(s\) indicates positive or negative (sign bit when \(V=0\) is special case)*significand*: \(M\) fractional binary number between 1 and \(2 - \epsilon\) or between 0 and \(1 - \epsilon\)*exponent*: \(E\) weights by a power of 2 (can be negative power)

- We will choose to distribute our 6 bits to represent these quantities as follows:
`s`

encodes \(s\)`exp`

encodes \(E\) in*biased*form- normally, \(E\) =
`exp`

\(- Bias\) where the \(k\) bits of`exp`

are treated as an unsigned integer and \(Bias = 2^{k-1} - 1\)

- normally, \(E\) =
`frac`

is the binary fraction \(0.f_{n-1}\cdots f_1f_0\), and the significand is \(M = 1 + f\)

### 3.1 Example

- Take the same six bits from before,
`0b1 011 10`

, what value do they represent under this scheme?`s`

is 1, so value is negative`exp`

is 3, so \(E = 3 - 3 = 0\)`frac`

is 0.5, so \(M = 1 + 0.5 = 1.5\)- V = -1.5 × 2
^{0}= -1.5

### 3.2 Exercise

- With 1 sign bit, 3 exponent bits, and 2 significand bits, how close can we get to 1/3
^{3}- using the formula from the example above

### 3.3 Why the biased form for exponents?

- we want to represent very small and very large numbers, so we need the exponent to be signed
- this suggests encoding
`exp`

as a two's complement integer

- this suggests encoding
- we want floating-point operations to be fast in hardware
- easier to compare floats if more 1s in
`exp`

means bigger number

- easier to compare floats if more 1s in
- clever trick: store
`exp`

as unsigned with implicit bias- in fact, by putting
`exp`

in between`s`

and`frac`

, the same hardware can do two's complement comparisons and floating-point comparisons

- in fact, by putting

### 3.4 Denormalized values

- When
`exp`

is 0, the representation switches from normalized to denormalized form- \(E = 1 - Bias\)
- \(M = f\)

## 4 Real IEEE

### 4.1 Reading: IEEE Floating Point Representation

Read section 2.4.2 from the CSPP book (p. 112–115), and take a look at figure 2.34 on p. 115.

### 4.2 Special cases

## 5 Useful simulation

## 6 Arithmetic

- IEEE standard specifies four
*rounding modes*- typically round-to-even, helps avoid statistical bias in practice by distributing rounding between rounding up and rounding down

- In general, perform exact computation and then round to something representable with available bits
- can
*underflow*if closest representable value is 0 - can
*overflow*if \(E\) is too big to fit in`exp`

(result is \(\pm\infty\)) - rounding breaks associtivity

- can

## 7 Homework

- Do practice problem 2.47 in the CSPP book (p. 117)
- Remember that lab 0 is due at 9pm tonight (April 20). You can use late days, see the course web page for the late work policy.
- Lab 1 has been posted. It will be due 9pm Wednesday, April 29. Part of the provided testing framework, the executable
`dlc`

, will only run on a Linux system or VM.

## Footnotes:

^{1}

`.010101`

(1/4 + 1/16 + 1/64 = 0.328125)

^{2}

Two's complement means the most significant bit has negative weight: `0b101.110`

= \(-4 + 1 + 0.5 + 0.25 = -2.25\)

^{3}

`0 001 01`

(sign exponent significand) yields \(1.25 \times 2^{1 - 3} = 1.25 \times 2^{-2} = 0.3125\).
This isn't as close as with a 6-bit binary fraction we could structure however we wanted, but an unstructured representation would require extra bits to locate the binary point.