Lecture 16 - Floating Point, FPGAs vs. ARM

This is not a how-to tutorial on how to build floating point ALUs or whatnot. Then, we'll talk about the difference between FPGAs and using an MCU (ARM architecture), where FPGAs are really good at processing AI workloads and whatnot.

Floating Point

We'll look at how floating point works. A floating point number looks like this:

Here:

X = \pm S \cdot B^{\pm E}

Where the sign bit applies to the $S$ , and $E$ is a (possibly) signed number ( $S$ isn't ever).

For example, to add numbers in floating point, we shift one of the numbers, add them, then shift back:

\begin{aligned} 3.21 \cdot 10^{3} + 0.79 \cdot 10^{4} & = 3.21 \cdot 10^{3} + 7.9 \cdot 10^{3} \\ = 11.11 \cdot 10^{3} \\ = 1.111 \cdot 10^{4} \end{aligned}

And for multiplication, we just multiply the significand, and add the exponents.

We can have overflow in both the $\pm \infty$ as well as for values near 0:

For instance if we have 4 bits for $S$ and 2 for $E$ for an unsigned $E$ we get a new number line:

Notice the logarithmic scaling of numbers in floating point.

ANSI/IEEE Standard

For 32-bit MCU, we use 8 bits for exponent (bias = 127) and 23+1 bits for S
For 64-bit MCU, we use 11 bits for exponent (bias = 1023) and 52+1 bits for S

Other specifications that are required:

Base 2
Significand range [1,2) with ms hidden bit of 1
Support for gradual underflow
Representing true zero means using all 0's
Special codes for $\pm \infty$ and NaN
Add, subtract, multiply, divide, square root

FPGAs vs. ARM

Recall the AND gate truth table, with $A, B$ as inputs. Suppose I have 4 addresses for memory:

But we can represent any gate using this methodology. XOR would be $1, 0, 0, 1$ in that order.

But we don't just have one gate. We usually have a cascade of gates. We'll want to send the output of one gate to the input of another gate (at random), which is done using the power of muxes. When you see LUT usage on say a Basys board, that's essentially what it's referring to.

Ultimately, we wire from LUTs into Fast Carry Logic (namely for common accelerators like lookahead adders), into D-Flip Flops, and then routed by MUXes.

Sending Outputs of Gates to Other Gates

Placement on an FPGA manufacturer's software packages helps place the gates that refer to each other physically near each other. Hence, the name place and route makes more sense. But notice that this problem is $NP$ -Hard, so then what really happens is that something like Vivado does a sweep of possible guesses, then chooses the best one.

The above shows a way to use transistors to send an input signal on the left to possibly up to 3 separate directions.

To communicate gates among a large set of gates, we use a global interconnect to connect any gate between each other. A local interconnect would connect only a set region of gates.

ARM Cortex M4 Core

The ARM Cortex Core on the STM32 is:

Embedded C
FPU (single precision)
Has DSP cores in there (singe-cycle Multiply-Accumulate Core)
There's two ISAs:
32-bit ARM instruction Set
16-bit Thumb Instruction Set

There's 7 basic operating modes, of 3 categories:

User: unpriviledged mode under which most tasks are run (only certain access to memory)
IRQ: entered when interrupt is raised
Supervisor: entered on reset and when a software interrupt in a privileged mode

There's 37 registers all 32-bit long:

1 dedicated PC
1 current program status register
5 dedicated saved program registers
30 GP registers

All of this is almost carbon copy from the OTTER. But there's a bus interface named AMBA, short for Advanced Microcontroller Bus Architecture. It's a standard that communicates the IO of the MCU with the bus, giving a definition of wires between the two.

Things like the APB are also an advance programming bus. You've seen these in your register code!

What to Choose?

We detail the pros and cons below:

FPGA
- Good at bit-wise, low-bit precision computation
- Highly regular workloads (any written function would need to contain a LUT of all inputs mapped to outputs)
- Static applications
- May have higher performance & time certainty
Processors
- Good at uncertain workloads
- Dynamic execution
- Less Expensive
- Easier Development (easier to write Python over SystemVerilog)

So for instance:

GPUs are great on FPGAs since the color data can be reduced to a small bit width
TPUs may be better on FPGAs since they're highly regular (multiply/add) and also need higher performance
A CPU is better on a Processor since the instructions are highly unordered (think many if/else)

Note there are ways to be 'in-between'. A GPGPU uses a variety of small cores that all get combined in a LUT/FPGA way.