Lecture 16 - Floating Point, FPGAs vs. ARM

This is not a how-to tutorial on how to build floating point ALUs or whatnot. Then, we'll talk about the difference between FPGAs and using an MCU (ARM architecture), where FPGAs are really good at processing AI workloads and whatnot.

Floating Point

We'll look at how floating point works. A floating point number looks like this:

Here:

X=±SB±E

Where the sign bit applies to the S, and E is a (possibly) signed number (S isn't ever).

For example, to add numbers in floating point, we shift one of the numbers, add them, then shift back:

3.21103+0.79104=3.21103+7.9103=11.11103=1.111104

And for multiplication, we just multiply the significand, and add the exponents.

We can have overflow in both the ± as well as for values near 0:

For instance if we have 4 bits for S and 2 for E for an unsigned E we get a new number line:

Notice the logarithmic scaling of numbers in floating point.

ANSI/IEEE Standard

For 32-bit MCU, we use 8 bits for exponent (bias = 127) and 23+1 bits for S
For 64-bit MCU, we use 11 bits for exponent (bias = 1023) and 52+1 bits for S

Other specifications that are required:

FPGAs vs. ARM

Recall the AND gate truth table, with A,B as inputs. Suppose I have 4 addresses for memory:

But we can represent any gate using this methodology. XOR would be 1,0,0,1 in that order.

But we don't just have one gate. We usually have a cascade of gates. We'll want to send the output of one gate to the input of another gate (at random), which is done using the power of muxes. When you see LUT usage on say a Basys board, that's essentially what it's referring to.

Ultimately, we wire from LUTs into Fast Carry Logic (namely for common accelerators like lookahead adders), into D-Flip Flops, and then routed by MUXes.

Sending Outputs of Gates to Other Gates

Placement on an FPGA manufacturer's software packages helps place the gates that refer to each other physically near each other. Hence, the name place and route makes more sense. But notice that this problem is NP-Hard, so then what really happens is that something like Vivado does a sweep of possible guesses, then chooses the best one.

The above shows a way to use transistors to send an input signal on the left to possibly up to 3 separate directions.

To communicate gates among a large set of gates, we use a global interconnect to connect any gate between each other. A local interconnect would connect only a set region of gates.

ARM Cortex M4 Core

The ARM Cortex Core on the STM32 is:

There's 7 basic operating modes, of 3 categories:

There's 37 registers all 32-bit long:

All of this is almost carbon copy from the OTTER. But there's a bus interface named AMBA, short for Advanced Microcontroller Bus Architecture. It's a standard that communicates the IO of the MCU with the bus, giving a definition of wires between the two.

Things like the APB are also an advance programming bus. You've seen these in your register code!

What to Choose?

We detail the pros and cons below:

So for instance:

Note there are ways to be 'in-between'. A GPGPU uses a variety of small cores that all get combined in a LUT/FPGA way.