Lecture 16 - Floating Point, FPGAs vs. ARM
This is not a how-to tutorial on how to build floating point ALUs or whatnot. Then, we'll talk about the difference between FPGAs and using an MCU (ARM architecture), where FPGAs are really good at processing AI workloads and whatnot.
Floating Point
We'll look at how floating point works. A floating point number looks like this:
Here:
Where the sign bit applies to the
For example, to add numbers in floating point, we shift one of the numbers, add them, then shift back:
And for multiplication, we just multiply the significand, and add the exponents.
We can have overflow in both the
For instance if we have 4 bits for
Notice the logarithmic scaling of numbers in floating point.
ANSI/IEEE Standard
For 32-bit MCU, we use 8 bits for exponent (bias = 127) and 23+1 bits for S
For 64-bit MCU, we use 11 bits for exponent (bias = 1023) and 52+1 bits for S
Other specifications that are required:
- Base 2
- Significand range [1,2) with ms hidden bit of 1
- Support for gradual underflow
- Representing true zero means using all 0's
- Special codes for
and NaN - Add, subtract, multiply, divide, square root
FPGAs vs. ARM
Recall the AND gate truth table, with
But we can represent any gate using this methodology. XOR would be
But we don't just have one gate. We usually have a cascade of gates. We'll want to send the output of one gate to the input of another gate (at random), which is done using the power of muxes. When you see LUT usage on say a Basys board, that's essentially what it's referring to.
Ultimately, we wire from LUTs into Fast Carry Logic (namely for common accelerators like lookahead adders), into D-Flip Flops, and then routed by MUXes.
Sending Outputs of Gates to Other Gates
Placement on an FPGA manufacturer's software packages helps place the gates that refer to each other physically near each other. Hence, the name place and route makes more sense. But notice that this problem is
The above shows a way to use transistors to send an input signal on the left to possibly up to 3 separate directions.
To communicate gates among a large set of gates, we use a global interconnect to connect any gate between each other. A local interconnect would connect only a set region of gates.
ARM Cortex M4 Core
The ARM Cortex Core on the STM32 is:
- Embedded C
- FPU (single precision)
- Has DSP cores in there (singe-cycle Multiply-Accumulate Core)
There's two ISAs: - 32-bit ARM instruction Set
- 16-bit Thumb Instruction Set
There's 7 basic operating modes, of 3 categories:
- User: unpriviledged mode under which most tasks are run (only certain access to memory)
- IRQ: entered when interrupt is raised
- Supervisor: entered on reset and when a software interrupt in a privileged mode
There's 37 registers all 32-bit long:
- 1 dedicated PC
- 1 current program status register
- 5 dedicated saved program registers
- 30 GP registers
All of this is almost carbon copy from the OTTER. But there's a bus interface named AMBA, short for Advanced Microcontroller Bus Architecture. It's a standard that communicates the IO of the MCU with the bus, giving a definition of wires between the two.
Things like the APB are also an advance programming bus. You've seen these in your register code!
What to Choose?
We detail the pros and cons below:
- FPGA
- Good at bit-wise, low-bit precision computation
- Highly regular workloads (any written function would need to contain a LUT of all inputs mapped to outputs)
- Static applications
- May have higher performance & time certainty
- Processors
- Good at uncertain workloads
- Dynamic execution
- Less Expensive
- Easier Development (easier to write Python over SystemVerilog)
So for instance:
- GPUs are great on FPGAs since the color data can be reduced to a small bit width
- TPUs may be better on FPGAs since they're highly regular (multiply/add) and also need higher performance
- A CPU is better on a Processor since the instructions are highly unordered (think many
if/else
)
Note there are ways to be 'in-between'. A GPGPU uses a variety of small cores that all get combined in a LUT/FPGA way.