Abstract

At a high level, this paper will discuss a deeper look into the Graphcore IPU via a process of microbenchmarking which includes a variety of operations like gather, scatter, etc. They will address:

memory performance
latency and bandwidth
compute power
actual performance

Memory Architectures of CPU, GPU, and IPU compared

CPU uses a hierarchy of memory caches, with sophisticated branching techniques to accurately predict the next instruction, and prefetch it, so over the average, there is a hidden latency.

GPU use typically smaller cores compared to CPUs and don not use as sophisticated branch predictions, but have workloads which allow them to run multiple threads on a batch of memory with memory accesses interleaved throughout to hide the latency.

IPU only provide the onchip memory of 256KiB in scratchpad form, so that each processor has full control to work on that amount of memory. The memory is designed in SRAM which is much faster than DRAM, while the IPU allows 6 independent threads to hide its own latency there too.