Roofline Model

核心思想:Roofline模型通过分析一个任务的“操作强度”来判断其性能受限于内存带宽还是计算能力
操作强度:计算任务中每访问1字节数据所能完成的浮点运算次数(FLOPs/Bytes)
性能限制:

  • 内存限制:如果数据访问速度(内存带宽)低于计算所需速度,性能受限于内存
  • 计算限制:如果数据访问足够快,但处理器的计算能力有限,性能受限于计算能力

Use for measuring the theoretical performance upper bound 𝑃 of model x can achieve on a computing platform

  • [Platform] Computility π : maximum FLOPS per second.

  • [Platform] Bandwidth β : maximum memory access per second.

  • [Platform] Computational intensity I_max = π / β 计算任务中每访问1字节数据所能完成的浮点运算次数

  • [Model] Computational workload 𝐴 :
    the number of floating-point operations (#FLOPs) that occur during a single forward pass when processing one input sample (for a CNN, this would be a single image).

  • [Model] Memory access 𝐵 :
    the total amount of memory (#Bytes) exchanged during a single forward pass when processing one input sample. In the ideal case, B = model’s weight parameters + memory used for each layer’s output.

  • [Model] Computational intensity I=𝐴/𝐵 :
    the number of floating-point operations performed per byte of memory exchanged during the computation (#FLOPs/Byte). The higher the computational intensity, the more efficiently the model utilizes memory, as it performs more computations for each unit of memory accessed.

  • [Model] theoretical peak performance 𝑃 :
    theoretical maximum number of floating-point operations it can achieve per second (#FLOPs) on a given computing platform.

Memory Roof: Intuitively, if the data needed for the computation is supplied slower than the computation itself, the processor will idly wait for data, making memory bandwidth the primary bottleneck.

Example

I(VGG16)= 25 FLOPs/Byte
I(MobileNet)= 7 FLOPs/Byte

Platform — 1080Ti : π=11.3 TFLOP/s, β=484GB/s