[Note] Roofline Model

核心思想：Roofline模型通过分析一个任务的“操作强度”来判断其性能受限于内存带宽还是计算能力
操作强度：计算任务中每访问1字节数据所能完成的浮点运算次数（FLOPs/Bytes）
性能限制：

内存限制：如果数据访问速度（内存带宽）低于计算所需速度，性能受限于内存
计算限制：如果数据访问足够快，但处理器的计算能力有限，性能受限于计算能力

Use for measuring the theoretical performance upper bound 𝑃 of model x can achieve on a computing platform

[Platform] Computility π : maximum FLOPS per second.
[Platform] Bandwidth β : maximum memory access per second.
[Platform] Computational intensity I_max = π / β 计算任务中每访问1字节数据所能完成的浮点运算次数
[Model] Computational workload 𝐴 :
the number of floating-point operations (#FLOPs) that occur during a single forward pass when processing one input sample (for a CNN, this would be a single image).
[Model] Memory access 𝐵 :
the total amount of memory (#Bytes) exchanged during a single forward pass when processing one input sample. In the ideal case, B = model’s weight parameters + memory used for each layer’s output.
[Model] Computational intensity I=𝐴/𝐵 :
the number of floating-point operations performed per byte of memory exchanged during the computation (#FLOPs/Byte). The higher the computational intensity, the more efficiently the model utilizes memory, as it performs more computations for each unit of memory accessed.
[Model] theoretical peak performance 𝑃 :
theoretical maximum number of floating-point operations it can achieve per second (#FLOPs) on a given computing platform.

Memory Roof: Intuitively, if the data needed for the computation is supplied slower than the computation itself, the processor will idly wait for data, making memory bandwidth the primary bottleneck.

Example

I(VGG16)= 25 FLOPs/Byte
I(MobileNet)= 7 FLOPs/Byte

Platform — 1080Ti : π=11.3 TFLOP/s, β=484GB/s