核心思想:Roofline模型通过分析一个任务的“操作强度”来判断其性能受限于内存带宽还是计算能力
操作强度:计算任务中每访问1字节数据所能完成的浮点运算次数(FLOPs/Bytes)
性能限制:
- 内存限制:如果数据访问速度(内存带宽)低于计算所需速度,性能受限于内存
- 计算限制:如果数据访问足够快,但处理器的计算能力有限,性能受限于计算能力
Use for measuring the theoretical performance upper bound 𝑃 of model x can achieve on a computing platform
[Platform] Computility π : maximum FLOPS per second.
[Platform] Bandwidth β : maximum memory access per second.
[Platform] Computational intensity I_max = π / β 计算任务中每访问1字节数据所能完成的浮点运算次数
[Model] Computational workload 𝐴 :
the number of floating-point operations (#FLOPs) that occur during a single forward pass when processing one input sample (for a CNN, this would be a single image).[Model] Memory access 𝐵 :
the total amount of memory (#Bytes) exchanged during a single forward pass when processing one input sample. In the ideal case, B = model’s weight parameters + memory used for each layer’s output.[Model] Computational intensity I=𝐴/𝐵 :
the number of floating-point operations performed per byte of memory exchanged during the computation (#FLOPs/Byte). The higher the computational intensity, the more efficiently the model utilizes memory, as it performs more computations for each unit of memory accessed.[Model] theoretical peak performance 𝑃 :
theoretical maximum number of floating-point operations it can achieve per second (#FLOPs) on a given computing platform.

Memory Roof: Intuitively, if the data needed for the computation is supplied slower than the computation itself, the processor will idly wait for data, making memory bandwidth the primary bottleneck.
Example
I(VGG16)= 25 FLOPs/Byte
I(MobileNet)= 7 FLOPs/Byte
