@@ -269,22 +269,110 @@ need to worry about additional cleanup.
269
269
Supported fusions
270
270
=================================================
271
271
272
- The following tables outline the supported fusions for ``FP32 `` and ``FP16 ``, including any applicable
272
+ The following tables outline the supported fusions for ``FP32 ``, `` FP16 ``, and ``BFP16 ``, including any applicable
273
273
constraints.
274
274
275
275
.. note ::
276
276
277
- Fusion Plans with grouped convolutions are not supported.
277
+ Fusion Plans with grouped convolutions are supported in the inference direction for
278
+ convolution, bias, and activation.
279
+
280
+ The following abbreviations apply to the combination column in the following tables:
281
+
282
+ * **C **: Convolution
283
+ * **B **: Bias
284
+ * **N **: Batch Normalization
285
+ * **A **: Activation
286
+
287
+ For example, CBA refers to convolution plus bias plus activation.
288
+
289
+ Convolution-based FP32 fusion for inference
290
+ -------------------------------------------
291
+
292
+ The following table applies to single-precision floating point.
293
+
294
+ .. csv-table ::
295
+ :header: "Combination","Conv algo","Stride","Filter dims","N mode","Activations","Other constraints"
296
+ :widths: 15, 15, 15, 20, 12, 20, 20
297
+
298
+ "CBNA","Direct","1 and 2","3x3, 5x5, 7x7, 9x9, 11x11","All","All","stride and padding must be either 1 or 2"
299
+ "CBA","Direct","--","1x1","--","All","stride and padding not supported"
300
+ "CBA","Winograd","1","1x1, 2x2","N/A","Relu, Leaky Relu","c >= 18"
301
+ "CBA","Winograd","1","3x3","--","Relu, Leaky Relu","c >= 18 and c is even"
302
+ "CBA","Winograd","1","4x4, 5x5, 6x6","--","Relu, Leaky Relu","4 x c >= 18"
303
+ "CBA","Winograd","1","7x7, 8x8, 9x9","--","Relu, Leaky Relu","12 x c >= 18"
304
+ "CBA","Winograd","1","10x10, 11x11, 12x12","--","Relu, Leaky Relu","16 x c >= 18"
305
+ "CBA","Winograd","1","larger filter sizes","--","Relu, Leaky Relu","none"
306
+ "CBA","Winograd","2","1x1","--","Relu, Leaky Relu","2 x c >= 18"
307
+ "CBA","Winograd","2","2x2, 3x3, 4x4, 5x5, 6x6","--","Relu, Leaky Relu","4 x c >= 18"
308
+ "CBA","Winograd","2","7x7","--","Relu, Leaky Relu","12 x c >= 18"
309
+ "CBA","Winograd","2","8x8, 9x9, 10x10, 11x11, 12x12","--","Relu, Leaky Relu","16 x c >= 18"
310
+ "CBA","Winograd","2","larger filter sizes","--","Relu, Leaky Relu","none"
311
+ "CBA","CK","--","--","--","Relu, Clipped Relu, CLAMP","none"
312
+ "NA","--","--","--","All","All","padding not supported"
313
+ "CA","Direct","--","1x1","--","All","stride and padding not supported"
314
+ "CA","CK","--","--","--","Relu, Clipped Relu, CLAMP","none"
278
315
279
- ** C = convolution, B = bias, N = batch normalization, A = activation **
316
+ .. note ::
280
317
281
- .. image :: ../data/how-to/fp32fusions.png
282
- :width: 800
283
- :alt: Convolution based fp32 fusion
318
+ N mode is either spatial or per activation. For CBA, other asymmetric kernels are supported but for brevity are not enumerated here.
284
319
285
- .. image :: ../data/how-to/fp16fusions.png
286
- :width: 800
287
- :alt: Convolution based fp16 fusion
320
+
321
+ Convolution-based FP16 fusion for inference
322
+ -------------------------------------------
323
+
324
+ The following table applies to half-precision floating point.
325
+
326
+ .. csv-table ::
327
+ :header: "Combination","Conv algo","Stride","Filter dims","N mode","Activations","Other constraints"
328
+ :widths: 15, 15, 15, 20, 12, 20, 20
329
+
330
+ "CBNA","Direct","1 and 2","3x3, 5x5, 7x7, 9x9, 11x11","All","All","stride and padding must be either 1 or 2"
331
+ "CBA","Direct","--","1x1","--","All","stride and padding not supported"
332
+ "CBA","CK","--","--","--","Relu, Clipped Relu, CLAMP","none"
333
+ "CA","Direct","--","1x1","--","All","stride and padding not supported"
334
+ "CA","CK","--","--","--","Relu, Clipped Relu, CLAMP","none"
335
+
336
+ .. note ::
337
+
338
+ N mode is either spatial or per activation.
339
+
340
+
341
+ Convolution-based BFP16 fusion for inference
342
+ --------------------------------------------
343
+
344
+ The following table applies to half-precision block floating point.
345
+
346
+ .. csv-table ::
347
+ :header: "Combination","Conv algo","Stride","Filter dims","N mode","Activations","Other constraints"
348
+ :widths: 15, 15, 15, 20, 12, 20, 20
349
+
350
+ "CBNA","Direct","1 and 2","3x3, 5x5, 7x7, 9x9, 11x11","All","All","stride and padding must be either 1 or 2"
351
+ "CBA","Direct","--","1x1","--","All","stride and padding not supported"
352
+ "CBA","CK","--","--","--","Relu, Clipped Relu, CLAMP","none"
353
+ "CA","Direct","--","1x1","--","All","stride and padding not supported"
354
+ "CA","CK","--","--","--","Relu, Clipped Relu, CLAMP","none"
355
+
356
+ .. note ::
357
+
358
+ N mode is either spatial or per activation.
359
+
360
+ Batch Normalization-based fusion for FP32, BFP16, and FP16 for inference and training
361
+ -------------------------------------------------------------------------------------
362
+
363
+ The following table applies to both full-precision and half-precision floating point.
364
+
365
+ .. csv-table ::
366
+ :header: "Combination","N mode","Activations","Constraints"
367
+ :widths: 30, 15, 15, 15
368
+
369
+ "NA for inference","All","All","None"
370
+ "NA forward training","All","All","None"
371
+ "NA backward training","All","All","None"
372
+
373
+ .. note ::
374
+
375
+ N mode is either spatial or per activation.
288
376
289
377
Comparing performance with non-fused kernels
290
378
=================================================
@@ -298,6 +386,6 @@ non-fused version. All configurations have a batch size of 64:
298
386
299
387
The following graph depicts the speedup obtained by fusing BatchNorm (in spatial mode) with activation:
300
388
301
- .. image :: ../data/how-to/na .png
389
+ .. image :: ../data/how-to/bn_activ_fused .png
302
390
:width: 800
303
391
:alt: BatchNorm activation fusion
0 commit comments