Skip to content

Conversation

@MrBurmark
Copy link
Member

@MrBurmark MrBurmark commented Sep 17, 2025

Summary

Add more kernel attributes by automatically counting operations using wrapper types.

This could either replace our manual counts or act as a check on their accuracy. It can also get accurate counts in some cases where we have been estimating. In some other cases we may not be able to use this like in Algorithm_SORT where we use std::sort. Its important to note that getting accurate counts after compiler optimization is still difficult but in most cases manual optimization can still get us good counts.

Note that this requires C++20 at the moment, but it could be back-ported to C++17 with SFINAE instead of concepts.

At the moment I'm interested in if people think this is a reasonable direction to take.
If so are there any things that I'm missing that I could be capturing with wrappers types and instrumentation.

As an example I used the counters in Apps_PRESSURE, APPS_VOL3D, and Polybench_JACOBI_2D. Note that I discovered opportunities to optimize redundant loads in PRESSURE and VOL3D kernels and found a copy paste error in VOL3D in examining these counters and comparing them to the manual "Estimate" counters.
Below are the normal attributes followed by the counted attributes. After that is a breakdown of each kernel with counters for each section of the kernel, by using enough macros it was possible to capture the code of the entire kernel and print it out.

// here is the output of the kernel information with counted values
                      ,          Input ,   Input
Kernels               ,   Problem size ,    Reps
------------------------------------------------
Polybench_JACOBI_2D   ,        1000000 ,      50
Apps_PRESSURE         ,        1000000 ,     700
Apps_VOL3D            ,        1000000 ,     100

                      ,         Estimate ,      Estimate ,     Estimate
Kernels               ,   Iterations/rep ,   Kernels/rep ,    Bytes/rep
-----------------------------------------------------------------------
Polybench_JACOBI_2D   ,         80000000 ,            80 ,   1282560000
Apps_PRESSURE         ,          2000000 ,             2 ,     48000000
Apps_VOL3D            ,          1103022 ,             1 ,     35558808

                      ,    Estimate ,        Estimate ,           Estimate
Kernels               ,   FLOPS/rep ,   BytesRead/rep ,   BytesWritten/rep
--------------------------------------------------------------------------
Polybench_JACOBI_2D   ,   400000000 ,       642560000 ,          640000000
Apps_PRESSURE         ,     3000000 ,        32000000 ,           16000000
Apps_VOL3D            ,    79417584 ,        26734632 ,            8824176

                      ,                       Estimate
Kernels               ,   BytesAtomicModifyWritten/rep
------------------------------------------------------
Polybench_JACOBI_2D   ,                              0
Apps_PRESSURE         ,                              0
Apps_VOL3D            ,                              0

                      ,          Counted ,       Counted ,          Counted
Kernels               ,   Iterations/rep ,   Kernels/rep ,   NumAllocations
---------------------------------------------------------------------------
Polybench_JACOBI_2D   ,         80000000 ,            80 ,                4
Apps_PRESSURE         ,          2000000 ,             2 ,                5
Apps_VOL3D            ,          1103022 ,             1 ,                4

                      ,          Counted
Kernels               ,   AllocatedBytes
----------------------------------------
Polybench_JACOBI_2D   ,         32128128
Apps_PRESSURE         ,         40000000
Apps_VOL3D            ,         35995648

                      ,                Counted ,         Counted
Kernels               ,   LoopBytesTouched/rep ,   LoopBytes/rep
----------------------------------------------------------------
Polybench_JACOBI_2D   ,             1282560000 ,      1282560000
Apps_PRESSURE         ,               48000000 ,        48000000
Apps_VOL3D            ,               35558808 ,        35558808

                      ,             Counted ,                Counted
Kernels               ,   LoopBytesRead/rep ,   LoopBytesWritten/rep
--------------------------------------------------------------------
Polybench_JACOBI_2D   ,           642560000 ,              640000000
Apps_PRESSURE         ,            32000000 ,               16000000
Apps_VOL3D            ,            26734632 ,                8824176

                      ,                            Counted
Kernels               ,   LoopBytesAtomicModifyWritten/rep
----------------------------------------------------------
Polybench_JACOBI_2D   ,                                  0
Apps_PRESSURE         ,                                  0
Apps_VOL3D            ,                                  0

                      ,            Counted ,     Counted ,         Counted
Kernels               ,   BytesTouched/rep ,   Bytes/rep ,   BytesRead/rep
--------------------------------------------------------------------------
Polybench_JACOBI_2D   ,           16064000 ,    32064000 ,        16064000
Apps_PRESSURE         ,           40000000 ,    48000000 ,        32000000
Apps_VOL3D            ,           35558808 ,    35558808 ,        26734632

                      ,            Counted ,                        Counted
Kernels               ,   BytesWritten/rep ,   BytesAtomicModifyWritten/rep
---------------------------------------------------------------------------
Polybench_JACOBI_2D   ,           16000000 ,                              0
Apps_PRESSURE         ,           16000000 ,                              0
Apps_VOL3D            ,            8824176 ,                              0

                      ,         Counted ,          Counted ,         Counted
Kernels               ,   int64_ops/rep ,   int64_copy/rep ,   int64_add/rep
----------------------------------------------------------------------------
Polybench_JACOBI_2D   ,      1360080041 ,       1280000001 ,       640000000
Apps_PRESSURE         ,         2000000 ,                2 ,               0
Apps_VOL3D            ,         1103022 ,                1 ,               0

                      ,         Counted ,          Counted ,            Counted
Kernels               ,   int64_sub/rep ,   int64_mult/rep ,   int64_preinc/rep
-------------------------------------------------------------------------------
Polybench_JACOBI_2D   ,       160000001 ,        480000000 ,           80080040
Apps_PRESSURE         ,               0 ,                0 ,            2000000
Apps_VOL3D            ,               0 ,                0 ,            1103022

                      ,        Counted
Kernels               ,   int64_lt/rep
--------------------------------------
Polybench_JACOBI_2D   ,       80160121
Apps_PRESSURE         ,        2000002
Apps_VOL3D            ,        1103023

                      ,       Counted ,        Counted ,          Counted
Kernels               ,   ptr_ops/rep ,   ptr_copy/rep ,   ptr_assign/rep
-------------------------------------------------------------------------
Polybench_JACOBI_2D   ,     960000000 ,              0 ,                0
Apps_PRESSURE         ,      12000000 ,              0 ,                0
Apps_VOL3D            ,      81623670 ,             21 ,               24

                      ,       Counted ,           Counted
Kernels               ,   ptr_add/rep ,   ptr_bit_lsh/rep
---------------------------------------------------------
Polybench_JACOBI_2D   ,     480000000 ,         480000000
Apps_PRESSURE         ,       6000000 ,           6000000
Apps_VOL3D            ,      40811835 ,          40811835

                      ,        Counted ,         Counted ,           Counted
Kernels               ,   fp64_ops/rep ,   fp64_copy/rep ,   fp64_assign/rep
----------------------------------------------------------------------------
Polybench_JACOBI_2D   ,      400000000 ,       400000000 ,          80000000
Apps_PRESSURE         ,        3000000 ,         4000000 ,           5000000
Apps_VOL3D            ,       79417584 ,        76108518 ,          14339286

                      ,         Counted ,          Counted ,        Counted
Kernels               ,   fp64_load/rep ,   fp64_store/rep ,   fp64_abs/rep
---------------------------------------------------------------------------
Polybench_JACOBI_2D   ,       400000000 ,         80000000 ,              0
Apps_PRESSURE         ,         4000000 ,          2000000 ,        1000000
Apps_VOL3D            ,        39708792 ,          1103022 ,              0

                      ,        Counted ,        Counted ,         Counted
Kernels               ,   fp64_add/rep ,   fp64_sub/rep ,   fp64_mult/rep
-------------------------------------------------------------------------
Polybench_JACOBI_2D   ,      320000000 ,              0 ,        80000000
Apps_PRESSURE         ,        1000000 ,              0 ,         2000000
Apps_VOL3D            ,       18751374 ,       29781594 ,        30884616

                      ,       Counted ,       Counted
Kernels               ,   fp64_lt/rep ,   fp64_ge/rep
-----------------------------------------------------
Polybench_JACOBI_2D   ,             0 ,             0
Apps_PRESSURE         ,       2000000 ,       1000000
Apps_VOL3D            ,             0 ,             0


// Here is the output showing counting for each section of the kernel
// Apps_PRESSURE kernel
{
  // Line 111 hit 1 times
  // fp64* allocation_0 = malloc(1000000 * 8);
  // fp64* allocation_1 = malloc(1000000 * 8);
  // fp64* allocation_2 = malloc(1000000 * 8);
  // fp64* allocation_3 = malloc(1000000 * 8);
  // fp64* allocation_4 = malloc(1000000 * 8);
  {
    // Line 121 hit 1 times
    const Index_type ibegin = 0;
    const Index_type iend = getActualProblemSize();
    Real_ptr compression = m_compression;
    Real_ptr bvc = m_bvc;
    Real_ptr p_new = m_p_new;
    Real_ptr e_old = m_e_old;
    Real_ptr vnewc = m_vnewc;
    const Real_type cls = m_cls;
    const Real_type p_cut = m_p_cut;
    const Real_type pmin = m_pmin;
    const Real_type eosvmax = m_eosvmax;
    // Line 123 hit 1 times
    {
      // Line 126 hit 1 times
      // int64 copy 1
      // int64 preinc 1000000
      // int64 lt 1000001
      for (Index_type i = ibegin;
      i < iend;
      ++i )
      {
        // Line 127 hit 1000000 times
        // ptr add 2000000
        // ptr bit_lsh 2000000
        // fp64 copy 2000000
        // fp64 assign 1000000
        // fp64 load 1000000
        // fp64 store 1000000
        // fp64 add 1000000
        // fp64 mult 1000000
        // by rep bytes touched 16000000
        // by rep bytes read 8000000
        // by rep bytes written 8000000
        // by loop bytes touched 16000000
        // by loop bytes read 8000000
        // by loop bytes written 8000000
        bvc[i] = cls * (compression[i] + 1.0);
      }
      // Line 130 hit 1 times
      // int64 copy 1
      // int64 preinc 1000000
      // int64 lt 1000001
      for (Index_type i = ibegin;
      i < iend;
      ++i )
      {
        // Line 131 hit 1000000 times
        // ptr add 4000000
        // ptr bit_lsh 4000000
        // fp64 copy 2000000
        // fp64 assign 4000000
        // fp64 load 3000000
        // fp64 store 1000000
        // fp64 abs 1000000
        // fp64 mult 1000000
        // fp64 lt 2000000
        // fp64 ge 1000000
        // by rep bytes touched 32000000
        // by rep bytes read 24000000
        // by rep bytes written 8000000
        // by loop bytes touched 32000000
        // by loop bytes read 24000000
        // by loop bytes written 8000000
        Real_type p = bvc[i] * e_old[i] ;
        if ( fabs(p) < p_cut ) p = 0.0 ;
        if ( vnewc[i] >= eosvmax ) p = 0.0 ;
        if ( p < pmin ) p = pmin ;
        p_new[i] = p;
      }
    }
  }
}

// here is the code that does the counting with various wrapper macros applied via a special include.

// Only define setCountedAttributes functions past this point
// BEWARE: data types (Index_type, Real_ptr, etc) become wrappers past this point
#include "common/counting_macros.hpp"

void rajaperf::apps::PRESSURE::setCountedAttributes()
{
  VariantID vid = VariantID::Base_Seq;
  size_t tune_idx = 0;

  RAJAPERF_COUNTERS_INITIALIZE();

  setUp(vid, tune_idx);

  {
    RAJAPERF_COUNTERS_SETUP_WRAPPER(
    const Index_type ibegin = 0;
    const Index_type iend   = getActualProblemSize();

    PRESSURE_DATA_SETUP
    );

    RAJAPERF_COUNTERS_REP_SCOPE()
    {

      RAJAPERF_COUNTERS_PAR_LOOP(for (Index_type i = ibegin; i < iend; ++i )) {
        RAJAPERF_COUNTERS_LOOP_BODY(PRESSURE_BODY1);
      }

      RAJAPERF_COUNTERS_PAR_LOOP(for (Index_type i = ibegin; i < iend; ++i )) {
        RAJAPERF_COUNTERS_LOOP_BODY(PRESSURE_BODY2);
      }

    }

  }

  tearDown(vid, tune_idx);

  RAJAPERF_COUNTERS_FINALIZE();
}

@MrBurmark MrBurmark marked this pull request as draft September 17, 2025 18:31
@MrBurmark MrBurmark force-pushed the feature/burmark1/countedAttributes branch 3 times, most recently from 6a2ff21 to c4a5bb4 Compare September 17, 2025 22:54
@MrBurmark MrBurmark force-pushed the feature/burmark1/countedAttributes branch 4 times, most recently from 2d61ed6 to a3f6041 Compare September 26, 2025 22:24
@MrBurmark MrBurmark force-pushed the feature/burmark1/countedAttributes branch from ad0efd3 to 40b1d9d Compare October 7, 2025 15:20
@MrBurmark MrBurmark force-pushed the feature/burmark1/countedAttributes branch 2 times, most recently from 0303048 to 081e064 Compare October 15, 2025 05:19
@MrBurmark MrBurmark force-pushed the feature/burmark1/countedAttributes branch from 2bc6ae3 to e2f562c Compare October 22, 2025 22:37
@MrBurmark MrBurmark force-pushed the feature/burmark1/countedAttributes branch 2 times, most recently from 1e712fe to 18d7034 Compare November 4, 2025 18:25
This adds a check that our manual counts are corrects and adds
the potential for more accurate counts in the future.

Note that accurate counts after compiler optimization is difficult
but in some cases some manual optimization can avoid double counting
in many cases that the compiler can optimize.

The accuracy of some counts also depends on how you define the counts.
This leads to differences between the exact counts of bytes and
iterations for some kernels. Normally the LoopBytes*/Rep counters
should be the same as the estimated bytes. Similarly the
ParallelIterations/Rep should match the estimate. The fp64Ops/rep should
be the same as the estimated flops.
Some kernels do not yet support counting and will have values of -1
to indicate that they were not counted.
In some cases the kernels helper file may have not cleanly worked with the
wrappers. Some kernels use library functions like MPI or std::sort that
may not reliably work with the wrappers.
@MrBurmark MrBurmark force-pushed the feature/burmark1/countedAttributes branch from 4412955 to e44e549 Compare November 4, 2025 20:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Consider more kernel attributes relating to Bytes

2 participants