SPEChpc™ 2021: Changes in V1.1
(To check for possible updates to this document, please see http://www.spec.org/hpc2021/Docs/ )
Introduction to V1.1
SPEChpc 2021 V1.1 is an incremental update to SPEChpc 2021 V1.0 and primarily a bug fix release.
Results generated with v1.1 using pure-MPI and OpenMP are comparable to results
from V1.0 and vice versa. However, a few changes do effect performance of the OpenACC
and OpenMP with Target Offload versions making them non-comparable to v1.0. These performance relevant changes were made to
ensure better comparability between the two offload models.
In addition, reports may optionally include a break-out of the internal timing of the benchmarks
to allow better understand of the impact of MPI and application initialization overhead when conducting
scaling analysis.
Contents
Benchmark source code changes
LBM (505,605,705,805) |
SOMA (513,613) |
TeaLeaf (518,618,718,818) |
CloverLeaf (519,619,719,819) |
miniSweep (521,621) |
POT3d (528,628,728,828) |
SPH-EXA (532,632) |
HPGMG-FV (534,634,734,834) |
miniWeather (535,635,735,835) |
Reporting
Timer Information
Changes to benchmarks
The following benchmark changes were made in V1.1:
LBM (505,605,705,805)
- Enable host parallelization of the initialization loop with OpenMP Target Offload and OpenACC. For OpenACC, the "self" clause is used. However since "self" is a recent addition to the OpenACC standard, compiler support for this feature is limited. Setting "-DSPEC_OPENACC_NO_SELF" will instead fallback to using OpenMP host parallelization of this loop.
-
Enable support in OpenMP Target Offload for direct device to device MPI communication enabled via "-DSPEC_ACCEL_AWARE_MPI" for MPI implementations which support this feature.
-
Remove the OpenACC "acc_shutdown" API call. This was found to have issues with some MPI implementations which expected a CUDA context to be available during clean-up in MPI_Finalize. "acc_shutdown" was called before MPI_Finalize and shut down the context. While the call could have been moved to after MPI_Finalized, it's use was deemed unnecessary so simply removed.
-
A suggested change to how remaining grid blocks were distributed amongst ranks (remaining blocks are given the right-most rank of the grid) to a more equitable distribution was tested but had little to no impact in performance and resulted in incorrect results a few cases so was not modified.
SOMA (513,613)
- Enable support in OpenACC and OpenMP Target Offload for direct device to device MPI communication enabled via "-DSPEC_ACCEL_AWARE_MPI" for MPI implementations which support this feature.
- Remove extraneous data movement in OpenACC.
TeaLeaf (518,618,718,818)
- Align OpenACC and OpenMP Target Offload schedules by using "collapse(2)" clause for both models.
- Include OpenACC "independent" clauses on two loops where they were inadvertently removed prior to release of v1.0.
- Use triplet notation for OpenACC data clauses. ex. "buffer[n]" changed to "buffer[:n]".
- Add "%%" in printf to ensure "%" is printed in output.
CloverLeaf (519,619,719,819)
No updates.
miniSweep (521,621)
- Remove "depend" clause from OpenMP version which may cause the code to hang under some configurations.
- Pass "--nthread_e" argument to executable using the value from the config/runhpc "threads" setting.
- Enable support in OpenMP Target Offload for direct device to device MPI communication enabled via "-DSPEC_ACCEL_AWARE_MPI" for MPI implementations which support this feature.
- Remove unused variables as seen in compiler warnings.
POT3D (528,628,728,828)
- Remove OpenACC "async" directives.
- Enable support in OpenMP Target Offload for direct device to device MPI communication enabled via "-DSPEC_ACCEL_AWARE_MPI" for MPI implementations which support this feature.
- Fix error where modification to use scalar instead of array reduction was incomplete.
- Undefine several HDF5 configuration settings which may not be supported on all platforms.
SPH_EXA (532,632)
- Fix typo in OpenACC data directive which caused it to be ignored leading to extra data movement in other areas of the code.
HPGMG-FV (534,632,734,834)
- Revise OpenMP Target Offload's rank to device binding to use local rank id rather than the global ranks id as is done in OpenACC.
- Avoid mapping of NULL pointer-based array sections in OpenMP Target Offload.
- Remove "align" attribute since it's a tuning parameter and may cause portability issues.
Reporting
Internal Timer Information
As part of SPEC/HPG's follow-on SPEChpc weak scaling suite (currently under development), internal timers were added to the codes to measure MPI initializations overhead, application initialization overhead, and the core computation time. For weak scaling, the core compute time will be used to determine a throughput "Figure of Merit" (FOM) measurement of units of work over time.
For the current strong scaled suites, SPEC/HPG decided to optionally include this measurement as it may better help understanding scaling.
The internal timing information may only be used for academic and research purposes. or as a derived value per SPEC's Fair Use Rules
Reporting of the internal timing is disabled by default. To enable, either add "showtimer=1" to your config file, use the runhpc --showtimer=1 option, or edit the resulting "raw" (.rsf) file changing the "showtimer" field to 1 and use rawformat utility to reformat the reports.
When included, the internal timer table will be show at the bottom of the report, except for the CSV report where it's listed just below the reported results. An example of the timer output from a text report:
=================================== Internal Timer Table (informational only) ==================================
| Base | Base Base Base Base | Peak | Peak Peak Peak Peak
Benchmarks | M Reportd | OverHd Init Core Resid | M Reportd | OverHd Init Core Resid
-------------- | - ------- | ------- ------- ------- ------- | - ------- | ------- ------- ------- -------
505.lbm_t | * 17.6 | 1.71 5.83 9.94 0.120 | * 17.6 | 1.71 5.83 9.94 0.120
505.lbm_t | 17.4 | 2.20 3.98 11.1 0.135 | 17.4 | 2.20 3.98 11.1 0.135
513.soma_t | 49.8 | 1.91 3.28 38.0 6.53 | 50.7 | 3.93 6.34 36.2 4.25
513.soma_t | * 52.7 | 2.14 3.60 39.8 7.21 | * 53.1 | 4.31 6.43 38.1 4.23
518.tealeaf_t | * 33.3 | 1.68 1.46 30.0 0.109 | 33.4 | 1.65 1.44 30.2 0.112
518.tealeaf_t | 32.4 | 1.87 1.23 29.2 0.0698 | * 33.7 | 2.10 1.62 29.9 0.104
519.clvleaf_t | 22.9 | 2.65 3.06 17.2 0 | 22.9 | 2.65 3.06 17.2 0
519.clvleaf_t | * 27.0 | 3.11 3.67 20.2 0 | * 27.0 | 3.11 3.67 20.2 0
521.miniswp_t | * 67.2 | 2.65 0.524 64.0 0 | * 65.2 | 3.93 0.884 60.4 0
521.miniswp_t | 64.2 | 2.72 0.515 60.9 0 | 62.7 | 4.21 1.00 57.5 0
528.pot3d_t | 37.5 | 2.45 0.275 34.7 0.0357 | 36.8 | 2.53 0.257 34.0 0.00949
528.pot3d_t | * 38.1 | 2.45 0.244 35.4 0.00141 | * 37.5 | 2.78 0.276 34.5 0.00565
532.sph_exa_t | 79.7 | 2.22 4.31 73.2 0 | 79.7 | 2.22 4.31 73.2 0
532.sph_exa_t | * 80.9 | 2.25 4.57 74.1 0 | * 80.9 | 2.25 4.57 74.1 0
534.hpgmgfv_t | 72.6 | 1.72 6.56 64.0 0.346 | * 69.9 | 4.92 5.97 58.8 0.203
534.hpgmgfv_t | * 72.8 | 1.69 6.54 64.2 0.329 | 69.7 | 4.82 6.14 58.6 0.199
535.weather_t | 25.0 | 2.83 0.0845 22.1 0.00329 | 25.0 | 2.83 0.0845 22.1 0.00329
535.weather_t | * 25.3 | 2.87 0.0872 22.4 0.00285 | * 25.3 | 2.87 0.0872 22.4 0.00285
================================================================================================================
Timer | Description |
Median (M) | Starred* (txt) or underlined (html,pdf) times indicates the median reported time used in the metric. |
Reported (Reportd) | Measured time by the SPEC tools used to compute the metric. |
MPI Overhead (OverHd) | Node, scheduler, and MPI start-up overhead time. (Reported time less application time*) |
Application Initialization (Init) | Time spent in the application initializing data, reading files, domain decomposition, etc. |
Core compute (Core) | Time spent in the core computation of the application. Time includes MPI Communication |
Residual (Resid) | Remaining application time not captured under intialization or core compute. Includes items such as verification of results or saving output data files. |
*Note the log files also include the Application time which is the time measured between MPI_Init and MPI_Finalize. The applications time is not included in the Internal Timer Table but is the summation of the Intialization, Core Compute, and Residual Time.
Copyright 2021-2022 Standard Performance Evaluation Corporation
All Rights Reserved