Run and Reporting Rules for SPEC HPC2002

SPEC High Performance Group

ABSTRACT
This document provides guidelines required for building, running, and reporting SPEC HPC2002 benchmarks.

Table of Contents

Purpose
1. General Philosophy
2. Building SPEC HPC

      2.0.1 Peak builds
      2.0.2 Runspec must be used
      2.0.3 The runspec build environment
      2.0.4 Continuous Build requirement
      2.0.5 Changes to the runspec build environment
      2.0.6 Cross-compilation allowed
   2.1 General Rules for Optimizations
      2.1.1 Limitations on library substitutions
      2.1.2 Feedback directed optimization is allowed
      2.1.3 Limitations on size changes
   2.2 Optimization Rules
       2.2.1 Feedback directed optimization is allowed
      2.2.2 Assertion flags may be used
      2.2.3 Floating point reordering allowed
      2.2.4 Optimizations and Permitted Source Code Changes
3. Running SPEC HPC
   3.1 System Configuration
      3.1.1 File Systems
      3.1.2 System State
   3.2 Continuous Run Requirement
   3.3 Run-time environment
4. Results Disclosure
   4.1 Rules regarding availability date and systems not yet shipped
   4.2 Configuration Disclosure
      4.2.1 System Identification
      4.2.2 Hardware Configuration
      4.2.3 Software Configuration
      4.2.4 Tuning Information
    4.3 Test Results Disclosure
      4.3.1 Speed Metrics
      4.3.2 Source Code Use and Disclosure: Review Web Rules
   4.4 Research and Academic usage of HPC
   4.5 Disclosures
5. Run Rule Exceptions
6. Revision History

Purpose

This document specifies how the benchmarks in the HPC2002 suites are to be run for measuring and publicly reporting performance results, to ensure that results generated with the suites are meaningful, comparable to other generated results, and reproducible (with documentation covering factors pertinent to reproducing the results).

Per the SPEC license agreement, all results publicly disclosed must adhere to the SPEC Run and Reporting Rules, or be clearly marked as estimates.

The following basics are expected and clarified in the main body of the document:

Each of these points are discussed in further detail below.

Suggestions for improving this run methodology should be made to the SPEC High Performance Group (HPG) for consideration in future releases.

1. General Philosophy

SPEC believes the user community will benefit from an objective series of tests which can serve as common reference and be considered as part of an evaluation process.

SPEC HPC2002 provides benchmarks in the form of source code, which are compiled according to the rules contained in this document. It is expected that a tester can obtain a copy of the suites, install the hardware, compilers, and other software described in another tester's result disclosure, and reproduce the claimed performance (within a small range to allow for run-to-run variation).

12 benchmarks are provided: 3 different applications each with up to 4 data sizes: HPC2002 has small, medium, large, and extra-large data set sizes. Benchmarks use the OpenMP and MPI APIs.

SPEC is aware of the importance of optimizations in producing the best system performance. SPEC is also aware that it is sometimes hard to draw an exact line between legitimate optimizations that happen to benefit SPEC benchmarks and optimizations that specifically target the SPEC benchmarks. However, with the list below, SPEC wants to increase awareness of implementers and end users to issues of unwanted benchmark-specific optimizations that would be incompatible with SPEC's goal of fair benchmarking.

To ensure that results are relevant to end-users, SPEC expects that the hardware and software implementations used for the running the SPEC benchmarks adhere to following conventions:

In cases where it appears that the above guidelines have not been followed, SPEC may investigate such a claim and request that the offending optimization (e.g. a SPEC-benchmark specific pattern matching) be backed off and the results resubmitted. Or, SPEC may request that the deficiency be corrected (e.g. make the optimization more general purpose or correct problems with code generation) before submitting results based on the optimization.

The SPEC High Performance Group reserves the right to adapt the HPC2002 benchmarks as it deems necessary to preserve its goal of fair benchmarking (e.g. remove a benchmark, modify benchmark code or workload, etc). If a change is made to a suite, SPEC will notify the appropriate parties (i.e. members and licensees). SPEC may redesignate the metrics (e.g. changing the metric from SPECenvM2002 to SPECenvM2002a). In the case that a benchmark is removed, SPEC reserves the right to republish in summary form adapted results for previously published systems, converted to the new metric. In the case of other changes, such a republication may necessitate re-testing and may require support from the original test sponsor.

SPEC HPC2002 metrics may be estimated. All estimates must be clearly identified as such. Licensees are encouraged to give a rationale or methodology for any estimates, and to publish actual SPEC HPC metrics as soon as possible. SPEC requires that every use of an estimated number be flagged, rather than burying an asterisk at the bottom of a page. For example, say something like this:

The JumboFast will achieve estimated performance of 
         Model 1   SPECenvM2002 50 est.
                   SPECenvL2002 60 est.
         Model 2   SPECenvM2002 70 est.
                   SPECenvL2002 80 est.

The use of SPEC HPC2002 metrics is permitted only after submission to SPEC, successful review and publication. All other use of SPEC HPC2002 metrics must clearly be identified as estimated or under review. Submitted results, not yet approved are labeled as being under review.

2.0 Building SPEC HPC

SPEC has adopted a set of rules defining how SPEC HPC2002 benchmark suite must be built and run to produce peak metrics. SPEC HPC2002 only supports peak builds, base builds are not supported.

2.0.1 Peak builds

"Peak" metrics are produced by building each benchmark in the suite with a set of optimizations individually tailored for that benchmark. The optimizations selected must adhere to the set of general benchmark optimization rules described in section 2.1 below. Limited source code modifications are allowed related to Parallel Performance.

2.0.2 Runspec must be used

With the release of SPEC HPC2002 suite, a set of tools based on GNU Make and Perl5 are supplied to build and run the benchmarks. To produce publishable results, these SPEC tools must be used. This helps ensure reproducibility of results by requiring that all individual benchmarks in the suite are run in the same way and that a configuration file that defines the optimizations used is available.

The primary tool is called "runspec" (runspec.bat for Windows NT). It is described in the file runspec.txt in the doc subdirectory of the SPEC root directory -- in a Bourne shell that would be called ${SPEC}/docs/runspec.txt .

SPEC supplies pre-compiled versions of the tools for a variety of platforms. If a new platform is used, please see ${SPEC}/docs/tools_build.txt for information on how to build the tools and how to obtain approval for them.

For more complex ways of compilation, for example feedback-driven compilation, SPEC has provided hooks in the tools so that such compilation and execution is possible (see the tools documentation, config.txt, for details). Only if, unexpectedly, such a compilation and execution should not be possible, there is the possibility that the test sponsor can ask for a permission to use performance-neutral alternatives (see section 5).

2.0.3 The runspec build environment

When runspec is used to build the SPEC HPC2002 benchmarks, it must be used in generally available, documented, and supported environments (see section 1), and any aspects of the environment that contribute to performance must be disclosed to SPEC (see section 4).

On occasion, it may be possible to improve run time performance by environmental choices at build time. For example, one might install a performance monitor, turn on an operating system feature such as bigpages, or set an environment variable that causes the cc driver to invoke a faster version of the linker.

It is difficult to draw a precise line between environment settings that are reasonable versus settings that are not. Some settings are obviously not relevant to performance (such as hostname), and SPEC makes no attempt to regulate such settings. But for settings that do have a performance effect, for the sake of clarity, SPEC has chosen that:

(a) It is acceptable to install whatever software the tester wishes, including performance-enhancing software, provided that the software is installed prior to starting the builds, remains installed throughout the builds, is documented, supported, generally available, and disclosed to SPEC.
(b) It is acceptable to set whatever system configuration parameters the tester wishes, provided that these are applied at boot time, documented, supported, generally available, and disclosed to SPEC. "Dynamic" system parameters (i.e. ones that do not require a reboot) must nevertheless be applied at boot time, except as provided under section 2.0.5.
(c) After the boot process is completed, environment settings may be made as follows: provided that these settings are documented; supported; generally available; disclosed to SPEC; made PRIOR to starting the build; and do not change during the build, except as provided in section 2.0.5.

2.0.4 Continuous Build requirement

As described in section 1, it is expected that testers can reproduce other testers' results. In particular, it must be possible for a new tester to compile the peak benchmarks for an entire suite (i.e. SPECenvM2002) in one execution of runspec, with appropriate command line arguments and an appropriate configuration file, and obtain executable binaries that are (from a performance point of view) equivalent to the binaries used by the original tester.

The simplest and least error-prone way to meet this requirement is for the original tester to take production hardware, production software, a SPEC config file, and the SPEC tools and actually build the benchmarks in a single invocation of runspec on the System Under Test (SUT). But SPEC realizes that there is a cost to benchmarking and would like to address this, for example through the rules that follow regarding cross-compilation. However, in all cases, the tester is taken to assert that the compiled executables will exhibit the same performance as if they all had been compiled on SUT (see 2.0.6).

2.0.5 Changes to the runspec build environment

SPEC HPC2002 binaries must be built using the environment rules of section 2.0.3, and may not rely upon any changes to the environment during the build.

For a build, the environment may be changed, subject to the following constraints:

2.0.6 Cross-compilation allowed

It is permitted to use cross-compilation, that is, a building process where the benchmark executables are built on a system (or systems) that differ(s) from the SUT. The runspec tool must be used on all systems (typically with "-a build" on the host(s) and "-a validate" on the SUT).

If all systems belong to the same product family and if the software used to build the executables is available on all systems, this does not need to be documented. In the case of a true cross compilation, (e.g. if the software used to build the benchmark executables is not available on the SUT, or the host system provides performance gains via specialized tuning or hardware not on the SUT), the host system(s) and software used for the benchmark building process must be documented in the Notes section. See section 4.

2.1 General Rules for Optimizations

The following rules apply to compiler flag selection for SPEC HPC Metrics.

2.1.1 Limitations on library substitutions

Flags which substitute pre-computed (e.g. library-based) routines for routines defined in the benchmark on the basis of the routine's name are not allowed. Exceptions are:

  1. the function "alloca". It is permitted to use a flag that substitutes the system's "builtin_alloca" for any C components in a benchmark.
  2. the netlib-interface-compliant level 1, 2 and 3 BLAS functions, LAPACK functions, and FFT functions.

2.1.2 Feedback directed optimization is allowed.

Only the training input (which is automatically selected by runspec) may be used for the run that generates the feedback data.

Optimization with multiple feedback runs is also allowed (see 2.2.1).

The requirement to use only the train data set at compile time shall not be taken to forbid the use of run-time dynamic optimization tools that would observe the reference execution and dynamically modify the in-memory copy of the benchmark. However, such tools would not be allowed to in any way affect later executions of the same benchmark (for example, when running multiple times in order to determine the worst run time). Such tools would also have to be disclosed in the submission of a result, and would have to be used for the entire suite (see section 3.3).

2.1.3 Limitations on size changes

Flags that change a data type size to a size different from the default size of the compilation system are not allowed. Exceptions are: a) C long can be 32 or greater bits, b) pointer sizes can be set different from the default size.

2.2 Additional Optimization Rules

In addition to the rules listed in section 2.1 above, the selection of optimizations to be used to produce SPEC HPC2002 Base Metrics includes the following:

2.2.1 Feedback directed optimization is allowed.

The allowed steps are:
PASS1: compile the program

Training run: run the program with the train data set

PASS2: re-compile the program, or invoke a tool that otherwise adjusts the program, and which uses the observed profile from the training run.

PASS2 is optional. For example, it is conceivable that a daemon might optimize the image automatically based on the training run, without further tester intervention. Such a daemon would have to be noted in the full disclosure to SPEC.

It is acceptable to use the various fdo_* hooks to clean up the results of previous feedback compilations. The preferred hook is fdo_pre0 -- for example:

               fdo_pre0 = rm /tmp/prof/*Counts*

Other than such cleanup, no intermediate processing steps may be performed between the steps listed above. If additional processing steps are required, the optimization is allowed for peak only but not for base.

When a two-pass process is used, the flag(s) that explicitly control(s) the generation or the use of feedback information can be - and usually will be - different in the two compilation passes. For the other flags, one of the two conditions must hold:

  1. The same set of flags are used for both invocations of the compiler/linker. For example:
           PASS1_CFLAGS= -gen_feedback -fast_library -opt1 -opt2 
            PASS2_CFLAGS= -use_feedback -fast_library -opt1 -opt2
    
  2. The set of flags in the first invocation are a subset of the flags used in the second. For example:
                    PASS1_CFLAGS= -gen_feedback -fast_library
                    PASS2_CFLAGS= -use_feedback -fast_library -opt1 -opt2
    

2.2.2 Assertion flags may be used.

An assertion flag is one that supplies semantic information that the compilation system did not derive from the source statements of the benchmark.

With an assertion flag, the programmer asserts to the compiler that the program has certain nice properties that allow the compiler to apply more aggressive optimization techniques (for example, that there is no aliasing via C pointers).

2.2.3 Floating point reordering allowed

Base results may use flags which affect the numerical accuracy or sensitivity by reordering floating-point operations based on algebraic identities. In addition, any reordering due to parallel calculations finishing in different order are permitted, e.g. reductions can be done in any order if done in parallel.

2.2.4 Optimizations and Permitted Source Code Changes

SPEC HPC allows source code modifications. Changes to the directives and source are permitted to facilitate generally useful and portable optimizations, with a focus on improving scalability. Changes in algorithm are not permitted. Vendor unique extensions to OpenMP or MPI are allowed, if they are portable.

Qualifications for permitted optimizations include:

  1. ANSI standard compliant optimizations
  2. ISO Fortran and C compliant optimizations
  3. Optimizations that produce valid results on other compilers and architectures
  4. Use of subroutine name or function name in a compiler flag (e.g. Inlining.)

Examples of permitted source code modifications and optimizations are as follows:

  1. Loop Reordering
  2. Loops for explicitly touching of memory in a specific order.
  3. Reshaping arrays
  4. Inlining source code
  5. Parallelization of serial sections without substantive algorithm changes.
  6. Vendor specific OpenMP extensions
  7. Modifications to parallel workload and/or memory distribution

Examples of optimizations or source code modifications that are not permitted are as follows:

  1. Changing a direct solver to an iterative solver.
  2. Adding calls to vendor specific subroutines
  3. Vendor unique directives, which are not OpenMP extensions
  4. Language Extensions

Full source and a written report of the nature and justification of the source changes is required with any submission having source changes. These reports will be made public on the SPEC web site.

Source code added by a vendor is expected to be portable to other compilers and architectures. In particular, source code is required to run on at least one other compiler/run-time library/architecture other than the platform of the vendor.

All source code changes are subject to review by the HPG committee.

Source code modifications are protected by a 6 week publication window. That is, a period of 6 weeks after the publication of results based on a set of source code changes during which results based on the same source code modification or technique not approved by the tester may not be published.

3. Running SPEC HPC

3.1 System Configuration

3.1.1 File Systems

SPEC allows any type of file system (disk-based, memory-based, NFS, DFS, FAT, NTFS, Clustered, etc.) to be used. The types of file system must be disclosed in reported results.

3.1.2 System State

The system state (multi-user, single-user, init level N) may be selected by the tester. This state along with any changes in the default configuration of daemon processes or system tuning parameters must be documented in the notes section of the results disclosure. (For Windows NT, system state is normally "Default"; a list of services that are shut down should be provided, if any, e.g. networking service shut down)

3.2 Continuous Run Requirement

All benchmark executions, including the validation steps, contributing to a particular result report must occur continuously, that is, in one execution of runspec.

3.3 Run-time environment

SPEC does not attempt to regulate the run-time environment for the benchmarks, other than to require that the environment be:

(a) set prior to runspec and consistent throughout the run, with the exception of certain user environment modifications described below.
(b) fully described in the submission, and
(c) in compliance with section 1, "Philosophy".

For example, if each of the following:

          run level:   single-user 
          OS tuning:   bigpages=yes, cpu_affinity=hard
          file system: in memory

were set prior to the start of runspec, unchanged during the run, described in the submission, and documented and supported by a vendor for general use, then these options could be used in a HPC submission.

Note: Item (a) is intended to forbid all means by which a tester might change the environment. In particular, it is forbidden to change the environment during the run using the config file hooks such as monitor_pre_bench. Those hooks are intended for use when studying the benchmarks, not for actual submissions.

For a run, the environment may be changed, subject to the following constraints:

Notes:

  1. It is permitted but not required to compile in the same runspec invocation as the execution. See rule 2 regarding cross compilation.
  2. It is permitted but not required to run multiple benchmarks in a single invocation of runspec.

4. Results Disclosure

SPEC requires a full disclosure of results and configuration details sufficient to reproduce the results. Results published outside of the SPEC web site (www.spec.org), in a publicly available medium, and not reviewed by SPEC are either estimates or under review, and must be labeled as such. Publication of results under non-disclosure or company internal use or company confidential are not "publicly" available.

A full disclosure of results will typically include:

A full disclosure of results should include sufficient information to allow a result to be independently reproduced. If a tester is aware that a configuration choice affects performance, then s/he should document it in the full disclosure.

Note: this rule is not meant to imply that the tester must describe irrelevant details or provide massively redundant information. For example, if the SuperHero Model 1 comes with a write-through cache, and the SuperHero Model 2 comes with a write-back cache, then specifying the model number is sufficient, and no additional steps need to be taken to document the cache protocol. But if the Model 3 is available with both write-through and write-back caches, then a full disclosure must specify which cache is used.

For information on how to submit a result to SPEC, contact the SPEC office. Contact information is maintained at the SPEC web site, www.spec.org

4.1 Rules regarding availability date and systems not yet shipped

If a tester submits results for a hardware or software configuration that has not yet shipped, the submitting company must:

"Generally available" means that the product can be ordered by ordinary customers, ships in a reasonable period after orders are submitted, and at least one customer has received it. (The term "reasonable period" is not specified in this paragraph, because it varies with the complexity of the system. But it seems likely that a reasonable period for a $500 machine would probably be measured in minutes; a reasonable period for a $5,000,000 machine would probably be measured in months.)

It is acceptable to test larger configurations than customers are currently ordering, provided that the larger configurations can be ordered and the company is prepared to ship them. For example, if the SuperHero is available in configurations of 1 to 1000 CPUs, but the largest order received to date is for 128 CPUs, the tester would still be at liberty to test a 1000 CPU configuration and submit the result.

A beta release of a compiler (or other software) can be used in a submission, provided that the performance-related features of the compiler are committed for inclusion in the final product. The tester should practice due diligence to ensure that the tests do not use an uncommitted prototype with no particular shipment plans. An example of due diligence would be a memo from the compiler Project Leader which asserts that the tester's version accurately represents the planned product, and that the product will ship on date X.

The general availability date for software is either the committed customer shipment date for the final product, or the date of the beta, provided that all three of the following conditions are met:

  1. The beta is open to all interested parties without restriction. For example, a compiler posted to the web for general users to download, or a software subscription service for developers, would both be acceptable.
  2. The beta is generally announced. A secret test version is not acceptable.
  3. The final product has a committed date with 3 months of first public release, which is specified in the notes section.

If it is not possible to meet all three of these conditions, then the date of the beta may not be used as the date of general availability. In that case, use the date of the final product (which, then, must be within the 3 month window.)

As an example, suppose that in February 2001 a tester uses the generally downloadable GoFast V5.2 beta which shipped in January 2001, but the final product is committed to ship in Mar, 2001 (i.e. less than 3 months later). It would be acceptable to say something like this:

      sw_avail     = Jan-2001
       sw_compiler  = GoFast Fortran V5.2 (Beta 1)
       notes900     = GoFast Fortran V5.2 (final) will ship March, 2001

SPEC is aware that performance results published for systems that have not yet shipped may sometimes be subject to change, for example when a last-minute bugfix reduces the final performance. If something becomes known that reduces performance by more than 2.75% on an overall metric (for example, SPECenvL2002 or SPECenvM2002), SPEC requests that the result be resubmitted.

4.2 Configuration Disclosure

The following sections describe the various elements that make up the disclosure for the system and test configuration used to produce a given test result. The SPEC tools used for the benchmark allow setting this information in the configuration file:

4.2.1 System Identification

4.2.2 Hardware Configuration

4.2.3 Software Configuration

4.2.4 Tuning Information

SPEC is aware that sometimes the spelling of compiler switches, or even the presence of compiler switches, changes between beta releases and final releases. For example, suppose that during a compiler beta the tester specifies:

        f90 -fast -architecture_level 3 -unroll 16

but the tester knows that in the final release the architecture level will be automatically set by -fast, and the compiler driver is going to change to set the default unroll level to 16. In that case, it would be permissible to mention only -fast in the notes section of the full disclosure. The tester is expected to exercise due diligence regarding such flag reporting, to ensure that the disclosure correctly records the intended final product. An example of due diligence would be a memo from the compiler Project Leader which promises that the final product will spell the switches as reported. SPEC may request that such a memo be generated and that a copy be provided to SPEC.

4.3 Test Results Disclosure

The actual test results consist of the elapsed times and ratios for the individual benchmarks and the overall SPEC metric produced by running the benchmarks via the SPEC tools. The required use of the SPEC tools ensures that the results generated are based on benchmarks built, run, and validated according to the SPEC run rules. Below is a list of the measurement components for the SPEC HPC2002 suite and metric:

4.3.1 Metrics

All runs of a specific benchmark when using the SPEC tools are required to have validated correctly.

The benchmark executables must have been built according to the rules described in section 2 above.

4.4 Research and Academic usage of HPC

SPEC encourages use of the HPC2002 suites in academic and research environments.

Additional guidelines for academic and research publications may be found in the HPG section of the SPEC web site (www.spec.org/hpg).

4.5 Disclosures

If a SPEC HPC2002 licensee publicly discloses an HPC2002 result (for example in a press release, academic paper, magazine article, or public web site), and does not clearly mark the result as an estimate, any SPEC member may request that the rawfile(s) from the run(s) be sent to SPEC. Such results must be made available to all interested members no later than 10 working days after the request.

If the tester is not ready to do a formal submission (for example, because of SPEC requiring a fee for non-member submissions, or because the system will not ship within 3 months, or because the compilers are for research and will never be made generally available), the result will not be formally reviewed nor posted on the SPEC web page.

But when public claims are made about HPC2002 results, whether by vendors or by academic researchers, SPEC reserves the right to also comment publicly on those claims, for example if it should occur that the rawfile is not made available, or shows substantially different performance from the tester's claim, or shows obvious violations of the run rules.

5. Run Rule Exceptions

If for some reason, the tester cannot run the benchmarks as specified in these rules, the tester can seek SPEC HPG approval for performance-neutral alternatives. No publication may be done without such approval. HPG maintains a Policies and Procedures document that defines the procedures for such exceptions.

6. Revision History

Version 1 written August 2002.