ABSTRACT
This document sets the requirements to build, run,
and report on the SPEC CPU2006 benchmarks.
These rules may be updated from time to time.
Revisions are posted at
http://www.spec.org/cpu2006/Docs/runrules.html.
Testers are required to comply with the version posted as of the date of their testing. In the event of substantive changes, a notice will be posted at SPEC's top-level page, http://www.spec.org, to define a transition period during which compliance with the new rules is phased in.
Edit history:
3-Aug-2011: updates for V1.2:
At the top, clarify that these rules may be updated from time to time.
In rule 4.2, note that system descriptions may be edited after a test.
Add rule 4.2.1.1 about identifying equivalent systems.
In rule 4.2.3, reference the updated mechanism for setting the auto parallel field.
In rule 4.2.4(g), emphasize that all tuning must be documented.
In rule 4.2.5 clarify which information belongs in flags files vs. other locations.
Add rule 4.2.7 on how to document cross compiles, referencing it from 2.0.6.
Replace rule 4.7 with a reference to the SPEC-wide Fair Use rule.
Improve consistency of terminology for SPECspeed metrics vs. SPECrate metrics.
8-Apr-2008: release candidate 2 for V1.1:
Explain a philosophy of estimates in rule 1.6, and
clarify marking of estimates in rule 4.8.
Tweak rule 2.1.1 to clarify that the rule is discussing "benchmark" source code.
Clarify documentation of system state and tuning, in rules 3.1.2, 4.2.3 (paragraphs b and g), and 4.2.4 (paragraphs f, g).
Add rule 3.2.5 for parallel setup and parallel test
Expand rule 4.2.3 on automatic parallelization and reporting thereof.
Add new rule 4.2.6 regarding disclosure of configurations for user-built systems.
In rule 4.3.2 allow conversions in both directions between SPECspeed metrics and 1 copy
SPECrate metrics.
In rule 4.6 note that a required disclosure is considered public information.
23-Jul-2006: version V1.0
Overview
Clicking one of the following will take you to the detailed table of contents for that section:
1. Philosophy
2. Building SPEC CPU2006
3. Running SPEC CPU2006
4. Results Disclosure
5. Run Rule Exceptions
Detailed Contents
1. Philosophy
1.1 Purpose
1.2 A SPEC CPU2006 Result Is An Observation
1.2.1 Test Methods
1.2.2 Conditions of Observation
1.2.3 Assumptions About the Tester
1.3 A SPEC CPU2006 Result Is A Declaration of Expected Performance
1.3.1 Reproducibility
1.3.2 Obtaining Components
1.4 A SPEC CPU2006 Result is a Claim about Maturity of Performance Methods
1.5 Peak and base builds
1.6 Estimates
1.7 About SPEC
1.7.1 Publication on SPEC's web site is encouraged
1.7.2 Publication on SPEC's web site is not required
1.7.3 SPEC May Require New Tests
1.7.4 SPEC May Adapt the Suites
1.8 Usage of the Philosophy Section
2.0 Building SPEC CPU2006
2.0.1 (removed)
2.0.2 SPEC's tools must be used
2.0.3 The runspec build environment
2.0.4 Continuous Build requirement
2.0.5 Changes to the runspec build environment
2.0.6 Cross-compilation allowed
2.0.7 Individual builds allowed
2.0.8 Tester's assertion of equivalence between build types
2.1 General Rules for Selecting Compilation Flags
2.1.1 Cannot use names
2.1.2 Limitations on library substitutions
2.1.3 Feedback directed optimization is allowed in peak
2.1.4 Limitations on size changes
2.1.5 Portability flags
2.2 Base Optimization Rules
2.2.1 Safe
2.2.2 Same for all
2.2.3 Feedback directed optimization must not be used in base
2.2.4 Assertion flags must NOT be used in base
2.2.5 Floating point reordering allowed
2.2.6 (removed)
2.2.7 Safety and Standards Conformance
2.2.8 Base build environment
2.2.9 Portability Switches for Data Models
2.2.10 Cross-module optimization
2.2.11 Alignment switches are allowed
2.2.12 Pointer sizes
3. Running SPEC CPU2006
3.1 System Configuration
3.1.1 File Systems
3.1.2 System State
3.2 Additional Rules for Running SPECrate Tests
3.2.1 Number of copies in peak
3.2.2 Number of copies in base
3.2.3 Single file system
3.2.4 Submit
3.2.5 Parallel setup and parallel test
3.3 Continuous Run Requirement
3.4 Run-time environment
3.5 Base, peak, and basepeak
3.6 Run time dynamic optimization
4. Results Disclosure
4.1 Rules regarding availability date and systems not yet shipped
4.1.1 Pre-production software can be used
4.1.2 Software component names
4.1.3 Specifying dates
4.1.4 If dates are not met
4.1.5 Performance changes for pre-production systems
4.2 Configuration Disclosure
4.2.1 System Identification
4.2.1.1 Identification of Equivalent Systems
4.2.2 Hardware Configuration
4.2.3 Software Configuration
4.2.4 Tuning Information
4.2.5 Description of Tuning Options ("Flags File")
4.2.6 Configuration Disclosure for User Built Systems
4.2.7 Documentation for cross-compiles
4.3 Test Results Disclosure
4.3.1 SPECspeed Metrics
4.3.2 Throughput Metrics
4.3.3 Performance changes for production systems
4.4 Metric Selection
4.5 Research and Academic usage of CPU2006
4.6 Required Disclosures
4.7 Fair Use
4.8 Estimates are Allowed
5. Run Rule Exceptions
This section is an overview of the purpose, definitions, methods, and assumptions for the SPEC CPU2006 run rules.
The purpose of the SPEC CPU2006 benchmark and its run rules is to further the cause of fair and objective CPU benchmarking. The rules help ensure that published results are meaningful, comparable to other results, and reproducible. SPEC believes that the user community benefits from an objective series of tests which serve as a common reference.
Per the SPEC license agreement, all SPEC CPU results disclosed in public -- whether in writing or in verbal form -- must adhere to the SPEC CPU Run and Reporting Rules, or be clearly described as estimates.
A published SPEC CPU2006 result is three things:
A published SPEC CPU2006 result is an empirical report of performance observed when carrying out certain computationally intensive tasks.
SPEC supplies the CPU2006 benchmarks in the form of source code, which testers are not allowed to modify except under certain very restricted circumstances. SPEC CPU2006 includes 29 benchmarks, organized into 2 suites: an integer suite of 12 benchmarks, known as CINT2006; and a floating point suite of 17 benchmarks, known as CFP2006.
Note: this document avoids the (otherwise common) usage "CPU2006 suite" (singular), instead insisting on "CPU2006 suites" (plural). Thus a rule that requires consistency within a suite means that consistency is required across the set of 12, or across the set of 17; not the set of 29.
The tester supplies compilers and the System Under Test (SUT). The tester may set optimization flags and, where needed, portability flags, in a SPEC config file. SPEC supplies tools which automatically:
The CPU2006 benchmarks (code + workload) have been designed to fit within about 1GB of memory (when compiled with 32 bit pointers), i.e. within the capabilities of systems that allow user applications to use 32 bits (4GB).
(SPEC is aware that some systems that are commonly described as "32-bit" may provide a smaller number of bits to user applications, for example if one or more bits are reserved to privileged code. SPEC is also aware that there are many ways to spend profligate amounts of virtual memory. Therefore, although 32-bit systems are within the design center for the CPU2006 suites, SPEC does not guarantee any particular memory size for the benchmarks, nor that they will necessarily fit on all systems that are described as 32-bit.)
The report that certain performance has been observed is meaningful only if the conditions of observation are stated. SPEC therefore requires that a published result include a description of all performance-relevant conditions.
It is assumed that the tester:
The person who actually carries out the test is, therefore, the first and the most important audience for these run rules. The rules attempt to help the tester by trying to be clear about what is and what is not allowed.
A published SPEC CPU2006 result is a declaration that the observed level of performance can be obtained by others. Such declarations are widely used by vendors in their marketing literature, and are expected to be meaningful to ordinary customers.
It is expected that later testers can obtain a copy of the SPEC CPU2006 suites, obtain the components described in the original result, and reproduce the claimed performance, within a small range to allow for run-to-run variation.
Therefore, it is expected that the components used in a published result can in fact be obtained, with the level of quality commonly expected for products sold to ordinary customers. Such components are required to:
The judgment of whether a component meets the above list may sometimes pose difficulty, and various references are given in these rules to guidelines for such judgment. But by way of introduction, imagine a vendor-internal version of a compiler, designated only by an internal code name, unavailable to customers, which frequently generates incorrect code. Such a compiler would fail to provide a suitable environment for general programming, and would not be ready for use in a SPEC CPU2006 result.
A published SPEC CPU result carries an implicit claim that the performance methods it employs are more than just "prototype" or "experimental" or "research" methods; it is a claim that there is a certain level of maturity and general applicability in its methods. Unless clearly described as an estimate, a published SPEC result is a claim that the performance methods employed (whether hardware or software, compiler or other):
SPEC is aware of the importance of optimizations in producing the best performance. SPEC is also aware that it is sometimes hard to draw an exact line between legitimate optimizations that happen to benefit SPEC benchmarks, versus optimizations that exclusively target the SPEC benchmarks. However, with the list above, SPEC wants to increase awareness of implementers and end users to issues of unwanted benchmark-specific optimizations that would be incompatible with SPEC's goal of fair benchmarking.
The tester must describe the performance methods that are used in terms that a performance-aware user can follow, so that users can understand how the performance was obtained and can determine whether the methods may be applicable to their own applications. The tester must be able to make a credible public claim that a class of applications in the real world may benefit from these methods.
"Peak" metrics may be produced by building each benchmark in the suite with a set of optimizations individually selected for that benchmark. The optimizations selected must adhere to the set of general benchmark optimization rules described in section 2.1 below. This may also be referred to as "aggressive compilation".
"Base" metrics must be produced by building all the benchmarks in the suite with a common set of optimizations. In addition to the general benchmark optimization rules (section 2.1), base optimizations must adhere to a stricter set of rules described in section 2.2.
These additional rules serve to form a "baseline" of performance that can be obtained with a single set of compiler switches, single-pass make process, and a high degree of portability, safety, and performance.
The choice of a single set of switches and single-pass make process is intended to reflect the performance that may be attained by a user who is interested in performance, but who prefers not to invest the time required for tuning of individual programs, development of training workloads, and development of multi-pass Makefiles.
SPEC allows base builds to assume that the program follows the relevant language standard (i.e. it is portable). But this assumption may be made only where it does not interfere with getting the expected answer. For all testing, SPEC requires that benchmark outputs match an expected set of outputs, typically within a benchmark-defined tolerance to allow for implementation differences among systems.
Because the SPEC CPU benchmarks are drawn from the compute intensive portion of real applications, some of them use popular practices that compilers must commonly cater for, even if those practices are nonstandard. In particular, some of the programs (and, therefore, all of base) may have to be compiled with settings that do not exploit all optimization possibilities that would be possible for programs with perfect standards compliance.
In base, the compiler may not make unsafe assumptions that are more aggressive than what the language standard allows.
Finally, though, as a performance suite, SPEC CPU has throughout its history allowed certain common optimizations to nevertheless be included in base, such as reordering of operands in accordance with algebraic identities.
Rules for building the benchmarks are described in section 2.
SPEC CPU2006 metrics may be estimated. All estimates must be clearly designated as such.
This philosophy section has described how a "result" has certain characteristics: e.g. a result is an empirical report of performance, includes a full disclosure of performance-relevant conditions, can be reproduced, uses mature performance methods. By contrast, estimates may fail to provide one or even all of these characteristics.
Nevertheless, estimates have long been seen as valuable for SPEC CPU benchmarks. Estimates are set at inception of a new chip design and are tracked carefully through analytic, simulation, and HDL (Hardware Description Language) models. They are validated against prototype hardware and, eventually, production hardware. With chip designs taking years, and requiring very large investments, estimates are central to corporate roadmaps. Such roadmaps may compare SPEC CPU estimates for several generations of processors, and, explicitly or by implication, contrast one company's products and plans with another's.
SPEC wants the CPU benchmarks to be useful, and part of that usefulness is allowing the metrics to be estimated.
The key philosophical point is simply that estimates must be clearly distinguished from results.
SPEC encourages the review of CPU2006 results by the relevant subcommittee, and subsequent publication on SPEC's web site (http://www.spec.org/cpu2006). SPEC uses a peer-review process prior to publication, in order to improve consistency in the understanding, application, and interpretation of these run rules.
Review by SPEC is not required. Testers may publish rule-compliant results independently. No matter where published, all results publicly disclosed must adhere to the SPEC Run and Reporting Rules, or be clearly marked as estimates. SPEC may take action if the rules are not followed.
In cases where it appears that the run rules have not been followed, SPEC may investigate such a claim and require that a result be regenerated, or may require that the tester correct the deficiency (e.g. make the optimization more general purpose or correct problems with code generation).
The SPEC Open Systems Group reserves the right to adapt the SPEC CPU2006 suites as it deems necessary to preserve its goal of fair benchmarking. Such adaptations might include (but are not limited to) removing benchmarks, modifying codes or workloads, adapting metrics, republishing old results adapted to a new metric, or requiring retesting by the original tester.
This philosophy section is intended to introduce concepts of fair benchmarking. It is understood that in some cases, this section uses terms that may require judgment, or which may lack specificity. For more specific requirements, please see the sections below.
In case of a conflict between this philosophy section and a run rule in one of the sections below, normally the run rule found below takes priority.
Nevertheless, there are several conditions under which questions should be resolved by reference to this section: (a) self-conflict: if rules below are found to impose incompatible requirements; (b) ambiguity: if they are unclear or silent with respect to a question that affects how a result is obtained, published, or interpreted; (c) obsolecsence: if the rules below are made obsolete by changing technical circumstances or by directives from superior entities within SPEC.
When questions arise as to interpretation of the run rules:
Interested parties should seek first to resolve questions based on the rules as written in the sections that follow. If this is not practical (because of problems of contradiction, ambiguity, or obsolescence), then the principles of the philosophy section should be used to resolve the issue.
The SPEC CPU subcommittee should be notified of the issue. Contact information may be found via the SPEC web site, www.spec.org.
SPEC may choose to issue a ruling on the issue at hand, and may choose to amend the rules to avoid future such issues.
SPEC has adopted a set of rules defining how SPEC CPU2006 benchmark suites must be built and run to produce peak and base metrics.
(This rule, formerly present in CPU2000, is now covered in section 1.5.)
With the release of SPEC CPU2006 suites, a set of tools based on GNU Make and Perl5 are supplied to build and run the benchmarks. To produce publication-quality results, these SPEC tools must be used. This helps ensure reproducibility of results by requiring that all individual benchmarks in the suites are run in the same way and that a configuration file is available that defines the optimizations used.
The primary tool is called runspec (runspec.bat for Microsoft Windows). It is described in the runspec documentation in the Docs subdirectory of the SPEC root directory -- in a Bourne shell that would be called ${SPEC}/Docs/, or on Microsoft Windows %SPEC%\Docs\.
Some Fortran programs in the floating point suite need to be preprocessed, for example to choose variable sizes depending on whether -DSPEC_CPU_LP64 has been set. Fortran preprocessing must be done using the SPEC-supplied preprocessor, even if the vendor's compiler has its own preprocessor. Runspec will automatically enforce this requirement by invoking the SPEC preprocessor.
SPEC supplies pre-compiled versions of the tools for a variety of platforms. If a new platform is used, please see tools-build.html in the Docs directories for information on how to build the tools, and how to obtain approval for them. SPEC's approval is required for the tool build, so a log must be generated during the build.
For more complex ways of compilation, for example feedback-driven compilation, SPEC has provided hooks in the tools so that such compilation and execution is possible (see the tools documentation for details). Only if, unexpectedly, such a compilation and execution should not be possible, there is the possibility that the tester may ask for permission to use performance-neutral alternatives (see section 5).
When runspec is used to build the SPEC CPU2006 benchmarks, it must be used in generally available, documented, and supported environments (see section 1), and any aspects of the environment that contribute to performance must be disclosed to SPEC (see section 4).
On occasion, it may be possible to improve run time performance by environmental choices at build time. For example, one might install a performance monitor, turn on an operating system feature such as bigpages, or set an environment variable that causes the cc driver to invoke a faster version of the linker.
It is difficult to draw a precise line between environment settings that are reasonable versus settings that are not. Some settings are obviously not relevant to performance (such as hostname), and SPEC makes no attempt to regulate such settings. But for settings that do have a performance effect, for the sake of clarity, SPEC has chosen that:
(a) The tester may install whatever software the tester wishes, including performance-enhancing software, but such software must be installed prior to starting the builds, must remain installed throughout the builds, and must be documented, supported, generally available, and disclosed to SPEC.
(b) The tester may set whatever system configuration parameters the tester wishes, but these must be applied at boot time, documented, supported, generally available, and disclosed to SPEC. "Dynamic" system parameters (i.e. ones that do not require a reboot) must nevertheless be applied at boot time, except as provided under section 2.0.5.
(c) After the boot process is completed, environment settings may be made as follows:
* to specify resource limits (for example, as in the Bourne shell ulimit command), and
* to select major components of the compilation system -- for example, as in:
setenv CC_LOC /net/dist/version73/cc
setenv LD_LOC /net/opt/dist/ld-fast
-- but these settings must be documented; supported; generally available; disclosed to SPEC; made PRIOR to starting the build; and must not change during the build, except as provided in section 2.0.5.
As described in section 1, it is expected that testers can reproduce other testers' results. In particular, it must be possible for a new tester to compile both the base and peak benchmarks for an entire suite (i.e. CINT2006 or CFP2006) in one execution of runspec, with appropriate command line arguments and an appropriate configuration file, and obtain executable binaries that are (from a performance point of view) equivalent to the binaries used by the original tester.
The simplest and least error-prone way to meet this requirement is for the original tester to take production hardware, production software, a SPEC config file, and the SPEC tools and actually build the benchmarks in a single invocation of runspec on the System Under Test (SUT). But SPEC realizes that there is a cost to benchmarking and would like to address this, for example through the rules that follow regarding cross-compilation and individual builds. However, in all cases, the tester is taken to assert that the compiled executables will exhibit the same performance as if they all had been compiled with a single invocation of runspec (see 2.0.8).
SPEC CPU2006 base binaries must be built using the environment rules of section 2.0.3, and must not rely upon any changes to the environment during the build.
Note 1: base cross compiles using multiple hosts are allowed (2.0.6), but the performance of the resulting binaries must not depend upon environmental differences among the hosts. It must be possible to build performance-equivalent base binaries with one set of switches (2.2.2), in one execution of runspec (2.0.4), on one host, with one environment (2.0.3).
For a peak build, the environment may be changed, subject to the following constraints:
The environment change must be accomplished using the SPEC-provided config file hooks (such as fdo_pre0).
The environment change must be fully disclosed to SPEC (see section 4).
The environment change must not be incompatible with a Continuous Build (see section 2.0.4).
The environment change must be accomplished using simple shell commands. It is not permitted to invoke a more complex entity unless that entity is provided as part of a generally-available software package.
Examples:
Note 2: peak cross compiles using multiple hosts are allowed (2.0.6), but the performance of the resulting binaries must not depend upon environmental differences among the hosts. It must be possible to build performance-equivalent peak binaries with one config file, in one execution of runspec (2.0.4), in the same execution of runspec that built the base binaries, on one host, starting from the environment used for the base build (2.0.3), and changing that environment only through config file hooks (2.0.5).
It is permitted to use cross-compilation, that is, a building process where the benchmark executables are built on a system (or systems) that differ(s) from the SUT. The runspec tool must be used on all systems (typically with -a build on the host(s) and -a validate on the SUT).
Documentation of cross-compiles is described in section 4.2.7.
It is permitted to use more than one host in a cross-compilation. If more than one host is used in a cross-compilation, they must be sufficiently equivalent so as not to violate rule 2.0.4. That is, it must be possible to build the entire suite on a single host and obtain binaries that are equivalent to the binaries produced using multiple hosts.
The purpose of allowing multiple hosts is so that testers can save time when recompiling many programs. Multiple hosts must NOT be used in order to gain performance advantages due to environmental differences among the hosts. In fact, the tester must exercise great care to ensure that any environment differences are performance neutral among the hosts, for example by ensuring that each has the same version of the operating system, the same performance software, the same compilers, and the same libraries. The tester must exercise due diligence to ensure that differences that appear to be performance neutral - such as differing MHz or differing memory amounts on the build hosts - are in fact truly neutral.
Multiple hosts must NOT be used in order to work around system or compiler incompatibilities (e.g. compiling the SPECfp2006 C benchmarks on a different OS version than the SPECfp2006 Fortran benchmarks in order to meet the different compilers' respective OS requirements), since that would violate the Continuous Build rule (2.0.4).
It is permitted to build the benchmarks with multiple invocations of runspec, for example during a tuning effort. But, the executables must be built using a consistent set of software. If a change to the software environment is introduced (for example, installing a new version of the C compiler which is expected to improve the performance of one of the floating point benchmarks), then all affected benchmarks must be rebuilt (in this example, all the C benchmarks in the floating point suite).
The previous 4 rules may appear to contradict each other (2.0.4 through 2.0.7), but the key word in 2.0.4 is the word "possible".
Consider the following sequence of events:
In this example, the tester is taken to be asserting that the above sequence of events produces binaries that are, from a performance point of view, equivalent to binaries that it would have been possible to build in a single invocation of the tools.
If there is some optimization that can only be applied to individual benchmark builds, but which it is not possible to apply in a continuous build, the optimization must not be used.
Rule 2.0.8 is intended to provide some guidance about the kinds of practices that are reasonable, but the ultimate responsibility for result reproducibility lies with the tester. If the tester is uncertain whether a cross-compile or an individual benchmark build is equivalent to a full build on the SUT, then a full build on the SUT is required (or, in the case of a true cross-compile which is documented as such, then a single runspec -a build is required on a single host.) Although full builds add to the cost of benchmarking, in some instances a full build in a single runspec may be the only way to ensure that results will be reproducible.
The following rules apply to compiler flag selection for SPEC CPU2006 Peak and Base Metrics. Additional rules for Base Metrics follow in section 2.2.
Benchmark source file or variable or subroutine names must not be used within optimization flags or compiler/build options.
Identifiers used in preprocessor directives to select alternative benchmark source code are also forbidden, except for a rule-compliant library substitution (2.1.2) or an approved portability flag (2.1.5).
For example, if a benchmark source code uses one of:
#ifdef IDENTIFIER #ifndef IDENTIFIER #if defined IDENTIFIER #if !defined IDENTIFIER
to provide alternative source code under the control of a compiler option such as -DIDENTIFIER, such a switch may not be used unless it meets the criteria of 2.1.2 or 2.1.5.
Flags which substitute pre-computed (e.g. library-based) routines for routines defined in the benchmark on the basis of the routine's name must not be used. Exceptions are:
a) the function alloca. It is permitted to use a flag that substitutes the system's builtin_alloca. Such a flag may be applied to individual benchmarks (in both base and peak).
b) the level 1, 2 and 3 BLAS functions in the CFP2006 benchmarks, and the netlib-interface-compliant FFT functions. Such substitution may be used in a peak run, but must not be used in base.
Note: rule 2.1.2 does not forbid flags that select alternative implementations of library functions defined in an ANSI/ISO language standard. For example, such flags might select an optimized library of these functions, or allow them to be inlined.
Feedback directed optimization may be used in peak. Only the training input (which is automatically selected by runspec) may be used for the run(s) that generate(s) feedback data.
Optimization with multiple feedback runs is also allowed (build, run, build, run, build...).
The requirement to use only the train data set at compile time shall not be taken to forbid the use of run-time dynamic optimization tools that would observe the reference execution and dynamically modify the in-memory copy of the benchmark. However, such tools must not in any way affect later executions of the same benchmark (for example, when running multiple times in order to determine the median run time). Such tools must also be disclosed in the publication of a result, and must be used for the entire suite (see section 3.3).
Flags that change a data type size to a size different from the default size of the compilation system are not allowed. Exceptions are: a) the C long type may be set to 32 or greater bits; b) pointer sizes may be set in a manner which requires, or which assumes, that the benchmarks (code+data) fit into 32 bits of address space.
Rule 2.2.2 requires that all benchmarks use the same flags in base. Portability flags are an exception to this rule: they may differ from one benchmark to another, even in base. Such flags are subject to two major requirements:
The initial published results for CPU2006 will include a reviewed set of portability flags on several operating systems; later users who propose to apply additional portability flags must prepare a justification for their use.
A proposed portability flag will normally be approved if one of the following conditions holds:
(a) The flag selects a performance-neutral alternate benchmark source, and the benchmark cannot build and execute correctly on the given platform unless the alternate source is selected. (Examples might be flags such as -DHOST_WORDS_BIG_ENDIAN, -DHAVE_SIGNED_CHAR.)
(b) The flag selects a compiler mode that allows basic parsing of the input source program, and it is not possible to set that flag for all programs of the given language in the suite. (An example might be -fixedform, to select Fortran source code fixed format.)
(c) The flag selects features from a certain version of the language, and it is not possible to set that flag for all programs of the given language in the suite. (An example might be -language:c89.)
(d) The flag solves a data model problem, as described in section 2.2.9.
(e) The flag selects a resource limit, and it is not possible to set that flag for all programs of the given language in the suite.
A proposed portability flag will normally not be approved unless it is essential in order to successfully build and run the benchmark.
If more than one solution can be used for a problem, the subcommittee will review attributes such as precedent from previously published results, performance neutrality, standards compliance, amount of code affected, impact on the expressed original intent of the program, and good coding practices (in rough order of priority).
If a benchmark is discovered to violate the relevant standard, that may or may not be reason for the subcommittee to grant a portability flag. If the justification for a portability flag is standards compliance, the tester must include a specific reference to the offending source code module and line number, and a specific reference to the relevant sections of the appropriate standard. The tester should also address impact on the other attributes mentioned in the previous paragraph.
If a given portability problem (within a given language) occurs in multiple places within a suite, then, in base, the same method(s) must be applied to solve all instances of the problem.
If a library is specified as a portability flag, SPEC may request that the table of contents of the library be included in the disclosure.
In addition to the rules listed in section 2.1 above, the selection of optimizations to be used to produce SPEC CPU2006 Base Metrics includes the following:
The optimizations used are expected to be safe, and it is expected that system or compiler vendors would endorse the general use of these optimizations by customers who seek to achieve good application performance.
The requirements that optimizations be safe, and that they generate correct code for a class of programs larger than the suites themselves (rule 1.4), are normally interpreted as requiring that the system, as used in base, implement the language correctly. "The language" is defined by the appropriate ANSI/ISO standard (C99, Fortran-95, C++ 98).
The principle of standards conformance is not automatically applied, because SPEC has historically allowed certain exceptions:
Otherwise, a deviation from the standard that is not performance neutral, and gives the particular implementation a CPU2006 performance advantage over standard-conforming implementations, is considered an indication that the requirements about "safe" and "correct code" optimizations are probably not met. Such a deviation may be a reason for SPEC to find a result not rule-conforming.
If an optimization causes any SPEC CPU2006 benchmark to fail to validate, and if the relevant portion of this benchmark's code is within the language standard, the failure is taken as additional evidence that an optimization is not safe.
Regarding C++: Note that for C++ applications, the standard calls for support of both run-time type information (RTTI) and exception handling. The compiler, as used in base, must enable these.
For example, a compiler enables exception handling by default; it can be turned off with --noexcept. The switch --noexcept is not allowed in base.
For example, a compiler defaults to no run time type information, but allows it to be turned on via --rtti. The switch --rtti must be used in base.
Regarding accuracy: Because language standards generally do not set specific requirements for accuracy, SPEC has also chosen not to do so. Nevertheless:
In cases where the class of appropriate applications appears to be so narrowly drawn as to constitute a "benchmark special", that may be a reason for SPEC to find a result non-conforming.
In base, the same compiler must be used for all modules of a given language within a benchmark suite. Except for portability flags (see 2.1.5 above), all flags or options that affect the transformation process from SPEC-supplied source to completed executable must be the same, including but not limited to:
All flags must be applied in the same order for all compiles of a given language.
Note that the SPEC tools provide methods to set flags on a per-language basis.
For example, if a tester sets:
fp=base:
COPTIMIZE = -O4
FOPTIMIZE = -O5
then the floating point C benchmarks will be compiled with -O4 and the floating point Fortran benchmarks with -O5. (This is legal: there is no requirement to compile C with the same optimization level as Fortran.)
Regarding benchmarks that have been written in more than one language:
In a mixed-language benchmark, the tools automatically compile each source module with the options that have been set for its language.
Continuing the example just above, a benchmark that uses both C and Fortran would have its C modules compiled with -O4 and its Fortran modules with -O5. This, too, is legal.
In order to link an executable for a mixed-language benchmark, the tools need to decide which link options to apply (e.g. those defined in CLD/CLDOPT vs. those in FLD/FLDOPT vs. those in CXXLD/CXXLDOPT). This decision is based on benchmark classifications that were determined during development of CPU2006. For reasons of link time library inclusion, the classifications were not made based on percentage of code nor on the language of the main routine; rather, the classifications have been set to either F (for mixed Fortran/C benchmarks) or CXX (for benchmarks that include C++).
Link options must be consistent in a base build. For example, if FLD is set to /usr/opt/advanced/ld for pure Fortran benchmarks, the same setting must be used for any mixed language benchmarks that have been classified, for purpose of linking, as Fortran.
Inter-module optimization and mixed-language benchmarks:
For mixed-language benchmarks, if the compilers have an incompatible inter-module optimization format, flags that require inter-module format compatibility may be dropped from base optimization of mixed-language benchmarks. The same flags must be dropped from all benchmarks that use the same combination of languages. All other base optimization flags for a given language must be retained for the modules of that language.
For example, suppose that a suite has exactly two benchmarks that employ both C and Fortran, namely 997.CFmix1 and 998.CFmix2. A tester uses a C compiler and Fortran compiler that are sufficiently compatible to be able to allow their object modules to be linked together - but not sufficiently compatible to allow inter-module optimization. The C compiler spells its intermodule optimization switch -ifo, and the Fortran compiler spells its switch --intermodule_optimize. In this case, the following would be legal:
fp=base:
COPTIMIZE = -fast -O4 -ur=8 -ifo
FOPTIMIZE = --prefetch:all --optimize:5 --intermodule_optimize
FLD=/usr/opt/advanced/ld
FLDOPT=--nocompress --lazyload --intermodule_optimize
997.CFmix1,998.CFmix2=base:
COPTIMIZE = -fast -O4 -ur=8
FOPTIMIZE = --prefetch:all --optimize:5
FLD=/usr/opt/advanced/ld
FLDOPT=--nocompress --lazyload
Following the precedence rules as explained in config.html, the above section specifiers set default tuning for the C and Fortran benchmarks in the floating point suite, but the tuning is modified for the two mixed-language benchmarks to remove switches that would have attempted inter-module optimization.
Feedback directed optimization must not be used in base for SPEC CPU2006. (This is a change from SPEC CPU2000.)
An assertion flag is one that supplies semantic information that the compilation system did not derive from the source statements of the benchmark.
With an assertion flag, the programmer asserts to the compiler that the program has certain nice properties that allow the compiler to apply more aggressive optimization techniques (for example, that there is no aliasing via C pointers). The problem is that there can be legal programs (possibly strange, but still standard-conforming programs) where such a property does not hold. These programs could crash or give incorrect results if an assertion flag is used. This is the reason why such flags are sometimes also called "unsafe flags". Assertion flags should never be applied to a production program without previous careful checks; therefore they must not be used for base.
Exception: a tester is free to turn on a flag that asserts that the benchmark source code complies to the relevant standard (e.g. -ansi_alias). Note, however, that if such a flag is used, it must be applied to all compiles of the given language (C, C++, or Fortran), while still passing SPEC's validation tools with correct answers for all the affected programs.
Base results may use flags which affect the numerical accuracy or sensitivity by reordering floating-point operations based on algebraic identities.
(This rule, formerly present in CPU2000, has been removed for CPU2006.)
This rule, formerly present in CPU2000, has been merged into rule 2.2.1 for CPU2006.
The system environment must not be manipulated during a build of base. For example, suppose that an environment variable called bigpages can be set to yes or no, and the default is no. The tester must not change the choice during the build of the base binaries. See section 2.0.5.
Normally, it is expected that the data model (such as pointer sizes, sizes of int, etc) will be consistent in base for all compilation of a given language. In particular, several benchmarks use -DSPEC_CPU_LP64, -DSPEC_CPU_P64, and/or -DSPEC_CPU_ILP64 to control the data model. If one of these flags is used in base, then normally it should be set for all benchmarks of the given language in the suite for base.
If for some reason it is not practical to use a consistent data model in base, then SPEC may choose to grant a portability flag and allow use of an inconsistent data model in base.
(i) For example, suppose that it is preferable to use a certain system in 64-bit mode, but that a benchmark is found, unexpectedly, to have a source code limitation that prevents such usage.
(ii) For example, suppose that a certain compiler combination runs into data model difficulties due to the presence of mixed-language benchmarks in a suite.
The tester could describe the problem to SPEC and request that SPEC allow use of an inconsistent data model in base. SPEC would consider such a request using the same process outlined in rule 2.1.5, including consideration of the technical arguments as to the nature of the data model problem and consideration of the practicality of technical alternatives, if any. SPEC might or might not grant the portability flag. SPEC might also choose to fix source code limitations, if any, that are causing difficulty.
Frequently, performance may be improved via optimizations that work across source modules, for example -ifo, -xcrossfile, or -IPA. Some compilers may require the simultaneous presentation of all source files for inter-file optimization, as in:
cc -ifo -o a.out file1.c file2.c
Other compilers may be able to do cross-module optimization even with separate compilation, as in:
cc -ifo -c -o file1.o file1.c
cc -ifo -c -o file2.o file2.c
cc -ifo -o a.out file1.o file2.o
By default, the SPEC tools operate in the latter mode, but they can be switched to the former through the config file option ONESTEP=yes.
ONESTEP is not allowed in base. (This is a change from CPU2000.)
Switches that cause data to be aligned on natural boundaries may be used in base.
In base, pointer sizes may be set in a manner which requires, or which assumes, that the benchmarks (code+data) fit into 32 bits of address space.
SPEC requires the use of a of single file system to contain the directory tree for the SPEC CPU2006 suite being run. SPEC allows any type of file system (disk-based, memory-based, NFS, DFS, FAT, NTFS etc.) to be used. The type of file system must be disclosed in reported results.
There is a config file feature that allows a user to define a directory tree to hold the run directories (along with some other outputs; please see the discussion of output_root in config.html). This feature may be used in a reportable run. If it is used,
The system state (for example, "Multi-User", "Single-User", "Safe Mode With Networking") may be selected by the tester. This state must be disclosed. As described in rule 4.2.4, the tester must also disclose whether any services or daemons are shut down, and any changes to tuning parameters.
For SPECint_rate2006 and SPECfp_rate2006 (peak), the tester is free to choose the number of concurrent copies for each individual benchmark independently of the other benchmarks.
The median value that is used must, for each benchmark, come from three runs with the same number of copies. However, this number may be different between benchmarks.
For SPECint_rate_base2006 and SPECfp_rate_base2006, the tester must select a single value to use as the number of concurrent copies to be applied to all benchmarks in the suite.
The multiple concurrent copies of the benchmark must be executed using data from different directories within the same file system. Each copy of the test must have its own working directory, which must contain all the input files needed for the actual execution of the benchmark, and all output files when created. The output of each copy of the benchmark must be validated to be the correct output.
Note: although benchmark inputs are duplicated across run directories, the benchmark binary itself is only placed into the run directories once.
The config file option submit may be used to assign work to processors. It is commonly used for SPECrate tests, but can also be used for the non-rate (SPECspeed) case. The tester may, if desired:
The submit command must not be used to change the run time environment (see section 3.4). In addition, if a testbed description is referenced by a submit option, the same description must be used by all benchmarks.
In base, the submit command must be the same for all benchmarks in a suite (integer or fp). In peak, different benchmarks may use different submit commands.
For reportable runs, substantial time may be required during the setup phase, as the tools write run directories for every copy, and validate that benchmark binaries get the correct answers for the (non-timed) test/train workloads. As of SPEC CPU2006 V1.1, new features have been added to allow these operations to complete more quickly by optionally doing more operations in parallel, typically by assigning tasks to multiple processors. The new features are commonly used for SPECrate tests, but may also be used for the non-rate (SPECspeed) case.
Testers may use the options submit, parallel_setup, parallel_test, and related features, for the same purposes, and subject to the same rule 3.2.4, as the use of submit for the reference workload: that is, benchmark setup jobs may be placed on desired processors, arithmetic may be done to figure out where to place a job, and so forth. In particular, note that the run time environment may not be changed during setup.
Using the config file features bench_post_setup and/or post_setup, at the conclusion of setup of each benchmark and/or at the conclusion of the setup of all benchmarks, a system command may be issued to cause the benchmark data to be written to stable storage (e.g. sync). Note: It is not the intent of this run rule to provide a hook for a more generalized cleanup of memory; the intent is simply to allow dirty file system data to be written to stable storage.
Note: it is not required that parallel setup, parallel test, and the actual reference run be done using identical methods. Within the limits of the features provided (see config.html), there may be some differences - for example, the reference run might run 128 copies, while parallel setup uses only 12.
All benchmark executions, including the validation steps, contributing to a particular result page must occur continuously, that is, in one execution of runspec.
For a reportable run, the runspec tool will run all three workloads (test, train, and ref), and will ensure that the correct answer is obtained for all three. (Note: the execution and validation of test and train is not part of the timing of the benchmark - it is only an additional test for correct operation of the binary.)
SPEC does not attempt to regulate the run-time environment for the benchmarks, other than to require that the environment be:
For example, if each of the following:
run level: single-user
OS tuning: bigpages=yes, cpu_affinity=hard
file system: in memory
were set prior to the start of runspec, unchanged during the run, described in the notes section of the result page, and documented and supported by a vendor for general use, then these options could be used in a published CPU2006 result.
Note 1: Item (a) is intended to forbid all means by which a tester might change the environment. In particular, it is forbidden to change the environment during the run using config file hooks such as submit or monitor_pre_bench.
For example, it would not be acceptable to use submit to cause different benchmarks to pick differing page sizes, differing number of threads, or differing choices for local vs. shared memory.
Note 2: Although the tester is not allowed to change the run-time environment, it is acceptable to select choices at compile time that cause benchmark binaries to carry information about their run time requirements.
For example, a compiler choice could be made that causes binaries to request running with bigpages, and, for peak only, that choice could differ from benchmark to benchmark.
If a result page will contain both base and peak results, a single runspec invocation must be used for the runs. When both base and peak are run, the tools run the base executables first, followed by the peak executables.
It is permitted to publish base results as peak. This can be accomplished in various ways, all of which are allowed:
Set basepeak=yes in the config file for individual benchmarks.
In this case, the tools will run the same binary for both base and peak; however, the base times will be reported for both base and peak. (The reason for running the binary during both base and peak is to remove the possibility that skipping a benchmark altogether might somehow change the performance of some other benchmark.)
Set basepeak=yes in the config file for an entire suite.
In this case, the peak runs will be skipped and base results will be reported as both base and peak for the suite.
Select the --basepeak option when using rawformat.
Doing so will cause a new rawfile to be written, with base results copied to peak. It is permitted to use this feature to copy all of the base results to peak, or just the results for selected benchmarks.
Notes:
1. It is permitted but not required to compile in the same runspec invocation as the execution. See rule 2.0.6 regarding cross compilation.
2. It is permitted but not required to run both the integer suite and the floating point suite in a single invocation of runspec.
As used in these run rules, the term "run-time dynamic optimization" (RDO) refers broadly to any method by which a system adapts to improve performance of an executing program based upon observation of its behavior as it runs. This is an intentionally broad definition, intended to include techniques such as:
RDO may be under control of hardware, software, or both.
Understood this broadly, RDO is already commonly in use, and usage can be expected to increase. SPEC believes that RDO is useful, and does not wish to prevent its development. Furthermore, SPEC views at least some RDO techniques as appropriate for base, on the grounds that some techniques may require no special settings or user intervention; the system simply learns about the workload and adapts.
However, benchmarking a system that includes RDO presents a challenge. A central idea of SPEC benchmarking is to create tests that are repeatable: if you run a benchmark suite multiple times, it is expected that results will be similar, although there will be a small degree of run-to-run variation. But an adaptive system may recognize the program that it is asked to run, and "carry over" lessons learned in the previous execution; therefore, it might complete a benchmark more quickly each time it is run. Furthermore, unlike in real life, the programs in the benchmark suites are presented with the same inputs each time they are run: value prediction is too easy if the inputs never change. In the extreme case, an adaptive system could be imagined that notices which program is about to run, notices what the inputs are, and which reduces the entire execution to a print statement. In the interest of benchmarking that is both repeatable and representative of real-life usage, it is therefore necessary to place limits on RDO carry-over.
Run time dynamic optimization is allowed, subject to the usual provisions that the techniques must be generally available, documented, and supported. It is also subject to the conditions listed in the rules immediately following.
Rule 4.2 applies to run-time dynamic optimization: any settings which the tester has set to non-default values must be disclosed. If RDO requires any hardware resources, these must be included in the description of the hardware configuration.
For example, suppose that a system can be described as a 64-core system. After experimenting for a while, the tester decides that optimum SPECrate throughput is achieved by dedicating 4 cores to the run-time dynamic optimizer, and running only 60 copies of the benchmarks. The system under test is still correctly described as a 64-core system, even though only 60 cores ran SPEC code.
Run time dynamic optimization is subject to rule 3.4: settings cannot be changed at run-time. But Note 2 of rule 3.4 also applies to RDO: for example, in peak it would be acceptable to compile a subset of the benchmarks with a flag that suggests to the run-time dynamic optimizer that code rearrangement should be attempted. Of course, rule 2.1.1 also would apply: such a flag could not tell RDO which routines to rearrange.
If run-time dynamic optimization is effectively enabled for base (after taking into account the system state at run-time and any compilation flags that interact with the run-time state), then RDO must comply with 2.2.1, the safety rule. It is understood that the safety rule has sometimes required judgment, including deliberation by SPEC in order to determine its applicability. The following is intended as guidance for the tester and for SPEC:
If an RDO system optimizes a SPEC benchmark in a way which allows it to successfully process the SPEC-supplied inputs, that is not enough to demonstrate safety. If it can be shown that a different, but valid, input causes the program running under RDO to fail (either by giving a wrong answer or by exiting), where such failure does not occur without RDO; and if it is not a fault of the original source code; then this is taken as evidence that the RDO method is not safe.
If an RDO system requires that programs use a subset of the relevant ANSI/ISO language standard, or requires that they use non-standard features, then this is taken as evidence that it is not safe.
But an RDO system is allowed to assume that the programs adhere to the relevant ANSI/ISO language standard.
As described in section 3.6.1, SPEC has an interest in preventing carry-over of information from run to run. Specifically, no information may be carried over which identifies the specific program or executable image. Here are some examples of behavior that is, and is not, allowed.
It doesn't matter whether the information is intentionally stored, or just "left over"; if it's about a specific program, it's not allowed:
If information is left over from a previous run that is not associated with a specific program, that is allowed:
Any form of RDO that uses memory about a specific program is forbidden:
The system is allowed to respond to the currently running program, and to the overall workload:
SPEC requires a full disclosure of results and configuration details sufficient to reproduce the results. For results published on its web site, SPEC also requires that base results be published whenever peak results are published. If peak results are published outside of the SPEC web site (http://www.spec.org/cpu2006/) in a publicly available medium, the tester must supply base results on request. Publication of results under non-disclosure or company internal use or company confidential are not "publicly" available.
A full disclosure of results must include:
A full disclosure of results must include sufficient information to allow a result to be independently reproduced. If a tester is aware that a configuration choice affects performance, then s/he must document it in the full disclosure.
Note: this rule is not meant to imply that the tester must describe irrelevant details or provide massively redundant information.
For example, if the SuperHero Model 1 comes with a write-through cache, and the SuperHero Model 2 comes with a write-back cache, then specifying the model number is sufficient, and no additional steps need to be taken to document the cache protocol. But if the Model 3 is available with both write-through and write-back caches, then a full disclosure must specify which cache is used.
For information on how to publish a result on SPEC's web site, contact the SPEC office. Contact information is maintained at the SPEC web site, http://www.spec.org/.
If a tester publishes results for a hardware or software configuration that has not yet shipped,
The component suppliers must have firm plans to make production versions of all components generally available, within 3 months of the first public release of the result (whether first published by the tester or by SPEC); and
The tester must specify the general availability dates that are planned.
Note 1: "Generally available" is defined in the SPEC Open Systems Group Policy document, which can be found at http://www.spec.org/osg/policy.html.
Note 2: It is acceptable to test larger configurations than customers are currently ordering, provided that the larger configurations can be ordered and the company is prepared to ship them.
For example, if the SuperHero is available in configurations of 1 to 1000 CPUs, but the largest order received to date is for 128 CPUs, the tester would still be at liberty to test a 1000 CPU configuration and publish the result.
A "pre-production", "alpha", "beta", or other pre-release version of a compiler (or other software) can be used in a test, provided that the performance-related features of the software are committed for inclusion in the final product.
The tester must practice due diligence to ensure that the tests do not use an uncommitted prototype with no particular shipment plans. An example of due diligence would be a memo from the compiler Project Leader which asserts that the tester's version accurately represents the planned product, and that the product will ship on date X.
The final, production version of all components must be generally available within 3 months after first public release of the result.
When specifying a software component name in the results disclosure, the component name that should be used is the name that customers are expected to be able to use to order the component, as best as can be determined by the tester. It is understood that sometimes this may not be known with full accuracy; for example, the tester may believe that the component will be called "TurboUnix V5.1.1" and later find out that it has been renamed "TurboUnix V5.2", or even "Nirvana 1.0". In such cases, an editorial request can be made to update the result after publication.
Some testers may wish to also specify the exact identifier of the version actually used in the test (for example, "build 20020604"). Such additional identifiers may aid in later result reproduction, but are not required; the key point is to include the name that customers will be able to use to order the component.
The configuration disclosure includes fields for both "Hardware Availability" and "Software Availability". In both cases, the date which must be used is the date of the component which is the last of the respective type to become generally available.
If a software or hardware date changes, but still falls within 3 months of first publication, a result page may be updated on request to SPEC.
If a software or hardware date changes to more than 3 months after first publication, the result is considered Non-Compliant. For procedures regarding Non-Compliant results, see the SPEC Open Systems Group (OSG) Policy Document, http://www.spec.org/osg/policy.html.
SPEC is aware that performance results for pre-production systems may sometimes be subject to change, for example when a last-minute bugfix reduces the final performance.
For results measured on pre-production systems, if the tester becomes aware of something that will reduce production system performance by more than 1.75% on an overall metric (for example, SPECfp_base2006 or SPECfp2006), the tester is required to republish the result, and the original result shall be considered non-compliant.
The following sections describe the various elements that make up the disclosure of the system configuration tested. The SPEC tools allow setting this information in the configuration file, prior to starting the measurement (i.e. prior to the runspec command).
It is also acceptable to update the information after a measurement has been completed, by editing the rawfile. Rawfiles include a marker that separates the user-editable portion from the rest of the file.
# =============== do not edit below this point ===================
Edits are forbidden beyond that marker.
(There is information about rawfile updating in the rawformat section of the document utility.html.)
SPEC recommends that measurements be done on the actual systems for which results are claimed. Nevertheless, SPEC recognizes that there is a cost of benchmarking, and that multiple publications from a single measurement may sometimes be appropriate. For example, two systems badged as "Model A" versus "Model B" may differ only in the badge itself; in this situation, differences are sometimes described as only "paint deep", and a tester may wish to perform only a single test (i.e. the runspec tool is invoked only once, and multiple rawfiles are prepared with differing system descriptions).
Although paint is usually not a performance-relevant difference, for other differences it can be difficult to draw a precise line as to when two similar systems should no longer be considered equivalent. For example, what if Model A and B come from different vendors? Use differing firmware, power supplies, or line voltage? Support additional types or numbers of disks, busses, interconnects, or other devices?
For SPEC CPU, a single measurement may be published as multiple equivalent results provided that all of the following requirements are met:
Performance differences from factors such as those listed in the paragraph above (paint, vendor, firmware, and so forth) are within normal run-to-run variation.
The CPU is the same.
The motherboards are the same:
same motherboard manufacturer
same electrical devices (for example, IO support chips, memory slots, PCI slots)
same physical shape.
The memory systems are the same:
same caches
same memory interconnect
same number of memory modules
memory modules are run at the same speed
memory modules comply with same specifications, where applicable (for example, the same labels as determined by the JEDEC DDR3 DIMM Label Specification).
As tested, all hardware components are supported on both systems.
For example, the Model A and Model B meet the requirements listed above, including a motherboard with the same number of DIMM slots. The Model A can be fully populated with 96 DIMMs. Due to space and thermal considerations, the Model B can only be half-populated; i.e. it is not supported with more than 48 DIMMs. If the actual sytem under test is the Model A, the tester must fill only the DIMM slots that are allowed to be filled for both systems.
Disclosures must reference each other, and must state which system was used for the actual measurement. For example:
This result was measured on the Acme Model A. The Acme Model A and the Bugle Model B are equivalent.
When a single measurement is used for multiple systems, SPEC may ask for a review of the differences between the systems, may ask for substantiation of the requirements above, and/or may require that additional documentation be included in the publications.
CPU Name: A manufacturer-determined processor formal name.
CPU Characteristics: Technical characteristics to help identify the processor.
This field must be used to disambiguate which processor is used, unless the CPU is already unambiguously designated by the combination of the fields "CPU Name", "CPU MHz", "FPU", and "Level (n) Cache".
In addition, SPEC encourages use of this field to make it easier for the reader to identify a processor, even if the processor choice is not, technically, ambiguous.
SPEC does not require that CPU2006 results be published on the SPEC web site, although such publication is encouraged. For results that are published on its web site, SPEC is likely to use this field to note CPU technical characteristics that SPEC may deem useful for queries, and may adjust its contents from time to time.
Some processor differences may not be relevant to performance, such as differences in packaging, distribution channels, or CPU revision levels that affect a SPEC CPU2006 overall performance metric by less than 1.75%. In those cases, SPEC does not require disambiguation as to which processor was tested.
For example, when first introduced, the TurboBlaster series is available with only one instruction set, and runs at speeds up to 2GHz. Later, a second instruction set (known as "Arch2") is introduced and older processors are commonly, but informally, referred to as having employed "Arch1", even though they were not sold with that term at the time. Chips with Arch2 are sold at speeds of 2GHz and higher. The manufacturer has chosen to call both Arch1 and Arch2 chips by the same formal chip name (TurboBlaster).
1. A 2.0GHz TurboBlaster result is published. Since the formal chip name is the same, and since both Arch1 and Arch2 are available at 2.0GHz, the CPU Characteristics field must be used to identify whether this is an Arch1 or Arch2 chip.
2. A 2.2GHz TurboBlaster result is published. In this case, there is technically no ambiguity, since all 2.2GHz results use Arch2. Nevertheless, the tester is encouraged to note that the chip uses Arch2, to help the reader disambiguate the processors.
3. As an aid to technical readers doing queries, SPEC may decide to adjust all the TurboBlaster results that have been posted on its website by adding either "Arch1" or "Arch2" to all posted results.
4. The 2.2GHz TurboBlaster is available in an OEM package and a Consumer package. These are highly similar, although the OEM version has additional testing features for use by OEMs. But these are both 2.2GHz TurboBlasters, with the same cache structure, same instruction set, and, within run-to-run variation, the same CPU2006 performance. In this case, it is not necessary to specify whether the OEM or Consumer version was tested.
CPU MHz: a numeric value expressed in megahertz. That is, do not say "1.0 GHz", say "1000". The value here is to be the speed at which the CPU is run, even if the chip itself is sold at a different clock rate. That is, if you "over-clock" or "under-clock" the part, disclose here the actual speed used.
FPU
Number of CPUs in System. As of early 2006, it is assumed that processors can be described as containing one or more "chips", each of which contains some number of "cores", each of which can run some number of hardware "threads". Fields are provided in the results disclosure for each of these. If industry practice evolves such that these terms are no longer sufficient to describe processors, SPEC may adjust the field set.
The current fields are:
Regarding the fields in the above list that mention the word "enabled": if a chip, core, or thread is available for use during the test, then it must be counted. If one of these resources is disabled - for example by a firmware setting prior to boot - then it need not be counted, but the tester must exercise due diligence to ensure that disabled resources are truly disabled, and not silently giving help to the result.
Regarding the remaining field (hw_ncoresperchip), the tester must count the cores irrespective of whether they are enabled.
Example: In the following tests, the SUT is a Turboblaster Model 32-64-256, which contains 32 chips. Each chip has 2 cores. Each core can run 4 hardware threads.
A 256-copy SPECint_rate2006 test uses all the available resources. It is reported as:
hw_ncores: 64 hw_nchips: 32 hw_ncoresperchip: 2 hw_nthreadspercore: 4
The same system is tested with a 24-copy SPECint_rate2006 test, without changing the system configuration. Even though they are now only lightly loaded, all the above resources are still configured into the SUT; therefore the SUT must still be described as:
hw_ncores: 64 hw_nchips: 32 hw_ncoresperchip: 2 hw_nthreadspercore: 4
The system is halted, and firmware commands are entered to disable all but 3 of the chips. All resources are available on the remaining 3 chips. The system is rebooted and a 24-copy test is run once more. This time, the resources are:
hw_ncores: 6 hw_nchips: 3 hw_ncoresperchip: 2 hw_nthreadspercore: 4
The system is halted, and firmware commands are entered to enable 24 chips; but only 1 core is enabled per chip, and hardware threading is turned off. The system is booted, and a 24-copy test is run. The resources this time are:
hw_ncores: 24 hw_nchips: 24 hw_ncoresperchip: 2 hw_nthreadspercore: 1
Note: if resources are disabled, the method(s) used for such disabling must be documented and supported.
Number of CPUs orderable. Specify the number of processors that can be ordered, using whatever units the customer would use when placing an order. If necessary, provide a mapping from that unit to the chips/cores units just above. For example:
1 to 8 TurboCabinets. Each TurboCabinet contains 4 chips.
Level 1 (primary) Cache: Size, location, number of instances (e.g. "32 KB I + 64 KB D on chip per core")
Level 2 (secondary) Cache: Size, location, number of instances
Level 3 (tertiary) Cache: Size, location, number of instances
Other Cache: Size, location, number of instances
Memory: Size in MB/GB. Performance relevant information as to the memory configuration must be included, either in the field or in the notes section. If there is one and only one way to configure memory of the stated size, then no additional detail need be disclosed. But if a buyer of the system has choices to make, then the result page must document the choices that were made by the tester.
For example, the tester may need to document number of memory carriers, size of DIMMs, banks, interleaving, access time, or even arrangement of modules: which sockets were used, which were left empty, which sockets had the bigger DIMMs.
Exception: if the tester has evidence that a memory configuration choice does not affect performance, then SPEC does not require disclosure of the choice made by the tester.
For example, if a 1GB system is known to perform identically whether configured with 8 x 128MB DIMMs or 4 x 256MB DIMMs, then SPEC does not require disclosure of which choice was made.
Disk Subsystem: Size (MB/GB), Type (SCSI, Fast SCSI etc.), other performance-relevant characteristics. The disk subsystem used for the SPEC CPU2006 run directories must be described. If other disks are also performance relevant, then they must also be described.
Other Hardware: Additional equipment added to improve performance
System State: On Linux systems with multiple run levels, the system state must be described by stating the run level and a very brief description of the meaning of that run level, for example:
System State: Run level 4 (multi-user with display manager)
On other systems:
If the system is installed and booted using default options, document the System State as "Default".
If the system is used in a non-default mode, document the system state using the vocabulary appropriate to that system (for example, "Safe Mode with Networking", "Single User Mode").
Note: some Unix (and Unix-like) systems have deprecated the concept of "run levels", preferring other terminology for state description. In such cases, the system state field should use the vocabulary recommended by the operating system vendor.
Additional detail about system state may be added in free form notes.
File System Type used for the SPEC CPU2006 run directories
Compilers:
Auto Parallel: Whether any benchmarks are automatically optimized to use multiple threads, cores, and/or chips. Set this field to "yes" if at least one benchmark does so, and disclose in the flag descriptions or notes section which benchmarks and/or libraries are parallelized.
Note 1: It is acceptable for library functions (e.g. math functions, strcmp, memcpy, memset, std::__find) to be used in a manner that allows them to spread their work across multiple hardware threads, cores, or chips. If one or more library functions are used in this manner, that counts as auto parallelization, for purposes of this field.
Note 2: sometimes libraries are referred to as "thread safe" or "SMP safe" when implemented in a manner that allows multiple calling threads from a single process. Such an implementatation is not alone enough to require setting the field to "yes"; the point is whether the library routine itself causes multiple threads of work to be generated.
Note 3: incidental operating system usage of hardware resources due to interrupt processing and system services does not count as "Auto Parallelization" for purposes of this field. (Of course, all available cpu resources must be disclosed, as described in rule 4.2.2 (e).)
Note 4: parallel directives, such as OpenMP directives, are disabled for SPEC CPU2006; but compilers are allowed to do automatic parallelization.
Note 5: As of CPU2006 V1.1, this report field is set via several other fields, as described in config.html, in the section About Auto Parallel Reporting.
Scripted Installations and Pre-configured Software: In order to reduce the cost of benchmarking, test systems are sometimes installed using automatic scripting, or installed as preconfigured system images. A tester might use a set of scripts that configure the corporate-required customizations for IT Standards, or might install by copying a disk image that includes Best Practices of the performance community. SPEC understands that there is a cost to benchmarking, and does not forbid such installations, with the proviso that the tester is responsible to disclose how end users can achieve the claimed performance (using appropriate fields above).
Example: the Corporate Standard Jumpstart Installation Script has 73 documented customizations and 278 undocumented customizations, 34 of which no one remembers. Of the various customizations, 17 are performance relevant for SPEC CPU2006 - and 4 of these are in the category "no one remembers". The tester is nevertheless responsible for finding and documenting all 17. Therefore to remove doubt, the tester prudently decides that it is less error-prone and more straightforward to simply start from customer media, rather than the Corporate Jumpstart.
System Services: If performance relevant system services or daemons are shut down (e.g. remote management service, disk indexer / defragmenter, spyware defender, screen savers) these must be documented in the notes section. Incidental services that are not performance relevant may be shut down without being disclosed, such as the print service on a system with no printers attached. The tester remains responsible for the results being reproducible as described.
System and other tuning: Operating System tuning selections and other tuning that has been selected by the tester (including but not limited to firmware/BIOS, environment variables, kernel options, file system tuning options, and options for any other performance-relevant software packages) must be documented in the configuration disclosure in the rawfile. The meaning of the settings must also be described, in either the free form notes or in the flags file. The tuning parameters must be documented and supported.
Any additional notes such as listing any use of SPEC-approved alternate sources or tool changes.
For example, suppose the tester uses a pre-release compiler with:
f90 -O4 --newcodegen --loopunroll:outerloop:alldisable
but the tester knows that the new code generator will be automatically applied in the final product, and that the spelling of the unroll switch will be simpler than the spelling used here. The recommended spelling for customers who wish to achieve the effect of the above command will be:
f90 -O4 -no-outer-unroll
In this case, the flags report will include the actual spelling used by the tester, but a note should be added to document the spelling that will be recommended for customers.
SPEC CPU2006 provides benchmarks in source code form, which are compiled under control of SPEC's toolset. Compilation flags are detected and reported by the tools with the help of "flag description files". Such files provide information about the syntax of flags and their meaning.
a. Flags file required: A result will be marked "invalid" unless it has an associated flag description file. A description of how to write one may be found at http://www.spec.org/cpu2006/Docs/flag-description.html.
b. Flags description files are not limited to compiler flags. Although these descriptions have historically been called "flags files", flag description files are also used to describe other performance-relevant options.
c. Notes section or flags file? As mentioned above (rule 4.2.4), all tuning must be disclosed, and the meaning of the tuning options must be described. In general, it is recommended that the result page should state what tuning has been done, and the flags file should state what it means. As an exception, if a definition is brief, it may be more convenient, and it is allowed, to simply include the definition in the notes section.
d. Required detail: The level of detail in the description of a flag is expected to be sufficient so that an interested technical reader can form a preliminary judgment of whether he or she would also want to apply the option.
This requirement is phrased as a "preliminary judgment" because a complete judgment of a performance option often requires testing with the user's own application, to ensure that there are no unintended consequences.
At minimum, if a flag has implications for safety, accuracy, or standards conformance, such implications must be disclosed.
For example, one might write:
When --algebraII is used, the compiler is allowed to use the rules of elementary algebra to simplify expressions and perform calculations in an order that it deems efficient. This flag allows the compiler to perform arithmetic in an order that may differ from the order indicated by programmer-supplied parentheses.
The final sentence of the preceding paragraph is an example of a deviation from a standard which must be disclosed.
e. Description of Feedback-directed optimization: If feedback directed optimization is used, the description must indicate whether training runs:
Hardware performance counters are often available to provide information such as branch mispredict frequencies, cache misses, or instruction frequencies. If they are used during the training run, the description needs to note this; but SPEC does not require a description of exactly which performance counters are used.
As with any other optimization, if the optimizations performed have effects regarding safety, accuracy, or standards conformance, these effects must be described.
f. Flag file sources: It is acceptable to build flags files using previously published results, or to reference a flags file provided by someone else (e.g. a compiler vendor). Doing so does not relieve an individual tester of the responsibility to ensure that his or her own result is accurate, including all its descriptions.
SPEC CPU results are for systems, not just for chips: it is required that a user be able to obtain the system described in the result page and reproduce the result (within a small range for run-to-run variation).
Nevertheless, SPEC recognizes that chip and motherboard suppliers have a legitimate interest in CPU benchmarking. For those suppliers, the performance-relevant hardware components typically are the cpu chip, motherboard, and memory; but users would not be able to reproduce a result using only those three. To actually run the benchmarks, the user has to supply other components, such as a case, power supply, and disk; perhaps also a specialized CPU cooler, extra fans, a disk controller, graphics card, network adapter, BIOS, and configuration software.
Such systems are sometimes referred to as "white box", "home built", "kit built", or by various informal terms. For SPEC purposes, the key point is that the user has to do extra work in order to reproduce the performance of the tested components; therefore, this document refers to such systems as "user built".
For user built systems, the configuration disclosure must supply a parts list sufficient to reproduce the result. As of the listed availability dates in the disclosure, the user should be able to obtain the items described in the disclosure, spread them out on an anti-static work area, and, by following the instructions supplied with the components, plus any special instructions in the SPEC disclosure, build a working system that reproduces the result. It is acceptable to describe components using a generic name (e.g. "Any ATX case"), but the recipe must also give specific model names or part numbers that the user could order (e.g. "such as a Mimble Company ATX3 case").
Component settings that are listed in the disclosure must be within the supported ranges for those components. For example, if the memory timings are manipulated in the BIOS, the selected timings must be supported for the chosen type of memory.
Components for a user built system may be divided into two kinds: performance-relevant (for SPEC CPU), and non-performance-relevant. For example, SPEC CPU benchmark scores are affected by memory speed, and motherboards often support more than one choice for memory; therefore, the choice of memory type is performance-relevant. By contrast, the motherboard needs to be mounted in a case. Which case is chosen in not normally performance-relevant; it simply has to be the correct size (e.g. ATX, microATX, etc).
Performance-relevant components must be described in fields for "Configuration Disclosure" (see rules 4.2.2, and 4.2.3). These fields begin with hw_ or sw_ in the config file, as described in config.html (including hw_other and sw_other, which can be used for components not already covered by other fields). If more detail is needed beyond what will fit in the fields, add more information under the free-form notes.
Components that are not performance-relevant are to be described in the free-form notes.
Example:
hw_cpu_name = Frooble 1500 hw_memory = 2 GB (2x 1GB Mumble Inc Z12 DDR2 1066) sw_other = SnailBios 17 notes_plat_000 = notes_plat_005 = The BIOS is the Mumble Inc SnailBios Version 17, notes_plat_010 = which is required in order to set memory timings notes_plat_015 = manually to DDR2-800 5-5-5-15. The 2 DIMMs were notes_plat_020 = configured in dual-channel mode. notes_plat_025 = notes_plat_030 = A standard ATX case is required, along with a 500W notes_plat_035 = (minimum) ATX power supply [4-pin (+12V), 8-pin (+12V) notes_plat_040 = and 24-pin are required]. An AGP or PCI graphics notes_plat_045 = adapter is required in order to configure the system. notes_plat_050 = notes_plat_055 = The Frooble 1500 CPU chip is available in a retail box, notes_plat_060 = part 12-34567, with appropriate heatsinks and fan assembly. notes_plat_065 = notes_plat_070 = As tested, the system used a Mimble Company ATX3 case, notes_plat_075 = a Frimble Ltd PS500 power supply, and a Frumble notes_plat_080 = Corporation PCIe Z19 graphics adapter. notes_plat_085 =
Additional notes:
Note 1: Regarding graphics adapters:
Note 2: Regarding power modes: Sometimes CPU chips are capable of running with differing performance characteristics according to how much power the user would like to spend. If non-default power choices are made for a user built system, those choices must be documented in the notes section.
Note 3: Regarding cooling systems: Sometimes CPU chips are capable of running with degraded performance if the cooling system (fans, heatsinks, etc.) is inadequate. When describing user built systems, the notes section must describe how to provide cooling that allows the chip to achieve the measured performance.
It was mentioned in section 2 that it is allowed to build on a different system than the system under test. This section describes when and how to document such builds.
(a) Circumstances under which additional documentation is required for the build environment
If all components of the build environment are available for the run environment, and if both belong to the same product family and are running the same operating system versions, then this is not considered a cross-compilation. The fact that the binaries were built on a different system than the run time system does not need to be documented.
If the software used to build the benchmark executables is not available on the SUT, or if the host system provides performance gains via specialized tuning or hardware not available on the SUT, the host system(s) and software used for the benchmark building process must be documented.
Sometimes, the person building the benchmarks may not know which of the two previous paragraphs apply, because the benchmark binaries and config file are redistributed to other users who run the actual tests. In this situation, the build environment must be documented.
(b) How to document a build environment.
The actual test results consist of the elapsed times and ratios for the individual benchmarks and the overall SPEC metric produced by running the benchmarks via the SPEC tools. The required use of the SPEC tools ensures that the results generated are based on benchmarks built, run, and validated according to the SPEC run rules. Below is a list of the measurement components for each SPEC CPU2006 suite and metric:
o CINT2006 SPECspeed Metrics: SPECint_base2006 (Required Base result) SPECint2006 (Optional Peak result) o CFP2006 SPECspeed Metrics: SPECfp_base2006 (Required Base result) SPECfp2006 (Optional Peak result)
The elapsed time in seconds for each of the benchmarks in the CINT2006 or CFP2006 suite is given and the ratio to the reference machine (a Sun UltraSparc II system at 296MHz), is calculated. The SPECint_base2006 and SPECfp_base2006 metrics are calculated as a Geometric Mean of the individual ratios, where each ratio is based on the median execution time from three runs. All runs of a specific benchmark when using the SPEC tools are required to have validated correctly.
The benchmark executables must have been built according to the rules described in section 2 above.
o CINT2006 SPECrate Metrics: SPECint_rate_base2006 (Required Base result) SPECint_rate2006 (Optional Peak result) o CFP2006 SPECrate Metrics: SPECfp_rate_base2006 (Required Base result) SPECfp_rate2006 (Optional Peak result)
The SPECrate (throughput) metrics are calculated based on the execution of benchmark binaries that are built using the same rules as binaries built for SPECspeed metrics. However, the tester may select the number of concurrent copies of each benchmark to be run. The same number of copies must be used for all benchmarks in a base test. This is not true for the peak results where the tester is free to select any combination of copies. The number of copies selected is usually a function of the number of CPUs in the system.
The SPECrate metric calculated for each benchmark is a function of:
the number of copies run *
reference factor for the benchmark /
elapsed time in seconds
which yields a rate in jobs/time. The SPECrate overall metrics are calculated as a geometric mean from the individual
SPECrate metrics using the median result from three runs. As with the SPECspeed metric, all copies of the benchmark during
each run are required to have validated correctly.
It is permitted to use the SPEC tools to generate a 1-copy SPECrate disclosure from a 1-copy SPECspeed run. The reverse is also permitted.
As mentioned above, performance may sometimes change for pre-production systems; but this is also true of production systems (that is, systems that have already begun shipping). For example, a later revision to the firmware, or a mandatory OS bugfix, might reduce performance.
For production systems, if the tester becomes aware of something that reduces performance by more than 1.75% on an overall metric (for example, SPECfp_base2006 or SPECfp2006), the tester is encouraged but not required to republish the result. In such cases, the original result is not considered non-compliant. The tester is also encouraged, but not required, to include a reference to the change that makes the results different (e.g. "with OS patch 20020604-02").
Publication of peak results are considered optional by SPEC, so the tester may choose to publish only base results. Since by definition base results adhere to all the rules that apply to peak results, the tester may choose to refer to these results by either the base or peak metric names (e.g. SPECint_base2006 or SPECint2006).
It is permitted to publish base-only results. Alternatively, the use of the flag basepeak is permitted, as described in section 3.5.
SPEC encourages use of the CPU2006 suites in academic and research environments. It is understood that experiments in such environments may be conducted in a less formal fashion than that demanded of testers who publish on the SPEC web site. For example, a research environment may use early prototype hardware that simply cannot be expected to stay up for the length of time required to meet the Continuous Run requirement (see section 3.3), or may use research compilers that are unsupported and are not generally available (see section 1).
Nevertheless, SPEC would like to encourage researchers to obey as many of the run rules as practical, even for informal research. SPEC respectfully suggests that following the rules will improve the clarity, reproducibility, and comparability of research results.
Where the rules cannot be followed, SPEC requires that the deviations from the rules be clearly disclosed, and that any SPEC metrics (such as SPECint2006) be clearly marked as estimated.
It is especially important to clearly distinguish results that do not comply with the run rules when the areas of non-compliance are major, such as not using the reference workload, or only being able to correctly validate a subset of the benchmarks.
If a SPEC CPU2006 licensee publicly discloses a CPU2006 result (for example in a press release, academic paper, magazine article, or public web site), and does not clearly mark the result as an estimate, any SPEC member may request that the rawfile(s) from the run(s) be sent to SPEC. The rawfiles must be made available to all interested members no later than 10 working days after the request. The rawfile is expected to be complete, including configuration information (section 4.2 above).
A required disclosure is considered public information as soon as it is provided, including the configuration description.
For example, Company A claims a result of 1000 SPECint_rate2006. A rawfile is requested, and supplied. Company B notices that the result was achieved by stringing together 50 chips in single-user mode. Company B is free to use this information in public (e.g. it could compare the Company A machine vs. a Company B machine that scores 999 using only 25 chips in multi-user mode).
Review of the result: Any SPEC member may request that a required disclosure be reviewed by the SPEC CPU subcommittee. At the conclusion of the review period, if the tester does not wish to have the result posted on the SPEC result pages, the result will not be posted. Nevertheless, as described above, the details of the disclosure are public information.
When public claims are made about CPU2006 results, whether by vendors or by academic researchers, SPEC reserves the right to take action if the rawfile is not made available, or shows different performance than the tester's claim, or has other rule violations.
Consistency and fairness are guiding principles for SPEC. To help assure that these principles are met, any organization or individual who makes public use of SPEC benchmark results must do so in accordance with the SPEC Fair Use Rule, as posted at http://www.spec.org/fairuse.html.
SPEC CPU2006 metrics may be estimated. All estimates must be clearly identified as such. It is acceptable to estimate a single metric (for example, SPECint_rate2006, or SPECfp_base2006, or the elapsed seconds for 401.bzip2).
Note that it is permitted to estimate only the peak metric; one is not required to provide a corresponding estimate for base.
SPEC requires that every use of an estimated number be clearly marked with "est." or "estimated" next to each estimated number, rather than burying a footnote at the bottom of a page.
For example, say that the JumboFast will achieve estimated performance of:
Model 1 SPECint_base2006 50 est. SPECint2006 60 est. Model 2 SPECint_rate2006 70 est. SPECfp_rate2006 80 est.
If estimates are used in graphs, the word "estimated" or "est." must be plainly visible within the graph, for example in the title, the scale, the legend, or next to each individual result that is estimated.
Note: the term "plainly visible" in this rule is not defined; it is intended as a call for responsible design of graphical elements. Nevertheless, for the sake of giving at least rough guidance, here are two examples of the right way and wrong way to mark estimated results in graphs:
Licensees are encouraged to give a rationale or methodology for any estimates, together with other information that may help the reader assess the accuracy of the estimate. For example:
Those who publish estimates are encouraged to publish actual SPEC CPU2006 metrics as soon as possible.
If for some reason, the tester cannot run the benchmarks as specified in these rules, the tester can seek SPEC
approval for performance-neutral alternatives. No publication may be done without such approval. The SPEC Open Systems
Group (OSG) maintains a
Policies and Procedures document that defines the procedures for such exceptions.
Copyright 1999-2011 Standard Performance Evaluation Corporation All Rights Reserved