766.femflow_r
SPEC CPU®2026 Benchmark Description

Benchmark Name

766.femflow_r

Benchmark Program General Category

Fluid dynamics: high-order finite element method.

Benchmark Authors

FemFlow was written by Martin Kronbichler, <Martin[dot]Kronbichler[at]rub[dot]de>. FemFlow is derived from ExaDG and uses the deal.II finite element library; their authors may be found via the links below.

766.femflow_r was submitted to the SPEC CPU v8 Benchmark Search Program by Martin Kronbichler.

Benchmark Description

The FemFlow program solves the compressible Navier-Stokes equations with high-order finite element methods. The equations are solved on locally refined meshes and use splitting methods, to treat the hyperbolic terms with a discontinuous Galerkin method on hexahedral elements of a 3D mesh (polynomial degree 4, over-integration on 6 points per direction) explicitly in time with Heun's method, while the parabolic terms are treated implicitly with a Crank-Nicolson scheme. The associated linear system is solved separately for velocity and temperature with a conjugate gradient method preconditioned by the matrix diagonal. The code relies on the deal.II finite element library, in particular its matrix-free infrastructure to quickly compute the underlying integrals with sum factorization.

The benchmark involves three main types of operations that get executed in tight succession. Firstly, the sum-factorization algorithms underlying the finite-element integrals perform a sequence of regular arithmetic operations in the form of tensor contractions. The code expressed them in a form similar to matrix-matrix multiplication with several different matrix sizes, the most common being 5x5 times 5x25, 6x5 times 5x25, and 6x6 times 6x36. These operations are followed by the physical terms in the Navier-Stokes equations at the points of a quadrature formula, including convective effects and viscous effects in three-dimensional space. The temporary results of these operations lead to a local data set of up to 300 kB with high re-use. In order to expose opportunities for SIMD vectorization, the benchmark selects to expose an additional "inner array" dimension of 8 to each step. Mathematically, this corresponds to executing the mathematical operations on several cells at once. As opposed to the intrinsics-based across-element vectorization in the deal.II library, the benchmark keeps the underlying loops in an abstraction class VectorizedArray<double, 8>.

The second challenging ingredient to benchmarks concerns the code immediately around the arithmetically intensive steps above. There are sequences of more intensive memory accesses with intermediate-range data dependencies a few 100kB apart, including indirect addressing and BLAS-1 type vector operations. Thirdly, the benchmark involves unstructured data re-arrangement operations required by the setup routines of dynamic mesh adaptation in regular intervals.

In an optimally compiled way, it is likely that the code will make good use of SIMD units and processor caches.

The benchmark code is derived from a real application, with a series of simplifications to be self-contained. The real application is called ExaDG (High-Order Discontinuous Galerkin for the Exa-Scale) at github repo github.com/exadg/exadg" and the particular test case is also the basis of a tutorial program in the deal.II library, www.dealii.org/developer/doxygen/deal.II/step_67.html.

Some mathematical steps in the algorithm have been removed to make the application self-contained. For example, the benchmark version does not include a multigrid solver that is actually run in practice, because it would need external library support for algebraic multigrid, which would make the platform-agnostic execution more challenging. Furthermore, several numerical ingredients addressing the outer algorithm composition, such as more sophisticated time stepping algorithms, stabilization approaches for flows at higher Reynolds and Mach numbers, and ingredients for more complicated geometries via unstructured and curved meshes have been removed, as these would run similar inner loops. These mathematical modifications make the algorithm less robust than actual practice, but allow simplification of the original application to less than 2000 lines of code beyond the main deal.II library, while nevertheless retaining the key elements of the original application in computational fluid dynamics, as listed in the first paragraph of this section.

Input Description

On the command line, specify the name of a parameter file which contains various values that control the program. For example, the very small (un-timed) size=test workload is intended to provide a quick sanity check that a valid binary has been built. It uses:

subsection Flow parameters
  set gamma       = 1.4
  set R           = 287.0
  set c_v         = 717.5
  set c_p         = 1.951219512195122e-03
  set viscosity   = 6.25e-04
  set lambda      = 1.717622810030917e-06
  set Mach number = 0.1
end

subsection Control parameters
  set output tick         = 0.05
  set refine tick         = 0.3
  set Courant number      = 0.05
  set refinements         = 1
  set end time            = 0.01
  set print debug timings = false
end

If one wishes to vary the work done by the benchmark, the parameters that are most likely to be useful are the number of refinements (useful range: 0 to 4) and the end time (useful range: .01 to 4). The train and refrate workloads pick values for these that cause the benchmark to meet SPEC's requirements for approximate duration.

If you enable debug timings, detailed information will be printed about program phases.

Output Description

The kinetic energy and dissipation is printed after each time step. Furthermore, the program prints the average number of linear iterations in the iterative conjugate gradient solvers employed for the viscous effects, in order to verify that the expected convergence rates of the iterative solvers are reached. The outputs are validated against SPEC's expected answers.

Programming Language

C++

Threading Model

Although there are source code references to std::thread, only 1 thread is active at a time.

Known Portability Issues

GNU/Linux systems implement C++ std::thread using POSIX Threads. Although some systems automatically include the needed support, this is not universal. Surprises have been seen when changing OS versions, or libraries, or compilers; or when FDO is added; or when combining C and C++ modules. Typically, it is safest to add -pthread to all compile and link lines for all SPEC CPU benchmarks that use std::thread. Please see the $SPEC/config directory for Example config files that demonstrate how to conveniently do so.

Sources and Licensing

FemFlow is licensed under the GNU GPL v3, because that is the license of ExaDG.

deal.II is licensed under the GNU LGPL 2.1 or later. Deal.II can be found at https://www.dealii.org. The version of deal.II used for this benchmark is commit 95acc98c42161077e0dfa09cf67356e2cdc90473.

The copy of deal.II used in this benchmark contains a stripped-down version of BOOST, with minor modifications from the version 1.70.0 that was originally imported into deal.II. BOOST is licensed under the BOOST Software License.

A detailed version history of this benchmark can be found here.

References

Fehn, N., Wall, W.A., & Kronbichler, M. (2018). A matrix-free high-order discontinuous Galerkin compressible Navier-Stokes solver: A performance comparison of compressible and incompressible formulations for turbulent incompressible flows. International Journal for Numerical Methods in Fluids, 89, 102 - 71. arxiv.org/abs/1806.03095
Kronbichler, M., Sashko, D., & Munch, P. (2022). Enhancing data locality of the conjugate gradient method for high-order matrix-free finite-element implementations. The International Journal of High Performance Computing Applications, 37, 61 - 81. arxiv.org/abs/2205.08909

766.femflow_r SPEC CPU®2026 Benchmark Description