707.ntest_r
SPEC CPU®2026 Benchmark Description

Benchmark Name

707.ntest_r

Benchmark Program General Category

Game AI

Benchmark Authors

Chris Welty and Vlad Petric

707.ntest_r was submitted to the SPEC CPU v8 Benchmark Search Program by Vlad Petric vlad [at] petric [dot] cc.

Benchmark Description

Othello is a combinatorial game played on a 8x8 rectangular board with stones that are colored black on one side and white on the other side. Starting with a specified initial position of 4 stones, two black and two white in an X formation in the center of the board, players take turns making moves, starting with the black player. At each step, a stone with the color of the mover is placed on the board. The newly placed stone must capture, in between two of the mover's pieces, a contiguous line of opponent pieces, in any direction: horizontal, vertical, or diagonal. Capturing entails flipping the stones from the opponent's color to the mover's color. If a player can't move, they have to pass. If both players pass, then the match is over.

An Othello game takes at most 60 moves, as at any step a new stone must be placed on the board. A vast majority of competitive Othello games complete in 60 or 59 moves, though in some cases they can end much sooner than that.

For more information about Othello's gameplay, see https://www.wikihow.com/Play-Othello

Like most other combinatorial games, the ability to foresee a number of moves in the future is critical for good play. Modern (as of 2023) computer engines can easily evaluate as far as 32 moves ahead, while human players typically see a single digit number of moves ahead. In fact, Othello engines have defeated the world champion since 1997 (1), though the game itself remains unsolved.

NTest is a leading Othello engine, developed by Chris Welty and Vlad Petric (2). It employs a negascout algorithm from the minimax family with multi-probcut and a strong evaluation function based on pattern databases (3).

The SPEC benchmark version of NTest evaluates a large number of midgame (36 remaining moves) and endgame positions (20 remaining moves). The endgame positions are evaluated perfectly, while the midgame positions are evaluated partially (not at complete depth). For each evaluated position, the benchmark score prints the score associated with the position (a plus score represents a win for the mover, a negative score represents a loss, and zero is a draw). For endgame positions the printed score represents the final score with perfect play. For midgame positions, the printed score is an estimation of the final score with perfect play (which is reasonably accurate, but not completely precise).

(1) https://archive.nytimes.com/www.nytimes.com/library/cyber/week/080997othello.html

(2) https://en.wikipedia.org/wiki/Computer_Othello

(3) Michael Buro, "Experiments with Multi-ProbCut and a New High-Quality Evaluation Function for Othello", 1997 https://skatgame.net/mburo/ps/improve.pdf

How NTest Works

NTest plays through the game up to a certain point (based on a set number of moves or empty squares) and then evaluates the board position.

At that point, it figures out:

What is the best move next?
What is the score or expected outcome?

Types of Evaluations

There are two main kinds of searches:

1. Full Evaluation
- Explores the game completely to the end.
- Gives a guaranteed result (win/loss/draw, called WLD).
- Can also focus on maximizing score.
- Takes a lot of computing time, especially in the middle of the game.

2. Partial Evaluation
- Only explores partway ahead.
- Gives estimates, not guaranteed outcomes.
- Much faster, especially for complex mid-game positions.

Performance Notes

- Endgame positions (few empty squares) are faster to evaluate.
- Partial evaluations are much quicker than full ones.
- Full WLD evaluations are faster than full score-maximization evaluations.

Input Description

GGF files contain Othello games, written one per line. Each game is marked with ( ; at the start and ;) at the end.

Each move is shown by player (B or W) and the square (like chess notation, e.g., B[h4] means Black plays at h4).

NTest only cares about the sequence of moves, not the extra game info in the file.

Othello has a total of 60 possible moves (because the board starts with 4 fixed stones and 64 total squares).

The NTest SPEC benchmark uses different search settings and different game sets depending on the evaluation goals:

test uses a single game (OneGame.ggf), which it walks to 10 remaining squares (aka depth of 10), and does a partial evaluation of depth 6 (4 empty squares remaining).
train uses 100 games from September 2004, which it fully evaluates at depth 19 (19 remaining empty squares). The file is Othello.60e4.ggf
refrate uses 290 games from April 2010, which it walks down to depth 20, then it partially evaluates down to 4 remaining squares (evaluation depth of 16). The file is Othello.154.ggf
refspeed uses 19986 games from 1994 to 2004, which it fully evaluates at depth 20. The file is Othello.01e4.ggf

All these files/games are taken from the Thor database of official Othello games (arbitrary selection of existing ggf files, as of 2020). These include: recorded over-the-board games, online games played as part of a tournament, and computer vs (human/computer) games played in an official setting. Note how for each of the 4 run types, there is a control file in the respective input directory. Those contain the command line parameters for NTest: (i) the game file, (ii) the depth of the position, and (iii) the depth of the search.

Other input files

The coefficients/ subdirectory contains the binary database of patterns. Patterns are main input to NTest's evaluation/goodness function (which estimates the value of a position without any expansion of the position's subtree).

It also includes an mpc stats file, which tracks the quality of the evaluation function (how accurate it is versus a fully-evaluated result), and doesn't serve a purpose for the SPEC submission. Note that its impact on the total run time is epsilon/unmeasurable.

The resource/ subdir contains a single file, solver12.txt, which includes a bunch of depth 12 positions, that are fully evaluated in the start-up testing phase.

Note that only the coefficients are critical - the other two can be removed, or extracted to an out-of-band test suite.

Output Description

The output is a text file showing the state of the game board after each move. For verification, the output files from the run should match exactly to the expected output.

Programming Language

C++

Threading Model

The SPECrate version is single-threaded. The SPECspeed version uses OpenMP.

Known Portability Issues

None.

Sources and Licensing

All the NTest code is distributed under the GNU General Public License Version 3 GPLv3.txt.

SPEC added a version of the Mersenne-Twister PRNG that is licensed by its authors (Makoto Matsumoto, Takuji Nishimura, and Mutsuo Saito) under a BSD license.

The original NTest is available at github: github.com/vladpetric/ntest. The SPEC CPU version is based on commit hash fc7d6b26.

707.ntest_r SPEC CPU®2026 Benchmark Description