838.diamond_s
Metagenomics and protein sequencing
Benjamin Buchfink <buchfink[at]gmail [dot] com>, github.com/bbuchfink
838.diamond_s was submitted to the SPEC CPU v8 Benchmark Search Program by Benjamin Buchfink.
DIAMOND is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. DIAMOND is considered a high performance replacement for BLAST, the Basic ALignment Search Tool from the National Institutes of Health.
The key features are:
DIAMOND's primary file inputs consist of FASTA files. FASTA is a scientific data format used to store nucleic acid sequences (such as DNA sequences) or protein sequences. It may contain multiple sequences and therefore is sometimes referred to as the FASTA database format. FASTA files can be viewed and analyzed using any DNA analysis software. These files have a suffix of .fa or .fasta, and may be gzipped since DIAMOND can read gzipped files on the fly using the zlib library. These FASTA files can also be compressed into .dmnd files which are binary archives which are specific to the DIAMOND program.
The command lines specify a database (-d) and a query (-q). At a high level, the progam is searching for the query sequences in the database. Some switches are used to control the sensitivity of the search, (--ultra-sensitive vs --sensitive vs --mid-sensitive vs --fast, etc). These switches can be discovered by running "./diamond help".
More documentation and tutorials are available at github.com/bbuchfink/diamond/wiki.
The FASTA files used to construct the SPEC CPU benchmarks were downloaded from these sources below. A savvy user can download alternative protein sequences and databases to craft their own command lines.
The input parameters also describe the output format. In SPEC CPU, we use the following string, which is decoded below.
--outfmt 6 qseqid sseqid slen mismatch gapopen qstart qend sstart send
| token | Description |
| qseqid | Query Sequence - id |
| sseqid | Subject Sequence - id |
| slen | Subject Sequence length |
| mismatch | Number of mismatches |
| gapopen | Number of gap openings |
| qstart | Start of alignment in query |
| qend | End of alignment in query |
| sstart | Start of alignment in subject |
| send | End of alignment in subject |
The benchmark output is a list of sequences showing pairwise alignments, listed in columnar format described above, one sequence per line. We verify by checking this output against the expected outputs. Sometimes we have seen floating point sensitivity in the order of these pairwise alignments, causing the lines to be printed in a different order than expected. The runcpu program mitigates this issue before the verification step by sorting the output alphabetically at the end of the run using the sort.pl script found in 838.diamond_s/data/all/input.
C++, C
The benchmark uses std::thread with a thread pool. The number of threads can be set by either the SPEC CPU 2026 config file or the runcpu command; the requested number is passed to the program using the -p option on its command line.
GNU/Linux systems implement C++ std::thread using POSIX Threads. Although some systems automatically include the needed support, this is not universal. Surprises have been seen when changing OS versions, or libraries, or compilers; or when FDO is added; or when combining C and C++ modules. Typically, it is safest to add -pthread to all compile and link lines for all SPEC CPU benchmarks that use std::thread. Please see the $SPEC/config directory for Example config files that demonstrate how to conveniently do so.
DIAMOND is available at github.com/bbuchfink/diamond. The version used in the SPEC CPU benchmark began with commit hash 21a32fc on March 11, 2023.
DIAMOND is distributed under the GPL-3, as seen in Diamond.license.txt. The /lib/blast and /lib/alp directories were dedicated to the Public Domain by the National Center for Biotech Info. Other included library sources are licensed under compatible terms:
SPEC added a version of the Mersenne-Twister PRNG that is licensed by its authors (Makoto Matsumoto, Takuji Nishimura, and Mutsuo Saito) under a BSD license.
spec_random_distributions.h is sampled from the LLVM project, which is distributed under the Apache License v2.0 with LLVM Exceptions.
The genomic input databases are licensed freely and available for commercial use:
Copyright © 2026 Standard Performance Evaluation Corporation (SPEC®)