838.diamond_s
SPEC CPU®2026 Benchmark Description

Benchmark Name

838.diamond_s

Benchmark Program General Category

Metagenomics and protein sequencing

Benchmark Authors

Benjamin Buchfink <buchfink[at]gmail [dot] com>, github.com/bbuchfink

838.diamond_s was submitted to the SPEC CPU v8 Benchmark Search Program by Benjamin Buchfink.

Benchmark Description

DIAMOND is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. DIAMOND is considered a high performance replacement for BLAST, the Basic ALignment Search Tool from the National Institutes of Health.

The key features are:

Pairwise alignment of proteins and translated DNA at 100x-10,000x speed of BLAST.
Protein clustering of up to tens of billions of proteins
Frameshift alignments for long read analysis.
Low resource requirements and suitable for running on standard desktops or laptops.
Various output formats, including BLAST pairwise, tabular and XML, as well as taxonomic classification.

Input Description

DIAMOND's primary file inputs consist of FASTA files. FASTA is a scientific data format used to store nucleic acid sequences (such as DNA sequences) or protein sequences. It may contain multiple sequences and therefore is sometimes referred to as the FASTA database format. FASTA files can be viewed and analyzed using any DNA analysis software. These files have a suffix of .fa or .fasta, and may be gzipped since DIAMOND can read gzipped files on the fly using the zlib library. These FASTA files can also be compressed into .dmnd files which are binary archives which are specific to the DIAMOND program.

The command lines specify a database (-d) and a query (-q). At a high level, the progam is searching for the query sequences in the database. Some switches are used to control the sensitivity of the search, (--ultra-sensitive vs --sensitive vs --mid-sensitive vs --fast, etc). These switches can be discovered by running "./diamond help".

More documentation and tutorials are available at github.com/bbuchfink/diamond/wiki.

The FASTA files used to construct the SPEC CPU benchmarks were downloaded from these sources below. A savvy user can download alternative protein sequences and databases to craft their own command lines.

The Swiss Institute of Bioinformatics
- UniProt downloads: www.uniprot.org/release-notes/downloads
- uniprot_sprot.fasta complete: uniprot_sprot.fasta.gz
Berkeley's Structural Classification of Proteins (SCOPe)
- ASTRAL version 2.08: astral-scopedom-seqres-gd-sel-gs-bib-95-2.08.fa
- Related data including older versions (e.g. 2.07): scop.berkeley.edu/downloads
The Global Proteome Machine, the home of proteomics crowd-sourced "Big Data"
- The common Repository of Adventitious Proteins, cRAP, is an attempt to create a list of proteins commonly found in proteomics experiments that are present either by accident or through unavoidable contamination of protein samples: thegpm.org/crap

Output Description

The input parameters also describe the output format. In SPEC CPU, we use the following string, which is decoded below.

--outfmt 6 qseqid sseqid slen mismatch gapopen qstart qend sstart send

token	Description
qseqid	Query Sequence - id
sseqid	Subject Sequence - id
slen	Subject Sequence length
mismatch	Number of mismatches
gapopen	Number of gap openings
qstart	Start of alignment in query
qend	End of alignment in query
sstart	Start of alignment in subject
send	End of alignment in subject

The benchmark output is a list of sequences showing pairwise alignments, listed in columnar format described above, one sequence per line. We verify by checking this output against the expected outputs. Sometimes we have seen floating point sensitivity in the order of these pairwise alignments, causing the lines to be printed in a different order than expected. The runcpu program mitigates this issue before the verification step by sorting the output alphabetically at the end of the run using the sort.pl script found in 838.diamond_s/data/all/input.

Programming Language

C++, C

Threading Model

The benchmark uses std::thread with a thread pool. The number of threads can be set by either the SPEC CPU 2026 config file or the runcpu command; the requested number is passed to the program using the -p option on its command line.

Known Portability Issues

GNU/Linux systems implement C++ std::thread using POSIX Threads. Although some systems automatically include the needed support, this is not universal. Surprises have been seen when changing OS versions, or libraries, or compilers; or when FDO is added; or when combining C and C++ modules. Typically, it is safest to add -pthread to all compile and link lines for all SPEC CPU benchmarks that use std::thread. Please see the $SPEC/config directory for Example config files that demonstrate how to conveniently do so.

Sources and Licensing

DIAMOND is available at github.com/bbuchfink/diamond. The version used in the SPEC CPU benchmark began with commit hash 21a32fc on March 11, 2023.

DIAMOND is distributed under the GPL-3, as seen in Diamond.license.txt. The /lib/blast and /lib/alp directories were dedicated to the Public Domain by the National Center for Biotech Info. Other included library sources are licensed under compatible terms:

Eigen library license is Mozilla 2.0: Eigen.license.txt. The Eigen library includes some BSD-licensed components copyrighted by Intel Corporation, Benoit Jacob, and Gael Guennebaud.
Ipso library license is BSD-2: Ips4o.license.txt, copyright Daniel Ferizovic, Michael Axtmann, and Sascha Witt.
Wfa2 library license is MIT: Wfa2.license.txt
MemoryPool library is MIT: MemoryPool.license.txt
BLAST library is public domain since it is a United States Government Work: Blast.license.txt
MIO library is MIT: Mio.license.txt
ALP library is public domain since it is a United States Government Work: Alp.license.txt
PrefixScan license is MIT: Pfscan.license.txt
Zlib is distributed under the Zlib license: Zlib.license.txt
GSL is distributed under GPLv3
Portable Snippets endianness library (psnip) is licensed under the Creative Commons Zero 1.0 Universal Public Domain Dedication: Psnip.license.txt. Portions of the library are based on code from Sean Eron Anderson's Bit Twiddling Hacks which are dedicated to the public domain: BitTwiddling.license.txt.

SPEC added a version of the Mersenne-Twister PRNG that is licensed by its authors (Makoto Matsumoto, Takuji Nishimura, and Mutsuo Saito) under a BSD license.

spec_random_distributions.h is sampled from the LLVM project, which is distributed under the Apache License v2.0 with LLVM Exceptions.

The genomic input databases are licensed freely and available for commercial use:

Uniprot is available under Creative Commons Attribution 4.0 International license: Uniprot.license.txt.
SCOPe is freely available for all usage under their own license text: SCOPe.license.txt.
TheGPM is freely available for all usage under their own license text: TheGPM.license.txt.

References

Github repository: github.com/bbuchfink/diamond
Scientific papers
- Buchfink B, Reuter K, Drost HG, Sensitive protein alignments at tree-of-life scale using DIAMOND, Nature Methods 18, 366-368 (2021). doi:10.1038/s41592-021-01101-x
- Buchfink B, Xie C, Huson DH, Fast and sensitive protein alignment using DIAMOND, Nature Methods 12, 59-60 (2015). doi:10.1038/nmeth.3176
- Fox NK, Brenner SE, Chandonia JM. SCOPe: Structural Classification of Proteins - extended, integrating SCOP and ASTRAL data and classification of new structures, Nucleic Acids Research 42:D304-309 (2014). doi:10.1093/nar/gkt1240
- Chandonia JM, Guan L, Lin S, Yu C, Fox NK, Brenner SE. SCOPe: Improvements to the Structural Classification of Proteins - extended Database to facilitate Variant Interpretation and Machine Learning, Nucleic Acids Research 50:D553-559 (2022). doi:10.1093/nar/gkab1054
- The UniProt Consortium. UniProt: the Universal Protein Knowledgebase in 2023, Nucleic Acids Res. 51:D523-D531(2023) doi:10.1093/nar/gkac1052
Blog: Consider Working on Genomics, https://claymcleod.dev/blog/2022-11-19-consider-working-on-genomics.html

838.diamond_s SPEC CPU®2026 Benchmark Description