Modern x86 CPUs implement advanced instruction sets, such as SSE and AVX, which may greatly help performance. However, when distributing precompiled binaries (think about Debian, CentOS, AnaConda, etc), we often prefer to fall back on older instruction sets for the sake of portability. Is there a way to dynamically choose CPU instruction sets at runtime such that we can achieve performance and portability at the same time? Yes, the answer is CPU dispatch. For a program that supports CPU dispatch, we typically compile it on a recent CPU to generate a fat(ish) binary that contains multiple implementations of a function or a code block with different instruction sets. When we run, the program dynamically chooses internal implementations based on the CPU features. I first heard of “CPU dispatch” from an Intel developer a few years ago. Unfortunately, googling “CPU dispatch” does not give me much relevant information immediately even today. This post aims to briefly explain the strategies to implement CPU dispatch in C/C++.
On x86, my preferred way to implement CPU dispatch is to detect the supported SIMD instruction sets via CPUID, which can be retrieved with x86 assembly, or with the __cpuid intrinsics specific to MS VC++. The following shows an example.
#include <stdio.h> #define SIMD_SSE 0x1 #define SIMD_SSE2 0x2 #define SIMD_SSE3 0x4 #define SIMD_SSE4_1 0x8 #define SIMD_SSE4_2 0x10 #define SIMD_AVX 0x20 #define SIMD_AVX2 0x40 #define SIMD_AVX512F 0x80 unsigned x86_simd(void) { unsigned eax, ebx, ecx, edx, flag = 0; #ifdef _MSC_VER int cpuid[4]; __cpuid(cpuid, 1); eax = cpuid[0], ebx = cpuid[1], ecx = cpuid[2], edx = cpuid[3]; #else asm volatile("cpuid" : "=a" (eax), "=b" (ebx), "=c" (ecx), "=d" (edx) : "a" (1)); #endif if (edx>>25&1) flag |= SIMD_SSE; if (edx>>26&1) flag |= SIMD_SSE2; if (ecx>>0 &1) flag |= SIMD_SSE3; if (ecx>>19&1) flag |= SIMD_SSE4_1; if (ecx>>20&1) flag |= SIMD_SSE4_2; if (ecx>>28&1) flag |= SIMD_AVX; if (ebx>>5 &1) flag |= SIMD_AVX2; if (ebx>>16&1) flag |= SIMD_AVX512F; return flag; } int main() { printf("%x\n", x86_simd()); return 0; }
It is known to work with gcc-4.4, icc-15.0, clang-8.0 and msvc-14.0, fairly portable.
The second way is to use a GCC built-in: __builtin_cpu_supports(). This function tests if CPU the program is running on supports certain instruction sets. It is a new function only available to recent C compilers. I can confirm it is working with gcc-4.9 on Linux and clang-8.1.0 on Mac. Clang-8.0.0 has this built-in but is buggy: it compiles but can’t link. Intel C compiler (ICC) v15.0 has a similar problem. MS VC++ doesn’t support this function. The IBM compiler appears to has a similar built-in, though it only tests Power-related instruction sets. On x86, this second approach is simpler but less portable.
Icc has a similar built-in with an interesting name: _may_i_use_cpu_feature(). Icc alternatively allows to creates multiple versions of a function with a compiler extension __declspec(cpu_dispatch()). Gcc-4.8+ has a similar feature, though for C++ only. I don’t like these methods because they are not portable at all.
By the way, there were some interesting discussions on supporting CPU dispatch in the C++ standard. The thread covers serval strategies mentioned here. It went down, though.
Leave a Reply