Power of assume in C++

One of my favorite features of C++ are the “micro-optimizations” that likely have miniscule performance benefits but still feels like you’ve shaved off a few milliseconds by tiny changes to code. Things like [[likely]] or [[unlikely]] attributes that re-order or inline branches depending on how likely a branch is to be taken. Or #pragma omp parallel for, which executes loops in threads.

One interesting attribute that I haven’t found too many resources or benchmarks for is the new [[assume]] directive, introduced from C++23, though previously available with the __builtin_assume directive. From cppreference:

Specifies that the given expression is assumed to always evaluate to true at a given point in order to allow compiler optimizations based on the information given.

There’s also some interesting tidbits in Herb Sutter’s paper on assumptions, such as:

It’s never evaluated, but it’s UB if it is ever false
It should be “For careful use by experts only”
Used “case by case where the optimizer could/should be performing an optimization”
Assumptions should be used 1000x less than asserts

Regardless, I wanted to look at some notable compiler differences & optimizations from assumptions.

Vectorizations

Vectorizations was the first thing I could think of, since when compilers vectorize instructions, it has to have a condition when it has to compute the leftover elements without vectorization.

Consider this easily vectorizable loop:

void add_arrays(
  float* a,
  float* b,
  float* result,
  int n
) {
  for (int i = 0; i < n; i++) {
    result[i] = a[i] + b[i];
  }
}

Thanks to the power of SIMD, it’s able to use addps and SIMD registers to vectorize this so that it adds 4/8/16 elements at once.

However, as addps does 4, 8 or 16 (depending on if it supports AVX-512 & AVX), if there are leftover elements, it has to compute those if there are elements not multiples of 4, 8 or 16; So, adding [[assume(n % 16 == 0)]] removes cleanup code for 2 or 1 remainders and comparisons at the start and during the loop to check for remaining items. I didn’t include the ASM diff here as it cuts around ~30 lines.

Benchmarks ran on Sapphire Rapids 4x P-core @ 2.3GHz with Nanobench & gcc 16

Benchmark	Switch	ns/op
2048 x 10K Baseline	-O2	167,765
2048 x 10K Assume	-O2	142,118
2048 x 10K Baseline	-O3	139,998
2048 x 10K Assume	-O3	136,782

As seen, with -O2, adding the assumption shows an improvement, but has very little difference between both -O3 programs. As GCC vectorizes aggressively with -O3, the improvements from assumptions disappears with -O3.

There’s also some nano-optimizations you can make, such as assuming that n > 0 or pointer alignment, but because they don’t affect the hot loop, it likely has very miniscule effects.

One interesting note is that I couldn’t get GCC to generate 512-bit wide instructions (using ZMM registers), likely because AVX-512 instructions can be slower than 256-bit width instructions, though I’m not a 100% sure.

Conditionals

One of the frequent uses of assume that I found in the few videos about assumptions are conditionals. Say that you have a function like so:

int get_value(int* ptr) {
  if (ptr == nullptr) {
    throw "Exception!!";
  }

  return *ptr;
}

Now, if we know from a caller that ptr != nullptr, the compiler is able to optimize away the conditional check if you prepend [[assume(ptr != nullptr)]] prior to calling get_ptr(), as it inlines get_value into the caller. Adding assume eliminates:

test    rax, rax
je      .L13        ; Exception handling section

However, if we have two functions and force inlining to be off using noinline and have another caller not assume that ptr != nullptr, the function body for get_ptr remains the same with exception checks. So - if a function can be inlined, assumptions can help - but if it can’t, and there are callers that can’t make the same assumption (which is likely, because why else would you put the conditional check otherwise), the advantage disappears.

So really, in the case of conditionals, it’s really for inlined functions that see the advantage.

Alignment

Consider the following snippet:

void scale_unaligned(float* data, int n, float factor) {
  for (int i = 0; i < n; i++)
    data[i] *= factor;
}

What if we could guarantee that data is aligned? We can use the new float* data_ptr = std::assume_aligned<32>(data) before the for loop, which generates:

movaps  xmm0, XMMWORD PTR [rax]
add     rax, 16
mulps   xmm0, xmm1
movaps  XMMWORD PTR [rax-16], xmm0

rather than movups, an unaligned move. In older architectures (i.e. AMD Bulldozer, Piledriver), unaligned moves are slower than aligned moves - though on anything in the last decade, it performs the same. So perhaps in the very narrow case where you care about an obsolete architecture’s performance, pointer alignment assumptions can be useful.

Closing Notes

Overall, it was a struggle finding any significant optimizations from assumptions. Compilers are already smart enough to make pretty solid assumptions themselves, so apart from very specific scenarios, it didn’t affect the assembly output at all.

I was hoping that for non-inlined functions, if I marked it as pure or the compiler could guarantee no side effects, the caller with certain assumptions can jump to the region after the conditional checks in the function body while callers without such assumptions jump to the start of the function body. Or generate multiple versions of the functions depending on what conditional checks can be skipped.

Another interesting suggestion that I’ve heard is, with the introduction of contracts, having an assume setting that automatically inserts assumptions - it is not yet part of the spec, but it would be a free and safe optimization.

But as assumptions are right now, I can’t seem to find any significant uses.