I was talking with someone today that really really wanted the sqrtps to be used in some code they were writing. And because of a quirk with clang (still there as of clang 18.1.0), if you happened to use -ffast-math clang would butcher the use of the intrinsic. So for the code:

__m128 test(const __m128 vec)
{
    return _mm_sqrt_ps(vec);
}

Clang would compile it correctly without fast-math:

test:                                   # @test
        sqrtps  xmm0, xmm0
        ret

And create this monstrosity with -ffast-math:

.LCPI0_0:
        .long   0xbf000000                      # float -0.5
        .long   0xbf000000                      # float -0.5
        .long   0xbf000000                      # float -0.5
        .long   0xbf000000                      # float -0.5
.LCPI0_1:
        .long   0xc0400000                      # float -3
        .long   0xc0400000                      # float -3
        .long   0xc0400000                      # float -3
        .long   0xc0400000                      # float -3
test:
        rsqrtps xmm1, xmm0
        movaps  xmm2, xmm0
        mulps   xmm2, xmm1
        movaps  xmm3, xmmword ptr [rip + .LCPI0_0] # xmm3 = [-5.0E-1,-5.0E-1,-5.0E-1,-5.0E-1]
        mulps   xmm3, xmm2
        mulps   xmm2, xmm1
        addps   xmm2, xmmword ptr [rip + .LCPI0_1]
        mulps   xmm2, xmm3
        xorps   xmm1, xmm1
        cmpneqps        xmm0, xmm1
        andps   xmm0, xmm2
        ret

The optimization flow here in LLVM is:

  • That under fast-math conditions, sqrt(x) == x * rsqrt(x), so it uses rsqrtps instead.
  • But that has precision issues between Intel and AMD due to a high ULP tolerance for the rsqrtps instruction.
  • So LLVM does two newton-raphson iterations anytime it calls rsqrtps to correct the precision between the CPU implementations.

The ‘fix’ here is just to use inline assembly to guarantee you’ll get the instruction selection you want always:

__m128 test(__m128 vec)
{
    __asm__ ("sqrtps %1, %0" : "=x"(vec) : "x"(vec));
    return vec;
}

But there is one additional thing I’d advocate you do if you need to use inline assembly - write your own constant folding.

See the one downside to the inline assembly above is that if test is inlined and vec was a constant, it wouldn’t constant fold it. For example:

__attribute__((always_inline)) __m128 test(__m128 vec)
{
    __asm__ ("sqrtps %1, %0" : "=x"(vec) : "x"(vec));
    return vec;
}

__m128 call_test()
{
    return test(_mm_setr_ps(1.f, 2.f, 3.f, 4.f));
}

Will produce:

test:
        sqrtps  xmm0, xmm0
        ret
.LCPI1_0:
        .long   0x3f800000                      # float 1
        .long   0x40000000                      # float 2
        .long   0x40400000                      # float 3
        .long   0x40800000                      # float 4
call_test:
        movaps  xmm0, xmmword ptr [rip + .LCPI1_0] # xmm0 = [1.0E+0,2.0E+0,3.0E+0,4.0E+0]
        sqrtps  xmm0, xmm0
        ret

So that even under inlining, when we could have constant folded it away entirely, we are still calling sqrtps when we don’t have to. So what is the fix?

LLVM has an intrinsic is_constant which can be got at via the Clang-supported GCC extension __builtin_constant_p. If we extend our test above to check when vec is constant, we can call _mm_sqrt_ps when it is constant, and benefit from the constant folder doing its thing and removing the call entirely. So our code becomes:

__attribute__((always_inline)) __m128 test(__m128 vec)
{
    if (__builtin_constant_p(vec))
    {
        return _mm_sqrt_ps(vec);
    }

    __asm__ ("sqrtps %1, %0" : "=x"(vec) : "x"(vec));
    return vec;
}

__m128 call_test()
{
    return test(_mm_setr_ps(1.f, 2.f, 3.f, 4.f));
}

And we get:

call_test:
        movaps  xmm0, xmmword ptr [rip + .LCPI11_0] # xmm0 = [1.0E+0,2.0E+0,3.0E+0,4.0E+0]
        sqrtps  xmm0, xmm0
        ret

What the heck?! It hasn’t constant folded! Turns out GCC is a bit picky with this builtin, and it looks like LLVM has inherited that funky behaviour. You cannot use it with a vector - even though LLVM happily has the support in the IR for it. But there is a workaround, an ugly one:

__attribute__((always_inline)) __m128 test(__m128 vec)
{
    if (__builtin_constant_p(vec[0]) &&
      __builtin_constant_p(vec[1]) &&
      __builtin_constant_p(vec[2]) &&
      __builtin_constant_p(vec[3]))
    {
        return _mm_sqrt_ps(vec);
    }

    __asm__ ("sqrtps %1, %0" : "=x"(vec) : "x"(vec));
    return vec;
}

__m128 call_test()
{
    return test(_mm_setr_ps(1.f, 2.f, 3.f, 4.f));
}

Will produce:

.LCPI15_0:
        .long   0x3f800000                      # float 1
        .long   0x3fb504f3                      # float 1.41421354
        .long   0x3fddb3d7                      # float 1.73205078
        .long   0x40000000                      # float 2
call_test:
        movaps  xmm0, xmmword ptr [rip + .LCPI15_0] # xmm0 = [1.0E+0,1.41421354E+0,1.73205078E+0,2.0E+0]
        ret

Nice! We’ve got the constant folding we want. And also nicely, if we mark test as noinline instead, the code for test is:

test:
        sqrtps  xmm0, xmm0
        ret

Meaning the branch is folded away. In both cases we now get the behaviour we want. We’ve wrote our own constant folder. Nice! You can see the full example on godbolt.

It’d be nice if we could just use the vector in __builtin_constant_p, but I think the LLVM folks have purposefully tried to match what GCC would do. I’d personally advocate for a loosening of the builtin, and I might file a GitHub issue about just that.