How costly is changing the hardware rounding mode?

I was looking at hardware rounding modes the other day, and one thing that I realised I had no concept of was how costly is it to change the hardware rounding mode for floating-point operations.

For those that don’t know - modern hardware often includes a way to set the rounding mode that floating-point operations use when there is an imprecision in the result of a floating-point operation that cannot be encoded in the format. There are four alternatives that I am aware of:

Round to nearest - behaves like how math works, will round up or down to the nearest number that can fit in the representation. Often the default rounding mode on CPUs.
Round to zero - all ties go towards 0. Is often the default rounding mode on GPUs.
Round to positive infinity.
Round to negative infinity.

On GPUs, in Vulkan for instance you can use VK_KHR_shader_float_controls to specify the floating point mode for an entire shader execution.

But on CPUs there is often a hardware register that controls the floating-point rounding - on x86 it is the mxcsr register and on 64-bit Arm it is the fpcr register.

C lets you get and set the current rounding mode by using the fenv.h functions fesetround and fegetround. So how costly is it to use these operations?

I did a few tests on an Arm-based Macbook Air M1:

And an x86-based Threadripper Pro:

The tests were as follows:

Set the rounding mode to round-to-nearest.
Run 1024 floating-point adds in a benchmark.
Then run with each call changing the rounding mode to round-to-zero before the add and then resetting it after to the default.
Then I ran with doing 4, 8, and 16 operations and amortizing any cost when changing the rounding mode.
And lastly I wanted to know if checking whether the rounding mode matched before changing it could save any cycles - so I tested getting the rounding mode, if it matches I don’t set it, otherwise if it doesn’t I have to set it, and I tested this when it does match and doesn’t.
And did the whole thing above with calls to tanf too.

So what did we find?

On Arm it is 30x slower when you set the rounding mode. On x86 it is 69x slower. I’d guess that setting these registers does a full pipeline stall in the hardware (the CPU won’t predict ahead), but I don’t know for sure.

On Arm it is only 1.9x slower if the rounding mode you want is actually already being used on the CPU (and so you don’t have to reset it). On x86 it is 8.7x slower. Here’s the code to do this check:

int old_mode = fegetround();
if (old_mode != FE_TOWARDZERO) {
  fesetround(FE_TOWARDZERO);
  // Do your operation here!
  fesetround(old_mode);        
} else {
  // Do your operation here!
}

I’m not sure if the difference in performance between Arm and x86 is some cost in querying the register, or just the different hardware costs for doing a branch like this.

The tan data didn’t show anything hugely interesting beyond the add - but it at least matched the performance characteristics of the floating-point add.

So perhaps unsurprisingly - its slow! Probably about as slow as I would have expected if you were flipping the rounding mode regularly. This is obviously not how these APIs are meant to be used, but you could see a situation where you call some foreign code and you want to ensure it hasn’t messed with the control registers and would want to guard against that - and this gives you some idea of the cost of doing just that.