Can Function Inlining Affect Floating Point Outputs? Exploring FMA and Other Consistency Issues

June 2023

At my job, I’m refactoring a 30k LOC codebase that simulates learning in the mammal brain. The emergent behavior of these large brain models is hard to test for, so we opted to take the safe route and preserve bit-equality in the weights of the trained model to guarantee that we are not breaking anything.The main downside of hash-based regression testing during refactoring is that it doesn’t check numerical stability. The most common change we applied was inlining many small functions to increase readability.

This raises the question: Can function inlining affect the output of a numerical program? I’m interested in both the programmer inlining a function manually in his editor and the compiler doing it for you during an optimization pass.

It turns out that yes, there are multiple ways in which inlining can change results, and specifics depend on the interplay of your language spec, compiler and hardware. One of the big reasons why compilers do inlining is that it increases the scope for optimizations. Most optimization passes run on individual functions, so inlining gives the compiler more code to work with, and more opportunities to apply potentially result-changing optimizations.

Here’s a concrete example that compiled with gcc -O3 -march=haswell produces different results depending on whether the function is inlined or not (godbolt):

__attribute__((noinline))
float mulNoInline(float x, float y) {
    return x * y;
}

float mulInline(float x, float y) {
    return x * y;
}

float global = 1.2485889846f;

int main() {
    float inlineRes = mulInline(global, global) + 1.0f;
    float noInlineRes = mulNoInline(global, global) + 1.0f;

    if (inlineRes != noInlineRes) {
        printf("Results are not equal!\n");
    }
}

They results differ in the last two bits: inlineRes is 0x4023C63D while noInlineRes is 0x4023C63E.

In this post I’ll focus on what’s probably the most common bit-changing optimization that can be exposed after inlining: Multiply-Add fusion.More bets are off with -ffast-math since it exposes a lot more optimization potential. After being mindful of FMA fusion, we managed to perform an extensive refactoring of our codebase without changing the results of our program by one bit.

The Fused Multiply Add Instruction (FMA)

FMA performs this operation: (a * b) + c, where all variables are floating point numbers. On x86, the most common instruction is called VFMADD, which was added with Intel Haswell.This means not every x86-64 CPU has support for FMA, and you’ll need to pass at least -march=haswell (or -mfma) to get the compiler to emit it. There are two lenses through which I look at the FMA instruction: Performance and precision.

FMA Performance: Latency & Throughput

FMA tends to be faster than doing a MUL followed by ADD. I wrote microbenchmarks and looked at instruction latencies, and FMA is around ~30% faster than a multiply followed by an add. As a rule of thumb, one FMA takes about 4 cycles on a recent x86 CPU, which is as fast as a single FP multiply, or a single FP add. I assume that pipelining, macro fusionMacro-op fusion is when an arithmetic instruction and a branch instruction are fused into a single μop by the CPU. From what I can tell, this optimization can not be apply to FMA. and other processor wizardry will affect these results in practice.On older Intel x86 architectures like Haswell, transforming (a * b) + (c * d) into fma(a * b, c, d) is actually bad for performance. With ILP we can run the two multiplies in parallel. So the latency of the original is roughly latency(mul) + latency(add) while the transformed version takes latency(mul) + latency(fma). On Haswell the transformed version has worse performance since the latencies are: add: 3 cycles, mul: 5 cycles, fma: 5 cycles.Despite being more complex than add or mul instructions, FMA is also a single μop according to Agner Fog’s tables. Clang’s performance model for x86 and AArch64 assumes FMA is faster than MUL and ADD for all microarchitectures (for fp32 and fp64 data types).

Besides latency, we also care about throughput: On Ice Lake, throughput is 2 ops/cycle for FMA, equal to ADD and MUL.Interestingly, when FMA first came out with Haswell, latency & throughput used to be 5 cycles, 2ops/cycle for FMA; 3 cycles, 1op/cycle for FADD and 5 cycles, 2ops/cycle for FMUL. So the FMA had higher throughput than a simple add. You can test how much FMA Fusion affects your program’s speed by compiling with -ffp-contract=off and removing any usage of std::fma.

FMA Precision: Infinitely Precise Intermediate Results

Floating point math requires rounding since we can only represent a finite set of numbers. Besides increased performance, the second reason to use FMA is higher precision due to only rounding once instead of twice.

Without FMA: RoundToFloat32(RoundToFloat32(a * b) + c)
With FMA: RoundToFloat32((a * b) + c)

For modern processors, the bit result of FMA is specified by the IEEE-754since the 2008 revision. standard for every possible input. This means that every processor’s FMA instruction (including CPU vector units and Nvidia GPUs) will produce exactly the same output given the same inputsFor IEEE datatypes that is. No guarantees if you’re using bfloat16 or the 19 bit long TensorFloat32 format.. IEEE requires that instructions be “exactly rounded”. This means that the result is as if it had first been computed to infinite precision, then rounded. For FMA, this means the result is as if the result of the multiply had been computed to infinite precision and we only rounded to 32 / 64 bits once after the add.

Floating Point Consistency and FMA

So it sounds like FMA is faster and more accurate, so what’s the issue? Consistency. Because depending on whether or not float res = a * b + c; turns into fma(a,b,c) or add(mul(a,b),c) assembly instructions, the results will differ for some inputs.

IEEE does not help us much here. This is because IEEE is mainly a specification of hardware, not of programming languages. While the IEEE standard specifies the result of VFMADD231SS (the x86 instruction) down to the bit it does not specify what the result of writing float res = a * b + c; in C++ should be.This is for multiple reasons: First, the C++ standard does not require IEEE-compliant data formats. Second, IEEE does not specify the intermediate precision of operations. The C++ draft has this note (also see SO answer): The values of the floating-point operands and the results of floating-point expressions may be represented in greater precision and range than that required by the type; the types are not changed thereby. Which is not very specific, but seems to permit fusion by default. Some options allowed by the C++ spec are:

Fuse into a single FMA instruction (1 rounding)
Do not fuse, execute with 32-bit intermediate precision (2 roundings)
Do not fuse, execute with 64-bit intermediate precision then round down to 32-bit result (3 roundings)

As we can see, floating point consistency is not just about the language, it’s a play with many actors. The main participants are:

The language specification: Does it allow fusing a * b + c without explicitly coding for an FMA? Both C++ and Go doFrom the Go spec: An implementation may combine multiple floating-point operations into a single fused operation, possibly across statements, and produce a result that differs from the value obtained by executing and rounding the instructions individually. . Both languages also have std::fma and math.FMA for generating FMA instructions explicitly.Notice that FMA fusion doesn’t require any fastmath in C++.
The compiler: Does it fuse a * b + c into a single FMA instruction? Clang does so by default (-ffp-contract=on) as long as the mul and add are part of the same statement. GCC even fuses across statements by default (-ffp-contract=fast), as does the gc Go compiler. If cross-statement fusion is allowed, then inlining may expose more opportunities for FMA fusion that were not previously visible to the compiler.
The hardware: Does it have an FMA instruction? Pre-Haswell x86-64 CPUs do not. What happens if you use std::fma on non-FMA hardware?

Let’s look a bit closer at the second actor, the compiler, taking Clang as our main example.

FMA Fusion in the Clang Frontend and Backend

As I mentioned, FMA fusion is explicitly allowed by the C++ spec. Clang has the -ffp-contract flag to control what gets fusedThe description says “Specify when the compiler is permitted to form fused floating-point operations, such as fused multiply-add (FMA)”. x86-64 has more fused operators besides FMA, but I’ve never seen them used explicitly: VFMSUB (a * b - c) and VFNADD (-a * b + c) and VFNMSUB (-a * b - c).. Possible settings are:

on (the default): fuse, but not across statements. So float tmp = a * b; float res = tmp + c; will not get fused, but float res = a * b + c; will. This happens in the compiler frontend during an AST rewrite.
fast: also fuse across statements. This is one of the flags enabled by -ffast-math. It is implemented in the architecture-specific compiler backend.
off: Do not fuse.
fast-honor-pragmas: I have never seen this used. It refers to the FP_CONTRACT pragma which comes from the C standard.

By default, Clang is more defensive than GCC, which has -ffp-contract=fast as the default setting. Here’s a Godbolt example where GCC by default changes the result after inlining, while Clang does not.

Let’s look at how FMA fusion is implemented in the Clang source code. There are two places where fusion can happen: in the frontend and the backend.

FMA Fusion in the Clang Frontend

The relevant code is in CodeGen. The frontend performs optimizations that are independent of the architecture being compiled for. If -ffp-contract=on and the operations are fusible, it will emit an llvm.fmuladd instruction. Interestingly, if -ffp-contract=fast, the frontend doesn’t do any fusion but delegates that task to the architecture-dependent backends.

Notably, there are two instructions in LLVM IR for FMA: llvm.fma and llvm.fmuladd.

llvm.fma: perform the fused multiply-add operation. […] Return the same value as a corresponding libm ‘fma’ function but without trapping or setting errno.
llvm.fmuladd: The LLVM langref has this to say: represent[s] multiply-add expressions that can be fused if the code generator determines that (a) the target instruction set has support for a fused operation, and (b) that the fused operation is more efficient than the equivalent, separate pair of mul and add instructions.

The main difference here is that llvm.fmuladd can be fused, while llvm.fma always returns the same value as the corresponding C math library’s fma function.

Here’s a code sample:

C++:

float f(float a, float b, float c){
 // required (by C++ standard) to perform fma with 1 rounding
  return std::fma(a,b,c);
}

float g(float a, float b, float c){
  // no requirements from C++ standard about whether to fuse or not
  return a * b + c;
}

LLVM IR:

define dso_local float @_Z1ffff(float %a, float
%b, float %c) local_unnamed_addr {
entry:
  %0 = tail call float @llvm.fma.f32(float %a, float %b, float %c)
  ret float %0
}

define dso_local float @_Z1gfff(float %a, float
%b, float %c) local_unnamed_addr {
entry:
  %0 = tail call float @llvm.fmuladd.f32(float %a, float %b, float %c)
  ret float %0
}

The resulting assembly is the same for both:

f(float, float, float):
  vfmadd132ss xmm0, xmm2, xmm1
  ret

g(float, float, float):
  vfmadd132ss xmm0, xmm2, xmm1
  ret

Since -ffp-contract=on fuses only within the same statement and is hence very programming-language specific, it makes sense that this happens in the frontend. The backend will unfuse the fmuladd again if the hardware does not have an FMA instruction, or if it is slower than doing FMul + FAdd.As an edge case, this leads to some fun consistency issues with constant folding on architectures that do not support FMA.

FMA Fusion in the Clang backend

When -ffp-contract=fast is set then the backend is responsible for fusing into FMA. Unlike the frontend, at this stage, Clang can take architecture-specific timing tables into account. For x86, Clang assumes that FMA is always faster than fmul and fadd for fp32 and fp64.

The implementation of the backend depends on the architecture being built for. For x86, the DAG combiner that emits the FMA when -ffp-contract=fast is enabled is implemented here (simplified):

/// Try to perform FMA combining on a given FADD node.
template <class MatchContextClass>
SDValue DAGCombiner::visitFADDForFMACombine(SDNode *N) {
  SDValue N0 = N->getOperand(0);
  SDValue N1 = N->getOperand(1);
  EVT VT = N->getValueType(0);
  MatchContextClass matcher(DAG, TLI, N);
  const TargetOptions &Options = DAG.getTarget().Options;

  // Floating-point multiply-add with intermediate rounding.
  bool HasFMAD = !UseVP && (LegalOperations && TLI.isFMADLegal(DAG, N));

  // Floating-point multiply-add without intermediate rounding.
  bool HasFMA =
      TLI.isFMAFasterThanFMulAndFAdd(DAG.getMachineFunction(), VT) &&
      (!LegalOperations || matcher.isOperationLegalOrCustom(ISD::FMA, VT));

  // No valid opcode, do not combine.
  if (!HasFMAD && !HasFMA)
    return SDValue();

  bool CanReassociate =
      Options.UnsafeFPMath || N->getFlags().hasAllowReassociation();
  bool AllowFusionGlobally = (Options.AllowFPOpFusion == FPOpFusion::Fast ||
                              Options.UnsafeFPMath || HasFMAD);
  // If the addition is not contractable, do not combine.
  if (!AllowFusionGlobally && !N->getFlags().hasAllowContract())
    return SDValue();

  if (TLI.generateFMAsInMachineCombiner(VT, OptLevel))
    return SDValue();

  // Always prefer FMAD to FMA for precision.
  unsigned PreferredFusedOpcode = HasFMAD ? ISD::FMAD : ISD::FMA;
  bool Aggressive = TLI.enableAggressiveFMAFusion(VT);

  auto isFusedOp = [&](SDValue N) {
    return matcher.match(N, ISD::FMA) || matcher.match(N, ISD::FMAD);
  };

  // Is the node an FMUL and contractable either due to global flags or
  // SDNodeFlags.
  auto isContractableFMUL = [AllowFusionGlobally, &matcher](SDValue N) {
    if (!matcher.match(N, ISD::FMUL))
      return false;
    return AllowFusionGlobally || N->getFlags().hasAllowContract();
  };
  // If we have two choices trying to fold (fadd (fmul u, v), (fmul x, y)),
  // prefer to fold the multiply with fewer uses.
  if (Aggressive && isContractableFMUL(N0) && isContractableFMUL(N1)) {
    if (N0->use_size() > N1->use_size())
      std::swap(N0, N1);
  }

  // fold (fadd (fmul x, y), z) -> (fma x, y, z)
  if (isContractableFMUL(N0) && (Aggressive || N0->hasOneUse())) {
    return matcher.getNode(PreferredFusedOpcode, SL, VT, N0.getOperand(0),
                           N0.getOperand(1), N1);
  }
}

Conclusion: How Function Inlining Can Change Float Point Results

We saw how inlining increases the scope of the optimizer, allowing it to perform more potentially result-changing optimizations. One of the most common ones is FMA fusion, which combines an add and a mul into a single instruction that is faster and more accurate. But FMA yields different results compared to doing separate mul and add, which rounds the intermediate result. Whether or not this is an issue will depend on your use case. Retaining bit-equal results becomes hard when fastmath is enabled.

FMA is not the only way in which inlining can affect results. Another problem is with intermediate precision of results. However, in my experience, disabling cross-statement FMA fusion allows pretty extensive refactorings without changing results. The specifics of FMA fusion depend strongly on the language spec, the compiler and the hardware, so your mileage may vary.

Further Resources

A note: Most content online about floating point consistency was written before AVX was widely established. Issues arising from non-specified intermediate precision (e.g. Intels 80bit x87 FPU) were much more widespread back then. These problems are less relevant nowadays as compilers mostly emit AVX instructions with 32-bit intermediate precision for doing floating point math.
Prof Higham’s blog is a good introduction to floating point arithmetic.
Fabien Sanglard published the most intuitive explanation of the floating point format that I’ve ever come across.
I enjoyed the Handbook of Floating-Point Arithmetic, particularly the chapter called “Languages and Compilers”.
Bruce Dawson has a great series of blogposts about floating point consistency.
Yosefk’s blogpost “Consistency: how to defeat the purpose of IEEE floating point” talks about how IEEE was mainly written for HPC folk, who care more about performance and accuracy, than for game developers, who may care more about consistent outputs.