Can Function Inlining Affect Floating Point Outputs? Exploring FMA and Other Consistency Issues
June 2023
At my job, I’m refactoring a 30k LOC codebase that simulates learning in the mammal brain. The emergent behavior of these large brain models is hard to test for, so we opted to take the safe route and preserve bit-equality in the weights of the trained model to guarantee that we are not breaking anything.The main downside of hash-based regression testing during refactoring is that it doesn’t check numerical stability. The most common change we applied was inlining many small functions to increase readability.
This raises the question: Can function inlining affect the output of a numerical program? I’m interested in both the programmer inlining a function manually in his editor and the compiler doing it for you during an optimization pass.
It turns out that yes, there are multiple ways in which inlining can change results, and specifics depend on the interplay of your language spec, compiler and hardware. One of the big reasons why compilers do inlining is that it increases the scope for optimizations. Most optimization passes run on individual functions, so inlining gives the compiler more code to work with, and more opportunities to apply potentially result-changing optimizations.
Here’s a concrete example that compiled with gcc -O3 -march=haswell
produces different results depending on whether the function is inlined or not (godbolt):
__attribute__((noinline))
float mulNoInline(float x, float y) {
return x * y;
}
float mulInline(float x, float y) {
return x * y;
}
float global = 1.2485889846f;
int main() {
float inlineRes = mulInline(global, global) + 1.0f;
float noInlineRes = mulNoInline(global, global) + 1.0f;
if (inlineRes != noInlineRes) {
printf("Results are not equal!\n");
}
}
They results differ in the last two bits: inlineRes
is 0x4023C63D while noInlineRes
is 0x4023C63E.
In this post I’ll focus on what’s probably the most common bit-changing
optimization that can be exposed after inlining: Multiply-Add
fusion.More bets are off with -ffast-math
since it exposes a lot more optimization potential. After being mindful of FMA fusion, we managed to perform an
extensive refactoring of our codebase without changing the results of
our program by one bit.
The Fused Multiply Add Instruction (FMA)
FMA performs this operation: (a * b) + c, where all variables are floating point numbers. On x86, the most common
instruction is called VFMADD, which was added with Intel Haswell.This means not every x86-64 CPU has support for FMA, and you’ll need to pass at least -march=haswell
(or -mfma
) to get the compiler to emit it.
There are two lenses through which I look at the FMA instruction:
Performance and precision.
FMA Performance: Latency & Throughput
FMA tends to be faster than doing a MUL followed by ADD. I wrote microbenchmarks and looked at instruction latencies, and FMA is around ~30% faster than a multiply followed by an add. As a rule of thumb, one FMA takes about 4 cycles on a recent x86 CPU, which is as fast as a single FP multiply, or a single FP add. I assume that pipelining, macro fusionMacro-op fusion is when an arithmetic instruction and a branch instruction are fused into a single μop by the CPU. From what I can tell, this optimization can not be apply to FMA. and other processor wizardry will affect these results in practice.On older Intel x86 architectures like Haswell, transforming (a * b) + (c * d) into fma(a * b, c, d) is actually bad for performance. With ILP we can run the two multiplies in parallel. So the latency of the original is roughly latency(mul) + latency(add) while the transformed version takes latency(mul) + latency(fma). On Haswell the transformed version has worse performance since the latencies are: add: 3 cycles, mul: 5 cycles, fma: 5 cycles.Despite being more complex than add or mul instructions, FMA is also a single μop according to Agner Fog’s tables. Clang’s performance model for x86 and AArch64 assumes FMA is faster than MUL and ADD for all microarchitectures (for fp32 and fp64 data types).
Besides latency, we also care about throughput: On Ice Lake,
throughput is 2 ops/cycle for FMA, equal to ADD and MUL.Interestingly, when FMA first came out with Haswell, latency & throughput used to be 5 cycles, 2ops/cycle for FMA; 3 cycles, 1op/cycle for FADD and 5 cycles, 2ops/cycle for FMUL. So the FMA had higher throughput than a simple add. You can
test how much FMA Fusion affects your program’s speed by compiling with
-ffp-contract=off
and removing any usage of std::fma
.
FMA Precision: Infinitely Precise Intermediate Results
Floating point math requires rounding since we can only represent a finite set of numbers. Besides increased performance, the second reason to use FMA is higher precision due to only rounding once instead of twice.
- Without FMA:
RoundToFloat32(RoundToFloat32(a * b) + c)
- With FMA:
RoundToFloat32((a * b) + c)
For modern processors, the bit result of FMA is specified by the IEEE-754since the 2008 revision. standard for every possible input. This means that every processor’s FMA instruction (including CPU vector units and Nvidia GPUs) will produce exactly the same output given the same inputsFor IEEE datatypes that is. No guarantees if you’re using bfloat16 or the 19 bit long TensorFloat32 format.. IEEE requires that instructions be “exactly rounded”. This means that the result is as if it had first been computed to infinite precision, then rounded. For FMA, this means the result is as if the result of the multiply had been computed to infinite precision and we only rounded to 32 / 64 bits once after the add.
Floating Point Consistency and FMA
So it sounds like FMA is faster and more accurate, so what’s the issue?
Consistency. Because depending on whether or not float res = a * b +
c;
turns into fma(a,b,c)
or add(mul(a,b),c)
assembly instructions, the
results will differ for some inputs.
IEEE does not help us much here. This is because IEEE is mainly a
specification of hardware, not of programming languages. While the IEEE
standard specifies the result of VFMADD231SS (the x86 instruction) down
to the bit it does not specify what the result of writing float res =
a * b + c;
in C++ should be.This is for multiple reasons: First, the C++ standard does not require IEEE-compliant data formats. Second, IEEE does not specify the intermediate precision of operations. The C++ draft has this note (also see SO answer): The values of the floating-point operands and the results of floating-point expressions may be represented in greater precision and range than that required by the type; the types are not changed thereby. Which is not very specific, but seems to permit fusion by default. Some options allowed by the C++ spec
are:
- Fuse into a single FMA instruction (1 rounding)
- Do not fuse, execute with 32-bit intermediate precision (2 roundings)
- Do not fuse, execute with 64-bit intermediate precision then round down to 32-bit result (3 roundings)
As we can see, floating point consistency is not just about the language, it’s a play with many actors. The main participants are:
- The language specification: Does it allow fusing a * b + c without
explicitly coding for an FMA? Both C++ and Go doFrom the Go spec: An implementation may combine multiple floating-point operations into a single fused operation, possibly across statements, and produce a result that differs from the value obtained by executing and rounding the instructions individually. . Both
languages also have
std::fma
andmath.FMA
for generating FMA instructions explicitly.Notice that FMA fusion doesn’t require any fastmath in C++. - The compiler: Does it fuse a * b + c into a single FMA instruction?
Clang does so by default (
-ffp-contract=on
) as long as the mul and add are part of the same statement. GCC even fuses across statements by default (-ffp-contract=fast
), as does thegc
Go compiler. If cross-statement fusion is allowed, then inlining may expose more opportunities for FMA fusion that were not previously visible to the compiler. - The hardware: Does it have an FMA instruction? Pre-Haswell x86-64 CPUs do not. What happens if you use std::fma on non-FMA hardware?
Let’s look a bit closer at the second actor, the compiler, taking Clang as our main example.
FMA Fusion in the Clang Frontend and Backend
As I mentioned, FMA fusion is explicitly allowed by the C++ spec. Clang
has the -ffp-contract
flag
to control what gets fusedThe description says “Specify when the compiler is permitted to form fused floating-point operations, such as fused multiply-add (FMA)”. x86-64 has more fused operators besides FMA, but I’ve never seen them used explicitly: VFMSUB (a * b - c) and VFNADD (-a * b + c) and VFNMSUB (-a * b - c).. Possible settings are:
on
(the default): fuse, but not across statements. Sofloat tmp = a * b; float res = tmp + c;
will not get fused, butfloat res = a * b + c;
will. This happens in the compiler frontend during an AST rewrite.fast
: also fuse across statements. This is one of the flags enabled by-ffast-math
. It is implemented in the architecture-specific compiler backend.off
: Do not fuse.fast-honor-pragmas
: I have never seen this used. It refers to the FP_CONTRACT pragma which comes from the C standard.
By default, Clang is more defensive than GCC, which has
-ffp-contract=fast
as the default
setting.
Here’s a Godbolt example
where GCC by default changes the result after inlining, while Clang does
not.
Let’s look at how FMA fusion is implemented in the Clang source code. There are two places where fusion can happen: in the frontend and the backend.
FMA Fusion in the Clang Frontend
The relevant code is in
CodeGen.
The frontend performs optimizations that are independent of the
architecture being compiled for. If -ffp-contract=on
and the operations
are fusible, it will emit an llvm.fmuladd
instruction. Interestingly, if
-ffp-contract=fast
, the frontend doesn’t do any fusion but delegates
that task to the architecture-dependent backends.
Notably, there are two instructions in LLVM IR for FMA: llvm.fma
and llvm.fmuladd
.
- llvm.fma: perform the fused multiply-add operation. […] Return the same value as a corresponding libm ‘fma’ function but without trapping or setting errno.
- llvm.fmuladd: The LLVM langref has this to say: represent[s] multiply-add expressions that can be fused if the code generator determines that (a) the target instruction set has support for a fused operation, and (b) that the fused operation is more efficient than the equivalent, separate pair of mul and add instructions.
The main difference here is that llvm.fmuladd
can be fused, while
llvm.fma
always returns the same value as the corresponding C math
library’s fma
function.
Here’s a code sample:
C++:
float f(float a, float b, float c){
// required (by C++ standard) to perform fma with 1 rounding
return std::fma(a,b,c);
}
float g(float a, float b, float c){
// no requirements from C++ standard about whether to fuse or not
return a * b + c;
}
LLVM IR:
define dso_local float @_Z1ffff(float %a, float
%b, float %c) local_unnamed_addr {
entry:
%0 = tail call float @llvm.fma.f32(float %a, float %b, float %c)
ret float %0
}
define dso_local float @_Z1gfff(float %a, float
%b, float %c) local_unnamed_addr {
entry:
%0 = tail call float @llvm.fmuladd.f32(float %a, float %b, float %c)
ret float %0
}
The resulting assembly is the same for both:
f(float, float, float):
vfmadd132ss xmm0, xmm2, xmm1
ret
g(float, float, float):
vfmadd132ss xmm0, xmm2, xmm1
ret
Since -ffp-contract=on
fuses only within the same statement and is hence
very programming-language specific, it makes sense that this happens in
the frontend. The backend will unfuse the fmuladd
again
if the hardware does not have an FMA instruction, or if it is slower
than doing FMul + FAdd.As an edge case, this leads to some fun consistency issues with constant folding on architectures that do not support FMA.
FMA Fusion in the Clang backend
When -ffp-contract=fast
is set then the backend is responsible for
fusing into FMA. Unlike the frontend, at this stage, Clang can take
architecture-specific timing tables into account. For x86, Clang
assumes
that FMA is always faster than fmul and fadd for fp32 and fp64.
The implementation of the backend depends on the architecture being
built for. For x86, the DAG combiner that emits the FMA when
-ffp-contract=fast
is enabled is implemented here (simplified):
/// Try to perform FMA combining on a given FADD node.
template <class MatchContextClass>
SDValue DAGCombiner::visitFADDForFMACombine(SDNode *N) {
SDValue N0 = N->getOperand(0);
SDValue N1 = N->getOperand(1);
EVT VT = N->getValueType(0);
MatchContextClass matcher(DAG, TLI, N);
const TargetOptions &Options = DAG.getTarget().Options;
// Floating-point multiply-add with intermediate rounding.
bool HasFMAD = !UseVP && (LegalOperations && TLI.isFMADLegal(DAG, N));
// Floating-point multiply-add without intermediate rounding.
bool HasFMA =
TLI.isFMAFasterThanFMulAndFAdd(DAG.getMachineFunction(), VT) &&
(!LegalOperations || matcher.isOperationLegalOrCustom(ISD::FMA, VT));
// No valid opcode, do not combine.
if (!HasFMAD && !HasFMA)
return SDValue();
bool CanReassociate =
Options.UnsafeFPMath || N->getFlags().hasAllowReassociation();
bool AllowFusionGlobally = (Options.AllowFPOpFusion == FPOpFusion::Fast ||
Options.UnsafeFPMath || HasFMAD);
// If the addition is not contractable, do not combine.
if (!AllowFusionGlobally && !N->getFlags().hasAllowContract())
return SDValue();
if (TLI.generateFMAsInMachineCombiner(VT, OptLevel))
return SDValue();
// Always prefer FMAD to FMA for precision.
unsigned PreferredFusedOpcode = HasFMAD ? ISD::FMAD : ISD::FMA;
bool Aggressive = TLI.enableAggressiveFMAFusion(VT);
auto isFusedOp = [&](SDValue N) {
return matcher.match(N, ISD::FMA) || matcher.match(N, ISD::FMAD);
};
// Is the node an FMUL and contractable either due to global flags or
// SDNodeFlags.
auto isContractableFMUL = [AllowFusionGlobally, &matcher](SDValue N) {
if (!matcher.match(N, ISD::FMUL))
return false;
return AllowFusionGlobally || N->getFlags().hasAllowContract();
};
// If we have two choices trying to fold (fadd (fmul u, v), (fmul x, y)),
// prefer to fold the multiply with fewer uses.
if (Aggressive && isContractableFMUL(N0) && isContractableFMUL(N1)) {
if (N0->use_size() > N1->use_size())
std::swap(N0, N1);
}
// fold (fadd (fmul x, y), z) -> (fma x, y, z)
if (isContractableFMUL(N0) && (Aggressive || N0->hasOneUse())) {
return matcher.getNode(PreferredFusedOpcode, SL, VT, N0.getOperand(0),
N0.getOperand(1), N1);
}
}
Conclusion: How Function Inlining Can Change Float Point Results
We saw how inlining increases the scope of the optimizer, allowing it to perform more potentially result-changing optimizations. One of the most common ones is FMA fusion, which combines an add and a mul into a single instruction that is faster and more accurate. But FMA yields different results compared to doing separate mul and add, which rounds the intermediate result. Whether or not this is an issue will depend on your use case. Retaining bit-equal results becomes hard when fastmath is enabled.
FMA is not the only way in which inlining can affect results. Another problem is with intermediate precision of results. However, in my experience, disabling cross-statement FMA fusion allows pretty extensive refactorings without changing results. The specifics of FMA fusion depend strongly on the language spec, the compiler and the hardware, so your mileage may vary.
Further Resources
-
A note: Most content online about floating point consistency was written before AVX was widely established. Issues arising from non-specified intermediate precision (e.g. Intels 80bit x87 FPU) were much more widespread back then. These problems are less relevant nowadays as compilers mostly emit AVX instructions with 32-bit intermediate precision for doing floating point math.
-
Prof Higham’s blog is a good introduction to floating point arithmetic.
-
Fabien Sanglard published the most intuitive explanation of the floating point format that I’ve ever come across.
-
I enjoyed the Handbook of Floating-Point Arithmetic, particularly the chapter called “Languages and Compilers”.
-
Bruce Dawson has a great series of blogposts about floating point consistency.
-
Yosefk’s blogpost “Consistency: how to defeat the purpose of IEEE floating point” talks about how IEEE was mainly written for HPC folk, who care more about performance and accuracy, than for game developers, who may care more about consistent outputs.