The -march flag itself is GCC-specific, but the general advice is universal: don’t forget to tell your compiler that it can take full advantage of your spiffy new CPU! I should know better but I’ve been forgetting to specify -march when compiling upb.
Here’s an extreme example of why. Take an innocent-looking function like:
int float_to_int(float f) {
return (int)f;
}
Looks simple enough, right? Unfortunately, float -> int casts are stupidly expensive on x86. Without any -m flags, gcc compiles this to:
sub $0x8, %esp ; allocate stack space fnstcw 0x6(%esp) ; save floating-point control word flds $0xc(%esp) ; push floating-point param onto fp stack movzwl 0x6(%esp), %eax ; move prev fp control word into %eax mov $0xc, %ah ; set rounding mode of control word to "truncate" mov %ax, 0x4(%esp) ; save it *back* to the stack fldcw 0x4(%esp) ; set the floating-point control word to truncate fistp 0x2(%esp) ; store integer from the fp stack to the stack fldcw 0x6(%esp) ; set the fp control word back to what it was movzwl 0x2(%esp), %eax ; read the value into eax (the return value) add $0x8, %esp ; give the stack space back ret
This would be funny if it weren’t so sad. All these gymnastics are required because the cast is required to round down (according to the C standard), but that requires the x86’s floating point unit to be in a different mode than for most operations.
Compiling exactly the same code with -msse2 allows the compiler to take advantage of an SSE-only instruction, and the above is replaced with:
cvttss2si 0x4(%esp), %eax ; convert value to integer with truncation ret
The difference in this case is astounding. Hopefully this will motivate you never to forget the -march flag!
The right thing to do in my case is compile with -march=core2. When I compile with -march=core2 or -msse3, the compiler to emits the not-quite-as-terse:
sub $0x4,%esp flds 0x8(%esp) fisttpl (%esp) mov (%esp),%eax add $0x4,%esp ret
I’m really not sure why gcc prefers this version when sse3 is available. It seems to be more work than the sse2 version. I tend to believe gcc know what it’s doing here, but I’d love to learn why.