A few days ago I posted GCC: the impressive and the disappointing where I looked at some cases where GCC produces not-quite-optimal code. One of the comments on that post was (emphasis mine):
So, it seems like there is a much better way to give the compiler a shot at doing the right thing: [snip suggestion]. I think you will find the compiler will generate quite efficient code in this case, particularly if you look at the real execution overhead, rather than what the assembler looks like.
This is a common attitude I encounter when I am discussing my attempts to optimize my protocol buffer decoding library upb. Programmers love to tell other programmers that they are prematurely optimizing, and most of the time they’re right. I’m sure to some people it seems ludicrous that I would be looking at assembly language output to determine whether it is efficient enough. For 99.99% of programs, it would be. But I’m working in one of those rare domains where it actually matters. And today I encountered pretty convincing evidence that the compiler’s bad code is actually affecting me.
The compiler’s bad code in this case is an example of a bug I previously filed on GCC: struct returned by value generates useless stores. Though I had previously observed that bug only by inspecting assembly language output, today I had it show up on an actual profile as clear as day. Here is a screenshot from Shark (click to get full-size):
To summarize, the compiler took the code:
typedef struct {
upb_flow_t flow; // An enum defined elsewhere.
void *closure;
} upb_sflow_t;
upb_flow_t upb_dispatch_startsubmsg([...]) {
// [...]
upb_sflow_t sflow = f->cb.startsubmsg([...]);
if (sflow.flow != UPB_CONTINUE) {
// [...]
}
…and turned that function call/test into this awful machine code (here in its Intel-syntax form):
call QWORD PTR [r12 + 16] mov DWORD PTR [rbp - 64], eax mov QWORD PTR [rbp - 56], rdx mov rax, QWORD PTR [rbp - 64] ; loads rax with data it already has. mov QWORD PTR [rbp - 48], rax ; stores rax into the stack a second time. mov QWORD PTR [rbp - 40], rdx ; stores rdx into the stack a second time. mov edx, DWORD PTR [rbp - 48] ; loads edx with data already in rax. testl edx, edx
..and then (this is the important part) in an actual profile it shows up as being 43.4% of the execution time of a hot function in my program.
This is not a slam against the GCC developers. GCC is a big and complex piece of software, and they have to prioritize all sorts of different bugs, feature requests, new hardware, etc.
This is just a reminder to those who jump to dare-I-say “premature” conclusions about what is premature optimization: some of us really are working in domains where things like virtual function overhead, branch predictability, and the efficiency of the compiler’s code make a difference.
