It’s been a while since I’ve posted about upb, but I’ve been busy improving it! I think the biggest achievement I can mention is that the core upb APIs (upb_handlers, upb_def, and upb_bytestream/upb_bytesink) are converging to the point where I’m comfortable with people starting to experiment with them. I’m not promising they won’t change at all, but I’m a lot more confident in their overall structure and semantics than I have been previously.
Notably, I think upb’s deserialization is ready for casual and experimental use. Definitely don’t trust any data to it until it’s better tested, though. I won’t be releasing until things have converged a bit more (and are better tested).
I gave a talk about upb at Google yesterday that was well-received. One question that comes up is “how are you beating the generated code from the protobuf compiler?” For the record, my JIT appears to be about 25% faster than Google’s protobuf release on a completely apples-to-apples test. It is a bit surprising, since my code has basically the same structure as protobuf’s generated C++ — I don’t invent any new optimizations or anything like that. I think it really comes down to generating better assembly than gcc is.
We live in a day and age where common wisdom is that you can’t beat a good C++ compiler, or at least not by much, and I think this is probably true for 99% of use cases. But I was first inspired to think that I could beat the C++ compiler in this case by reading this mailing list post from Mike Pall where he explains why you can still beat the compiler for interpreter main loops. A protobuf parser is surprisingly similar to a byte-code interpreter main loop, so I thought I’d give it a shot.
Below is just the simplest example I could dig up of a side-by-side comparison of my code vs. the compiler’s. What follows is the code to parse a single fixed64 value:
; upb JIT assembly: mov rdx,QWORD PTR [rbx+0x2] ; load fixed64 val out of buffer add rbx,0xa ; advance buffer by 10 (2 for tag) mov QWORD PTR [r12+0x40],rdx ; store fixed64 value in message or BYTE PTR [r12+0x1],0x4 ; set hasbit cmp rbx,QWORD PTR [r15+0xaf8] ; check for end-of-buffer jae <end of buffer> mov rcx,QWORD PTR [rbx] ; load next tag cmp cx,0x1b0 ; next field+wt in order? je <expected next field>
There’s not a lot left to cut away here. Compare with protobuf/gcc-generated code:
; protobuf/gcc code: mov ecx,DWORD PTR [rbx+0x10] ; load buffer end mov rax,QWORD PTR [rbx+0x8] ; load buffer sub ecx,eax ; if (buffer_end_ - buffer_ <= 7) cmp ecx,0x7 jle <error> mov rax,QWORD PTR [rax] ; load fixed64 val mov rdx,QWORD PTR [rbp-0x48] ; load this mov QWORD PTR [rdx],rax ; store fixed64 val in this add QWORD PTR [rbx+0x8],0x8 ; advance buffer or DWORD PTR [r12+0x74],0x800 ; set hasbit mov rdx,QWORD PTR [rbx+0x10] ; load buffer end mov rax,QWORD PTR [rbx+0x8] ; load buffer mov ecx,edx sub ecx,eax ; if (buffer_end_ - buffer_ <= 1) cmp ecx,0x1 jle <end of file> cmp BYTE PTR [rax],0xb0 ; check first byte of next tag jne <lookup field> cmp BYTE PTR [rax+0x1],0x1 ; check second byte of next tag jne <lookup field>
There is some poor register allocation going on here — gcc is repeatedly loading buffer_ and buffer_end_ even though it has plenty of registers to play with. All in all, gcc is generating over twice the number of instructions, over twice the number of loads, and more stores too. Note that this is taken from the middle of a very large C++ function with a big switch statement and a bunch of gotos, and I think it’s just difficult for a compiler to do good register allocation under these constraints.
If the C++ compiler could know the difference between fast paths and slow paths and do the register allocation solely for the fast paths (spilling everything for the slow paths) it might have a better shot. But still I think it’s just a hard problem.