Contact us

Send a message
Contact details

Snabb GmbH

c/o Alphavat AG
Chilcherlistrasse 1
6055 Alpnach Dorf

contact@snabb.solutions

+41 (0)76 411 2037

Snabb GmbH © 2018

Search

The curse of "high-impact medium-generality" optimizations

Updated: Oct 15, 2018


Originally posted at https://github.com/lukego/blog/issues/21


We can rate software optimizations as low/medium/high in terms of their impact and their generality.


Impact is how much difference the optimization makes when it works; generality is how broadly applicable the optimization is across different situations. Thinking about these factors explicitly can be helpful for predicting how beneficial (or harmful!) an optimization will be in practice.


That's putting it mildly but this blog entry is really a rant: I hate high-impact medium-generality optimizations. Let me show you three examples to explain why. The first is drawn from CPU design for the purposes of illustration and the second two are specific problems that I want to fix in RaptorJIT.


CPUs and memory alignment


CPU hardware tends to operate on fixed-size data at fixed alignments in memory. Memory is accessed in fixed-size blocks (cache lines) with fixed alignment, 64-bit values are stored at 64-bit aligned addresses whenever possible, and so on. CPU hardware optimizations are easier to implement when they can make assumptions about alignment. If the alignment is not known then there are a whole bunch of additional corner cases to worry about.


So what approaches to CPUs take when it comes to accessing "unaligned" data? There are a few:

  1. Make unaligned access illegal.

  2. Make unaligned access slow.

  3. Make unaligned access fast.

The "low generality" approach is to completely outlaw unaligned memory access. This punts the problem to somebody else: the compiler or the programmer. ARM was famous for this.


The "medium generality" approach is to support unaligned access but to make it slow. This half-punts the problem: the compiler and programmer are free to write straightforward code, but they may be surprised to find that their software runs much faster or slower depending on the addresses of certain data structures, and that may motivate them to write a bunch of tricky special-case code. Intel SIMD instructions have worked this way in earlier silicon implmentations.


The "high generality" approach is to support unaligned access with the same high performance as aligned access. This takes ownership of the problem: memory access is fast, period. Compilers and programmers don't need to worry about the low-level special cases: they can go back to thinking about the problem they actually care about. Intel SIMD instructions mostly work this way in later silicon.


I hope it is obvious that the high-impact high-generality approach is the best by far. The programmer writes straightforward source, the compiler assembles straightforward machine code, and the benchmarks show the performance straightforwardly.


Less obvious is that the high-impact medium-generality approach really sucks. It provides a false sense of security. You can write your nice straightforward code but this punts the complexity to your benchmarks: simple testing will probably exercise the sweet spots, making everything look great, but more thorough testing will reveal uncertainty and unpredictability. (Sad face.) If you care about worst-case performance, and you don't have control over the layout of your data in memory, then this can defeat the whole purpose of having a fast path in the first place: you can't depend on that optimization and you need to constantly worry about it screwing up your performance tests.


(The high-impact low-generality approach is probably fine most of the time, since you won't be tricked into thinking it works when it doesn't, though for this example of simply accessing data in memory it's not much fun.)


Tracing JIT and loop optimization


RaptorJIT (via LuaJIT) has a feature called loop optimization. This is a very powerful optimization. However, it is one of these awful "high-impact medium-generality" features. It is completely awesome when it works, but it is really hard to predict whether it will be there when you need it.


The idea of loop optimization is simple: on the first iteration of a loop you simply execute the loop body, but for the later iterations you skip anything that you already did in the first iteration. This cuts out a whole bunch of work: assertions that don't need to be rechecked, calculation results that are already in registers, side-effects that have already been effected, and so on.


The gotcha is that it is only applicable when the later iterations execute exactly the same way as the first one. Your control flow has to be the same, so you need to always be taking the same branch of your if-then-else statements, and the types of your local variables have to be the same too. On any iteration where these conditions don't hold you will get an exit to a slow path that runs without the optimizations.


Let us look at a simple example of loop optimization. Here is a simple Lua program with a loop that repeatedly stores the current loop counter into a global variable:

for i = 1, 100 do
   myglobal = i
end

Here is the machine code for the first iteration of the loop:

12a0eff65  mov dword [0x00021410], 0x1
12a0eff70  cvttsd2si ebp, [rdx]
12a0eff74  mov ebx, [rdx-0x8]
12a0eff77  mov ecx, [rbx+0x8]
12a0eff7a  cmp dword [rcx+0x1c], +0x3f
12a0eff7e  jnz 0x12a0e0010	->0
12a0eff84  mov edx, [rcx+0x14]
12a0eff87  mov rdi, 0xfffffffb00030620
12a0eff91  cmp rdi, [rdx+0x320]
12a0eff98  jnz 0x12a0e0010	->0
12a0eff9e  lea eax, [rdx+0x318]
12a0effa4  cmp dword [rcx+0x10], +0x00
12a0effa8  jnz 0x12a0e0010	->0
12a0effae  xorps xmm0, xmm0
12a0effb1  cvtsi2sd xmm0, ebp
12a0effb5  movsd [rax], xmm0
12a0effb9  test byte [rcx+0x4], 0x4
12a0effbd  jz 0x12a0effd4
12a0effbf  and byte [rcx+0x4], 0xfb
12a0effc3  mov edi, [0x000213f4]
12a0effca  mov [0x000213f4], ecx
12a0effd1  mov [rcx+0xc], edi
12a0effd4  add ebp, +0x01
12a0effd7  cmp ebp, +0x64
12a0effda  jg 0x12a0e0014	->1

That is quite a lot, eh! That's because we are finding the globals hashtable and locating the myglobalslot. This requires quite a few instructions to do correctly, taking into account that this is a dynamic language and the table could have been updated or replaced after the machine code was compiled, which we need to detect to know if the hashtable slot we want to look in is valid, and so on.


Here is what the subsequent iterations look like thanks to loop optimization:

->LOOP:
12a0effe0  xorps xmm7, xmm7
12a0effe3  cvtsi2sd xmm7, ebp
12a0effe7  movsd [rax], xmm7
12a0effeb  add ebp, +0x01
12a0effee  cmp ebp, +0x64
12a0efff1  jle 0x12a0effe0	->LOOP
12a0efff3  jmp 0x12a0e001c	->3

That's a bit better, right? We take our loop index in ebp, convert that into a Lua number (double float) in xmm7, store that in the hashtable slot whose address we already loaded into rax on the first iteration, and then we bump our index and check for loop termination.


Great stuff, right? Wrong! The trouble is that loop optimization is only medium-generality. There are myriad ways that we can tweak this simple example to throw a spanner in the works. All it takes is to have some variation in either the types used in the loop or the branches that are taken.


Let me show you a fun example of screwing it up:

-- Fast version
for i = 1, 1e9 do
   -- Store loop index in global
   myglobal = i         
end

-- Slow version
for i = 1, 1e9 do
   -- Store "is loop index even?" flag
   myglobal = i%2 == 0
end

You might reasonable expect that both of these loops would run at about the same speed, with the second one paying a modest price for doing some extra arithmetic. But you would be wrong, wrong, wrong, wrong, wrong. On both counts.


The first version averages 1.5 cycles (6 instructions) per iteration while the second version averages 9.1 cycles (22.5) instructions per iteration. That's a six times slowdown.

The reason is not extra arithmetic: it's types. Let me explain in two halves. First, the JIT uses a statically typed intermediate representation for generating machine code: it always knows the exact type of the value in each register. Second, the JIT does not have the concept of a boolean type: a value can be typed as true or as false but not as true-or-false. This means that the JIT needs to generate two copies of the loop body, one for even iterations and one for odd iterations, and the constant branching between wipes out the benefit of loop optimization.


Clear as mud?


I see this as a massive problem that I want to fix in RaptorJIT. I don't want to have high-impact medium-generality optimizations. It's hostile to users and it violates the principle of least astonishment. I am thinking very hard about how to solve this.


One approach would be to upgrade the feature to high-impact high-generality. This would require making the optimization effective even when the branches and types are diverse. This seems superficially like a hard problem but may have a pragmatic solution.


Another approach would be to downgrade the feature to medium-impact medium-generality. This would be done by making separate optimizations to speed up the bad case so that the relativeimportance of loop optimization is less. Then application developers don't have to worry so much about whether loop optimization "kicks in" because they will get pretty fast code either way. This could be done by doing more of the work during the JIT phase before the program runs: perhaps we insert the hashtable slot address directly into the machine code, without any safety checks, and separately we ensure that if the hashtable is resized that this will cause the machine code to be flushed and replaced.


Life is not boring when you are maintaining a tracing JIT, let me tell you.


Tracing JIT and allocation sinking


"Allocation sinking" is another problematic high-impact medium-generality optimization. It allows you to create complex objects and have them stored purely in registers, without ever allocating on the heap or running the garbage collector. Except when it doesn't... which can be very hard to predict in practice.


I have a plan to downgrade allocation sinking to medium-impact medium-generality by optimizing the slow path. Specifically I want to switch the heap from using 64-bit words to using 96-bit words so that typed pointers can be stored without a heap allocation (boxing.) However, this blog entry has become rather long, so if you want to know about that you'd better look at raptorjit/raptorjit#93.


Summary


There is no permanent place in the world for high-impact medium-generality optimizations. They are simply too hostile to users.


In the long run you need to either make them very general so that users can depend on them working, or you need to separately optimize the edge cases to soften the impact when the special case does not apply.


This is a major topic in RaptorJIT development! You are welcome to come and help with that project if you like :-)


P.S. Please leave a comment with your most hated examples of high-impact medium-generality optimizations in other domains!