11
Sep

Performance Optimization, SIMD and Cache


Hello! My name is Sergiy Migdalskiy. I’m a programmer at Valve. I’m going to talk about Performance Optimization Strategies today This is a 10’000 foot overview of the little things that have big impact on performance. I will never forget my first computer. Like the Soviet reality behind my window, it was very austere. It didn’t even have a real keyboard. It felt very fast.
Because there wasn’t a whole lot I could do with it. ZX Spectrum was my first user friendly computer. It had a basic interpreter and a real keyboard. And it wasn’t fast enough. Because there was a whole lot I could do with it. Spectrum had an 8-bit Zilog CPU. This is it. By the way, you can make out individual transistors here With just 8500 transistors, you can have a fully functioning CPU. A lot has changed. Pentium had millions of transistors Some versions of Intel’s Haswell or AMD’s Bulldozer have over a billion. Tip of a pin on this picture could fit a hundred Zilogs. At least if someone fabricated Zilogs with 22nm process. This is how tough we used to have it. This is a floating point multiplication routine for the Zilog It’s not even full floating point. It ignores the least significant 8 bits This is how easy we have it these days. This is how we multiply a full float. In fact, we can also multiply quadruplets or octuplets of numbers the same way. Lots of things got simpler and faster since then And how do we use this tremendous power? Here is a virtual function call It may cause 3 cache misses: When we fetch the object pointer Then the vtable pointer Then the pointer to function That’s before the actual function is even called. And to boot, you can miss the Instruction Cache or TLB. This is the scope of the problem Each car is an instruction here. A cache miss has to wait for main memory. And main memory is very far. It’s 500 cycles away on the last-gen consoles. While this instruction is waiting, the rest of the program is waiting, too. If I made this animation to time scale, it would take 10 minutes. And a virtual call can cause this stall 3 times in a row. This is how many instructions could have run if we didn’t have a cache miss. 500 It’s not quite the same on a modern PC. But it’s still bad. I’ll talk about it in a minute. So why is main memory so slow? Well, it’s billions of capacitors. With just hundreds of lanes connecting them to CPU. That’s why there are layers upon layers of cache between them The typical bandwidth nowadays is from 8 to 36 GByte/s. It’s usually shared between 2 to 6 cores and sometimes the GPU. It’s not that slow if you use it right. But if you do miss the cache, the problem is that the main memory is latent CPU and RAM are quite fast, individually. But they’re just far from each other. They can talk fast, they just won’t hear a reply until later. It’s like having a lag in your phone conversation. Chasing data down a chain of pointers exacerbates this problem. Every arrow in the chain is a memory access. And it has to complete before the next one can execute. The CPU asks for one cache line at a time and waits… The main memory delivers one cache line at a time and waits… Both CPU and RAM are basically waiting for each other most of the time Everything should ideally be one pointer lookup away. Data dependency chains should be short. This will let CPU prefetch more things in parallel. The out of order speculative execution engine will even do it for you automatically. I’ll talk more about it in a minute You never read only one float from memory. You can only fetch the whole cache line. So use it all in its entirety. Align and pack your data into cache line sized chunks. I have a rule of thumb that CPU should not starve. It shouldn’t wait for memory. Here is a typical low-end machine. 2 GigaByte per core per second, and about 2 GHz. That’s 1 byte per CPU cycle. I pack the data, and then reuse it as much as I can. before it’s evicted from the tiny, puny L1 Cache Hopefully pulling no more than 1 byte per cycle in the process. I estimate the cycles and bytes by hand. Well, I really count full cache lines and operations, but it’s close enough. And I also measure and verify the assumptions, using all tools I can get my hands on. One does not replace the other. Tools have poor signal to noise ratio. And estimates are error-prone. Always double-check yourself. Fundamentally, when CPU has enough stuff to do, it’s easy to fix cache misses. It’s good to be compute bound I can concentrate on optimizing the logic, and I can see immediate and predictable results. Classic Computer Science training prepares us for exactly this scenario. It concentrates on minimizing the number of operations to make things go faster. When you are Compute bound, that’s exactly the right thing to do. If CPU doesn’t have enough stuff to do, If you are not Compute bound, you lose ticks to cache misses. This is how much per miss. With stalls of these proportions, no wonder performance is often poor and unpredictable. I hope I convinced you by now that cache misses are pretty bad. But wait a minute…The last-gen consoles PowerPCs were dual-issue. That’s 2 instructions per tick. So all in all it’s more like we lost 1000 instructions per cache miss. That’s a theoretical number. But I did reach 1.6 average instructions per cycle in optimized code. Typical memory latency on a modern computer is about 50 nanoseconds, maybe 60 That’s about 150 ticks on a 3 GHz CPU, not 500. In reality it’s longer because it takes time for CPU to realize the address is not in cache and send the query to the memory. Also, other cores and GPU compete for the same memory bus That increases latency, but let’s stick to the slightly delusional 150 ticks. That’s much better than 500, right? Well, here is the thing You can run 3 or more instructions per tick on an Ivy Bridge CPU. Moves here are just register renames. It’s kind of a cool shortcut for CPU. Moves cost almost nothing because they don’t even go through the whole CPU pipeline. They don’t use execution units. At least on many of the latest CPUs. But the increments are real arithmetic operations It’s pretty cool we can run 3 of them per tick. So let’s compare the old with the new. If you stall the modern CPU for 150 cycles, the theoretical opportunity cost is about 450 simple instructions You’ll never achieve 450, but you can get half of that quite easily. And in the last gen, it wasn’t unheard of to run 2 cycles per instruction and not worry about it. So the cache miss opportunity cost was about 250 instructions in last gen and is about 220 today. So it’s faster, but not cheaper today So, what about out-of-order, you might ask? Doesn’t CPU just go ahead, and leave the stalled instructions behind? The other instructions running speculatively ahead? So any instruction stalled on a cache miss, should just be reordered until the memory read is complete. Unfortunately, it doesn’t work quite that way. You fetch data from memory because you need it. Usually right away. So stalling one instruction stalls the whole dependency graph downstream. Like those yellow cars. They can’t move anywhere. But what about those other instructions that don’t depend on your memory data? Few other instructions can make real progress with some variable in unknown state. The reorder buffer size is not infinitely long. Haswell has 192 entries in reorder buffer. It also has limited number of other resources that can shrink the reorder window. So yes, sometimes it helps To count cycles on modern superscalar architectures,
you need to be comfortable with two numbers. This is familiar to most of you. Every instruction has latency and throughput Latency is how many cycles it takes for a result of one instruction to be available for the next. Latency of 4 means you can’t use the result of the instruction until 4 cycles after it issued. Throughput is how often the instruction can issue. For example, if a multiply has 4 cycle latency but throughput of 1, you can issue one multiply every cycle, as long as they are independent If throughput is for example 1/2, it means you can issue half an instruction per cycle. In other words, you can issue only every other cycle. It’s kind of a quirky terminology. Sometimes people use Reciprocal throughput. Throughput of 1/2 is the same as reciprocal throughput of 2. So – every other cycle. The compiler tries to schedule instructions from other dependency chains after an instruction with high latency. Latency and throughput numbers in CPU documentation assume the best case. There are a lot of hazards that can make them worse. Here is an observation. Instructions depend on each other. You cannot compute the length of a vector until you fetched the vector from memory. You cannot divide it by length if you haven’t computed the length yet. Instructions form a dependency graph. The edge length in this graph is latency. The longest path is the dependency chain that limits how fast you can execute this graph. The latency becomes huge if you have a cache miss. It can delay the whole program. But at least if you have many short chains that can run in parallel The CPU can reorder the cache miss and get some work done And if you have two misses in two parallel chains, chances are CPU can fetch both cache lines at the same time. And that’s huge savings Even if you have a few short dependency chains but they all miss cache; the CPU will simply run out of reorder buffer or reservation stations or register renames or some other resource and stall. Sure, it’ll wait for several cache lines at the same time. And for sure, it’s better than waiting for them one after another But avoiding cache misses is still the #1 priority. Say you fixed all your cache misses. Lots of short chains still work best. Optimizing compilers can schedule parallel dependency chains side by side. They do it extremely well. And shorter latency chains, and even individual instructions with shorter latency are much easier to schedule. Compiler can reorder instructions better at compile time. And at runtime, the Reorder Buffer has more instructions to choose from. This improves CPU throughput. Because more instructions can issue in parallel. The more short chains in a linear block of code, the better they can schedule and utilize CPU. It was hard to hide latencies with in-order CPUs. (of the last gen) Some instructions had a latency of a dozen cycles. It’s rarely an issue nowadays, with short latencies and Out of Order execution. But exploiting and optimizing for multi-issue capability is much more subtle. In some very limited sense, having multi-issue is a little bit like having faster tick rate. And also slower latency and throughput. We still have to increase Instructions per Clock metric, and avoid starving the CPU execution engine. But now it’s very hard to analyze the behavior of the Reorder Buffer. Here’s another fun little insight that a GPU engineer shared with me. What’s the latency of an add instruction? Is it, like,1 cycle? Well, it depends on your perspective. Real world CPUs fetch bytes of memory, decode instructions, issue and reorder instructions, execute instructions, retire instructions, put things into complex queues between all those stages and do a whole lot of other things. A CPU obviously cannot fetch, decode, reorder, execute and retire an instruction in a single cycle. It can literally take hundreds of cycles between when CPU sees this tiny, little add and when it retires it. Or cancels it as the luck has it. The CPU is so heavily pipelined to do its very best to hide this dirty little secret from you: it’s huge and latent. And this is what makes it possible to hide a lot of nasty stalls. This might also be why those prefetches you casually sprinkled around your code last time … had little measurable effect. Which reminds me Prefetches are tricky. First of all, prefetch instruction is treated as merely recommendation by many CPUs. You can try to insert a memory load or compare instruction directly in place of a prefetch. Then CPU has no choice but to fetch the memory. But it’s much better to structure the code and data for the hardware prefetcher. Because it’s hard to beat the hardware prefetcher. If your data is in a linear stream, hardware prefetcher will work perfectly. Software prefetches are unnecessary and sometimes even harmful in this case. If your data is random, prefetching early enough can sometimes help. But often it won’t. For instance, cache misses are often overlapped with each other and with execution. If you fix just one of two cache misses that happen in parallel, the other [one] is still there, so it won’t help performance. There are valid situations when you don’t have enough work per byte. In those cases, you want to maximize the system memory bandwidth. Because if CPU is idling, you don’t want the memory bus to also be idling. If you memcopy slower than 8 GByte/second then there’s room for improvement. And by memcopy I mean any task that is memory bound. And by 8 Gbyte/second I mean whatever the available bandwidth is. Until then, you are just wasting resources. Something is very wrong with brainless memcopy-like jobs. CPU is essentially free when you’re waiting for memory. You should do something with it! Sometimes you can compress data, or join two jobs together to use CPU more. Tiling reduces memory traffic by utilizing cache better. On the flip side, memory bandwidth is free, too, when CPU is too busy. But that doesn’t happen too often. In short, I try to load balance CPU and memory. Prefetching is sort of like double-buffering. You fill (or prefetch) one buffer, while working on the other. It works best when CPU prefetches for you. That works best when you use a couple of linear streams of data. It doesn’t invalidate the simple fact of life: If you are eating bytes faster than they can be delivered, you’ll starve the CPU. So my desktop computer is supposed to have the aggregate memory bandwidth of about 37 Gb/s. But simply reading an array in memory goes at 13.5 GB/s, about 1/3 of overall bandwidth. Why? The CPU has bottlenecks. Or spare capacity, depends on the point of view. One Thread, Two Streams can go at 18 Gb/s, half of the system’s bandwidth. Two Threads, Two Streams each, goes to 25Gb/s And 4 Threads, 4 Streams goes up to 36 Gb/s. It’s ok to split data into a few streams. In fact it may even be beneficial. But don’t count on every thread having access to the full system bandwidth. So what’s the minimal bandwidth you should count on? Let’s say we have the familiar low-end quad core. Something like a Core2 Quad with DDR2 memory. It has 2 GiB/s per core, which is exactly one byte per CPU cycle. You can usually run at least 1.5 instructions per cycle. In physics code let’s say that’s one math instruction — and half address or logic instruction. That means that if I count the bytes, —and count the math operations in my code, — I want them to be in balance. If there’s more data, — I know I want to compress and pack it into fewer cache lines. It’s ok to run more computations, but this is my ideal target. I try to keep it simple. Writes normally don’t stall on modern CPUs. CPU can put a write into a queue, and the queue is long enough. If you try to load data right after store, modern CPUs forward data from the store queue. On the last gen consoles, it always caused a stall. Not anymore. Unless you try to play tricks with unions and mix and match data of different sizes, you should have no problems. For example, if you combine two 16-bit integers into one by writing and then reading from memory. That may cause problems. It’s a gotcha to be aware of. To write a byte that’s not in cache, usually CPU fetches the whole cache line, and only then writes a byte. So memory write can cause a cache miss or cache line migration from core to core. It doesn’t stall, but this still pollutes the cache. It may be cheaper sometimes to do a few checks and not write anything out. Especially if another core may be accessing the same cache line. You don’t want to steal cache lines from another core. If your branch is random, it’s fair to say it will mispredict half the time. So random branches are bad. All CPUs try to predict branches nowadays. Some are more sophisticated than others. Sophisticated branch prediction is not for us [PC game programmers]. Mainly because it differs from CPU to CPU. And we want our games to perform well on all of them. Every CPU predicts slightly differently. But they all have something in common. If you have consistent branches, you’ll have consistent prediction. When your conditional changes, is when you incur the misprediction penalty. After one or two passes it’s predicted correctly every time. The two passes may be required because branch predictors are too smart. Some will see an occasional mispredicted branch as a fluke. Then they mispredict again, and then they’ll switch their prediction. But these are the details that distract from the big picture. You just don’t want your branches to change their direction often. Predicted branch costs you literally nothing. And if branch always goes the same way, pretty much any modern CPU will predict correctly. Function calls and returns are very special kinds of branches. The return is a form of indirect jump, which is hard to predict. And function calls happen very often. That’s why on most modern CPUs there’s a hardware optimization. It’s called Return Stack Buffer. The CPU keeps track of a few return addresses and predicts returns perfectly. So you shouldn’t be afraid to call non-inlined functions, it’s really cheap these days. At least if you don’t do any trickery that throws off the prediction. But you should be aware of recursion deeper than a dozen calls. If you blow through the CPU Return Stack Buffer depth, Each return may cost you over a dozen cycles. Here is a specific example of how I measure performance and bandwidth. In a cloth solver I have bulky cloth elements moving 16 verts at a time. It takes 600 clock cycles to run ~1000 instructions. The data is 5 cache lines streamed (320 bytes), and ~8 cache lines scattered-gathered on top of it. That’s ~46 cycles per cache line, a bit worse than 1 byte per cycle. I’m still ALU bound, which is good: optimizing math actually gives perf benefit. And perf is quite predictable. Unfortunately, it also means every extra math operation makes it a bit slower. I’m consuming memory at 4.7Gb/s per thread. That’s enough to affect other threads in some limited cases. There’s room for improvement, but I’m happy enough with these metrics and I move on. Because my bottlenecks are elsewhere. I was asked how I measured the stats from the previous slide. It’s actually very low tech. Even though I love profiling stuff with fancy tools. I still measure my loops with time stamp counter. Since I write my SIMD loops without branches, they always execute the same code path. So unless they stall randomly, the time per iteration is very repeatable. I just sum up all the cycles and all the iteration counts. This gives me the average cycles per iteration. I can derive a lot of metrics from there. In this case I just carefully counted the cache lines by hand. It’s not too hard because I organize my data in large aligned blocks. The 8 random access cache lines are there because I read and write 16 vertices per iteration, and I can only place pairs of those vertices next to each other into the same cache line. I assumed conservatively that all my data is cold. Meaning not in cache. And I also assume ultra-conservatively that I won’t hit the same cache lines with randomly accessed vertices. This is actually wrong because I specifically optimize the order of computation to be cache friendly. So the memory traffic of 4.7 Gb/s is probably overly pessimistic. For the instruction count I just copy-paste the assembly and count the lines. I should really count micro-ops and I could do that with IACA. Out Of Order represents an interesting compromise. On one hand, it uses real estate on the chip. Not a whole bunch, but some appreciable amount.
Not a whole bunch, but some appreciable amount. So it reduces the raw throughput of the chip. But it also improves it by making better use of the resources. In the real world, it’s a net improvement. I don’t want to spend my days pouring over low-level engine routines. I want to be creative, and computers to handle the mundane tasks. Out of Order does just that, and it does that to the whole program, not just the hotspot loops. So it’s very useful. So let me summarize So far we covered that: Cache misses are expensive, and just how expensive they are You should always count cache lines, not bytes, for bandwidth computations. Out of Order execution hides latency and pipelines everything, but we can help it by disentangling dependency chains RAM is fast but far from CPU. We can help it by laying out data for hardware prefetcher. 1 byte per op rule of thumb for math intensive physics code. What’s missing from this picture? How about a weird little trick that will let you lose 32 bytes in a day or less? You know what, I miss the simplicity of 8-bit times.. A program used to always load at the same offset. Every variable was always at the same absolute address. You just bake the game binary, and when it loads it’s ready to play immediately. Not anymore. We allocate memory dynamically. We can’t just bake pointers into the disk image. We waste 8 bytes per pointer, although we’re really far from using all that address space. This is a 10,000 feet strategic overview talk. But I wanted to share this technique we use at Valve. It’s a simple templatized class that behaves like a pointer. But it’s position-independent, and it’s binary compatible between 32-bit and 64-bit code. We save these data to disk raw, with all the relative pointers; then load them at runtime, and they’re ready to go. Very simple. No fix-ups or decoding. Same data in 32-bit and 64-bit. And it generally doesn’t cost any more than raw pointers to use. So how do we do it? There are 2 ways to address memory: directly, with absolute address, and indirectly, relatively to another address. A normal pointer is the absolute address. Relative pointer is a relative offset, from itself, in bytes. You move a block of memory with relative pointers, save and load it from disk, and it’s still valid. The offset of 0 would point to the pointer itself. That’s not very useful, so 0 is treated as NULL pointer. We use this extensively to bake out all the physics collision data, and load it at runtime. Without multiple dynamic allocations or fixups. One natural limitation is that you have to move a piece of memory as one contiguous block. You can’t fragment it. 32bit offsets let you address +/- 2GBytes. Here’s the implementation. 1. We have an offset, 2. …and overloaded operator -> 3. …that just computes the pointer from the relative offset. 4. …Note that this doesn’t result in NULL if the offset is zero, so we assert in that case. 5. …But we do check for null in assignment. Just like with normal pointers, you shouldn’t dereference a NULL. So, to recap. In this big, huge address space you have a sandbox. It can be model data, collision of the world or nav mesh, or anything else Logically belonging together. The sandbox doesn’t have to be small. With 32-bit offsets, you can easily address within 2GBytes. You can also make the offset be in words, not bytes. Then you make all your pointers 4 byte aligned, and address +-8 GBytes. Anything can point to anything arbitrarily within the sandbox. Pointers are relative offsets and take 4 bytes. You can also make some offsets 2 or even 1 byte. And you can make the sandbox very small. So this whole sandbox can be moved about arbitrarily. It can be dumped to disk And loaded from disk. All the relative pointers remain valid. The beauty of this method is that you don’t need to write any serialization code. There’s no need in any reflection API. It’s very simple. We usually don’t want to move the memory while we are constructing the data. It’s inconvenient. On the other hand we don’t know in advance how much memory we’ll need. So we do a VirtualAlloc, guessing how much data we may ever need, and commit it incrementally as we allocate more data. It works very well in practice, except when you want to compile 3-Gb texture on 32-bit machine. Here’s a little comparison chart of a linked list with normal pointers and offsets. The declarations are very similar. The pointers that point inside the data block change to offsets. We use the template class for that All the data fields remain the same. Alignment and size can reduce. We have to use explicit allocator to stay in the sandbox. At Valve, we use a simple stack allocator. This makes data layout predictable and explicit. The assignment operator handles the NULL case. In other respects, the behavior is the same The runtime class is just a few lines of code The tool-time allocator class is maybe 50 lines and has a call to VirtualAlloc, but it’s still pretty simple. There’s no runtime overhead. Byte offsets cost nothing on modern CPUs Code Gen and register usage is slightly different, but it doesn’t matter in practice. The format on disk is exactly the same as in memory You can even memory map files. The data becomes more compact, because pointers aren’t 8-byte large anymore. This is good for cache. There are some limitations, of course. Automatic versioning is important for data that you cannot re-ship or recompile easily. In practice we use this for data that changes frequently and isn’t easy to construct. So we can rebuild it often, but it can be arbitrarily complex structure and it’s very efficient to load and use The data structures must reside in a continuous block of data It’s less than ideal for constructing on the fly. But it is ideal for data you load from disk and never change. It’s an extra concept to learn for the programmers new to the code base. That’s not a trivial consideration in large teams. So we talked about how CPU works and then about a weird little trick that can help you lose 32 bytes in a day or less. But this is a talk about strategy, after all. So what’s missing is the bigger picture. Some systemic approach, with strong theoretical underpinnings. I wonder, what would that be? See, back in the day things were simple. CPU took an instruction and executed it. It had a manual that listed all instruction timings. I could predict with 100% certainty how fast anything would be That world is now gone. Unfortunately, there is no established theory for designing globally optimal algorithms on modern hardware. There’s well developed asymptotic complexity analysis. The Big O notation It’s often uninformative if your data isn’t infinitely large. There are also glorified Tips and Tricks and Best Practices that we learn and publish and apply Those have no established unifying theory. Optimization is still half art, half science and a pinch of magic. Obviously, any optimization effort should include complexity analysis. Early You fix the algorithm first, count cycles later. CPUs work very non-linearly, though. Adding an instruction may literally cost zero. The exact same operation may cost 200 cycles elsewhere. You can’t just count instructions and be done. But – nobody repealed the fundamental laws. Here is my favorite complexity: Nada. If I don’t have to do something, I don’t want to do it. Cheat it. Talk to the designer. Chances are that not everything you do is absolutely necessary. other of my favorite complexities is O(1). For example, cache the results. Hopefully with O(1) lookup. Memoization is another name for this trick. Do stuff once per frame. We were trying to optimize rag doll creation in left4dead. After struggling with hitches for some time we just queued up and staggered rag dolls Fixed a lot of hitches, and nobody noticed Approximate what you can Especially in game physics,
it’s all about approximating and simplifying the real world physics into the ground I’m sure we can all come up with all sorts of cheap tricks. But if nobody notices, that’s not a cheap trick anymore. That’s a valuable technique. Array traversal and linked list traversal are both O(N) complexity. Does it mean they’re the same? They kind of look the same to me. Linked list is certainly very cache unfriendly. But even if it’s 200 times slower, it’s still O(N) Big O doesn’t capture the constant. And on a modern CPU the constant is very relevant In the old times of 8-bit CPUs, there were no such nuances. It was, in fact, about as fast to iterate an array as a linked list. It was popular, and useful, to compare algorithms by the flop count. Today, it can be very misleading Here’s an example of Big O being misleading. What should we use, Red Black Tree or a Hash Table? Think about a simple lookup in a map with a thousand elements. In the classical variant of RB-tree, you touch 10 to 20 cache lines. That’s a lot of cache misses for a lookup That’s also a lot of branch mispredictions unless your binary search is branchless. Because the binary search branch is effectively random. Although if you insert in order, then traverse in order, it’ll be cache friendly. In a classic hash table, you can have horrible hash collisions They can make worst case lookup linear. But if you simply use sensible hash and bucket count you can expect to touch on average 3 or 4 cache lines almost regardless of the number of elements. And the branches are inherently predictable So what’s more important – algorithmic complexity or cache misses? In the real world, the operation cost makes a difference. Should you use more CPU or Memory? It’s a judgement call It’s a tradeoff between practical performance and theoretical or worst-case scalability or reliability. And we should pay more attention to cache friendliness of our algorithms Instructions actually take different number of ticks and that’s fine. But if you don’t know if a branch will take 1 tick or 20, or if a register load will take 4 ticks or 20 then you can’t make good decisions about your code. You just make wild guesses. Eliminating stalls makes performance predictable And makes decisions – informed. So suppose your algorithm is rock solid optimal you have no cache misses because you prefetch everything perfectly and your instructions pipeline well and you never have any stalls And suppose you are Computation bound. There are a lot of such cases during physics optimization What next? SIMD is an obvious next stop. SIMD is by no means trivial to implement [use]. It imposes a lot of limitations. Like, you have to have multiple things to process at the same time. No big revelation here. And computations in different SIMD lanes must not affect each other. Even if you have 4 things to do, it doesn’t mean you can do them in parallel. Like, if you have 4 contacts to solve, they may affect the same rigid bodies So you may have to decide how to blend the results together. SIMD doesn’t have to be super hard though. I’ll share an easy workflow that I used for the past 10 or so years. I start with a scalar path, keeping in mind the kind of things SIMD can do. So, I use no branches at this stage, for example. But I use reciprocal square roots, min, max and selects generously And I process one thing at a time. But in a way that allows me to group them into groups of 4 or 8 independent things. So for example one single rope solver would not be SIMDizable this way. Because it only allows sequential computations. You need to be very deliberate with your data layout at the stage when you are planning for SIMD. Or you’ll have to load 4 separate elements and then shuffle them. That’s worse than 1 instruction per element. Which is worse than scalar case. In the nearest future we’ll have special gather instructions. But I suspect it won’t fix this problem completely. Because fundamentally, CPU will still have to hunt down multiple pieces of data and shuffle them together. So it’ll probably cost us latency. And when you save the data out, you have the same problem. You have to shuffle first and save data element by element. So you can easily end up with SIMD code that’s slower than scalar. You don’t have to interleave all the data right away. But you need to think about it early. Otherwise it will be hard to lay it out correctly later. There’s a glimmer of hope, though. When code is scheduled nice and tight, you’re usually bound on only one pipe or issue port. In physics that’s normally the arithmetic pipe. It’s the U pipe on the older PowerPCs. Or Issue Ports 0, 1 or 5 on newer Intels And guess what? That leaves the other pipe or ports idling half the time. That means any additional instructions that can issue on the other pipes are essentially free. And those are normally the logical and shuffle pipes Why is that interesting? Because even when you do math in SIMD with maximum throughput, you can still swizzle and transpose your data essentially for free on the other pipes. sometimes As long as you don’t do it too much. Of course, you should never rely on your code being scheduled ideally tight. So your input data should be swizzled so that you load 4 elements with one load instruction. .. as much as possible You can’t always pack the data that way, but you should do it as much as possible. For example, you can swizzle and pack static model data this way pretty much always. You’ll have to duplicate data sometimes, but it’s ok RAM is cheap these days. It’s the bandwidth that is at premium Because the operations of loading, storing and shuffling
can pipeline well with computations, some gather-scatter activity can hide behind the math The 1 op per byte rule is handy here. If you limit how much you load, you also limit how much you shuffle But it’s still best to avoid it as much as you can. The inputs and outputs of SIMDized algorithm are good candidate for gather-scatter Because it’s usually unreasonable to build an API with pre-interleaved structures. But when I have full control over design of data structure, and API is not an issue, [and perf is] I interleave the data as much as I can. After I do this, I replace all operations with their SIMD equivalents. This is the part where I actually start using SIMD operations. Sometimes I templatize code and data by type float. This function computes an approximate Givens rotation If you pass float4 as template argument, it will compute 4 Givens rotations. I wouldn’t recommend it if you are not very familiar with SIMD. Also, there’s a tool that does it for you, it’s called ISPC I take the scalar data structures, and I replace all types and operations with their SIMD equivalents. Float becomes 4 floats, int becomes 4 ints and so on. And it just works! This is an example of a scalar-path vs SIMD-path struct in a cloth solver. Of course, when you compute 4 of something in parallel, there better be no dependencies between those 4 things. It takes some forethought. Some problems are notoriously hard, for example some matrix factorizations. And some are easy, like this one Here, one quad is solved per lane as a single block. Same with rods. I just make sure the quads and rods don’t touch, that’s all. Works like a charm. So, to SIMDize something, you need 4 or 8 items that need similar processing, and they must be independent of each other There’s a myth I’ve heard: that SIMD is always a premature optimizatio You shouldn’t write SIMD right away but when designing data structures and algorithm without SIMD in mind it’s hard to separate data into independent homogenous lanes cleanly afterwards. If you don’t do that cleanly, you have to waste time shuffling and preparing data which destroys the benefits of SIMDizing. There’s no easy recipe for SIMDizing 4 things with differing computations. But here are a couple of recipes of how to handle branches. First is the most commonly used and that’s what happens on GPUs. If you have a condition, you execute both the “if” and “else” branches for all elements, and then select one of them. Everything that’s not selected is wasted computation but it can easily be faster than running the scalar path with branches Here’s a another branch trick. You just sort elements by conditional. One list for “if”, one list for “else”. You can even SIMD sort it 8 at a time, without branches. Then you run “if” and “else” code only on elements that need it. No wasted computations. Obviously, the same caveats about ordering and dependent elements apply. This is harder to pull off, but it’s a much more powerful technique. You’ll often have almost 4x perf with SIMD Or 8 times in case of AVX. That is, if your problem is separable And if you lay out your data right. And if you plan for SIMD ahead of time, both code and data SIMDization is very problem dependent. Compilers are very bad at auto-vectorizing nontrivial code. It’s very hard to SIMDize something that wasn’t meant for it. Or if the problem is not separable by itself. But 2/3 of CS:GO players have CPUs with AVX support. So the future is bright for SIMD techniques. I’d like to extend the Special Thanks to my art team. Which is :3D Art by Anna Bibikova And 2D Art by Heather Campbell Thank you very much!

Tags: , , , , , , , ,

24 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *