Prefetching in memory-intensive applications
In the previous post we have been looking on various, sometimes intrusive and complicated methods of compacting data structures for providing better memory and cache usage. Today we will continue with other method of improving performance for data intensive applications: prefetching.
Data flow in computer system
Before starting with showing examples, let’s think about how and when actually data moves from memory to CPU. Imagine very simplistic example: calculating sum of array elements.
uint64_t data[64000];
uint64_t sum = 0;
for(int i=0;i<64000;i++) {
sum += data[i];
}
The simplest system has memory functionally connected directly to the CPU, which means data is loaded and stored from processor registers to memory. On such system memory will be accessed every time data[i]
is referenced. In modern CPU that would be a disaster as access to RAM is slow, in range of ~100ns (where one CPU cycle is ~0.4ns). Fortunately, most CPUs including small embedded ones are more sophisticated than that.
To improve performance, back in ’80s mainstream computer systems were enhanced witch cache, in most cases many levels of it. A cache is a very fast memory, sitting between CPU and RAM, while usually being physically integrated into the CPU. On systems equipped with a cache, when reading a value from memory to register, a chunk of data near the requested address is loaded to the cache. This chunk of data has fixed size (almost always 64 bytes), is aligned to its size and is called the cache line. Whenever accessing data from this chunk, there is no need to read it from the main memory, but rather from a much faster cache. In our example, where data is accessed in 64bit (8 bytes) chunks, only every 8th iteration CPU will have to wait on memory access.
This design was enhanced even further with the introduction of speculative prefetching
.
This idea is based on the fact that data in most cases is accessed predictably and can be read ahead of time to cache. The simplest case of predictability is continuous linear memory access, exactly as in our example. Any relatively modern CPU will notice that data is accessed linearly and read a few consecutive lines of cache, to be available when needed. This way in our example there will be almost no memory stalls (at least in theory). This process of reading data from RAM to cache, ahead of time is called prefetching. Predicting which cache line to read next is not limited to continuous access, it works well also for stride access where processing every Nth element in memory. There is much more into that topic, like multiple cache levels, cache policies, cache synchronization protocols but describing those details would be out of scope for this article.
Practical examples
Prefetching descriptors
Let’s get back to our example of processing packet descriptors from the previous article. We will use a structure with the best overall performance profile: PacketDescriptorPacked
and try to improve it further with prefetching. Prefetching of arbitrary memory location can be done using: __builtin_prefetch
which details can be found in GCC documentation. The only obligatory argument is a pointer to memory location requested for prefetching. Of course, not only the exact address will be loaded, but an entire 64 bytes cache line. A pointer can be invalid or null, which is useful because in most cases bound checking can be ignored saving a few CPU cycles. There are two additional arguments to __builtin_prefetch
, but those are not important here.
Now let’s try and add prefetching into the benchmark loop. How much data to prefetch ahead of time? In theory, this depends on many factors like cache size, memory latency, time to process a single line of cache worth of data. This is very difficult to evaluate properly, so the best option is to just try multiple values and measure.
Performance improvement was moderate at most. The reason for that was already mentioned before: CPU is smart enough to do prefetching for us. Access pattern in the benchmark is completely linear which is extremely easy to predict. Data are being prefetched transparently and probably the only reason for improvement observed is hardware prefetcher being too conservative and not prefetching more than 1kb ahead. Also, we can observe performance degradation for small values, this is an important indication that prefetching is not for free and introduces some overhead: i.e prefetching data already in cache will be no-op wasting CPU cycles.
Prefetching payload
Another example applicable to working with network packets is iterating over packet descriptors but processing also payload pointed by one of the descriptor structure fields. This has a dramatic impact on memory access pattern: descriptors are placed back to back, one after another, but payload can be in a completely different memory locations, not adjacent to one another. This pattern is not possible to predict and gives more space for performance improvements.
Benchmark simulating this kind of processing is implemented in the ProcessDescriptorsPayloadRead
function. The algorithm is completely trivial, iterating over descriptors and summing up the first word from packet payload. Having learned from the previous example, we are not prefetching descriptors, only payload for a different number of packets ahead. CPU has no way of knowing our data structures, so issuing prefetch instructions gives CPU a hint which data will be needed soon.
Surprisingly there was even less performance improvement than previously and performance degradation when prefetching more than 256 packets. This is because time spent in processing data is so minuscule compared to the time needed to load (or prefetch) data that memory bandwidth became a limiting factor. No prefetching amount can improve that significantly, the memory bus is busy almost all the time and giving hints which data will be needed is not going to help. For prefetching to make sense and improve anything it needs to have a chance of running “in the background”. There has to be some amount of time when CPU is processing the current batch of data so future data can be prefetched.
Prefetching payload with heavy processing
A modified example implemented in ProcessDescriptorsPayloadReadHeavy
simulates more time-consuming data processing. Now there are periods where CPU is busy crunching data already available in cache and future data can be loaded in the background.
Here performance improvement is much more significant. The performance went from 5mpps
when not prefetching to 11mpps
when prefetching 4 packets ahead. As for quite simple, safe and localized change, this is an impressive result. It shows prefetching as a powerful performance improvement for memory-bound applications. There are some prerequisites, like certain processing-to-memory-access ratio and keeping prefetching logic as minimal as possible, so that our improvements are not diminished by it, but overall this makes good and relatively simple optimization technique.
Bonus: optional arguments
__builtin_prefetch
has two additional parameters: rw
and locality
.
rw
indicates if prefetched memory is going to be used to read or write. This sounds simple in theory but it is more complicated to make it work in practice. One issue is that only most modern CPUs support it. According to gcc manual Intel has it since broadwell
architecture. So there is need either for targeting application only fo the latest CPUs or for CPU architecture dispatching. The second more complex issue is that it is hard to find a case where write prefetching makes any difference. On the first glimpse, it seems that when loading data to cache, it makes no difference if the CPU is going to read or write it later. But there are some cases in multi-core and especially multi-socket systems, where it can make quite a lot of difference. It concerns mostly invalidating cache on remote CPUs so that when writing to cache, there will be no need for cache sync-up between CPUs. This is not easy to simulate and this topic requires more study on my side.
As for locality
, it is supposed to indicate a degree of temporal locality
. A value of zero means that the data has no temporal locality, higher values mean more temporal locality. This probably has something to do with levels of cache where data needs to be loaded, but let’s just try it out and see what difference does it make.
LOCALITY
PREFETCHING 0 1 2 3
0 62 62 62 62
1 57 57 57 57
2 58 58 58 58
4 59 59 59 59
8 61 61 61 61
16 63 63 63 63
32 62 62 62 62
64 62 62 62 62
128 62 62 62 62
256 60 60 61 60
384 60 60 60 60
512 58 58 58 58
768 52 52 52 52
1024 47 47 47 47
There is no measurable difference for any value of locality, so either it requires some special conditions to work or it is just not functional for this CPU architecture.
Tracing memory performance
It is not trivial to correctly identify and solve memory bottlenecks even in simple applications. How can we know that our application is limited by memory bandwidth? Which changes are reducing memory stalls? Fortunately, CPU provides various hardware counters storing many types of events like memory reads, reads served by a cache (cache-hits), reads served by main memory (cache-misses). Those counters can be accessed programmatically, as instrumentation in code i.e for performance-critical parts of the application or by using perf stat
tool.
Running it requires specifying which counters we are interested in and application we want to measure:
perf stat -e cache-misses,cache-references,cycles,instructions,L1-dcache-loads,L1-dcache-misses,LLC-loads,LLC-load-misses ./PacketDescriptorsPrefetching
Here we have specified counters responsible for various levels of data cache and ones that can show instructions per cycle.
Results without payload prefetching:
Prefetching 4 pointers ahead
Payload read processing performance long: 9 mpps
2191257897 cache-misses # 78.771 % of all cache refs (49.99%)
2781792905 cache-references (50.00%)
364710082877 cycles (50.01%)
236285106758 instructions # 0.65 insn per cycle (62.51%)
38528300611 L1-dcache-loads (62.51%)
3000142952 L1-dcache-misses # 7.79% of all L1-dcache hits (62.51%)
1209407796 LLC-loads (62.50%)
670348079 LLC-load-misses # 55.43% of all LL-cache hits (62.49%)
Results with prefetching:
Prefetching 0 pointers ahead
Payload read processing performance long: 4 mpps
2156820650 cache-misses # 78.719 % of all cache refs (50.00%)
2739896913 cache-references (50.00%)
176376504963 cycles (50.01%)
239466479151 instructions # 1.36 insn per cycle (62.51%)
39593265565 L1-dcache-loads (62.51%)
2957529169 L1-dcache-misses # 7.47% of all L1-dcache hits (62.50%)
574832230 LLC-loads (62.49%)
48264412 LLC-load-misses # 8.40% of all LL-cache hits (62.49%)
To some surprise general cache-misses
and L1-dcache-misses
stays the same. But there is a clear difference in LLC-load-misses
(LLC stands for last level cache, so L3 on this particular CPU). The number of cache-misses, so events, when data were not present in the cache and had to be read from main memory, have decreased drastically. Also improved insn per cycle
shows that CPU was able to spend more time executing code and less waiting on memory.
Summary
Hopefully, this post has proven prefetching as a viable technique for improving performance in memory-intensive applications. Having some knowledge about CPU and cache architecture as well as some profiling tools can be very helpful in chasing bottlenecks. Also, it is worth remembering that prefetching is very architecture-dependent and volatile: what works well on one CPU can work different way on another. It is always important to know a range of architectures our application will work on and after making changes based on a scientific guess, measure results. Things like predictive prefetching in CPU, proper timing, multi-core systems calls can make it tricky.
All code used to write this article can be found, as usual on my GitHub.
Comments