Cache Coding Techniques
CS 233 · Loop optimization strategies — loop fusion, fission, inversion, tiling, and prefetching
Loop Fusion
Merge two loops over the same range into one to exploit temporal locality
B = number of elements of A that fit in one cache block
for (int j = 0; j < LARGE; j++) {
sum += A[j];
}
for (int j = 0; j < LARGE; j++) {
product *= A[j];
}Why it works
Both loops access A[j] with the same stride. In the original, A[j] is loaded for the first loop, evicted by the time the second loop runs (for large arrays), then loaded again. Fusing them means each cache line is loaded once and both sum and product are updated before the line leaves cache.
Loop Fission
Split a loop body into two separate loops to reduce working-set size in the inner loop
B = number of elements of A that fit in one cache block
for (int j = 0; j < LARGE; j++) {
sum += A[j];
for (int k = 0; k < LARGE; k++) {
other_sum += B[j][k];
}
}Why it works
In the original, the inner B[j][k] loop sweeps a large 2-D array on every outer iteration, evicting the recently loaded A[j] from cache before the next outer iteration can reuse it. Splitting the loops gives A[j] its own pass where each cache line is fully used before moving on.
Loop Inversion
Swap inner and outer loops to turn a high-stride access into a unit-stride access
B = number of elements of a per cache block
for (int i = 0; i < m; i++) {
for (int j = 0; j < n; j++) {
sum += a[j];
}
}Why it works
In the original, the outer i loop iterates m times, and each time the inner j loop reloads every element of a[j] from scratch (it gets evicted between outer iterations). After inversion, a[j] is loaded once per j step and kept in cache for all m inner iterations.
When to use
- ✓You have nested loops.
- ✓Incrementing the inner loop has a large stride, but the outer loop would have a small stride if swapped.
- ✓Common cases: column traversals of 2-D arrays, struct arrays where the inner loop traverses the array instead of individual struct fields.
Tiling (Blocking)
Process data in cache-sized tiles to exploit temporal locality across loop iterations
When to use
- ✓You are re-using the same data multiple times. Tiling does NOT help if there is no data re-use.
- ✓The re-use lacks temporal locality — data is evicted before it can be reused.
- ✓Other techniques (fusion, fission, inversion) are insufficient to create temporal locality.
- ✓Watch for column and row traversals in the same innermost loop (or equivalent struct patterns).
Choosing tile size
- ✓All active tiles must fit inside the cache until you are done with them.
- ✓Make the tile as large as possible given the above constraint.
- ✓Remember to account for the size of your data type when calculating tile sizes.
1-D Tiling
Tile over one dimension to keep a chunk of a[j] resident while iterating over b[i].
for (int i = 0; i < M; i++) {
for (int j = 0; j < N; j++) {
sum += a[j] / b[i];
}
}Multi-Dimensional Tiling
Matrix multiplication: tile over i, j, and k so that TILE_SIZE × TILE_SIZE sub-matrices of a, b, and c all fit in cache simultaneously.
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
for (int k = 0; k < N; k++)
c[i*N+j] += a[i*N+k] * b[k*N+j];Prefetching
Issue a cache load early so the data arrives before it is needed
When to use
- ✓Other techniques cannot further reduce your miss rate.
- ✓Your access pattern is predictable.
- ✓The number of future predictable accesses is large.
Guidelines
- ✓A cache miss costs 100–200 clock cycles. Prefetch far enough ahead that the miss resolves before your code catches up.
- ✓Do not prefetch so far ahead that you exceed the entire cache size.
- ✓Prefetch the address of the element, not the element itself — use
&A[i+k].
Syntax
void __builtin_prefetch(const void *addr);Example
/* Prefetch element 8 iterations ahead */
for (int i = 0; i < N; i++) {
__builtin_prefetch(&A[i + 8]);
sum += A[i];
}__builtin_prefetch(A[i+8]) instead of __builtin_prefetch(&A[i+8]). You must pass the address of the element, not the element value.