Cache Coding Techniques

CS 233 · Loop optimization strategies — loop fusion, fission, inversion, tiling, and prefetching

Loop Fusion

Merge two loops over the same range into one to exploit temporal locality

Before2 × LARGE / B

→

AfterLARGE / B

B = number of elements of A that fit in one cache block

for (int j = 0; j < LARGE; j++) {
  sum += A[j];
}
for (int j = 0; j < LARGE; j++) {
  product *= A[j];
}

2 × LARGE/B misses

Why it works

Both loops access A[j] with the same stride. In the original, A[j] is loaded for the first loop, evicted by the time the second loop runs (for large arrays), then loaded again. Fusing them means each cache line is loaded once and both sum and product are updated before the line leaves cache.

Loop Fission

Split a loop body into two separate loops to reduce working-set size in the inner loop

BeforeLARGE misses (A accessed with stride 1 but thrashed by inner B loop)

→

AfterLARGE / B misses for A

B = number of elements of A that fit in one cache block

for (int j = 0; j < LARGE; j++) {
  sum += A[j];
  for (int k = 0; k < LARGE; k++) {
    other_sum += B[j][k];
  }
}

LARGE misses on A

Why it works

In the original, the inner B[j][k] loop sweeps a large 2-D array on every outer iteration, evicting the recently loaded A[j] from cache before the next outer iteration can reuse it. Splitting the loops gives A[j] its own pass where each cache line is fully used before moving on.

Loop Inversion

Swap inner and outer loops to turn a high-stride access into a unit-stride access

Beforem × (n / B) misses

→

Aftern / B misses

B = number of elements of a per cache block

for (int i = 0; i < m; i++) {
  for (int j = 0; j < n; j++) {
    sum += a[j];
  }
}

m × n/B misses

Why it works

In the original, the outer i loop iterates m times, and each time the inner j loop reloads every element of a[j] from scratch (it gets evicted between outer iterations). After inversion, a[j] is loaded once per j step and kept in cache for all m inner iterations.

When to use

✓You have nested loops.
✓Incrementing the inner loop has a large stride, but the outer loop would have a small stride if swapped.
✓Common cases: column traversals of 2-D arrays, struct arrays where the inner loop traverses the array instead of individual struct fields.

Tiling (Blocking)

Process data in cache-sized tiles to exploit temporal locality across loop iterations

When to use

✓You are re-using the same data multiple times. Tiling does NOT help if there is no data re-use.
✓The re-use lacks temporal locality — data is evicted before it can be reused.
✓Other techniques (fusion, fission, inversion) are insufficient to create temporal locality.
✓Watch for column and row traversals in the same innermost loop (or equivalent struct patterns).

Choosing tile size

✓All active tiles must fit inside the cache until you are done with them.
✓Make the tile as large as possible given the above constraint.
✓Remember to account for the size of your data type when calculating tile sizes.

1-D Tiling

Tile over one dimension to keep a chunk of a[j] resident while iterating over b[i].

for (int i = 0; i < M; i++) {
  for (int j = 0; j < N; j++) {
    sum += a[j] / b[i];
  }
}

poor temporal reuse on a[]

Multi-Dimensional Tiling

Matrix multiplication: tile over i, j, and k so that TILE_SIZE × TILE_SIZE sub-matrices of a, b, and c all fit in cache simultaneously.

for (int i = 0; i < N; i++)
  for (int j = 0; j < N; j++)
    for (int k = 0; k < N; k++)
      c[i*N+j] += a[i*N+k] * b[k*N+j];

full N³ reloads

Prefetching

Issue a cache load early so the data arrives before it is needed

When to use

✓Other techniques cannot further reduce your miss rate.
✓Your access pattern is predictable.
✓The number of future predictable accesses is large.

Guidelines

✓A cache miss costs 100–200 clock cycles. Prefetch far enough ahead that the miss resolves before your code catches up.
✓Do not prefetch so far ahead that you exceed the entire cache size.
✓Prefetch the address of the element, not the element itself — use &A[i+k].

Syntax

void __builtin_prefetch(const void *addr);

Example

/* Prefetch element 8 iterations ahead */
for (int i = 0; i < N; i++) {
  __builtin_prefetch(&A[i + 8]);
  sum += A[i];
}

Common mistake: writing __builtin_prefetch(A[i+8]) instead of __builtin_prefetch(&A[i+8]). You must pass the address of the element, not the element value.