Intel ARCHITECTURE IA-32 - Software Prefetching Usage Checklist

To Next Page

To Previous Page

IA-32 Intel® Architecture Optimization

6-24

The performance loss caused by poor utilization of resources can be

completely eliminated by correctly scheduling the prefetch instructions

appropriately. As shown in Figure 6-3, prefetch instructions are issued

two vertex iterations ahead. This assumes that only one vertex gets

processed in one iteration and a new data cache line is needed for each

iteration. As a result, when iteration n, vertex V

, is being processed, the

requested data is already brought into cache. In the meantime, the

front-side bus is transferring the data needed for iteration n+1, vertex

n+1

. Because there is no dependence between V

n+1

data and the

execution of V

, the latency for data access of V

n+1

can be entirely

hidden behind the execution of V

. Under such circumstances, no

“bubbles” are present in the pipelines and thus the best possible

performance can be achieved.

Prefetching is useful for inner loops that have heavy computations, or

are close to the boundary between being compute-bound and

memory-bandwidth-bound.

The prefetch is probably not very useful for loops which are

predominately memory bandwidth-bound.

When data is already located in the first level cache, prefetching can be

useless and could even slow down the performance because the extra

µops either back up waiting for outstanding memory accesses or may be

dropped altogether. This behavior is platform-specific and may change

in the future.

Software Prefetching Usage Checklist

The following checklist covers issues that need to be addressed and/or

resolved to use the software prefetch instruction properly:

• Determine software prefetch scheduling distance

• Use software prefetch concatenation

• Minimize the number of software prefetches

• Mix software prefetch with computation instructions

• Use cache blocking techniques (for example, strip mining)

Related product manuals