Intel ARCHITECTURE IA-32 - Avoid Excessive Software Prefetches; Improve Effective Latency of Cache Misses

To Next Page

To Previous Page

IA-32 Intel® Architecture Optimization

7-36

Avoid Excessive Software Prefetches

Pentium 4 and Intel Xeon Processors have an automatic hardware

prefetcher. It can bring data and instructions into the unified

second-level cache based on prior reference patterns. In most situations,

the hardware prefetcher is likely to reduce system memory latency

without explicit intervention from software prefetches. It is also

preferable to adjust data access patterns in the code to take advantage of

the characteristics of the automatic hardware prefetcher to improve

locality or mask memory latency. Using software prefetch instructions

excessively or indiscriminately will inevitably cause performance

penalties. This is because excessively or indiscriminately using software

prefetch instructions wastes the command and data bandwidth of the

system bus.

Using software prefetches delays the hardware prefetcher from starting

to fetch data needed by the processor core. It also consumes critical

execution resources and can result in stalled execution. The guidelines

for using software prefetch instructions are described in Chapter 2. The

techniques of using automatic hardware prefetcher is discussed in

Chapter 6.

User/Source Coding Rule 28. (M impact, L generality) Avoid excessive use

of software prefetch instructions and allow automatic hardware prefetcher to

work. Excessive use of software prefetches can significantly and unnecessarily

increase bus utilization if used inappropriately.

Improve Effective Latency of Cache Misses

System memory access latency due to cache misses is affected by bus

traffic. This is because bus read requests must be arbitrated along with

other requests for bus transactions. Reducing the number of outstanding

bus transactions helps improve effective memory access latency.

One technique to improve effective latency of memory read transactions

is to use multiple overlapping bus reads to reduce the latency of sparse

reads. In situations where there is little locality of data or when memory

reads need to be arbitrated with other bus transactions, the effective

Related product manuals