Intel ARCHITECTURE IA-32

To Next Page

To Previous Page

General Optimization Guidelines 2

2-49

write misses; only four write-combining buffers are guaranteed to be

available for simultaneous use. Write combining applies to memory

type WC; it does not apply to memory type UC.

Assembly/Compiler Coding Rule 28. (H impact, L generality) If an inner

loop writes to more than four arrays, (four distinct cache lines), apply loop

fission to break up the body of the loop such that only four arrays are being

written to in each iteration of each of the resulting loops.

The write combining buffers are used for stores of all memory types.

They are particularly important for writes to uncached memory: writes

to different parts of the same cache line can be grouped into a single,

full-cache-line bus transaction instead of going across the bus (since

they are not cached) as several partial writes. Avoiding partial writes can

have a significant impact on bus bandwidth-bound graphics applica-

tions, where graphics buffers are in uncached memory. Separating

writes to uncached memory and writes to writeback memory into sepa-

rate phases can assure that the write combining buffers can fill before

getting evicted by other write traffic. Eliminating partial write transac-

tions has been found to have performance impact of the order of 20%

for some applications. Because the cache lines are 64 bytes, a write to

the bus for 63 bytes will result in 8 partial bus transactions.

When coding functions that execute simultaneously on two threads,

reducing the number of writes that are allowed in an inner loop will

help take full advantage of write-combining store buffers. For

write-combining buffer recommendations for Hyper-Threading

Technology, see Chapter 7.

Store ordering and visibility are also important issues for write combin-

ing. When a write to a write-combining buffer for a previously-unwrit-

ten cache line occurs, there will be a read-for-ownership (RFO). If a

subsequent write happens to another write-combining buffer, a separate

RFO may be caused for that cache line. Subsequent writes to the first

cache line and write-combining buffer will be delayed until the second

RFO has been serviced to guarantee properly ordered visibility of the

writes. If the memory type for the writes is write-combining, there will

Intel ARCHITECTURE IA-32 - Page 121