Intel ARCHITECTURE IA-32 - Shared-Memory Optimization; Minimize Sharing of Data between Physical Processors

To Next Page

To Previous Page

Multi-Core and Hyper-Threading Technology 7

7-39

block size for loop blocking should be determined by dividing the target

cache size by the number of logical processors available in a physical

processor package. Typically, some cache lines are needed to access

data that are not part of the source or destination buffers used in cache

blocking, so the block size can be chosen between one quarter to one

half of the target cache (see also, Chapter 3).

Software can use the deterministic cache parameter leaf of CPUID to

discover which subset of logical processors are sharing a given cache.

(See Chapter 6.) Therefore, guideline above can be extended to allow all

the logical processors serviced by a given cache to use the cache

simultaneously, by placing an upper limit of the block size as the total

size of the cache divided by the number of logical processors serviced

by that cache. This technique can also be applied to single-threaded

applications that will be used as part of a multitasking workload.

User/Source Coding Rule 32. (H impact, H generality) Use cache blocking

to improve locality of data access. Target one quarter to one half of the cache

size when targeting IA-32 processors supporting Hyper-Threading Technology

or target a block size that allow all the logical processors serviced by a cache

to share that cache simultaneously.

Shared-Memory Optimization

Maintaining cache coherency between discrete processors frequently

involves moving data across a bus that operates at a clock rate

substantially slower that the processor frequency.

Minimize Sharing of Data between Physical Processors

When two threads are executing on two physical processors and sharing

data, reading from or writing to shared data usually involves several bus

transactions (including snooping, request for ownership changes, and

sometimes fetching data across the bus). A thread accessing a large

amount of shared memory is likely to have poor processor-scaling

performance.

Related product manuals