Intel ARCHITECTURE IA-32

To Next Page

To Previous Page

General Optimization Guidelines 2

2-101

Assembly/Compiler Coding Rule 10. (M impact, L generality) Do not

put more than four branches in 16-byte chunks. 2-22

Assembly/Compiler Coding Rule 11. (M impact, L generality) Do not

put more than two end loop branches in a 16-byte chunk. 2-22

Assembly/Compiler Coding Rule 12. (M impact, MH generality) If the

average number of total iterations is less than or equal to 100, use a

forward branch to exit the loop. 2-23

Assembly/Compiler Coding Rule 13. (H impact, M generality) Unroll

small loops until the overhead of the branch and the induction variable

accounts, generally, for less than about 10% of the execution time of the

loop. 2-27

Assembly/Compiler Coding Rule 14. (H impact, M generality) Avoid

unrolling loops excessively, as this may thrash the trace cache or

instruction cache. 2-27

Assembly/Compiler Coding Rule 15. (M impact, M generality) Unroll

loops that are frequently executed and that have a predictable number of

iterations to reduce the number of iterations to 16 or fewer, unless this

increases code size so that the working set no longer fits in the trace

cache. If the loop body contains more than one conditional branch, then

unroll so that the number of iterations is 16/(# conditional branches).

2-27

Assembly/Compiler Coding Rule 16. (H impact, H generality) Align

data on natural operand size address boundaries. If the data will be

accesses with vector instruction loads and stores, align the data on

16-byte boundaries. 2-30

Assembly/Compiler Coding Rule 17. (H impact, M generality) Pass

parameters in registers instead of on the stack where possible. Passing

arguments on the stack is a case of store followed by a reload. While this

sequence is optimized in IA-32 processors by providing the value to the

load directly from the memory order buffer without the need to access the

data cache, floating point values incur a significant latency in forwarding.

Passing floating point argument in (preferably XMM) registers should

save this long latency operation. 2-33

Intel ARCHITECTURE IA-32 - Page 173