General Optimization Guidelines 2
2-27
• The Pentium 4 processor can correctly predict the exit branch for an 
inner loop that has 16 or fewer iterations, if that number of iterations 
is predictable and there are no conditional branches in the loop. 
Therefore, if the loop body size is not excessive, and the probable 
number of iterations is known, unroll inner loops until they have a 
maximum of 16 iterations. With the Pentium M processor, do not 
unroll loops more than 64 iterations.
The potential costs of unrolling loops are:
• Excessive unrolling, or unrolling of very large loops can lead to 
increased code size. This can be harmful if the unrolled loop no 
longer fits in the trace cache (TC).
• Unrolling loops whose bodies contain branches increases demands 
on the BTB capacity. If the number of iterations of the unrolled loop 
is 16 or less, the branch predictor should be able to correctly predict 
branches in the loop body that alternate direction.
Assembly/Compiler Coding Rule 13. (H impact, M generality) Unroll small 
loops until the overhead of the branch and the induction variable accounts, 
generally, for less than about 10% of the execution time of the loop.
Assembly/Compiler Coding Rule 14. (H impact, M generality) Avoid 
unrolling loops excessively, as this may thrash the trace cache or instruction 
cache.
Assembly/Compiler Coding Rule 15. (M impact, M generality) Unroll 
loops that are frequently executed and that have a predictable number of 
iterations to reduce the number of iterations to 16 or fewer, unless this 
increases code size so that the working set no longer fits in the trace cache or 
instruction cache.  If the loop body contains more than one conditional branch, 
then unroll so that the number of iterations is 16/(# conditional branches).
Example 2-10 shows how unrolling enables other optimizations.