Intel ARCHITECTURE IA-32 - Guidelines for Optimizing Floating-Point Code

To Next Page

To Previous Page

IA-32 Intel® Architecture Optimization

2-58

Guidelines for Optimizing Floating-point Code

User/Source Coding Rule 10. (M impact, M generality) Enable the

compiler’s use of SSE, SSE2 or SSE3 instructions with appropriate switches.

Follow this procedure to investigate the performance of your

floating-point application:

• Understand how the compiler handles floating-point code.

• Look at the assembly dump and see what transforms are already

performed on the program.

• Study the loop nests in the application that dominate the execution

time.

• Determine why the compiler is not creating the fastest code.

• See if there is a dependence that can be resolved.

• Determine the problem area: bus bandwidth, cache locality, trace

cache bandwidth or instruction latency. Focus on optimizing the

problem area. For example, adding prefetch instructions will not

help if the bus is already saturated. If trace cache bandwidth is the

problem, added prefetch µops may degrade performance.

For floating-point coding, follow all the general coding

recommendations discussed in this chapter, including:

• blocking the cache

• using prefetch

• enabling vectorization

• unrolling loops

User/Source Coding Rule 11. (H impact, ML generality) Make sure your

application stays in range to avoid denormal values, underflows.

Out-of-range numbers cause very high overhead.

User/Source Coding Rule 12. (M impact, ML generality) Do not use double

precision unless necessary. Set the precision control (PC) field in the x87 FPU

control word to “Single Precision”. This allows single precision (32-bit)

computation to complete faster on some operations (for example, divides due

Related product manuals