Intel ARCHITECTURE IA-32

To Next Page

To Previous Page

General Optimization Guidelines 2

2-59

to early out). However, be careful of introducing more than a total of two

values for the floating point control word, or there will be a large performance

penalty. See “Floating-point Modes”.

User/Source Coding Rule 13. (H impact, ML generality) Use fast

float-to-int routines, FISTTP, or SSE2 instructions. If coding these routines, use

the

fisttp instruction if SSE3 is available or cvttss2si, cvttsd2si

instructions if coding with Streaming SIMD Extensions 2.

Many libraries do more work than is necessary. The FISTTP instruction

in SSE3 can convert floating-point values to 16-bit, 32-bit or 64-bit

integers using truncation without accessing the floating-point control

word (FCW). The instructions

cvttss2si/cvttsd2si save many µops

and some store-forwarding delays over some compiler implementations.

This avoids changing the rounding mode.

User/Source Coding Rule 14. (M impact, ML generality) Break dependence

chains where possible.

Removing data dependence enables the out of order engine to extract

more ILP from the code. When summing up the elements of an array,

use partial sums instead of a single accumulator. For example, to

calculate

z = a + b + c + d, instead of:

x = a + b;

y = x + c;

z = y + d;

use:

x = a + b;

y = c + d;

z = x + y;

User/Source Coding Rule 15. (M impact, ML generality) Usually, math

libraries take advantage of the transcendental instructions (for example,

fsin) when evaluating elementary functions. If there is no critical need to

evaluate the transcendental functions using the extended precision of 80 bits,

applications should consider alternate, software-based approach, such as

look-up-table-based algorithm using interpolation techniques. It is possible to

improve transcendental performance with these techniques by choosing the

Intel ARCHITECTURE IA-32 - Page 131