Intel ARCHITECTURE IA-32 - SIMD Optimizations and Microarchitectures; Packed Floating-Point Performance

To Next Page

To Previous Page

Optimizing for SIMD Floating-point Applications 5

5-27

SIMD Optimizations and Microarchitectures

Pentium M, Intel Core Solo and Intel Core Duo processors have a

different microarchitecture than Intel NetBurst

microarchitecture. The

following sub-section discusses optimizing SIMD code that target Intel

Core Solo and Intel Core Duo processors.

Packed Floating-Point Performance

Most packed SIMD floating-point code will speed up on Intel Core Solo

processors relative to Pentium M processors. This is due to

improvement in decoding packed SIMD instructions.

The improvement of packed floating-point performance on the Intel

Core Solo processor over Pentium M processor depends on several

factors. Generally, code that is decoder-bound and/or has a mixture of

integer and packed floating-point instructions can expect significant

gain. Code that is limited by execution latency and has a “cycles per

instructions” ratio greater than one will not benefit from decoder

improvement.

movaps xmm0, Vector1 ; the destination has a3, a2, a1, a0

movaps xmm1, Vector2 ; the destination has b3, b2, b1, b0

movaps xmm2, Vector3 ; the destination has c3, c2, c1, c0

movaps xmm3, Vector4 ; the destination has d3, d2, d1, d0

mulps xmm0, xmm1 ; a3b3, a2b2, a1b1, a0b0

mulps xmm2, xmm3 ; c3d3, c2d2, c1d1, c0d0

haddps xmm0, xmm2 ; the destination has c3d3+c2d2,

; c1d1+c0d0,a3b3+a2b2,a1b1+a0b0

haddps xmm0, xmm0 ; the destination has

; c3d3+c2d2+c1d1+c0d0,a3b3+a2b2+a1b1+a0b0,

; c3d3+c2d2+c1d1+c0d0,a3b3+a2b2+a1b1+a0b0

Example 5-13 Calculating Dot Products from AOS (continued)

Related product manuals