Intel ARCHITECTURE IA-32 - Example 3-8 Simple Four-Iteration Loop

To Next Page

To Previous Page

IA-32 Intel® Architecture Optimization

3-14

The examples that follow illustrate the use of coding adjustments to

enable the algorithm to benefit from the SSE. The same techniques may

be used for single-precision floating-point, double-precision

floating-point, and integer data under SSE2, SSE, and MMX

technology.

As a basis for the usage model discussed in this section, consider a

simple loop shown in Example 3-8.

Note that the loop runs for only four iterations. This allows a simple

replacement of the code with Streaming SIMD Extensions.

For the optimal use of the Streaming SIMD Extensions that need data

alignment on the 16-byte boundary, all examples in this chapter assume

that the arrays passed to the routine,

a, b, c, are aligned to 16-byte

boundaries by a calling routine. For the methods to ensure this

alignment, please refer to the application notes for the Pentium 4

processor.

The sections that follow provide details on the coding methodologies:

inlined assembly, intrinsics, C++ vector classes, and automatic

vectorization.

Example 3-8 Simple Four-Iteration Loop

void add(float *a, float *b, float *c)

{

int i;

for (i = 0; i < 4; i++) {

c[i] = a[i] + b[i];

}

Related product manuals