Dispatch and Execution Timing 4-17
18524C/0—Nov1996 AMD-K5 Processor Technical Reference Manual
4.2.3 Integer Dot Product Example
This example illustrates an optimal code sequence for an inte-
ger dot product operation that performs multiply/accumulates
(MACs) at the rate of one every 3 cycles. In this example, the
array size is a constant. The loop is unrolled to perform sepa-
rate MAC operations in parallel for even and odd elements.
The final sum is generated outside the loop (as well as the final
iteration for odd-sized arrays).
mac_loop:
MOV EAX, [ESI][ECX*4] ;load A(i)
MOV EBX, [ESI][ECX*4]+4 ;load A(i+1)
IMUL EAX, [EDI][ECX*4] ;A(i) * B(i)
IMUL EBX, [EDI][ECX*4]+4 ;A(i+1) * B(i+1)
ADD ECX, 2 ;increment index
ADD EDX, EAX ;even sum
ADD EBP, EBX ;odd sum
CMP ECX, EVEN_ARRAY_SIZE ;loop control
JL mac_loop ;jump
;do final MAC here for odd-sized arrays
ADD EDX, EBP ;final sum
XCHG reg, reg 0_0x_1000011x_xxx_xxx F
alu 1/1
alu 1/1
alu 2/2
XCHG mem, reg 0_1x_1000011x_xxx_xxx F
ld 1/1
st 1/1/2
alu 1/2
XOR reg, reg 0_0x_001100xx_xxx_xxx Falu1/1
XOR reg, mem 0_1x_0011001x_xxx_xxx F
ld 1/1
alu 1/2
XOR mem, reg 0_1x_0011000x_xxx_xxx F
ld 1/1
alu 1/2
st 1/1/3
XOR AL/AX/EAX, imm 0_xx_0011010x_xxx_xxx Falu1/1
XOR reg, imm 0_0x_100000xx_110_xxx Falu 1/1
XOR mem, imm 0_1x_100000xx_110_xxx F
ld 1/1
alu 1/2
st 1/1/3
Table 4-1. Integer Instructions (continued)
Instruction Mnemonic Opcode Format
Fastpath or
Microcode
Execution
Unit Timing