Code Optimization 4-3
18524C/0—Nov1996 AMD-K5 Processor Technical Reference Manual
■ Loops—Unroll loops to get more parallelism and reduce
loop overhead even with branch prediction. Inline small
routines to avoid procedure-call overhead. In both cases,
however, consider the cost of possible increased register
usage, which might add load/store instructions for register
spilling.
■ Indexed Addressing—There is no penalty for base + index
addressing in the AMD-K5 processor. However, future
implementations may have such a penalty to achieve a
higher overall clock rate.
4.1.2 Techniques Specific to the AMD-K5 Processor
■ Jumps and Loops—JCXZ requires 1 cycle (correctly pre-
dicted) and therefore is faster than a TEST/JZ, in contrast
to the Pentium processor in which JCXZ requires 5 or 6
cycles. All forms of LOOP take 2 cycles (correctly pre-
dicted), which is also faster than the Pentium processor's 7
or 8 cycles.
■ Multiplies—Independent IMULs can be pipelined at one
per cycle with 4-cycle latency, in contrast to the Pentium
processor's serialized 9-cycle time. (MUL has the same
latency, although the implicit AX usage of MUL prevents
independent, parallel MUL operations.)
■ Dispatch Conflicts—Load-balancing (that is, selecting
instructions for parallel decode) is still important, but to a
lesser extent than on the Pentium processor. In particular,
arrange instructions to avoid execution-unit dispatching
conflicts. (See Section 4.2 on page 4-5.)
■ Instruction Prefixes—There is no penalty for instruction pre-
fixes, including combinations such as segment-size and
operand-size prefixes. This is particularly important for 16-
bit code. However, future implementations may have penal-
ties for the use of these prefixes.
■ Byte Operations—For byte operations, the high and low
bytes of AX, BX, CX, and DX are effectively independent
registers that can be operated on in parallel. For example,
reading AL does not have a dependency on an outstanding
write to AH.
■ Move and Convert—MOVZX, MOVSX, CBW, CWDE, CWD,
CDQ all take 1 cycle (2 cycles for memory-based input), in
contrast to the Pentium processor's 2 or 3 cycles.