To Next Page

To Previous Page

Code Optimization 4-3

18524C/0—Nov1996 AMD-K5 Processor Technical Reference Manual

■ Loops—Unroll loops to get more parallelism and reduce

loop overhead even with branch prediction. Inline small

routines to avoid procedure-call overhead. In both cases,

however, consider the cost of possible increased register

usage, which might add load/store instructions for register

spilling.

■ Indexed Addressing—There is no penalty for base + index

addressing in the AMD-K5 processor. However, future

implementations may have such a penalty to achieve a

higher overall clock rate.

4.1.2 Techniques Specific to the AMD-K5 Processor

■ Jumps and Loops—JCXZ requires 1 cycle (correctly pre-

dicted) and therefore is faster than a TEST/JZ, in contrast

to the Pentium processor in which JCXZ requires 5 or 6

cycles. All forms of LOOP take 2 cycles (correctly pre-

dicted), which is also faster than the Pentium processor's 7

or 8 cycles.

■ Multiplies—Independent IMULs can be pipelined at one

per cycle with 4-cycle latency, in contrast to the Pentium

processor's serialized 9-cycle time. (MUL has the same

latency, although the implicit AX usage of MUL prevents

independent, parallel MUL operations.)

■ Dispatch Conflicts—Load-balancing (that is, selecting

instructions for parallel decode) is still important, but to a

lesser extent than on the Pentium processor. In particular,

arrange instructions to avoid execution-unit dispatching

conflicts. (See Section 4.2 on page 4-5.)

■ Instruction Prefixes—There is no penalty for instruction pre-

fixes, including combinations such as segment-size and

operand-size prefixes. This is particularly important for 16-

bit code. However, future implementations may have penal-

ties for the use of these prefixes.

■ Byte Operations—For byte operations, the high and low

bytes of AX, BX, CX, and DX are effectively independent

registers that can be operated on in parallel. For example,

reading AL does not have a dependency on an outstanding

write to AH.

■ Move and Convert—MOVZX, MOVSX, CBW, CWDE, CWD,

CDQ all take 1 cycle (2 cycles for memory-based input), in

contrast to the Pentium processor's 2 or 3 cycles.

Manufacturer

AMD

Model

Architecture

x86

Microarchitecture

Introduction Year

1996

Clock Speed

75 - 133 MHz

Core Count

Socket

Socket 7

Core stepping

SSA/5, 5k86

Voltage

3.3V

Transistors

4.3 million

L1 Cache

8 KB (data) + 16 KB (instruction)

FSB

50 MHz to 66 MHz

Process Technology

350 nm

AMD K5 User Manual

Table of Contents

Questions and Answers:

AMD K5 Specifications

Related product manuals