4-4 Performance
AMD-K5 Processor Technical Reference Manual 18524C/0—Nov1996
■ Bit Scan—BSF and BSR take 1 cycle (2 cycles for memory-
based input), in contrast to the Pentium processor's data-
dependent 6 to 34 cycles.
■ Bit Test—BT, BTS, BTR, and BTC take 1 cycle for register-
based operands, and 2 or 3 cycles for memory-based oper-
ands with immediate bit-offset, in contrast to the Pentium
processor's 4 to 9 cycles. Register-based bit-offset forms on
the AMD-K5 processor take 5 cycles. If the semantics of the
register-based bit-offset form are desired (where the bit off-
set can cover a very large bit string in memory), it is better
to emulate this with simpler instructions that can be inter-
leaved with independent instructions for greater parallel-
ism.
■ Floating-Point Top-of-Stack Bottleneck—The AMD-K5 proces-
sor has a pipelined floating-point unit. Greater parallelism
can be achieved by using FXCH in parallel with floating-
point operations to alleviate the top-of-stack bottleneck, as
in the Pentium processor. The AMD-K5 processor also per-
mits integer operations (ALU, branch, load/store) in paral-
lel with floating-point operations.
■ Locating Branch Targets—Performance can be sensitive to
code alignment, especially in tight loops. Locating branch
targets to the first 17 bytes of the 32-byte cache line maxi-
mizes the opportunity for parallel execution at the target.
NOPs can be added to adjust this alignment. The AMD-K5
processor executes NOPs (opcode 90h) at the rate of two per
cycle. Adding NOPs is even more effective if they execute
in parallel with existing code. Other instructions of greater
length, such as a register-immediate TEST instruction, can
be used as NOPs to minimize the overhead of such padding.
■ Branch Prediction—There are two branch prediction bits in
a 32-byte instruction cache line. One bit applies to the first
16 bytes of the line and the second bit applies to the second
16 bytes of the line. For effective branch prediction, code
should be generated with one branch per 16-byte line half.
The prediction is associated with the half-line containing
the last byte of the branch instruction.
■ Address-Generation Interlocks (AGIs)—The AMD-K5 proces-
sor does not suffer from the single-cycle penalty that the
486 and Pentium processors have when a result from execu-
tion or from a data-cache access is used to form a cache
address, so it is not necessary to avoid these situations.