EasyManua.ls Logo

AMD K5

AMD K5
406 pages
To Next Page IconTo Next Page
To Next Page IconTo Next Page
To Previous Page IconTo Previous Page
To Previous Page IconTo Previous Page
Loading...
4-4 Performance
AMD-K5 Processor Technical Reference Manual 18524C/0Nov1996
Bit ScanBSF and BSR take 1 cycle (2 cycles for memory-
based input), in contrast to the Pentium processor's data-
dependent 6 to 34 cycles.
Bit TestBT, BTS, BTR, and BTC take 1 cycle for register-
based operands, and 2 or 3 cycles for memory-based oper-
ands with immediate bit-offset, in contrast to the Pentium
processor's 4 to 9 cycles. Register-based bit-offset forms on
the AMD-K5 processor take 5 cycles. If the semantics of the
register-based bit-offset form are desired (where the bit off-
set can cover a very large bit string in memory), it is better
to emulate this with simpler instructions that can be inter-
leaved with independent instructions for greater parallel-
ism.
Floating-Point Top-of-Stack BottleneckThe AMD-K5 proces-
sor has a pipelined floating-point unit. Greater parallelism
can be achieved by using FXCH in parallel with floating-
point operations to alleviate the top-of-stack bottleneck, as
in the Pentium processor. The AMD-K5 processor also per-
mits integer operations (ALU, branch, load/store) in paral-
lel with floating-point operations.
Locating Branch TargetsPerformance can be sensitive to
code alignment, especially in tight loops. Locating branch
targets to the first 17 bytes of the 32-byte cache line maxi-
mizes the opportunity for parallel execution at the target.
NOPs can be added to adjust this alignment. The AMD-K5
processor executes NOPs (opcode 90h) at the rate of two per
cycle. Adding NOPs is even more effective if they execute
in parallel with existing code. Other instructions of greater
length, such as a register-immediate TEST instruction, can
be used as NOPs to minimize the overhead of such padding.
Branch PredictionThere are two branch prediction bits in
a 32-byte instruction cache line. One bit applies to the first
16 bytes of the line and the second bit applies to the second
16 bytes of the line. For effective branch prediction, code
should be generated with one branch per 16-byte line half.
The prediction is associated with the half-line containing
the last byte of the branch instruction.
Address-Generation Interlocks (AGIs)The AMD-K5 proces-
sor does not suffer from the single-cycle penalty that the
486 and Pentium processors have when a result from execu-
tion or from a data-cache access is used to form a cache
address, so it is not necessary to avoid these situations.

Table of Contents

Related product manuals