EasyManua.ls Logo

AMD AMD5K86 - Techniques Specific to the AMD5 K86 Processor

AMD AMD5K86
416 pages
Print Icon
To Next Page IconTo Next Page
To Next Page IconTo Next Page
To Previous Page IconTo Previous Page
To Previous Page IconTo Previous Page
Loading...
18524BjO-Mar1996
AMD~
AMD5xB6
Processor
Technical
Reference
Manual
Loops-Unroll
loops
to
get
more
parallelism
and
reduce
loop
overhead
even
with
branch
prediction.
Inline
small
routines
to
avoid
procedure-call
overhead.
In
both
cases,
however,
consider
the
cost
of
possible
increased
register
usage,
which
might
add
load/store
instructions
for
register
spilling.
Indexed
Addressing-There
is
no
penalty
for
base
+
index
addressing
in
the
AMD5
K
86
processor.
However,
future
implementations
may
have
such
a
penalty
to
achieve
a
higher
overall
clock
rate.
4.1.2
Techniques
Specific
to
the
AMDS
K
86
Processor
Code
Optimization
Jumps and
Loops-
JCXZ
requires
1
cycle
(correctly
pre-
dicted)
and
therefore
is
faster
than
a
TEST/JZ,
in
contrast
to
the
Pentium
processor
in
which
JCXZ
requires
5
or
6
cycles. All
forms
of
LOOP
take
2
cycles
(correctly
pre-
dicted),
which
is
also
faster
than
the
Pentium
processor's
7
or
8 cycles.
Multiplies-Independent
IMULs
can
be
pipelined
at
one
per
cycle
with
4-cycle
latency,
in
contrast
to
the
Pentium
processor's
serialized
9-cycle
time.
(MUL
has
the
same
latency,
although
the
implicit
AX
usage
of
MUL
prevents
independent,
parallel
MUL
operations.)
Dispatch
Conflicts-Load-balancing
(that
is,
selecting
instructions
for
parallel
decode)
is
still
important,
but
to
a
lesser
extent
than
on
the
Pentium
processor.
In
particular,
arrange
instructions
to
avoid
execution-unit
dispatching
conflicts.
(See
Section
4.2
on
page
4-5.)
Instruction
Prefixes-There
is
no
penalty
for
instruction
pre-
fixes,
including
combinations
such
as
segment-size
and
operand-size
prefixes.
This
is
particularly
important
for
16-
bit
code.
However,
future
implementations
may
have
penal-
ties
for
the
use
of
these
prefixes.
Byte
Operations-For
byte
operations,
the
high
and
low
bytes
of
AX, BX, CX,
and
DX
are
effectively
independent
registers
that
can
be
operated
on
in
parallel.
For
example,
reading
AL
does
not
have
a
dependency
on
an
outstanding
write
to
AH.
Move and
Convert-MOVZX,
MOVSX, CBW, CWDE, CWD,
CDQ
all
take
1
cycle
(2
cycles
for
memory-based
input),
in
contrast
to
the
Pentium
processor's
2
or
3 cycles.
4-3

Table of Contents

Related product manuals