AMD AMD5K86 - Techniques Specific to the AMD5 K86 Processor

416 pages

Save Page as PDF

To Next Page

To Next Page

To Previous Page

To Previous Page

Loading...

18524BjO-Mar1996

AMD~

AMD5xB6

Processor

Technical

Reference

Manual

•

Loops-Unroll

loops

to

get

more

parallelism

and

reduce

loop

overhead

even

with

branch

prediction.

Inline

small

routines

to

avoid

procedure-call

overhead.

In

both

cases,

however,

consider

the

cost

of

possible

increased

register

usage,

which

might

add

load/store

instructions

for

register

spilling.

• Indexed

Addressing-There

is

no

penalty

for

base

+

index

addressing

in

the

AMD5

K

86

processor.

However,

future

implementations

may

have

such

a

penalty

to

achieve

a

higher

overall

clock

rate.

4.1.2

Techniques

Specific

to

the

AMDS

K

86

Processor

Code

Optimization

• Jumps and

Loops-

JCXZ

requires

1

cycle

(correctly

pre-

dicted)

and

therefore

is

faster

than

a

TEST/JZ,

in

contrast

to

the

Pentium

processor

in

which

JCXZ

requires

5

or

6

cycles. All

forms

of

LOOP

take

2

cycles

(correctly

pre-

dicted),

which

is

also

faster

than

the

Pentium

processor's

7

or

8 cycles.

•

Multiplies-Independent

IMULs

can

be

pipelined

at

one

per

cycle

with

4-cycle

latency,

in

contrast

to

the

Pentium

processor's

serialized

9-cycle

time.

(MUL

has

the

same

latency,

although

the

implicit

AX

usage

of

MUL

prevents

independent,

parallel

MUL

operations.)

• Dispatch

Conflicts-Load-balancing

(that

is,

selecting

instructions

for

parallel

decode)

is

still

important,

but

to

a

lesser

extent

than

on

the

Pentium

processor.

In

particular,

arrange

instructions

to

avoid

execution-unit

dispatching

conflicts.

(See

Section

4.2

on

page

4-5.)

• Instruction

Prefixes-There

is

no

penalty

for

instruction

pre-

fixes,

including

combinations

such

as

segment-size

and

operand-size

prefixes.

This

is

particularly

important

for

16-

bit

code.

However,

future

implementations

may

have

penal-

ties

for

the

use

of

these

prefixes.

•

Byte

Operations-For

byte

operations,

the

high

and

low

bytes

of

AX, BX, CX,

and

DX

are

effectively

independent

registers

that

can

be

operated

on

in

parallel.

For

example,

reading

AL

does

not

have

a

dependency

on

an

outstanding

write

to

AH.

• Move and

Convert-MOVZX,

MOVSX, CBW, CWDE, CWD,

CDQ

all

take

1

cycle

(2

cycles

for

memory-based

input),

in

contrast

to

the

Pentium

processor's

2

or

3 cycles.

4-3

Table of Contents

Related product manuals

Preview: AMD FX series