AMD AMD5K86 - Dispatch and Execution Timing

416 pages

Save Page as PDF

To Next Page

To Next Page

To Previous Page

To Previous Page

Loading...

AMD~

AMD5f1J6

Processor

Technical

Reference

Manual

18524B/O-Mar1996

4-4

•

Bit

Scan-BSF

and

BSR

take

1 cycle (2

cycles

for

memory-

based

input),

in

contrast

to

the

Pentium

processor's

data-

dependent

6

to

34 cycles.

•

Bit

Test-BT,

BTS, BTR,

and

BTC

take

1 cycle

for

register-

based

operands,

and

2

or

3 cycles

for

memory-based

oper-

ands

with

immediate

bit-offset,

in

contrast

to

the

Pentium

processor's

4

to

9 cycles.

Register-based

bit-offset

forms

on

the

AMD5

K

86

processor

take

5 cycles.

If

the

semantics

of

the

register-based

bit-offset

form

are

desired

(where

the

bit

offset

can

cover

a

very

large

bit

string

in

memory),

it

is

bet-

ter

to

emulate

this

with

simpler

instructions

that

can

be

interleaved

with

independent

instructions

for

greater

paral-

lelism.

• Floating-Point Top-oj-Stack

Bottleneck-The

AMD5

K

86 pro-

cessor

has

a

pipelined

floating-point

unit.

Greater

parallel-

ism

can

be

achieved

by

using

FXCH

in

parallel

with

floating-point

operations

to

alleviate

the

top-of-stack

bottle-

neck,

as

in

the

Pentium

processor.

The

AMD5

K

86

processor

also

permits

integer

operations

(ALD,

branch,

load/store)

in

parallel

with

floating-point

operations.

• Locating Branch

Targets-Performance

can

be

sensitive

to

code

alignment,

especially

in

tight

loops.

Locating

branch

targets

to

the

first

17

bytes

of

the

32-byte

cache

line

maxi-

mizes

the

opportunity

for

parallel

execution

at

the

target.

NOPs

can

be

added

to

adjust

this

alignment.

The

AMD5

K

86

processor

executes

NOPs

(opcode

90h)

at

the

rate

of

two

per

cycle.

Adding

NOPs

is

even

more

effective

if

they

execute

in

parallel

with

existing

code.

Other

instructions

of

greater

length,

such

as a

register-based

TEST

instruction,

can

be

used

as

NOPs

to

minimize

the

overhead

of

such

padding.

• Branch

Prediction-

There

are

two

branch

prediction

bits

in

a 32-byte

instruction

cache

line.

One

bit

applies

to

the

first

16

bytes

of

the

line

and

the

second

bit

applies

to

the

second

16

bytes

of

the

line.

For

effective

branch

prediction,

code

should

be

generated

with

one

branch

per

16-byte

line

half.

• Address-Generation Interlocks (AGIs) -

The

AMD5

K

86 proces-

sor

does

not

suffer

from

the

single-cycle

penalty

that

the

486

and

Pentium

processors

have

when

a

result

from

execu-

tion

or

from

a

data-cache

access

is

used

to

form

a

cache

address,

so

it

is

not

necessary

to

avoid

these

situations.

Performance

Table of Contents

Related product manuals

Preview: AMD FX series