Sun Microsystems UltraSPARC-I

To Next Page

To Previous Page

Sun Microelectronics

273

16. Code Generation Guidelines

16.3.2 D-Cache Timing

The latency of a load to the D-Cache depends on the opcode. For unsigned loads,

data can be used two cycles after the load. For instance, if the ﬁrst two instruc-

tions in the instruction buffer are a load and an instruction dependent on that

load, the grouping logic will break the group after the load and a bubble will be

inserted in the pipeline the following cycle. Code compiled for an earlier SPARC

processor with a load use penalty of one cycle will show a penalty of about.1 CPI

just for this rule; thus, it is very important to separate loads from their use.

16.3.2.1 Signed Loads

All signed loads smaller than 64 bits must be separated from their use by three

cycles; otherwise, an extra bubble is inserted in the pipeline to force the separa-

tion between the load and its use. Floating-point loads are not sign extended, so

they have a latency of two cycles.

Once a signed load (smaller than 64 bits) is encountered in the instruction stream,

all subsequent consecutive loads (signed or unsigned) also return data in three

cycles; otherwise, there would be a collision between two loads returning data.

As soon as a cycle without a load appears in the pipeline, the latency of loads is

brought back to two cycles.

Note: The SPARC-V8 LD instruction is replaced with LDUW in SPARC-V9; the

new instruction does not require sign extension.

16.3.3 Data Alignment

SPARC-V9 requires that all accesses be aligned on an address equal to the size of

the access. Otherwise a

mem_address_not_aligned

trap is generated. This is espe-

cially important for double precision ﬂoating-point loads, which should be

aligned on an 8-byte boundary. If misalignment is determined to be possible at

compile time, it is better to use two LDF (load ﬂoating-point, single precision) in-

structions and avoid the trap. UltraSPARC supports single-precision loads mixed

with double-precision operations, so that the case above can execute without pen-

alty (except for the additional load). If a trap does occur, UltraSPARC dedicates a

trap vector for this speciﬁc misalignment, which reduces the overall penalty of

the trap.

Grouping load data is desirable, since a D-Cache sub-block can contain either

four properly aligned single-precision operands or two properly aligned double-

precision operands (eight and four respectively for a D-Cache line). As we shall

Artisan Technology Group - Quality Instrumentation ... Guaranteed | (888) 88-SOURCE | www.artisantg.com

Sun Microsystems UltraSPARC-I - Page 288