EasyManua.ls Logo

Intel ARCHITECTURE IA-32 User Manual

Intel ARCHITECTURE IA-32
568 pages
To Next Page IconTo Next Page
To Next Page IconTo Next Page
To Previous Page IconTo Previous Page
To Previous Page IconTo Previous Page
Page #234 background image
IA-32 Intel® Architecture Optimization
4-14
Insert Word
The pinsrw instruction loads a word from the lower half of a 32-bit
integer register or from memory and inserts it in the MMX technology
destination register at a position defined by the two least significant bits
of the immediate constant. Insertion is done in such a way that the three
other words from the destination register are left untouched, see
Figure 4-6 and Example 4-8.
Figure 4-5 pextrw Instruction
Example 4-7 pextrw Instruction Code
; Input:
; eax source value
; immediate value:“0”
; Output:
; edx 32-bit integer register containing the
; extracted word in the low-order bits &
; the high-order bits zero-extended
movq mm0, [eax]
pextrw edx, mm0, 0
OM15163
0 ..0 X1
MM
R32
31 0
31 063
X4 X3 X2 X1

Table of Contents

Question and Answer IconNeed help?

Do you have a question about the Intel ARCHITECTURE IA-32 and is the answer not in the manual?

Intel ARCHITECTURE IA-32 Specifications

General IconGeneral
Instruction Setx86
Instruction Set TypeCISC
Memory SegmentationSupported
Operating ModesReal mode, Protected mode, Virtual 8086 mode
Max Physical Address Size36 bits (with PAE)
Max Virtual Address Size32 bits
ArchitectureIA-32 (Intel Architecture 32-bit)
Addressable Memory4 GB (with Physical Address Extension up to 64 GB)
Floating Point Registers8 x 80-bit
MMX Registers8 x 64-bit
SSE Registers8 x 128-bit
RegistersGeneral-purpose registers (EAX, EBX, ECX, EDX, ESI, EDI, ESP, EBP), Segment registers (CS, DS, SS, ES, FS, GS), Instruction pointer (EIP), Flags register (EFLAGS)
Floating Point UnitYes (x87)

Summary

Introduction

Chapter 1: IA-32 Intel® Architecture Processor Family Overview

SIMD Technology

Introduces SIMD computations and technologies like MMX, SSE, SSE2, and SSE3.

Summary of SIMD Technologies

Details MMX technology and Streaming SIMD Extensions (SSE, SSE2, SSE3) capabilities.

Intel® Extended Memory 64 Technology (Intel® EM64T)

Describes Intel EM64T as an extension to IA-32, increasing linear address space to 64 bits.

Intel NetBurst® Microarchitecture

Describes features and operation of the Intel NetBurst microarchitecture for high performance.

Overview of the Intel NetBurst Microarchitecture Pipeline

Details the pipeline structure: in-order issue front end, out-of-order superscalar execution core, and in-order retirement.

The Front End

Explains the front end's functions: instruction fetching, decoding into µops, and branch prediction.

The Out-of-order Core

Discusses the core's ability to execute instructions out of order for parallelism and dispatching µops.

Retirement

Explains the retirement logic that tracks and commits µop results according to program order.

Branch Prediction

Details branch prediction importance, mechanisms, and static prediction for deeply pipelined processors.

Execution Core Detail

Describes how the execution core optimizes performance by handling common cases efficiently.

Instruction Latency and Throughput

Explains port restrictions, result latencies, and issue latencies that assist software in ordering instructions.

Execution Units and Issue Ports

Details the four issue ports and associated execution units that dispatch µops to the core.

Caches

Describes the three levels of on-chip cache in the Intel NetBurst microarchitecture and their organization.

Data Prefetch

Discusses software and hardware mechanisms for prefetching data to improve memory access performance.

Loads and Stores

Explains techniques used to speed up memory operations like speculative execution and reordering.

Store Forwarding

Describes how store data can be forwarded directly to a subsequent load, outlining requirements and restrictions.

Intel® Pentium® M Processor Microarchitecture

Provides an overview of the Pentium M processor microarchitecture, highlighting differences from NetBurst.

The Front End

Explains the Pentium M processor's front end, consisting of fetch/decode unit and instruction cache.

Microarchitecture of Intel® Core™ Solo and Intel® Core™ Duo Processors

Describes enhancements in Core Solo/Duo microarchitecture over Pentium M for performance and power efficiency.

Microarchitecture Pipeline and Multi-Core Processors

Discusses how multi-core processors resemble single-core implementations and cache hierarchy differences.

Shared Cache in Intel Core Duo Processors

Explains the shared second-level cache and single bus interface in Intel Core Duo processors for minimizing bus traffic.

Chapter 2: General Optimization Guidelines

Tuning to Achieve Optimum Performance

Discusses key factors for achieving optimal processor performance like branch prediction and memory access.

Tuning to Prevent Known Coding Pitfalls

Lists and explains common coding pitfalls that limit performance on IA-32 processors.

General Practices and Coding Guidelines

Provides guidelines derived from performance factors and highlights practices using performance tools.

Use Available Performance Tools

Discusses using Intel C++ Compiler and VTune Performance Analyzer for code optimization.

Optimize Performance Across Processor Generations

Suggests strategies like CPUID dispatch and compatible code for performance across processor generations.

Optimize Branch Predictability

Explains techniques to improve branch predictability and optimize instruction prefetching.

Optimize Memory Access

Provides guidelines for optimizing code and data memory accesses, including alignment and store-forwarding.

Optimize Floating-point Performance

Offers advice on optimizing floating-point operations, including precision and SIMD capabilities.

Optimize Instruction Selection

Focuses on selecting instructions for path length, minimizing uops, and maximizing retirement throughput.

Optimize Instruction Scheduling

Discusses considering latencies and resource constraints for instruction scheduling.

Enable Vectorization

Explains how to enable vectorization for parallelism, focusing on data types and loop nesting.

Coding Rules, Suggestions and Tuning Hints

Presents rules, suggestions, and hints for performance tuning, ranked by impact and generality.

Performance Tools

Introduces Intel C++ Compiler and VTune Performance Analyzer for application optimization.

General Compiler Recommendations

Discusses using compilers tuned for target microarchitecture and compiler switches for optimization.

VTune™ Performance Analyzer

Explains how to use VTune Performance Analyzer for performance monitoring and identifying coding pitfalls.

Processor Perspectives

Compares coding recommendations across different microarchitectures like Pentium M and Core processors.

CPUID Dispatch Strategy and Compatible Code Strategy

Discusses using CPUID for processor generation identification and compatible code strategies for performance.

Transparent Cache-Parameter Strategy

Explains using CPUID's deterministic cache parameter leaf for forward-compatible coding.

Threading Strategy and Hardware Multi-Threading Support

Covers hardware multi-threading support like dual-core and Hyper-Threading Technology for application design.

Chapter 3: Coding for SIMD Architectures

Checking for Processor Support of SIMD Technologies

Explains how to check for MMX, SSE, SSE2, and SSE3 support in processors and operating systems.

Checking for MMX Technology Support

Details using CPUID to check for MMX technology support via feature flags in the EDX register.

Checking for Streaming SIMD Extensions Support

Outlines steps to check for SSE support, including processor and OS checks using CPUID.

Checking for Streaming SIMD Extensions 2 Support

Explains how to check for SSE2 support, similar to SSE, focusing on processor and OS requirements.

Checking for Streaming SIMD Extensions 3 Support

Details checking for SSE3 support, including processor and OS checks, and feature bits in CPUID.

Considerations for Code Conversion to SIMD Programming

Guides developers on evaluating code for SIMD conversion by asking key questions about benefits and requirements.

Coding Techniques

Discusses vectorization, memory access dependencies, and loop strip-mining for SIMD architecture.

Coding Methodologies

Compares trade-offs between hand-coded assembly, intrinsics, and automatic vectorization for SIMD programming.

Automatic Vectorization

Explains how the Intel C++ Compiler automatically vectorizes loops and the techniques used to identify vectorizable loops.

Chapter 4: Optimizing for SIMD Integer Applications

General Rules on SIMD Integer Code

Provides overall rules for SIMD integer code, including intermixing with x87, type checking, and prefetching.

Using SIMD Integer with x87 Floating-point

Discusses rules and considerations for mixing 64-bit SIMD integer instructions with x87 floating-point registers.

Using the EMMS Instruction

Explains the EMMS instruction's role in clearing the x87 stack for proper x87 code operation after MMX.

Data Alignment

Emphasizes the importance of 8-byte alignment for 64-bit SIMD integer data and 16-byte for 128-bit data.

Data Movement Coding Techniques

Covers techniques for gathering and re-arranging data for efficient SIMD computation.

Unsigned Unpack

Explains how MMX unpack instructions can be used to zero-extend unsigned numbers.

Signed Unpack

Details using the psrad instruction for sign-extending values during unpacking.

Interleaved Pack with Saturation

Explains packssdw instruction for packing signed doublewords into saturated signed words.

Interleaved Pack without Saturation

Describes pack instructions for packing words without saturation, using low-order bits to prevent overflow.

Non-Interleaved Unpack

Explains unpack instructions for merging operands without interleaving, focusing on doublewords.

Extract Word

Details the pextrw instruction for extracting words and moving them to a 32-bit register.

Insert Word

Explains the pinsrw instruction for loading words into MMX destination registers at specified positions.

Move Byte Mask to Integer

Describes the pmovmskb instruction for creating a bit mask from the most significant bits of bytes.

Packed Shuffle Word for 64-bit Registers

Explains the pshuf instruction for selecting words between MMX registers or memory using an immediate operand.

Packed Shuffle Word for 128-bit Registers

Details pshuflw/pshufhw/pshufd for shuffling word/double-word fields within 128-bit registers.

Unpacking/interleaving 64-bit Data in 128-bit Registers

Explains punpcklqdq/punpchqdq for interleaving 64-bit source operands into 128-bit destination registers.

Data Movement

Discusses instructions enabling data movement from 64-bit SIMD integer registers to 128-bit SIMD registers.

Conversion Instructions

Mentions new instructions for 4-wide conversion of single-precision to double-word integer data.

Generating Constants

Shows code segments for generating frequently used constants in SIMD registers using specific instructions.

Building Blocks

Describes instructions and algorithms for implementing common code building blocks efficiently.

Absolute Difference of Unsigned Numbers

Explains computing absolute difference of two unsigned numbers using subtract with unsigned saturation.

Absolute Difference of Signed Numbers

Details a sorting technique using XOR for calculating absolute difference of two signed numbers.

Absolute Value

Shows how to compute the absolute value |x| of signed words using pmaxsw and psubw instructions.

Clipping to an Arbitrary Range [high, low]

Explains clipping values to a range [high, low] using packed-add and packed-subtract with saturation.

Highly Efficient Clipping

Provides techniques for clipping signed words and unsigned bytes to arbitrary ranges using specific instructions.

Clipping to an Arbitrary Signed Range [high, low]

Shows how to clip signed words to an arbitrary range using pmaxsw and pminsw instructions.

Clipping to an Arbitrary Unsigned Range [high, low]

Details clipping unsigned values to a range [high, low] using packed-add and packed-subtract with saturation.

Packed Max/Min of Signed Word and Unsigned Byte

Explains pmaxsw and pminsw for signed words, and pmaxub and pminub for unsigned bytes.

Packed Multiply High Unsigned

Describes pmulhuw and pmulhw instructions for multiplying unsigned/signed words.

Packed Sum of Absolute Differences

Explains psadbw instruction for computing absolute difference of unsigned bytes and summing them.

Packed Average (Byte/Word)

Describes pavgb and pavgw instructions for adding unsigned elements and shifting results.

Complex Multiply by a Constant

Explains complex multiplication using pmaddwd instruction, requiring data formatted into 16-bit values.

Packed 32*32 Multiply

Details PMULUDQ instruction for unsigned multiply on double-word operands.

Packed 64-bit Add/Subtract

Describes PADDQ/PSUBQ instructions for adding/subtracting quad-word operands.

128-bit Shifts

Explains pslldq/psrldq instructions for shifting operands by bytes specified by an immediate operand.

Memory Optimizations

Discusses techniques to improve memory accesses using larger block sizes and avoiding data mixing.

Partial Memory Accesses

Addresses issues with large loads after small stores or vice versa, and how to avoid stalls.

Supplemental Techniques for Avoiding Cache Line Splits

Explains using LDDQU instruction to avoid cache line splits when loading non-16-byte aligned data.

Increasing Bandwidth of Memory Fills and Video Fills

Provides guidelines for obtaining higher bandwidth and shorter latencies for sequential memory fills.

Increasing Memory Bandwidth Using the MOVDQ Instruction

Explains using movdq for storing data to UC/WC memory to reduce stores per fill cycle.

Increasing Memory Bandwidth by Loading and Storing to and from the Same DRAM Page

Discusses how accessing the same DRAM page improves bandwidth by reducing latency for page misses.

Increasing UC and WC Store Bandwidth by Using Aligned Stores

Explains how aligned stores yield higher bandwidth for UC/WC memory by avoiding cache line boundary crossing.

Converting from 64-bit to 128-bit SIMD Integer

Simplifies porting 64-bit integer applications to SSE2 by using 128-bit instructions and considering alignment.

SIMD Optimizations and Microarchitectures

Discusses optimizing SIMD code for Intel Core Solo and Core Duo processors, considering microarchitectural differences.

Packed SSE2 Integer versus MMX Instructions

Compares favoring 128-bit SIMD integer instructions over 64-bit MMX instructions on Core processors for performance.

Chapter 5: Optimizing for SIMD Floating-point Applications

General Rules for SIMD Floating-point Code

Provides rules for optimizing floating-point code with SIMD instructions, balancing port utilization and exceptions.

Planning Considerations

Lists issues programmers should consider for achieving optimum performance with SIMD floating-point instructions.

Using SIMD Floating-point with x87 Floating-point

Explains mixing SIMD floating-point code with x87 floating-point or 64-bit SIMD integer code.

Scalar Floating-point Code

Discusses SIMD floating-point instructions operating on least-significant operands (scalar) and their advantages.

Data Alignment

Emphasizes 16-byte alignment for SIMD floating-point data to avoid exceptions and performance penalties.

Data Arrangement

Discusses arranging data contiguously for SIMD registers to maximize performance, cache misses, and throughput.

Vertical versus Horizontal Computation

Compares vertical data processing in SSE/SSE2 with horizontal data movement for SIMD operations.

Data Swizzling

Explains rearranging SoA format data to AoS format using unpcklps/unpckhps and movlps/movhps.

SSE3 and Complex Arithmetics

Demonstrates using SSE3 for complex number multiplication and division, benefiting from AoS data structures.

SIMD Optimizations and Microarchitectures

Discusses optimizing SIMD code for Intel Core Solo and Core Duo processors, focusing on packed floating-point performance.

Packed Floating-Point Performance

Compares packed SIMD floating-point code performance on Core Solo vs. Pentium M processors, noting decoder improvements.

Chapter 6: Optimizing Cache Usage

General Prefetch Coding Guidelines

Provides guidelines to reduce memory traffic and utilize bandwidth by leveraging hardware prefetchers and compiler intrinsics.

Hardware Prefetching of Data

Explains automatic data prefetching in Pentium 4/Xeon/M and Core processors, covering its characteristics and triggers.

Prefetch and Cacheability Instructions

Discusses prefetch and cacheability control instructions for managing data caching and minimizing pollution.

The Non-temporal Store Instructions

Describes streaming, non-temporal stores (movntps, movntdq) for managing data retention and minimizing cache pollution.

Fencing

Explains the necessity of fencing operations (sfence, lfence, mfence) for ensuring store data visibility and ordering.

Streaming Non-temporal Stores

Details how streaming stores improve performance by increasing store bandwidth and reducing cache disturbance.

Memory Type and Non-temporal Stores

Discusses considerations when memory type (UC, WC, WB) interacts with non-temporal store hints.

Write-Combining

Explains WC semantics for ensuring coherence and using fencing for producer-consumer models.

Streaming Store Usage Models

Covers coherent and non-coherent requests for streaming stores, emphasizing the need for sfence.

Coherent Requests

Describes how streaming stores work with WC memory and the role of sfence for coherency in MP systems.

Non-coherent Requests

Details non-coherent requests from I/O devices and the use of sfence for coherency with WC memory mapping.

Streaming Store Instruction Descriptions

Describes movntq/movntdq for non-temporal integer stores and movntps for non-temporal single-precision float stores.

The fence Instructions

Introduces sfence, lfence, and mfence instructions for ensuring memory ordering and visibility.

The sfence Instruction

Explains sfence for ensuring global visibility of stores before subsequent store instructions.

The lfence Instruction

Details lfence for ensuring global visibility of load instructions before subsequent load instructions.

The mfence Instruction

Explains mfence for ensuring global visibility of both load and store instructions before subsequent memory references.

The clflush Instruction

Describes clflush for invalidating cache lines and its usage with fencing for speculative memory references.

Memory Optimization Using Prefetch

Discusses software-controlled prefetch and automatic hardware prefetch for memory optimization.

Software-controlled Prefetch

Explains software prefetch instructions for hiding data access latency and its characteristics.

Software Prefetching Usage Checklist

Covers essential issues for using software prefetch instructions effectively, including scheduling distance and concatenation.

Software Prefetch Scheduling Distance

Defines prefetch scheduling distance (PSD) and provides a simplified equation for its calculation.

Software Prefetch Concatenation

Explains prefetch concatenation to bridge execution pipeline bubbles and remove memory de-pipelining stalls.

Minimize Number of Software Prefetches

Discusses reducing software prefetches by unrolling loops or software-pipelining to avoid performance penalties.

Mix Software Prefetch with Computation Instructions

Advises interspersing prefetch instructions with computational instructions to improve instruction-level parallelism.

Software Prefetch and Cache Blocking Techniques

Explains cache blocking techniques like strip-mining and loop blocking to improve temporal locality and cache hit rates.

Cache Blocking Technique

Details loop blocking for reducing cache misses and improving memory access performance by fitting data into cache.

Shared-Memory Optimization

Discusses maintaining cache coherency and optimizing data sharing between processors for performance.

Minimize Sharing of Data between Physical Processors

Advises minimizing data sharing between threads on different physical processors to avoid contention and improve scaling.

Batched Producer-Consumer Model

Introduces batched producer-consumer design to minimize bus traffic and optimize work buffer usage.

Eliminate 64-KByte Aliased Data Accesses

Addresses 64 KB aliasing conditions causing cache evictions and how to eliminate them for better frequency scaling.

Preventing Excessive Evictions in First-Level Data Cache

Explains how multiple threads accessing private stack data can cause excessive cache line evictions.

Per-thread Stack Offset

Suggests using per-thread stack offsets to prevent private stack accesses from thrashing the first-level data cache.

Per-instance Stack Offset

Proposes adding per-instance stack offsets to avoid cache line evictions when multiple application instances run in lock step.

Key Practices of Front-end Optimization

Outlines practices for front-end optimization on Hyper-Threading Technology processors, focusing on loop unrolling and code size.

Avoid Excessive Loop Unrolling

Advises avoiding excessive loop unrolling to maintain Trace Cache efficiency and manage code size.

Optimization for Code Size

Focuses on optimizing code size to improve Trace Cache locality and delivered trace length, especially for multithreaded applications.

Using Thread Affinities to Manage Shared Platform Resources

Explains using thread affinities and CPUID to manage logical processors and their relationships for optimized resource sharing.

Chapter 7: Multi-Core and Hyper-Threading Technology

Performance and Usage Models

Discusses how performance gains in multi-processor/core systems are affected by usage models and parallelism.

Multithreading

Explains exploiting task-level parallelism in workloads using multi-threading and Amdahl's law for performance gain.

Multitasking Environment

Covers how hardware multi-threading exploits task-level parallelism in single-threaded applications scheduled concurrently.

Programming Models and Multithreading

Discusses parallelism, workload, thread interaction, and hardware utilization as key concepts in multithreaded design.

Domain Decomposition

Describes creating identical or similar threads to process data subsets independently, leveraging duplicated execution resources.

Functional Decomposition

Explains programming separate threads for different functions to achieve flexible thread-level parallelism.

Specialized Programming Models

Introduces specialized models like "producer-consumer" for multi-core processors, minimizing bus traffic.

Producer-Consumer Threading Models

Illustrates the basic scheme of interaction between producer and consumer threads, emphasizing synchronization and cache usage.

Tools for Creating Multithreaded Applications

Introduces Intel Compilers with OpenMP support and automatic parallelization, plus development tools like Thread Checker.

Programming with OpenMP Directives

Covers OpenMP directives for shared memory parallelism, benefits of directive-based processing, and thread scheduling.

Automatic Parallelization of Code

Explains how OpenMP directives and compiler options like -Qparallel can automatically transform serial code to parallel.

Supporting Development Tools

Introduces Intel Thread Checker for finding threading errors and Thread Profiler for analyzing performance bottlenecks.

Intel® Thread Checker

Details using Intel Thread Checker to locate threading errors like data races, stalls, and deadlocks.

Thread Profiler

Explains using Thread Profiler to analyze threading performance, identify bottlenecks, and visualize execution timelines.

Optimization Guidelines

Summarizes optimization guidelines for multithreaded applications across five areas: thread synchronization, bus utilization, memory, front-end, and execution resources.

Key Practices of Thread Synchronization

Provides key practices for minimizing thread synchronization costs, including PAUSE instruction, spin-locks, and thread-blocking APIs.

Key Practices of System Bus Optimization

Covers managing bus traffic for high data throughput and quick response, including locality, prefetching, and write transactions.

Key Practices of Memory Optimization

Summarizes practices for optimizing memory operations, including cache blocking, data sharing, and access patterns.

Minimize Sharing of Data between Physical Processors

Advises minimizing data sharing between threads on different physical processors to avoid contention and improve scaling.

Minimize Data Access Patterns that are Offset by Multiples of 64 KB in Each Thread

Addresses 64 KB aliasing conditions causing cache evictions and how to eliminate them for better frequency scaling.

Key Practices of Front-end Optimization

Outlines practices for front-end optimization on Hyper-Threading Technology processors, focusing on loop unrolling and code size.

Avoid Excessive Loop Unrolling

Advises avoiding excessive loop unrolling to maintain Trace Cache efficiency and manage code size.

Optimization for Code Size

Focuses on optimizing code size to improve Trace Cache locality and delivered trace length, especially for multithreaded applications.

Using Thread Affinities to Manage Shared Platform Resources

Explains using thread affinities and CPUID to manage logical processors and their relationships for optimized resource sharing.

Generality and Performance Impact

Discusses ranking recommendations by local impact and generality, noting subjectivity and variability.

Thread Synchronization

Highlights the importance of careful thread synchronization design and implementation to avoid performance reduction.

Choice of Synchronization Primitives

Guides selection of synchronization primitives, favoring compiler intrinsics or OS interlocked APIs for atomic updates.

Synchronization for Short Periods

Discusses using spin-wait loops for fast response synchronization and the impact of processor architecture.

Synchronization for Longer Periods

Provides guidelines for spin-wait loops not expected to be released quickly, including OS services and processor idle states.

Avoid Coding Pitfalls in Thread Synchronization

Illustrates common pitfalls in thread synchronization, like polling loops and improper use of Sleep(), advising thread-blocking APIs.

Prevent Sharing of Modified Data and False-Sharing

Explains performance penalties from shared modified data and false sharing in multi-core/HT environments due to cache coherency.

Placement of Shared Synchronization Variable

Advises optimal spacing (128 bytes) for synchronization variables to minimize cache coherency traffic and prevent false-sharing.

System Bus Optimization

Discusses managing bus bandwidth and locality enhancements for multi-threaded applications to improve processor scaling.

Conserve Bus Bandwidth

Focuses on improving code and data locality to conserve bus command bandwidth and reduce instruction fetches.

Understand the Bus and Cache Interactions

Warns about exceeding second-level cache or bus capacity with parallel threads, potentially causing performance degradation.

Avoid Excessive Software Prefetches

Advises against excessive software prefetches to avoid wasting bus bandwidth and consuming execution resources.

Improve Effective Latency of Cache Misses

Suggests using overlapping memory reads to reduce latency of sparse reads and improve effective memory access latency.

Use Full Write Transactions to Achieve Higher Data Rate

Recommends using full write transactions (64 bytes) over partial writes to increase data throughput and bus efficiency.

Memory Optimization

Focuses on efficient cache operation through cache blocking, shared memory optimization, and eliminating aliased data accesses.

Cache Blocking Technique

Details loop blocking for reducing cache misses and improving memory access performance by fitting data into cache.

Shared-Memory Optimization

Discusses maintaining cache coherency and optimizing data sharing between processors for performance.

Minimize Sharing of Data between Physical Processors

Advises minimizing data sharing between threads on different physical processors to avoid contention and improve scaling.

Batched Producer-Consumer Model

Introduces batched producer-consumer design to minimize bus traffic and optimize work buffer usage.

Eliminate 64-KByte Aliased Data Accesses

Addresses 64 KB aliasing conditions causing cache evictions and how to eliminate them for better frequency scaling.

Preventing Excessive Evictions in First-Level Data Cache

Explains how multiple threads accessing private stack data can cause excessive cache line evictions.

Per-thread Stack Offset

Suggests using per-thread stack offsets to prevent private stack accesses from thrashing the first-level data cache.

Per-instance Stack Offset

Proposes adding per-instance stack offsets to avoid cache line evictions when multiple application instances run in lock step.

Key Practices of Front-end Optimization

Outlines practices for front-end optimization on Hyper-Threading Technology processors, focusing on loop unrolling and code size.

Avoid Excessive Loop Unrolling

Advises avoiding excessive loop unrolling to maintain Trace Cache efficiency and manage code size.

Optimization for Code Size

Focuses on optimizing code size to improve Trace Cache locality and delivered trace length, especially for multithreaded applications.

Using Thread Affinities to Manage Shared Platform Resources

Explains using thread affinities and CPUID to manage logical processors and their relationships for optimized resource sharing.

Generality and Performance Impact

Discusses ranking recommendations by local impact and generality, noting subjectivity and variability.

Thread Synchronization

Highlights the importance of careful thread synchronization design and implementation to avoid performance reduction.

Choice of Synchronization Primitives

Guides selection of synchronization primitives, favoring compiler intrinsics or OS interlocked APIs for atomic updates.

Synchronization for Short Periods

Discusses using spin-wait loops for fast response synchronization and the impact of processor architecture.

Synchronization for Longer Periods

Provides guidelines for spin-wait loops not expected to be released quickly, including OS services and processor idle states.

Avoid Coding Pitfalls in Thread Synchronization

Illustrates common pitfalls in thread synchronization, like polling loops and improper use of Sleep(), advising thread-blocking APIs.

Prevent Sharing of Modified Data and False-Sharing

Explains performance penalties from shared modified data and false sharing in multi-core/HT environments due to cache coherency.

Placement of Shared Synchronization Variable

Advises optimal spacing (128 bytes) for synchronization variables to minimize cache coherency traffic and prevent false-sharing.

System Bus Optimization

Discusses managing bus bandwidth and locality enhancements for multi-threaded applications to improve processor scaling.

Conserve Bus Bandwidth

Focuses on improving code and data locality to conserve bus command bandwidth and reduce instruction fetches.

Understand the Bus and Cache Interactions

Warns about exceeding second-level cache or bus capacity with parallel threads, potentially causing performance degradation.

Avoid Excessive Software Prefetches

Advises against excessive software prefetches to avoid wasting bus bandwidth and consuming execution resources.

Improve Effective Latency of Cache Misses

Suggests using overlapping memory reads to reduce latency of sparse reads and improve effective memory access latency.

Use Full Write Transactions to Achieve Higher Data Rate

Recommends using full write transactions (64 bytes) over partial writes to increase data throughput and bus efficiency.

Memory Optimization

Focuses on efficient cache operation through cache blocking, shared memory optimization, and eliminating aliased data accesses.

Cache Blocking Technique

Details loop blocking for reducing cache misses and improving memory access performance by fitting data into cache.

Shared-Memory Optimization

Discusses maintaining cache coherency and optimizing data sharing between processors for performance.

Minimize Sharing of Data between Physical Processors

Advises minimizing data sharing between threads on different physical processors to avoid contention and improve scaling.

Batched Producer-Consumer Model

Introduces batched producer-consumer design to minimize bus traffic and optimize work buffer usage.

Eliminate 64-KByte Aliased Data Accesses

Addresses 64 KB aliasing conditions causing cache evictions and how to eliminate them for better frequency scaling.

Preventing Excessive Evictions in First-Level Data Cache

Explains how multiple threads accessing private stack data can cause excessive cache line evictions.

Per-thread Stack Offset

Suggests using per-thread stack offsets to prevent private stack accesses from thrashing the first-level data cache.

Per-instance Stack Offset

Proposes adding per-instance stack offsets to avoid cache line evictions when multiple application instances run in lock step.

Key Practices of Front-end Optimization

Outlines practices for front-end optimization on Hyper-Threading Technology processors, focusing on loop unrolling and code size.

Avoid Excessive Loop Unrolling

Advises avoiding excessive loop unrolling to maintain Trace Cache efficiency and manage code size.

Optimization for Code Size

Focuses on optimizing code size to improve Trace Cache locality and delivered trace length, especially for multithreaded applications.

Using Thread Affinities to Manage Shared Platform Resources

Explains using thread affinities and CPUID to manage logical processors and their relationships for optimized resource sharing.

Chapter 8: 64-bit Mode Coding Guidelines

Use Legacy 32-Bit Instructions When The Data Size Is 32 Bits

Recommends using 32-bit instructions for 32-bit data to save instruction bytes and reduce code size.

Use Extra Registers to Reduce Register Pressure

Suggests using additional 64-bit general purpose and XMM registers to avoid spilling values to the stack.

Use 64-Bit by 64-Bit Multiplies That Produce 128-Bit Results Only When Necessary

Advises preferring 64x64 multiplies yielding 64-bit results over 128-bit results due to performance impact.

Sign Extension to Full 64-Bits

Explains optimizing sign extension by using 64-bit extension for 32-bit destinations to reduce uops and improve performance.

Alternate Coding Rules for 64-Bit Mode

Provides alternative coding rules for 64-bit mode, emphasizing using 64-bit registers and instructions where appropriate.

Use 64-Bit Registers Instead of Two 32-Bit Registers for 64-Bit Arithmetic

Recommends using native 64-bit arithmetic instructions over 32-bit register pairs for better performance.

Use 32-Bit Versions of CVTSI2SS and CVTSI2SD When Possible

Suggests using 32-bit versions of CVTSI2SS/SD for converting signed integers to floating-point values when sufficient.

Using Software Prefetch

Recommends following guidelines in Chapters 2 and 6 for choosing between hardware and software prefetching.

Chapter 9: Power Optimization for Mobile Usages

Overview

Introduces power optimization techniques for mobile applications, considering performance and battery life.

ACPI C-States

Explains ACPI C-states (C0-C3) for managing processor idle states and reducing static power consumption.

Processor-Specific C4 and Deep C4 States

Discusses processor-specific C-states like C4 for aggressive static power reduction by lowering voltage.

Guidelines for Extending Battery Life

Provides guidelines to conserve battery life by adapting power management, avoiding spin loops, and reducing workload.

Adjust Performance to Meet Quality of Features

Suggests reducing feature performance or quality to extend battery life, and using OS APIs for power status.

Reducing Amount of Work

Explains reducing processor energy consumption by minimizing active workload execution time and cycles.

Platform-Level Optimizations

Covers platform-level techniques like caching CD/DVD data, switching off unused devices, and network usage for power saving.

Handling Sleep State Transitions

Advises applications to be aware of sleep state transitions and react appropriately to preserve state and connectivity.

Using Enhanced Intel SpeedStep® Technology

Explains using Enhanced Intel SpeedStep Technology to adjust processor frequency/voltage for lower power consumption and energy savings.

Enabling Intel® Enhanced Deeper Sleep

Discusses consolidating computations into larger chunks to enable deeper C-States and reduce static power consumption.

Multi-Core Considerations

Highlights special considerations for multi-core processors, especially dual-core architecture, for power savings.

Enhanced Intel SpeedStep® Technology

Explains transforming single-threaded applications for multi-core processors to enable lower frequency and voltage operation.

Thread Migration Considerations

Discusses performance anomalies due to OS scheduling and multi-core unaware power management affecting thread migration.

Multi-core Considerations for C-States

Covers how multi-core-unaware C-state coordination can affect power savings and the need for coordinated hardware/software.

Chapter A: Application Performance Tools

Intel C++ Compiler and Intel® Fortran Compiler

Discusses Intel compilers' features for generating optimized code, including vectorization and profile-guided optimization.

Intel Debugger

Explains Intel Debugger (IDB) for debugging C++, Fortran, and mixed language programs, including XMM register viewing.

VTune Performance Analyzer

Introduces VTune analyzer for collecting, analyzing, and displaying Intel architecture-specific performance data.

Intel® Performance Libraries

Lists Intel Performance Libraries like MKL and IPP, optimized for Intel processors, providing platform compatibility.

Intel Threading Tools

Introduces Intel Thread Checker and Thread Profiler for analyzing and debugging multithreaded applications.

Intel® Thread Checker

Details using Intel Thread Checker to locate threading errors like data races, stalls, and deadlocks.

Thread Profiler

Explains using Thread Profiler to analyze threading performance, identify bottlenecks, and visualize execution timelines.

Intel® Software College

Mentions Intel Software College as a resource for training on SSE2, Threading, and IA-32 architecture.

Code Optimization Options

Describes specific compiler options like -O1, -O2, -O3, -Qx, and -Qax for optimizing code performance.

Targeting a Processor (-Gn)

Explains using -Gn option to target specific Intel architecture processors for maximum performance.

Automatic Processor Dispatch Support (-Qx[extensions] and -Qax[extensions])

Covers -Qx and -Qax options for generating processor-specific code based on extensions, with runtime checks.

Vectorizer Switch Options

Details vectorizer switch options like -Qx, -Qax, -Qvec_report, and -Qrestrict for controlling loop vectorization.

Loop Unrolling

Explains how compilers automatically unroll loops with specific switches and how to disable it.

Multithreading with OpenMP*

Discusses shared memory parallelism using OpenMP directives, library functions, and environment variables.

Automatic Multithreading

Explains how Intel compilers can generate multithreaded code automatically for simple loops with no dependencies.

Inline Expansion of Library Functions (-Oi, -Oi-)

Describes default inline expansion of library functions for faster execution and potential issues.

Floating-point Arithmetic Precision (-Op, -Op-, -Qprec, -Qprec_div, -Qpc, -Qlong_double)

Covers options for controlling optimization that may affect floating-point arithmetic precision.

Rounding Control Option (-Qrcd)

Explains using -Qrcd option to improve floating-point calculation performance by controlling rounding mode.

Interprocedural and Profile-Guided Optimizations

Details Interprocedural Optimization (IPO) and Profile-Guided Optimization (PGO) for improving code performance.

Interprocedural Optimization (IPO)

Explains using -Qip and -Qipo options to analyze and optimize code between procedures within and across source files.

Profile-Guided Optimization (PGO)

Describes creating instrumented programs to generate dynamic information for optimizing heavily traveled code paths.

Chapter B: Using Performance Monitoring Events

Pentium 4 Processor Performance Metrics

Introduces performance metrics specific to Pentium 4 and Intel Xeon processors based on Intel NetBurst microarchitecture.

Pentium 4 Processor-Specific Terminology

Explains terms like Bogus, Non-bogus, Retire, and Bus Ratio used in performance monitoring.

Bogus, Non-bogus, Retire

Defines bogus instructions/uops (cancelled due to misprediction) vs. retired/non-bogus (committed architectural state changes).

Bus Ratio

Defines Bus Ratio as the ratio of processor clock to bus clock, used in Bus Utilization metric.

Replay

Explains replay mechanism where uops are reissued due to unsatisfied execution conditions like cache misses or resource constraints.

Assist

Describes assist events where hardware needs microcode assistance, like for floating-point underflow conditions.

Tagging

Explains tagging as a means of marking uops for counting at retirement, allowing multiple tags per uop for detection.

Counting Clocks

Describes mechanisms to count processor clock cycles (Non-Halted, Non-Sleep, Timestamp Counter) for performance monitoring.

Non-Halted Clockticks

Defines Non-Halted Clockticks as clocks when a logical processor is active and not in power-saving states.

Non-Sleep Clockticks

Defines Non-Sleep Clockticks as clocks when the physical processor is not in sleep or power-saving states.

Time Stamp Counter

Explains the Time Stamp Counter increments on clock signal activity and can be read via RDTSC instruction.

Microarchitecture Notes

Provides notes on microarchitectural elements like Trace Cache Events and Bus/Memory Metrics for correct metric interpretation.

Trace Cache Events

Discusses trace cache performance, its relation to bottlenecks, and metrics for determining front-end performance.

Bus and Memory Metrics

Explains understanding transaction sizes, queues, sectoring, and prefetching for correct interpretation of bus/memory metrics.

Usage Notes for Specific Metrics

Provides event-specific notes for interpreting performance metrics, especially those related to BSQ cache references.

Metrics Descriptions and Categories

Lists performance metrics categorized by General, Branching, Trace Cache/Front End, Memory, Bus, and Characterization.

Chapter C: IA-32 Instruction Latency and Throughput

Overview

Provides an overview of issues related to instruction selection and scheduling, and performance impact of applying information.

Definitions

Defines key terms like Instruction Name, Latency, Throughput, and Execution units for IA-32 instructions.

Latency and Throughput

Presents latency and throughput information for IA-32 instructions, including SSE, MMX, and general-purpose instructions.

Latency and Throughput with Register Operands

Provides IA-32 instruction latency and throughput data for register operands, covering SSE3, SSE2, MMX, and general-purpose instructions.

Latency and Throughput with Memory Operands

Discusses latency and throughput for instructions with memory operands, noting longer latencies compared to register operands.

Chapter D: Stack Alignment

Stack Frames

Describes stack alignment conventions for esp-based and ebp-based stack frames, and the importance of 16-byte alignment for __m128 data.

Aligned esp-Based Stack Frames

Details creating esp-based stack frames with compiler padding for alignment, applicable when debug info and exception handling are not needed.

Aligned ebp-Based Stack Frames

Explains ebp-based frames with padding before return address, used when debug info or exception handling is present.

Stack Frame Optimizations

Discusses compiler optimizations for aligned frames, including bypassing unnecessary alignment code and using static function alignment.

Inlined Assembly and ebx

Warns against modifying the ebx register in inlined assembly functions that use dynamic stack alignment without saving/restoring it.

Chapter E: Mathematics of Prefetch Scheduling Distance

Simplified Equation

Presents a simplified equation to compute Prefetch Scheduling Distance (PSD) based on lookup, transfer, and computation latencies.

Mathematical Model for PSD

Defines parameters like psd, Tc, Tl, Tb, and CPI used in PSD calculation and discusses their dependencies.

No Preloading or Prefetch

Explains the traditional sequential approach without preloading/prefetching and the resulting execution pipeline stalls.

Compute Bound (Case:Tc >= Tl + Tb)

Analyzes the compute-bound case where compute latency exceeds memory latency plus transfer latency, indicating PSD=1.

Compute Bound (Case: Tl + Tb > Tc > Tb)

Examines the compute-bound scenario where iteration latency equals computation latency, suggesting PSD > 1.

Memory Throughput Bound (Case: Tb >= Tc)

Discusses memory throughput bound cases where memory latency dominates, making prefetch benefits marginal.

Example

Provides example calculations for PSD based on given conditions for computation and memory throughput latencies.

Example E-1 Calculating Insertion for Scheduling Distance of 3

Illustrates using prefetchnta instruction with a scheduling distance of 3, showing data usage in iteration i+3.

Example E-2 Accesses per Iteration, Example 1

Shows a graph of accesses per iteration vs. computation clocks, illustrating latency reduction with prefetching.

Example E-7 Accesses per Iteration, Example 2

Presents results for prefetching multiple cache lines per iteration, showing differing burst latencies.

Related product manuals