Do you have a question about the Intel ARCHITECTURE IA-32 and is the answer not in the manual?
Instruction Set | x86 |
---|---|
Instruction Set Type | CISC |
Memory Segmentation | Supported |
Operating Modes | Real mode, Protected mode, Virtual 8086 mode |
Max Physical Address Size | 36 bits (with PAE) |
Max Virtual Address Size | 32 bits |
Architecture | IA-32 (Intel Architecture 32-bit) |
Addressable Memory | 4 GB (with Physical Address Extension up to 64 GB) |
Floating Point Registers | 8 x 80-bit |
MMX Registers | 8 x 64-bit |
SSE Registers | 8 x 128-bit |
Registers | General-purpose registers (EAX, EBX, ECX, EDX, ESI, EDI, ESP, EBP), Segment registers (CS, DS, SS, ES, FS, GS), Instruction pointer (EIP), Flags register (EFLAGS) |
Floating Point Unit | Yes (x87) |
Introduces SIMD computations and technologies like MMX, SSE, SSE2, and SSE3.
Details MMX technology and Streaming SIMD Extensions (SSE, SSE2, SSE3) capabilities.
Describes Intel EM64T as an extension to IA-32, increasing linear address space to 64 bits.
Describes features and operation of the Intel NetBurst microarchitecture for high performance.
Details the pipeline structure: in-order issue front end, out-of-order superscalar execution core, and in-order retirement.
Explains the front end's functions: instruction fetching, decoding into µops, and branch prediction.
Discusses the core's ability to execute instructions out of order for parallelism and dispatching µops.
Explains the retirement logic that tracks and commits µop results according to program order.
Details branch prediction importance, mechanisms, and static prediction for deeply pipelined processors.
Describes how the execution core optimizes performance by handling common cases efficiently.
Explains port restrictions, result latencies, and issue latencies that assist software in ordering instructions.
Details the four issue ports and associated execution units that dispatch µops to the core.
Describes the three levels of on-chip cache in the Intel NetBurst microarchitecture and their organization.
Discusses software and hardware mechanisms for prefetching data to improve memory access performance.
Explains techniques used to speed up memory operations like speculative execution and reordering.
Describes how store data can be forwarded directly to a subsequent load, outlining requirements and restrictions.
Provides an overview of the Pentium M processor microarchitecture, highlighting differences from NetBurst.
Explains the Pentium M processor's front end, consisting of fetch/decode unit and instruction cache.
Describes enhancements in Core Solo/Duo microarchitecture over Pentium M for performance and power efficiency.
Discusses how multi-core processors resemble single-core implementations and cache hierarchy differences.
Explains the shared second-level cache and single bus interface in Intel Core Duo processors for minimizing bus traffic.
Discusses key factors for achieving optimal processor performance like branch prediction and memory access.
Lists and explains common coding pitfalls that limit performance on IA-32 processors.
Provides guidelines derived from performance factors and highlights practices using performance tools.
Discusses using Intel C++ Compiler and VTune Performance Analyzer for code optimization.
Suggests strategies like CPUID dispatch and compatible code for performance across processor generations.
Explains techniques to improve branch predictability and optimize instruction prefetching.
Provides guidelines for optimizing code and data memory accesses, including alignment and store-forwarding.
Offers advice on optimizing floating-point operations, including precision and SIMD capabilities.
Focuses on selecting instructions for path length, minimizing uops, and maximizing retirement throughput.
Discusses considering latencies and resource constraints for instruction scheduling.
Explains how to enable vectorization for parallelism, focusing on data types and loop nesting.
Presents rules, suggestions, and hints for performance tuning, ranked by impact and generality.
Introduces Intel C++ Compiler and VTune Performance Analyzer for application optimization.
Discusses using compilers tuned for target microarchitecture and compiler switches for optimization.
Explains how to use VTune Performance Analyzer for performance monitoring and identifying coding pitfalls.
Compares coding recommendations across different microarchitectures like Pentium M and Core processors.
Discusses using CPUID for processor generation identification and compatible code strategies for performance.
Explains using CPUID's deterministic cache parameter leaf for forward-compatible coding.
Covers hardware multi-threading support like dual-core and Hyper-Threading Technology for application design.
Explains how to check for MMX, SSE, SSE2, and SSE3 support in processors and operating systems.
Details using CPUID to check for MMX technology support via feature flags in the EDX register.
Outlines steps to check for SSE support, including processor and OS checks using CPUID.
Explains how to check for SSE2 support, similar to SSE, focusing on processor and OS requirements.
Details checking for SSE3 support, including processor and OS checks, and feature bits in CPUID.
Guides developers on evaluating code for SIMD conversion by asking key questions about benefits and requirements.
Discusses vectorization, memory access dependencies, and loop strip-mining for SIMD architecture.
Compares trade-offs between hand-coded assembly, intrinsics, and automatic vectorization for SIMD programming.
Explains how the Intel C++ Compiler automatically vectorizes loops and the techniques used to identify vectorizable loops.
Provides overall rules for SIMD integer code, including intermixing with x87, type checking, and prefetching.
Discusses rules and considerations for mixing 64-bit SIMD integer instructions with x87 floating-point registers.
Explains the EMMS instruction's role in clearing the x87 stack for proper x87 code operation after MMX.
Emphasizes the importance of 8-byte alignment for 64-bit SIMD integer data and 16-byte for 128-bit data.
Covers techniques for gathering and re-arranging data for efficient SIMD computation.
Explains how MMX unpack instructions can be used to zero-extend unsigned numbers.
Details using the psrad instruction for sign-extending values during unpacking.
Explains packssdw instruction for packing signed doublewords into saturated signed words.
Describes pack instructions for packing words without saturation, using low-order bits to prevent overflow.
Explains unpack instructions for merging operands without interleaving, focusing on doublewords.
Details the pextrw instruction for extracting words and moving them to a 32-bit register.
Explains the pinsrw instruction for loading words into MMX destination registers at specified positions.
Describes the pmovmskb instruction for creating a bit mask from the most significant bits of bytes.
Explains the pshuf instruction for selecting words between MMX registers or memory using an immediate operand.
Details pshuflw/pshufhw/pshufd for shuffling word/double-word fields within 128-bit registers.
Explains punpcklqdq/punpchqdq for interleaving 64-bit source operands into 128-bit destination registers.
Discusses instructions enabling data movement from 64-bit SIMD integer registers to 128-bit SIMD registers.
Mentions new instructions for 4-wide conversion of single-precision to double-word integer data.
Shows code segments for generating frequently used constants in SIMD registers using specific instructions.
Describes instructions and algorithms for implementing common code building blocks efficiently.
Explains computing absolute difference of two unsigned numbers using subtract with unsigned saturation.
Details a sorting technique using XOR for calculating absolute difference of two signed numbers.
Shows how to compute the absolute value |x| of signed words using pmaxsw and psubw instructions.
Explains clipping values to a range [high, low] using packed-add and packed-subtract with saturation.
Provides techniques for clipping signed words and unsigned bytes to arbitrary ranges using specific instructions.
Shows how to clip signed words to an arbitrary range using pmaxsw and pminsw instructions.
Details clipping unsigned values to a range [high, low] using packed-add and packed-subtract with saturation.
Explains pmaxsw and pminsw for signed words, and pmaxub and pminub for unsigned bytes.
Describes pmulhuw and pmulhw instructions for multiplying unsigned/signed words.
Explains psadbw instruction for computing absolute difference of unsigned bytes and summing them.
Describes pavgb and pavgw instructions for adding unsigned elements and shifting results.
Explains complex multiplication using pmaddwd instruction, requiring data formatted into 16-bit values.
Details PMULUDQ instruction for unsigned multiply on double-word operands.
Describes PADDQ/PSUBQ instructions for adding/subtracting quad-word operands.
Explains pslldq/psrldq instructions for shifting operands by bytes specified by an immediate operand.
Discusses techniques to improve memory accesses using larger block sizes and avoiding data mixing.
Addresses issues with large loads after small stores or vice versa, and how to avoid stalls.
Explains using LDDQU instruction to avoid cache line splits when loading non-16-byte aligned data.
Provides guidelines for obtaining higher bandwidth and shorter latencies for sequential memory fills.
Explains using movdq for storing data to UC/WC memory to reduce stores per fill cycle.
Discusses how accessing the same DRAM page improves bandwidth by reducing latency for page misses.
Explains how aligned stores yield higher bandwidth for UC/WC memory by avoiding cache line boundary crossing.
Simplifies porting 64-bit integer applications to SSE2 by using 128-bit instructions and considering alignment.
Discusses optimizing SIMD code for Intel Core Solo and Core Duo processors, considering microarchitectural differences.
Compares favoring 128-bit SIMD integer instructions over 64-bit MMX instructions on Core processors for performance.
Provides rules for optimizing floating-point code with SIMD instructions, balancing port utilization and exceptions.
Lists issues programmers should consider for achieving optimum performance with SIMD floating-point instructions.
Explains mixing SIMD floating-point code with x87 floating-point or 64-bit SIMD integer code.
Discusses SIMD floating-point instructions operating on least-significant operands (scalar) and their advantages.
Emphasizes 16-byte alignment for SIMD floating-point data to avoid exceptions and performance penalties.
Discusses arranging data contiguously for SIMD registers to maximize performance, cache misses, and throughput.
Compares vertical data processing in SSE/SSE2 with horizontal data movement for SIMD operations.
Explains rearranging SoA format data to AoS format using unpcklps/unpckhps and movlps/movhps.
Demonstrates using SSE3 for complex number multiplication and division, benefiting from AoS data structures.
Discusses optimizing SIMD code for Intel Core Solo and Core Duo processors, focusing on packed floating-point performance.
Compares packed SIMD floating-point code performance on Core Solo vs. Pentium M processors, noting decoder improvements.
Provides guidelines to reduce memory traffic and utilize bandwidth by leveraging hardware prefetchers and compiler intrinsics.
Explains automatic data prefetching in Pentium 4/Xeon/M and Core processors, covering its characteristics and triggers.
Discusses prefetch and cacheability control instructions for managing data caching and minimizing pollution.
Describes streaming, non-temporal stores (movntps, movntdq) for managing data retention and minimizing cache pollution.
Explains the necessity of fencing operations (sfence, lfence, mfence) for ensuring store data visibility and ordering.
Details how streaming stores improve performance by increasing store bandwidth and reducing cache disturbance.
Discusses considerations when memory type (UC, WC, WB) interacts with non-temporal store hints.
Explains WC semantics for ensuring coherence and using fencing for producer-consumer models.
Covers coherent and non-coherent requests for streaming stores, emphasizing the need for sfence.
Describes how streaming stores work with WC memory and the role of sfence for coherency in MP systems.
Details non-coherent requests from I/O devices and the use of sfence for coherency with WC memory mapping.
Describes movntq/movntdq for non-temporal integer stores and movntps for non-temporal single-precision float stores.
Introduces sfence, lfence, and mfence instructions for ensuring memory ordering and visibility.
Explains sfence for ensuring global visibility of stores before subsequent store instructions.
Details lfence for ensuring global visibility of load instructions before subsequent load instructions.
Explains mfence for ensuring global visibility of both load and store instructions before subsequent memory references.
Describes clflush for invalidating cache lines and its usage with fencing for speculative memory references.
Discusses software-controlled prefetch and automatic hardware prefetch for memory optimization.
Explains software prefetch instructions for hiding data access latency and its characteristics.
Covers essential issues for using software prefetch instructions effectively, including scheduling distance and concatenation.
Defines prefetch scheduling distance (PSD) and provides a simplified equation for its calculation.
Explains prefetch concatenation to bridge execution pipeline bubbles and remove memory de-pipelining stalls.
Discusses reducing software prefetches by unrolling loops or software-pipelining to avoid performance penalties.
Advises interspersing prefetch instructions with computational instructions to improve instruction-level parallelism.
Explains cache blocking techniques like strip-mining and loop blocking to improve temporal locality and cache hit rates.
Details loop blocking for reducing cache misses and improving memory access performance by fitting data into cache.
Discusses maintaining cache coherency and optimizing data sharing between processors for performance.
Advises minimizing data sharing between threads on different physical processors to avoid contention and improve scaling.
Introduces batched producer-consumer design to minimize bus traffic and optimize work buffer usage.
Addresses 64 KB aliasing conditions causing cache evictions and how to eliminate them for better frequency scaling.
Explains how multiple threads accessing private stack data can cause excessive cache line evictions.
Suggests using per-thread stack offsets to prevent private stack accesses from thrashing the first-level data cache.
Proposes adding per-instance stack offsets to avoid cache line evictions when multiple application instances run in lock step.
Outlines practices for front-end optimization on Hyper-Threading Technology processors, focusing on loop unrolling and code size.
Advises avoiding excessive loop unrolling to maintain Trace Cache efficiency and manage code size.
Focuses on optimizing code size to improve Trace Cache locality and delivered trace length, especially for multithreaded applications.
Explains using thread affinities and CPUID to manage logical processors and their relationships for optimized resource sharing.
Discusses how performance gains in multi-processor/core systems are affected by usage models and parallelism.
Explains exploiting task-level parallelism in workloads using multi-threading and Amdahl's law for performance gain.
Covers how hardware multi-threading exploits task-level parallelism in single-threaded applications scheduled concurrently.
Discusses parallelism, workload, thread interaction, and hardware utilization as key concepts in multithreaded design.
Describes creating identical or similar threads to process data subsets independently, leveraging duplicated execution resources.
Explains programming separate threads for different functions to achieve flexible thread-level parallelism.
Introduces specialized models like "producer-consumer" for multi-core processors, minimizing bus traffic.
Illustrates the basic scheme of interaction between producer and consumer threads, emphasizing synchronization and cache usage.
Introduces Intel Compilers with OpenMP support and automatic parallelization, plus development tools like Thread Checker.
Covers OpenMP directives for shared memory parallelism, benefits of directive-based processing, and thread scheduling.
Explains how OpenMP directives and compiler options like -Qparallel can automatically transform serial code to parallel.
Introduces Intel Thread Checker for finding threading errors and Thread Profiler for analyzing performance bottlenecks.
Details using Intel Thread Checker to locate threading errors like data races, stalls, and deadlocks.
Explains using Thread Profiler to analyze threading performance, identify bottlenecks, and visualize execution timelines.
Summarizes optimization guidelines for multithreaded applications across five areas: thread synchronization, bus utilization, memory, front-end, and execution resources.
Provides key practices for minimizing thread synchronization costs, including PAUSE instruction, spin-locks, and thread-blocking APIs.
Covers managing bus traffic for high data throughput and quick response, including locality, prefetching, and write transactions.
Summarizes practices for optimizing memory operations, including cache blocking, data sharing, and access patterns.
Advises minimizing data sharing between threads on different physical processors to avoid contention and improve scaling.
Addresses 64 KB aliasing conditions causing cache evictions and how to eliminate them for better frequency scaling.
Outlines practices for front-end optimization on Hyper-Threading Technology processors, focusing on loop unrolling and code size.
Advises avoiding excessive loop unrolling to maintain Trace Cache efficiency and manage code size.
Focuses on optimizing code size to improve Trace Cache locality and delivered trace length, especially for multithreaded applications.
Explains using thread affinities and CPUID to manage logical processors and their relationships for optimized resource sharing.
Discusses ranking recommendations by local impact and generality, noting subjectivity and variability.
Highlights the importance of careful thread synchronization design and implementation to avoid performance reduction.
Guides selection of synchronization primitives, favoring compiler intrinsics or OS interlocked APIs for atomic updates.
Discusses using spin-wait loops for fast response synchronization and the impact of processor architecture.
Provides guidelines for spin-wait loops not expected to be released quickly, including OS services and processor idle states.
Illustrates common pitfalls in thread synchronization, like polling loops and improper use of Sleep(), advising thread-blocking APIs.
Explains performance penalties from shared modified data and false sharing in multi-core/HT environments due to cache coherency.
Advises optimal spacing (128 bytes) for synchronization variables to minimize cache coherency traffic and prevent false-sharing.
Discusses managing bus bandwidth and locality enhancements for multi-threaded applications to improve processor scaling.
Focuses on improving code and data locality to conserve bus command bandwidth and reduce instruction fetches.
Warns about exceeding second-level cache or bus capacity with parallel threads, potentially causing performance degradation.
Advises against excessive software prefetches to avoid wasting bus bandwidth and consuming execution resources.
Suggests using overlapping memory reads to reduce latency of sparse reads and improve effective memory access latency.
Recommends using full write transactions (64 bytes) over partial writes to increase data throughput and bus efficiency.
Focuses on efficient cache operation through cache blocking, shared memory optimization, and eliminating aliased data accesses.
Details loop blocking for reducing cache misses and improving memory access performance by fitting data into cache.
Discusses maintaining cache coherency and optimizing data sharing between processors for performance.
Advises minimizing data sharing between threads on different physical processors to avoid contention and improve scaling.
Introduces batched producer-consumer design to minimize bus traffic and optimize work buffer usage.
Addresses 64 KB aliasing conditions causing cache evictions and how to eliminate them for better frequency scaling.
Explains how multiple threads accessing private stack data can cause excessive cache line evictions.
Suggests using per-thread stack offsets to prevent private stack accesses from thrashing the first-level data cache.
Proposes adding per-instance stack offsets to avoid cache line evictions when multiple application instances run in lock step.
Outlines practices for front-end optimization on Hyper-Threading Technology processors, focusing on loop unrolling and code size.
Advises avoiding excessive loop unrolling to maintain Trace Cache efficiency and manage code size.
Focuses on optimizing code size to improve Trace Cache locality and delivered trace length, especially for multithreaded applications.
Explains using thread affinities and CPUID to manage logical processors and their relationships for optimized resource sharing.
Discusses ranking recommendations by local impact and generality, noting subjectivity and variability.
Highlights the importance of careful thread synchronization design and implementation to avoid performance reduction.
Guides selection of synchronization primitives, favoring compiler intrinsics or OS interlocked APIs for atomic updates.
Discusses using spin-wait loops for fast response synchronization and the impact of processor architecture.
Provides guidelines for spin-wait loops not expected to be released quickly, including OS services and processor idle states.
Illustrates common pitfalls in thread synchronization, like polling loops and improper use of Sleep(), advising thread-blocking APIs.
Explains performance penalties from shared modified data and false sharing in multi-core/HT environments due to cache coherency.
Advises optimal spacing (128 bytes) for synchronization variables to minimize cache coherency traffic and prevent false-sharing.
Discusses managing bus bandwidth and locality enhancements for multi-threaded applications to improve processor scaling.
Focuses on improving code and data locality to conserve bus command bandwidth and reduce instruction fetches.
Warns about exceeding second-level cache or bus capacity with parallel threads, potentially causing performance degradation.
Advises against excessive software prefetches to avoid wasting bus bandwidth and consuming execution resources.
Suggests using overlapping memory reads to reduce latency of sparse reads and improve effective memory access latency.
Recommends using full write transactions (64 bytes) over partial writes to increase data throughput and bus efficiency.
Focuses on efficient cache operation through cache blocking, shared memory optimization, and eliminating aliased data accesses.
Details loop blocking for reducing cache misses and improving memory access performance by fitting data into cache.
Discusses maintaining cache coherency and optimizing data sharing between processors for performance.
Advises minimizing data sharing between threads on different physical processors to avoid contention and improve scaling.
Introduces batched producer-consumer design to minimize bus traffic and optimize work buffer usage.
Addresses 64 KB aliasing conditions causing cache evictions and how to eliminate them for better frequency scaling.
Explains how multiple threads accessing private stack data can cause excessive cache line evictions.
Suggests using per-thread stack offsets to prevent private stack accesses from thrashing the first-level data cache.
Proposes adding per-instance stack offsets to avoid cache line evictions when multiple application instances run in lock step.
Outlines practices for front-end optimization on Hyper-Threading Technology processors, focusing on loop unrolling and code size.
Advises avoiding excessive loop unrolling to maintain Trace Cache efficiency and manage code size.
Focuses on optimizing code size to improve Trace Cache locality and delivered trace length, especially for multithreaded applications.
Explains using thread affinities and CPUID to manage logical processors and their relationships for optimized resource sharing.
Recommends using 32-bit instructions for 32-bit data to save instruction bytes and reduce code size.
Suggests using additional 64-bit general purpose and XMM registers to avoid spilling values to the stack.
Advises preferring 64x64 multiplies yielding 64-bit results over 128-bit results due to performance impact.
Explains optimizing sign extension by using 64-bit extension for 32-bit destinations to reduce uops and improve performance.
Provides alternative coding rules for 64-bit mode, emphasizing using 64-bit registers and instructions where appropriate.
Recommends using native 64-bit arithmetic instructions over 32-bit register pairs for better performance.
Suggests using 32-bit versions of CVTSI2SS/SD for converting signed integers to floating-point values when sufficient.
Recommends following guidelines in Chapters 2 and 6 for choosing between hardware and software prefetching.
Introduces power optimization techniques for mobile applications, considering performance and battery life.
Explains ACPI C-states (C0-C3) for managing processor idle states and reducing static power consumption.
Discusses processor-specific C-states like C4 for aggressive static power reduction by lowering voltage.
Provides guidelines to conserve battery life by adapting power management, avoiding spin loops, and reducing workload.
Suggests reducing feature performance or quality to extend battery life, and using OS APIs for power status.
Explains reducing processor energy consumption by minimizing active workload execution time and cycles.
Covers platform-level techniques like caching CD/DVD data, switching off unused devices, and network usage for power saving.
Advises applications to be aware of sleep state transitions and react appropriately to preserve state and connectivity.
Explains using Enhanced Intel SpeedStep Technology to adjust processor frequency/voltage for lower power consumption and energy savings.
Discusses consolidating computations into larger chunks to enable deeper C-States and reduce static power consumption.
Highlights special considerations for multi-core processors, especially dual-core architecture, for power savings.
Explains transforming single-threaded applications for multi-core processors to enable lower frequency and voltage operation.
Discusses performance anomalies due to OS scheduling and multi-core unaware power management affecting thread migration.
Covers how multi-core-unaware C-state coordination can affect power savings and the need for coordinated hardware/software.
Discusses Intel compilers' features for generating optimized code, including vectorization and profile-guided optimization.
Explains Intel Debugger (IDB) for debugging C++, Fortran, and mixed language programs, including XMM register viewing.
Introduces VTune analyzer for collecting, analyzing, and displaying Intel architecture-specific performance data.
Lists Intel Performance Libraries like MKL and IPP, optimized for Intel processors, providing platform compatibility.
Introduces Intel Thread Checker and Thread Profiler for analyzing and debugging multithreaded applications.
Details using Intel Thread Checker to locate threading errors like data races, stalls, and deadlocks.
Explains using Thread Profiler to analyze threading performance, identify bottlenecks, and visualize execution timelines.
Mentions Intel Software College as a resource for training on SSE2, Threading, and IA-32 architecture.
Describes specific compiler options like -O1, -O2, -O3, -Qx, and -Qax for optimizing code performance.
Explains using -Gn option to target specific Intel architecture processors for maximum performance.
Covers -Qx and -Qax options for generating processor-specific code based on extensions, with runtime checks.
Details vectorizer switch options like -Qx, -Qax, -Qvec_report, and -Qrestrict for controlling loop vectorization.
Explains how compilers automatically unroll loops with specific switches and how to disable it.
Discusses shared memory parallelism using OpenMP directives, library functions, and environment variables.
Explains how Intel compilers can generate multithreaded code automatically for simple loops with no dependencies.
Describes default inline expansion of library functions for faster execution and potential issues.
Covers options for controlling optimization that may affect floating-point arithmetic precision.
Explains using -Qrcd option to improve floating-point calculation performance by controlling rounding mode.
Details Interprocedural Optimization (IPO) and Profile-Guided Optimization (PGO) for improving code performance.
Explains using -Qip and -Qipo options to analyze and optimize code between procedures within and across source files.
Describes creating instrumented programs to generate dynamic information for optimizing heavily traveled code paths.
Introduces performance metrics specific to Pentium 4 and Intel Xeon processors based on Intel NetBurst microarchitecture.
Explains terms like Bogus, Non-bogus, Retire, and Bus Ratio used in performance monitoring.
Defines bogus instructions/uops (cancelled due to misprediction) vs. retired/non-bogus (committed architectural state changes).
Defines Bus Ratio as the ratio of processor clock to bus clock, used in Bus Utilization metric.
Explains replay mechanism where uops are reissued due to unsatisfied execution conditions like cache misses or resource constraints.
Describes assist events where hardware needs microcode assistance, like for floating-point underflow conditions.
Explains tagging as a means of marking uops for counting at retirement, allowing multiple tags per uop for detection.
Describes mechanisms to count processor clock cycles (Non-Halted, Non-Sleep, Timestamp Counter) for performance monitoring.
Defines Non-Halted Clockticks as clocks when a logical processor is active and not in power-saving states.
Defines Non-Sleep Clockticks as clocks when the physical processor is not in sleep or power-saving states.
Explains the Time Stamp Counter increments on clock signal activity and can be read via RDTSC instruction.
Provides notes on microarchitectural elements like Trace Cache Events and Bus/Memory Metrics for correct metric interpretation.
Discusses trace cache performance, its relation to bottlenecks, and metrics for determining front-end performance.
Explains understanding transaction sizes, queues, sectoring, and prefetching for correct interpretation of bus/memory metrics.
Provides event-specific notes for interpreting performance metrics, especially those related to BSQ cache references.
Lists performance metrics categorized by General, Branching, Trace Cache/Front End, Memory, Bus, and Characterization.
Provides an overview of issues related to instruction selection and scheduling, and performance impact of applying information.
Defines key terms like Instruction Name, Latency, Throughput, and Execution units for IA-32 instructions.
Presents latency and throughput information for IA-32 instructions, including SSE, MMX, and general-purpose instructions.
Provides IA-32 instruction latency and throughput data for register operands, covering SSE3, SSE2, MMX, and general-purpose instructions.
Discusses latency and throughput for instructions with memory operands, noting longer latencies compared to register operands.
Describes stack alignment conventions for esp-based and ebp-based stack frames, and the importance of 16-byte alignment for __m128 data.
Details creating esp-based stack frames with compiler padding for alignment, applicable when debug info and exception handling are not needed.
Explains ebp-based frames with padding before return address, used when debug info or exception handling is present.
Discusses compiler optimizations for aligned frames, including bypassing unnecessary alignment code and using static function alignment.
Warns against modifying the ebx register in inlined assembly functions that use dynamic stack alignment without saving/restoring it.
Presents a simplified equation to compute Prefetch Scheduling Distance (PSD) based on lookup, transfer, and computation latencies.
Defines parameters like psd, Tc, Tl, Tb, and CPI used in PSD calculation and discusses their dependencies.
Explains the traditional sequential approach without preloading/prefetching and the resulting execution pipeline stalls.
Analyzes the compute-bound case where compute latency exceeds memory latency plus transfer latency, indicating PSD=1.
Examines the compute-bound scenario where iteration latency equals computation latency, suggesting PSD > 1.
Discusses memory throughput bound cases where memory latency dominates, making prefetch benefits marginal.
Provides example calculations for PSD based on given conditions for computation and memory throughput latencies.
Illustrates using prefetchnta instruction with a scheduling distance of 3, showing data usage in iteration i+3.
Shows a graph of accesses per iteration vs. computation clocks, illustrating latency reduction with prefetching.
Presents results for prefetching multiple cache lines per iteration, showing differing burst latencies.