Appendix B. Performance tooling and empirical performance analysis 169
Finding alignment issues
Improperly aligned code or data can cause performance degradation. By default, the IBM
compilers and linkers correctly align code and data, including stack and statically allocated
variables. Incorrect typecasting can result in references to storage that are not correctly
aligned. There are two types of alignment issues to be concerned with:
Alignment issues that are handled by microcode in the POWER7 processor
Alignment issues that are handled through alignment interrupts.
Examples of alignment issues that are handled by microcode with a performance penalty in
the POWER7 processor are loads that cross a 128-byte boundary and stores that cross a
4 KB page boundary. To give an indication of the penalty for this type of misalignment, on a
4 GHz processor, a nine-instruction loop that contains an 8 byte load that crosses a 128-byte
boundary takes double the time of the same loop with the load correctly aligned.
Alignment issues that are handled by microcode can be detected by running hpmcount or
hpmstat. The hpmcount command is a command-line utility that runs a command and collects
statistics from the POWER7 PMU while the command runs. To detect alignment issues that
are handled by microcode, run hpmcount to collect data for group 38. An example is provided
in Example B-8.
Example B-8 Example of the results of the hpmcount command
# hpmcount -g 38 ./unaligned
Group: 38
Counting mode: user
Counting duration: 21.048874056 seconds
PM_LSU_FLUSH_ULD (LRQ unaligned load flushes) : 4320840034
PM_LSU_FLUSH_UST (SRQ unaligned store flushes) : 0
PM_LSU_FLUSH_LRQ (LRQ flushes) : 450842085
PM_LSU_FLUSH_SRQ (SRQ flushes) : 149
PM_RUN_INST_CMPL (Run instructions completed) : 19327363517
PM_RUN_CYC (Run cycles) : 84219113069
Normalization base: time
Counting mode: user
Derived metric group: General
[ ] Run cycles per run instruction : 4.358
The hpmstat command is similar to hpmcount, except that it collects performance data on a
system-wide basis, rather than just for the execution of a command.
Generally, scenarios in which the ratio of (
LRQ unaligned load flushes + SRQ unaligned store
flushes
) divided by Run instructions completed is greater than 0.5% must be further
investigated. The tprof command can be used to further pinpoint where in the code the
unaligned storage references are occurring. To pinpoint unaligned loads, the -E
PM_MRK_LSU_FLUSH_ULD flag is added to the tprof command line, and to pinpoint unaligned
stores, the -E PM_MRK_LSU_FLUSH_UST flag is added. When these flags are used, tprof
generates a profile where unaligned loads and stores are sampled instead of
time-based sampling.