IBM Power7 - Page 118

To Next Page

To Previous Page

102 POWER7 and POWER7+ Optimization and Tuning Guide

To disable the usage of mmap for mallocs (which includes Fortran allocates), set the max

value to zero:

MALLOC_MMAP_MAX_=0

To disable the trim threshold, set the value to negative one:

MALLOC_TRIM_THRESHOLD_=-1

Trimming and using mmap are two different ways of releasing unused memory back to the

system. When used together, they change the normal behavior of malloc across C and

Fortran programs, which in some cases can change the performance characteristics of the

program. You can run one of the following commands to use both actions:

򐂰 # ./my_program

򐂰 # MALLOC_MMAP_MAX_=0 MALLOC_TRIM_THRESHOLD_=-1 ./my_program

Depending on your application's behavior regarding memory and data locality, this change

might do nothing, or might result in performance improvement.

Linux malloc considerations

The Linux GNU C run time includes a default malloc implementation that is optimized for

multi-threading and medium sized allocations. For smaller allocations (less than the

MMAP_THRESHOLD), the default malloc implementation allocates blocks of storage with sbrk()

called arenas, which are then suballocated for smaller malloc requests. Larger allocations

(greater than MMAP_THRESHOLD) are allocated by an anonymous mmap, one per request.

The default values are listed here:

DEFAULT_MXFAST 64 (for 32-bit) or 128 (for 64-bit)

DEFAULT_TRIM_THRESHOLD 128 * 1024

DEFAULT_TOP_PAD 0

DEFAULT_MMAP_THRESHOLD 128 * 1024

DEFAULT_MMAP_MAX 65536

Storage within arenas can be reused without kernel intervention. The default malloc

implementation uses trylock techniques to detect contentions between POSIX threads, and

then tries to assign each thread its own arena. This action works well when the same thread

frees storage that it allocates, but it does result in more contention when malloc storage is

passed between producer and consumer threads. The default malloc implementation also

tries to use atomic operations and more granular and critical sections (lock and unlock) to

enhance parallel thread execution, which is a trade-off for better multi-thread execution at the

expense of a longer malloc path length with multiple atomic operations per call.

Large allocations (greater than MMAP_THRESHOLD) require a kernel syscall for each malloc()

and free(). The Linux Virtual Memory Management (VMM) policy does not allocate any real

memory pages to an anonymous mmap() until the application touches those pages. The

benefit of this policy is that real memory is not allocated until it is needed. The downside is

that, as the application begins to populate the new allocation with data, the application

experiences multiple page faults, on first touch to allocate and zero fill the page. This situation

means that on the initial touching of memory, there is more processing then, as opposed to

the earlier timing when the original mmap is done. In addition, this first touch timing can

impact the NUMA placement of each memory page.

Such storage is unmapped by free(), so each new large malloc allocation starts with a flurry

of page faults. This situation is partially mitigated by the larger (64 KB) default page size of

the Red Hat Enterprise Linux and SUSE Linux Enterprise Server on Power Systems; there

are fewer page faults than with 4 KB pages.

Main Page

IBM Power7 - Page 118

Table of Contents

Related product manuals