EasyManua.ls Logo

IBM Power7 - Page 118

IBM Power7
224 pages
To Next Page IconTo Next Page
To Next Page IconTo Next Page
To Previous Page IconTo Previous Page
To Previous Page IconTo Previous Page
Loading...
102 POWER7 and POWER7+ Optimization and Tuning Guide
To disable the usage of mmap for mallocs (which includes Fortran allocates), set the max
value to zero:
MALLOC_MMAP_MAX_=0
To disable the trim threshold, set the value to negative one:
MALLOC_TRIM_THRESHOLD_=-1
Trimming and using mmap are two different ways of releasing unused memory back to the
system. When used together, they change the normal behavior of malloc across C and
Fortran programs, which in some cases can change the performance characteristics of the
program. You can run one of the following commands to use both actions:
򐂰 # ./my_program
򐂰 # MALLOC_MMAP_MAX_=0 MALLOC_TRIM_THRESHOLD_=-1 ./my_program
Depending on your application's behavior regarding memory and data locality, this change
might do nothing, or might result in performance improvement.
Linux malloc considerations
The Linux GNU C run time includes a default malloc implementation that is optimized for
multi-threading and medium sized allocations. For smaller allocations (less than the
MMAP_THRESHOLD), the default malloc implementation allocates blocks of storage with sbrk()
called arenas, which are then suballocated for smaller malloc requests. Larger allocations
(greater than MMAP_THRESHOLD) are allocated by an anonymous mmap, one per request.
The default values are listed here:
DEFAULT_MXFAST 64 (for 32-bit) or 128 (for 64-bit)
DEFAULT_TRIM_THRESHOLD 128 * 1024
DEFAULT_TOP_PAD 0
DEFAULT_MMAP_THRESHOLD 128 * 1024
DEFAULT_MMAP_MAX 65536
Storage within arenas can be reused without kernel intervention. The default malloc
implementation uses trylock techniques to detect contentions between POSIX threads, and
then tries to assign each thread its own arena. This action works well when the same thread
frees storage that it allocates, but it does result in more contention when malloc storage is
passed between producer and consumer threads. The default malloc implementation also
tries to use atomic operations and more granular and critical sections (lock and unlock) to
enhance parallel thread execution, which is a trade-off for better multi-thread execution at the
expense of a longer malloc path length with multiple atomic operations per call.
Large allocations (greater than MMAP_THRESHOLD) require a kernel syscall for each malloc()
and free(). The Linux Virtual Memory Management (VMM) policy does not allocate any real
memory pages to an anonymous mmap() until the application touches those pages. The
benefit of this policy is that real memory is not allocated until it is needed. The downside is
that, as the application begins to populate the new allocation with data, the application
experiences multiple page faults, on first touch to allocate and zero fill the page. This situation
means that on the initial touching of memory, there is more processing then, as opposed to
the earlier timing when the original mmap is done. In addition, this first touch timing can
impact the NUMA placement of each memory page.
Such storage is unmapped by free(), so each new large malloc allocation starts with a flurry
of page faults. This situation is partially mitigated by the larger (64 KB) default page size of
the Red Hat Enterprise Linux and SUSE Linux Enterprise Server on Power Systems; there
are fewer page faults than with 4 KB pages.

Table of Contents

Related product manuals