Thursday 31 October 2019

1557506 - Linux paging improvements

You experience high swapping activity on your Linux system, e.g. when triggering a filesystem backup or similar, although the SAP applications are sized to completely fit into the system's main memory. This results in high response times on the application level as well.

Other Terms

SUSE, SLES, SLES for SAP, Enterprise Server, suse, SuSE, Novell, novell, LINUX, linux, Page Cache, page cache, pagecache, filesystem cache, pagecache swapping, swapping, swappiness, paging, swap partition, swap file, pagecache_limit_mb, page cache limit, memory usage, poor performance, performance problem, bad performance

Reason and Prerequisites

This SAP note describes the possibility to configure a limit for the Linux kernel page cache using the tunables vm.pagecache_limit_mb and vm.pagecache_limit_ignore_dirty, which are available in SUSE Linux Enterprise Server for SAP Applications 11 (all Service Packs) and later versions. We recommend to setup the page cache limit only if you experience a performance problem caused by heavy paging / swapping, as it is described below. In order to avoid a bug, which might lead into system freezes on systems with a large amount of CPU cores, we recommend installing a certain minimum version of the SLES Linux kernel. Please refer to section "Known Problems" for more information.

SAP solutions can need large amounts of memory to process business data in memory. Besides the memory needed for the applications, the Linux Kernel is using the remaining free memory for other purposes such as the page cache. The Linux page cache keeps data read from hard disk in main memory to speed up access in case the same data is accessed again. It also caches writes to the filesystem resulting in better write throughput and in hiding latencies for disk writes.
The Linux page cache (this is called filesystem cache on some other systems) is the reason why a Linux system will always show only a small amount of "free" memory after running for a certain time, since every access to hard disk is kept in the page cache as long as there is enough free memory available. The Linux kernel will shrink the page cache automatically, if applications need more memory.

(For more details, please search resources on the Internet, like http://linux-mm.org/Low_On_Memory, as well.)

If the remaining memory available for application data ("free" plus page cache) gets low, the Linux kernel can decide to page out currently inactive memory into the swap partition / swapfile in order to free-up main memory. This typically improves the performance of the system, since unused memory areas don't use the limited main memory anymore. The freed-up memory can be used e.g. by applications -- or by the page cache in order to improve I/O performance. The Linux Kernel decides which memory areas are paged out, based on a LRU (least recently used) algorithm. Recently used memory areas stay in the main memory. Not recently used memory areas may be paged out to the disk.

Faced with filesystem load such as backup jobs, the Linux kernel may decide to prefer the page cache over SAP application memory areas, if these memory areas have not recently been used (depending on the memory activities this does not necessarily mean hours). It pages out these memory areas to the disk (into the swap partition or swapfile). When the application then tries to access a memory region that has been paged out, the response time is poor, since these memory areas have to be read from hard disk (paged back in). Worse, when the SAP solution running on Java incurs a Java Garbage Collection which touches all of the memory used by the Java heap, the system starts heavy page-in activity and the system may appear unresponsive for an extended period of time.

The Linux kernel behavior to page out rarely accessed memory pages and use memory as page cache is in general beneficial for overall system throughput and thus is not considered a bug and can not be changed unconditionally -- but it can be sub-optimal for specific SAP workloads.

Solution

Backup applications can bypass the page cache using Direct I/O or telling the Linux kernel that the recently accessed files (now in the page cache) should not be cached using posix_fadvise(POSIX_FADV_DONTNEED) or posix_fadvise(POSIX_FADV_DONTREUSE). Unfortunately, this is not commonly supported and thus often no help.
A new Linux kernel feature has been developed for SUSE Linux Enterprise Server (SLES), which allows the system administrator to limit the amount of page cache that the kernel uses when there is competition between application memory and page cache. It tells the Linux kernel, that once the page cache is filled to the configured limit, application memory is more important and should not be paged out. No application memory pages will be paged out to disk if the memory footprint of the workload plus the configured page cache limit does not exceed the amount of physical RAM in the system. If there is plenty of free memory, the kernel will continue to use it as page cache in order to speed up filesystem operations. In this case, the kernel exceeds the configured page cache limit. As soon as an application allocates additional memory, the page cache will be shrinked until the configured limit is reached. Below the limit, the page cache competes with application memory as if the feature would have been turned off.
Two Linux kernel tunables have been introduced:
  • vm.pagecache_limit_mb (/proc/sys/vm/pagecache_limit_mb)
  • vm.pagecache_limit_ignore_dirty (/proc/sys/vm/pagecache_limit_ignore_dirty)
These tunables influence the size of the Linux kernel page cache.
To set the Linux kernel parameters for testing, simply use 'echo' to set the values to the respective proc-filesystem files.
For example:
echo 1024 > /proc/sys/vm/pagecache_limit_mb
echo 2 > /proc/sys/vm/pagecache_limit_ignore_dirty
The parameters will remain in effect until they are changed again or until the system is rebooted.
To permanently configure these parameters, please add them to /etc/sysctl.conf, e.g.
vm.pagecache_limit_mb = 1024
vm.pagecache_limit_ignore_dirty = 2
The parameters are then automatically set upon boot via /etc/init.d/boot.sysctl, which can also manually be invoked using
/etc/init.d/boot.sysctl start

- or -
sysctl -p
Recommended values
It is recommended that you only set a page cache limit for SAP workloads, if the symptoms described at the top of this note are observed. For SAP systems and SAP databases, that have no swap space or just a low amount of swap space configured, we generally recommend leaving this feature disabled.
  • vm.pagecache_limit_mb

    The recommended value up to a value of 64 GB is 1/16 (~6%) of the amount of RAM, but not less than 512 MByte.
    <  8 GB:  512  (recommended min. limit)
      8 GB:  512  (=  8 * 1024 MB / 16)
      16 GB: 1024  (= 16 * 1024 MB / 16)
      32 GB: 2048  (= 32 * 1024 MB / 16)
      64 GB: 4096  (= 64 * 1024 MB / 16)

    Above 64 GB, the recommend limit is 2% of the amount of RAM, but not less than 4096 MB.

    Note for SAP HANA: SAP HANA uses a memory allocation scheme, that does never allocate more than 97% of the total amount of memory. Using a maximum of 2% for the page cache ensures, that the page cache will never compete with memory allocated by the HANA database.

    128 GB: 4096 (=recommended limit)
    256 GB: 5243 (=2% of 256 * 1024 MB)
    512 GB: 10486 (=2% of 512 * 1024 MB)
    1024 GB: 20972 (=2% of 1024 * 1024 MB)
    2048 GB: 41943 (=2% of 2048 * 1024 MB)

  • 4096 GB: 83886 (=2% of 4096  * 1024 MB)
    8192 GB: 167772 (=2% of 8192  * 1024 MB)
    16384 GB: 335544 (=2% of 16384  * 1024 MB)
    [..]

    Important notes:
    • These values are recommendations and may not fit to your situation. You are allowed to set higher or lower values depending on your workload. Higher values consume more expensive RAM. You should always make sure, that your main application (like the database or application server) is configured in a way not to consume more memory than the amount of RAM minus the max. value of the page-cache. Lower values might cause system instabilities under certain conditions, especially with workloads that create a high system load.
    • Please configure your SAP system in such a way that you have enough free memory for the page cache, the OS and other services (like databases, service daemons, etc.).
      In HANA-environments, please consider adjusting the Global Allocation Limit accordingly (see SAP note 1999997 - FAQ: SAP HANA Memory for more details).
    • Keep in mind, that the configured page-cache size can still be exceeded. However, when exceeding the limit, it will never put pressure on memory allocations of other applications. For more details, please read the section 'Technical Details

  • vm.pagecache_limit_ignore_dirty

    If there are a lot of local writes and it is OK to throttle them by limiting the writeback caching, we recommended that you set the value to 0. If writing mainly happens to NFS filesystems, the default 1 should be left untouched. A value of 2 would be a middle ground, not limiting local write back caching as much, but potentially resulting in some paging.
These parameters have been found efficient in preventing SAP systems from paging out application memory and thus help to prevent performance regressions. Setting the pagecache_limit_mb to non-zero however does limit the Linux kernel's ability to take paging decisions and thus can have negative performance impact for some workloads, especially if low limits are set.
Thus the recommendation to only configure limits if needed.


A negative performance impact has not been observed in any SAP test case unless much lower limits than the recommended ones have been set.

We encourage you to share your experience and ask questions about these tunables at linux@sap.com.
Technical details
  • /proc/sys/vm/pagecache_limit_mb

    This tunable sets a soft limit of the page cache in megabytes. If it is set to zero (default), the paging behavior is unchanged from the standard Linux kernel behavior. If non-zero, it will configure a soft page cache limit -- the page cache can still grow above the configured size if the system has significant amount of free memory. However, when exceeding the limit, the page cache will never put pressure on the memory allocations of other applications. Furthermore it will immediately be shrunk up to the configured limit, if an application requires the memory. Using the remaining free memory for the page cache helps to improve the overall performance of a system. (See above for recommended values.)
  • /proc/sys/vm/pagecache_limit_ignore_dirty

    This tunable influences whether dirty memory is considered part of the limited page cache, as we can not as easily free up dirty memory (we need to do writes for this) as clean memory. By setting this to 0, dirty (unmapped) memory will be considered freeable and the Linux kernel will try to write the pages out when enforcing the page cache limit. It effectively thus enforces a severe limit to the write back cache for filesystem writes -- typically much lower than the normal limit (percentage) configured via vm.dirty_ratio (/proc/sys/vm/dirty_ratio). By setting this to 1 (ignore all dirty memory in the page cache when enforcing the limit) the page cache can actually grow well beyond the configured limit if lots of writes happen to local filesystems.Values larger than 1 are also possible and result in a fraction of the dirty pages to be considered freeable. (See above for recommended values.)
More details are available in /usr/src/linux/Documentation/vm/pagecache-limit
from the kernel-source package.
Availability of this solution
The solution has been developed for the Linux kernel shipped with SUSE Linux Enterprise Server 11 SP1 and has been shipping as an *experimental* feature. After successful real-world validation with a number of pilot customers, this feature is now generally supported for SUSE Linux Enterprise Server for SAP Applications 11 SP1 and later service pack versions, as well as with SUSE Linux Enterprise Server for SAP Applications 12.
Known Problems
Due to a bug in previous versions of the SUSE Linux Enterprise Server kernel, systems with a large amount of CPU cores might freeze in case they are under heavy load. This bug has been fixed and the patch is available as of the kernel versions listed below. We strongly recommend updating your kernel.
  • SUSE Linux Enterprise Server 11 SP3: Kernel version 3.0.101-0.40 (SUSE TID) or later versions
  • SUSE Linux Enterprise Server 11 SP2: Kernel version 3.0.101-0.7.23.1 or later versions