varnish-numa

NUMA awareness in Varnish Enterprise

Manual section:7

Non-Uniform Memory Access (NUMA)

NUMA architecture is a design used in multiprocessing when manufacturing computer boards. In a NUMA system, memory banks may be local to certain CPUs and remote for others, and this implies a transparent access by the application independently of where it is running. A local memory bank is cheaper to access as it is usually directly connected to the NUMA node where the task is currently running, whereas a remote memory bank costs more to access as the memory bank is further away from the NUMA node and some form of synchronization is required to make it seem for the running task that the memory accessed is cache coherent. Penalties typically observed by the applicaton when accessing a remote memory bank can be seen as higher latencies, higher CPU usage, lower throughput, and possibly others.

The Linux kernel has a NUMA auto-balance feature to migrate pages to where the application is currently running using an algorithm which tries to reduce the number of remote access based on some heuristics. However, we have found active participation from the application to yield better performance than the auto-balance feature by itself.

A key performance factor on a NUMA system is the balance of the hardware itself. A problem typically observed for a NUMA aware application on unbalanced NUMA systems are symptoms where only some NUMA nodes are consumed while others nodes are mostly idle. A common cause leading to part of the hardware being idle is the presence of a single Network Interface Card (NIC). Having one NIC per NUMA node will help prevent network operations from crossing an expensive NUMA boundary.

On some systems it can make sense to make the application NUMA aware even at the expense of some idle NUMA nodes. The question is whether the application will perform better running on one NUMA node than being continuous migrated between NUMA nodes in an unbalanced NUMA system.

One way to visualize the system for NUMA nodes is to use the lstopo tool, another way is to check the verbose output of lspci. A third option is to look for the numa_node virtual file under the sysfs subsystem.

NUMA aware Varnish

Varnish has a concept of thread pools where each thread pool contains a set of worker threads. This makes for a great abstraction layer for NUMA aware Varnish. This is crucial as shared memory access between the NUMA nodes should be avoided whenever possible. One way to ensure that tasks running in worker threads never cross over to another NUMA node for task-scoped data structures is to colocate memory pools with thread pools.

We assign a NUMA node to each thread pool in a round robin fashion. We also internally keep the thread_pools parameter to be at least the number of NUMA nodes detected on the system. If the thread_pools parameter is configured greater than this minimum we assign more thread_pools to the invidual NUMA node. This should distribute each NUMA node equal amount of thread pools, but the balance can be off if thread_pools parameter is not divisible by the number of NUMA nodes available on the system.

In order to make the transition between the kernel and Varnish efficient, we pin new sessions to the thread pool where the client arrived in the kernel.

It is not possible to eliminate all NUMA crossing scenarios. Data structures shared between multiple tasks such as the cache itself are exempt of NUMA node pinning. Likewise, NUMA awareness is not a criterion for book and store selection in persisted caches with MSE. There are also background threads outside of worker thread pools.

Runnning NUMA aware Varnish

To run NUMA aware Varnish, the version of Varnish needs to be at least Varnish Enterprise 6.0.8r2. However, our general recommendation is to run the latest stable release whenever possible. This comes with latest fixes and other improvements.

Part of the NUMA aware feature of Varnish Enterprise relies on recent version of the Linux kernel. Thus, the minimum supported platforms are:

  • Debian 12 (Bookworm)
  • Ubuntu 20.04 LTS (Focal Fossa)
  • Red Hat Enterprise Linux 8 (RHEL, AlmaLinux)

To make Varnish NUMA aware, you need to enable two parameters: reuseport and numa_aware. These parameters can only be set on the command line and both parameters are immutable after that.

The varnishd command line will then look like this:

$ varnishd -V
varnishd (varnish-plus-6.0.8r2 revision 68c65b70a04ef588aa124a91207d78252452c381)
Copyright (c) 2006 Verdens Gang AS
Copyright (c) 2006-2021 Varnish Software AS
$ varnishd -a :80 -f /etc/varnish/default.vcl -p reuseport=on -p numa_aware=on

We recommend that you perform testing before applying this change to all of your servers. See below for metrics to look for.

NUMA Statistics

One way to tell if a system is running optimally with regards to NUMA is to use the numastat(8) tool to evaluate the page allocation performance. A sample run could look something like this:

$ numastat
                             node0         node1
numa_hit                  76557759      92126519
numa_miss                 30772308      30827638
numa_foreign              30827638      30772308
interleave_hit              106507        103832
local_node                76502227      92086995
other_node                30827840      30867162

The numa_hit counter should be high on each node while the numa_miss counter should be low in comparison. Please check the manual page for numastat(8) for details on the other counters. It is possible to look at the same metrics through the sysfs subsystem: # cat /sys/devices/system/node/node*/numastat

Intel Performance Counter Monitor (Intel PCM) tool exists for Intel CPUs to get finer details on how much traffic is traversing between the NUMA nodes at a given time (have a look at the pcm-numa.x program): https://github.com/intel/pcm It is also possible to setup a Grafana Dashboard to watch these metrics: https://github.com/intel/pcm/tree/master/scripts/grafana

Similar tools may exist for other platforms.