.. role:: ref(emphasis)

.. _varnish-numa(7):

============
varnish-numa
============

------------------------------------
NUMA awareness in Varnish Enterprise
------------------------------------

:Manual section: 7

Non-Uniform Memory Access (NUMA)
================================

NUMA architecture is a design used in multiprocessing when manufacturing
computer boards. In a NUMA system, memory banks may be local to certain CPUs and
remote for others, and this implies a transparent access by the application
independently of where it is running. A local memory bank is cheaper to access
as it is usually directly connected to the NUMA node where the task is currently
running, whereas a remote memory bank costs more to access as the memory bank
is further away from the NUMA node and some form of synchronization is required
to make it seem for the running task that the memory accessed is cache coherent.
Penalties typically observed by the applicaton when accessing a remote memory
bank can be seen as higher latencies, higher CPU usage, lower throughput, and
possibly others.

The Linux kernel has a NUMA auto-balance feature to migrate pages to where the
application is currently running using an algorithm which tries to reduce the
number of remote access based on some heuristics. However, we have found active
participation from the application to yield better performance than the
auto-balance feature by itself.

A key performance factor on a NUMA system is the balance of the hardware itself.
A problem typically observed for a NUMA aware application on unbalanced NUMA
systems are symptoms where only some NUMA nodes are consumed while others
nodes are mostly idle. A common cause leading to part of the hardware
being idle is the presence of a single Network Interface Card (NIC). Having one
NIC per NUMA node will help prevent network operations from crossing an
expensive NUMA boundary.

On some systems it can make sense to make the application NUMA aware even at the
expense of some idle NUMA nodes. The question is whether the application will
perform better running on one NUMA node than being continuous migrated between
NUMA nodes in an unbalanced NUMA system.

One way to visualize the system for NUMA nodes is to use the `lstopo` tool,
another way is to check the verbose output of `lspci`. A third option is to look
for the `numa_node` virtual file under the sysfs subsystem.

NUMA aware Varnish
==================

Varnish has a concept of thread pools where each thread pool contains a set of
worker threads. This makes for a great abstraction layer for NUMA aware Varnish.
This is crucial as shared memory access between the NUMA nodes should be
avoided whenever possible. One way to ensure that tasks running in worker
threads never cross over to another NUMA node for task-scoped data structures
is to colocate memory pools with thread pools.

We assign a NUMA node to each thread pool in a round robin fashion. We also
internally keep the thread_pools parameter to be at least the number of NUMA
nodes detected on the system. If the thread_pools parameter is configured
greater than this minimum we assign more thread_pools to the invidual NUMA node.
This should distribute each NUMA node equal amount of thread pools, but the
balance can be off if thread_pools parameter is not divisible by the number of
NUMA nodes available on the system.

In order to make the transition between the kernel and Varnish efficient,
we pin new sessions to the thread pool where the client arrived in the kernel.

It is not possible to eliminate all NUMA crossing scenarios. Data structures
shared between multiple tasks such as the cache itself are exempt of NUMA node
pinning. Likewise, NUMA awareness is not a criterion for book and store
selection in persisted caches with MSE. There are also background threads
outside of worker thread pools.


Runnning NUMA aware Varnish
===========================

To run NUMA aware Varnish, the version of Varnish needs to be at least
Varnish Enterprise 6.0.8r2. However, our general recommendation is to run
the latest stable release whenever possible. This comes with latest fixes and
other improvements.

Part of the NUMA aware feature of Varnish Enterprise relies on recent version
of the Linux kernel. Thus, the minimum supported platforms are:

- Debian 12 (Bookworm)

- Ubuntu 20.04 LTS (Focal Fossa)

- Red Hat Enterprise Linux 8 (RHEL, AlmaLinux)

To make Varnish NUMA aware, you need to enable two parameters:
`reuseport` and `numa_aware`. These parameters can only be set on the command
line and both parameters are immutable after that.

The `varnishd` command line will then look like this::

  $ varnishd -V
  varnishd (varnish-plus-6.0.8r2 revision 68c65b70a04ef588aa124a91207d78252452c381)
  Copyright (c) 2006 Verdens Gang AS
  Copyright (c) 2006-2021 Varnish Software AS
  $ varnishd -a :80 -f /etc/varnish/default.vcl -p reuseport=on -p numa_aware=on

We recommend that you perform testing before applying this change to all of
your servers. See below for metrics to look for.

NUMA Statistics
===============

One way to tell if a system is running optimally with regards to NUMA is to use
the `numastat(8)` tool to evaluate the page allocation performance.
A sample run could look something like this::

  $ numastat
                               node0         node1
  numa_hit                  76557759      92126519
  numa_miss                 30772308      30827638
  numa_foreign              30827638      30772308
  interleave_hit              106507        103832
  local_node                76502227      92086995
  other_node                30827840      30867162

The `numa_hit` counter should be high on each node while the `numa_miss` counter
should be low in comparison. Please check the manual page for `numastat(8)` for
details on the other counters. It is possible to look at the same metrics
through the sysfs subsystem: # cat /sys/devices/system/node/node*/numastat

Intel Performance Counter Monitor (Intel PCM) tool exists for Intel CPUs to get
finer details on how much traffic is traversing between the NUMA nodes at a
given time (have a look at the **pcm-numa.x** program):
https://github.com/intel/pcm
It is also possible to setup a Grafana Dashboard to watch these metrics:
https://github.com/intel/pcm/tree/master/scripts/grafana

Similar tools may exist for other platforms.

SEE ALSO
========

* :ref:`varnishd(1)`
* :ref:`vmod_utils(3)`
* `numastat(8) <numastat(8)>`_
* `numactl(8) <numactl(8)>`_
* `lstopo(1) <lstopo(1)>`_


COPYRIGHT
=========

* Copyright (c) 2024 Varnish Software

* Author: Asad Sajjad Ahmed <asadsa@varnish-software.com>
* Author: Alve Elde <alve@varnish-software.com>
* Author: Dridi Boukelmoune <dridi@varnish-software.com>
