.. role:: ref(emphasis)

.. _varnish-mse(7):

===========
varnish-mse
===========

----------------------
Massive Storage Engine
----------------------

:Manual section: 7


Massive Storage Engine
======================

The Massive Storage Engine (MSE) is an advanced stevedore for Varnish
Cache Plus. The stevedore is the component that handles storing the cached
objects and their metadata, and keeping track of which objects in the
cache are most relevant, and which to purge if needed to make room for new
content. MSE adds several advanced features compared to standard
stevedores that ship with Varnish:

* Compact memory object structure

  MSE has a more compact object storage structure giving less storage
  overhead. This is most noticable for small objects.

* Fair LRU eviction strategy

  When evicting content to make room for fresh content in the cache, the
  fetch task that does the eviction will be given priority to the space
  that it made available. This ensures that fetches does not fail due to
  other simultaneous fetch tasks stealing the space from under it.

* Large caches using disks to cache objects

  MSE can use disks as backing for object data, enabling cache sizes that
  are much larger than the available system memory. Content that is
  frequently used will be kept in memory, while less frequently used
  content will be read from disk instead of fetching from the backend.

* Persisted caches

  MSE will persist the disk stored objects, keeping the content in the
  cache between planned and unplanned restarts of the Varnish daemon.

* Memory Governor

  MSE features a mechanism that will automatically adjust the size of the
  cache according to the process memory consumption. This makes it easy to
  set up the right cache limits, and ensures the best utilization of the
  available memory even in shifting load conditions,


Configuration and usage
=======================

MSE uses a structured configuration file to describe the layout of the
devices to use for object storage. The syntax of the configuration file is
shown in the examples below.

The configuration structure is hierarchical. At the top level there is one
and exactly one `environment`, which configures the global sizes and rules
to use for memory held object fragments.

An environment by itself configures a non-persistent cache. This makes MSE
behave like a regular Varnish instance with a memory only cache, much like
when using the default `malloc` stevedore, while giving the benefits of
the compact object memory structure and fair LRU eviction.

To configure a persisted cache using disk for object storage, one or more
`books` with associated `stores` needs to be configured in the
environment. The `books` contain metadata internal to MSE, while the
`stores` contain the object data used by Varnish.

Once the configuration file has been created, MSE can be enabled using the
`-s mse[,<path-to-config-file>]` option to the Varnish daemon.


Persisted caching
=================

When books and stores are configured, the cached objects are also
persisted, keeping the content between restarts of the Varnish daemon.

The `book` is an embedded database that contains the necessary metadata
about the objects stored in the cache. This includes e.g. the hash values
for the objects, their associated TTLs and Vary matching information. The
`book` also contains the maps for where in the `store` the payload data
for the object resides, and the lists mapping free `store` space. Lastly
the book has one journal file to persist bans on the system, and one
journal file for each configured store to speed up metadata updates. All
the data files that make up the book are kept in a directory, and the path
to this directory is given in the configuration file.

Each `book` needs to have at least one `store` associated with it. The
`store` holds the object payload data, consisting of its attributes
(object headers, ESI instructions etc) and the object body. Each store
is a single large file in the filesystem, that can contain any number
of objects within it. New objects are assigned to a store on a round
robin basis.

Keeping books and stores configured and stored separately is useful
when the disks to use may not have the same IO capacity. It would be
advisable to e.g. keep the book on a fast SSD type of drive, while using
the larger but slower rotating disk for the store.

The data files, both for books and stores, needs to be initialized before
starting the Varnish daemon for the first time. This is done using the
bundled `mkfs.mse` utility, passing the configuration file as an option.
See the :ref:`mkfs.mse(1)` manpage for details.

All the data files are marked with a version marker identifying the on
disk data format. If this version marker does not match that of the
Varnish daemon, Varnish will refuse to start, and the files will have
to be recreated using the `mkfs.mse` utility with the `-f` force
option, clearing the cache in the process. If a new release of Varnish
Cache Plus comes with a new on disk format, the changelog entry will
clearly say so.


Memory Governor
===============

MSE has a feature called the memory governor that will automatically
adjust the size of the cache in response to the memory usage of the
Varnish cache worker process.

Traditionally the way to manage memory usage in Varnish has been by
controlling how much memory to use for cache storage. That is an upper
limit on how many bytes of actual object payload data Varnish will keep in
memory, and once the limit is reached objects will be removed from the
cache by LRU (Least Recently Used) order to make room for new content.

This approach has several limitations. First it does not take the cost of
running the cache server itself into account. Each Varnish instance will
need memory space in order to run all of the cache worker threads (client
requests and backend fetches), scratch workspace regions for parsing and
constructing HTTP request and response data, as well as lookup trees and
other internal data structures.

Second, it does not take transient objects into account. Transient objects
are a subgroup of objects that either shortlived (TTL of less than 10
seconds), Hit-For-Pass and Hit-For-Miss stub objects and private objects
(cached request bodies and any active delivery pass objects). The typical
setup of Varnish will run without an upper bound on the amount of memory
that can be allocated for transient usage, requiring the main storage
limit to be lowered to accomodate the expected transient usage. (Note that
while it is possible to put a cap on Transient memory usage, that comes
with its own limitations).

Third, the memory overhead will vary with the traffic patterns the Varnish
instance is handling. The request rate most Varnish servers handles will
not be the same during work hours and evenings, or on weekdays and
weekends. With a heavy request load, the amount of memory needed for
request handling and general overhead will increase, and likely also the
transient usage.

The end result is that configuring the right cache size becomes a case of
trial and error in order to dial in the right setting for each Varnish
instance, and the trial period can typically be a week long in order to
get a good baseline usage.

The memory governor allows the system administrator to take a new approach
to configuring the way Varnish uses memory. Instead of configuring the
amount of memory to be spent on cache payload data and making sure to set
it low enough to accomodate the expected overhead from the above, one
specifies the amount of memory the Varnish cache worker process itself
should consume. Varnish will then self-regulate, increasing or decreasing
the cache size in order to keep the memory usage constant.


Enabling the memory governor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The memory governor is a component that sits part in Core Varnish Cache,
and part in the MSE stevedore, where the core part measures and reports on
memory usage, while the MSE part reacts and adjusts the cache
size. Because of this relationship, there is a requirement that in order
to enable the memory governor, there can only be a single stevedore
instance configured (only one `-s` argument on the `varnishd` command
line), and that needs to be of type MSE. In persisted MSE setups one can
continue to use multi-book and multi-store MSE configurations, and use the
MSE VMOD to route objects to specific stores internally in MSE, but there
needs to be one and only one instance of MSE configured in the Varnish
daemon.

To enable the memory governor, one simply sets the `memcache_size` key in
the MSE configuration file (top level environment section) to "auto". This
instructs MSE to take the memory allowance limits from the memory
governor.

The target memory consumption can be configured by setting the
`memory_target` Varnish runtime parameter to the required memory
size. This can be set at startup by adding a `-p
memory_target=<memory-usage>` option on the `varnishd` command line. The
parameter may also be adjusted at runtime using `varnishadm`. The new
target value will take effect immediately, but when lowering the target it
may take a moment before the new target is reached.

Note that in order for the memory governor to work, the kernel must
provide certain memory statistics. At startup a test is done to make sure
that the necessary per process statistics is present and readable through
/proc/<pid>/status, and Varnish will refuse to start if the information is
missing. The required status fields are VmRSS, RssAnon, RssFile and
VmSwap.


Memory sizing
~~~~~~~~~~~~~

When the memory governor is in effect what Varnish will regulate is the
process' private memory usage. That is the Varnish process' virtual memory
mappings without any file backing (sometimes called anonymous memory
maps). That means that the cost of open files, for example the Varnish
shared memory log and the varnish counters, have zero cost from the
governors point of view, and the cache will not be purged in an attempt to
undo that kind of memory usage. This is due to the memory being spent by
these files belong not directly to the process, but the kernel's page
cache. The kernel has its own algorithms that control the page cache and
what parts of files to keep in memory and which to drop, based on the
memory pressure the system as a whole is experiencing. The memory governor
makes sure that the memory pressure the Varnish process adds to the system
is constant, but leaves to the kernel to handle the page cache and what
file content is best kept in memory using the leftover space.

When setting the memory target, one needs to take into account the base
memory cost of the rest of the system. If there are multiple services
running on the system host (e.g. the Hitch TLS proxy), the space they will
need must also be taken into account.

When running with persisted MSE, also the page cache space of the books
should be taken into consideration. The books that contain the meta data
about the persisted objects are written using regular file IO, and the
available page cache memory will allow the kernel to speed up access to
this data. How much of a speed up, and the threshold level where further
page cache space does not give any benefit will depend on the traffic
pattern, so some experimentation may be required. (Note that MSE's store
files that contain the actual object payload data are accessed using
direct IO directly into the private memory space of the worker process,
and thus bypasses the kernel's page cache).

As a baseline safe default, we recommend setting the `memory_target`
parameter at 80% of the available system memory. This may then be further
tuned at runtime if necessary.


Transient objects under Memory Governor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Under the memory governor, Varnish' handling of transient objects will
change slightly. Without the governor, there would be a dedicated
stevedore of type `malloc` to handle this, keeping the transient objects
separated from non-transient objects. With the governor enabled, all
objects are handled by the single MSE stevedore instance, including the
transient objects. The dedicated counter segment for the Transient
stevedore disappears. Also the transient objects will be subject to LRU
eviction when needed, just like regular objects.

When using MSE in a persisted configuration, transient objects will become
memory-only objects, meaning they will not be persisted.


Using MSE with SELinux
======================

If SELinux is enabled, the MSE `books` and `stores` needs to be located on
a path that the SELinux policy allows the Varnish daemon to access. The
policy shipped with the packages enables the `/var/lib/mse/*` path for
this purpose. If your `stores` are on separate drives, you will need to
mount those drives below that path.


Example memory only configuration
=================================

The following is an example configuration for a memory only cache using
100 Gb of memory to hold cached objects::

  env: {
	id = "myenv";
	memcache_size = "100G";
  };


Example memory governor configuration
=====================================

This example configuration file sets up a non-persisted MSE with memory
governor::

  env: {
	id = "mse";
	memcache_size = "auto";
  };

The memory usage will be controlled by the `memory_target` runtime
parameter.

As a convenience shortcut it is possible to get the exact MSE
configuration shown above (with the environment ID set to "mse") and
memory governor enabled by omitting the MSE configuration file
completely. The `varnishd` command line will then typically look like::

  $ varnishd -a :80 -f varnish.vcl -s mse -p memory_target=<size>


Example with 2 books each holding 2 stores with memory governor
===============================================================

The following example demonstrates how to configure multiple `books` each
holding multiple `stores`. The memory governor is enabled, and Varnish
will adjust the memory space used for holding frequently accessed content
automatically. There will be 2 books, each configured for 1Gb metadata
space. Each book has 2 stores, each holding 1Tb of object data::

  env: {
	id = "myenv";
	memcache_size = "auto";

	books = ( {
		id = "book1";
		directory = "/var/lib/mse/book1";
		database_size = "1G";

		stores = ( {
			id = "store-1-1";
			filename = "/var/lib/mse/stores/disk1/store-1-1.dat";
			size = "1T";
		}, {
			id = "store-1-2";
			filename = "/var/lib/mse/stores/disk2/store-1-2.dat";
			size = "1T";
		} );
	}, {
		id = "book2";
		directory = "/var/lib/mse/book2";
		database_size = "1G";

		stores = ( {
			id = "store-2-1";
			filename = "/var/lib/mse/stores/disk3/store-2-1.dat";
			size = "1T";
		}, {
			id = "store-2-2";
			filename = "/var/lib/mse/stores/disk4/store-2-2.dat";
			size = "1T";
		} );
	} );
  };


Example of non-homogenous setup with a default storage selection
================================================================

This example is similar to the previous, but the stores are of
different sizes. In addition, a default set of stores is selected
through the `default_stores` parameter.

The two books in the configuration are of the same size, while their
stores' sizes differ by an order of magnitude. Although spinning disks
is often a bad idea, it can be a good option if you have many huge,
uncommonly requested files that you want to cache.

In the example, imagine that the second book contains two stores on
spinning disks, while the first book use faster SSD/NVMe drives. By
default, through the `default_stores` parameter, only the first book's
stores will be selected during object insertion::

  env: {
	id = "myenv";
	memcache_size = "100G";

	books = ( {
		id = "book1";
		directory = "/var/lib/mse/book1";
		database_size = "1G";

		stores = ( {
			id = "store-1-1";
			filename = "/var/lib/mse/stores/disk1/store-1-1.dat";
			size = "1T";
		}, {
			id = "store-1-2";
			filename = "/var/lib/mse/stores/disk2/store-1-2.dat";
			size = "1T";
		} );
	}, {
		id = "book2";
		directory = "/var/lib/mse/book2";
		database_size = "1G";

		stores = ( {
			id = "store-2-1";
			filename = "/var/lib/mse/stores/disk3/store-2-1.dat";
			size = "10T";
		}, {
			id = "store-2-2";
			filename = "/var/lib/mse/stores/disk4/store-2-2.dat";
			size = "10T";
		} );
	} );
	default_stores = "book1";
  };

It is possible to use `vmod_mse` to override the store selection on
each individual backend request, and this will be the only way to get
objects into `book2` with the above configuration.


Tags and an example of how they are used
========================================

It is possible to attach tags to books and individual stores, and use
these to select which stores. Consider the following, somewhat silly,
example::

  env: {
	id = "myenv";
	memcache_size = "100G";

	books = ( {
		id = "book1";
		directory = "/var/lib/mse/book1";
		database_size = "1G";
		tags = "red";

		stores = ( {
			id = "store-1-1";
			filename = "/var/lib/mse/stores/disk1/store-1-1.dat";
			size = "1T";
		}, {
			tags = ( "orange", "store-2-1" );
			id = "store-1-2";
			filename = "/var/lib/mse/stores/disk2/store-1-2.dat";
			size = "1T";
		} );
	}, {
		id = "book2";
		directory = "/var/lib/mse/book2";
		database_size = "1G";
		tags = ( "pink", "red" );

		stores = ( {
			id = "store-2-1";
			filename = "/var/lib/mse/stores/disk3/store-2-1.dat";
			size = "1T";
			tags = "green";
		}, {
			id = "store-2-2";
			filename = "/var/lib/mse/stores/disk4/store-2-2.dat";
			size = "1T";
			tags = ( "blue", "book1", "red" );
		} );
	} );
	default_stores = "none";
  };

The example above is equal to the second example, but tags have been
added on both the books and the stores, and `default_stores` is set to
the special value `"none"` (indicating that objects should be *memory
only* by default, as described in `vmod_mse` manual).

Tags can be specified either as a single string, or as a list of
strings, and they can be applied to books and stores.

When a set of stores is selected, either by using `default_stores` or
`vmod_mse`, the string will be matched against book names, store
names, and tags. In the example above, `mse.set_stores("red");` will
select all stores, since both books have been tagged `"red"`. Even
though `store-2-2` is tagged `"red"` twice (in the book and the store
itself), it will not be chosen twice as often as the other `"red"`
stores.

There is no discrimination of names and tags, so
`mse.set_stores("book1");` will select all the stores in book1, and
store-2-2, since this store has the `"book1"` tag. Similarly,
`mse.set_stores("store-2-1");` will select two stores, one because of
a matching name, and the other because a tag matches.

Read more about store selection in `vmod_mse(3)`, where it also explained
how stores are weighted after selection, and how to change it.


Degraded mode and configuration example
=======================================

It is possible to enable fault tolerance for an MSE environment, in which
case it may start with a subset of its books and stores. In the event of a
device failure, misconfiguration, or any other reason that would result in
a book or a store not successfully opening during startup, MSE can ignore
these failures and proceed in degraded mode.

If an environment is successfully loaded, but corrupted, a fault may occur
after startup. In this case Varnish tries to catch the error and cache it
on disk for the next startup. This cache is a directory containing one text
file per MSE environment that can be edited to unregister failed books and
stores after restoring them to a pristine state.

Configuring a fault-tolerant environment can be done like this::

  env: {
	id = "my-degradable-env";
	memcache_size = "auto";
	degradable = true;
	degradable_cache = "/var/lib/mse/degradable_cache";

	# define books and stores
  }

MSE3 was not initially designed with fault tolerance, so failures will first
manifest as panics. As the cache process crashes, the manager process might
cache the MSE error before restarting a new cache process. Soon Varnish is
ready to serve traffic, with a degraded persistent storage capacity.


Configuration key flags
=======================

Some configuration keys have flags associated with them. Their meanings
are listed below.

*required*
	The configuration key is required, and it is an error to not
	specify it.

*persisted*
	The configuration key is persisted, and can not be changed without
	recreating the book.


Configuration key types
=======================

The following types of configuration keys exist, listed with its expected
value format.

*id*
	An identification string, maximum 16 characters long. Give the
	value in double quotes.

*bytes*
	Value identifies a byte count. Give the value in double
	quotes. Accepts k, m, g, t and p suffix.

*string*
	Regular string. Give the value in double quotes.

*bool*
	Boolean value. Accepts true or false, without quotes.

*double*
	Floating point number. Give the value without quotes.

*unsigned*
	Unsigned integer value. Give the value without quotes.


Environment configuration parameters
====================================

.. include:: ../include/mse-params-env.rst


Book configuration parameters
=============================

.. include:: ../include/mse-params-book.rst


Store configuration parameters
==============================

.. include:: ../include/mse-params-store.rst


SEE ALSO
========

* :ref:`varnishd(1)`
* :ref:`mkfs.mse(1)`
* :ref:`vmod_mse(3)`


COPYRIGHT
=========

* Copyright (c) 2018 Varnish Software

* Author: Martin Blix Grydeland <martin@varnish-software.com>
* Author: Dridi Boukelmoune <dridi@varnish-software.com>
