.. role:: ref(emphasis)

.. _mse4-persisted(7):

==============
mse4-persisted
==============

-------------------------
Persisted caching in MSE4
-------------------------

:Manual section: 7

Persisted caching in MSE4
=========================

MSE4 offers the possiblity of storing cached content not only in memory,
but also on disk. This makes it possible to extend the cache size of the
system beyond the available system memory. Upon system restarts the on
disk content will be kept and made available again without having to fetch
it from the backend. This mode of operation is referred to as persisted
caching. The other option where content is only kept in memory and never
written to disk is referred to as ephemeral caching. This document will
also use the terms persisted objects and ephemeral objects, referring to
objects that have a disk backing and those that don't.

When Varnish is responding to client requests delivering cached data, that
data is always delivered from memory. This is true for both ephemeral and
persisted caching. Where persisted caching differs is that for an object
known to be on disk, memory buffers are allocated and filled by reading
from disk rather than getting the bytes from a backend. Once the memory
buffers are in memory, they are used mostly in the same way as for
ephemeral caching. This means that the data will be shared among all
client requests requesting the same data, and only a single copy is kept
in memory regardless of how many simultaneous clients are using it.

When a persisted object first enters the cache through a backend fetch,
memory buffers are allocated as needed to hold the content. Like for
ephemeral caching, the buffer bytes received from the backend are
immediately made available for use by any client requests streaming the
data. This means that the use of persisted caching does not have to impose
an IO delay for writing the data to disk before they can be used to
deliver cached content. Once the buffers are completely filled with data
from the backend, the disk writeout is done asynchronously.

Reading data back into memory from disk is done on demand. When it is
found that a byte belonging to a persisted object is not in memory, a
memory buffer for a byte range that covers the requested byte is allocated
and filled by reading from disk. Once the buffer has been filled, any
waiting delivery tasks will be notified and they can resume the delivery.

The memory buffers used to hold persisted objects will be kept in memory
for as long as possible, even if there are no active client connections
requesting the data. It is only when the available memory is getting low
and space need to be made to hold other content that the buffers can be
evicted. The algorithm to choose what content to evict is a variant of
Least Recently Used (LRU), and ephemeral objects and persisted memory
buffers alike can be chosen. If a persisted memory buffer is found to be
the least recently used and chosen for eviction, only that buffer itself
is evicted. The other object buffers will stay in memory until they happen
to be chosen as the eviction candidate.

Between the on-demand read from disk mechanism and the least recently used
buffer eviction, only the specific parts of an object that is actually
requested is kept in memory. This makes it so that for example large
objects that only see a demand for the very beginning of the object will
not require the entire object to be held in memory.


File devices
============

A file device is a term used to describe the large files that MSE4 uses to
store information on disk. It refers to the actual file itself residing in
the file system, and not the drive or device that holds the file system
data.

The books and stores are the file devices that are used to store persisted
objects to disk. The books hold the meta data about the objects, while the
stores hold the actual object data, including object headers and the
object body bytes.

The books and stores are kept in separate files to provide flexibility in
how the data is laid out on the drives available. If the system has a
nonhomogeneous set of drives available where some are faster than the
others, it is recommended to provision the system so that the books are
kept on the faster drives and the stores on the slower drives. If the
drives are all equal, it is recommended to have one book on each drive,
and then any number of stores as required on each drive using that book
for meta data.

The Book
~~~~~~~~

The book is a type of custom database developed specifically for MSE4 to
provide fast and consistent updates to the set of objects in the cache,
while minimizing the amout of IO needed in order to record changes to the
set of objects. All data in the book is checksummed for consistency, and
any updates are journaled to keep the order of operations consistent.

The book stores for each object all of the meta data that Varnish needs in
order to figure out if an object matches a client request. This includes
the object hash value and Vary match data, as well as the object lifetime
parameters (time-to-live, grace period etc.). In addition it stores any
Ykeys associated with the objects, and finally a list byte offset, length
and checksum triplets which shows where in the store where the actual
headers and object body data is stored.

The book database is comprised of a number of fixed size slots. A slot is
wide enough to describe one persisted object in the cache, with a combined
maximum number of 4 data chunks or Ykeys. Slots will be chained together
to describe larger objects or objects with many Ykeys, with each
additional chained slot increasing the number of data chunks or Ykeys
described by 9.

The layout of the book and its slot capacity is determined at the time the
file device is created, and is influenced by several configuration
settings. The maxinum number of slots a given book can hold can be queried
by giving the `headers` command to `mkfs.mse4` and looking at the
`maxslots` key.

Each book can provide meta data storage for up to 16 stores. The set of
slots available is shared among all of the stores the book is managing.

To speed up operations and avoid blocking on IO in critical data paths,
the entire book slot table is kept in memory at runtime. The IO generated
by the book after startup will only be writes to record the slot table
changes.

The Store
~~~~~~~~~

The store is where the actual payload bytes of objects are stored. This
includes the object HTTP headers, the object body bytes and any auxiliary
attributes stored with the objects. All of the store bytes for an object
is always kept in the same store.

The data stored does not contain any structure information, it is just a
series of byte chunks of varying length stored consecutively in one large
file. All of the data needed to stitch an object back together, as well as
the checksums for the data chunks of the store, is kept in the book.


Configuring MSE4 for use with persisted caching
===============================================

To enable persisted caching, the file devices in which the persisted
objects will be stored needs to be defined. This is done in an MSE4
specific configuration file.

The configuration file declares the file devices to be used, their sizes
location in the file system. The books and stores are listed
hierarchically in the configuration, where the stores sharing a book for
meta data storage are listed under the book configuration.

Please see :ref:`mse4-config(7)` for information and an example of how to
structure the configuration file.

The Varnish daemon needs to be configured to use MSE4, and the path to the
configuration file specified. To do this, give a single "`-s
mse4,<path-to-mse4-configuration>`" argument to the Varnish daemon.

Creating the file devices
~~~~~~~~~~~~~~~~~~~~~~~~~

Before Varnish and MSE4 can start using the file devices, they need to be
created. For this purpose a special utility program called `mkfs.mse4` is
provided.

As a one time operation before starting the Varnish daemon, execute this
command:

::

   $ mkfs.mse4 -c <path-to-configuration-file> configure

Please see :ref:`mkfs.mse4(1)` for more information about the `mkfs.mse4`
utility.


Object Creation
===============

When a new object enters the cache, MSE4 will need to make a decision on
how to store the content, whether to persist it and if so to which store
it should be written.

Only regular cached content can be persisted. Special objects like
``hit-for-pass`` and ``hit-for-miss`` will always become
ephemeral. Temporary objects that are created to hold request bodies or to
handle passes are also always ephemeral.

The VCL program can also set whether to attempt to persist the object or
not. The vmod ``mse4`` has a function called `set_storage()` for this
purpose. See :ref:`vmod_mse4(3)` for more information.

Content Category
~~~~~~~~~~~~~~~~

The set of stores an object can be persisted to is determined by the
object's assigned content category. Each category has a list of stores
assigned, and a store can only be assigned to one category. 

Note that when there are no category definitions in the MSE4 configuration
file, a default root category is created that has all of the configured
stores automatically assigned to it.

If the object's content category does not list any assigned stores, the
object becomes ephemeral.

See :ref:`mse4-categories(7)` for more information on how to configure and
use content categories.

Store Selection
~~~~~~~~~~~~~~~

When the assigned object category has multiple stores to choose from, one
is selected based on the configured store selection algorithm. The
algorithm to use is determined by the value of the `*store_select*` key in
the selected category's configuration, or if not specified by the
environment level configuration key `default_store_select` (which defaults
to `smooth`).

When running the store selection algorithm, only stores in state
``ONLINE`` are considered. Stores in any other state are ignored, and the
algorithm is executed as if they were not part of the set. If none of the
stores in the set are ``ONLINE``, the object becomes ephemeral. See the
"Runtime drive failures" section below for more information about store
state.

The possible values for `store_select` are:

* `smooth` (default)

  The ``smooth`` algorithm gives weigths to the possible stores by the sum
  of the available free space in the store and the size of the store. The
  store is then selected by a random draw from the weighted stores. The
  ``smooth`` algorithm favours a large and mostly empty store to ensure
  that it is filled efficiently when empty, and once all stores are filled
  transitions to a ``size`` based store selection.

* `size`

  The ``size`` algorithm gives weights to the possible stores by the size
  of the store, and a store is then selected by a random draw from the
  weighted stores. This makes the store selection probability be according
  to the size, making a store that is twice the size of another be
  selected twice as often.

* `available`

  The ``available`` algorithm gives weights to the stores according to how
  much free space is available in the store. This gives the store with the
  largest amount of free space the priority.

* `round-robin`

  The ``round-robin`` algorithm will simply choose a store in a round
  robin fashion.


Bans and persisted caching
==========================

Banning is a cache invalidation mechanism where an administrator issues a
statement which is tested on each and every object in the cache. If the
statement evaluates as `true`, the object will be invalidated (removed
from the cache). It is a very flexible system for cache invalidation, but
does come with quite high cost.

In order for creating new ban statements to not take forever while testing
each and every object, the actual execution of the ban statement is done
in a lazy manner. This means that the statements are simply added to the
cache without doing any searching and matching, and then the actual
execution of the ban statement is done as a background task. During cache
look-up, any object will first have any not yet matched ban statements
applied, and possibly invalidated at that time if it is a match.

For this scheme to work with persisted caches, it becomes necessary to
also persist any ban statements that haven't yet been tested against all
of the cached objects. For this purpose, each book sets aside an area to
store away on disk the most recently issued ban statements. The size of
this area is controlled by the "``banjournal_size``" book parameter.


Persisted cache bootstrapping
=============================

When the Varnish daemon starts up using MSE4 configured for persisted
caching, a bootstrap process is performed. The slot table from each
configured book will be read into memory, and all of the slots verified by
checksum for consistency. Any valid objects described will be added to the
main Varnish look-up tree so that the object's presence is known during
cache look-ups.

Invalid slots found during the bootstrap will be cleared and reclaimed
before normal cache operations commences. This could for example be the
result of an unclean shutdown of the system and subsequent aborted IO
operations.

The bootstrap process may also identify any number of invalid objects. The
slots they occupy will also be cleared and reclaimed. Reasons for
invalidating an object during the bootstrap include:

* The lifetime parameters indicate that the object is too old. This can
  happen when the object expired while Varnish was not running.

* The object's ban timestamp is invalid. This can happen if the book's ban
  journal is too small to keep all of the active bans in the system. Since
  there is no way of knowing whether the object would've been invalidated
  by one of the missing bans, the object is removed as a precaution.

* The store that holds the objects cache data is offline, missing or
  otherwise not available. When the store is not accessible, several
  Varnish mechanisms like cache invalidations and bans will not be applied
  to the affected objects. If the store is then made accessible again on a
  later restart, objects that would've been invalidated may resurface. To
  safe guard against this, inaccessible objects are invalidated during the
  bootstrap.

At the end of the bootstrap process, a message is given in ``syslog``
detailing how many objects were removed from the cache. XXX: This is not
yet implemented.


Store checksum verification
===========================

All of the store data can be checksummed upon first being written to the
stores, and then verified when reading it back. The checksums for the
content is stored in the book.

This functionality is enabled or disabled by the "``write_checksum``" and
"``verify_checksum``" store configuration parameters. By default checksum
verification is turned on.

When data is read back into memory from the store and the checksum
verification failed, the object will automatically be evicted from the
cache. This ensures that no other client request can successfully make a
cache hit on that object from that point in time, and subsequent attempts
at getting the content will cause a cache fetch from the backend.

However, due to how content is only read back into memory on-demand, a
delivery process may be well under way and the beginning of an object
already delivered to the client by the time the checksum verification
fails on a later chunk for the object. It will be too late to communicate
the bad object to the client, and the only option left is to cause a
delivery failure. Clients may experience short reads and a forced
connection close in this situation. Note that it is ensured that no byte
from the invalid read will ever be communicated to the client.


Persisted object eviction
=========================

When either the book runs out of slots or the store is full, content will
be evicted from the cache to free up resources.

Book free slot management
~~~~~~~~~~~~~~~~~~~~~~~~~

When new persisted objects are added to the cache, free slots are needed
in the store's book in which to store the object meta data.

Every book holds a reserve of free slots ready to be handed out. The size
of this reserve is controlled by the `slot_reserve` book configuration
key.

When the reserve runs low, a background task free up slots by evicting the
least recently used object from among all of the objects stored in the
book.

Store free space management
~~~~~~~~~~~~~~~~~~~~~~~~~~~

When new persisted objects are added to the cache, free space in the store
needs to be assigned. The available free space is for efficiency reasons
not mapped in its entirety. Rather a background task exists to fill the
reserve with known free space regions meeting a minimum requirement for
continuous byte ranges. The fetch tasks grab byte ranges from the reserve
to assign them to new persisted objects.

The goal for this background task is controlled by the `reserve_size`
store configuration key. Upon reaching this goal the task will sleep, and
be woken again when the reserve drops below the goal. The available
reserve is shown in the `MSE4_STORE.<store-id>.g_reserve_bytes` VSC
counter.

When the background task fails to meet its goals through natural decay
(objects being deleted due to their lifetime settings), persisted content
will be evicted from the cache to make new space.

When evicting content to make up store space, it is necessary to work
towards creating continuous free areas to avoid store space
fragmentation. To achieve this, the space is considered a segment at a
time in a round robin fashion, with the segment size determined by the
`segment_size` configuration key. All objects with one or more allocations
in that segment, and are not frequently accessed, are evicted.


Runtime drive failures
======================

If a drive IO error is reported at runtime, MSE4 is able to take a book or
store offline without causing system down time. The only disruption of
service will be that the objects that were hosted on the affected book or
store will be removed from the cache.

The IO errors that are monitored and will cause the file devices to be
taken offline are any error codes reported back to MSE4 from the operating
system during IO operations.

When multiple file devices are hosted on the same drive, a failure to one
will not automatically take the others offline. However, when a book
device fails it will also fail all of the stores that are hosted in that
book.

When a file device fails, MSE4 will immediately cease all IO operations to
the device. All of the objects hosted on the device will be
invalidated. Any active delivery tasks using any object being invalidated
may have to fail the delivery as a result. Clients may experience short
reads and a forced connection close in this situation.

After a file device has failed and become offline, an administrator may
reset the device and bring it back up into service again. This would
presumably be done after having executed a hot swap of the affected drive
to replace the faulty hardware. When a file device reset is performed, the
device will always come back into service empty. That means that any
cached objects that were on the file device will have been lost.

File device state
~~~~~~~~~~~~~~~~~

A file device will be in one of three states at runtime. The states are:

* ONLINE

  When a file device is in state ``ONLINE``, it operates normally.

* FAILING

  Whenever a file device fails, it will first transition from ``ONLINE``
  to the ``FAILING`` state. While in this state, active clean up tasks is
  still being performed on the cached objecs that this device holds. MSE4
  will still be holding file descriptors open on the device during this
  period, and an administrator should ensure that no file devices on an
  affected drive is in this state before hotplugging the drive.

  A file device in the ``FAILING`` state can not be reset.

* OFFLINE

  Once the clean up tasks have been completed, the file device will
  transition from the ``FAILING`` state to the ``OFFLINE`` state. At this
  point in time, MSE4 will have closed all of the file descriptors to the
  file device, and it is safe to hotplug the drive.

  Once a file device has reached the ``OFFLINE`` state, it is possible to
  reset the device. This will reinitialize the file device and bring it
  back to the ``ONLINE`` state.

The current state of the file devices can be queried by using the CLI
command `mse4.status`.

A boolean ``online`` status is also reported through the ``varnishstat``
counters ("``MSE4_BOOK.<book-id>.online``" and
"``MSE4_STORE.<store-id>.online``"). This counter value will be `1` when
the file device state is ``ONLINE``, and `0` otherwise.

The MSE4 statelog
~~~~~~~~~~~~~~~~~

MSE4 uses a statelog file to record state changes to the file
devices. This log resides by default in
`/var/lib/mse/<hostname>.mse4_statelog`. The statelog file serves two
purposes. The primary use is to initialize a file device's assumed state
after a restart of Varnish, to make sure that a failed drive stays in the
failed state until the drive has been replaced. Secondly it serves as a
log of IO errors that has happened on the device.

The log is an ascii-file, and records any state change that has happened
to the file device. For each file device, the very last state change
recorded will be the state assumed after a restart.

Manual device failure
~~~~~~~~~~~~~~~~~~~~~

The administrator can induce a file device failure manually. This can be a
useful tool if e.g. ``S.M.A.R.T.`` monitoring is indicating that a drive
is starting to experience issues, and a proactive drive replacement needs
to be performed.

To cause a file device to fail, execute the CLI command "`mse4.fail
<book-or-store-id>`". This will have the same effect as if an IO operation
on the file device reported an IO error.

Drive hotplugging
~~~~~~~~~~~~~~~~~

When a file device reaches the ``OFFLINE`` state, MSE4 will have closed
all of its file descriptors pointing to the file device. This will make it
possible to safely unmount the file system upon which the file resides,
and a hotplug of the drive to be performed.

Note that the administrator should make sure that all of the MSE4 file
devices residing on a given drive are in the ``OFFLINE`` state before
unmounting the file system.

The instructions for how to hotplug a drive is out of scope for this
manual.

After a fresh drive has been installed, a file system needs to be created
on the drive, and the drive remounted.

File device reset
~~~~~~~~~~~~~~~~~

An MSE4 file device in the ``OFFLINE`` state can be reset at runtime to
bring it back to the ``ONLINE`` state.

To do this, execute the CLI command "`mse4.reset <book-or-store-id>`".

A store device can not be reset if its book device is not ``ONLINE``. If
resetting both a books and its stores, reset the book first.

The reset command will recreate the file device from scratch, using the
configuration parameters as described in the MSE4 configuration file at
the time the Varnish daemon was started. If a file of the same name
already exists, it will first be deleted.

A successful reset will mark the file device as ``ONLINE`` in the
statelog.


File device resizing
====================

The Varnish daemon will on startup print a warning message to syslog if
the actual file devices presented to it differs in size and layout from
what the configuration suggests. It will start up as normal though, and
use the applicable settings (file and journal sizes) as stored in the file
devices, disregarding the configuration options.

To bring the file devices into compliance with the configuration, it is
necessary to perform an offline file device resize operation. This can be
done using the `mkfs.mse4` utility and the `resize` command. The Varnish
daemon must be stopped in order to execute this command. Please see
:ref:`mkfs.mse4(1)` for more information about running this command.


See also
========

:ref:`mse4-config(7)`
:ref:`mse4-categories(7)`
:ref:`mkfs.mse4(1)`
