varnish-mse¶
Massive Storage Engine¶
Manual section: | 7 |
---|
Massive Storage Engine¶
The Massive Storage Engine (MSE) is an advanced stevedore for Varnish Cache Plus. The stevedore is the component that handles storing the cached objects and their metadata, and keeping track of which objects in the cache are most relevant, and which to purge if needed to make room for new content. MSE adds several advanced features compared to standard stevedores that ship with Varnish:
Compact memory object structure
MSE has a more compact object storage structure giving less storage overhead. This is most noticable for small objects.
Fair LRU eviction strategy
When evicting content to make room for fresh content in the cache, the fetch task that does the eviction will be given priority to the space that it made available. This ensures that fetches does not fail due to other simultaneous fetch tasks stealing the space from under it.
Large caches using disks to cache objects
MSE can use disks as backing for object data, enabling cache sizes that are much larger than the available system memory. Content that is frequently used will be kept in memory, while less frequently used content will be read from disk instead of fetching from the backend.
Persisted caches
MSE will persist the disk stored objects, keeping the content in the cache between planned and unplanned restarts of the Varnish daemon.
Memory Governor
MSE features a mechanism that will automatically adjust the size of the cache according to the process memory consumption. This makes it easy to set up the right cache limits, and ensures the best utilization of the available memory even in shifting load conditions,
Configuration and usage¶
MSE uses a structured configuration file to describe the layout of the devices to use for object storage. The syntax of the configuration file is shown in the examples below.
The configuration structure is hierarchical. At the top level there is one and exactly one environment, which configures the global sizes and rules to use for memory held object fragments.
An environment by itself configures a non-persistent cache. This makes MSE behave like a regular Varnish instance with a memory only cache, much like when using the default malloc stevedore, while giving the benefits of the compact object memory structure and fair LRU eviction.
To configure a persisted cache using disk for object storage, one or more books with associated stores needs to be configured in the environment. The books contain metadata internal to MSE, while the stores contain the object data used by Varnish.
Once the configuration file has been created, MSE can be enabled using the -s mse[,<path-to-config-file>] option to the Varnish daemon.
Persisted caching¶
When books and stores are configured, the cached objects are also persisted, keeping the content between restarts of the Varnish daemon.
The book is an embedded database that contains the necessary metadata about the objects stored in the cache. This includes e.g. the hash values for the objects, their associated TTLs and Vary matching information. The book also contains the maps for where in the store the payload data for the object resides, and the lists mapping free store space. Lastly the book has one journal file to persist bans on the system, and one journal file for each configured store to speed up metadata updates. All the data files that make up the book are kept in a directory, and the path to this directory is given in the configuration file.
Each book needs to have at least one store associated with it. The store holds the object payload data, consisting of its attributes (object headers, ESI instructions etc) and the object body. Each store is a single large file in the filesystem, that can contain any number of objects within it. New objects are assigned to a store on a round robin basis.
Keeping books and stores configured and stored separately is useful when the disks to use may not have the same IO capacity. It would be advisable to e.g. keep the book on a fast SSD type of drive, while using the larger but slower rotating disk for the store.
The data files, both for books and stores, needs to be initialized before starting the Varnish daemon for the first time. This is done using the bundled mkfs.mse utility, passing the configuration file as an option. See the mkfs.mse manpage for details.
All the data files are marked with a version marker identifying the on disk data format. If this version marker does not match that of the Varnish daemon, Varnish will refuse to start, and the files will have to be recreated using the mkfs.mse utility with the -f force option, clearing the cache in the process. If a new release of Varnish Cache Plus comes with a new on disk format, the changelog entry will clearly say so.
Memory Governor¶
MSE has a feature called the memory governor that will automatically adjust the size of the cache in response to the memory usage of the Varnish cache worker process.
Traditionally the way to manage memory usage in Varnish has been by controlling how much memory to use for cache storage. That is an upper limit on how many bytes of actual object payload data Varnish will keep in memory, and once the limit is reached objects will be removed from the cache by LRU (Least Recently Used) order to make room for new content.
This approach has several limitations. First it does not take the cost of running the cache server itself into account. Each Varnish instance will need memory space in order to run all of the cache worker threads (client requests and backend fetches), scratch workspace regions for parsing and constructing HTTP request and response data, as well as lookup trees and other internal data structures.
Second, it does not take transient objects into account. Transient objects are a subgroup of objects that either shortlived (TTL of less than 10 seconds), Hit-For-Pass and Hit-For-Miss stub objects and private objects (cached request bodies and any active delivery pass objects). The typical setup of Varnish will run without an upper bound on the amount of memory that can be allocated for transient usage, requiring the main storage limit to be lowered to accomodate the expected transient usage. (Note that while it is possible to put a cap on Transient memory usage, that comes with its own limitations).
Third, the memory overhead will vary with the traffic patterns the Varnish instance is handling. The request rate most Varnish servers handles will not be the same during work hours and evenings, or on weekdays and weekends. With a heavy request load, the amount of memory needed for request handling and general overhead will increase, and likely also the transient usage.
The end result is that configuring the right cache size becomes a case of trial and error in order to dial in the right setting for each Varnish instance, and the trial period can typically be a week long in order to get a good baseline usage.
The memory governor allows the system administrator to take a new approach to configuring the way Varnish uses memory. Instead of configuring the amount of memory to be spent on cache payload data and making sure to set it low enough to accomodate the expected overhead from the above, one specifies the amount of memory the Varnish cache worker process itself should consume. Varnish will then self-regulate, increasing or decreasing the cache size in order to keep the memory usage constant.
Enabling the memory governor¶
The memory governor is a component that sits part in Core Varnish Cache, and part in the MSE stevedore, where the core part measures and reports on memory usage, while the MSE part reacts and adjusts the cache size. Because of this relationship, there is a requirement that in order to enable the memory governor, there can only be a single stevedore instance configured (only one -s argument on the varnishd command line), and that needs to be of type MSE. In persisted MSE setups one can continue to use multi-book and multi-store MSE configurations, and use the MSE VMOD to route objects to specific stores internally in MSE, but there needs to be one and only one instance of MSE configured in the Varnish daemon.
To enable the memory governor, one simply sets the memcache_size key in the MSE configuration file (top level environment section) to “auto”. This instructs MSE to take the memory allowance limits from the memory governor.
The target memory consumption can be configured by setting the memory_target Varnish runtime parameter to the required memory size. This can be set at startup by adding a -p memory_target=<memory-usage> option on the varnishd command line. The parameter may also be adjusted at runtime using varnishadm. The new target value will take effect immediately, but when lowering the target it may take a moment before the new target is reached.
Note that in order for the memory governor to work, the kernel must provide certain memory statistics. At startup a test is done to make sure that the necessary per process statistics is present and readable through /proc/<pid>/status, and Varnish will refuse to start if the information is missing. The required status fields are VmRSS, RssAnon, RssFile and VmSwap.
Memory sizing¶
When the memory governor is in effect what Varnish will regulate is the process’ private memory usage. That is the Varnish process’ virtual memory mappings without any file backing (sometimes called anonymous memory maps). That means that the cost of open files, for example the Varnish shared memory log and the varnish counters, have zero cost from the governors point of view, and the cache will not be purged in an attempt to undo that kind of memory usage. This is due to the memory being spent by these files belong not directly to the process, but the kernel’s page cache. The kernel has its own algorithms that control the page cache and what parts of files to keep in memory and which to drop, based on the memory pressure the system as a whole is experiencing. The memory governor makes sure that the memory pressure the Varnish process adds to the system is constant, but leaves to the kernel to handle the page cache and what file content is best kept in memory using the leftover space.
When setting the memory target, one needs to take into account the base memory cost of the rest of the system. If there are multiple services running on the system host (e.g. the Hitch TLS proxy), the space they will need must also be taken into account.
When running with persisted MSE, also the page cache space of the books should be taken into consideration. The books that contain the meta data about the persisted objects are written using regular file IO, and the available page cache memory will allow the kernel to speed up access to this data. How much of a speed up, and the threshold level where further page cache space does not give any benefit will depend on the traffic pattern, so some experimentation may be required. (Note that MSE’s store files that contain the actual object payload data are accessed using direct IO directly into the private memory space of the worker process, and thus bypasses the kernel’s page cache).
As a baseline safe default, we recommend setting the memory_target parameter at 80% of the available system memory. This may then be further tuned at runtime if necessary.
Transient objects under Memory Governor¶
Under the memory governor, Varnish’ handling of transient objects will change slightly. Without the governor, there would be a dedicated stevedore of type malloc to handle this, keeping the transient objects separated from non-transient objects. With the governor enabled, all objects are handled by the single MSE stevedore instance, including the transient objects. The dedicated counter segment for the Transient stevedore disappears. Also the transient objects will be subject to LRU eviction when needed, just like regular objects.
When using MSE in a persisted configuration, transient objects will become memory-only objects, meaning they will not be persisted.
Using MSE with SELinux¶
If SELinux is enabled, the MSE books and stores needs to be located on a path that the SELinux policy allows the Varnish daemon to access. The policy shipped with the packages enables the /var/lib/mse/* path for this purpose. If your stores are on separate drives, you will need to mount those drives below that path.
Example memory only configuration¶
The following is an example configuration for a memory only cache using 100 Gb of memory to hold cached objects:
env: {
id = "myenv";
memcache_size = "100G";
};
Example memory governor configuration¶
This example configuration file sets up a non-persisted MSE with memory governor:
env: {
id = "mse";
memcache_size = "auto";
};
The memory usage will be controlled by the memory_target runtime parameter.
As a convenience shortcut it is possible to get the exact MSE configuration shown above (with the environment ID set to “mse”) and memory governor enabled by omitting the MSE configuration file completely. The varnishd command line will then typically look like:
$ varnishd -a :80 -f varnish.vcl -s mse -p memory_target=<size>
Example with 2 books each holding 2 stores with memory governor¶
The following example demonstrates how to configure multiple books each holding multiple stores. The memory governor is enabled, and Varnish will adjust the memory space used for holding frequently accessed content automatically. There will be 2 books, each configured for 1Gb metadata space. Each book has 2 stores, each holding 1Tb of object data:
env: {
id = "myenv";
memcache_size = "auto";
books = ( {
id = "book1";
directory = "/var/lib/mse/book1";
database_size = "1G";
stores = ( {
id = "store-1-1";
filename = "/var/lib/mse/stores/disk1/store-1-1.dat";
size = "1T";
}, {
id = "store-1-2";
filename = "/var/lib/mse/stores/disk2/store-1-2.dat";
size = "1T";
} );
}, {
id = "book2";
directory = "/var/lib/mse/book2";
database_size = "1G";
stores = ( {
id = "store-2-1";
filename = "/var/lib/mse/stores/disk3/store-2-1.dat";
size = "1T";
}, {
id = "store-2-2";
filename = "/var/lib/mse/stores/disk4/store-2-2.dat";
size = "1T";
} );
} );
};
Example of non-homogenous setup with a default storage selection¶
This example is similar to the previous, but the stores are of different sizes. In addition, a default set of stores is selected through the default_stores parameter.
The two books in the configuration are of the same size, while their stores’ sizes differ by an order of magnitude. Although spinning disks is often a bad idea, it can be a good option if you have many huge, uncommonly requested files that you want to cache.
In the example, imagine that the second book contains two stores on spinning disks, while the first book use faster SSD/NVMe drives. By default, through the default_stores parameter, only the first book’s stores will be selected during object insertion:
env: {
id = "myenv";
memcache_size = "100G";
books = ( {
id = "book1";
directory = "/var/lib/mse/book1";
database_size = "1G";
stores = ( {
id = "store-1-1";
filename = "/var/lib/mse/stores/disk1/store-1-1.dat";
size = "1T";
}, {
id = "store-1-2";
filename = "/var/lib/mse/stores/disk2/store-1-2.dat";
size = "1T";
} );
}, {
id = "book2";
directory = "/var/lib/mse/book2";
database_size = "1G";
stores = ( {
id = "store-2-1";
filename = "/var/lib/mse/stores/disk3/store-2-1.dat";
size = "10T";
}, {
id = "store-2-2";
filename = "/var/lib/mse/stores/disk4/store-2-2.dat";
size = "10T";
} );
} );
default_stores = "book1";
};
It is possible to use vmod_mse to override the store selection on each individual backend request, and this will be the only way to get objects into book2 with the above configuration.
Tags and an example of how they are used¶
It is possible to attach tags to books and individual stores, and use these to select which stores. Consider the following, somewhat silly, example:
env: {
id = "myenv";
memcache_size = "100G";
books = ( {
id = "book1";
directory = "/var/lib/mse/book1";
database_size = "1G";
tags = "red";
stores = ( {
id = "store-1-1";
filename = "/var/lib/mse/stores/disk1/store-1-1.dat";
size = "1T";
}, {
tags = ( "orange", "store-2-1" );
id = "store-1-2";
filename = "/var/lib/mse/stores/disk2/store-1-2.dat";
size = "1T";
} );
}, {
id = "book2";
directory = "/var/lib/mse/book2";
database_size = "1G";
tags = ( "pink", "red" );
stores = ( {
id = "store-2-1";
filename = "/var/lib/mse/stores/disk3/store-2-1.dat";
size = "1T";
tags = "green";
}, {
id = "store-2-2";
filename = "/var/lib/mse/stores/disk4/store-2-2.dat";
size = "1T";
tags = ( "blue", "book1", "red" );
} );
} );
default_stores = "none";
};
The example above is equal to the second example, but tags have been added on both the books and the stores, and default_stores is set to the special value “none” (indicating that objects should be memory only by default, as described in vmod_mse manual).
Tags can be specified either as a single string, or as a list of strings, and they can be applied to books and stores.
When a set of stores is selected, either by using default_stores or vmod_mse, the string will be matched against book names, store names, and tags. In the example above, mse.set_stores(“red”); will select all stores, since both books have been tagged “red”. Even though store-2-2 is tagged “red” twice (in the book and the store itself), it will not be chosen twice as often as the other “red” stores.
There is no discrimination of names and tags, so mse.set_stores(“book1”); will select all the stores in book1, and store-2-2, since this store has the “book1” tag. Similarly, mse.set_stores(“store-2-1”); will select two stores, one because of a matching name, and the other because a tag matches.
Read more about store selection in vmod_mse(3), where it also explained how stores are weighted after selection, and how to change it.
Degraded mode and configuration example¶
It is possible to enable fault tolerance for an MSE environment, in which case it may start with a subset of its books and stores. In the event of a device failure, misconfiguration, or any other reason that would result in a book or a store not successfully opening during startup, MSE can ignore these failures and proceed in degraded mode.
If an environment is successfully loaded, but corrupted, a fault may occur after startup. In this case Varnish tries to catch the error and cache it on disk for the next startup. This cache is a directory containing one text file per MSE environment that can be edited to unregister failed books and stores after restoring them to a pristine state.
Configuring a fault-tolerant environment can be done like this:
env: {
id = "my-degradable-env";
memcache_size = "auto";
degradable = true;
degradable_cache = "/var/lib/mse/degradable_cache";
# define books and stores
}
MSE3 was not initially designed with fault tolerance, so failures will first manifest as panics. As the cache process crashes, the manager process might cache the MSE error before restarting a new cache process. Soon Varnish is ready to serve traffic, with a degraded persistent storage capacity.
Configuration key flags¶
Some configuration keys have flags associated with them. Their meanings are listed below.
- required
- The configuration key is required, and it is an error to not specify it.
- persisted
- The configuration key is persisted, and can not be changed without recreating the book.
Configuration key types¶
The following types of configuration keys exist, listed with its expected value format.
- id
- An identification string, maximum 16 characters long. Give the value in double quotes.
- bytes
- Value identifies a byte count. Give the value in double quotes. Accepts k, m, g, t and p suffix.
- string
- Regular string. Give the value in double quotes.
- bool
- Boolean value. Accepts true or false, without quotes.
- double
- Floating point number. Give the value without quotes.
- unsigned
- Unsigned integer value. Give the value without quotes.
Environment configuration parameters¶
memcache_size¶
Memory cache size
- Type: bytes_auto
- Default: 1G
- Minimum value: 4M
The number of bytes of memory to use for object storage in this environment. If set to “auto” it Varnish memory governor will be activate, and the memory usage will be automatically controlled (see varnish-mse(7)).
memcache_chunksize¶
Memory cache chunk size
- Type: bytes
- Default: 4M
- Minimum value: 4k
Maximum size of memory chunks allocated for object fragments. Also defines the size of AIO requests.
memcache_metachunksize¶
Memory cache meta chunk size
- Type: bytes
- Default: 4k
- Minimum value: 4k
Target size of the memory chunk containing the meta data (object attributes including the object headers) of stored objects. This is an indication only, and if necessary a larger chunk will be used, up to memcache_chunksize. Any left over space will be used for object body data.
default_stores¶
Default store selection
- Type: string
Specifies which stores will be selected if no specific store is selected in VCL (through the mse VMOD). This functions in the same way as with mse.select_stores(), so it is legal to specify a book name, a store name, or a tag that is shared among any number of books and stores
degradable¶
Persistence fault tolerance
- Type: bool
- Default: false
Allow the environment to work in degraded mode, with a subset of its books and stores, or in memory only if none are available. Broken or missing books and stores are skipped during startup.
If a book or store becomes unavailable during Varnish’s runtime it will result in a panic, since this error is unrecoverable. If the origin of the error can be attributed to a specific book or store, it is added to the degradable cache to remember it across restarts of the cache process.
degradable_cache¶
Cache of degraded MSE components
- Type: string
- Default: /usr/local/var/mse/degradable_cache
MSE works in a degraded mode when missing or broken books and stores couldn’t be loaded during startup. In the case of a file corruption that is detected after startup, caching errors will prevent panic loops.
The degradable cache is a directory containing for each environment a file called <id>.env
where <id>
refers to the symbolic ID identifier. This file can be edited at rest before (re)starting the cache process to remove the references to books and stores after they were repaired. The degradable cache is only populated or consumed by MSE environments running in degraded mode.
Each line contains the following fields:
- a UTC timestamp
- a type (book, store)
- an identifier
- a status code
- a description
A single degradable cache cannot be shared by multiple environments. If multiple varnishd
servers run on the same host with similar MSE configurations, care must be taken not to use the same path.
varylib_tblsize¶
Varylib table size
- Type: bytes
- Default: 4k
Size of table chunk used for vary library
Book configuration parameters¶
directory¶
Directory path for book storage
- Type: string
- Flags: required
Specifies the directory where the book’s data files should be stored.
database_size¶
Size of database
- Type: bytes
- Default: 1G
- Minimum value: 100k
Size of the embedded database. Upon first initialization the storage file will be preallocated to this size. The size may be increased at a later stage by increasing this value and restarting Varnish. The storage file will then grow up to the new limit as needed. Decreasing the value will not have any effect.
database_readers¶
Number of simultaneous readers
- Type: unsigned
- Default: 4096
- Minimum value: 126
Size of reader table for database readers. This defines the maximum number of simultaneous readers.
database_sync¶
Synchronize database to disk
- Type: bool
- Default: true
Synchronize the database to disk at the end of transactions. This enables transaction consistency also in the event of operating system or power failures. If these types of errors are not expected, one may turn off the synchronization to disk. When turned off, there will still be consistency with regard to the cached objects and cache invalidations in case of planned or unplanned restarts of the Varnish daemon.
database_insert_timeout¶
Database object insertion timeout
- Type: double
- Default: 0.1
- Minimum value: 0
- Maximum value: 1
Experimental: The maximum time to wait for database access on object insertion. If this timeout expires, the attempt is aborted, and the fetch is done as a memory only object instead (not persisted). Note: This parameter and the interface to control this behaviour is subject to change.
database_waterlevel¶
Database high waterlevel ratio
- Type: double
- Default: 0.9
- Minimum value: 0.1
- Maximum value: 0.99
The maximum fill level of the embedded database as a ratio. When this fill level is reached, new transactions are queued until the background worker thread has purged enough objects to get below the fill level again.
database_waterlevel_hysterisis¶
Database waterlevel hysterisis
- Type: double
- Default: 0.05
- Minimum value: 0.0
- Maximum value: 0.5
The embedded database fill level at which the dedicated worker thread to remove objects will be started, expressed as the ratio difference to the database waterlevel. When this level is reached, the background worker thread to purge objects is started, but transactions are not queued until the database waterlevel is reached.
database_waterlevel_snipecount¶
Waterlevel purge batch size
- Type: unsigned
- Default: 10
- Minimum value: 1
Number of objects to snipe in each batch during database waterlevel purging.
banlist_size¶
Banlist journal size
- Type: bytes
- Default: 1M
- Minimum value: 8192
The active bans on the system are stored in a journal file of this size. If the space is exhausted, they will bleed off into the embedded database.
Store configuration parameters¶
filename¶
Storage file name
- Type: string
- Flags: required
The full path filename of the storage file that will hold the persisted objects.
size¶
Store size
- Type: bytes
- Default: 1G
- Minimum value: 100k
- Flags: persisted
The size of the store in bytes.
align¶
Store chunk alignment
- Type: bytes
- Default: 4k
- Minimum value: 4k
- Flags: persisted
Align all store allocations to multiples of this amount.
minfreechunk¶
Minimum size of a free chunk
- Type: bytes
- Default: 4k
- Minimum value: 4k
The minimum size of a free store chunk to keep track of. Chunks smaller than this will instead be added to allocations as extra overhead.
aio_requests¶
Number of simultaneous AIO requests
- Type: unsigned
- Default: 128
- Minimum value: 1
- Maximum value: 65534
The number of concurrent AIO requests allowed. If exceeded, threads will be queued waiting for an AIO context to become available. Defines the concurrency against the system IO device.
aio_db_handles¶
Number of read only database handles set aside for the AIO
- Type: unsigned
- Default: 16
- Minimum value: 1
The AIO subsystem will have this many read only database handles in a pool to use when needing to read metadata from the book. If exhausted the thread will be queued until a database handle becomes available.
aio_write_queue_overflow¶
Experimental: Do not store objects when write queue is too long
- Type: bool
- Default: true
When this option is enabled and more than aio_write_queue_overflow_len threads are already queued for write IO, then the store selection is overridden and the object becomes a memory only cached object. The object is still cached, and will live in the memory cache according to the normal rules, but will not be disk backed, and if removed through LRU eviction will have to be fetched again from the backend. Note: This parameter and the interface to control this behaviour is subject to change.
aio_write_queue_overflow_len¶
Experimental: Queue length at which to not store objects
- Type: unsigned
- Default: 1
- Minimum value: 1
The AIO write queue length at which the store bypass option takes effect. Note: This parameter and the interface to control this behaviour is subject to change.
journal_size¶
Store metadata update journal size
- Type: bytes
- Default: 1M
- Minimum value: 8192
Object metadata updates are temporarily stored in a journal file of this size, and applied to the embedded database as part of the next transaction. If the space is exhausted, additional database transactions are made to flush the journal.
waterlevel_painted¶
Fraction of store objects painted as candidates for waterlevel purge
- Type: double
- Default: 0.33
- Minimum value: 0
- Maximum value: 1
This fraction of the objects on the store LRU will be painted as candidates for waterlevel purge when there is too little space available for new objects in the cache. The painted objects are all eligeble for purging, and which object is purged is based on the location they occupy in the store so that continous free areas may be created.
waterlevel_threads¶
Number of waterlevel threads to use
- Type: unsigned
- Default: 1
- Minimum value: 1
This scales the number of threads to use when purging objects to achieve the set waterlevel for the store. More threads will achieve more concurrency to reach the goal faster.
waterlevel_minchunksize¶
Minimum chunk size to include in the store waterlevel measurement
- Type: bytes
- Default: 512k
- Minimum value: 16k
When calculating the current fill level of the store, only free chunks of at least this size is considered. This defines the level of fragmentation that is accepted, and chunks of this size and larger does not constitute unwanted fragmentation.
waterlevel¶
Store waterlevel
- Type: double
- Default: 0.9
- Minimum value: 0.1
- Maximum value: 0.99
The maximum fill level of the store given as a ratio of the store size. When this fill level is reached, attempts to allocate space for new objects will be queued until the waterlevel drops below this level.
waterlevel_hysterisis¶
Store waterlevel hysterisis
- Type: double
- Default: 0.05
- Minimum value: 0.0
- Maximum value: 0.5
The maximum fill level at which the background worker threads to purge objects and create continous free space are started. Expressend as the ratio difference to the store waterlevel. When the fill level drops below this level the background threads are paused.
waterlevel_snipecount¶
Store waterlevel purge batch size
- Type: unsigned
- Default: 10
- Minimum value: 1
The number of objects each background purge thread will remove from the the cache in each step before remeasuring the fill level.
COPYRIGHT¶
- Copyright (c) 2018 Varnish Software
- Author: Martin Blix Grydeland <martin@varnish-software.com>
- Author: Dridi Boukelmoune <dridi@varnish-software.com>