Recently we’ve been working on writing some new output logging plug-ins for Mistral.
For those of you who don’t know, Mistral is designed with an API to allow our customers to push log data out to a reporting system of their choice. We provide several sample plug-ins to push data out to common systems such as InfluxDB, IBM Spectrum LSF RTM and others.
InfluxDB was the first time-series database that we wrote a plug-in for, simply because it was the most common product used by our customers. Over time we’ve found InfluxDB useful internally for its simplicity and, in conjunction with Grafana, its ability to produce demonstrations that effectively show the range of what can be achieved with Mistral log data.
However, some customers don’t, won’t or can’t use InfluxDB. Therefore we aim to have a set of sample plug-ins that can be used either out of the box or as the basis for a custom set-up unique to an individual customer’s requirements.
Graphite and Elasticsearch
Our most recent work is to round out our support of the common TSDB implementations by looking at Graphite and Elasticsearch.
The job of the plug-ins in these cases is not complex. We simply insert Mistral log data unmodified into an appropriate data model within the TSDB in question, although this can be more difficult than it sounds.
One of the benefits of Mistral is that metrics are recorded at the job level when used in conjunction with a job scheduler. This is one of the most useful features for our customers as it allows reporting against, and control of, individual jobs. However, when we started to look at Graphite this presented a problem.
When we started to test our new plug-in using a small data set containing about a dozen different job ids we kept getting intermittent failures. Some test runs would work complete perfectly but others would only have a few of the records stored. Others would fail entirely.
Technically Graphite is not a TSDB, but rather a system used to store and produce graphs from time series data. By default, Graphite uses Whisper to store data using a single file per metric. These files are created the first time each metric is seen, with enough space to hold all the data Whisper has been told to retain.
As it turns out, Whisper’s storage model is explicitly optimised for metrics that are stable over time, such as CPU load, temperature etc. So much so, in fact, that the overhead of creating new metric storage is so high that Whisper has configuration to limit the rate at which they can be created. Any data seen above this rate is silently dropped
The theory is that you don’t care if you lose some data around the time you start collecting the metric as you are going to be recording it in perpetuity. As you can imagine, this does not work well when creating metrics that include an ever-increasing job id.
Although we now have a sample Graphite plug-in, given that even our small test data set was enough to trigger the default configuration to drop metrics when run too often in a short space of time, we were forced to conclude that we cannot recommend using Graphite for storing Mistral data. To store data reliably using the default Whisper backend we would have to discard too much information that makes Mistral the tool that it is.
In contrast, Elasticsearch handles Mistral log data beautifully, adding structure within its JSON document storage model. After a few days getting to grips with the details of the internals of Elasticsearch, we now have a fully supported logging plug-in that “just works”.