How to optimise the data you collect with Mistral

Mistral can collect an awful lot of data about what a job is doing. Often, you won’t want all that data as there is simply too much information to process.

With all monitoring and telemetry solutions you must make a trade-off between scalability and detail. Mistral will perform much more efficiently if you limit what you ask it to collect. If you ask the tool to monitor all I/O activity, you may add a noticeable overhead to the runtime of the application.

Mistral lets you choose your own trade-off by giving you access to simple monitoring rules that can be toggled on and off as well as adjusted in other ways to change what is collected. The monitoring rules are in a text file called a contract.

There are several ways that you can reduce the amount of data collected. For example:

  1. Have fewer rules in the contract – choose what you want to measure and be selective. You can always edit this, live.
  2. Increase the time frame from 1s to longer. However, note that if you set the time frame to be longer than the job, you will get aggregate data for the whole job.
  3. Increase the threshold of each rule so that data is only logged when that threshold is met. This is last field in Mistral. This ensures that you only get data that is interesting to you, as you can adjust the contract until you are logging data from the applications you want to hear about.
  4. Sample jobs at random or on a similar basis so that not all jobs are profiled. The launch script can toss a coin to enable Mistral easily.

Below are two examples of contracts:

  • The ‘kitchen sink’ contract that contains rules to monitor I/O for the whole file system. There are also sample rules to monitor other mount points. All rules are set to trigger on just one I/O operation so will log all I/O activity.
  • A lightweight contract that monitors bandwidth and meta data operations over a certain level. If the I/O counts or bandwidth are below the threshold then nothing will be logged.  This means that this contract will only log data for applications doing high levels of I/O.

The static Healthcheck report isn’t quite as effective if you deviate from the ‘kitchen sink’ contract, but you could always do an initial trace with this then move on to specifics.

For help with Mistral contracts, contact the Ellexus team on support@ellexus.com

The kitchen sink

# Version, Type of Contract, Timeframe
2,monitortimeframe,1s 
# Rule format: 
# LABEL,PATH,CALL-TYPE,SIZE-RANGE,MEASUREMENT,THRESHOLD

# all meta data I/O operations by mount point
count-meta-all,mount:/*,open+delete+access+create+fschange,all,count,0
# read and write bandwidth by mount point
bw-read-all,mount:/*,read,all,bandwidth,0B
bw-write-all,mount:/*,write,all,bandwidth,0B

# meta data I/O counts
count-open_A,/,open,all,count,0
count-access_A,/,access,all,count,0
count-create_A,/,create,all,count,0
count-delete_A,/,delete,all,count,0
count-fschange_A,/,fschange,all,count,0

# seek I/O counts indicating random I/O
count-seek_all_A,/,seek,all,count,0

# The following rules are broken down into small (0-32kB), medium
# and large (100MB+ operations).
# Ideally applications will make good streaming I/O of medium size
# with low meta data counts and low seek counts.
# read I/O counts
count-read_0-32kB_A,/,read,-32kB,count,0
count-read_32kB-100MB_A,/,read,32kB-100MB,count,0
count-read_100MBplus_A,/,read,100MB-,count,0

# write I/O counts
count-write_0-32kB_A,/,write,-32kB,count,0
count-write_32kB-100MB_A,/,write,32kB-100MB,count,0
count-write_100MBplus_A,/,write,100MB-,count,0

# read I/O bandwidth
bw-read_0-32kB_A,/,read,-32kB,bandwidth,0B
bw-read_32kB-100MB_A,/,read,32kB-100MB,bandwidth,0B
bw-read_100MBplus_A,/,read,100MB-,bandwidth,0B

# write I/O bandwidth
bw-write_0-32kB_A,/,write,-32kB,bandwidth,0B
bw-write_32kB-100MB_A,/,write,32kB-100MB,bandwidth,0B
bw-write_100MBplus_A,/,write,100MB-,bandwidth,0B

# I/O Size distribution
# Combined read+write rules, for different size ranges
count_4kB_read_write,/,read+write,-4kB,count,0
count_8kB_read_write,/,read+write,4kB-8kB,count,0
count_16kB_read_write,/,read+write,8kB-16kB,count,0
count_32kB_read_write,/,read+write,16kB-32kB,count,0
count_64kB_read_write,/,read+write,32kB-64kB,count,0
count_128kB_read_write,/,read+write,64kB-128kB,count,0
count_256kB_read_write,/,read+write,128kB-256kB,count,0
count_512kB_read_write,/,read+write,256kB-512kB,count,0
count_1024kB_read_write,/,read+write,512kB-1024kB,count,0
count_2048kB_plus_read_write,/,read+write,2048kB-,count,0

bandwidth_4kB_read_write,/,read+write,-4kB,bandwidth,0B
bandwidth_8kB_read_write,/,read+write,4kB-8kB,bandwidth,0B
bandwidth_16kB_read_write,/,read+write,8kB-16kB,bandwidth,0B
bandwidth_32kB_read_write,/,read+write,16kB-32kB,bandwidth,0B
bandwidth_64kB_read_write,/,read+write,32kB-64kB,bandwidth,0B
bandwidth_128kB_read_write,/,read+write,64kB-128kB,bandwidth,0B
bandwidth_256kB_read_write,/,read+write,128kB-256kB,bandwidth,0B
bandwidth_512kB_read_write,/,read+write,256kB-512kB,bandwidth,0B
bandwidth_1024kB_read_write,/,read+write,512kB-1024kB,bandwidth,0B
bandwidth_2048kB_plus_read_write,/,read+write,2048kB-,bandwidth,0B

The lightweight contract

# Version, Type of Contract, Timeframe
2,monitortimeframe,10s
# Rule format:
# LABEL,PATH,CALL-TYPE,SIZE-RANGE,MEASUREMENT,THRESHOLD
# all meta data I/O operations by mount point
count-meta-all,mount:/*,open+delete+access+create+fschange,all,count,10000

# read and write bandwidth by mount point
bw-read-all,mount:/*,read,all,bandwidth,0B
bw-write-all,mount:/*,write,all,bandwidth,0B

# meta data I/O counts
count-open,/,open,all,count,10000
count-access,/,access,all,count,10000
count-create,/,create,all,count,10000
count-delete,/,delete,all,count,10000
count-fschange,/,fschange,all,count,10000

# I/O bandwidth
bw-read_all,/,read,all,bandwidth,1GB
bw-write_all,/,write,all,bandwidth,1GB