Tuning vovfilerd

You may need to refine the tuning of the L1,L2,L3 limits that define the behavior of vovfilerd for each specific filer. We suggest a simple experiment that will help you see the effect of storage-aware-scheduling and also may help you decide on the tuning.

The experiment consists of running two I/O intensive workload (using dd) first with Storage Aware Scheduling and then without. The difference in behavior can be monitored in the browser interface. The suggested workload consists of only "write-to-disk" operations, so it is by definition an extreme workload. Real workload will behave better that this experiment.

First, bring up the GUI for the filer you want to test (in this example we assume the nickname for the filer is FS1), so you can keep an eye on the measured latency on the filer.
% vovfilerdgui -show
% vovfilerdgui -f FS1 &


Figure 1.
Next run the workload with storage-aware-scheduling, i.e. using the resource that represents the filer, in this case Filer:FS1:
% cd some/directory/on/filer/FS1
% mkdir OUT
% setenv VOV_JOBPROJ SASyes
% time nc run -w -r Filer:FS1 -array 5000 dd of=OUT/dd_@INDEX@.out 
if=/dev/zero count=1024 bs=102400
Next, run the same workload without the resource, so no restraint will be placed on it:
% cd some/directory/on/filer/FS1
% mkdir OUT
% setenv VOV_JOBPROJ SASno
% nc run               -array 5000 dd of=OUT/dd_@INDEX@.out if=/dev/zero 
count=1024 bs=102400

If you have a big enough farm, say close to 1000 cores, you will see that this second workload uses all machines you have and creates even larger level of latency on the filer. The jobs, which should take about 1s each, may end up taking minutes to run because of such congestion. If the jobs also required a license, that license would be used by the job for way longer than it was earlier with storage aware scheduling.

To quantify the difference between the two modes of operation, lets navigate the browser interface (Home > Workload > Job Plots) to look at this report (you will have to define a precise time range for the report, something of the form "20190506T150000-20190506T170000", then report by project with no binning)


Figure 2.

In this report you see the two workloads. (In this plot the SASno experiment comes before the SASyes experiment). Look at the average run time of the jobs. In SASyes experiment, the average duration is 5 seconds, while in the SASno experiment the average duration is 2m47s, i.e. 33 times longer!