Preemption Rules

Preemption is controlled by persistent objects called preemption rules, which collectively define:
  • The resources that can be preempted
  • Conditions under which preemption should be triggered
  • Methods used to revoke those resources
There are two ways to manage preemption rules:
  1. Create and edit the rules with Accelerator's web interface. The rules are stored within the vovserver and saved to disk from time to time. The rules are automatically loaded when Accelerator is restarted.
  2. Define the rules in the Tcl syntax configuration file at vnc.swd/vovpreemptd/config.tcl. When the vovpreemptd daemon starts, it reads the config.tcl file containing the preemption rule information. The daemon monitors this file and reads it again upon changes. The daemon also creates the info.tcl file which contains information about the daemon and serves as a lock file to prevent two instances of the daemon from running. The daemon tracks the modification time of the info.tcl file, and will exit if the time is changed, for example by another instance of vovpreemptd.

Preemption rules define the conditions under which preemption is to be performed. These rules are either defined using the VovPreemptRule command (defined below) and/or via Accelerator's web page interface.

The preemption rules are grouped into different pools.
  • Only one rule in a pool fires for a given iteration; rules are considered in order (as defined with the -order <N> option). This order of executing rules can use used to set up escalation. For example, if the current rule has not fired, then consider the next rule. The first rule could be for a small set of preemptable jobs, the second rule could be for a much larger set of preemptable jobs.
  • Multiple pools allow different preemption strategies to be considered in parallel; one pool could be for Design Verification jobs, while another pool could be for spice simulation jobs.

By default, all rules are added to the pool called mainpool. For multiple rules to fire during each preemption cycle, the rules must be organized into different pools.

Every preemption rule must have a unique name within it's preemption pool specified with the -rulename NAME option.
Note: If the same name is used for multiple rules, the last definition prevails.

Preemption Conditions

The preemption condition can be defined in one of the following ways.
  • There is a bucket of jobs that matches a given selection rule (use the -preempting and -bucketage options)
  • There is a bucket of jobs that is waiting for at least one of a list of resources (use the -waitingfor and -bucketage options)
  • There is a resource map that is controlled by MultiQueue and MultiQueue is currently requesting a drastic reduction (at least 10% of the current "in-use" S count) in the amount allocated to this queue (see option -multiqueueres). The 10% threshold can be controlled by means of the option -mqthresh.

The weakest types of preemption are the RESERVE_RESOURCES and RESERVE_TASKERS methods. RESERVE_RESOURCES simply reserves some resources for the job at the top of the bucket and RESERVE_TASKERS simply reserves a tasker for the job at the top of the bucket. It is the intended behaviour that while the resources or taskers are reserved, some other jobs will terminate and enable the job at the top of the bucket to be dispatched. The reservation is controlled by the options -reservetime, -reservefor, and -reservenum.

If the preemption type is not RESERVE_RESOURCES or RESERVE_TASKERS, then the system looks for jobs that can be preempted, i.e., that can be either killed or suspended.

The strongest type of preemption is FREE_TASKERS which looks at ways to preempt all jobs currently running on a tasker.

Search for Preemptable Jobs

In this search for preemptable jobs:
  • Exclude jobs that have the preemptable flag set to zero;
  • Exclude jobs that are "system" jobs (like job resumers, zip jobs, ... )
  • Exclude jobs that are labeled as top job, which are jobs that have caused a preemption in the past, because they were at the top job in a preempting bucket. The jobs are left alone for at least 10 minutes, a time interval that can be controlled with the option -donotdisturb <timeSpec>.
Searching for preemptable jobs is done in the following order:
  • Look for jobs that can be killed, or more precisely withdrawn, and resubmitted, which are the jobs that satisfy the -preemptable selection rule and that are younger than -killage. These jobs also must be useful in the sense that they hold some resources requested by the preempting job. Preemption will not kill jobs that are not considered useful for the preempting job.
  • If no job can be killed, then look for jobs that satisfy the -preemptable selection rule and are also useful. If any such job is found, preemption is attempted using the method specified by the -method option. Some jobs are resilient to some preemption methods, so care is applied to validate that the method has been effective.
  • If the preempted job is successfully suspended, then a resumer job associated with the suspended job is created. The resumer job is an invocation of the script vovjobresumer. The resumer job inherits the grabbed resources from the suspended job, meaning that it will be executed only when all resources grabbed from the suspended job become available. The resource list of the resumer job can also be augmented with the option -resumeres.

Preemption Rule Types

Preemption was previously defined as the process of revoking resources from a running job or reserving resources in order to start a 'more urgent' queued job that needs specific resources that are not currently available. Consequently, rules are divided into specific types based on how the resources are to be made available to such 'more urgent' job.

The rule type is specified via the -ruletype option in VovPreemptRule, and the default rule type is "GENERIC".

Some of options in VovPreemptRule are only meaningful for certain rule types. The following options apply to all rule types.
  • -pool
  • -rulename
  • -ruletype
  • -order
  • -debug
  • -enabled
  • -fireonce
  • -preempting
  • -bucketage
  • -waitingfor

RESERVE_TASKERS

The rules with type RESERVE_TASKERS are the simplest and least intrusive. Such rules when fired add a reservation of some specified tasker(s) for some specified period of time. When jobs terminate on a reserved tasker, those open slots are reserved to the jobs in the preempting bucket.

Note: A preempt rule is recomputed every few seconds; because of this, a short reserve time, such as 10 seconds or so, is typically sufficient.
The options in VovPreemptRule required for specifying this rule type are:
  • -reservenum
  • -reservetasker
  • -reservefor
  • -reservetime
  • -preempttaskerspec

Since no jobs are preempted, options such as -method and -preemptable are not needed and will be ignored.

For example, the following rule reserves a number of taskers for one minute for any job in the hsim_critical jobclass:

#
# Reserve some machines for one minute if there is a critical hsim job.
#
VovPreemptRule -rulename "ReserveOnlyHsimHw" \
   -preempting "jobclass==hsim_critical"     \
   -ruletype RESERVE_TASKERS                  \
   -reservetasker "taskerlist:dram4"        \
   -reservefor   "JOBCLASS hsim_critical"    \
   -reservenum   1  \
   -reservetime  "1m"

RESERVE_RESOURCES

The preemption rules with type RESERVE_RESOURCES are similar to the RESERVE_TASKERS type; however, instead of reserving taskers, these rules reserve resources for some specific reservation period. The resources reserved are the resources that the preempting job is waiting for and will be reserved for the preempting job for the time specified via the -reservetime option. Since no jobs are preempted, the options -method and -preemptable are ignored. The following is meaningful options for RESERVE_RESOURCES ruletype.
  • -reservetime
In the following example, licenses are reserved for a high priority job of class LargeJob that has been waiting for more than 5 minutes.
VovPreemptRule 						\
    -pool     "mainpool" 				\
    -rulename "ReserveLicenseLargeJobHighPriority" 	\
    -ruletype "RESERVE_RESOURCES" 			\
    -preempting "Priority>8 JOBCLASS=LargeJob" 		\
    -bucketage "5m" 					\
    -waitingfor "License:*" 				\
    -reservetime 2m

MULTIQUEUE

Options for this type are:
  • -multiqueueres
  • -mqthresh 0.25
  • -donotdisturb
  • -preemptable
  • -killage
  • -method
  • -skipresumedjob
  • -resumeres
  • -numjobs
  • -maxattempts
  • -sortjobsby
In the following example, if the difference between the multi queue allocation of License:hsim and actual is greater than 25%, then the jobs using that resource are preempted using the automatic method.
VovPreemptRule -rulename "mqPreemptHsim" \
    -ruletype MULTIQUEUE                 \
    -multiqueueres License:hsim          \
    -mqthresh 0.25                       \
    -pool multiqueue                     \
    -method AUTOMATIC

GENERIC

The preemption type GENERIC is the most common. It is designed to find running jobs that can be preempted to provide for resources in order to dispatch the preempting job.

Options for this type are:
  • -donotdisturb
  • -preemptable
  • -killage
  • -method
  • -skipresumedjob
  • -resumeres
  • -numjobs
  • -maxattempts
  • -sortjobsby

For reference, it may be best to review the options for the vovpreemptrule command for more detailed explanation of the options available for the command.

An example rule follows:
#
# Preempting rule is activiated when any job with priority greater than
# or equal to 8 AND is waiting for License:hsmi AND has been waiting
# more than 2 minutes.  It will preempt any job with priority less
# than the preempting job AND is using resource License:hsmi.
# The preemptable job will be killed and resubmitted if it has been
# running less than 1 minute.  Otherwise, it will be preempted via the
# AUTOMATIC method.
#
VovPreemptRule -rulename "priority"    \
    -ruletype GENERIC                  \
    -preempting "PRIORITY>=8"          \
    -waitingfor "License:hsim"         \
    -bucketage  2m                     \
    -preemptable "PRIORITY<@PRIORITY@" \
    -killage    1m                     \
    -method SIGTSTP

FAST_FAIRSHARE

FAST_FAIRSHARE preemption is intended to help speed up FairShare. The rule type is mainly used by the NC web page preemption rule entry page to pre-enter the interesting FairShare related fields for the preempting and preemptable condidtions in the selection rules. Internally, it is processed exactly the same as the GENERIC rule type.

FREE_TASKERS

This is one of the strongest preemption types, because it can preempt all jobs on a tasker at the same time to make space for the preempting job. This type preempts necessary number of taskers and jobs on those taskers enough to run jobs in the preempting bucket. Also the number of taskers to get preempted does not exceed -preempttaskernum.

Options for this type are:
  • -donotdisturb
  • -preemptable
  • -preempttaskerspec
  • -preempttaskernum
  • -method
  • -skipresumedjob
  • -resumeres
In the following example, high priority jobs in the jobclass "design" requiring 4 or more cores are allowed to preempt groups of "regression" jobs. With a bucketage of 10 seconds, this rule fires about once every 10 seconds for each bucket.
VovPreemptRule 				\
    -pool     "mainpool" 		\
    -rulename "taskerFree" 		\
    -ruletype "FREE_TASKERS" 		\
    -preempttaskerspec "TASKERLIST:default" 	\
    -waitingfor HW 			\
    -bucketage  10 \
    -preempting "JOBCLASS==design PRIORITY>=8 REQCORES>=4" 		\
    -preemptable "JOBCLASS==regression" \

Preemption Timing

-numjobs N
The maximum number of jobs preempted per bucket per preemption cycle.
Default value is -1
-maxattempts N
The maximum number of attempts to match the rule for a preempting job.
Setting this to zero (0) disables the check, meaning that the rule can be matched an unlimited number of times which is useful for example < for RESERVE_* type rules.
Default value is 10

The -maxattempts limits the number of times the preemption rule will be applied to the top job in a preempting bucket after no preemptable targets are found.

The default preemption cycle length is 3s. Since this is short it may appear that more than one job is being preempted during a given cycle. The preemptionPeriod parameter can be set in policy.tcl to a longer period to make the number of jobs preempted more apparent. For example:
set config(preemptionPeriod) 10s

The number of jobs preempted per cycle is also limited to a fraction the size of the preempting bucket.

For example, consider a situation with the following characteristics:
  • preemptionPeriod of 10s
  • SIGTSTP method
  • a central resource with -total 4
  • 4 preemptable jobs that consume a single resource and 10 preempting jobs that consume 4 resources each
  • -numjobs 1
  • -maxattempts 3
The preemptable jobs are running before the preempting jobs are added. When the preempting job runs, it runs indefinitely (the others were added just to have a sufficiently large preempting bucket to test -numjobs). Initially no preempting job is running and the rule triggers for the top job 000001112 in the preempting bucket. Since it needs 4 resources, it runs 4 cycles preempting one job at a time allowing job 000001112 to execute. Subsequently the rule fires for job 000001117 but there are no suitable preemptable jobs available, so after the third cycle (-maxattempts 3) it will no longer apply this rule to job 000001117.
Rule triggers for job 000001112 in bucket 000001114 (jobproj==urgent_job_53244).
GENERIC PreemptRule Rule_53244 trying to preempt up to 1 jobs.  4 preemptable targets found
Rule triggers for job 000001112 in bucket 000001114 (jobproj==urgent_job_53244).
GENERIC PreemptRule Rule_53244 trying to preempt up to 1 jobs.  3 preemptable targets found
Rule triggers for job 000001112 in bucket 000001114 (jobproj==urgent_job_53244).
GENERIC PreemptRule Rule_53244 trying to preempt up to 1 jobs.  2 preemptable targets found
Rule triggers for job 000001112 in bucket 000001114 (jobproj==urgent_job_53244).
GENERIC PreemptRule Rule_53244 trying to preempt up to 1 jobs.  1 preemptable targets found
Rule triggers for job 000001117 in bucket 000001114 (jobproj==urgent_job_53244).
GENERIC PreemptRule Rule_53244 trying to preempt up to 1 jobs.  0 preemptable targets found
Rule triggers for job 000001117 in bucket 000001114 (jobproj==urgent_job_53244).
GENERIC PreemptRule Rule_53244 trying to preempt up to 1 jobs.  0 preemptable targets found
Rule triggers for job 000001117 in bucket 000001114 (jobproj==urgent_job_53244).
GENERIC PreemptRule Rule_53244 trying to preempt up to 1 jobs.  0 preemptable targets found
Permanently skip this preemption rule for top job 000001117 since it exceeds maximum attempts of 3

Command Line Interface for Preemption Rules

Here are some useful commands to manage preemption rules.

% vovshow -preemptrules
002772887 test            testFreeTaskerRule              KILL+RESUBMIT    101
002774006 Micron          mic_pri                        KILL+RESUBMIT    102
002774004 Micron          mic_mq                         KILL+RESUBMIT    101
002775275 mainpool        FormalRegressions              SUSPEND          101
002774085 mainpool        testStealResource              AUTOMATIC        102
002775412 mainpool        PreemptAth                     AUTOMATIC        103
002774511 mainpool        byPriority                     SUSPEND          101
002777727 RegrTestPool    RegrTestPriority1523033980     KILL+RESUBMIT     50
002778828 RegrTestPool    RegrTestThomas                 0:*:EXT,KILL 10:WITHDRAWN:RESUBMIT   50
002777743 RegrTestPool    ReserveTaskersForTest           AUTOMATIC        101
002774829 mainpool        HelpStiffJobs                  AUTOMATIC        101
002778200 mainpool        pRule4613                      SUSPEND          101
002778064 mainpool        rr41523034013                  AUTOMATIC         50
002775429 HERO            Test_Priority_Same_User        SUSPEND          101
002777656 TESTPOOL        TestMethodnormal               AUTOMATIC        103
002778839 RegrTestPoolMQ  RegrTestMQ1523034523           SUSPEND           55
% vovforget -preemptrules
If you know the VovId of a preemption rule, you can use it in these commands:
% vovshow ID_OF_PREEMPT_RULE
...
% vovforget ID_OF_PREEMPT_RULE
...
Preemption methods are created only in policy.tcl:
% vovshow -preemptmethods
 1 JOBHANDLER_VOVSH     0:*:EXT,SIGTSTP,vovsh 5:SUSPENDED:NOLMREMOVE
 2 SIGTSTP+LMREMOVE     *:RETRACING:SIGTSTP 5:WAIT:SUSPEND 10:SUSPENDED:LMREMOVE 20:LMREMOVED:DONE
 3 SUSPEND              *:*:SUSPEND                   
 4 SIGTSTP+SUSPEND      *:RETRACING:SIGTSTP 5:WAIT:SUSPEND
 5 SIGTSTP              *:*:TSTP                      
 6 KILL+RESUBMIT        0:*:KILL 3:WAIT:NOP 30:WITHDRAWN:RESUBMIT
 7 LMREMOVE             *:*:SUSPEND 10:SUSPENDED:LMREMOVE 20:LMREMOVED:DONE
 8 AUTOMATIC            *:*:*                         
 9 JOBHANDLER           0:*:EXT,SIGTSTP,tclsh* 5:SUSPENDED:NOLMREMOVE

Tcl Interface to Preemption Rules

To dump the rules to a file, use the command VovDumpPreemptionRules:
# This is Tcl.
VovDumpPreemptionRules NameOfFile.tcl
At the low level, you can use these procedures to manipulate preemption rules:
% vovshow -api preempt
vtk_preemptrule_create DESCRIPTION_ARRAY
vtk_preemptrule_modify DESCRIPTION_ARRAY
vtk_preemptrule_forget ID
vtk_preemptrule_delete ID 
vtk_preemptrule_find   POOL RULENAME
vtk_preemptrule_get    ID RESULT_ARRAY
vtk_preemptrule_delete_all 
vtk_preemptrule_forget_all
To preempt a specific job, call:
# This is Tcl.
vtk_transition_preempt  jobId [-noop] [-manualresume] [-method METHOD] [-resumeres RESLIST]

vovpreemptrule

Usage: VovPreemptRule -rulename NAME [OPTIONS]

Options:
    -pool             POOLNAME    -- The rule belongs to a pool of rules.
                                     At most one preemption can occur for each pool
                                     in each preemption cycle (default: mainpool)
    -rulename         NAME        -- Required.
    -ruletype         TYPE        -- The type of preemption rule. Allowed values are
                                     GENERIC, FAST_FAIRSHARE, MULTIQUEUE, RESERVE_RESOURCES,
                                     RESERVE_SLAVES, and FREE_SLAVES (default: GENERIC)
    -order            INTEGER     -- Specify the order of evaluation of rules within the same pool.
                                     Rules are evaluated from low to high order.  If not specified
                                     the order is assigned automatically based on order of declaration.
                                     Typical range is small positives from 0 to 1000, but the order
                                     can be any integer.
    -enabled          BOOL        -- To enable and disable the rule.
    -enable           BOOL        -- Same as -enabled (obsolete).
    -debug            BOOL        -- To control debugging flag for this rule.
    -fireonce         BOOL        -- To control the fire-once flag.
    -preempting       SELRULE     -- A selection rule for the top job in a bucket.
    -waitingfor       RESLIST     -- If set, the top job in the bucket must be waiting
                                     for at least one of the
                                     given resources in order to trigger a preemption.
                                     If the RESLIST contains the string 'HW', then
                                     preemption is triggered if a job waits for a slot.
    -bucketage        TIMESPEC    -- Only apply the preemption if the bucket age
                                     is greater than the specified value.

    -multiqueueres    RESLIST     -- Trigger preemption if a multiqueue resource (rank>20)
                                     is imbalanced.
    -mqthresh        THRESHOLD    -- Percent reduction in MQ allocation that triggers
                                     preemption. Default 0.1=10%

    -donotdisturb     TIMESPEC   -- Do not preempt a job that was a top-job (i.e. a job
                                    that triggered some preemption) for at least the specified
                                    time (default 10m)
    -preemptable      SELRULE    -- A selection rule for the running jobs
                                    that should be preempted.  Any field of the
                                    form @FIELD@ is replaced by the corresponding
                                    value for the top job in the bucket.

    -preemptslavespec SPEC       -- If preempting job is waiting for hardware,
                                    preempt slaves that match the given SPEC. 
                                    The SPEC may include "SlaveList:NAMEOFSLAVELIST"
                                    and selection rules for slaves, like "HOSTNAME=lnx01,lnx02 RANDOM>5000"
                                    Used for FREE_SLAVES rules.
    -preemptslavenum  N          -- For FREE_SLAVES rules, how many slaves to preempt for each bucket
                                    that matches the preempting rule. In any case, we never preempt more
                                    slaves than there are jobs in the bucket. Default is 1.
                                    Use a higher number if you are preempting many jobs for better performance.

    -killage          TIMESPEC   -- Jobs younger than this age are simply killed
                                    and resubmitted.  Limited to 7 days max and default
                                    is 0 which implies that killage is not used.
    -method           METHOD     -- The method to be used to recover license
                                    resources from the job. Allowed values are
                                    SUSPEND, LMREMOVE, RESERVE, AUTOMATIC.
                                    If AUTOMATIC, then each license is removed using
                                    the specific method defined with VovPreemptMethod.
                                    Default: AUTOMATIC
    -skipresumedjob   TIMESPEC   -- Do not preempt jobs that have been resumed
                                    no more than TIMESPEC ago.
                                    Default: 2m

    -reservetime      TIMESPEC   -- How long resources should be reserved for the
                                    top job when attempting preemption (default 20)
    -reservetype      RESTYPE    -- Deprecated. Use reservefor.
    -reservenum       N          -- How many slaves are to be reserved for RESERVE_SLAVES rule type.
                                    Default: 1
    -reserveslave     SLAVENAMES -- If a job is waiting for hardware, this is a space-separated list
                                    of slaves to reserve for the job.  It is also possible to include
                                    a slave list by using the keyword 'SlaveList:NAME_OF_SLAVE_LIST'.
                                    -preemptslavespec is used if the field is empty.
    -reservefor       RESSPEC    -- Specify how to reserve a slave. The RESSPEC is
                                    a space-separated list of KEY VALUE, where KEY is
                                    one of BUCKET USER GROUP JOBCLASS JOBPROJ OSGROUP JOBID and VALUE
                                    is a comma-separated list of values (also symbolic like @USER@).
                                    VALUE of BUCKET and JOBID entered here is ignored and 
                                    preempting job ID and bucket ID are used.
                                    Default is BUCKET.

    -resumeres        RESLIST    -- List of resources to append to the resumer job.
                                    RESLIST can contain field references (e.g. @HOST@)
                                    which are taken from the preempted job.
    -resumedelay      TIMESPEC   -- Set the minimum delay before executing the resumer job,
                                    where TIMESPEC is the span of time between job
                                    suspension and future time when the resumer job
                                    will be considered for scheduling again.
                                    Default: 5s

    -numjobs          N          -- The maximum number of jobs preempted per bucket per
                                    preemption cycle.
    -maxattempts      N          -- The maximum number of attempts to match the rule for a
                                    preempting job. Setting this to zero (0) disables the check
                                    meaning that the rule can be matched an unlimited number of times
                                    which is useful for example for RESERVE_* type rules.
    -sortjobsby       N          -- Criteria to sort/order potential preemptable jobs.
                                    Format is:
                                    <fieldname> [ASC|DESC] [, <fieldname> [ASC|DESC]]*.
                                    Default is 'PRIORITY ASC, AGE ASC'.

Debug VovPreemptRule

To make sure the rules works as intended, it is useful to look at how the preemption algorithms work in detail. Detailed logs will be written into a log file separate from server log as named server_preemption_DATE.log.

Set the server parameter preemption.log.verbosity to a number between 0 and 10. For preempt rules of interest, turn on the debug flag through Web UI. To log all preempt rules, set the server parameter preemption.log.allrules to 1.

The following shows kinds of messages shown for each verbosity level.
  • Taskers preempted, jobs preempted, reservations made on taskers and resources, and durations of preempted jobs.
  • Preempt rules that trigger for jobs in each bucket. Reasons why rules get disabled. Time taken to process rules if it is significant.
  • Which taskerlist is used. Which tasker is missing.
  • Why each tasker is not selected. Reasons may be bad tasker status, HW not compabible, already reserved, or select rule not applicable.
  • Report all job status being preempted. Each job preempted with which plan. Reserving critical resource. Skip job after max attempt.
  • Report how all preempted jobs are handled. Which jobcontrol method is applied.
  • Why preempt rule is not triggered. Why taskers are not chosen (already reserved, invalid reserve spec.).
  • Details about choosing preemptable target such as waiting for HW and SW, running jobs that have resources managed by Allocator, jobs that have useful resources, wait reasons, preemptable analysis, missing resources.
  • Time taken to process preempt rules.
  • Which rule is disabled. Miscellaneous messages.