vovtaskermgr

The main way to start, configure, and stop the taskers is with the vovtaskermgr command. This command acts relative to the VOV-project enabled in the shell where it is issued.

The file taskers.tcl in the project.swd directory stores the configuration information used by this command.
Note: Changes made to taskers.tcl are not automatically propagated to the running vovtaskers. To do this, use the update subcommand.

A vovtasker listed in the taskers.tcl file may be running or stopped. The show subcommand gives information on the running vovtaskers currently connected to the vovserver. The list subcommand gives the names of all the vovtaskers defined in vovtaskers, whether running or stopped.

Starting Many Taskers in Parallel

If you have hundreds of taskers to start, it may take some time. You can speed up the process by running multiple start script with the -random option, which is useful to start taskers in random order.

For example:
% vovtaskermgr start -random &
% vovtaskermgr start -random &
% vovtaskermgr start -random &
% vovtaskermgr start -random &
% vovtaskermgr start -random &
% vovtaskermgr start -random &

Tasker Configuration on the Fly

Many vovtasker characteristics can be changed on the fly using vovtaskermgr configure. For example, you can change the capacity of a tasker, i.e. the maximum number of jobs that the tasker can take, with:
% vovtaskermgr configure -capacity 8 pluto
Setting the capacity to zero effectively disables the tasker:
% vovtaskermgr configure -capacity 0 pluto
% vovtaskermgr configure -message "Temporarily disabled by John" pluto

Tasker Capacity

The behavior of manually overriding vovtasker cores and capacity has been improved. By default, the capacity follows the core count, but it can also be manually set via the -T option or by defining the SLOTS/N consumable resource via the -r option, where N is a positive integer. In all cases, the capacity directly affects the number of slot licenses that will be requested.

Tasker Reservation

Below is an example of using vovtaskermgr to set a reservation on a tasker. In this case, you want to reserve the tasker called 'pluto' for user 'john' for 2 days.

If you wish for the vovtaskers to be reserved when they start, use the -reserve option in the taskers.tcl file.
% vovtaskermgr reserve -user john -duration 2d pluto

vovtaskermgr


vovtaskermgr: Usage Message

USAGE:
    vovtaskermgr <SUBCOMMAND> [options] [taskerList]

    SUBCOMMAND is case-insensitive.

    The taskerList consists of tasker names or tasker ids.

SUBCOMMAND is one of:
    LIST           -- List all hosts named in the taskers.tcl file.
    RESTART        -- Same as STOP followed by START.
    REFRESH        -- Refresh cached environments and equivalences.
                      The default behavior is for taskers to obtain the
                      equivalences from the server. If changes are made to the
                      equiv.tcl file, the server will need to be instructed to
                      reread the file using the "vovproject reread" command
                      prior to requesting a tasker refresh.
                      If VOVEQUIV_CACHE_FILE is set to "legacy", a host-based
                      equivalence cache file will be created and updated in
                      the SWD/equiv.caches directory. If VOVEQUIV_CACHE_FILE
                      is set to a file path, the specified file will be used
                      instead.
    SHOW           -- Show info about connected or down taskers.
    PRINTSTATUS    -- Tell taskers to print their status in their log file.
    START          -- Start configured taskers.  If a list of hosts is
                      given, start taskers only on those hosts.  Otherwise,
                      start all configured taskers that are not running.
    UPDATE         -- Update configuration of running taskers.
    RESERVE        -- Reserve specified taskers.
    RESERVESHOW    -- Show current tasker reservations.
    CONFIGURE      -- Reconfigure the specified taskers on-the-fly.
                      Changes only persist until the tasker is stopped.
    STOP           -- Stop taskers; let jobs finish, unless -force is given.
    CANCELSHUTDOWN -- Revert stopped but still running taskers to normal
                      so they continue running and accept new jobs.
    ROTATELOG      -- Recreate new log files for specified taskers
                      if log files are missing, create tasker log directories
                      if needed, and have no impact on tasker startup logs.
    CLOSE [MSG]    -- Close taskers from accepting jobs. Closed taskers will
                      start and run, but will do so in a suspended state,
                      displaying the closure message, until opened by the
                      administrator. The default closure message is
                      'Closed by administrator'.
    OPEN [MSG]     -- Open taskers to accept jobs. The accompanying message
                      will be displayed on running taskers until another
                      message is generated during the course of normal
                      operation. Taskers that are not running will not display
                      the message after starting. The default opening message
                      is an empty string.
Global Options are:
    -l            -- Use longer format with LIST (may be repeated).
    -v            -- Increase verbosity of messages.
    -cfgfile      -- Specify path to tasker config file, relative to SWD.
                     Default: taskers.tcl
    -failover     -- Restrict operation to dedicated failover taskers only.

Options for SHOW are:
    -nameonly     -- Show only the names of the connected taskers.
    -nameid       -- Show only the names and ids of the connected taskers.
    -resourceonly -- Show only the resources of the connected taskers.
    -down         -- Show names of configured taskers that are down.
    -license      -- Show licensed capabilities of connected taskers.
    -taskergroups -- Show tasker group for each connected tasker.

Options for START and RESTART are:
    -server      -- Start the taskers by rsh/ssh from the vovserver host.
                    By default, the taskers are started
                    by the host that executes this script.
    -random      -- Start taskers in random order.
                    This is useful to start a large pool of tasker,
                    by running multiple concurrent commands like:
                      % vovtaskermgr start -random &
                      % vovtaskermgr start -random &
                      % vovtaskermgr start -random &
    -nolog       -- Redirect tasker output to /dev/null.
                    Useful to avoid huge log files in /usr/tmp
    -confirmafter <TIMESPEC>
                 -- Wait for the given time specification after the last start
                    request for the list of taskers being started, then print
                    whether each tasker has successfully started and connected
                    to the vovserver. Only taskers in the READY, WRKNG, FULL, or
                    OVRLD state will be considered as running.

Options for RESERVE are:
    -user        -- Reserve the tasker(s) for given list of users
                    (comma separated list)
    -group       -- Reserve the tasker(s) for given list of FairShare groups
                    (comma separated list)
    -jobclass    -- Reserve the tasker(s) for given list of jobclasses
                    (comma separated list)
    -jobproj     -- Reserve the tasker(s) for given list of job projects
                    (comma separated list)
    -osgroup     -- Reserve the tasker(s) for given list of Unix groups
                    (comma separated list)
    -bucketid    -- Reserve the tasker(s) for given list of queue buckets
                    (comma separated list)
    -id          -- Reserve the tasker(s) for given list of jobs
                    (comma separated list of job ids)
    -start       -- Reservation start time
    -end         -- Reservation end time
    -duration    -- Reservation duration (VOV timespec)
    -cancel      -- Cancel the reservation on tasker(s)

Options for STOP are:
    -force       -- Stop taskers with force. BEWARE: kills running jobs.
    -noconfirm   -- Do not prompt for confirmation. Default is to prompt.
    -all         -- Stop all running taskers.
    -sick <TIMESPEC>
                 -- Stop all taskers that have been sick for at least the given
                    time specification, as compared against the last time a
                    heartbeat was received by the server for each sick tasker.
                    All jobs running on a sick tasker being stopped will be
                    marked as failed in the server, even if the job does,
                    or has, completed successfully while the tasker is sick.
                    It is recommended to check tasker host connectivity before
                    using this function and allow for the tasker to reconnect
                    and send a heartbeat in case connectivity is restored.

Parameters for CONFIGURE are:
    -allowcoredump <bool>    -- Control core-dump behavior.
    -autokillmethod <d|n|v>  -- Control autokill method.
    -capacity <CAP>[MAXCAP]  -- Specify capacity and optionally the
                                max-capacity of the tasker. The capacity is
                                the maximum number of jobs that can be run by
                                tasker. The max_capacity is the maximum slots
                                a tasker can be expanded to have when jobs are
                                suspended. The default value for capacity is
                                equal to the number of CORES present. The
                                default value for max_capacity is 2*CAPACITY.
                                Use N, N/N, CORES[-+*/]N, CORES[-+*/]N/N,
                                N/CORES[-+*/]N, CORES[-+*/]N/CORES[-+*/]N to
                                make adjustments from the default.
                                Examples: 4, 4/8, CORES-2, CORES*0.8,
                                          CORES+0/20, CORES+2/CORES*2
    -cpus <N>                -- Number of CPUs in this machine.
    -debugcontainers <bool>  -- Enable debug logging of container activity.
    -debugjobcontrol <bool>  -- Enable debug logging of job control activity.
    -debugmultienv   <bool>  -- Enable debug logging of environment switching.
    -debugnuma       <bool>  -- Enable debug logging of NUMA activity.
    -debugusageinfo  <bool>  -- Enable debug logging of memory usage analysis.
    -maxload <MAXLOAD>       -- Maximum load above which new jobs are refused.
                                The default value for max_load is
                                CAPACITY+0.5.
                                Use 0 or less than 0 to specify default value.
                                Use N or CAPACITY[-+*/]N to make adjustments
                                from the default.
                                Examples: 12.0, CAPACITY+2, CAPACITY*2
    -maxwaitnostart <N>      -- How long to wait for a job to start.
    -maxwaittoreconnect <N>  -- How long to wait before reconnect.
    -message <string>        -- Set vovtasker message.
    -numabindtonode <bool>   -- Bind to entire NUMA node or individual cores.
                                Default is to bind to entire NUMA node.
    -resources  <string>     -- vovtasker resources.
    -taskergroup <string>    -- The tasker group.
    -minramfree <N>          -- Minimum amount of free RAM in MB.
    -name <string>           -- Name of vovtasker.
    -ramsentry <bool>        -- Activate/Deactivate RAM SENTRY.
    -efftotram <N>           -- Effective total RAM in MB.
    -retrychdir <N>          -- Specify number of retries for failed chdirs.
    -retrychdirsleep <N>     -- Specify the sleep interval time between
                                retries for failed chdirs.
    -retrychdirbackoff <N>   -- Specify the factor multiplied to the sleep
                                interval to increase sleep interval between
                                retries for failed chdirs.
    -liverecorder on|off     -- Enable/disable Live Recorder debugging
                                capability (linux64 only).
    -liverecorder.logdir <string>
                             -- Specify the directory in which the Live
                                Recorder recording file should be saved. The
                                directory must exist. Default is "/tmp".
    -liverecorder.logsize <N> --
                                Specify the Live Recorder log size in MB.
                                Default: 256, Min: 256, Max: 65536.
    -liverecorder.mode <string>
                             -- Specify the Live Recorder mode, which is one of
                                the following: tasker, subtasker, both.
                                Note that enabling subtasker recording results
                                in a recording file for each job executed on
                                the tasker.
                                Default: tasker.
    -rawpower                -- Specify a raw power figure for initial tasker
                                startup.
    -mindisk                 -- Specify minimum /tmp disk in MB or
                                percentage (0%-99%, for example, 10%)
                                required for tasker startup.
    -coeff                   -- Specify a scaling factor from 0.01-100.0
                                used to derate tasker power.
    -sendenv <name>          -- Send a named environment to a tasker.
    -setenv VAR=VALU  E      -- Set a variable in the tasker environment.
                                ("VAR=VALUE" must be quoted on Windows)
    -taskerheartbeat <N>     -- Specify the heartbeat for a tasker.
    -unsetenv VAR            -- Unset a variable in the tasker environment.

EXAMPLES:
    % vovtaskermgr show
    % vovtaskermgr show -nameid
    % vovtaskermgr start
    % vovtaskermgr start unix1
    % vovtaskermgr start -random            -- Start taskers in random order.
    % vovtaskermgr update
    % vovtaskermgr restart
    % vovtaskermgr stop                     -- Stop all taskers, let running
                                               jobs finish.
    % vovtaskermgr stop -noconfirm          -- Like above, no confirmation
                                               required.
    % vovtaskermgr stop -force              -- Kill running jobs now
                                               (-noconfirm implied).
    % vovtaskermgr reserve -user john \\
             -duration 3h jupiter           -- Reserve tasker jupiter for user
                                               john for 3h from now
    % vovtaskermgr configure -message "shutdown 1PM" farm11 farm12
    % vovtaskermgr printstatus farm11
    % vovtaskermgr rotatelog                -- Recreate missing log files for
                                               all connected taskers
    % vovtaskermgr rotatelog farm2 farm11   -- Recreate missing log files for
                                               tasker farm2 farm11

    % vovtaskermgr configure jupiter -sendenv BASE
                                            -- send the BASE environment to
                                               tasker jupiter