2019.01 Update 5 Release

New Features and Enhancements

The following new features and enhancements were introduced this software release:

Table 1.
Product Internal Number Case Number Description
All VOV-11299 25102 Fixed error that resulted in a "Server is operating on a non-internal object" error to be printed in the server log. This error is linked to querying for the "why" status of a job that has an input dependency.
All VOV-11350   Fixed statistics for some hierarchical sets.
Accelerator VOV-7535 20833 The proc VovGetRevokeDelay {} can now be added and customized by redefining it in vovresourced/config.tcl under the SWD directory to allow users to customize the revoke delay to be used in vovreconciled. This allows users to have the revoke delay from their job classes override the default value of RESD(revokeDelay). The proc definition has been added to the Altair Accelerator Administrator Guide. In addition, the verbosity levels of various messages have been modified per customer requests.
Accelerator VOV-3765 20136 The nc run command now supports an option to control the number of times a job can be rescheduled. Thus:
  • maxresched <N> -- Maximum number of times the job can be rescheduled.
  • Must be >= 1 and <= 10 (default 10).
This is implemented via the MAX_RESCHEDULE property on the job.
Accelerator Plus VOV-11260 24568, 24890, 25001, 25249 Added policy parameter fairshare.overshoot.damping 0/1; 1=enabled, 0=disabled, controls whether or not FairShare restricts the number of jobs scheduled for groups that are over budget.
Accelerator Plus VOV-11337   Added accounts option (-A) for PBS Pro resource list for Accelerator Plus.
Accelerator Plus, FlowTracer VOV-11188   A configuration value for the Accelerator Plus configuration file, SWD/vovwxd/config.tcl, has been added allow the user to specify a limit on how many consecutive failures of a slave job in the base queue will be allowed before we no longer attempt to create slaves for a bucket. The default value is 0 (no limit). This is to prevent a malformed job from causing churn in the system.
Allocator VOV-9411 23688 Added support for hierarchical Altair Allocators. Please see documentation for details.

Resource plots in the child LA are identical to the plots in the top-most parent LA. In other words, they show the data for the entire resource pool, and not the sub-set of the resource corresponding to the child LA.

The slave definition in the child LA (in the <swd>/slaves.tcl file) should use the hostname of the slave, and not localhost.

For example, 'jaguar' is the hostname here:
vtk_slave_define jag -host jaguar\ 
-resources "runMq" \ 
-maxload 20.0 \ 
-mindisk 0 -disablejobstats 1
This hostname must match the hostname used when adding the child LA to the parent LA (in <swd>/vovlad/config.tcl). For example:
LA::AddSite child_la@jaguar child_la {} -host jaguar 
-port 8787 -la 1 -version SAME
FlowTracer VOV-11462   This update brings feature parity with vovlsfd. For example, LSFjobname can now be overridden on a per job basis. Bucket reservations are now used to map jobs to batch submitted vovslave, rather than resource strings. Code to address jobs that have an xdur greater than maxlife has moved into the vovslave itself.

Resolved Issues

The following issues were resolved in this release.

Table 2.
Product Internal Number Case Number Description
All VOV-10110   vovwxd cleaner log files will be preserved for the time spec specified by the delCleanerLog,older config parameter in vovwxd/config.tcl.
All VOV-11326   Slave slot licenses will be released when a slave exits in Auto Licensing mode.
All VOV-11294   The /local/registry/system-accelerator folder may have not always been writable because it was created with user's umask permissions. Now created with 777.
All VOV-11350   Fixed statistics for some hierarchical sets.
Accelerator VOV-8012 21662 VOV_LM_VARNAMES functionality will now be available for interactive jobs.
Accelerator VOV-11347 25138 Fixed an issue in vovfsgroup loadconfig where the weight and window values of the FairShare group were not getting set to the values in the config file.
Accelerator VOV-11242 25036 Broken HTML links in the Altair Accelerator Training Guide have been fixed.
Accelerator VOV-10926 24819 Fixed issue with interactive jobs (nc -I) failing when run with a PRECMD that reschedules the job. This also fixes the issue of the PTY overriding the exit code from the PRECMD.
Accelerator VOV-11307 25103 Fixed a race condition in the job fostering system, which is used to properly account for jobs running on a host that has had its vovslave restarted, that could cause the foster jobs to fail and the restarted vovslave to refuse any future stop requests.
Accelerator VOV-11305 25109 Prevent a new autokill sequence from initiating if an existing sequence is already being processed. Prior to this change, an autokill sequence that took more than 5m to process would result in a new sequence starting without the existing one completing. This would result in the slave entering a looping condition that may never end.
Accelerator VOV-11296 25093 A change to the command nc hosts that displayed all reservations a slave may have was backed out due to adverse performance effects on the server. The command nc hosts will now only show the "dominant" reservation", i.e. the oldest unexpired reservation.
Accelerator, Accelerator Plus VOV-11338 25079 Fixed issues with job resource usage reporting by including detached processes with unique gpids and session ids by matching VOV_JOBID and VOV_SLAVE_PID. The VOV_JOBID to be matched will be taken from the transaction object rather than depending on the subslave environment. Also added NC_JOBID and NC_SLAVE_PID env variables so that WX and NC slaves can both correctly track processes.
Accelerator VOV-11358   Fixed an issue where running Altair Accelerator in interactive mode (nc run -I) with both the input and output redirected would result in lost output.
Accelerator VOV-11336 CS0120656, CS0120663 Fixed issue where preempted jobs may have been prematurely resumed preventing the preempting job from running.
Accelerator VOV-11180   Handled invalid values for these 3 resources: "RAM" "CORES" "SLOTS". Numeric within this range [0 - 2147483647] is allowed.
Accelerator VOV-11160   The output of nc getfield JOB CPUTIME with an uppercase "CPUTIME" will now accurately show time in milliseconds instead of 0.
Accelerator VOV-11210   Changed 'cputime' type from integer to integer64 in vovshow -fields command output
Accelerator VOV-11222   Underscores have been removed from the Node Field Names help topic to reflect the updated behavior.
Accelerator VOV-11799 CS0120663 Fixed issue where preempted jobs may have been prematurely resumed preventing the preempting job from running.
Accelerator VOV-11828 CS0120715 The output of "nc getfield JOB CPUTIME" with an uppercase "CPUTIME" will now accurately show time in milliseconds instead of 0.
Accelerator Plus VOV-11276 24834, 25080 Fixed an issue with array submission in WX that would lead to "Illegal set id" errors. This also fixes an issue that resulted in log file conflicts with the error messages "Error: OnLaunchError for <queue>,time: <timestamp>, err: Launcher job failed:" and "FATAL ERROR: Cannot use FILEX <log_filename>"
Accelerator Plus VOV-11234   Fixed issue with core file generation on signals SIGSEGV and SIGBUS
Accelerator Plus VOV-11115   Internal optimization of the WX slave creation process.
Accelerator Plus VOV-11191   vovwxd will no longer create extraneous slave objects and/or processes when launching slaves using the vovlsf.tcl driver.
Accelerator Plus, FlowTracer VOV-11646   vovwxd should no longer attempt to provision extra slaves when the number of pending slaves is sufficient to handle the currently queued load.
Accelerator Plus VOV-11677   Fixed issue which prevented vovwxd from launching more slaves when the limit was increased in the SWD/vovwxd/config.tcl file without requiring a vovwxd daemon restart.
Accelerator Plus VOV-11645   Fixed issue that caused the PBS_JOBID environment variable to be modified to contain the numeric part of the job ID only.
Accelerator Plus VOV-11630   Modified PBS driver script to use the -V submission option for launcher jobs to ensure that all environment variables required for slave operation are set in the slave's environment. Also added a new configuration item, CONFIG(pbsBin), in the vovwxd configuration file that can be used to specify the location of the PBS binaries (default: /opt/pbs/bin).
Allocator VOV-11306 25123 Fixed a crash that was introduced in 2019.01 u4. The call stack for the crash would have entries similar to the following:
Received signal: SIGSEGV 11 3/15 
vovserver(_Z25vovGenFirstAttachmentFASTPK9VovObjectjl+0x21) [0x753bd1] 
Received signal: SIGSEGV 11 4/15 
TResJobSaIS2_EEPNS_19MatchFalseOOQResultE+0x312) [0x79ecb2] 
Received signal: SIGSEGV 11 5/15 
vovserver(_ZN8VovMQRes12matchHandlesERSt6vectorIP11VovFTResJobSaIS2_EEl+0x140) [0x7bd420] 
Received signal: SIGSEGV 11 6/15 
vovserver(_ZN12VovMQManager27matchAndDistributeResourcesER8VovTracelb+0x3ab) [0x7b011b] 
FlowTracer VOV-10095   vovslaves running under WX or FT with vovwxd will have the environment variable VOV_SLAVE_NAME set to the name of the FT slave spawned by vovwxd. The VOVSLAVE environment variable will no longer be set. CONFIG(slave,timeout) will be passed to the vovslaves launched by vovwxd as the -t option, setting the time allowed for the new vovslave to connect to vovserver.
FlowTracer VOV-11259   Supported Force Validation of 'PHANTOM' files.
FlowTracer VOV-11187   Allow for a prescripts subdirectory to be placed inside the vovwxd/launcher directory and be immune from periodic cleanup by the vovwxd stale file cleaner. Also change the LSF:pre special resource to use this directory as the base directory for a specified prescript.

Example: LSFpre:mypre.sh results in an LSF submission option of -E ./prescripts/mypre.sh.

FlowTracer VOV-1171   Fixed problem which prevented vovwxd from launching additional slaves as expected when more jobs are added to a bucket that has active jobs.
Monitor VOV-11366   Fixed an issue with the Detailed Plots report that resulted in a Tcl error when generating a report for a feature with no usage for the specified time range. Also fixed an issue with the Usage Trends report that resulted in a Tcl error when generating a report for a feature with a capacity of 1000 or more tokens.
Monitor VOV-11467   Fix an issue with all SFD packages for Windows where in some Windows configurations, the controls for installing and controlling a Windows Service were disabled due to administrative rights not being detected properly.