What is a Simulation
Simulations allow what-if analysis for determining the most productive way to scale and configure an HPC system.
Simulations evaluate how proposed changes to a HPC system will affect core utilization and job throughput. Run a simulation to perform what-if analysis prior to scaling HPC resources up or down, such as increasing memory or CPUs, purchasing more hardware, turning off unused nodes, or changing scheduling policies.
- Server status (qstat -Bf)
- Queue status (qstat -Qf)
- Resource information (resourcedef file)
- Scheduler information (sched_priv and qmgr list sched)
- Node information (pbsnodes -av)
- Accounting logs
To perform the simulation, you select a snapshot, make proposed changes to the HPC system configuration and select a specific workload time interval. You may choose to run your simulation to include all historical workload data or choose a subset based on a beginning and end date.
When you run a simulation, you can change your HPC's current system configuration by changing or tuning scheduling policies and/or by increasing or decreasing a grouping of execution nodes called Node Classes. Execution nodes are grouped based on RAM, CPU and GPU signatures, that is execution nodes having the same number of CPUs, GPUs and RAM are grouped. The below figure demonstrates this grouping.
For instance, you may want to run a simulation that increases Node Class Type_0 from 2 to 4. The simulation will determine how adding two additional execution hosts having 16 CPUs, 4 GPUs and 64 GB of RAM effects the core utilization and job throughput based on the selected workload interval.
Once a simulation is complete, the results of the simulation are plotted and compared to the results of a baseline simulation. A baseline simulation takes as input the original HPC configuration as provided by the snapshot and the workload selected by the user for the simulation.
What Happens During a Simulation
During a simulation, the Simulate solver takes as input an HPC snapshot of an existing PBS Professional cluster. It runs a virtual version of the cluster (submitting and running jobs based on the captured workload); and outputs a new snapshot with the resulting, simulated PBS Professional accounting logs. Since both the input and output contain PBS Professional accounting logs, Analyze can be used for analysis.
Why Simulate a Workload Manager
Examples for why you may want to simulate your Workload Manager:
- Explore Policies - What if I reduce the research team’s CPU limit below 100?
- Verify Service Level Agreements - Can preemption increase turnaround for VIP jobs without delaying others?
- Optimize Infrastructure - How many nodes should I buy for my workload? What kinds of nodes? What if I had more nodes?
- Debug - Why are jobs going into a wait state when they should be running?
Architecture
The primary external application programming interface to PBS Professional is the Batch Interface Library or IFL. Any batch service request can be invoked through calls to the batch interface library. The IFL provides a user-callable function corresponding to each batch client command. All PBS Professional daemons use this library to talk to the PBS Professional Server.
During a simulation, the Batch Interface Library is replaced by a Simulate library. The Simulate IFL interacts with the snapshot rather than the PBS Professional Server. The advantage of this architecture is that the PBS Professional Scheduler and the PBS Professional commands (qsub, qstat, qmgr, pbsnodes, etc.) do not have to be programmatically changed except that they now communicate with the Simulate IFL.