High Performance Computing

The following frequently asked question are for SPMD: Domain Decompositon Method (DDM).

How many MPI processes (-np) and threads (-nt) per MPI process should I use for DDM runs?

This depends on whether DDM is run in a cluster with separate machines or in a shared memory machine with multiple processors/sockets.

To run parallel MPI processes, distributed memory (with parallel access) is essential. If a single node contains multiple sockets (each with a single Processor), then theoretically, an equivalent number of MPI processes (equal to the number of sockets) can be run on the node, provided sufficient RAM is available to handle all MPI processes simultaneously in parallel. However, if sufficient distributed memory is not available in the RAM, then it is typically more efficient to use Shared Memory Parallelization (SMP) instead of DDM and use multiple cores within the node in parallel via the –nt run option.

When each node has sufficient RAM to execute only a single serial OptiStruct run, activate SMP on each node by splitting up the run into multiple threads (using more than four threads, -nt=4 is usually not effective for such nodes).

Example:

On a 4-node cluster (with 2 sockets each) and if overall each node contains 8 cores, you can run:
  • Insufficient RAM: optistruct <inputfile> -ddm -np 4 –nt 4
  • Sufficient RAM: optistruct <inputfile> -ddm -np 8 –nt 4
Shared memory machine with multiple processors:
It is important to try avoiding an out-of-core solution in a shared memory machine with multiple processors/sockets. Otherwise multiple processors will compete with the system I/O resources and slow down the entire solution. Use -core in and limit the number of MPI processes -np to make sure OptiStruct runs in the in-core mode. The number of MPI processes -np is usually dictated by the memory demand. When -np is set, you can determine the number of threads per MPI process based on the total number of cores available.
Note: Total memory usage is the summation of OptiStruct usage listed in memory estimation section in the .out file and the MUMPS usage listed in .stat file. The MUMPS estimation is also listed in the memory estimation section of .out file.
Cluster with separate machines:

A generally cautious method is to specify the number of threads per MPI process -nt equal to the number of cores per socket, and to specify the number of MPI processes per machine equal to the number of sockets in each machine. You can extrapolate from this to the cluster environment. For example, if one machine in a cluster is equipped with two typical Intel Xeon Ivy bridge CPUs, you can set two MPI processes per machine, and 12 cores per MPI process (-nt=12) since a typical Ivy bridge CPU consists of 12 cores.

Starting from version 2018.0, OptiStruct also allows setting -cores run option wherein, you are only required to identify the total number of cores available for your run (regardless of whether this is a cluster run or single node run). OptiStruct will automatically assign -np and -nt based on your number of total cores specification via -cores.

Will DDM use less memory for each MPI process than in the serial run?

Yes, memory per MPI process for a DDM solution is significantly reduced compared to serial runs. DDM is designed for extremely large models on machine clusters. The scaling of out-of-core mode on multiple MPI processes is very good because the total I/O amount is distributed and the smaller I/O is better cached by system memory.

Will DDM use less disk space for each MPI process than in the serial run?

Yes. Disk space usage is also distributed.

Can DDM be used in Normal Mode Analysis and Dynamic Analysis?

Yes, refer to the Supported Solution Sequences for DDM Level 2 Parallelization (Geometric Partitioning) section. Both DDM levels (1 and 2) are supported for Direct Frequency response analysis, whereas, geometric partitioning (level 2) is generally supported for most solutions.

Can DDM be used in Nonlinear Analysis?

Yes, see Supported Solution Sequences for DDM Level 2 Parallelization (Geometric Partitioning). DDM level 1 (task-based parallelization) is not supported for Nonlinear Analysis. However, geometric partioning via DDM level 2 is generally supported for most solutions.

Can DDM be used in Optimization runs?

Yes, DDM can be used in Analysis and Optimization. For details, refer to Supported Solution Sequences for DDM Level 2 Parallelization (Geometric Partitioning).

Can DDM be used if I have only one subcase?

Yes, the solver utilizes multiple processors/sockets/machines to perform matrix factorizations and analysis.

If I have multiple subcases, should I use DDM?

Yes, DDM is applicable to multiple subcases as well. Run may be even more efficient if the multiple subcases are supported in DDM level 1 (task-based parallelization). Again, this depends on the available memory and disk-space resources.

How to run OptiStruct DDM over LAN?

It is possible to run OptiStruct DDM over LAN. Follow the corresponding MPI manual to setup different working directories of each node the OptiStruct SPMD is launched.

Is it better to run on cluster of separate machines or on shared memory machine(s) with multiple CPUs?

There is no single answer to this question. If the computer has sufficient memory to run all tasks in-core, expect faster solution times as MPI communication is not slowed down by the network speed. But if the tasks have to run out-of-core, then computations are slowed down by disk read/write delay. Multiple tasks on the same machine may compete for disk access, and (in extreme situations) even result in wall clock time slower than that for serial (non-MPI) runs.