NVIDIA V100 (Volta) versus NVIDIA A100 (Ampere)

Systems

DGX-1: 8x NVIDIA V100 (16 GB, max. 4 GPUs used); 2x Intel Xeon E5 2698 v4; 512 GB RAM; DGX OS 4.0.7; NVIDIA Driver 410.129
DGX-A100: 8x NVIDIA A100 (40 GB, max. 4 GPUs used); 2x AMD EPYC 7742; 1 TB RAM; DGX OS 4.99.11; NVIDIA Driver 450.51
nanoFluidX Software Stack: nanoFluidX 2020.0 with single precision floating point arithmetics; CUDA 8.0 GA2 (8.0.61); Open MPI 2.1.6 (with CUDA support)

Results

Minimal Cube: Simple cube of static fluid particles in rest; Minimal case to estimate raw performance of solver core; Two sizes:

Small: ~7m fluid particles (size of relatively small production case, slightly smaller than what is recommended for four GPU's)

Big: ~57m fluid particles (size of a bigger production case); All runs cover 1,000 timesteps

Dambreak: Collapsing water column under gravity in domain (indicated by lines); Good case to evaluate performance for tricky particle distributions; Therefore indicates whether load balancing works; Two sizes:

Small: ~7m fluid particles and ~9m total (slightly smaller than what is recommended for four GPU's)

Big: ~54m fluid particles and ~64m total (size of a bigger production case); All runs cover 10000 timesteps

Altair E-Gearbox: Showcase by Altair for E-Mobility application; Size: ~6.5m fluid particles and ~12m total (slightly smaller than what is recommended for four GPU's); Run covers 10,000 timesteps

Aerospace Gearbox: Another showcase for aerospace gearbox applications; Chosen to have another slightly bigger case than previous one to fully benefit from four or more GPU's; Size: ~21m fluid particles and ~26.7m total; Run covers 10,000 timesteps

Conclusions

Theoretical performance of core is approximately 25 percent faster on A100 (based on Minimal Cube, tendency to more for very large cases). Industrial cases seem to benefit slightly more (up to 30 percent). No noticeable difference regarding scalability.

Additional Notes

Performance data in the graphs is always relative to one V100 on the DGX-1.
All cases were run with the WEIGHTED particle interaction scheme.
All solver output has been deactivated to focus on solver performance, but generally this does not change the results significantly.
Slight performance uncertainty because of CUDA version as newer versions might apply more tailored optimizations. Compilation for V100 and A100 compute capabilities is not possible in CUDA 8.0. This improvement may be on the order of a few percent, generally not significant.
Scalability between one and two GPU's is usually slightly impaired because some parts related to multi-GPU may be skipped entirely in single GPU runs.