There’s a very good article here about GPU architecture that is worth reading if you’re a bit literate on computer hardware and want to understand more about how they work.
The interesting thing about this for CAE applications is…almost none of that execution-level stuff matters. The only thing that matters about GPUs is that per dollar, they offer more memory bandwidth than CPUs do. The compute kernels in CFD codes tend to do from 0.1 to 2 FLOPS per byte, the result being that the memory bandwidth of CPUs comes nowhere near to maxing out their floating point capability. To make math simple, we can handwave it as 1…in other words, a system will be able to sustain about 1 TFLOP for every 1 TB/s of bandwidth it has. This of course will vary by code and the ability of the processor to actually ingest the rated data. So the fact that, for example, an 8-card NVIDIA H100 HGX box has a rated max theoretical 7.9 PFLOPS is irrelevant to CFD. It has 27 TB/s of memory bandwidth, so we can estimate that CFD codes will be able to achieve around 27 TFLOPS, or perhaps around 0.5% of the theoretical max.
That sounds rather awful, but at $300,000 list price for the node (disclaimer: prices will vary by vendor, availability, etc), I assure you, it’s a steal. If I spend $300,000 on x86 servers, I’m liable to get somewhere in the neighborhood of 12 TB/s of memory bandwidth. One can get a dual-socket x86 CPU with 8 channels of DDR4-3200, netting you about 400 GB/s on the node, for $10,000. Now, there are a variety of configurations of GPUs and CPUs that come in at different price/performance targets, but in general, you will find GPUs deliver about 2x-5x the performance of CPUs for the same dollar amount. Thus, the CFD industry is very excited about GPUs.
My excitement is somewhat guarded. One reason is that GPUs’ internal compute architecture is very restrictive. Those 32-wide stream processors stall out if one stream needs to do something slightly different than the others, which is very common in unstructured CFD. If I recall correctly, the developers of Hypre, a popular sparse linear solver used in many CFD codes, found that, while their unstructured IJ solver is indeed very fast on GPUs, it has a much, much lower ceiling than the structured solver (SSTRUCT) due to the irregular data structure of CSR matrices stalling out the processor so much. Add something really irregular, like spray parcels shattering, evaporating, and merging, and SIMT really isn’t the right architecture. From the perspective of CFD, then, the GPU itself, meaning that array of thousands of stream processors, is a necessary evil. That lovely, ultra-fast memory subsystem is mated to an extremely expensive processor that just isn’t designed to do what we need. It’s fantastic at AI and rendering, really! But the more physics you throw into a CFD run, the less and less the data access patterns look like AI or rendering.
The second issue is that this super-fast bandwidth is due to using ultra-expensive 3D-stacked high-bandwidth memory (HBM). The world’s biggest GPU right now, the MI300X, has just 192 GB of HBM3. At 5.3 TB/s of memory bandwidth, I’d like to have maybe more like 3-8 TB of memory. Oh sure, these things are fantastic at large eddy simulation for aerodynamics, maybe even with a flamelet-generate manifold or a two-phase flow. But if you need to do detailed chemical kinetics, which can add thousands of transported species to the flow, your memory requirements explode. An 8-card H100 box, with its 640 GB of RAM, suddenly feels kind of small when you want to run a 200m cell case with 500m Lagrangian spray particles and over 1000 chemical species.

Now we throw in a new wrinkle - not every algorithm is amenable to SIMT computation. I know people who are insistent that every large-scale computing task can be refactored to run on GPUs. All it takes is inventing the right algorithm. Perhaps they are right. But frankly, when so much meshing code in CFD is still serial, this clearly falls into the “much easier said than done” category.
And now, loop all the way back to the main issue - current-gen CPUs don’t have enough memory bandwidth to stay busy. A 4th gen EPYC server with a pair of 9654 CPUs on it has 900 GB/s of bandwidth. But those CPUs are theoretically capable of, I believe, a total of 28.8 TFLOPs. Even with 3D V-Cache giving CFD codes anywhere from a 15%-30% speed boost, CFD codes aren’t coming anywhere close to utilizing all the floating-point performance on an EPYC.
Now, you go to war with the army you have, and you do your CFD with the computational hardware you’ve got. Nobody’s giving me 29 TB/s of memory bandwidth for a 192-core CPU node. But what if they did?
What if you gave me 5 TB/s of memory on that node? What if you could give me substantially more than 80 GB, like 3-8 TB? Would that change the market?