Accelerate HPC on Power With Job Concurrency

It's about total job throughput

Nov 10, 2024

Trucks are a big part of getting work done in the modern world. It’s safe to say that without trucks, our world would look radically different. One truck you see around Texas a lot is the 3/4-ton “dually” pickup, usual. It’s a big truck and hauls a lot of stuff.

F350 pulling a trailer — Courtesy of Bing AI Images

But for serious hauling? Nothing beats a semi tractor. Even with the extended cab, optional diesel engine, and dually wheels, the nicest pickup in the world isn’t what you need to haul 50 tons of cargo from L.A. to St Louis. Although the pickup has a better unloaded 0-60 time and fuel mileage, the more tons of cargo you have to haul, the less appealing it is.

Now, what’s the fundamental difference between these trucks? I want you to look beyond just raw numbers like horsepower and torque and think about the philosophy behind their design. At heart, even the biggest commercial pickup you can buy, has its roots in a consumer vehicle. A fully loaded commercial pickup, like a Ram 3500 Dually with crew cab, extended bed, and Cummins turbodiesel engine has its roots in the much more modest Ram 1500, which you can find in suburban driveways all over America. Effectively, it’s a consumer truck that’s been enhanced to commercial grade.

By contrast, there’s no consumer version of a semi tractor. Mack makes a lot of trucks, but no matter how many features you strip off a Mack Anthem, it’s never going to be a daily driver. It’s not just the motor, either, it’s the entire design of the vehicle. A modern pickup diesel has more power and torque than the semi engines of the 1980s, but you still wouldn’t want to haul a 15-ton trailer across the country with one. That’s just not what it’s for as an entire system. All a semi tractor is designed to do is haul big loads. In fact, when it’s unloaded, it’s just a slow, unwieldy truck with terrible fuel mileage.

What’s true of trucks is just as true of server CPU cores. Even the largest, most expensive x86 server CPUs share a design heritage with desktop and laptop cores. They’re not exactly the same, of course. The server version will typically have more memory channels, more PCIe lanes, and other features to make it enterprise-capable. But what makes it fundamentally similar to a consumer CPU core is that it is designed to execute a single task as quickly as possible. While modern CPUs do have hyperthreading to enable two tasks at once, the benefit tends to be small, and Intel has already retired the technology on desktop CPUs. The best way x86 server CPUs deliver higher task throughput is simply by adding more cores.

A Power core, however, is very different. Power cores are huge, about twice as large as x86 cores. But while an x86 core has all kinds of advanced technologies to execute a single task as quickly as possible, a Power core has an internally parallel design to execute eight tasks as quickly as possible. It’s not about the speed of the individual task; it’s about delivering the full load of tasks, much like a big rig loaded with cargo will deliver 20 tons of goods faster than a pickup truck can.

Let’s look at a simple example of a parallel process, the Monte Carlo benchmark in FinanceBench. Monte Carlo (MC) simulations are popular in the finance world to do risk analysis. FinanceBench’s MC only runs on 8 threads, so we’re going to limit ourselves to just 8 Power cores. As we load more and more jobs on the core, the jobs do slow down…but those jobs are running simultaneously. The job multiplier outweighs the slowdown, resulting in a net overall speedup.

FinanceBench Concurrency Results on Power10

A single SMT-8 core is really two small SMT-4 cores connected together, so adding a second task doubles our throughput. The surprising thing is how adding more and more tasks keeps increasing throughput, until, with all eight hardware threads loaded, our job throughput is 3.6x higher. In actuality, we would always run at least two jobs per core, so SMT-8 gives us about a 2x speedup over that.

Let’s visualize what happens to queue time and compute time if we want to run eight jobs on this system. This is what job execution looks like when we run two jobs per core:

Four jobs per core:

And 8 jobs per core:

Load up the core, get more work done. Like I said, it’s like a big rig with a full trailer. If you measure a big rig in terms of cargo-miles per hour, it’s the fastest vehicle on the road!

Let’s look at a somewhat more realistic example. Your typical HPC system has a pretty heterogeneous workload running on it. I ran four programs, WRF, OpenRadioss, OpenFOAM, and LAMMPs, two instances of each, with each job running at np = 120 and “stacked” on the cores. I compared to running each job at np = 240 sequentially, the fastest they could go.

Once again, we see we get massive gains by running all the jobs together, saturating the cores with as much work as they can handle. The net speedup is a total of 1.7x, pretty close to the 2x we saw with the much more synthetic benchmark. Since HPC machines typically have large batches of jobs to process, this technology is clearly well-suited for accelerating total job performance in HPC environments.

CFD Person of Interest

Discussion about this post