If money isn’t an issue, you can buy all the most expensive consumer PC parts and build a mighty PC to check your email and scroll through social media. Of course, this isn’t how most people buy things; it’s not even how rich people buy things, as it’s not an excellent way to stay rich. Instead, most people look at what they want to do with a computer and then find a computer that has suitable hardware.
In the home market, there’s a decent amount of choice, but once you get to the workstation and server market, there are some even more powerful options for even more money. For example, the best PC you can build at home supports 16 cores (or 24 if you count Intel’s efficiency cores). You can also get a powerful GPU. Technically you can get multiple powerful GPUs, but you can’t use them together as SLI/NVLINK is essentially dead.
In the server and workstation market, you can get far more cores in a CPU, up to 96 in AMD’s EPYC lineup. You can also get GPUs with more capable interconnects and more VRAM. CPU cores, though, are where a lot of money goes, especially in HPC (High-Performance Computing), Hyperscaler, and Supercomputing worlds. So, what do you do if you need more than 96 cores in one computer? Add more CPUs, obviously.
Multi-Socket Motherboards
Of course, you can’t just slap a second CPU on any old motherboard; there’d be nowhere for it to go. You need specific hardware. AMD supports the ability for two of their EPYC server CPUs to be placed on the same motherboard. That offers up a total of 192 cores or 384 threads. Intel’s latest server CPUs maxed out at 40 cores, though the previous generation featured a 56-core model. Intel, however, supports up to 8 CPUs on a single motherboard. That’s 320 or 448 cores and 640 or 896 threads. While this is overkilled for checking Instagram, some workloads can use all this horsepower.
The problem comes from memory. Four things generally limit CPUs. The first is a lack of things to do; sometimes, the CPU just isn’t loaded. Next, you have power, there’s only so much power you can draw before you start damaging the CPU, and limits are in place to ensure the CPU isn’t at risk of burning out when under full load. You also have the closely related temperature pressure, the more power you use, the more heat you generate and have to dissipate; overheating is just as bad as too much power as things start to melt. The other limitation is memory access.
A CPU typically needs a lot of data to perform a lot of processing. All of that is stored in RAM. Unfortunately, RAM is pretty slow compared to a CPU. This can leave it idle for “ages” before it gets the data it needs to operate. CPU cache helps a lot, but it’s so small it can’t cover everything, and the main memory needs to be accessed.
Memory Latency
To minimize the effect of RAM being slow, it is physically placed as close to the CPU as possible. This is why RAM is always located directly next to the CPU socket on a motherboard. But what happens if you have multiple CPUs on a single motherboard? Then there is a different access time for a CPU to access its memory compared to the memory next to another. “Oh no,” you might say, “some memory is slightly slower.” But this is an actual issue that can have a surprisingly profound effect on performance. This concept is called Non-Uniform Memory Access, or NUMA.
NUMA involves providing a mechanism for the operating system to understand that while it can access all the memory, some parts are preferred for certain things over others. Where possible, the OS then stores data for tasks running on CPU1 in the RAM directly next to CPU1. Similarly, data necessary for a task running on CPU2 is stored in the RAM directly next to CPU2. Of course, with limited RAM capacities and massive data sets, staying within these confines is not always possible. Still, best efforts are made and have a significant impact on performance.
Memory access over a single channel is also sequential. This means that when two different CPUs try to access data on the same channel, one directly connected to the DIMM and the other NUMA hop away, the second request not only has to wait, idle, for its request but also the request of the other processor. As such, wherever possible, data should be stored on the RAM directly next to the CPU that will need it.
Conclusion
NUMA stands for Non-Uniform Memory Access. It’s a term used in computer systems with multiple physical CPUs. It refers to the fact that one CPU will have a different memory latency to the RAM directly surrounding it compared to the RAM surrounding another CPU. The extra latency decreases system performance in multiple ways. NUMA is a way to inform the operating system that this is the case.
It allows it to optimize memory usage and data locality based on the CPU that needs the data. Where possible, all data for the processes running on a CPU is stored in the RAM directly attached to that CPU. When the local RAM doesn’t have enough capacity, data can spill over into the RAM around other CPUs. Again where possible, the number of NUMA hops is minimized to reduce latency.