Modern CPUs run incredibly fast; they can significantly outperform the system RAM. This speed imbalance between CPU and memory would cause your processor to often sit idle, waiting for data to be sent to it so it can continue running a process. To prevent this from happening, allowing CPUs to continue to run faster and faster, a CPU cache is used.
How does a CPU cache speed up a CPU?
The CPU cache is designed to be as fast as possible and to then cache data that the CPU requests. The CPU cache has its speed optimised in three ways: latency, bandwidth, and proximity. The CPU cache operates at very low latencies, minimising the amount of time it takes for a result to be returned. For example, the Intel i9-9900k has a cache latency of 0.8, 2.4, and 11.1 nanoseconds for the L1, L2, and L3 cache respectively. In comparison, the latency of modern high-speed RAM is on the order of 14 nanoseconds.
Tip: The cache levels will be explained in more detail later, but simply put the lower layers of cache are faster but are more expensive so have lower capacities. A nanosecond is a billionth of a second, so a latency of 0.8 seconds means that it takes less than a billionth of a second to return a result.
In terms of bandwidth, the CPU cache offers significant performance improvements over traditional storage and RAM. Read speeds of the L1 and L3 cache can peak at 2.3 TB/s and 370 GB/s respectively, while the bandwidth of RAM is typically around 40 GB/s. This increased bandwidth means that the CPU cache can transfer data to the CPU a lot faster than RAM can.
To achieve the maximum possible speeds the CPU cache is actually built into the silicon of the CPU die itself. This minimizes the distance that any electrical signals need to travel, therefore keeping the latency as low as possible. For example, when the L3 cache was first moved from the motherboard to the CPU die, the processor of the time (Pentium 4 EE) was able to gain a 10-20% performance improvement.
CPU cache architecture
Modern CPUs generally use three layers of CPU cache labelled L1-3, with lower-numbered caches being closer to the CPU cores, faster, and more expensive. Each individual CPU core in a multi-core CPU has its own L1 cache. It is typically split into two portions, the L1I and L1D. The L1I is used to cache instructions for the CPU while L1D is used to cache the data on which those instructions are to be performed.
Each CPU core typically also has its own L2 cache on a modern CPU. The L2 cache is larger and slower than the L1 cache and is used primarily to store data that wouldn’t otherwise fit in the L2 cache. By having a dedicated L2 cache per core, cache contention is avoided. Cache contention is where different cores fight to claim cache space for their own workloads, which can lead to important data being cleared from the cache.
The L3 cache is typically shared between all the CPU cores of the processor. Again, the L3 cache is slower than the L2 cache but is cheaper and larger. By providing a shared cache it’s possible to reduce the amount of data that would be duplicated on lower levels of per-core cache.
Tip: As an example, in cache sizes, Intel’s i9-9900K has a 64KB L1 and a 256KB L2 cache per-core (for a total of 512KB L1 and 2MB L2), it also has a 16MB shared L3 cache.
How is the CPU cache used?
All levels of the CPU cache are used to speed up processor performance by caching data from RAM. When a CPU requests data it typically searches through its cache layers first in an attempt to get the data as fast as possible. If the data is found in a cache hit, then the CPU can continue its processing. If the data isn’t in the cache, in what’s called a cache miss, then the CPU has to check the RAM, and then the hard drive if the data isn’t there either. The faster layers are always checked first for maximum performance.
To help the CPU have the data it needs in the cache when it needs it, the cache attempts to pre-empt what data the CPU might need next. For example, if the CPU has requested some data for an image it’s rendering the cache may try to pre-emptively cache more of the image data so it can be fed to the CPU as fast as possible.