Every modern processor features a small amount of cache memory. Over the past few decades, cache architectures have become increasingly complex: The levels of CPU cache have increased, the size of each block has grown and the cache associativity has undergone several changes as well. But before we dive into the specifics, let me ask you, what exactly is cache memory and why do you need it? Furthermore, modern processors consist of L1, L2, and L3 cache. What’s the difference between these cache levels?
Cache Memory vs System Memory: SRAM vs DRAM
Cache memory is based on the much faster (and expensive) Static RAM while system memory leverages the slower DRAM (Dynamic RAM). The main difference between the two is that the former is made of CMOS technology and transistors (six for every block) while the latter uses capacitors and transistors.
DRAM needs to be constantly refreshed (due to leaking charges) to retain data for longer periods. Due to this, it draws significantly more power and is slower as well. SRAM doesn’t have to be refreshed and is much more efficient. However, the higher pricing has prevented mainstream adoption, limiting its use to processor cache.
Importance of Cache Memory in Processors?
Modern processors are light years ahead of their primitive ancestors that were around in the 80s and early 90s. These days, top-end consumer chips run at well-over 4GHz while most DDR4 memory modules are rated at less than 1800MHz. As a result, system memory is too slow to directly work with CPUs without severely slowing them down. This is where the cache memory comes in. It acts as an intermediate between the two, storing small chunks of repeatedly used data or in some cases, the memory addresses of those files.
L1, L2 and L3 Cache: What’s the Difference?
In contemporary processors, cache memory is divided into three segments: L1, L2 and L3 cache, in order of increasing size and decreasing speed. L3 cache is the largest and also the slowest (the 3rd Gen Ryzen CPUs feature a large L3 cache of up to 64MB) cache level. L2 and L1 are much smaller and faster than L3 and are separate for each core. Older processors didn’t include a third-level L3 cache and the system memory directly interacted with the L2 cache:
L1 cache is further divided into two sections: L1 Data Cache and L1 Instruction Cache. The latter contains the instructions that need to be executed by the CPU while the former is used to hold data that will be written back to the main memory.
L1 cache not only works as the instruction cache, but it also holds pre-decode data and branching information. Furthermore, while the L1 data cache often acts as an output-cache, the instruction cache behaves like an input-cache. This is helpful when loops are engaged as the required instructions are right next to the fetch unit.
Modern CPUs include up to 512KB of L1 cache (64KB per core) for flagship processors while server parts feature almost twice as much.
L2 cache is much larger than L1 but at the same time slower as well. They range from 4-8MB on flagship CPUs (512KB per core). Each core has its own L1 and L2 cache while the last level, the L3 cache is shared across all the cores.
L3 cache is the lowest-level cache. It varies from 10MB to 64MB. Server chips feature as much as 256MB of L3 cache. Furthermore, AMD’s Ryzen CPUs have a much larger cache size compared to rival Intel chips. This is because of the MCM design vs Monolithic on the Intel side. Read more on that here.
When the CPU needs data, it first searches the associated core’s L1 cache. If it’s not found, the L2 and L3 caches are searched next. If the necessary data is found, it’s called a cache hit. On the other hand, if the data isn’t present in the cache, the CPU has to request it to be loaded on to the cache from the main memory or storage. This takes time and adversely affects performance. This is called a cache miss.
Generally, the cache hit rate improved when the cache size is increased. This is especially true in the case of gaming and other latency-sensitive workloads.
With the basic explanations out of the way, let’s talk about how the system memory talks to the cache memory. The cache memory is divided into blocks. These blocks are in-turn divided into n 64-byte lines. The system memory is divided into the same number of blocks as the cache and then the two are linked.
If you have 1GB of system RAM, then the cache will be divided into 8192 lines and then separated into blocks. This is called n-way associative cache. With 2-way associate cache, each block contains two lines each, 4-way includes four lines each, eight lines for 8-way and sixteen lines for 16-way. Each block in the memory will be 512 KB in size if the total RAM size is 1GB.
If you have 512 KB of 4-way associated cache, the RAM will be divided into 2,048 blocks (8192/4 for 1GB) and linked to the same number of 4-line cache blocks.
In the same way with 16-way associative cache, the cache is divided into 512 blocks linked to 512 (2048 KB) blocks in the memory, each cache block containing 16 lines. When the cache runs out of data blocks, the cache controller reloads a new set of blocks with the required data to continue processor execution.
N-way associative cache is the most commonly used mapping method. There are two more methods known as direct mapping and fully associated mapping. In the former, there is hard-linking between the cache lines and memory while in the case of the latter, the cache can contain any memory address. Basically, each line can access any main memory block. This method has the highest hit-rate. However, it’s costly to implement and is mostly avoided by chipmakers.