NVIDIA to Devs: Keep your Game Localized to Fewer CPU Threads for Best Performance

NVIDIA has advised developers to optimize their game code for fewer CPU threads to deliver the best performance. Although desktop PCs offer up to 32 threads, not all are equal. A physical thread (core) isn’t the same as a logical thread born out of hyperthreading. Similarly, the P-cores on Intel’s hybrid processors are much more potent than the accompanying low-power E-cores. Consequently, game developers are better off limiting their code to fewer CPU threads to maximize performance and minimize complexity.

Many CPU-bound games actually degrade in performance when the core count increases beyond a certain point, so the benefits of the extra threading parallelism are outweighed by the overhead.

Related Articles

AMD Zen 5 CPUs Allegedly Going to Offer Only 10% IPC Uplift, Claims Lenovo Manager
May 9, 2024

AMD’s Ryzen Desktop CPU Market Share Grows by 21% in Q1 2024; Mobile/Server Flat
May 8, 2024

On high-end desktop systems with greater than eight physical cores for example, some titles can see performance gains of up to 15% by reducing the thread count of their worker pools to be less than the core count of the CPU.

The reasons for the performance drop are complex and varied. Where one title may see a performance drop of 10%, another may see a performance gain of 10% on the same system, thus highlighting the difficulty in providing a one-size-fits-all solution across all titles and all systems.

NVIDIA suggests that high-end CPUs with more than 8 cores (Ryzen 9 7950X/Core i9-13900K) can render some games faster with half their thread counts. This can be true for titles (highly) sensitive to memory/cache latency. Limiting the threads to a single core cluster (CCD or P-cores) keeps the critical game data in the same space without assigning a “far-away” core to the same resource.

High-end CPUs with up to 16 or more cores have a single-core boost clock and a multi-core boost clock. The boost clock often decreases the more cores are loaded (to keep power in check). Therefore, reducing the number of threads can improve the boost clocks. However, this won’t work if the game is compute-bound rather than memory/cache-bound. The result will vary on the workload complexity.

Hardware performance: Higher-core-count CPUs sometimes have lower CPU speeds. Reducing the number of threads may enable the active cores to boost their frequency.

Hardware resource contention: Reducing the thread count can often decrease the pressure on the memory subsystem, reducing latency and enabling the CPU caches to be more efficient. This is especially true for chiplet-based architectures that do not have a unified L3 cache. Threads executing on different chiplets can cause high cache thrashing.

Executing threads on both logical cores of a single physical core (hyperthreading or simultaneous multi-threading) can add latency as both threads must share the physical resource (caches, instruction pipelines, and so on). If a critical thread is sharing a physical core, then its performance may decrease. Targeting physical core counts instead of logical core counts can help to reduce this on larger core count systems.

Software resource contention: Locks and atomics can have much higher latency when accessed by many threads concurrently, adding to the memory pressure. False sharing can exacerbate this.

OS scheduling issues: An over-subscription of threads to active cores leads to a high number of context switches which can be expensive and can put extra pressure on the CPU memory subsystem.

On systems with P/E cores, work is scheduled first to physical P cores, then E cores, and then hyperthreaded logical P cores. Using fewer threads than total physical cores enables background threads, such as OS threads, to execute on the E cores without disrupting critical threads running on P cores by executing on their sibling logical cores.

Power management: Reducing the number of threads can enable more cores to be parked, saving power and potentially allowing the remaining cores to run at a higher frequency.

Core parking has been seen to be sensitive to high thread counts, causing issues with short bursty threads failing to trigger the heuristic to unpark cores. Having longer running, fewer threads helps the core parking algorithms.

Then there’s the matter of hyper-threading (or Simultaneous Multi-threading). As you may have observed in several games, hyper-threading can degrade performance by making two threads compete for resources. This can be especially glaring if one of the threads is a critical thread. In this case, you’re reducing the resources available to the primary thread by scheduling a second one. Even though the CPU utilization increases with HT, the latency-sensitive critical thread will get slowed.

The introduction of hybrid cores adds another layer of complexity to code optimization. Here, you want most (if not all) threads on the P-cores to avoid any unnecessary stutters or lags. Limiting the thread usage to less than the physical core count maintains most game threads on the P-cores, pushing the system and OS threads to the E-core.

Via: NVIDIA