Intel’s ISL team has created a neural network that converts rendered footage from GTA V into photo-realistic imagery which is hard to distinguish from captured media. This mechanism works broadly similar to NVIDIA’s DLSS, with the key difference being that it’s not integrated into the game engine, and isn’t rendered in real-time either.
In the above shot, you can see how this convolution network works and below you can see the integration of DLSS into game engines. Similar to DLSS, this image enhancement network is fed with image maps such as the depth, exposure, normal, lighting, and jitter offsets which are then used to distinguish between the various objects and sort them into classes. This info is also used to modulate the features of the real image based on the rendered image. The discriminator then compares the two images and assigns a realism score and LPIPS score (difference between rendered and final image) to each frame. The VGG-16 network in the discriminator extracts “Perceptual Features” from varying depths/levels of the enhanced and real image while the RSN derives label maps that are consistent for object classes from both the rendered and real images. The latter is then compared to the former, assigning a realism score, therefore helping the convolution network to improve its accuracy.
The resultant images are quite impressive as they are nearly identical to the original rendered input both geometrically and semantically while also being stable over time. It’s unclear what hardware was used for this, but the real images were shot in the cityscape of urban Germany. Furthermore, since the process was conducted externally, it’s hard to say how much of a performance overhead this would cause in a real-time rendered game. In the end though, you can be sure that this is the future of gaming, and it won’t be too long before we see techniques like this integrated into video games.