Accelerating Stereo Rendering via Image Warping and
Spatio-Temporal Supersampling
Sipeng Yang,
Junhao Zhuge,
Jiayu Ji,
Qingchuan Zhu,
Xiaogang Jin
State Key Lab of CAD&CG, Zhejiang University
Abstract
Achieving immersive virtual reality (VR) experiences typically requires extensive computational resources to ensure high-definition visuals, high frame rates, and low latency in stereoscopic rendering. This challenge is particularly pronounced for lower-tier and standalone VR devices with limited processing power. To accelerate rendering, existing supersampling and image reprojection techniques have shown significant potential, yet to date, no previous work has explored their combination to minimize stereo rendering overhead.

In this paper, we introduce a lightweight supersampling framework that integrates image projection with spatio-temporal supersampling to accelerate stereo rendering. Our approach effectively leverages the temporal and spatial redundancies inherent in stereo videos, enabling rapid image generation for unshaded viewpoints and providing resolution-enhanced and anti-aliased images for binocular viewpoints. We first blend a rendered low-resolution (LR) frame with accumulated temporal samples to construct an high-resolution (HR) frame. This HR frame is then reprojected to the other viewpoint to directly synthesize images. To address disocclusions in reprojected images, we utilize accumulated history data and low-pass filtering for filling, ensuring high-quality results with minimal delay. Extensive evaluations on both the PC and the standalone device confirm that our framework requires short runtime to generate high-fidelity images, making it an effective solution for stereo rendering across various VR platforms.
Methodology
Our method efficiently synthesizes one viewpoint from the other in each stereo pair, enhancing the resolution of both views while requiring only one LR shaded image.
Sample Accumulation: We first use image backward warping to align the history HR frame with the currently rendered LR frame. Invalid pixels in the history frame are discarded, and regions with shading changes are clamped to the current frame. After the rectification, we blend the rendered frame with the rectified history HR frame to obtain the current HR frame.
View Synthesis: We calculate the projection vectors from the rendered viewpoint to the synthesized one and align the rendered HR frame to the synthesized viewpoint using image backward warping. Then, accumulated history pixels, followed by low-pass filtering, are used to fill holes in the warped current image. To avoid flickering caused by alternate viewpoint projections, our supersampling framework incorporates a heuristic pixel selection method to detect and rectify shading changes, ensuring high temporal stability in the video.
The resulting HR stereo image pair undergoes the necessary post-processing operations within the rendering pipeline and is displayed on the screen. In the next frame, we shade an alternate viewpoint of the left or right eye and apply the same approach for image resolution enhancement and image synthesis.
Supersampling method for stereo rendering
Comparisons
Resolution Enhancement
To assess the resolution enhancement performance, we compared our approach with SOTA non-deep learning real-time rendering super-resolution methods. Only videos from a single side were utilized to facilitate comparisons. This table presents the quantitative comparisons of reconstructed HR image quality by various methods at an upscaling factor of ×2. The results demonstrate that the reconstruction results of our method surpass those of previous methods across the evaluated scenes in both PSNR and SSIM metrics. Notably, in comparison to our baseline method Mob-FGSR, our approach shows clear improvements, with average enhancements of 0.48 in PSNR and 0.004 in SSIM.
Comparison of resolution enhancement methods
The figure provides a visual qualitative comparison, where our method retains sharper textures without suffering from ghosting issues. Conversely, methods such as FSR 2 and TAAU occasionally produce ghosting artifacts due to insufficient cleaning of disoccluded pixels.
Resolution Enhancement
Unshaded View Synthesis
To evaluate the image synthesis performance separately, the resolution enhancement component of our method was disabled. We compared our approach against SOTA image synthesis methods for stereo rendering. All methods used a single viewpoint 1080P image as input to output an unshaded viewpoint. In terms of quantitative metrics, our method achieves better results than the re-shading approach HRR in some cases. This improvement is attributed to our pixel selection strategy, which more accurately constructs non-Lambertian surface reflections, closely aligning with the ground truth.
Comparison of unshaded view synthesis methods
To provide a clearer comparison of the disocclusion filling results across methods, we present visual results in this figure, where key disoccluded areas are magnified. Here, the HRR method, employing a re-shading strategy for disocclusion, theoretically delivers the best filling results. Our approach also achieves satisfactory results by utilizing history information for filling, outperforming the EHIW method, which relies solely on the depth-based low-pass filter.
Unshaded View Synthesis
Performance
We conduct detailed performance evaluations of our supersampling method on both PC (NVIDIA GeForce RTX 4070Ti GPU) and standalone VR devices (Qualcomm Snapdragon 8 Gen3 SoC). Runtime metrics for all components of our framework are presented in the following table. By leveraging fast image backward warping, we achieve significantly reduced runtimes.
On PC, enhancing a 1080P input to 4K is accomplished in just 0.39ms. For unshaded viewpoint synthesis, generating a 4K image takes only 1.52ms. In comparison, the EHIW method requires 5.94~8.97ms, while the HRR method takes 0.71~1.02ms and requires additional time for hole-filling. Compared to existing methods, our approach demonstrates strong low-latency performance.
Additionally, runtime assessments on a standalone device show that enhancing to a 1440×1440 image requires just 2.20ms, while synthesizing an alternate viewpoint takes only 4.24ms. These results confirm the broad applicability of our supersampling approach across various hardware platforms.
Runtime performance analysis of our method and alternative view synthesis approaches
Codes
Code coming soon after acceptance.