No-Reference Rendered Video Quality Assessment:
Dataset and Metrics
Jiayu Ji1
Qingchuan Zhu1
Zhiyao Yang2
1State Key Lab of CAD&CG, Zhejiang University
2OPPO Computing & Graphics Research Institute
Introduction Video
Abstract
Quality assessment of videos is crucial for many computer graphics applications, including video games, virtual reality, and augmented reality, where visual performance has a significant impact on user experience. When test videos cannot be perfectly aligned with references or when references are unavailable, the significance of no-reference video quality assessment (NR-VQA) methods is undeniable. However, existing NR-VQA datasets and metrics are primarily focused on camera-captured videos; applying them directly to rendered videos would result in biased predictions, as rendered videos are more prone to temporal artifacts.

To address this, we present a large rendering-oriented video dataset with subjective quality annotations, as well as a designed NR-VQA metric specific to rendered videos. The proposed dataset includes a wide range of 3D scenes and rendering settings, with quality scores annotated for various display types to better reflect real-world application scenarios. Building on this dataset, we calibrated our NR-VQA metric to assess rendered video quality by looking at both image quality and temporal stability. We compare our metric to existing NR-VQA metrics, demonstrating its superior performance on rendered videos. Finally, examples illustrate how our metric can be used to benchmark supersampling methods and evaluate frame generation strategies for real-time rendering.

Dataset

The proposed dataset,
ReVQ-2K
:
  • Features a diverse range of 3D scenes.
  • Applies a variety of rendering configurations, supplemented by various supersampling and post-processing techniques.
  • Includes perceptual quality scores annotated on both smartphone screens and desktop displays to replicate real-world scenarios.

3D Scenes

Examples of 3D scenes from the ReVQ-2k dataset
Showcases
Urban
Interior
Landscapes
Cartoon
Weather

Rendering Configuration

The scalability reference of Unreal Engine was adopted for our rendering settings.
Resolution Scale: 0 Resolution Scale: 1 Resolution Scale: 2
1 r.ScreenPercentage=50
1 r.ScreenPercentage=75
1 r.ScreenPercentage=100
View Distance: 0 View Distance: 1 View Distance: 2
1 r.ViewDistanceScale=0.4
1 r.ViewDistanceScale=0.6
1 r.ViewDistanceScale=1.0
Anti-Aliasing: 0 Anti-Aliasing: 1 Anti-Aliasing: 2
1 r.PostProcessAAQuality=0
1 r.PostProcessAAQuality=2
1 r.PostProcessAAQuality=4
Post Process: 0 Post Process: 1 Post Process: 2
1r.MotionBlurQuality=0
2r.BlurGBuffer=0
3r.AmbientOcclusionLevels=0
4r.AmbientOcclusionRadiusScale=1.7
5r.DepthOfFieldQuality=0
6r.RenderTargetPoolMin=300
7r.LensFlareQuality=0
8r.SceneColorFringeQuality=0
9r.EyeAdaptationQuality=0
10r.BloomQuality=4
11r.FastBlurThreshold=0
12r.Upscale.Quality=1
13r.Tonemapper.GrainQuantization=0
1r.MotionBlurQuality=3
2r.BlurGBuffer=0
3r.AmbientOcclusionLevels=1
4r.AmbientOcclusionRadiusScale=1.7
5r.DepthOfFieldQuality=1
6r.RenderTargetPoolMin=350
7r.LensFlareQuality=0
8r.SceneColorFringeQuality=0
9r.EyeAdaptationQuality=0
10r.BloomQuality=4
11r.FastBlurThreshold=2
12r.Upscale.Quality=2
13r.Tonemapper.GrainQuantization=0
1r.MotionBlurQuality=4
2r.BlurGBuffer=1
3r.AmbientOcclusionLevels=3
4r.AmbientOcclusionRadiusScale=1
5r.DepthOfFieldQuality=2
6r.RenderTargetPoolMin=400
7r.LensFlareQuality=2
8r.SceneColorFringeQuality=1
9r.EyeAdaptationQuality=2
10r.BloomQuality=5
11r.FastBlurThreshold=7
12r.Upscale.Quality=3
13r.Tonemapper.GrainQuantization=1
Texture Quality: 0 Texture Quality: 1 Texture Quality: 2
1r.Streaming.MipBias=2.5
2r.MaxAnisotropy=0
3r.Streaming.PoolSize=200
1r.Streaming.MipBias=1
2r.MaxAnisotropy=2
3r.Streaming.PoolSize=400
1r.Streaming.MipBias=0
2r.MaxAnisotropy=8
3r.Streaming.PoolSize=1000
Effects Quality: 0 Effects Quality: 1 Effects Quality: 2
1r.TranslucencyLightingVolumeDim=24
2r.RefractionQuality=0
3r.SSR=0
4r.SceneColorFormat=3
5r.DetailMode=0
6r.TranslucencyVolumeBlur=0
7r.MaterialQualityLevel=0
1r.TranslucencyLightingVolumeDim=32
2r.RefractionQuality=0
3r.SSR=0
4r.SceneColorFormat=3
5r.DetailMode=1
6r.TranslucencyVolumeBlur=0
7r.MaterialQualityLevel=1
1r.TranslucencyLightingVolumeDim=64
2r.RefractionQuality=2
3r.SSR=1
4r.SceneColorFormat=4
5r.DetailMode=2
6r.TranslucencyVolumeBlur=1
7r.MaterialQualityLevel=1
FSR2 DLSS TAAU
1r.FidelityFX.FSR2.QualityMode=2
1r.NGX.DLSS.Quality=-1
1r.TemporalAA.Upsampling=1
Labeling Software
Desktop Platform
Smartphone Platform

Quality Scores

OA-MOS: The overall mean opinion score (OA-MOS) reflects the overall quality of videos by considering factors such as colorfulness, block artifacts, visual blurring, and temporal disruptions like flickering.
TS-MOS: The temporal stability mean opinion score (TS-MOS) reflects the temporal stability of videos, focusing on issues like flickering, moving jaggies, and other temporal artifacts that critically affect rendered video quality.

720p

1080p

2k

All

Metrics
Building on the ReVQ-2k dataset, we develop a new two-stream NR-VQA metric to predict rendered video quality. The architecture of our proposed model, which is divided into two main components: image qualityassessment and temporal stability analysis.
Image Quality Assessment: In this stream, we evaluate the overall image quality of videos. Given the extensive analysis in existing literature, we adopt the cropping strategy and Swin transformer (Swin-T) model employed in FAST-VQA. This stream considers factors such as clarity and appropriate exposure, in alignment with existing NR-VQA methods, and evaluates static rendering artifacts like Moir´e patterns.
Temporal Stability Analysis: In this stream, we crop a series of images from consecutive video frames, align them via motion estimation, and assess their temporal stability using image differencing. The results from both streams are then combined through a multilayer perceptron (MLP) to regress the overall video quality,
Overview of our NR-VQA model
Image Differencing
With generated motion vectors and disocclusion maps, each subset \( K_i \) is subjected to backward warping and disoccluded pixel removal, resulting in an aligned subset. We calculate image differencing for adjacent frames and for frames separated by one, two, and three intervals, enabling detection of flickering across various frequencies. The image differences are then fed into a depth-wise separable convolutions-based image difference detector to evaluate pixel-level temporal stability. Finally, the stability maps from all subsets are subjected to average pooling and a MLP to regress the temporal stability score.
Comparison
The quantitative analysis in this table clearly demonstrates that our proposed model significantly outperforms existing SOTA methods. Without temporal stability supervision, our model exceeds the top-performing baseline by 1.1% and 0.8% in PLCC and SRCC, respectively. This margin increases to 3.9% and 3.2% when our model incorporates temporal stability scores for training.
Quantitative comparison of various NR-VQA methods on the ReVQ-2k datase