Full-Reference Video Quality Assessment Using Structural Similarity (SSIM) Index

Improving Video Quality Assessment with Structural Similarity Index

by Preveen Gurav*, Gurulingappa Patil,

- Published in Journal of Advances in Science and Technology, E-ISSN: 2230-9659

Volume 12, Issue No. 25, Dec 2016, Pages 305 - 311 (7)

Published by: Ignited Minds Journals


ABSTRACT

Video Quality Assessment is one of the key words in the field of Quality of Service (QoS) for mobile phones, today. The goal of video quality assessment is to evaluate if a distorted video is of a good quality by quantifying the difference between the original and distorted video. To assess the video quality of an arbitrary distorted or compressed video, the visual features of the distorted video are compared with those of the original video. Objective video quality measures play important roles in a variety of video processing applications, such as compression, communication, printing, analysis, registration, restoration, enhancement and watermarking. Most proposed quality assessment approaches in the literature are error sensitivity-based methods. In this paper, we follow a new algorithm Structural Similarity (SSIM) Index in designing video quality metrics, which uses structural distortion as an estimate of perceived visual distortion. This algorithm is simple, straight forward, makes real time implementation easy, very consistent relation with the subjective measures and delivers more accurate results compared to other objective video quality measures MSE and PSNR.

KEYWORD

Video Quality Assessment, Structural Similarity (SSIM) Index, QoS, mobile phones, distorted video, original video, visual features, objective video quality measures, error sensitivity-based methods, structural distortion, perceived visual distortion, real time implementation, subjective measures, MSE, PSNR

I. INTRODUCTION

The field of image and video processing deals with signals that are prepared for human consumption in general. Movies on DVDs or images and video over the Internet are some examples of such signals. Before an image or video is presented to a human observer it goes through many stages of pre- and processing in most cases. Each stage of processing may introduce distortion and reduce the quality of the final display. One way to determine the quality of video is to ask opinions from human observers, but such a method is expensive and limited. That is why researchers pursue the goal to develop objective quality assessment methods that can automatically predict perceived image or video quality. These objective quality measurement methods are useful in a variety of image and video processing applications, such as compression, communication, printing, displaying, analysis, registration, restoration, enhancement and watermarking. Generally speaking, these methods can be employed in three ways. First, they can be used to monitor image/video quality for quality control systems. Second, they can be employed to benchmark image/video processing systems and algorithms. Third, they can also be embedded into image/video processing systems to optimize algorithms and parameter settings [1–5]. Currently, the most commonly used full-reference (FR) objective image and video distortion/quality metrics are signal-to-noise ratio (SNR), mean squared error (MSE) and peak signal-to-noise ratio (PSNR). MSE and PSNR are widely used because they are simple to calculate, have clear physical meanings, and are mathematically easy to deal with for optimization purposes. However, they have been widely criticized as well for not correlating well with perceived quality measurement. In the last three decades, a great deal of effort has been made to develop objective image and video quality assessment methods, which incorporate perceptual quality measures by considering human visual system (HVS) characteristics. The video quality experts group (VQEG) was formed to develop, validate and standardize new objective measurement methods for video quality. Although, the Phase I test for FR television video quality assessment only achieved limited success, VQEG continues its work on Phase II test for FR quality assessment for television, and reduced-reference (RR) and no-

3

It is worth noting that many of the proposed objective image/video quality assessment approaches employ a common error sensitivity-based philosophy which is motivated from psychophysical and physiological vision research. The basic principle is to think of the distorted signal being evaluated as the sum of a perfect quality reference signal and an error signal. The task of perceptual image quality assessment is then to evaluate how strong the error signal is perceived by the HVS according to the characteristics of human visual error sensitivity.

II. THEORETICAL DESCRIPTION

Methods for Video Quality Assessment:

There are two classes of assessment methods the subjective and objective. For a subjective test human viewers are required to rate the quality of video clips. In most testing scenarios, pairs of video clips are being compared, where one clip is the source (reference clip) and the other the degraded clip, which was processed in some manner. Subjective assessment is a costly and time consuming process, but yields accurate results for any given evaluation. This type of assessment is mainly necessary in situation such as final product evaluation and standardization processes where quality must be assured. Instead of a human viewer watching the video clips, objective test methods examine the actual video signal. With the introduction of digital video technologies, visually noticeable artifacts appear, that are different from the analogue artifacts. Therefore, new objective test methods are needed. The new measurement methods analyse the video signal in the video image space employing knowledge of the HVS. An algorithm tries to measure the spatial degradation of the video images and the temporal alignment degradation of the sequence [9–13]. The objective quality measurement methods have been classified into the following five main categories depending on the type of input data that is being used for quality assessment: Media-Layer Models: These models use the speech or video signal to compute the Quality of Experience (QoE). These models do not require any information about the system under testing; hence can be best applied to scenarios such as codec comparison and codec optimization. Parametric Packet-Layer Models: Unlike the media layer models, the parametric packet-layer models predict the QoE only from the packet-header information and do not have access to media signals. But this forms a lightweight solution for predicting QoE as it does not have to process the media signals. terminals to predict the QoE. As a result they require a priori knowledge about the system that is being tested. Bitstream-Layer Models: These models use encoded bitstream information and packet-layer information that is used in parametric packet-layer models for measuring QoE. Hybrid Models: These models mainly combine two or more of the preceding models. The media-layer objective quality assessment methods can be further categorized as full-reference (FR), reduced-reference (RR), and no-reference (NR) depending on whether a reference, partial information about a reference, or no reference is used in assessing the quality, respectively. Full- and reduced-reference methods are important for the evaluation of video systems in non-real-time scenarios where both (i) the original (reference) video data or a reduced feature data set, and (ii) the distorted video data are available. Compression Artifacts: The quality of the degraded video sequence is mainly affected by two factors, the compression and the transmission. On the compressor side, the algorithms using a block-based discrete cosine transform (DCT) and quantization of the DCT coefficients to compress the images and to reduce temporal or frame-to-frame redundancies. In this coding schemes, compression distortions are caused by this operation, namely in the quantization. Other factors affecting the visual quality are motion prediction and the size of the decoding buffer. Compression artifacts are usually correlated with movement in the pictures. Distortion can be divided into spatial and temporal coding distortions. Some of them are: Blockiness, Blurring, Ringing, Mosquito Noise and Quantization Noise and Jerkiness. Transmission Errors: One source of distortion is the transmission of the bit stream over a noisy channel. For most applications the bit stream needs to be transported in such a way that it can be decoded and displayed in real time. By transporting media over a noisy channel two types of impairments can occur. Packets can be lost or they can be delayed to the point where they are not received in time for decoding. Both cases have the same effect: some parts of the media stream are not available, packets are missing. Loss in data does not only mean a loss in the data relevant to the corrupted block, it can also affect a stream up to a fully (intra-coded) received frame. The visual effects of lost or corrupted packets are depending from the decoder‟s ability to deal with the bit stream. Some apply concealment methods in order to minimize the errors, while others never recover from certain errors [14,15].

Praveen Gurav1*, Gurulingappa Patil2 3

Structural Distortion Measurement: The commonly used full-reference objective image and video distortion/quality metrics mean squared error (MSE) and peak signal-to-noise ratio (PSNR) are widely uses because they are simple to calculate and mathematically easy to deal with. But because they do not correlate well with perceived quality measurements, they have been criticized. Great effort has been made to develop objective image and video quality assessment methods, which consider the HVS characteristics. Most of the proposed models share a common error sensitivity-based philosophy, which is motivated from psychophysical and physiological vision research. It follows the principle to think of a distorted signal being evaluated as the sum of a perfect quality reference signal and an error signal. The task of a video quality assessment algorithm is then to predict how strong the error signal is perceived by the HVS according to the characteristics of the human visual error sensitivity. The structural distortion measurement is based on the fact, that natural image signals are highly structured. By a structural signal the strong dependencies between the samples is described. Most error sensitivity-based approaches are using the so-called Minkowski error metrics, which is independent of the signal structure, by using point wise signal differencing. Therefore, the motivation of the proposed approach is to find a way to compare the structures of the reference and the distorted signals. The main differences of the new approach from the error-sensitivity-based philosophy are the following: 1. Image degradations are considered as perceived structural loss instead of perceived errors. 2. The new approach is a top-down approach simulating the hypothesized functionality of the overall HVS. The error-sensitivity-based philosophy uses a bottom-up approach by simulating the function of each relevant component in the HVS and combines them together. 3. Error-sensitivity based philosophy has issues like “suprathreshold” problem and “natural image complexity”.

IV. STRUCTURAL SIMILARITY (SSIM) INDEX

There may be different implementations of the new philosophy, depending on how the concepts of “structural information” and “structural distortion” are interpreted and quantified. Here, from an image formation point of view, we consider the “structural information” in an image as those attributes that reflect independent of the average luminance and contrast of the image. This leads to an image quality assessment approach that separates the measurement of luminance, contrast and structural distortions. Structural similarity (SSIM) index measurement system diagram is shown in Figure 1. Let x and y be two non-negative signals that have been aligned with each other (e.g., two image patches extracted from the same spatial location from two images being compared, respectively), and let μx, μy, σ2y, σ2x and σxy be the mean of x, the mean of y, the variance of x, the variance of y, and the covariance of x and y, respectively. Here, the mean and the standard deviation (square root of the variance) of a signal are roughly considered as estimates of the luminance and the contrast of the signal. The covariance (normalized by the variance) can be thought of as a measurement of how much one signal is changed nonlinearly to the other signal being compared. We define the luminance, contrast and structure comparison measures as follows: Notice that these terms are conceptually independent in the sense that the first two terms only depend on the luminance and the contrast of the two images being compared, respectively, and purely changing the luminance or the contrast of either image has no impact on the third term. Geometrically, s(x, y) corresponds to the cosine of the angle between the vectors x - μx and y - μy, independent of the lengths of these vectors. Although, s(x, y) does not use a direct descriptive representation of the image structures, it reflects the similarity between two image structures-it equals one if and only if the structures of the two image signals being compared are exactly the same (recall that we consider structural information as those image attributes other than the luminance and contrast information). When (μ2x + μ2y) (σ2x + σ2y) ≠ 0, the similarity index measure between x and y corresponds to

3

Fig. 1: Diagram of the Structural Similarity (Ssim) Measurement System.

If the two signals are represented discretely as x = { xi | i = 1, 2, ..., N} and y = { yi | i = 1, 2, ..., N}, then the statistical features can be estimated as follows: One problem with (2) is that when (μ2x + μ2y) or (σ2x + σ2y) is close to 0, the resulting measurement is unstable. This effect has been frequently observed in our experiments, especially over flat regions in images. In order to avoid this problem, we have modified equation (2). The resulting new measure is named the Structural Similarity (SSIM) index between signals x and y: Two constants, C1 and C2, are added which are given by: C1= (K1L)2 and C2 = (K2L)2 (7) The SSIM index satisfies the following conditions: 1. Symmetry: SSIM (x, y) = SSIM (y, x); 2. Boundedness: SSIM (x, y) ≤ 1; 3. Unique maximum: SSIM (x, y) = 1 if and only if x = y ( in discrete representations, xi = yi for all i = 1, 2,...,N). have perfect quality, then the SSIM index provides a quantitative measurement of the quality of the other image signal. The SSIM indexing algorithm is applied for quality assessment of still images using a sliding window approach. The window size is fixed to be 8 x 8 in this paper. The SSIM indices are calculated within the sliding window, which moves pixel-by-pixel from the top-left to the bottom-right corner of the image. This results in a SSIM index map of the image, which is also considered as the quality map of the distorted image being evaluated. The overall quality value is defined as the average of the quality map, or, equivalently, the mean SSIM (MSSIM) index.

V. VIDEO QUALITY ASSESSMENT

A hybrid video quality assessment method was developed, where the proposed quality indexing approach (with C1 = C2 = 0) was combined with blocking and blurring measures as well as a texture classification algorithm. In this paper, we attempt to use a much simpler method, which employs the SSIM index as a single measure for various types of distortions.

Fig. 2: Proposed Video Quality Assessment System.

The diagram of the proposed video quality assessment system is shown in Figure 2 the quality of the distorted video is measured in three levels: the local region level, the frame level, and the sequence level. First, local sampling areas are extracted from the corresponding frame and spatial locations in the original and the distorted video sequences, respectively. The sampling areas are randomly selected 8 X 8 windows. This is different from the method used for still images where all possible sampling windows are selected since the sliding window moves pixel-by-pixel across the whole image. Instead, only a proportion of all possible 8 X 8 windows are selected here. We use the number of sampling windows per video frame (Rs) to represent the sampling density. Our experiments show that properly selected Rs can largely reduce computational cost while still maintains reasonably robust measurement results. The SSIM indexing approach is then applied to the Y, Cb and Cr color components independently and combined into a local quality measure using a weighted summation. Let SSIM ijY , SSIM ijCb and SSIM ijCr denote the SSIM index values of the Y, Cb and Cr components of the j-

Praveen Gurav1*, Gurulingappa Patil2 3

respectively. The local quality index is given by Where, the weights are fixed in our experiments to be WY = 0:8, WCb = 0:1 and WCr = 0:1, respectively. In the second level of quality evaluation, the local quality values are combined into a frame-level quality index using: Where, Qi denotes the quality index measure of the i-th frame in the video sequence, and wij is the weighting value given to the j-th sampling window in the i- th frame. Finally in the third level, the overall quality of the entire video sequence is given by Where, F is the number of frames and Wi is the weighting value assigned to the i-th frame. If all the frames and all the sampling windows in every frame are considered equally then This leads to a quality measure equalling the average SSIM index measurement of all sampling windows in all frames. Such a weighting assignment method may not be optimal because different regions and different frames may be of different importance to the human observers. Optimal weighting assignment is difficult because many psychological aspects are involved, which may depend on the content and context of the video sequence being observed. However, certain appropriate adjustments around the selection of all-equal-weighting may help to improve the prediction accuracy of the quality assessment algorithm. In this paper, two simple adjustment methods are employed. The first is based on the observation that dark regions usually do not attract fixations, therefore, should be assigned smaller weighting values. We use the mean value (as given in (3)) of the Y component as an estimate of the local luminance, and the local weighting is adjusted as: The second adjustment considers the case when very large global motion occurs. Note that some image distortions are perceived differently when the background of the video is moving very fast (usually corresponds to high speed camera movement). For example, severe blurring is usually perceived as a very unpleasant type of distortion in still images or slowly moving video. However, the same amount of blur may not be as important in a large motion frame, perhaps because large perceptual motion blur occurs at the same time. Such kind of differences cannot be captured by the intra-frame SSIM index, which does not involve any motion information. Our experiments also indicate that the proposed algorithm performs less stable when very large global motion occurs. Therefore, we give smaller weighting to the large motion frames to improve the robustness of the algorithm. First, for each sampling window, we use a block-based motion estimation algorithm to evaluate its motion with respect to its adjacent next frame. Suppose mij represents the motion vector length of the j-th sampling window in the i-th frame, then the motion level of the i-th frame is estimated as Where KM is a constant that serves as a normalization factor of the frame motion level. We uses KM = 16 in our experiment. The weighting of frame is then adjusted by:

VI. SIMULATION RESULTS

We use video with different types of distortions to test the Structural Similarity Index (SSIM) method of video quality assessment. In this project we tested the video, which is distorted by wide variety of corruptions: additive Gaussian noise, impulsive salt-pepper noise, multiplicative speckle noise and blurring with different values of variances, say 0.05, 0.01 and 0.1 and computed results are tabulated in Tables 1 and 2. It can be observed from the results of the SSIM method that more is the value of the Q, more is the quality of the video.

3

The new quality index exhibits very consistent correlation with the subjective measures. Our experimental results indicate that it outperforms the MSE significantly under different types of image distortions. It is perhaps surprising that such a simple mathematically defined quality index performs so well without any HVS model explicitly employed. The success of this quality index is due to its strong ability in measuring structural distortion occurred during the video degradation process. This is a clear distinction with MSE, which is sensitive to the energy of errors, instead of structural distortion.

VII. CONCLUSION

We designed a new objective video quality assessment system. Our experimental results indicate that it outperforms the MSE significantly under different types of image distortions. The key feature of the proposed method is the use of structural distortion instead of error sensitivity based measurement for quality evaluation. Experiments on VQEG FR-TV Phase I test dataset show that it has good correlation with perceived video quality. One of the most attractive features of the proposed method is perhaps its simplicity. Note that no complicated procedures (such as spatial and temporal filtering, linear transformations, object segmentation, texture classification, blur evaluation, and blockiness estimation) are involved. This implies that the SSIM index is a simple formula that inherently has effective normalization power for various types of image structures and distortions. The simplicity of the algorithm also makes real-time implementation easy. In addition, the speed of the algorithm can be further adjusted by tuning the parameter of frame sampling rate Rs. Our experiments show that reasonably robust performance can be obtained with a relatively small sampling rate (e.g., Rs < 100), allowing real-time software implementation on moderate speed computers. The proposed method has been found to be consistent with many observations of HVS behaviours. For example, the blocking artifact in JPEG compressed images may significantly impair the “structure” in smooth image regions, but is less disturbing in highly textured regions. This is captured very well in the quality maps. example, vertical distortions may appear more significant than horizontal distortions. It remains a problem that how to systematically connect and adjust the proposed quality index in accordance with psychophysical and physiological HVS studies.

VIII. FUTURE WORK

In order to improve the proposed algorithm, many issues need further investigations in the future. One important issue is related to motion. The current SSIM index is oriented for comparison of still image structures. Notice that there are several significant outliers, where the model gives much lower scores than they should supply. In fact, most of these significant outliers correspond to the video sequences with large global motions (such as SRC5, SRC9 and SRC19 in the VQEG Phase I test dataset). So far, no method has been found to naturally incorporate motion information into the SSIM index measure. We have attempted to apply the same SSIM index measure as in (6) for 3-dimensional windows (instead of the current intra-frame 2-dimensional windows). Unfortunately, no significant improvement has been observed. Another issue is regarding the case of burst of - error. For example, when most of the frames in a video sequence have high quality, but only a few are damaged and have extremely low quality, the human observers tend to give a lower quality score than averaging all the frames. To solve this problem, a non-linear pooling method (instead of weighted summation used in this project) may need to be applied. Furthermore, how to measure and incorporate colour distortions also needs more investigations.

REFERENCES

Eugen Rodel, Video Quality Assessment. ITUR Recommendation BT.50010. Methodology for the subjective assessment of the quality of television pictures. ITU, Geneva, Switzerland; 2000. VQEG, Final report from the Video Quality Experts Group on the validation of objective models of video quality assessment. 2000. Available at http://www.vqeg.org. Z. Wang, L. Lu, A. C. Bovik. Video quality assessment based on structural distortion measurement. Signal Processing: Image Communication . 2004; 19. Z. Wang, A. C. Bovik. A universal image quality index. IEEE Signal Process. Lett. 2002; 9. S. Winkler. Vision models and quality metrics for image processing applications. Ph.D. Thesis

Praveen Gurav1*, Gurulingappa Patil2 3

Lausanne, Switzerland; 2000. S. Winkler, C. J. van den Branden Lambrecht, M. Kunt. Vision and video: models and applications to image and video processing. Chap. 10, Kluwer Academic Publishers; 2001. Z. Wang, H. R. Sheikh, A. C. Bovik. Objective video quality assessment. In The Handbook of Video Databases: Design and Applications CRC Press; 2003. VQEG: The Video Quality Experts Group. http://www.vqeg.org/. H. R. Sheikh, Z. Wang, A. C. Bovik, et al. Image and video quality assessment research at LIVE. http://live.ece.utexas.edu/research/quality. Muthukumar S, Dr.Krishnan .N, Pasupathi.P, Deepa. S. Analysis of image inpainting techniques with exemplar, poisson, successive elimination and 8 pixel neighborhood methods. International Journal of Computer Applications. 2010; 9(11). Zhou Wang, Alan Conrad Bovik, Hamid Rahim Sheikh, et al. Image quality assessment: from error visibility to structural similarity. In Proc. IEEE Transactions on Image Proc. 2004; 13(4).

Radim javůrek. Efficient models for objective video quality assessment.

Z. Wang, L. Lu, A. C. Bovik. Video quality assessment using structural distortion measurement. In Proc. IEEE Int. Conf. Image Proc. 2002; 3: 65–68p. Shyamprasad Chikkerur, Vijay Sundaram, Martin Reisslein, et al. Objective video quality assessment methods: A classification, review, and performance comparison. IEEE Transactions on Broad Casting. 2011; 57(2).

Corresponding Author Praveen Gurav*

Dept. of E&CE, KLECET Chikodi

E-Mail – pgurav25@gmail.com