Multiple Selection Approximation for Improved Spatio-Temporal Prediction in Video Coding
Abstract
In this contribution, a novel spatio-temporal prediction algorithm for video coding is introduced. This algorithm exploits temporal as well as spatial redundancies for effectively predicting the signal to be encoded. To achieve this, the algorithm operates in two stages. Initially, motion compensated prediction is applied on the block being encoded. Afterwards this preliminary temporal prediction is refined by forming a joint model of the initial predictor and the spatially adjacent already transmitted blocks. The novel algorithm is able to outperform earlier refinement algorithms in speed and prediction quality. Compared to pure motion compensated prediction, the mean data rate can be reduced by up to 15 and up to 1.16 dB gain in PSNR can be achieved for the considered sequences.
Index Terms— Signal extrapolation, Video coding, Prediction
1 Introduction
The transmission and playback of digital video data has become more and more popular in the past years. But the widespread usage of digital video is only possible since video sequences can be compressed efficiently. Modern hybrid video codecs as e. g. H.264/AVC [1] use several techniques to achieve this compression. One important part of hybrid video codecs is prediction which aims at extrapolating the signal parts to be encoded from already transmitted signal parts. Since only the parts of the signal are used as source for prediction that are already available at the decoder, only the difference between the predicted signal and the original signal has to be transmitted. Consequently, the amount of data to transmit and therewith the coding efficiency directly depends on the prediction quality.
For predicting the signal, current video codecs perform either a temporal or a spatial extrapolation. Spatial extrapolation is obtained by smartly continuing the signal from already transmitted regions of the actually processed frame into the region to be encoded. As described in [1], e. g. H.264/AVC uses different modes for spatial prediction. Instead of using signal parts from the actual frame, temporal extrapolation is achieved by using previously, already completely transmitted frames. According to [2], these frames are used to perform motion compensated prediction for the area being encoded. In doing so, in a previous frame an area is determined that fits the signal being encoded best. The displacement between the two areas is transmitted as side information. The decoder then can use this information to select the same area from the already transmitted frame and use it for prediction. Even though most modern video codecs can intelligently switch between spatial and temporal prediction, the combined utilization of spatial and temporal redundancies for forming a predictor is only seldom performed. Very few spatio-temporal prediction algorithms exist, but as two examples for doing so, the Pixelwise Adaptive Spatio-Temporal Prediction from [3] and the Joint Predictive Coding from [4] should be mentioned.
In [5], we recently proposed another spatio-temporal prediction algorithm. This algorithm works in two stages, whereas in the first stage a preliminary temporal prediction is obtained by motion compensation. Afterwards the preliminary prediction is spatially refined by Frequency Selective Approximation (FSA) for incorporating the redundancies to the adjacent already transmitted areas into the prediction. Although this algorithm is able to improve coding efficiency significantly, it has the drawback that it is computationally very expensive. To cope with this, in [6] we proposed a modification which is noticeably faster but has a slightly decreased coding efficiency compared to the original algorithm. In the scope of this contribution, we want to propose a novel refinement step, the Multiple Selection Approximation (MSA), that combines the advantages of the two earlier algorithms. In the next subsection, the idea of spatially refined motion compensation will be reviewed and the two older algorithms will be presented briefly for pointing out their strengths and weaknesses. Afterwards the novel algorithm will be introduced, before its prospects are proven with simulations.
2 Spatially Refined Motion Compensated Prediction

For outlining the idea of spatially refined motion compensated prediction a common hybrid video codec operating in line scan order is regarded. In Fig. -779, the block diagram of such a codec is shown, including the spatial refinement step which can take place subsequent to motion compensated prediction. The block to be predicted is denoted by and is located in frame at position . When operating in line scan order, block has four adjoining blocks that have already been transmitted and decoded. These blocks are subsumed in region , called reconstructed area. For the spatial refinement, now the projection area of size macroblocks is regarded. As shown in Fig. -778, area is centered by the block to be predicted and further contains the already transmitted blocks and four padding blocks.
The spatially refined motion compensated prediction operates in two stages for forming a predictor for the signal in . First, motion compensated prediction is performed for this block to obtain a preliminary estimate of the signal at the expense of spending some rate for the motion vector. Afterwards, a joint model is formed for union , the approximation area. Then, the samples corresponding to area are taken from the model and are used for prediction. This joint model is used to incorporate temporal as well as spatial dependencies into the predictor and therewith form a better predictor. Since only signal parts that are also available at the decoder are used for model generation, the decoder can build the model in exactly the same way as the encoder.
The model to be generated in area is denoted by with spatial coordinates and . In general, area is of size samples and contains the unrefined signal depicting the reconstructed blocks and the motion compensated block. The model
(1) |
results from superimposing the mutually orthogonal two-dimensional basis functions with corresponding weights . In set , the indices of all basis functions used for the model generation are subsumed. So the task is to determine which basis functions to use for the model and determine the appropriate weights in such a way that the model fits the original signal to be predicted.
The above mentioned, already proposed, algorithms for spatial refinement either utilize Frequency Selective Approximation (FSA) [5] or Relaxed Best Approximation (RBA) [6] for the model generation. In doing so, both algorithms iteratively generate the model by selecting one (FSA) or several (RBA) basis functions per iteration to be added to the model in a certain iteration step. Thereby, the basis functions are selected in such a way that the model in the -th iteration step approximates , meaning that in every iteration step the weighted approximation energy
(2) |
is reduced. The weighting function is used to exclude the padding blocks and to control the influence samples have on the model generation depending on their position. Samples far away from in general are only weakly correlated to the signal being predicted and thus have to get a smaller weight. Therewith they have a fewer influence on the model generation compared to samples close to . Since the preliminary temporal extrapolation through motion compensation is an already good estimate of the original signal, this part has to get a relatively high and constant weight. According to [5, 6] the weighting function is described by
(3) |
with controlling the weight of the motion compensated estimate and an exponentially decreasing weight for the samples in , controlled by decay factor .
For determining which basis functions to add to the model in a certain iteration step, the approximation error
(4) |
from the previous iteration step is projected onto all basis functions. In doing so, again the weighting function is used to control the influence of the samples depending on their position. This leads for all to the projections coefficients
(5) |
resulting from the residual’s weighted projection onto the basis function .
Henceforward, FSA and RBA differ. Using FSA, based on the projection coefficients, one basis function is selected to be added to the model and the portion this basis function has of the residual is added to the model and removed from the residual. But by using RBA, several basis functions are selected in every iteration step and then the input signal is projected onto the space spanned by all the basis functions from this and the previous iterations. For this reason, the approximation of is superior to FSA and far less iterations are needed, as shown in [6]. But it is important to notice, that the final aim is not to approximate , but to estimate the original weights of the different basis functions. For RBA, the problem arises that the estimation can fail if a basis function which is only weakly represented in the original signal is selected and added to the model. This is due to the fact, that the basis functions are not orthogonal anymore if evaluated over area . Hence, the strongly represented basis functions that have been estimated well up to the selection of a weakly represented one, get distorted as they have to compensate the portion the weakly one has in their direction. To illustrate this problem, the two-dimensional example in Fig. -777 is regarded. There, the top subfigure shows the signal , emanating from the superposition of , and with the original weights , and . Since the original distribution of the weights is not known, FSA and RBA aim at estimating the weights from . The subfigure bottom left now shows the estimation of and after two iterations with FSA. And obviously, the estimated weights correspond well to the original ones, resulting in and . In contrast to this, the subfigure bottom right shows the estimation after two iterations of RBA. There, is estimated too large since is projected onto the space spanned by and . In doing so, is estimated too large, as the portion of cannot be accounted for. And as and are not orthogonal, has also to be enlarged to compensate the portion of in direction of . This leads to and .
So, although RBA achieves a better approximation the model generation is inferior to FSA, as FSA is able to estimate the weights more accurately. But at the same time, RBA needs far less iterations compared to FSA and is computationally far less expensive. Subsequently, we want to propose a novel model generation algorithm, the Multiple Selection Approximation (MSA). MSA combines the advantages of FSA and RBA, meaning that it has the same robustness against a bad basis function selection as FSA and is computationally nearly as efficient as RBA.
3 Multiple Selection Approximation
The robustness of FSA is caused by the fact that in every iteration only the just selected basis function is touched and the model generated so far is not modified. Whereas the high speed of RBA originates from the circumstance that in every iteration step several basis functions are selected and the unrefined signal is projected onto the subspace spanned by all basis functions selected so far. In order to combine the advantages of both, MSA selects several basis functions per iteration and projects the residual on the subspace spanned by the basis functions selected in the actual iteration. The other basis functions, that are not involved in this iteration step, are left untouched.
The principal behavior of MSA is akin to FSA and RBA. Again, the model from (1) is generated iteratively, whereas the initial model is set to zero. To determine which basis function to select in an iteration, first of all, again a weighted projection of residual is performed onto all basis functions according to (5), yielding the projection coefficients . Then, the decrement
(6) |
of the weighted approximation error energy that could be achieved if basis function alone would be removed from the residual is computed. In doing so, the basis functions that contribute strongly to the residual can be identified and a set of basis functions to use can be determined. As mentioned before, the projection on a subspace can fail if basis functions which are only weakly represented in the residual are included in the subspace. Thus, only the basis functions are selected that could lead alone to a decrement
(7) |
larger than the energy fraction threshold times the maximal possible decrement. In order to limit the size of the subspace, and therewith the complexity of the subsequent projection, further only the basis functions are selected that correspond to the largest decrements. The indices of the basis functions fullfilling both requirements then are subsumed in set .
Subsequent to the basis function selection, residual is projected on the space spanned by them. For this, the squared weighted distance
(8) |
between the residual and the weighted projection onto the subspace is minimized. There, the variables depict the new projection coefficients to determine. The minimization is carried out by setting the partial derivatives of
(9) |
with respect to all to zero. Evaluating the equations above yields the following system of equations which has to be solved for calculating the projection coefficients:
(10) |
In the equation above, the new index is introduced to distinguish between the equation index and the index of the summation in each equation. As this system of equations is maximally of size and as the terms form a symmetric matrix, it can efficiently be solved by utilizing e. g. a Cholesky decomposition.
The projection coefficients now can be used to estimate the weights of the corresponding basis functions in the model. The weights are obtained from the projection coefficients by compensating the remaining orthogonality deficiency, resulting in
(11) |
As outlined in [7] in detail, the problem with orthogonality deficiency is, that although the basis functions are orthogonal with respect to area , they are not orthogonal if evaluated in area in combination with the weighting function. Due to this orthogonality deficiency, the weight of a basis function can get overemphasized due to portions of other, not selected, basis functions aiming in a similar direction. To compensate this, the orthogonality deficiency factor causes that only a fraction of the projection coefficient is taken and therewith overemphasizing a basis function is prevented. If the portion is estimated too small, the same basis function can get selected in a later iteration again.
At the end of each iteration step, the model is updated by adding the selected basis functions:
(12) |
The above described steps are repeated until a predefined number of iterations is reached. Finally, area is cut out of and is used for predicting the block to be coded.
4 Simulations and Results
For evaluating the abilities of the novel algorithm, it was implemented into the H.264/AVC reference encoder JM10.2, Baseline Profile, Level 2.0. Thereby, motion compensation is carried out with quarter-pel accuracy with a maximum search range of samples and one reference frame is used. In order to compare the prediction quality, rate control is switched of and fixed QPs between and are used. In doing so, the first frames of the CIF-sequences “Crew”, “Discovery City”, “Discovery Orient”, “Foreman”, and “Vimto” are encoded in IPPP order at frames per second.
Altogether, four different prediction algorithms are considered: first of all pure motion compensation without spatial refinement, further the spatial refinement is applied whereas the refinement is carried out either by FSA [5], RBA [6] or the novel MSA. Since for some blocks the spatial refinement can produce a worse predictor than pure motion compensation, the encoder has to test if spatial refinement should be applied or not. This simple rate-distortion optimization is performed by comparing the predictor to the original block in terms of the mean squared error. The decision if the refinement is applied or not, further has to be transmitted to the decoder. For this, for every macroblock one additional bit is added as worst case estimate for the additional side information.
The basis functions used for model generation for the three refinement algorithms are the ones from the two-dimensional discrete Fourier transform. In [5, 6], this set of basis functions has already proven to be well suited for the model generation. Considering the weighting function, the preliminary temporally extrapolated block is weighted by and the neighboring blocks are weighted by the exponentially decreasing function with decay factor . The number of iterations used for the model generation is set to for FSA and to for RBA, whereas for RBA maximally 20 basis functions can get selected in an iteration step. For MSA, the number of iterations performed is set to , the energy fraction threshold is set to , the maximum number of basis functions per iteration to and the orthogonality deficiency compensation to . Fortunately, none of the mentioned parameters is very critical or sequence dependent. All the parameters can be varied widely without any worse impact on the prediction quality.
Fig. -776 shows the rate-distortion curves for sequences “Discovery City” and “Discovery Orient” for the case that the four different prediction algorithms are applied. Obviously, the coding efficiency can be significantly improved by applying spatial refinement subsequent to motion compensated prediction, independently of the actual refinement algorithm. In order to compare the performance of the different refinement algorithms for the regarded sequences, Table 1 lists the mean rate reduction and the mean gain compared to pure motion compensation respectively. The mean rate reduction and the mean gain are computed over the complete QP range according to [8]. Regarding Table 1, one can see that the rate can be reduced by up to and the can be improved by up to compared to motion compensated prediction. By further comparing the different refinement algorithms among each other, it becomes obvious that MSA is able to outperform the other two algorithms and can achieve a gain of up to compared to FSA and up to compared to RBA.
Table 1 further shows the mean processing time per frame for the spatial refinement. Thereby the spatial refinement is carried out in MATLAB v7.6 on one core of an Intel Core2 @ . To accomplish this, the motion compensated block and the neighboring blocks are transferred from the JM reference software to MATLAB, where the refinement is carried out. After refinement, the processed block is retransferred to the H.264/AVC encoder. Comparing the processing time of the three algorithms, one can see that MSA is nearly as fast as RBA and times faster than FSA. The speed-up of MSA compared to FSA is slightly smaller than the ratio of the iterations, since by using MSA every iteration is a little bit more complex due to the projection onto the subspace. But overall, the reduction of the number of iterations leads to the significant reduction in processing time. Altogether, the novel refinement algorithm is able to combine the strengths of the other two algorithms: it is nearly as fast as RBA and at the same time is able to sustain or rather even improve the higher prediction quality of FSA.
“Crew” | “Discovery | “Discovery | “Foreman” | “Vimto” | |
City” | Orient” | ||||
Mean rate reduction | |||||
MSA | |||||
FSA [5] | |||||
RBA [6] | |||||
Mean gain | |||||
MSA | |||||
FSA [5] | |||||
RBA [6] | |||||
Mean processing time per frame | |||||
MSA | |||||
FSA [5] | |||||
RBA [6] |
5 Conclusion
In the scope of this paper we presented a novel spatio-temporal prediction algorithm for video coding as improvement to two existing ones. The novel algorithm combines the advantages of the two older ones, leading to an improved prediction quality by simultaneously reducing the processing time. Compared to classical motion compensated prediction, a rate reduction of up to and a gain of up to can be achieved for the considered sequences. The novel algorithm outperforms the better one of the two older algorithms by up in and is nearly as fast as the faster one of the two at the same time.
Although the novel algorithm shows a significant improvement to the two older ones, our current research aims at further reducing the computational complexity.
References
- [1] Iain Richardson, H.264 & MPEG-4 Video Compression, Wiley, West Sussex, England, Aug. 2003.
- [2] Frédéric Dufaux and Fabrice Moscheni, “Motion estimation techniques for digital TV: A review and a new contribution,” Proceedings of the IEEE, vol. 83, no. 6, pp. 858–876, June 1995.
- [3] M. G. Day and J. A. Robinson, “Residue-free video coding with pixelwise adaptive spatio-temporal prediction,” IET Image Processing, vol. 2, no. 3, pp. 131–138, 2008.
- [4] Wenfei Jiang, Longin Jan Latecki, Wenyu Liu, Hui Liang, and Ken Gorman, “A video coding scheme based on joint spatiotemporal and adaptive prediction,” in IEEE Trans. Image Process., 2009, vol. 18, pp. 1025–1036.
- [5] Jürgen Seiler and André Kaup, “Spatio-temporal prediction in video coding by spatially refined motion compensation,” in Proc. Int. Conf. on Image Processing (ICIP), San Diego, USA, 12.-15. Oct. 2008, pp. 2788–2791.
- [6] Jürgen Seiler, Haricharan Lakshman, and André Kaup, “Spatio-temporal prediction in video coding by best approximation,” in Proc. Picture Coding Symposium (PCS), Chicago, USA, May 6-8 2009.
- [7] Jürgen Seiler and André Kaup, “Fast orthogonality deficiency compensation for improved frequency selective image extrapolation,” in Proc. Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), Las Vegas, USA, 31. March - 4. April 2008, pp. 781–784.
- [8] Gisle Bjøntegaard, “Calculation of average PSNR differences between RD-curves,” Tech. Rep., ITU-T VCEG Meeting, Austin, Texas, USA, document VCEG-M33, April 2001.