Decoding of framewise compressed-sensed video via interframe total variation minimization

Abstract. Compressed sensing is the theory and practice of sub-Nyquist sampling of sparse signals of interest. Perfect reconstruction may then be possible with significantly fewer than the Nyquist required number of data. In this work, we consider a video system where acquisition is performed via framewise pure compressed sensing. The burden of quality video sequence reconstruction falls, then, solely on the decoder side. We show that effective decoding can be carried out at the receiver/decoder side in the form of interframe total variation minimization. Experimental results demonstrate these developments.


Introduction
By the Nyquist-Shannon sampling theory, to reconstruct a signal without error, the sampling rate must be at least twice the highest frequency of the signal.2][3] Rather than collecting an entire Nyquist ensemble of signal samples, CS can reconstruct sparse signals from a small number of (random 3 or deterministic 4 ) linear measurements via convex optimization, 5 linear regression, 6,7 or greedy recovery algorithms. 8n example of a CS application that has attracted interest is the "single-pixel camera" architecture, 9 in which a still image can be produced from significantly fewer captured measurements than the number of desired/reconstructed image pixels.A desirable next-step development is compressive video streaming.In the present work, we consider a video transmission system in which the transmitter/encoder performs pure direct compressed sensing acquisition without the benefits of the familiar sophisticated forms of video encoding.This setup is of interest, for example, in problems that involve large wireless multimedia networks of primitive low-complexity, power-limited video sensors.CS is potentially an enabling technology in this context, 10 as video acquisition would require minimal or no computational power at all, yet transmission bandwidth would still be greatly reduced.In such a case, the burden of quality video reconstruction will fall solely on the receiver/decoder side.In comparison, conventional predictive encoding schemes [H.264 11 or high efficiency video coding (HEVC) 12 ] are known to offer great transmission bandwidth savings for targeted video quality levels, but place strong complexity and power consumption demands on the encoder side.
The transmission bandwidth and the quality of the reconstructed CS video are determined by the number of collected measurements, which based on CS principles should be proportional to the sparsity level of the signal.The challenge of implementing a well-compressed and well-reconstructed CSbased video streaming system rests on developing effective sparse representations and corresponding video recovery algorithms.Several methods for CS video recovery have already been proposed, each relying on a different sparse representation.An intuitive (JPEG-motivated) approach is to independently recover each frame using the two-dimensional discrete cosine transform (2D-DCT) 13 or a two-dimensional discrete wavelet transform (2D-DWT). 14As an improvement that enhances sparsity by exploiting correlations among successive frames, several frames can be jointly recovered under a three-dimensional DWT (3D-DWT) 14 or 2D-DWT applied on inter-frame difference data. 15To enhance sparse representation and exploit motion among successive video frames, a video sequence is divided into key frames and CS frames in Refs.16 and 17.Whereas each key frame is reconstructed individually using a fixed basis (e.g., 2D-DWT or 2D-DCT), each CS frame is reconstructed conditionally using an adaptively generated basis from adjacent already reconstructed key frames.In Refs.18-20, each frame of a compressed-sensed video sequence is reconstructed iteratively using adaptively generated Karhunen-Loève transform (KLT) bases from neighboring frames.
Another approach for compressed-sensed signal recovery is total-variation (TV) minimization.TV minimization, also known as TV regularization, has been widely used in the past as an image denoising algorithm. 21,22Based on the principle that signals with excessive, likely spurious detail have excessively high TV (that is, the integral of the absolute gradient of the signal is high), reducing TV of the reconstructed signal while staying consistent with the collected samples removes unwanted detail while preserving important information such as edges.4][25][26][27] In Refs.28 and 29, a multiframe CS video encoder was proposed with interframe TV minimization decoding.
Although promising, such a system requires complex and expensive spatial-temporal light modulators that make the technique difficult to be implemented in practice.
In this present work, we propose a system that consists of a pure framewise CS video encoder in which each video frame is encoded independently using compressive sensing.Such a CS video acquisition system can be directly implemented practically with existing CS imaging technology.At the receiver/decoder, we develop and describe in detail a procedure by which multiple independently encoded video frames are jointly recovered successfully via sliding window-based interframe TV minimization.
The rest of this paper is organized as follows.In Sec. 2, we briefly review TV-based CS signal recovery principles.In Sec. 3, the proposed framewise CS video acquisition system with interframe TV minimization decoding is described in detail.Some experimental results are presented and examined in Sec. 4, and a few conclusions are drawn in Sec. 5.

Compressive Sampling with TV Minimization
Reconstruction In this section, we briefly review 2-D and 3-D signal acquisition by CS and recovery using sparse gradient constraints (TV minimization).If the signal of interest is a 2-D image X ∈ R m×n and x ¼ vecðXÞ ∈ R N , N ¼ mn, represents vectorization of X via column concatenation, then CS measurements of X are collected in the form of y ¼ ΦvecðXÞ; ( with a linear measurement matrix Φ P×N , P ≪ N.4][25][26][27] If x i;j denotes the pixel in the i'th row and j'th column of X, the horizontal and vertical gradients at x i;j are defined, respectively, as and The discrete spatial gradient of X at pixel x i;j can be interpreted as the 2D vector and the anisotropic 2D-TV of X is simply the sum of the magnitudes of this discrete gradient at every pixel, To reconstruct X, we can solve the convex program However, in practical situations the measurement vector y may be corrupted by noise.Then, CS acquisition of X can be formulated as where e is the unknown noise vector bounded by a presumably known power amount kek l 2 ≤ ϵ, ϵ > 0. To recover X, we can use 2D-TV minimization as in Eq. ( 4) with a relaxed constraint in the form of Moving on now to the needs of the specific CS video work presented in this paper, if the underlying signal is a video signal F ∈ R m×n×q representing a stack of q successive frames F t ∈ R m×n , t ¼ 1; : : : ; q, then concatenating the columns of all F 1 ; : : : ; F q results to a length mnq vector f ¼ vecðFÞ.If f i;j;t denotes the pixel at the ith row and jth column of frame F t , then the horizontal, vertical, and temporal gradient at f i;j;t can be defined, respectively, as and D t;ij ½F t ¼ f i;j;tþ1 − f i;j;t ; t< q; f i;j;1 − f i;j;t ; t¼ q: Correspondingly, the spatial-temporal gradient of F at f i;j;t can be interpreted as the 3D vector and the anisotropic 3D-TVof F is simply the sum of the magnitudes of this discrete gradient at every pixel: To reconstruct the frame sequence F from noiseless measurements, we can solve the convex program The reconstruction of F from noisy measurements can be formulated as the 3D-TV decoding If the individual frames F 1 ; : : : ; F q in F are highly timecorrelated, then a pixelwise temporal DCT generally improves sparsity.As illustrated in Fig. 1, each temporallength q (q ¼ 4 for example) vector f i;j ¼ ½f i;j;1 ; : : : ; f i;j;q T , i ¼ 1; : : : ; m, j ¼ 1; : : : ; n, consisting of the pixels at spatial position ði; jÞ across q successive frames, can be represented as where Ψ DCT is the 1D-DCT basis and c i;j is the transformdomain coefficient vector.The resulting coefficient matrix C 1 represents the frequency component that remains unchanged over time (dc) and the subsequent coefficient matrices C t , t ¼ 2; : : : ; q, represent frequency components of increasing time variability.Because each matrix C t , t ¼ 1; : : : ; q, is expected to have small TV, they can be jointly recovered in the form of Ĉ1 ; : : : ; Ĉq ¼ argmin where DCT −1 ð C1 ; : : : ; Cq Þ stands for pixelwise inverse 1D-DCT.Subsequently, the complete frame sequence F can be reconstructed simply as Below, we will refer to this form of interframe CS reconstruction as TV-DCT decoding.
3 Proposed CS Video System CS-based signal acquisition with TV-based reconstruction, as described in Sec. 2, can be applied to video coding.In Refs.28 and 29, the video frame sequence is divided into cubes, and each cube consisting of multiple frames is vectorized and compressed-sensed using a large-scale sensing matrix.At the decoder, each cube of video frames is recovered from the received measurements via 3D-TV decoding as in Eq. ( 10) or via TV-DCT decoding as in Eqs. ( 12) and ( 13).However, such a multiframe CS acquisition system requires simultaneous access-hence, some form of temporal storage -to the whole cube of frames, which is impractical and, arguably, defies the core intention of compressed sensing.
In this paper, we propose a practical CS video acquisition system that performs pure, direct framewise encoding.In the simple compressive video encoding block diagram shown in Fig. 2, each frame F t of size m × n, t ¼ 1; 2; : : : ; T, is viewed as a vectorized column f t ∈ R N , N ¼ mn, t ¼ 1; 2; : : : ; T. CS is performed by projecting f t onto a P × N random measurement matrix Φ t , y t ¼ Φ t f t ; t¼ 1; 2; : : : ; T; where Φ t , t ¼ 1; 2; : : : ; T, is generated by randomly permuting the columns of an order-k, k ≥ N and multiple-of-four, Walsh-Hadamard (WH) matrix followed by arbitrary selection of P rows from the k available WH rows (if k > N, only N arbitrary columns are utilized).This class of WH measurement matrices has the advantage of easy implementation (antipodal AE1 entries), fast transformation, and satisfactory reconstruction performance, as we will see later on.A richer class of matrices can be found in Refs.30 and 31.To quantize the elements of the resulting measurement vector y t ∈ R P (block Q in Fig. 2), in this work we follow a simple adaptive quantization approach of two codeword lengths.A positive threshold η > 0 is chosen such that 1% of the elements in y 1 have magnitude above η.For every measurement vector y t , t ¼ 1; 2; : : : ; 16-bit uniform scalar quantization is used for elements with magnitudes larger than η, and 8-bit uniform scalar quantization is used for the remaining elements.The resulting quantized values ỹt are then indexed and transmitted to the decoder.
To reconstruct the independently encoded CS video frames, a simplistic idea is to recover each frame independently via 2D-TV decoding by Eq. ( 6).However, such a  decoding scheme does not exploit the interframe similarities of a video sequence.We propose, instead, to jointly recover multiple individually encoded CS frames via interframe TV minimization.As shown in Fig. 3, the proposed interframe CS video decoder collects and concatenates a group of q dequantized measurement vectors ŷt ∈ R P , t ¼ 1; : : : ; q, to form ŷ ∈ R qP .Because each individual dequantized vector is in the form of ŷt ¼ Φ t f t þ e t with noise e t , ŷ can be represented as where Φ ∈ R ðqPÞ×ðqNÞ is the block diagonal matrix f is the concatenation of the q vectorized frames and e is the concatenation of the noise vectors in the form of e T ¼ ½e T 1 e T 2 : : : e T q : The decoder then performs 3D-TV decoding on the q frames [Fig.3 Although Eq. ( 19) may be considered a powerful joint 3D-TV recovery procedure for general 2D CS-acquired video, for highly temporally correlated video frames, better reconstruction quality may be achieved via TV-temporal-DCT decoding [Fig.3 F can then be reconstructed simply by In Eqs. ( 20) and ( 21), we carried out interframe decoding for each independent group of q frames.To further exploit interframe similarities and capture local motion among adjacent groups of frames, we now propose a sliding-window TV-DCT decoder.The concept of such a decoder is depicted in Fig. 4. Initially, the decoder performs TV-DCT decoding on the first q (q ¼ 4, for example) frames, F 1 ; : : : ; F q specified by a decoding window of length q [Fig.4(a)] using the block diagonal matrix Φ with diagonal elements Φ 1 ; : : : ; Φ q .The reconstructed frames are called F1;1 , F2;1 , F3;1 , F4;1 [Fig.4(b)], where Ft;l represents the l'th reconstruction of the t'th frame.Then, the decoding window shifts one frame to the right, performs TV-DCT decoding on F 2 ; : : : ; F qþ1 using the matrix Φ with diagonal elements Φ 2 ; : : : ; Φ qþ1 , and produces the reconstructed frames F2;2 , F3;2 , F4;2 , F5;1 .The decoder continues on with sliding-window TV-DCT decoding until the last group of frames F T−qþ1 ; : : : ; F T is recovered.Final reconstruction of each frame Ft is executed by taking the average of all different decodings in the form of  Liu and Pados: Decoding of framewise compressed-sensed video via interframe total. . .
Compared to the simple (nonsliding-window) TV-DCT decoder of Eqs. ( 20) and ( 21), the sliding-window TV-DCT decoder enforces sparsity for any successive q frames in the video sequence.Hence, it protects sharp temporal changes for pixels that have fast motion in any q-framesequence and smooths intensities for static or slow-motion pixels in the same decoding window.

Experimental Results
In this section, we study experimentally the performance of the proposed CS video systems by evaluating the peak signal-to-noise ratio (PSNR) (as well as the perceptual quality) of reconstructed video sequences.Two test sequences, Container and Highway, with CIF resolution 352 × 288 pixels and frame rate of 30 frames∕s, are used.Processing is carried out only on the luminance component.
At our trivial, pure CS encoder side, each frame is handled as a vectorized column of length N ¼ 101376 multiplied by a P × N randomized partial WH matrix Φ t .The sensing matrix Φ t is referred to as varying Φ t if it is independently generated to encode each frame and is referred to as fixed Φ if it is generated only once to encode all frames in the video sequence.The elements of the captured P-dimensional measurement vector are quantized and then transmitted to the decoder.In our experiments, P ¼ 12672, 25344, 38016, 50688, 63360 are used to produce the corresponding bit rates of 3071.7,6143.4,9215.1, 12287, and 15358 kbps.(Considering the quantization scheme described in Sec. 3 and frame rate 30 fps, the bit rate can be calculated as ð16 × 0.01P þ 8 × 0.99PÞ × 30∕1000 kbps.)With an Intel i5-2410M 2.30-GHz processor, the encoding time per frame is well within 0.1 s, whereas the H.264/AVC JM reference software programmed in C++ requires about 1.55 s with low-complexity configurations. 11t the decoder side, we chose the TVAL3 software 28,29 for reconstruction motivated by its low-complexity and satisfactory recovery performance characteristics.In our experimental studies for the slow-motion Container sequence, five CS video systems are examined: (1) baseline fixed Φ acquisition with frame-by-frame 2D-TV decoding [Eq.( 6)]; (2) fixed Φ and (3) varying Φ t acquisition with TV-DCT decoding  [Eqs.(20) and (21)]; (4) 3D-TV decoding with fixed Φ [Eq.( 19)]; and (5) varying Φ t acquisition with slidingwindow TV-DCT decoding [Eqs.(20), (21), and (22)].
Figure 5 shows the decodings of the 28th frame of Container produced by the sliding-window TV-DCT decoder with varying Φ t and window size q ¼ 20 [Fig.5(b)], the TV-DCT decoder with varying Φ t [Fig.5(c)], the TV-DCT decoder with fixed Φ [Fig.5(d)], the 3D-TV decoder with fixed Φ [Fig.5(e)], and the 2D-TV decoder with fixed Φ [Fig.5(f)].It can be observed that the 2D-TV decoder as well as the fixed Φ TV-DCT decoder suffer noticeable performance loss over the whole image, whereas the varying Φ t sliding-window TV-DCT decoder demonstrates considerable reconstruction quality improvement.(As usual, pdf formatting of the present article tends to dampen perceptual quality differences between Fig. 5(a)-5(f) that are quite pronounced in video playback.Figure 6 is the usual attempt to capture average differences quantitatively.)These findings are consistent with the belief that varying Φ t , t ¼ 1; : : : ; q, in Eq. ( 16) results in a joint block-diagonal recovery matrix Φ that is more likely to satisfy the restricted isometry property (RIP) 3 for a given data sparsity level.
Figure 6 shows the rate-distortion characteristics of the five decoders for the Container video sequence.The PSNR values (in dB) are averaged over 100 frames.Evidently, the varying Φ t TV-DCT decoder outperforms the fixed Φ TV-DCT decoder for all P values, as well the fixed Φ 2D-TV decoder at the median-low to high bit rate range with gains as much as 5 dB.The proposed varying Φ t sliding-window TV-DCT decoder further improves performance by up to an additional 2.5 dB.
For the Highway sequence with fixed Φ framewise CS acquisition, Fig. 7 shows the decodings of the 54th frame produced by the sliding-window TV-DCT decoder with window size q ¼ 4 [Fig.7

Conclusions
We propose an interframe TV minimizing decoder for video streaming systems with plain framewise CS encoding.Each group of successive frames is jointly decoded by minimizing the TV of the pixelwise DCT along the temporal direction (TV-DCT decoding).To capture local motion across adjacent frames, a sliding-window decoding structure was developed in which a decoding window specifies the group of frames to be decoded.As the window continuously shifts forward one frame at a time, multiple decodings are produced for each frame in the video sequence, from which the average is taken to form the final reconstructed frame.Experimental results demonstrate that the proposed sliding-window interframe TV minimizing decoder outperforms significantly the intraframe 2D-TV minimizing decoder, as well as 3D-TV CS decoding schemes.In terms of future work, to further reduce our encoder/decoder complexity and maintain satisfactory video reconstruction quality, we may develop block-level CS video acquisition systems with rate-adaptive sampling at the encoder and measurement matrices of deterministic design to facilitate efficient encoding/decoding.

Fig. 6
Fig. 6 Rate-distortion studies on the Container sequence.

Table 1
Empirical q values for container.

Table 2
Empirical q values for highway.