With the rapid development of wireless communication and mobile display devices, mobile multimedia have been available in many commercial areas, and among them, video contents have been considered an important form of information. In particular, since a variety of display devices such as tablet computers, cellular phones, smartphones, and handheld personal computers are released in the market, and such devices have different display resolutions, video needs to be changed according to different resolution sizes or aspect ratio of display panels. That is to say, the spatial resolution of video is downsized or upsized by resizing algorithms in order to use a video contents more effectively. However, because the simple resizing techniques such as scaling and cropping do not take into account the dominant contents (i.e., a person at the picture) in video, the primary transformation or distortion of such salient objects is inevitable. Therefore, it is necessary to develop a new content-based video resizing method that can preserve the dominant contents in an image while changing its size.
Figure 1(b)–1(e) are 30% horizontally reduced images of Fig. 1(a). Because the scaling method adjusts the sampling rate uniformly over a whole image, if the scaling ratio is different from the aspect ratio of the source image, the contents of the source image are distorted [Fig. 1(b)]. Therefore, content-based image resizing methods have been studied in order to prevent this visual distortion. Cropping is effective for displaying a region of interest (ROI) where dominant objects are located. Santella et al. proposed a semi-automatic cropping technique,1 which finds the important content and crops an image. However cropping-based methods discard the exterior region of ROI when the resolution of the target display is much smaller than that of original video and cannot correctly preserve the sparse multiple objects [Fig. 1(c)]. To solve this problem, Liu et al.2 proposed the fisheye-view warping technique, that preserves the dominant region while other region is warped [Fig. 1(d)]. Fisheye-view warping preserves the main content of an image as much as possible but has the disadvantage of severely distorting the rest of the video information. Recently, Avidan et al.3 introduced the seam carving technique, which is known to have high image scaling performance with low quality loss of a retargeted image [Fig. 1(e)]. In order to resize images, this method removes or inserts pixels of a seam which is defined as a vertical (or horizontal) connected path of pixels with the minimum gradient energy. In addition, studies on additional methods to preserve the contents and change the size of an image are in progress.22.214.171.124.126.96.36.199.13.14.–15
In a video, the geometric transformations16,17 of a content-based image are different from those in static images. For static images, each image is processed only in the spatial dimensions; in contrast, in a video, a consideration of the relationship between adjacent frames is needed because the concept of the time dimension is added. Without the consideration of the relationship, the contents of each frame’s image can be preserved. However, the irregular movement of the contents’ location in a video generates a shaking phenomenon (jitter) for the contents, because the connectivity of the time axis is lost. Therefore, it is essential to protect the time continuity of the contents to prevent this shaking phenomenon, which implies that a new content-based geometric conversion algorithm should be applied to videos.
There have been several classes of video retargeting approaches. Setlur et al.16 generates a motion illustration by using a principal motion direction in video to detect and accentuate a moving object’s motion in a single static frame. Liu et al.17 performs a video retargeting using an automatic pan-and-scan method by moving the cropping window in each frame. Furthermore, additional studies on methodologies that can maintain the dominant contents and change the size of an image or a video are in progress.188.8.131.52.184.108.40.206.13.14.–15
Video carving,18,19 which is the application of seam carving to a video, uses a three-dimensional (3-D) cube to connect the frames to the time axis. Rubinstein et al.18 introduces an improved seam carving algorithm for image and video retargeting, which applies forward energy instead of gradient value to evaluate the energy of a pixels. Chen et al.19 proposed a video carving to handle two-dimensioanl (2-D) connected surface of pixels in 3-D space-time volume by constructed consecutive frames in video. However, since the location and geometric shape of contents are changed in the video frames, the 2-D connected surface considering spatial and temporal connectivity in whole video is not obtained simply. Therefore, in order to attain effective video retargeting, the entire 3-D space-time volume has to be analyzed while considering the energy in spatial and temporal connectivity of 2-D surface (Fig. 2). At this time, because 2-D connected surface is obtained by applying the graph cut technique20,21 that is required to a large amount of memory and high-complexity operations within the both Rubinstein’s and Chen’s methods, novel real-time image retargeting technique is required for a systems with limited resources, such as a mobile devices.
In this paper, a novel video resizing algorithm that preserves the dominant contents of video frames is proposed. The proposed method determines the 2-D connected paths for each frame by considering both the spatial and the temporal correlation between frames to prevent jitter and jerkiness with a reduced computational cost. Therefore, this method is performed in real-time and with low memory consumption.
The proposed technique operates by shot unit, which means that the consecutive images are taken by a single camera, and all of the frames within a shot have similar features. First, in order to separate each shot effectively in a video, a shot change is detected by monitoring the brightness differences and the histogram differences, which are susceptible to movement and color change,22,23 respectively. If a shot change is generated and a new shot begins, the first frame of the shot is resized using the conventional seam carving technique on the static image. At this time, the seams extracted by the seam carving technique and the coordinates of the seams are stored. The proposed image resizing technique can calculate the new seams of the next frame in real-time by the newly proposed forward energy instead of creating a 3D cube which requires information on all of the video frames. And then the image resizing is carried out by the seams.
This paper is organized as follows. In the next section, the conventional seam carving algorithm is briefly introduced. The proposed algorithm is presented in Sec. 3. Section 4 presents and discusses the experimental results. Finally, our conclusions are given in Sec. 5.
Review on Conventional Seam Carving
The seam carving method extracts the seam of which the change of the energy is the lowest in the image, and controls the image size by adding or removing the pixel to the each coordinate of the seam. Seam is a line which is connected widthwise or lengthwise and composed of one pixel per a row and/or a column. In image, the seam is defined as Eq. (1).2). 24,25 in order to reduce these calculation quantities. The method finding the cumulative minimum energy map , that is the first stage of the dynamic programming, by using the condition of the vertical seam of Eq. (1) and the matrix structure of image in image shows up in Eq. (3). 3), and the vertical seam is found from each cumulative minimum energy values through the reverse search. The number of the vertical seams is identical with the horizontal size of the image since the number of the cumulative minimum energy values are like the horizontal size of the image. The optimum seam among the vertical seams is found through the reverse search from the pixel of which the cumulative minimum energy value is the smallest. The horizontal optimum seam can be found in the same way.
The image size can be controlled by adding or deleting video data on the coordinates of the optimum seam. Several seams are required in order to control the image size variously. After excluding the pixels corresponding to the seam which firstly is extracted in order to extract several seams, the next seam is extracted by the renewal of . The reason for excluding the pixel corresponding to the previous seam coordinates in order to find the new seam is to satisfy the definition of the seams. The energy of the pixels comprising the optimum seam is low. Therefore, if the pixels of the already selected seam are not removed, the possibility that these pixels are again selected is high, and the overlapped pixels between the seams are generated, so the definition of the seam cannot be satisfied. If the definition of the seam is not satisfied, when converting the image size, the same pixel is repeatedly referred and the distortion of the result image is generated. Because the renewal of is needed in order to prevent this distortion, the total processing time delay is inevitable. If the resolution of the image to adjust is big, the delay time increases exponentially.
Proposed Image Resizing Algorithm in Video
As shown in Fig. 3, the proposed real-time content-aware video resizing system is composed of three parts: shot change detection (SCD), generating seam, and image resizing (Appendix). If shot change is detected and a new shot is initiated, seam information stored of the previous frame is ignored and the new seams are searched using the seam carving technique on the static images. And then, after storing the information about the searched seam, the frame is resized to the target size. On the other hand, if a shot has been continued, the seams of the current frame are calculated by using the seam information stored of the previous frame, and then the frame is resized by generated seam.3.1.Detecting Shot Change.
Because the frame rate of a video is more than 10 fps, the shot change detection is performed every 10 frames. First, the feature values are extracted between two consecutive frames.
For the stability of the algorithm, the shot change detection is not performed until 10 feature values are gathered. After 10 feature values are gathered, the largest and the second largest feature values are extracted and the difference between the two values is calculated. The shot change between two consecutive frames is detected through the following equations.
Since the conventional seam carving for a static image is applied to the first frame after a shot change, the frequency of shot change has an effect on the speed of the algorithm. However, in the case of the general video, the scene change does not occur frequently as much as the real time processing is obstructed.
Deriving Seam in the First Frame
After a shot change is generated, the conventional seam carving for a static image is applied to the first frame. All the coordinate and energy values of the seams of the first frame are stored in order to use this information when finding seams in the next frame. The following equations show the stored information of the seams in frame of size.
Generating Seam of Current Frame by New Scheme
The seams of the current frame are extracted with reference to the seams information stored in the buffer when a shot change not occurs, that is the current frame belongs to the same shot as the previous frame. Because a visual correlation (a similarity) exist on consecutive frames within an identical shot, the energy distribution of the neighboring frames are also correlated and similar, and then the seams in a frame are analogous to those of neighboring frame. Thus the seams of the current frame are derived from specified range considering the coordinates of the seams of the previous frame because of correlation. At this time, the seams of temporal connection have to be considered. If the seams for each frame in video are generated independently without correlation, the jitter and jerkiness are occur. The visual artifact of jitter mainly occurs, in particular, because of a difference in the numbers of the seams around the dominant contents each frame. For example, assume that in the first frame, three seams and five seams were extracted from the left and right of some contents, respectively. And in the consecutive second frame, five seams and three seams were extracted from the left and right of the same contents of first frame, respectively. If the image size is changed identically for the two frames, the relative locations of the contents between the two frames have a difference of two pixels. This problem is jitter, which occurs on the contents of frame by repeating process of extracting seams independently for each frame. Figure 4 shows the results of independently expanding the size of the consecutive frames by the seam carving.
If we give attention to the picture in the red circle each frame in Fig. 4, we can observe that seven seams and one seam exist to the left and right of the red circle in the first frame, respectively, whereas six seams and two seams exist to the left and right of the red circle in the second frame, respectively. In the original video, the picture in the red circle exists in a fixed location. However, in the images expanded independently by the seam carving, the picture in the second frame moves one pixel to the left compared to the first frame. If these processes are repeated, the contents in the red circle shake tremendously.
Therefore, in a video, preventing the shaking phenomenon is more important than finding the optimum seam. This section presents a new process to extract seams that prevents the shaking phenomenon and preserves the form of the dominant content.
Seam-ordering of current frame
Since seams of frame can be overlapped, the conventional seam carving extracts the next seam after removing the previous seam. Figure 5 shows the overlapped coordinate between the first seam and the second seam.
In Fig. 5, overlapped coordinates are generated at the location where the first seam and the second seam meet. If the coordinate of the overlapped part is used when the image size is modified by the seams, it will be incorrect by one pixel at the location of the overlap. In conclusion, a distortion of the image occurs. Therefore, a specific order is used for the seams. That is, the seam order of the current frame is identical to that of the previous frame. For example, the information of the 4th seam of the previous frame is stored in order to get the 4th seam of the current frame. Equation (7) indicates that the seam information of the previous frame is referred in order to produce the seam of the current frame.4), and is the reference to produce the new seam. Also, is the number of the current frame, and is the number of the current seam.
Energy cost of pixel
The conventional seam carving method considers the energy of each pixel to determine a seam, and there exist the various energy functions. The amount of change of the pixel value, the spatial forward energy, the standard deviation, the edge information,26 gradient vector flow,27 the energy of high tasks (e.g., face detector), etc., can be used as the energy, and the other result image is generated according to each energy function. Among them, the spatial forward energy having the good performance uses the difference between adjacent pixels of a pixel. If the pixel is selected as a seam and removed, the adjacent pixels are smoothly connected. The spatial forward energy is defined as Eq. (8).8) is used to find the vertical seam, and the horizontal seam is obtained by the same method. In calculating , one among the left-up, up, and right-up is selected only for the pixels of which the spatial connectivity is maintained.
The spatial forward energy shows the good performance about the static images, but not about the videos because the correlation between frames is not considered. In this paper, the temporal forward energy is proposed as the energy considering the correlation between frames. The temporal forward energy can guarantee the continuity of the seam in the time domain.
Figure 6 shows the three possible vertical seam by temporal forward energy, and is the ’th pixel value in the ’th frame. As shown in Fig. 6, we search for the seam whose removal inserts the minimal amount of energy between two consecutive frames. These are seams that are not necessarily minimal in their energy, but will leave less artifacts in the resulting image, after removal. This coincides with the assumption that two neighboring images have piece-wise smooth intensity at the same position of the pixel, which is a popular assumption in the literature. The temporal forward energy according to the position of the pixel to be removed is defined as Eq. (9).
In calculating , one among the left-down, down, and right- down is selected only for the pixels of which the temporal connectivity are maintained.
Generating seam of continuous frames
The coordinates of the pixels which are temporally connected with the reference seam of the previous frame are selected as the starting coordinates of a seam. The is set of coordinate of seam candidate. The next coordinate of is obtained with reference to and the reference seam . The condition to find is given by
1. and are spatially connected (spatial connection).
2. and are connected to the time axis (temporal connection).
Equation (10) is the process of finding the candidate pixel (CanPix) satisfying the above condition.Figure 7 shows an example of the spatial connection condition, temporal connection condition, and the set CanPix satisfying two conditions.
The set Canpix is composed of the pixels satisfying the spatial connection and temporal connection altogether and the pixels becomes the candidate for the seam guaranteeing the continuity in the time domain. The spatial forward energy and the temporal forward energy of the candidate pixels are obtained, and the pixel with the smallest sum of the two energy values is included in the seam as Eq. (11).8), (9), and (11), and therefore, the proposed technique resizes the video without distortion of the primary contents and visual artifacts.
The image size is modified by the coordinates of all the seams that are finally determined in the current frame. When reducing the image size, as many seams as the difference in size between the original video and target video are removed in the order of the seams, one at a time. On the other hand, when expanding the image size, pixel values are inserted to the coordinates of the seams in the order of the seams. Figure 8 shows examples of the process to control the image size. First, a seam map is generated by the coordinates of seams in seam information stored. The size of the seam map is identical to that of the original image, and the corresponding seam numbers are stored with the coordinates of the seams as shown in Fig. 8(a). The image size is controlled by the produced seam map. When reducing the image size, as shown in Fig. 8(b), the seam map is searched and the pixels with the coordinates of the first seam are removed. After the size of the image is reduced by one seam, in order to update the coordinates by removed seam, the referred seam is removed from the seam map. The image size is reduced by repeating this process for the number of seams.
On the other hand, when the image size is enlarged, as shown in Fig. 8(c), empty spaces are inserted at the same coordinates as the coordinates of a seam. In addition, the pixel values generated by an interpolation method are filled in the empty spaces, and the image size is expanded. After the size of the image is expanded by one seam, in order to update the coordinates by inserted seam, the referred seam is inserted in the seam map. The target image is obtained by repeating this process.
In this section, the performance of three image resizing techniques are evaluated, namely, the bilinear method, the technique of applying Avidan’s algorithm3 to a video, and the proposed technique. Extensive experimental testing and comparison were performed on several sequences with different characteristics: “SOCCER,” “COASTGUARD,” and “MOTHER & DAUGHTER” are in CIF format (), and “IN TO TREE” are in 720p format (). All sequences have 300 frames, and were horizontally enlarged by 30%. First, each method was evaluated on the basis of its runtime and the average memory usage, which are the most important factors in real-time processing. The experiments were performed in the 1.86 GHz dual core with 2 GB memory. In order to enhance the reliability in the measured value, the same process was repeated 10 times, and the averages of the result values were compared.
Run-times for different algorithms (s).
|Algorithm||352×288 pixels||1280×720 pixels|
Memory usages for different algorithms (KB).
|Algorithm||352×288 pixels||1280×720 pixels|
As the Avidan’s algorithm needs many operations and the large storage space in order to analyze the entire frames in video, it cannot be performed on a system with limited resource such as a mobile terminal. However, the proposed algorithm runs about 25 times faster than the Avidan’s algorithm and achieves the comparable runtime as compared with the bilinear method as shown in Table 1. Since the proposed algorithm can process 12 frames per second in case of CIF, real-time processing is possible for systems with a frame rate of 12 frames per second.
Since the proposed algorithm is designed for mobile terminal, the memory usage is also important. As shown in Table 2, the proposed method requires lower memory about three times than the Avidan’s algorithm. Because the new seam of the current frame is computed with reference to the seam information of the previous frame, the memory usage of the proposed method is similar to that of the bilinear method which is usually performed to resize image on mobile device.
Next, whether the main content was maintained and whether the shaking phenomenon exits or not were compared through each result frame. Figure 9 shows “SOCCER” (174th frame), “COASTGUARD” (62th frame), and “MOTHER & DAUGHTER” (60th frame) from the results of each algorithm.
Compared to the source image in Fig. 9(a), the result of the bilinear technique in Fig. 9(b) indicates that the shapes of the primary contents have been broadened. However, in the images results from Avidan’s algorithm and the proposed algorithm, the shapes of the contents are similar to those in the original image. Thus, it is seen that the proposed algorithm maintains the main content of the image.
Finally, the differences between the experimental results and source image are shown as the Error Rate given byTable 3 shows numerically how many differences the result images by the proposed method and the Avidan’s method shows with the original video by Error Rate.
Error rates for different algorithms.
|Algorithm||352×288 pixels||1280×720 pixels|
As shown in the Table 3, the result images by the proposed method have the smaller error rate and are more similar to the original video than those of the Avidan’s method.
Figure 10 shows the differences between adjacent frames in “IN TO TREE” (frames 33–36). Because these frames belong to a single shot, any differences between adjacent frames are small.
As shown in Fig. 11(a), because the technique applying Avidan’s algorithm to video does not consider the relation between adjacent frames, the shaking phenomenon occurs and many differences between neighboring frames are generated. On the other hand, because the proposed algorithm considers the correlation between adjacent frames, there is no shaking phenomenon and the differences between neighboring frames are similar to those in original video as shown in Fig. 11(b).
The results have been presented only for horizontal direction. In order to control the image size in both directions, the proposed algorithm is just applied twice: once in the horizontal direction and once in the vertical direction.
A novel video resizing algorithm that preserves the dominant contents of video frames was proposed. Because a visual correlation (a similarity) exist on consecutive frames within an identical shot, the energy distribution of the neighboring frames are also correlated and similar, and then the seams in a frame are analogous to those of neighboring frame. Thus, the seams of the current frame are derived from specified range considering the coordinates of the seams of the previous frame because of correlation. The proposed method determines the 2-D connected paths for each frame by considering both the spatial and temporal correlations between frames to prevent jitter and jerkiness. The conventional seam carving requires too much complexity and a large amount of memory because the entire frames in video have to be analyzed. Therefore, the conventional seam carving cannot be performed on a system with mobile terminal. The proposed algorithm has a fast processing speed similar to that of the bilinear method, while preserving the main content of an image to the greatest extent possible. In addition, because the memory usage is remarkably small compared with the existing seam carving method, the proposed algorithm is usable in mobile terminals which have limited memory resources. Computer simulation results indicate that the proposed technique provide better objective performance, subjective image quality, shaking phenomenon removal, and content conservation than conventional algorithms.
|F=number of frame|
|N=number of seam|
|for (f=1; f<=F; f++)|
|perform shot change detection|
|for (n=1; n<=N; n++)|
|if f is first frame or shot change occurred|
|calculate SFE to pixel of frame without including seam extracted|
|extract one seam using dynamic programming on SFE|
|update and accumulate seam information|
|calculate SFE and TFE to pixel satisfying spatial and temporal connectivity|
|for the nth seam of previous frame|
|generate one seam considering SFE, TFE value and the location of seam of|
|update and accumulate seam information|
|create new resizing frame to use seam information|
Daehyun Park received BS and MS degrees in computer engineering with the Department of Computer and Communications Engineering from Kangwon National University in 2007 and 2009, respectively. He is now a PhD candidate in computer engineering with the Department of Computer and Communications Engineering at Kangwon National University. His research interests are in the areas of video signal processing and multimedia communications.
Kanghee Lee received BS and MS degrees in computer engineering with the Department of Computer and Communications Engineering from Kangwon National University in 2009 and 2011, respectively. His research interests are in the areas of video signal processing and multimedia communications.
Yoon Kim received a BS degree in 1993, an MS degree in 1995, and a PhD degree in 2003, in electronic engineering with the Department of Electronic Engineering from Korea University. In 2004, he joined the Department of Computer and Communications Engineering, Kangwon National University, where he is currently an associate professor. From 1995 to 1999, he was with the LG-Philips LCD Co., where he was involved in research and development on digital image equipment. His research interests are in the areas of video signal processing, multimedia communications, and wireless sensor networks.