The native display format of all high-definition-television receivers must accommodate a number of different transmitted video formats. This article explores three of the video signal-processing functions that make this possible: de-interlacing, frame-rate conversion, and super-resolution.
by Yunwei Jia, Quang Dam Le, Larry Pearlstein, and Philip Swan
ADVANCED VIDEO-PROCESSING algorithms are essential for delivering the best possible picture quality on high-definition-television (HDTV) receivers. Video-processing functions that attempt to recover missing picture information have become increasingly important for adapting a native display format to a set of different input video formats. The picture information to be recovered may have been lost due to optical or electronic processing or it may simply have never been captured. The physical world has nearly infinite spatial and temporal resolution, but image-capture devices preserve only a limited amount of information, and subsequent processing often results in additional information loss. This article considers three functions for recovering picture information: de-interlacing, frame-rate conversion, and super-resolution. For each of the three functions, a brief survey of algorithmic approaches is presented, along with the relative implementation complexity and picture quality associated with each approach.
De-Interlacing
De-interlacing is the process whereby interlaced format pictures are converted to progressive format pictures. Currently, this function can be found in virtually all projection, plasma, and liquid-crystal-display (LCD) HDTV receivers. The sophistication of the algorithms used varies widely, but high-end de-interlacers generally incorporate the techniques of cadence detection, spatial-temporal interpolation, and directional interpolation. While motion-estimation and motion-compensation (ME/MC) technology is quite mature, its successful incorporation into de-interlacer implementations has faced many challenges. Historically, panel response times or panel motion artifacts have obscured the primary benefits of ME/MC, which also has a higher cost and complexity, and many implementations to date have had objectionable motion artifacts. Often, it is not coupled successfully with top-notch implementations of the first three core de-interlacing techniques. These problems will be solved in products in the near future. New de-interlacers will reliably turn all 1080i content into the sharpest possible 1080p content. This, in turn, will lead to even greater consumer appreciation of the highest-resolution display technologies.
One of the first things a good de-interlacer will do is use a "cadence detection" technique to determine how the source was authored. For example, when 24-frame-per-second (fps) film has been converted to 60-field-per-second video, there is a 3:2 pulldown cadence in the video.
If the interlaced content was originally authored as film, the de-interlacer can "cheat" by simply weaving fields together that come from the same film frame. However, there are several complicating real-life factors:
• There are several film frame rates. • Still scenes, scene cuts, and "bad edits" can disrupt the continuity of the cadence. • Subtitles can appear and disappear out of phase with the film's cadence. • Some content is made by mixing of film and video. • Some content will have repetitive background noise superimposed on it. • Some content contains detailed patterns that make analysis difficult.
These factors make reliable detection of a film cadence very challenging in practice. The most advanced de-interlacers will first use very sensitive signal-processing algorithms to extract useful raw information from the background noise of the video. Then, they will apply complex heuristics designed to convert that raw data into a real-time decision about whether or not it is safe to simply weave pixels from two or more selected fields together. It is important to note that if a de-interlacer uses this approach, it is not actually creating any new pixels. Instead, it is simply using existing pixels to create progressive video at the lower frame rate of the originally authored content.
If the content is not film – and thus reconstructing the original frames is not an option – then the de-interlacer will actually fabricate new pixels using advanced interpolation techniques. The input pixels at a de-interlacer's disposal are generally divided into two groups:
(a) Pixels that are from the correct time that are nearby in space. (b) Pixels at the correct position that are nearby in time.
The first challenge a de-interlacer faces is to decide which group of pixels is most likely to provide a good basis for interpolation. For example, if the scene is static (i.e., there is no motion in the scene), then the pixels from group (b) make the best basis for interpolation, referred to as "temporal interpolation." If the scene is moving, then the pixels from group (a) make the best basis for interpolation, referred to as "spatial interpolation." This process is illustrated in Fig. 1, which shows rasters for three successive fields. "M" is a missing pixel that the de-interlacing algorithm will construct. If the algorithm interpolates using the "T" pixels, it is performing temporal interpolation. If it uses the "S" pixels, it is performing a simple form of spatial interpolation. A state-of-the-art de-interlacer will adaptively adjust the amount of spatial and temporal interpolation on a pixel-by-pixel basis. This ability is the second core component of every good de-interlacer.
When the de-interlacer favors spatial de-interlacing, the best de-interlacers will use an advanced interpolation technique called directional interpolation. The "D" and "S" pixels shown in Fig. 1 would be used for this more complex form of spatial interpolation. Directional interpolation is the signal-processing equivalent of sanding a piece of wood with the grain rather than across the grain. A directional interpolator will detect the orientation of edges in the scene and interpolate in parallel with edge orientation. When this works, the quality of the interpolated scene is much greater because jagged edges are eliminated in the output. However, in real-life scenes, there are many situations in which it is not wise to interpolate in parallel with the detected edge orientation. Therefore, an advanced algorithm also incorporates intelligence that allows it to use this interpolation technique only when it is appropriate to do so. The ability to perform good directional spatial interpolation is the third core component of every good deinterlacer.
It is possible to detect and compensate for the motion in a scene using motion-estimation and motion-compensation (ME/MC) techniques. Motion-vector fields provide yet a third interpolation option – to perform temporal interpolation using pixels found along the motion vector. This is illustrated in Fig. 1 where if, for example, field-to-field motion of two pixels in the x direction and two pixels in the y direction is estimated near the missing pixel M, then the MC pixels would provide a good basis for temporal interpolation. When they work properly, ME/MC techniques increase the perceived sharpness and reduce the flicker in a detailed moving scene. However, as with cadence detection, spatial-temporal adaptation, and directional spatial interpolation, there are many real-life cases where the application of ME/MC techniques is a mixed blessing. If applied universally, some scene elements will be improved, but others will suffer from a reduction in perceived quality. Thus, for best overall picture quality, ME/MC techniques must be coupled with an intelligent component that applies ME/MC selectively only when and where it will provide improvements beyond the other techniques mentioned above.
Frame-Rate Conversion
Frame-rate conversion (FRC) is the process whereby input pictures captured at one frame rate, e.g., motion-picture film at 24 fps, are converted to a different frame rate for display, e.g., 60 fps, which is the native display frame rate for many current HDTV receivers. FRC, which creates new pictures when they are needed for display, is a function currently found only in high-end HDTV receivers, but is moving rapidly toward the mainstream. FRC addresses the following problems:
• Reduction of the "judder" effect whereby motion pictures captured at low rates appear to exhibit motion that is not smooth. An example of this is the reproduction of material captured on film at 24 fps, where the 3:2 sequence produces a 12-Hz component in the temporal spectrum of the displayed image.
• Adaptation between an input video signal frame rate and a different native display frame rate. An example of this is the display of content created for distribution at 50 Hz on devices that are only capable of reproduction at 60 Hz.
• Reduction of the motion-blur phenomenon that occurs due to eye tracking of moving scenery reproduced on a device with sustained illumination.1 An example of this is the motion blur associated with large-area LCD devices.
High-quality FRC involves composing a hypothetical video frame at a new display time that is consistent with two or more observed video frames, each associated with different capture times. Strictly speaking, FRC is an ill-posed problem in that there is no unique solution to the problem formulation. Practically, however, most real sequences are sufficiently well behaved that it is possible to deliver dramatic improvements in picture quality via FRC.
FRC must address the issues of multiple independent motions, complex motion such as rotation and zoom, and occlusions. These are illustrated in Fig. 2, which shows two input frames to the FRC algorithm and the interpolated frame created by the FRC algorithm. Note that the bird and tree each move differently from the lumberjack and that the tree undergoes rotational motion. In addition, the falling tree occludes the bird, while portions of the lumberjack are revealed.
One of the most important components of any FRC algorithm is the creation of an accurate motion-vector field, which describes how different objects in a scene move. Although a variety of effective techniques have been developed for motion-vector estimation in the context of digital-video compression (e.g., MPEG-2), these are generally unsuitable for use in FRC. The requirements for motion-vector estimation in the context of FRC are distinguished by the following:
• High-quality FRC requires the assignment of motion vectors at very fine granularity, ideally one motion vector per pixel.
• FRC requires high-precision motion vectors (e.g., one-quarter pixel accuracy or better).
• FRC requires the reconstruction of true motion information – the simple identification of the motion vector producing the best pattern match is unsuitable for use in FRC. This issue is especially important when dealing with pictures containing periodic patterns.
• Motion estimation for FRC should be robust to changes in lighting and noise.
• In the context of FRC, two or more pictures are used for reference in creating an interpolated picture. When areas of the reference pictures are covered or uncovered by foreground motion, it is essential that the motion-vector field include an indication of which reference picture should be used in creating a given interpolated pixel.
Fig. 2: Image sequence showing multiple motions, occlusions, and complex motion.
Techniques for motion-vector estimation involve the identification of similar patterns of pixels in two or more pictures. In general, there are two classes of algorithms for pattern matching which are in common use – sum of absolute differences (SAD) and phase plane correlation (PPC).2,3
SAD-type algorithms involve computing the sum of the absolute values of the pixel value differences between a region of pixels in a first picture and an identically shaped region in a second picture that has been displaced horizontally by an amount dx and vertically by an amount dy. Various values of dx and dy are evaluated accordingly. The values of dx and dy that produce the minimum SAD value indicate the best pixel match.
PPC algorithms are generally equivalent to performing the following steps:
• Extract a rectangle of pixels from each of a first and a second reference picture. • Apply a windowing function to the rectangles of pixels. • Compute the two-dimensional fast Fourier transform (FFT) of each of the rectangles of pixels. • Compute the point-by-point complex product of the first array of FFT values and the complex conjugate of the second array of FFT values. • Normalize the resulting complex product to produce an array of phase values, i.e., complex values with unity complex modulus. • Perform the inverse FFT operation on the normalized product. The output of the inverse FFT operation is the PPC. • Identify the coordinates of peaks in the PPC – use these coordinates as motion- vector candidates. • Evaluate the set of motion-vector candidates to determine the best motion-vector estimate for each pixel.
The PPC algorithm can be viewed as an approximation to the cross-correlation function between two picture regions, after performing a spectral equalization operation, which tends to enhance detail. The detail-enhancement effect makes the PPC more robust than SAD-type algorithms at identifying true motion in regions that are relatively flat in texture.
While a good pixel match is important, it is not sufficient by itself for estimating the true motion captured by a video sequence. In order to better estimate the true motion of objects, additional constraints are typically employed, including motion-vector-field spatial smoothness (try to keep motion vectors similar to those in their neighborhood) and temporal consistency (evaluate evidence that the motion-vector estimate at a given position in the current picture is consistent with the vectors determined in previous pictures).
Although the PPC requires the implementation of a complicated series of arithmetic operations, the FFT algorithm is more computationally efficient at pattern matching over a large search area than a commensurate application of a SAD-type algorithm. Thus, with the PPC algorithm, it is feasible to implement an FRC solution with accurate motion-vector estimation over the large pull-in range required for performing FRC on HDTV video formats. Because FRC for LCD motion-blur reduction is now sufficiently mature, some LCD panels for HDTV are offered with this feature integrated into the panel subsystem.
Super-Resolution
All HDTV receivers are capable of digitally up-sampling video received in a standard-definition-television (SDTV) format to match the native high-definition display format. The traditional approach used to perform spatial up-sampling involves a straightforward application of linear poly-phase interpolation filters. Although the traditional approach results in a representation of each picture with a higher pixel density, it does not recover lost picture information and, therefore, does not actually increase the perceived resolution of the pictures. Recently, HDTV receivers have begun to include more advanced techniques for up-sampling video, which can actually increase perceived resolution. These advanced techniques are generally referred to as super-resolution (SP-RES) processing.
SP-RES refers to the process of converting one or more pictures from a low-spatial-resolution sequence to a picture at a higher spatial resolution through image processing. SP-RES is commonly found in instrumentation applications such as satellite imaging, astronomical imaging, remote sensing, and medical imaging. In SP-RES, multiple low-resolution pictures may be used to generate high-resolution details that would not be produced by traditional up-sampling techniques. For example, in Fig. 3, three low-resolution frames [LR(n-1), LR(n), and LR(n+1)] in the input video sequence are used to produce a high-resolution frame SR(n) in the output video sequence.
SP-RES can be formulated as an inversion problem in the following fashion. Let X be an unknown high-resolution image of a scene and Y be the set of observed (or received) low-resolution images. The imaging process from X to Y may include camera motion, object motion in the scene, optical blurring, motion blurring, down-sampling, and noise corruption. Overall, the imaging process can be abstracted in the following linear equation:
Y = HX + N, (1)
where H represents the system matrix of the imaging process and N represents additive random noise. In SP-RES, the objective is to find an estimate of the high-resolution image X from the observed low-resolution images Y.
Numerous algorithms have been developed for SP-RES, and a recent overview of these algorithms can be found in Ref. 4. Roughly speaking, SP-RES algorithms can be categorized into two groups: motion-based methods and motion-free methods.
1. In motion-based SP-RES, one tries to track an object in multiple low-resolution images and combine these spatially shifted versions of the object into a high-resolution image of the object. Clearly, accurate estimation of the true motion of the object is the key for these motion-based methods. Moreover, the true motion represented in the low-resolution images must include subpixel shifts so that additional information can be extracted from each of the low-resolution images to produce a high-resolution image.
2. In motion-free SP-RES, one uses cues other than motion in the low-resolution images to obtain high-resolution details; examples of such cues include known samples of low-resolution images and the corresponding high-resolution images, and edges in the low-resolution images.
In the following, we briefly describe three SP-RES algorithms, which have the potential of being used in HDTV receivers. The first two are motion-based methods and the last is a motion-free method.
• Iterative back-projection (IBP) methods: This type of method5 follows a simulate-and-correct approach in constructing a high-resolution image from multiple low-resolution images. First, an initial estimate of the high-resolution image is constructed; this may be done by simple spatial scaling techniques such as bi-cubic filtering or poly-phase filtering. Second, assuming the explicit knowledge of the imaging process from the scene to the observation [i.e., the system matrix H in Eq. (1)], one can simulate the imaging processing from the initial estimate of the high-resolution image and produce simulated low-resolution images. Third, the simulated low-resolution images are compared with the observed images, and the residue is used to update the estimate of the high-resolution image. IBP methods provide an effective framework for estimating the high-resolution image from the low-resolution images. The main disadvantages include the requirement of explicit knowledge of the imaging process and high computational complexity.
• Bayesian methods: In this type of method,6 one tries to solve the system [Eq. (1)] in an optimal sense, within a probabilistic framework. Considering X, Y, and N as realizations of random processes, one attempts to find the estimate of X with the highest probability of occurrence given the observed low-resolution picture, Y. Iterative techniques are generally used to solve the optimization problem. The main advantage of this type of method is the incorporation of the probability knowledge in the solution. The challenge of this type of method is to find an appropriate probability model that accurately describes the high-resolution images.
• Learning-based methods: In some applications, a database of a number of low-resolution images and their corresponding high-resolution images may exist. In such cases, it is possible to learn from the database some knowledge of the imaging process and utilize the knowledge in SP-RES. The learning process can be done off-line and therefore can use complicated methods. Based on the results of the learning process, the construction of high-resolution images from low-resolution images can be done in real time. The challenge of this type of method is the development of a representative training set that will permit the learning process to be sufficiently general to handle the diverse types of video sequences that may be viewed on a television receiver.
Consumer-electronics products claiming super-resolution capabilities are starting to appear on the market. Algorithms for super resolution in consumer electronics will become increasingly sophisticated over the next few years.
Conclusion
The immense popularity of large HDTV displays has driven the development and implementation of highly sophisticated algorithms for video processing. All of the algorithms discussed in this article attempt to tackle the difficult problem of recovering missing picture information. In the case of super-resolution, the missing information is high-resolution pixel data. De-interlacing attempts to create missing scan lines of picture data. In FRC, new entire pictures are created.
Acknowledgments
The authors gratefully recognize their colleagues Daniel Doswald, Marinko Karanovic, Finn Wredenhagen, and Jason Yang for their contributions to the subject matter discussed in this article, and Ruth Valiante for her artistic assistance.
References
1H. Pan, X. F. Feng, and S. Daly, "LCD motion blur modeling and analysis," Proc. ICIP, II-21-24 (2005). 2G. A. Thomas, "Television motion measurement for DATV and other applications," BBC Research Department Report, BBC RD 1987/11. 3A. Pelagotti and G. de Haan, "High-quality-picture rate up-conversion for video on TV and PC," Proc. Philips Conf. Digital Signal Processing, paper 4.1 (Nov. 1999). 4S. C. Park, M. K. Park, and M. G. Kang, "Super-resolution image reconstruction: a technical overview," IEEE Signal Processing Magazine 20, No. 3, 21–36 (May 2003). 5S. Peleg, D. Keren, and L. Schweitzer, "Improving image resolution by using subpixel motion," Pattern Recognition Letters 5, No. 3, 223–226 (March 1987). 6R. Schultz and R. Stevenson, "A Bayesian approach to image expansion for improved definition," IEEE Trans. Image Processing 3, No. 3, 233–242 (May 1994). •