Natural and Intuitive User Interfaces with Perceptual Computing Technologies
The ways we interface and interact with computing, communications, and entertainment devices are changing. A transition to natural and intuitive user interfaces promises to usher in a new class of exciting and immersive applications and user experiences.
by Achintya K. Bhowmik
HUMAN–COMPUTER interaction and user-interface paradigms have been undergoing a surge of innovation and rapid evolution in recent years. The human interface with computers has already been transformed over the past decades in a major way, with graphical user interfaces that employ a mouse as the input device replacing the old command line interfaces based on text inputs. Touch technology has coincided with many breakthrough advances in mobile-display technologies toward power-efficient, thin, and light devices with stunning visual performance.1 We are now witnessing the next interface evolution, with the advent of more natural user interfaces in which the user interacts with computing, communications, and entertainment devices using gesture, voice, and multi-modal inputs, as well as touch.
These new interface technologies and the ensuing applications present exciting oppor-tunities for the display technology and consumer-electronics industry at large. This article outlines the key technologies, recent developments, and trends toward realizing natural user interfaces and implementations of new interactive systems and applications.
Touch Inputs
The recent transformation of the display subsystem from a visual information device to an interactive one is largely due to the integration of touch sensing to the display module. With the capability of sensing contact locations on its surface, a display device morphs from a one-way interface device that merely shows visual content to a two-way interface device that also directly receives user inputs and thus enables interactive applications. It took decades of development after the first report of capacitive touch screens, by Johnson in 1965,2 for the technology and usage to go mainstream with consumers. Recently, touch-screen technology and its commercial deployment have been undergoing a phase of fast adoption and growth, thanks to the widespread proliferation of smartphones and tablets. Touch-enabled display screens are increasingly appearing in modern laptop computers, especially in the category of ultrabooks.
Let us take a quick look at the market size to gain appreciation for the footprint of touch-screen technology. The industry now ships well more than a billion units of touch screens per year. As shown in Fig. 1, the market for touch-screen technologies has grown at a rapid pace recently, both in terms of total units and revenues.3
Fig. 1: Overall touch-screen market growth from 2007 to 2012 is depicted by unit volume in a solid line on the left axis and revenues in dashed line on the right axis. Source: DisplaySearch.
While projected-capacitive technology has been the main enabler of touch penetration in consumer devices in recent years, other technologies for sensing contact location on the surface of a display include analog resistive, surface capacitive, surface acoustic wave, infrared, camera-based optical, LCD in-cell, bending wave, force-sensing, planar scatter detection, vision-based, electromagnetic resonance, and combinations of technologies.4
Multi-Modal Inputs: Toward Perceptual Computing
We human beings use multi-modal interface schemes to comprehend our surroundings and communicate with each other in our daily lives, seamlessly combining gestures, voice, touch, facial expressions, eye gaze, emotions, and context. We have evolved into highly interactive beings, aided by a sophisticated 3-D visual-perception system, aural and auditory capabilities, skin with tactile sensitivity, and other perceptual sensors. Well more than half of the human brain is dedicated to processing perceptual signals,5 which enables us to understand the space, beings, and objects around us and interact in contextually aware natural and intuitive ways. As depicted in Fig. 2, the human-visual system consists of 3-D and depth-perception capability with a binocular imaging scheme, allowing us to navigate and interact with objects in 3-D space.
Fig. 2: At left, more than half of the human brain is dedicated to processing perceptual signals, enabling us to see, hear, understand, and interact with each other in natural ways. At right, the human-visual system perceives the world in 3-D with a binocular imaging scheme.
Taking a page from nature’s playbook, we are now adding human-like sensing and perception capabilities to computing and communications devices, to give them the abilities to “see,” “hear,” and “understand” human actions and instructions in natural and intuitive ways and use these new capabilities to interact with us.6
The industry is witnessing significant innovations and early commercial implementations of gesture-based interaction schemes based on real-time image capture and inference technologies. Until recently, most efforts were focused on 2-D computer vision and image-processing techniques for gesture-input recognition,7 taking advantage of the 2-D image sensors that are now a ubiquitous part of computing and communications devices. However, implementations based on 2-D image sensors and processing are limited to simple gestures. Recent breakthroughs in 3-D imaging technologies are now enabling fine-grain rich user interactions and object manipulations in 3-D space in front of the display. There are various methods for capturing and interpreting real-time 3-D user inputs; three of the most prominent are (1) stereo-imaging-based computer vision, (2) projected structured-light-based 3-D imaging, and (3) time-of-flight techniques for depth-map determination.8
Stereo-imaging-based 3-D computer-vision techniques attempt to mimic the human-visual system, in which two calibrated imaging devices laterally displaced from each other capture synchronized images of the scene, and the depth for the image pixels is extracted from the binocular disparity. The technique is illustrated in Fig. 3 (left), where O and Oʹ are the two camera centers with focal length f, forming images of an object, X, at positions x and xʹ in their respective image planes. In this simple case, it can be shown that the distance of the object, perpendicular to the baseline connecting the two camera centers, is inversely proportional to the binocular disparity: z = Bf/(x – xʹ). Algorithms for determining binocular disparity and depth information from stereo images have been widely researched and further advances continue to be made.9
In the case of structured-light-based computer-vision methods, a patterned or “structured” beam of light, typically infrared, is projected onto the object or scene of interest; the image of the light pattern deformed due to the shape of the object or scene is then captured using an image sensor, and, finally, the depth map and 3-D geometric shape of the object or scene are determined using the distortion of the projected optical pattern. This is conceptually illustrated in Fig. 3 (middle).10
The time-of-flight method measures the depth map by illuminating an object or scene with a beam of pulsed infrared light and determining the time it takes for the light pulse to be detected on an imaging device after being reflected from the object or scene. The system typically comprises a full-field range imaging capability, including amplitude-modulated illumination source and an image sensor array with a high-speed shutter. Fig. 3 (right) conceptually illustrates a method for converting the phase shifts of the reflected optical pulse into light intensity,11 which allows determination of the depth map.
Fig. 3: Principles of 3-D image-capture technologies include, at left, stereo-3D imaging based on the geometric relationship between object distance and binocular disparity; in middle, a projected structured-light 3-D imaging method10; and, at right, a time-of-flight range-imaging technique using pulsed light reflection.11 At left, more than half of the human brain is dedicated to processing perceptual signals, enabling us to see, hear, understand, and interact with each other in natural ways. At right, the human-visual system perceives the world in 3-D with a binocular imaging scheme.
With real-time acquisition of 3-D data points using the techniques described above, rich human–computer interaction schemes can be implemented using recognition and inference techniques that enable interactions beyond touch screens. Besides these 3-D computer-vision technologies, there have also been significant interest and efforts in the research community in developing human–computer interfaces utilizing voice input, processing, and output.12 Recent developments in this domain are starting to yield commercial success, with applications in mobile devices, consoles, and automotive markets. Furthermore, the combination of these sensing and perception domains makes it possible to implement multi-modal user interfaces.
As an example of such an implementation, we have recently developed and released the Intel Perceptual Computing Software Development Kit (SDK), which includes a small, light-weight, USB-powered 3-D imaging device with dual-array microphones, a set of libraries consisting of 3-D computer vision and speech processing algorithms, and Application Programming Interfaces (APIs) for application developers to utilize these libraries.13 These tools allow developers to create immersive applications and interactive experiences that incorporate close-range hand and finger-level tracking, fine-grain gesture, and pose recognitions, speech recognition, facial analysis, 2-D/3-D object tracking, and augmented reality.
The 3-D imaging device included in the aforementioned SDK is specifically designed for close-range interactivity, including a 3-D depth sensor, a high-definition RGB sensor, and built-in dual-array microphones. Figure 4 shows some of the important capabilities and tools provided in the SDK, utilizing real-time computer vision and image-processing algorithms.
Fig. 4: At top left, a depth-image is captured using the 3-D-imaging device included in the Perceptual Computing SDK. The top middle figure shows color-image and face analysis and the top right figure shows 2-D/3-D tracking and augmented reality. The bottom row shows hand and finger-level recognition and tracking.
The image-capture module in the SDK provides an 8-bit RGB image and a 16-bit depth map, enabling the reconstruction of 3-D point clouds. The audio-capture module provides 1–2 channel PCM/IEEE-Float audio streams. The close-range finger-tracking module includes geometric node tracking and 7-point tracking of fingertips, palm center, and elbow. Advanced computer-vision algorithms provide an estimation of positions, volumes, openness and handedness, recognition of a standardized set of poses such as thumb up/down and peace, gestures such as swipe left/right/up/down, circle and wave, and label maps for hand image and its parameters. The face tracking and analysis module includes landmark detection – eyes, nose, and mouth, facial attributes – age-group and gender detection, smile and blink detection, and face recognition. Other modules incorporate speech recognition and synthesis with voice command and control, short-sentence dictation, and text-to-speech synthesis; and 2-D/3-D object tracking with tracking of planer surfaces, reporting of position, orientation, and other parameters, and tracking of 3-D objects based on 3-D models.
Besides the libraries described above, we have also developed and released a computationally efficient articulated 3-D hand-skeletal tracking technique based on a physical-simulation approach using the 3-D imaging sensor.14 This method fits a 3-D model of a hand into the depth image or 3-D point cloud generated by the 3-D imaging device, on a frame-by-frame basis, and imposes constraints based on physiology for accurate tracking of the hand and the individual fingers despite occasional occlusions. As shown in Fig. 5, this technology enables fine-grain versatile manipulation of objects in 3-D space.
Fig. 5: An articulated model-based hand-skeletal tracking technology enables fine-grain versatile manipulation of objects in 3-D space. Source: Intel.
A New Class of Interactive Applications
Besides enhancing user interactions for some of the traditional applications, a new class of highly interactive applications and user experiences are made possible by multi-modal interaction schemes that combine multiple inputs such as touch, voice, face, and 3-D gesture recognitions in intuitive and engaging ways. Figure 6 shows two examples that are naturally enabled by 3-D gesture interactions in front of the display, rather than traditional 2-D inputs such as a mouse or touch screen. The image on the left shows a scenario in which the user is expected to reach out and “grab” a door knob, “turn” it, and “pull” it out of the plane of the display to “open” the door. The image on the right shows a “slingshot” application, in which the user “pulls” a sling with the fingers, directs it in the 3-D space, and “releases” it to hit and break the targeted elements of a 3-D structure.
Fig. 6: Examples of interactive applications and experiences enabled by 3-D imaging technologies include manipulations of objects in the 3-D space in front of the display. Source: Intel.
These actions would be quite difficult to implement intuitively with a mouse, keyboard, or a touch screen. Implementations of 3-D gesture interactions using 3-D computer-vision algorithms result in more natural and intuitive user experiences for this type of application.
Besides gesture interactions and object manipulations in 3-D space, real-time 3-D imaging can also transform video conferencing, remote collaboration, and video blogging applications, as users can easily be subtracted from the background, or a custom background can be inserted using the depth map generated by the 3-D imaging device. Another category of applications that can be dramatically enhanced is augmented reality, where rendered graphical content is added to captured image sequences. Beyond the traditional augmented reality applications that currently use 2-D cameras, 3-D imaging can augment video content with 3-D models of objects and scenes and allow users to interact with elements in the augmented world.
Enabling Enhanced Interactions
Just as the introduction of the mouse and the graphical user interface three decades ago brought about numerous new applications on computers, and the proliferation of the touch interface enabled another set of new applications on smartphones and tablets over the past few years, 3-D user interfaces based on perceptual computing promise to usher in a new class of exciting and interactive applications on computing, communications, and entertainment devices.
This article has examined key natural user input technologies behind the emerging multi-modal interaction paradigms that are paving the path to the era of perceptual computing and a new class of highly interactive applications and user experiences. Recent advances in 3-D computer vision with real-time 3-D image-capture techniques and inference algorithms, combined with improvements in speech recognition, promise to take human–computer interactions one step further into the 3-D space in front of the display.
References
1A. Bhowmik et al., eds., “Mobile Displays: Technology & Applications (John Wiley & Sons, Ltd., New Jersey, 2008).
2E. Johnson, “Touch Display – A novel input/output device for computers,” Electronics Lett. 1, 219–220 (1965).
3DisplaySearch, “Touch-Panel Market Analysis Annual Reports,” 2008–2012.
4G. Walker, “A Review of Technologies for Sensing Contact Location on the Surface of a Display,” J. Soc. Info. Display 20, 413–440 (2012).
5R. Snowden et al., “Basic Vision: An Introduction to Visual Perception” (Oxford University Press, 2006).
6http://www.ted.com/pages/intel_achin_bhowmik.
7S. Rautaray, “Real Time Hand Gesture Recognition System for Dynamic Applications,” Intl. J. UbiComp 3, 21-31 (2012).
8A. Bhowmik, “3D Computer Vision,” SID Seminar Lecture Notes, M-9 (2012).
9M. Brown et al., “Advances in Computational Stereo,” IEEE Trans. Pattern Analysis and Machine Intl. 25, 8 (2003).
10L. Zhang et al., “Rapid Shape Acquisition Using Color Structured Light and Multi-Pass Dynamic Programming,” IEEE Intl. Symp. on 3D Data Proc. Vis. Trans., 24-36 (2002).
11A. Jongenelen, “Development of a Compact, Configurable, Real-time Range Imaging System,” Ph.D. Dissertation, Victoria University of Wellington (2011).
12S. Furui et al., “Fundamental Technologies in Modern Speech Recognition,” IEEE Signal Processing Magazine 16 (2012).
13http://www.intel.com/software/perceptual
14http://software.intel.com/en-us/articles/the-intel-skeletal-hand-tracking-library-experimental-release •