Multimodal Interaction - Honda Research Institute Japan Co., Ltd. (HRI-JP)

Multimodal Interaction

Cooperative Intelligence

Aggregating Input for High-level Conversation

In software systems, input/output forms such as text, sound, and images are each a type of "Modality." HRI-JP is researching a "multimodal dialogue system" which handles the combined input/output of sounds, images and videos. One application of this multimodal system which combines text, sound and images is the next generation navigation system.

The next generation navigation system is required not only to instruct "Turn left at the next corner," but instruct "Turn left at the corner of the tall red building," which requires analyzing the surrounding environment. If the user responds with "Do you mean the Red Lion Hotel?", then the navigation system is expected to clarify. "Yes. Please turn left before the Red Lion Hotel." Thus, the next generation navigation system is required to analyze and understand both the user's question and the environment. Research is necessary to combine language processing, real-world references, and visual interpretation in order to realize this level of technology.

For realizing route navigation utilizing not only the map location or building names but also speech of driver or sight from window shield, Multi Modal Processing is required which aggregates speech recognition, language understanding, and image processing technologies.

Group Conversation with a Robot

At HRI-JP we have been engaged in researching a group conversation system which enables conversation with multiple users simultaneously. Human-Machine Dialogue Enhancer (HALOGEN), a multimodal dialogue research platform, enhances the language information processing capability of HRiME.

HALOGEN can collect and analyze information not only from language but also from the voice, images and video. HALOGEN extracts information such as the tone of the voice, direction of the face, emotion and gestures, gender, and age, and combine this information with the language data to identify the current speaker and judge the conversation flow.

It is difficult for a robot to decide whether or not to respond when there are multiple users present. A system can respond to a user in a one-on-one situation relatively easily, but if multiple users speak in front of the robot, it will mistakenly respond to a monologue or to a conversation between humans. HALOGEN focuses on the motions and postures of users to estimate "Response Obligation" to solve this problem. We have observed that users tend to look towards the robot and wait for a while after speaking because we are uncertain if the robot will respond. The observation that the user is waiting is also used to decide whether to respond. HALOGEN aims to enhance Cooperative Intelligence between humans and robots.

When two people are talking each other in front of a robot, the robot won’t respond while only two are talking. When a human talks looking into the robot and stops his/her gesture for wait for response, the robot can recognize the human is talking into the robot.

Interpreting Objects as Humans Do

In addition to the microphone array, we research extraction of Define term. information using a camera or depth sensor for better understanding of the environment. HRI-JP is engaged in using technologies to detect, recognize, and identify objects and faces, both human and animal. We are conducting experiments with robots to further understand environmental interaction by attaching sensors to a robot and allowing it to explore indoors. The robot collects information about its own location and the direction and the number of sounds while moving around, and then combines the sounds and images to create a 3D map.

We are also involved in research of a system which enables natural interaction with humans using the multimodal information from sounds (voices), finger motions and gestures. Humans tend to have unique preferences when interacting with an intelligent system. Some people may want to use a voice command to turn on a lamp while others prefer using pointing gestures to execute the same command. Our system is designed so that different preferences of users can be supported, and the system acts accordingly in order to meet all the preferences of all the users.

This image picture shows how the depth sensor detects and calculates distance between objects by recognizing the shape and color of the object using RGB sensors. Thus, accurate positions of objects are identified.

The image capture recognizing shape and color of objects by RGB sensor with distance of objects by depth sensor. This recognize objects more accurate.