Expressive and multimodal speech synthesis

Multimodal speech synthesis, or audio-visual speech synthesis, deals with automatic generation of voice and facial animation from arbitrary text with the facial animation being lip-synced to the generated audio. Applications span from research on human communication and perception, to tools for the hearing impaired, and to spoken and multimodal interfaces in human-computer interaction. A view of the face can improve intelligibility of both natural and synthetic speech significantly, especially under degraded acoustic conditions. Moreover, facial expressions can signal emotion, add emphasis to the speech and support the interaction in a dialogue situation.

The face plays a critical role in interpersonal communication; expressiveness (or the lack of it) provides strong cues about what a person is thinking or feeling. This has always been a problem when it comes to virtual characters. From the first virtual characters that lacked any emotional visual cues, interpretation was the only tool to help build a connection. Now that human-computer interaction is moving more and more into a realistic visual style, emotion is more important than ever. The eyes and the lips are two of the biggest conduits of emotion. Lip thickness, lip press, and lip motion synchronized with what is being said, are some key elements for a believable virtual character. Virtual characters with dead eyes or puppet like mouths are no longer acceptable.

Realistic facial animations can only be achieved by resorting to real face motion capture data. The subtleties and nuances of facial expressions are too complex to analytically model or manually specify. Recent progress in face motion capture technologies and video analysis algorithms has enabled the development of high-quality solutions that can track even the finest details of the human face, even from a single-camera video stream or from depth data coming from a low-budget sensor. Then, sophisticated retargeting algorithms can analyze this captured performance and map it onto the virtual face of high-quality, high-poly, rigged 3D character models, even in real time. The outcome can be remarkable: virtual characters that emote in believable ways taking human-computer interaction to a different level.

The possibility to synchronously capture speech and high-quality facial data from real speakers, and then to exploit them for developing multimodal speech synthesizers by augmenting traditional algorithms from the speech synthesis field, presents a challenging and most promising area of research.