DECface: An Automatic Lip-Synchronization Algorithm for Synthetic Faces [pdf]

Keith Waters and Thomas Levergood

Digital Equipment Corporation

Cambridge Research Laboratory
Technical report CRL 93/4

visemes
Figure 1 Partial viseme table

Abstract

This report addresses the problem of automatically synchronizing computer generated faces with synthetic speech. The complete process is called DECface which provides a novel form of face-to-face communication and the ability to create a new range of talking personable synthetic characters. Based on plain ASCII text input, a synthetic speech segment is generated and synchronized in real-time to a graphical display of an articulating mouth and face. The key component of DECface is the run-time facility that adaptively synchronizes the graphical display of the face to the audio.

full
viseme table
Figure 2 The full DECtalk phoneme to viseme mapping table

Introduction

In English lip-reading is based on the observation of forty-five phonemes and associated visemes [Wal82]. Traditionally, lip-reading has been considered to be a completely visual process developed by the small number of people who are completely deaf. There are, however, three mechanisms employed in visual speech perception: auditory, visual, and audio-visual. Those with hearing impairment concern themselves with the audio-visual, placing emphasis on observing the context in which words are spoken, such as posture and facial expression. Speech comprises a mixture of audio frequencies, and every speech sound belongs to one of the two main classes known as vowels and consonants. Vowels and consonants belong to basic linguistic units known as phonemes which can be mapped into visible mouth shapes known as visemes. Each vowel has a distintive mouth shape, and viseme groups such as fp,m,bg and ff,vg can be reliably observed like the vowels, although confusion among individual consonants within each viseme group is more common [McG85]. Despite the low threshold between understanding and misunderstanding, the discrete phonetics provide a useful abstraction, because they group together speech sounds that have common acoustic or articulatory features. We use phonemes and visemes as the basic units of visible articulatory mouth shapes.

decface interface decface interface
Figure 3 The wireframe of DECface and an interactive variation

References
  1. Waters, K. and Levergood, T. An Automatic Lip-Synchronization Algorithm for Synthetic Faces, ACM Multimedia, San Francisco, pages 149-156, October 1994
  2. US Patent: 5,657,426, Method and Apparatus for Producing Audio-Visual Synthetic Speech, Waters, K. and Levergood, T. Issued: August 12th 1997
  3. US Patent: 5,884,267, Automated Speech Alignment for Continuous Natural Speech, Goldenthal, B., Van Thong, J. and Waters, K. Issued: March 16th 1999