143rd
ASA meeting, Pittsburgh, June 3-7, 2002
Session on Naturalness in Synthesized Speech
and Music
Session organizer: Dr Sten Ternström, KTH, Stockholm, www.speech.kth.se
This session is hosted jointly by the ASA committees for Musical Acoustics and Speech Communication. It will be scheduled on June 3 or June 4. Your submission is welcome!
Abstracts are due by February 1st, on-line submission at http://asa.aip.org
Professor Ingo Titze, Denver, U.S.
Professor Hideki Kawahara, Wakayama, Japan
Professor Roger B Dannenberg, Pittsburgh, U.S.
Professor Xavier Serra, Barcelona, Spain
What does "naturalness" mean? In our experience with synthesized sounds, be they intended as spoken or as musical, all of us have found many syntheses to be inadequate or unconvincing. We can hear that something is wrong, either with the sound source (voice or instrument), or with the player/speaker, or with the way the message has been verbally or musically encoded. We tend to lump these very different shortcomings together under the common heading of "unnatural" or "lacking in naturalness". Do they have something in common, or must we deal with them one by one?
The layered transport model. Both music and speech can be envisaged as forms of communication that use layered transport protocols. As a very loose example:
-- Message (the story to be told)
-- Script (words, musical phrases)
-- Symbols (phonemes, notes)
-- Gestures (forces and velocities)
-- Converters (voice, musical instruments, synthesizers, ears)
-- Physical signals (sound waves)
Almost any non-trivial communication would be based on such multi-level protocols. The physical representation of the message varies with the protocol level, and must be interpreted on the appropriate level, and in the appropriate context (society, language, musical genre etc). In technical communication systems, such as computer networks, protocol levels are strictly isolated from one another and must communicate only through carefully defined interfaces, which are laboriously spelled out by standards committees. In human communication, however, a protocol appears to be loosely arrived at by context and experience. Furthermore, our brain has access to and control of several levels in this communication simultaneously. For example, our appreciation of music can be heightened by superior acoustics, a fine instrument, a competent player, an inspired composer, healthy ears, and an extensive experience as listeners. When many of these factors reach a high standard, great experiences can be had. As listeners, we can also choose to ignore certain shortcomings of the transport layers. A natural voice can continue to sound natural in origin, even though the transmitted voice signal has been subjected to gross distorsion. This observation seems to indicate that naturalness can be cued on some cognitive level as well as on a lower perceptual level. When we try to discuss naturalness, perhaps it might help to treat each level separately.
The source-filter modelling approach that was prevalent in speech synthesis from the 1960's to the 1980's seemed to reach an impasse in the 1990's. As workers in source-filter synthesis were trying to get rid of a buzzy, mechanical voice, and also trying to achieve natural-sounding transitions between segments, many concluded that the challenge would be overwhelming of mapping the human voice organ and its control systems into a source/formant space in sufficient detail. They turned instead to concatenative techniques of using prerecorded segments and transitions, because the result "sounds more natural"; and they were more successful, at least in commercial terms. This means that we must concede that even tiny segments of the recorded natural voice have desirable properties that we do not fully understand, but which are prized by listeners. In the meantime, a few brave workers have laboured on with physically based models of phonation and articulation, but it appears to be a long haul.
In the music technology field, a very similar path has been taken, although it does not manifest itself quite as clearly. In spite of what the marketers might like to tell you, the technological evolution of music synthesis has taken only a few really major steps since the 1960's. The first was when analog electronic synthesizers of the Robert Moog period came within the reach of musicians. The second was when all-digital techniques made mass-marketing possible on a larger scale, heralded by Yamaha's successful FM synthesizer in 1983, which was based on John Chowning's work. The industry now perceived that a major selling argument for a synthesizer was if it sounded like the real thing. The quest for tonal fidelity led to step three: the era of the sampler, which is little more than a sophisticated waveform memory and not really a synthesizer at all. This technique entirely dominates the market at the moment, and it is roughly analogous to the concatenative synthesis of speech. A sampler sounds reasonable in many applications, especially in accompaniment, where detailed expressiveness is less crucial than in solos. However, it relies on sonic attributes that are "canned" but not well explained. Developers are now busy with the fourth step: real-time physical modelling of vibrating mechanical systems. (In parallel, there is also a "retro" trend toward the digital-domain emulation of the early analog synthesizers.)
It might seem very sophisticated to synthesize e.g. a guitar by simulating the physical vibrations of the entire guitar. It is almost trivial that if we succeed performing in a sufficiently detailed simulation, such a simulation must sound like the real thing. However, just as with sampling, this does not tell us anything about which of the sonic properties that really matter for naturalness. Since our two eardrums have not many degrees of freedom, it somehow seems like overkill to simulate thousands of vibrating points in order to produce one or two audio signals. Or does it? Is naturalness perhaps represented on such a high level of abstraction that it has little to do with the sound itself?
Repetition is unnatural, and easily perceived. The physical world never exactly repeats itself. Even if a marimba player strikes exactly the same point on the bar under exactly the same velocity vector, she will most probably strike the bar in a different phase relationship to the previous excitation each time. Unfortunately for synthesis, our sense of hearing has an uncanny ability to detect repetition in sounds, from repeated individual vibratory cycles and all the way up to verses of a pop song or movements of a symphony. Auditory adaptation unfailingly inhibits repeating sounds from catching our attention; we are first bored with the "sameness" and then perhaps ignore the sound completely. This is a serious difficulty e.g. for music synthesis using sampled waveforms. Adding various forms of randomness in order to avoid strict repetition is often tried but rarely succeeds. If we better knew the details of our perceptual mechanisms, would we be in a better position to devise synthesis methods that retain the listener's interest? Should we instil in the synthesis some generic randomness and/or noise? And is there any recipe for which perceptual entities that must fluctuate? By how much and in what way?
Player/speaker control, and feedback. The human motor control system has certain properties that when manifested in sound will strongly cue our perception of "live" control. Voice flutter in F0 is one example, our way of realizing gestures is another. Even a single sine wave can be expressive to some degree if it is expertly controlled in amplitude and frequency. Although the sine wave itself is not very natural, such control gestures can represent something that is recognizably natural. In speech terms, one might correspondingly say that the articulatory motions and prosody must be acceptably inferred by the listener.
Perhaps speech and music synthesis diverge somewhat at this point: a speech synthesizer should not violate our preconceptions of natural speech gestures, but an electronic musical instrument could or even should rely on its human player to provide the gestures; unless we are in the business of trying to replace musicians, or of composing directly for a distributable medium.
How are gestures relevant to naturalness? Is it conceivable that proprioceptive and/or auditory feedback in the player/speaker are pivotal components of "liveness"? Should we therefore build synthesizers that "listen" to themselves?
Is naturalness important? I asked a few composers how they feel about natural sounds; and their response was that naturalness in itself is not a primary objective. In the music synthesis domain naturalness takes second place to expressiveness; it is more important to composers and musicians that an instrument be adequately expressive than that its sound be in some sense natural; although there seems to be some overlap between these terms. In most applications of speech synthesis, intelligibility is more important than naturalness, but again, it seems plausible that naturalness will promote intelligibility. It certainly promotes public acceptance of synthetic voices.
Realism. Fortunately, if somewhat surprisingly, the issue of "naturalness" seems to be quite distinct from the issue of "realism", in the sense of trying to mimic the actual presence of a speaker or an instrument in the room. This latter topic is conveniently deferred to the discipline of virtual acoustics.
Digital technology to which we are de facto committed in this age, is in itself entirely unnatural and inherently repetitive. But for us as researchers, this is a good thing. We will get nothing for free; everything that we accomplish in the digital domain must be completely specified from beginning to end. That may be cumbersome, but it may also be good news for building new knowledge.
Please regard the above only as an inspiration, and feel free to contribute with any approach to or aspect of natural-sounding synthesis that you wish to discuss.
Sten Ternström, session organizer