SONATA

Project Description


Home
Results
Description
Collaboration
Links

1  Abstract

The aim of this project is to contribute to the understanding of which aspects of the sound are important for perceived naturalness in speech and music by supplying metrics to quantify the naturalness. The pluridisciplinary research will be based on psychoacoustic experiments together with sound synthesis and signal processing. Metrics for perception of naturalness is envisaged to contribute to the development of natural speech synthesis while it is hoped that other fields should profit from this knowledge, such as telecommunication, digital musical instruments, and hearing-aid technology.

2  Motivation

With the evolution of the digital audio technology and communication, old analog products are exchanged with better, faster, and smaller digital versions, such as the record player, the radio, and the telephone, but also new products and services become realisable and gradually accessible to everyone, like the pocket-size mobile telephone, music download from the Internet, automatic telephone services, talking patient journals in hospitals, and individually adapted hearing aids for hearing-impaired to mention a few applications.

In communication systems, the transmission rates increase and give way to more realistic communication, for instance by adding live images to simulate the presence of the other participants, or by spatialisation of the sound, i.e. positioning of each sound source (speaker) in the room. When also simultaneity and minimising of delays are addressed, this should increase comprehensibility of the communication and make the situation comfortable to the user. However, with the fast evolution in this field, there should also be room for increasing the sound quality even further, just like sound quality is a very important issue in hifi systems. Naturalness is in many cases an important factor for the perceived sound quality, and it increases the quality of the communication as human attributes like expression and feelings are better transmitted. Just like sound quality can be retained through intelligent compression when the communication lines are slow, attributes of naturalness should be formulated and quantified so that loss of naturalness can be systematically avoided in such compression processes.

In speech synthesis, the situation is similar: merely to make comprehensible synthetic speech has been a great challenge. Until recently probably the most common way to produce natural speech has been to replay recorded phrases, which is common for automatic telephone services for instance. This method offers little flexibility but a high degree of naturalness. The market is enormous (for instance automatic telephone services, individual road information in a car, portable patient journals at hospitals, and for auralisation of books), and large efforts are therefore put into this problem. It seems that it is unit selection synthesis that is the state of the art at present. This method produces spoken sentences with a high degree of naturalness, as long as the database of prerecorded and segmented natural speech is large enough to contain suitable segments for the wanted sentence. If this is not the case, segments must be modified to fit, and this often results in loss of naturalness.

Naturalness is also a very important attribute in musical applications, such as synthesis of musical instruments (where the keyboard is the most well-known) and performance (orchestration systems, automatic accompaniment etc). There is therefore already a great focus on naturalness in this field, a resource that will be exploited in the present project.

In order to measure or estimate perceived naturalness, a set of metrics should be established. There is to my knowledge a lack of such metrics, but they are nonetheless important for optimisation of systems with respect to naturalness, for example for compression of audio information for transmission or for real-time synthesis of speech while retaining the information necessary for high naturalness (and of course also high intelligibility etc.).

Apart from for the above systems, good metrics for naturalness may be of good help in improving hearing aids and other ear terminals for communication, for instance. They may also be a useful tool for optimising the acoustics in rooms designed for listening, from small hifi rooms or video conference rooms to concert halls and cinemas.

A fundamental study of naturalness in audio is of great interest and not surprisingly encouraged by all the research groups contacted for collaboration with this project.

3  Project limits and aims

The overall aim of this project is to contribute to the understanding of which aspects of the sound are important for perceived naturalness in speech and to present a set of metrics to quantify the naturalness.

Naturalness is a vast subject, and limiting us to human speech and music, lack of naturalness may be manifested at all levels, from phrases without meaning, expression, or intention on the top level, via wrong or anormal grammer/harmonics or prosody/melody to a metallic voice or robotic rhythm or other voice defects on the bottom level. Many of the factors that contribute to naturalness are of stochastic nature, and these are manifested particularly at the lower levels. We thus limit this study to the lower-level aspects of naturalness although these cannot be completely disconnected to the higher levels.

Metrics for perceived naturalness should make us able to quantify in terms of signal parameters what makes speech or music sound static, flat, robotic, or metallic, for example. In order to establish such metrics, the following intermediate aims are proposed:

  • point out synergy effects when research in speech technology and music acoustics are combined
  • establish thresholds for when the brain recognises repeating wave patterns
  • identify artifacts that cause loss of naturalness due to concatenation and manipulation of prerecorded speech elements

Two international publications are expected as outcome of the project, the first about repetition thresholds and the second with an overview of related literature and a proposition for metrics describing lower-level aspects of naturalness in speech and music. Intermediate presentation at conferences is also encouraged, not least because of the valuable feed-back this may give, and a simple demonstrator will be made. During the project evolution, a web site will be maintained with information about progress and results.

4  Background

4.1  Speech synthesis

In a historic perspective [1], speech synthesis was attacked in the 1960's by a source-filter approach. This involved physically based models of the human voice organs from the vibration of the voice cords to the filtering by the resonances of the mouth and vocal tract. In the 1990's the research was rather concentrated towards concatenation of prerecorded segments of phonems and their transitions, i.e. methods based on signal processing rather than physical models.

The most modern technique in speech synthesis is automatic unit selection synthesis [2]. This method uses a database of short or long segments (or units) of natural speech and joins suitable units together to fit the phonetic transcription to be converted to speech. Giving close to natural speech quality, mainly because it avoids computer manipulation of the wave forms, the method is the state of the art in speech synthesis and makes already a basis in several speech-synthesis products.

The FONEMA project [3] aims to automate the database generation and adapt the method for Norwegian speech. How to choose which units to join is also an important issue. This is both a question of technical similarity on both sides of the joint, such as continuity of the pitch and spectrum as well as the phase of the signal, and a matter of what sounds good, i.e. the naturalness. Although naturalness in speech synthesis is an underlying aim of the FONEMA project, the question of naturalness will not be addressed directly. This is one of the reasons for the proposed project in naturalness, and a close collaboration is planned.

4.2  Music acoustics

In the music acoustics field, we see a similar evolution where three important advances may be recognised [1] the analog electronic synthesiser based for instance on subtractive synthesis, the frequency modulated synthesiser, where timbre control was extended by cyclic variations of frequency and amplitude, and finally the sampler, which in simple terms replays prerecorded sounds. While the naturalness of the resulting sounds seldom is convincing, the same sounds may be accepted as natural when they are controlled by a human musician: a pianist on a keyboard (which uses sampling techniques) will add her own interpretation and expression to give a natural result. The unnaturalness of each sampled tone will to some degree be masked by the fact that they are controlled by a human.

A way to go around the problem of naturalness in musical sounds is to turn the listeners attention away from these defects. If the pianist now accompanies a singer, for instance, our attention will be drawn towards this soloist, thus demanding less of the pianist. Indeed, automatic accompaniment systems have been proposed, although they cannot yet serve as more than repetition tools for the soloist. It should be mentioned that an effort is currently also put into understanding how to make computers play a piece from a piano score in a natural way, i.e. performology, at first as an aid for the composer but it may equally well be used to get an impression of one of the many musical pieces that have not been heard by people living today.

With the evolution of the numeric computer, real-time control of the sound may now be possible. Research has therefore started to turn towards real-time simulation, i.e. synthesis during the play, so that human control parameters may be taken into account in the sound production [4]. This may be based on physical modelling, often combined with models resulting from analysis of real sound [5]. As we are concerned with speech, the singing-voice models are of particular interest, for instance the early rule-based formant synthesisers "Musse" at KTH in Sweden [6] and "Chant" at IRCAM in France [7].

Finally, an important aspect of naturalness in music is the nuances of timbre (or tone colour), a perceptible spectral attribute that distinguishes two sounds with the same loudness and pitch. Timbre modelling has become a field in itself, and there exist already sets of metrics for timbre [5,8]. There is less focus on the timbre in speech processing, much because the character of vocal sounds is dominated by the formants (spectral peaks) of the vowels, and the formants move significantly during speech. Maybe timbre could be related to voice properties instead, just as each musical instrument has its own timbre. Nevertheless, advances in timbre research in music (e.g. [9] may give significant contributions to the naturalness of speech.

4.3  Psychoacoustics and neuroscience

Psychoacoustics is the field that is occupied with how acoustic stimuli are perceived [10,11]. For instance, for sounds at different pitch to be perceived equally loud, the pressure amplitude of the sound at 100 Hz must be over ten times higher than the pressure of the 1000 Hz sound. Psychoacoustic measurements can be performed by presenting two sounds and asking the subject to choose the one having most of a certain attribute (the two-alternative forced-choice method - 2AFC). Another common way is to present a single sound and ask the subject whether or not it contains a certain attribute.

Cognitive Neuroscience is concerned with the relationship between brain and thought. For this project we are merely interested in using the Event-Related Potential (ERP) method that allows us to study perceptual and cognitive processing with an excellent temporal resolution of the order of milliseconds [12]. This method is rather easy to use. Electrodes are attached to the scalp, and the changes in the brain electrical activity (evoked potentials) are recorded and time-locked to the event of interest, for instance the start of a sound. Because the evoked potentials are of small amplitude compared to the on-going electroencephalogram (EEG), a number of events belonging to the same category need to be averaged together to improve the signal to noise ratio. This method can easily be combined with psychoacoustic tests to test perception of a sound (e.g. [13]).

5  Project planning

Note that what follows are the initial thoughts of how to achieve the goals.

5.1  State of the art

Musical acoustics, with strong relations to psychoacoustics, and text-to-speech synthesis are two large fields each producing a considerable number of publications per year. Although some publications cross the boundary between these fields, there is a large potential in a combined literature survey of both fields, limited to publications that concern the "lower-level" aspects of naturalness (see section 3. A systematic study of published research is a good starting point of this project although it should be limited to 3-4 months. It is expected that studies related to naturalness in music will be more or less directly exploitable in speech technology, and possibly opposite also.

5.2  Thresholds for perception of repetition

In nature nothing repeats exactly, and the brain is used to the small variations of nature. If a singer sings a long tone and does not search to vary (develop) it, the brain gets bored and concentrates on other things, just like it gets used to the noise from the highly trafficked road outside our house, for example. A good singer therefore changes the intensity or adds a vibrato, for instance, to keep a tension so that the listener's brain continues to treat the sound. Now, if a computer "sings" a long tone by continuously repeating a wave period of 20 ms taken from a real singer, for instance, the brain will first get tired of the lack of variation. But then, after a short time, we get aware of the repeating pattern, often even consciously. We recognise that the sound is not natural.

One first aspect to study in this project, is thus the time it takes from a repeating-pattern tone is presented till it is rejected as unnatural. This can be achieved by studying one specific component of the evoked potential: the "mismatch negativity" [14], which can mark when the brain detects that a sound is repeating. The stimuli for such experiments should at first be sustained nonvarying tones from a real singer and static sounds made by repeating one of the wave periods of the real singer. From this experiment we hope to establish a time threshold for detection of repetition.

A second experiment should consider if and to what extent this time threshold depends on whether the brain has already been exposed to a repeating sound. This will be done by using stimuli containing two similar parts both made by a repeating wave form, but the first part during a little less or little more than the time threshold found in the first experiment. The brain should thus not detect the repetition in the first part, but give us a time threshold for detection of repetition in the second part. Will this threshold be the same as in the first experiment or will the brain be more alert and demand changes more and more often?

By combining knowledge obtained using the ERP method and psychoacoustic experiments, we should be able to determine how often a change is needed in the signal in order to make it appear natural. This should contribute to a better understanding of how acoustic stimuli are treated by the brain.

5.3  Concatenation and manipulation of speech segments

In the unit selection method, smooth concatenation of prerecorded speech segments is essential. If the database does not contain units that fit sufficiently well together, signal manipulation is necessary, for instance to change the natural pitch or the duration of the segment. For such modifications, it is common to use signal-processing methods such as the overlap-add methods PSOLA [15] and MBROLA [16], and the HNM [17], which separates the harmonics and the noise part and manipulates these separately. It should be mentioned that comparisons between such methods have already been performed [18,19], and the object of this part of the project is rather to examine why these methods degrade the naturalness of the signal and to what extent they can be used without having perceptible degradation. This will be done by recording the same sound twice but with different pitch, for instance. One of the signals is thus manipulated to have the same pitch as the other, and the two are compared by psycho\-acoustic comparison tests. Loss of naturalness of various types and degrees of modifications will be studied by psychoacoustic two-alternative forced-choice tests.

During the work with this part of the project, simple programs in Matlab and/or C will be made to treat the audio signals. At the end these will be put together in a demonstrator program.

References

[1] S. Ternström, Introduction to Call for papers of The ASA meeting in Pittsburgh 2002: http://www.speech.kth.se/~sten/Naturalness_Call.htm

[2] A. J. Hunt and A. W. Black, 1996. "Unit selection in a concatenative speech synthesis system using a large speech database," Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP'96), vol. 1, pp. 373-376

[3] FONEMA: http://www.iet.ntnu.no/projects/fonema

[4] J. Tro, "Control and Virtualisation of Musical Instruments," MOSART midterm meeting: http://www.diku.dk/musinf/mosart/midterm/T3/Tro-control.pdf

[5] R. Kronland-Martinet, P. Guillemain, and S. Ystad, "Timbre Modeling and Analysis-Synthesis of Sounds," MOSART midterm meeting: http://www.diku.dk/musinf/mosart/midterm/T2/Kronland-T2-timbre-A4.pdf

[6] J. Sundberg: "Synthesis of Singing by Rule," in Current Directions in Computer Music Research. Ed.: Mathews and Pierce. The MIT Press, 1989, pp. 45-56.

[7] X. Rodet, P. Cointe, J.B. Barrière, and Y. Potard, "The CHANT Project: Modelization and Production," ICMC, Venise Sep. 1982

[8] H. Järveläinen and G. De Poli, "Overview on timbre perception and modeling," MOSART midterm meeting: http://www.diku.dk/musinf/mosart/midterm/T2/d22b.pdf

[9] K. Jensen: "Timbre Models of Musical Sounds, Ph.D. Dissertation," DIKU Report 99/7, 1999 (http://www.aaue.dk/~krist)

[10] E. Zwicker and H. Fastl. "Psychoacoustics: Facts and Models," Springer Verlag 1991

[11] J. G. Roederer. "The Physics and Psychophysics of Music," 3rd ed., Springer-Verlag, 1994

[12] M. Kutas and C. van Petten, "Event-related brain potential studies of language," In P.K. Ackles, J.R. Jennings, and M.G.H. Coles (Eds.), Advances in Psychophysiology, J.A.I Press, Greenwich, Connecticut, 1988, vol. 3, pp. 139-187

[13] M. Besson, ``The musical brain: neural substrates of music perception,'' Special issue in Neuromusicology, J. New Music Research, vol. 28, 1999, pp. 20-31

[14] R. Näätänen, "Mismatch negativity (MMN): perspectives for application,'" Int'l J. of Psychophysiology, vol. 37, 2000, pp. 3-10

[15] E. Moulines and F. Charpentier. "Pitch synchronous waveform processing techniques for text-to-speech synthesis using diphones," Speech Communications, vol. 9 (1990), pp. 453--467

[16] The MBROLA project: http://www.tcts.fpms.ac.be/synthesis/mbrola.html

[17] J. Laroche, Y. Stylianou, and E. Moulines. "HNS: Speech modification based on a harmonic + noise model," ICASSP 93, Minneapolis, USA, 1993

[18] T. Dutoit, "High quality text-to-speech synthesis: A comparison of four candidate algorithms," Proc. IEEE Int'l. Conf. Acoust., Speech, Signal Processing, 1994, pp. 565-568

[19] A. Syrdal, Y. Stylianou, L. Garrison, A. Conkie and J. Schroeter. "TD-PSOLA versus Harmonic plus Noise Model in diphone based speech synthesis," ICASSP 98, Seattle, USA, 1998, pp. 273-276

[20] MOSART: http://www.diku.dk/musinf/mosart

[21] S. Ystad, C. Magne, S. Farner, G. Pallone, C. Astesano, V. Pasdeloup, R. Kronland-Martinet, and M. Besson. "Influence of rhytmic, melodic, and semantic violations in language and music on the electrical activity in the brain." Proceedings of the Stockholm Music Acoustics Conference 2003, Stockholm, Sweden, August 2003, pp. 671-674


Home: http://www.pvv.ntnu.no/~farner, E-mail: farner(a)pvv.ntnu.no
Last modified: Mon Apr 18 13:23:53 CEST 2005