Communication is a fundamental human need. For people who acquire a communication problem, for example because of a severe hearing impairment, social isolation and loneliness may be the most serious problem (cf. Nordeng, von Tetzchner and Martinsen, 1985). Speech is the most important form of communication. In addition to face-to-face communication, we have accustomed ourselves to the fact that our voice can be transmitted over nearly unlimited distances using electronic telecommunication systems, and the telephone is by far the most used telecommunication service. Today there are worldwide about 800 million telephone subscribers. Summarising all the other telecommunication services (non-voice services like data, fax, telex, teletex) we have about 30 million subscribers, that means, less than 4 per cent of those using a telephone. According to several prognoses, the part of non-voice services will grow. However, prognoses also indicate that the growth will remain under 10 per cent (compared to telephony). This also means that the new networks and services (e.g. narrow-band ISDN and wide-band IBCN) will primarily be used for speech transmission.
The use of the standard audio telephone, analogue or digital, depends on the human capacity to produce and hear speech. Nevertheless, the audio telephone is not only used for communication between people. A person can also use the telephone to operate a machine with the help of speech recognition and synthetic speech. Man-machine interaction with speech input and (synthetic) output normally has a very limited and task-oriented goal, but it may still give many people access to new services (audiotex), and make old ones more user friendly because the user utilises the basic skills of speech and hearing.
Thus, understanding of speech and hearing is fundamental for understanding telephone communication. This chapter presents the basic processes of speech production and hearing, the coding of speech in the telephone network, and the production and recognition of speech by computers.
Speech is produced by a cooperation of lungs, glottis (with vocal cords) and articulation tract (mouth and nose cavity). Figure 3.1 shows a cross section of the human speech organ. For the production of voiced sounds, the lungs press air through the epiglottis, the vocal cords vibrate and interrupt the air stream and produce a quasi-periodic pressure wave. The pressure impulses are commonly called pitch impulses and the frequency of the pressure signal is the pitch frequency or fundamental frequency. It is the part of the voice signal that defines the speech melody. When we speak with a constant pitch frequency, the speech sounds monotonously but in normal cases a permanent change of the frequency ensues.
The pitch impulses stimulate the air in the mouth and nasal cavity. When the cavities resonate, they radiate a sound wave which is the speech signal. Both cavities act as resonators with characteristic resonance frequencies, called formant frequencies. These formats are numbered so that the lowest frequency is number one, the second lowest number two etc. Since the mouth cavity can be greatly changed, we are able to pronounce very many different sounds.
In the case of unvoiced sounds, the voice onset time (VOT) is a little longer, i.e. the vocal cords are open for a longer time, allowing the air stream from the lungs to arrive at the articulation tract directly. Thus the excitation of the vocal tract is more noise-like. This means that the first formant is relatively less intense than the higher ones (second, third etc.).
The speech production may be illustrated by a simple model (Figure 3.2). Here the lungs are replaced by a DC source, the vocal cords by an impulse generator and the articulation tract by a linear filter system. A noise generator takes care of the unvoiced excitation. In practice all sounds being generated have a mixed excitation, which means that the excitation consists of voiced and unvoiced portions. In this model, the portions are adjusted by two potentiometers (Fellbaum, 1984).
Figure 3.3 shows the speech signal of the German word "lischen" (extinguish). The vowels "i" and "e" can be clearly identified by their high signal amplitude and the quasi-periodic structure, while the unvoiced sound "sch" has low energy and a random-like structure. Considering the spectrum (Figure 3.4), the voiced sounds are concentrated to low frequencies, while the unvoiced sound has a much wider spectrum.
The sense of hearing is a prerequisite for acoustic communication, and in the following, some important facts about hearing will be summarised.
It is customary to distinguish three different parts of the ear: the outer, middle and inner ear (Figure 3.5).
The outer ear comprises the auricle (pinna) and the external auditory canal which ends at the eardrum (tympanic membrane). This acts as resonator, with a resonance frequency of 2-4 KHz. In this frequency range, which is also the main energy range of speech, the ear has its greatest sensitivity.
The air pressure fluctuations result in movements of the tympanic membrane and the amplitude of the movements are extremely small. For normal speech loudness at a frequency of 1 kHz, the amplitude is in the order of 10-11metres. This is less than the diameter of the hydrogen atom and at the limit of the measurability.
The middle ear consists of a small, air filled cave. It has a connection to the nose and mouth cavity (Eustachian tube) which serves to equalise the air pressure on both sides of the tympanic membrane. The middle ear contains three tiny bones (ossicles) called hammer, anvil and stapes. These bones transmit and amplify the oscillations of the tympanic membrane to the oval window of the cochlea. They have three tasks: transformation of the air impedance to the liquid impedance of the cochlea, like an electric transformer, amplification of the tympanic membrane oscillations and protection of the inner ear (by movement blocking) in the case of too high sound pressures.
It is important to state that the 'blocking mechanism' needs a certain reaction time (60 to 100 ms.). For very rapidly developing high sound pressures (explosions etc.) the system is not fast enough and there is a serious danger for inner ear damage.
The inner ear is the cochlea (Figure 3.6). Figure 3.7 is a magnified cross section of the cochlea. On the right side is the basilar membrane with the Corti organ. With the aid of hair cells, it transforms membrane movements into nerve potentials which are transmitted to the brain.
The cochlea contains three liquid-filled tubes: scala vestibuli, scala media and scala tympani. The scala media is a separate closed system, while scala vestibuli and scala tympani have a connection through a small hole at the distant end of the cochlea, the helicotrema. This is shown in Figure 3.8, which is a view of the unrolled cochlea. When sound waves impinge, the oscillations of the tympanic membrane are transmitted via three ossicles to the oval window. Due to the oscillations of the oval window, a travelling wave is set up along the basilar membrane through the cochlea. Somewhere along the basilar membrane, the wave reaches a maximum, depending on the frequency of the oscillations. At the maximum, the basilar membrane is bent and straightened and with it the hair cells, which are arranged along the whole membrane. The cells stimulate the nerve fibres and the stimulus information is transmitted to the brain. We thus have a frequency-locus-neural transformation of the sound. Since at the beginning (near the oval window) the basilar membrane is tense and narrow and at the end (near the helicotrema) limper and wider, it has been shown that high frequencies are analyzed at the proximal end and low frequencies at the distal end of the basilar membrane (von Bekesy, 1960).
The human ear can perceive acoustic frequencies between about 16 Hz and 16 kHz. This range diminishes with age. The dynamic range is about 130 dB between a sound which is just audible (threshold of hearing) and one which is at the pain level (threshold of pain). The range of 130 dB is immense and exceeds by far the dynamic range of acoustic devices like microphones etc.
Figure 3.9 shows the "audibility area" in frequency and amplitude coordinates. The sensitivity of the ear is dependent on both amplitude and frequency. The most sensitive area is between 500 and 5000 Hz; for lower or higher frequencies the sensitivity decreases rapidly. This is also the area where acoustic speech signals have the most relevant information. The dynamic range of speech (between 50 and 80 dB) is located in the middle of the hearing area. We may assume that this is not accidental but an evolutionary development in which speech and hearing functions have been adapted to each other.
Although analogue telephony is the most widely used form today, digital speech coding is gradually taking over. Digital coding has advantages over analogue transmission:
On the other hand, there are also disadvantages:
Since the advantages are clearly predominant over the disadvantages, there is no doubt that there will be a rapid implementation of the wholly digitised telecommunication network.
With regard to the high transmission and storing capacity needed for digitization, effective coding schemes for high bit rate compression have been developed. On the other hand, the introduction of high capacity telecommunication channels (e.g. fibre optics) and low cost memory storage capacity, may diminish the problem of the extended bandwidth that is needed. The basic digital coding technique is the so-called "Pulse Code Modulation" (PCM). Figure 3.10 shows the processing procedure divided into six steps.
Step 1:
The starting point is the analogue speech signal. Speech produced by the human vocal tract has no sharp frequency limits. This makes it necessary to use a filter to limit the speech signal to the frequency range of 300- 3400 Hz.
Step 2:
Sampling of the analogue signal. According to the sampling theorem, the sampling frequency has to be more than twice the maximum frequency of the analogue signal (i.e. 2 x 3400 Hz). A frequency sampling of 8 kHz has been standardized.
Step 3:
Quantizing and coding of the sampled signal is a procedure that converts the sampled signal into a sequence of binary digits. For this reason the amplitude range of the signal has to be divided into equally spaced intervals. Each successive sampled signal value is replaced by the mean value of the interval which it belongs to. The difference between the true value and the mean interval value yields an error known as quantization error. This error decreases if the number of intervals (for the same amplitude range) is increased, because the intervals will then be narrower. Each interval is represented by a binary number. The assignment of intervals to numbers is called coding and the numbers being used are the code words.
Step 4:
Code words are transformed into digital voltage signals, in our case a PCM signal. As shown in Figure 3.10, a voltage impulse appears when the binary digit is "1", "0" means "no impulse". For example, the digits "10" (fourth value) are formed by the sequence of an impulses and a non-impulse.
Step 5:
After transmission, the PCM signal must be decoded. Since there is a one- to-one relation between the PCM pulse and the code number, and between the code number and the assigned interval mean amplitude value, the quantized signal samples can be transformed back to a set of amplitudes. Step 6: A filtering process produces the analogue spech signal from the sampled digital sequence.
The reconstructed analogue signal deviates from the original signal because of the quantization error. This error appears as a noise signal superimposed on the received speech signal. The disturbing effect may be reduced by increasing the number of mean amplitude intervals. An increase in numbers, however, means longer code words and a higher bit rate.
The disturbing effect of the quantization noise depends on the level of the speech signal. If the signal amplitude is high, it suppresses or masks the noise. For low amplitudes, the noise is clearly audible and for very low amplitudes, the speech is completely masked.
Consequently, the relevant measure of speech quality is not the value of the noise level, but the ratio of signal and noise level, called the signal-to-noise ratio. High quality means a signal-to-noise ratio with a constant and high value. In order to keep it constant, the noise level has to be small if the amplitude of the speech signal is small. This leads directly to a modified quantization sceme, where the interval width depends on the speech signal level. This sceme is called non-uniform quantization. The law behind it is logarithmic; it has been internationally standardized (A-law or u-law) and has resulted in the number of 256 amplitude intervals, which may be expressed as 8-bit digits. Based on an 8 kHz sampling frequency, the final result is a bit rate of 64 kbit/s.
These considerations, made for digital speech transmission, are also valid for storage of speech. Thus, with PCM, the storage of one second of speech requires 64 kbit.
Man-machine interaction has two directions: when the machine speaks, it is called speech output, when the machine is operated by speech, it is called speech input.
For most technical applications, only a limited vocabulary is needed. The system has to articulate various error, alarm or confirmation messages, control instructions, standardized question phrases, help functions etc. An important application in the field of rehabilitation is the acoustic keyboard for blind people, where speech sounds corresponding to one or more keys may be articulated, allowing the person to control what key he or she has pressed and possibly to learn to type (Fellbaum, 1986; 1987).
A limited vocabulary is usually made from natural speech which has been stored in a memory. This is called digitised speech. It can be replayed at will and is therefore called replay system. Another term, usually applied in the American literature, is voice store and forward system.
Digitised speech sounds natural and its quality depends on the coding technique and the bit rate. On the other hand, the bit rate influences the required storage capacity. To give an impression of today's state of the art; on a standard computer board we can store about one minute of high quality PCM speech. With a little degradation of the speech quality, it can be doubled to about two minutes. Finally, with the aid of a special speech compression technique, called linear predictive coding (LPC), it is possible to store about half an hour of speech with a moderate, but still acceptable quality.
In some cases, an unlimited speech vocabulary is needed. This is the case, for example, with reading machines for blind people which transform written text into speech. Since these machines should be able to read any text, it would be impossible to record the vocabulary in advance. Hence another technique is used, called speech synthesis (Figure 3.11). Its principle will be shortly explained here.
Speech comprises short phonetic elements like phones, diphones (double phonemes) and others that are joined to form contineous speech. It has been shown that a restricted number of phonetic elements are sufficient for generating an unlimited speech vocabulary (O'Shaughnessy, 1987).
Although the phonetic elements are taken from natural speech, the generated speech sounds artificial. There are two reasons for that: firstly, the various sound transitions which are typical for natural speech are replaced by more or less standardized transitions, and secondly, the phonetic elements are neutral, i.e. they have no stress or speech melody. The prosodic elements must be added artificially, a task which is still in the state of research.
In speech synthesis, i.e. transforming text into speech, three main stages of processing may be distinguished (cf. Figure 3.11).
Linguistic-phonetic transcription: The first step transforms the ordinary text into phonetic symbols which describe the pronunciation much more precisely than orthographic text.
Phonemization: In the second stage, the symbols are distributed to the related phonetic elements.
Signal reconstruction: In the third step, the phonetic elements are joined to continuous speech.
For the user to be able to use speech input, the machine must be able to recognise patterns of speech sounds. This is a task that is much more complex than producing synthetic speech. One of the reasons is that the listener may be able to understand poor quality speech. If a machine cannot understand the sound pattern properly, it will be of very little use. There are, however, several arguments for developing speech recognition:
The recognition is based on comparisons between patterns which are stored in the computer's memory and utterances spoken by the user operating the machine. For the machine to make these comparisons, it has to "learn" how the commands are spoken by the user. Thus, it is necessary to distinguish between a learning or training phase, and the working phase when the voice is actually used to operate the machine (Figure 3.12).
In the training phase, each of the words of the relevant vocabulary is spoken by the user, and processed and stored by the machine as a reference pattern in the memory. Simultaneously, each spoken word and its command is typed into the machine by a keyboard and assigned to the corresponding speech pattern. After all the command words have been input, the system is ready for the working phase. When a word is spoken by the user, it is first processed in the same way as in the training phase and then stored in a comparator. Then all the reference patterns from the training phase are compared successively with the current speech pattern; the reference pattern with the maximum similarity is selected and the corresponding command is executed.
This procedure is called speaker dependent word recognition. More complex forms of speech recognition, such as recognition of continuous speech, meet with considerable problems, and there is still a need for much further research until such systems can appear on the market. On the other hand, a simple form of speech recognition is sufficient for many practical applications.
The vocabulary that must be recognized by a system, depends strongly on the commands that may be used and the number of possible replies. The extreme case would be the two-word vocabulary, for example "yes" and "no". It is likely that there is an optimal number of alternatives: a small vocabulary leads to very long interactions, while too many commands may be difficult for the user to remember. In addition, recognition becomes more complex.
In my opinion, the development of speech recognition should primarily be directed toward systems which are robust (particularly against environmental noise), reliable and cheap, rather than towards sophisticated systems with continuous speech recognition, which may only work satisfactory in a sound sheltered laboratory environment.
Electronic speech processing will play an increasing role in telecommunications and in man-machine interactions, including applications for people with disabilities.
Although speech input and output techniques have been discussed separately, it must be emphasised that they belong together. In a human dialogue, we expect a spoken answer when we address somebody, and there is no reason why this should be different in a man-machine interaction.
To optimize man-machine interaction, however, is not easy, and the resources which are allocated to this field leaves much to be desired. This a major reason why technical systems often suffer from a low level of acceptance.
Concerning people with disabilities, modern telecommunication services and speech processing techniques may prove extremely helpful if the needs of these groups of people are taken into consideration. However, speech processing may also become a hindrance to integration if the development is focused exclusively at the needs of non-disabled people.
References:Fellbaum, K. (1984). Sprachverarbeitung und Sprach-Abertragung. berlin: Springer-Verlag.
Fellbaum, K. (1986). Research Project "Communication System for the Blind". Paper presented at the International Workshop "Communication Systems for the Blind", Florence, November 1986.
Fellbaum, K. (Ed.) (1987). Electronic communication aids. Proceedings of the Big Tech '86 Workshop. Berlin: Weidler Verlag.
Green, D.M. (1976). An introduction to hearing. Hillesdale, New Jersey: Lawrence Erlbaum.
Lapp, R.E. (1966). Schall und Gehir. The Netherlands: TIME-LIFE International.
Nordeng, H., Tetzchner, S.v. & Martinsen, H. (1985). Forced language change as a result of acquired deafness. International Journal of Rehabilitational Research, 8, 71-74.
O'Shaugnessy, D. (1987). Speech communication. London: Addison-Wesley.
Steinbuch, K. (1977). Kommunikationstechnik. Berlin: Springer- Verlag.
von Bekesy, G. (1960). Experiments in hearing. New York: McGraw-Hill.