Georgios Kouroupetroglou and Géza Németh
The wide area of speech technology includes speech recognition and understanding, speech synthesis, speech coding and enhancement, speaker verification and identification, speech analysis and feature extraction, as well as speech evaluation and assessment. The term "voice processing" has been chosen to indicate the means of utilizing speech technology to facilitate the dissemination and retrieval of information. Voice pro-cessing functions provide a specific set of capabilities that utilize voice as the dominant communication modality. Voice processing frequently interacts with other environments such as host databases and telecommunications systems as a part of a larger commu-nication and information management system.
Nearly all the achievements of speech technology in the past decade have been based on an historic development, where new knowledge has been added to old knowledge piece by piece rather than by a drastic change. Perhaps the most drastic change is in the field of tools rather than in the understanding of the speech code. However considerable progress can be seen in terms of improvements in quality and accuracy.
Possibly the most important progress in the last few years is that speech technology has matured sufficiently and has entered the market, by providing either stand-alone innovative products or products that have been incorporated in a wide range of systems. Nowadays speech technology can be applied effectively to facilitate at least the three very general categories of user needs: interpersonal communication, access to information and control of the environment.
This chapter briefly reviews individual sections of speech technology and examines its level of application development with regard to disabled and elderly people.
Although the intelligibility of the available speech syn-thesis systems is generally high for a number of languages, there is evidence from recent work that a new level of quality, based on a number of new techniques, might be possible in speech synthesis for text-to-speech systems. The focus is on naturalness, prosody and flexibility (different voices or voice personalities, speaking styles, dialects, accents and languages). Improved models will open the way to more realistic synthesis of child and female voices. Other important research areas are modelling of emotions and speech synthesis from the meaning. The quality and intelligibility of synthetic or natural speech are usually very difficult tasks to measure, but recently progress has been made in this field.
There has been steady progress in the field of speech recognition over recent years with two trends: the academic (achieved by improved techniques mainly in stochastic modelling, search and neural networks) and the pragmatic (the technology provides simple, low level in-teraction with machines, replacing buttons or switches). The latter does a useful job now, while the former mainly makes promises for the future. In pragmatic systems, the emphasis has been on accuracy, robustness and on computational efficiency permitting real-time perfor-mance with affordable hardware (Rudnicky et al., 1994).
The problem of speech recognition is very difficult for four main reasons: a) the basic units of speech are hard to recognize; b) continuous speech adds more to the difficulties; c) speaker and environmental differences are very important and d) the human language understanding process is unknown. The following considerations must be taken into account in order to simplify the problem: isolated word recognition versus continuous, training requirement (speaker dependent, independent or adaptive), vocabulary size (2 to 100 words) operational environment (studio, office, telephone or industrial) and grammar/language model (i.e. additional constraints for higher accuracy).
General-purpose speech recognition systems with the following specifications: speaker-dependent (i.e. requires training by the user), isolated-word, large vocabulary (5000-50,000 words) and with a close-talking microphone, were commercially introduced in 1990. They have found broad-ranging applications, predominately in dictation and document creation applications. Although they are still slower than skilled typists (>60 words per minute through-put), they can be productive for word-by-word dictation and can be used over telephone lines with lower accuracy.
Speaker-independent, discrete-word, small-vocabulary speech recognition has been successfully applied in areas such as control and speed dialling of mobile telephones. Limited vocabulary command/control capabilities are now generally available inexpensively (under ECU 225) or even "bundled free" with some new computer systems. Large vocabulary, discrete-word, speaker-independent, speech recognition systems are still expensive, ranging from ECU 1 500 to ECU 3 700 and up.
Recently, telephone companies have started to deploy speaker-independent recognition systems that work over the telephone lines. Although they handle small vocabularies (5-10 words), they still seem to have a very strong commercial potential for special services. Improved recognition accuracy and robustness against environmental and speaker variations are critical for real applications. Yet, not a single recognizer today works well enough under these conditions. A number of operational prototype systems for large-vocabulary, continuous speech recognition, have already been demonstrated, but are still limited in their application domains (e.g. X-ray reports).
Speech Dialogue Systems
Speech dialogue systems (in which speech synthesis and recognition are used in a man-machine dialogue framework) have been demonstrated for applications such as Automatic Teller Machines and in public tele-phone booths equipped with hands-free voice diallers. Current operational systems of automated telephone information systems are of simple dialogue complexity and understanding (menu style, isolated words, limited fluent speech, 100-word vocabulary size), but laboratory prototypes exist with 1000 words and sophisticated (mixed initiative, multilingual, natural language, co-operative, intelligent error recovery) dialogue complexity and understanding. Fluent interaction between human and machine requires the explicit use of cognitive engineering resources that do not yet fully exist. Also, a theory of man-machine interaction by speech and other modalities is lacking.
Identifying non-linguistic speech features, such as text-independent speaker, sex and language identification, in order to facilitate man-machine interaction, is being researched. Significant steps toward automatic interpreta-tion of telephony speech have been made in Japan. Translation of spontaneous language for face-to-face and telephone conversations has made some progress and rudimentary systems may be expected in the near future.
Speech coding is very important for digital communica-tions, land mobile radio systems, cordless digital telephone systems, satellite-based communications, voice storage, multipoint and multimedia teleconferencing. A good-to-excellent speech coding quality can be achieved today with bit rates of 1 bit/sample (8 kbit/s). Expectations are that the rates can be reduced to less than 0.5 bit/sample (4 kbit/s) within the next decade. In mobile communica-tions, bit rates in the range of 2-8 kbit/s are of high impor-tance for an economic use of frequencies and to provide the significant overhead capacity for error protection.
Applications for Disabled and Elderly People
The estimated number of people with speech/language difficulties living in the European Union (1993) is just over three quarters of a million people and is projected to rise by a further seven thousand by the year 2000. The materials and methods used for obtaining the results are the same as in (Carruthers et al., 1993).
Speech technology has contributed to more realistic user interfaces, especially in the area of computer and telecommunications technology. A detailed description of the applications of speech and voice processing to telecommunications can be found in (Rabiner, 1994).
Eyes-free and/or hands-free communication and control enables the user to freely communicate without having physical contact with the communication device, either for controlling the communication flow (i.e. initiating the call, transferring it to another number, etc.) or for communicating with the other party. This provides a seamless environment where control and communication are handled identically, i.e. by means of speech technology as well. In order to fully exploit these possibilities, users should be allowed to freely choose their preferred input/output modalities (e.g. parallel keyboard/recognizer input, speech only output).
During the last decade, the community of speech technology has shown its interest in various aspects of research and development for people with disabilities (Granstrom et al., 1993).
The main applications of speech technology for disabled and elderly people can be grouped as follows:
a) Voice Output Communication Aids: (VOCA) for persons who are non-speaking, speech impaired or deaf
Although computer-based vocal prostheses for non-speaking people are becoming increasingly common in the market, they are very often expensive, slow in operation and they cannot convey the feelings of the user. Most of them accept various non-orthographic writing communication systems based on graphic signs (COST 219, 1991, Kouroupetroglou 1993, 1994). Enhancing a communication prosthesis with vocal emotional effects (e.g. happiness, sadness, grief) by textual and pragmatic information, and incorporating strategies for a more acceptable speech production by input acceleration techniques have been demonstrated as prototypes. Research is under way for the establishment of a modular and open architecture offering considerable flexibility for widespread reusability in the design and development processes of VOCAs, as well as more general interpersonal communication systems (Stephanidis, 1994). Other developments aim to provide an integrated prosthesis for communicators for non-speaking people based on conversational analysis and artificial intelligence techniques. Synthesis of talking faces can improve the perception process and might be especially useful for deaf people. Two promising cases for human-to-speech synthesizer interfacing are: the glove-driven speech synthesizer and EEG-based brain computer interface for patients with hemiplegia or quadriplegia who cannot perform normal movements of the articulators for speech production.
b) Speech recognition in writing, programming, environmental control, and computer-aided design for persons with motor disabilities
The available speech recognition products can increase the written communication rate (text entry rate) for physically disabled individuals compared to keyboard alone or keyboard with word prediction. Systems for replacing mouse action with speech in graphical user interfaces (direct graphical manipulation) for people with impaired motor control of their hands are already in the market. Currently available speaker-dependent, isolated-word recognizers allow a larger vocabulary than speaker-independent recognizers. Since these systems can be adapted to the individual voice of the user, they could be a solution to control the environment or the telecommunications devices by reproducible voice commands that need not be understandable by other persons. Hands-free, voice-controlled, simple tele-communication terminals for people with motor disabilities are becoming a reality.
c) Speech synthesis in reading, writing and programming aids for persons who are blind or visually disabled
Although a variety of systems are already available as products, and although pocket reading devices for visually impaired and blind people are under development, improvements in speech quality and man-machine interface are expected to satisfy the real needs of the users. The designers of such systems must be aware of the fact that a significant proportion of visually impaired people have at least one other disability.
d) Speech training methods and devices
Several computer-assisted aids for speech training are now in the market and more sophisticated ones are under development. For example, aids for speech training, assessment and rehabilitation for profoundly deaf people use text-to-speech for generating some model parameters for the person with a disability to follow. They also use visual feedback after speech analysis or animated display of inferred tongue, lip and jaw movements during speech production, sometimes based on neural network, to help treat speech disorders in real time.
e) Processing of speech in cochlea implants, hearing aids and tactile aids
In speech-processing aids for severely and profoundly deaf persons, speech information is transmitted as tactile or visual signals, or recoded into new auditory signals or into signals that directly stimulate the auditory nerve (cochlea implants) (Goldstein, 1994). The latter can be used as speech perception aids or speech production aids, or both. A few tactile aids are now commercially available but have not yet been shown to give any substantial improvement of the ability to lip-read speech or to learn to speak. A real breakthrough has been made regarding cochlea implants, aids based on a direct electric stimulation of the auditory nerve. Speech intelligibility improvement in adverse listening conditions (e.g. in reverberant and/or noisy environments) and improvement for real environments have been studied. Pocket speech hearing aids based on digital signal processing and real time Very Large Scale Integration (VLSI) implementations have been developed. Digital speech processing techniques for enhancing and transforming/transpositioning into the residual hearing area significant speech features in order to increase overall speech intelligibility have also been developed (Engebretson, 1994). Expectations are on using current speech recognition technology to design communication aids for hearing impaired persons who need qualified transcription services (e.g. a new and more efficient design of relay services).
Speech technology is of particular interest for the augmentation of direct man-machine communication as it mimics the most natural and most preferable way of communication between human beings. Speech synthesis is a stable technology nowadays although its results are from the human speech production. It can be used in a wide range of low cost applications. Speech recognition is beginning to have a broad impact; nevertheless, it is still inadequate when compared to human capabilities. The development of man-machine interfaces for personal computer mediated communicators seems to be the most promising application area of speech recognition. Speech dialogue systems will also play an important role in human-computer interaction.
Speech technology and voice processing have already shown in a pragmatic way their usefulness for disabled and elderly people in application areas such as communication aids, hearing aids and control of the environment. It is of utmost importance to define open interfaces for speech input/output devices in order to allow flexible system configuration. It would allow wider use of these systems in general and provide extremely important functionality for people with special needs. New techniques under development in speech research laboratories will play a key role in future man-machine interaction and personal communication. As the application of speech technology is a multidisciplinary sector, stronger synergy and collaboration with other scientific sectors are required, in order to continue to force pioneering work in the field and to transform research and technology innovations into systems and product innovations. A better understanding of user needs or identification of how best to meet these needs through future research and development is also required.
CARRUTHERS, S., HUMPHREYS, A. and SANDHU, J. (1993). The Market for R.T. in Europe: a Demographic Study of Need, in Rehabilitation Technology, Edit. Ballabio, E. I. Placencia-Porrero and Puig de la Bellacasa, R., IOS Press, pp.158-163.
COST 219, von Tetzchner, S. (Ed.), (1991). Issues in Telecom-munication and Disability. Use of Graphic Communication Systems in Telecommunications, pp. 280-288. Published by CEC.
COST 219, Klause, G. (Ed.), (1995). Proceedings of Seminar on Applied Speech Processing and Voice Recognition in Telecommunications, in Potsdam, March 1995.
ENGEBRETSON, M. (1994). Benefits of Digital Hearing Aids, IEEE Eng. in Medicine and Biology Mag., Vol. 13, No 2, April-May 1994, pp. 238-248 .
GOLDSTEIN, M. (1994). Auditory Periphery as Speech Signal Processor, IEEE Eng. in Medicine and Biology Mag., Vol. 13, No 2, April-May 1994, pp. 186-196.
GRANSTROM, B., HUNNICUTT, S. and SPENS, K., Editors (1993). Speech and Language Technology for Disabled Persons, Proc. of an ESCA Workshop, Stockholm, May 31- June 2.
KOUROUPETROGLOU, G., ANAGNOSTOPOULOS, A., VIGLAS, C., PAPAKOSTAS, G. and CHAROUPIAS, A. (1993). The BLISPHON Alternative Communication System for the Speechless Individual, Proc. of an ESCA Workshop, Stockholm, May 31- June 2, pp. 107-110.
KOUROUPETROGLOU, G. and VIGLAS C. (1994). MULTIGRACE, a Multimedia Learning and Teaching Environment for Graphic Interpersonal Communication Systems, Proc. ISAAC'94 Conference, Maastricht NL, 9-13 October, pp. 407-409.
RABINER, L. (1994). Applications of Voice Processing to Telecommunications, Proc. of the IEEE, Vol. 82, No 2, February 1994, pp. 199-228.
RUDNICKY, A., HAUPTMANN, A. and KAI-FU LEE, (1994). Survey of Current Speech Technology, Communications of the ACM, Vol. 37, No 3, March 1994, pp. 52-57.
STEPHANIDIS, C. and KOUROUPETROGLOU, G. (1994). Human Machine Interface Technology and Interpersonal Communication Aids, Proc. Fifth COST 219 Conference, Tregastel - France, June 7-8. ISBN 951-33-0004-8.