Site hosted by Build your free website today!


IBM Speech-to-Speech Translation

material courtesy of IBM


The goal of the Speech-to-Speech Translation (S2S) research is to enable real-time, interpersonal communication via natural spoken language for people who do not share a common language. The Multilingual Automatic Speech-to-Speech Translator (MASTOR) system is the first S2S system that allows for bidirectional (English-Mandarin) free-form speech input and output.


The research leading to MASTOR was initiated in 2001 as an IBM adventurous research project and was also selected to be funded by the Defense Advanced Research Projects Agency (DARPA) CAST program (formerly called “Babylon” program).

MASTOR combines IBM cutting-edge technologies in the areas of automatic speech recognition, understanding and synthesis. The tight coupling of speech recognition and understanding effectively mitigates the effects of speech recognition errors and non-grammatical inputs common in conversational colloquial speech (as opposed to well-formed written text or read speech in dictation or broadcast news) on the quality of the translated output, resulting in a highly robust system for limited domains. MASTOR currently has bidirectional English-Mandarin translation capabilities on unconstrained free-form natural speech input with a large vocabulary (over 30,000 words for each direction) in multiple domains, including travel, emergency medical diagnosis and defense-oriented force protection and security. MASTOR runs in real-time on a laptop, and has also been ported to a handheld PDA, with minimal performance degradation. Both versions of the system displayed outstanding performances in the February and August 2004 DARPA evaluations across all criteria, including task completion rate, usability, user satisfaction, etc. The IBM team was also the only team able to present a stand-alone PDA bidirectional speech-to-speech translation system in DARPA CAST program, because it has the most accurate and optimized algorithms and code, so they require the least amount of memory and processing requirements for adequate performance.

S2S GUI Figure  

The GUI for S2S in Medical Domain

DARPA and the tech community recognize MASTOR as a breakthrough in spoken language translation for its ability to produce bidirectional usable translated output from free-form spoken input on real portable devices. MASTOR has been showcased on many occasions, including CeBIT’2003, DARPATech’2004 and technology demonstrations to U.S. senators and to the deputy director of the Department of Defense. Yuqing Gao, the principal researcher on the project, has received two awards from DARPA CAST program for technology progress. The innovation has also been highlighted widely by the media, including the BBC, an MIT Technology Review article featuring “10 Emerging Technologies That Will Change Your World,” and on National Public Radio's “Marketplace Morning Report.”

Construction of robust systems for speech-to-speech translation to facilitate cross-lingual oral communication has been the dream of speech and natural language researchers for decades. It is technically extremely difficult because of the need to integrate a set of complex technologies – Automatic Speech Recognition (ASR), Natural Language Understanding (NLU), Machine Translation (MT), Natural Language Generation (NLG), and Text-to-Speech Synthesis (TTS) – that are far from mature on an individual basis, much less when cascaded together. Blindly integrating ASR, MT and TTS components does not provide acceptable results because typical machine translation technologies, primarily oriented towards well-formed written text, are not adequate to process conversation speech materials rife with imperfect syntax and speech recognition errors. Initial work in this area in the 1990s, for example, by researchers at CMU and Japan’s ATR labs, resulted in systems severely limited to a small vocabulary or otherwise constrained in the variety of expressions supported. Currently, the only commercial available speech translation technology is Phraselator, a simple unidirectional translation device that is customized for military use. It searches from a fixed number of English sentences and plays out the corresponding voice recordings in foreign languages, and cannot handle bidirectional speech.

MASTRO architecture  

IBM MASTOR Architecture

MASTOR’s innovations include: methods that automatically extract the most likely meaning of the spoken utterance, store it in a tree structured set of concepts like "actions" and "needs", methods that take the tree-based output of a statistical semantic parser and transform the semantic concepts in the tree to express the same set of concepts in a way appropriate for another language; methods for statistical natural language generation that take the resultant set of transformed concepts and generate a sentence for the target language; generation of proper inflections by filtering hypotheses with an n-gram statistical language model; techniques for significantly enhancing the annotation automation rate to improve language and domain portability; direct modeling approaches, such as maximum entropy (MaxEnt), finite-state transducers, and semantic structured language models, that tightly couple and unify the speech recognition and understanding processes; unsupervised adaptation algorithms that automatically adapt to new speaker and acoustic environment quickly; the algorithms and acoustic models that provide high accuracy in noisy environments, such as in exhibition and hospital halls, and for far-field microphones, such as in handheld devices; the language models that handle colloquial Mandarin and its vocabulary and expressions, which are significantly different from those used in the standard Mandarin, and the advanced language modeling techniques that use a much smaller amount of training data.


  Related Publications
Fu Hua Liu, Liang Gu, Yuqing Gao and Michael Picheny. Use of Statistical N-Gram Models In Natural Language Generation For Machine Translation. ICASSP 2003. IEEE, April 2003.

Yuqing Gao, Bowen Zhou, Zijian Diao, Jeffrey Sorensen and Michael Picheny, "MARS: A Statistical Semantic Parsing and Generation-Based Multilingual Automatic tRanslation System," Journal of Machine Translation, Vol. 17, 185-212, 2002.

Ruhi Sarikaya, Yuqing Gao, Michael Picheny and Hakan Erdogan, "Semantic Confidence Measurement for Spoken Dialog Systems", IEEE Trans. Speech and Audio Processing, July, 2005.

Bowen Zhou, Daniel Dechelotte and Yuqing Gao, "Two-way Speech-to-Speech Translation on Handheld Devices", Int. Conf. of Spoken Language Processing (ICSLP), Korea, Oct. 2004.

Liang Gu, Yuqing Gao, "On Feature Selection in Maximum Entropy Approach to Statistical Concept-based Speech-to-Speech Translation,", Int. Workshop on Spoken Language Translation, Kyoto, Japan Oct. 2004.

Hong-Kwang Jeff Kuo and Yuqing Gao, "Maximum Entropy Direct Model as a Unified Model for Acoustic Modeling in Speech Recognition," in Proc. of Int. Conf. of Spoken Language Processing (ICSLP), Korea, Oct. 2004.


Yuqing Gao:
Principal Investigator of Year 2002 – DARPA CAST Program
Industrial Principal Investigator of Year 2003 – DARPA CAST Program



News and Information
"IBM Researchers voice translation hopes...", Mark Hachman, 2003.

"IBM Developing Translation Software", by Lisa Bowman, CNET, April 24, 2003.

"Future Tech: 20 Hot Technologies to Watch", by Cade Metz, PC Magazine, July 1, 2003.

"Voice Ideas: Automating the Tower of Babel", by Dr. Judith Markowitz, Speech Technology Magazine, Sept/Oct, 2004.

"Machines Not Lost in Translation", by Ann Harrison, Wired News, March 9, 2005.

“Soon, you too can speak Chinese - With a little computer help. We test new translation technology.” By Anders W. Hagen,, April 25, 2005

Universal Translator – one of the 10 Emerging Technologies That Will Change Your World,” by Greg Huang, MIT Technology Review, January 2004.

S2S DEMONSTRATION VIDEO CLIP from CeBIT 2004, CeBIT Germany, 2004.



Innovator's corner  


Yuging Gao  
Yuging Gao
What is the most exciting potential future use for the work you're doing?
Speech-to-speech translation systems have the potential to revolutionize the way people around the world who do not speak a common language communicate with one another. There are thousands of different languages spoken; imagine being able to communicate with anyone instantly through the assistance of a universal translator. Breaking such communication barriers would lead to tremendous growth in cultural understanding. Allowing people to accept and live with everyone’s differences would be a very rewarding future.

What is the most interesting part of your research?
The most interesting part of research with such an innovative task is taking on the challenges that have yet to be conquered by human beings. It is exciting to take part in finding a way to recognize, understand and translate languages through advanced techniques. Having so many high goals that we have to account for is very energizing to deal with on a daily basis. No day goes by without new obstacles staring at me. Every day is a new experience in this field.

What inspired you to go into this field?
Around five years ago, it felt as if speech recognition technology had reached a plateau, if we continued in the same statistical approach. I believed that speech recognition needed new perspectives that could strengthen it for practical uses, such as language independent semantic meaning representation, dialogue and pragmatic context, etc. I viewed speech translation as an integrated task of speech recognition and understanding. This was something that inspired me to start this project and look for ways to explore research to a new direction.

What is your favorite invention of all time?
The modern wireless appliances that are available. I came from a telecommunications background, and having ways to get information from miles away at such high speeds is very valuable to me. Almost everything has become wireless and more convenient for people to use. Nobody walks around without a cellphone and a laptop in the technology fields these days.

 Research team members
Yuging Gao
Yuging Gao
Liang Gu
Liang Gu
Hong-Kwang Jeff Kuo
Hong-Kwang Jeff Kuo
Antti-Veikko Rosti
Antti-Veikko Rosti
Ruhi Sarikaya
Ruhi Sarikaya
Bowen Zhou
Bowen Zhou

  Related Research
Disciplines: Computer Science
Research Areas: User Interface Technologies, Human Computer Interaction
Research Labs: Watson Research Center



    About IBM Privacy Contact IBM