The goal of the Speech-to-Speech Translation (S2S)
research is to enable real-time, interpersonal communication via
natural spoken language for people who do not share a common
language. The Multilingual Automatic Speech-to-Speech Translator
(MASTOR) system is the first S2S system that allows for
bidirectional (English-Mandarin) free-form speech input and
The research leading to MASTOR
was initiated in 2001 as an IBM
adventurous research project
and was also selected to be funded by the Defense Advanced
Research Projects Agency (DARPA)
(formerly called “Babylon” program).
MASTOR combines IBM
cutting-edge technologies in the areas of automatic speech
recognition, understanding and synthesis. The tight coupling of
speech recognition and understanding effectively mitigates the
effects of speech recognition errors and non-grammatical inputs
common in conversational colloquial speech (as opposed to
well-formed written text or read speech in dictation or
broadcast news) on the quality of the translated output,
resulting in a highly robust system for limited domains. MASTOR
currently has bidirectional English-Mandarin translation
capabilities on unconstrained free-form natural speech input
with a large vocabulary (over 30,000 words for each direction)
in multiple domains, including travel, emergency medical
diagnosis and defense-oriented force protection and security.
MASTOR runs in real-time on a laptop, and has also been ported
to a handheld PDA, with minimal performance degradation. Both
versions of the system displayed outstanding performances in the
February and August 2004 DARPA evaluations across all criteria,
including task completion rate, usability, user satisfaction,
etc. The IBM team was also the only team able to present a
stand-alone PDA bidirectional speech-to-speech translation
system in DARPA CAST program, because it has the most accurate
and optimized algorithms and code, so they require the least
amount of memory and processing requirements for adequate
The GUI for S2S in Medical Domain
and the tech community recognize MASTOR as a breakthrough in
spoken language translation for its ability to produce
bidirectional usable translated output from free-form spoken
input on real portable devices. MASTOR has been showcased on
many occasions, including
and technology demonstrations to U.S. senators and to the deputy
director of the Department of Defense. Yuqing Gao, the principal
researcher on the project, has received two awards from DARPA
CAST program for technology progress. The innovation has also
been highlighted widely by the media, including the BBC, an MIT
Technology Review article featuring “10 Emerging Technologies
That Will Change Your World,” and on National Public Radio's
“Marketplace Morning Report.”
Construction of robust systems
for speech-to-speech translation to facilitate cross-lingual
oral communication has been the dream of speech and natural
language researchers for decades. It is technically extremely
difficult because of the need to integrate a set of complex
technologies – Automatic Speech Recognition (ASR), Natural
Language Understanding (NLU), Machine Translation (MT), Natural
Language Generation (NLG), and Text-to-Speech Synthesis (TTS) –
that are far from mature on an individual basis, much less when
cascaded together. Blindly integrating ASR, MT and TTS
components does not provide acceptable results because typical
machine translation technologies, primarily oriented towards
well-formed written text, are not adequate to process
conversation speech materials rife with imperfect syntax and
speech recognition errors. Initial work in this area in the
1990s, for example, by researchers at CMU and Japan’s ATR labs,
resulted in systems severely limited to a small vocabulary or
otherwise constrained in the variety of expressions supported.
Currently, the only commercial available speech translation
technology is Phraselator, a simple unidirectional translation
device that is customized for military use. It searches from a
fixed number of English sentences and plays out the
corresponding voice recordings in foreign languages, and cannot
handle bidirectional speech.
MASTOR’s innovations include:
methods that automatically extract the most likely meaning of
the spoken utterance, store it in a tree structured set of
concepts like "actions" and "needs", methods that take the
tree-based output of a statistical semantic parser and transform
the semantic concepts in the tree to express the same set of
concepts in a way appropriate for another language; methods for
statistical natural language generation that take the resultant
set of transformed concepts and generate a sentence for the
target language; generation of proper inflections by filtering
hypotheses with an n-gram statistical language model; techniques
for significantly enhancing the annotation automation rate to
improve language and domain portability; direct modeling
approaches, such as maximum entropy (MaxEnt), finite-state
transducers, and semantic structured language models, that
tightly couple and unify the speech recognition and
understanding processes; unsupervised adaptation algorithms that
automatically adapt to new speaker and acoustic environment
quickly; the algorithms and acoustic models that provide high
accuracy in noisy environments, such as in exhibition and
hospital halls, and for far-field microphones, such as in
handheld devices; the language models that handle colloquial
Mandarin and its vocabulary and expressions, which are
significantly different from those used in the standard
Mandarin, and the advanced language modeling techniques that use
a much smaller amount of training data.
IBM MASTOR Architecture
Fu Hua Liu, Liang Gu, Yuqing Gao and Michael Picheny.
Use of Statistical N-Gram Models In Natural Language Generation
For Machine Translation. ICASSP 2003. IEEE, April
Yuqing Gao, Bowen Zhou, Zijian
Diao, Jeffrey Sorensen and Michael Picheny, "MARS: A Statistical
Semantic Parsing and Generation-Based Multilingual Automatic
Journal of Machine Translation,
Vol. 17, 185-212, 2002.
Ruhi Sarikaya, Yuqing Gao,
Michael Picheny and Hakan Erdogan, "Semantic Confidence
Measurement for Spoken Dialog Systems",
IEEE Trans. Speech and Audio
Bowen Zhou, Daniel Dechelotte
and Yuqing Gao, "Two-way Speech-to-Speech Translation on
Handheld Devices", Int.
Conf. of Spoken Language Processing (ICSLP), Korea, Oct. 2004.
Liang Gu, Yuqing Gao, "On
Feature Selection in Maximum Entropy Approach to Statistical
Concept-based Speech-to-Speech Translation,",
Int. Workshop on Spoken
Language Translation, Kyoto, Japan Oct. 2004.
Hong-Kwang Jeff Kuo and Yuqing
Gao, "Maximum Entropy Direct Model as a Unified Model for
Acoustic Modeling in Speech Recognition," in Proc. of Int. Conf.
of Spoken Language Processing (ICSLP), Korea, Oct. 2004.
Principal Investigator of Year 2002 – DARPA CAST Program
Industrial Principal Investigator of Year 2003 – DARPA CAST
Researchers voice translation hopes...",
Mark Hachman, ExtremeTech.com 2003.
Developing Translation Software", by
Lisa Bowman, CNET News.com, April 24, 2003.
Tech: 20 Hot Technologies to Watch",
by Cade Metz, PC Magazine, July 1, 2003.
Ideas: Automating the Tower of Babel",
by Dr. Judith Markowitz, Speech Technology Magazine, Sept/Oct,
Not Lost in Translation", by Ann
Harrison, Wired News, March 9, 2005.
“Soon, you too can speak Chinese - With a
little computer help. We test new translation technology.” By
Anders W. Hagen, Dagbladet.no, April 25, 2005
Translator – one of the 10 Emerging Technologies That Will
Change Your World,” by Greg Huang,
MIT Technology Review, January 2004.
S2S DEMONSTRATION VIDEO CLIP
from CeBIT 2004,
CeBIT Germany, 2004.
What is the most exciting potential future
use for the work you're doing?
translation systems have the potential to revolutionize the
way people around the world who do not speak a common
language communicate with one another. There are thousands
of different languages spoken; imagine being able to
communicate with anyone instantly through the assistance of
a universal translator. Breaking such communication barriers
would lead to tremendous growth in cultural understanding.
Allowing people to accept and live with everyone’s
differences would be a very rewarding future.
What is the most interesting part of
The most interesting
part of research with such an innovative task is taking on
the challenges that have yet to be conquered by human
beings. It is exciting to take part in finding a way to
recognize, understand and translate languages through
advanced techniques. Having so many high goals that we have
to account for is very energizing to deal with on a daily
basis. No day goes by without new obstacles staring at me.
Every day is a new experience in this field.
What inspired you to go into this
Around five years ago,
it felt as if speech recognition technology had reached a
plateau, if we continued in the same statistical approach. I
believed that speech recognition needed new perspectives
that could strengthen it for practical uses, such as
language independent semantic meaning representation,
dialogue and pragmatic context, etc. I viewed speech
translation as an integrated task of speech recognition and
understanding. This was something that inspired me to start
this project and look for ways to explore research to a new
What is your favorite invention of all
The modern wireless
appliances that are available. I came from a
telecommunications background, and having ways to get
information from miles away at such high speeds is very
valuable to me. Almost everything has become wireless and
more convenient for people to use. Nobody walks around
without a cellphone and a laptop in the technology fields
Hong-Kwang Jeff Kuo
Research Areas: User
Human Computer Interaction
Research Labs: Watson