Creating a Computer Voice That People Like
By JOHN MARKOFF
When computers speak, how human should they sound?
This was a question that a team of six IBM linguists, engineers and marketers faced in 2009, when they began designing a function that turned text into speech for Watson, the company’s “Jeopardy!”-playing artificial intelligence program.
Eighteen months later, a carefully crafted voice — sounding not quite human but also not quite like HAL 9000 from the movie “2001: A Space Odyssey” — expressed Watson’s synthetic character in a highly publicized match in which the program defeated two of the best human “Jeopardy!” players.
The challenge of creating a computer “personality” is now one that a growing number of software designers are grappling with as computers become portable and users with busy hands and eyes increasingly use voice interaction.
Machines are listening, understanding and speaking, and not just computers and smartphones. Voices have been added to a wide range of everyday objects like cars and toys, as well as household information “appliances” like the home-companion robots Pepper and Jibo, and Alexa, the voice of the Amazon Echo speaker device.
A new design science is emerging in the pursuit of building what are called “conversational agents,” software programs that understand natural language and speech and can respond to human voice commands.
However, the creation of such systems, led by researchers in a field known as human-computer interaction design, is still as much an art as it is a science.
It is not yet possible to create a computerized voice that is indistinguishable from a human one for anything longer than short phrases that might be used for weather forecasts or communicating driving directions.
Most software designers acknowledge that they are still faced with crossing the “uncanny valley,” in which voices that are almost human-sounding are actually disturbing or jarring. The phrase was coined by the Japanese roboticist Masahiro Mori in 1970. He observed that as graphical animations became more humanlike, there was a point at which they would become creepy and weird before improving to become indistinguishable from videos of humans.
The same is true for speech.
“Jarring is the way I would put it,” said Brian Langner, senior speech scientist at ToyTalk, a technology firm in San Francisco that creates digital speech for things like the Barbie doll. “When the machine gets some of those things correct, people tend to expect that it will get everything correct.”
Beyond correct pronunciation, there is the even larger challenge of correctly placing human qualities like inflection and emotion into speech. Linguists call this “prosody,” the ability to add correct stress, intonation or sentiment to spoken language.
Today, even with all the progress, it is not possible to completely represent rich emotions in human speech via artificial intelligence. The first experimental-research results — gained from employing machine-learning algorithms and huge databases of human emotions embedded in speech — are just becoming available to speech scientists.
Synthesized speech is created in a variety of ways. The highest-quality techniques for natural-sounding speech begin with a human voice that is used to generate a database of parts and even subparts of speech spoken in many different ways. A human voice actor may spend from 10 hours to hundreds of hours, if not more, recording for each database.
The importance and difficulty of adding an intangible emotional quality can be seen in the 2013 science fiction movie “Her,” in which a lonely office worker played by Joaquin Phoenix falls in love with Samantha, the synthetic voice of an advanced computer operating system.
That voice was ultimately portrayed by Scarlett Johansson, after the film’s director Spike Jonze decided the voice of the original actress did not convey the romantic relationship between human and machine that he was trying to portray.
The roots of modern speech synthesis technology lie in the early work of the Scottish computer scientist Alan Black, who is now a professor at the Language Technologies Institute at Carnegie Mellon University.
Mr. Black acknowledges that even though major progress has been made, speech synthesis systems do not yet achieve humanlike perfection. “The problem is we don’t have good controls over how we say to these synthesizers, ‘Say this with feeling,’ ” he said.
For those like the developers at ToyTalk who design entertainment characters, errors may not be fatal, since the goal is to entertain or even to make their audience laugh. However, for programs that are intended to collaborate with humans in commercial situations or to become companions, the challenges are more subtle.
These designers often say they do not want to try to fool the humans that the machines are communicating with, but they still want to create a humanlike relationship between the user and the machine.
IBM, for example, recently ran a television ad featuring a conversation between the influential singer-songwriter Bob Dylan and the Watson program in which Mr. Dylan abruptly leaves the stage when the program tries to sing. Watson, as it happens, is a terrible singer.
The advertisement does a good job of expressing IBM’s goal of conveying a not-quite-human savant. They wanted a voice that was not too humanlike and by extension not creepy.
“Jeopardy!” was a particularly challenging speech synthesis problem for IBM’s researchers because although the answers were short, there were a vast number of possible mispronunciation pitfalls.
“The error rate, in just correctly pronouncing a word, was our biggest problem,” said Andy Aaron, a researcher in the Cognitive Environments Laboratory at IBM Research.
Several members of the team spent more than a year creating a giant database of correct pronunciations to cut the errors to as close to zero as possible. Phrases like brut Champagne, carpe diem and sotto voce presented potential minefields of errors, making it impossible to follow pronunciation guidelines blindly.
The researchers interviewed 25 voice actors, looking for a particular human sound from which to build the Watson voice. Narrowing it down to the voice they liked best, they then played with it in various ways, at one point even frequency-shifting it so that it sounded like a child.
“This type of persona was strongly rejected by just about everyone,” said Michael Picheny, a senior manager at the Watson Multimodal Lab for IBM Research. “We didn’t want the voice to sound hyper-enthusiastic.”
The researchers looked for a machine voice that was slow, steady and most importantly “pleasant.” And in the end, they, acting more as artists than engineers, fine-tuned the program. The voice they arrived at is clearly a computer, but it sounds optimistic, even a bit peppy.
“A good computer-machine interface is a piece of art and should be treated as such,” Mr. Picheny said.
As speech technology continues to improve, there will be new, compelling and possibly perturbing applications.
Imperson, a software firm based in Israel that develops conversational characters for entertainment, is now considering going into politics. Imperson’s idea is that during a campaign, a politician would be able to deploy an avatar on a social media platform that could engage voters. A plausible-sounding Ted Cruz or Donald Trump could articulate the candidate’s positions on any possible subject.
“The audience wants to have an interactive conversation with a candidate,” said Eyal Pfeifel, co-founder and chief technology officer of Imperson. “People will understand, and there will be no uncanny-valley problem.”
聲音逼真卻不夠協調…電腦語音走不出「恐怖谷」
電腦說話的聲音,究竟該與人類相似到什麼程度?
這是2009年時IBM的一個研究團隊所面臨的問題,該團隊由六名語言學家、工程師和行銷人員組成,當時正在為超級電腦華生(Watson)設計一個文字轉語音的功能。
18個月後,在一場萬眾矚目的比賽中,華生擊敗了兩名最優秀的參賽者,其精心調校的聲音,展現了華生的人造性格。
如今,隨著電腦愈來愈容易隨身攜帶,以及愈來愈多手和眼睛都沒空的用戶使用語音,電腦的「人格」也成為愈來愈多軟體設計師的挑戰。
機器設備正在傾聽、理解和發出聲音,這些設備不僅包括電腦和智慧手機,還擴及汽車和玩具等等日常物品,以及資訊「家電」中,比如家用機械人Pepper和Jibo,以及亞馬遜的語音助理Alexa。
在這些情況下,一種新的設計科學逐漸興起,目標是打造稱為「對話代理人」(conversational agent)的軟體,能夠了解自然語言和語音,並對人類語音命令做出反應。
但是,由「人機互動」(human-computer interaction)設計領域研究人員率領創建這類系統,不僅是一門科學,也是一門藝術。
目前電腦發出的聲音,除了簡短的詞句之外,尚未逼真到和人聲難於區分,可以用來播送天氣預報和行車路線。
多數軟體設計師都坦承,他們尚未穿越「恐怖谷」(uncanny valley)階段,也就是設備發出的聲音接近真人,但卻讓人心煩、很不協調。「恐怖谷」的說法是日本機器人專家森政弘1970年所提出,他發現隨著動畫愈來愈逼真,但又尚未達到和真人拍攝的影片難以區分的程度時,會有一個讓人覺得毛骨悚然而怪異的階段。這種現象也適用於語音。
除了正確發音,在語音中表現語調變化及感情等人類特質,更是一個大挑戰。語言學家稱之為「韻律」,也就是說話時在正確的地方抑揚頓挫、在語句中添加語調或情緒等能力。
合成語音可以透過各種方式製作,而最高品質的自然語音技術會先透過人聲以不同方式,產生一個語音的成分甚至子成分的資料庫。一名配音員可能要耗費至少十到幾百個小時,為每個資料庫錄音。
現代語音合成技術源於蘇格蘭電腦科學家布雷克的早期研究;他現在是卡內基美隆大學語言科技研究所的教授。
布雷克坦承,儘管已取得重大進步,語音合成系統尚未達到接近人聲的完美程度。他說,「問題是我們對合成器說話時無法好好控制,要『有感情地說。』」
對於ToyTalk公司那些設計娛樂角色的開發人員來說,這種錯誤或許不會帶來嚴重的後果,因為他們的目標就是使顧客開心,甚至是大笑。然而,對於那些與人類合作用在商業狀況或與人成為夥伴的計畫案來說,這種挑戰更為微妙。
原文參照:
http://www.nytimes.com/2016/02/15/technology/creating-a-computer-voice-that-people-like.html
2016-03-07.經濟日報.A11.國際.編譯 廖玉玲