January 16, 2008

Text-To-Speech Foreign Language
Lima, Peru

AT&T has released a demo of their Natural Voices text-to-speech voice synthesizer, which has the ability to speak text using linguistic rules from various dialects. AT&T is a large telephone company, and I imagine they are using this technology to continue the development of automated systems dialog prompts (though the applications reach as far as in-dash car navigation and screen-readers for the visually disabled).

When I announced my son's birth, I mentioned the various pronunciations of his name, and which my girlfriend and I have settled on calling him by. This demo does a fun job of taking names and pronouncing them under different languages:

Natural Voices TTS speaking 'Aidric Heimburger' in U.S. English, UK English, Spanish, German, and French.

Since I'm sporting a wildly Germanic family name, and the origins of 'Aidric' are undoubtedly from the same region, it's of little surprise his name sounds so good in its original tongue (although we choose to pronounce it as you would in English).

I do find it interesting that German found its way onto the list. I read that it's a popular second language (I took it back in high school myself), though most believe it has business applications only (as most German-speakers have a propensity to have studied English for many, many years).

The United Nations has six official languages under its charter: Arabic, Chinese, English, French, Russian, and Spanish. A global overlay of these languages looks something like this:

If you're looking to learn a foreign language or two, I'd shoot for one of those.

More about Text-To-Speech:

Text-To-Speech (TTS) is often described as two conceptual stages. In the first stage, it decides how the text should be spoken, that is, how each word should be pronounced, what length and pitch each phoneme should have, etc. In the second stage, the system does its best to create audio that matches the specifications produced by stage one.

TTS software has little or no understanding of the text being read. It uses rules, lists, dictionaries, etc. to make very sophisticated guesses about how a piece of text should be read. While general performance can be quite good, some decisions are intrinsically hard to make without some level of understanding. For example, the word "bass" in the phrases "bass drum" or "bass boat". Intonation depends in many cases on the writer's intention, which often cannot be inferred in short texts even by human readers. As a result, TTS systems will occasionally make mistakes and can be fooled by carefully constructed texts.

The type of TTS we do is called a "concatenative" system, meaning that we record a human speaker to make a voice database. We re-use small chunks of the recordings to create new sentences containing words that were never recorded. Further, we do "unit selection" synthesis. This means that we use large voice databases and do clever searches on-the-fly to find chunks in the voice database that best match the requested sentences.



January 17th, 2008

