In one of the most memorable commercials of the ‘70s, Ella Fitzgerald famously asked: “Is it Live or is it Memorex®”. The commercial highlighted the defining feature of the new Memorex® tape system: the recording sound was so good that you couldn’t tell the difference between the original and the recorded “copy.” Today, another technology that takes us from text to speech is changing the way marketers sell virtually everything: Digital Text-to-Speech (TtS).


Text to speech: the history

Speech synthesis or the artificial production of human speech, goes back much further than one would expect. Doctor and physicist Christian Gottlieb Kratzenstein was one of the first to duplicate the human voice. Focusing on the vowels produced by the vocal chords, he won first prize – in 1780 – at the academy of St. Petersburg for his “vowel organ.” Building on the work of Kratzenstein, 11 years later, Wolfgang Von Kempelen of Hungary added the consonant tones using replicas of lips and tongue in his “acoustical-mechanical speech machine.”

Fast forward nearly 200 years later to 1961, John Larry Kelly, Jr. and Louis Gerstman of Bell Labs used an IBM 704 voice recorder synthesizer, called a Vocoder (picture and description), to recreate the song “Daisy Bell.” In fact, it is well documented that Arthur C. Clarke was inspired to give HAL 9000 in 2001 Space Odyssey the ability to sing the same song ( In subsequent years, we have developed and explored many different voice synthesis models, techniques, and applications (of particular note was the introduction of TI’s Speak ‘N Spell learning machine in 1980).

Clearly, we have been working and perfecting the replication of the human voice for hundreds of years. There are numerous books and websites that document our continuing journey to perfect vocal reproduction. If you want to learn more about advancements in speech replication, check out these sites to hear the major milestones and a comparison from just five years ago to today (


Text to speech for a memorable customer experience

We have come a long way in this journey and it is time for us to not only appreciate where we are, but to get inspired by it. It’s also time for marketers working in all sectors, in companies of every size, to start leveraging text to speech technologies to capture interest, generate response, and deliver a memorable customer experience.

Amazingly, most of the voices you hear today on smartphones, websites, and electronic games are not created by humans. In reality, they don’t need to be. From text to speech derives male and female voices in a wide selection of phonemes (sound components of spoken language) and accents. Today, the technology is sophisticated enough to perfectly recreate vocal intonations, including raising the voice on a question, increasing volume to emphasize a point, and deliberate pausing for maximum effect.

Not surprisingly, as detailed in this press release for a MarketsandMarkets study, the rapid growth of the technology that takes us from text to speech is being fueled by government investment for the purpose of educating differently abled persons – and of course the meteoric growth in handheld devices. The study reports that the text to speech market was valued at USD 1.3B in 2016 and is expected to grow to USD 3.03B by 2022 (a CAGR of 15.21% between 2017 and 2022).

For corporate marketers, dynamic, personalized mediums are ideal for integrating TtS engines. But the one media that traditionally has not been dynamic, namely videos, may have the greatest potential of them all. Here’s why.

We have watched—and enjoyed—movies on TV, on the big screen and now on little screens, for decades. They are memorable because they use both auditory and visual senses and when well-produced with emotion-grabbing music, great scripts, fine acting, and excellent direction, there is no other media that can equally stimulate our minds and affect our emotions. The chance to learn something new, to travel to another world, or walk in someone else’s shoes are just some of the reasons why we love movies. But now, with the advent of Personalized Video together with TtS, the experience has become even more personal, relevant, and response-generating than ever before.


The right time is now

There are specific reasons why now is the right time to start creating and deploying personalized video with TtS campaigns. These include:

  • The technology has amazing capabilities in which intonation, pausing, localization (translations), and accents are reproducible and virtually indistinguishable from the human voice.
  • Although we hear it all the time on our GPS and phone messages, we don’t often hear it in videos as the incorporation into this media is relatively new (personalized video with TtS continues to grow, but for now, it is novel; still, it has the power to capture a viewer’s interest—and hold it).
  • By combining personalized voiceover with customer-specific images and text, you achieve the “trifecta” of intimate dialogue, and direct the exact message for each recipient.
  • The cost of TtS is minimal in relation to live voiceovers, especially if you want to change the recording for various audiences—or for one person at a time.
  • Developing TtS voiceovers is as simple as sending the script into a TtS engine and by adding a few simple codes before and after words (we’ll discuss this a bit later in the post), creating the perfect human intonations and expressions.
  • The TtS production process although highly sophisticated, is now available to purchase as a highly efficient, cloud-enabled service separately or as part of a personalized video project.


From text to speech to satisfied customers

There are examples where personalized videos with TtS have resulted in double-digit responses. An example is the City of Ancona tax video by Doxee in which the TtS contains specific information about citizens based on the current records held by the city, which is used to compute their tax payment. If correct, the recipient is directed to a website to see the amount and approve it; if what the computer-generated voiceover “says” is incorrect, the citizen can go to the same site and correct the data. The end result is a satisfied citizen and a very low-cost, effective campaign for household census and tax verification. Read more about on this campaign in Doxee post.

As far as choices of TtS technology service companies are concerned, the elearning industry published a 2017 update to their Top Ten List of TtS Software (originally posted in 2015). Since then, their first choice “Ivona Speech Cloud,” an Amazon company, has been renamed “Polly TTS.”. This technology is what drives “Alexa” (learn about it here) and if Amazon has their way, “she” will be reminding you to raise your heat, turn on the coffee maker, start the car, and everything else you do!

Behind the “Polly” curtain, is some very cool AI. Deep learning is used to synthesize speech that sounds like a human voice. There are dozens of lifelike voices in many different languages that provide the needed flexibility to build applications that work in countries around the world. One of the most critical aspects of TtS engines is their ability to support lexicons and SSML tags (those codes referred to earlier) which enable you to control aspects of speech, such as pronunciation, volume, pitch, speed rate, etc.


From Text to speech to Personalized Video

So, why should you care about Polly? Because it is a service that is being used together with Doxee Pvideo® to create relevant, response-generating voiceovers for each video. It’s just a short step from text to speech to Personalized Video.

Just imagine clicking on a video that reminds you to stop in for service on your car and hearing the following in your preferred language: “We want to thank the Roberts family for being a great customer since 2016. We know you want to keep your silver Mercedes C-Class in perfect condition, so we would like to invite you to your own Customer Appreciation Day on Saturday, December 8, 2018. Please stop in to see Marie and pick up your complimentary set of new floor mats. For more information, please see your personal website at”

This voiceover is personal enough to capture your attention, but when you add the text and graphics specifically related to the owner silver Mercedes and reference their personal service representative, the result is a customer experience that cannot be matched in any other medium.

For marketers, there is a time to pay attention to trends. When they become more than that, it is time to watch and learn. But if the technology provides a unique opportunity to reach customers and create a memorable, interactive customer experience, dipping your proverbial “toe in the water”—maybe even jumping in the pool—is the smart thing to do.

Listen and you can hear the future of TtS. Take advantage of it.