Hacking Google’s Text To Speech “API”
When I was at my previous job, one task I had was localizing a large set of phrases to multiple languages, both in text and audio files. I did this by using the awesome Google Translate API.
The Google Translate website has features for translating text and playing audio of it in the translated language. There’s no official API for getting audio, though. Luckily, I’ve never let a lack of an official API stop me before. I had read a few old blog posts about how Google’s undocumented TTS API could be used, albeit with a 100 character limit. Going over 100 characters would result in a truncated audio file. Some of the text I needed to output to audio was longer than that. It turns out that with a little bit of Chrome web inspector, I could replicate the functionality of the Google Translate site.
The first thing to check out is the url scheme of the audio files, which looks like this:
http://translate.google.com/translate_tts?ie=UTF-8&q=hello%20world&tl=en&total=1&idx=0&textlen=11&prev=input
Breaking down the parameters, “ie” is the text’s encoding, “q” is the text to convert to audio, “tl” is the text language, “total” is the total number of chunks (more on that later), “idx” is which chunk we’re on, “textlen” is the length of the text in that chunk and “prev” is not really important.
The Google Translate site itself gets around its own character limit by breaking big blocks of text into “chunks”. It seems to try and break along punctuation, but for super long sentences it will also break in the middle of a sentence, which ends up sounding pretty weird. Using the Gettysburg Address as an example, Google makes a request for the chunk “civil war”.
In order to download audio files for longer chunks of text, I wrote up a python script that broke the text down and made separate requests to Google. The script would write all of the files to one file, and somehow, it worked! Just to be safe, I also set my script up to use Google’s Flash player as the referer (sic) and set the user agent to a version of Firefox.
At the time, I didn’t want to release the code as it was being used for some uber top secret stuff. But since I’m not working on that project anymore, I refactored the original code into a command line Python script. Along the way I had to learn how to use Python’s argparse, which is a pretty neat way of parsing command line arguments.
The project is available on Github right now, so go grab it and try it out. If you’re curious what the output sounds like, here’s a recording of female Abraham Lincoln reciting the Gettysburg Address (yes, she mispronounces some words). One fun thing to try out is outputting clashing input and output languages. Here’s Female Japanese Abraham Lincoln reciting the same speech (she just seems to be spelling words, slacker).
If you enjoyed this hack, let me know and I could post some other ones I’ve been working on. And if you find a way to improve the code (probably not difficult at all) go ahead and submit a pull request on Github. And if you’re from Google, please don’t shut down my Gmail and Adsense accounts.